How sorting works within PaperVision Enterprise and ImageSilo
Version: 10.21 through R79.2
Article ID: PE000087
Description
Summary
Below is how a search is executed today:
- User executes a new search criterion.
- The SQL Server looks for records that meet the specified criteria whose DOCID is greater than zero (0). It sorts the result set by the unique DOCID for each returned document. The query also limits the number of records returned as specified by the Max Results setting.
- As the server requesting the data receives the query results from SQL Server, it evaluates each record to ensure the user has access to it. Those records that are accessible by the user are added to an “Output Result Set” – the rest (i.e. those records the user does NOT have access to) are discarded.
- If the output result set does not yet contain enough records as specified by the Max Results setting, the system returns to step #2 above, except it only looks for documents that meet the specified criteria whose DOCID is greater than the last DOCID evaluated in step #3.
- Once the Output Result Set is full (either by meeting the Max Results value or no more records meet the criteria), the Output Result Set is sorted by the index fields as specified by the user and returned to the client.
If the user requests “More” results (when that option is available), the process starts at step #2 from above, except it only looks for documents that meet the specified criteria whose DOCID is greater than the last DOCID evaluated in the previous query.
The downside of this is that each page of results is sorted separately from all of the other pages of results. However, the upside – in terms of minimized processing overhead on the SQL Servers – is significant. The issue is simply a tradeoff – we must either retrieve a page of results and then sort that page, OR sort all of the results on the server and then just return a page of those sorted results.
Examples:
Let’s use an example – consider a database table that had 5 million records in it. Let’s say that we’re looking for all items whose name starts with an A. Today, PaperVision (and ImageSilo) would execute a query that looks something similar to this:
SELECT TOP 1000 * FROM PVDM_DOCS_1_1 WHERE (DOCINDEX1 LIKE ‘A%’) AND (DOCID > 0) ORDER BY DOCID
Of course, this very quickly returns 1000 documents that are all sorted (within that group of 1000). When the user wants to see the next grouping of 1000 documents that start with A the query might look similar to this:
SELECT TOP 1000 * FROM PVDM_DOCS_1_1 WHERE (DOCINDEX1 LIKE ‘A%’) AND (DOCID > 53211) ORDER BY DOCID
Note that the DOCID changed. The reason for that is that (in our fictional database table), there were exactly 1000 documents that the user had access to AND which started with an A AND whose DOCID was less than or equal to 53211 – those were the 1000 that were returned in the first query. So this query (above) will return the next group of 1000. So what are the options? Well, the most obvious one is to have the database order ALL of the items that meet the client criteria ON THE SQL SERVER and then only return a chunk of that data as requested by the client. Below is a possible example of that:
WITH OrderedResults AS (SELECT *, ROW_NUMBER() OVER (order by DOCINDEX1) as RowNumber FROM PVDM_DOCS_1_1) SELECT * FROM OrderedResults WHERE RowNumber between 1 and 1000
Now, this will certainly return the first 1000 records, but what if there are hundreds of thousands of records that meet the criteria? The SQL Server would first have to sort those hundreds of thousands of records before it could apply a ROW_NUMBER so it could return just 1000 of them. Furthermore, EACH and EVERY time, the next group of records is requested (i.e. the “between 1 and 1000” becomes “between 1001 and 2000”, etc), the SQL Server pays the same price. In other words the SQL Server has to process hundreds of thousands of records that aren’t going to get looked at by the user EVERY TIME they request just a thousand records.
Let’s say, for arguments sake, that the first query took 45 seconds to return (vs. the sub-one second necessary under today’s method) – then every time the user asked for “More Docs”, they’d be waiting another 45 seconds. Meanwhile the SQL Server would be dying under the load of even a couple dozen active users.
What about cursors? Well, database cursors don’t work too great for dynamic data (data that is getting added/deleted constantly – think imports). They also don’t work too great in environments where the dataset is completely disconnected from the client (as in PVE/ImageSilo).
What about the use of the “SET ROWCOUNT” method? Unfortunately, for that method to work, you have to sort/order by a unique column (i.e. DOCID), NOT the index field the user wants to sort by – which effectively yields exactly what PVE/ImageSilo does today. Would changing to a different database (i.e. Oracle) solve this problem? Not at all – this is a logical problem, not a technical one.
To put it simply, PaperVision (and ImageSilo) were designed from the ground up to provide extreme scalability. For every one user that wants to sort data sets greater than their configured Max Records, there are a thousand users that would appreciate having their search results from their large data sets return quickly.
PaperVision customers that want to have larger data sets sorted and returned can simply increase their Max Results. We believe that this provides the optimal balance of usability (by allowing the user to have an administrator-defined number of records sorted) and performance (by keeping the workload on the SQL Server light).