Thursday, January 28, 2016

Top Query Revealed: Top Query to find Random Records

Top Query to find Random Records

This is the fourth in a series on the Top Query. In previous posts, I've discussed Simple Top Queries, Aggregate and Grouping Top Queries, and the Problem of Ties in a Top Query.

One rather surprising use for a Top query is to return a resultset of randomly selected records. It's also surprisingly simple. All you really need to do is sort on a random number, or rather, generate a random number in the Order By clause of the query.

So how do you get a random number? Fortunately, Access has a built in function called Rnd that will supply one.

To use Rnd, you need to supply it with a seed or an initial value to generate a pseudorandom number. Any numeric field will do for a seed value, but I generally use an autonumber, primary key field. In the case of the Orders table, that would be OrderID, but OrderNum would have worked as well.

SELECT TOP 10 Account, OrderNum, [Pairs Shipped], [Total Price]
FROM Invoices
ORDER BY Rnd(OrderID);

Every time this query is run, it will return 10 different records, well, sort of. Used as is, the Rnd function will produce a different set of records each time. However, if you close the database and re-open it, it will run the same records in exactly the same sequence.

In order to get a truly random set of numbers each time, you have to add the Randomize command. To do that, you have to create your own function. In a global module, create the following function:

Public Function gRnd(FeedNum As Long) as Double
Randomize
gRnd = Rnd(FeedNum)
End Function

Then use your user-defined rounding function in your Order By clause:

SELECT TOP 10 Account, OrderNum, [Pairs Shipped], [Total Price]
FROM Invoices
ORDER BY gRnd(OrderID);

A free sample illustrating this process can be found here: http://www.rogersaccesslibrary.com/forum/forum_posts.asp?TID=233

Next time, I'll finish up the Top Query by looking at Creating a Parameter for Top Value.

 

Tuesday, January 19, 2016

Top Query Revealed: Top Query Problem - Ties

Top Query Problem - Ties

As I said last time in Top Query Revealed: Aggregate Values and Top Values By Group, one problem with Top queries becomes apparent when there are duplicates in the top values. When this happens, the query will return more values than asked for.

Simple Top Queries

For instance, the following query will produce four values, rather than three because there are two orders with a total price of $27.50.

SELECT TOP 3 Account, OrderNum, [Total Price]
FROM Invoices
WHERE Account=237001
ORDER BY [Total Price] DESC;

When asked for the top three values, Access can't decide which of the two duplicates is in the top three, so it includes both of them. Figure 4 shows the result.

Figure 4: Result of a query asking for the top three values when there is a tie. Four records are returned.

The solution is to add an additional field to the Order By clause.

SELECT TOP 3 Account, OrderNum, [Total Price]
FROM Invoices
WHERE Account=237001
ORDER BY [Total Price] DESC, OrderNum DESC;

In Figure 5, you'll see that by adding OrderNum to the Order By clause, it does in fact return three values. Since I used DESC in the clause, it returned the larger of the duplicated order numbers: 542724. If I had used ASC (or left it blank) it would have returned 542723 instead.

Figure 5: Top 3 query with one of the duplicates removed.

Top Query By Group

But as you can see in Figure 6, the problem of duplicates can happen in top queries by group as well.

Figure 6: Account 237001 has four records: two with $27.50.

Unfortunately, the solution is not the same. Simply adding a second field to the Order By clause of the subquery, doesn't work; not by itself anyway. However if we add to this what we learned about using a Totals query to return aggregate values, we can find a solution.

In our previous Totals query, we used the Sum aggregate function. This time, however, we'll use a different aggregate function. The Max aggregate function returns just the maximum value of a group rather than summing the group. Adding this to our main query, we get this.

SELECT Account, Max(OrderNum) AS MaxOfOrderNum, [Total Price]
FROM Invoices
GROUP BY Account, [Total Price]
HAVING [Total Price] In
(SELECT TOP 3 [Total Price]
FROM Invoices I2
WHERE Invoices.Account = I2.Account
ORDER BY I2.[Total Price] DESC, I2.OrderNum DESC)
ORDER BY Account, [Total Price] DESC;

So adding the aggregate function to find the maximum OrderNum in the main query and adding a sort on OrderNum in the subquery solves the problem and we get output like Figure 7.


Figure 7: Solution to the duplicates problem in a top query by group.

Notice that account 237001 now has only 3 records and the number of records returned is 378 rather than 403.

A free sample illustrating the "tie" problems and solutions can be found here: http://www.rogersaccesslibrary.com/forum/forum_posts.asp?TID=233

Next time, I'll talk about how use a Top Query to find Random Records.

 

.

 

Saturday, January 16, 2016

Intellectual Adventure: Using MS Access to search 250 Million Records–Part 2

Guest Post by Bob Goethe

This post is a follow up for Intellectual Adventure: Using MS Access to search 250 Million Records

Downloading the Files...from now to January 21, 2016

I have zipped all the files for the 1/4 billion-record database (comprising 1.7 GB zipped, 30 GB unzipped) and made them available for download. You have my permission to do with them as you please, including modifying, and uploading them to other locations and sharing them, without limitation.

Unless you wish to include a text file attesting to how you owe all of your success in life to my example and mentoring-from-afar, you don't need to take any particular action to attribute the files to me.

My personal ISP account has some bandwidth limitations, so rather than letting you download directly from my server, I used YouSendIt/Hightail to share the files. The good news about YouSendIt is, so far as I can tell, there are no bandwidth limitations. Every Access-L member could download a copy and there is no problem.

The bad news is that the files I upload expire a week after I upload them. This means that you can download these files without restriction for the next few days, but the files will become unavailable in the afternoon of January 21.

There are 25 backend MDB files, plus a front end. You can download them all if you choose. You can store them in any folder you wish, but you must make path notations in the table called "DataFiles" so that the frontend knows where to look for its data. The data files can be, in principle, stored in any folder either locally or on a network share. They don't have to be stored in the same location. The frontend knows where to look for each file individually, based on the DataFiles table. The frontend file is called "MS_Jet Test.mdb".

If you would like to play with the database, but don't want to download *all* the files, you can choose to download one file only, called "Files21-25+FrontEnd.zip". This will give you a database of 10,000,000 records only. Again, you will want to go into the DataFiles table in the front end and remove all references to 1.mdb up to 20.mdb if you want to do this.

Files1-4.zip

https://www.hightail.com/download/ZWJYZGVURndvQUp1a3NUQw

Files5-8.zip

https://www.hightail.com/download/ZWJYZGVURnd5UkU4RmNUQw

Files9-12.zip

https://www.hightail.com/download/ZWJYZGVURnc1UjVBSXNUQw

Files13-16.zip

https://www.hightail.com/download/ZWJYZGVURncwMEhOUjhUQw

Files17-20.zip

https://www.hightail.com/download/ZWJYZGVlYStCSWNsYzhUQw

Files21-25+FrontEnd.zip

https://www.hightail.com/download/ZWJYZGVlYStFc0pvSWNUQw

Each zip file contains a collection of further zip files. This odd structure arose as a result of my experimenting with YouSendIt, to find the largest file I could successfully send while using a free account. I mixed-and-matched a bit before I decided that packages of 4 files were a good size to upload. In any case, just keep on unzipping, and eventually you will drill down to MDB files.

If you figure out other interesting things to do with, or to, this database, do share what you learned with this list.

Thanks!

Bob Goethe

Tuesday, January 12, 2016

Top Query Revealed: Aggregate Values and Top Values By Group

Top Query Revealed: Aggregate Values and Top Values By Group

Last time, in Top Query Revealed: Simple Top Query, I talked about the simple top query. A simple top query returns the top values based on individual field values. I showed how this query:

SELECT TOP 10 Account, OrderNum,
[Pairs Shipped], [Total Price]
FROM Invoices
ORDER BY [Total Price] DESC;

Returned the following results:


Figure2: Results of simple Top query.

Top Query based on Aggregate Values

As useful the simple top query is, what if I wanted to find the top 5 Accounts by total sales rather than individual invoices? The solution is to use a Totals (or Aggregate) query.

SELECT TOP 5 Account,
Sum([Total Price]) AS [SumOfTotal Price]
FROM Invoices
GROUP BY Account
ORDER BY Sum([Total Price]) DESC;

Notice the addition of the Group By clause and the Sum function. The result of this query can be seen in Figure 3.

Figure 3: Result of top 5 accounts by total sales.

Another useful thing to do with a Top query is to return the top values for each of a group of data. For instance, suppose instead of the top 3 values for a particular Account, I wanted the top 3 values for all Accounts.

Top Query By Group

To do this, I need a correlated subquery. A subquery is a SELECT statement nested inside a SELECT, SELECT...INTO, INSERT...INTO, DELETE, or UPDATE statement or inside another subquery. A correlated subquery opens a separate instance of the table for each record in the main query, allowing you to compare the results of the subquery to results from the main query.

In this case, my subquery looks like this:

SELECT TOP 3 [Total Price]
FROM Invoices I2
WHERE Invoices.[Account] = I2.[Account]
ORDER BY I2.[Total Price] DESC

The key here is the From clause:

FROM Invoices I2

This creates an "alias" for the table, renaming it I2. This is important because when we embed this query within another query also based on the Invoice table, it needs to know which instance of Invoices we are asking for.

If we run the subquery as is, it will ask us for the value of Invoices.[Account] in a parameter prompt. If we give it an account number, say 391002, it will return the top 3 values for that account.

But when we embed this query in another query, instead of prompting us for the account number, it will take the account number from the each record of the main query, match it to the subquery, and return the top values for that record. The complete query looks like this:

SELECT Account, OrderNum, [Total Price]
FROM Invoices
WHERE [Total Price] In
(SELECT TOP 3 [Total Price]
FROM Invoices I2
WHERE Invoices.[Account] = I2.[Account]
ORDER BY I2.[Total Price] DESC)
ORDER BY Account, [Total Price] DESC;

And returns the values in Figure 4.


Figure 4: Shows the result of the query to show the top 3 invoices for each Account.

Notice, however that account 237001 has 4 values. Why is that? Because two records have the same value, Access cannot decide which to display, so it displays both. Duplicates can happen in both simple and group top queries, but the solution is different for each. I'll address that next in Top Query Problem: Ties.

A free sample illustrating the Top Query can be found here: http://www.rogersaccesslibrary.com/forum/forum_posts.asp?TID=233.

.

Wednesday, January 6, 2016

Intellectual Adventure: Using MS Access to search 250 Million Records

Guest Post by

Bob Goethe

I was reading the Wikipedia article on the MS Jet Database Engine (for which MS Access is essentially a frontend) and the writer commented that there was no essential difference between SQL Server and MS Jet in terms of the total volume of data they could store, since one could split the data across multiple MDB files.

So I sought to create a database backend containing 250 million records.  I did this by generating a collection of 25 mdb files, each containing 10 million records, where each record consisted of:
- RecID
- AuthorID
- Quote

The database was set up as a collection of literary quotes.  While doing the initial populating of the database, every record had a unique RecID, an AuthorID = 1 (i.e. Shakespeare), and a Quote (in the initial populating of the db, a quote from Julius Caesar).

The screen captures below will make this easier to grasp.

Screen Captures

Now, with any database where the backend is split across multiple files, you lose the chance to have the database engine itself enforcing relational integrity.  That task becomes the programmer's job to manage via his code.

(The trendy name for a non-relational database with lots of records is to call it a NoSQL database.  "NoRel" would be more accurate, but doesn't sound as cool.)

Access was capable of producing 5,500+ new records per second...but this still meant that creating 10 million records took a half hour.  So creating 250 million records took several hours of computer time, spread across several days.

The code was designed to search through the first MDB file whose path was stored in a table.  When it reached the end of the records in that file, it cascaded to the second MDB file.  It was irrelevant whether that MDB file was on the same physical box as the first or not.

This structure would lend itself to building a server-cluster of inexpensive Network Attached Storage (NAS) devices running Linux, though in my case I wanted to keep the speed up by eliminating any Ethernet bottlenecks.  So I kept all 25 MDB files (30 GB worth) on the hard drive of a single Win7 box.

This test was run using MS Access 2003.

Results of Test

The search of 1/4 billion records took 177 seconds to complete.  This means that the Jet database was processing 1.4 million records per second, including the time it took to close its connection with one MDB file and open a connection to the next one in line.

Code that Did the Work

Private Sub cmdExecute_Click()
    On Error GoTo Err_Routine
        Dim wrkspc As Workspace
        Dim dbs As Database, vAuthorName As String, vStart As Double, vEnd As Double, vTotalRecCount As Long
        Dim rstSrc As Recordset, rst As Recordset, vAuthorID As Long, vSrc As String, rstTarget As Recordset, rstAuthor As Recordset
1      Set rstSrc = CurrentDb.OpenRecordset("DataFiles")
2      Set rstAuthor = CurrentDb.OpenRecordset("SELECT * FROM Authors WHERE Authors.RecID = " & Me!AuthorID & ";")
3      vAuthorName = rstAuthor!AuthorName
4      vAuthorID = Me!AuthorID
5      CurrentDb.Execute "Delete * from SearchResults"
6      Set rstTarget = CurrentDb.OpenRecordset("SearchResults")
7      Set wrkspc = CreateWorkspace("", "Admin", "")
8      Me.Caption = "Search start: " & Format(Now, "hh:nn:ss")
9      vStart = Timer
10     Do Until rstSrc.EOF
11       vSrc = rstSrc!BackendFile
         ' I have a field on the form to tell me which backend file is being processed this moment.
12       Me!Text10 = vSrc
         'There are no linked tables in the frontend file; I establish a connection as it is required.
13       Set dbs = wrkspc.OpenDatabase(vSrc)
14       Set rst = dbs.OpenRecordset("SELECT * FROM Quotation WHERE (((Quotation.AuthorID)=" & vAuthorID & "));")
15       If rst.RecordCount > 0 Then
16           Do Until rst.EOF
17               rstTarget.AddNew
18                   rstTarget!RecID = rst!RecID
19                   rstTarget!Author = vAuthorName
20                   rstTarget!Quote = rst!Quote
21               rstTarget.Update
22               rst.MoveNext
23           Loop
24       End If
25       If Not dbs Is Nothing Then dbs.Close: Set dbs = Nothing
26       rstSrc.MoveNext
27     Loop
28     vFinish = Timer
29     vduration = vFinish - vStart
30     Me.Caption = Me.Caption & "    " & "End at " & Format(Now, "hh:nn:ss") & "   (" & Int(vduration) & " seconds)"
31     Me.Requery
Exit_Routine:
       Exit Sub
Err_Routine:
       MsgBox "Error " & Err.Number & " on line " & Erl & ": " & Err.Description
       Resume Exit_Routine
End Sub

Private Sub cmdCountRecs_Click()
    On Error GoTo Err_Routine
        Dim wrkspc As Workspace
        Dim dbs As Database, vAuthorName As String, vStart As Double, vEnd As Double, vTotalRecCount As Long
        Dim rstSrc As Recordset, rst As Recordset, vAuthorID As Long, vSrc As String, rstTarget As Recordset, rstAuthor As Recordset
1     Set rstSrc = CurrentDb.OpenRecordset("DataFiles")
2     Set rstTarget = CurrentDb.OpenRecordset("SearchResults")
3     Set wrkspc = CreateWorkspace("", "Admin", "")
4     Do Until rstSrc.EOF
5       vSrc = rstSrc!BackendFile
6       Me!Text10 = vSrc
7       Delay (0.25)
8       Set dbs = wrkspc.OpenDatabase(vSrc)
9       Set rst = dbs.OpenRecordset("Quotation")
10      rst.MoveLast
11      vTotalRecCount = vTotalRecCount + rst.RecordCount
12      If Not dbs Is Nothing Then dbs.Close: Set dbs = Nothing
13      rstSrc.MoveNext
14    Loop
15    Me!Text10 = Format(vTotalRecCount, "###,###,###") & " records searched."
Exit_Routine:
   Exit Sub
Err_Routine:
   MsgBox "Error " & Err.Number & " on line " & Erl & ": " & Err.Description
   Resume Exit_Routine
End Sub

Things to Watch Out For


My first effort involved storing 15 million records per MDB file...but I found that once files grew to 1.5-ish GB each, unpredictable behavior sometimes occurred.  E.g. compact the file, and suddenly RecID is no longer the primary key.  Try and re-designate it as primary key, and you get a message that there is insufficient space in the file to complete the request.

So while the specs on Jet databases say that MDB files of up to 2 GB are supported, my experience is that it is better to treat ~1 GB as a good, working limit.

Next Step


I have never been able to compare the *performance* of an Access database directly with a SQL Server database, since you almost never have the SAME data set hosted on the SAME equipment, and accessed from the SAME frontend.

My plan is to set up this test on a SQL Server box, and then run the two database engines head to head.

My expectation is that SQL Server will perform better, but cost a good bit more than the Access solution.  A cluster of multiple Windows Server Enterprise Edition machines running SQLS could cost well over $100K in hardware and licensing fees of various sorts.

The MS Jet solution, on the other hand, costs $130 for a retail copy of MS Access, a bank of NASes for less than $2000, and users can access the database with no need for additional licensing fees. All they need to do is download, at no charge, the Microsoft Access Runtime application.

What Does it All Mean?

Microsoft's problem - and it is a big one, from their own point of view - is that Access is simply too good, too powerful, too capable.

A good bit of Microsoft's marketing says that Access is just fine for homemakers to utilize in keeping track of their inventory of spices, and then printing out a shopping list to take to the Safeway store. But if you want to run a business, then you want a man's database. You want SQL Server.

Not every company has drunk the Cool Aide of Microsoft's SQL marketing campaign (though probably most 25 year old computer science graduates have). Enterprises the size of Alberta Covenant Health ($1B revenue/year) and even Alberta Health Services ($10B revenue/year) use MS Access databases in mission critical applications (much to the chagrin of some of their own SQL database admins).

If Access has a poor reputation among professional programmers, it is most likely because - since Access is included with MS Office - it is the database tool most often used by doctors or accountants who fancy themselves as computer savants. Of course, the applications created by these kinds of people are often total P.O.S., with no concept of relational data integrity or efficient data management. So sometimes a professional is called in to "tidy up the database" only to find that he has a real mess on his hands.  In this situation, it will be pretty easy for him to come away with a negative perception of Access.

But clearly, in the hands of a competent programmer, MS Access is good for more than just organizing the spice rack in your kitchen.

They say that if your only tool is a hammer, every problem looks like a nail. There are certainly problems that *require* SQLS. But the database professional who adds MS Access to his toolkit expands his resources for efficient problem solving. The possibilities are worth pondering.

Sample files are available for a limited time. See this post for details:
Intellectual Adventure- Using MS Access to search 250 Million Records–Part 2

Bob Goethe

BobPhoto

Bob is working for the Covenant Health Group of hospitals in Alberta, right now developing a seriously large document management-and-retrieval database.  Because of some of the politics that swirl around government healthcare in Canada, Covenant's management would like to have this project built in MS Access rather than SQL Server.  This sample database was built to test concepts that could be used in the operational document-management database.

Top Query Revealed: Simple Top Query

Top Query Revealed: Simple Top Query

Introduction

SQL, the query language of Microsoft Access, has many powerful, yet little known features. One of these is the Top Query. A Top query allows you to limit the results of a query to a certain number (or percentage) of records.

The term "top" is deceiving, because it implies "largest" values. However, this is not strictly the case. You can choose the top largest values or the top smallest values. As I'll show later, you can even choose random records.

Simple Top Query

Suppose I want to create a query which shows me the top 10 invoices (based on price) in my Invoices table.

To create a TOP query in SQL, add the TOP predicate immediately following the SELECT keyword followed by the number of records you want to see:

SELECT TOP 10 * FROM Invoices

But that's not the end of it. To define what is "top", you must sort on one or more fields, using use DESC for largest numbers and ASC for smallest numbers. (Technically, you don't need to add ASC if you wanted the smallest numbers because ascending order is default.) Adding that, the whole query looks like this:

SELECT TOP 10 *
FROM Invoices
ORDER BY [Total Price] DESC;

This will return 10 records with the largest values in the Total Price field. But you can also choose to see the top (or bottom) percent. So if I wanted to see the top 5% best sellers, I could do this:

SELECT TOP 5 PERCENT *
FROM Invoices
ORDER BY [Total Price] DESC;

Although I used the asterisk (*) in the above queries, I did so only for simplicity. Like any query, you can also include a field list in your query, like so:

SELECT TOP 10 Account, OrderNum,
[Pairs Shipped], [Total Price]
FROM Invoices
ORDER BY [Total Price] DESC;

You are not limited to the SQL view of a query to create a Top query, however. To create a TOP query in the Query Builder, open the Query Properties and look for the TOP property. See Figure 1.

Figure1: Shows the Query Builder window to create a Top query.

The drop down box gives you suggestions, but you are not limited to the values in the list. Choosing All, removes the Top predicate from the SQL statement and returns the query to a normal Select query.

The results of this query, whether created in the Query Builder or directly in SQL can be seen in Figure 2.

Figure2: Results of simple Top query.

Simple Top Query With Where Clause

I'm not limited to showing the top values for all invoices. I can also narrow the scope of the query with a Where clause just as you can with any Select query. For instance, if I wanted to see the top 3 invoices of a single Account (say, 237001), my query would look like this.

SELECT TOP 3 Account, OrderNum, [Total Price]
FROM Invoices
WHERE Account=237001
ORDER BY [Total Price] DESC;

Result of top 3 invoices for Account 237001.


As useful as this is, what if I wanted to find the top 10 Accounts in terms of total sales rather than individual invoices? Or suppose instead of the top 10 values for a particular Account, I wanted the top 10 values of each Account. I'll talk about that next time in: Top Query Revealed: Aggregate Values and Top Values By Group.

.