Wednesday, January 6, 2016

Intellectual Adventure: Using MS Access to search 250 Million Records

Guest Post by

Bob Goethe

I was reading the Wikipedia article on the MS Jet Database Engine (for which MS Access is essentially a frontend) and the writer commented that there was no essential difference between SQL Server and MS Jet in terms of the total volume of data they could store, since one could split the data across multiple MDB files.

So I sought to create a database backend containing 250 million records.  I did this by generating a collection of 25 mdb files, each containing 10 million records, where each record consisted of:
- RecID
- AuthorID
- Quote

The database was set up as a collection of literary quotes.  While doing the initial populating of the database, every record had a unique RecID, an AuthorID = 1 (i.e. Shakespeare), and a Quote (in the initial populating of the db, a quote from Julius Caesar).

The screen captures below will make this easier to grasp.

Screen Captures

Now, with any database where the backend is split across multiple files, you lose the chance to have the database engine itself enforcing relational integrity.  That task becomes the programmer's job to manage via his code.

(The trendy name for a non-relational database with lots of records is to call it a NoSQL database.  "NoRel" would be more accurate, but doesn't sound as cool.)

Access was capable of producing 5,500+ new records per second...but this still meant that creating 10 million records took a half hour.  So creating 250 million records took several hours of computer time, spread across several days.

The code was designed to search through the first MDB file whose path was stored in a table.  When it reached the end of the records in that file, it cascaded to the second MDB file.  It was irrelevant whether that MDB file was on the same physical box as the first or not.

This structure would lend itself to building a server-cluster of inexpensive Network Attached Storage (NAS) devices running Linux, though in my case I wanted to keep the speed up by eliminating any Ethernet bottlenecks.  So I kept all 25 MDB files (30 GB worth) on the hard drive of a single Win7 box.

This test was run using MS Access 2003.

Results of Test

The search of 1/4 billion records took 177 seconds to complete.  This means that the Jet database was processing 1.4 million records per second, including the time it took to close its connection with one MDB file and open a connection to the next one in line.

Code that Did the Work

Private Sub cmdExecute_Click()
    On Error GoTo Err_Routine
        Dim wrkspc As Workspace
        Dim dbs As Database, vAuthorName As String, vStart As Double, vEnd As Double, vTotalRecCount As Long
        Dim rstSrc As Recordset, rst As Recordset, vAuthorID As Long, vSrc As String, rstTarget As Recordset, rstAuthor As Recordset
1      Set rstSrc = CurrentDb.OpenRecordset("DataFiles")
2      Set rstAuthor = CurrentDb.OpenRecordset("SELECT * FROM Authors WHERE Authors.RecID = " & Me!AuthorID & ";")
3      vAuthorName = rstAuthor!AuthorName
4      vAuthorID = Me!AuthorID
5      CurrentDb.Execute "Delete * from SearchResults"
6      Set rstTarget = CurrentDb.OpenRecordset("SearchResults")
7      Set wrkspc = CreateWorkspace("", "Admin", "")
8      Me.Caption = "Search start: " & Format(Now, "hh:nn:ss")
9      vStart = Timer
10     Do Until rstSrc.EOF
11       vSrc = rstSrc!BackendFile
         ' I have a field on the form to tell me which backend file is being processed this moment.
12       Me!Text10 = vSrc
         'There are no linked tables in the frontend file; I establish a connection as it is required.
13       Set dbs = wrkspc.OpenDatabase(vSrc)
14       Set rst = dbs.OpenRecordset("SELECT * FROM Quotation WHERE (((Quotation.AuthorID)=" & vAuthorID & "));")
15       If rst.RecordCount > 0 Then
16           Do Until rst.EOF
17               rstTarget.AddNew
18                   rstTarget!RecID = rst!RecID
19                   rstTarget!Author = vAuthorName
20                   rstTarget!Quote = rst!Quote
21               rstTarget.Update
22               rst.MoveNext
23           Loop
24       End If
25       If Not dbs Is Nothing Then dbs.Close: Set dbs = Nothing
26       rstSrc.MoveNext
27     Loop
28     vFinish = Timer
29     vduration = vFinish - vStart
30     Me.Caption = Me.Caption & "    " & "End at " & Format(Now, "hh:nn:ss") & "   (" & Int(vduration) & " seconds)"
31     Me.Requery
       Exit Sub
       MsgBox "Error " & Err.Number & " on line " & Erl & ": " & Err.Description
       Resume Exit_Routine
End Sub

Private Sub cmdCountRecs_Click()
    On Error GoTo Err_Routine
        Dim wrkspc As Workspace
        Dim dbs As Database, vAuthorName As String, vStart As Double, vEnd As Double, vTotalRecCount As Long
        Dim rstSrc As Recordset, rst As Recordset, vAuthorID As Long, vSrc As String, rstTarget As Recordset, rstAuthor As Recordset
1     Set rstSrc = CurrentDb.OpenRecordset("DataFiles")
2     Set rstTarget = CurrentDb.OpenRecordset("SearchResults")
3     Set wrkspc = CreateWorkspace("", "Admin", "")
4     Do Until rstSrc.EOF
5       vSrc = rstSrc!BackendFile
6       Me!Text10 = vSrc
7       Delay (0.25)
8       Set dbs = wrkspc.OpenDatabase(vSrc)
9       Set rst = dbs.OpenRecordset("Quotation")
10      rst.MoveLast
11      vTotalRecCount = vTotalRecCount + rst.RecordCount
12      If Not dbs Is Nothing Then dbs.Close: Set dbs = Nothing
13      rstSrc.MoveNext
14    Loop
15    Me!Text10 = Format(vTotalRecCount, "###,###,###") & " records searched."
   Exit Sub
   MsgBox "Error " & Err.Number & " on line " & Erl & ": " & Err.Description
   Resume Exit_Routine
End Sub

Things to Watch Out For

My first effort involved storing 15 million records per MDB file...but I found that once files grew to 1.5-ish GB each, unpredictable behavior sometimes occurred.  E.g. compact the file, and suddenly RecID is no longer the primary key.  Try and re-designate it as primary key, and you get a message that there is insufficient space in the file to complete the request.

So while the specs on Jet databases say that MDB files of up to 2 GB are supported, my experience is that it is better to treat ~1 GB as a good, working limit.

Next Step

I have never been able to compare the *performance* of an Access database directly with a SQL Server database, since you almost never have the SAME data set hosted on the SAME equipment, and accessed from the SAME frontend.

My plan is to set up this test on a SQL Server box, and then run the two database engines head to head.

My expectation is that SQL Server will perform better, but cost a good bit more than the Access solution.  A cluster of multiple Windows Server Enterprise Edition machines running SQLS could cost well over $100K in hardware and licensing fees of various sorts.

The MS Jet solution, on the other hand, costs $130 for a retail copy of MS Access, a bank of NASes for less than $2000, and users can access the database with no need for additional licensing fees. All they need to do is download, at no charge, the Microsoft Access Runtime application.

What Does it All Mean?

Microsoft's problem - and it is a big one, from their own point of view - is that Access is simply too good, too powerful, too capable.

A good bit of Microsoft's marketing says that Access is just fine for homemakers to utilize in keeping track of their inventory of spices, and then printing out a shopping list to take to the Safeway store. But if you want to run a business, then you want a man's database. You want SQL Server.

Not every company has drunk the Cool Aide of Microsoft's SQL marketing campaign (though probably most 25 year old computer science graduates have). Enterprises the size of Alberta Covenant Health ($1B revenue/year) and even Alberta Health Services ($10B revenue/year) use MS Access databases in mission critical applications (much to the chagrin of some of their own SQL database admins).

If Access has a poor reputation among professional programmers, it is most likely because - since Access is included with MS Office - it is the database tool most often used by doctors or accountants who fancy themselves as computer savants. Of course, the applications created by these kinds of people are often total P.O.S., with no concept of relational data integrity or efficient data management. So sometimes a professional is called in to "tidy up the database" only to find that he has a real mess on his hands.  In this situation, it will be pretty easy for him to come away with a negative perception of Access.

But clearly, in the hands of a competent programmer, MS Access is good for more than just organizing the spice rack in your kitchen.

They say that if your only tool is a hammer, every problem looks like a nail. There are certainly problems that *require* SQLS. But the database professional who adds MS Access to his toolkit expands his resources for efficient problem solving. The possibilities are worth pondering.

Sample files are available for a limited time. See this post for details:
Intellectual Adventure- Using MS Access to search 250 Million Records–Part 2

Bob Goethe


Bob is working for the Covenant Health Group of hospitals in Alberta, right now developing a seriously large document management-and-retrieval database.  Because of some of the politics that swirl around government healthcare in Canada, Covenant's management would like to have this project built in MS Access rather than SQL Server.  This sample database was built to test concepts that could be used in the operational document-management database.


grovelli said...

"My plan is to set up this test on a SQL Server box, and then run the two database engines head to head."
So what's the test results?

SecurityPatrol said...

"But clearly, in the hands of a competent programmer, MS Access is good for more than just organizing the spice rack in your kitchen."

I love this quote. Thanks for the good read. I agree that Access is a powerful tool. I've used it professionally for almost 16 years. I have yet to find a single-user task that it could not handle. Multi-user is situational and requires good planning and a fast network. It does annoy me when IT snobs bad mouth Access just because it's Access.

Please forgive me for saying this, but Timer returns a single, and not a double. Have a great day.

Bob Goethe said...

Ah! SecurityPatrol, thanks for the correction on what Timer returns.

You are right about the need for planning and good networks...though SQLS also requires planning, even if it is of a different sort.

As for networks, it is our good fortune at Covenant Health Alberta (COV) to have a rock-solid network. I don't actually know how they manage it. Almost EVERY network in Alberta has hiccups during thunderstorm season.

Part of their magic is that they replace every piece of computing equipment on a schedule, whether or not that equipment is showing signs of failure.

This is quite different from what other companies I have been involved with do. The normal pattern is "it is working right let's leave the equipment in place until tomorrow." That strategy pretty much makes it *inevitable* that your switches will start dropping packets at some point -- and this will in turn corrupt an MDB backend.

So we can implement MS Access for a multi-user database and it works great for us.

Grovelli, I have not yet done the head-to-head SQLS test. Sorry. Part of why I was experimenting with the 250M record database is that I am working on a big COV doc mgmnt database, and I wanted to experiment with some techniques.

COV alleges that they have 80,000+ documents of various sorts, all of which are important in the running of the Medical Affairs department (which in turn manages their physician staff of 2,000+ doctors). Frankly, I cannot imagine that ANYbody needs 80,000 documents, no matter who they are.

But that said, I can easily imagine that they are quite correct when they say that they lose track of documents all the time. Somebody stores files in some deeply nested subfolder, and then nobody else can find them.

There are no off-the-shelf indexing apps that satisfy their I have created the *M*edical *A*ffairs *T*eam *R*esource Library *I*nde*x*...or "The Matrix". The process of indexing a file involves extracting full text for Word/Excel/PDFs and making provision for users to add meta tags of various sorts. Other files (like JPGs) of course have no full-text to extract, though meta tags are still relevant. And even if no other search fields exist, they find it helpful just to be able to perform a fast search for some fragment of the pathname.

The process of storing full text to facilitate searching has resulted in a backend MDB file of 800 MB for only 27,000 files. So given my experience that 1 GB is a good working maximum for an MDB file, going to a multiple-backend-files design with The Matrix, such as the 250M-record database used, is my #1 priority right now.

You know how it is, sometimes remunerative work gets in the way of intellectual-fun type experiments.

I must say that Access is so far giving me blindingly fast searches...even when I am including full-text as a search option, and even when I am doing proximity searching (which of course, involves rather more computer time, e.g. show me all documents where the phrase 'medication error' occurs within 35 words of the phrase 'The Goethe Memorial Hospital").

I am learning as I go on code-techniques for search optimization with this particular database. Searches that today take less than 1 second were taking 40 seconds to complete just two weeks ago.
Roger and I put our heads together to see if we couldn't post the full 250M-record database for people to download. But no matter how you zip it, it is simply too big.

I shall see if I can't zip it up and put the files on an FTP server here in my own office some time in the next week. I will post a note here when I have done so, and you can download my files and play with them a bit yourself.

I may end up making the files available for a few days until I max out my own bandwidth limitations, then make them unavailable until the next month rolls around.

I will communicate here and through the Access-L group just what I am doing.


SecurityPatrol said...

Where I work right now the network is relatively slow. As a result, Access files are just real slow if they have to open across the network. I've had to do some workarounds to make an Access tool usable such as creating loader scripts that run from the network and copy/setup the accdb front-end on the C drive, linking to back-end databases still on the network that contain only tables such as mapping data that must persist between runs, and sometimes copying those mapping tables locally temporarily before querying against them. Anything to cut down on the network traffic. Most of the company operates on Excel, so Access is used as a processing and automation tool. That means most of my operations now involve importing, processing, and exporting. It works out good there and the pay is good, so no arguing from me.

I used to work for a title insurance company that had a store of 36k forms and rate manuals. That's 36k Word docs on a network file share and 36k pdf's of those on an intranet portal for agents to use. The interface that the IT department made for those poor workers to maintain that portal involved working with only 1 file at a time through HTML forms. lol that's nuts!

I can agree with your limitation of keeping mdb files around 1GB. Another thing you can try if you have that much data is to use SQLite which has a file size limitation of around 2TB. If IT policies aren't restrictive, you could also set up the SQLite ODBC driver and use Access as a front-end to that engine by linking tables.

One problem you could run into with SQLite is that interfaces between the dll and VBA can be slow if not set up correctly. When I first tried it, I was using a COM wrapper called vbRichClient Framework to interface the direct api through VBA. It was rather slow, but there were some amazing features such as register-less COM and you could add custom VBA functions that you could use directly from SQLite. Sometimes worth the speed loss if the requirement is right.

I next moved to "SQLite for Excel" which is available on CodePlex. It's actually just a VBA module that has API declarations for interfacing directly to the native dll in 32/64 bit. I found that to be very fast and have actually built tools in Excel using it that move, index, and process 1 million rows in about 11 seconds. I'm sure it would run even faster if the databases were already established, but I needed to build something that could temporarily process that either in-memory or a temp db file and then destroy it.

Another thing to try would be the ODBC driver for SQLite, but I haven't tried that. Maybe I'll test that one day and create a blog about it ;)

Roger Carlson said...

Sample files are available for a limited time. See this post for details:
Intellectual Adventure- Using MS Access to search 250 Million Records–Part 2