Saturday, January 16, 2016

Intellectual Adventure: Using MS Access to search 250 Million Records–Part 2

Guest Post by Bob Goethe

This post is a follow up for Intellectual Adventure: Using MS Access to search 250 Million Records

Downloading the Files...from now to January 21, 2016

I have zipped all the files for the 1/4 billion-record database (comprising 1.7 GB zipped, 30 GB unzipped) and made them available for download. You have my permission to do with them as you please, including modifying, and uploading them to other locations and sharing them, without limitation.

Unless you wish to include a text file attesting to how you owe all of your success in life to my example and mentoring-from-afar, you don't need to take any particular action to attribute the files to me.

My personal ISP account has some bandwidth limitations, so rather than letting you download directly from my server, I used YouSendIt/Hightail to share the files. The good news about YouSendIt is, so far as I can tell, there are no bandwidth limitations. Every Access-L member could download a copy and there is no problem.

The bad news is that the files I upload expire a week after I upload them. This means that you can download these files without restriction for the next few days, but the files will become unavailable in the afternoon of January 21.

There are 25 backend MDB files, plus a front end. You can download them all if you choose. You can store them in any folder you wish, but you must make path notations in the table called "DataFiles" so that the frontend knows where to look for its data. The data files can be, in principle, stored in any folder either locally or on a network share. They don't have to be stored in the same location. The frontend knows where to look for each file individually, based on the DataFiles table. The frontend file is called "MS_Jet Test.mdb".

If you would like to play with the database, but don't want to download *all* the files, you can choose to download one file only, called "". This will give you a database of 10,000,000 records only. Again, you will want to go into the DataFiles table in the front end and remove all references to 1.mdb up to 20.mdb if you want to do this.

Each zip file contains a collection of further zip files. This odd structure arose as a result of my experimenting with YouSendIt, to find the largest file I could successfully send while using a free account. I mixed-and-matched a bit before I decided that packages of 4 files were a good size to upload. In any case, just keep on unzipping, and eventually you will drill down to MDB files.

If you figure out other interesting things to do with, or to, this database, do share what you learned with this list.


Bob Goethe

Wednesday, January 6, 2016

Intellectual Adventure: Using MS Access to search 250 Million Records

Guest Post by

Bob Goethe

I was reading the Wikipedia article on the MS Jet Database Engine (for which MS Access is essentially a frontend) and the writer commented that there was no essential difference between SQL Server and MS Jet in terms of the total volume of data they could store, since one could split the data across multiple MDB files.

So I sought to create a database backend containing 250 million records.  I did this by generating a collection of 25 mdb files, each containing 10 million records, where each record consisted of:
- RecID
- AuthorID
- Quote

The database was set up as a collection of literary quotes.  While doing the initial populating of the database, every record had a unique RecID, an AuthorID = 1 (i.e. Shakespeare), and a Quote (in the initial populating of the db, a quote from Julius Caesar).

The screen captures below will make this easier to grasp.

Screen Captures

Now, with any database where the backend is split across multiple files, you lose the chance to have the database engine itself enforcing relational integrity.  That task becomes the programmer's job to manage via his code.

(The trendy name for a non-relational database with lots of records is to call it a NoSQL database.  "NoRel" would be more accurate, but doesn't sound as cool.)

Access was capable of producing 5,500+ new records per second...but this still meant that creating 10 million records took a half hour.  So creating 250 million records took several hours of computer time, spread across several days.

The code was designed to search through the first MDB file whose path was stored in a table.  When it reached the end of the records in that file, it cascaded to the second MDB file.  It was irrelevant whether that MDB file was on the same physical box as the first or not.

This structure would lend itself to building a server-cluster of inexpensive Network Attached Storage (NAS) devices running Linux, though in my case I wanted to keep the speed up by eliminating any Ethernet bottlenecks.  So I kept all 25 MDB files (30 GB worth) on the hard drive of a single Win7 box.

This test was run using MS Access 2003.

Results of Test

The search of 1/4 billion records took 177 seconds to complete.  This means that the Jet database was processing 1.4 million records per second, including the time it took to close its connection with one MDB file and open a connection to the next one in line.

Code that Did the Work

Private Sub cmdExecute_Click()
    On Error GoTo Err_Routine
        Dim wrkspc As Workspace
        Dim dbs As Database, vAuthorName As String, vStart As Double, vEnd As Double, vTotalRecCount As Long
        Dim rstSrc As Recordset, rst As Recordset, vAuthorID As Long, vSrc As String, rstTarget As Recordset, rstAuthor As Recordset
1      Set rstSrc = CurrentDb.OpenRecordset("DataFiles")
2      Set rstAuthor = CurrentDb.OpenRecordset("SELECT * FROM Authors WHERE Authors.RecID = " & Me!AuthorID & ";")
3      vAuthorName = rstAuthor!AuthorName
4      vAuthorID = Me!AuthorID
5      CurrentDb.Execute "Delete * from SearchResults"
6      Set rstTarget = CurrentDb.OpenRecordset("SearchResults")
7      Set wrkspc = CreateWorkspace("", "Admin", "")
8      Me.Caption = "Search start: " & Format(Now, "hh:nn:ss")
9      vStart = Timer
10     Do Until rstSrc.EOF
11       vSrc = rstSrc!BackendFile
         ' I have a field on the form to tell me which backend file is being processed this moment.
12       Me!Text10 = vSrc
         'There are no linked tables in the frontend file; I establish a connection as it is required.
13       Set dbs = wrkspc.OpenDatabase(vSrc)
14       Set rst = dbs.OpenRecordset("SELECT * FROM Quotation WHERE (((Quotation.AuthorID)=" & vAuthorID & "));")
15       If rst.RecordCount > 0 Then
16           Do Until rst.EOF
17               rstTarget.AddNew
18                   rstTarget!RecID = rst!RecID
19                   rstTarget!Author = vAuthorName
20                   rstTarget!Quote = rst!Quote
21               rstTarget.Update
22               rst.MoveNext
23           Loop
24       End If
25       If Not dbs Is Nothing Then dbs.Close: Set dbs = Nothing
26       rstSrc.MoveNext
27     Loop
28     vFinish = Timer
29     vduration = vFinish - vStart
30     Me.Caption = Me.Caption & "    " & "End at " & Format(Now, "hh:nn:ss") & "   (" & Int(vduration) & " seconds)"
31     Me.Requery
       Exit Sub
       MsgBox "Error " & Err.Number & " on line " & Erl & ": " & Err.Description
       Resume Exit_Routine
End Sub

Private Sub cmdCountRecs_Click()
    On Error GoTo Err_Routine
        Dim wrkspc As Workspace
        Dim dbs As Database, vAuthorName As String, vStart As Double, vEnd As Double, vTotalRecCount As Long
        Dim rstSrc As Recordset, rst As Recordset, vAuthorID As Long, vSrc As String, rstTarget As Recordset, rstAuthor As Recordset
1     Set rstSrc = CurrentDb.OpenRecordset("DataFiles")
2     Set rstTarget = CurrentDb.OpenRecordset("SearchResults")
3     Set wrkspc = CreateWorkspace("", "Admin", "")
4     Do Until rstSrc.EOF
5       vSrc = rstSrc!BackendFile
6       Me!Text10 = vSrc
7       Delay (0.25)
8       Set dbs = wrkspc.OpenDatabase(vSrc)
9       Set rst = dbs.OpenRecordset("Quotation")
10      rst.MoveLast
11      vTotalRecCount = vTotalRecCount + rst.RecordCount
12      If Not dbs Is Nothing Then dbs.Close: Set dbs = Nothing
13      rstSrc.MoveNext
14    Loop
15    Me!Text10 = Format(vTotalRecCount, "###,###,###") & " records searched."
   Exit Sub
   MsgBox "Error " & Err.Number & " on line " & Erl & ": " & Err.Description
   Resume Exit_Routine
End Sub

Things to Watch Out For

My first effort involved storing 15 million records per MDB file...but I found that once files grew to 1.5-ish GB each, unpredictable behavior sometimes occurred.  E.g. compact the file, and suddenly RecID is no longer the primary key.  Try and re-designate it as primary key, and you get a message that there is insufficient space in the file to complete the request.

So while the specs on Jet databases say that MDB files of up to 2 GB are supported, my experience is that it is better to treat ~1 GB as a good, working limit.

Next Step

I have never been able to compare the *performance* of an Access database directly with a SQL Server database, since you almost never have the SAME data set hosted on the SAME equipment, and accessed from the SAME frontend.

My plan is to set up this test on a SQL Server box, and then run the two database engines head to head.

My expectation is that SQL Server will perform better, but cost a good bit more than the Access solution.  A cluster of multiple Windows Server Enterprise Edition machines running SQLS could cost well over $100K in hardware and licensing fees of various sorts.

The MS Jet solution, on the other hand, costs $130 for a retail copy of MS Access, a bank of NASes for less than $2000, and users can access the database with no need for additional licensing fees. All they need to do is download, at no charge, the Microsoft Access Runtime application.

What Does it All Mean?

Microsoft's problem - and it is a big one, from their own point of view - is that Access is simply too good, too powerful, too capable.

A good bit of Microsoft's marketing says that Access is just fine for homemakers to utilize in keeping track of their inventory of spices, and then printing out a shopping list to take to the Safeway store. But if you want to run a business, then you want a man's database. You want SQL Server.

Not every company has drunk the Cool Aide of Microsoft's SQL marketing campaign (though probably most 25 year old computer science graduates have). Enterprises the size of Alberta Covenant Health ($1B revenue/year) and even Alberta Health Services ($10B revenue/year) use MS Access databases in mission critical applications (much to the chagrin of some of their own SQL database admins).

If Access has a poor reputation among professional programmers, it is most likely because - since Access is included with MS Office - it is the database tool most often used by doctors or accountants who fancy themselves as computer savants. Of course, the applications created by these kinds of people are often total P.O.S., with no concept of relational data integrity or efficient data management. So sometimes a professional is called in to "tidy up the database" only to find that he has a real mess on his hands.  In this situation, it will be pretty easy for him to come away with a negative perception of Access.

But clearly, in the hands of a competent programmer, MS Access is good for more than just organizing the spice rack in your kitchen.

They say that if your only tool is a hammer, every problem looks like a nail. There are certainly problems that *require* SQLS. But the database professional who adds MS Access to his toolkit expands his resources for efficient problem solving. The possibilities are worth pondering.

Sample files are available for a limited time. See this post for details:
Intellectual Adventure- Using MS Access to search 250 Million Records–Part 2

Bob Goethe


Bob is working for the Covenant Health Group of hospitals in Alberta, right now developing a seriously large document management-and-retrieval database.  Because of some of the politics that swirl around government healthcare in Canada, Covenant's management would like to have this project built in MS Access rather than SQL Server.  This sample database was built to test concepts that could be used in the operational document-management database.

Friday, November 20, 2015

What Is A Query?

What is a query?
A query is a multi-purpose object in Microsoft Access. It can be used to display data from a single table or multiple tables, perform calculations on your data, modify data within your tables, and create or modify the structure of tables, indexes or relationships. Basically, anything that uses Structured Query Language (SQL) is called a query.

It is important to realize that queries are dynamic, that is, they allow you to view or manipulate data stored in your tables. I had one customer who complained that she was losing data from her database. Turned out she was creating queries and then deleting rows from the query that she didn't want to see. She didn't realize that in doing so, she was actually deleting the rows from the table the query was based on.

Query Builder View vs. SQL View
Access has two ways to create or modify a query. The Query Builder View allows you to create queries in a graphical interface. SQL View allows you to create queries directly in the SQL language. The two views are interactive. In other words, a query created in the Query Builder will be viewable and editable in the SQL View and vice versa. A good way to learn SQL is to create one in the Query Builder and view it in the SQL View.

Types of Queries

Select queries
Select queries create a dynamic view into the database. In fact, in other SQL implementations like SQL Server and Oracle, Select Queries are called Views. They allow you to see just what you want out of your tables. You can choose to see all columns or just selected ones. You can also restrict the number of rows based on criteria. (see: Select Queries Part 1: Simple Queries)

One important feature of a query is its ability to join multiple tables together on one or more common fields. If you have normalized your database properly (see: What is Normalization?), this is how you put your data back together again. For instance, if you had a CustomerID field in two tables: Customers and Orders, you could join the two tables on the CustomerID field to link all Customers with their respective Orders. (see: What is a Join: Part 1 (Introduction))

Aggregate (Totals) Queries
There are times when you want to aggregate your data, that is, you want to group your data and apply some mathematical function each group. The most common functions are sum and average. Access provides a type of query called a Totals Query that will do this.

Crosstab Queries
A Crosstab Query is a special kind of aggregate query, which calculates an aggregate function (like sum or average), and groups the resulting dataset by two sets of values — one down the side and another across the top. Crosstab queries are particularly useful for displaying summarized data in an easy to understand format.

Action Queries (Data Manipulation Language Queries)
Action Queries (also called Data Manipulation Language Queries or DML) make changes to the data in your tables. You can delete all or some records from your table (see Action Queries: Delete Query). You can change the value of a field or fields in all or some records in your table (see Action Queries: Update Query). You can append records to a table, either selecting records from another table or append specific values (see Action Queries: Append Query). Lastly, you can create a new table from selected record from an existing table (see Action Queries: Make-Table Query).

SQL Specific Queries
The only thing SQL Specific queries have in common is that they cannot be represented in the Query Builder window, so once created, you can never go back to the Query Builder window.

DDL Queries (Data Definition Language Queries)
DDL queries (also called Data Definition Language queries) allow you to create new tables and indexes modify existing ones.

Union Queries
Union queries combine one or more tables. However instead of combining them row-wise like the Join, the union combines tables column-wise. (see: Union Query: Part 1)

Pass-through Queries
A pass-through query is used to return records or manipulate data in a server-based database like SQL Server or Oracle. It simply passes the query on to the host database engine, bypassing the Jet database engine altogether. A pass-through query must be written in the SQL dialect of the host engine. For SQL Server, that would be T-SQL. For Oracle, it would be SQL *Plus.

Updateable vs. Non-updateable Queries

A query can be updateable or non-updateable, depending on a lot of conditions. If a Select Query is updateable, you can actually add, delete and modify records in the base table through the query. In Access, the * button on the Navigation Bar will be visible if the query is updateable. It will be grayed out if it is not. The actual conditions under which a query is updateable can be found in the Access Help System.

Action queries can also be updateable or not depending on the same conditions as the Select query. This is based on whether the dataset returned in the FROM clause of the action query is updateable.

Some queries are inherently non-updateable. These include Union queries, Crosstab queries, and Totals queries. Non-updateable queries can only be used to display data. For more detailed information see "This Recordset Is Not Updateable. Why?"

Creating and Using Queries in Code

Some SQL dialects, like T-SQL, allow you to use procedural constructs like IF statements and Loops, in a query. Access SQL does not. However, Access allows you to create and execute queries in Visual Basic procedures through the use of DAO (Data Access Objects), which is a library of objects that allows you to programmatically manipulate the database.


Monday, November 16, 2015

What is a Primary Key?

If I have a small company, any individual employee can be distinguished by a combination of First Name, Middle Initial, and Last Name. This combination uniquely identifies each employee.

If I have a larger company, the chances increase that I could have two employees that have the same name. So the name can no longer be used to uniquely identify a record. So instead, I could use a number like Social Security Number.

A primary key is a special kind of index (see What is an Index?) that is composed of a field (SSN) or combination of fields (First/Middle/Lastname), which uniquely identify a record.
A primary key has a number of useful properties.

First, the value in the primary key cannot be duplicated. If it is a single field, that value cannot be repeated. If it is composed of multiple fields, that combination of values cannot be duplicated. So in the samples above, having a primary key would mean that I could not put two "Roger J Carlson"s in my database.

Secondly, the primary key cannot be NULL and no portion of the primary key (in the case of a multiple field key) can be NULL. (See What does NULL mean?). The NULL means the value of the field is unknown. Obviously, if we don't know the value of the fields, we can't guarantee the value is unique. So disallowing NULLS guarantees we have a valid value in the key fields.

Thirdly, in order to create relationships, there must be a unique index on the field. Since by definition, a primary key is a unique index, a primary key makes an ideal join field. I will discuss this in a later post.

There are two basic types of primary keys: Natural Keys and Surrogate Keys.

Natural keys
A natural key is one composed of a field or fields that already exist in the table. In my examples above, both Social Security Number and Firstname/Middle/Lastname are natural keys. Natural keys can be composed of a single field or multiple fields.

It is important to note that a table can only have a *single* primary key. It is incorrect to say that a table has multiple primary keys. In the case of a multi-field primary key, it has a single primary key composed of multiple fields.

Surrogate keys
A surrogate key is an artificially created number. It has no real-world meaning, and is used mostly in relationships with other tables. In Access, you use the Autonumber datatype to create a surrogate key. This number is system-created and is guaranteed to be unique.

The disadvantage of a surrogate key is that it does not have real-world uniqueness. It would be possible to enter two records for Roger J Carlson, each with a different system-created number. To protect against that, you should also create a unique index on those fields that would otherwise create a natural key.

The advantage of a surrogate key is that it will never be affected by real-world changes to the database. It is also much more efficient to join tables on a single number than on multiple text fields.

Opinions differ, but I prefer a surrogate primary key with a separate unique index. This separates the functions of the primary key: the surrogate key for relationships and the unique index to control real-world uniqueness. In this way, if the conditions that effect the real-world uniqueness (as in the case of moving from a small business to a large business mentioned above), the table relationships will not be disturbed.

Creating a primary key in the Access User Interface is easy. Just open the table in Design View. For a surrogate or single field primary key, select a single record and click the Primary Key button.


For a multi-field primary key, just hold the Control [Crtl] key as you select the fields and then click the primary key button.

Why have a Surrogate Key AND a Unique Index

The purpose of a primary key is to create a unique record. But uniqueness from a database engine perspective is not necessarily the same as uniqueness from a real-world perspecive.
Natural keys perform both these functions.

However, surrogate keys only perform one of them, i.e. identifying records for the database engine to join. It does not, however, help the database user very much to identify unique entities.
For instance, suppose I a customer (Roger J Carlson) to a customer table and the surogate key gets assigned a value of 233. Now, another data entry person also enters the same customer. This copy of Roger J Carlson is assigned a value of 245.

Both Roger J Carlson's refer to the same real-world customer. Which one is right? Impossible to know.

So, while the record is unique from the database engine perspective, it's not from a real-world perspective.

To guard against that, when you use a surrogate key, you also need to create a unique index on a field or combination of fields that will uniquely identify the real-world Roger J Carlson.
Once Roger J Carlson has been entered, it will there after be impossible to enter another Roger J Carlson.

Obviously, in a large database, such a unique key wouldn't work. However, I could use Name, Gender, Birthdate (or some other unique combination of real-world attributes).

Wednesday, November 11, 2015

What is an Index?

An index is a method of cataloging the records in a table to increase the speed and efficiency of retrieving them. You can think of it as a library card catalog.

If the stacks in a real-world library represent your table, the library card catalog is the index. The library card catalog stores just the information necessary to identify a particular book: author, title, and genre, plus a number that identifies where the book can be found in the stacks.
With this number, you can go directly to the location of your book. Without it, you would have to start at the beginning of the stacks and look at each book until you found the one you wanted. (This assumes that the books are not stored in any particular order.)

An index works the same way. It stores the value of a field or combination of field and the location of each record that match it. With this information, a query that sorts or searches on the indexed field(s) can go directly to the records. Without it, it has to start at the beginning of a table and look at each record until it finds the ones it wants.

There are a number of different kinds of indexes: simple indexes, multi-field indexes, unique indexes, and clustered indexes.

Simple indexes are indexes on a single field. They may or may not allow duplicate values. Their main use is for searching and sorting records in a table.

Multi-field indexes are indexes across multiple fields. These also may or may not allow duplicate values. This is different than having simple indexes on multiple fields. In a multi-field index, the combination of records is indexed.

Unique indexes do not allow duplicate values. They may be either simple or multi-field. If it is multi-field, it will allow duplicates in individual fields within the indexed fields, but it will not allow a duplicate across the all of the fields in the index. This is useful to make sure you don't have duplicate records in your table.

Clustered indexes control how the records are stored in the database. If you have a clustered index, the records will actually be stored in that order. A table can have only one clustered index. This makes sense since records in a table can only be physically stored in one order.

Primary Keys are a special type of index. In Access, the primary key is both a unique index and a clustered index. It has one additional property in that no field in the index can be NULL. (see What does NULL mean? ). There can be only one primary key in a table. For more on primary keys, their creation and uses see: What is a Primary Key?

How do you create an index?

For simple indexes, it's easy. Just go to the Design View of the table and select a field. In the Properties, you'll see an Indexed property. Select one of the "Yes" options. To make it a unique index by selecting Yes (No Duplicates).

For multi-field indexes, it's a little more complicated. The simplest thing to do is create a simple index on the first field in your index.


Then on the line directly below it, add another field WITHOUT giving it an Index Name. If the Index Name field is blank, Access will assume the field is part of the index directly above it.


Add as many fields to the index as you want. From the Indexes Window, you can also set the Unique, Primary, and Allow Nulls properties.

A word of warning about indexes. While they do speed up the retrieval of records, they also slow down the insertion of records. Every time a new record is added, all the indexes need to be rebuilt. If you have too many indexes and a lot of records, this can be a problem. Therefore, it is best to just index those fields that really need indexing. These would include fields that you will be sorting or searching on or fields that participate in Relationships or Joins.

Sunday, November 8, 2015

What does NULL mean? How is it different than the Empty String?

One database concept given to much misunderstanding is the NULL value. What is it?
Null means that the value is unknown. This is different from the Empty String. The Empty String ("") means we know what the value is, and the value is, well, empty. That is, we know that there ISN'T a value.
Consider the case of the middle initial. Some people have a middle initial and some don't. Of those that do, we may or may not know what it is. If I leave it Null, it says that I don't know whether they have one or not. However, I can set the field to "" (Empty string) which says that they DO NOT have a middle initial.
Why is this difference important? Primary keys, for one thing.
Let's assume that my primary key is a composite key composed of FirstName, MiddleInitial, LastName. A primary key CANNOT have a Null as a part of it, thus we can set the value of MiddleInitial to "", and it will accept it. But we cannot leave the value Null.
Why can't a Primary Key have a Null value? Well aside from the fact that Access won't let it, the fact that Null means "I don't know the value" means that it cannot positively guarentee that the field is unique, which is one of the properties of the primary key.
The following four subs illustrate the differences:
Sub test()
If Null = Null Then
MsgBox "True"
MsgBox "false"
End If
End Sub
Running this sub will ALWAYS evaluate to False. Why? Because we don't know what the value of NULL is, so we can't say if it's equal to NULL.
Sub test2()
If IsNull(Null) Then
MsgBox "True"
MsgBox "false"
End If
End Sub
This evaluates to True because the IsNull() is specifically designed to test for NULL.
Sub test3()
If "" = "" Then
MsgBox "True"
MsgBox "false"
End If
End Sub
This evaluates to True because the empty string is a known value.
Sub test4()
If IsNull("") Then
MsgBox "True"
MsgBox "false"
End If
End Sub
False because "" is known, therefore it cannot be unknown.
As I said, NULL=NULL always evaluates to False. Since we don't know the value of Null, we can't say what it equals. This is why you never use NULL this way, but always use the IsNull() function.
Further, Len(Null) does evaluates to Null, as indeed do most things you *compare* with or *do to* Null, ie. Null=Null, X=Null, Null=True, cint(Null), etc.
Len("") ("" representing the 'Empty String' or 'Zero Length String') evaluates to 0 (zero). Because the length of a string of zero length is zero (duh!, my daughter would say).
The problem with explaining NULL, is all the good words are used for something else. You can't define it with itself, so saying "Null is Null" is null content (forgive the expression). You can't use "nothing", because that is an object pointer used to de-allocate object variables. You can't use "empty", because that is a constant specifically for use with a Variant variable. You can't use Zero either, because that represents a specific numeric value, rather than the absence of a value.
I guess we'll have to blame E.F. Codd who declared that any relational database implementation must have a special value called "Null" which represents the "absence of anything" or “unknown” condition for all datatypes.
The best we can do is to say that the value of Null is 'we don't know', which is not to say that we don't know what Null is ... I'd better stop there.

Friday, April 17, 2015

New Video: Ambiguous Outer Joins

In conjunction with Webucator Training Services, I'm pleased to present a video version of one of my most popular posts: Ambiguous Outer Joins