There are a number of circumstances when you as a developer might need to randomize data within a database. You may need to create sample data to test or demonstrate a database application. Or you may need to mask or "de-identify" confidential data to comply with regulatory agency rules. This article demonstrates two methods that can be used, either separately or in conjunction, to scramble your database.
Creating test data for a database application is always a difficult task. You need to create enough records to adequately test database performance. A design that works well with a hundred records, might not work with ten thousand. You also need to create data varied enough to mimic real world situations, data that will test the limits of the business rules. Developers have a tendency to create data that will work with their application, not data that will break it.
The very best source for creating test data is actual customer data. Nothing mimics real-world data like the real-world data. Unfortunately, your customers may not appreciate their data being used in this way. Worse yet, there may be regulatory considerations. For instance, the health care industry has to comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, which has very strict compliance rules.
Let's start by considering the simplest case. You simply want to mask customer data without the necessity of complying with regulations. It is probably sufficient to simply randomize each field that could identify a record to effectively mask the data.
Tables 1 and 2 show the before and after of a small Customer table that has been randomized.
Table 1: Original Customer table
Table 2: The Customer table after randomization
Of course, with only three records, the randomization is not very random, but the larger the result set, the better the randomization will be. So let's see how to do this.
Overview of the Process
The first thing we'll do is create a temporary table that holds the values we want to randomize. This is best done with a Make Table query (SELECT...INTO). Then we'll start at the top record of the main table and copy random values from the temporary table back into the main table. As each value is copied back in, we'll delete it from the temporary table. Repeat this process for each field and the records will be thoroughly scrambled.
One caveat: This process won't work if the field is a Primary Key or has a Unique Index because it will temporarily cause duplicate records. So, you'll have to manually remove the constraint. Of course it is possible to remove and re-establish constraints programmatically, but that is beyond the scope of this article. Lastly, this code will not work with an Autonumber field under any circumstances.
Creating Temporary Table
The first thing we have to do is create a temporary table, called "tblRandom" to hold the field values. (I could have used an array to do the same thing, but since this is a database project, I prefer to use database objects.) As I said, we can use a SQL statement in the form of a Make Table Query. The following subroutine shows how to do that.
Sub CreateRandomTable(TableName As String, FieldName As String)
'*** This routine creates the tblRandom table
' used to hold the values to be randomized
On Error GoTo Err_CreateRandomTable
Dim strSQL As String
'*** create the query based on the passed arguments
strSQL = "SELECT [" & FieldName & "] INTO tblRandom " & _
"FROM [" & TableName & "];"
'*** delete the table if it exists
'*** run SQL make-table query
'*** if the error is 3265, the table does not exist
If Err.Number = 3265 Then
'*** then skip the Delete command
The error trapping code in the function is necessary because if the table "tblRandom" already exists, the Make Table query will fail. But the code traps for that error (3265) and skips the Delete command.
Now we are ready to randomize the fields. The following subroutine will do that.
Sub RandomizeTableField _
(TableName As String, _
FieldName As String)
'***This subroutine randomizes the indicated
' field in the selected table.
On Error GoTo Err_RandomizeTableField
Dim db As DAO.Database
Dim rsTarget As DAO.Recordset
Dim rsSource As DAO.Recordset
Dim i As Integer, upperbound As Integer
'*** used with the RND() function to return
' a random number
'*** open the database
Set db = CurrentDb
'*** open the source table (tblRandom)
Set rsSource = db.OpenRecordset("tblRandom", dbOpenTable)
'*** open target table (original table)
Set rsTarget = db.OpenRecordset(TableName, dbOpenTable)
'*** repeat loop until the all records have been randomized
Do Until rsTarget.EOF
upperbound = rsSource.RecordCount
'*** select a random record in the source
' table and move to it
If upperbound <> 1 Then
i = Int((upperbound - 1) * Rnd + 1)
'*** write value from source table into target
rsTarget(FieldName) = rsSource(0)
'*** delete that value from source table
MsgBox "Field: " & FieldName & " from Table: " & _
TableName & " has been scrambled." & vbCrLf & _
"If you removed an index, please recreate it."
'*** Clean up object variables
Set rsTarget = Nothing
Set rsSource = Nothing
Set db = Nothing
'*** error trapping to catch constraints
'*** if the field has a unique index
If Err.Number = 3022 Then
'*** display message informing user to remove index
MsgBox "The field that you have chosen: " & _
cboFieldName1 & vbCrLf & _
"Has a Unique Index. You must remove this " & _
"index to proceed." & vbCrLf & _
"Once you have scrambled the field, " & _
"you can restore the index."
Calling the Routine
Now, all that's left is to pull it all together with a calling routine. How you do that depends on how it is going to be used. For instance, this code could be placed in a form with combo boxes on the form (cboTable and cboFieldName) holding the table and field names. Then you could select the table and field from lists.
The following code uses such a set up and would be in the On Click event of the cmdRunRandomization command button of the form.
Private Sub cmdRunRandomization_Click()
'*** This code is the calling routine
On Error GoTo Err_cmdRunRandomization_Click
Call CreateRandomTable(cboTable, cboFieldName1)
Call RandomizeTableField(cboTable, cboFieldName1)
'*** delete temporary table
If Err.Number = 3265 Then
'*** then skip the Delete command and
' resume on the next line
Figure 1: Example form that could be used to run the field randomization code.
Using a form like above allows you to select a field to scramble. You can select, in turn, all of fields with identifying data. On my website, you can find a small sample database called Datascramble.accdb, which demonstrates this process, including the calling form.
Sometimes It’s Not Enough
But sometimes just re-arranging field values in the records is not enough. Sometimes, your fields contain sensitive data like Social Security Numbers or Patient Identifiers which can’t be displayed even if it’s attached to the wrong record.
In that case, you may want to scramble the characters within a field itself. I address this problem in De-identifying Data for Confidentiality - Part II