Tuesday, December 5, 2017

De-identifying Data for Confidentiality - Part I


Data Scramble

There are a number of circumstances when you as a developer might need to randomize data within a database. You may need to create sample data to test or demonstrate a database application. Or you may need to mask or "de-identify" confidential data to comply with regulatory agency rules. This article demonstrates two methods that can be used, either separately or in conjunction, to scramble your database.

Creating test data for a database application is always a difficult task. You need to create enough records to adequately test database performance. A design that works well with a hundred records, might not work with ten thousand. You also need to create data varied enough to mimic real world situations, data that will test the limits of the business rules. Developers have a tendency to create data that will work with their application, not data that will break it.

The very best source for creating test data is actual customer data. Nothing mimics real-world data like the real-world data. Unfortunately, your customers may not appreciate their data being used in this way. Worse yet, there may be regulatory considerations. For instance, the health care industry has to comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, which has very strict compliance rules.

Randomizing Records

Let's start by considering the simplest case. You simply want to mask customer data without the necessity of complying with regulations. It is probably sufficient to simply randomize each field that could identify a record to effectively mask the data.

Tables 1 and 2 show the before and after of a small Customer table that has been randomized.

CustomerID

CustomerNumber

FirstName

LastName

1

A003

Roger

Carlson

2

A045

Sam

Martin

3

A014

Sarah

Sutherland

Table 1: Original Customer table

CustomerID

CustomerNumber

FirstName

LastName

1

A045

Sarah

Carlson

2

A014

Sam

Sutherland

3

A003

Roger

Martin

Table 2: The Customer table after randomization

Of course, with only three records, the randomization is not very random, but the larger the result set, the better the randomization will be. So let's see how to do this.

Overview of the Process

The first thing we'll do is create a temporary table that holds the values we want to randomize. This is best done with a Make Table query (SELECT...INTO). Then we'll start at the top record of the main table and copy random values from the temporary table back into the main table. As each value is copied back in, we'll delete it from the temporary table. Repeat this process for each field and the records will be thoroughly scrambled.

One caveat: This process won't work if the field is a Primary Key or has a Unique Index because it will temporarily cause duplicate records. So, you'll have to manually remove the constraint. Of course it is possible to remove and re-establish constraints programmatically, but that is beyond the scope of this article. Lastly, this code will not work with an Autonumber field under any circumstances.

Creating Temporary Table

The first thing we have to do is create a temporary table, called "tblRandom" to hold the field values. (I could have used an array to do the same thing, but since this is a database project, I prefer to use database objects.) As I said, we can use a SQL statement in the form of a Make Table Query. The following subroutine shows how to do that.

Sub CreateRandomTable(TableName As String, FieldName As String)

'*** This routine creates the tblRandom table

' used to hold the values to be randomized

On Error GoTo Err_CreateRandomTable

Dim strSQL As String

'*** create the query based on the passed arguments

strSQL = "SELECT [" & FieldName & "] INTO tblRandom " & _

"FROM [" & TableName & "];"

'*** delete the table if it exists

CurrentDb.TableDefs.Delete "tblRandom"

'*** run SQL make-table query

CurrentDb.Execute strSQL

Exit_CreateRandomTable:

Exit Sub

Err_CreateRandomTable:

'*** if the error is 3265, the table does not exist

If Err.Number = 3265 Then

'*** then skip the Delete command

Resume Next

Else

MsgBox Err.Description

Resume Exit_CreateRandomTable

End If

End Sub

The error trapping code in the function is necessary because if the table "tblRandom" already exists, the Make Table query will fail. But the code traps for that error (3265) and skips the Delete command.

Randomizing Records

Now we are ready to randomize the fields. The following subroutine will do that.

Sub RandomizeTableField _

(TableName As String, _

FieldName As String)

'***This subroutine randomizes the indicated

' field in the selected table.

On Error GoTo Err_RandomizeTableField

Dim db As DAO.Database

Dim rsTarget As DAO.Recordset

Dim rsSource As DAO.Recordset

Dim i As Integer, upperbound As Integer

'*** used with the RND() function to return

' a random number

Randomize

'*** open the database

Set db = CurrentDb

'*** open the source table (tblRandom)

Set rsSource = db.OpenRecordset("tblRandom", dbOpenTable)

'*** open target table (original table)

Set rsTarget = db.OpenRecordset(TableName, dbOpenTable)

'*** repeat loop until the all records have been randomized

Do Until rsTarget.EOF

upperbound = rsSource.RecordCount

'*** select a random record in the source

' table and move to it

If upperbound <> 1 Then

i = Int((upperbound - 1) * Rnd + 1)

rsSource.Move i

End If

'*** write value from source table into target

rsTarget.Edit

rsTarget(FieldName) = rsSource(0)

rsTarget.Update

'*** delete that value from source table

rsSource.Delete

rsSource.MoveFirst

rsTarget.MoveNext

Loop

MsgBox "Field: " & FieldName & " from Table: " & _

TableName & " has been scrambled." & vbCrLf & _

"If you removed an index, please recreate it."

'*** Clean up object variables

rsTarget.Close

rsSource.Close

db.Close

Set rsTarget = Nothing

Set rsSource = Nothing

Set db = Nothing

'*** error trapping to catch constraints

Exit_RandomizeTableField:

Exit Sub

Err_RandomizeTableField:

'*** if the field has a unique index

If Err.Number = 3022 Then

'*** display message informing user to remove index

MsgBox "The field that you have chosen: " & _

cboFieldName1 & vbCrLf & _

"Has a Unique Index. You must remove this " & _

"index to proceed." & vbCrLf & _

"Once you have scrambled the field, " & _

"you can restore the index."

Resume Exit_RandomizeTableField

Else

MsgBox Err.Description

Resume Exit_RandomizeTableField

End If

End Sub

Calling the Routine

Now, all that's left is to pull it all together with a calling routine. How you do that depends on how it is going to be used. For instance, this code could be placed in a form with combo boxes on the form (cboTable and cboFieldName) holding the table and field names. Then you could select the table and field from lists.

The following code uses such a set up and would be in the On Click event of the cmdRunRandomization command button of the form.

Private Sub cmdRunRandomization_Click()

'*** This code is the calling routine

On Error GoTo Err_cmdRunRandomization_Click

Call CreateRandomTable(cboTable, cboFieldName1)

Call RandomizeTableField(cboTable, cboFieldName1)

'*** delete temporary table

CurrentDb.TableDefs.Delete "tblRandom"

Exit_cmdRunRandomization_Click:

Exit Sub

Err_cmdRunRandomization_Click:

If Err.Number = 3265 Then

'*** then skip the Delete command and

' resume on the next line

Resume Next

Else

MsgBox Err.Description

Resume Exit_cmdRunRandomization_Click

End If

End Sub

DataScrambleFigure1

Figure 1: Example form that could be used to run the field randomization code.

Using a form like above allows you to select a field to scramble. You can select, in turn, all of fields with identifying data. On my website, you can find a small sample database called Datascramble.accdb, which demonstrates this process, including the calling form.

Sometimes It’s Not Enough

But sometimes just re-arranging field values in the records is not enough. Sometimes, your fields contain sensitive data like Social Security Numbers or Patient Identifiers which can’t be displayed even if it’s attached to the wrong record.

In that case, you may want to scramble the characters within a field itself.  I address this problem in De-identifying Data for Confidentiality - Part II

No comments: