Thursday, December 21, 2017

De-identifying Data for Confidentiality - Part II

For part I, see De-identifying Data for Confidentiality - Part I

De-identifying Field Data

In certain instances, simply randomizing records is not sufficient. Certain field values have inherent information regardless of the record that holds them. Phone numbers, social security number, patient numbers, and the like all contain information all by themselves that may breach confidentiality. HIPAA regulations in particular are very strict about displaying any data that could identify a patient.

So to further mask your data, it may be necessary to scramble character data with in the field itself. Now, it doesn't make sense to randomize some types of fields. Name and address fields in particular will look strange if randomized in this way.

CustomerID

CustomerNumber

FirstName

LastName

1

0A30

Rgroe

arConsl

2

450A

aSm

rianMt

3

14A0

aharS

ltnrSadueh

Not only does it look strange, the capitalization of the names makes it fairly easy to re-identify the field values. So this function needs to be used with some discretion. It is best used with character data that is composed of numbers.

A further complication is that sometimes character data uses an input mask that displays characters that may either be stored or not. For instance, when you see a social security number in a field 111-22-3333, you don't really know if the number was stored as displayed or stored as 111223333. If the dashes are stored, the number might be randomized as 2-33311-13. So any function that randomizes character data must take these complications into account.

Overview of the Process

This process is similar to the record randomization process, but it only works on character data. Instead of creating a temporary table that holds all the records, we'll create two temporary string variables, one to hold value of the field we want to randomize and another to hold the randomized value. Then we'll grab random characters from the Source string and append them one at time into the Target string. We'll start at the top record of the main table, randomize the character string, write it back into the record, then proceed through the entire table until all values of the field have been randomized.

Unlike the record randomization process, this process will work on fields with Unique Indexes and Primary Keys. In fact, if a field has a unique index, you should not remove it. It is entirely possible to create duplicate records at random, so I will include error trapping to identify duplicate records. If one is discovered, it will randomize the string again until it creates a non-duplicate field value.

While this process will work on a Primary Key field, you should be careful. If this field participates in a Relationship, you will lose relational integrity. That is, the values in the primary key will no longer match values in the foreign key of the related table. If you do this, you should make certain your relationships have Cascade Updates property set to Yes.

This is not an issue, however, if your Primary Key is an Autonumber field for two reasons. One, an Autonumber is a long integer rather than character data, and secondly, you cannot change the value of an Autonumber.

Randomizing Character Fields

Sub RandomizeCharacterField _

(TableName As String, _

FieldName As String, _

FieldFormat As Variant)

'***This subroutine randomizes the characters of the

' indicated field in the selected table. It requires

' a Reference to Microsoft DAO 3.6 Object Library

On Error GoTo Err_RandomizeCharacterField

Dim db As DAO.Database

Dim rsTableName As DAO.Recordset

Dim i As Integer, StrLen As Integer

Dim strSource As String

Dim strTarget As String

Dim LeftSource As String

'*** used with the RND() function to return a random number

Randomize

'*** open the database

Set db = CurrentDb

'*** open the source table (tblRandom)

Set rsTableName = db.OpenRecordset(TableName)

'*** If the field format is Null, set it to the empty string

FieldFormat = Nz(FieldFormat, "")

'Loop thru the table, starting with the first record

Do Until rsTableName.EOF

'return label: used if a duplicate is produced

ReRandomize:

'*** set temp string to the field value

' and set receiving string to the empty string

strSource = rsTableName(FieldName)

strTarget = ""

'*** Repeat loop until all characters have been

' removed from temp string

Do While Len(strSource) > 0

     StrLen = Len(rsTableName(FieldName))

'*** select a character position at random

     If StrLen <> 1 Then

         i = Int((StrLen - 1) * Rnd + 1)

     End If

'*** grab the selected character and append

' it to the target string

     If InStr(FieldFormat, Mid(strSource, i, 1)) = 0 Then

          strTarget = strTarget & Mid(strSource, i, 1)

     End If

'*** delete the selected character from the Source String

     strSource = Left(strSource, i - 1) & _

          Mid(strSource, i + 1)

Loop

'*** when the Target String is complete, write it back to

'*** the original table

rsTableName.Edit

'*** apply the appropriate format to the string

rsTableName(FieldName) = Format(strTarget, FieldFormat)

rsTableName.Update

'*** proceed to the next record

rsTableName.MoveNext

Loop

MsgBox "Field: " & FieldName & " from Table: " & _

     cboTable & " has been character scrambled."

'*** Clean up object variables

rsTableName.Close

db.Close

Set rsTableName = Nothing

Set db = Nothing

Exit_RandomizeCharacterField:

Exit Sub

Err_RandomizeCharacterField:

If Err.Number = 3022 Then

'*** if there is a duplicate value in a field

' with a unique index, randomize it again

Resume ReRandomize

Else

MsgBox Err.Description

MsgBox Err.Number

Resume Exit_RandomizeCharacterField

End If

End Sub

Calling the Routine

In the Character Randomizing routine above, I have a line of code that applies the appropriate format. However, I don't say how you'll know what the appropriate format is. That's because we'll decide this in the calling routine.

Whether we send a format string to the randomization routine depends on whether the field has an input mask. It also depends on what kind of input mask it has.

There are three possibilities for the InputMask property of a field: an input mask that saves the characters, an input mask that does not save display characters, and no input mask at all. However for our purposes, an input mask that does not save characters is the same as no input mask at all. So all we have to test for is an input mask that saves display characters.

An input mask has three sections. The first is the input mask itself. The second is tells the mask whether to save the display characters or not, and the third tells what kind of placeholder character the input mask will display. Semi-colons divide each section.

We are only concerned here with the second section. If the input mask saves the character, the second section holds a zero. A one or nothing indicates the mask will not save the display characters. Therefore, we can test the input mask for ";0;". If this string does not appear in the input mask, the display characters are not saved.

The InputMask property is one of those properties that don't exist for the object unless a value is assigned to it. So if a field does not have an input mask, trying to read the property will produce an error. Therefore, I will trap for that error (3270) to cover the case where there is no input mask.

Again, how you implement this code depends on how it is going to be used. As with the Record Randomizing routine, I chose to create a form with combo boxes on the form (cboTable and cboFieldName) to hold the table, field names, and one for the format. I also added a control that displays the input mask for identification purposes.

The following code uses such a set up and would be in the On Click event of the cmdRunCharRandomization command button of the form.


Private Sub cmdRunCharRandomization_Click()

'*** This code is the calling routine.

On Error GoTo Err_cmdRunCharRandomization_Click

Dim db As DAO.Database

Dim tdf As DAO.TableDef

Dim fld As DAO.Field

Dim Mask As String

Set db = CurrentDb

Set tdf = db.TableDefs(cboTable)

Set fld = tdf.Fields(cboFieldName1)

'*** Read the InputMask property of the field

Mask = fld.Properties("InputMask")

'*** if the InputMask of the field shows that extra

' characters are saved, call the randomization

' routine and include the FieldFormat string.

If InStr(Mask, ";0;") > 0 Then

     Call RandomizeCharacterField( _

         cboTable, cboFieldName1, cboFieldFormat)

    '*** but if the InputMask shows characters are not

     ' saved or if there is no InputMask, call the

     ' randomization routine without sending the

    ' FieldFormat string.

Else

      Call RandomizeCharacterField( _

           cboTable, cboFieldName1, "")

End If

Exit_cmdRunCharRandomization_Click:

Exit Sub

Err_cmdRunCharRandomization_Click:

'*** If there is no mask, it will return a

' Property Not Found error

If Err.Number = 3270 Then

Resume Next

ElseIf Err.Number = 3265 Then

MsgBox "Please select both a table and a field"

Resume Exit_cmdRunCharRandomization_Click

Else

MsgBox Err.Description

MsgBox Err.Number

Resume Exit_cmdRunCharRandomization_Click

End If

End Sub

CharacterScrambleFigure2

Figure 2: Example form that could be used to run the character randomization code.

Using a form like above allows you to select a field to scramble. You can select, in turn, all of fields with identifying data. On my website, you can find a small sample database called CharacterScramble.accdb, which demonstrates this process, including the calling form.

One caution. If you are randomizing fields to de-identify data to comply with regulatory rules, you should get approval for using this process with your regulatory compliance officer. For instance, HIPAA rules specify that if you de-identify data, you must use an algorithm which prevents the data from being reconstructed. Since this routine selects characters at random, I believe it complies with this instruction, but only your regulatory officer would know for sure.

Conclusion

There are many reasons for masking, blinding, scrambling, or de-identifying data within a database. Creating realistic data for demonstration purposes and masking data to comply with regulatory rules are just two. But with a little programming expertise, it doesn't have to be an onerous task.

Tuesday, December 5, 2017

De-identifying Data for Confidentiality - Part I


Data Scramble

There are a number of circumstances when you as a developer might need to randomize data within a database. You may need to create sample data to test or demonstrate a database application. Or you may need to mask or "de-identify" confidential data to comply with regulatory agency rules. This article demonstrates two methods that can be used, either separately or in conjunction, to scramble your database.

Creating test data for a database application is always a difficult task. You need to create enough records to adequately test database performance. A design that works well with a hundred records, might not work with ten thousand. You also need to create data varied enough to mimic real world situations, data that will test the limits of the business rules. Developers have a tendency to create data that will work with their application, not data that will break it.

The very best source for creating test data is actual customer data. Nothing mimics real-world data like the real-world data. Unfortunately, your customers may not appreciate their data being used in this way. Worse yet, there may be regulatory considerations. For instance, the health care industry has to comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, which has very strict compliance rules.

Randomizing Records

Let's start by considering the simplest case. You simply want to mask customer data without the necessity of complying with regulations. It is probably sufficient to simply randomize each field that could identify a record to effectively mask the data.

Tables 1 and 2 show the before and after of a small Customer table that has been randomized.

CustomerID

CustomerNumber

FirstName

LastName

1

A003

Roger

Carlson

2

A045

Sam

Martin

3

A014

Sarah

Sutherland

Table 1: Original Customer table

CustomerID

CustomerNumber

FirstName

LastName

1

A045

Sarah

Carlson

2

A014

Sam

Sutherland

3

A003

Roger

Martin

Table 2: The Customer table after randomization

Of course, with only three records, the randomization is not very random, but the larger the result set, the better the randomization will be. So let's see how to do this.

Overview of the Process

The first thing we'll do is create a temporary table that holds the values we want to randomize. This is best done with a Make Table query (SELECT...INTO). Then we'll start at the top record of the main table and copy random values from the temporary table back into the main table. As each value is copied back in, we'll delete it from the temporary table. Repeat this process for each field and the records will be thoroughly scrambled.

One caveat: This process won't work if the field is a Primary Key or has a Unique Index because it will temporarily cause duplicate records. So, you'll have to manually remove the constraint. Of course it is possible to remove and re-establish constraints programmatically, but that is beyond the scope of this article. Lastly, this code will not work with an Autonumber field under any circumstances.

Creating Temporary Table

The first thing we have to do is create a temporary table, called "tblRandom" to hold the field values. (I could have used an array to do the same thing, but since this is a database project, I prefer to use database objects.) As I said, we can use a SQL statement in the form of a Make Table Query. The following subroutine shows how to do that.

Sub CreateRandomTable(TableName As String, FieldName As String)

'*** This routine creates the tblRandom table

' used to hold the values to be randomized

On Error GoTo Err_CreateRandomTable

Dim strSQL As String

'*** create the query based on the passed arguments

strSQL = "SELECT [" & FieldName & "] INTO tblRandom " & _

"FROM [" & TableName & "];"

'*** delete the table if it exists

CurrentDb.TableDefs.Delete "tblRandom"

'*** run SQL make-table query

CurrentDb.Execute strSQL

Exit_CreateRandomTable:

Exit Sub

Err_CreateRandomTable:

'*** if the error is 3265, the table does not exist

If Err.Number = 3265 Then

'*** then skip the Delete command

Resume Next

Else

MsgBox Err.Description

Resume Exit_CreateRandomTable

End If

End Sub

The error trapping code in the function is necessary because if the table "tblRandom" already exists, the Make Table query will fail. But the code traps for that error (3265) and skips the Delete command.

Randomizing Records

Now we are ready to randomize the fields. The following subroutine will do that.

Sub RandomizeTableField _

(TableName As String, _

FieldName As String)

'***This subroutine randomizes the indicated

' field in the selected table.

On Error GoTo Err_RandomizeTableField

Dim db As DAO.Database

Dim rsTarget As DAO.Recordset

Dim rsSource As DAO.Recordset

Dim i As Integer, upperbound As Integer

'*** used with the RND() function to return

' a random number

Randomize

'*** open the database

Set db = CurrentDb

'*** open the source table (tblRandom)

Set rsSource = db.OpenRecordset("tblRandom", dbOpenTable)

'*** open target table (original table)

Set rsTarget = db.OpenRecordset(TableName, dbOpenTable)

'*** repeat loop until the all records have been randomized

Do Until rsTarget.EOF

upperbound = rsSource.RecordCount

'*** select a random record in the source

' table and move to it

If upperbound <> 1 Then

i = Int((upperbound - 1) * Rnd + 1)

rsSource.Move i

End If

'*** write value from source table into target

rsTarget.Edit

rsTarget(FieldName) = rsSource(0)

rsTarget.Update

'*** delete that value from source table

rsSource.Delete

rsSource.MoveFirst

rsTarget.MoveNext

Loop

MsgBox "Field: " & FieldName & " from Table: " & _

TableName & " has been scrambled." & vbCrLf & _

"If you removed an index, please recreate it."

'*** Clean up object variables

rsTarget.Close

rsSource.Close

db.Close

Set rsTarget = Nothing

Set rsSource = Nothing

Set db = Nothing

'*** error trapping to catch constraints

Exit_RandomizeTableField:

Exit Sub

Err_RandomizeTableField:

'*** if the field has a unique index

If Err.Number = 3022 Then

'*** display message informing user to remove index

MsgBox "The field that you have chosen: " & _

cboFieldName1 & vbCrLf & _

"Has a Unique Index. You must remove this " & _

"index to proceed." & vbCrLf & _

"Once you have scrambled the field, " & _

"you can restore the index."

Resume Exit_RandomizeTableField

Else

MsgBox Err.Description

Resume Exit_RandomizeTableField

End If

End Sub

Calling the Routine

Now, all that's left is to pull it all together with a calling routine. How you do that depends on how it is going to be used. For instance, this code could be placed in a form with combo boxes on the form (cboTable and cboFieldName) holding the table and field names. Then you could select the table and field from lists.

The following code uses such a set up and would be in the On Click event of the cmdRunRandomization command button of the form.

Private Sub cmdRunRandomization_Click()

'*** This code is the calling routine

On Error GoTo Err_cmdRunRandomization_Click

Call CreateRandomTable(cboTable, cboFieldName1)

Call RandomizeTableField(cboTable, cboFieldName1)

'*** delete temporary table

CurrentDb.TableDefs.Delete "tblRandom"

Exit_cmdRunRandomization_Click:

Exit Sub

Err_cmdRunRandomization_Click:

If Err.Number = 3265 Then

'*** then skip the Delete command and

' resume on the next line

Resume Next

Else

MsgBox Err.Description

Resume Exit_cmdRunRandomization_Click

End If

End Sub

DataScrambleFigure1

Figure 1: Example form that could be used to run the field randomization code.

Using a form like above allows you to select a field to scramble. You can select, in turn, all of fields with identifying data. On my website, you can find a small sample database called Datascramble.accdb, which demonstrates this process, including the calling form.

Sometimes It’s Not Enough

But sometimes just re-arranging field values in the records is not enough. Sometimes, your fields contain sensitive data like Social Security Numbers or Patient Identifiers which can’t be displayed even if it’s attached to the wrong record.

In that case, you may want to scramble the characters within a field itself.  I address this problem in De-identifying Data for Confidentiality - Part II