Tuesday, May 31, 2011

Normalizing Repeated Columns: Multiple Columns (Part 2)

Last time, in Normalizing Repeated Columns: Multiple Columns (Part1) I talked about Normalizing data in repeated columns from a spreadsheet, like this:



Into a structure like this:




Using a One-Time process that creates the tables and relationships in addition to copying the data.

Appending to Existing Tables

The process for normalizing repeated column data into Existing Tables is similar to the One-Time process. It does, however, require you to create the tables and relationships correctly ahead of time. Because I've just created them in the previous section, I'll simply delete the data and re-append it from the linked spreadsheet.

The two processes are very similar, but I'll be using Append queries rather than Make-Table queries.

Step 1: Remove the common PC elements.

Since I've already got a table, I'll use an Append query rather than the Make-Table query I used earlier:

INSERT INTO PC ( PC_Num, OS )
SELECT PC_RC.PC_Num, PC_RC.OS
FROM PC_RC;


Or in the Query Builder:


The resulting table looks like this:


Since I already have an autonumber primary key defined on the table, I don't need to do anything further.

Step 2: Create Software_Temp table

This step is exactly the same as Step 2 above. I'll use a Make-Table query to create a temporary table that will be used for creating the relationships later.

INSERT INTO Software_Temp ( PC_ID, Software, SoftwareType )
SELECT PC_Software_Temp.PC_ID, PC_Software_Temp.Software, PC_Software_Temp.SoftwareType
FROM (SELECT PC.PC_ID, Software1 As Software, PC_RC.Software1Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software1 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC.Software2, Software2Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software2 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC.Software3, Software3Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software3 IS NOT NULL) AS PC_Software_Temp;


The result will be a table that looks like this:


Step 3: Remove Duplicates from Software_Temp

I need to create a query that removes the duplicate values. The DISTINCT predicate works well for that. As before, I'll use an Append write the data to my existing Software table:

INSERT INTO Category ( SoftwareType )
SELECT DISTINCT Software_Temp.SoftwareType
FROM Software_Temp;


With the resulting table:


Again, this table already has an autonumber primary key, so I don't need to create one.

Step 4: Creating the Linking Table: PC_Software

So now, I've got my two "One-Side" tables: PC and Software. Now I just need to fill the linking table.
I can do that with an Append Query and a simple join of Software and Software_Temp:

INSERT INTO Software ( Software, CategoryID )
SELECT DISTINCT Software_Temp.Software, Category.CategoryID
FROM Software_Temp INNER JOIN Category ON Software_Temp.SoftwareType = Category.SoftwareType;


Or in the Query Builder:


The final result will look like this:


It is important to note that the order in which the data is moved is vital. Data must be written into the "one-side" tables (PC and Software) first, and then data can be moved into the "linking" tables (PC_Software).

This two-part post is available as a single download (with sample database) here: http://www.rogersaccesslibrary.com/forum/normalizingrepeatingcolumns1_topic562.html.
































Friday, May 27, 2011

Normalizing Repeated Columns: Multiple Columns (Part1)

In my earlier series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. I also showed several different examples of repeated columns and the how difficult it is to query repeated columns as compared to the normalized equivalent. If you haven't read these yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

One of the comments suggested I should discuss how to convert a table with repeated columns into a normalized table structure. I thought that was a really good idea, so I'm going to spend the next few posts doing so. The difficulty in discussing this issue, however, is that the exact solution differs with each table, so there is no single solution. Fortunately, in the last series, I showed 5 different tables, each with a slightly different structure, so while I cannot show a single solution that will work for every table with repeated columns, hopefully, one of the following will work for most cases.

Others in this series:

Complex PC Inventory

In my post Querying Repeated Columns: Multiple Unions I discussed a simple example of a PC Inventory table that looked like this:

Figure 1: PC_Inventory table with repeated columns

Normalized to remove the repeated columns, the tables relationships would look like this:

Figure 2: Relationships for the PC Inventory tables.

So how do I get the data from Figure 1 to Figure 2? Well, first of all, it depends on whether this is a one-time process where you are creating new normalized tables, or whether it is an on-going process where you are appending the data to existing, normalized tables.
In either case, however, not only do I have to split the records into separate tables, I also have to preserve the relationships between those records, so I have to do it is a specific order.


One-Time Process

Suppose I'm given a spreadsheet of data like this:



And I need to create a normalized database from it. I've already determined the tables I need (Figure 2 above), but if you're not certain what your table structure should be, you should read through my blog serie: What is Normalization, Entity-Relationship Diagramming, and The Normal Forms.

Link the spreadsheet into a database.

First of all, I need to get the spreadsheet into the database. I find it is preferable to link rather than import the file. Since I'm not going to be changing the data, there's no reason to import it. Once the file is linked it acts just like a table, and I can begin the process of normalizing it.

Step 1: Remove the common PC elements.

Since I need to create a new table to hold these values, I'll use a Make-Table query:

SELECT PC_Num, OS INTO PC
FROM PC_RC


Or in the Query Builder:


The resulting table looks like this:



Now, if I were using PC_NUM as a natural key, I could stop here. However, I prefer to use Surrogate Keys. It makes this process more difficult, but has many benefits in the long run. (For a discussion of natural vs. surrogate keys, see my blog post: What is a Primary Key?)

Add surrogate key to PC table.

Adding a surrogate key to a table is as simple as opening it in Design View, adding an Autonumber Field, and making it the Primary Key.


Saving the table will automatically fill the autonumber field with values:




Step 2: Create Software_Temp table

As I showed in my post Querying Repeated Columns: Multiple Unions it is necessary to use multiple UNION queries to create a list of Software. Something like this:

SELECT PC.PC_ID, Software1 As Software, PC_RC.Software1Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software1 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC.Software2, Software2Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software2 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC.Software3, Software3Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software3 IS NOT NULL


Note: You can't create or view this query in the Query Builder, however, you can create the first on (Software1) in the QB, then switch to the SQL View and copy and paste, modifying each for the specific column.

I still need to make this query into a Make-Table query. To do that, I surround the above in parentheses and use it in the From clause of the Make-table. Like this:

SELECT PC_Software_Temp.PC_ID, PC_Software_Temp.Software, PC_Software_Temp.SoftwareType INTO Software_Temp
FROM (SELECT PC.PC_ID, Software1 As Software, PC_RC.Software1Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software1 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC.Software2, Software2Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software2 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC.Software3, Software3Type as SoftwareType
FROM PC INNER JOIN PC_RC ON PC.PC_Num = PC_RC.PC_Num
WHERE Software3 IS NOT NULL) AS PC_Software_Temp;


The result will be a table that looks like this:


Step 3: Remove SoftwareType to Category Table

To fulfill 3NF, I need to create a query that removes the duplicate SoftwareType values into their own table, which I will call Category. The DISTINCT predicate works well for that. As before, I'll use a Make-Table query to create a new table to hold the software values:

SELECT DISTINCT SoftwareType INTO Category
FROM Software_Temp;


And as before, the PC table I created, I'll add an Autonumber primary key, so the table looks like this:
 

Step 4: Remove Duplicate Software Values from Software_Temp

Now I need to remove the duplicate Software values. However, this time I want to include the primary key of the Category table (Category) in the resulting table so I can create the relationship between the tables Software and Category.

As before, I'll use a Make-Table query to create a new table to hold the software values:

SELECT DISTINCT Software_Temp.Software, Category.CategoryID 
INTO Software
FROM Software_Temp 

    INNER JOIN Category ON Software_Temp.SoftwareType =
    Category.SoftwareType;


Or in the Query Builder:


Again, I'll add an Autonumber primary key. However, the make-table query will create CategoryID as an autonumber field. Since I can't have two autonumber fields in a table, I have to first change CategoryID to a Number (Long Integer) value. I have to do this first, save the table, then add the autonumber field (SoftwareID). The table ends up like this:


Notice the CategoryID field is already filled with the correct values from the Category table.

Step 5: Creating the Linking Table: PC_Software

So now, I've got my two "One-Side" tables: PC and Software. Now I just need to create the linking table.

A "Linking table" is a mechanism by which Many-To-Many relationships are built in a relational database. (See: What is Normalizion Part V for more information.)

I can do that with a Make-Table Query and a simple join of Software and Software_Temp:

SELECT Software.SoftwareID, Software_Temp.PC_ID INTO PC_Software
FROM Software INNER JOIN Software_Temp ON Software.Software = Software_Temp.Software;


Or in the Query Builder:


This will give me a table with two fields: PC_ID and SoftwareID. I'll also want to make these fields a compound primary key. To do that, I'll also want to first convert the Autonumber field to a number field:


The final result looks like this:



The final step is to create the relationships between the four tables: PC, PC_Software, Software, and Category.

Step 6: Create the Relationships

The easiest way to create relationships in Access is to use the Relationship Window. Add the four tables to the Relationships Window:


Click and drag PC_ID from the PC table to PC_ID in the PC_Software table. In the pop-up window, choose the Enforce Referential Integrity box:


And click Create. Do the same for SoftwareID between Software and PC_Software and for CategoryID between Software and Category. The final result will look like this:



Next Time:
Next time, I’ll finish up by importing the spreadsheet to existing tables:
Normalizing Repeated Columns: Multiple Columns (Part2)



Wednesday, May 25, 2011

New Sample: NormalizingRepeatingColumns1.mdb

By Roger Carlson

This sample demonstrates how to normalize a simple PC_Inventory table with a single set of repeated columns.  It's purpose is to demonstrate the general principles of normalizing denormalized data from a spreadsheet.

Document Included

You can find the sample here: http://www.rogersaccesslibrary.com/forum/normalizingrepeatingcolumns1_topic562.html

Monday, May 23, 2011

Normalizing Repeated Columns: Single Repeated Column (Part2)

Last time, in Normalizing Repeated Columns: Single Repeated Column (Part 1), I talked about Normalizing data in repeated columns from a spreadsheet, like this:



Into a structure like this:




Using a One-Time process that creates the tables and relationships in addition to copying the data.

Appending to Existing Tables

The process for normalizing repeated column data into Existing Tables is similar to the One-Time process. It does, however, require you to create the tables and relationships correctly ahead of time. Because I've just created them in the previous section, I'll simply delete the data and re-append it from the linked spreadsheet.

The two processes are very similar, but I'll be using Append queries rather than Make-Table queries.

Step 1: Remove the common PC elements.

Since I've already got a table, I'll use an Append query rather than the Make-Table query I used earlier:

INSERT INTO PC ( PC_Num, OperatingSystem )
SELECT PC_Num, OperatingSystem
FROM PC_RepeatedColumns_Link;
Or in the Query Builder:


The resulting table looks like this:


Since I already have an autonumber primary key defined on the table, I don't need to do anything further.

Step 2: Create Software_Temp table

This step is exactly the same as Step 2 above. I'll use a Make-Table query to create a temporary table that will be used for creating the relationships later.

SELECT PC_ID, Software INTO Software_Temp
FROM
(SELECT PC.PC_ID, PC_RC_Link.Software1 As Software
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software1 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software2
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software2 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software3
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software3 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software4
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software4 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software5
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software5 IS NOT NULL)


The result will be a table that looks like this:


Step 3: Remove Duplicates from Software_Temp

I need to create a query that removes the duplicate values. The DISTINCT predicate works well for that. As before, I'll use an Append write the data to my existing Software table:

INSERT INTO Software ( Software )
SELECT DISTINCT Software_Temp.Software
FROM Software_Temp;


With the resulting table:


Again, this table already has an autonumber primary key, so I don't need to create one.

Step 4: Creating the Linking Table: PC_Software

So now, I've got my two "One-Side" tables: PC and Software. Now I just need to fill the linking table.
I can do that with an Append Query and a simple join of Software and Software_Temp:

INSERT INTO PC_Software ( SoftwareID, PC_ID )
SELECT Software.SoftwareID, Software_Temp.PC_ID
FROM Software INNER JOIN Software_Temp ON Software.Software = Software_Temp.Software;


Or in the Query Builder:


The final result will look like this:


It is important to note that the order in which the data is moved is vital. Data must be written into the "one-side" tables (PC and Software) first, and then data can be moved into the "linking" tables (PC_Software).

This two-part post is available as a single download (with sample database) here: http://www.rogersaccesslibrary.com/forum/normalizingrepeatingcolumns1_topic562.html.

































Tuesday, May 17, 2011

Normalizing Repeated Columns: Single Repeated Column(Part1)

In my earlier series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. I also showed several different examples of repeated columns and the how difficult it is to query repeated columns as compared to the normalized equivalent. If you haven't read these yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

One of the comments suggested I should discuss how to convert a table with repeated columns into a normalized table structure. I thought that was a really good idea, so I'm going to spend the next few posts doing so.

The difficulty in discussing this issue, however, is that the exact solution differs with each table, so there is no single solution. Fortunately, in the last series, I showed 5 different tables, each with a slightly different structure, so while I cannot show a single solution that will work for every table with repeated columns, hopefully, one of the following will work for most cases.

Others in this series:
Simple PC Inventory

In my post Querying Repeated Columns: Multiple ORs I discussed an extremely simple example of a PC Inventory table that looked like this:

Figure 1: PC_Inventory table with repeated columns

Normalized to remove the repeated columns, the tables would look like this:

Figure 2: PC_Inventory normalized (First Normal Form (1NF)


That was sufficient for illustrating the difficulty of querying repeated columns, but unfortunately it didn't go far enough. Proper normalization requires that, in addition to moving the PC_Num and OperatingSystem fields to their own table, Software should also be removed to its own table as well. So it really should look like this:


Figure 3: PC_Inventory fully normalized (Third Normal Form (3NF)

The table relationships would look like this:

Figure 4: Relationships for the PC Inventory tables.

So how do I get the data from Figure 1 to Figure 3? Well, first of all, it depends on whether this is a one-time process where you are creating new normalized tables, or whether it is an on-going process where you are appending the data to existing, normalized tables.

In either case, however, not only do I have to split the records into separate tables, I also have to preserve the relationships between those records, so I have to do it is a specific order.

One-Time Process

Suppose I'm given a spreadsheet of data like this:



And I need to create a normalized database from it. I've already determined the tables I need (Figure 3 above), but if you're not certain what your table structure should be, you should read through my blog serie: What is Normalization, Entity-Relationship Diagramming, and The Normal Forms.


Link the spreadsheet into a database.

First of all, I need to get the spreadsheet into the database. I find it is preferable to link rather than import the file. Since I'm not going to be changing the data, there's no reason to import it. Once the file is linked it acts just like a table, and I can begin the process of normalizing it.

Step 1: Remove the common PC elements.

Since I need to create a new table to hold these values, I'll use a Make-Table query:

SELECT PC_Num, OperatingSystem INTO PC
FROM PC_RC_Link;


Or in the Query Builder:



The resulting table looks like this:



Now, if I were using PC_NUM as a natural key, I could stop here. However, I prefer to use Surrogate Keys. It makes this process more difficult, but has many benefits in the long run. (For a discussion of natural vs. surrogate keys, see my blog post: What is a Primary Key?)

Add surrogate key to PC table.

Adding a surrogate key to a table is as simple as opening it in Design View, adding an Autonumber Field, and making it the Primary Key.



Saving the table will automatically fill the autonumber field with values:



Step 2: Create Software_Temp table

As I showed in my post Querying Repeated Columns: Multiple Unions it is necessary to use multiple UNION queries to create a list of Software. Something like this:

SELECT PC.PC_ID, PC_RC_Link.Software1 As Software
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software1 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software2
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software2 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software3
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software3 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software4
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software4 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software5
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software5 IS NOT NULL


Note: You can't create or view this query in the Query Builder, however, you can create the first on (Software1) in the QB, then switch to the SQL View and copy and paste, modifying each for the specific column.

I still need to make this query into a Make-Table query. To do that, I surround the above in parentheses and use it in the From clause of the Make-table. Like this:

SELECT PC_ID, Software INTO Software_Temp
FROM
(SELECT PC.PC_ID, PC_RC_Link.Software1 As Software
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software1 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software2
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software2 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software3
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software3 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software4
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software4 IS NOT NULL
UNION
SELECT PC.PC_ID, PC_RC_Link.Software5
FROM PC INNER JOIN PC_RC_Link ON PC.PC_Num = PC_RC_Link.PC_Num
WHERE Software5 IS NOT NULL)


The result will be a table that looks like this:



If I added a surrogate key to this table, it would fulfill normalization to the First Normal Form as in Figure 1 above. However, I want to normalize it to 3NF. To do that, I still need to remove the duplicates from Software_Temp and create the linking table, PC_Software.

Step 3: Remove Duplicates from Software_Temp

I need to create a query that removes the duplicate values. The DISTINCT predicate works well for that. As before, I'll use a Make-Table query to create a new table to hold the software values:

SELECT DISTINCT Software INTO Software
FROM Software_Temp;


And as before, the PC table I created, I'll add an Autonumber primary key, so the table ends up like this:


Step 4: Creating the Linking Table: PC_Software

So now, I've got my two "One-Side" tables: PC and Software. Now I just need to create the linking table.

A "Linking table" is a mechanism by which Many-To-Many relationships are built in a relational database. (See: What is Normalizion Part V for more information.)

I can do that with a Make-Table Query and a simple join of Software and Software_Temp:

SELECT Software.SoftwareID, Software_Temp.PC_ID INTO PC_Software
FROM Software INNER JOIN Software_Temp ON Software.Software = Software_Temp.Software;


Or in the Query Builder:


This will give me a table with two fields: PC_ID and SoftwareID. I'll also want to make these fields a compound primary key:


The final result looks like this:


The final step is to create the relationships between the three tables: PC, PC_Software, and Software.

Step 5: Create the Relationships

The easiest way to create relationships in Access is to use the Relationship Window. Add the three tables to the Relationships Window:


Click and drag PC_ID from the PC table to PC_ID in the PC_Software table. In the pop-up window, choose the Enforce Referential Integrity box:


And click Create. Do the same for SoftwareID between Software and PC_Software. The final result will look like this:





Next Time:

Next time, I’ll finish up by importing the spreadsheet to existing tables:
Normalizing Repeated Columns: Single Repeated Column (Part2)