Tuesday, March 15, 2011

Querying Repeated Columns: Multiple ORs

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

What I'm going to concentrate in the next few posts are specific problems with querying data stored in repeated columns. The first of these is Multiple ORs.

Others in this series:
Multiple "OR" Conditions

One of the most common examples of repeated columns in a table come from PC Inventory databases. This is likely because they are created by Help Desk personnel who are familiar with spreadsheet software like Excel. Since repeated columns are allowed, even necessary, in a spreadsheet, it is logical to build a database table the same way. Logical, perhaps, but wrong.

An extremely simplified example of such a table might look like this:

Figure 1: PC_Inventory table with repeated columns

Normalized to remove the repeated columns, the table would look like this:

Figure 2: PC_Inventory normalized
Note: Proper normalization requires the fields PC_Num and OperatingSystem to be removed to their own table. Technically, Software should also be removed to its own table, but I'll address that later.


So suppose I want to produce a simple list of those PCs which have "Word 2010" installed. With the repeated columns table, my SQL statement would look like this:

SELECT PC_ID
FROM PC_RepeatedColumns
WHERE Software1="Word 2010" OR
   Software2="Word 2010" OR
   Software3="Word 2010" OR
   Software4="Word 2010" OR
   Software5="Word 2010";

In the Query Builder, it would look like this:

Figure 3: Query with repeated columns

By contrast, querying the normalized tables would look like this:

SELECT PC_Num 
FROM PC INNER JOIN PCInventory ON PC.PC_ID = PCInventory.PC_ID
WHERE PCInventory.Software ="Word 2010";

In the Query Builder:
Figure 4: Query with normalized tables

The results of both queries are identical:

Figure 5: Query Results





Adding more columns just makes matters worse. With 20 columns in your repeated columns table, you have to have 20 "OR" conditions in your Where clause. What's more, if you find you have to add another column down the road, you have to modify your query (and every query that uses that table). On the other hand, adding a new value to the normalized table requires no change to the SQL statement.

Thus, with the repeated columns table, I have to do a whole lot more work than with the normalized structure, both initially and when a new software category is added. But it gets worse. Next time I'll look at the same PC_Inventory tables with a software type field added for each software.




















No comments: