In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?
What I'm going to concentrate in the next few posts are specific problems with querying data stored in repeated columns. The first of these is Multiple ORs.
Others in this series:
Multiple "OR" Conditions
One of the most common examples of repeated columns in a table come from PC Inventory databases. This is likely because they are created by Help Desk personnel who are familiar with spreadsheet software like Excel. Since repeated columns are allowed, even necessary, in a spreadsheet, it is logical to build a database table the same way. Logical, perhaps, but wrong.
An extremely simplified example of such a table might look like this:
Normalized to remove the repeated columns, the table would look like this:
Note: Proper normalization requires the fields PC_Num and OperatingSystem to be removed to their own table. Technically, Software should also be removed to its own table, but I'll address that later.
So suppose I want to produce a simple list of those PCs which have "Word 2010" installed. With the repeated columns table, my SQL statement would look like this:
WHERE Software1="Word 2010" OR
Software2="Word 2010" OR
Software3="Word 2010" OR
Software4="Word 2010" OR
In the Query Builder, it would look like this:
By contrast, querying the normalized tables would look like this:
FROM PC INNER JOIN PCInventory ON PC.PC_ID = PCInventory.PC_ID
WHERE PCInventory.Software ="Word 2010";
In the Query Builder:
The results of both queries are identical:
Adding more columns just makes matters worse. With 20 columns in your repeated columns table, you have to have 20 "OR" conditions in your Where clause. What's more, if you find you have to add another column down the road, you have to modify your query (and every query that uses that table). On the other hand, adding a new value to the normalized table requires no change to the SQL statement.
Thus, with the repeated columns table, I have to do a whole lot more work than with the normalized structure, both initially and when a new software category is added. But it gets worse. Next time I'll look at the same PC_Inventory tables with a software type field added for each software.