Monday, March 7, 2011

The Problem of Repeated Columns

I've discussed this topic briefly in another post: Why Normalization?, but the issue of Repeated Columns comes up so often that I think it deserves a post of its own.

First of all, what are repeated columns?

A table has repeated columns when the same category of information is stored in multiple columns (or fields). For instance, if I have workstation inventory table, I would want to store all the software installed on a given PC. I could solve this problem with fields like this: Software1, Software2, Software3, and so on. Anytime you have fields which need a suffix to differentiate one field from another, you have repeated columns.

 
Figure 1: Inventory table with repeated fields.


But although fields with suffixes are the easiest to spot, there are other cases. Suppose I had these fields: OS, Wordprocessor, Spreadsheet, PhotoManager, and so forth. This is essentially the same as Software1, Software2. I've just substituted the name of software type for a generic name. This reduces the number of fields from the above example, but it still has problems of its own.

Figure 2: Inventory table with repeated fields with different names.




Yet a third type is Multiple Yes/No fields. Suppose I have a Patient table with fields for symptoms (Cough, Fever, Sneeze, etc) which are Yes/No fields. At first glance, these are not repeated fields, but in fact they are. They store the same category of information, that is, each is a symptom.


Figure 3: Patient table with repeated symptom Yes/No columns.








Problems:

So what's so wrong with repeated columns? After all, we do it all the time in spreadsheets. Unfortunately, while a database table might look like a spreadsheet, it's not, and the rules for using them are different.

It's in the Rules

In a table ALL information should be stored as values in fields, not in the field names. When a table has fields like Cough, Sneeze, Fever, etc, information is being stored in the field name. Field names should identify a category of information, not individual values. This is part of the basic definition of a relational database table (called the First Normal Form) and all SQL (Structured Query Language) implementations are built for it. When a table has repeated columns, queries that should be easy become difficult -- sometimes impossible. I'll show specifics later.

Data Anomolies

In a normalized database, I can create a unique index to prevent duplicate values being entered in a field. There is no way to prevent this at the database level with repeated columns. For instance, there is nothing to prevent a user from entering the operating system in both Software1 and Software2.

How Many, How Few?

And then there's the question of how many repeated fields I should have. For instance, how many software fields (Software1, Software2, etc) should I have in my PC_RepeatedColumns table? Choose some arbitrary minimum (5?) and lose the rest? Or do I try to determine the maximum I could ever have (255?) and leave most of them empty? Or do I pick some arbitrary middle ground and modify my database afterward if I need more? There are disadvantages to each approach and few advantages in any of them.

Columns Are Expensive, Rows Are Cheap


A truism of database design is: "Columns are expensive. Rows are cheap." One of the powerful features of Access (or any database, for that matter) is the ability to create a user application to help end-users enter and query data. Adding a new column requires the developer to make changes in the table and all queries, forms, and reports that use that table as well. This makes adding a column to a table one of the most "expensive" (in terms of development time) development operations.

By contrast, adding a new value to a column, rather than adding a new column, can be done by the user and requires no development time.

Querying:

Querying with repeated columns often turns a simple task into an exercise in frustration, requiring massive, nested IIF statements, multiple OR conditions, or both. Aggregate queries like Sum, Count, and Average are particularly difficult.

These problems are magnified by the "How Many, How Few" issue. If I have a maximum possible number, I need to add columns to my query that may never be used. But if I only have the minimum, adding a new column requires me to modify every query. Either way, it's a lot of unnecessary work.

To illustrate the problems with repeated fields, especially with querying, I'm going to spend the next few posts showing specific examples:

Querying Repeated Columns

4 comments:

Grover Park George said...

Hi, Roger.

You really hit the nail on the head here. Spreadsheet-like tables are all too common in the databases we see on Access help forums recently. I also blogged about this recently, No, Excel is Not a Database. Your blog post really does a good job of explaining some of the issues I alluded to.

George

Crystal said...

hi Roger,

great article, thanks Roger -- and thanks to George for the link on UA ;) ... good explanation and examples -- something everyone (reading your blog) can relate to (smile)

Repeating groups are often used because (1) it does not occur that data should be in separate tables, and (2) and it is (seemingly) easier to put data in and get it out without considering filtering and sorting or reviewing historical data for trends and statistics.

Do you have any articles along the lines of "How to set up a mainform and a continuous subform [/datasheet]" that you could link to so that others can see how to resolve the problem of inputting data into a record in one table with related records in another table ? thanks, Roger

Warm Regards,
Crystal

*
(: have an awesome day :)
*

Crystal said...

hi Roger,

great article, thanks Roger -- and thanks to George for the link on UA ;) ... good explanation and examples -- something everyone (reading your blog) can relate to (smile)

Repeating groups are often used because (1) it does not occur that data should be in separate tables, and (2) and it is (seemingly) easier to put data in and get it out without considering filtering and sorting or reviewing historical data for trends and statistics.

Do you have any articles along the lines of "How to set up a mainform and a continuous subform [/datasheet]" that you could link to so that others can see how to resolve the problem of inputting data into a record in one table with related records in another table ? thanks, Roger

Warm Regards,
Crystal

*
(: have an awesome day :)
*

Roger Carlson said...

Thanks Crystal. That's a terrific idea for a follow-up post.