I've discussed this topic briefly in another post: Why Normalization?, but the issue of Repeated Columns comes up so often that I think it deserves a post of its own.
First of all, what are repeated columns?
A table has repeated columns when the same category of information is stored in multiple columns (or fields). For instance, if I have workstation inventory table, I would want to store all the software installed on a given PC. I could solve this problem with fields like this: Software1, Software2, Software3, and so on. Anytime you have fields which need a suffix to differentiate one field from another, you have repeated columns.
Figure 1: Inventory table with repeated fields.
But although fields with suffixes are the easiest to spot, there are other cases. Suppose I had these fields: OS, Wordprocessor, Spreadsheet, PhotoManager, and so forth. This is essentially the same as Software1, Software2. I've just substituted the name of software type for a generic name. This reduces the number of fields from the above example, but it still has problems of its own.
Figure 2: Inventory table with repeated fields with different names.
Yet a third type is Multiple Yes/No fields. Suppose I have a Patient table with fields for symptoms (Cough, Fever, Sneeze, etc) which are Yes/No fields. At first glance, these are not repeated fields, but in fact they are. They store the same category of information, that is, each is a symptom.
Figure 3: Patient table with repeated symptom Yes/No columns.
So what's so wrong with repeated columns? After all, we do it all the time in spreadsheets. Unfortunately, while a database table might look like a spreadsheet, it's not, and the rules for using them are different.
It's in the Rules
In a table ALL information should be stored as values in fields, not in the field names. When a table has fields like Cough, Sneeze, Fever, etc, information is being stored in the field name. Field names should identify a category of information, not individual values. This is part of the basic definition of a relational database table (called the First Normal Form) and all SQL (Structured Query Language) implementations are built for it. When a table has repeated columns, queries that should be easy become difficult -- sometimes impossible. I'll show specifics later.
In a normalized database, I can create a unique index to prevent duplicate values being entered in a field. There is no way to prevent this at the database level with repeated columns. For instance, there is nothing to prevent a user from entering the operating system in both Software1 and Software2.
How Many, How Few?
And then there's the question of how many repeated fields I should have. For instance, how many software fields (Software1, Software2, etc) should I have in my PC_RepeatedColumns table? Choose some arbitrary minimum (5?) and lose the rest? Or do I try to determine the maximum I could ever have (255?) and leave most of them empty? Or do I pick some arbitrary middle ground and modify my database afterward if I need more? There are disadvantages to each approach and few advantages in any of them.
Columns Are Expensive, Rows Are Cheap
A truism of database design is: "Columns are expensive. Rows are cheap." One of the powerful features of Access (or any database, for that matter) is the ability to create a user application to help end-users enter and query data. Adding a new column requires the developer to make changes in the table and all queries, forms, and reports that use that table as well. This makes adding a column to a table one of the most "expensive" (in terms of development time) development operations.
By contrast, adding a new value to a column, rather than adding a new column, can be done by the user and requires no development time.
Querying with repeated columns often turns a simple task into an exercise in frustration, requiring massive, nested IIF statements, multiple OR conditions, or both. Aggregate queries like Sum, Count, and Average are particularly difficult.
These problems are magnified by the "How Many, How Few" issue. If I have a maximum possible number, I need to add columns to my query that may never be used. But if I only have the minimum, adding a new column requires me to modify every query. Either way, it's a lot of unnecessary work.
To illustrate the problems with repeated fields, especially with querying, I'm going to spend the next few posts showing specific examples:
Querying Repeated Columns