Monday, March 28, 2011

Querying Repeated Columns: Multiple Joins

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

What I'm going to concentrate in the next few post are specific problems with querying data stored in repeated columns. This time, it's Multiple Joins.

Others in this series:

Multiple Joins

In my last post, Querying Repeated Columns: Multiple Unions, I discussed the problem of repeated columns with multiple categories. That is, in addition to Software1, Software2, ... SoftwareX, the table also has a Type field for each software: Software1Type, Software2Type, ... SoftwareXType. One solution that people come up with to solve this problem is to have a separate, "look-up" table for the software. Something like this:

Figure 1: Incorrect solution to the Multiple Category problem


This solution does have some advantages from a data integrity standpoint. I can use the Software table as the Row Source in a combo box on a form, which will limit data entry errors. I also can't assign an incorrect category to a software, since it is stored in the Software table.

The downside, however, is that I have to create multiple Joins with the Software table, one for each Software column. So just to get my original table back, I have to do something like this:

SELECT
        PC_ID,
        Software1,
        Software.Category AS Software1Type,
        Software2 ,
        Software_1.Category AS Software2Type ,
        Software3                            ,
        Software_2.Category AS Software3Type
FROM
        (( PC_RepeatedColumns
LEFT JOIN
        Software ON
        PC_RepeatedColumns.Software1 = Software.Software )
LEFT JOIN
        Software AS Software_1 ON
        PC_RepeatedColumns.Software2 = Software_1.Software )
LEFT JOIN
        Software AS Software_2 ON
        PC_RepeatedColumns.Software3 = Software_2.Software;

Or in the Query Builder:

Figure 2: Query with multiple joins to display software types.



Unfortunately, this just gets me back to the original, repeated columns table.


But as I showed last time, it does nothing to solve the underlying problem of querying the repeated columns. I would still need to use multiple UNION queries to return a list of PCs with word processors. However, unlike last time, each query in the Union must be joined to the Software table:

SELECT PC_ID, Software1, Software.Category
FROM PC_RepeatedColumns LEFT JOIN Software
   ON PC_RepeatedColumns.Software1 = Software.Software
WHERE Category="word processor"
UNION
SELECT PC_ID, Software2, Software.Category
FROM PC_RepeatedColumns LEFT JOIN Software
   ON PC_RepeatedColumns.Software2 = Software.Software
WHERE Category="word processor"
UNION
SELECT PC_ID, Software3, Software.Category
FROM PC_RepeatedColumns LEFT JOIN Software
   ON PC_RepeatedColumns.Software3 = Software.Software
WHERE Category="word processor"



Figure 3: Query with Multiple Unions and Multiple Joins


But as I showed last time, using a truly normalized structure, like this:

Figure 4: Fully normalized with Category column added.


I need only a simple join to query the data:

SELECT PC.PC_Num, Software, Category
FROM Software INNER JOIN
(PC INNER JOIN PC_Software ON PC.PC_ID = PC_Software.PCID)
ON Software.SoftwareID = PC_Software.SoftwareID
WHERE Category="word processor";


This one can be viewed in the Query Builder:

Figure 5: Word processor for each PC with normalized structure.


Once again, while the results of the queries are the same, querying repeated columns proves much more complicated and much less flexible than querying a normalized table structure.
 
.




Tuesday, March 22, 2011

New Sample: Form_ControlDefaultValues

Form_ControlDefaultValues


By A.D. Tejpal

This sample db demonstrates programmatic setting of default values for form controls, covering different data types (Date, Text and Number). Two methods are illustrated as follows:

1 - Concatenation of intended default value into a string assigned as control's default value property.

2 - Generation of intended default value through embedded reference within the string assigned as control's default value property.

For each of the above styles, two alternative modes are demonstrated as follows:

(a) Set defaults as per current record (On editing or double clicking pertinent controls).

(b) Set defaults as per last record (On editing or double clicking pertinent controls - only if the cursor is on last record).

You can find the sample here: http://www.rogersaccesslibrary.com/forum/Form-controldefaultvalues_topic559.html

.

Monday, March 21, 2011

Querying Repeated Columns: Multiple Unions

In the first post in this series (The Problem of Repeated Columns), I defined Repeated Columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

What I'm going to concentrate in the next few post are specific problems with querying data stored in repeated columns. This time, it's Multiple Unions.

Others in this series:

Multiple UNIONs

Last time, in Querying Repeated Columns: Multiple ORs, I talked about the difficulties of querying data in repeated columns. But it gets worse. Suppose I want to categorize each type of software. In which case, I'd have to add a category column for each software column (Software1Type, Software2Type, ...etc). It could look something like this:

Figure 1: Repeated Columns with repeated category columns.


By comparison, the normalized version would look like Figure 3, with both Software and Category removed to their own table:

Figure 2: Fully normalized with Category column added.

Now, suppose I want to query a listing of all word processors for each PC. To create such a query with repeated columns, I'd have to resort to multiple UNIONs:

SELECT PC_ID, Software1 as Software, Software1Type as Category
FROM Worse_PC_RepeatedColumns
WHERE Software1Type="word processor"
UNION
SELECT PC_ID, Software2 as Software, Software2Type as Category
FROM Worse_PC_RepeatedColumns
WHERE Software2Type="word processor"
UNION
SELECT PC_ID, Software3 as Software, Software3Type as Category
FROM Worse_PC_RepeatedColumns
WHERE Software3Type="word processor"


Unfortunately, this must be done in SQL view since union queries cannot be viewed in the Query Builder.

Figure 3: Word processor for each PC with repeated columns.

By contrast, creating the query in the normalized structure looks like this:

SELECT PC.PC_Num, Software, Category
FROM Software INNER JOIN
     (PC INNER JOIN PC_Software ON PC.PC_ID = PC_Software.PCID)
         ON Software.SoftwareID = PC_Software.SoftwareID
WHERE Category="word processor";


This one can be viewed in the Query Builder:

Figure 4: Word processor for each PC with normalized structure.

Again, the results of both queries are identical with the exception that the query with repeated columns (UNION) is not updateable.

Figure 5: Results of two queries.

More columns in the repeated columns table makes the query more complex. I'd have to add another UNION for each column pair. Worse yet, there is a limit to the number of Unions you can use in a query, so if you have too many categories, your query may not just be difficult, but impossible.

By contrast, adding more values to the the normalized tables requires no change to the query at all, and there is virtually no limit on the number of categories.



Tuesday, March 15, 2011

Querying Repeated Columns: Multiple ORs

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

What I'm going to concentrate in the next few posts are specific problems with querying data stored in repeated columns. The first of these is Multiple ORs.

Others in this series:
Multiple "OR" Conditions

One of the most common examples of repeated columns in a table come from PC Inventory databases. This is likely because they are created by Help Desk personnel who are familiar with spreadsheet software like Excel. Since repeated columns are allowed, even necessary, in a spreadsheet, it is logical to build a database table the same way. Logical, perhaps, but wrong.

An extremely simplified example of such a table might look like this:

Figure 1: PC_Inventory table with repeated columns

Normalized to remove the repeated columns, the table would look like this:

Figure 2: PC_Inventory normalized
Note: Proper normalization requires the fields PC_Num and OperatingSystem to be removed to their own table. Technically, Software should also be removed to its own table, but I'll address that later.


So suppose I want to produce a simple list of those PCs which have "Word 2010" installed. With the repeated columns table, my SQL statement would look like this:

SELECT PC_ID
FROM PC_RepeatedColumns
WHERE Software1="Word 2010" OR
   Software2="Word 2010" OR
   Software3="Word 2010" OR
   Software4="Word 2010" OR
   Software5="Word 2010";

In the Query Builder, it would look like this:

Figure 3: Query with repeated columns

By contrast, querying the normalized tables would look like this:

SELECT PC_Num 
FROM PC INNER JOIN PCInventory ON PC.PC_ID = PCInventory.PC_ID
WHERE PCInventory.Software ="Word 2010";

In the Query Builder:
Figure 4: Query with normalized tables

The results of both queries are identical:

Figure 5: Query Results





Adding more columns just makes matters worse. With 20 columns in your repeated columns table, you have to have 20 "OR" conditions in your Where clause. What's more, if you find you have to add another column down the road, you have to modify your query (and every query that uses that table). On the other hand, adding a new value to the normalized table requires no change to the SQL statement.

Thus, with the repeated columns table, I have to do a whole lot more work than with the normalized structure, both initially and when a new software category is added. But it gets worse. Next time I'll look at the same PC_Inventory tables with a software type field added for each software.




















Monday, March 7, 2011

The Problem of Repeated Columns

I've discussed this topic briefly in another post: Why Normalization?, but the issue of Repeated Columns comes up so often that I think it deserves a post of its own.

First of all, what are repeated columns?

A table has repeated columns when the same category of information is stored in multiple columns (or fields). For instance, if I have workstation inventory table, I would want to store all the software installed on a given PC. I could solve this problem with fields like this: Software1, Software2, Software3, and so on. Anytime you have fields which need a suffix to differentiate one field from another, you have repeated columns.

 
Figure 1: Inventory table with repeated fields.


But although fields with suffixes are the easiest to spot, there are other cases. Suppose I had these fields: OS, Wordprocessor, Spreadsheet, PhotoManager, and so forth. This is essentially the same as Software1, Software2. I've just substituted the name of software type for a generic name. This reduces the number of fields from the above example, but it still has problems of its own.

Figure 2: Inventory table with repeated fields with different names.




Yet a third type is Multiple Yes/No fields. Suppose I have a Patient table with fields for symptoms (Cough, Fever, Sneeze, etc) which are Yes/No fields. At first glance, these are not repeated fields, but in fact they are. They store the same category of information, that is, each is a symptom.


Figure 3: Patient table with repeated symptom Yes/No columns.








Problems:

So what's so wrong with repeated columns? After all, we do it all the time in spreadsheets. Unfortunately, while a database table might look like a spreadsheet, it's not, and the rules for using them are different.

It's in the Rules

In a table ALL information should be stored as values in fields, not in the field names. When a table has fields like Cough, Sneeze, Fever, etc, information is being stored in the field name. Field names should identify a category of information, not individual values. This is part of the basic definition of a relational database table (called the First Normal Form) and all SQL (Structured Query Language) implementations are built for it. When a table has repeated columns, queries that should be easy become difficult -- sometimes impossible. I'll show specifics later.

Data Anomolies

In a normalized database, I can create a unique index to prevent duplicate values being entered in a field. There is no way to prevent this at the database level with repeated columns. For instance, there is nothing to prevent a user from entering the operating system in both Software1 and Software2.

How Many, How Few?

And then there's the question of how many repeated fields I should have. For instance, how many software fields (Software1, Software2, etc) should I have in my PC_RepeatedColumns table? Choose some arbitrary minimum (5?) and lose the rest? Or do I try to determine the maximum I could ever have (255?) and leave most of them empty? Or do I pick some arbitrary middle ground and modify my database afterward if I need more? There are disadvantages to each approach and few advantages in any of them.

Columns Are Expensive, Rows Are Cheap


A truism of database design is: "Columns are expensive. Rows are cheap." One of the powerful features of Access (or any database, for that matter) is the ability to create a user application to help end-users enter and query data. Adding a new column requires the developer to make changes in the table and all queries, forms, and reports that use that table as well. This makes adding a column to a table one of the most "expensive" (in terms of development time) development operations.

By contrast, adding a new value to a column, rather than adding a new column, can be done by the user and requires no development time.

Querying:

Querying with repeated columns often turns a simple task into an exercise in frustration, requiring massive, nested IIF statements, multiple OR conditions, or both. Aggregate queries like Sum, Count, and Average are particularly difficult.

These problems are magnified by the "How Many, How Few" issue. If I have a maximum possible number, I need to add columns to my query that may never be used. But if I only have the minimum, adding a new column requires me to modify every query. Either way, it's a lot of unnecessary work.

To illustrate the problems with repeated fields, especially with querying, I'm going to spend the next few posts showing specific examples:

Querying Repeated Columns