Monday, April 11, 2011

Querying Repeated Columns: Impossible Joins

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

What I'm going to concentrate in the next few post are specific problems with querying data stored in repeated columns. This time, it's Impossible Joins statements.

Others in this series:

Impossible Joins

Last time, I discussed problems associated with the sort of repeated columns typified by the use of multiple Yes/No fields. For instance, a patient table with a listing of symptoms as Yes/No fields. Like this:

clip_image002
Figure 1: Patient table with Yes/No fields representing symptoms.

Another problem with this sort of table is the impossibility of joining it to another table based on the column names. For instance, suppose I wanted to group the symptoms into disease groupings. I could create a table like this:

clip_image004
Figure 2: Disease grouping table.

The problem here is there's no way to join this table back to the Patient table. I can't join values in one table to the field names in another. In a relational database, all information is supposed to be stored as values in tables. When you put the value as the field name, you lose much of the capabilities built in the SQL, the query language designed for relational databases.

To solve this, I need to use the solution to the patient/symptom list problem discussed last time. That is, using multiple IIF statements and multiple unions to convert the field names into values. Like this:

SELECT Patient, IIf([cough] = True,"Cough") AS Symptom
FROM Patient_RepeatedColumns WHERE [Cough] = True
UNION ALL
SELECT Patient, IIf([Sneeze] = True,"Sneeze") AS Symptom
FROM Patient_RepeatedColumns WHERE [Sneeze] = True
UNION ALL
SELECT Patient, IIf([Fever] = True,"Fever") AS Symptom
FROM Patient_RepeatedColumns WHERE [Fever] = True
UNION ALL
SELECT Patient, IIf([Body_Aches] = True,"Body aches") AS Symptom
FROM Patient_RepeatedColumns WHERE [Body_Aches] = True
UNION ALL
SELECT Patient, IIf([Nausea] = True,"Nausea") AS Symptom
FROM Patient_RepeatedColumns WHERE [Nausea] = True
ORDER BY Patient, Symptom;

Which gives me the following result:

clip_image006
Figure 3: Intermediate query to be used as a subquery

I can now use this query as a subquery in the From clause of another query by giving it an alias (Symptoms):

SELECT Symptoms.Patient, DiseaseGroup.Group
FROM
(SELECT Patient, IIf([cough],"Cough") AS Symptom
FROM Patient_RepeatedColumns WHERE [Cough] = True
UNION ALL
SELECT Patient, IIf([Sneeze],"Sneeze") AS Symptom
FROM Patient_RepeatedColumns WHERE [Sneeze] = True
UNION ALL
SELECT Patient, IIf([Fever],"Fever") AS Symptom
FROM Patient_RepeatedColumns WHERE [Fever] = True
UNION ALL
SELECT Patient, IIf([Body_Aches],"Body aches") AS Symptom
FROM Patient_RepeatedColumns WHERE [Body_Aches] = True
UNION ALL SELECT Patient, IIf([Nausea],"Nausea") AS Symptom
FROM Patient_RepeatedColumns WHERE [Nausea] = True
ORDER BY Patient, Symptom) as Symptoms
INNER JOIN DiseaseGroup ON Symptoms.Symptom = DiseaseGroup.Symptom
GROUP BY Symptoms.Patient, DiseaseGroup.Group;

By contrast, using a normalized structure like this:

clip_image008
Figure 4: Repeated columns normalized into repeated rows in Symptoms table

I can query the data much more simply:

SELECT Patient.Patient, DiseaseGroup.Group
FROM
(Symptoms INNER JOIN DiseaseGroup ON Symptoms.Symptom = DiseaseGroup.Symptom)
INNER JOIN Patient ON Symptoms.PatientID = Patient.PatientID
GROUP BY Patient.Patient, DiseaseGroup.Group;

Or in the query builder:

clip_image010
Figure 5: Normalized query in the Query Builder.

Once again, the results of the queries are identical:

clip_image012
Figure : Results of both queries

But querying repeated columns proves much more complicated and much less flexible than querying a normalized table structure.

No comments: