Monday, April 25, 2011

Aggregating Across Repeated Columns: Counting

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

So far in this series, I've discussed the problem with querying textual or Boolean (Yes/No) data from repeated columns. But numeric data offers new challenges because we often want to do math on them. The most common kind of math is aggregation, that is, summing, counting, and averaging numeric values. This time, I'll talk about counting data in repeated columns.

Others in this series:

Counting Across Columns

The ability to count values is useful in a number of applications. In Excel, there is a Count function which will count the cells that have a value. The Excel function will work for any range of cells, across or down.

In Access, however, the Count() function only works down columns, not across columns. So, counting values in repeated rows is easy (see below), but just like summing, counting across repeated columns is more challenging.

For instance, suppose I had a table of student test scores:

image

If I wanted to count the number of tests each student has taken, I need to create an expression that will return a 1 if the field has a value and a 0 if it does not. In this way, I can sum the returned values and that will equal the number of fields that have a value. To do this, I need 3 stages.

1. Checking for NULL

To test whether a field has a value or not, I can test for a NULL value. In Access a blank column is NULL, which does NOT mean either "zero" or "empty string"(see What does NULL mean? How is it different than the Empty String?). To test for a NULL, I have to use the IsNull() function. Like this:

IsNull([Test1])

This will return a value of -1 (Yes) if the field is NULL and 0 (No) if it is not.

2. Checking for NOT NULL

Unfortunately, this is the opposite of what I want. I want a Yes if the field has a value and a No if it does not. To correct this, I can reverse the value returned by the IsNull() function by prefacing it with the NOT operator:

Not IsNull([Test1])

This will return a value of -1 if the field is NOT NULL and 0 if it is. So now I can sum my columns:

(Not IsNull([Test1])) + (Not IsNull([Test2])) + (Not IsNull([Test3])) + (Not IsNull([Test4]))

Note: the extra parentheses around the individual expressions are necessary to evaluate properly.

3. Returning the Absolute Value

This is close, but not exactly what I want because it will return a negative value for the sum. One last thing I have to do is return the absolute value of the returned value:

Abs((Not IsNull([Test1])) + (Not IsNull([Test2])) + (Not IsNull([Test3])) + (Not IsNull([Test4])))

The full query would look like this:

SELECT Student, Abs((Not IsNull([Test1]))+(Not IsNull([Test2]))+(Not IsNull([Test3]))+(Not IsNull([Test4]))) AS TestCount
FROM StudentScores_RepeatedColumns
ORDER BY Student;

Or in the Query Builder:

image

The result would look like this:

image

Count Down Rows

By contrast, suppose I normalize the table to remove the repeated columns. The table should look something like this:

image

Since the table is normalized (that is, the values go down a row), I can use the aggregate (or "Totals") functions that are built in to SQL. In this case, it's the Count() function.

Count(Score) AS TestCount

The full query would look like this:

SELECT StudentID, Count(Score) AS TestCount
FROM StudentScores_Rows
GROUP BY StudentID
ORDER BY StudentID;

Or in the Query Builder:

image

Once again, the results of the queries are identical:

image

Now, with only 4 test scores, the expression to count repeated columns is manageable. But what if there were 20 or 50? The expression quickly becomes long and cumbersome. But with the normalized structure, the query doesn't change no matter how many test values there are.

Monday, April 18, 2011

Aggregating Across Repeated Columns: Summing

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

So far in this series, I've discussed the problem with querying textual or Boolean (Yes/No) data from repeated columns. But numeric data offers new challenges because we often want to do math on them. The most common kind of math is aggregation, that is, summing, counting, and averaging numeric values. This time, I'll talk about summing data in repeated columns.

Others in this series:

Summing Across Columns

The ability to sum values is useful in a number of applications. In Excel, there is a SUM function which will add up the values of cells that have a value. The Excel function will work for any range of cells, across or down.

In Access, however, the Sum() function only works down columns, not across columns. So, summing values in repeated rows is easy (see below), but just like summing across repeated columns is much more challenging.

For instance, suppose I had a table of student test scores:

image

Summing across a row with repeated columns is much like doing it in a spreadsheet. I can create an expression addressing the field names rather than cell references:

[Test1]+[Test2]+[Test3]+[Test4] AS TestSum

Unfortunately, this won't work as is, because some of the columns are blank. In Access a blank column is NULL, which is different than in a spreadsheet, where a blank cell means "zero" or "empty string"(see What does NULL mean? How is it different than the Empty String?). If I used the expression as written so far, both Ralph and Sue would have blanks:

image

Adding any value to a NULL returns a NULL. So I have to explicitly handle the NULL values and turn them into zeros. Fortunately, Access has the Nz() function which will do that:

Nz([Test1])+Nz([Test2])+Nz([Test3])+Nz([Test4]) AS TestSum

The full query would look like this:

SELECT Student, Nz([Test1])+Nz([Test2])+Nz([Test3])+Nz([Test4]) AS TestSum
FROM StudentScores_RepeatedColumns
ORDER BY Student;

Or in the Query Builder:

image

The result would look like this:

image

Summing Down Rows

By contrast, suppose I normalize the table to remove the repeated columns. The table should look something like this:

image

Since the table is normalized (that is, the values go down a row), I can use the aggregate (or "Totals") functions that are built in to SQL. In this case, it's the SUM() function.

Sum(Nz(Score)) AS TestScore

The full query would look like this:

SELECT StudentID, sum(Nz(Score)) AS TestScore
FROM StudentScores_Rows
GROUP BY StudentID
ORDER BY StudentID;

Or in the Query Builder:

image

Once again, the results of the queries are identical:

image

Now, this doesn't look like too much of an issue, especially compared to some of the problems we've seen with querying repeated columns. However, this is just 4 test scores. What if there were 20 or 50? The expression in the Repeated Columns table quickly becomes long and cumbersome. But with the normalized structure, the query doesn't change no matter how many test values there are.

Alternate Solution: User Defined Function (UDF)

Although the best solution is to normalize your database, it is often not practical with an existing database. In that case, a User Defined Function (UDF) may be a solution.

Fellow MVP, John Spencer has created a UDF for summing any number of columns (up to 29):

Public Function fRowSum(ParamArray Values())
'John Spencer
'Last Update: April 5, 2000
'Calculates the arithmetic sum of a group of values passed to it.
'Sample call:
'mySum = fRowSum("1","TEST","2", "3",4,5,6,0) returns 21
'Ignores values that cannot be treated as numbers.
'
' Max of 29 arguments can be passed to a function in Access SQL
Dim i As Integer, intElementCount As Integer, dblSum As Double
intElementCount = 0
dblSum = 0
For i = LBound(Values) To UBound(Values)
  If IsNumeric(Values(i)) Then 'Ignore Non-numeric values
    dblSum = dblSum + Values(i)
    intElementCount = intElementCount + 1
  End If
Next i
If intElementCount > 0 Then 'At least one number in the group of values
  fRowSum = dblSum
Else 'No number in the group of values
  fRowSum = 0
End If
End Function

The function should go in a General Module. Then it can be used in a query as follows:

SELECT Student, fRowSum([Test1],[Test2],[Test3],[Test4]) AS TestSum
FROM StudentScores_RepeatedColumns
ORDER BY Student;

Monday, April 11, 2011

Querying Repeated Columns: Impossible Joins

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

What I'm going to concentrate in the next few post are specific problems with querying data stored in repeated columns. This time, it's Impossible Joins statements.

Others in this series:

Impossible Joins

Last time, I discussed problems associated with the sort of repeated columns typified by the use of multiple Yes/No fields. For instance, a patient table with a listing of symptoms as Yes/No fields. Like this:

clip_image002
Figure 1: Patient table with Yes/No fields representing symptoms.

Another problem with this sort of table is the impossibility of joining it to another table based on the column names. For instance, suppose I wanted to group the symptoms into disease groupings. I could create a table like this:

clip_image004
Figure 2: Disease grouping table.

The problem here is there's no way to join this table back to the Patient table. I can't join values in one table to the field names in another. In a relational database, all information is supposed to be stored as values in tables. When you put the value as the field name, you lose much of the capabilities built in the SQL, the query language designed for relational databases.

To solve this, I need to use the solution to the patient/symptom list problem discussed last time. That is, using multiple IIF statements and multiple unions to convert the field names into values. Like this:

SELECT Patient, IIf([cough] = True,"Cough") AS Symptom
FROM Patient_RepeatedColumns WHERE [Cough] = True
UNION ALL
SELECT Patient, IIf([Sneeze] = True,"Sneeze") AS Symptom
FROM Patient_RepeatedColumns WHERE [Sneeze] = True
UNION ALL
SELECT Patient, IIf([Fever] = True,"Fever") AS Symptom
FROM Patient_RepeatedColumns WHERE [Fever] = True
UNION ALL
SELECT Patient, IIf([Body_Aches] = True,"Body aches") AS Symptom
FROM Patient_RepeatedColumns WHERE [Body_Aches] = True
UNION ALL
SELECT Patient, IIf([Nausea] = True,"Nausea") AS Symptom
FROM Patient_RepeatedColumns WHERE [Nausea] = True
ORDER BY Patient, Symptom;

Which gives me the following result:

clip_image006
Figure 3: Intermediate query to be used as a subquery

I can now use this query as a subquery in the From clause of another query by giving it an alias (Symptoms):

SELECT Symptoms.Patient, DiseaseGroup.Group
FROM
(SELECT Patient, IIf([cough],"Cough") AS Symptom
FROM Patient_RepeatedColumns WHERE [Cough] = True
UNION ALL
SELECT Patient, IIf([Sneeze],"Sneeze") AS Symptom
FROM Patient_RepeatedColumns WHERE [Sneeze] = True
UNION ALL
SELECT Patient, IIf([Fever],"Fever") AS Symptom
FROM Patient_RepeatedColumns WHERE [Fever] = True
UNION ALL
SELECT Patient, IIf([Body_Aches],"Body aches") AS Symptom
FROM Patient_RepeatedColumns WHERE [Body_Aches] = True
UNION ALL SELECT Patient, IIf([Nausea],"Nausea") AS Symptom
FROM Patient_RepeatedColumns WHERE [Nausea] = True
ORDER BY Patient, Symptom) as Symptoms
INNER JOIN DiseaseGroup ON Symptoms.Symptom = DiseaseGroup.Symptom
GROUP BY Symptoms.Patient, DiseaseGroup.Group;

By contrast, using a normalized structure like this:

clip_image008
Figure 4: Repeated columns normalized into repeated rows in Symptoms table

I can query the data much more simply:

SELECT Patient.Patient, DiseaseGroup.Group
FROM
(Symptoms INNER JOIN DiseaseGroup ON Symptoms.Symptom = DiseaseGroup.Symptom)
INNER JOIN Patient ON Symptoms.PatientID = Patient.PatientID
GROUP BY Patient.Patient, DiseaseGroup.Group;

Or in the query builder:

clip_image010
Figure 5: Normalized query in the Query Builder.

Once again, the results of the queries are identical:

clip_image012
Figure : Results of both queries

But querying repeated columns proves much more complicated and much less flexible than querying a normalized table structure.

Monday, April 4, 2011

Querying Repeated Columns: Multiple IIFs

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

What I'm going to concentrate in the next few post are specific problems with querying data stored in repeated columns. This time, it's Multiple IIF statements.

Others in this series:

Multiple IIF Statements

Another sort of repeated columns can be seen by the use of multiple Yes/No fields, each of which are just values of a specific category. For instance, suppose I had a patient table with a listing of symptoms as Yes/No fields. Like this:

clip_image002
Figure : Patient table with Yes/No fields representing symptoms.

If I wanted to create a simple list of patient's symptoms, in addition to the multiple Unions, I need multiple IIF statements to convert the field names into values. Something like this:

SELECT Patient, IIf([cough] = True,"Cough") AS Symptom
FROM Patient_RepeatedColumns WHERE [Cough] = True
UNION ALL
SELECT Patient, IIf([Sneeze] = True,"Sneeze") AS Symptom
FROM Patient_RepeatedColumns WHERE [Sneeze] = True
UNION ALL
SELECT Patient, IIf([Fever] = True,"Fever") AS Symptom
FROM Patient_RepeatedColumns WHERE [Fever] = True
UNION ALL
SELECT Patient, IIf([Body_Aches] = True,"Body aches") AS Symptom
FROM Patient_RepeatedColumns WHERE [Body_Aches] = True
UNION ALL
SELECT Patient, IIf([Nausea] = True,"Nausea") AS Symptom
FROM Patient_RepeatedColumns WHERE [Nausea] = True
ORDER BY Patient, Symptom;

By contrast, if I normalized the table to convert the repeated columns to repeated rows, it might look something like this:

clip_image004
Figure : Repeated columns converted to repeated rows in Symptoms table.

Creating the patient/symptom list is now simple:

SELECT Patient, Symptom
FROM Symptoms INNER JOIN Patient ON Symptoms.PatientID = Patient.PatientID
ORDER BY Patient, Symptom;

Or in the query builder:

clip_image006
Figure : Normalized query in the Query Builder.

The results of the queries are identical except the normalized query is updateable:

clip_image008
Figure : Results of both queries

Once again, querying repeated columns proves much more complicated and much less flexible than querying a normalized table structure.