Monday, April 18, 2011

Aggregating Across Repeated Columns: Summing

In the first post in this series (The Problem of Repeated Columns), I defined repeated columns and talked about the data integrity problems associated with them. If you haven't read that yet, it would be worthwhile to read first. Similarly, if you are not familiar with the concept of Normalization, you should read my blog series What is Normalization?

So far in this series, I've discussed the problem with querying textual or Boolean (Yes/No) data from repeated columns. But numeric data offers new challenges because we often want to do math on them. The most common kind of math is aggregation, that is, summing, counting, and averaging numeric values. This time, I'll talk about summing data in repeated columns.

Others in this series:

Summing Across Columns

The ability to sum values is useful in a number of applications. In Excel, there is a SUM function which will add up the values of cells that have a value. The Excel function will work for any range of cells, across or down.

In Access, however, the Sum() function only works down columns, not across columns. So, summing values in repeated rows is easy (see below), but just like summing across repeated columns is much more challenging.

For instance, suppose I had a table of student test scores:



Summing across a row with repeated columns is much like doing it in a spreadsheet. I can create an expression addressing the field names rather than cell references:

[Test1]+[Test2]+[Test3]+[Test4] AS TestSum

Unfortunately, this won't work as is, because some of the columns are blank. In Access a blank column is NULL, which is different than in a spreadsheet, where a blank cell means "zero" or "empty string"(see What does NULL mean? How is it different than the Empty String?). If I used the expression as written so far, both Ralph and Sue would have blanks:



Adding any value to a NULL returns a NULL. So I have to explicitly handle the NULL values and turn them into zeros. Fortunately, Access has the Nz() function which will do that:

Nz([Test1])+Nz([Test2])+Nz([Test3])+Nz([Test4]) AS TestSum

The full query would look like this:

SELECT 
   Student, 
   Nz([Test1])+Nz([Test2])+Nz([Test3])+Nz([Test4]) AS TestSum
FROM StudentScores_RepeatedColumns
ORDER BY Student;


Or in the Query Builder:



The result would look like this:



Summing Down Rows

By contrast, suppose I normalize the table to remove the repeated columns. The table should look something like this:



Since the table is normalized (that is, the values go down a row), I can use the aggregate (or "Totals") functions that are built in to SQL. In this case, it's the SUM() function.

Sum(Nz(Score)) AS TestScore

The full query would look like this:

SELECT StudentID, sum(Nz(Score)) AS TestScore
FROM StudentScores_Rows
GROUP BY StudentID
ORDER BY StudentID;

Or in the Query Builder:




Once again, the results of the queries are identical:



Now, this doesn't look like too much of an issue, especially compared to some of the problems we've seen with querying repeated columns. However, this is just 4 test scores. What if there were 20 or 50? The expression in the Repeated Columns table quickly becomes long and cumbersome. But with the normalized structure, the query doesn't change no matter how many test values there are.


Alternate Solution: User Defined Function (UDF)

Although the best solution is to normalize your database, it is often not practical with an existing database. In that case, a User Defined Function (UDF) may be a solution.

Fellow MVP, John Spencer has created a UDF for summing any number of columns (up to 29):

Public Function fRowSum(ParamArray Values())
'John Spencer
'Last Update: April 5, 2000
'Calculates the arithmetic sum of a group of values passed to it.
'Sample call:
'mySum = fRowSum("1","TEST","2", "3",4,5,6,0) returns 21
'Ignores values that cannot be treated as numbers.
'
' Max of 29 arguments can be passed to a function in Access SQL
Dim i As Integer, intElementCount As Integer, dblSum As Double
intElementCount = 0
dblSum = 0
For i = LBound(Values) To UBound(Values)
  If IsNumeric(Values(i)) Then 'Ignore Non-numeric values
    dblSum = dblSum + Values(i)
    intElementCount = intElementCount + 1
  End If
Next i
If intElementCount > 0 Then 'At least one number in the group of values
  fRowSum = dblSum
Else 'No number in the group of values
  fRowSum = 0
End If
End Function


The function should go in a General Module. Then it can be used in a query as follows:

SELECT Student, fRowSum([Test1],[Test2],[Test3],[Test4]) AS TestSum
FROM StudentScores_RepeatedColumns
ORDER BY Student;



































2 comments:

Pam said...

I have a question regarding the symptoms table. How can you account for different responses to the symptoms, e.g. "Cough?" Yes (cough present), "No" (no cough), "Unknown" (someone should have asked but answer not recorded), "NA" (patient did not respond or question was not asked). These points seem picky, but only the Yes and No are valid; the other responses are possible Yes/No and analysis requires taking them into account.

Roger Carlson said...

@Pam,

I assume you're referring to my post Querying Repeated Columns: Multiple IIFs . But, I'm not sure if you're thinking of the repeated columns table or the normalized version.

In either case, you'd use a text field restricted to values like "Y", "N", "NA", "UNK". However, doing this in the repeated columns table would just make the issue of querying the date even worse.

It's hard to give a complete answer here in the comments section, and because the question illustrates yet another variation on the repeated columns theme, I'll address it in a new post in a couple of days.

Stay tuned.