Thursday, May 24, 2012

COUNT DISTINCT In Access: Part 3

User-Defined Function
SQL Server has a nice built-in function called COUNT DISTINCT, which is missing in Access SQL.

What does COUNT DISTINCT do?  Well, there are times when you want to count distinct values in a query, that is, a count of values without duplicates.  For instance, given the following table, how many distinct customers have orders?

ORDERS
OrderIDOrderDateCustomerAmount
11/17/2008Ajax Inc.$310.00
21/17/2008Ajax Inc.$510.50
31/17/2008Ajax Inc.$311.00
41/17/2008Baker Corp.$5,144.00
51/17/2008Baker Corp.$61.00
61/17/2008Baker Corp.$110.50
71/17/2008Baker Corp.$11.00
81/17/2008Crystal & Co.$111.85
91/17/2008Crystal & Co.$511.00
101/18/2008Baker Corp.$711.95
111/18/2008Baker Corp.$810.00
121/18/2008Baker Corp.$310.59
131/18/2008Crystal & Co.$311.00
141/18/2008Crystal & Co.$811.50
151/18/2008Ajax Inc.$512.00
161/18/2008D&D LLC$211.00
171/18/2008D&D LLC$3,311.50
181/19/2008Ajax Inc.$410.00
191/19/2008Ajax Inc.$610.50
201/19/2008Baker Corp.$4,411.00
211/19/2008Baker Corp.$511.50
221/19/2008Baker Corp.$611.50

In SQL Server, I can do this:

SELECT COUNT(DISTINCT Customer) AS CountOfCustomer FROM Orders
Which will give me the following:

CountOfCustomer
4

In Access, if I use the Distinct predicate with the count:

SELECT DISTINCT Count(Customer) AS CountOfCustomer FROM Orders;

I get:

CountOfCustomer
22

Since Access SQL does not have the Count Distinct function, what can I do?
There are actually three different methods for simulating the Count Distinct:
  1. Subqueries in the FROM Clause
  2. Subqueries in the Field List
  3. User-Defined Function (this post)
  4. Crosstab Query (reader submitted method)
Each of these methods have advantages and disadvantages, and I'll address each in turn.  This time I'll end with a User-Defined Function.

User-Defined Function

A user-defined function is simply a piece of VBA code that you can call from other code or from a query.  In this case, I'll call it from a query.

This process works pretty much like the built-in Domain functions like DCount, DMax, DLookup, etc.  If you're unfamiliar with Domain Functions, you can read more in my post: Domain Functions Demystified.

Into the function, I send the name of a field, a table (or query), and a Where clause (without the WHERE keyword),  In addition, I also need to send the Group By clause (again without the GROUP BY keyword).

Then I take the query created in Subqueries in the Field List and build it in code with the addition of a WHERE clause that restricts the query to a single row matching the row being evaluated in the query.  The evaluated query returns a single value.  

Lastly, I open a recordset and return the value from the "countof" column from the function.
First of all, create the following Function in a General Module:


Function DistinctCount(FieldName As String, _
                        TableName As String, _
                        WhereClause As String, _
                        GroupByClause As String) As Long
'declare variablesDim db As DAO.Database
Dim rs As DAO.Recordset
Dim strSQL As String

On Error GoTo DistinctCount_Error
'set database variable to current databaseSet db = CurrentDb
'create SQL string with function argument valuesstrSQL = "SELECT Count(Temp." & FieldName & ") as countof " & _
       " FROM (SELECT " & GroupByClause & ", Customer FROM " & _
       TableName & " GROUP BY " & GroupByClause & ", " & _
       FieldName & " ORDER BY " & GroupByClause & ")  AS Temp " & _
       " WHERE " & WhereClause & _
       " GROUP BY " & GroupByClause
      
'open a recordset of the resultant SQL statementSet rs = db.OpenRecordset(strSQL)

'if a record is foundIf Not rs.EOF Then
    'return the value    DistinctCount = rs!countof
Else
    'return -1 to indicate "no records"    DistinctCount = -1
End If

On Error GoTo 0
    'destroy object variables    Set rs = Nothing
    Set db = Nothing

    Exit Function
DistinctCount_Error:
    MsgBox "Error " & Err.Number & " (" & Err.Description & _
    ") in procedure DistinctCount of Module Module1"

End Function

Call the Function

Calling the function, from a query looks something like this in the Query Builder:
CountOfCustomer: DistinctCount("Customer","Orders","OrderDate =# " & [OrderDate] & "#","OrderDate")



In SQL view:

SELECT Orders.OrderDate, DistinctCount("Customer","Orders","OrderDate =# " & [OrderDate] & "#","OrderDate") AS CountOfCustomer, Count(Orders.OrderID) AS CountOfOrderID, Sum(Orders.Amount) AS SumOfAmount
FROM Orders
GROUP BY Orders.OrderDate
ORDER BY Orders.OrderDate;


Which results in the following:

OrderDate CountOfCustomer CountOfOrderID SumOfAmount
1/17/2008 3 9 $7,080.85
1/18/2008 4 8 $6,989.54
1/19/2008 2 5 $6,554.50

How it works

The function runs for each row evaluated. For instance, in the above query, for 1/17/2008, the query that is created in the code looks like this:

SELECT Count(Temp.Customer) as countof  FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate, Customer ORDER BY OrderDate)  AS Temp  WHERE OrderDate =#1/17/2008# GROUP BY OrderDate

Which returns

countof
3

The SQL for the next row is:

SELECT Count(Temp.Customer) as countof  FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate, Customer ORDER BY OrderDate)  AS Temp  WHERE OrderDate =#1/18/2008# GROUP BY OrderDate

Which returns

countof
4

The SQL for the last row is:
SELECT Count(Temp.Customer) as countof  FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate, Customer ORDER BY OrderDate)  AS Temp  WHERE OrderDate =#1/19/2008# GROUP BY OrderDate

Which returns

countof
2

As you can imagine, with a lot of customers, this might not be the best performer.  But I'll leave that for next time, when I'll look at the methods and discuss the pros and cons of each.

5 comments:

Patrick Honorez said...

You can get that Count Distinct MUCH more efficiently by using a Crosstab query:


TRANSFORM Count(*) AS Cell
SELECT Count(cell) AS DistinctCount
FROM Orders
GROUP BY "Anything"
PIVOT CustId In (null);

Roger Carlson said...

@Patrick

Thanks for the suggestion. That's a new one to me, and I've incorporated it into the series. However, you might be surprised by the performance comparison in Part 5.

Anonymous said...

Thank for posting the code for the user defined function. Note: the sql string has "Customer" instead of "FieldName".

Patrick Honorez said...

Very interesting comparison, thanks !
If you still have the tables that you prepared, it would be interesting to benchmark again, but instead of calculating the time to do an OpenQuery, calculating the time of
a) opening the recordset b) do a rs.movelast

Dorv said...

I'm still amazed that a function like this doesn't already exist in Access.

I continually run into the need for it specifically in reports.

Example:

I'm currently working on a report (whose data already comes through several different queries, including a rather large union query), where I have the individual records grouped by State (more specifically the field DepartState).

In the group header, I'd like to add a a counter for the number of individual counties that exists in the set of records displayed (specifically DepartCounty).

(I don't actually show the detail lines here, just the group header, which displays a line for each state).

I THINK the user-defined function is the way to go, but I'm not entirely sure.