Monday, April 30, 2012

COUNT DISTINCT in Access: Part 1

SQL Server has a nice built-in function called COUNT DISTINCT, which is missing in Access SQL.

What does COUNT DISTINCT do?  Well, there are times when you want to count distinct values in a query, that is, a count of values without duplicates.  For instance, given the following table, how many distinct customers have orders?

ORDERS

 OrderID OrderDate Customer Amount 1 1/17/2008 Ajax Inc. \$310.00 2 1/17/2008 Ajax Inc. \$510.50 3 1/17/2008 Ajax Inc. \$311.00 4 1/17/2008 Baker Corp. \$5,144.00 5 1/17/2008 Baker Corp. \$61.00 6 1/17/2008 Baker Corp. \$110.50 7 1/17/2008 Baker Corp. \$11.00 8 1/17/2008 Crystal & Co. \$111.85 9 1/17/2008 Crystal & Co. \$511.00 10 1/18/2008 Baker Corp. \$711.95 11 1/18/2008 Baker Corp. \$810.00 12 1/18/2008 Baker Corp. \$310.59 13 1/18/2008 Crystal & Co. \$311.00 14 1/18/2008 Crystal & Co. \$811.50 15 1/18/2008 Ajax Inc. \$512.00 16 1/18/2008 D&D LLC \$211.00 17 1/18/2008 D&D LLC \$3,311.50 18 1/19/2008 Ajax Inc. \$410.00 19 1/19/2008 Ajax Inc. \$610.50 20 1/19/2008 Baker Corp. \$4,411.00 21 1/19/2008 Baker Corp. \$511.50 22 1/19/2008 Baker Corp. \$611.50

In SQL Server, I can do this:

SELECT COUNT(DISTINCT Customer) AS CountOfCustomer FROM Orders

Which will give me the following:

 CountOfCustomer 4

In Access, if I use the Distinct predicate with the count:

SELECT DISTINCT Count(Customer) AS CountOfCustomer FROM Orders;

I get:

 CountOfCustomer 22

Since Access SQL does not have the Count Distinct function, what can I do?

There are actually four different methods for simulating the Count Distinct:

1. Subqueries in the FROM Clause (this post)
2. Subqueries in the Field List
3. User-Defined Function
4. Crosstab Query (reader submitted method)

Subquery in the FROM Clause Method

Using a subquery in the FROM cause restricts the pool of values for aggregating in the main query.  The main advantage is that it's fairly easy to figure out.  The main disadvantage is lack of flexibility.

Note: I have a couple of choices for using a subqueries.  They can be used either in-line or stacked.  What's the difference?  In the stacked query method, the individual subsqueries are saved as named queries and then referenced by name in later queries.  In the in-line method, they are all combined into a single SQL statement

I find it useful to create use the stacked query method to develop and combine them into a single statement once I've got it working correctly.  It also gives me a method of checking my results.

Problem 1:

So, back to my original problem.  First, I create a simple aggregate query, grouping the customers:

SELECT Customer FROM Orders GROUP BY Customer;

Result:

SimpleStep1 (saved query)

 Customer Ajax Inc. Baker Corp. Crystal & Co. D&D LLC

I'll save that as SimpleStep1.  Then I'll use SimpleStep1 in the FROM clause for a second query:

SELECT Count(Customer) AS CountOfCustomer FROM SimpleStep1;

SimpleStep2

 CountOfCustomer 4

Combining these into a single query:

SELECT Count(Step1.Customer) AS CountOfCustomer
FROM (SELECT Customer FROM Orders GROUP BY Customer) AS Step1;

SimpleCombined

 CountOfCustomer 4

Note: the alias "Step1" could be anything at all, even "A".  I used "Step1" to illustrate how the queries are related.

More Complex Queries: Problem 2

This is all well and good, but, unfortunately, this is only useful in limited situations.  If I wanted a more complex query, say, how many customers per day:

 OrderDate CountOfCustomer 1/17/2008 3 1/18/2008 4 1/19/2008 2

it is much more difficult.

SQL Server

With COUNT DISTINCT, in SQL Server, I can just add an additional field to the Field List and Group By clauses.

SELECT OrderDate, COUNT(DISTINCT Customer) AS CountOfCustomer FROM Orders
GROUP BY OrderDate;

Access

If I try to do the same thing in Access,

SELECT OrderDate, Count(Step1.Customer) AS CountOfCustomer
FROM (SELECT Customer FROM Orders GROUP BY Customer)  AS Step1
GROUP BY OrderDate;

I get a parameter box asking for OrderDate:

In retrospect the reason for this if fairly obvious.  I'm using a GROUP BY in the subquery in the FROM clause to limit the values available and OrderDate isn't in the field list.

So to fix it, I have to add OrderDate to both the subquery and the main query:

SELECT Step1.OrderDate, Count(Step1.Customer) AS CountOfCustomer
FROM (SELECT OrderDate, Customer
FROM Orders GROUP BY OrderDate, Customer)  AS Step1
GROUP BY Step1.OrderDate;

And I get the same result:

 OrderDate CountOfCustomer 1/17/2008 3 1/18/2008 4 1/19/2008 2

Even More Complex Queries: Problem 3

But suppose I want an even more complex query.  Suppose I want to see a distinct count of the Customers AND a count of the number of orders for each date?

In SQL Server, I can simply add my additional aggregrate function (CountOfOrders) to the Field List:

SELECT OrderDate, COUNT(DISTINCT Customer) AS CountOfCustomer
COUNT(OrderID) AS CountOfOrders
FROM Orders
GROUP BY OrderDate;

In Access, if I try to do the same thing:

SELECT Step1.OrderDate, Count(Step1.Customer) AS CountOfCustomer,
COUNT(OrderID) AS CountOfOrders
FROM (SELECT OrderDate, Customer
FROM Orders GROUP BY OrderDate, Customer)  AS Step1
GROUP BY Step1.OrderDate;

once again, I'll get the parameter box but this time asking for the OrderID.

The reason is the same because OrderID is not in the record source.  So I'm going to have to modify the record source to add it.  I'll start with the stacked query method and then combine them.

StackedStep1

SELECT Orders.OrderDate, Orders.Customer
FROM Orders
GROUP BY Orders.OrderDate, Orders.Customer;

StackedStep2

SELECT OrderDate, Count(Customer) AS CountOfCustomer
FROM StackedStep1
GROUP BY OrderDate;

StackedStep3

SELECT Orders.OrderDate, Stackedstep2.CountOfCustomer,
Count(Orders.OrderID) AS CountOfOrderID
FROM Stackedstep2 INNER JOIN Orders ON Stackedstep2.OrderDate = Orders.OrderDate
GROUP BY Orders.OrderDate, Stackedstep2.CountOfCustomer;

Which gives me the following:

 OrderDate CountOfCustomer CountOfOrderID 1/17/2008 3 9 1/18/2008 4 8 1/19/2008 2 5

Combining them into a single query:

SELECT Orders.OrderDate, step2.CountOfCustomer,
Count(Orders.OrderID) AS CountOfOrderID
FROM (SELECT OrderDate, Count(Customer) AS CountOfCustomer
FROM (SELECT OrderDate, Customer FROM Orders
GROUP BY OrderDate, Customer ORDER BY OrderDate)  AS step1
GROUP BY OrderDate)  AS step2
INNER JOIN Orders ON step2.OrderDate = Orders.OrderDate
GROUP BY Orders.OrderDate, step2.CountOfCustomer;

Fun, huh?

I think you can see that while modifying the FROM clause allows me to get the right answer, it's a lot of work and not very flexible, that is, it's not simple to add additional levels of grouping.  That's because I'm removing duplicates from the record source itself.  It would be better if I could leave the record source alone and simply remove the duplicates from the count. This is exactly what the Field List method does.

Next time, I'll look at doing the COUNT DISTINCT in the Field list, and we'll see if it's any better.

Katie said...

Cater said...