Monday, April 30, 2012

COUNT DISTINCT In Access: Part 2

Field List Method

SQL Server has a nice built-in function called COUNT DISTINCT, which is missing in Access SQL.

What does COUNT DISTINCT do?  Well, there are times when you want to count distinct values in a query, that is, a count of values without duplicates.  For instance, given the following table, how many distinct customers have orders?

ORDERS

OrderID

OrderDate

Customer

Amount

1

1/17/2008

Ajax Inc.

$310.00

2

1/17/2008

Ajax Inc.

$510.50

3

1/17/2008

Ajax Inc.

$311.00

4

1/17/2008

Baker Corp.

$5,144.00

5

1/17/2008

Baker Corp.

$61.00

6

1/17/2008

Baker Corp.

$110.50

7

1/17/2008

Baker Corp.

$11.00

8

1/17/2008

Crystal & Co.

$111.85

9

1/17/2008

Crystal & Co.

$511.00

10

1/18/2008

Baker Corp.

$711.95

11

1/18/2008

Baker Corp.

$810.00

12

1/18/2008

Baker Corp.

$310.59

13

1/18/2008

Crystal & Co.

$311.00

14

1/18/2008

Crystal & Co.

$811.50

15

1/18/2008

Ajax Inc.

$512.00

16

1/18/2008

D&D LLC

$211.00

17

1/18/2008

D&D LLC

$3,311.50

18

1/19/2008

Ajax Inc.

$410.00

19

1/19/2008

Ajax Inc.

$610.50

20

1/19/2008

Baker Corp.

$4,411.00

21

1/19/2008

Baker Corp.

$511.50

22

1/19/2008

Baker Corp.

$611.50

In SQL Server, I can do this:

SELECT COUNT(DISTINCT Customer) AS CountOfCustomer FROM Orders

Which will give me the following:

CountOfCustomer

4

In Access, if I use the Distinct predicate with the count:

SELECT DISTINCT Count(Customer) AS CountOfCustomer FROM Orders;

I get:

CountOfCustomer

22

Since Access SQL does not have the Count Distinct function, what can I do? 

There are actually three different methods for simulating the Count Distinct:

  1. Subqueries in the FROM Clause
  2. Subqueries in the Field List (this post)
  3. User-Defined Function
  4. Crosstab Query (reader submitted method)

Each of these methods have advantages and disadvantages, and I'll address each in turn. This time, I'll look at a subquery in the field list.

One problem with the FROM Clause method is its lack of flexibility, that is, it's not simple to add additional levels of grouping.  That's because you're removing duplicates from the record source itself.  It would be better if I could leave the record source alone and simply remove the duplicates from the count. This is exactly what the Field List method does.

Subquery in the Field List Method

With the FROM Clause method, I worked from the inside out, that is, I started with the inner queries and developed more complex queries based on the previous ones.  But that won't work well with the Field List Method.  With this method, I'll work from the outside in.  I also can't use the Query Builder grid very well (although the final query will be viewable in it), so this will mostly be SQL.

I'll start with the query that I though should have worked but didn't:

SELECT OrdersMain.OrderDate, Count(Customer) AS CountOfCustomer
FROM Orders AS OrdersMain
GROUP BY OrdersMain.OrderDate
ORDER BY OrdersMain.OrderDate;

Notice I have used an alias (OrdersMain) in the FROM clause.  This will become necessary later, so I'll put it in now.  The query will produce the original incorrect results:

OrderDate

CountOfCustomer

1/17/2008

9

1/18/2008

8

1/19/2008

5

Now, what I need to do is replace

Count(Customer) AS CountOfCustomer

with a subquery that produces the correct values for each OrderDate.  I've already got such a query from my last post using the FROM clause method (Problem2).  I can use it here:

SELECT Count(OrdersTemp.customer),OrdersTemp.orderdate
FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate,
      Customer ORDER BY OrderDate)  AS OrdersTemp
GROUP BY OrdersTemp.orderdate;

As before, I've aliased the FROM source, this time as OrdersTemp.  This produces the following:

CountOfcustomer

orderdate

3

1/17/2008

4

1/18/2008

2

1/19/2008

Now, I've just got to get the two queries together.  To make it more obvious what's going on, I'll show the original MainQuery in Blue, the SubQuery in Red, and additional changes necessary to merge the two in Green

SELECT OrdersMain.OrderDate,
(
SELECT Count(OrdersTemp.Customer), OrdersTemp.orderdate
      FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate,
            Customer ORDER BY OrderDate)  AS OrdersTemp
      GROUP BY OrdersTemp.orderdate
) AS CountOfCustomer
FROM Orders AS OrdersMain
GROUP BY OrdersMain.OrderDate
ORDER BY OrdersMain.OrderDate;

Now, I do need to make a few more modifications.  A subquery in the field list can only return a single column and a single row, that is, just one value.  But as is, the subquery produces two columns and three rows.  What I need to do is match up the orderdate fields in both the main query and sub query, which I do my adding a WHERE Clause.  I also have to remove orderdate from the subquery field list.

So I get the following:

SELECT OrdersMain.OrderDate,
(SELECT Count(OrdersTemp.Customer), OrdersTemp.orderdate

       FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate, 
             Customer ORDER BY OrderDate)  AS OrdersTemp
       WHERE OrdersMain.OrderDate = OrdersTemp.OrderDate
       GROUP BY OrdersTemp.OrderDate) AS CountOfCustomer
FROM Orders AS OrdersMain
GROUP BY OrdersMain.OrderDate
ORDER BY OrdersMain.OrderDate;

The final query:

SELECT OrdersMain.OrderDate,
(
SELECT Count(OrdersTemp.Customer)
       FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate, 
             Customer ORDER BY OrderDate)  AS OrdersTemp
      
WHERE OrdersMain.OrderDate = OrdersTemp.OrderDate
       GROUP BY OrdersTemp.OrderDate
) AS CountOfCustomer
FROM Orders AS OrdersMain
GROUP BY OrdersMain.OrderDate
ORDER BY OrdersMain.OrderDate;

OrderDate

CountOfCustomer

1/17/2008

3

1/18/2008

4

1/19/2008

2

 

Wait a second.  This is exactly what I got with the subquery itself.  Why should I bother adding it into the main query?

The reason is the flexibility I talked about earlier.  If I want to add a couple of additional fields, say a count of the orders and the total amount for each date, that is this:

OrderDate

CountOfCustomer

CountOfOrderID

SumOfAmount

1/17/2008

3

9

$7,080.85

1/18/2008

4

8

$6,989.54

1/19/2008

2

5

$6,554.50

It's as simple as adding a new field to the Main field list (in Purple)

SELECT OrdersMain.OrderDate,
      (
SELECT Count(OrdersTemp.Customer)
       FROM (SELECT OrderDate, Customer FROM Orders GROUP BY OrderDate, 
             Customer ORDER BY OrderDate)  AS OrdersTemp
      
WHERE OrdersMain.OrderDate = OrdersTemp.OrderDate
       GROUP BY OrdersTemp.OrderDate
) AS CountOfCustomer,
   
   Count(OrdersMain.OrderID) AS CountOfOrderID,
       Sum(OrdersMain.Amount) AS SumOfAmount
FROM Orders AS OrdersMain
GROUP BY OrdersMain.OrderDate
ORDER BY OrdersMain.OrderDate;

So depending on how complex your aggregate query is it may be simpler to use the Field List method.    By comparison, adding each new column using the FROM clause method requires a complete re-working of the query.

This method also more accurately mimics the Count Distinct of T-SQL which also works at the field list level.

Next time, I'll look at a User-Defined Function.

No comments: