Some people go to great lengths to avoid GROUP BY and HAVING clauses in their queries. The error messages are fussy but they are usually right. GROUP BY and HAVING key words are essential for good SQL reporting.
The primary reason for GROUP BY is to reduce the number of rows, usually by aggregation. It produces only one row for each matching grouping from the input. This allows you to make sophisticated calculations via ordinary SQL.
Fruit Example:
We have some fruit:
item | source | amt | fresh_until ---------+---------+-----+------------ bananas | Chile | 50 | 2019-05-01 bananas | Bolivia | 25 | 2019-04-15 bananas | Chile | 150 | 2019-07-10 apples | USA-WA | 75 | 2019-07-01 apples | USA-CA | 75 | 2019-08-15 apples | Canada | 80 | 2019-08-01 grapes | USA-CA | 120 | 2019-07-15 (7 rows)
This next case allows us to look forward. Mid-year, what fruits will be available? We do this with the same query as above, however, after the query runs we check the values of min(fresh_until) by using a having clause. HAVING is how you qualify an aggregate.
select item, count(source) as srcs, sum(amt) as tot_amt, min(fresh_until) as min_fresh_until from fruit group by item having min(fresh_until) > '30-jun-2019'; item | srcs | tot_amt | min_fresh_until --------+------+---------+---------------- grapes | 1 | 120 | 2019-07-15 apples | 3 | 230 | 2019-07-01 (2 rows)
All of the apples and grapes will be available mid-year.
The list of items between SELECT and FROM, the target list. may contain non-aggregates and aggregates. Those non-aggregate columns in the target list
should be in the group by clause. The error message says so. The order of the columns in the group by clause matters. It determines how the aggregates are grouped. The order is often hierarchical. What that means to your columns is your focus. It could be fruit, or sources and/or fresh_until date.
Playing Cards Examples
Let’s look at another set of examples that illustrate extracting information on playing cards. You can learn about cards on Wikipedia Standard Cards.
Suppose you deal out programmatically six 5-card hands, like six people playing poker. A total of 30 cards are used in this deal. They are in a hand table like the following where the names of cards and suits are joined in by lookup tables. We store ranks so we can sort properly. We use names for display. The names and ranks have a one to one relationship for each of cards and suits.
create or replace view hands_v as select handid, su_name, ca_name from hands join suits using (su_rank) join cards using (ca_rank); handid | su_name | ca_name --------+---------+--------- 1 | Diamond | 2 1 | Club | 8 1 | Spade | J 1 | Diamond | 5 1 | Heart | Q 2 | Heart | 9 2 | Spade | 4 2 | Heart | 7 ...
What is the suit count for each hand? We really only care about any hands that have 3 or more cards of the same suit. That will tell us who has better chances for a poker flush. Note that although GROUP BY would seem to imply ORDER BY, it does not. ORDER BY must be explicit.
select handid, su_name, count(ca_name) from hands_v group by handid, su_name having count(ca_name) >= 3 order by handid, su_name; handid | su_name | count --------+---------+------- 3 | Diamond | 3 4 | Spade | 3 6 | Spade | 3 (3 rows)
So what if you mis-grouped your query? If this hand table is not grouped by handid, then you will get 30 records of 6 hands of 5-cards. If you had aggregates, they would be grouped by row. Not very helpful.
If you aggregate the card name and do not include
the card name solo on the target list and try to order by card name,
you will receive the error message that it should not be in
the order by clause. The order by clause should contain
elements of the group by clause.
However, if the card name is explicitly in the target list,
select handid, ca_name, count(ca_name)...
then the card name must be in the group by clause and
therefore allowable on the order by clause.
If the query is by suit, there will be a minimum of 1 or maximum of 4 records per suit for each of the six hands. Notice that we are sorting by suit rank which
also must be in the group by clause. su_name and su_rank have a one to one relationship.
select handid, su_name, count(su_name) from hands join suits using (su_rank) group by su_rank, su_name, handid order by handid, su_rank; handid | su_name | count --------+---------+------- 1 | Spade | 1 1 | Heart | 1 1 | Diamond | 2 1 | Club | 1 2 | Spade | 1 2 | Heart | 2 2 | Club | 2 3 | Spade | 1 3 | Diamond | 3 3 | Club | 1 4 | Spade | 3 4 | Heart | 1 4 | Diamond | 1 5 | Heart | 2 5 | Diamond | 1 5 | Club | 2 6 | Spade | 3 6 | Diamond | 2 (18 rows)
To see the distribution of cards into hands, We must group by the card rank column. Of course there are 4 suits of each card, so you won’t see a card in more than four hands.
select ca_name, count(handid) as num_hands from hands join suits using (su_rank) join cards using (ca_rank) group by ca_rank, ca_name order by ca_rank; ca_name | num_hands ---------+----------- 2 | 2 3 | 2 4 | 4 5 | 3 6 | 2 7 | 3 8 | 2 9 | 3 J | 1 Q | 2 K | 4 A | 2 (12 rows)
To peek and see who is holding aces, we can use the following short query. Note that there is a WHERE clause which is executed while collecting the rows. HAVING is executed after the rows are collected.
select handid, count(ca_name) from hands_v where ca_name = 'A' group by handid; handid | count --------+------- 5 | 1 6 | 1 (2 rows)
Summary
These examples are simple ways to evaluate known entities. Experiment and use these simple rules.
- If a column is on the target list and not an aggregate, it must be in a GROUP BY clause.
- WHERE clauses occur during the selection process.
- HAVING clauses occur after the aggregates are completed.
- Only non-aggregates can be in the ORDER BY clause.
- Order of the GROUP BY clause matters.