1. sql
  2. /references
  3. /group-by

SQL GROUP BY Statement

Definition

In SQL, the GROUP BY statement is used in conjunction with aggregate functions (like SUM, COUNT, AVG, MAX, MIN, etc.) to group the result-set by one or more columns. The aggregate functions are used to calculate a single value for each group of rows.

Example

For example, consider the following table named "orders":

idcustomer_nameproductquantityprice
1John SmithHammer210
2John SmithScrews1005
3Jane DoeSaw150
4Jane DoeHil2005
5John SmithBen505

If we want to find out the total quantity of each product ordered, we can use the SUM() function with GROUP BY:

SELECT product, SUM(quantity) 
FROM orders 
GROUP BY product;

The result of this query would be:

productSUM(quantity)
Hammer2
Saw1
Screws350

Notice that the SELECT statement only includes the columns mentioned in the GROUP BY clause and the aggregate function. The columns that are not included in the GROUP BY clause, like "customer_name" and "price", are not displayed in the result set.

You can also combine multiple column in GROUP BY, for example:

SELECT customer_name, product, SUM(quantity) 
FROM orders 
GROUP BY customer_name, product;

The result of this query would be:

idcustomer_nameproductSUM (quantity)
1Jane DoeSaw1
2Jane DoeScrews200
3John SmithHammer2
4John SmithScrews150

It will create a separate group for each combination of the columns mentioned in the GROUP BY clause.

Best Practices

Here are some best practices to keep in mind when using the GROUP BY statement:

  • Always include a WHERE clause before the GROUP BY clause, if possible, to filter the data. This will make the data more manageable and improve performance.

  • Use the correct aggregate function for the type of data you're working with. For example, use SUM() for numerical data and COUNT() for categorical data.

  • When selecting multiple columns in the SELECT statement, make sure that the column that is not included in the GROUP BY clause is an aggregate function or a constant.

  • Make sure that you understand how NULL values are handled by the aggregate function you're using. Some functions (like COUNT()) treat NULL values as distinct from other values, while others (like SUM()) ignore NULL values.

  • Use the GROUP BY clause wisely, and make sure you understand how it affects the results of your query. It's very easy to produce incorrect results if you're not careful.

  • Optimize and debug your query, by using the database management system's appropriate EXPLAIN command, testing with a small dataset, and also by checking the performance of the query by seeing how much IO, CPU and Memory it is consuming.