[sql] GROUP BY and COUNT in PostgreSQL

The query:

SELECT COUNT(*) as count_all, 
       posts.id as post_id 
FROM posts 
  INNER JOIN votes ON votes.post_id = posts.id 
GROUP BY posts.id;

Returns n records in Postgresql:

 count_all | post_id
-----------+---------
 1         | 6
 3         | 4
 3         | 5
 3         | 1
 1         | 9
 1         | 10
(6 rows)

I just want to retrieve the number of records returned: 6.

I used a subquery to achieve what I want, but this doesn't seem optimum:

SELECT COUNT(*) FROM (
    SELECT COUNT(*) as count_all, posts.id as post_id 
    FROM posts 
    INNER JOIN votes ON votes.post_id = posts.id 
    GROUP BY posts.id
) as x;

How would I get the number of records in this context right in PostgreSQL?

This question is related to sql postgresql count distinct aggregate-functions

The answer is


WITH uniq AS (
        SELECT DISTINCT posts.id as post_id
        FROM posts
        JOIN votes ON votes.post_id = posts.id
        -- GROUP BY not needed anymore
        -- GROUP BY posts.id
        )
SELECT COUNT(*)
FROM uniq;

There is also EXISTS:

SELECT count(*) AS post_ct
FROM   posts p
WHERE  EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);

In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id):

SELECT count(DISTINCT p.id) AS post_ct
FROM   posts p
JOIN   votes v ON v.post_id = p.id;

The more rows per post there are in votes, the bigger the difference in performance. Test with EXPLAIN ANALYZE.

count(DISTINCT post_id) has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS will only scan votes (or, preferably, an index on post_id) until the first match is found.

If every post_id in votes is guaranteed to be present in the table posts (referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:

SELECT count(DISTINCT post_id) AS post_ct
FROM   votes;

May actually be faster than the EXISTS query with no or few entries per post.

The query you had works in simpler form, too:

SELECT count(*) AS post_ct
FROM  (
    SELECT FROM posts 
    JOIN   votes ON votes.post_id = posts.id 
    GROUP  BY posts.id
    ) sub;

Benchmark

To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:

Test setup

Fake a typical post / vote situation:

CREATE SCHEMA y;
SET search_path = y;

CREATE TABLE posts (
  id   int PRIMARY KEY
, post text
);

INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int)  -- random text
FROM   generate_series(1,10000) g;

DELETE FROM posts WHERE random() > 0.9;  -- create ~ 10 % dead tuples

CREATE TABLE votes (
  vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);

INSERT INTO votes (post_id, up_down)
SELECT g.* 
FROM  (
   SELECT ((random()* 21)^3)::int + 1111 AS post_id  -- uneven distribution
        , random()::int::bool AS up_down
   FROM   generate_series(1,70000)
   ) g
JOIN   posts p ON p.id = g.post_id;

All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.

  1. As is.

  2. After ..

    ANALYZE posts;
    ANALYZE votes;
    
  3. After ..

    CREATE INDEX foo on votes(post_id);
    
  4. After ..

    VACUUM FULL ANALYZE posts;
    CLUSTER votes using foo;
    

count(*) ... WHERE EXISTS

  1. 253 ms
  2. 220 ms
  3. 85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
  4. 85 ms

count(DISTINCT x) - long form with join

  1. 354 ms
  2. 358 ms
  3. 373 ms -- (index scan on posts, index scan on votes, merge join)
  4. 330 ms

count(DISTINCT x) - short form without join

  1. 164 ms
  2. 164 ms
  3. 164 ms -- (always seq scan)
  4. 142 ms

Best time for original query in question:

  • 353 ms

For simplified version:

  • 348 ms

@wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:

  • 366 ms

Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS.

Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):


Using OVER() and LIMIT 1:

SELECT COUNT(1) OVER()
FROM posts 
   INNER JOIN votes ON votes.post_id = posts.id 
GROUP BY posts.id
LIMIT 1;

Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to postgresql

Subtracting 1 day from a timestamp date pgadmin4 : postgresql application server could not be contacted. Psql could not connect to server: No such file or directory, 5432 error? How to persist data in a dockerized postgres database using volumes input file appears to be a text format dump. Please use psql Postgres: check if array field contains value? Add timestamp column with default NOW() for new rows only Can't connect to Postgresql on port 5432 How to insert current datetime in postgresql insert query Connecting to Postgresql in a docker container from outside

Examples related to count

Count the Number of Tables in a SQL Server Database SQL count rows in a table How to count the occurrence of certain item in an ndarray? Laravel Eloquent - distinct() and count() not working properly together How to count items in JSON data Powershell: count members of a AD group How to count how many values per level in a given factor? Count number of rows by group using dplyr C++ - how to find the length of an integer JPA COUNT with composite primary key query not working

Examples related to distinct

Using DISTINCT along with GROUP BY in SQL Server How to "select distinct" across multiple data frame columns in pandas? Laravel Eloquent - distinct() and count() not working properly together SQL - select distinct only on one column SQL: Group by minimum value in one field while selecting distinct rows Count distinct value pairs in multiple columns in SQL sql query distinct with Row_Number Eliminating duplicate values based on only one column of the table MongoDB distinct aggregation Pandas count(distinct) equivalent

Examples related to aggregate-functions

Spark SQL: apply aggregate functions to a list of columns GROUP BY without aggregate function GROUP BY + CASE statement must appear in the GROUP BY clause or be used in an aggregate function Naming returned columns in Pandas aggregate function? Concatenate multiple result rows of one column into one, group by another column How to include "zero" / "0" results in COUNT aggregate? Apply multiple functions to multiple groupby columns Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause Optimal way to concatenate/aggregate strings