A nice little gem in PostgreSQL’s SQL syntax is the
In other words, we would execute these operations in the following logical order:
If your database doesn’t support the
From a readability perspective, I would definitely prefer the standard SQL syntax over
The result being:
DISTINCT ON
clause, which is as powerful as it is esoteric.
In a previous post, we’ve blogged about some caveats to think of when DISTINCT
and ORDER BY
are used together. The bigger picture can be seen in our article about the logical order of operations in SQL SELECT.
The PostgreSQL documentation explains it well:
Again, this is quite esoteric as the distinct-ness is now decided only based on the columns listed separately in parentheses. For all the other columns, only the first row according toSELECT DISTINCT ON ( expression [, ...] )
keeps only the first row of each set of rows where the given expressions evaluate to equal. TheDISTINCT ON
expressions are interpreted using the same rules as forORDER BY
. […] For example:retrieves the most recent weather report for each location. […]SELECT DISTINCT ON (location) location, time, report FROM weather_reports ORDER BY location, time DESC;
ORDER BY
is projected. The SQL language is probably one of the only ones where the syntactic order of operations has absolutely nothing to do with the logical order of operations, and this DISTINCT ON
syntax isn’t helping. In a more straightforward language design, the above statement could read, instead:
FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
SELECT
location,
FIRST_VALUE (time) OVER w AS time,
FIRST_VALUE (report) OVER w AS report
DISTINCT
ORDER BY location
FROM
: Access the weather_reports tableWINDOW
: Specify “windows” or “groups” in our data set, grouping by location and ordering contents of each group by time, descendingly (using the term “WINDOW” is no accident as we’ll see afterwards)SELECT
: Project the location (which is the grouping) and per group, the first values of time / report ordered by timeDISTINCT
: Remove all the duplicates, because the above operation will produce the same time and report value for each record that shares the same locationORDER BY
: Finally, order the results per grouping
SELECT DISTINCT
location,
FIRST_VALUE (time) OVER w AS time,
FIRST_VALUE (report) OVER w AS report
FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
ORDER BY location
WINDOW
clause, just expand it into the individual window functions. E.g. in Oracle, write:
SELECT DISTINCT
location,
FIRST_VALUE (time) OVER (PARTITION BY location ORDER BY time DESC),
FIRST_VALUE (report) OVER (PARTITION BY location ORDER BY time DESC)
FROM weather_reports
ORDER BY location
DISTINCT ON
.
Want to play around with it? Here’s some sample data:
create table weather_reports (location text, time date, report text);
insert into weather_reports values ('X', DATE '2000-01-01', 'X1');
insert into weather_reports values ('X', DATE '2000-01-02', 'X2');
insert into weather_reports values ('X', DATE '2000-01-03', 'X3');
insert into weather_reports values ('Y', DATE '2000-01-03', 'Y1');
insert into weather_reports values ('Y', DATE '2000-01-05', 'Y2');
insert into weather_reports values ('Z', DATE '2000-01-04', 'Z1');
|location|time |report| |--------|----------|------| |X |2000-01-03|X3 | |Y |2000-01-05|Y2 | |Z |2000-01-04|Z1 |Notice that jOOQ already supports PostgreSQL
DISTINCT ON
and in the future, we might emulate it for other dialects using the above technique: https://github.com/jOOQ/jOOQ/issues/3564