Using DISTINCT ON in Non-PostgreSQL Databases

A nice little gem in PostgreSQL’s SQL syntax is the DISTINCT ON clause, which is as powerful as it is esoteric. In a previous post, we’ve blogged about some caveats to think of when DISTINCT and ORDER BY are used together. The bigger picture can be seen in our article about the logical order of operations in SQL SELECT. The PostgreSQL documentation explains it well:

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY. […] For example:
SELECT DISTINCT ON (location) location, time, report
    FROM weather_reports
    ORDER BY location, time DESC;
retrieves the most recent weather report for each location. […]

Again, this is quite esoteric as the distinct-ness is now decided only based on the columns listed separately in parentheses. For all the other columns, only the first row according to ORDER BY is projected. The SQL language is probably one of the only ones where the syntactic order of operations has absolutely nothing to do with the logical order of operations, and this DISTINCT ON syntax isn’t helping. In a more straightforward language design, the above statement could read, instead:


FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
SELECT 
  location, 
  FIRST_VALUE (time) OVER w AS time,
  FIRST_VALUE (report) OVER w AS report
DISTINCT
ORDER BY location

In other words, we would execute these operations in the following logical order:

FROM: Access the weather_reports table
WINDOW: Specify “windows” or “groups” in our data set, grouping by location and ordering contents of each group by time, descendingly (using the term “WINDOW” is no accident as we’ll see afterwards)
SELECT: Project the location (which is the grouping) and per group, the first values of time / report ordered by time
DISTINCT: Remove all the duplicates, because the above operation will produce the same time and report value for each record that shares the same location
ORDER BY: Finally, order the results per grouping

That’s what really happens, and incidentally, the above synthetic SQL syntax matches the actual logical order of operations in the SQL language, so translating it back to an actual SQL statement would yield:


SELECT DISTINCT
  location, 
  FIRST_VALUE (time) OVER w AS time,
  FIRST_VALUE (report) OVER w AS report
FROM weather_reports
WINDOW w AS (PARTITION BY location ORDER BY time DESC)
ORDER BY location

If your database doesn’t support the WINDOW clause, just expand it into the individual window functions. E.g. in Oracle, write:


SELECT DISTINCT
  location, 
  FIRST_VALUE (time) OVER (PARTITION BY location ORDER BY time DESC),
  FIRST_VALUE (report) OVER (PARTITION BY location ORDER BY time DESC)
FROM weather_reports
ORDER BY location

From a readability perspective, I would definitely prefer the standard SQL syntax over DISTINCT ON. Want to play around with it? Here’s some sample data:


create table weather_reports (location text, time date, report text);
insert into weather_reports values ('X', DATE '2000-01-01', 'X1');
insert into weather_reports values ('X', DATE '2000-01-02', 'X2');
insert into weather_reports values ('X', DATE '2000-01-03', 'X3');
insert into weather_reports values ('Y', DATE '2000-01-03', 'Y1');
insert into weather_reports values ('Y', DATE '2000-01-05', 'Y2');
insert into weather_reports values ('Z', DATE '2000-01-04', 'Z1');

The result being:

|location|time      |report|
|--------|----------|------|
|X       |2000-01-03|X3    |
|Y       |2000-01-05|Y2    |
|Z       |2000-01-04|Z1    |

Notice that jOOQ already supports PostgreSQL DISTINCT ON and in the future, we might emulate it for other dialects using the above technique: https://github.com/jOOQ/jOOQ/issues/3564