What I Wish I Had Known About PostgreSQL

For PGSQL Phriday #006, Grant Fritchey asked us to think about one of the helpful tidbits that we wish we had known earlier about PostgreSQL as we tried to learn using it effectively. Like myself and so many others, Grant has come to PostgreSQL with years of experience in another database platform, primarily SQL Server. Even with such a strong background using and teaching about relational databases, transferring that knowledge from one system to another can be challenging at times.

To be honest, it feels as if I’ve become a bit of a broken record about my struggles moving from SQL Server to PostgreSQL. Working with the PostgreSQL community to start an initiative like PGSQL Phriday was a direct result of those struggles and ensuing conversations.

What would I want to go back and tell Ryan in 2018 that would help him along on this PostgreSQL journey?

OK, I can’t pick just one, so I’ll give a more general principle and then something PostgreSQL specific – lateral joins.

Patience You Must Have – (Yoda)

This first point is a reminder to me more than anything. Too often in my technical career I end up getting frustrated and annoyed because things don’t work the way I expect, especially based on my previous experience. Quite honestly, this is the never-ending problem of working in technology, and one of the greatest drivers of innovation.

Just this week, I opened Power BI for the first time in over a year to do some quick analysis and processing of web data. I had done this kind of work before and I understand how Power Query works with the type of data I was processing. Still, it took me longer than I wanted because M is a functional language and I rarely do functional development anymore.

“Why can’t I just take this result and loop over it with this other function!!”

Of course it turned out that I could, just not the way that I wanted to do it.

The same thing has happened to me with PostgreSQL many times over the last few years. I’ll hit a performance issue and start wishing for the kinds of tools SQL Server provided. When I want to get backups running quickly and track their status, I remember when I could do most of that through SSMS or community supported scripts.

But as the years have gone by, I’ve come to understand how much time and dedication the global team of volunteers have given to the project and some of the amazing innovations that PostgreSQL provides which I can’t find anywhere else. We all have a front-row seat to the discussion of each feature, improvement, and bug fix. I can literally read the code for how the query planner works if I have questions (granted, I’m more of a “let’s start with the comments” learner).

No, everything doesn’t work the way I want it to, but the more I’ve come to embrace that, the more I’ve learned, and loved, about PostgreSQL.

(But don’t get me started about the absolutely annoying Regex matching bug with my QNAP… 😂)

Lateral Joins

I was much further in my SQL Server career than I should have been before I understood the usefulness of APPLY in T-SQL. I didn’t realize that APPLY isn’t a SQL standard and so when I started using PostgreSQL and needed similar functionality, I was perplexed trying to find the equivalent. Fortunately, I wasn’t the only one searching.

Like APPLY statements, LATERAL joins allow you to execute an inner query for each row of the outer query and you can reference values of the parent. The canonical example for using a query like this is when you have a table that represents something (accounts, sensors, products) with time-series like data in another table (orders, readings, etc.). To get the most recent value from the time-series table, a LATERAL is the easiest way without going through a bunch of gymnastics.

Learning about LATERAL joins unblocked numerous query and performance struggles I was having. And, much like APPLY statements in SQL Server, LATERAL queries are still one of the main optimization fences we have in PostgreSQL. I honestly can’t count how many times I’ve seen responses on various community channels that start with, “have you tried a lateral join?

No, lateral joins don’t solve every problem. But knowing when they are a viable option to try is 🔥.

With that, let me dive a bit into how CROSS JOIN LATERAL has helped me solve a few tricky data transformation problems.

CROSS JOIN LATERAL

What really started to blow my mind, particularly when wrangling data for transformation in the database (ELT), was the use of CROSS JOIN LATERAL. There are so many set returning functions in PostgreSQL that can be used to transform various bits of data (strings, JSON, arrays) into something more usable from a SQL perspective.

Over the last couple of years I’ve learned two particular ways to use CROSS JOIN LATERAL when processing data.

Retaining Order

Many times, as I’m pulling strings or JSON apart into table structures, it’s helpful to retain the order of the internal components. This was especially helpful when I analyzed Wordle data because I needed to transform a block of emojis into rows and columns, maintaining the order of each emoji as I processed more data. I could have done this with some creative row_number() window function usage, but PostgreSQL provides a simpler, and typically faster, way.

Any Set Returning Function (SRF) in PostgreSQL can utilize the extra WITH ORDINALITY keyword. The key is that the function must be used as a source of data (after the FROM clause) rather than as a scaler/inline function.

As an example, I’ll simply break a string of words apart, adding a column that represents the order of the words as they appeared. I’m using a CTE to mimic a table that’s returning data. If these strings were stored inside of a table, I could just select directly from there. In the `row_number()` example, I would still need a CTE to get the unnested rows before I could apply the window function.

Word order using row_number()

WITH str AS (
        SELECT UNNEST(string_to_array(TOKEN,' ')) word FROM
               (SELECT 'I ❤️ Postgres CROSS JOIN LATERAL!' AS TOKEN) a
)
SELECT word, ROW_NUMBER() OVER() FROM str;

--- results in
word    |row_number|
--------+----------+
I       |         1|
❤️      |         2|
Postgres|         3|
CROSS   |         4|
JOIN    |         5|
LATERAL!|         6|

Word order using CROSS JOIN LATERAL and WITH ORDINALITY

--- the comma between str and UNNEST is shorthand for CROSS JOIN LATERAL
WITH str AS (
       SELECT 'I ❤️ Postgres CROSS JOIN LATERAL!' AS token
)
SELECT t.* FROM str, UNNEST(string_to_array(TOKEN,' ')) WITH ORDINALITY t(word,word_order);

-- produces the same result
word    |row_number|
--------+----------+
I       |         1|
❤️      |         2|
Postgres|         3|
CROSS   |         4|
JOIN    |         5|
LATERAL!|         6|

It might seem small, but it’s such a useful feature to have at your disposal.

Organizing Calculations for Cleaner SELECTs

During my (slow) process of working through the Advent of Code puzzles, I’ve learned a lot about SQL features supported in Postgres from others in the community. One that jumped out at me was how Vik Fearing used CROSS JOIN LATERAL to organize calculations into smaller groups to allow alias references and overall cleaner SELECT statements at the parent (at least, that’s why I ended up using this technique).

This one is a bit harder for me to come up with an example on the spot, but I’ll give it my best shot.

Each example query below two CTEs that generate fake X,Y values. Let’s pretend that we need to:

  • sum each set of X and Y columns between the two tables
  • find the difference between those sums

There are many ways we could go about this, which is partially why it’s difficult for me to think of a better example.

Without using the CROSS JOIN LATERAL feature, I would typically have to do the addition for the first two columns (x+x, y+y) and then a third time as part of the subtraction calculation ((x+x)-(y+y)). It’s not the end of the world, but I really wish I could refer to an aliases of those first two columns instead.

Without using CROSS JOIN LATERAL

WITH ta(x,y) AS (
       SELECT * FROM generate_series(1,2) x, generate_series(3,4) y
),
tb(x,y) AS (
       SELECT * FROM generate_series(10,11) x, generate_series(13,14) y
)
SELECT
       ta.x+tb.x AS x_sum,
       ta.y+tb.y AS y_sum,
       (ta.x+tb.x)-(ta.y+tb.y) AS sum_diff
FROM ta, tb;

Using a CROSS JOIN LATERAL, I can do the first set of calculations and then refer to the aliases further down in the query. Not only does it break up my work, but it also makes the top-level SELECT statement cleaner.

WITH ta(x,y) AS (
       SELECT * FROM generate_series(1,2) x, generate_series(3,4) y
),
tb(x,y) AS (
       SELECT * FROM generate_series(10,11) x, generate_series(13,14) y
)
SELECT
       x_sum,
       y_sum,
       sum_diff
FROM ta, tb
       CROSS JOIN LATERAL (VALUES(ta.x+tb.x,ta.y+tb.y)) AS t(x_sum,y_sum)
       CROSS JOIN LATERAL (VALUES (t.x_sum-t.y_sum)) AS d(sum_diff);

Both solutions produce the same result:

x_sum|y_sum|sum_diff|
-----+-----+--------+
   11|   16|      -5|
   11|   17|      -6|
   12|   16|      -4|
   12|   17|      -5|
   12|   16|      -4|
   12|   17|      -5|
   13|   16|      -3|
   13|   17|      -4|
   11|   17|      -6|
   11|   18|      -7|
   12|   17|      -5|
   12|   18|      -6|
   12|   17|      -5|
   12|   18|      -6|
   13|   17|      -4|
   13|   18|      -5|

The win here, at least for me, is that I can build up each set of calculations and refer to them higher or lower in the query. Being able to refer to my previous calculations by alias is just a huge organizational win. How many times have you wanted to do that in a query?!

Conclusion

I love using these features of PostgreSQL to transform data. I’ve learned most of them by pair programming with other PostgreSQL users and developers and each time I have to really internalize how it works and the value.

There are so many other things I could have chosen to talk about in response to Grant’s prompt, but hopefully knowing a little bit more about CROSS JOIN LATERAL helps you. 😊

4 thoughts on “What I Wish I Had Known About PostgreSQL”

  1. > The canonical example for using a query like this is when you have a table that represents something (accounts, sensors, products) with time-series like data in another table (orders, readings, etc.). To get the most recent value from the time-series table, a LATERAL is the easiest way without going through a bunch of gymnastics.

    Postgres also has a nice DISTINCT ON expression that you can use in select statements for that. The PG documentation has this example:

    SELECT DISTINCT ON (location) location, time, report
    FROM weather_reports
    ORDER BY location, time DESC;

    Reference: https://www.postgresql.org/docs/current/sql-select.html

    Reply
    • For sure! For small things, DISTINCT ON is a really cool feature!

      One of the issues with DISTINCT ON, however, is that it doesn’t (currently) utilize the index to do the distinct and sorting. For something more performant, you can try to craft a query that essentially does a loose index scan, or use an extension like Timescale which included “SkipScan” functionality ~2 years ago which will efficiently utilize an ordered index with DISTINCT ON.

      Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.