Relational or Non-relational Postgres?

For PGSQL Phriday #005, Ryan Lambert asked us to think about how we use PostgreSQL for non-relational workloads. I’ve been following Ryan’s work for a few years now (his blog posts are 💯) and every time I connect with him about mapping and Postgres, I walk away with way more knowledge than I contributed to the conversation. 🙂

Needless to say, I think a discussion about different ways we use non-relational data in Postgres is a great topic, well timed with a few discussions I’ve had recently about schema design and data access functions unique to PostgreSQL.

Postgres for Powerful ETL ELT

Until recently, most conversation about processing raw data within a data pipeline has focused on the Extract, Transform, Load (ETL) pattern. Partially because databases lacked a more complete set text processing tools, it was left to developers to exchange and process data externally and then insert the fully transformed data.

Databases have changed significantly over the last decade or so. Postgres, specifically, is replete with powerful text and data processing functions, allowing us to take a similar, database-focused approach. Extract, Load, Transform (ELT).

The difference is subtle. The focus is on extracting the key data as part of the original processing, but then using Postgres as the staging and transformation layer. It used to be that data would be loaded into an intermediate database and manipulated with external scripts and functions. But much of that same work can be handled internally, and often more efficiently. (keep your logic close to your data… right? 😉)

If you start to head down this path and think about doing more data transformation in the database, then we have to talk about how that impacts schema.

Relational design for non-relational data

Yeah, that felt weird to write.

The most prominent way that PostgreSQL has displayed its non-relational superpowers in recent years is through the JSONB data type. Arguments abound as to whether Postgres could serve as a replacement for NoSQL workloads by pulling the necessary, relational pieces out and then just querying the raw JSON directly as needed. For many applications, there are probably some good arguments to harnessing the power of relational data with JSON of the modern web service apps. But that’s not what I’m concerned with.

What if there are ways to use the JSON data that proliferates our data streams more quickly without overthinking Postgres schema design upfront?

When I was at Timescale, we had many discussions about how to best architect schema design for high-throughput applications. Specifically, we were working to consume API data from time-series web services like stock and cryptocurrency trading platforms.

In many of these platforms, the API responses tend to be in a constant state of change. The main properties were consistent (ids, alternate keys, prices, timestamps, trading volume), but the business metadata was in constant flux. Moreover, as we attempted to learn the APIs to provide working samples to our users, we would often get stuck in debates about which fields to parse from the incoming JSON and which to leave.

One day, a co-worker simply said, “Don’t worry about it so much. Pull out what seems reasonable and then save the rest in a JSONB field and pull it out later if you need it.” 🤯

At first, I protested. Until now, most of the debate around how to use JSON within Postgres focused on the analytical query performance over the data as a potential replacement for NoSQL applications. In most of the work that I did, working that way didn’t really apply. But now I had a compelling reason to consider using JSONB fields more regularly.

His point was simple. When there is some uncertainty about what embedded data may be useful to analyze at a later date, consider using this pattern.

As an example, let’s look at the following Twitter API response. In real life, this payload could easily have 50 or more fields included.

    {
        "data": {
            "author_id": "395950601",
            "created_at": "2022-01-15T03:09:22.000Z",
            "id": "1482188130191122436",
            "text": "Wordle 209 3/6\n\n\u2b1b\u2b1b\u2b1b\ud83d\udfe9\u2b1b\n\u2b1b\u2b1b\ud83d\udfe8\u2b1b\ud83d\udfe8\n\ud83d\udfe9\ud83d\udfe9\ud83d\udfe9\ud83d\udfe9\ud83d\udfe9"
        },
        "includes": {
            "users": [
                {
                    "id": "395950601",
                    "location": "hell",
                    "name": "Hall & Oates Enjoyer",
                    "username": "theviirg",
                    "verified": false
                }
            ]
        },
        "matching_rules": [
            {
                "id": "1482188147178053633",
                "tag": "wordle"
            }
        ]
    }

When I started consuming these tweets in the early months of 2022, I knew a few of the basic questions I wanted to analyze:

  • how often did users solve a puzzle in three guesses or less?
  • How often did a winning puzzle start with guessing the first letter correctly?
  • how many puzzles were solved by a specific Twitter handle?

Most of that analyzation could come from the Wordle results themselves. There were many pieces of meta information I could retrieve about the tweets but I wasn’t sure if they would be useful or not. What if I just stored them when I loaded the tweet in case I found use for it later?

ELT For the Win

To consume the stream of tweets I had to create a Python script which would receive the payload and store it in my Postgres database. Because I knew some of the basic business logic I wanted to perform on the incoming data, it was more efficient to do minimal pre-parsing as the data flowed through. I could quickly accomplish the main goal; pull out certain fields and validate the tweet content. Once that was done, I simply stored the full payload in a separate JSONB column alongside the data for processing later, if necessary.

I didn’t have to spend a lot of time thinking about and obsessing over what other content I might need from the payload or what the schema design would be. Approaching non-relational data like this in Postgres provides two additional benefits I hadn’t considered before.

First, as I’ve said, there’s an easy fallback to the data if I need more information later. I wasn’t stuck, kicking myself for not pulling a piece of the data out when I had originally consumed it. Instead, I can create a new column (with a NULL or constant default!) and then update it by querying the payload.

Second, the actual impact of storing this data may not be as big as I had originally thought, which was another primary objective I had to storing lots of JSON in my databases. If the JSON data is large, it will likely be TOASTed, stored off row in a file with pointers from the row. This means that if we don’t select this column under normal query scenarios (don’t use SELECT * FROM…), each row response will be significantly smaller and faster. Pointers to TOAST data have essentially no impact in query response if that column is not requested. Even when it is stored in row, it will be compressed as much as possible.

By storing the original payload, you’re able to retain ultimate control and flexibility down the road. You can stop obsessing over every detail of your schema and get to work analyzing the most important data quickly. If your needs change in the future, you’re covered for however long you retain the original payload (or some valid subset therein).

From 🟩🟨⬛ to 🤯

For my Wordle analysis, I found that many of the same functions that allowed me to load and transform the raw JSON data into relational tables also helped me extract meaning out of the emojis that show a user’s daily score.

To accomplish this, I had to turn to the powerful regexp family of functions that are part of Postgres.

Splitting Strings and Keeping Order

The objective was to turn rows of 🟩🟨⬛ into something meaningful to query. I had to retain the order of the rows and the order of each guessed letter.

To do this, I first had to find the row boundaries; exactly five emojis in a row. Because the emoji are extended UTF-8 characters and can be one of multiple different colors (depending on user settings), it took a little bit of effort to figure out the regex that worked.

Step 1: Extract the rows

--  Wrapped in a CTE for easier processing in step 2
WITH wordle_score AS (
   select * from regexp_matches($$Not as easy as you think
   #Wordle 511 3/6*
   🟨⬜🟨⬜⬜
   ⬜🟨🟨🟨🟨
   🟩🟩🟩🟩🟩
$$,'([🟩|🟧|🟨|🟦|⬛|⬜|]{5})','g') WITH ORDINALITY AS g(guess, guess_num)
) 
SELECT * FROM wordle_score;

By using regexp_matche as a source in the FROM clause, it becomes a Set Returning Function (SRT). In Postgres, SRTs provide the WITH ORDINALITY functionality to also return the order of the results being returned from the function. In this case, it allowed me to know exactly which guess was first, second, third and so on.

guess           |guess_num|
----------------+---------+
{🟨⬜🟨⬜⬜}|        1|
{⬜🟨🟨🟨🟨}|        2|
{🟩🟩🟩🟩🟩}|        3|

Now that I knew the set of emoji that represented each guessed word, I needed to process each guess, separating the letters in order. For this, I was able to use an implicit CROSS JOIN to take the results of the previous query to further process each row using another regexp_matches function, this time tracking the order of the letters within each word guess.

Step 2: Extract letters in each guess

-- Break it apart even further to get each separate letter

WITH wordle_score AS (
  select * from regexp_matches($$Not as easy as you think
     #Wordle 511 3/6*
     🟨⬜🟨⬜⬜
     ⬜🟨🟨🟨🟨
     🟩🟩🟩🟩🟩
  $$,'([🟩|🟧|🟨|🟦|⬛|⬜|]{5})','g') WITH ORDINALITY AS g(guess, guess_num)
)
SELECT * FROM wordle_score ws, -- the comma is an implicit CROSS JOIN LATERAL
regexp_matches(ws.guess[1],'([⬛|🟩|🟨|⬜]{1})','g') WITH ORDINALITY AS r(c1, letter)

This results in a table that lists each letter guessed, including the row and column position.

guess           |guess_num|c1   |letter|
----------------+---------+-----+------+
{🟨⬜🟨⬜⬜}|        1|{🟨}|     1|
{🟨⬜🟨⬜⬜}|        1|{⬜}|     2|
{🟨⬜🟨⬜⬜}|        1|{🟨}|     3|
{🟨⬜🟨⬜⬜}|        1|{⬜}|     4|
{🟨⬜🟨⬜⬜}|        1|{⬜}|     5|
{⬜🟨🟨🟨🟨}|        2|{⬜}|     1|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     2|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     3|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     4|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     5|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     1|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     2|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     3|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     4|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     5|

Once I started to see how quickly I could take skills I already had to process textual data like this, creating relations and meaning from green, yellow, and black boxes, I knew there was a lot of power I hadn’t yet tapped in Postgres. I’ll have to save the Wordle fun for another post.

A Happy Coexistence

For the sake of the challenge presented by Ryan, I’m going to stop here. Honestly, working on the Wordle Tweets project was one of those turning point moments in how I viewed the ability to use Postgres for more than just storing and querying relational data. Hopefully just talking about and showing you a glimpse of what can be done with Postgres does the same for you.

One of my next steps in this direction will probably be to finally learn more about PostGIS. And I happen to know of a great teacher and resource to get started! 😀

P.S. – Will You Contribute to PGSQL Phriday Next Month?

Next month will mark six months of experimenting with regular community blogging event. I had very low expectations in suggesting the idea, and less that folks would have the time to participate. To be honest, I understand how busy everyone is and answering challenging questions posed by others isn’t a quick task. Oh how I’ve been pleasantly surprised at the response, me of little faith. And every month so far, I’ve learned something from all of the contributions.

There’s just one thing missing. YOU!

Keep a watch on Twitter, Mastodon, Postgres Slack, and Planet.Postgresql.org for an announcement of the next topic near the end of February. Consider how you could share your knowledge and experience with everyone in the community. I promise that someone will benefit from what you have to offer!

And, if you’re a (semi) regular Postgres blogger and would like to host a future month, reach out and let me know!

1 thought on “Relational or Non-relational Postgres?”

  1. I have very similar attitude towards storing JSON, which is: keep it JSON it you do not need to do anything with it except for read/write. And parse it to table when you figure out what is searchable!

    Reply

Leave a Reply to Hettie D.Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.