PostgreSQL Patch of Interest (To Me): Documentation Linking

This is part of an occasional series of blogs about PostgreSQL commitfest patches. Look at the series intro which also lists some ground rules and expectations for what to expect in my thoughts below. 🐘

Patch title and thread:Create visible links for HTML elements that have an id to make them discoverable via the web interface
Commitfest at time of review:2023-03
Status at time of review:Waiting on Author

As I said in the intro to this occasional series, writing about PostgreSQL patches forces me to stay engaged. I’m not the most technical contributor in the PostgreSQL space by any means so every bit of conversation and discussion I can read just helps me level up that much more.

Instead, I often focus on the usability issues as I talk or teach about PostgreSQL because that’s usually where my struggles are. “I wish things would work this way! I wonder if anyone has thought about that?” And sure enough, almost every time I think something like that I find discussion somewhere about how to improve the feature or the challenges that have prevented additional changes.

I still can’t believe how amazing it is that this is all just out there for us to learn from.

Anyway, I digress.

While there are a few other patches I’ve started to follow and write about (more to come soon), a quick succession of events today made me want to jump in and get a quick post out to highlight how efficient this community is at helping each other.

Documentation Usability

If you’ve been around PostgreSQL for any amount of time, you probably know that the PostgreSQL developers and community are very proud of the documentation (they should be!) but that it’s also been a source of frustration to many, particularly those with less experience. In fact, in the 2022 State of PostgreSQL survey, documentation was both the most “go-to” source for information and one of the most commented on things that could be improved.

There is so much good, deep, technical documentation content. Unfortunately, following through and intuitively connecting all the sections doesn’t always make sense. In other places, there are essential bits of wisdom “hidden” inside of callout (TIP/NOTE) boxes which are easy to miss. Code examples are another area that are usually helpful and to the point, but a bit light on detail in more complex areas.

But that’s not what this short post is about. Honestly, the overall improvement of the docs is pretty amazing over the last couple of years. The community and contributors has stepped with their efforts and the overall team has worked hard to act on usability suggestions. 👏

Linking to Documentation

That said, linking to various sections and callouts in PostgreSQL documentation has never been easy. A few releases ago (I’m not completely sure when), there was a lot of effort to ensure that most sections had ID tags in the generated HTML so that it was possible to link to specific sections, just not easy.

Because I’m doing more PostgreSQL presentations lately, I’ve noticed it more when I want to put links to specific sections for a slide deck or to share on social media. Earlier this week I was at it again and finally decided to ask on Twitter what other PostgreSQL users do.

The responses were super helpful, pointing me to a browser extension that will quickly display hidden ID tags on a page and help you get the URL with anchors. But then, another user directed me to a current commitfest entry that is directly trying to address this issue.

This patch does two things, mentioned in this message of the thread:

  • Create a hover effect link next to hidden ID areas (like sections on a page)
  • Emit a build warning when there is a section without an ID

🔥🔥🔥

I know that might seem small and not really worth a blog post (I think it is!) but the bigger point for me is this.

I asked if others had a solution or if anyone had considered making docs better. Within a few hours, I had multiple responses to solve my current problem (browser extension) and a link to a current patch and conversation around the best way to implement this in documentation.

That’s really freaking cool, folks! 😎

Documentation Patch Status

The patch made quick progress in January and February, but some late conversation around the best, “right” way to apply the patch through the documentation system has put the patch on hold for a bit. I have no experience with Docbook previously, although I have read through the PostgreSQL documentation page for it. Maybe this will be the thing that finally prompts me to get docs building locally on my machine.

If you have the skills or experience to contribute to the patch, I think you’d be surprised how much of a usability impact this would have over the long haul!

PostgreSQL Content at the 2023 PASS Data Community Summit

TL;DR;

The 2023 PASS Data Community Summit will be held from November 14-17 at the Seattle Convention Center. For 2023 it is a fully in-person event. This year we are actively encouraging PostgreSQL experts to submit sessions during the CFP to help us build an inclusive event that spans multiple data platforms.

CFP is open and closes on March 30, 2023 at 11:59PM PST.

But come on… I know you want just a little more detail, right? 😉

The Importance of PASS and Summit To Me

If you’re around me much at a conference or happen to catch a blog post or two of mine, there’s a high likelihood that I’ll talk about community and how valuable it’s been to me professionally and personally. Much of my baseline and expectation for what a technical community feels like was born out of my experience with PASS. From numerous world-wide SQL Saturday events to the yearly PASS Summit, the people and content were always encouraging and inspiring. I honestly walked away from every encounter feeling more empowered to be better at my craft.

As I’ve grown more within the PostgreSQL community and continue to learn how things work, I’m just as inspired and encouraged at the wealth of knowledge and expertise that exists.

When Redgate purchased the PASS Summit in 2021, I had no idea I’d eventually work here as a PostgreSQL Advocate with an opportunity to share my love for the platform and help others be successful with The World’s Most Advanced Open-source Relational Database! 🐘

Even more exciting, I couldn’t have foreseen the continual shift that our industry is experiencing. Many database developers and administrators are no longer tied to a single database platform in all parts of their job. Furthermore, when there are multiple databases being used within a team, there’s a good chance PostgreSQL is one of those databases.

PostgreSQL at Summit

And so, as part of the2023 PASS Data Community Summit, we are actively seeking PostgreSQL content. We are working to provide a “getting started” learning path as well as content for intermediate and advanced users.

“But isn’t Summit mostly about the Microsoft Data Platform?” Historically that has been true, and the majority of the content this year will still focus on those platforms. However, in my time at Timescale and Redgate, you’d probably be shocked how often I have conversations with customers that have started to incorporate PostgreSQL with new projects. If I had asked these same customers if PostgreSQL would be a part of their architecture two years ago, most of them would have laughed at the question.

The CFP is now open and closes on March 30 (11:59PM PST) – just over two weeks from now!! I would be thrilled to have folks in the PostgreSQL community submit talks for consideration.

Content Ideas

Sometimes it’s easy to forget that even the most basic topics are new to someone. Just think about the customers and users I speak with every dya that didn’t know PostgreSQL was right around the corner for them.

Likewise, topics that might seem commonplace often have edge cases which impact users at all experience levels. In these cases, providing tips and tricks to overcome issues are tremendously valuable.

By no means exhaustive, a few general themes of content could include things such as:

  • MVCC – what it is, how it works, and why it matters
  • Indexes (PostgreSQL has so many types!)
  • Query tuning
  • How to use EXPLAIN
  • Tooling
  • Using psql
  • Unique data types and how to use them (array, range, geo, JSON/JSONB, etc.)
  • Dealing with bloat or VACUUM issues
  • Etc.

There are so many possibilities I wish I could hear talks on all of these topics!

Session Details and Speaker Benefits

If you’ve spoken at a typical PostgreSQL conference previously, the PASS Data Community Summit has a few differences in  session length and speaker benefits.

  • Talks are 75 minutes total: ~60 minutes of content with ~15 minutes for questions
  • Talks are recorded, but not made public for the first year (generally)
  • Selected speakers get a 3-day pass to the main conference (Wed-Fri)
  • Selected speakers receive one night lodging for the day of your session

Submit Your PostgreSQL Sessions Today!

This is a great opportunity to present to a diverse audience of data professionals. What you have to offer will benefit many people in the data community, especially data professionals that might not normally attend a PostgreSQL-only conference.

Feel free to reach out on Twitter (including a DM) or through email. Ryan @ the domain of this blog. (😉 tricky, eh?)

What I Wish I Had Known About PostgreSQL

For PGSQL Phriday #006, Grant Fritchey asked us to think about one of the helpful tidbits that we wish we had known earlier about PostgreSQL as we tried to learn using it effectively. Like myself and so many others, Grant has come to PostgreSQL with years of experience in another database platform, primarily SQL Server. Even with such a strong background using and teaching about relational databases, transferring that knowledge from one system to another can be challenging at times.

To be honest, it feels as if I’ve become a bit of a broken record about my struggles moving from SQL Server to PostgreSQL. Working with the PostgreSQL community to start an initiative like PGSQL Phriday was a direct result of those struggles and ensuing conversations.

What would I want to go back and tell Ryan in 2018 that would help him along on this PostgreSQL journey?

OK, I can’t pick just one, so I’ll give a more general principle and then something PostgreSQL specific – lateral joins.

Patience You Must Have – (Yoda)

This first point is a reminder to me more than anything. Too often in my technical career I end up getting frustrated and annoyed because things don’t work the way I expect, especially based on my previous experience. Quite honestly, this is the never-ending problem of working in technology, and one of the greatest drivers of innovation.

Just this week, I opened Power BI for the first time in over a year to do some quick analysis and processing of web data. I had done this kind of work before and I understand how Power Query works with the type of data I was processing. Still, it took me longer than I wanted because M is a functional language and I rarely do functional development anymore.

“Why can’t I just take this result and loop over it with this other function!!”

Of course it turned out that I could, just not the way that I wanted to do it.

The same thing has happened to me with PostgreSQL many times over the last few years. I’ll hit a performance issue and start wishing for the kinds of tools SQL Server provided. When I want to get backups running quickly and track their status, I remember when I could do most of that through SSMS or community supported scripts.

But as the years have gone by, I’ve come to understand how much time and dedication the global team of volunteers have given to the project and some of the amazing innovations that PostgreSQL provides which I can’t find anywhere else. We all have a front-row seat to the discussion of each feature, improvement, and bug fix. I can literally read the code for how the query planner works if I have questions (granted, I’m more of a “let’s start with the comments” learner).

No, everything doesn’t work the way I want it to, but the more I’ve come to embrace that, the more I’ve learned, and loved, about PostgreSQL.

(But don’t get me started about the absolutely annoying Regex matching bug with my QNAP… 😂)

Lateral Joins

I was much further in my SQL Server career than I should have been before I understood the usefulness of APPLY in T-SQL. I didn’t realize that APPLY isn’t a SQL standard and so when I started using PostgreSQL and needed similar functionality, I was perplexed trying to find the equivalent. Fortunately, I wasn’t the only one searching.

Like APPLY statements, LATERAL joins allow you to execute an inner query for each row of the outer query and you can reference values of the parent. The canonical example for using a query like this is when you have a table that represents something (accounts, sensors, products) with time-series like data in another table (orders, readings, etc.). To get the most recent value from the time-series table, a LATERAL is the easiest way without going through a bunch of gymnastics.

Learning about LATERAL joins unblocked numerous query and performance struggles I was having. And, much like APPLY statements in SQL Server, LATERAL queries are still one of the main optimization fences we have in PostgreSQL. I honestly can’t count how many times I’ve seen responses on various community channels that start with, “have you tried a lateral join?

No, lateral joins don’t solve every problem. But knowing when they are a viable option to try is 🔥.

With that, let me dive a bit into how CROSS JOIN LATERAL has helped me solve a few tricky data transformation problems.

CROSS JOIN LATERAL

What really started to blow my mind, particularly when wrangling data for transformation in the database (ELT), was the use of CROSS JOIN LATERAL. There are so many set returning functions in PostgreSQL that can be used to transform various bits of data (strings, JSON, arrays) into something more usable from a SQL perspective.

Over the last couple of years I’ve learned two particular ways to use CROSS JOIN LATERAL when processing data.

Retaining Order

Many times, as I’m pulling strings or JSON apart into table structures, it’s helpful to retain the order of the internal components. This was especially helpful when I analyzed Wordle data because I needed to transform a block of emojis into rows and columns, maintaining the order of each emoji as I processed more data. I could have done this with some creative row_number() window function usage, but PostgreSQL provides a simpler, and typically faster, way.

Any Set Returning Function (SRF) in PostgreSQL can utilize the extra WITH ORDINALITY keyword. The key is that the function must be used as a source of data (after the FROM clause) rather than as a scaler/inline function.

As an example, I’ll simply break a string of words apart, adding a column that represents the order of the words as they appeared. I’m using a CTE to mimic a table that’s returning data. If these strings were stored inside of a table, I could just select directly from there. In the `row_number()` example, I would still need a CTE to get the unnested rows before I could apply the window function.

Word order using row_number()

WITH str AS (
        SELECT UNNEST(string_to_array(TOKEN,' ')) word FROM
               (SELECT 'I ❤️ Postgres CROSS JOIN LATERAL!' AS TOKEN) a
)
SELECT word, ROW_NUMBER() OVER() FROM str;

--- results in
word    |row_number|
--------+----------+
I       |         1|
❤️      |         2|
Postgres|         3|
CROSS   |         4|
JOIN    |         5|
LATERAL!|         6|

Word order using CROSS JOIN LATERAL and WITH ORDINALITY

--- the comma between str and UNNEST is shorthand for CROSS JOIN LATERAL
WITH str AS (
       SELECT 'I ❤️ Postgres CROSS JOIN LATERAL!' AS token
)
SELECT t.* FROM str, UNNEST(string_to_array(TOKEN,' ')) WITH ORDINALITY t(word,word_order);

-- produces the same result
word    |row_number|
--------+----------+
I       |         1|
❤️      |         2|
Postgres|         3|
CROSS   |         4|
JOIN    |         5|
LATERAL!|         6|

It might seem small, but it’s such a useful feature to have at your disposal.

Organizing Calculations for Cleaner SELECTs

During my (slow) process of working through the Advent of Code puzzles, I’ve learned a lot about SQL features supported in Postgres from others in the community. One that jumped out at me was how Vik Fearing used CROSS JOIN LATERAL to organize calculations into smaller groups to allow alias references and overall cleaner SELECT statements at the parent (at least, that’s why I ended up using this technique).

This one is a bit harder for me to come up with an example on the spot, but I’ll give it my best shot.

Each example query below two CTEs that generate fake X,Y values. Let’s pretend that we need to:

  • sum each set of X and Y columns between the two tables
  • find the difference between those sums

There are many ways we could go about this, which is partially why it’s difficult for me to think of a better example.

Without using the CROSS JOIN LATERAL feature, I would typically have to do the addition for the first two columns (x+x, y+y) and then a third time as part of the subtraction calculation ((x+x)-(y+y)). It’s not the end of the world, but I really wish I could refer to an aliases of those first two columns instead.

Without using CROSS JOIN LATERAL

WITH ta(x,y) AS (
       SELECT * FROM generate_series(1,2) x, generate_series(3,4) y
),
tb(x,y) AS (
       SELECT * FROM generate_series(10,11) x, generate_series(13,14) y
)
SELECT
       ta.x+tb.x AS x_sum,
       ta.y+tb.y AS y_sum,
       (ta.x+tb.x)-(ta.y+tb.y) AS sum_diff
FROM ta, tb;

Using a CROSS JOIN LATERAL, I can do the first set of calculations and then refer to the aliases further down in the query. Not only does it break up my work, but it also makes the top-level SELECT statement cleaner.

WITH ta(x,y) AS (
       SELECT * FROM generate_series(1,2) x, generate_series(3,4) y
),
tb(x,y) AS (
       SELECT * FROM generate_series(10,11) x, generate_series(13,14) y
)
SELECT
       x_sum,
       y_sum,
       sum_diff
FROM ta, tb
       CROSS JOIN LATERAL (VALUES(ta.x+tb.x,ta.y+tb.y)) AS t(x_sum,y_sum)
       CROSS JOIN LATERAL (VALUES (t.x_sum-t.y_sum)) AS d(sum_diff);

Both solutions produce the same result:

x_sum|y_sum|sum_diff|
-----+-----+--------+
   11|   16|      -5|
   11|   17|      -6|
   12|   16|      -4|
   12|   17|      -5|
   12|   16|      -4|
   12|   17|      -5|
   13|   16|      -3|
   13|   17|      -4|
   11|   17|      -6|
   11|   18|      -7|
   12|   17|      -5|
   12|   18|      -6|
   12|   17|      -5|
   12|   18|      -6|
   13|   17|      -4|
   13|   18|      -5|

The win here, at least for me, is that I can build up each set of calculations and refer to them higher or lower in the query. Being able to refer to my previous calculations by alias is just a huge organizational win. How many times have you wanted to do that in a query?!

Conclusion

I love using these features of PostgreSQL to transform data. I’ve learned most of them by pair programming with other PostgreSQL users and developers and each time I have to really internalize how it works and the value.

There are so many other things I could have chosen to talk about in response to Grant’s prompt, but hopefully knowing a little bit more about CROSS JOIN LATERAL helps you. 😊

PostgreSQL Patch of Interest (to me): Transaction Timeout

This is part of an occasional series of blogs about PostgreSQL commitfest patches. Look at the series intro which also lists some ground rules and expectations for what to expect in my thoughts below. 🐘

Patch title and thread:Transaction Timeout
Commitfest at time of review:2023-03
Status at time of review:Ready for Committer

Nikolay Samokhvalov (Postgres.ai, Postgres.tv, Postgres.fm) seems to enjoy pushing for small wins/iterative development on the core product. He and Michael Christofides (pgMustard) have been having a lot of great conversations on the Postgres.fm podcast about features of Postgres, changes they’d love to see, and ways that the overall Postgres development community can grow.

In the last few weeks I’ve seen a few posts from Nikolay about live-coding sessions he’s been hosting with Andrey Borodin (@x4mmmmmm) to prepare PoC patches for PostgreSQL.

Probably because I follow Nikolay and have seen some of these sessions, both of these names caught my attention when scanning the (long) list of patches currently included in the March 2023 Commitfest.

The patch has been reviewed and tested by a few folks and currently sitting in a “Ready for Committer” state, but no committer has yet to assign themselves to the patch.

Why Add Another Timeout?

After Andrey submitted the patch in December 2022, Nikolay was the first to respond as to why he suggested adding another timeout option to the mix. Currently PostgreSQL has several timeout settings available for configuration. In particular, statement_timeout often gets a lot of airtime in the community as a good failsafe setting to prevent runaway queries from running forever.

What Nikolay pointed out is that some workloads could have many very short running statements that run within one longer transaction. If your statement timeout is 30 seconds for instance, it’s possible that a transaction is started and many small statements are run and completed without any meaningful value being returned to the client. In web applications, this can mean that users have long abandoned the web page, but because the statements are completed under the statement_timeout setting, PostgreSQL continues to do the work.

Getting the combination of timeouts correct for the expected outcome might be a little tricky, but I like that the discussion and opportunity to submit a patch like this points, yet again, to the open nature of Postgres.

How Much Refactoring is Required?

So far, this patch has only existed since December 2022. Assuming a committer picks it up, there’s a chance this could be pretty short lived from concept to completion. Still, in those two months there has already been a number of comments in the thread from other contributors around what an implementation like this adds to the complexity of processing a single statement.

If you begin to follow the Commitfest threads more, this is a common discussion point. Obviously, it makes sense in any development endeavor to simplify function abstractions and be as DRY as possible. I’m sure many of us have done countless times in design meetings and code reviews.

What often catches my eye, however, within the PostgreSQL context is the deep understanding of how and why a particular function is coded the way it is. I’m not even surprised anymore when someone points back to mailing list or commit messages from a decade or more ago as further context. Having this kind of information about the code, potential improvements, previous attempts at simplifying the code, and more, is an amazing thing to watch.

Yes, it can slow things down because folks often have to rehash the previous discussion and see if anything has changed through the normal course of development, but the end result is typically a better path for long-term maintenance.

Follow Along with Live Coding Examples

If these kind of small patches interesting you, I’d encourage you to follow Nikolay and Postgres.tv. In the ~8 weeks since this patch was submitted, Andrey has been coding other small patches to improve things like UUID formats, buffer pool managements, detailed buffer statistics, and more.

PostgreSQL Patch of Interest (to me): Using the System CA Pool for Certificate Verification

This is part of an occasional series of blogs about PostgreSQL commitfest patches. Look at the series intro which also lists some ground rules and expectations for what to expect in my thoughts below. 🐘

Patch title and thread:Use the system CA pool for certificate verification
Commitfest at time of review:2023-03
Status at time of review:Ready for Committer

Setting up SSL mode in PostgreSQL (and really any hosted service) is an essential, table stakes DBA task. Just as important is understanding how the sslmode parameters of a PostgreSQL connection really work and what users can expect to happen every time they connect to the cluster.

This patch caught my attention because I’ve struggled over the years to get various hosted services to work (and sometimes various applications) because the SSL certificates weren’t in the right spot locally. Before getting to the details of the patch and why it was submitted, let’s briefly talk about SSL connections with Postgres

SSL Client Communication with PostgreSQL

The way that PostgreSQL connection encryption works is through the sslmode parameter of a connection string. The client communication library, libpq, will use this setting to determine what kind of SSL connection it will request of the PostgreSQL cluster.

With many hosted services and tutorials, this connection string setting is often not discussed and taken for granted. Unfortunately, the default setting for this parameter is sslmode=prefer, which doesn’t guarantee a secure connection. Referring to the table in libpq-ssl documentation page, we can see that prefer simply means that the client would like to use SSL if it’s available, but no error will be thrown if the server doesn’t support it. 😱

Ouch!

To make matters worse, libpq will only successfully verify the certificate authority and server name if the root CA is in a very specific location on the client; ~/.postgresql/root.crt. That is, even if the server certificate is signed by a common root CA already located in the system-wide CA pool (like Let’s Encrypt), the libpq client will not search for and use it when either sslmode=verify-ca or sslmode=verify-full are specified. The CA certificate must be named root.crt and located in the hidden directory noted above.

This translates into a lot of frustration for end-users and a fair amount of setup/maintenance of user accounts to make sure the root certificates are in the right place. It’s also challenging to have multiple root certificates if you work between multiple organizations and their certificates are signed by different authorities. The only workaround it so use specify the root certificate location in the connection string using `sslrootcert` and then the full path to the certificate.

In short, the big usability issue here is that most of us have the widely supported root CA certificates on our computers already. PostgreSQL is just expecting them to be in a specific location when we want to verify the server authenticity. In fact, this response in the thread from Jelte Fennema (Microsoft Postgres developer) really brings home the issue.

Huge +1 from me. On Azure we're already using public CAs to sign
certificates for our managed postgres offerings[1][2]. Right now, our
customers have to go to the hassle of downloading a specific root cert or
finding their OS default location. Neither of these allow us to give users
a simple copy-pastable connection string that uses secure settings. This
would change this and make it much easier for our customers to use secure
connections to their database.

Improving the End-user SSL Experience in PostgreSQL

To be honest, the patch caught my attention because I’ve struggled with SSL setup in PostgreSQL sometimes over the years. The conversation around the topic, patch approach, and even some surprise in how things are working from long-time PostgreSQL contributors is refreshing.

What I didn’t expect was such a good example of how circuitous the lifecycle of some patches are. Over the last two years, the journey of this patch looked like this.

  • Thomas Habets submitted the original patch in September 2021 (first email in the patch thread)
  • discussion on the patch eventually ended on October 4, 2021 because there wasn’t full agreement on the best path forward
  • a year later another developer, Jacob Champion, implemented a different approach and resubmitted the patch
  • conversation ensues, with multiple developers discussing the new approach and moving it a step closer to inclusion

I love how much this illustrates the power of the PostgreSQL developer community. Ideas are shared and collaboration happens. Sometimes this isn’t the right time or approach for the proposed patch. But, it triggers the right conversation and eventually someone else picks up where the first contributor left off.

Currently Proposed Changes

This patch is proposing a pretty small change to the connection string handling in libpq. In Jacob’s proposed patch, he’s adding the ability to use a new keyword with the sslrootcert parameter. If a connection string has sslrootcert=system, then libpq will only look for the root CA in the system CA pool.

Earlier in the thread, there was some discussion over possibly using some fallback logic so that if there was no root.crt file in the correct location, then using sslmode=verify-full would fall back and look in the CA pool.

Instead, Jacob argued that not using a fallback would be more clear and precise, actually leading to easier debugging if someone asked for help with a connection. In the end, everyone seemed to agree.

Current Status

As of the February 15, 2023 (the publish date of this article), this patch is in a “Ready for Committer” state. This means that multiple people have reviewed the patch, applied it locally and tested it. It is now ready for an official committer to prepare the patch and move it forward with a pull request.

Because it changes how libpq works I assume it wouldn’t be included as part of a minor release, which means it’s likely to get included with PostgreSQL 16 as long as a committer has the time to shepherd it forward one last step.

PostgreSQL Patch of Interest (to me): An Occasional Blog Series

PostgreSQL is the “Worlds Most Advanced Open Source Relational Database” (the website header says so, and I happen to agree).

One of the reasons PostgreSQL popularity continues to increase is because of how new features, bug fixes, and general discussions about improving Postgres are open for anyone to see. Some changes are proposed and merged quickly (relatively speaking). Other ideas might take a few years to discuss and implement in a way that garners wide support.

Although I’ve been using Postgres full-time again for the last 5 years, it’s only in the last 12 months or so that I’ve understood how open the development process is, and how easy it is for you and me to follow along. If you’ve lived most of your career working with proprietary database systems (even those based on Postgres), it might be hard to believe.

Five times a year, the Postgres project runs a Commitfest to track the patches submitted for inclusion in upcoming releases. These patches can be documentation changes, improvements for minor releases, or new features for upcoming major releases.

Jacob Champion, a colleague when I worked at Timescale, wrote a few articles about managing a Commitfest which are worth the read. As Timescale prepared to publish the articles, the internal review helped me better understand how unique the Postgres process is, especially among most major database platforms.

As a result of understanding this process better, I now make it a regular part of my week to review the open Commitfest. I scan the patches that are being worked on and dig a little deeper into the discussion of patches that catch my eye.

Don’t get me wrong. My (current) skillset isn’t to review Postgres C code with much ability to help that process. Yet, the discussion around patches and the opinions that folks share all contribute to improving Postgres release after release. For a Postgres enthusiast, this is a fascinating process to watch.

Which leads me to this new series, PostgreSQL Patch of Interest (to me).

I often find myself scanning the Commitfest application, clicking through to the associated pgsql-hackers email discussion, and thinking, “This is intriguing! I wonder what other non-hackers think about this?” There are too many patches in flight at any one time to mention them all, but I’m guessing at least one or two a month would trigger some interesting discussion.

My Ground Rules and Disclaimers

  • I’m not a hard-core C developer and I still have a lot to learn about Postgres internals. I may never be up to the challenge of working on a patch myself. But maybe one day I’ll feel that I know enough to help review some. 🤞
  •  Most of the time this series will not be a deep dive into the actual patch itself. In fact, the discussion for a small patch can often span months of time, so I can’t even promise to dig into many details. 👷
  • Any thoughts are only musings about the fix or feature and what it means to the wider Postgres discussion. You might not care if LZ4 compression is added to `pg_dump`, but I might happen to think it’s a pretty cool improvement. Let’s still be friends. 🤔
  • I have no issues borrowing from other discussions about a patch if it seems relevant. I’ll attempt to faithfully give credit, but the goal is learning (both for myself and others). 🧑‍🎓
  • If I miss a few weeks, life probably just got busy. I doubt I’ll ever run out of patches I find interesting, but there are only so many hours in a day. ⏱️

Following along at home

I’ll link to any patch I discuss, specifically in the Commitfest website. However, if you want to browse on your own or dig in for more detail, here are a few helpful things to know about the items you’ll encounter on the website.

Home

On the homepage you’ll see a listing of each commitfest back to December of 2014. If you want to see what patches were accepted into the review process of a prior release and what the outcome was, have a glance through anything labeled “Closed”. I’ll be looking at the Open and Future commitfest entries most of the time.

A screenshot of the commitfest homepage
The Commitfest homepage

Commitfest Details

Once you’ve clicked into a Commitfest page, the table lists all patches currently attached to the open review session. There are headers (eg. “Bug Fixes”, “Clients”, “Documentation”, etc) interspersed throughout the table to group things together. The current status of a patch is listed, as well as the targeted PostgreSQL version.

A web page table listing the patches currently included in a specific commitfest
The list of patches currently included in a specific commitfest

Patch Detail

Next, you can click on the “title” of the patch to get to more detail, including how many commitfests this patch has been a part of. From here, the most interesting piece for me is the pgsql-hackers email thread that accompanies each patch. Clicking on the first link in the Emails section will direct you to the entire thread of discussion about the patch. I’m just warning you, some of these pages could be very long.

A commitfest patch detail page about a specific patch, including links to the full email discussion
Details about a specific patch, including links to the full email discussion

Which directs you to this:

A webpage that provides the complete email thread from the pgsql-hackers mailing list about a specific patch
The complete pgsql-hackers thread about a specific patch

What you see is the entire email thread for the referenced patch listed in chronological order. The first email you see in the thread is the original patch submission. Most often the original email has the initial patch file(s) attached. Sometimes, however, a feature currently included in a commitfest may have started  with discussion of a feature that eventually turned into a working patch.

One final note if you haven’t watched the PostgreSQL mailing lists much. In general the email lists are anti-top posting. This means that all responses should quote the relevant parts of the email they are responding to and essentially make those responses in-line. You should not (or very rarely) see a response at the top of an email with no relevant quoted text. It’s easy to lose context with top-posting and you’ll find that the Postgres email list users are pretty pedantic about the best way to provide feedback through email.

Tracking the Changes

That’s the basic rundown of how I’ll be finding patches to discuss and how you can do the same. As I get to know more folks in the PostgreSQL community, including those that are contributing to the project on a regular basis, watching the development process unfold has been a very good learning opportunity. At the very least it helps me realize and remember that this project has been built by many dedicated people, over a very long time. Quite an achievement, and one that all users all benefit from! 🙏

Relational or Non-relational Postgres?

For PGSQL Phriday #005, Ryan Lambert asked us to think about how we use PostgreSQL for non-relational workloads. I’ve been following Ryan’s work for a few years now (his blog posts are 💯) and every time I connect with him about mapping and Postgres, I walk away with way more knowledge than I contributed to the conversation. 🙂

Needless to say, I think a discussion about different ways we use non-relational data in Postgres is a great topic, well timed with a few discussions I’ve had recently about schema design and data access functions unique to PostgreSQL.

Postgres for Powerful ETL ELT

Until recently, most conversation about processing raw data within a data pipeline has focused on the Extract, Transform, Load (ETL) pattern. Partially because databases lacked a more complete set text processing tools, it was left to developers to exchange and process data externally and then insert the fully transformed data.

Databases have changed significantly over the last decade or so. Postgres, specifically, is replete with powerful text and data processing functions, allowing us to take a similar, database-focused approach. Extract, Load, Transform (ELT).

The difference is subtle. The focus is on extracting the key data as part of the original processing, but then using Postgres as the staging and transformation layer. It used to be that data would be loaded into an intermediate database and manipulated with external scripts and functions. But much of that same work can be handled internally, and often more efficiently. (keep your logic close to your data… right? 😉)

If you start to head down this path and think about doing more data transformation in the database, then we have to talk about how that impacts schema.

Relational design for non-relational data

Yeah, that felt weird to write.

The most prominent way that PostgreSQL has displayed its non-relational superpowers in recent years is through the JSONB data type. Arguments abound as to whether Postgres could serve as a replacement for NoSQL workloads by pulling the necessary, relational pieces out and then just querying the raw JSON directly as needed. For many applications, there are probably some good arguments to harnessing the power of relational data with JSON of the modern web service apps. But that’s not what I’m concerned with.

What if there are ways to use the JSON data that proliferates our data streams more quickly without overthinking Postgres schema design upfront?

When I was at Timescale, we had many discussions about how to best architect schema design for high-throughput applications. Specifically, we were working to consume API data from time-series web services like stock and cryptocurrency trading platforms.

In many of these platforms, the API responses tend to be in a constant state of change. The main properties were consistent (ids, alternate keys, prices, timestamps, trading volume), but the business metadata was in constant flux. Moreover, as we attempted to learn the APIs to provide working samples to our users, we would often get stuck in debates about which fields to parse from the incoming JSON and which to leave.

One day, a co-worker simply said, “Don’t worry about it so much. Pull out what seems reasonable and then save the rest in a JSONB field and pull it out later if you need it.” 🤯

At first, I protested. Until now, most of the debate around how to use JSON within Postgres focused on the analytical query performance over the data as a potential replacement for NoSQL applications. In most of the work that I did, working that way didn’t really apply. But now I had a compelling reason to consider using JSONB fields more regularly.

His point was simple. When there is some uncertainty about what embedded data may be useful to analyze at a later date, consider using this pattern.

As an example, let’s look at the following Twitter API response. In real life, this payload could easily have 50 or more fields included.

    {
        "data": {
            "author_id": "395950601",
            "created_at": "2022-01-15T03:09:22.000Z",
            "id": "1482188130191122436",
            "text": "Wordle 209 3/6\n\n\u2b1b\u2b1b\u2b1b\ud83d\udfe9\u2b1b\n\u2b1b\u2b1b\ud83d\udfe8\u2b1b\ud83d\udfe8\n\ud83d\udfe9\ud83d\udfe9\ud83d\udfe9\ud83d\udfe9\ud83d\udfe9"
        },
        "includes": {
            "users": [
                {
                    "id": "395950601",
                    "location": "hell",
                    "name": "Hall & Oates Enjoyer",
                    "username": "theviirg",
                    "verified": false
                }
            ]
        },
        "matching_rules": [
            {
                "id": "1482188147178053633",
                "tag": "wordle"
            }
        ]
    }

When I started consuming these tweets in the early months of 2022, I knew a few of the basic questions I wanted to analyze:

  • how often did users solve a puzzle in three guesses or less?
  • How often did a winning puzzle start with guessing the first letter correctly?
  • how many puzzles were solved by a specific Twitter handle?

Most of that analyzation could come from the Wordle results themselves. There were many pieces of meta information I could retrieve about the tweets but I wasn’t sure if they would be useful or not. What if I just stored them when I loaded the tweet in case I found use for it later?

ELT For the Win

To consume the stream of tweets I had to create a Python script which would receive the payload and store it in my Postgres database. Because I knew some of the basic business logic I wanted to perform on the incoming data, it was more efficient to do minimal pre-parsing as the data flowed through. I could quickly accomplish the main goal; pull out certain fields and validate the tweet content. Once that was done, I simply stored the full payload in a separate JSONB column alongside the data for processing later, if necessary.

I didn’t have to spend a lot of time thinking about and obsessing over what other content I might need from the payload or what the schema design would be. Approaching non-relational data like this in Postgres provides two additional benefits I hadn’t considered before.

First, as I’ve said, there’s an easy fallback to the data if I need more information later. I wasn’t stuck, kicking myself for not pulling a piece of the data out when I had originally consumed it. Instead, I can create a new column (with a NULL or constant default!) and then update it by querying the payload.

Second, the actual impact of storing this data may not be as big as I had originally thought, which was another primary objective I had to storing lots of JSON in my databases. If the JSON data is large, it will likely be TOASTed, stored off row in a file with pointers from the row. This means that if we don’t select this column under normal query scenarios (don’t use SELECT * FROM…), each row response will be significantly smaller and faster. Pointers to TOAST data have essentially no impact in query response if that column is not requested. Even when it is stored in row, it will be compressed as much as possible.

By storing the original payload, you’re able to retain ultimate control and flexibility down the road. You can stop obsessing over every detail of your schema and get to work analyzing the most important data quickly. If your needs change in the future, you’re covered for however long you retain the original payload (or some valid subset therein).

From 🟩🟨⬛ to 🤯

For my Wordle analysis, I found that many of the same functions that allowed me to load and transform the raw JSON data into relational tables also helped me extract meaning out of the emojis that show a user’s daily score.

To accomplish this, I had to turn to the powerful regexp family of functions that are part of Postgres.

Splitting Strings and Keeping Order

The objective was to turn rows of 🟩🟨⬛ into something meaningful to query. I had to retain the order of the rows and the order of each guessed letter.

To do this, I first had to find the row boundaries; exactly five emojis in a row. Because the emoji are extended UTF-8 characters and can be one of multiple different colors (depending on user settings), it took a little bit of effort to figure out the regex that worked.

Step 1: Extract the rows

--  Wrapped in a CTE for easier processing in step 2
WITH wordle_score AS (
   select * from regexp_matches($$Not as easy as you think
   #Wordle 511 3/6*
   🟨⬜🟨⬜⬜
   ⬜🟨🟨🟨🟨
   🟩🟩🟩🟩🟩
$$,'([🟩|🟧|🟨|🟦|⬛|⬜|]{5})','g') WITH ORDINALITY AS g(guess, guess_num)
) 
SELECT * FROM wordle_score;

By using regexp_matche as a source in the FROM clause, it becomes a Set Returning Function (SRT). In Postgres, SRTs provide the WITH ORDINALITY functionality to also return the order of the results being returned from the function. In this case, it allowed me to know exactly which guess was first, second, third and so on.

guess           |guess_num|
----------------+---------+
{🟨⬜🟨⬜⬜}|        1|
{⬜🟨🟨🟨🟨}|        2|
{🟩🟩🟩🟩🟩}|        3|

Now that I knew the set of emoji that represented each guessed word, I needed to process each guess, separating the letters in order. For this, I was able to use an implicit CROSS JOIN to take the results of the previous query to further process each row using another regexp_matches function, this time tracking the order of the letters within each word guess.

Step 2: Extract letters in each guess

-- Break it apart even further to get each separate letter

WITH wordle_score AS (
  select * from regexp_matches($$Not as easy as you think
     #Wordle 511 3/6*
     🟨⬜🟨⬜⬜
     ⬜🟨🟨🟨🟨
     🟩🟩🟩🟩🟩
  $$,'([🟩|🟧|🟨|🟦|⬛|⬜|]{5})','g') WITH ORDINALITY AS g(guess, guess_num)
)
SELECT * FROM wordle_score ws, -- the comma is an implicit CROSS JOIN LATERAL
regexp_matches(ws.guess[1],'([⬛|🟩|🟨|⬜]{1})','g') WITH ORDINALITY AS r(c1, letter)

This results in a table that lists each letter guessed, including the row and column position.

guess           |guess_num|c1   |letter|
----------------+---------+-----+------+
{🟨⬜🟨⬜⬜}|        1|{🟨}|     1|
{🟨⬜🟨⬜⬜}|        1|{⬜}|     2|
{🟨⬜🟨⬜⬜}|        1|{🟨}|     3|
{🟨⬜🟨⬜⬜}|        1|{⬜}|     4|
{🟨⬜🟨⬜⬜}|        1|{⬜}|     5|
{⬜🟨🟨🟨🟨}|        2|{⬜}|     1|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     2|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     3|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     4|
{⬜🟨🟨🟨🟨}|        2|{🟨}|     5|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     1|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     2|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     3|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     4|
{🟩🟩🟩🟩🟩}|        3|{🟩}|     5|

Once I started to see how quickly I could take skills I already had to process textual data like this, creating relations and meaning from green, yellow, and black boxes, I knew there was a lot of power I hadn’t yet tapped in Postgres. I’ll have to save the Wordle fun for another post.

A Happy Coexistence

For the sake of the challenge presented by Ryan, I’m going to stop here. Honestly, working on the Wordle Tweets project was one of those turning point moments in how I viewed the ability to use Postgres for more than just storing and querying relational data. Hopefully just talking about and showing you a glimpse of what can be done with Postgres does the same for you.

One of my next steps in this direction will probably be to finally learn more about PostGIS. And I happen to know of a great teacher and resource to get started! 😀

P.S. – Will You Contribute to PGSQL Phriday Next Month?

Next month will mark six months of experimenting with regular community blogging event. I had very low expectations in suggesting the idea, and less that folks would have the time to participate. To be honest, I understand how busy everyone is and answering challenging questions posed by others isn’t a quick task. Oh how I’ve been pleasantly surprised at the response, me of little faith. And every month so far, I’ve learned something from all of the contributions.

There’s just one thing missing. YOU!

Keep a watch on Twitter, Mastodon, Postgres Slack, and Planet.Postgresql.org for an announcement of the next topic near the end of February. Consider how you could share your knowledge and experience with everyone in the community. I promise that someone will benefit from what you have to offer!

And, if you’re a (semi) regular Postgres blogger and would like to host a future month, reach out and let me know!

Database DevOps and Code management with PostgreSQL

For PGSQL Phriday #004, Hettie Dombrovskaya challenged us to think about working with PostgreSQL as code, by ourselves or within teams, and how we keep track of the scripts and changes that are important to our day-to-day work. There are so many facets of this topic we could discuss, and Hettie gave a few really great prompts, but I’ll focus on two main areas: Database DevOps (generally) and then more specifically PostgreSQL scripts that I find helpful (and how to track them).

Let’s hop to it!

What if Joe joins the circus?

At the risk of sounding like a broken record, the longest tenure of my career, 15 years, was at a software company that used SQL Server dating back to the 1990s. (their usage of SQL Server, not the start of my employment! How old do you think I am?!?! 😱👴)

The only real schema management companies had at the time was a host of Entity Relationship tools like ER/Win (what we used), ER Studio, and others. Nobody was talking about Agile, DevOps (or Database DevOps), and application releases took months or more of effort.

Around 2010, Agile was taking hold and many changes were incorporated into “regular” application development processes to improve efficiency and the overall speed of product delivery. Unfortunately, database development didn’t quite follow suit in most organizations. At best, database changes began to be incorporated as ORM migrations, often at the expense of utilizing the database features to their fullest extent. It also meant that developers with little real database experience were modifying production databases through feature migrations without much oversight.

Oddly enough, we had the opposite problem. Our part-time DBA was still using the ER-based changeset process developed in the 1990s using bash scripts. We had to rely on him to incorporate all the changes each release, modify upgrade (manual migration) scripts to catch edge cases, ensure static data was up-to-date, and more. Aside from asking quite a lot of one person, and despite numerous attempts to get others up to speed on the process, it became clear that we were very exposed to a break in this link of the development chain. If Joe got hit by a bus joined the circus, database change management was going to be a major issue.

It was here, struggling to lead the development team in a growing SaaS business, managing hundreds of client databases, that I knew there must be a better way.

Database DevOps

“DevOps is the union of people, process, and products to enable continuous delivery of value to our end users.”

Donovan Brown, Microsoft

At the core, the biggest problem we needed to solve was making our database change process visible to the whole team. We had to modify our processes and the expectations we had of individual people within the team to deliver better, more consistent value at the data layer.

The tools we chose for our circumstances don’t apply completely to PostgreSQL because they were SQL Server specific. But the process and benefits of building a “database as code” mindset within the team accelerated our velocity and quality. And Joe didn’t have the weight of the world on his shoulders. 😀

A few principles that we found important and carry through in how I approach/teach database management to this day include:

  • Knowing the true database state is invaluable: One of the biggest issues I have with a migration-only process (whether SQL or ORM) is that it’s hard to validate the intended state of the database schema – the “source of truth”. If you’re tooling doesn’t provide a method for easily maintaining a record of the expected state at various milestones, you’ll never really know which schema to trust (or find when a production database has “drifted” from that source of truth)
  • All SQL is versioned and committed: Migrations, repeatable scripts, static data, and helper scripts need to be versioned and committed. This provides the same PR-led observability on changes and allows all developers to know where the “newest, best version” of a helper script can be found. When they’re strewn about across developer computers or in browser bookmarks, that’s a recipe for trouble.
  • CI/CD is for databases, too: Database changes should be automated and incorporated just like regular application code. Even when committed code that’s not put through a similar CI/CD process will eventually fall short to manual processes. If the toolset you’re using doesn’t help with this, consider moving toolsets. The pain now will reap greater ROI than you can probably see now.

I believe in the process of database development so much that I’ve recently joined Redgate, a leader in database DevOps and tooling for over two decades. Our flagship too, Flyway Enterprise, incorporates both state-based tracking with intelligent migration generation. Although I had been a Flyway Community user before (a migration-only toolset), the added benefits of tracking state, too, provides control over migrations and assurance that there is always a source of truth to verify against.

If your current toolset doesn’t allow you to work towards a state of committing changes and rolling them out in an automated, test-driven, PR-based way, find a tool that does and start taking small steps there. I promise that once you see a script work its way through dev, staging, and production environments, you’ll be amazed that you worked so long without that process.

And, if you don’t know where to get started, feel free to reach out through Twitter, LinkedIn, or email ([email protected] blog domain).

Helpful PostgreSQL Scripts

Great! Now that you have a process for managing your database changes and scripts that matter – what are those scripts and where do I find them. The answer to that is more tricky than I’d like.

PostgreSQL, the global community, and core “leadership” have put so much effort into documentation and examples over the years last few decades. Sometimes, however, it can be hard to remember that PostgreSQL is maintained by volunteers, even if some of them are being employed by various companies as contributors to the project. Maintaining pages and repositories of example scripts is challenging for the most committed teams, let alone a globally distributed set of “hackers” focused on turning out some pretty 💯 releases of PostgreSQL every year (for many years running)!!

That said, there have been a few attempts over the years to provide a place for this extra information, a few of which aren’t referred to often enough (IMO). One site that holds lots of information is the PostgreSQL Wiki. Below are a few pages that I’ve referred to over the years, some of which probably need to be updated and moved into some more conspicuous, “official” areas like documentation.

  • Don’t Do This: Still (mostly) relevant information about PostgreSQL features, data types, or setup in a “do this, not that” way.
  • Index Maintenance Scripts: Common starting place for finding duplicate or unused indexes.
  • Developer FAQ: Exactly as it says on the tin, many questions that are repeated often on mailing lists, Slack, and elsewhere. Continues to be maintained.
  • Performance Snippets: Database bloat, loose index scan, finding locks, and more!

In a real project, whenever I use one of these scripts as a starting point, we’ll commit it to a “helpers” repository or folder in the current project with comments that references the source, provide a URL, and if necessary, any blog post & Stack Overflow links that were relevant. Without this habit or expectation, many scripts die the death of an accidently closed IDE tab. 😬

Work to be an example

Most of the time I’m the least qualified developer in the PostgreSQL space to provide elegant examples. My Postgres knowledge is hard won (and hard work). But I love the platform, the community, and the constant improvements. Learning how to use data, harness the power of SQL and Postgres, and share it all with others is something I’m trying to get better at.

Whenever I give a presentation at a conference or meetup, I try to make the slides and any SQL available through my GitHub repo. More recently, I’ve started recording a video for each day of the Advent of Code as I’ve completed them, talking through my solutions and what I learned along the way.

I’d encourage you to do the same. Through the 2022 State of PostgreSQL Survey, we learned that many people in the PostgreSQL community prefer reading blog posts and watching videos. The more you contribute, the more likely your work will end up in the comment section of a script that helps a team use Postgres more effectively. 😉

Long-live Postgres!

Advent of Code 2022: Days 6-10 With PostgreSQL

I’m slowly working through the Advent of Code 2022 puzzles using PostgreSQL and SQL as much as possible. For an overview of how I’m approaching these, please see the first post in this series: Advent of Code 2022: Days 1-5 with PostgreSQL

Other resources:

AoC 2022: Day 6-10 with PostgreSQL

Day 6: Tuning Trouble

A fun puzzle, we were given a long string of characters which we then had to find sets of transmission markers by identifying non-repeating characters.

For Star 1, we had to find the “start-of-packet” marker by finding the first set of four non-repeating characters. After some thought, I chose to split the long string into smaller sets of four character arrays and then doing a DISTINCT to identify ones that had four distinct characters.

Star 2 was similar, only we had to identify the “start-of-message” marker which was 14 consecutive, non-repeating characters. I simply changed a few values to create arrays of 14 characters to solve this puzzle.

Advent of Code Day 6 with PostgreSQL

Day 7: No Space Left On Device

Directory structures and recursive CTEs. Whenever someone teaches about recursive CTEs they either use a manager/employee example or a directory structure, typical parent/child relationships. So while I felt like this should have been easy, for some reason I struggled to get the right relationship. Once I did (with some help), things fell into place.

Star 1 entailed creating the correct parent/child relationship so that we could sum the total file space of each directory in the tree. This meant that a parent fold was the sum of the files it contained and the sum of all child folders. Great usage of regexp_match and CARDINALITY to help solve this puzzle.

I kept getting close, but honestly couldn’t have solved it in a timely manor without seeing what someone else did to get the correct parent/child sums. Great learning experience.

Star 2 took the input from the first star to then figure out the smallest folder that could be deleted to free up enough space to do an update. I’m sure there were more elegant solutions, but I simply did some additional math and sub-selects to get the final answer.

Advent of Code Day 7 with PostgreSQL

Day 8: Treetop Tree House

Early in my career I was a teacher for 12-18 year old (middle and high school for U.S. folks). Part of my teach load was technology classes, both programming and basics like word processing and spreadsheets. I learned early on that most kids rarely read past the first line of a test question or the first choice in a spellcheck list (I got many papers addressed to Mr. Boo).

Day 8 was my own little reminder to read the full instructions. The puzzle instructions and examples were clear… I just didn’t read them carefully. This meant that I approached Star 1 wrong from the beginning and wasted some time. Once I realized I was wrong, my second try helped.

For Star 1 we had to look at a grid of numbers representing the height of trees to find the total number of visible trees. A tree was visible if it could be seen from any direction (row/column) all the way to the edge. The final solution used string_to_table (new in PostgreSQL 14+) and some basic WHERE predicates. I also saw a better solution from Vik Fearing using WINDOW functions which was so much simpler. 🤦‍♂️

Star 2 was a bit closer to what I was originally trying for the first star, but this time we had to find a “scenic score” for each tree using some basic math and a final ORDER BY.

Quick note: I realized after recording the video that one of my “clean up” edits on the SQL actually produced the wrong answer in the video because it relied on the identify column of my data starting at 1. I’ve updated the GitHub repo code to ensure it works regardless of the ID, but didn’t feel it was worth re-recording the video. Live and learn…

Advent of Code Day 8 with PostgreSQL

Day 9: Rope Bridge

I’m not going to lie on this one. Day 9 really got me. The basics of how to approach this puzzle seemed clear, but I once again missed an important detail. We had to track the movement of movements for the head of a rope and then adjust the tail of the rope to follow along based on the rules presented. “Simple enough,” I thought, “perfect use of a recursive CTE”. That did work in the end, but not without some trials. The solution for each puzzle was about tracking how many unique coordinates the tail visited during all of the movement.

Star 1 was only about two “knots”, a head and a tail. We were provided the movement of the head and then as the tail . The big thing I missed the first time around is that we had take one step at a time and adjust the tail accordingly. Instead, my first attempt simply made the total movement each time, which resulted in significantly under-counting the movement of the tail. Lesson learned.

Star 2 was just the first star… on steroids!! Now we had to track the movement of 10 knots while still only knowing where the first knot (the head) moved. I kept getting close, but because the movement of each knot was predicated on the movement of the knot before it, I kept missing one movement value. The simple formula that Vik used in his solution was the final key for me.

My final solution here used a multi-dimensional array with a recursive CTE to track the movement of each knot as things went. By cutting out additional work, I was able to cut my initial solution down from ~25 minutes to ~1.5 minutes.

Advent of Code 2022 Day 9 with PostgreSQL

Day 10: Cathode-Ray Tube

Honestly the second star has been my favorite answer so far. Getting there seemed simple enough, but off-by-one issues just kept annoying me! Once again I entered the puzzle with what seemed like a really easy solution, only to get complicated 10 minutes later. After solving the puzzle I saw others solved this much more easily with something like string_to_table and a WINDOW function. One of these days I’ll learn!

Star 1 tracked the value of an integer in the register as each CPU cycle clicked by given a set of instructions. Again, I got the answer, but not without overcomplicating it.

Star 2 was honestly just really fun. We had to use the values in the register from Star 1, using them to identify pixels on a small CRT that needed to be turned on. Once that was completed, we had to “draw” the screen to see a set of capital letters. Something about taking a table of data and calling string_agg to reveal the letters just felt fun.

Advent of Code 2022 Day 10 with PostgreSQL

Key Takeaways

I have no idea how long it will take me to try and work through all 25 puzzles. In ten “days”, I’m amazed at how much I’ve learned, both about how I approach this kind of problem and how there are always multiple ways to the same ends. Learning the strengths and weaknesses of your tool, PostgreSQL in this case, is essential to improving your knowledge for the next task.

And again, I’m thankful to others like Vik and Feike that are sharing their work so that I can learn more, too!

Advent of Code 2022: Days 1-5 With PostgreSQL

Advent of Code is a yearly coding challenge created by Eric Watsl (@ericwastl) that runs from December 1-25. I’ve seen it referenced over the last 6-7 years in various technical circles, but never registered and tried to follow along.

This year, I started to see Vik Fearing and Feike Steenbergen tweet about it, specifically solving the puzzles using PostgreSQL. The specific challenge they presented for themselves was to only use SQL and base functions provided by PostgreSQL, hoping to avoid functions or stored procedures that require a language like pl/pgsql.

The idea intrigued me and I decided to work alongside my fellow PostgreSQL community on AoC, learning all I could in the process. (spoiler alert… I’ve learned a lot already!).

As I go, I’ll upload my finished code to my repo and (hopefully) record a video for each day explaining my approach. I’ve finished through day 8 (as of December 14, 2022), but haven’t finished all of the videos yet. Again, my goal is to enjoy the challenge and learn, not try to be the first one done or get on a leaderboard somewhere. This also means it will probably take into January 2023 to finish the puzzles and record videos.

Enough explanation! Let’s talk about the first five puzzles, including links to the videos.

AoC 2022: Day 1-5 with PostgreSQL

Day 1: Calorie Counting

This was an interesting first start for SQL. I originally solved this using a complicated back and forth between a rows and arrays. The puzzle provided input that represented the calorie values for various food items that each elf had with them. The group of items for each elf differed, and the only way to determine the separation of groups was with an empty row in the text file.

The simpler solution was using a FILTER statement on an aggregation of the groups. (Thanks to Feike for that hint).

Advent of Code Day 1 with PostgreSQL

Day 2: Rock, Paper, Scissors

So far this has felt like the “easiest” puzzle for me. Although I probably could have completed it with less typing, the concept was pretty straight forward and accomplished with joining two related tables and adding the appropriate point amount.

Advent of Code Day 2 with PostgreSQL

Day 3: Rucksack Reorganization

There’s something about simple tasks like splitting strings and creating tables to solve a puzzle. Not the least of which is remembering that most of these tasks in PostgreSQL are 1-based when splitting a string in half doesn’t quite work out the first time. 😀 In the end, it was simplest to create couple of tables with a CTE and join to find the “uniqueness required to “shared items” and solve the puzzle.

Advent of Code Day 3 with PostgreSQL

Day 4: Camp Cleanup

This puzzle was all about overlapping ranges of numbers. I had recently been reviewing the comparison operators for both arrays and range types which made this puzzle pretty easy to solve once the data was prepared. For the fun of it, I did it with both arrays and range types.

Advent of Code Day 4 with PostgreSQL

Day 5: Supply Stacks

This puzzle entailed iterating through instructions to move items between stacks. I knew this would probably entail using a recursive CTE without the ability to use a secondary language with loops and conditionals. And, to be honest, recursive CTE’s always make my brain work harder than they should, but after some work I was able to make it happen.

In the end, I chose to only use an array of strings to move items around the stacks which honestly made it a bit harder to get to my solution. After the fact, I saw that others in the community quickly created a JSONB object from the data so that they could use the jsonb_set function which was a bit more clean.

Because I had to put some mental effort into creating the recursive CTE correctly, I spent ~10 minutes in my video recap for this day talking about how recursive CTE’s work and how it helped me solve the day 5 puzzle.

Advent of Code Day 5 with PostgreSQL

Key Takeaways

I’m happy that I saw others in the PostgreSQL community mention Advent of Code. The opportunity to think about data in a new way and think about how to process and query it without variables and traditional procedural conveniences has been fun (if not difficult a few times) and displays how powerful PostgreSQL and SQL are.

I can’t wait to see what the next 20 challenges have in store… and what I’ll inevitably learn through the process.