Adventures in PostgreSQL Backups

For PGSQL Phriday #002, Andreas Scherbaum asked us to consider PostgreSQL backups, how we do them, any recommendations we have, and whether pg_dump is a backup tool.

To be honest, I feel pretty unqualified to talk much about backups in Postgres, partially because it’s been a couple of years since I’ve had to manage the uptime of an application. As PostgreSQL has grown in popularity and usage, there is a lot of effort being put into tools like pgBackrest, pgBarman, and some newer solutions like pgcopydb (ok, I realize this isn’t billed as a “backup” tool, but…).

What I can share are a couple of things I’ve learned about restoring clusters and databases, particularly given the extensible nature of Postgres and the variety of tooling.

Which is a good segue to mention…

Restores > Backups

I think this goes without saying on most cases, but if I’ve learned anything by parenting six kids, repeating the basics (almost) never hurts. 😉

Backups are only as good as your ability to restore the data in the business appropriate timeframe.

It’s really that simple. No matter how much you work to dump the data or backup the data files in the appropriate way, if they can’t be restored it was all an effort in futility.

It doesn’t matter if the tool is pg_dump or pgBackrest. If you can’t get back to a state that is appropriate for your business or application, it doesn’t matter. Knowing how to use the tools requires practice, patience, and knowledge of the potential pitfalls.

What Counts as a Restore?

For most of us, talking about “restores” usually means the data only, and maybe security/users included. In my experience there’s one other aspect of Postgres that can significantly impact the restore of your data and the smooth recovery of your application.

Extensions! 😱

One thing that most users don’t take into account when they restore data is that extensions are not backed up with your data. Regardless of how you do the backup (again, pg_dump or a file-level backup), that’s just (usually) the schema and data of your database. If you installed some extensions that are necessary for your application to work, the actual extension files are not backed up. This presents a few problems if you didn’t plan ahead and are caught in a bind.

Extension Availability

First, if you restore your database to a new server and those extensions aren’t installed or available in the contrib packages that ships with the core PostgreSQL installation, you’re application might not work even if all of the data was restored.

Let’s say your application relies on a specific Foreign Data Wrapper, maybe the redis_fdw. If that extension isn’t installed on the new server when you restore the data, your application probably isn’t going to work even though all of the data has been restored.

This is because the command to install the extension is backed up (and in some cases configuration information for the extension to run normally), but when the restore executes something like CREATE EXTENSION redis_fdw;, you’re going to get an error. Assuming the restore continues and all of the data still gets restored, the extension itself still isn’t available.

Therefore, all of the clusters that you want to use as destinations for restoring data must have the same extensions installed and ready to go. Again, if you’re only using extensions that are provided as part of contrib, you’re probably OK most of the time. But due diligence is required when your application and users rely upon you being responsible with their data and availability.

Extension Versions

To make things even more complicated, the version of the extension is important to consider, too. Depending on what the extension is and the functionality being used, you are generally responsible for making sure that the correct version of the extension is available. This is easy to see in a textual pg_dump backup. The CREATE EXTENSION... command doesn’t specify the version, only that the extension needs to be created.

When an extension is upgraded between versions, various scripts can be run if there are catalog or metadata to modify. For example, if my application relies on version 1.0 of an extension and I restore to a server that has version 1.5 installed instead, my application might not contain the various schema changes necessary for that extension to work. That usually means bad things when I try to query data or use (deprecated/deleted) extension functionality.

I don’t want to make too big a deal of this. In many cases, extensions don’t update very frequently and new versions might not have made changes to functions, data, or catalog settings. But any time you rely on a new feature or setup, this is a notable issue to consider.

In short, if you are responsible for data integrity among multiple PostgreSQL clusters, ensure that all extensions are updated to the same versions and document it somewhere along with a changelog. This is helpful for you to remember the state of your database and for others in your organization that need to help maintain the database.

Selective Backup and Restore

The other thing that is particularly helpful with pg_dump and pg_restore is that there are many options to do selective operations.

  • Need to backup everything but that really large table only used for transient ETL data? pg_dump... --exclude-table=large_table
  • Are you trying to create a baseline script for a new database DevOps tool like Flyway and only require the database schema? pg_dump... --schema-only
  • Will this file be used for demonstration purposes and you can’t guarantee the same roles are available on the destination server? pg_dump... --no-owner
  • Conversely, did you receive a backup script and keep getting “role does not exist” warnings when you try to restore it? pg_restore... --no-owner

There are so many runtime options and permutations for each of these tools, there are many examples we could give. Suffice it to say, there’s a lot of power in using pg_dump and pg_restore for doing daily development work and helping other PostgreSQL users try to solve problems. Learning how to use both tools effectively is helpful for any PostgreSQL user.

Bonus: Is pg_dump a Backup Tool?

But seriously, I’m not sure I really want to wade into this discussion. At face value, yes. pg_dump produces a transaction safe backup of whatever you told it to. Assuming you know how to correctly restore that dump file, you technically have a good, working backup.

That said, this isn’t a plan for maintaining a production system which requires fine grained control and responsive replication. And that, is an entirely different post (or series of posts).

In summary, pg_dump is technically a backup tool. But you’re going to be in a lot of pain if you try to use it for a high-availability setup. 😀

The Conversation That Never Ends

I’m excited to see what others talk about as part of this PGSQL Phriday. There are so many aspects of backup and restore that impact our daily lives. There’s no shortage of tips, tricks, triumphs, and horror stories in this area of PostgreSQL database management – and I think we could blog about it every month for a few years and still not cover everything.

Hopefully, shedding a little light on how extensions require just a little bit more attention as part of your backup scheme is helpful. Again, it all comes back to a successful restore, and extensions are part of that success.

Should you attend PASS Data Community Summit this year? It depends.

In the technology and database world there’s one timeless response, provided time and again to nearly any form of the question, “should I…?”.

“It depends.” (insert obligatory chuckle from the audience…)

  • Should I use a CROSS APPLY or LATERAL JOIN to solve my query problem? It depends.
  • Should I partition this table to improve parallel queries and ingest? It depends.
  • Should I move to managed instances in cloud X to solve my scaling problem? It depends.

What about your career? With conferences coming back in full force, should you take the step to attend a premier conference, make connections, learn from the best, get an edge heading into the coming year?

Well, it depends – on your career goals, job responsibilities, interest in cutting edge content, and desire to learn best practices from the creators of the features and tools you use every day. It might depend on how energized you get by connecting IRL with the larger data community after a three-year hiatus.

For me, these tangible conference experiences make a tremendous difference to my focus, skills, and relationships. Let me give you an example.

PGConf.EU 2022

I just returned from my first European trip in a long time. PGConf EU was held in Berlin from October 25-28, 2022. Like the PASS Data Community Summit, and nearly every other conference on planet Earth, this was the first time the wider PostgreSQL community in Europe was together, in person, since 2019.

🎉 It sold out a month ahead of time. 🎉

I don’t think this was an anomaly.

Over three days I got to shake hands and engage in meaningful conversation with more than 20 virtual friendships that started during the pandemic! ❤️ I got to attend sessions lead by the core PostgreSQL team as they explained foundational concepts and new feature improvements. 🎓 And, I learned more by hearing the questions others asked in real time. 👂

I came away encouraged, thankful for the opportunity, and energized to continue investing in the community and the project. 🐘

The PASS Data Community Summit provides that same opportunity at a larger scale!

400+ Sessions and 3-month Recording Access

For deep-dive sessions, there are 18 different full day pre-conference sessions. I think the introduction to PostgreSQL for SQL Server Developers and DBAs is going to be pretty 🔥, but the other 17 are equally 🥇!

The main three-day conference runs in-person and virtually from Wednesday-Friday, November 16-18. I can’t wait to be in Seattle with thousands of data professionals, listening to compelling keynotes, having access to hundreds of live sessions, attending social events, and spending time in the “hallway” track seeing old friends and making new ones. Plus, you’ll get access to all pre-recorded, virtual sessions that are live right now!

If you haven’t signed up yet and still think you could get to Seattle two weeks from now, I know it will be worth the investment!

A Ground Breaking Virtual Experience Also!

If you can’t be there in-person, a virtual pass will give you all the same access to teaching and content from the comfort of your own home. Unlike the virtual sessions of the last two years, however, there is one major difference with PASS Summit. As you’re watching the session, streamed live from the conference, you’ll see the presenter and a live audience talking and interacting with one another… and you can join in too! 🎉🎉🎉🎉

Finally, a virtual conference that lets you feel like part of the conversation and crowd!

I think it’s fantastic and I’m excited to see how speakers and attendees alike respond to this mix!

Will You Join Us For PASS Summit This Year?

There’s really no reason to miss the excitement and world-class content. I think it depends (😉) only on whether you’ll be able to make it in-person or online.

PostgreSQL at the PASS Data Community SUMMIT 2022

Posted on Categories PASS Summit, PostgreSQL, SQL ServerLeave a comment on PostgreSQL at the PASS Data Community SUMMIT 2022

This year at PASS Data Community Summit, I’ll be presenting a full-day precon about PostgreSQL.

Why is there a PostgreSQL precon at an event primarily focused SQL Server & the Microsoft Data Platform?

Read along to find out more!

In just under a month, I’ll be in Seattle preparing for the PASS Data Community SUMMIT 2022. This is the first in-person Summit since 2019, and the first ever SUMMIT to offer both in-person and virtual attendance. In many ways, this is a homecoming for the #sqlfamily just as much as it is the premier data event for the growing Microsoft data platform. 🎉

I’m honestly thrilled to be a part of the event, especially because I get to offer a full-day precon session about PostgreSQL, the most advanced open-source relational database in the world… and one that Microsoft, Amazon, Google, and countless others continue to invest significant resources into.

Why is there a PostgreSQL precon at an event focused primarily on the Microsoft Data Platform… and why should you consider attending the training and PASS SUMMIT?

I’m glad you asked. 😉

A SQL Server Developer Walks Into a New Job

I’ve been working with data and software development for nearly 20 years. The first two years I built applications with open-source databases, primarily MySQL and PostgreSQL. For the next 14 years I worked for one software company that exclusively used SQL Server and .NET for all development. It was here that I got to know about PASS, SQL Saturday, and the larger #sqlfamily.

But then, a little more than four years ago, I joined a new company to help grow the data analysis team. Between the time I accepted the job and my first day, someone forgot to tell me that they were migrating away from SQL Server to Amazon RDS PostgreSQL.

That was a rude awakening on my first day.

I had to quickly figure out how to do even the most basic things with PostgreSQL that were so easy with SQL Server.

  • What IDE could I use without SSMS?
  • Why can’t I just select the TOP(1) record from a table?
  • How do I set (and use) a variable in a script?
  • It’s 2018 and the best thing I can get to tune my query is this text dump from EXPLAIN and are you kidding me this is torture!!!!!!!! (only slightly paraphrased quote)

I started to quickly find success as I asked for help, applied principles from my years with SQL Server and all that I learned from #sqlfamily, and eventually… by not constantly wishing everything was like SQL Server.

The more I learned, the more I grew to love PostgreSQL and the community that surrounds it.

Ever since, I’ve been on a bit of a mission to pass on as much of that knowledge as I can… even though I know there’s still so much more I have to learn.

From PostgreSQL Newbie to Knowledgeable Hacker

The training course is intended to be a start to finish overview of the major parts of PostgreSQL and how they relate to the experience you already have as a SQL Server developer or DBA. While you’ll have the best experience following along inside of PostgreSQL yourself, you will still learn a lot (and walk away with all the training scripts and slides) by attending.

The expected outline of the course will look like this. Because it’s difficult to determine ahead of time the level of experience that everyone has coming into the course, this may be adjusted this slightly as we go, but again, any of the prepared material we don’t cover fully will still be provided and I’m happy to discuss it with attendees afterwards.

  1. Connecting to PostgreSQL and running your first SQL – I want you to experience PostgreSQL as quickly as possible. We’ll discuss a few good options for connecting to PostgreSQL quickly, run a few scripts, and talk about what we learned.
  2. Installation options – While it is assumed that attendees will have ready access to a PostgreSQL server to start the session, we will discuss the various options for installing a cluster. Remember,  PostgreSQL is an open-source project which doesn’t have a single company managing it like a commercial product providing finely tuned GUIs for all of your settings.
  3. You’re Not in Kansas Anymore – Before we dive in too deeply, we’ll take a few minutes to talk about the major differences between SQL Server and PostgreSQL that are likely to surprise you. Knowing is half the battle. 😉
  4. Configuration – Like any complex database solution, there are many configuration settings. Knowing which ones to look at first, where to find help, and how to test them can mean the difference between fast queries or a dead server.
  5. Intro to psql – Many of us have grown to love a well-designed GUI for interacting with our database. It took me too long to realize that understanding `psql` and learning the basics made me a better PostgreSQL developer and administrator. This could be a half-day course unto itself, but you owe it to yourself to become familiar with this 🔥 tool.
  6. Hands-on SQL comparison – Now that you’ve gotten connected and run some basic SQL, we’ll take time to look at many of the first “gotcha” moments writing queries in PostgreSQL. By the end of this section, you’ll be ready to read and write many of your common SQL pattens in a new way that works with PostgreSQL.
  7. Enhancing PostgreSQL with Extensions – Extensions are the magic power of PostgreSQL and one of the primary reasons so many companies are investing heavily in the platform. We’ll look at a few of the must-have extensions that are available with most PostgreSQL offerings.
  8. Indexing – Indexing is indexing… until it’s not and you have access to multiple datatype specific index types. 🤯
  9. Query Tuning – We can’t hide from it any longer. It’s time to dive into the world of EXPLAIN and query tuning in PostgreSQL. There are multi-hour classes available on this topic alone, so we can’t dive too deep, but you’ll know what to do with the wall of text and how to take next steps to improve your queries.
  10. Functions and Stored Procedures – Once again, they’re the same… but different. And functions aren’t always told they’re bad and to be avoided at all costs. They actually get invited to parties. 😊
  11. Triggers – Like functions, triggers aren’t shunned at all costs. How they work is similar (but again… different… and maybe better) and PostgreSQL developers believe they have a valuable purpose that doesn’t have to be a hidden menace.

🎉 BONUS section 🎉

No training would be complete without a review of how to get help and connect with other developers. Although we’ll certainly talk about community throughout the day, I want to take a few minutes at the end of the session to point you in the right direction to continue learning about PostgreSQL and continue being successful when the time comes.

What We Won’t Cover Directly

I know a full day of training feels like we should be able to cover everything. I must keep reminding myself that some subjects are just too deep and specific for the first into most users will have to this new (to them) database server.

After receiving some questions about the course recently, I changed the title to more clearly align with the description of the course (and the content outline above). That is, the course will more closely align with users that are database developers, architects, and “accidental DBAs”.

Therefore, this time around, we will not specifically cover some deeper-level topics that would be labeled more classic “DBA” tasks including:

  • More complex backup and recovery for enterprise installations
  • HA and replication setups
  • Upgrades from one version of Postgres to another
  • Security and access deep dive
  • Deeper database maintenance scenarios

I have no doubt that some questions and side discussions will hit on aspects of these topics, but we won’t do any hands-on work here specifically. Maybe that will be part two for next year. 😀

Please Join Us!!

I hope that this precon is helpful to many folks attending SUMMIT. Coupled with the other fantastic precon and 300+ regular sessions, there is so much to learn at the PASS Data Community SUMMIT this year as we have the opportunity to come together for the first time in three years.

Please join us, either virtually or in-person if you’re able! I have no doubt it will be the highlight of your year.

See you November 15-18!

PGSQL Phriday #001 Wrap-up and review!

As I sit here on October 7, 2022 watching PostgreSQL friends from all around the globe post contributions to the first ever PGSQL Phriday blogging event, I’m honestly pretty shocked… and very (very, very, very) grateful! While I have lots of ideas and love connecting with people to hear their stories, I wasn’t sure what to expect because, let’s face it, there are so many demands for our time and attention.

The fact that many folks have been supportive of this idea and contributed to this first event truly warms my heart. Thank you each for helping to get this ball rolling!

Recap: How good is the Postgres community at lying? 😉

Quick note to future hosts: You don’t have to give personal background on every contributor. That’s just how I think about these interactions. Feel free to just give a brief summary of the post!

Lætitia Avrot – Lætitia has been one of my go-to sources for learning psql better as I’ve grown in my PostgreSQL journey. (she has lots of other 💯 content and presentations too!). So, it comes as no surprise that for the inaugural PGSQL Phriday Lætitia would give us some nuggets about the beloved command-line tool, with a slight twist on the “two truths and a lie” twist. 😂 Specifically, I really appreciated her second tip that points to repeatability when collaborating with others.

Lutz Horn – One thing that I’ve grown to love with the PostgreSQL community is the diverse avenues of communication. Lutz submitted his post through the pgsql-general email list and I am so thankful he reached out! Lutz’s post talked about three features of PostgreSQL (ON CONFLICT...DP UPDATE, table inheritance, and CTE’s) that he uses often… or hopes to use more often… in his work with PostgreSQL.

Grant Fritchy – I’ve known Grant’s work and teaching for many years in the SQL Server world. It’s been fun to see him learn more about PostgreSQL and relate to his expertise with SQL Server, not that dissimilar to my (ongoing) journey these last 4+ years. Grant took the this opportunity to talk about backing up and restoring databases in PostgreSQL, including the introduction of a new PostgreSQL commands I’d never heard of… 😉

Andreas Scherbaum – Ads is well know to the PostgreSQL community and has been a consistent source of help and connection for me personally. He faithfully helps with multiple PostgreSQL conferences, teaches about multiple topics, and continues to post “PostgreSQL Person of the Week” interviews to help highlight members of the community. The first presentation I ever saw Andreas give was about PostgreSQL deployments with Ansible and so I was happy to see him provide some guidance on configuring dates, backups, and roles in PostgreSQL.

Andy Atkinson – I had the pleasure of meeting Andy at PGConf NYC 2021, in that magical time just after Thanksgiving and the third week of December 2021 when a few in-person conferences were able to meet before things shut down again for a month because of COVID spikes. He’s an active Ruby on Rails developer and speaker and so his post looks at three ways to improve a Rails application that utilizes PostgreSQL as the database. If you’re a Rails developer, you should bookmark Andy’s site and Twitter account to learn more about developing great PostgreSQL apps on Rails.

Pat Wright – More and more folks in the U.S. contingent of PostgreSQL users have seen Pat become more involved in the community and conferences as the world continues to open up again. He’s an avid conference organizer in many communities and has such a great vision for helping the community grow. In his day job Pat has to manage a lot of database infrastructure, including HA and replication. This post focuses on a few important settings to make sure replication is working well and what to look for when things are going as expected.

Conclusion

As I said at the beginning, I count six blog posts as a huge success for this first PGSQL Phriday. Next month Andreas is going to host and will probably be asking for content around backups (although he’s totally free to change that!).

I hope you’ve learned something and will consider taking part in the future with your own blog post.

PGSQL Phriday #001 – Two truths and a lie about PostgreSQL

Welcome to the inaugural invite of #PGSQLPhriday, a new, monthly blogging event for the larger PostgreSQL community! Each month, a different member of the PostgreSQL community will act as the host, choose the topic, and write an invite post similar to this. Once the blog posts are contributed, the host will write a wrap-up of all the contributions. What a great way to share your knowledge with other PostgreSQL users!!

(The topic for this first event is explained a bit further 👇, after a brief detour to discuss the last-minute name change.)

Hold on, I thought this was #PSQLPhriday?

OK, you caught me. Probably one of the worst things to do on the inaugural invite is to change the event’s name. 😬 The truth is, since announcing the idea, a few different community members expressed concerns about “psql” (rather than “pg”) and “phriday” (rather than “Friday”). They’re all valid concerns, and it’s hard (for me) not to want to accommodate everyone.

Because this idea is taken from the SQL Server community and #TSQLTuesday, I wanted to give a nod to that long-running event, and “psql” (no “G”) seemed like a natural fit. But “psql” is a bit overloaded too because of the 💯 command line tool of the same name, so there was a small chance that name could cause some confusion.

And so this weekend, as I debated what to do, I realized that all of the official PostgreSQL mailing list names used “pgsql” as the prefix. Sticking with this same prefix still allows the event to (subtly) give a tip of the hat to the original TSQL Tuesday event while still pointing to the decades of hard work and collaboration that has occurred on the mailing lists, a place where so much history of Postgres is documented and available.

And so, #PGSQLPhriday it is. No exchanges this time, I promise.

Thanks for understanding!

Choosing a topic for the inaugural event

I’m not going to lie. It’s been challenging for me to pick the first topic to get this party started. I seem to gravitate to topics that are too broad (“Why do you love Postgres 15?”), or too narrow (“Tell us about your best contribution to PostgreSQL”). In the original announcement post, I even hinted that the topic would simply be “Why Postgres?”

I also realize that this first event comes at a unique time in the PostgreSQL lifecycle: many folks just attended PGConf NYC 2022, PostgreSQL 15 is being released on October 13, and PGConf EU is October 25-28 in Berlin. A lot is happening right now, and any of these special milestones could be inspiration for an interesting topic. (“Which PostgreSQL 15 feature are you most excited about?”, “What have you enjoyed most about attending in-person conferences again?”)

But as I contemplated the topic (for far too long…), I recalled one of the more unique talks I was able to attend at PGConf NYC given by Ilya Kosmodemiansky, “PostgreSQL Worst Practices.” In his presentation, Ilya provided insights into many configuration and development issues folks run into with PostgreSQL by admonishing us to do the “worst” practice. (ie. “Absolutely don’t take regular backups of production, but if you do, only use pg_dump”) It was a clever and humorous presentation that provided good laughs and great dialog about true best practices.

So that got me thinking about some clever ways we could share tried and true PostgreSQL development and configuration tips with one another. If you spend any time on Twitter, the mailing lists, or Postgres Slack, there is no shortage of questions about common configuration and development best practices. While some folks have written blog posts to address these common questions over the years, there is always room for others to contribute.

Your mission, should you choose to accept it…

For this first event, I want you to write a blog post that shares three of your go-to Postgres best practices, but write it in the form of the game Two truths and a lie.

For example, perhaps there are three configuration settings you always modify whenever setting up a new cluster. You could share why these settings are important to modify and how you find the correct values, but for one of them, use Ilya’s comical approach to teach us the correct way to set the value. (ie. “I never set shared_buffers to more than 128MB to conserve server resources.” 😱)

Maybe you have three go-to ways of monitoring your cluster or multiple SQL options to overcome a tricky query problem many PostgreSQL developers face. You’re limited only by your imagination.

And obviously, at the end of the blog, be sure to clearly identify which of the three tips was a lie and how someone should actually modify the setting or write that tricky SQL statement. 😉

How to contribute your blog post

There are only a few “rules” that we ask everyone to follow when writing a blog post to contribute each month.

  • Use “PGSQL Phriday #001”/”#PGSQLPhriday 001” in the title or first paragraph of your blog post. (the number will change each month)
  • Link back to this invite. (ie. “For #PGSQLPhriday 001, Ryan asked us to…”). If you don’t link back to the invite, you won’t be included in the summary list of blog posts.
  • Post your blog sometime on Friday, October 7, 2022. The world is a big place, so if it’s a little early or late… it’s probably still Phriday somewhere! 😃
  • Announce your blog post in at least one of the following places:
  • Read other blog post contributions and encourage the authors. Remember, the goal is to increase community involvement and knowledge sharing

I can’t wait to see what you contribute on Friday, October 7, 2022! 🎉🎉🎉

PGSQL Phriday: A Monthly Blog Event for the PostgreSQL Community

Not quite a year ago, I had the opportunity to give a keynote address at PGConf NYC 2021 as part of Timescale’s sponsorship. Thankfully the leadership at Timescale supported me addressing the topic of community and how, particularly in light of the upward trend of PostgreSQL adoption, investing in the growing community can help everyone involved.

In that talk, I drew on a few of my experiences as a SQL Server developer, specifically within the PASS and #sqlfamily community. Although I understand that each community has its ethos, I wondered aloud if some of the initiatives that had drawn me further into the SQL Server community could provide ideas on ways to engage the growing PostgreSQL user base.

One of those initiatives was focused on the monthly T-SQL Tuesday blogging event started by community members more than a decade ago. The long-running event is simple.

Each month a community member volunteers to host the event. They select a topic and “invite” anyone to participate through a short blog post addressing the topic. Then, on the first Tuesday of each month, anyone in the community can post their own blog about the topic, linking back to the original invitation blog. The main purpose of the monthly event is to provide a safe, open opportunity for anyone to blog with a group of community members. For many people, the biggest hurdle to writing down and sharing their experience with a specific technology is selecting topics… and believing that anyone cares about what they have to say. T-SQL Tuesday has started many successful blogs over the years because someone had a good prompt and others received encouraging feedback about what they shared. 

As more people got involved, the community grew, spreading the knowledge and support of the community across a larger, more close-knit group.

I don’t know about you, but that sounds 🔥!

What does the PostgreSQL community think?

In the 2022 State of PostgreSQL survey, short and long-form blog posts were among the top three ways users prefer to learn about PostgreSQL! Users want to learn from the experiences of others and how they overcome technical obstacles with PostgreSQL. This seems to echo a recent blog from SQL Server consultant Erik Darling about his frustration with technical books, specifically regarding readability and code examples. Blogs (and other online forums) often have the least amount of friction in unlocking reader engagement.

In short, there are a lot of people in the PostgreSQL community that are looking for the knowledge and experience you have. It just has to be written down and published. Most of us (or at least me) need that consistent prompt to keep going.

The Key Ingredient: Volunteer Hosts

Very little coordination is needed to run a monthly event like this. Folks in the community can contribute a blog post when they’re able and there’s no penalty for missing a month. If someone posts a few hours early or late, that’s OK. The desire is to foster and encourage more participation because we believe it will strengthen the PostgreSQL community.

The only thing the initiative needs is a volunteer host that changes each from an individual in the community.

The responsibilities of the host are pretty simple:

  • Host blogs must be “syndicated” through Planet PostgreSQL. If it is not, follow the process a few weeks ahead of time and reach out with any questions!
  • Choose the topic for the month and write a short invitation blog post about it.
  • On the Monday before the first Friday of the month (which is sometimes the end of the prior month!), publish your blog post which should publish it through Planet PostgreSQL. Additionally, please send a reminder on social platforms that link to the invitation and include the #psqlphriday hashtag.
    • At a minimum, post to Twitter, PostgreSQL Slack, and the `psql-general` email list
  • Within a few days of the monthly blog event (~the following Monday or Tuesday), compile links and brief reviews of the “submissions” and post the summary to your blog.
  • Ensure there is a new volunteer host for the following month.

The topics can cover anything about, or related to, PostgreSQL. Hosts might often prompt with technical topics, sometimes more general database/data questions, and still other prompts might focus on soft skills for a successful technology career. Example invite posts might include some of the following ideas:

Technical:

  • “Why you love extension x”
  • “Your favorite feature of PG15 and why”
  • “How logical replication saved your bacon”

General PostgreSQL/database:

  • “What is your favorite tool that’s not psql and why?”
  • “What are your favorite queries for monitoring database performance?”
  • “What feature did you learn about reading the documentation that changed your approach?”

Soft topics:

  • “How have you contributed to building the PostgreSQL community this past year?”
  • “What’s one skill every PostgreSQL developer should know or grow in this coming year?”
  • “What was the first talk you gave about PostgreSQL and what did you learn that would encourage others to give talks too?”

That doesn’t look too hard, does it? Do you think you’d be interested in participating or hosting?

Could the PostgreSQL community try #PGSQLPhriday?

What are we waiting for?

When I brought this idea up at PGConf NYC last year, several folks jumped on board with the idea. The fact that it’s taken almost a year for me to create this post and get the ball rolling is mostly due to large doses of imposter syndrome. (Sigh…)

But I’m more bullish on the PostgreSQL community than ever, and I’m ready to dig in and see how I can help more.

Therefore, I’m proposing that we start with the first blog event on Friday, October 7, 2022. This falls between PGConf US and PGConf EU, and hopefully, within a week or two of when PostgreSQL 15 is officially released.

The basic ground rules we’ll encourage everyone to follow are minimal:

  1. For the best audience reach, work to get your blog listed at Planet Postgres before posting your first PostgreSQL blog. Ensure your PostgreSQL blog adheres to the policies and then follow these steps.
  2. Post your blog any time on the first Friday of the month and link back to the host’s topic announcement blog post for reference. We’ll create a logo you can include and link back to the host post. More to come. (wink)
  3. Using your social platform of choice, mention your post and include the #pgsqlphriday hashtag. If the host misses your post in their wrap-up, assume that it simply got lost in the social media deluge and just reach out and give them a link and thank them for hosting! (smile)
  4. Be encouraging and respond in kind to other folks that contribute posts to the event!

Pretty easy… right?!

Ready… Set… Blog!

The volunteer hosts for the first few months are already lined up and listed below, along with names of community leaders that have agreed to host starting in 2023. If you’d like to lead a month in the future, reach out to me through Twitter, the PostgreSQL Slack, or email. I’ll do my best to keep things flowing as we get this party started!

DateHostProposed Topic
October 7, 2022Ryan BoozWhy PostgreSQL?
November 4, 2022Andreas Scherbaum Upgrade PostgreSQL
December 2, 2022Pat Wright
January 6, 2022Henrietta Dombrovskaya
Initial hosts for the first four PSQL Phriday events!

Upcoming hosts: Ryan Lambert, Grant Fritchey, Michael Christofides

Be on the lookout for the first intro post on October 2! Until then – #LongLivePostgres 🐘

P.S. A special shoutout

Before signing off I wanted to make sure that I gave a shoutout to Shaun Thomas from EDB. When a few community members came up with the “PGSQL Phriday” name, we didn’t know that Shaun had already been blogging for a few years with a similar heading, “PG Phriday”. When I reached out to see how he would feel about us using the name, he was quick to say, “No problem! I’ll probably join in.”

Thanks for your willingness to let us try to grow this a little bit more, Shaun, and get more of the community involved.

The 2022 State of Postgres Survey is about to close! Take part and share your experience!

It’s been quite an exciting month at Timescale for a number of reasons, but one of the biggest initiatives we’ve tackled is the third annual State of PostgreSQL survey. With approximately 45 questions about the community experience with PostgreSQL, helpful tools, best practices, and more – there are so many great insights that will help to encourage and build a stronger community.

If you haven’t already, would you please take 10 minutes to complete the survey and add your voice? All of the data will be anonymized and provided freely to the entire community in a few weeks. Your additions will add even more value!

Although I can’t share any specific insights about the survey yet, there are a few things that have caught my attention to share with you and hopefully build a little anticipation. 😉

Community is Sweet! (and so is 🐝 beekeeping 🐝)

Wait, what? How is beekeeping related to the Postgres community and the yearly State of Postgres survey?

Well, first off, I’m an amateur beekeeper and if you are around me for much time (or watch my Twitter account), you’ll hear me talk about bees for sure. 

Timescalers meet in real life… and meet the Booz farm bees!

But, as responses have been coming in over the last three weeks, it’s been fascinating to see how folks use PostgreSQL – including some tools that I had never heard of before.

  • Psql ✅
  • PgAdmin ✅
  • DBeaver ✅
  • Intellij IDEA ✅
  • Beekeeper Studio ⁉️

Sure enough, out of the many suggestions of tools that folks use, it was hard for this one not to grab my attention. Maintained by Matthew Rathbone, Beekeeper Studio is an IDE that he’s been working on for a few years. Although he’s not a beekeeper (😢), his early interaction with Apache Hive seems to have given birth to a lot of beekeeper (data) products. Fun to know!

To be honest, without the results of this survey, I don’t know if I would have heard of the tool (sorry Matthew!). But I love that others do know about it and wanted to share.

What about you – is there a specific tool you use a lot and love that others need to know about?

Languages, languages, languages!

Sometimes, I need things like this survey to remind me that the world is a bigger place than my experience. Sure, it seems that nearly every developer on the planet uses either Python, Javascript, Go, or Java to access their Postgres data. I have to admit, however, that I was a bit surprised to see a few other languages mentioned by folks, including:

  • Delphi
  • Clojure
  • GDScript
  • Pascal (again ⁉️)
  • Perl

When I see responses like this, it reminds me that there is so much diversity within a community like ours – many of us trying to solve similar problems with different tools.

What about you? Have you been working on some research data using Pascal? 😉

Features, features, features!

For one final glimpse into some of the cool things we’ll learn about the Postgres community once all of the data is in – let’s talk features! Developers love the plethora of features that PostgreSQL has to offer and they use many of them in their daily work.

  • Common Table Expressions are 💯 and many respondents seem to use them often
  • FILTER clauses for aggregates, a feature that’s not available in every database, are more popular than I expected! 🔥
  • TABLESAMPLE is known by more people than I would expect, even with folks that only have a few years of experience using Postgres! 😎

In short, Postgres development is driven by community needs and contributions. It’s evident that most of the major features (even relatively new ones) are known and loved by many in the community. 🎉

Will you add your voice, too?

There are so many other interesting insights we’re going to be able to glean from this data. Your answers, and your voice, will help us finish strong with all of the best insights possible!

Take the survey today!

We’ll take a few weeks to crunch the numbers and provide a beautiful report of our findings, but we’ll also make all of the data available for anyone to slice and dice as they see fit. We think there’s a lot for the community as a whole to learn from this data and can’t wait to learn together!

Building community – PostgreSQL edition

Community.

I believe we’re all created for it. I believe most people thrive when they find it. In fact, for many people, community has been their lifeline over the last year.

Truth be told, however, community is hard to build and maintain.

When I jumped onto a new career path in September 2020 as a Developer Advocate with Timescale, I had dreamy ideas about how easy it was going to be to build community within the PostgreSQL world. I’ve been a database user for 20+ years, lead a number of teams, had the fortune to be involved in some excellent tech communities over the years, and was growing in my rekindled fondness for Postgres. (don’t get me wrong, the journey back wasn’t all roses)

What I didn’t anticipate, however, was the lack of a Postgres “owner” to drive some cohesion within the community. Nor did I fully understand how the flexibility of the platform created so much diverse usage. Sometimes it’s easy to get stuck in traditional database developer/architect patterns to see the bigger opportunities. You know, the old “can’t see the forest for the trees” problem.

With six months under my belt at Timescale, I’m learning… that I have a lot to learn about the Postgres community. 😃

While the umbrella that encompasses “PostgreSQL” is large, I’ve quickly noticed that I have a tendency to focus on one subset of the user base (new to the fold, new to SQL, looking to grow) because it’s where I’m most comfortable. Instead, I’d rather be thinking about it more holistically, working to understand where everyone started, where they are at currently, and where they are headed with Postgres.

That’s why I’m so excited about The State of PostgreSQL 2021 Survey that we just launched for everyone, far and wide, to participate in.

The State of PostgreSQL 2021 Survey

At Timescale, we believe that investing in and building up the Postgres community helps everyone. Understanding how developers and businesses use and interact with Postgres is key to strengthening the community at large.

https://twitter.com/TimescaleDB/status/1372163878516359171

The survey covers four key areas:

  1. Who you are as a developer or user?
  2. How did you start with Postgres?
  3. What tools and features do you use most?
  4. What communities are you a part of and how have they been helpful?

In total, the survey is ~35 questions and should take you about 10 minutes to complete. Most of the questions are multiple choice, while others allow you to provide free form answers. For example:

  • What is the main reason you chose to use Postgres?
  • How easily do you find it to connect with the community?
  • What tools do you use and recommend to other users?
  • What’s the size of your database deployment?
  • What cloud provider have you used Postgres with?
  • Are you in the PostgreSQL or Postgres camp??? (you’ll notice I flow easily between them both! 😉)

Open Source results for everyone!

There’s no agenda here for Timescale. Within the community there are any number of groups or companies that could (or have) done similar things in the past. We love to see the growth and excitement around PostgreSQL and want to understand the community better so that we can find more ways to help.

But… we want to make sure everyone else is empowered to help too!

Once the survey closes, all results will be anomalized and made available for anyone to review and slice as they see fit.

So, what are you waiting for?

Let your voice be heard, helping us to provide valuable insights to the entire community. Let’s make the Postgres community stronger, together!

Amazon Babelfish: First Thoughts

Posted on Categories AWS, PostgreSQL, SQL ServerLeave a comment on Amazon Babelfish: First Thoughts

Amazon Babelfish promises to help customers seamlessly move from SQL Server to PostgreSQL. I’m not excited about it, but probably not for the reasons you think!

Amazon re:Invent 2020 is in full swing and although some of their announcements impact my current career more than others (still waiting to hear about Timestream updates…), one item did blow up my Twitter feed a bit on Tuesday (12/1/2020).

AWS goes after Microsoft’s SQL Server with Babelfish

Techcrunch

Want more PostgreSQL? You just might like Babelfish

Matt Asay, AWS

My first reaction was, “Whoa… this is a big deal!” Having spent a lot of my career with SQL Server (and still a big fan of the community), the jump back to PostgreSQL wasn’t as smooth as I wanted it to be. Two years later, things have changed dramatically and I’m feeling much more at home (and enthusiastic) about PostgreSQL and the community. So much so that I switched jobs and careers to join the awesome team at Timescale!

https://twitter.com/ryanbooz/status/1305694338417123328?s=20

So back to Babelfish. 1 week into thinking about the announcement, and I’m a lot less excited. Let me explain.

The problem isn’t Babelfish

The imputes for creating the tool is clear for AWS. Provide a way for customers to easily connect a SQL Server app to Aurora Postgres, saving big on licensing fees and reducing total cost of ownership. Assuming the tool is successful at some level, I’m sure it will provide a revenue boost for Amazon and some customers might (initially) feel a win. No harm, no foul on Amazon for leading the effort. Free markets, baby!

No matter how clever Babelfish is, however, I just can’t see how this is ultimately a win for SQL Server or PostgreSQL… or the developers that will ultimately need to support these “hybrid” apps.

(Quick Disclaimer: I wasn’t able to attend the live streamed deep dive session from re:Invent on Aurora Postgres and Babelfish (I’m waiting for the recording to post), so I’m sure there are details I’m overlooking, which may impact what I’m about to say. And, one other caveat… I have no doubt that the team spearheading this project is smart, hardworking, and doing everything they can to make Babelfish as awesome as possible.)

For some Amazon customers pricing may actually be reduced, at least initially, resulting in the promised financial win. Depending on the application and code, there might be a negligible performance impact (or maybe even an improvement??) Honestly, there might be plenty of apps that recreate the schema, flip the connection string, and cruise into the sunset – Piña Colada in hand.

Even if this all works better than expected, I still think there will be more lost than gained for one reason:

SQL illiteracy.

The problem is SQL illiteracy

I’ve been involved in software development for nearly 20 years and always with an affinity for databases and SQL in general. Even then, in a room of 100 SQL folks (SQL Server or PostgreSQL), my level of expertise is probably in the bottom 20%. (but my, how I do love learning more every day!) Still, in every job I’ve had, teaching and mentoring have been a focus that I’ve enjoyed and treasured.

While SQL may be the third most popular language in 2020, I haven’t run across many traditional developers that understand the nuances of each SQL platform, even one they’ve been “using” for their entire career. In most cases the culprit for this SQL illiteracy is an ORM. They are convenient and extremely helpful with the mundane tasks like selecting lists of objects or saving data without writing your own abstraction later. And this is often really good stuff because, after all, we’ve been taught over and over to abstract away the mundane. ORM’s help us do that.

But beyond select * from... and INSERT INTO..., things almost always start to fall apart at some point. It’s nearly impossible to create a well written query that takes advantage of specific SQL dialect and features when you’re trying to be all things to all people. Too often DBA’s are called in to debug a slow query (some bad enough that they bring down servers under load), only to find a query created by an ORM. Who doesn’t love un-nesting 15 levels of sub-select queries when a simple (and more readable) version using a CTE or CROSS APPLY (SQL Server) or LATERAL JOIN (PostgreSQL) would have done the job? Most of us have been there.

Let’s not even talk about PIVOT and translating to crosstab! 😃

So why do ORMs come to mind when thinking about Babelfish? Because I think it’s going magnify the SQL illiteracy of most developers making this transition. This is primarily a marketing effort and a way to bring down spend and make CFOs happy… initially.

Babelfish won’t help SQL Server developers understand the features of PostgreSQL better, or why portions of the app don’t work well after the transition (let alone how to improve them). They won’t understand the pluggable extension architecture that PostgreSQL thrives on. (trigram index anyone? Easy GIS queries? Time-series data?) It doesn’t point them to the best resources and trainers on the internet for PostgreSQL (in the last month alone, I’ve seen multiple requests that start, “Who is the Brent Ozar of Postgres?”)

There is also be plenty of nuance that goes the other way, too… features that the SQL Server community spends a lot of time teaching about and building consulting careers around. Clustered indexes (vs. MVCC)? Columnstore Indexes (“just use them!”)? Querystore (which AWS won’t replicate any time soon)? Built-in query plan visualization? In short, the SQL Server community has spent a lot of effort teaching and learning about the internals of the database query engine to create performant apps. Translating those features will take a lot of effort that I probably doesn’t interest AWS at all.

Instead, Babelfish will simply return an error if it can’t correctly figure out what to do. I suspect a lot of developers will be getting those errors.

In the end, nobody will be happy

As a result, I think we’ll have at least three different problem sets when these apps start to implement Babelfish:

  1. ORMs creating complex, (already?) poorly performing queries for SQL Server being translated to (poorly) performing PostgreSQL queries. Lose/lose for everyone
  2. Targeted, hand-crafted queries that perform well on SQL Server that are hard (or impossible) to translate to PostgreSQL
  3. Stored Procedure-based applications that will require manual updates and still not take advantage of PostgreSQL features for improved query performance – features that SQL Server just doesn’t have. (👋 SQL Server folks, did you know you can tune PostgreSQL planner values for functions and stored procedure?!?!)

Putting it all together, in the short-term (at least) we’ll have a few initial outcomes:

  1. More revenue for AWS and likely higher long-term costs for users as the need to scale up PostgreSQL instances increases in order to deal with inefficient queries.
  2. Frustrated SQL Server developers that assume PostgreSQL is an inferior database
  3. A PostgreSQL community that thinks SQL Server developers are inferior and “stupid” for not understanding how things really work. (To be honest, a lot of SQL Server devs may be shocked that there’s not as much #SQLFamily community in the #Postgres world. Sorry.)

As I said, I haven’t seen the deep-dive from AWS yet and I’m not currently using Aurora Postgres to give it a test run, so these are preliminary thoughts.

Be the solution! Level-up your SQL

In the end, my greatest desire when I lead a team or teach about a technology is for developers to better understand the tools and technology they use, enabling them to create awesome applications. I get no greater joy for others (or myself honestly) than when new insights are gained and core technology is understood.

If you’re a SQL Server dev that might get caught in this transition, spend some time getting to know PostgreSQL a little better. While I haven’t quite found Brent’s replacement yet, there are a lot of good places to start learning. Start with my old post on PostgreSQL tooling (Part 1 & Part 2) and have a look at the responses to this Reddit post.

If you’re a PostgreSQL dev or instructor, have patience with the folks that will be thrust into learning the basics of Postgres for the first time. Remember that they probably didn’t choose this transition for themselves!

Hopefully as more details emerge and the Babelfish implementation is better understood, the performance worry is misplaced. Hopefully it drives better understanding of both products – because they each have a lot to offer users depending on the use case.

Still, I’m afraid we’ll be missing an opportunity to level-up on skills, drive deeper understanding with developers, and increasing the ROI for all parties.

What are your thoughts?

Exchanging an OAuth Token in Power BI Custom Data Connectors

Custom Data Connectors can support multiple kinds of authentication. For most modern web apps, OAuth is the obvious choice and the Power Query SDK, coupled with the authentication examples (here and here), make it easy to setup.

Unfortunately, two apps that I have recently created Custom Data Connectors for were partway through an OAuth implementation. While the circumstances were unique to each application, I had to authenticate the user with the OAuth Identity Provider (IdP) and then exchange the access_token for one that the application could use. This isn’t an unusual OAuth “flow”, particularly when the application uses a custom payload in the access_token JWT for application functions.

Whether you agree with an exchange flow or not, I had to figure out how to make the Custom Data Connector work within these constraints.

To accomplish this, I had to interrupt the normal OAuth response flow and exchange that IdP provided access_token for one that the application would accept as through Bearer authorization. I also needed to ensure that the refresh_token flow would continue to work when Power BI encountered a HTTP 401 (unauthorized) response from and API, causing it to perform a refresh grant to retrieve a new access_token and keep data flowing without the user being interrupted.

It just wasn’t clear to me exactly where to do all of this work.

In other OAuth frameworks or SDK’s, it’s usually pretty easy to see where you’d drop into the flow and make the exchange before moving ahead. Given the Github authentication example, my mind wasn’t thinking functionally at the time (it’s still a challenge for me!). It was unclear that the response from the FinishLogin function was being consumed and stored by Power BI directly. Once I made that connection it was easy to accomplish what I needed.

The Basic OAuth Setup

I started with the examples from the articles I’ve linked to above, enabling OAuth and setting up each of the functions.

...    
Authentication = [
        OAuth = [
            StartLogin=StartLogin,
            FinishLogin=FinishLogin,
            Refresh=Refresh,
            Logout=Logout
        ]
...

Inside of the FinishLogin function there is a call to the TokenMethod which takes the code grant and exchanges it with the IdP for an access_token.

TokenMethod = (code) =>
    let
        Response = Web.Contents("https://auth.example.com/oauth2/token", [
            Content = Text.ToBinary(Uri.BuildQueryString([
                client_id = client_id,
                client_secret = client_secret,
                code = code,
                redirect_uri = redirect_uri])),
            Headers=[#"Content-type" = "application/x-www-form-urlencoded",
                                 #"Accept" = "application/json"]]),
        LoginResult = Json.Document(Response)
    in
        LoginResult ;

It took me a while to realize that the Json.Document(Response) was simply parsing the OAuth payload (id_token, access_token, and refresh_token) into a Power Query record that the connector/runtime was storing and using later. Therefore, I had to do my exchange somewhere in this function before returning the Record that Power BI needed.

Exchanging the Access Token

In order to exchange the trusted access_token, another function was needed to call the actual Exchange endpoint of each application. YMMV, but here’s what a basic function might look like.

TokenExchange = (id_token as text) =>
    let
         NewToken = Web.Contents("https://auth.example.com/api/exchangeToken",[
            Content = Text.ToBinary("{id_token: '" & id_token & "'}"),
            Headers = [#"Content-Type"="application/json"]]),
         LoginResult = Json.Document(NewToken),
         AccessToken = LoginResult[token]
    in
        AccessToken

With that in place, I could modify the TokenMethod to do the exchange mid-stream for me like so.

TokenMethod = (code) =>
    let
        Response = Web.Contents("https://Github.com/login/oauth/access_token", [
            Content = Text.ToBinary(Uri.BuildQueryString([
                client_id = client_id,
                client_secret = client_secret,
                code = code,
                redirect_uri = redirect_uri])),
            Headers=[#"Content-type" = "application/x-www-form-urlencoded",
                                 #"Accept" = "application/json"]]),
        body = Json.Document(Response),
        result = if (Record.HasFields(body, {"error", "error_description"})) then 
                    error Error.Record(body[error], body[error_description], body)
                 else
                    // This is a code exchange, we expect a refresh_token
                    if (Record.HasFields(body, {"refresh_token"})) then 
                    [
                        refresh_token=body[refresh_token],
                        access_token = ExchangeToken(body[id_token])
                    ]
                    // This token was obtained by a Refresh request. No refresh_token             
                  else
                    [
                        access_token = ExchangeToken(body[id_token])
                    ]
    in
        result

The key was understanding that the result of the TokenMethod, which is what the FinishLogin method will ultimately respond with, needs to provide a Record with all of the fields necessary to continue the OAuth flow. In this case, simply replacing the access_token and refresh_token allowed Power BI to store this for later use.

Whenever the token expires, producing a HTTP 401 (Unauthorized) response, the RefreshToken logic kicks in, goes back through this same logic, and data keeps flowing.

Conclusion

Learning how to use Power Query to get data from REST APIs has been a fun (if not frustrating at times) learning experience. The documentation continues to improve and the Power Query team and community are very helpful. Each new challenge helps me better understand the building blocks they have provided and how to get to the next milestone.

Now, if I could just have something like a “finally” query that gets executed after an import to create the relationships I need and additional linking tables if needed. Wow, I’d be in heaven! 😉