PostgreSQL for the SQL Server DBA: Transaction Isolation and Table Bloat

This is part of a series I began in 2018, documenting my journey from SQL Server to PostgreSQL. As someone that’s been through the process before, my hope is that these discussions will help mitigate some of the surprises and help you get up to speed faster.

Today, I’m going to dig into how PostgreSQL implements transactional isolation, ongoing maintenance tasks, and highlight another group of settings that you’ll need to become familiar with in your quest to keep PostgreSQL in top-notch shape.

Go grab a cup of coffee. You might need it. (BTW, I roast my own coffee for home consumption. If you’re a coffee bean nut, reach out. ☕️)

Quick note:

For most of this article you can assume that I’m talking primarily about READ COMMITTED transaction isolation. It’s the default isolation level in PostgreSQL and for most applications the predominant isolation setting. There are three other types of transaction isolation specified in the SQL standard, one of which PostgreSQL doesn’t support in the same way as SQL Server (READ UNCOMMITTED).

Transaction Isolation

All relational databases handle transaction isolation in some way, typically with an implementation of Multi-version Concurrency Control (MVCC). Plain ‘ol, mainline SQL Server uses a form of MVCC, but all older rows (currently retained for ongoing transactions) are stored in TempDB. Oracle and MySQL also do something similar, storing (essentially) diffs of the modified data outside of the table that is merged at runtime for ongoing transactions that still need to see the older data.

Among these databases, PostgreSQL stands alone in the specific way MVCC is implemented. Rather than storing some form of the older data outside of the current table for transactions to query/merge/etc. at runtime, PostgreSQL always creates the newly modified row in-table alongside the existing, older versions that are still needed for running transactions. Yes, every UPDATE creates a new row of data in the table, even if you just change one column.

Your first reaction is probably something like 🤯. That happens a lot.

10k View of PostgreSQL MVCC

From a non-blocking, straightforward transaction isolation approach, it’s honestly a pretty simple concept to follow. PostgreSQL doesn’t have to manage multiple data structures or the merging of modified column data for example. When a new row is created (by INSERT or UPDATE), the minimum transaction ID is stored with the row so that older (currently running) transactions don’t see the data, and visa versa for newer transactions. When a query requests rows, PostgreSQL (essentially) just has to check if each row is visible to the current transaction running the query. If it’s “visible” and the min transaction ID is less than the current transaction ID, ✅. If the transaction ID is greater, 🚫.

Simple, right?!? Except, nothing is simple.

To be fair, I’m being pretty hand-wavy in my discussion above about MVCC and transaction isolation. It’s a complicated, technical topic, including topics like tuple and page freezing, transaction wraparound prevention, and more. The way that PostgreSQL chose to implement it has been the topic of intense debate for years (with a particularly renewed focus recently!). For the sake of our discussion, however, you mainly need to remember a few main points for us to dig deeper on table bloat.

  1. PostgreSQL creates new rows for UPDATES, always.
  2. The old rows eventually become “dead” and their space needs to be cleaned up.
  3. The VACUUM process frees up this space and keeps tables in tip-top shape (ie. other magical stuff happens when VACUUM is run).
  4. Vacuuming a database and tables is so important that the autovacuum daemon should always be running.
  5. As part of the autovacuum process, the table is (auto)analyzed to update statistics if the proper thresholds are met.

Each of those statements are probably worth a small blog post unto themselves, just not today.

Instead, let’s focus on one potential outcome when VACUUM isn’t functioning as efficiently as possible. Table bloat.

Table Bloat in PostgreSQL

I know I’m moving quickly, but hopefully we’re on the same page about how PostgreSQL currently implements MVCC. (yes, emphasis & pun was intended 😂). Depending on how active your application and database is, there’s a reasonable chance autovacuum will just do its job and you’ll rarely think about it. However, if your application has a couple of tables that are primarily update heavy, you need to pay attention.

Autovacuum is triggered based on a percentage of modified rows, set to 20% + 50 rows by default. Likewise, autoanalyze triggers at a 10% + 50 row threshold (because we all know that outdated statistics are bad!). Over time, particularly with a large, update-heavy table, this can become a problem without proper tuning.

Total RowsAutovacuum CalculationRows Modified Before autovacuumRows Modified Before autoanalyze
10,000(10,000 * .20) + 502,0501,050
100,000(100,000 * .20) + 5020,05010,050
1,000,000(1,000,000 * .20) + 50200,050100,050
10,000,000(10,000,000 * .20) + 502,000,050Any guesses?? 😉

As you can see, the larger a table gets, more dead tuples can accumulate before VACUUM kicks in, which means more pages will be created on disk to store the additional rows. Eventually, when the table is vacuumed, dead rows will be removed and the space on the data pages is made available for reuse. However, the size of the table on disk won’t get smaller.

PostgreSQL doesn’t pre-create empty 8Kb pages in blocks according to a setting like SQL Server Autogrowth. However, the overall principle is the same. Once the page exists, the space is reserved for that table. Yes, you can “shrink” the table footprint by other means, but none of them are quick and all require some kind of exclusive locking. In the end, if your root problem is misconfigured autovacuum settings, the table will continue to experience bloat over time. Fix the root cause, first.

The screenshot below from SQL Monitor demonstrates this visually. The sample table currently takes 40GB of disk space, of which only 28GB is live tuples. The remaining 12GB of space is reusable (freed by VACUUM over time) and currently dead tuples. PostgreSQL is cleaning up those dead tuples over time (the dips in the dark blue area), but probably not quickly enough for the activity in this specific table. We’ll discuss later how to tune these settings appropriately.

Quick aside: SQL Server Table Statistics

For some of you, this percentage-based maintenance might sound similar to a SQL Server process? 🤔 UPDATING TABLE STATISTICS!!

Until at least SQL Server 2016, many of us used to deal with query planning problems because statistics were only updated after a percentage of rows were modified. In fact, if you’ve been using SQL Server for a long time, I’d almost bet that you have a SQL Agent job that updates statistics across all tables (I 👀 those Ola Halagrin’s scripts running). Trace flag 2371, introduced in SQL Server 2008 R2 and made the default in 2016, goes a long way to improving statistic updates frequency. However, it’s still not tunable per-table.

Autovacuum and Autoanalyze Threshold Configuration

Modifying the default thresholds can be done at the server level through the postgresql.conf file or by passing in configuration at startup. If your database is hosted on a service like RDS, you can modify these through your instance parameter groups. Remember, however, that anything you change at the server level applies equally to all tables, both large and small.

There are five main autovacuum/autoanalyze settings to consider on your first pass:

  • autovacuum – this should be true in every circumstance I can realistically think of. I know of shops that turn this off and manage vacuuming manually because they have the people and expertise to do it. If you want to set this to false, please consider the potential impact.
  • autovacuum_vacuum_threshold – the minimum number of rows in a table that must be modified before vacuum kicks in. Even if you set the percentage value to zero (don’t!!), this row-based threshold should be set to a non-zero value. In essence, this is a safety for smaller tables going through constant vacuuming when the benefit doesn’t really justify it. Consider adjusting it upward a little if your typical small table is a few hundred rows of lightly changing data.
  • autovacuum_vacuum_scale_factor – the percentage (expressed as a decimal fraction) of rows modified, added to autovacuum_vacuum_threshold, before autovacuum kicks in.
  • autovacuum_analyze_threshold – the minimum number of rows in a table that must be modified before analyze kicks in. Like the vacuum threshold, this is a safety that prevents smaller tables from being analyzed constantly with little benefit to query planning.
  • autovacuum_analyze_scale_factor – the percentage (expressed as a decimal fraction) of rows modified, added to autovacuum_analyze_threshold, before autoanalyze kicks in.

To be honest, it’s difficult to provide global guidance on how to set these values. There is no one-size-fits-all approach. You know your database best, including the data access and modification patterns. Think about your most active tables, large and small, and try to work out a reasonable common ground so that as data changes and new tables are created, they have sensible starting points.

The main thing to remember is that frequently seeing AUTOVACUUM or AUTOANALYZE in your tracked queries or logs doesn’t indicate these processes are running too often (in most cases). In years past, folks that didn’t fully understand the purpose of these processes thought that performance problems were being caused by the background processes running over and over. The opposite is usually true. If server resources are available and configured appropriately for your workload, the autovacuum process rarely runs too often. There are exceptions to every rule, but tending towards more frequent vacuums probably isn’t a bad thing for most databases.

Tuning Tables Individually

I often think of tuning autovacuum/autoanalyze in terms of keeping table statistics updated in SQL Server. Over the years, outdated statistics have often been a problem on modification heavy workloads. As I noted above, trace flag 2371 helped a lot, but there are still many DBAs that schedule SQL Agent jobs to update specific table statistics more frequently because there’s no way (as of SQL Server 2019) to tune individual tables. That’s honestly something I always wished I could do.

In PostgreSQL, we can set table-specific values for all of the settings above to ensure our global settings are reasonable in most cases, but when necessary, specific tables can be vacuumed and analyzed more frequently. From a DBA perspective, this is a huge win!

Modifying the settings for a table is done through the ALTER TABLE command. As an example, let’s consider a table with 500,000 rows, at least 20% of which are updated every day. Rather than waiting ~24 hours for this table to be vacuumed based on the default thresholds, we could tell PostgreSQL that we want it to be vacuumed after 5% of the rows (~25,000) are modified (which might work out to once every 4-5 hours). Likewise, because this table is insert and update heavy, we want to update statistics more frequently, too.

-- VACUUM at 5% modification:
ALTER TABLE t SET (autovacuum_vacuum_scale_factor=0.05)

-- VACUUM threshold above percentage
ALTER TABLE t SET (autovacuum_vacuum_threshold=500)

-- ANALYZE at 5% modification:
ALTER TABLE t SET (autovacuum_analyze_scale_factor=0.05)

-- ANALYZE threshold above percentage
ALTER TABLE t SET (autovacuum_analyze_threshold=500)

Identifying tables that might be candidates for individual tuning can be accomplished in a few ways. One simple query just looks at the ratio of live to dead tuples. In theory, the ratio shouldn’t ever climb much past the threshold values discussed above, but if you see a table with millions of dead tuples that account for 20% of the table, that might be a great place to start.

   schemaname || '.' || relname tablename,
   n_dead_tup/n_live_tup pct_dead_tuples
FROM pg_stat_all_tables
WHERE n_live_tup > 0;

That’s a pretty subjective way to look at table bloat, however. The following query, provided courtesy of the fine folks at PGExperts, inspects more catalog information to provide reasonable estimates of how much actual bloat in MBs exist within tables. Tuning these tables individually will likely have noticeable impacts down the road with query performance and maintenance tasks.

-- new table bloat query
-- still needs work; is often off by +/- 20%
WITH constants AS (
    -- define some constants for sizes of things
    -- for reference down the query and easy maintenance
    SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 8 AS ma
no_stats AS (
    -- screen out table who have attributes
    -- which dont have stats, such as JSON
    SELECT table_schema, table_name, 
        n_live_tup::numeric as est_rows,
        pg_table_size(relid)::numeric as table_size
    FROM information_schema.columns
        JOIN pg_stat_user_tables as psut
           ON table_schema = psut.schemaname
           AND table_name = psut.relname
        LEFT OUTER JOIN pg_stats
        ON table_schema = pg_stats.schemaname
            AND table_name = pg_stats.tablename
            AND column_name = attname 
    WHERE attname IS NULL
        AND table_schema NOT IN ('pg_catalog', 'information_schema')
    GROUP BY table_schema, table_name, relid, n_live_tup
null_headers AS (
    -- calculate null header sizes
    -- omitting tables which dont have complete stats
    -- and attributes which aren't visible
        hdr+1+(sum(case when null_frac <> 0 THEN 1 else 0 END)/8) as nullhdr,
        SUM((1-null_frac)*avg_width) as datawidth,
        MAX(null_frac) as maxfracsum,
        hdr, ma, bs
    FROM pg_stats CROSS JOIN constants
        LEFT OUTER JOIN no_stats
            ON schemaname = no_stats.table_schema
            AND tablename = no_stats.table_name
    WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
        AND no_stats.table_name IS NULL
            FROM information_schema.columns
                WHERE schemaname = columns.table_schema
                    AND tablename = columns.table_name )
    GROUP BY schemaname, tablename, hdr, ma, bs
data_headers AS (
    -- estimate header and row size
        ma, bs, hdr, schemaname, tablename,
        (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
        (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
    FROM null_headers
table_estimates AS (
    -- make estimates of how large the table should be
    -- based on row and page size
    SELECT schemaname, tablename, bs,
        reltuples::numeric as est_rows, relpages * bs as table_bytes,
            (datahdr + nullhdr2 + 4 + ma -
                (CASE WHEN datahdr%ma=0
                    THEN ma ELSE datahdr%ma END)
                )/(bs-20))) * bs AS expected_bytes,
    FROM data_headers
        JOIN pg_class ON tablename = relname
        JOIN pg_namespace ON relnamespace = pg_namespace.oid
            AND schemaname = nspname
    WHERE pg_class.relkind = 'r'
estimates_with_toast AS (
    -- add in estimated TOAST table sizes
    -- estimate based on 4 toast tuples per page because we dont have 
    -- anything better.  also append the no_data tables
    SELECT schemaname, tablename, 
        TRUE as can_estimate,
        table_bytes + ( coalesce(toast.relpages, 0) * bs ) as table_bytes,
        expected_bytes + ( ceil( coalesce(toast.reltuples, 0) / 4 ) * bs ) as expected_bytes
    FROM table_estimates LEFT OUTER JOIN pg_class as toast
        ON table_estimates.reltoastrelid = toast.oid
            AND toast.relkind = 't'
table_estimates_plus AS (
-- add some extra metadata to the table data
-- and calculations to be reused
-- including whether we cant estimate it
-- or whether we think it might be compressed
    SELECT current_database() as databasename,
            schemaname, tablename, can_estimate, 
            CASE WHEN table_bytes > 0
                THEN table_bytes::NUMERIC
                ELSE NULL::NUMERIC END
                AS table_bytes,
            CASE WHEN expected_bytes > 0 
                THEN expected_bytes::NUMERIC
                ELSE NULL::NUMERIC END
                    AS expected_bytes,
            CASE WHEN expected_bytes > 0 AND table_bytes > 0
                AND expected_bytes <= table_bytes
                THEN (table_bytes - expected_bytes)::NUMERIC
                ELSE 0::NUMERIC END AS bloat_bytes
    FROM estimates_with_toast
    SELECT current_database() as databasename, 
        table_schema, table_name, FALSE, 
        est_rows, table_size,
    FROM no_stats
bloat_data AS (
    -- do final math calculations and formatting
    select current_database() as databasename,
        schemaname, tablename, can_estimate, 
        table_bytes, round(table_bytes/(1024^2)::NUMERIC,3) as table_mb,
        expected_bytes, round(expected_bytes/(1024^2)::NUMERIC,3) as expected_mb,
        round(bloat_bytes*100/table_bytes) as pct_bloat,
        round(bloat_bytes/(1024::NUMERIC^2),2) as mb_bloat,
        table_bytes, expected_bytes, est_rows
    FROM table_estimates_plus
-- filter output for bloated tables
SELECT databasename, schemaname, tablename,
    pct_bloat, mb_bloat,
FROM bloat_data
-- this where clause defines which tables actually appear
-- in the bloat chart
-- example below filters for tables which are either 50%
-- bloated and more than 20mb in size, or more than 25%
-- bloated and more than 4GB in size
WHERE ( pct_bloat >= 50 AND mb_bloat >= 10 )
    OR ( pct_bloat >= 25 AND mb_bloat >= 1000 )
ORDER BY mb_bloat DESC;

Feel free to adjust the WHERE clause for your database and application, but definitely consider tuning specific tables if the percent of bloat is above 25% and a gigabyte of space or more.

When Vacuuming Doesn’t Keep Up

One last discussion about this automated processing. Sometimes, even after numerous attempts to tune autovacuum/autoanalyze settings for a database or table, you might find that things are still falling behind. Most of the time this points to the background workers not having enough resources to do the job effectively.

Successfully vacuuming your tables in a timely manner is a combination of maintenance_work_mem (or autovacuum_work_mem) and autovacuum_max_workers. Similar to my discussion in the previous article about work_mem, the maintenance workers need to have enough allocated memory to process the dead tuples in your biggest tables while not exhausting memory resources if PostgreSQL were to kick off the maximum number of workers. There are a lot of ways to consider setting these globally, but I appreciate the brief guidance provided by Amazon for the RDS environment. In essence: “If you have many small tables allocate more workers and less memory. If you have many large tables, do the opposite.” (slightly paraphrased)

The main issue at play is that you want to ensure PostgreSQL autovacuum can quickly vacuum dead tuples with each pass. Otherwise, some tuples will be left unvacuumed, adding to the volume of data for the next time vacuuming process to deal with. Without proper management the left-over rows can build up from vacuum to vacuum and PostgreSQL never fully catches up. Again, this reaches further than table bloat (freezing tuples and transaction ID management are more important in the long run), but it’s usually easier to identify this high-level problem for most people newer to PostgreSQL.

Baby Steps to Better Performance

Table bloat isn’t the most important aspect of managing your new PostgreSQL database. However, I know so many applications track update-heavy data that improper tuning of autovacuum can begin to have far reaching impacts. In fact, that’s how I discovered an application architecture problem that manifested itself in PostgreSQL differently from SQL Server.

My main reason for bringing up table bloat as a way to identify autovacuum problems is because it’s usually easy to rationalize about when you know your database.

In my case, I knew that a specific table had at most 300-400MB worth of active data based on the schema. But as we started investigating why our database seemed to be so large, this one table topped out at over 5GB, not including indexes. It tracked hundreds of thousands of sensors and the application updated the “last_reported” column every 1-5 minutes in most cases. Every one of those updates created a new row on disk. More than almost any other table, it was often the source of slow queries and heavy I/O.

After tuning the autovacuum and autoanalyze settings for this table and modifying autovacuum_work_mem appropriately, we saw improvements over time even as more sensors were tracked and overall application usage increased.

If you’re starting a new project in PostgreSQL, or porting one from SQL Server, keep an eye on dead tuples and overall table growth over time. Once you’ve identified a problematic table, consider tuning those settings!

1 thought on “PostgreSQL for the SQL Server DBA: Transaction Isolation and Table Bloat”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.