Richard Yen

Making JSONB More Queryable with Generated Columns

Mon, 11 May 2026 06:00:00 +0000

Introduction

Over the past year, I’ve worked in a handful of contexts managing large volumes of data stored as JSONB in PostgreSQL. The scenario is common: users appreciate the flexibility of a document-oriented storage model, avoiding the need to predefine schemas or constantly migrate table structures as their data requirements evolve. JSONB documents can be deeply nested with numerous optional fields, and they scale to hundreds of kilobytes per record without issue. However, when the time comes to query these documents – filtering by user ID, event type, timestamps, or nested action properties – the queries can become slow and/or cumbersome to work with.

The problem I want to address is: “How do we make searching JSONB data more efficient without breaking apart our documents or forcing it into columns in a relational database?” There are several approaches available in Postgres, each with different tradeoffs. I hope to shed some light on those approaches in this article.

The Setup

I created a basic, no-frills table for the sake of this test:

CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    data JSONB NOT NULL
);

Here's the document shape I used for testing and writing this post -- it's representative of the event logs and audit trails I've encountered: a mix of primitive fields, nested objects, and metadata that accumulates over time.

-- Representative JSONB document
{
  "user_id": 5234,
  "event_type": "event_42",
  "timestamp": 1712341200,
  "session_id": "sess_abc123...",
  "ip_address": "192.168.1.42",
  "action": {
    "type": "click",
    "target_id": 87654,
    "coordinates": {"x": 512, "y": 768},
    "duration_ms": 1234
  },
  "device": {
    "type": "mobile",
    "os": "iOS",
    "screen_width": 1920,
    "screen_height": 1080
  },
  "performance": {
    "page_load_time": 1234,
    "dns_lookup": 123,
    "tcp_connection": 234,
    "server_response": 876
  },
  "custom_fields": { ... }
}

The queries that matter are straightforward equality and range filters on known fields: find all events for a given user, filter by event type, narrow to a time window. With this setup, we’ll try to discern which kind of index actually serves the specific access pattern, and what the real cost of each option is.

All tests run on PostgreSQL 18.2 in Docker on an Apple M-series host. Tables contain 50,000 rows with realistic JSONB event documents. Query benchmarks run 20 times on a warm cache and report avg/min/max. Insert benchmarks run 5 trials of 5,000 rows each. Schema and scripts are included throughout so you can reproduce these results.

Three Approaches to Indexing JSONB

There are three realistic options for this access pattern. Let’s look at each in turn – what it costs to build/maintain, what queries it actually helps, and where it falls down.

Option 1: GIN Indexes

The natural candidate for indexing a JSONB column would be a GIN (Generalized Inverted Index) index. After all, GIN indexes are specifically designed for JSON documents and full-text search. It indexes every key and value pair in every document, making the entire structure searchable:

CREATE INDEX idx_gin ON events USING GIN (data);
-- or the path-only variant:
CREATE INDEX idx_gin_path ON events USING GIN (data jsonb_path_ops);

As a refresher, I’ll mention that GIN is designed for containment and key existence operators (@>, ?, ?|, ?&), not for equality on extracted fields:

-- This query uses a GIN index correctly:
SELECT id FROM events WHERE data @> '{"user_id": 5234}';

-- This query does NOT use a GIN index, even if one exists:
SELECT id FROM events WHERE cast(data->>'user_id' AS INT) = 5234;

For the containment form, the GIN index is used and the query is fast – but still slower than a B-tree on the same field, because GIN lookups involve more bookkeeping:

-- GIN jsonb_ops + containment operator
Bitmap Index Scan on idx_gin
  Index Cond: (data @> '{"user_id": 5234}')

lanning Time: 1.173 ms  |  Execution Time: 1.295 ms

-- GIN jsonb_path_ops + containment operator
Bitmap Index Scan on idx_gin_path
  Index Cond: (data @> '{"user_id": 5234}')
Planning Time: 3.342 ms  |  Execution Time: 0.450 ms

The jsonb_path_ops variant is smaller and faster for containment queries, but it trades away support for key-existence operators (?, ?|, ?&). Neither GIN variant can help with range predicates like ts > 1700000000 – those always fall through to a filter step.

Option 2: Expression Indexes

Postgres lets you create an index on an expression, including JSONB extraction:

CREATE INDEX idx_user_id ON events (cast(data->>'user_id' AS INT));

This is a B-tree index on the result of evaluating the expression. When the query predicate matches the indexed expression exactly, and after ANALYZE has gathered statistics on it, the planner will use it:

SELECT id FROM events
WHERE cast(data->>'user_id' AS INT) = 5234;

Bitmap Heap Scan on t_expr
  Recheck Cond: ((data ->> 'user_id')::integer = 5234)
  Heap Blocks: exact=3
  ->  Bitmap Index Scan on idx_user_id
        Index Cond: ((data ->> 'user_id')::integer = 5234)
Planning Time: 1.168 ms  |  Execution Time: 0.341 ms

The execution time on this equality operator seems to be pretty similar to the performance of the GIN index.

Option 3: Generated Columns

Generated columns (available since PostgreSQL 12) let you extract JSONB values into regular typed columns at write time. The values are stored physically alongside the row and kept in sync automatically:

CREATE TABLE events (
    id         BIGSERIAL PRIMARY KEY,
    data       JSONB NOT NULL,
    user_id    INT    GENERATED ALWAYS AS ((data->>'user_id')::INT)    STORED,
    event_type TEXT   GENERATED ALWAYS AS (data->>'event_type')        STORED,
    ts         BIGINT GENERATED ALWAYS AS ((data->>'timestamp')::BIGINT) STORED,
    action     TEXT   GENERATED ALWAYS AS (data->'action'->>'type')    STORED
);

CREATE INDEX idx_user_id ON events (user_id);
CREATE INDEX idx_event_type ON events (event_type);
CREATE INDEX idx_ts ON events (ts);
CREATE INDEX idx_action ON events (action);

Queries against generated columns are plain typed-column lookups. The planner sees them as regular B-tree columns and produces tight estimates:

SELECT id FROM events WHERE user_id = 5234;

Bitmap Heap Scan on t_gen
  Recheck Cond: (user_id = 5234)
  Heap Blocks: exact=3
  ->  Bitmap Index Scan on idx_user_id
        Index Cond: (user_id = 5234)
Planning Time: 1.159 ms  |  Execution Time: 0.407 ms

You also get native support for range queries and composite indexes at no extra complexity – just combine columns as you normally would:

-- Indexed range query on generated timestamp column
CREATE INDEX ON events (event_type, ts);

SELECT id FROM events
WHERE event_type = 'event_42' AND ts > 1700000000;
-- Execution Time: 0.698 ms (vs 6.6 ms with GIN + post-filter)

Side-by-Side: Query Performance

With all three approaches set up, here are the warm-cache query results averaged over 20 runs for an equality filter on user_id:

Approach	Avg (ms)	Min (ms)	Max (ms)
GIN jsonb_ops + `@>`	0.198	0.101	1.769
GIN jsonb_path_ops + `@>`	0.197	0.032	3.115
Expression index	0.106	0.018	1.705
Generated column B-tree	0.112	0.016	1.839

Expression indexes and generated columns perform very similarly for equality queries—both around 0.1ms on warm cache. The real work is done in the B-tree lookup and both produce the same index structure. GIN with the correct @> operator is nearly as fast in PG 18.2 – still slightly slower than B-tree for this access pattern, but the gap has narrowed. GIN lookups still require a recheck step that B-tree lookups avoid, and the variance remains notable: GIN max of 3.1ms vs B-tree max of 1.8ms on warm cache.

The more surprising result is what happens if the GIN index is present but the query is written with extraction-based equality:

-- GIN index exists, but this query gets a seq scan:
SELECT id FROM events WHERE cast(data->>'user_id' AS INT) = 5234;
-- Execution Time: 47.935 ms (same as no index at all)

GIN doesn’t support that operator class. This is by far the most common confusion teams run into with JSONB indexing.

The Full Cost Picture: Storage and Writes

Storage

Here’s what the same 50,000 rows cost on disk under each approach:

Approach	Table size	Index size	Total
Expression indexes (4)	18 MB	3.5 MB	21 MB
Generated columns + B-tree (4)	20 MB	3.5 MB	23 MB
GIN jsonb_path_ops	18 MB	13 MB	31 MB
GIN jsonb_ops	18 MB	18 MB	36 MB

Expression indexes and generated column B-tree indexes produce identical index sizes for the same fields – this makes sense, since the index structures are the same; the only extra cost of generated columns is the 2 MB of additional stored column data in the table (~40 bytes per row for four typed columns). GIN indexes are substantially larger: 13–18 MB for a single index vs 3.5 MB for four targeted B-tree indexes. The jsonb_path_ops variant is smaller because it only stores value hashes for the @> operator path, but it still dwarfs the targeted approach.

One caveat: these numbers reflect documents with short keys and compact values. Documents with verbose key names, deeply nested structures, or large string values will inflate GIN indexes proportionally more – because GIN indexes every key path. B-tree and expression indexes are unaffected by document verbosity, since they only store the extracted value.

Write Throughput

Here’s what 5,000 INSERTs per trial, 5 trials each, on a table already containing 50,000 rows looked like:

Approach	Avg (ms)	Min (ms)	Max (ms)
Generated columns + B-tree (4)	157	91	317
Expression indexes (4)	163	93	366
GIN jsonb_path_ops	171	73	408
GIN jsonb_ops	334	225	525

Generated columns and expression indexes are now very close in write cost, with generated columns slightly edging out on average. GIN jsonb_path_ops has become more competitive with both. However, the default GIN jsonb_ops variant is dramatically more expensive: 2× slower than expression indexes and generated columns. It must decompose the entire document into key-value pairs and insert entries for each one. The high variance is also worth noting: GIN jsonb_ops max of 525ms vs 366ms for expression indexes.

Choosing the Right Approach

The benchmarks above tell a consistent story for workloads dominated by equality and range filters on a known set of fields:

Expression indexes are the lowest-cost migration path. They add no schema structure, require no application changes to insert logic, and impose minimal write overhead. If your team already has a table in production and just needs to speed up a handful of known slow queries, a well-placed expression index is your first move. The catch: every query must exactly match the expression as written in the index definition, which can be fragile to maintain as codebases evolve.
Generated columns take slightly more storage and impose more write overhead than expression indexes, but they offer something the others can’t: the extracted values become first-class columns. You can build composite indexes across them, reference them in views, expose them via ORMs, and sort or aggregate on them without embedding extraction logic everywhere. For new tables or for tables you’re willing to migrate, they’re the most maintainable long-term solution.
GIN indexes serve a different purpose. They’re the right tool when your query patterns are flexible or unknown – searching for the existence of a key, filtering on any field in an ad-hoc fashion, or supporting containment queries on arbitrarily-shaped documents. For those access patterns, they’re genuinely powerful and there’s no clean B-tree equivalent. But for consistent equality and range filters on known fields, they cost more in storage, impose higher write latency, and only work with one operator class (@>, not =).

Here’s a rough decision guide:

Situation	Recommended approach
Unknown or ad-hoc field queries	GIN (`@>`, key existence)
Known fields, few queries, no schema change	Expression index
Known fields, high query volume, evolving codebase	Generated columns
Known fields + range queries (e.g., timestamps)	Generated columns + composite B-tree
Mixed: some known fields + some ad-hoc	Generated columns + GIN (both)

Caveats and Considerations

Regardless of which approach you choose, a few things apply broadly:

The real win is making data typed and relational again. Generated columns aren’t magic. The reason they (and expression indexes) outperform GIN for equality filters is that they produce typed scalar values with precise statistics, letting the planner make accurate row-count estimates and choose cheap comparison operations. JSONB is flexible but opaque; once you extract a field into a typed column or expression, Postgres can reason about it properly.

Expression indexes require exact predicate matching. An index on cast(data->>'user_id' AS INT) will not be used by a query written as (data->>'user_id')::int. The cast form must be identical. Generated columns avoid this fragility – any query that references the column name will benefit.

Generated column expressions must be immutable. The expression cannot reference functions that depend on time, session state, or anything external. NOW(), CURRENT_USER, and similar functions are off-limits.

Generated columns cannot be directly updated. Their value is always derived from the source column. If you UPDATE the JSONB data, the generated columns recompute automatically.

GIN maintenance overhead compounds on write-heavy tables. GIN indexes build an internal pending list and flush it periodically (controlled by gin_pending_list_limit). Under sustained write load, this flushing can cause the latency spikes visible in the benchmark max values above. B-tree indexes don’t have this mechanism.

These benchmarks cover one dataset shape and one machine. At much larger row counts (hundreds of millions), cache-miss behavior and index bloat will dominate—relative rankings should hold, but absolute numbers will differ. When in doubt, benchmark on your own data before committing to a migration.

Conclusion

For workloads dominated by equality and range filters on a predictable set of JSONB fields, the data is clear: B-tree indexes on typed values – whether via expression indexes or generated columns – outperform GIN both on read latency and write throughput. GIN’s strength is flexibility, not speed for known-field access patterns; when you know exactly which fields you’ll filter on, a targeted B-tree beats the GIN every time.

If you’re starting from scratch or are willing to migrate a table, generated columns are the most maintainable path. They make your frequently-queried fields easily accessible, eliminate JSONB extraction logic from your application’s query layer, and support composite indexes and range queries naturally. If you need to add indexing to an existing table without a schema change, expression indexes get you 90% of the way there with a fraction of the write overhead.

GIN still belongs in your toolkit – but for the right job: ad-hoc containment searches, key-existence checks, and cases where the query patterns genuinely vary by document. For everything else, make your JSONB fields relational.

Potential Consequences of Using Postgres as a Job Queue

Mon, 04 May 2026 06:00:00 +0000

This post was originally published on the Microsoft Tech Community Blog.

Introduction

At small scale, using Postgres as a job queue is totally fine, and I’d even say it’s the right call. Fewer moving parts, one less system to manage, ACID guarantees on your jobs. What’s not to love?

The problem is that “small scale” has a ceiling, and the ceiling is lower than most people expect. When you’ve got thousands of concurrent workers hammering a jobs table with SELECT ... FOR UPDATE SKIP LOCKED, things start to behave in ways that aren’t obvious from the application layer. CPU usage creeps up. Also vacuum sometimes can’t keep up. Finally, in the wait event stats, you start seeing ominous entries like LWLock:MultiXactSLRU stacking up across many backends.

This pattern has tripped up teams more than a few times, and it usually plays out the same way: everything works fine in dev and staging, then goes off a cliff in production once the concurrency gets real. So let’s dig into why this happens, and what the alternatives look like.

The Typical Pattern

When using Postgres as a job queue, the standard approach looks something like this:

CREATE TABLE job_queue (
    id         bigserial PRIMARY KEY,
    status     text NOT NULL DEFAULT 'pending',
    payload    jsonb NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    locked_by  text,
    locked_at  timestamptz
);

CREATE INDEX idx_job_queue_status ON job_queue (status) WHERE status = 'pending';

Workers grab jobs with:

UPDATE job_queue
   SET status = 'processing',
       locked_by = 'worker-42',
       locked_at = now()
 WHERE id = (
     SELECT id FROM job_queue
      WHERE status = 'pending'
      ORDER BY created_at
      LIMIT 1
        FOR UPDATE SKIP LOCKED
 )
 RETURNING *;

And then mark them done:

UPDATE job_queue SET status = 'completed' WHERE id = $1;

Some users may DELETE the row entirely. Either way, the lifecycle is: insert, lock-and-update, update-or-delete. Repeated thousands of times per second.

At low concurrency, this works very smoothly. SKIP LOCKED means workers don’t block each other waiting for the same row. Postgres handles the locking, visibility, and ordering. It’s elegant.

So where does it break?

The MultiXact SLRU Problem

When multiple transactions hold locks on the same row, Postgres stores the set of lockers as a MultiXact ID – a pointer into a side structure under pg_multixact/.

With SELECT ... FOR UPDATE SKIP LOCKED, users might think MultiXacts aren’t involved – after all, SKIP LOCKED is supposed to avoid contention. But in practice, with many concurrent workers all racing to lock rows, there are brief windows where multiple transactions reference the same row before one of them “wins” and the others skip. If you combine this with any FOR SHARE or FOR KEY SHARE locks (which are commonly created implicitly by foreign key checks), MultiXact IDs start accumulating quickly.

The MultiXact data lives in SLRU buffers (Simple Least Recently Used) – a small, fixed-size shared memory cache. When backends need to read or write MultiXact data, they acquire LWLocks to access these buffers. Under high concurrency, this becomes a bottleneck:

wait_event_type | wait_event
-----------------+-------------------
LWLock          | MultiXactMemberSLRU
LWLock          | MultiXactOffsetSLRU

You’ll see dozens or hundreds of backends piled up on these waits. The SLRU cache is small (by design – it’s a fixed number of pages in shared memory), and when the working set of MultiXact lookups exceeds what fits in the cache, you get constant eviction and re-reads from disk. Every lock acquisition and release on a job row potentially triggers a MultiXact SLRU lookup, and at thousands of concurrent sessions, those lookups serialize on LWLocks.

The result: CPU gets pegged, throughput collapses, and latency spikes – not because the queries are expensive, but because the locking infrastructure itself is overwhelmed.

Bloat: The Silent Killer

The other side of this coin is table and index bloat. Every job row goes through multiple updates (and possibly a delete), and each of those operations creates a new tuple version in the heap. The old versions stick around until VACUUM cleans them up.

On a busy job queue table:

Dead tuples accumulate faster than autovacuum can clean them. By the time autovacuum finishes one pass, tens of thousands of new dead tuples have appeared. The table grows and grows.
Index bloat compounds the problem. Every index on the table also accumulates dead entries. The partial index on status = 'pending' gets thrashed especially hard, since rows constantly enter and leave that condition.
Sequential scans get slower. As the table bloats, even index scans start doing more I/O because the heap pages are sparsely populated. Vacuum reclaims space at the end of the table, but can’t reclaim space in the middle (unless the pages are completely empty).

Job queue tables can grow to tens of gigabytes when the actual “live” data was only a few megabytes. It makes everything slower: scans, vacuum, even pg_dump.

You can mitigate this by running vacuum more aggressively (lower autovacuum_vacuum_scale_factor, higher autovacuum_vacuum_cost_limit), or by partitioning the table and dropping old partitions. But at some point, you’re fighting the fundamental mismatch between MVCC’s design goals and the write pattern of a job queue.

CPU and Lock Overhead

Beyond the SLRU contention and bloat, there’s just the raw overhead of using Postgres’s full transactional machinery for what is essentially a FIFO dispatch operation:

Every lock/unlock is a full WAL-logged transaction. Grabbing a job writes WAL. Marking it complete writes WAL. Deleting it writes WAL. On a system processing thousands of jobs per second, the WAL volume from the job queue alone can saturate your wal_writer and checkpoint processes.
SKIP LOCKED still touches rows. The name suggests rows are skipped, but Postgres still has to find them, check their lock status, and move on. With high concurrency, many workers end up scanning past the same locked rows before finding one they can claim. This is wasted CPU.
Snapshot management overhead also becomes an issue. Each transaction needs a consistent snapshot, and with thousands of concurrent transactions, the ProcArray (the structure that tracks active transactions) becomes a contention point itself. You might see LWLock:ProcArrayLock waits alongside the MultiXact ones.
Vacuum contention. While vacuum is cleaning up dead tuples, it needs locks too. On a table under constant write pressure, vacuum can interfere with the workers and vice versa. I’ve seen systems where disabling autovacuum on the job queue table improved throughput in the short term.

Better Alternatives

So what should you use instead? It depends on your requirements, but there are several options that handle high-throughput job dispatch more gracefully than a Postgres table.

Advisory Locks (Staying in Postgres)

If you want to stay within Postgres and avoid adding infrastructure, advisory locks are worth considering for certain queue patterns. Instead of locking rows, you lock on an abstract numeric key:

-- Worker tries to acquire a lock on the job ID
SELECT pg_try_advisory_lock(id) FROM job_queue
 WHERE status = 'pending'
 ORDER BY created_at
 LIMIT 1;

Advisory locks are lightweight – they don’t touch the heap, don’t create MultiXact entries, and don’t generate dead tuples. They live entirely in shared memory. The trade-off is that you lose the atomicity of FOR UPDATE SKIP LOCKED: you need to handle the case where a lock is acquired but the job processing fails, and you need to release the lock explicitly (or rely on session-end cleanup).

This approach works well when the queue depth is manageable and you want to avoid the MVCC overhead. But it’s still Postgres, so you’re still subject to connection limits, ProcArray overhead, and general resource contention at very high session counts.

pgq (Skytools)

pgq is purpose-built for exactly this problem. It’s a queue implementation that sits inside Postgres but uses a batching model that avoids most of the row-level locking and MVCC pitfalls. Events are written to a queue table, but consumers read them in batches and the queue maintenance is done via a ticker process that manages rotation.

The key advantages:

No row-level contention. Consumers don’t lock individual rows.
Built-in batch processing. Events are consumed in chunks, reducing transaction overhead.
Efficient cleanup. Old events are rotated out rather than vacuumed row-by-row.

The downside is that pgq is not as actively maintained as it once was, and it adds operational complexity (the ticker daemon, consumer registration, etc.). But for teams already deep in the Postgres ecosystem, it’s a battle-tested option.

PgQue

Coincidentally, during the writing of this post, Nikolay Samokhvalov has built PgQue, which is a derivative of pgq. Like pgq, it sits inside Postgres, but ships as a single SQL file – no C extension and no external daemon – making it deployable on managed services like RDS, Aurora, Cloud SQL, AlloyDB, Supabase, and Neon. Producers INSERT events into rotating event tables (recycled via TRUNCATE instead of row-by-row deletion), and consumers read batches by diffing two pg_snapshot values captured by a periodic ticker – so the hot path contains zero UPDATEs, DELETEs, or SELECT ... FOR UPDATE SKIP LOCKED, and therefore produces no dead tuples on the event tables. For a deeper dive into the algorithm, see Christophe Pettus’s writeup.

Redis

For many teams, Redis is the natural choice for job queues. Using Redis lists (BRPOPLPUSH or the Streams API), you get:

Sub-millisecond dispatch latency. No disk I/O, no MVCC, no vacuum.
Atomic pop operations. Workers grab jobs without any locking protocol.
Simple scaling. Redis handles thousands of concurrent consumers trivially.

The trade-off is durability. Redis can persist to disk, but it’s not ACID. If Redis crashes between a pop and the job completing, you might lose or duplicate work (though Redis Streams with consumer groups mitigate this significantly). For most job queue use cases, at-least-once delivery is acceptable, and Redis does that well.

Kafka

For truly high-throughput, distributed workloads, Apache Kafka is the heavyweight option. Kafka partitions give you parallel consumption with ordering guarantees per partition, durable storage, and replay capability. It’s the right tool when:

You need to process thousands of events per second
Multiple consumers need to read the same events
You want event replay or audit trails
Your architecture is already event-driven

The operational overhead is nontrivial – ZooKeeper (or KRaft), brokers, topic management, consumer group coordination. But for teams already running Kafka for other reasons, adding a job queue topic is practically free.

Choosing the Right Tool

Here’s a rough decision guide:

Scenario	Recommendation
Under 100 concurrent workers, simple jobs	Postgres with `SKIP LOCKED` is fine
Moderate concurrency, want to stay in Postgres	Advisory locks or pgq
High throughput, low-latency dispatch	Redis (Lists or Streams)
Massive scale, distributed, event replay	Kafka

Many teams that start with Postgres (reasonably) hit scaling problems and then try to fix Postgres rather than recognizing that the workload has outgrown the tool. They throw more autovacuum workers at it, increase max_connections, add connection poolers – all of which help at the margins, but don’t address the fundamental issue: Postgres’s MVCC and locking machinery wasn’t designed for this access pattern at high concurrency.

Conclusion

Postgres is great, but it can’t be the best tool for every job. Using it as a job queue is a perfectly valid choice when your scale is modest. But when you’re running thousands of concurrent workers, the combination of MultiXact SLRU contention, heap bloat, vacuum pressure, and raw locking overhead will eventually push you toward a purpose-built solution.

The good news is that you don’t have to rip out everything. Advisory locks can buy you headroom without adding infrastructure. Redis can handle dispatch while Postgres keeps owning the data. And if you’re already using Kafka, a job topic is a natural fit. Take your pick – there are many queueing options out there!

Understanding Bitmap Heap Scans in PostgreSQL

Mon, 27 Apr 2026 08:00:00 +0000

Introduction

When people first start reading PostgreSQL execution plans, they quickly learn a few common scan types: Seq Scan, Index Scan, Index Only Scan. But eventually another one appears that is less obvious: Bitmap Heap Scan, which is almost always accompanied by Bitmap Index Scan.

At first glance, it sounds like two scans on the same table – a very inefficient choice?! But bitmap scans are actually one of the planner’s most practical tools for balancing random I/O vs sequential access. Understanding how they work can make execution plans much easier to interpret, so we’ll dive into that a little bit today.

The Basic Idea

A bitmap scan is a two-step process:

Step 1: Build a bitmap of matching rows using one or more indexes.

Step 2: Visit the heap pages containing those rows referenced in the bitmap.

In an execution plan this usually appears as:

Bitmap Heap Scan on orders
-> Bitmap Index Scan on orders_customer_id_idx

The important part is that the index lookup and heap access are separated – this separation allows Postgres to explain heap access costs and actuals more clearly.

Why Not Just Use an Index Scan?

With a normal index scan, the query executor does something like this:

Find a matching entry in the index
Jump to the heap page
Fetch the row
Repeat

If the query returns only a few rows, this works well. But if the query returns thousands of rows scattered across the table, the database ends up doing many random heap fetches. Random I/O can become expensive, so a bitmap scan solves this problem.

How the Bitmap Is Built

During the Bitmap Index Scan phase, the executor does not immediately fetch rows. Instead it records which heap pages contain matching rows. Conceptually, the structure looks like this:

Page 101 -> rows 2, 7
Page 205 -> rows 1, 3, 8
Page 410 -> row 5

These page references are stored as a bitmap structure in memory. Once the bitmap is complete, the executor can visit heap pages in physical order rather than jumping around randomly. Visiting heap pages in physical order means less random I/O and therefore less latency.

Multiple Indexes Can Be Combined

One particularly powerful feature is that bitmap scans allow the query planner to combine multiple indexes. For example:

WHERE status = 'active'
AND created_at >= '2025-01-01'

The plan might look like:

Bitmap Heap Scan
-> BitmapAnd
-> Bitmap Index Scan on status_idx
-> Bitmap Index Scan on created_at_idx

Each index produces a bitmap, and the planner combines them using logical operations, such as BitmapAnd and BitmapOr. This allows the planner to efficiently use multiple indexes even when a single composite index does not exist.

When Does the Planner Chooses Bitmap Scans?

The planner usually prefers bitmap scans in situations where the query returns more rows than a typical index scan, but not enough rows to justify a full sequential scan. In other words, bitmap scans often appear in the middle selectivity range.

Very roughly:

Selectivity	Likely Plan
Very small	Index Scan
Medium	Bitmap Heap Scan
Very large	Seq Scan

This is not a strict rule, but it helps explain the planner’s reasoning.

Pros and Cons

As with everything in databases, there’s no free lunch. Here are some advantages and disadvantages for bitmap scans

Advantages of Bitmap Heap Scans
- Reduced Random I/O: By grouping heap page accesses, bitmap scans avoid excessive random disk reads.
- Ability to Combine Indexes: Bitmap operations allow the query planner to use multiple independent indexes efficiently.
- Better Performance for Medium Selectivity: Queries returning thousands of rows often benefit from bitmap access patterns.
- Predictable Heap Access: Because heap pages are visited in order, caching behavior tends to improve.
Disadvantages of Bitmap Heap Scans
- Memory Usage: The bitmap structure is stored in memory. If the result set becomes too large, the query executor may switch to a lossy bitmap, where only page-level information is stored. This can cause additional filtering work later.
- Two-Phase Execution: Because the bitmap must be built before heap access begins, the query cannot stream rows immediately. This can increase latency for queries expecting early rows.
- Extra CPU Work: Maintaining and combining bitmap structures adds overhead compared to simple index scans.

Lossy Bitmaps

When memory limits are reached, the query executor may degrade the bitmap representation. Instead of tracking individual tuple offsets, it only records:

Page 205 -> possible matches

During the heap scan, the executor must then recheck all rows on that page. In execution plans you may see mention of Recheck Cond. This indicates that the bitmap became lossy. While still correct, this can reduce efficiency.

Final Thoughts

Bitmap heap scans are one of the planner’s most practical optimization tools, as they allow the database to reduce random I/O, combine multiple indexes, and handle medium-sized result sets efficiently.

While they may look complicated at first, the core idea is simple: Find matching rows first, then fetch heap pages efficiently. What a great concept!

The Postgres Performance Triangle

Mon, 20 Apr 2026 08:00:00 +0000

Everyone who’s gone at least knee-deep in photography knows there’s this idea of the exposure triangle: aperture, shutter speed, and ISO. Depending on what you’re going for artistically, you adjust the three parameters, knowing that there are trade-offs in doing so. After working on a few cases, and presenting solutions to customers, I’ve started to think about Postgres performance tuning in a similar way – there are basic parameters that can be tuned, and there are trade-offs for the choices DBAs make:

Memory Allocation
Disk I/O
Concurrency

Each of these (in broad strokes) affects throughput – how much work your system gets done.

Caveat: I know that in the academic sense, “throughput” doesn’t quite capture the balance of these concepts, but please bear with me!

Let’s talk about how each of these three work together with the whole system, and what the trade-offs look like.

Memory Allocation

When you increase memory allocation in Postgres, whether it’s shared_buffers or work_mem, things tend to feel smoother. Most notably, queries spill to disk less often, sorts and joins stay in memory, cache hit rates improve. But there’s a trade-off that’s easy to miss at first, especially with these two parameters. A single complex query can consume multiple chunks of work_mem (see Laetitia’s excellent post about it). Multiply that across concurrent queries, and you begin to see the OS consuming swap space, churning at checkpoints, and even OOM Killer getting invoked. So while more memory can make things faster, it also quietly reduces how much concurrency your system can safely handle.

I’d relate this to aperture – you can throw money at some fast glass, but you also get shallower depth of field (in an annoying way).

Disk I/O

Disk is where things go when memory isn’t enough, or when an access pattern requires it. We see examples of this in sequential scans, random index lookups, and temporary files from sorts or hashes. Lowering work_mem might increase disk I/O due to sorts spilling to temp files, for example. We can try to minimize disk I/O by adding indexes, increasing work_mem, or simply rewriting queries.

Another way we can try to affect disk I/O is to tinker with the costs, to encourage the query planner to choose one scan method over the other. In any case, our attempts to balance disk I/O and memory usage can be pretty straightforward at first, but could become complicated at scale. That’s where partitioning and read-only replicas come in, but I’m beginning to digress…

Indexes, in particular, are where things start to get interesting. Adding an index can feel like an easy win, as it leads to fewer rows scanned and less CPU work per query, along with less disk activity, but there are trade-offs:

Every INSERT will update every relevant index
Every UPDATE can potentially rewrite index entries
Every DELETE leaves behind cleanup work (vacuum)

At scale, we also see other effects:

Indexes get large
Cache hit rates drop (because there’s more to cache)
Random I/O increases

So an index that helps one query might quietly make others worse, or make writes more expensive.

It’s like raising ISO to compensate for low light. You get the shot, but the noise shows up somewhere else.

Concurrency

So far, this has all been somewhat per-query. But things change when you introduce concurrency. In a high-demand service, the instinct is to increase max_connections to allow the service to scale up, but in my experience there’s a price to pay for this kind of concurrency. Some people fail to notice that each connection brings its own memory usage, takes up a spot in Postgres’ internal data structures, and puts the system at risk for increased CPU demand and resource contention.

In the photography analogy, you can turn down the ISO very low on a bright and sunny day, but that won’t be enough. Soon, you’ll be closing the aperture and increasing the shutter speed, and then you lose your ability to create the artistic feel that you’re actually trying to go for. So what do photographers do? They use an ND filter to limit how much light hits the sensor.

In Postgres, that “ND filter” is something like a connection pooler, like PgBouncer. Instead of letting thousands of connections compete for CPU: You cap active queries, you allocate more resources to each actual DB session, and you trade a bit of latency for stability. Sometimes, to keep your throughput, you need some additional accessories.

The Art of Postgres

As a DBA, you can calculate optimal index usage, memory sizing, and expected I/O patterns, but those calculations tend to assume a steady state. Every DBA knows that real production systems are always changing, due to traffic patterns, scaling, and new features getting rolled out on the application side. As the organization changes, the work to keep the database performant is dependent upon the DBA being both a Database Administrator as well as a Database Artist, working with internal teams to know which indexes to add/drop, how much concurrency to allow, and how to allocate memory without running out of it.

Instead of asking, “What’s the optimal configuration?” it might be more useful to ask these questions:

Where is my system currently paying the cost—memory, disk, or CPU?
If I relieve pressure here, where does it move?
How much can we tolerate that new pressure?

Costs don’t disappear – they just shift – and it’s the DBA’s job to help decision-makers decide where to shift it to.

Conclusion

There’s more to photography than exposure – there’s composition, color-correction, external lighting, and so much more. In the same way, this discussion has just been one part of database administration. There’s so much more to go over, in terms of creating a robust and scalable database. I wanted to highlight this topic because I do find that some users tend to approach database architecture without considering all the trade-offs. We can definitely get the database to peform well, but there’s no one-size-fits-all solution for every situation. It takes thought, planning, testing, and discussion with stakeholders to come up with a good solution.

Understanding PostgreSQL Wait Events

Mon, 13 Apr 2026 08:00:00 +0000

Introduction

One of the most useful debugging tools in modern PostgreSQL is the wait event system. When a query slows down or a database becomes CPU bound, a natural question is: “What are sessions actually waiting on?” Postgres exposes this information through the pg_stat_activity view via two columns:

wait_event_type
wait_event

These fields reveal what the backend process is blocked on at a given moment. Among the different wait types, one category tends to cause confusion:

LWLock

If you’ve ever seen dashboards full of LWLock waits, you’re not alone in wondering what they mean and whether they’re a problem.

Where Wait Events Appear

The easiest way to see wait events is:

SELECT pid,
wait_event_type,
wait_event,
state,
query
FROM pg_stat_activity
WHERE state != 'idle';

Example output might look like:

pid	wait_event_type	wait_event	state
1234	Lock	transactionid	active
5678	LWLock	buffer_content	active
9012	IO	DataFileRead	active

Each category represents a different kind of wait. Common types include:

Lock
LWLock
IO
Client
IPC
Activity

Among these, LWLock waits often appear during performance incidents.

What Is an LWLock?

LWLock stands for Lightweight Lock. These are internal Postgres synchronization primitives used to coordinate access to shared memory structures. Note that they are NOT related to lock contention on tables, or deadlocking when performing DML. LWLocks protect important internal structures such as:

shared buffers
WAL buffers
lock tables
SLRU caches

Because these structures are accessed by many processes simultaneously, Postgres must coordinate access carefully.

Why LWLock Waits Appear

In healthy systems, LWLocks are acquired and released very quickly. However, they can become visible when:

contention increases
many sessions access the same internal structure
CPU saturation occurs
shared memory structures become hot spots

Seeing LWLock waits in pg_stat_activity doesn’t automatically mean something is wrong. But persistent LWLock contention usually indicates a scaling issue somewhere in the workload.

Common LWLock Wait Events

A few LWLock events appear frequently during real-world incidents.

Understanding them can help narrow down the root cause.

buffer_content

wait_event_type = LWLock
wait_event = buffer_content

This occurs when Postgres processes compete to access a shared buffer page.

Typical causes include:

many concurrent updates to the same rows
heavy index modifications
hot tables receiving high write volume

If you see these locks, try these troubleshooting steps:

check for write-heavy workloads
inspect tables experiencing frequent updates
look for missing indexes causing excessive page access

WALWriteLock

wait_event = WALWriteLock

This indicates contention while writing to the Write-Ahead Log (WAL).

Common causes:

high write throughput
large batch inserts or updates
slow storage affecting WAL flushes

Possible diagnostic steps:

examine WAL generation rate
check disk latency
review bulk write workloads

In some systems this appears as commit latency spikes.

WALInsertLock

wait_event = WALInsertLock

This occurs when multiple sessions attempt to insert WAL records simultaneously. It usually appears when:

many concurrent transactions are committing
high insert/update workloads exist
transaction throughput is extremely high

Postgres versions over time have reduced contention here by increasing WAL insertion slots. Still, very high write concurrency can trigger it.

ProcArrayLock

wait_event = ProcArrayLock

This lock protects Postgres’ internal structure tracking active transactions. It is often associated with:

snapshot creation
visibility checks
large numbers of active connections

Possible causes include:

very high connection counts
long-running transactions
frequent snapshot creation

Connection pooling (and lowering max_connection) often helps reduce this type of contention.

CLogControlLock / SLRU Locks

wait_event = CLogControlLock

These involve the SLRU (Simple Least Recently Used) subsystem, which tracks transaction commit status. Heavy contention here can appear when:

extremely high transaction rates exist
frequent visibility checks occur
many short transactions are executed

Diagnosing LWLock Problems

When investigating LWLock waits, a few steps usually help.

1. Look for dominant wait events

Start by identifying which LWLock appears most frequently:

SELECT wait_event, count(*)
FROM pg_stat_activity
WHERE wait_event_type = 'LWLock'
GROUP BY wait_event
ORDER BY count(*) DESC;

2. Examine workload characteristics

Questions to ask:

Are there many concurrent writers?
Is a single table receiving heavy updates?
Are there extremely high transaction rates?

3. Check connection counts

Large numbers of connections can amplify contention. Connection pooling often reduces LWLock pressure significantly.

4. Look at query patterns

High-frequency queries touching the same rows or pages can create hotspots.

Final Thoughts

PostgreSQL’s wait event system provides valuable insight into what the database is doing internally. LWLocks, in particular, reveal contention inside shared memory structures that are otherwise invisible. When investigating performance issues, a good rule of thumb is: If many sessions are waiting on the same LWLock, there is usually a workload hotspot somewhere. Once you know where the contention lives, the path toward fixing it becomes much clearer.

WAL as a Data Distribution Layer

Mon, 06 Apr 2026 08:00:00 +0000

Introduction

Every so often, I talk to someone working in data analytics who wants access to production data, or at least a snapshot of it. Sometimes, they tell me about their ETL setup, which takes hours to refresh and can be brittle, with a lot of monitoring around it. For them, it works, but it sometimes gets me wondering if they need all that plumbing to get a snapshot of their live dataset. Back at Turnitin, I set up a way to get people access to production data without having to snapshot nightly, and I thought maybe I should share it with people here.

Common Implementations and Their Risks

Typical solutions that we might encounter as we give people a little bit of access to production data:

1. Query the primary

This is generally a bad idea, since you don’t want users getting access to the production prirmary, lest they make some mistakes or do something to lock up tables that prevent customers from using your apps. Even with a read-only user, large data analytics queries could cause unwanted interference that negatively affect your uptime. This is almost certainly not the way to go.

2. Query a streaming replica

This is better, but doing this is not free. Long-running queries can create replay lag, vacuum conflicts can cancel queries, and I/O contention can affect the primary upstream. It’s safer since users are forced to be read-only, but that still carries risk.

3. Nightly snapshots / rebuilds

Having time-based snapshots and rebuilds are the most common form of getting data out to analysts. ETL queries run at night (or some other specified regular interval) and provide the information needed to do the necessary work. This works, but is another piece of software that produces somewhat stale data, depending on how much stale-ness can be tolerated.

Once Upon a Time, Before Streaming Replication

If you’ve spent any time in Postgres, you already understand streaming replication. Primary sends WAL to standby, and standby replays the WAL stream. All the tutorials talk about using pg_basebackup, setting hot_standby and standby.signal and configuring primary_conninfo.

However, many people don’t know that before streaming replication, there was log shipping. Introduced in v. 8.2, it was the predecessor to what eventually became hot standby/streaming replication in v. 9.0. Instead of maintaining a live connection between primary and standby, the two clusters are decoupled. WAL files are shipped (via scp or rsync or some other mechanism – maybe even NFS) to the replica, and then replayed there.

Log Shipping Hits a Different Point on the Tradeoff Curve

With WAL log shipping the standby never connects to the primary, and the primary never tracks the standby, and therefore there is no backpressure mechanism (i.e. no cancelled queries because of conflict with recovery, no need for hot_standby_feedback).

While you may not get up-to-the-millisecond minimized replication lag, you get pretty close to real-time data. In some cases, this lag may even be desirable – you could throttle the playback so you are an hour behind, even giving yourself some time to look at a table’s state before someone fat-fingers an UPDATE without a WHERE clause.

A Subtle but Important Detail

Postgres doesn’t force you to choose one mechanism over the other. A standby can use both primary_conninfo AND restore_command. The way it works is that it will toggle between the two, depending on availability. If the primary is disconnected for some reason, it will switch over to restore_command until it cannot find the WAL file it wants, and then it flips back to primary_conninfo again.

Log shipping isn’t just a legacy mode, but it’s part of the replication continuum. It’s like incremental backup, except that your backup is always full-loaded and can be queried against. For these reasons, keeping your WAL files around is a very good practice.

Architecture Pattern: Introduce a WAL Hub

Instead of thinking in terms or replication happening between a primary and a number of standbys, it may be useful to think about a central WAL archive host, even if it’s an S3 bucket, so that many consumers can access data at any point in time.

These consumers can be analytics standbys, QA environments, or ad-hoc data sandboxes – or whatever else you want to give a copy of near-realtime production data to, without risking replication backpressure or compromising network security.

A Hands-On Approach

I created a simple demo that sets this up end-to-end. It sets up 3 containers in Docker – a primary, standby, and a mock WAL archive location. Disclaimer: yes, I used AI to help me generate the scripts, but it’s exactly how I had it set up at Turnitin (yes, we used rsyncd back in 2009 – there might be better stuff out there these days).

Some key configuration params for clarity:

archive_command pushes WAL files to a directory
restore_command pulls WAL files on the standby
standby.signal enables continuous recovery
hot_standby=on allows read-only queries
archive_mode=on not entirely necessary, but for posterity

Note that in this example, some characteristics of the standby:

No primary_conninfo
No replication slots used
No entries in pg_stat_replication show up on the primary.

If you want, you can set up traditional streaming replication in parallel to this log shipping standby – it doesn’t interfere with the log shipping so long as WAL files get to the archive location.

Why This Pattern Deserves More Attention

Most teams default to streaming replication because it’s the most visible feature.

But Postgres replication isn’t one thing; it’s a set of primitives:

WAL generation
WAL transport
WAL replay

Streaming replication couples all three and log shipping lets you separate them. And once you do that, new architectures open up!

The Hidden Behavior of plan_cache_mode

Mon, 30 Mar 2026 08:00:00 +0000

Introduction

Most PostgreSQL users use prepared statements as a way to boost performance and prevent SQL injection. Fewer people know that the query planner silently changes the execution plan for prepared statements after exactly five executions.

This behavior often surprises engineers because a query plan can suddenly shift—sometimes dramatically, even though the query itself hasn’t changed. The reason lies in the planner’s handling of custom plans vs generic plans, controlled by the parameter plan_cache_mode.

Custom Plans vs Generic Plans

When a prepared statement is executed with parameters, the planner has two choices:

Custom Plan: Generated using the actual parameter values. It is potentially optimal for that specific execution but requires planning overhead every time.
Generic Plan: Planned once without knowing specific parameter values. It is reused for all subsequent executions to save planning overhead.

By default, plan_cache_mode is set to auto. In this mode, the planner uses custom plans for the first five executions. On the sixth execution, it compares the average cost of those custom plans against the estimated cost of a generic plan. If the generic plan is deemed “cheaper” or equal, the planner switches to it permanently for that session.

Demonstrating with pgbench

As always, pgbench is the schema of choice when it comes to simple demonstrations. I’m using Postgres 18, which is the latest version as of this writing. Adding a column with highly skewed values makes it easier to trigger the switch, for the purposes of this post. Therefore we add a flag column with extreme skew: 'N' for 0.1% of rows, 'Y' for the remaining 99.9%:

### In bash:
pgbench -i -s 10 -U postgres postgres

### In psql:
ALTER TABLE pgbench_accounts ADD COLUMN flag CHAR(1) NOT NULL DEFAULT 'Y';
UPDATE pgbench_accounts SET flag = 'N' WHERE aid <= 1000;
CREATE INDEX idx_accounts_flag ON pgbench_accounts(flag);
ANALYZE pgbench_accounts;

SELECT flag, count(*) FROM pgbench_accounts GROUP BY flag;

 flag | count
------+--------
 N    |   1000
 Y    | 999000

Before triggering the auto-switch, let’s force each mode directly to see what the planner produces for the same statement.

-- Custom plan: planner sees the literal value 'Y', looks it up in column
-- statistics (MCV frequency ≈ 0.999), and picks Seq Scan for 999,033 rows.
SET plan_cache_mode = force_custom_plan;
PREPARE flag_lookup(char) AS
  SELECT aid, abalance FROM pgbench_accounts WHERE flag = $1;

EXPLAIN EXECUTE flag_lookup('Y');

                               QUERY PLAN
-------------------------------------------------------------------------
 Seq Scan on pgbench_accounts  (cost=0.00..28910.00 rows=999033 width=8)
   Filter: (flag = 'Y'::bpchar)   <-- literal value 'Y' indicates custom plan

DEALLOCATE flag_lookup;

-- Generic plan: the planner has no value to look up. With ndistinct = 2
-- (only 'Y' and 'N' exist), it estimates 1/ndistinct = 50% selectivity,
-- or 500,000 rows. At that estimate, the cheaper path is Index Scan.
SET plan_cache_mode = force_generic_plan;
PREPARE flag_lookup(char) AS
  SELECT aid, abalance FROM pgbench_accounts WHERE flag = $1;

EXPLAIN EXECUTE flag_lookup('Y');

                                            QUERY PLAN
--------------------------------------------------------------------------------------------
 Index Scan using idx_accounts_flag on pgbench_accounts  (cost=0.42..19322.07 rows=500000)
   Index Cond: (flag = $1)   <-- Note the placeholder $1 instead of literal 'Y'/'N'

The cost numbers reveal the selection of Index Scan over Seq Scan: 19,322 < 28,910.

The Automatic Switch in Action

After resetting plan_cache_mode back to auto, we execute the statement five times using the common value 'Y'. Each run generates a custom Seq Scan plan at cost ~28,910. After five such executions, the planner compares Average custom plan cost: ~28,910 v. Generic plan cost: ~19,322

Since 19,322 ≤ 28,910, the generic plan is chosen from execution 6 onward.

DEALLOCATE flag_lookup;
SET plan_cache_mode = auto;
PREPARE flag_lookup(char) AS
  SELECT aid, abalance FROM pgbench_accounts WHERE flag = $1;

-- Executions 1–5: custom plans, each resolving 'Y' literally
EXPLAIN (COSTS OFF) EXECUTE flag_lookup('Y');
EXPLAIN (COSTS OFF) EXECUTE flag_lookup('Y');
EXPLAIN (COSTS OFF) EXECUTE flag_lookup('Y');
EXPLAIN (COSTS OFF) EXECUTE flag_lookup('Y');
EXPLAIN (COSTS OFF) EXECUTE flag_lookup('Y');

Each shows:

           QUERY PLAN
--------------------------------
 Seq Scan on pgbench_accounts
   Filter: (flag = 'Y'::bpchar)

On the sixth execution:

EXPLAIN (COSTS OFF) EXECUTE flag_lookup('Y');
                       QUERY PLAN
--------------------------------------------------------
 Index Scan using idx_accounts_flag on pgbench_accounts
   Index Cond: (flag = $1)

The strategy flips from Seq Scan to Index Scan on the sixth call — even though the query and data are identical. The $1 placeholder confirms the generic plan is now used.

Does it Ever Switch Back?

From execution 6 onward, every query — regardless of the parameter value — uses that generic Index Scan. For 'N' (1,000 rows) an Index Scan happens to be efficient. For 'Y' (999,000 rows), scanning nearly the entire 1M-row table through random index lookups is dramatically worse than a sequential scan would be.

-- Executions 7+: generic plan regardless of value
EXPLAIN (COSTS OFF) EXECUTE flag_lookup('Y');  -- 999,000 rows via Index Scan (bad!)
EXPLAIN (COSTS OFF) EXECUTE flag_lookup('N');  -- 1,000 rows via Index Scan (fine by accident)

Both show:

                       QUERY PLAN
--------------------------------------------------------
 Index Scan using idx_accounts_flag on pgbench_accounts
   Index Cond: (flag = $1)

The generic plan stays until DEALLOCATE flag_lookup or the session ends. This is certainly something to be aware of for frequently-executed prepared statements, as it has had significant consequences on usability with some customers I’ve worked with.

Under the Hood: The C Logic

Just to highlight that the number 5 isn’t determined with any fancy logic, we can find it in the source code. In src/backend/utils/cache/plancache.c (around line 1200), the function choose_custom_plan spells it out explicitly:

static bool
choose_custom_plan(CachedPlanSource *plansource)
{
    /* ... settings check for force_custom / force_generic ... */

    /* If we haven't done 5 custom plans yet, keep doing them */
    if (plansource->num_custom_plans < 5)
        return true;

    /* * Otherwise, compare generic_cost against the average custom_cost.
     * If the generic plan is cheaper (or equal), we switch!
     */
    if (plansource->generic_cost <= plansource->total_custom_cost / plansource->num_custom_plans)
        return false;

    return true;
}

Final Thoughts

The query planner’s automatic plan caching is usually a hero, saving CPU cycles. But when you have highly skewed data or volatile temporary objects, that “6th run switch” can negatively affect client/application performance.

If you see unexplained regressions in a prepared statement, you may want to check to see if it is being called more than 5 times, or try SET plan_cache_mode = force_custom_plan as a troubleshooting step. This forces a fresh custom plan on every execution, guaranteeing the planner always sees the actual parameter value and can choose the right strategy.

Good luck!

EXPLAIN's Other Superpowers

Mon, 23 Mar 2026 08:00:00 +0000

Introduction

Most people who work with PostgreSQL eventually learn two commands for query tuning: EXPLAIN and EXPLAIN ANALYZE.

EXPLAIN shows the planner’s chosen execution plan, and EXPLAIN ANALYZE runs the query and adds runtime statistics. For most tuning tasks, this already provides a wealth of information.

But what many people don’t realize is that EXPLAIN has a handful of other options that can make troubleshooting much easier. In some cases they answer questions that EXPLAIN ANALYZE alone cannot.

In this post we’ll take a look at a few of those lesser-known options.

BUFFERS: Where Did the Data Come From?

One common question during performance analysis is whether data came from: shared buffers (cache), disk, or temporary buffers. This is where the BUFFERS option comes in handy. Output can look something like this:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM mytable WHERE id = 123;
[...]
  Index Scan using mytable_pkey on mytable
  Buffers: shared hit=5 read=2
[...]

In the example above, we see:

shared hit – pages already in cache (i.e., cache hit)
shared read – pages fetched from disk (i.e., cache miss)

Note that buffers in this context are 8 kilobyte blocks of memory (standard block size for most storage systems)

This is extremely useful when trying to determine if performance problems are related to: cold cache, excessive disk reads, or insufficient memory (i.e., cache is too crowded to keep all the data being worked with).

Especially for index scans, this information confirms whether a query that should be index-friendly is actually pulling large portions of the table into memory.

MEMORY: Memory Used by the Query

This is a new feature introduced in version 18. It is different from BUFFERS in the sense that it tracks the amount of memory consumed during the query planning phase, not execution. Output would appear at the bottom of EXPLAIN output like this:

EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM mytable WHERE id = 123;
[...]
 Planning:
   Buffers: shared hit=36 read=1
   Memory: used=63kB  allocated=64kB

WAL: How Much Logging Is Happening?

Another useful option that many people overlook is WAL:

EXPLAIN (ANALYZE, WAL)
INSERT INTO mytable SELECT * FROM staging_table;
[...]
  WAL: records=100, fpi=5, bytes=45000

In the example above, records are the number of WAL records generated, fpi refers to full-page images that were written (number of pages modified for the first time since the last checkpoint), and bytes is the total WAL traffic generated by the query. This can be helpful when investigating write-heavy workloads, including bulk loads, large updates, index creation, and high replication traffic.

SETTINGS: Remind Me What My Environment Looked Like?

Sometimes a query behaves differently on two servers even though the SQL is identical. Or you may have modified some parameters locally before running the query (i.e., work_mem). To properly understand how a query is affected by any differences in environment, the SETTINGS option can sometimes be useful:

EXPLAIN (SETTINGS)
SELECT * FROM mytable WHERE id = 123;
[...]
Settings: effective_cache_size = '48GB', effective_io_concurrency = '200', enable_partitionwise_aggregate = 'on', enable_partitionwise_join = 'on', max_parallel_workers = '16', max_parallel_workers_per_gather = '4', temp_buffers = '1MB', search_path = 'public'

VERBOSE: See the Planner’s Full Story

Another useful option is VERBOSE, which prints additional information like, internal column references, expanded target lists, and schema-qualified objects:

postgres=# EXPLAIN (ANALYZE) SELECT * FROM pgbench_accounts a JOIN pgbench_branches b ON b.bid=a.bid ORDER BY 2 DESC;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.12..3898.14 rows=100000 width=461) (actual time=0.216..32.597 rows=100000.00 loops=1)
   Join Filter: (b.bid = a.bid)
   Buffers: shared hit=1642
   ->  Index Scan Backward using pgbench_branches_pkey on pgbench_branches b  (cost=0.12..8.14 rows=1 width=364) (actual time=0.095..0.102 rows=1.00 loops=1)
         Index Searches: 1
         Buffers: shared hit=2
   ->  Seq Scan on pgbench_accounts a  (cost=0.00..2640.00 rows=100000 width=97) (actual time=0.024..9.732 rows=100000.00 loops=1)
         Buffers: shared hit=1640
 Planning Time: 0.346 ms
 Execution Time: 40.078 ms
(10 rows)

postgres=# EXPLAIN (ANALYZE, VERBOSE) SELECT * FROM pgbench_accounts a JOIN pgbench_branches b ON b.bid=a.bid ORDER BY 2 DESC;
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.12..3898.14 rows=100000 width=461) (actual time=0.225..32.869 rows=100000.00 loops=1)
   Output: a.aid, a.bid, a.abalance, a.filler, b.bid, b.bbalance, b.filler
   Join Filter: (b.bid = a.bid)
   Buffers: shared hit=1642
   ->  Index Scan Backward using pgbench_branches_pkey on public.pgbench_branches b  (cost=0.12..8.14 rows=1 width=364) (actual time=0.183..0.190 rows=1.00 loops=1)
         Output: b.bid, b.bbalance, b.filler
         Index Searches: 1
         Buffers: shared hit=2
   ->  Seq Scan on public.pgbench_accounts a  (cost=0.00..2640.00 rows=100000 width=97) (actual time=0.026..9.756 rows=100000.00 loops=1)
         Output: a.aid, a.bid, a.abalance, a.filler
         Buffers: shared hit=1640
 Planning Time: 0.547 ms
 Execution Time: 40.228 ms
(13 rows)

While it may look a bit noisy, it can be helpful when diagnosing:

view expansion
rule rewriting
complex query transformations

Combining Options

The real power of EXPLAIN comes from combining options together. For example:

EXPLAIN (ANALYZE, BUFFERS, WAL, SETTINGS)
SELECT * FROM mytable WHERE id = 123;

This produces a plan that shows execution time, cache usage, WAL generation, and configuration parameters influencing the planner

In many cases, this gives a far more complete picture of what the database is doing internally.

Note that many of these can also be enabled in auto_explain as parameters in the database configuration.

Conclusion

EXPLAIN ANALYZE is powerful, but the feature provides many additional tools for understanding query behavior. These additional options can provide valuable insight into memory usage, disk activity, WAL generation, and instrumentation overhead. When troubleshooting tricky performance problems, these options can reveal details that a basic execution plan might hide.

Learning AI Fast with pgEdge's RAG

Mon, 16 Mar 2026 08:00:00 +0000

Introduction

If you’ve been paying attention to the technology landscape recently, you’ve probably noticed that AI is everywhere. New frameworks, new terminology, and a dizzying array of acronyms and jargon: LLM, RAG, embeddings, vector databases, MCP, and more.

Honestly, it’s been difficult to figure out where to start. Many tutorials either dive deep into machine learning theory (Bayesian transforms?) or hide everything behind a single API call to a hosted model. Neither approach really explains how these systems actually work.

Recently I spent some time experimenting with the pgEdge AI tooling after hearing Shaun Thomas’ talk at a PrairiePostgres meetup. He talked about how to set up the various components of an AI chatbot system, starting from ingesting documents into a Postgres database, vectorizing the text, setting up a RAG and then an MCP server.

When I got home I wanted to try it out for myself – props to the pgEdge team for making it all free an open-source! What surprised me most was not just that everything worked, but how easy it was to get a complete AI retrieval pipeline running locally. More importantly, it turned out to be one of the clearest ways I’ve found to understand how modern AI systems are constructed behind the scenes. Thanks so much, Shaun!

The pgEdge AI Components

The pgEdge AI ecosystem provides several small tools that fit together naturally. I’ll go through them real quickly here

Doc Converter – The doc-converter normalizes documents into a format that is easy to process downstream. Whether the input is PDF, HTML, Markdown, or plain text, the converter produces clean text output suitable for ingestion.
Vectorizer – The vectorizer handles the process of converting text chunks into embeddings. These embeddings are numeric representations of text that capture semantic meaning. Once generated, they can be stored inside PostgreSQL using pgvector and queried with similarity search.
Retrieval-Augmented Generation (RAG) Server – The RAG framework ties everything together. It orchestrates:
1. embedding the user’s query
2. retrieving similar document chunks
3. assembling prompt context
4. sending the prompt to an LLM
5. returning the generated response

When the full system is running, you essentially have ChatGPT or Gemini running on your laptop

Running Everything Locally with Ollama

With ChatGPT and Gemini, getting tokens or sharing my payment info was a blocker, especially if I just want to test stuff for educational purposes. Through Shaun’s presentation, I was introduced to Ollama, which is a great alternative, if you’re okay with slower performance (especially on a 8GB M1 Mac Mini).

I was pleasantly surprised at how easy it was to run the entire pipeline without relying on external AI APIs. Specifically, I used the embeddinggemma model for generating embeddings. This meant the entire stack could run locally, no API keys required! Running everything locally removes those barriers and definitely makes experimentation much easier.

Understanding RAG by Actually Running It

One of the most confusing concepts in learning AI prior to Shaun’s talk was Retrieval-Augmented Generation (RAG). I learned that what a RAG does is:

Before asking the LLM to answer a question, retrieve relevant information and include it in the prompt.

With the pgEdge pipeline, the flow becomes very visible.

Documents are converted into clean text
Text is split into chunks
Chunks are embedded into vectors
Vectors are stored in PostgreSQL
A question is embedded into a vector
A similarity search finds relevant chunks
Those chunks are inserted into the prompt
The LLM generates the response

From this, I realized that the LLM is not storing my data. Instead, the system retrieves relevant information on demand and feeds it into the prompt. The RAG is a facilitator to the LLM’s response.

The Role of the Vectorizer

The vectorizer is a crucial step in the pipeline. Its job is to convert human language into embeddings, which are high-dimensional numeric representations of meaning. With vectors, searching with natural language becomes possible, instead of old-fashioned keyword matches.

Once the embeddings (vectorized documents) are stored in PostgreSQL using pgvector, everything starts to look familiar again for database engineers:

indexing
storage
similarity search
ranking results

Managing these things look pretty doable for a database guy like me 😂

Don’t Try This At Home!

After writing about the pgEdge stack I wanted to make it as easy as possible for others to reproduce the same experience, so I packaged everything into a Docker Compose project.

Clone the repository and run:

git clone https://github.com/richyen/learn-ai-with-postgres.git
cd learn-ai-with-postgres
mkdir documents # put some txt files in there for vectorization
docker compose up --build

That single command:

Builds a custom PostgreSQL image with pgvector and pgedge_vectorizer compiled in
Starts an Ollama container and pulls the embeddinggemma and glm-4.7-flash models locally
Runs pgedge-docloader to ingest any documents you’ve put into the documents/ folder
Calls pgedge_vectorizer.enable_vectorization(), which starts background workers inside Postgres that chunk and embed every page
Starts the RAG server on port 8080

No API keys, no cloud services. Everything runs on your own hardware.

Once the RAG server is up (watch for the setup container to exit cleanly), try asking it a question:

curl -s -X POST http://localhost:8080/v1/pipelines/pg-docs \
  -H "Content-Type: application/json" \
  -d '{"query": "How does autovacuum decide when to run?"}' \
  | jq .

The answer comes back a few seconds later, grounded in the actual PostgreSQL documentation:

{
  "answer": "Autovacuum in PostgreSQL is triggered based on thresholds defined by two parameters: autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. The daemon considers a table eligible for vacuuming when the number of dead tuples exceeds the threshold plus (scale_factor × total row count) ..."
}

You can also run raw similarity searches directly in SQL to see exactly what the retrieval step is doing before the LLM touches anything:

SELECT
    d.title,
    left(c.content, 200) AS snippet
FROM documents_content_chunks c
JOIN documents d ON c.source_id = d.id
WHERE c.embedding IS NOT NULL
ORDER BY c.embedding <=>
    pgedge_vectorizer.generate_embedding('autovacuum threshold configuration')
LIMIT 5;

This is the same pgvector <=> (cosine distance) operator the RAG server uses internally — you can inspect the retrieval step at any time without going through the HTTP API.

Embeddings are generated in the background by Postgres workers, so you can start querying as soon as a few hundred chunks are ready. Watch the progress with:

psql postgresql://postgres:password@localhost:5432/pgai -c "
SELECT
  (SELECT count(*) FROM documents)                                             AS total_docs,
  (SELECT count(*) FROM documents_content_chunks WHERE embedding IS NOT NULL)  AS vectorized;
"

The project also includes the pgedge-postgres-mcp server on port 8081, which exposes the knowledge base via the Model Context Protocol — so it can be wired directly into Claude Desktop, VS Code Copilot, or any other MCP-compatible client.

Final Thoughts

There’s a lot of pressure right now to “learn AI,” but that phrase can mean many different things. For people coming from infrastructure, databases, or backend engineering, one of the most approachable paths is simply:

build a small RAG pipeline and observe how the pieces fit together.

The pgEdge tooling made this surprisingly straightforward. Instead of assembling half a dozen unrelated frameworks, the components already fit together:

doc ingestion
vectorization
PostgreSQL storage
retrieval
prompt generation
LLM response

Once I saw the entire flow working end-to-end, the AI ecosystem makes a lot more sense. Setting up the pgEdge RAG stack turned out to be a surprisingly effective way to see that architecture in action.

Enjoy!

Debugging RDS Proxy Pinning: How a Hidden JIT Toggle Created Thousands of Pinned Connections

Thu, 12 Mar 2026 08:00:00 +0000

Introduction

When using AWS RDS Proxy, the goal is to achieve connection multiplexing – many client connections share a much smaller pool of backend PostgreSQL connections, givng more resources per connection and keeping query execution running smoothly.

However, if the proxy detects that a session has changed internal state in a way it cannot safely track, it pins the client connection to a specific backend connection. Once pinned, that connection can never be multiplexed again. This was the case with a recent database I worked on.

In this case, we observed the following:

extremely high CPU usage
relatively high LWLock wait times
OOM killer activity on the database, maybe once every day or two
thousands of active connections

What was strange about it all was that the queries involved were relatively simple, with max just one join.

Finding the Pinning Source

To get to the root cause, one option was to look in pg_stat_statements. However, that approach had two problems:

Getting a clean snapshot of the statistics while thousands of queries were being actively processed would be tricky.
pg_stat_statements normalizes queries and does not expose the values passed to parameter placeholders.

Instead, to see the actual parameters, we briefly enabled log_statement = 'all'. This immediately surfaced something interesting in the logs, which could be downloaded and reviewed on my own time and pace.

What we saw were statements like SELECT set_config($2,$1,$3) with parameters related to JIT configuration – that was the first real clue.

Getting to the Bottom

After tracing the behavior through the stack, the root cause turned out to be surprisingly indirect. The application created new connections through SQLAlchemy’s asyncpg dialect, and we needed to drill down into that driver’s behavior.

Step 1 – Reviewing how SQLAlchemy registers JSON codecs

During connection initialization, SQLAlchemy runs an on_connect hook:

def connect(conn):
    conn.await_(self.setup_asyncpg_json_codec(conn))
    conn.await_(self.setup_asyncpg_jsonb_codec(conn))

This registers optimized JSON and JSONB codecs (the client’s application deals with a lot of JSONB data).

Step 2 – Observing how asyncpg introspects type metadata

Registering those codecs requires looking up type OIDs in pg_catalog.

That triggers asyncpg’s internal function: introspect_types()

Step 3 – Catching asyncpg temporarily disabling JIT

Inside _introspect_types() there is this block:

async def _introspect_types(self, typeoids, timeout):
    if self._server_caps.jit:
        cfgrow, _ = await self.__execute(
            """SELECT current_setting('jit') AS cur,
                      set_config('jit', 'off', false) AS new""",
        )

The purpose is harmless and avoids rare edge cases with complex type queries by temporarily disabling JIT, running the introspection query, and finally restoring the setting afterwards. For direct PostgreSQL connections, this is perfectly fine.

Unfortunately, set_config() changes session state. RDS Proxy cannot safely track this change. So it decides it is necessary to pin the client connection to a backend session. Once pinned, that connection can never be multiplexed again, for the duration of the session.

In short, since every connection initialization triggers the JIT toggle, every RDS Proxy connection gets pinned to a database connection, effectively invalidating the usefulness of RDS Proxy’s purpose of connection multiplexing. With thousands of live connections doing relatively little, Postmaster develops a lot of LWLock overhead memory buffers don’t get flushed, and OOM Killer can be invoked when the conditions are right.

The Fix

The key observation is that asyncpg only runs the JIT toggle if it believes the server supports JIT.

That capability is stored in an internal structure _server_caps. If jit is set to False, asyncpg skips the entire block.

So we added a SQLAlchemy connection hook:

@event.listens_for(engine.sync_engine, "connect", insert=True)
def _prevent_rds_proxy_session_pinning(dbapi_connection, connection_record):
    raw_conn = dbapi_connection._connection
    if hasattr(raw_conn, "_server_caps") and raw_conn._server_caps.jit:
        raw_conn._server_caps = raw_conn._server_caps._replace(jit=False)

This configuration does the following:

Registers a connection hook so that it runs every time a new connection is created.
Runs the hook before SQLAlchemy’s own hooks and ensures our handler runs before SQLAlchemy’s on_connect logic. That is important because the JSON codec registration is what triggers the introspection.
Disables the JIT capability flag. By using _server_caps._replace(jit=False), we tell asyncpg to skip the set_config() block entirely.

The Result

After deploying the asyncpg fix, we saw the number of pinned sessions drop precipitously:

Of course, we were still seeing many pinned sessions, which we continued to deal with through other fixes, but this first step produced an improvement of over 50%

Other Fix Attempts That Didn’t Work

Before landing on this fix, we attempted a few other approaches.

First, we attempted to disable JIT via connection parameters by setting server_settings={"jit": "off"}. This fails because RDS Proxy rejects it with a message like:

FeatureNotSupportedError:
RDS Proxy currently doesn't support the option jit

We also tried disabling prepared statement caching with prepared_statement_cache_size=0 in the configuration. This didn’t work because it prevents named prepared statement pinning, but it does not prevent set_config() pinning.

The only fix that worked was to add the pin-prevention hook as described above.

Lessons Learned

A few takeaways from this debugging experience:

RDS Proxy pinning can come from unexpected places. Even small session-level changes can disable multiplexing.
pg_stat_statements hides parameter values. It’s great for query patterns, but it does not expose bound parameters, which can hide critical clues. Sometimes the fastest diagnostic tool is temporarily enabling log_statement = 'all', which quickly exposed the params in the set_config() call.
SQLAlchemy and asyncpg do have some quirks that need to be addressed when using them with RDS Proxy

Final Thoughts

The entire chain looked like this:

SQLAlchemy connection
 → asyncpg codec registration
 → asyncpg type introspection
 → temporary JIT disable via set_config()
 → RDS Proxy detects session state change
 → connection gets pinned

A single hidden configuration toggle resulted in thousands of pinned sessions.

Once identified, the fix was only a few lines of code.

But getting there required following the entire stack – from SQLAlchemy to asyncpg to PostgreSQL to RDS Proxy.

Hopefully this saves someone else a few hours (or days) of debugging.