<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Richard Yen</title>
    <description></description>
    <link>http://richyen.com/postgres/</link>
    <atom:link href="http://richyen.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 11 May 2026 21:00:28 +0000</pubDate>
    <lastBuildDate>Mon, 11 May 2026 21:00:28 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Making JSONB More Queryable with Generated Columns</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Over the past year, I’ve worked in a handful of contexts managing large volumes of data stored as JSONB in PostgreSQL. The scenario is common: users appreciate the flexibility of a document-oriented storage model, avoiding the need to predefine schemas or constantly migrate table structures as their data requirements evolve. JSONB documents can be deeply nested with numerous optional fields, and they scale to hundreds of kilobytes per record without issue. However, when the time comes to query these documents – filtering by user ID, event type, timestamps, or nested action properties – the queries can become slow and/or cumbersome to work with.&lt;/p&gt;

&lt;p&gt;The problem I want to address is: “How do we make searching JSONB data more efficient without breaking apart our documents or forcing it into columns in a relational database?” There are several approaches available in Postgres, each with different tradeoffs. I hope to shed some light on those approaches in this article.&lt;/p&gt;

&lt;h2 id=&quot;the-setup&quot;&gt;The Setup&lt;/h2&gt;

&lt;p&gt;I created a basic, no-frills table for the sake of this test:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BIGSERIAL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSONB&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;Here&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;s the document shape I used for testing and writing this post -- it&apos;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;representative&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;of&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;the&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;logs&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;audit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;trails&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;I&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;ve encountered: a mix of primitive fields, nested objects, and metadata that accumulates over time.

-- Representative JSONB document
{
  &quot;user_id&quot;: 5234,
  &quot;event_type&quot;: &quot;event_42&quot;,
  &quot;timestamp&quot;: 1712341200,
  &quot;session_id&quot;: &quot;sess_abc123...&quot;,
  &quot;ip_address&quot;: &quot;192.168.1.42&quot;,
  &quot;action&quot;: {
    &quot;type&quot;: &quot;click&quot;,
    &quot;target_id&quot;: 87654,
    &quot;coordinates&quot;: {&quot;x&quot;: 512, &quot;y&quot;: 768},
    &quot;duration_ms&quot;: 1234
  },
  &quot;device&quot;: {
    &quot;type&quot;: &quot;mobile&quot;,
    &quot;os&quot;: &quot;iOS&quot;,
    &quot;screen_width&quot;: 1920,
    &quot;screen_height&quot;: 1080
  },
  &quot;performance&quot;: {
    &quot;page_load_time&quot;: 1234,
    &quot;dns_lookup&quot;: 123,
    &quot;tcp_connection&quot;: 234,
    &quot;server_response&quot;: 876
  },
  &quot;custom_fields&quot;: { ... }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The queries that matter are straightforward equality and range filters on known fields: find all events for a given user, filter by event type, narrow to a time window. With this setup, we’ll try to discern which kind of index actually serves the specific access pattern, and what the real cost of each option is.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All tests run on PostgreSQL 18.2 in Docker on an Apple M-series host. Tables contain 50,000 rows with realistic JSONB event documents. Query benchmarks run 20 times on a warm cache and report avg/min/max. Insert benchmarks run 5 trials of 5,000 rows each. Schema and scripts are included throughout so you can reproduce these results.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;three-approaches-to-indexing-jsonb&quot;&gt;Three Approaches to Indexing JSONB&lt;/h2&gt;

&lt;p&gt;There are three realistic options for this access pattern. Let’s look at each in turn – what it costs to build/maintain, what queries it actually helps, and where it falls down.&lt;/p&gt;

&lt;h3 id=&quot;option-1-gin-indexes&quot;&gt;Option 1: GIN Indexes&lt;/h3&gt;

&lt;p&gt;The natural candidate for indexing a JSONB column would be a GIN (Generalized Inverted Index) index.  After all, GIN indexes are specifically designed for JSON documents and full-text search.  It indexes every key and value pair in every document, making the entire structure searchable:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_gin&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;USING&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- or the path-only variant:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_gin_path&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;USING&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jsonb_path_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As a refresher, I’ll mention that GIN is designed for containment and key existence operators (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?|&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&amp;amp;&lt;/code&gt;), not for equality on extracted fields:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- This query uses a GIN index correctly:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&amp;gt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;{&quot;user_id&quot;: 5234}&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;-- This query does NOT use a GIN index, even if one exists:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the containment form, the GIN index is used and the query is fast – but still slower than a B-tree on the same field, because GIN lookups involve more bookkeeping:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-- GIN jsonb_ops + containment operator
Bitmap Index Scan on idx_gin
  Index Cond: (data @&amp;gt; &apos;{&quot;user_id&quot;: 5234}&apos;)

lanning Time: 1.173 ms  |  Execution Time: 1.295 ms

-- GIN jsonb_path_ops + containment operator
Bitmap Index Scan on idx_gin_path
  Index Cond: (data @&amp;gt; &apos;{&quot;user_id&quot;: 5234}&apos;)
Planning Time: 3.342 ms  |  Execution Time: 0.450 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsonb_path_ops&lt;/code&gt; variant is smaller and faster for containment queries, but it trades away support for key-existence operators (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?|&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&amp;amp;&lt;/code&gt;). Neither GIN variant can help with range predicates like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ts &amp;gt; 1700000000&lt;/code&gt; – those always fall through to a filter step.&lt;/p&gt;

&lt;h3 id=&quot;option-2-expression-indexes&quot;&gt;Option 2: Expression Indexes&lt;/h3&gt;

&lt;p&gt;Postgres lets you create an index on an expression, including JSONB extraction:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_user_id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a B-tree index on the &lt;em&gt;result&lt;/em&gt; of evaluating the expression. When the query predicate matches the indexed expression exactly, and after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ANALYZE&lt;/code&gt; has gathered statistics on it, the planner will use it:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan on t_expr
  Recheck Cond: ((data -&amp;gt;&amp;gt; &apos;user_id&apos;)::integer = 5234)
  Heap Blocks: exact=3
  -&amp;gt;  Bitmap Index Scan on idx_user_id
        Index Cond: ((data -&amp;gt;&amp;gt; &apos;user_id&apos;)::integer = 5234)
Planning Time: 1.168 ms  |  Execution Time: 0.341 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The execution time on this equality operator seems to be pretty similar to the performance of the GIN index.&lt;/p&gt;

&lt;h3 id=&quot;option-3-generated-columns&quot;&gt;Option 3: Generated Columns&lt;/h3&gt;

&lt;p&gt;Generated columns (available since PostgreSQL 12) let you extract JSONB values into regular typed columns at write time. The values are stored physically alongside the row and kept in sync automatically:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;         &lt;span class=&quot;n&quot;&gt;BIGSERIAL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;       &lt;span class=&quot;n&quot;&gt;JSONB&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;user_id&lt;/span&gt;    &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;TEXT&lt;/span&gt;   &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;event_type&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;         &lt;span class=&quot;nb&quot;&gt;BIGINT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;timestamp&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;BIGINT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;action&lt;/span&gt;     &lt;span class=&quot;nb&quot;&gt;TEXT&lt;/span&gt;   &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;action&apos;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;type&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_user_id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_event_type&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_ts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_action&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;action&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Queries against generated columns are plain typed-column lookups. The planner sees them as regular B-tree columns and produces tight estimates:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan on t_gen
  Recheck Cond: (user_id = 5234)
  Heap Blocks: exact=3
  -&amp;gt;  Bitmap Index Scan on idx_user_id
        Index Cond: (user_id = 5234)
Planning Time: 1.159 ms  |  Execution Time: 0.407 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You also get native support for range queries and composite indexes at no extra complexity – just combine columns as you normally would:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- Indexed range query on generated timestamp column&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;event_42&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1700000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- Execution Time: 0.698 ms (vs 6.6 ms with GIN + post-filter)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;side-by-side-query-performance&quot;&gt;Side-by-Side: Query Performance&lt;/h2&gt;

&lt;p&gt;With all three approaches set up, here are the warm-cache query results averaged over 20 runs for an equality filter on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_id&lt;/code&gt;:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Approach&lt;/th&gt;
      &lt;th&gt;Avg (ms)&lt;/th&gt;
      &lt;th&gt;Min (ms)&lt;/th&gt;
      &lt;th&gt;Max (ms)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_ops + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;0.198&lt;/td&gt;
      &lt;td&gt;0.101&lt;/td&gt;
      &lt;td&gt;1.769&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_path_ops + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;0.197&lt;/td&gt;
      &lt;td&gt;0.032&lt;/td&gt;
      &lt;td&gt;3.115&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Expression index&lt;/td&gt;
      &lt;td&gt;0.106&lt;/td&gt;
      &lt;td&gt;0.018&lt;/td&gt;
      &lt;td&gt;1.705&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Generated column B-tree&lt;/td&gt;
      &lt;td&gt;0.112&lt;/td&gt;
      &lt;td&gt;0.016&lt;/td&gt;
      &lt;td&gt;1.839&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Expression indexes and generated columns perform very similarly for equality queries—both around 0.1ms on warm cache. The real work is done in the B-tree lookup and both produce the same index structure. GIN with the correct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt; operator is nearly as fast in PG 18.2 – still slightly slower than B-tree for this access pattern, but the gap has narrowed. GIN lookups still require a recheck step that B-tree lookups avoid, and the variance remains notable: GIN max of 3.1ms vs B-tree max of 1.8ms on warm cache.&lt;/p&gt;

&lt;p&gt;The more surprising result is what happens if the GIN index is present but the query is written with extraction-based equality:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- GIN index exists, but this query gets a seq scan:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- Execution Time: 47.935 ms (same as no index at all)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;GIN doesn’t support that operator class. This is by far the most common confusion teams run into with JSONB indexing.&lt;/p&gt;

&lt;h2 id=&quot;the-full-cost-picture-storage-and-writes&quot;&gt;The Full Cost Picture: Storage and Writes&lt;/h2&gt;

&lt;h3 id=&quot;storage&quot;&gt;Storage&lt;/h3&gt;

&lt;p&gt;Here’s what the same 50,000 rows cost on disk under each approach:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Approach&lt;/th&gt;
      &lt;th&gt;Table size&lt;/th&gt;
      &lt;th&gt;Index size&lt;/th&gt;
      &lt;th&gt;Total&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Expression indexes (4)&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;3.5 MB&lt;/td&gt;
      &lt;td&gt;21 MB&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Generated columns + B-tree (4)&lt;/td&gt;
      &lt;td&gt;20 MB&lt;/td&gt;
      &lt;td&gt;3.5 MB&lt;/td&gt;
      &lt;td&gt;23 MB&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_path_ops&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;13 MB&lt;/td&gt;
      &lt;td&gt;31 MB&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_ops&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;36 MB&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Expression indexes and generated column B-tree indexes produce &lt;em&gt;identical&lt;/em&gt; index sizes for the same fields – this makes sense, since the index structures are the same; the only extra cost of generated columns is the 2 MB of additional stored column data in the table (~40 bytes per row for four typed columns). GIN indexes are substantially larger: 13–18 MB for a single index vs 3.5 MB for four targeted B-tree indexes. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsonb_path_ops&lt;/code&gt; variant is smaller because it only stores value hashes for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt; operator path, but it still dwarfs the targeted approach.&lt;/p&gt;

&lt;p&gt;One caveat: these numbers reflect documents with short keys and compact values. Documents with verbose key names, deeply nested structures, or large string values will inflate GIN indexes proportionally more – because GIN indexes every key path. B-tree and expression indexes are unaffected by document verbosity, since they only store the extracted value.&lt;/p&gt;

&lt;h3 id=&quot;write-throughput&quot;&gt;Write Throughput&lt;/h3&gt;

&lt;p&gt;Here’s what 5,000 INSERTs per trial, 5 trials each, on a table already containing 50,000 rows looked like:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Approach&lt;/th&gt;
      &lt;th&gt;Avg (ms)&lt;/th&gt;
      &lt;th&gt;Min (ms)&lt;/th&gt;
      &lt;th&gt;Max (ms)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Generated columns + B-tree (4)&lt;/td&gt;
      &lt;td&gt;157&lt;/td&gt;
      &lt;td&gt;91&lt;/td&gt;
      &lt;td&gt;317&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Expression indexes (4)&lt;/td&gt;
      &lt;td&gt;163&lt;/td&gt;
      &lt;td&gt;93&lt;/td&gt;
      &lt;td&gt;366&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_path_ops&lt;/td&gt;
      &lt;td&gt;171&lt;/td&gt;
      &lt;td&gt;73&lt;/td&gt;
      &lt;td&gt;408&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_ops&lt;/td&gt;
      &lt;td&gt;334&lt;/td&gt;
      &lt;td&gt;225&lt;/td&gt;
      &lt;td&gt;525&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Generated columns and expression indexes are now very close in write cost, with generated columns slightly edging out on average. GIN jsonb_path_ops has become more competitive with both. However, the default GIN jsonb_ops variant is dramatically more expensive: 2× slower than expression indexes and generated columns. It must decompose the entire document into key-value pairs and insert entries for each one. The high variance is also worth noting: GIN jsonb_ops max of 525ms vs 366ms for expression indexes.&lt;/p&gt;

&lt;h2 id=&quot;choosing-the-right-approach&quot;&gt;Choosing the Right Approach&lt;/h2&gt;

&lt;p&gt;The benchmarks above tell a consistent story for workloads dominated by equality and range filters on a known set of fields:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Expression indexes&lt;/strong&gt; are the lowest-cost migration path. They add no schema structure, require no application changes to insert logic, and impose minimal write overhead. If your team already has a table in production and just needs to speed up a handful of known slow queries, a well-placed expression index is your first move. The catch: every query must exactly match the expression as written in the index definition, which can be fragile to maintain as codebases evolve.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Generated columns&lt;/strong&gt; take slightly more storage and impose more write overhead than expression indexes, but they offer something the others can’t: the extracted values become first-class columns. You can build composite indexes across them, reference them in views, expose them via ORMs, and sort or aggregate on them without embedding extraction logic everywhere. For new tables or for tables you’re willing to migrate, they’re the most maintainable long-term solution.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;GIN indexes&lt;/strong&gt; serve a different purpose. They’re the right tool when your query patterns are flexible or unknown – searching for the existence of a key, filtering on any field in an ad-hoc fashion, or supporting containment queries on arbitrarily-shaped documents. For those access patterns, they’re genuinely powerful and there’s no clean B-tree equivalent. But for consistent equality and range filters on known fields, they cost more in storage, impose higher write latency, and only work with one operator class (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;, not &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;=&lt;/code&gt;).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s a rough decision guide:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Situation&lt;/th&gt;
      &lt;th&gt;Recommended approach&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Unknown or ad-hoc field queries&lt;/td&gt;
      &lt;td&gt;GIN (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;, key existence)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Known fields, few queries, no schema change&lt;/td&gt;
      &lt;td&gt;Expression index&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Known fields, high query volume, evolving codebase&lt;/td&gt;
      &lt;td&gt;Generated columns&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Known fields + range queries (e.g., timestamps)&lt;/td&gt;
      &lt;td&gt;Generated columns + composite B-tree&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Mixed: some known fields + some ad-hoc&lt;/td&gt;
      &lt;td&gt;Generated columns + GIN (both)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;caveats-and-considerations&quot;&gt;Caveats and Considerations&lt;/h2&gt;

&lt;p&gt;Regardless of which approach you choose, a few things apply broadly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real win is making data typed and relational again.&lt;/strong&gt; Generated columns aren’t magic. The reason they (and expression indexes) outperform GIN for equality filters is that they produce typed scalar values with precise statistics, letting the planner make accurate row-count estimates and choose cheap comparison operations. JSONB is flexible but opaque; once you extract a field into a typed column or expression, Postgres can reason about it properly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expression indexes require exact predicate matching.&lt;/strong&gt; An index on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cast(data-&amp;gt;&amp;gt;&apos;user_id&apos; AS INT)&lt;/code&gt; will not be used by a query written as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(data-&amp;gt;&amp;gt;&apos;user_id&apos;)::int&lt;/code&gt;. The cast form must be identical. Generated columns avoid this fragility – any query that references the column name will benefit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generated column expressions must be immutable.&lt;/strong&gt; The expression cannot reference functions that depend on time, session state, or anything external. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NOW()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CURRENT_USER&lt;/code&gt;, and similar functions are off-limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generated columns cannot be directly updated.&lt;/strong&gt; Their value is always derived from the source column. If you UPDATE the JSONB &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt;, the generated columns recompute automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GIN maintenance overhead compounds on write-heavy tables.&lt;/strong&gt; GIN indexes build an internal pending list and flush it periodically (controlled by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gin_pending_list_limit&lt;/code&gt;). Under sustained write load, this flushing can cause the latency spikes visible in the benchmark max values above. B-tree indexes don’t have this mechanism.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These benchmarks cover one dataset shape and one machine.&lt;/strong&gt; At much larger row counts (hundreds of millions), cache-miss behavior and index bloat will dominate—relative rankings should hold, but absolute numbers will differ. When in doubt, benchmark on your own data before committing to a migration.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;For workloads dominated by equality and range filters on a predictable set of JSONB fields, the data is clear: B-tree indexes on typed values – whether via expression indexes or generated columns – outperform GIN both on read latency and write throughput. GIN’s strength is flexibility, not speed for known-field access patterns; when you know exactly which fields you’ll filter on, a targeted B-tree beats the GIN every time.&lt;/p&gt;

&lt;p&gt;If you’re starting from scratch or are willing to migrate a table, generated columns are the most maintainable path. They make your frequently-queried fields easily accessible, eliminate JSONB extraction logic from your application’s query layer, and support composite indexes and range queries naturally. If you need to add indexing to an existing table without a schema change, expression indexes get you 90% of the way there with a fraction of the write overhead.&lt;/p&gt;

&lt;p&gt;GIN still belongs in your toolkit – but for the right job: ad-hoc containment searches, key-existence checks, and cases where the query patterns genuinely vary by document. For everything else, make your JSONB fields relational.&lt;/p&gt;
</description>
        <pubDate>Mon, 11 May 2026 06:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/05/11/generated_columns_jsonb.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/05/11/generated_columns_jsonb.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>jsonb</category>
        
        <category>generated</category>
        
        <category>columns</category>
        
        <category>indexing</category>
        
        <category>performance</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Potential Consequences of Using Postgres as a Job Queue</title>
        <description>&lt;p&gt;&lt;em&gt;This post was originally published on the &lt;a href=&quot;https://techcommunity.microsoft.com/blog/adforpostgresql/potential-consequences-of-using-postgres-as-a-job-queue/4514332&quot;&gt;Microsoft Tech Community Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;At small scale, using Postgres as a job queue is totally fine, and I’d even say it’s the right call.  Fewer moving parts, one less system to manage, ACID guarantees on your jobs.  What’s not to love?&lt;/p&gt;

&lt;p&gt;The problem is that “small scale” has a ceiling, and the ceiling is lower than most people expect.  When you’ve got thousands of concurrent workers hammering a jobs table with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR UPDATE SKIP LOCKED&lt;/code&gt;, things start to behave in ways that aren’t obvious from the application layer.  CPU usage creeps up.  Also vacuum sometimes can’t keep up.  Finally, in the wait event stats, you start seeing ominous entries like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LWLock:MultiXactSLRU&lt;/code&gt; stacking up across many backends.&lt;/p&gt;

&lt;p&gt;This pattern has tripped up teams more than a few times, and it usually plays out the same way: everything works fine in dev and staging, then goes off a cliff in production once the concurrency gets real.  So let’s dig into why this happens, and what the alternatives look like.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-typical-pattern&quot;&gt;The Typical Pattern&lt;/h2&gt;

&lt;p&gt;When using Postgres as a job queue, the standard approach looks something like this:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;         &lt;span class=&quot;n&quot;&gt;bigserial&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;     &lt;span class=&quot;nb&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DEFAULT&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;payload&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;jsonb&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timestamptz&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DEFAULT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;locked_by&lt;/span&gt;  &lt;span class=&quot;nb&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;locked_at&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;timestamptz&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_job_queue_status&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Workers grab jobs with:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;UPDATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;processing&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;locked_by&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;worker-42&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;locked_at&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;FOR&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;UPDATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SKIP&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LOCKED&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;RETURNING&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then mark them done:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;UPDATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;completed&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Some users may &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt; the row entirely.  Either way, the lifecycle is: insert, lock-and-update, update-or-delete.  Repeated thousands of times per second.&lt;/p&gt;

&lt;p&gt;At low concurrency, this works very smoothly.  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; means workers don’t block each other waiting for the same row.  Postgres handles the locking, visibility, and ordering.  It’s elegant.&lt;/p&gt;

&lt;p&gt;So where does it break?&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-multixact-slru-problem&quot;&gt;The MultiXact SLRU Problem&lt;/h2&gt;

&lt;p&gt;When multiple transactions hold locks on the same row, Postgres stores the set of lockers as a MultiXact ID – a pointer into a side structure under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_multixact/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR UPDATE SKIP LOCKED&lt;/code&gt;, users might think MultiXacts aren’t involved – after all, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; is supposed to avoid contention.  But in practice, with many concurrent workers all racing to lock rows, there are brief windows where multiple transactions reference the same row before one of them “wins” and the others skip.  If you combine this with any &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR SHARE&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR KEY SHARE&lt;/code&gt; locks (which are commonly created implicitly by foreign key checks), MultiXact IDs start accumulating quickly.&lt;/p&gt;

&lt;p&gt;The MultiXact data lives in SLRU buffers (Simple Least Recently Used) – a small, fixed-size shared memory cache.  When backends need to read or write MultiXact data, they acquire LWLocks to access these buffers.  Under high concurrency, this becomes a bottleneck:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event_type | wait_event
-----------------+-------------------
LWLock          | MultiXactMemberSLRU
LWLock          | MultiXactOffsetSLRU
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You’ll see dozens or hundreds of backends piled up on these waits.  The SLRU cache is small (by design – it’s a fixed number of pages in shared memory), and when the working set of MultiXact lookups exceeds what fits in the cache, you get constant eviction and re-reads from disk.  Every lock acquisition and release on a job row potentially triggers a MultiXact SLRU lookup, and at thousands of concurrent sessions, those lookups serialize on LWLocks.&lt;/p&gt;

&lt;p&gt;The result: CPU gets pegged, throughput collapses, and latency spikes – not because the queries are expensive, but because the locking infrastructure itself is overwhelmed.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;bloat-the-silent-killer&quot;&gt;Bloat: The Silent Killer&lt;/h2&gt;

&lt;p&gt;The other side of this coin is table and index bloat.  Every job row goes through multiple updates (and possibly a delete), and each of those operations creates a new tuple version in the heap.  The old versions stick around until &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VACUUM&lt;/code&gt; cleans them up.&lt;/p&gt;

&lt;p&gt;On a busy job queue table:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Dead tuples accumulate faster than autovacuum can clean them.&lt;/strong&gt;  By the time autovacuum finishes one pass, tens of thousands of new dead tuples have appeared.  The table grows and grows.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Index bloat compounds the problem.&lt;/strong&gt;  Every index on the table also accumulates dead entries.  The partial index on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;status = &apos;pending&apos;&lt;/code&gt; gets thrashed especially hard, since rows constantly enter and leave that condition.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Sequential scans get slower.&lt;/strong&gt;  As the table bloats, even index scans start doing more I/O because the heap pages are sparsely populated.  Vacuum reclaims space at the end of the table, but can’t reclaim space in the middle (unless the pages are completely empty).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Job queue tables can grow to tens of gigabytes when the actual “live” data was only a few megabytes.  It makes everything slower: scans, vacuum, even &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_dump&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can mitigate this by running vacuum more aggressively (lower &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_vacuum_scale_factor&lt;/code&gt;, higher &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_vacuum_cost_limit&lt;/code&gt;), or by partitioning the table and dropping old partitions.  But at some point, you’re fighting the fundamental mismatch between MVCC’s design goals and the write pattern of a job queue.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;cpu-and-lock-overhead&quot;&gt;CPU and Lock Overhead&lt;/h2&gt;

&lt;p&gt;Beyond the SLRU contention and bloat, there’s just the raw overhead of using Postgres’s full transactional machinery for what is essentially a FIFO dispatch operation:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Every lock/unlock is a full WAL-logged transaction.&lt;/strong&gt;  Grabbing a job writes WAL.  Marking it complete writes WAL.  Deleting it writes WAL.  On a system processing thousands of jobs per second, the WAL volume from the job queue alone can saturate your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wal_writer&lt;/code&gt; and checkpoint processes.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; still touches rows.&lt;/strong&gt;  The name suggests rows are skipped, but Postgres still has to &lt;em&gt;find&lt;/em&gt; them, check their lock status, and move on.  With high concurrency, many workers end up scanning past the same locked rows before finding one they can claim.  This is wasted CPU.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Snapshot management overhead also becomes an issue.&lt;/strong&gt;  Each transaction needs a consistent snapshot, and with thousands of concurrent transactions, the ProcArray (the structure that tracks active transactions) becomes a contention point itself.  You might see &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LWLock:ProcArrayLock&lt;/code&gt; waits alongside the MultiXact ones.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Vacuum contention.&lt;/strong&gt;  While vacuum is cleaning up dead tuples, it needs locks too.  On a table under constant write pressure, vacuum can interfere with the workers and vice versa.  I’ve seen systems where disabling autovacuum on the job queue table improved throughput in the short term.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;better-alternatives&quot;&gt;Better Alternatives&lt;/h2&gt;

&lt;p&gt;So what should you use instead?  It depends on your requirements, but there are several options that handle high-throughput job dispatch more gracefully than a Postgres table.&lt;/p&gt;

&lt;h3 id=&quot;advisory-locks-staying-in-postgres&quot;&gt;Advisory Locks (Staying in Postgres)&lt;/h3&gt;

&lt;p&gt;If you want to stay within Postgres and avoid adding infrastructure, advisory locks are worth considering for certain queue patterns.  Instead of locking rows, you lock on an abstract numeric key:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- Worker tries to acquire a lock on the job ID&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pg_try_advisory_lock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Advisory locks are lightweight – they don’t touch the heap, don’t create MultiXact entries, and don’t generate dead tuples.  They live entirely in shared memory.  The trade-off is that you lose the atomicity of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR UPDATE SKIP LOCKED&lt;/code&gt;: you need to handle the case where a lock is acquired but the job processing fails, and you need to release the lock explicitly (or rely on session-end cleanup).&lt;/p&gt;

&lt;p&gt;This approach works well when the queue depth is manageable and you want to avoid the MVCC overhead.  But it’s still Postgres, so you’re still subject to connection limits, ProcArray overhead, and general resource contention at very high session counts.&lt;/p&gt;

&lt;h3 id=&quot;pgq-skytools&quot;&gt;pgq (Skytools)&lt;/h3&gt;

&lt;p&gt;pgq is purpose-built for exactly this problem.  It’s a queue implementation that sits inside Postgres but uses a batching model that avoids most of the row-level locking and MVCC pitfalls.  Events are written to a queue table, but consumers read them in batches and the queue maintenance is done via a ticker process that manages rotation.&lt;/p&gt;

&lt;p&gt;The key advantages:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;No row-level contention.  Consumers don’t lock individual rows.&lt;/li&gt;
  &lt;li&gt;Built-in batch processing.  Events are consumed in chunks, reducing transaction overhead.&lt;/li&gt;
  &lt;li&gt;Efficient cleanup.  Old events are rotated out rather than vacuumed row-by-row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The downside is that pgq is not as actively maintained as it once was, and it adds operational complexity (the ticker daemon, consumer registration, etc.).  But for teams already deep in the Postgres ecosystem, it’s a battle-tested option.&lt;/p&gt;

&lt;h3 id=&quot;pgque&quot;&gt;PgQue&lt;/h3&gt;

&lt;p&gt;Coincidentally, during the writing of this post, &lt;a href=&quot;https://github.com/NikolayS/pgque&quot;&gt;Nikolay Samokhvalov has built PgQue&lt;/a&gt;, which is a derivative of pgq.  Like pgq, it sits inside Postgres, but ships as a single SQL file – no C extension and no external daemon – making it deployable on managed services like RDS, Aurora, Cloud SQL, AlloyDB, Supabase, and Neon.  Producers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt; events into rotating event tables (recycled via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TRUNCATE&lt;/code&gt; instead of row-by-row deletion), and consumers read batches by diffing two &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_snapshot&lt;/code&gt; values captured by a periodic ticker – so the hot path contains zero &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt;s, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt;s, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR UPDATE SKIP LOCKED&lt;/code&gt;, and therefore produces no dead tuples on the event tables.  For a deeper dive into the algorithm, see &lt;a href=&quot;https://thebuild.com/blog/2026/05/03/pgque-two-snapshots-and-a-diff/&quot;&gt;Christophe Pettus’s writeup&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;redis&quot;&gt;Redis&lt;/h3&gt;

&lt;p&gt;For many teams, Redis is the natural choice for job queues.  Using Redis lists (BRPOPLPUSH or the Streams API), you get:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Sub-millisecond dispatch latency.  No disk I/O, no MVCC, no vacuum.&lt;/li&gt;
  &lt;li&gt;Atomic pop operations.  Workers grab jobs without any locking protocol.&lt;/li&gt;
  &lt;li&gt;Simple scaling.  Redis handles thousands of concurrent consumers trivially.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is durability.  Redis can persist to disk, but it’s not ACID.  If Redis crashes between a pop and the job completing, you might lose or duplicate work (though Redis Streams with consumer groups mitigate this significantly).  For most job queue use cases, at-least-once delivery is acceptable, and Redis does that well.&lt;/p&gt;

&lt;h3 id=&quot;kafka&quot;&gt;Kafka&lt;/h3&gt;

&lt;p&gt;For truly high-throughput, distributed workloads, Apache Kafka is the heavyweight option.  Kafka partitions give you parallel consumption with ordering guarantees per partition, durable storage, and replay capability.  It’s the right tool when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;You need to process thousands of events per second&lt;/li&gt;
  &lt;li&gt;Multiple consumers need to read the same events&lt;/li&gt;
  &lt;li&gt;You want event replay or audit trails&lt;/li&gt;
  &lt;li&gt;Your architecture is already event-driven&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The operational overhead is nontrivial – ZooKeeper (or KRaft), brokers, topic management, consumer group coordination.  But for teams already running Kafka for other reasons, adding a job queue topic is practically free.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;choosing-the-right-tool&quot;&gt;Choosing the Right Tool&lt;/h2&gt;

&lt;p&gt;Here’s a rough decision guide:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Scenario&lt;/th&gt;
      &lt;th&gt;Recommendation&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Under 100 concurrent workers, simple jobs&lt;/td&gt;
      &lt;td&gt;Postgres with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; is fine&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Moderate concurrency, want to stay in Postgres&lt;/td&gt;
      &lt;td&gt;Advisory locks or pgq&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;High throughput, low-latency dispatch&lt;/td&gt;
      &lt;td&gt;Redis (Lists or Streams)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Massive scale, distributed, event replay&lt;/td&gt;
      &lt;td&gt;Kafka&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Many teams that start with Postgres (reasonably) hit scaling problems and then try to fix Postgres rather than recognizing that the workload has outgrown the tool.  They throw more autovacuum workers at it, increase &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_connections&lt;/code&gt;, add connection poolers – all of which help at the margins, but don’t address the fundamental issue: Postgres’s MVCC and locking machinery wasn’t designed for this access pattern at high concurrency.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Postgres is great, but it can’t be the best tool for every job.  Using it as a job queue is a perfectly valid choice when your scale is modest.  But when you’re running thousands of concurrent workers, the combination of MultiXact SLRU contention, heap bloat, vacuum pressure, and raw locking overhead will eventually push you toward a purpose-built solution.&lt;/p&gt;

&lt;p&gt;The good news is that you don’t have to rip out everything.  Advisory locks can buy you headroom without adding infrastructure.  Redis can handle dispatch while Postgres keeps owning the data.  And if you’re already using Kafka, a job topic is a natural fit.  Take your pick – there are many queueing options out there!&lt;/p&gt;
</description>
        <pubDate>Mon, 04 May 2026 06:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/05/04/postgres_job_queue.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/05/04/postgres_job_queue.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>scaling</category>
        
        <category>job-queue</category>
        
        <category>multixact</category>
        
        <category>lwlock</category>
        
        <category>advisory-locks</category>
        
        <category>redis</category>
        
        <category>kafka</category>
        
        <category>pgq</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Understanding Bitmap Heap Scans in PostgreSQL</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;When people first start reading PostgreSQL execution plans, they quickly learn a few common scan types: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Seq Scan&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Index Scan&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Index Only Scan&lt;/code&gt;.  But eventually another one appears that is less obvious: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bitmap Heap Scan&lt;/code&gt;, which is almost always accompanied by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bitmap Index Scan&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At first glance, it sounds like two scans on the same table – a very inefficient choice?! But bitmap scans are actually one of the planner’s most practical tools for balancing random I/O vs sequential access.  Understanding how they work can make execution plans much easier to interpret, so we’ll dive into that a little bit today.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;the-basic-idea&quot;&gt;The Basic Idea&lt;/h1&gt;

&lt;p&gt;A bitmap scan is a two-step process:&lt;/p&gt;

&lt;p&gt;Step 1: Build a bitmap of matching rows using one or more indexes.&lt;/p&gt;

&lt;p&gt;Step 2: Visit the heap pages containing those rows referenced in the bitmap.&lt;/p&gt;

&lt;p&gt;In an execution plan this usually appears as:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan on orders
-&amp;gt; Bitmap Index Scan on orders_customer_id_idx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The important part is that the index lookup and heap access are separated – this separation allows Postgres to explain heap access costs and actuals more clearly.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;why-not-just-use-an-index-scan&quot;&gt;Why Not Just Use an Index Scan?&lt;/h1&gt;

&lt;p&gt;With a normal index scan, the query executor does something like this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Find a matching entry in the index&lt;/li&gt;
  &lt;li&gt;Jump to the heap page&lt;/li&gt;
  &lt;li&gt;Fetch the row&lt;/li&gt;
  &lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the query returns only a few rows, this works well.  But if the query returns thousands of rows scattered across the table, the database ends up doing many random heap fetches.  Random I/O can become expensive, so a bitmap scan solves this problem.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;how-the-bitmap-is-built&quot;&gt;How the Bitmap Is Built&lt;/h1&gt;

&lt;p&gt;During the Bitmap Index Scan phase, the executor does not immediately fetch rows.  Instead it records which heap pages contain matching rows.  Conceptually, the structure looks like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Page 101 -&amp;gt; rows 2, 7
Page 205 -&amp;gt; rows 1, 3, 8
Page 410 -&amp;gt; row 5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These page references are stored as a bitmap structure in memory.  Once the bitmap is complete, the executor can visit heap pages in physical order rather than jumping around randomly.  Visiting heap pages in physical order means less random I/O and therefore less latency.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;multiple-indexes-can-be-combined&quot;&gt;Multiple Indexes Can Be Combined&lt;/h1&gt;

&lt;p&gt;One particularly powerful feature is that bitmap scans allow the query planner to combine multiple indexes.  For example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;WHERE status = &apos;active&apos;
AND created_at &amp;gt;= &apos;2025-01-01&apos;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The plan might look like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan
-&amp;gt; BitmapAnd
-&amp;gt; Bitmap Index Scan on status_idx
-&amp;gt; Bitmap Index Scan on created_at_idx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each index produces a bitmap, and the planner combines them using logical operations, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BitmapAnd&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BitmapOr&lt;/code&gt;.  This allows the planner to efficiently use multiple indexes even when a single composite index does not exist.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;when-does-the-planner-chooses-bitmap-scans&quot;&gt;When Does the Planner Chooses Bitmap Scans?&lt;/h1&gt;

&lt;p&gt;The planner usually prefers bitmap scans in situations where the query returns more rows than a typical index scan, but not enough rows to justify a full sequential scan.  In other words, bitmap scans often appear in the middle selectivity range.&lt;/p&gt;

&lt;p&gt;Very roughly:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Selectivity&lt;/th&gt;
      &lt;th&gt;Likely Plan&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Very small&lt;/td&gt;
      &lt;td&gt;Index Scan&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Medium&lt;/td&gt;
      &lt;td&gt;Bitmap Heap Scan&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Very large&lt;/td&gt;
      &lt;td&gt;Seq Scan&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This is not a strict rule, but it helps explain the planner’s reasoning.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;pros-and-cons&quot;&gt;Pros and Cons&lt;/h1&gt;

&lt;p&gt;As with everything in databases, there’s no free lunch.  Here are some advantages and disadvantages for bitmap scans&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Advantages of Bitmap Heap Scans
    &lt;ul&gt;
      &lt;li&gt;Reduced Random I/O: By grouping heap page accesses, bitmap scans avoid excessive random disk reads.&lt;/li&gt;
      &lt;li&gt;Ability to Combine Indexes: Bitmap operations allow the query planner to use multiple independent indexes efficiently.&lt;/li&gt;
      &lt;li&gt;Better Performance for Medium Selectivity: Queries returning thousands of rows often benefit from bitmap access patterns.&lt;/li&gt;
      &lt;li&gt;Predictable Heap Access: Because heap pages are visited in order, caching behavior tends to improve.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Disadvantages of Bitmap Heap Scans
    &lt;ul&gt;
      &lt;li&gt;Memory Usage: The bitmap structure is stored in memory.  If the result set becomes too large, the query executor may switch to a lossy bitmap, where only page-level information is stored.  This can cause additional filtering work later.&lt;/li&gt;
      &lt;li&gt;Two-Phase Execution: Because the bitmap must be built before heap access begins, the query cannot stream rows immediately.  This can increase latency for queries expecting early rows.&lt;/li&gt;
      &lt;li&gt;Extra CPU Work: Maintaining and combining bitmap structures adds overhead compared to simple index scans.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;lossy-bitmaps&quot;&gt;Lossy Bitmaps&lt;/h1&gt;

&lt;p&gt;When memory limits are reached, the query executor may degrade the bitmap representation.  Instead of tracking individual tuple offsets, it only records:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Page 205 -&amp;gt; possible matches
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;During the heap scan, the executor must then recheck all rows on that page.  In execution plans you may see mention of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Recheck Cond&lt;/code&gt;.  This indicates that the bitmap became lossy.  While still correct, this can reduce efficiency.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h1&gt;

&lt;p&gt;Bitmap heap scans are one of the planner’s most practical optimization tools, as they allow the database to reduce random I/O, combine multiple indexes, and handle medium-sized result sets efficiently.&lt;/p&gt;

&lt;p&gt;While they may look complicated at first, the core idea is simple: Find matching rows first, then fetch heap pages efficiently.  What a great concept!&lt;/p&gt;
</description>
        <pubDate>Mon, 27 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/27/bitmap_heap_scan.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/27/bitmap_heap_scan.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>query-planner</category>
        
        <category>indexing</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>The Postgres Performance Triangle</title>
        <description>&lt;p&gt;Everyone who’s gone at least knee-deep in  photography knows there’s this idea of the &lt;em&gt;exposure triangle&lt;/em&gt;: aperture, shutter speed, and ISO. Depending on what you’re going for artistically, you adjust the three parameters, knowing that there are trade-offs in doing so.  After working on a few cases, and presenting solutions to customers, I’ve started to think about Postgres performance tuning in a similar way – there are basic parameters that can be tuned, and there are trade-offs for the choices DBAs make:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Memory Allocation&lt;/li&gt;
  &lt;li&gt;Disk I/O&lt;/li&gt;
  &lt;li&gt;Concurrency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these (in broad strokes) affects throughput – how much work your system gets done.&lt;/p&gt;

&lt;p&gt;Caveat: I know that in the academic sense, “throughput” doesn’t quite capture the balance of these concepts, but please bear with me!&lt;/p&gt;

&lt;p&gt;Let’s talk about how each of these three work together with the whole system, and what the trade-offs look like.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;memory-allocation&quot;&gt;Memory Allocation&lt;/h2&gt;

&lt;p&gt;When you increase memory allocation in Postgres, whether it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shared_buffers&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt;, things tend to feel smoother.  Most notably, queries spill to disk less often, sorts and joins stay in memory, cache hit rates improve.  But there’s a trade-off that’s easy to miss at first, especially with these two parameters.  A single complex query can consume multiple chunks of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt; (see &lt;a href=&quot;https://mydbanotebook.org/posts/work_mem-its-a-trap/&quot;&gt;Laetitia’s excellent post about it&lt;/a&gt;). Multiply that across concurrent queries, and you begin to see the OS consuming swap space, churning at checkpoints, and even OOM Killer getting invoked.  So while more memory &lt;em&gt;can&lt;/em&gt; make things faster, it also quietly reduces how much concurrency your system can safely handle.&lt;/p&gt;

&lt;p&gt;I’d relate this to aperture – you can throw money at some fast glass, but you also get shallower depth of field (in an annoying way).&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;disk-io&quot;&gt;Disk I/O&lt;/h2&gt;

&lt;p&gt;Disk is where things go when memory isn’t enough, or when an access pattern requires it.  We see examples of this in sequential scans, random index lookups, and temporary files from sorts or hashes.  Lowering &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt; might increase disk I/O due to sorts spilling to temp files, for example.  We can try to minimize disk I/O by adding indexes, increasing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt;, or simply rewriting queries.&lt;/p&gt;

&lt;p&gt;Another way we can try to affect disk I/O is to tinker with the costs, to encourage the query planner to choose one scan method over the other.  In any case, our attempts to balance disk I/O and memory usage can be pretty straightforward at first, but could become complicated at scale.  That’s where partitioning and read-only replicas come in, but I’m beginning to digress…&lt;/p&gt;

&lt;p&gt;Indexes, in particular, are where things start to get interesting.  Adding an index can feel like an easy win, as it leads to fewer rows scanned and less CPU work per query, along with less disk activity, but there are trade-offs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt; will update every relevant index&lt;/li&gt;
  &lt;li&gt;Every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; can potentially rewrite index entries&lt;/li&gt;
  &lt;li&gt;Every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt; leaves behind cleanup work (vacuum)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At scale, we also see other effects:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Indexes get large&lt;/li&gt;
  &lt;li&gt;Cache hit rates drop (because there’s more to cache)&lt;/li&gt;
  &lt;li&gt;Random I/O increases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So an index that helps one query might quietly make others worse, or make writes more expensive.&lt;/p&gt;

&lt;p&gt;It’s like raising ISO to compensate for low light. You get the shot, but the noise shows up somewhere else.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;concurrency&quot;&gt;Concurrency&lt;/h2&gt;

&lt;p&gt;So far, this has all been somewhat per-query. But things change when you introduce concurrency.  In a high-demand service, the instinct is to increase &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_connections&lt;/code&gt; to allow the service to scale up, but in my experience there’s a price to pay for this kind of concurrency.  Some people fail to notice that each connection brings its own memory usage, takes up a spot in Postgres’ internal data structures, and puts the system at risk for increased CPU demand and resource contention.&lt;/p&gt;

&lt;p&gt;In the photography analogy, you can turn down the ISO very low on a bright and sunny day, but that won’t be enough.  Soon, you’ll be closing the aperture and increasing the shutter speed, and then you lose your ability to create the artistic feel that you’re actually trying to go for.  So what do photographers do?  They use an ND filter to limit how much light hits the sensor.&lt;/p&gt;

&lt;p&gt;In Postgres, that “ND filter” is something like a connection pooler, like &lt;a href=&quot;https://www.pgbouncer.org/&quot;&gt;PgBouncer&lt;/a&gt;.  Instead of letting thousands of connections compete for CPU: You cap active queries, you allocate more resources to each actual DB session, and you trade a bit of latency for stability.  Sometimes, to keep your throughput, you need some additional accessories.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-art-of-postgres&quot;&gt;The Art of Postgres&lt;/h2&gt;

&lt;p&gt;As a DBA, you can calculate optimal index usage, memory sizing, and expected I/O patterns, but those calculations tend to assume a steady state.  Every DBA knows that real production systems are always changing, due to traffic patterns, scaling, and new features getting rolled out on the application side.  As the organization changes, the work to keep the database performant is dependent upon the DBA being both a Database Administrator as well as a Database Artist, working with internal teams to know which indexes to add/drop, how much concurrency to allow, and how to allocate memory without running out of it.&lt;/p&gt;

&lt;p&gt;Instead of asking, “What’s the optimal configuration?” it might be more useful to ask these questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Where is my system currently paying the cost—memory, disk, or CPU?&lt;/li&gt;
  &lt;li&gt;If I relieve pressure here, where does it move?&lt;/li&gt;
  &lt;li&gt;How much can we tolerate that new pressure?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Costs don’t disappear – they just shift – and it’s the DBA’s job to help decision-makers decide where to shift it to.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;There’s more to photography than exposure – there’s composition, color-correction, external lighting, and so much more.  In the same way, this discussion has just been one part of database administration.  There’s so much more to go over, in terms of creating a robust and scalable database.  I wanted to highlight this topic because I do find that some users tend to approach database architecture without considering all the trade-offs.  We can definitely get the database to peform well, but there’s no one-size-fits-all solution for every situation.  It takes thought, planning, testing, and discussion with stakeholders to come up with a good solution.&lt;/p&gt;
</description>
        <pubDate>Mon, 20 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/20/throughput_triangle.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/20/throughput_triangle.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Understanding PostgreSQL Wait Events</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;One of the most useful debugging tools in modern PostgreSQL is the wait event system.  When a query slows down or a database becomes CPU bound, a natural question is: “What are sessions actually waiting on?” Postgres exposes this information through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_activity&lt;/code&gt; view via two columns:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event_type
wait_event
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These fields reveal what the backend process is blocked on at a given moment.  Among the different wait types, one category tends to cause confusion:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;LWLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you’ve ever seen dashboards full of LWLock waits, you’re not alone in wondering what they mean and whether they’re a problem.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;where-wait-events-appear&quot;&gt;Where Wait Events Appear&lt;/h1&gt;

&lt;p&gt;The easiest way to see wait events is:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SELECT pid,
wait_event_type,
wait_event,
state,
query
FROM pg_stat_activity
WHERE state != &apos;idle&apos;;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Example output might look like:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;pid&lt;/th&gt;
      &lt;th&gt;wait_event_type&lt;/th&gt;
      &lt;th&gt;wait_event&lt;/th&gt;
      &lt;th&gt;state&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;1234&lt;/td&gt;
      &lt;td&gt;Lock&lt;/td&gt;
      &lt;td&gt;transactionid&lt;/td&gt;
      &lt;td&gt;active&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;5678&lt;/td&gt;
      &lt;td&gt;LWLock&lt;/td&gt;
      &lt;td&gt;buffer_content&lt;/td&gt;
      &lt;td&gt;active&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;9012&lt;/td&gt;
      &lt;td&gt;IO&lt;/td&gt;
      &lt;td&gt;DataFileRead&lt;/td&gt;
      &lt;td&gt;active&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Each category represents a different kind of wait.  Common types include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Lock&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LWLock&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IO&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Client&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IPC&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Activity&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Among these, LWLock waits often appear during performance incidents.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;what-is-an-lwlock&quot;&gt;What Is an LWLock?&lt;/h1&gt;

&lt;p&gt;LWLock stands for &lt;strong&gt;Lightweight Lock&lt;/strong&gt;.  These are &lt;strong&gt;internal&lt;/strong&gt; Postgres synchronization primitives used to coordinate access to shared memory structures.  Note that they are &lt;strong&gt;NOT&lt;/strong&gt; related to lock contention on tables, or deadlocking when performing DML.  LWLocks protect important internal structures such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;shared buffers&lt;/li&gt;
  &lt;li&gt;WAL buffers&lt;/li&gt;
  &lt;li&gt;lock tables&lt;/li&gt;
  &lt;li&gt;SLRU caches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because these structures are accessed by many processes simultaneously, Postgres must coordinate access carefully.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;why-lwlock-waits-appear&quot;&gt;Why LWLock Waits Appear&lt;/h1&gt;

&lt;p&gt;In healthy systems, LWLocks are acquired and released very quickly.  However, they can become visible when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;contention increases&lt;/li&gt;
  &lt;li&gt;many sessions access the same internal structure&lt;/li&gt;
  &lt;li&gt;CPU saturation occurs&lt;/li&gt;
  &lt;li&gt;shared memory structures become hot spots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seeing LWLock waits in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_activity&lt;/code&gt; doesn’t automatically mean something is wrong.  But persistent LWLock contention usually indicates a scaling issue somewhere in the workload.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;common-lwlock-wait-events&quot;&gt;Common LWLock Wait Events&lt;/h1&gt;

&lt;p&gt;A few LWLock events appear frequently during real-world incidents.&lt;/p&gt;

&lt;p&gt;Understanding them can help narrow down the root cause.&lt;/p&gt;

&lt;h3 id=&quot;buffer_content&quot;&gt;buffer_content&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event_type = LWLock
wait_event = buffer_content
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This occurs when Postgres processes compete to access a shared buffer page.&lt;/p&gt;

&lt;p&gt;Typical causes include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;many concurrent updates to the same rows&lt;/li&gt;
  &lt;li&gt;heavy index modifications&lt;/li&gt;
  &lt;li&gt;hot tables receiving high write volume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you see these locks, try these troubleshooting steps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;check for write-heavy workloads&lt;/li&gt;
  &lt;li&gt;inspect tables experiencing frequent updates&lt;/li&gt;
  &lt;li&gt;look for missing indexes causing excessive page access&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;walwritelock&quot;&gt;WALWriteLock&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = WALWriteLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This indicates contention while writing to the Write-Ahead Log (WAL).&lt;/p&gt;

&lt;p&gt;Common causes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;high write throughput&lt;/li&gt;
  &lt;li&gt;large batch inserts or updates&lt;/li&gt;
  &lt;li&gt;slow storage affecting WAL flushes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Possible diagnostic steps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;examine WAL generation rate&lt;/li&gt;
  &lt;li&gt;check disk latency&lt;/li&gt;
  &lt;li&gt;review bulk write workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In some systems this appears as commit latency spikes.&lt;/p&gt;

&lt;h3 id=&quot;walinsertlock&quot;&gt;WALInsertLock&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = WALInsertLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This occurs when multiple sessions attempt to insert WAL records simultaneously.  It usually appears when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;many concurrent transactions are committing&lt;/li&gt;
  &lt;li&gt;high insert/update workloads exist&lt;/li&gt;
  &lt;li&gt;transaction throughput is extremely high&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Postgres versions over time have reduced contention here by increasing WAL insertion slots.  Still, very high write concurrency can trigger it.&lt;/p&gt;

&lt;h3 id=&quot;procarraylock&quot;&gt;ProcArrayLock&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = ProcArrayLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This lock protects Postgres’ internal structure tracking active transactions.  It is often associated with:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;snapshot creation&lt;/li&gt;
  &lt;li&gt;visibility checks&lt;/li&gt;
  &lt;li&gt;large numbers of active connections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Possible causes include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;very high connection counts&lt;/li&gt;
  &lt;li&gt;long-running transactions&lt;/li&gt;
  &lt;li&gt;frequent snapshot creation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Connection pooling (and lowering &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_connection&lt;/code&gt;) often helps reduce this type of contention.&lt;/p&gt;

&lt;h3 id=&quot;clogcontrollock--slru-locks&quot;&gt;CLogControlLock / SLRU Locks&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = CLogControlLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These involve the SLRU (Simple Least Recently Used) subsystem, which tracks transaction commit status.  Heavy contention here can appear when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;extremely high transaction rates exist&lt;/li&gt;
  &lt;li&gt;frequent visibility checks occur&lt;/li&gt;
  &lt;li&gt;many short transactions are executed&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;diagnosing-lwlock-problems&quot;&gt;Diagnosing LWLock Problems&lt;/h1&gt;

&lt;p&gt;When investigating LWLock waits, a few steps usually help.&lt;/p&gt;

&lt;h3 id=&quot;1-look-for-dominant-wait-events&quot;&gt;1. Look for dominant wait events&lt;/h3&gt;

&lt;p&gt;Start by identifying which LWLock appears most frequently:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SELECT wait_event, count(*)
FROM pg_stat_activity
WHERE wait_event_type = &apos;LWLock&apos;
GROUP BY wait_event
ORDER BY count(*) DESC;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-examine-workload-characteristics&quot;&gt;2. Examine workload characteristics&lt;/h3&gt;

&lt;p&gt;Questions to ask:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Are there many concurrent writers?&lt;/li&gt;
  &lt;li&gt;Is a single table receiving heavy updates?&lt;/li&gt;
  &lt;li&gt;Are there extremely high transaction rates?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-check-connection-counts&quot;&gt;3. Check connection counts&lt;/h3&gt;

&lt;p&gt;Large numbers of connections can amplify contention.  Connection pooling often reduces LWLock pressure significantly.&lt;/p&gt;

&lt;h3 id=&quot;4-look-at-query-patterns&quot;&gt;4. Look at query patterns&lt;/h3&gt;

&lt;p&gt;High-frequency queries touching the same rows or pages can create hotspots.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h1&gt;

&lt;p&gt;PostgreSQL’s wait event system provides valuable insight into what the database is doing internally.  LWLocks, in particular, reveal contention inside shared memory structures that are otherwise invisible.  When investigating performance issues, a good rule of thumb is: &lt;em&gt;If many sessions are waiting on the same LWLock, there is usually a workload hotspot somewhere.&lt;/em&gt; Once you know where the contention lives, the path toward fixing it becomes much clearer.&lt;/p&gt;
</description>
        <pubDate>Mon, 13 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/13/wait_events.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/13/wait_events.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>troubleshooting</category>
        
        <category>wait-events</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>WAL as a Data Distribution Layer</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Every so often, I talk to someone working in data analytics who wants access to production data, or at least a snapshot of it.  Sometimes, they tell me about their ETL setup, which takes hours to refresh and can be brittle, with a lot of monitoring around it.  For them, it works, but it sometimes gets me wondering if they need all that plumbing to get a snapshot of their live dataset.  Back at Turnitin, I set up a way to get people access to production data without having to snapshot nightly, and I thought maybe I should share it with people here.&lt;/p&gt;

&lt;h1 id=&quot;common-implementations-and-their-risks&quot;&gt;Common Implementations and Their Risks&lt;/h1&gt;

&lt;p&gt;Typical solutions that we might encounter as we give people a little bit of access to production data:&lt;/p&gt;

&lt;h3 id=&quot;1-query-the-primary&quot;&gt;1. Query the primary&lt;/h3&gt;

&lt;p&gt;This is generally a bad idea, since you don’t want users getting access to the production prirmary, lest they make some mistakes or do something to lock up tables that prevent customers from using your apps.  Even with a read-only user, large data analytics queries could cause unwanted interference that negatively affect your uptime.  This is almost certainly not the way to go.&lt;/p&gt;

&lt;h3 id=&quot;2-query-a-streaming-replica&quot;&gt;2. Query a streaming replica&lt;/h3&gt;

&lt;p&gt;This is better, but doing this is not free.  Long-running queries can create replay lag, vacuum conflicts can cancel queries, and I/O contention can affect the primary upstream.  It’s safer since users are forced to be read-only, but that still carries risk.&lt;/p&gt;

&lt;h3 id=&quot;3-nightly-snapshots--rebuilds&quot;&gt;3. Nightly snapshots / rebuilds&lt;/h3&gt;

&lt;p&gt;Having time-based snapshots and rebuilds are the most common form of getting data out to analysts.  ETL queries run at night (or some other specified regular interval) and provide the information needed to do the necessary work.  This works, but is another piece of software that produces somewhat stale data, depending on how much stale-ness can be tolerated.&lt;/p&gt;

&lt;h1 id=&quot;once-upon-a-time-before-streaming-replication&quot;&gt;Once Upon a Time, Before Streaming Replication&lt;/h1&gt;

&lt;p&gt;If you’ve spent any time in Postgres, you already understand streaming replication.  Primary sends WAL to standby, and standby replays the WAL stream.  All the tutorials talk about using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_basebackup&lt;/code&gt;, setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hot_standby&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;standby.signal&lt;/code&gt; and configuring &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;However, many people don’t know that before streaming replication, there was log shipping.  Introduced in v. 8.2, it was the predecessor to what eventually became hot standby/streaming replication in v. 9.0.  Instead of maintaining a live connection between primary and standby, the two clusters are decoupled.  WAL files are shipped (via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scp&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsync&lt;/code&gt; or some other mechanism – maybe even NFS) to the replica, and then replayed there.&lt;/p&gt;

&lt;h1 id=&quot;log-shipping-hits-a-different-point-on-the-tradeoff-curve&quot;&gt;Log Shipping Hits a Different Point on the Tradeoff Curve&lt;/h1&gt;

&lt;p&gt;With WAL log shipping the standby never connects to the primary, and the primary never tracks the standby, and therefore there is no backpressure mechanism (i.e. no cancelled queries because of conflict with recovery, no need for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hot_standby_feedback&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;While you may not get up-to-the-millisecond minimized replication lag, you get pretty close to real-time data.  In some cases, this lag may even be desirable – you could throttle the playback so you are an hour behind, even giving yourself some time to look at a table’s state before someone fat-fingers an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; without a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause.&lt;/p&gt;

&lt;h1 id=&quot;a-subtle-but-important-detail&quot;&gt;A Subtle but Important Detail&lt;/h1&gt;

&lt;p&gt;Postgres doesn’t force you to choose one mechanism over the other.  A standby can use both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt; AND &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restore_command&lt;/code&gt;.  The way it works is that it will toggle between the two, depending on availability.  If the primary is disconnected for some reason, it will switch over to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restore_command&lt;/code&gt; until it cannot find the WAL file it wants, and then it flips back to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;Log shipping isn’t just a legacy mode, but it’s part of the replication continuum.  It’s like incremental backup, except that your backup is always full-loaded and can be queried against.  For these reasons, keeping your WAL files around is a very good practice.&lt;/p&gt;

&lt;h1 id=&quot;architecture-pattern-introduce-a-wal-hub&quot;&gt;Architecture Pattern: Introduce a WAL Hub&lt;/h1&gt;

&lt;p&gt;Instead of thinking in terms or replication happening between a primary and a number of standbys, it may be useful to think about a central WAL archive host, even if it’s an S3 bucket, so that many consumers can access data at any point in time.&lt;/p&gt;

&lt;p&gt;These consumers can be analytics standbys, QA environments, or ad-hoc data sandboxes – or whatever else you want to give a copy of near-realtime production data to, without risking replication backpressure or compromising network security.&lt;/p&gt;

&lt;h1 id=&quot;a-hands-on-approach&quot;&gt;A Hands-On Approach&lt;/h1&gt;

&lt;p&gt;I created a &lt;a href=&quot;https://github.com/richyen/toolbox/tree/master/demos/wal_shipping&quot;&gt;simple demo&lt;/a&gt; that sets this up end-to-end.  It sets up 3 containers in Docker – a primary, standby, and a mock WAL archive location.  &lt;em&gt;Disclaimer:&lt;/em&gt; yes, I used AI to help me generate the scripts, but it’s exactly how I had it set up at Turnitin (yes, we used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsyncd&lt;/code&gt; back in 2009 – there might be better stuff out there these days).&lt;/p&gt;

&lt;p&gt;Some key configuration params for clarity:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;archive_command&lt;/code&gt; pushes WAL files to a directory&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restore_command&lt;/code&gt; pulls WAL files on the standby&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;standby.signal&lt;/code&gt; enables continuous recovery&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hot_standby=on&lt;/code&gt; allows read-only queries&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;archive_mode=on&lt;/code&gt; not entirely necessary, but for posterity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that in this example, some characteristics of the standby:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;No &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;No replication slots used&lt;/li&gt;
  &lt;li&gt;No entries in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_replication&lt;/code&gt; show up on the primary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want, you can set up traditional streaming replication in parallel to this log shipping standby – it doesn’t interfere with the log shipping so long as WAL files get to the archive location.&lt;/p&gt;

&lt;h1 id=&quot;why-this-pattern-deserves-more-attention&quot;&gt;Why This Pattern Deserves More Attention&lt;/h1&gt;

&lt;p&gt;Most teams default to streaming replication because it’s the most visible feature.&lt;/p&gt;

&lt;p&gt;But Postgres replication isn’t one thing; it’s a set of primitives:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;WAL generation&lt;/li&gt;
  &lt;li&gt;WAL transport&lt;/li&gt;
  &lt;li&gt;WAL replay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming replication couples all three and log shipping lets you separate them.  And once you do that, new architectures open up!&lt;/p&gt;
</description>
        <pubDate>Mon, 06 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/06/wal_archiving.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/06/wal_archiving.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>replication</category>
        
        <category>archiving</category>
        
        <category>log_shipping</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>The Hidden Behavior of plan_cache_mode</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Most PostgreSQL users use prepared statements as a way to boost performance and prevent SQL injection. Fewer people know that the query planner silently changes the execution plan for prepared statements after exactly five executions.&lt;/p&gt;

&lt;p&gt;This behavior often surprises engineers because a query plan can suddenly shift—sometimes dramatically, even though the query itself hasn’t changed. The reason lies in the planner’s handling of custom plans vs generic plans, controlled by the parameter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plan_cache_mode&lt;/code&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;custom-plans-vs-generic-plans&quot;&gt;Custom Plans vs Generic Plans&lt;/h1&gt;

&lt;p&gt;When a prepared statement is executed with parameters, the planner has two choices:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Custom Plan:&lt;/strong&gt; Generated using the actual parameter values. It is potentially optimal for that specific execution but requires planning overhead every time.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Generic Plan:&lt;/strong&gt; Planned once without knowing specific parameter values. It is reused for all subsequent executions to save planning overhead.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plan_cache_mode&lt;/code&gt; is set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;auto&lt;/code&gt;. In this mode, the planner uses custom plans for the first five executions. On the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sixth&lt;/code&gt; execution, it compares the average cost of those custom plans against the estimated cost of a generic plan. If the generic plan is deemed “cheaper” or equal, the planner switches to it permanently for that session.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;demonstrating-with-pgbench&quot;&gt;Demonstrating with pgbench&lt;/h1&gt;

&lt;p&gt;As always, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pgbench&lt;/code&gt; is the schema of choice when it comes to simple demonstrations.  I’m using Postgres 18, which is the latest version as of this writing.  Adding a column with highly skewed values makes it easier to trigger the switch, for the purposes of this post.  Therefore we add a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;flag&lt;/code&gt; column with extreme skew: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;N&apos;&lt;/code&gt; for 0.1% of rows, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;Y&apos;&lt;/code&gt; for the remaining 99.9%:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;### In bash:&lt;/span&gt;
pgbench &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; 10 &lt;span class=&quot;nt&quot;&gt;-U&lt;/span&gt; postgres postgres

&lt;span class=&quot;c&quot;&gt;### In psql:&lt;/span&gt;
ALTER TABLE pgbench_accounts ADD COLUMN flag CHAR&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; NOT NULL DEFAULT &lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
UPDATE pgbench_accounts SET flag &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;N&apos;&lt;/span&gt; WHERE aid &amp;lt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1000&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
CREATE INDEX idx_accounts_flag ON pgbench_accounts&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;flag&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
ANALYZE pgbench_accounts&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

SELECT flag, count&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; FROM pgbench_accounts GROUP BY flag&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

 flag | count
&lt;span class=&quot;nt&quot;&gt;------&lt;/span&gt;+--------
 N    |   1000
 Y    | 999000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Before triggering the auto-switch, let’s force each mode directly to see what the planner produces for the same statement.&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- Custom plan: planner sees the literal value &apos;Y&apos;, looks it up in column&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- statistics (MCV frequency ≈ 0.999), and picks Seq Scan for 999,033 rows.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plan_cache_mode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;force_custom_plan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;PREPARE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;abalance&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pgbench_accounts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;                               QUERY PLAN
-------------------------------------------------------------------------
 Seq Scan on pgbench_accounts  (cost=0.00..28910.00 rows=999033 width=8)
   Filter: (flag = &apos;Y&apos;::bpchar)   &amp;lt;-- literal value &apos;Y&apos; indicates custom plan
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;DEALLOCATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;-- Generic plan: the planner has no value to look up. With ndistinct = 2&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- (only &apos;Y&apos; and &apos;N&apos; exist), it estimates 1/ndistinct = 50% selectivity,&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- or 500,000 rows. At that estimate, the cheaper path is Index Scan.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plan_cache_mode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;force_generic_plan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;PREPARE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;abalance&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pgbench_accounts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;                                            QUERY PLAN
--------------------------------------------------------------------------------------------
 Index Scan using idx_accounts_flag on pgbench_accounts  (cost=0.42..19322.07 rows=500000)
   Index Cond: (flag = $1)   &amp;lt;-- Note the placeholder $1 instead of literal &apos;Y&apos;/&apos;N&apos;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The cost numbers reveal the selection of Index Scan over Seq Scan: 19,322 &amp;lt; 28,910.&lt;/p&gt;

&lt;h1 id=&quot;the-automatic-switch-in-action&quot;&gt;The Automatic Switch in Action&lt;/h1&gt;

&lt;p&gt;After resetting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plan_cache_mode&lt;/code&gt; back to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;auto&lt;/code&gt;, we execute the statement five times using the common value &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;Y&apos;&lt;/code&gt;. Each run generates a custom Seq Scan plan at cost ~28,910. After five such executions, the planner compares &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Average custom plan cost: ~28,910&lt;/code&gt; v. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Generic plan cost: ~19,322&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Since 19,322 ≤ 28,910, the generic plan is chosen from execution 6 onward.&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;DEALLOCATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plan_cache_mode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;PREPARE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;abalance&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pgbench_accounts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;-- Executions 1–5: custom plans, each resolving &apos;Y&apos; literally&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;COSTS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;OFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;COSTS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;OFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;COSTS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;OFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;COSTS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;OFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;COSTS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;OFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Each shows:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;           QUERY PLAN
--------------------------------
 Seq Scan on pgbench_accounts
   Filter: (flag = &apos;Y&apos;::bpchar)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the sixth execution:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;EXPLAIN (COSTS OFF) EXECUTE flag_lookup(&apos;Y&apos;);
                       QUERY PLAN
--------------------------------------------------------
 Index Scan using idx_accounts_flag on pgbench_accounts
   Index Cond: (flag = $1)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The strategy flips from Seq Scan to Index Scan on the sixth call — even though the query and data are identical. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$1&lt;/code&gt; placeholder confirms the generic plan is now used.&lt;/p&gt;

&lt;h1 id=&quot;does-it-ever-switch-back&quot;&gt;Does it Ever Switch Back?&lt;/h1&gt;

&lt;p&gt;From execution 6 onward, every query — regardless of the parameter value — uses that generic Index Scan. For &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;N&apos;&lt;/code&gt; (1,000 rows) an Index Scan happens to be efficient. For &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;Y&apos;&lt;/code&gt; (999,000 rows), scanning nearly the entire 1M-row table through random index lookups is dramatically worse than a sequential scan would be.&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- Executions 7+: generic plan regardless of value&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;COSTS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;OFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;Y&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;-- 999,000 rows via Index Scan (bad!)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;COSTS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;OFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;EXECUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flag_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;N&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;-- 1,000 rows via Index Scan (fine by accident)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Both show:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;                       QUERY PLAN
--------------------------------------------------------
 Index Scan using idx_accounts_flag on pgbench_accounts
   Index Cond: (flag = $1)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The generic plan stays until &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEALLOCATE flag_lookup&lt;/code&gt; or the session ends.  This is certainly something to be aware of for frequently-executed prepared statements, as it has had significant consequences on usability with some customers I’ve worked with.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;under-the-hood-the-c-logic&quot;&gt;Under the Hood: The C Logic&lt;/h1&gt;

&lt;p&gt;Just to highlight that the number 5 isn’t determined with any fancy logic, we can find it in the source code. In &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/backend/utils/cache/plancache.c&lt;/code&gt; (around &lt;strong&gt;line 1200&lt;/strong&gt;), the function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;choose_custom_plan&lt;/code&gt; spells it out explicitly:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt;
&lt;span class=&quot;nf&quot;&gt;choose_custom_plan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CachedPlanSource&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plansource&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;cm&quot;&gt;/* ... settings check for force_custom / force_generic ... */&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/* If we haven&apos;t done 5 custom plans yet, keep doing them */&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plansource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_custom_plans&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/* * Otherwise, compare generic_cost against the average custom_cost.
     * If the generic plan is cheaper (or equal), we switch!
     */&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plansource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generic_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plansource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total_custom_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plansource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_custom_plans&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h1&gt;

&lt;p&gt;The query planner’s automatic plan caching is usually a hero, saving CPU cycles. But when you have highly skewed data or volatile temporary objects, that “6th run switch” can negatively affect client/application performance.&lt;/p&gt;

&lt;p&gt;If you see unexplained regressions in a prepared statement, you may want to check to see if it is being called more than 5 times, or try &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SET plan_cache_mode = force_custom_plan&lt;/code&gt; as a troubleshooting step.  This forces a fresh custom plan on every execution, guaranteeing the planner always sees the actual parameter value and can choose the right strategy.&lt;/p&gt;

&lt;p&gt;Good luck!&lt;/p&gt;
</description>
        <pubDate>Mon, 30 Mar 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/03/30/plan_cache_mode.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/03/30/plan_cache_mode.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>query-planner</category>
        
        <category>prepared-statements</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>EXPLAIN&apos;s Other Superpowers</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Most people who work with PostgreSQL eventually learn two commands for query tuning: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN ANALYZE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN&lt;/code&gt; shows the planner’s chosen execution plan, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN ANALYZE&lt;/code&gt; runs the query and adds runtime statistics. For most tuning tasks, this already provides a wealth of information.&lt;/p&gt;

&lt;p&gt;But what many people don’t realize is that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN&lt;/code&gt; has a handful of other options that can make troubleshooting much easier. In some cases they answer questions that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN ANALYZE&lt;/code&gt; alone cannot.&lt;/p&gt;

&lt;p&gt;In this post we’ll take a look at a few of those lesser-known options.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;buffers-where-did-the-data-come-from&quot;&gt;BUFFERS: Where Did the Data Come From?&lt;/h1&gt;

&lt;p&gt;One common question during performance analysis is whether data came from: shared buffers (cache), disk, or temporary buffers.  This is where the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BUFFERS&lt;/code&gt; option comes in handy.  Output can look something like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM mytable WHERE id = 123;
[...]
  Index Scan using mytable_pkey on mytable
  Buffers: shared hit=5 read=2
[...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the example above, we see:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shared hit&lt;/code&gt; – pages already in cache (i.e., cache hit)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shared read&lt;/code&gt; – pages fetched from disk (i.e., cache miss)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;buffers&lt;/code&gt; in this context are 8 kilobyte blocks of memory (standard block size for most storage systems)&lt;/p&gt;

&lt;p&gt;This is extremely useful when trying to determine if performance problems are related to: cold cache, excessive disk reads, or insufficient memory (i.e., cache is too crowded to keep all the data being worked with).&lt;/p&gt;

&lt;p&gt;Especially for index scans, this information confirms whether a query that &lt;em&gt;should&lt;/em&gt; be index-friendly is actually pulling large portions of the table into memory.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;memory-memory-used-by-the-query&quot;&gt;MEMORY: Memory Used by the Query&lt;/h1&gt;

&lt;p&gt;This is a new feature introduced in version 18.  It is different from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BUFFERS&lt;/code&gt; in the sense that it tracks the amount of memory consumed during the query planning phase, not execution.  Output would appear at the bottom of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN&lt;/code&gt; output like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM mytable WHERE id = 123;
[...]
 Planning:
   Buffers: shared hit=36 read=1
   Memory: used=63kB  allocated=64kB

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;wal-how-much-logging-is-happening&quot;&gt;WAL: How Much Logging Is Happening?&lt;/h1&gt;

&lt;p&gt;Another useful option that many people overlook is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WAL&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;EXPLAIN (ANALYZE, WAL)
INSERT INTO mytable SELECT * FROM staging_table;
[...]
  WAL: records=100, fpi=5, bytes=45000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the example above, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;records&lt;/code&gt; are the number of WAL records generated, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fpi&lt;/code&gt; refers to full-page images that were written (number of pages modified for the first time since the last checkpoint), and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bytes&lt;/code&gt; is the total WAL traffic generated by the query.  This can be helpful when investigating write-heavy workloads, including bulk loads, large updates, index creation, and high replication traffic.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;settings-remind-me-what-my-environment-looked-like&quot;&gt;SETTINGS: Remind Me What My Environment Looked Like?&lt;/h1&gt;

&lt;p&gt;Sometimes a query behaves differently on two servers even though the SQL is identical.  Or you may have modified some parameters locally before running the query (i.e., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt;).  To properly understand how a query is affected by any differences in environment, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SETTINGS&lt;/code&gt; option can sometimes be useful:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;EXPLAIN (SETTINGS)
SELECT * FROM mytable WHERE id = 123;
[...]
Settings: effective_cache_size = &apos;48GB&apos;, effective_io_concurrency = &apos;200&apos;, enable_partitionwise_aggregate = &apos;on&apos;, enable_partitionwise_join = &apos;on&apos;, max_parallel_workers = &apos;16&apos;, max_parallel_workers_per_gather = &apos;4&apos;, temp_buffers = &apos;1MB&apos;, search_path = &apos;public&apos;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;verbose-see-the-planners-full-story&quot;&gt;VERBOSE: See the Planner’s Full Story&lt;/h1&gt;

&lt;p&gt;Another useful option is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VERBOSE&lt;/code&gt;, which prints additional information like, internal column references, expanded target lists, and schema-qualified objects:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;postgres=# EXPLAIN (ANALYZE) SELECT * FROM pgbench_accounts a JOIN pgbench_branches b ON b.bid=a.bid ORDER BY 2 DESC;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.12..3898.14 rows=100000 width=461) (actual time=0.216..32.597 rows=100000.00 loops=1)
   Join Filter: (b.bid = a.bid)
   Buffers: shared hit=1642
   -&amp;gt;  Index Scan Backward using pgbench_branches_pkey on pgbench_branches b  (cost=0.12..8.14 rows=1 width=364) (actual time=0.095..0.102 rows=1.00 loops=1)
         Index Searches: 1
         Buffers: shared hit=2
   -&amp;gt;  Seq Scan on pgbench_accounts a  (cost=0.00..2640.00 rows=100000 width=97) (actual time=0.024..9.732 rows=100000.00 loops=1)
         Buffers: shared hit=1640
 Planning Time: 0.346 ms
 Execution Time: 40.078 ms
(10 rows)

postgres=# EXPLAIN (ANALYZE, VERBOSE) SELECT * FROM pgbench_accounts a JOIN pgbench_branches b ON b.bid=a.bid ORDER BY 2 DESC;
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.12..3898.14 rows=100000 width=461) (actual time=0.225..32.869 rows=100000.00 loops=1)
   Output: a.aid, a.bid, a.abalance, a.filler, b.bid, b.bbalance, b.filler
   Join Filter: (b.bid = a.bid)
   Buffers: shared hit=1642
   -&amp;gt;  Index Scan Backward using pgbench_branches_pkey on public.pgbench_branches b  (cost=0.12..8.14 rows=1 width=364) (actual time=0.183..0.190 rows=1.00 loops=1)
         Output: b.bid, b.bbalance, b.filler
         Index Searches: 1
         Buffers: shared hit=2
   -&amp;gt;  Seq Scan on public.pgbench_accounts a  (cost=0.00..2640.00 rows=100000 width=97) (actual time=0.026..9.756 rows=100000.00 loops=1)
         Output: a.aid, a.bid, a.abalance, a.filler
         Buffers: shared hit=1640
 Planning Time: 0.547 ms
 Execution Time: 40.228 ms
(13 rows)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;While it may look a bit noisy, it can be helpful when diagnosing:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;view expansion&lt;/li&gt;
  &lt;li&gt;rule rewriting&lt;/li&gt;
  &lt;li&gt;complex query transformations&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;combining-options&quot;&gt;Combining Options&lt;/h1&gt;

&lt;p&gt;The real power of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN&lt;/code&gt; comes from combining options together.  For example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;EXPLAIN (ANALYZE, BUFFERS, WAL, SETTINGS)
SELECT * FROM mytable WHERE id = 123;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This produces a plan that shows execution time, cache usage, WAL generation, and configuration parameters influencing the planner&lt;/p&gt;

&lt;p&gt;In many cases, this gives a far more complete picture of what the database is doing internally.&lt;/p&gt;

&lt;p&gt;Note that many of these can also be enabled in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;auto_explain&lt;/code&gt; as parameters in the database configuration.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXPLAIN ANALYZE&lt;/code&gt; is powerful, but the feature provides many additional tools for understanding query behavior.  These additional options can provide valuable insight into memory usage, disk activity, WAL generation, and instrumentation overhead.  When troubleshooting tricky performance problems, these options can reveal details that a basic execution plan might hide.&lt;/p&gt;
</description>
        <pubDate>Mon, 23 Mar 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/03/23/explain_options.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/03/23/explain_options.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>explain</category>
        
        <category>tuning</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Learning AI Fast with pgEdge&apos;s RAG</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;If you’ve been paying attention to the technology landscape recently, you’ve probably noticed that AI is &lt;strong&gt;everywhere&lt;/strong&gt;. New frameworks, new terminology, and a dizzying array of acronyms and jargon: &lt;strong&gt;LLM&lt;/strong&gt;, &lt;strong&gt;RAG&lt;/strong&gt;, &lt;strong&gt;embeddings&lt;/strong&gt;, &lt;strong&gt;vector databases&lt;/strong&gt;, &lt;strong&gt;MCP&lt;/strong&gt;, and more.&lt;/p&gt;

&lt;p&gt;Honestly, it’s been difficult to figure out where to start. Many tutorials either dive deep into machine learning theory (Bayesian transforms?) or hide everything behind a single API call to a hosted model.  Neither approach really explains how these systems actually work.&lt;/p&gt;

&lt;p&gt;Recently I spent some time experimenting with the &lt;a href=&quot;https://www.pgedge.com&quot;&gt;pgEdge&lt;/a&gt; AI tooling after hearing Shaun Thomas’ talk at a &lt;a href=&quot;https://prairiepostgres.org/&quot;&gt;PrairiePostgres&lt;/a&gt; meetup.  He talked about how to set up the various components of an AI chatbot system, starting from ingesting documents into a Postgres database, vectorizing the text, setting up a RAG and then an MCP server.&lt;/p&gt;

&lt;p&gt;When I got home I wanted to try it out for myself – props to the pgEdge team for making it all free an open-source!  What surprised me most was not just that everything worked, but how easy it was to get a complete AI retrieval pipeline running locally. More importantly, it turned out to be one of the clearest ways I’ve found to understand how modern AI systems are constructed behind the scenes.  Thanks so much, Shaun!&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;the-pgedge-ai-components&quot;&gt;The pgEdge AI Components&lt;/h1&gt;

&lt;p&gt;The pgEdge AI ecosystem provides several small tools that fit together naturally.  I’ll go through them real quickly here&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/pgEdge/doc-converter&quot;&gt;Doc Converter&lt;/a&gt; – The doc-converter normalizes documents into a format that is easy to process downstream. Whether the input is PDF, HTML, Markdown, or plain text, the converter produces clean text output suitable for ingestion.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/pgEdge/pgedge-vectorizer&quot;&gt;Vectorizer&lt;/a&gt; – The vectorizer handles the process of converting text chunks into embeddings.  These embeddings are numeric representations of text that capture semantic meaning. Once generated, they can be stored inside PostgreSQL using &lt;a href=&quot;https://github.com/pgvector/pgvector&quot;&gt;pgvector&lt;/a&gt; and queried with similarity search.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/pgEdge/pgedge-rag-server&quot;&gt;Retrieval-Augmented Generation (RAG) Server&lt;/a&gt; – The RAG framework ties everything together.  It orchestrates:
    &lt;ol&gt;
      &lt;li&gt;embedding the user’s query&lt;/li&gt;
      &lt;li&gt;retrieving similar document chunks&lt;/li&gt;
      &lt;li&gt;assembling prompt context&lt;/li&gt;
      &lt;li&gt;sending the prompt to an LLM&lt;/li&gt;
      &lt;li&gt;returning the generated response&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the full system is running, you essentially have ChatGPT or Gemini running on your laptop&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;running-everything-locally-with-ollama&quot;&gt;Running Everything Locally with Ollama&lt;/h1&gt;

&lt;p&gt;With ChatGPT and Gemini, getting tokens or sharing my payment info was a blocker, especially if I just want to test stuff for educational purposes.  Through Shaun’s presentation, I was introduced to &lt;a href=&quot;https://ollama.com&quot;&gt;Ollama&lt;/a&gt;, which is a great alternative, if you’re okay with slower performance (especially on a 8GB M1 Mac Mini).&lt;/p&gt;

&lt;p&gt;I was pleasantly surprised at how easy it was to run the entire pipeline without relying on external AI APIs.  Specifically, I used the &lt;strong&gt;embeddinggemma&lt;/strong&gt; model for generating embeddings.  This meant the entire stack could run locally, no API keys required!  Running everything locally removes those barriers and definitely makes experimentation much easier.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;understanding-rag-by-actually-running-it&quot;&gt;Understanding RAG by Actually Running It&lt;/h1&gt;

&lt;p&gt;One of the most confusing concepts in learning AI prior to Shaun’s talk was Retrieval-Augmented Generation (RAG).  I learned that what a RAG does is:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Before asking the LLM to answer a question, retrieve relevant information and include it in the prompt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With the pgEdge pipeline, the flow becomes very visible.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Documents are converted into clean text&lt;/li&gt;
  &lt;li&gt;Text is split into chunks&lt;/li&gt;
  &lt;li&gt;Chunks are embedded into vectors&lt;/li&gt;
  &lt;li&gt;Vectors are stored in PostgreSQL&lt;/li&gt;
  &lt;li&gt;A question is embedded into a vector&lt;/li&gt;
  &lt;li&gt;A similarity search finds relevant chunks&lt;/li&gt;
  &lt;li&gt;Those chunks are inserted into the prompt&lt;/li&gt;
  &lt;li&gt;The LLM generates the response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From this, I realized that the LLM is not storing my data.  Instead, the system retrieves relevant information &lt;em&gt;on demand&lt;/em&gt; and feeds it into the prompt.  The RAG is a facilitator to the LLM’s response.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;the-role-of-the-vectorizer&quot;&gt;The Role of the Vectorizer&lt;/h1&gt;

&lt;p&gt;The vectorizer is a crucial step in the pipeline.  Its job is to convert human language into embeddings, which are high-dimensional numeric representations of meaning.  With vectors, searching with natural language becomes possible, instead of old-fashioned keyword matches.&lt;/p&gt;

&lt;p&gt;Once the embeddings (vectorized documents) are stored in PostgreSQL using pgvector, everything starts to look familiar again for database engineers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;indexing&lt;/li&gt;
  &lt;li&gt;storage&lt;/li&gt;
  &lt;li&gt;similarity search&lt;/li&gt;
  &lt;li&gt;ranking results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Managing these things look pretty doable for a database guy like me 😂&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;dont-try-this-at-home&quot;&gt;&lt;del&gt;Don’t&lt;/del&gt; Try This At Home!&lt;/h1&gt;

&lt;p&gt;After writing about the pgEdge stack I wanted to make it as easy as possible for others to reproduce the same experience, so I &lt;a href=&quot;https://github.com/richyen/learn-ai-with-postgres&quot;&gt;packaged everything into a Docker Compose project&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Clone the repository and run:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;git clone https://github.com/richyen/learn-ai-with-postgres.git
&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;learn-ai-with-postgres
&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;documents &lt;span class=&quot;c&quot;&gt;# put some txt files in there for vectorization&lt;/span&gt;
docker compose up &lt;span class=&quot;nt&quot;&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That single command:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Builds a custom PostgreSQL image with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pgvector&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pgedge_vectorizer&lt;/code&gt; compiled in&lt;/li&gt;
  &lt;li&gt;Starts an Ollama container and pulls the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;embeddinggemma&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;glm-4.7-flash&lt;/code&gt; models locally&lt;/li&gt;
  &lt;li&gt;Runs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pgedge-docloader&lt;/code&gt; to ingest any documents you’ve put into the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;documents/&lt;/code&gt; folder&lt;/li&gt;
  &lt;li&gt;Calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pgedge_vectorizer.enable_vectorization()&lt;/code&gt;, which starts background workers inside Postgres that chunk and embed every page&lt;/li&gt;
  &lt;li&gt;Starts the RAG server on port 8080&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No API keys, no cloud services. Everything runs on your own hardware.&lt;/p&gt;

&lt;p&gt;Once the RAG server is up (watch for the setup container to exit cleanly), try asking it a question:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;curl &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/pipelines/pg-docs &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;-H&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Content-Type: application/json&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;{&quot;query&quot;: &quot;How does autovacuum decide when to run?&quot;}&apos;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  | jq &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The answer comes back a few seconds later, grounded in the actual PostgreSQL documentation:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;answer&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Autovacuum in PostgreSQL is triggered based on thresholds defined by two parameters: autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. The daemon considers a table eligible for vacuuming when the number of dead tuples exceeds the threshold plus (scale_factor × total row count) ...&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You can also run raw similarity searches directly in SQL to see exactly what the retrieval step is doing before the LLM touches anything:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snippet&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;documents_content_chunks&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;documents&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source_id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;embedding&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;embedding&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;pgedge_vectorizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generate_embedding&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;autovacuum threshold configuration&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is the same pgvector &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;=&amp;gt;&lt;/code&gt; (cosine distance) operator the RAG server uses internally — you can inspect the retrieval step at any time without going through the HTTP API.&lt;/p&gt;

&lt;p&gt;Embeddings are generated in the background by Postgres workers, so you can start querying as soon as a few hundred chunks are ready. Watch the progress with:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;psql postgresql://postgres:password@localhost:5432/pgai &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;
SELECT
  (SELECT count(*) FROM documents)                                             AS total_docs,
  (SELECT count(*) FROM documents_content_chunks WHERE embedding IS NOT NULL)  AS vectorized;
&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The project also includes the pgedge-postgres-mcp server on port 8081, which exposes the knowledge base via the Model Context Protocol — so it can be wired directly into Claude Desktop, VS Code Copilot, or any other MCP-compatible client.&lt;/p&gt;

&lt;hr /&gt;
&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h1&gt;

&lt;p&gt;There’s a lot of pressure right now to “learn AI,” but that phrase can mean many different things.  For people coming from infrastructure, databases, or backend engineering, one of the most approachable paths is simply:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;build a small RAG pipeline and observe how the pieces fit together.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The pgEdge tooling made this surprisingly straightforward.  Instead of assembling half a dozen unrelated frameworks, the components already fit together:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;doc ingestion&lt;/li&gt;
  &lt;li&gt;vectorization&lt;/li&gt;
  &lt;li&gt;PostgreSQL storage&lt;/li&gt;
  &lt;li&gt;retrieval&lt;/li&gt;
  &lt;li&gt;prompt generation&lt;/li&gt;
  &lt;li&gt;LLM response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once I saw the entire flow working end-to-end, the AI ecosystem makes a lot more sense.  Setting up the pgEdge RAG stack turned out to be a surprisingly effective way to see that architecture in action.&lt;/p&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;
</description>
        <pubDate>Mon, 16 Mar 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/03/16/pgedge_ai_repos.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/03/16/pgedge_ai_repos.html</guid>
        
        <category>postgres</category>
        
        <category>ai</category>
        
        <category>rag</category>
        
        <category>vector</category>
        
        <category>pgvector</category>
        
        <category>ollama</category>
        
        <category>docker</category>
        
        <category>mcp</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Debugging RDS Proxy Pinning: How a Hidden JIT Toggle Created Thousands of Pinned Connections</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;When using AWS RDS Proxy, the goal is to achieve connection multiplexing – many client connections share a much smaller pool of backend PostgreSQL connections, givng more resources per connection and keeping query execution running smoothly.&lt;/p&gt;

&lt;p&gt;However, if the proxy detects that a session has changed internal state in a way it cannot safely track, it &lt;strong&gt;pins&lt;/strong&gt; the client connection to a specific backend connection. Once pinned, that connection can never be multiplexed again.  This was the case with a recent database I worked on.&lt;/p&gt;

&lt;p&gt;In this case, we observed the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;extremely high CPU usage&lt;/li&gt;
  &lt;li&gt;relatively high LWLock wait times&lt;/li&gt;
  &lt;li&gt;OOM killer activity on the database, maybe once every day or two&lt;/li&gt;
  &lt;li&gt;thousands of active connections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What was strange about it all was that the queries involved were relatively simple, with max just one join.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;finding-the-pinning-source&quot;&gt;Finding the Pinning Source&lt;/h1&gt;

&lt;p&gt;To get to the root cause, one option was to look in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_statements&lt;/code&gt;.  However, that approach had two problems:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Getting a clean snapshot of the statistics while thousands of queries were being actively processed  would be tricky.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_statements&lt;/code&gt; normalizes queries and does not expose the values passed to parameter placeholders.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead, to see the actual parameters, we briefly enabled &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;log_statement = &apos;all&apos;&lt;/code&gt;.  This immediately surfaced something interesting in the logs, which could be downloaded and reviewed on my own time and pace.&lt;/p&gt;

&lt;p&gt;What we saw were statements like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT set_config($2,$1,$3)&lt;/code&gt; with parameters related to JIT configuration – that was the first real clue.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;getting-to-the-bottom&quot;&gt;Getting to the Bottom&lt;/h1&gt;

&lt;p&gt;After tracing the behavior through the stack, the root cause turned out to be surprisingly indirect.  The application created new connections through SQLAlchemy’s asyncpg dialect, and we needed to drill down into that driver’s behavior.&lt;/p&gt;

&lt;hr /&gt;

&lt;h3 id=&quot;step-1--reviewing-how-sqlalchemy-registers-json-codecs&quot;&gt;Step 1 – Reviewing how SQLAlchemy registers JSON codecs&lt;/h3&gt;

&lt;p&gt;During connection initialization, SQLAlchemy runs an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;on_connect&lt;/code&gt; hook:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;await_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setup_asyncpg_json_codec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;await_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setup_asyncpg_jsonb_codec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This registers optimized JSON and JSONB codecs (the client’s application deals with a lot of JSONB data).&lt;/p&gt;

&lt;hr /&gt;

&lt;h3 id=&quot;step-2--observing-how-asyncpg-introspects-type-metadata&quot;&gt;Step 2 – Observing how asyncpg introspects type metadata&lt;/h3&gt;

&lt;p&gt;Registering those codecs requires looking up type OIDs in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_catalog&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That triggers asyncpg’s internal function: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;introspect_types()&lt;/code&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h3 id=&quot;step-3--catching-asyncpg-temporarily-disabling-jit&quot;&gt;Step 3 – Catching asyncpg temporarily disabling JIT&lt;/h3&gt;

&lt;p&gt;Inside &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_introspect_types()&lt;/code&gt; there is this block:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;async&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_introspect_types&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typeoids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timeout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_server_caps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;jit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;cfgrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;await&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__execute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;SELECT current_setting(&apos;jit&apos;) AS cur,
                      set_config(&apos;jit&apos;, &apos;off&apos;, false) AS new&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The purpose is harmless and avoids rare edge cases with complex type queries by temporarily disabling JIT, running the introspection query, and finally restoring the setting afterwards.  For direct PostgreSQL connections, this is perfectly fine.&lt;/p&gt;

&lt;p&gt;Unfortunately, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set_config()&lt;/code&gt; changes session state.  RDS Proxy cannot safely track this change.  So it decides it is necessary to pin the client connection to a backend session.  Once pinned, that connection can never be multiplexed again, for the duration of the session.&lt;/p&gt;

&lt;p&gt;In short, since every connection initialization triggers the JIT toggle, every RDS Proxy connection gets pinned to a database connection, effectively invalidating the usefulness of RDS Proxy’s purpose of connection multiplexing.  With thousands of live connections doing relatively little, Postmaster develops a lot of LWLock overhead memory buffers don’t get flushed, and OOM Killer can be invoked when the conditions are right.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;the-fix&quot;&gt;The Fix&lt;/h1&gt;

&lt;p&gt;The key observation is that asyncpg only runs the JIT toggle if it believes the server supports JIT.&lt;/p&gt;

&lt;p&gt;That capability is stored in an internal structure &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_server_caps&lt;/code&gt;. If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jit&lt;/code&gt; is set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;False&lt;/code&gt;, asyncpg skips the entire block.&lt;/p&gt;

&lt;p&gt;So we added a SQLAlchemy connection hook:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;listens_for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;engine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sync_engine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;connect&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;insert&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_prevent_rds_proxy_session_pinning&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dbapi_connection&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;connection_record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;raw_conn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dbapi_connection&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_connection&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;hasattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;raw_conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;_server_caps&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;raw_conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_server_caps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;jit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;raw_conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_server_caps&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;raw_conn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_server_caps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;jit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This configuration does the following:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Registers a connection hook so that it runs every time a new connection is created.&lt;/li&gt;
  &lt;li&gt;Runs the hook before SQLAlchemy’s own hooks and ensures our handler runs &lt;strong&gt;before&lt;/strong&gt; SQLAlchemy’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;on_connect&lt;/code&gt; logic.  That is important because the JSON codec registration is what triggers the introspection.&lt;/li&gt;
  &lt;li&gt;Disables the JIT capability flag. By using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_server_caps._replace(jit=False)&lt;/code&gt;, we tell asyncpg to skip the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set_config()&lt;/code&gt; block entirely.&lt;/li&gt;
&lt;/ol&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;the-result&quot;&gt;The Result&lt;/h1&gt;

&lt;p&gt;After deploying the asyncpg fix, we saw the number of pinned sessions drop precipitously:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/richyen/richyen.github.io/refs/heads/gh-pages/img/rds_proxy_pinning.png&quot; alt=&quot;RDS Proxy Pinning Graph&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Of course, we were still seeing many pinned sessions, which we continued to deal with through other fixes, but this first step produced an improvement of over 50%&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;other-fix-attempts-that-didnt-work&quot;&gt;Other Fix Attempts That Didn’t Work&lt;/h1&gt;

&lt;p&gt;Before landing on this fix, we attempted a few other approaches.&lt;/p&gt;

&lt;p&gt;First, we attempted to disable JIT via connection parameters by setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;server_settings={&quot;jit&quot;: &quot;off&quot;}&lt;/code&gt;.  This fails because RDS Proxy rejects it with a message like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;FeatureNotSupportedError:
RDS Proxy currently doesn&apos;t support the option jit
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We also tried disabling prepared statement caching with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prepared_statement_cache_size=0&lt;/code&gt; in the configuration.  This didn’t work because it prevents named prepared statement pinning, but it does not prevent &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set_config()&lt;/code&gt; pinning.&lt;/p&gt;

&lt;p&gt;The only fix that worked was to add the pin-prevention hook as described above.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;lessons-learned&quot;&gt;Lessons Learned&lt;/h1&gt;

&lt;p&gt;A few takeaways from this debugging experience:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;RDS Proxy pinning can come from unexpected places.  Even small session-level changes can disable multiplexing.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_statements&lt;/code&gt; hides parameter values. It’s great for query patterns, but it does not expose bound parameters, which can hide critical clues.  Sometimes the fastest diagnostic tool is temporarily enabling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;log_statement = &apos;all&apos;&lt;/code&gt;, which quickly exposed the params in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set_config()&lt;/code&gt; call.&lt;/li&gt;
  &lt;li&gt;SQLAlchemy and asyncpg do have some quirks that need to be addressed when using them with RDS Proxy&lt;/li&gt;
&lt;/ol&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h1&gt;

&lt;p&gt;The entire chain looked like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SQLAlchemy connection
 → asyncpg codec registration
 → asyncpg type introspection
 → temporary JIT disable via set_config()
 → RDS Proxy detects session state change
 → connection gets pinned
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A single hidden configuration toggle resulted in &lt;strong&gt;thousands of pinned sessions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Once identified, the fix was only a few lines of code.&lt;/p&gt;

&lt;p&gt;But getting there required following the entire stack – from SQLAlchemy to asyncpg to PostgreSQL to RDS Proxy.&lt;/p&gt;

&lt;p&gt;Hopefully this saves someone else a few hours (or days) of debugging.&lt;/p&gt;
</description>
        <pubDate>Thu, 12 Mar 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/03/12/rds_proxy_pinning.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/03/12/rds_proxy_pinning.html</guid>
        
        <category>postgres</category>
        
        <category>rds-proxy</category>
        
        <category>sqlalchemy</category>
        
        <category>asyncpg</category>
        
        <category>performance</category>
        
        <category>debugging</category>
        
        
        <category>postgres</category>
        
      </item>
    
  </channel>
</rss>
