Data ScienceJune 5, 2026

Optimizing Postgres With Generated Columns For Jsonb Key Existence Filters

S

Written by

Sage Stream

The problem I kept running into

I was building a real-time analytics pipeline on PostgreSQL, where events came in as JSONB. Each event had a payload, and one of the most common questions in dashboards was along the lines of:

“Show me rows where payload contains key X (and optionally equals some value).”

The catch: the key was inside JSONB, and the filter looked like a simple boolean check.

When I ran it against a few million rows, the query would sometimes take seconds—despite the predicate being “obviously simple.” What I found after poking around was that PostgreSQL often can’t use a normal index for “does this JSONB key exist?” patterns in the way people expect.

So I built a niche optimization that turned the predicate into something index-friendly using generated columns.


The niche pattern: indexing JSONB key existence with a generated column

What generated columns do (plain English)

A generated column is a column whose value is computed from other columns automatically. You can compute a “derived” value (like “payload has user_id key?”) and store it on disk, so PostgreSQL can index it like a regular column.

Why this helps

Instead of filtering with:

  • payload ? 'user_id' (JSONB “key exists” operator)

I transform it into:

  • payload_has_user_id (a boolean column computed from payload)

Then I index that boolean column (or index another computed expression), and PostgreSQL can use it efficiently.


Repro: baseline query that was slow

Table setup

I used a simplified event table:

-- Run in psql. Works on modern PostgreSQL versions (12+; generated columns are long supported). DROP TABLE IF EXISTS events_jsonb; CREATE TABLE events_jsonb ( id bigserial PRIMARY KEY, created_at timestamptz NOT NULL DEFAULT now(), payload jsonb NOT NULL ); -- Helpful baseline indexes CREATE INDEX events_jsonb_created_at_idx ON events_jsonb(created_at);

A slow query

The dashboard query is essentially “key exists”:

EXPLAIN (ANALYZE, BUFFERS) SELECT id, created_at, payload FROM events_jsonb WHERE payload ? 'user_id' ORDER BY created_at DESC LIMIT 50;

On a larger dataset, the planner typically can’t turn that payload ? 'user_id' into an indexable lookup using the created_at index alone. It may scan lots of rows to find matches, because created_at ordering doesn’t tell it where the JSON keys live.


Step 1: Populate with realistic JSONB data

To make the difference measurable, I inserted test data where some rows have the key and some don’t.

INSERT INTO events_jsonb (created_at, payload) SELECT now() - (g * interval '1 minute') as created_at, CASE WHEN (g % 7) = 0 THEN jsonb_build_object('user_id', (100000 + g), 'plan', 'pro') ELSE jsonb_build_object('plan', 'free') END as payload FROM generate_series(1, 300000) g;

This creates:

  • ~1/7 of rows with user_id
  • ~6/7 without

That distribution is important because “rare matches” make poor index usage painfully obvious.


Step 2: Add a generated column for key existence

Now I add a computed boolean:

ALTER TABLE events_jsonb ADD COLUMN payload_has_user_id boolean GENERATED ALWAYS AS (payload ? 'user_id') STORED; -- Index the derived value CREATE INDEX events_jsonb_payload_has_user_id_idx ON events_jsonb(payload_has_user_id, created_at DESC);

Why the index includes created_at DESC

The query orders by created_at DESC and limits to 50 rows. A composite index on (payload_has_user_id, created_at DESC) gives PostgreSQL a way to:

  1. Jump directly to matching payload_has_user_id = true
  2. Read results already in the desired time order
  3. Stop early due to LIMIT 50

Step 3: Re-run the query and compare plans

EXPLAIN (ANALYZE, BUFFERS) SELECT id, created_at, payload FROM events_jsonb WHERE payload_has_user_id ORDER BY created_at DESC LIMIT 50;

What I expect to see in the plan

The important signs are usually:

  • An Index Scan using events_jsonb_payload_has_user_id_idx
  • Very few rows visited compared to a scan of the base table

Even if the actual timing varies by hardware, the planner behavior should switch from “filter after scanning” to “seek directly into an index.”


Step 4: Use equality filters too (value inside JSONB)

Key existence is only half the story. Often the predicate is:

  • key exists AND equals some value

For example: payload->>'user_id' = '123'

Here’s the same approach, but computed into a generated column.

ALTER TABLE events_jsonb ADD COLUMN user_id_text text GENERATED ALWAYS AS (payload->>'user_id') STORED; CREATE INDEX events_jsonb_user_id_text_idx ON events_jsonb(user_id_text, created_at DESC);

Now the query becomes:

EXPLAIN (ANALYZE, BUFFERS) SELECT id, created_at, payload FROM events_jsonb WHERE user_id_text = '100007' ORDER BY created_at DESC LIMIT 50;

Why this is better than indexing (payload) with a GIN index

GIN indexes are excellent for containment queries, but “simple equality on a specific extracted JSON value” is often more directly optimized by converting that extraction into a real indexed column. In practice, this tends to reduce planner uncertainty and makes performance far more predictable.


Data observability angle: why this helps real-time pipelines

In real-time analytics, you often have:

  • frequent queries
  • mixed data quality (missing keys, different JSON shapes)
  • strict latency budgets

When the JSON predicate is turned into an indexed generated column, query latency becomes more stable. That stability matters because monitoring “query time” alone isn’t enough; you need signals that data-driven filters won’t randomly degrade when key presence changes.

Generated columns provide that bridge between:

  • raw semi-structured payloads
  • and structured, observable, indexable features

Gotchas I hit while building this

1) Generated columns need to be STORED for indexing

Using VIRTUAL instead of STORED prevents efficient indexing in the way I needed. STORED physically materializes the computed values so the index stays fast.

2) Don’t forget to query the generated column

The index exists, but PostgreSQL can only use it when the query references the computed column directly (like WHERE payload_has_user_id), not only when you use the original JSON operator.

3) JSONB extraction types matter

payload->>'user_id' produces text. If your data is numeric and you compare against an integer, the planner may cast and ruin index usage. Keeping consistent types avoids that.


Clean up example (optional)

If you ever need to remove the added columns:

ALTER TABLE events_jsonb DROP COLUMN payload_has_user_id; ALTER TABLE events_jsonb DROP COLUMN user_id_text; DROP INDEX IF EXISTS events_jsonb_payload_has_user_id_idx; DROP INDEX IF EXISTS events_jsonb_user_id_text_idx;

Conclusion

I learned that “simple” JSONB predicates like “key exists” (payload ? 'user_id') can be surprisingly hard for PostgreSQL to optimize with general-purpose JSON indexing. By adding STORED generated columns for JSONB key existence and extracted JSON values—and indexing those generated columns with time-aware composite indexes—I was able to turn unpredictable slow scans into fast index seeks that respect ORDER BY ... LIMIT. This approach bridges semi-structured event payloads with real-time SQL performance and makes latency much more observable and stable.