Postgres index bloat: a debugging story (with a happy ending)

Six months after launch, a client pinged us: 'the dashboard takes 12 seconds to load now.' At launch, the same dashboard rendered in under 300ms. Same query, same data shape, ~5× more rows. The slowdown was non-linear, and that should have been our first clue.

What we ruled out

Query plan regression — EXPLAIN ANALYZE confirmed the planner was still using the same index.
Missing index — every WHERE clause column had one.
Connection pool exhaustion — pgBouncer was idle 70% of the time.
Disk I/O — IOPS was nowhere near the ceiling.

The culprit: dead tuples in a hot index

The table had a high-churn boolean column (an is_active flag) that was indexed. Every UPDATE that touched the row created a new tuple, and the index page accumulated dead pointers. VACUUM was running, but autovacuum thresholds hadn't been tuned for write volume — the index was 80% bloat by the time we looked.

A REINDEX CONCURRENTLY on the offending index brought query time back to 280ms. But the real fix was upstream: dropping the index entirely on the low-cardinality boolean (it was a partial-scan red herring) and adding a partial index for the actual hot path.

What we now monitor

pg_stat_user_indexes.idx_scan vs idx_tup_read — high reads with low scans = bloat suspect.
pgstattuple_approx on critical indexes, weekly.
Autovacuum log lines — if naptime > target, the index is falling behind.

Postgres rewards operators who keep an eye on it. The defaults are sane for moderate workloads — they are not sane for high-churn columns over a year of growth.

What we ruled out

The culprit: dead tuples in a hot index

What we now monitor

Keep reading

Why your AI agent keeps lying — and how we fixed it with eval harnesses

React Server Components, two years in: the patterns that survived

Building a RAG eval framework with 200 golden questions

Get the field notes in your inbox.