Six months after launch, a client pinged us: 'the dashboard takes 12 seconds to load now.' At launch, the same dashboard rendered in under 300ms. Same query, same data shape, ~5× more rows. The slowdown was non-linear, and that should have been our first clue.
What we ruled out
- Query plan regression — EXPLAIN ANALYZE confirmed the planner was still using the same index.
- Missing index — every WHERE clause column had one.
- Connection pool exhaustion — pgBouncer was idle 70% of the time.
- Disk I/O — IOPS was nowhere near the ceiling.
The culprit: dead tuples in a hot index
The table had a high-churn boolean column (an is_active flag) that was indexed. Every UPDATE that touched the row created a new tuple, and the index page accumulated dead pointers. VACUUM was running, but autovacuum thresholds hadn't been tuned for write volume — the index was 80% bloat by the time we looked.
A REINDEX CONCURRENTLY on the offending index brought query time back to 280ms. But the real fix was upstream: dropping the index entirely on the low-cardinality boolean (it was a partial-scan red herring) and adding a partial index for the actual hot path.
What we now monitor
- pg_stat_user_indexes.idx_scan vs idx_tup_read — high reads with low scans = bloat suspect.
- pgstattuple_approx on critical indexes, weekly.
- Autovacuum log lines — if naptime > target, the index is falling behind.
Postgres rewards operators who keep an eye on it. The defaults are sane for moderate workloads — they are not sane for high-churn columns over a year of growth.