Six months after launch, a client pinged us: 'the dashboard takes 12 seconds to load now.' At launch, the same dashboard rendered in under 300ms. Same query, same data shape, ~5× more rows. The slowdown was non-linear, and that should have been our first clue.

What we ruled out

  • Query plan regression — EXPLAIN ANALYZE confirmed the planner was still using the same index.
  • Missing index — every WHERE clause column had one.
  • Connection pool exhaustion — pgBouncer was idle 70% of the time.
  • Disk I/O — IOPS was nowhere near the ceiling.

The culprit: dead tuples in a hot index

The table had a high-churn boolean column (an is_active flag) that was indexed. Every UPDATE that touched the row created a new tuple, and the index page accumulated dead pointers. VACUUM was running, but autovacuum thresholds hadn't been tuned for write volume — the index was 80% bloat by the time we looked.

A REINDEX CONCURRENTLY on the offending index brought query time back to 280ms. But the real fix was upstream: dropping the index entirely on the low-cardinality boolean (it was a partial-scan red herring) and adding a partial index for the actual hot path.

What we now monitor

  • pg_stat_user_indexes.idx_scan vs idx_tup_read — high reads with low scans = bloat suspect.
  • pgstattuple_approx on critical indexes, weekly.
  • Autovacuum log lines — if naptime > target, the index is falling behind.

Postgres rewards operators who keep an eye on it. The defaults are sane for moderate workloads — they are not sane for high-churn columns over a year of growth.