WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL Dead Tuple and Autovacuum Bloat Incident Runbook

A production-safe runbook to detect dead tuple buildup, unblock autovacuum, and recover Odoo query latency before bloat becomes outage-level.

When Odoo gets slower over hours or days (without obvious CPU spikes), dead tuple accumulation and stalled autovacuum are common causes. This runbook gives a deterministic response: confirm bloat pressure, remove blockers safely, recover table health, and prevent recurrence.

Incident signals

Treat this as an active reliability incident when multiple signals persist:

  • Odoo list views/searches degrade across multiple modules (sale.order, stock.move, account.move).
  • Database size grows quickly while live row counts do not.
  • n_dead_tup keeps rising on high-write tables.
  • Long-running transactions or idle in transaction sessions are present.
  • Autovacuum workers are active but not reducing dead tuples over repeated checks.

Step 0 — Stabilize write pressure first

Before touching vacuum settings or running manual vacuum, reduce incoming churn:

  1. Pause non-critical imports/batch writes/large cron jobs.
  2. Keep checkout, invoicing, and warehouse-critical flows online if possible.
  3. Assign one operator to DB commands and one to incident notes/timeline.

This prevents chasing a moving target while dead tuples keep accumulating.

Step 1 — Baseline the blast radius

# Top dead-tuple tables in current database
psql "$ODOO_DB_URI" -c "
select
  schemaname,
  relname,
  n_live_tup,
  n_dead_tup,
  round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
  last_autovacuum,
  vacuum_count,
  autovacuum_count
from pg_stat_user_tables
order by n_dead_tup desc
limit 20;
"
# Database-wide tuple churn signal
psql "$ODOO_DB_URI" -c "
select datname,
       numbackends,
       xact_commit,
       xact_rollback,
       tup_inserted,
       tup_updated,
       tup_deleted,
       deadlocks
from pg_stat_database
where datname = current_database();
"

Capture this output in incident notes so you can verify trend reversal later.

Step 2 — Check whether autovacuum is blocked

# Active vacuum progress
psql "$ODOO_DB_URI" -c "
select
  pid,
  relid::regclass as relation,
  phase,
  heap_blks_total,
  heap_blks_scanned,
  heap_blks_vacuumed,
  index_vacuum_count,
  max_dead_tuples,
  num_dead_tuples
from pg_stat_progress_vacuum;
"
# Long transactions that can block cleanup horizons
psql "$ODOO_DB_URI" -c "
select
  pid,
  usename,
  application_name,
  state,
  now() - xact_start as txn_age,
  now() - query_start as query_age,
  wait_event_type,
  wait_event,
  left(query, 160) as query
from pg_stat_activity
where datname = current_database()
  and xact_start is not null
order by xact_start asc
limit 20;
"

Triage checklist

  • Are oldest transactions older than your normal request/worker profile?
  • Is state = 'idle in transaction' present for many minutes?
  • Is autovacuum repeatedly scanning the same hot table with little progress?
  • Are lock waits (wait_event_type = 'Lock') preventing vacuum completion?

If yes, clear blockers before forcing aggressive manual cleanup.

Step 3 — Remove blockers in safest order

Use escalation, not bulk termination:

  1. Let short-running, near-complete transactions finish.
  2. Cancel obvious runaway statements.
  3. Terminate only clearly stuck sessions that keep cleanup blocked.
# Safer first action
psql "$ODOO_DB_URI" -c "select pg_cancel_backend(<pid>);"

# Escalate only when cancellation fails and impact is widening
psql "$ODOO_DB_URI" -c "select pg_terminate_backend(<pid>);"

Avoid terminating replication, backup, or migration sessions unless confirmed as incident root cause.

Step 4 — Controlled manual vacuum on hottest tables

After blocker cleanup and write-pressure reduction, run targeted vacuum on top offender tables.

# Example: run one table at a time, reassess between each
psql "$ODOO_DB_URI" -c "vacuum (verbose, analyze) public.stock_move;"
psql "$ODOO_DB_URI" -c "vacuum (verbose, analyze) public.account_move_line;"

Guidance:

  • Prefer VACUUM (ANALYZE) first; it is online and safer during incidents.
  • Do not run VACUUM FULL during peak traffic unless outage severity justifies an AccessExclusive lock.
  • If index bloat is dominant and urgent, use REINDEX INDEX CONCURRENTLY for targeted indexes (slower but less blocking).

Step 5 — Verification and rollback guardrails

Verify recovery trend

# Re-check dead tuples on the same top tables every 5–10 minutes
psql "$ODOO_DB_URI" -c "
select relname, n_live_tup, n_dead_tup, last_autovacuum
from pg_stat_user_tables
where relname in ('stock_move', 'account_move_line', 'mail_message')
order by n_dead_tup desc;
"

# Check Odoo error/latency symptoms improving
odoocli logs tail --service odoo --since 15m --grep "timeout|canceling statement|could not serialize|deadlock"

Incident can be closed when:

  • Dead-tuple counts trend downward across consecutive samples.
  • User-facing Odoo flows recover to normal latency.
  • No sustained lock/timeout burst appears in logs.
  • Autovacuum activity resumes normal cadence on hot tables.

Rollback/safety actions if recovery worsens

  • Stop manual vacuum cycle and reassess blockers/lock graph.
  • Re-enable paused workloads gradually (one lane at a time) only after trend improves.
  • If vacuum attempts cause unacceptable latency, defer heavy operations and schedule off-peak maintenance window.

Hardening and prevention checklist

  • Alert on high n_dead_tup for critical Odoo tables.
  • Alert on oldest transaction age and idle in transaction session count.
  • Enforce idle_in_transaction_session_timeout for leak-prone clients.
  • Tune autovacuum per hot table (scale factor and threshold) instead of global-only tuning.
  • Review large batch jobs to commit in smaller chunks.
  • Run periodic bloat review (tables + indexes) and planned maintenance before incidents.

The operating principle: clear blockers first, then vacuum surgically, then tune with evidence. This avoids turning bloat recovery into a lock-amplified outage.

Back to blog