Odoo PostgreSQL Dead Tuple and Autovacuum Bloat Incident Runbook
A production-safe runbook to detect dead tuple buildup, unblock autovacuum, and recover Odoo query latency before bloat becomes outage-level.
When Odoo gets slower over hours or days (without obvious CPU spikes), dead tuple accumulation and stalled autovacuum are common causes. This runbook gives a deterministic response: confirm bloat pressure, remove blockers safely, recover table health, and prevent recurrence.
Incident signals
Treat this as an active reliability incident when multiple signals persist:
- Odoo list views/searches degrade across multiple modules (
sale.order,stock.move,account.move). - Database size grows quickly while live row counts do not.
n_dead_tupkeeps rising on high-write tables.- Long-running transactions or
idle in transactionsessions are present. - Autovacuum workers are active but not reducing dead tuples over repeated checks.
Step 0 — Stabilize write pressure first
Before touching vacuum settings or running manual vacuum, reduce incoming churn:
- Pause non-critical imports/batch writes/large cron jobs.
- Keep checkout, invoicing, and warehouse-critical flows online if possible.
- Assign one operator to DB commands and one to incident notes/timeline.
This prevents chasing a moving target while dead tuples keep accumulating.
Step 1 — Baseline the blast radius
# Top dead-tuple tables in current database
psql "$ODOO_DB_URI" -c "
select
schemaname,
relname,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
last_autovacuum,
vacuum_count,
autovacuum_count
from pg_stat_user_tables
order by n_dead_tup desc
limit 20;
"
# Database-wide tuple churn signal
psql "$ODOO_DB_URI" -c "
select datname,
numbackends,
xact_commit,
xact_rollback,
tup_inserted,
tup_updated,
tup_deleted,
deadlocks
from pg_stat_database
where datname = current_database();
"
Capture this output in incident notes so you can verify trend reversal later.
Step 2 — Check whether autovacuum is blocked
# Active vacuum progress
psql "$ODOO_DB_URI" -c "
select
pid,
relid::regclass as relation,
phase,
heap_blks_total,
heap_blks_scanned,
heap_blks_vacuumed,
index_vacuum_count,
max_dead_tuples,
num_dead_tuples
from pg_stat_progress_vacuum;
"
# Long transactions that can block cleanup horizons
psql "$ODOO_DB_URI" -c "
select
pid,
usename,
application_name,
state,
now() - xact_start as txn_age,
now() - query_start as query_age,
wait_event_type,
wait_event,
left(query, 160) as query
from pg_stat_activity
where datname = current_database()
and xact_start is not null
order by xact_start asc
limit 20;
"
Triage checklist
- Are oldest transactions older than your normal request/worker profile?
- Is
state = 'idle in transaction'present for many minutes? - Is autovacuum repeatedly scanning the same hot table with little progress?
- Are lock waits (
wait_event_type = 'Lock') preventing vacuum completion?
If yes, clear blockers before forcing aggressive manual cleanup.
Step 3 — Remove blockers in safest order
Use escalation, not bulk termination:
- Let short-running, near-complete transactions finish.
- Cancel obvious runaway statements.
- Terminate only clearly stuck sessions that keep cleanup blocked.
# Safer first action
psql "$ODOO_DB_URI" -c "select pg_cancel_backend(<pid>);"
# Escalate only when cancellation fails and impact is widening
psql "$ODOO_DB_URI" -c "select pg_terminate_backend(<pid>);"
Avoid terminating replication, backup, or migration sessions unless confirmed as incident root cause.
Step 4 — Controlled manual vacuum on hottest tables
After blocker cleanup and write-pressure reduction, run targeted vacuum on top offender tables.
# Example: run one table at a time, reassess between each
psql "$ODOO_DB_URI" -c "vacuum (verbose, analyze) public.stock_move;"
psql "$ODOO_DB_URI" -c "vacuum (verbose, analyze) public.account_move_line;"
Guidance:
- Prefer
VACUUM (ANALYZE)first; it is online and safer during incidents. - Do not run
VACUUM FULLduring peak traffic unless outage severity justifies an AccessExclusive lock. - If index bloat is dominant and urgent, use
REINDEX INDEX CONCURRENTLYfor targeted indexes (slower but less blocking).
Step 5 — Verification and rollback guardrails
Verify recovery trend
# Re-check dead tuples on the same top tables every 5–10 minutes
psql "$ODOO_DB_URI" -c "
select relname, n_live_tup, n_dead_tup, last_autovacuum
from pg_stat_user_tables
where relname in ('stock_move', 'account_move_line', 'mail_message')
order by n_dead_tup desc;
"
# Check Odoo error/latency symptoms improving
odoocli logs tail --service odoo --since 15m --grep "timeout|canceling statement|could not serialize|deadlock"
Incident can be closed when:
- Dead-tuple counts trend downward across consecutive samples.
- User-facing Odoo flows recover to normal latency.
- No sustained lock/timeout burst appears in logs.
- Autovacuum activity resumes normal cadence on hot tables.
Rollback/safety actions if recovery worsens
- Stop manual vacuum cycle and reassess blockers/lock graph.
- Re-enable paused workloads gradually (one lane at a time) only after trend improves.
- If vacuum attempts cause unacceptable latency, defer heavy operations and schedule off-peak maintenance window.
Hardening and prevention checklist
- Alert on high
n_dead_tupfor critical Odoo tables. - Alert on oldest transaction age and
idle in transactionsession count. - Enforce
idle_in_transaction_session_timeoutfor leak-prone clients. - Tune autovacuum per hot table (scale factor and threshold) instead of global-only tuning.
- Review large batch jobs to commit in smaller chunks.
- Run periodic bloat review (tables + indexes) and planned maintenance before incidents.
The operating principle: clear blockers first, then vacuum surgically, then tune with evidence. This avoids turning bloat recovery into a lock-amplified outage.