Odoo PostgreSQL Replication Slot WAL Retention Runbook
A CLI-first incident workflow for when stale replication slots retain WAL, grow disk usage, and threaten Odoo database availability.
When PostgreSQL disk usage grows fast during otherwise normal Odoo traffic, stale replication slots are a common cause. This runbook gives a deterministic sequence: verify slot retention, identify the stale consumer, recover safely, and harden.
Incident signals
pg_walvolume grows continuously without matching business traffic spikes.- Alerts fire on free disk dropping while CPU and query load look normal.
- Replica or CDC consumers are offline, but their replication slots still exist.
- PostgreSQL logs include checkpoint pressure or low-disk warnings.
Step 0 — Stabilize before editing slots
- Freeze non-critical writes (bulk imports, reprocessing jobs, large attachment backfills).
- Assign one operator to execute commands and one to track timeline/decisions.
- Confirm backups are recent before any destructive slot action.
Do not delete slots blindly. A healthy consumer may be behind but still recoverable.
Step 1 — Measure retained WAL by slot
psql "$ODOO_DB_URI" -c "
select
slot_name,
slot_type,
active,
restart_lsn,
confirmed_flush_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as retained_bytes
from pg_replication_slots
order by retained_bytes desc nulls last;
"
Focus on inactive slots retaining large WAL (active = f and high retained_bytes).
Step 2 — Verify whether the consumer is expected to return
For each high-retention slot, capture owner and purpose from your infra inventory:
- physical standby replication slot
- logical CDC pipeline (ETL/warehouse/integration)
- temporary migration tooling that should already be removed
If ownership is unknown, treat as high-risk and escalate before dropping.
Step 3 — Recovery decision tree
Path A: consumer is valid and should continue
- Restore consumer connectivity first (network, credentials, process health).
- Keep slot intact.
- Monitor retained WAL every few minutes until it trends down.
watch -n 15 "psql \"$ODOO_DB_URI\" -Atc \"select slot_name||'|'||active||'|'||pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) from pg_replication_slots order by 1\""
Path B: consumer is permanently decommissioned
Only execute after explicit confirmation from the system owner.
# Replace <slot_name> with the confirmed stale slot
psql "$ODOO_DB_URI" -c "select pg_drop_replication_slot('<slot_name>');"
Drop one slot at a time and immediately re-check free disk and retained WAL.
Step 4 — Validate recovery before closing incident
pg_walgrowth rate normalizes.- Free disk trends upward or stabilizes above your safety threshold.
- Remaining replication slots show expected ownership and acceptable retention.
- Critical Odoo transactions (login, quote save, invoice post) remain healthy.
Post-incident hardening
- Alert on per-slot retained WAL bytes, not only raw disk usage.
- Keep a source-of-truth registry for slot owner, purpose, and decommission procedure.
- Require teardown checklist items for migrations/integrations that create temporary slots.
- Rehearse this scenario in staging by intentionally stalling a non-production slot.
The operating principle: WAL-retention incidents are usually ownership/governance failures, not PostgreSQL tuning failures. Confirm ownership first, then take the smallest safe action.