Odoo PostgreSQL Checkpoint I/O Spike Mitigation Runbook
A practical incident runbook for stabilizing Odoo when PostgreSQL checkpoint bursts cause latency spikes, write stalls, and cascading worker slowdowns.
When Odoo latency suddenly jumps during traffic bursts, and PostgreSQL disk writes spike in the same window, checkpoint pressure is a common root cause.
This runbook gives a safe response order: confirm checkpoint-driven pressure, reduce write burst intensity, recover throughput, then harden settings so the pattern does not return.
Incident signals that justify immediate action
- Odoo endpoints with write-heavy flows (checkout, invoicing, stock moves) become slow at the same time.
- PostgreSQL logs show frequent checkpoint messages (
checkpoints are occurring too frequently). - Disk I/O utilization (
%util) stays high with growing request latency. - Odoo worker timeout/slow-request rates increase even when CPU is not fully saturated.
pg_stat_bgwritercheckpoint counters climb faster than normal baseline.
Step 0 — Stabilize before changing PostgreSQL settings
- Freeze non-critical write amplification lanes (bulk imports, recompute jobs, historical backfills).
- Pause deploys/module upgrades and incident-unrelated cron lanes.
- Keep one operator running commands and one operator tracking timeline/actions.
- Do not restart PostgreSQL first; capture evidence before any tuning or reload.
Step 1 — Confirm checkpoint pressure and quantify severity
1.1 Inspect checkpoint cadence and write pressure
psql "$ODOO_DB_URI" -c "
select
checkpoints_timed,
checkpoints_req,
checkpoint_write_time,
checkpoint_sync_time,
buffers_checkpoint,
buffers_backend,
maxwritten_clean
from pg_stat_bgwriter;
"
Interpretation during incident:
- Fast growth in
checkpoints_reqsuggests WAL pressure is forcing early checkpoints. - High
checkpoint_sync_timesuggests fsync flush pressure (storage can’t absorb checkpoint burst smoothly). - High
buffers_backendmeans backend processes are doing writes themselves (background write smoothing is insufficient).
1.2 Check if WAL growth is repeatedly hitting limits
psql "$ODOO_DB_URI" -c "
select
name,
setting,
unit
from pg_settings
where name in (
'checkpoint_timeout',
'max_wal_size',
'min_wal_size',
'checkpoint_completion_target'
)
order by name;
"
# Optional host-level pressure check (Linux)
iostat -xz 1 10
vmstat 1 10
1.3 Validate query and lock side-effects in Odoo traffic
psql "$ODOO_DB_URI" -c "
select
count(*) filter (where state = 'active') as active,
count(*) filter (where state = 'idle in transaction') as idle_in_txn,
count(*) filter (
where state = 'active'
and now() - query_start > interval '30 seconds'
) as active_over_30s
from pg_stat_activity
where datname = current_database();
"
If long-running active statements rise only during checkpoint bursts, the incident is likely I/O pacing failure, not a pure SQL-plan regression.
Step 2 — Apply production-safe remediation order
2.1 Reduce write burst amplitude first
- Temporarily pause high-write non-critical jobs (mass stock valuation recomputes, large import batches, bulk mail state updates).
- Keep revenue-critical and accounting-critical writes alive where possible.
- If queue workers exist, reduce concurrency instead of hard stop.
2.2 Smooth checkpoint behavior (safe-first)
If your current values are aggressive for current write volume, prefer gradual tuning:
-- Persist carefully, then reload/restart based on your platform policy.
ALTER SYSTEM SET checkpoint_completion_target = '0.9';
ALTER SYSTEM SET max_wal_size = '8GB';
ALTER SYSTEM SET min_wal_size = '2GB';
Apply config changes:
SELECT pg_reload_conf();
Notes:
checkpoint_completion_target=0.9spreads checkpoint I/O over more of the interval.- Increasing
max_wal_sizereduces forced checkpoints during short write spikes. - Validate disk headroom before increasing WAL retention limits.
2.3 Protect user traffic from indefinite waits during incident
-- Use only if not already enforced by policy; prefer role-level for Odoo user.
ALTER ROLE odoo SET statement_timeout = '60s';
ALTER ROLE odoo SET lock_timeout = '10s';
This limits blast radius from stalled write paths while storage pressure is being stabilized.
Step 3 — Verification loop before reopening full load
Run a short loop while gradually re-enabling paused lanes:
watch -n 10 "psql \"$ODOO_DB_URI\" -Atc \"
select
now(),
checkpoints_timed,
checkpoints_req,
checkpoint_write_time,
checkpoint_sync_time
from pg_stat_bgwriter;
\""
And confirm user-path health at the same time:
- Login, quotation confirm, invoice post, and stock transfer flows succeed.
- Odoo timeout/error rate returns near baseline.
- Disk
%utiland await metrics trend down from incident peak.
Rollback and safety checks
If latency worsens after tuning changes:
- Re-freeze the last write lane you re-enabled.
- Revert the most recent parameter change first (single-variable rollback).
- Re-check
pg_stat_bgwritercounters and disk pressure. - Escalate to storage/IOPS capacity path if checkpoint smoothing alone does not recover service.
Example rollback:
ALTER SYSTEM RESET checkpoint_completion_target;
ALTER SYSTEM RESET max_wal_size;
ALTER SYSTEM RESET min_wal_size;
SELECT pg_reload_conf();
Only reset values that were changed during incident; preserve known-good baseline overrides.
Hardening checklist (post-incident)
- Alert on checkpoint frequency and
checkpoints_reqacceleration. - Track
checkpoint_write_time/checkpoint_sync_timetrend, not just one-off values. - Keep large imports/backfills behind throttled job lanes and off peak interactive windows.
- Validate WAL/disk headroom before seasonal traffic events.
- Enable and review
pg_stat_statementsto separate checkpoint issues from SQL-plan regressions. - Run a staging load drill with burst writes and verify checkpoint metrics stay controlled.
Practical references
- PostgreSQL WAL and checkpoint settings: https://www.postgresql.org/docs/current/runtime-config-wal.html
- PostgreSQL monitoring statistics (
pg_stat_bgwriter,pg_stat_activity): https://www.postgresql.org/docs/current/monitoring-stats.html - PostgreSQL logging and checkpoint warnings: https://www.postgresql.org/docs/current/runtime-config-logging.html
- Odoo deployment guidance: https://www.odoo.com/documentation/17.0/administration/on_premise/deploy.html
Operational rule: do not treat checkpoint incidents as “just increase IOPS” by default. First reduce write burst pressure, then smooth checkpoint pacing, then scale capacity with measured evidence.