Odoo PostgreSQL Read Replica Failover and Cutover Runbook
A production-safe, CLI-first runbook for promoting a PostgreSQL replica during Odoo incidents, cutting over traffic, and avoiding split-brain.
Failover incidents are where teams accidentally create bigger outages than the original fault. This runbook gives a deterministic sequence: confirm primary failure, pick a safe replica, fence the old primary, promote, cut over Odoo, then verify and harden.
Incident signals that justify failover evaluation
- Primary DB is unreachable or repeatedly crashing while Odoo requests time out.
- Write paths in Odoo fail (
could not connect to server, connection reset, or repeated transaction aborts). - Infrastructure events indicate node/storage/network loss on the primary.
- You cannot restore primary service inside your defined RTO.
A failover is a data-risk decision. Always measure replication lag first.
Step 0 — Stabilize and assign roles
- Freeze non-critical write traffic (imports, bulk updates, heavy cron jobs, queue replays).
- Assign one command operator and one incident scribe.
- Confirm who can approve promotion if data-loss risk is non-zero.
Do not allow parallel ad-hoc commands on both primary and replica clusters.
Step 1 — Validate candidate replica freshness
Run on each replica candidate:
psql "$REPLICA_DB_URI" -c "
select
now() as observed_at,
pg_is_in_recovery() as in_recovery,
now() - pg_last_xact_replay_timestamp() as replay_delay,
pg_last_wal_receive_lsn() as receive_lsn,
pg_last_wal_replay_lsn() as replay_lsn,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) as bytes_not_replayed;
"
Interpretation:
in_recovery = tconfirms it is still a standby (good before promotion).replay_delaynear zero is ideal.- Large
bytes_not_replayedmeans you are promoting with higher potential data loss.
If all replicas are stale, escalate explicitly: "failover is available, but data loss up to X seconds/bytes."
Step 2 — Confirm old primary state (split-brain prevention)
Before promotion, verify the old primary is truly unavailable or fenced.
# If reachable, check whether it still accepts writes
psql "$PRIMARY_DB_URI" -c "select now(), pg_is_in_recovery();"
If old primary is reachable and writable (pg_is_in_recovery = f), fence it before continuing:
- Stop PostgreSQL service on old primary host, or
- Remove old primary from load balancer / security group path to app layer, or
- Shut down node at orchestrator level.
The goal is simple: after promotion, only one writable primary may exist.
Step 3 — Promote selected replica
On chosen replica:
psql "$REPLICA_DB_URI" -c "select pg_promote(wait_seconds => 60);"
Then verify promotion completed:
psql "$REPLICA_DB_URI" -c "select pg_is_in_recovery() as in_recovery;"
Expected: in_recovery = f.
Step 4 — Cut Odoo traffic to new primary
- Update connection target used by Odoo (
db_host, DNS name, PgBouncer upstream, or secret/parameter store). - Reload/restart connection pools before app workers, so stale sockets do not pin the old target.
- Restart Odoo workers in controlled batches.
Example (systemd deployment):
sudo systemctl restart pgbouncer
sudo systemctl restart odoo
If using container orchestration, roll deployments with max unavailable tuned to keep API alive.
Step 5 — Verification checklist (must pass before incident close)
Run against new primary:
psql "$NEW_PRIMARY_DB_URI" -c "
select
current_setting('transaction_read_only') as tx_read_only,
pg_is_in_recovery() as in_recovery;
"
Expected: tx_read_only = off and in_recovery = f.
Validate application behavior:
- Odoo login works.
- Create/update business objects (quote, sale order, invoice draft).
- Queue/cron throughput resumes without error spikes.
- No continued connection attempts from Odoo to old primary endpoint.
Check for straggler sessions targeting retired primary host:
# Run where PgBouncer is used
psql "$PGBOUNCER_ADMIN_URI" -c "show clients;"
Rollback guidance (if cutover destabilizes)
Only roll back if the old primary is verified consistent and controlled.
- Stop Odoo writes again.
- Decide authoritative node (do not allow dual-primary writes).
- Re-point app/pooler once, restart, and re-verify write/read checks.
If both nodes accepted writes, treat as data divergence incident and execute reconciliation plan (not ad-hoc failback).
Hardening and prevention checklist
- Define explicit failover SLOs: max acceptable replay lag and max tolerated data loss.
- Alert on replica lag (
replay_delay, WAL gap bytes) and stale replication state. - Store a tested fencing procedure per environment (VM, Kubernetes, managed DB).
- Practice quarterly failover drills in staging with timed RTO/RPO measurement.
- Keep Odoo DB endpoint indirection simple (single DNS/pooler target) to reduce cutover errors.
- Document exactly who can approve data-loss failover decisions.
Reference material used for technical alignment
- PostgreSQL docs: replication, standby operation, and failover behavior.
- PostgreSQL docs: administrative functions including
pg_promote()and recovery checks. - Operational best practices from production SRE failover playbooks (fencing first, single writer guarantee).
The operating principle: failover is not "promote fast"; it is "promote safely with one writer and known data risk."