Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL Read Replica Failover and Cutover Runbook

A production-safe, CLI-first runbook for promoting a PostgreSQL replica during Odoo incidents, cutting over traffic, and avoiding split-brain.

Failover incidents are where teams accidentally create bigger outages than the original fault. This runbook gives a deterministic sequence: confirm primary failure, pick a safe replica, fence the old primary, promote, cut over Odoo, then verify and harden.

Incident signals that justify failover evaluation

Primary DB is unreachable or repeatedly crashing while Odoo requests time out.
Write paths in Odoo fail (could not connect to server, connection reset, or repeated transaction aborts).
Infrastructure events indicate node/storage/network loss on the primary.
You cannot restore primary service inside your defined RTO.

A failover is a data-risk decision. Always measure replication lag first.

Step 0 — Stabilize and assign roles

Freeze non-critical write traffic (imports, bulk updates, heavy cron jobs, queue replays).
Assign one command operator and one incident scribe.
Confirm who can approve promotion if data-loss risk is non-zero.

Do not allow parallel ad-hoc commands on both primary and replica clusters.

Step 1 — Validate candidate replica freshness

Run on each replica candidate:

psql "$REPLICA_DB_URI" -c "
select
  now() as observed_at,
  pg_is_in_recovery() as in_recovery,
  now() - pg_last_xact_replay_timestamp() as replay_delay,
  pg_last_wal_receive_lsn() as receive_lsn,
  pg_last_wal_replay_lsn() as replay_lsn,
  pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) as bytes_not_replayed;
"

Interpretation:

in_recovery = t confirms it is still a standby (good before promotion).
replay_delay near zero is ideal.
Large bytes_not_replayed means you are promoting with higher potential data loss.

If all replicas are stale, escalate explicitly: "failover is available, but data loss up to X seconds/bytes."

Step 2 — Confirm old primary state (split-brain prevention)

Before promotion, verify the old primary is truly unavailable or fenced.

# If reachable, check whether it still accepts writes
psql "$PRIMARY_DB_URI" -c "select now(), pg_is_in_recovery();"

If old primary is reachable and writable (pg_is_in_recovery = f), fence it before continuing:

Stop PostgreSQL service on old primary host, or
Remove old primary from load balancer / security group path to app layer, or
Shut down node at orchestrator level.

The goal is simple: after promotion, only one writable primary may exist.

Step 3 — Promote selected replica

On chosen replica:

psql "$REPLICA_DB_URI" -c "select pg_promote(wait_seconds => 60);"

Then verify promotion completed:

psql "$REPLICA_DB_URI" -c "select pg_is_in_recovery() as in_recovery;"

Expected: in_recovery = f.

Step 4 — Cut Odoo traffic to new primary

Update connection target used by Odoo (db_host, DNS name, PgBouncer upstream, or secret/parameter store).
Reload/restart connection pools before app workers, so stale sockets do not pin the old target.
Restart Odoo workers in controlled batches.

Example (systemd deployment):

sudo systemctl restart pgbouncer
sudo systemctl restart odoo

If using container orchestration, roll deployments with max unavailable tuned to keep API alive.

Step 5 — Verification checklist (must pass before incident close)

Run against new primary:

psql "$NEW_PRIMARY_DB_URI" -c "
select
  current_setting('transaction_read_only') as tx_read_only,
  pg_is_in_recovery() as in_recovery;
"

Expected: tx_read_only = off and in_recovery = f.

Validate application behavior:

Odoo login works.
Create/update business objects (quote, sale order, invoice draft).
Queue/cron throughput resumes without error spikes.
No continued connection attempts from Odoo to old primary endpoint.

Check for straggler sessions targeting retired primary host:

# Run where PgBouncer is used
psql "$PGBOUNCER_ADMIN_URI" -c "show clients;"

Rollback guidance (if cutover destabilizes)

Only roll back if the old primary is verified consistent and controlled.

Stop Odoo writes again.
Decide authoritative node (do not allow dual-primary writes).
Re-point app/pooler once, restart, and re-verify write/read checks.

If both nodes accepted writes, treat as data divergence incident and execute reconciliation plan (not ad-hoc failback).

Hardening and prevention checklist

Define explicit failover SLOs: max acceptable replay lag and max tolerated data loss.
Alert on replica lag (replay_delay, WAL gap bytes) and stale replication state.
Store a tested fencing procedure per environment (VM, Kubernetes, managed DB).
Practice quarterly failover drills in staging with timed RTO/RPO measurement.
Keep Odoo DB endpoint indirection simple (single DNS/pooler target) to reduce cutover errors.
Document exactly who can approve data-loss failover decisions.

Reference material used for technical alignment

PostgreSQL docs: replication, standby operation, and failover behavior.
PostgreSQL docs: administrative functions including pg_promote() and recovery checks.
Operational best practices from production SRE failover playbooks (fencing first, single writer guarantee).

The operating principle: failover is not "promote fast"; it is "promote safely with one writer and known data risk."

Back to blog