WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL Read Replica Failover and Cutover Runbook

A production-safe, CLI-first runbook for promoting a PostgreSQL replica during Odoo incidents, cutting over traffic, and avoiding split-brain.

Failover incidents are where teams accidentally create bigger outages than the original fault. This runbook gives a deterministic sequence: confirm primary failure, pick a safe replica, fence the old primary, promote, cut over Odoo, then verify and harden.

Incident signals that justify failover evaluation

  • Primary DB is unreachable or repeatedly crashing while Odoo requests time out.
  • Write paths in Odoo fail (could not connect to server, connection reset, or repeated transaction aborts).
  • Infrastructure events indicate node/storage/network loss on the primary.
  • You cannot restore primary service inside your defined RTO.

A failover is a data-risk decision. Always measure replication lag first.

Step 0 — Stabilize and assign roles

  1. Freeze non-critical write traffic (imports, bulk updates, heavy cron jobs, queue replays).
  2. Assign one command operator and one incident scribe.
  3. Confirm who can approve promotion if data-loss risk is non-zero.

Do not allow parallel ad-hoc commands on both primary and replica clusters.

Step 1 — Validate candidate replica freshness

Run on each replica candidate:

psql "$REPLICA_DB_URI" -c "
select
  now() as observed_at,
  pg_is_in_recovery() as in_recovery,
  now() - pg_last_xact_replay_timestamp() as replay_delay,
  pg_last_wal_receive_lsn() as receive_lsn,
  pg_last_wal_replay_lsn() as replay_lsn,
  pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) as bytes_not_replayed;
"

Interpretation:

  • in_recovery = t confirms it is still a standby (good before promotion).
  • replay_delay near zero is ideal.
  • Large bytes_not_replayed means you are promoting with higher potential data loss.

If all replicas are stale, escalate explicitly: "failover is available, but data loss up to X seconds/bytes."

Step 2 — Confirm old primary state (split-brain prevention)

Before promotion, verify the old primary is truly unavailable or fenced.

# If reachable, check whether it still accepts writes
psql "$PRIMARY_DB_URI" -c "select now(), pg_is_in_recovery();"

If old primary is reachable and writable (pg_is_in_recovery = f), fence it before continuing:

  • Stop PostgreSQL service on old primary host, or
  • Remove old primary from load balancer / security group path to app layer, or
  • Shut down node at orchestrator level.

The goal is simple: after promotion, only one writable primary may exist.

Step 3 — Promote selected replica

On chosen replica:

psql "$REPLICA_DB_URI" -c "select pg_promote(wait_seconds => 60);"

Then verify promotion completed:

psql "$REPLICA_DB_URI" -c "select pg_is_in_recovery() as in_recovery;"

Expected: in_recovery = f.

Step 4 — Cut Odoo traffic to new primary

  1. Update connection target used by Odoo (db_host, DNS name, PgBouncer upstream, or secret/parameter store).
  2. Reload/restart connection pools before app workers, so stale sockets do not pin the old target.
  3. Restart Odoo workers in controlled batches.

Example (systemd deployment):

sudo systemctl restart pgbouncer
sudo systemctl restart odoo

If using container orchestration, roll deployments with max unavailable tuned to keep API alive.

Step 5 — Verification checklist (must pass before incident close)

Run against new primary:

psql "$NEW_PRIMARY_DB_URI" -c "
select
  current_setting('transaction_read_only') as tx_read_only,
  pg_is_in_recovery() as in_recovery;
"

Expected: tx_read_only = off and in_recovery = f.

Validate application behavior:

  • Odoo login works.
  • Create/update business objects (quote, sale order, invoice draft).
  • Queue/cron throughput resumes without error spikes.
  • No continued connection attempts from Odoo to old primary endpoint.

Check for straggler sessions targeting retired primary host:

# Run where PgBouncer is used
psql "$PGBOUNCER_ADMIN_URI" -c "show clients;"

Rollback guidance (if cutover destabilizes)

Only roll back if the old primary is verified consistent and controlled.

  1. Stop Odoo writes again.
  2. Decide authoritative node (do not allow dual-primary writes).
  3. Re-point app/pooler once, restart, and re-verify write/read checks.

If both nodes accepted writes, treat as data divergence incident and execute reconciliation plan (not ad-hoc failback).

Hardening and prevention checklist

  • Define explicit failover SLOs: max acceptable replay lag and max tolerated data loss.
  • Alert on replica lag (replay_delay, WAL gap bytes) and stale replication state.
  • Store a tested fencing procedure per environment (VM, Kubernetes, managed DB).
  • Practice quarterly failover drills in staging with timed RTO/RPO measurement.
  • Keep Odoo DB endpoint indirection simple (single DNS/pooler target) to reduce cutover errors.
  • Document exactly who can approve data-loss failover decisions.

Reference material used for technical alignment

  • PostgreSQL docs: replication, standby operation, and failover behavior.
  • PostgreSQL docs: administrative functions including pg_promote() and recovery checks.
  • Operational best practices from production SRE failover playbooks (fencing first, single writer guarantee).

The operating principle: failover is not "promote fast"; it is "promote safely with one writer and known data risk."

Back to blog