Odoo PgBouncer Pool Exhaustion Incident Runbook
A deterministic response guide for Odoo outages caused by PgBouncer pool/client exhaustion, with safe triage, recovery order, and hardening steps.
If Odoo starts failing with no more connections allowed or request latency explodes while PostgreSQL still has headroom, PgBouncer is often the real bottleneck.
This runbook gives you a production-safe sequence: confirm pool pressure, relieve it without random restarts, then harden the system so the incident does not recur.
Incident signals that justify immediate response
- Odoo users see intermittent 500/timeout errors while app hosts are otherwise healthy.
- PgBouncer logs include messages like
no more connections allowed (max_client_conn). - Odoo queue/cron throughput collapses even though PostgreSQL CPU is not pegged.
- Connection errors appear at app tier before PostgreSQL reaches
max_connections.
Step 0 — Stabilize and prevent self-inflicted damage
- Freeze non-critical traffic first (heavy reports, bulk imports, low-priority cron lanes).
- Keep checkout/invoicing/auth paths prioritized.
- Do not restart PostgreSQL as a first move.
- Do not blindly increase every pool limit at once.
Why: random restarts drop in-flight transactions and hide root cause. You need a controlled pressure release.
Step 1 — Confirm whether the choke point is PgBouncer or PostgreSQL
First, inspect PgBouncer internals through the admin console.
# Replace values as needed
export PGBOUNCER_HOST=127.0.0.1
export PGBOUNCER_PORT=6432
export PGBOUNCER_ADMIN_USER=pgbouncer
psql "host=$PGBOUNCER_HOST port=$PGBOUNCER_PORT dbname=pgbouncer user=$PGBOUNCER_ADMIN_USER" -c "SHOW POOLS;"
psql "host=$PGBOUNCER_HOST port=$PGBOUNCER_PORT dbname=pgbouncer user=$PGBOUNCER_ADMIN_USER" -c "SHOW STATS;"
psql "host=$PGBOUNCER_HOST port=$PGBOUNCER_PORT dbname=pgbouncer user=$PGBOUNCER_ADMIN_USER" -c "SHOW CLIENTS;"
psql "host=$PGBOUNCER_HOST port=$PGBOUNCER_PORT dbname=pgbouncer user=$PGBOUNCER_ADMIN_USER" -c "SHOW SERVERS;"
Then confirm PostgreSQL actual pressure:
psql "$ODOO_DB_URI" -c "
with limits as (
select setting::int as max_connections
from pg_settings where name='max_connections'
), usage as (
select count(*) as current_connections,
count(*) filter (where state='active') as active_connections,
count(*) filter (where state='idle in transaction') as idle_in_txn
from pg_stat_activity
where datname=current_database()
)
select usage.*, limits.max_connections,
round(100.0*usage.current_connections/nullif(limits.max_connections,0),1) as pct_used
from usage cross join limits;
"
Interpretation:
- If PgBouncer clients are saturated but PostgreSQL is well below max, PgBouncer limits are your immediate bottleneck.
- If both are saturated, you have coupled exhaustion and need strict traffic shedding plus leak cleanup.
Step 2 — Identify who is consuming client slots
psql "$ODOO_DB_URI" -c "
select
coalesce(application_name,'(unset)') as app,
usename,
client_addr,
state,
count(*) as sessions
from pg_stat_activity
where datname=current_database()
group by 1,2,3,4
order by sessions desc
limit 30;
"
Check for:
- unexpected client sources (ad-hoc scripts, BI jobs, stale workers),
- excessive
idle in transaction, - noisy background jobs that can be paused without user-facing impact.
Step 3 — Relieve pressure in a safe order
Use this order to avoid unnecessary transaction loss.
Pause non-critical Odoo background throughput.
Cancel long-running low-priority SQL first:
psql "$ODOO_DB_URI" -c "
select pid, now()-query_start as run_for, left(query,120) as sample
from pg_stat_activity
where datname=current_database()
and state='active'
and now()-query_start > interval '2 minutes'
order by query_start asc
limit 20;
"
# Safer first action
psql "$ODOO_DB_URI" -c "select pg_cancel_backend(<pid>);"
- Terminate only confirmed offenders that do not clear:
psql "$ODOO_DB_URI" -c "select pg_terminate_backend(<pid>);"
- If PgBouncer config was tuned during incident, reload PgBouncer (not Postgres):
psql "host=$PGBOUNCER_HOST port=$PGBOUNCER_PORT dbname=pgbouncer user=$PGBOUNCER_ADMIN_USER" -c "RELOAD;"
Never bulk-kill all sessions. Preserve critical app traffic and replication/admin channels.
Step 4 — Recover in controlled increments
Bring capacity back lane by lane:
- Web traffic stable
- Critical queues
- Remaining cron/reporting workloads
During recovery, watch both layers every 10s:
watch -n 10 '
psql "host='$PGBOUNCER_HOST' port='$PGBOUNCER_PORT' dbname=pgbouncer user='$PGBOUNCER_ADMIN_USER'" -Atc "SHOW POOLS;" | head -n 20
psql "$ODOO_DB_URI" -Atc "select count(*) from pg_stat_activity where datname=current_database();"
'
If pressure immediately re-appears, re-freeze low-priority load and continue leak/source isolation.
Step 5 — Exit criteria before incident closure
- No new PgBouncer
no more connections allowedevents in logs for at least 15 minutes. - Odoo critical transactions succeed (login, quote/sale, invoice post, payment capture).
- Queue depth is decreasing, not flat.
- PostgreSQL and PgBouncer connection metrics are stable and below emergency thresholds.
Hardening checklist (after service is stable)
- Tune PgBouncer deliberately:
max_client_conn,default_pool_size,reserve_pool_size,pool_mode. - Keep
pool_mode=transactionfor most Odoo workloads unless a validated reason exists otherwise. - Add alerting on:
- PgBouncer client saturation
- wait queue growth / pool starvation
- PostgreSQL
idle in transactionage and count
- Enforce timeout guardrails (
statement_timeout,lock_timeout,idle_in_transaction_session_timeout) with app-aware defaults. - Review Odoo worker/cron concurrency versus actual DB capacity; avoid unconstrained worker growth.
- Add a staging game day that intentionally saturates PgBouncer and rehearses this runbook.
Fast diagnostic query pack
# PostgreSQL oldest idle-in-tx
psql "$ODOO_DB_URI" -c "
select pid, usename, application_name, now()-xact_start as txn_age, left(query,120)
from pg_stat_activity
where datname=current_database()
and state='idle in transaction'
order by xact_start asc
limit 20;
"
# PostgreSQL blockers/waiters
psql "$ODOO_DB_URI" -c "
select a.pid as waiting_pid,
a.query as waiting_query,
b.pid as blocking_pid,
b.query as blocking_query
from pg_locks wa
join pg_stat_activity a on a.pid=wa.pid
join pg_locks wb on wb.locktype=wa.locktype
and wb.database is not distinct from wa.database
and wb.relation is not distinct from wa.relation
and wb.page is not distinct from wa.page
and wb.tuple is not distinct from wa.tuple
and wb.virtualxid is not distinct from wa.virtualxid
and wb.transactionid is not distinct from wa.transactionid
and wb.classid is not distinct from wa.classid
and wb.objid is not distinct from wa.objid
and wb.objsubid is not distinct from wa.objsubid
and wb.pid <> wa.pid
join pg_stat_activity b on b.pid=wb.pid
where not wa.granted and wb.granted;
"
The operational rule: treat PgBouncer incidents as queueing failures with upstream causes, not as a restart problem. Stabilize load, remove offenders safely, then tune with measurements.