Odoo PostgreSQL Connection Saturation Incident Runbook
A practical CLI-first recovery workflow for when Odoo starts hitting max PostgreSQL connections and user traffic begins to fail.
When Odoo incidents start with too many clients already, panic actions usually make it worse.
This runbook gives operators a deterministic sequence: confirm saturation, stop the leak, recover service, then harden.
Incident signals worth immediate action
- Odoo logs show
FATAL: sorry, too many clients already. - Login, save, or checkout flows fail intermittently while CPU looks normal.
- Queue workers stop progressing even though jobs remain
pending. - Monitoring shows rapid connection growth without matching request throughput.
Step 0 — Stabilize before tuning anything
- Freeze non-critical batch traffic (imports, low-priority cron jobs, heavy reports).
- Keep one operator running commands and one operator tracking timeline/decisions.
- Do not raise
max_connectionsas first response unless memory headroom is verified.
Increasing connection limits under pressure can shift the failure from connection errors to RAM exhaustion.
Step 1 — Confirm saturation and where sessions come from
# Current usage vs configured max_connections
psql "$ODOO_DB_URI" -c "
with limits as (
select setting::int as max_connections
from pg_settings
where name = 'max_connections'
), usage as (
select
count(*) as current_connections,
count(*) filter (where state = 'active') as active_connections,
count(*) filter (where state = 'idle') as idle_connections,
count(*) filter (where state = 'idle in transaction') as idle_in_txn
from pg_stat_activity
where datname = current_database()
)
select
usage.current_connections,
limits.max_connections,
round(100.0 * usage.current_connections / nullif(limits.max_connections, 0), 1) as pct_used,
usage.active_connections,
usage.idle_connections,
usage.idle_in_txn
from usage cross join limits;
"
# Attribute sessions by application/user/client
psql "$ODOO_DB_URI" -c "
select
coalesce(application_name, '(unset)') as app,
usename,
client_addr,
state,
count(*) as sessions
from pg_stat_activity
where datname = current_database()
group by 1,2,3,4
order by sessions desc
limit 30;
"
Step 2 — Remove the highest-risk leak safely
Prioritize long idle in transaction sessions first: they hold resources, block vacuum, and can amplify lock incidents.
psql "$ODOO_DB_URI" -c "
select pid, application_name, usename, client_addr,
now() - xact_start as txn_age,
now() - state_change as idle_for,
left(query, 140) as last_query
from pg_stat_activity
where datname = current_database()
and state = 'idle in transaction'
order by xact_start asc
limit 20;
"
Cancel before terminate:
# Safer first move
psql "$ODOO_DB_URI" -c "select pg_cancel_backend(<pid>);"
# Use terminate only if cancel does not clear within your incident timeout
psql "$ODOO_DB_URI" -c "select pg_terminate_backend(<pid>);"
Never bulk-kill every session. Keep Odoo app workers and replication/admin sessions alive unless explicitly identified as the leak source.
Step 3 — Recover throughput in controlled increments
- Re-check
pct_usedand confirm it is trending down. - Re-enable paused traffic one lane at a time (queue workers, then cron/reporting).
- Watch for immediate re-growth (that indicates leak still active).
# Fast repeated check during recovery
watch -n 10 "psql \"$ODOO_DB_URI\" -Atc \"select count(*) from pg_stat_activity where datname = current_database();\""
Step 4 — Exit criteria before closing incident
- Connection usage stable below emergency threshold (for example <75%) for at least 15 minutes.
- No new
too many clients alreadyentries in logs. - Critical user transactions (login, quote, invoice posting) complete successfully.
- Queue depth is decreasing, not flatlining.
Hardening actions after recovery
- Put an alert on
% of max_connectionsandidle in transactioncount. - Review Odoo worker and connection-pool settings in code/config (not ad-hoc shell edits).
- Add timeouts for client-side leaked sessions (
idle_in_transaction_session_timeoutwhere appropriate). - Add a weekly replay drill: intentionally saturate staging and rehearse this runbook.
The principle: connection incidents are usually leak problems, not capacity problems. Fix the leak path first, then resize with evidence.