Odoo Worker Timeout Storm and Long-Polling Degradation Runbook
A production-safe, CLI-first runbook for stabilizing Odoo when worker timeouts cascade and bus/long-polling starts failing.
When Odoo users report random 502/504 errors, frozen Kanban updates, and delayed notifications at the same time, you are often in a worker-timeout storm.
This runbook is optimized for operators: confirm the blast radius, stop the timeout cascade, recover long-polling, then harden so it does not return next peak hour.
Incident signals that justify immediate response
- Odoo logs contain repeated
Worker (...) timeoutor forced worker reload events. - Nginx/ALB shows rising 502/504 to Odoo upstream.
- Chatter/live updates (bus/long-polling) are delayed or dead.
- PostgreSQL has spikes in long-running active queries or
idle in transactionsessions. - Request latency increases before CPU is fully saturated (classic queueing collapse).
Step 0 — Stabilize before changing config
- Pause non-critical traffic: bulk imports, scheduled heavy reports, low-priority cron jobs.
- Announce incident mode and freeze deploys/module upgrades.
- Keep one command operator + one incident scribe.
- Do not restart everything blindly; collect evidence first to avoid looping failure.
Step 1 — Confirm timeout storm and identify where time is spent
1.1 Check recent worker timeout frequency
# systemd deployments
journalctl -u odoo -S -30min | egrep -i "worker.*timeout|timeout|longpoll|bus" | tail -n 200
# container deployments
docker logs --since 30m <odoo_container> 2>&1 | egrep -i "worker.*timeout|timeout|longpoll|bus" | tail -n 200
1.2 Verify HTTP edge errors (Nginx example)
sudo tail -n 200 /var/log/nginx/error.log
sudo awk '{print $9}' /var/log/nginx/access.log | egrep "502|504" | tail -n 100
1.3 Inspect PostgreSQL pressure from Odoo sessions
psql "$ODOO_DB_URI" -c "
select
count(*) filter (where state = 'active') as active,
count(*) filter (where state = 'idle in transaction') as idle_in_txn,
count(*) filter (where now() - query_start > interval '30 seconds' and state = 'active') as active_over_30s
from pg_stat_activity
where datname = current_database();
"
psql "$ODOO_DB_URI" -c "
select
pid,
usename,
application_name,
now() - query_start as query_age,
wait_event_type,
wait_event,
left(query, 180) as query
from pg_stat_activity
where datname = current_database()
and state = 'active'
order by query_start asc
limit 20;
"
If active query age is high and worker timeouts are rising, your workers are likely blocked on slow DB/lock paths rather than pure CPU shortage.
Step 2 — Recover service with safest-first ordering
2.1 Drain expensive background load first
- Temporarily reduce queue/cron concurrency (do not fully disable business-critical jobs).
- Pause known heavy jobs (mass mailers, recomputations, large exports).
2.2 Cancel long queries before terminating backends
# Prefer cancel first
psql "$ODOO_DB_URI" -c "select pg_cancel_backend(<pid>);"
# Escalate only if cancel fails and impact continues
psql "$ODOO_DB_URI" -c "select pg_terminate_backend(<pid>);"
Never bulk-kill all sessions. Preserve replication/admin connections and healthy Odoo workers.
2.3 Protect app from infinite waits during incident window
Use conservative timeout guardrails (session-level or role-level) so stuck statements fail fast:
-- Apply carefully; validate with app owners before persistent ALTER ROLE
ALTER ROLE odoo SET statement_timeout = '60s';
ALTER ROLE odoo SET lock_timeout = '10s';
If already set, verify effective values:
psql "$ODOO_DB_URI" -c "show statement_timeout; show lock_timeout;"
Step 3 — Restore long-polling health (bus/gevent lane)
Long-polling often fails as a downstream symptom. Treat it as a separate lane:
- Confirm dedicated long-polling endpoint/process is reachable.
- Ensure proxy timeouts are not shorter than expected long-poll wait windows.
- Restart only the degraded service lane if needed (long-polling/websocket process), then re-check user notifications.
Quick probe example:
# Endpoint differs by version/deployment; verify your route
curl -I -m 10 https://<odoo-host>/longpolling/poll
If the endpoint fails while core HTTP works, recover bus lane without a full platform restart.
Step 4 — Verification and rollback checks
Exit incident mode only when all conditions hold for at least 15 minutes:
- Worker timeout log rate returns to baseline.
- 502/504 rate remains near zero.
- Long-polling/chat presence updates flow normally.
active_over_30sandidle in transactionremain controlled.- Critical user flows (login, sales order confirm, invoice post) pass.
Verification loop:
watch -n 10 "psql \"$ODOO_DB_URI\" -Atc \"select count(*) from pg_stat_activity where datname=current_database() and state='active' and now()-query_start>interval '30 seconds';\""
If user-facing latency regresses after re-enabling jobs, roll back the last traffic lane you reintroduced and reassess top slow statements.
Hardening checklist (post-incident)
- Set alerts on worker timeout rate, 502/504 ratio, and long-polling endpoint failures.
- Track top SQL by total time and calls (
pg_stat_statements) and tune the highest offenders. - Enforce sane Odoo worker sizing and timeout configuration for your CPU/memory envelope.
- Separate heavy cron/queue workloads from interactive traffic windows.
- Add index/query fixes for repeatedly timed-out business operations.
- Run a staged load drill that includes long-polling behavior, not just login throughput.
Practical references
- Odoo deployment/system configuration guidance: https://www.odoo.com/documentation/17.0/administration/on_premise/deploy.html
- PostgreSQL client timeout settings (
statement_timeout,lock_timeout): https://www.postgresql.org/docs/current/runtime-config-client.html - PostgreSQL runtime stats (
pg_stat_activity): https://www.postgresql.org/docs/current/monitoring-stats.html
Core principle: during timeout storms, reduce queueing pressure first, then remove the slow/blocking root causes. Full restarts hide evidence and usually make recurrence more likely.