Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo Worker Timeout Storm and Long-Polling Degradation Runbook

A production-safe, CLI-first runbook for stabilizing Odoo when worker timeouts cascade and bus/long-polling starts failing.

When Odoo users report random 502/504 errors, frozen Kanban updates, and delayed notifications at the same time, you are often in a worker-timeout storm.

This runbook is optimized for operators: confirm the blast radius, stop the timeout cascade, recover long-polling, then harden so it does not return next peak hour.

Incident signals that justify immediate response

Odoo logs contain repeated Worker (...) timeout or forced worker reload events.
Nginx/ALB shows rising 502/504 to Odoo upstream.
Chatter/live updates (bus/long-polling) are delayed or dead.
PostgreSQL has spikes in long-running active queries or idle in transaction sessions.
Request latency increases before CPU is fully saturated (classic queueing collapse).

Step 0 — Stabilize before changing config

Pause non-critical traffic: bulk imports, scheduled heavy reports, low-priority cron jobs.
Announce incident mode and freeze deploys/module upgrades.
Keep one command operator + one incident scribe.
Do not restart everything blindly; collect evidence first to avoid looping failure.

Step 1 — Confirm timeout storm and identify where time is spent

1.1 Check recent worker timeout frequency

# systemd deployments
journalctl -u odoo -S -30min | egrep -i "worker.*timeout|timeout|longpoll|bus" | tail -n 200

# container deployments
docker logs --since 30m <odoo_container> 2>&1 | egrep -i "worker.*timeout|timeout|longpoll|bus" | tail -n 200

1.2 Verify HTTP edge errors (Nginx example)

sudo tail -n 200 /var/log/nginx/error.log
sudo awk '{print $9}' /var/log/nginx/access.log | egrep "502|504" | tail -n 100

1.3 Inspect PostgreSQL pressure from Odoo sessions

psql "$ODOO_DB_URI" -c "
select
  count(*) filter (where state = 'active') as active,
  count(*) filter (where state = 'idle in transaction') as idle_in_txn,
  count(*) filter (where now() - query_start > interval '30 seconds' and state = 'active') as active_over_30s
from pg_stat_activity
where datname = current_database();
"

psql "$ODOO_DB_URI" -c "
select
  pid,
  usename,
  application_name,
  now() - query_start as query_age,
  wait_event_type,
  wait_event,
  left(query, 180) as query
from pg_stat_activity
where datname = current_database()
  and state = 'active'
order by query_start asc
limit 20;
"

If active query age is high and worker timeouts are rising, your workers are likely blocked on slow DB/lock paths rather than pure CPU shortage.

Step 2 — Recover service with safest-first ordering

2.1 Drain expensive background load first

Temporarily reduce queue/cron concurrency (do not fully disable business-critical jobs).
Pause known heavy jobs (mass mailers, recomputations, large exports).

2.2 Cancel long queries before terminating backends

# Prefer cancel first
psql "$ODOO_DB_URI" -c "select pg_cancel_backend(<pid>);"

# Escalate only if cancel fails and impact continues
psql "$ODOO_DB_URI" -c "select pg_terminate_backend(<pid>);"

Never bulk-kill all sessions. Preserve replication/admin connections and healthy Odoo workers.

2.3 Protect app from infinite waits during incident window

Use conservative timeout guardrails (session-level or role-level) so stuck statements fail fast:

-- Apply carefully; validate with app owners before persistent ALTER ROLE
ALTER ROLE odoo SET statement_timeout = '60s';
ALTER ROLE odoo SET lock_timeout = '10s';

If already set, verify effective values:

psql "$ODOO_DB_URI" -c "show statement_timeout; show lock_timeout;"

Step 3 — Restore long-polling health (bus/gevent lane)

Long-polling often fails as a downstream symptom. Treat it as a separate lane:

Confirm dedicated long-polling endpoint/process is reachable.
Ensure proxy timeouts are not shorter than expected long-poll wait windows.
Restart only the degraded service lane if needed (long-polling/websocket process), then re-check user notifications.

Quick probe example:

# Endpoint differs by version/deployment; verify your route
curl -I -m 10 https://<odoo-host>/longpolling/poll

If the endpoint fails while core HTTP works, recover bus lane without a full platform restart.

Step 4 — Verification and rollback checks

Exit incident mode only when all conditions hold for at least 15 minutes:

Worker timeout log rate returns to baseline.
502/504 rate remains near zero.
Long-polling/chat presence updates flow normally.
active_over_30s and idle in transaction remain controlled.
Critical user flows (login, sales order confirm, invoice post) pass.

Verification loop:

watch -n 10 "psql \"$ODOO_DB_URI\" -Atc \"select count(*) from pg_stat_activity where datname=current_database() and state='active' and now()-query_start>interval '30 seconds';\""

If user-facing latency regresses after re-enabling jobs, roll back the last traffic lane you reintroduced and reassess top slow statements.

Hardening checklist (post-incident)

Set alerts on worker timeout rate, 502/504 ratio, and long-polling endpoint failures.
Track top SQL by total time and calls (pg_stat_statements) and tune the highest offenders.
Enforce sane Odoo worker sizing and timeout configuration for your CPU/memory envelope.
Separate heavy cron/queue workloads from interactive traffic windows.
Add index/query fixes for repeatedly timed-out business operations.
Run a staged load drill that includes long-polling behavior, not just login throughput.

Practical references

Odoo deployment/system configuration guidance: https://www.odoo.com/documentation/17.0/administration/on_premise/deploy.html
PostgreSQL client timeout settings (statement_timeout, lock_timeout): https://www.postgresql.org/docs/current/runtime-config-client.html
PostgreSQL runtime stats (pg_stat_activity): https://www.postgresql.org/docs/current/monitoring-stats.html

Core principle: during timeout storms, reduce queueing pressure first, then remove the slow/blocking root causes. Full restarts hide evidence and usually make recurrence more likely.

Back to blog