WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo Worker Timeout Storm and Long-Polling Degradation Runbook

A production-safe, CLI-first runbook for stabilizing Odoo when worker timeouts cascade and bus/long-polling starts failing.

When Odoo users report random 502/504 errors, frozen Kanban updates, and delayed notifications at the same time, you are often in a worker-timeout storm.

This runbook is optimized for operators: confirm the blast radius, stop the timeout cascade, recover long-polling, then harden so it does not return next peak hour.

Incident signals that justify immediate response

  • Odoo logs contain repeated Worker (...) timeout or forced worker reload events.
  • Nginx/ALB shows rising 502/504 to Odoo upstream.
  • Chatter/live updates (bus/long-polling) are delayed or dead.
  • PostgreSQL has spikes in long-running active queries or idle in transaction sessions.
  • Request latency increases before CPU is fully saturated (classic queueing collapse).

Step 0 — Stabilize before changing config

  1. Pause non-critical traffic: bulk imports, scheduled heavy reports, low-priority cron jobs.
  2. Announce incident mode and freeze deploys/module upgrades.
  3. Keep one command operator + one incident scribe.
  4. Do not restart everything blindly; collect evidence first to avoid looping failure.

Step 1 — Confirm timeout storm and identify where time is spent

1.1 Check recent worker timeout frequency

# systemd deployments
journalctl -u odoo -S -30min | egrep -i "worker.*timeout|timeout|longpoll|bus" | tail -n 200

# container deployments
docker logs --since 30m <odoo_container> 2>&1 | egrep -i "worker.*timeout|timeout|longpoll|bus" | tail -n 200

1.2 Verify HTTP edge errors (Nginx example)

sudo tail -n 200 /var/log/nginx/error.log
sudo awk '{print $9}' /var/log/nginx/access.log | egrep "502|504" | tail -n 100

1.3 Inspect PostgreSQL pressure from Odoo sessions

psql "$ODOO_DB_URI" -c "
select
  count(*) filter (where state = 'active') as active,
  count(*) filter (where state = 'idle in transaction') as idle_in_txn,
  count(*) filter (where now() - query_start > interval '30 seconds' and state = 'active') as active_over_30s
from pg_stat_activity
where datname = current_database();
"
psql "$ODOO_DB_URI" -c "
select
  pid,
  usename,
  application_name,
  now() - query_start as query_age,
  wait_event_type,
  wait_event,
  left(query, 180) as query
from pg_stat_activity
where datname = current_database()
  and state = 'active'
order by query_start asc
limit 20;
"

If active query age is high and worker timeouts are rising, your workers are likely blocked on slow DB/lock paths rather than pure CPU shortage.

Step 2 — Recover service with safest-first ordering

2.1 Drain expensive background load first

  • Temporarily reduce queue/cron concurrency (do not fully disable business-critical jobs).
  • Pause known heavy jobs (mass mailers, recomputations, large exports).

2.2 Cancel long queries before terminating backends

# Prefer cancel first
psql "$ODOO_DB_URI" -c "select pg_cancel_backend(<pid>);"

# Escalate only if cancel fails and impact continues
psql "$ODOO_DB_URI" -c "select pg_terminate_backend(<pid>);"

Never bulk-kill all sessions. Preserve replication/admin connections and healthy Odoo workers.

2.3 Protect app from infinite waits during incident window

Use conservative timeout guardrails (session-level or role-level) so stuck statements fail fast:

-- Apply carefully; validate with app owners before persistent ALTER ROLE
ALTER ROLE odoo SET statement_timeout = '60s';
ALTER ROLE odoo SET lock_timeout = '10s';

If already set, verify effective values:

psql "$ODOO_DB_URI" -c "show statement_timeout; show lock_timeout;"

Step 3 — Restore long-polling health (bus/gevent lane)

Long-polling often fails as a downstream symptom. Treat it as a separate lane:

  1. Confirm dedicated long-polling endpoint/process is reachable.
  2. Ensure proxy timeouts are not shorter than expected long-poll wait windows.
  3. Restart only the degraded service lane if needed (long-polling/websocket process), then re-check user notifications.

Quick probe example:

# Endpoint differs by version/deployment; verify your route
curl -I -m 10 https://<odoo-host>/longpolling/poll

If the endpoint fails while core HTTP works, recover bus lane without a full platform restart.

Step 4 — Verification and rollback checks

Exit incident mode only when all conditions hold for at least 15 minutes:

  • Worker timeout log rate returns to baseline.
  • 502/504 rate remains near zero.
  • Long-polling/chat presence updates flow normally.
  • active_over_30s and idle in transaction remain controlled.
  • Critical user flows (login, sales order confirm, invoice post) pass.

Verification loop:

watch -n 10 "psql \"$ODOO_DB_URI\" -Atc \"select count(*) from pg_stat_activity where datname=current_database() and state='active' and now()-query_start>interval '30 seconds';\""

If user-facing latency regresses after re-enabling jobs, roll back the last traffic lane you reintroduced and reassess top slow statements.

Hardening checklist (post-incident)

  • Set alerts on worker timeout rate, 502/504 ratio, and long-polling endpoint failures.
  • Track top SQL by total time and calls (pg_stat_statements) and tune the highest offenders.
  • Enforce sane Odoo worker sizing and timeout configuration for your CPU/memory envelope.
  • Separate heavy cron/queue workloads from interactive traffic windows.
  • Add index/query fixes for repeatedly timed-out business operations.
  • Run a staged load drill that includes long-polling behavior, not just login throughput.

Practical references

Core principle: during timeout storms, reduce queueing pressure first, then remove the slow/blocking root causes. Full restarts hide evidence and usually make recurrence more likely.

Back to blog