WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Fri Mar 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo Queue Backlog Recovery Runbook (When Jobs Stop Draining)

A CLI-first incident pattern to diagnose queue saturation, unblock safe jobs, and recover worker throughput without blind retries.

When queue_job stops draining, teams often make it worse by retrying everything at once. This runbook focuses on controlled recovery: measure, isolate, then recover throughput safely.

Incident signals

Treat backlog as an incident when one or more of these persist for 10–15 minutes:

  • pending jobs rising while started stays flat.
  • Same jobs repeatedly cycling failed -> pending.
  • Worker CPU idle but queue depth still grows.
  • Business flows tied to async jobs (invoice send, stock sync, webhooks) visibly lag.

Step 1 — Capture queue shape before touching anything

# Count jobs by state
psql "$ODOO_DB_URI" -c "
select state, count(*)
from queue_job
group by state
order by count(*) desc;
"

# Top channels by pending depth
psql "$ODOO_DB_URI" -c "
select channel, count(*) as pending_jobs
from queue_job
where state = 'pending'
group by channel
order by pending_jobs desc
limit 10;
"

Save this output to your incident notes. You need a baseline to confirm real recovery.

Step 2 — Find the jobs that are blocking throughput

# Oldest pending jobs (often where blockage starts)
psql "$ODOO_DB_URI" -c "
select id, channel, eta, date_created, substring(exc_info from 1 for 140) as err
from queue_job
where state in ('pending', 'failed')
order by date_created asc
limit 30;
"

Focus on repeated exceptions from one model/integration first (for example, a dead webhook target or missing external credential).

Step 3 — Quarantine noisy failures, then retry only safe subsets

If one integration is broken, stop flooding retries from that channel.

# Example: postpone one failing channel for 30 minutes
psql "$ODOO_DB_URI" -c "
update queue_job
set eta = now() + interval '30 minutes'
where state = 'pending'
  and channel = 'root.webhook_sync';
"

Then retry only idempotent jobs (never bulk-retry payment capture or non-idempotent external writes blindly).

# Example selective retry
odoocli queue retry --channel root.stock_sync --older-than 20m --limit 200

Step 4 — Restore worker throughput deliberately

# Inspect worker and memory pressure first
odoocli doctor --env production

# If healthy, increase queue workers one step (not 3x at once)
odoocli workers scale queue --count +2

Re-check state counts after 5–10 minutes. If started increases and pending trends down, keep the setting. If not, roll back worker count and revisit blocking exceptions.

Step 5 — Exit criteria before closing incident

  • pending depth decreases across three consecutive checks.
  • No rapid reappearance of the same exception signature.
  • Critical business jobs complete within expected SLA.
  • Temporary channel quarantines are either removed or converted into a tracked fix.

Hardening after recovery

  • Add channel-level alerting (pending depth + oldest job age).
  • Document which channels are safe for bulk retry and which are not.
  • Add a weekly replay drill in staging from a production-like queue snapshot.

The main rule: do not optimize for clearing the queue fast. Optimize for clearing it safely without creating duplicate side effects.

Back to blog