Odoo Queue Backlog Recovery Runbook (When Jobs Stop Draining)
A CLI-first incident pattern to diagnose queue saturation, unblock safe jobs, and recover worker throughput without blind retries.
When queue_job stops draining, teams often make it worse by retrying everything at once.
This runbook focuses on controlled recovery: measure, isolate, then recover throughput safely.
Incident signals
Treat backlog as an incident when one or more of these persist for 10–15 minutes:
pendingjobs rising whilestartedstays flat.- Same jobs repeatedly cycling
failed -> pending. - Worker CPU idle but queue depth still grows.
- Business flows tied to async jobs (invoice send, stock sync, webhooks) visibly lag.
Step 1 — Capture queue shape before touching anything
# Count jobs by state
psql "$ODOO_DB_URI" -c "
select state, count(*)
from queue_job
group by state
order by count(*) desc;
"
# Top channels by pending depth
psql "$ODOO_DB_URI" -c "
select channel, count(*) as pending_jobs
from queue_job
where state = 'pending'
group by channel
order by pending_jobs desc
limit 10;
"
Save this output to your incident notes. You need a baseline to confirm real recovery.
Step 2 — Find the jobs that are blocking throughput
# Oldest pending jobs (often where blockage starts)
psql "$ODOO_DB_URI" -c "
select id, channel, eta, date_created, substring(exc_info from 1 for 140) as err
from queue_job
where state in ('pending', 'failed')
order by date_created asc
limit 30;
"
Focus on repeated exceptions from one model/integration first (for example, a dead webhook target or missing external credential).
Step 3 — Quarantine noisy failures, then retry only safe subsets
If one integration is broken, stop flooding retries from that channel.
# Example: postpone one failing channel for 30 minutes
psql "$ODOO_DB_URI" -c "
update queue_job
set eta = now() + interval '30 minutes'
where state = 'pending'
and channel = 'root.webhook_sync';
"
Then retry only idempotent jobs (never bulk-retry payment capture or non-idempotent external writes blindly).
# Example selective retry
odoocli queue retry --channel root.stock_sync --older-than 20m --limit 200
Step 4 — Restore worker throughput deliberately
# Inspect worker and memory pressure first
odoocli doctor --env production
# If healthy, increase queue workers one step (not 3x at once)
odoocli workers scale queue --count +2
Re-check state counts after 5–10 minutes. If started increases and pending trends down, keep the setting.
If not, roll back worker count and revisit blocking exceptions.
Step 5 — Exit criteria before closing incident
pendingdepth decreases across three consecutive checks.- No rapid reappearance of the same exception signature.
- Critical business jobs complete within expected SLA.
- Temporary channel quarantines are either removed or converted into a tracked fix.
Hardening after recovery
- Add channel-level alerting (pending depth + oldest job age).
- Document which channels are safe for bulk retry and which are not.
- Add a weekly replay drill in staging from a production-like queue snapshot.
The main rule: do not optimize for clearing the queue fast. Optimize for clearing it safely without creating duplicate side effects.