Fri Mar 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo Queue Backlog Recovery Runbook (When Jobs Stop Draining)

A CLI-first incident pattern to diagnose queue saturation, unblock safe jobs, and recover worker throughput without blind retries.

When queue_job stops draining, teams often make it worse by retrying everything at once. This runbook focuses on controlled recovery: measure, isolate, then recover throughput safely.

Incident signals

Treat backlog as an incident when one or more of these persist for 10–15 minutes:

pending jobs rising while started stays flat.
Same jobs repeatedly cycling failed -> pending.
Worker CPU idle but queue depth still grows.
Business flows tied to async jobs (invoice send, stock sync, webhooks) visibly lag.

Step 1 — Capture queue shape before touching anything

# Count jobs by state
psql "$ODOO_DB_URI" -c "
select state, count(*)
from queue_job
group by state
order by count(*) desc;
"

# Top channels by pending depth
psql "$ODOO_DB_URI" -c "
select channel, count(*) as pending_jobs
from queue_job
where state = 'pending'
group by channel
order by pending_jobs desc
limit 10;
"

Save this output to your incident notes. You need a baseline to confirm real recovery.

Step 2 — Find the jobs that are blocking throughput

# Oldest pending jobs (often where blockage starts)
psql "$ODOO_DB_URI" -c "
select id, channel, eta, date_created, substring(exc_info from 1 for 140) as err
from queue_job
where state in ('pending', 'failed')
order by date_created asc
limit 30;
"

Focus on repeated exceptions from one model/integration first (for example, a dead webhook target or missing external credential).

Step 3 — Quarantine noisy failures, then retry only safe subsets

If one integration is broken, stop flooding retries from that channel.

# Example: postpone one failing channel for 30 minutes
psql "$ODOO_DB_URI" -c "
update queue_job
set eta = now() + interval '30 minutes'
where state = 'pending'
  and channel = 'root.webhook_sync';
"

Then retry only idempotent jobs (never bulk-retry payment capture or non-idempotent external writes blindly).

# Example selective retry
odoocli queue retry --channel root.stock_sync --older-than 20m --limit 200

Step 4 — Restore worker throughput deliberately

# Inspect worker and memory pressure first
odoocli doctor --env production

# If healthy, increase queue workers one step (not 3x at once)
odoocli workers scale queue --count +2

Re-check state counts after 5–10 minutes. If started increases and pending trends down, keep the setting. If not, roll back worker count and revisit blocking exceptions.

Step 5 — Exit criteria before closing incident

pending depth decreases across three consecutive checks.
No rapid reappearance of the same exception signature.
Critical business jobs complete within expected SLA.
Temporary channel quarantines are either removed or converted into a tracked fix.

Hardening after recovery

Add channel-level alerting (pending depth + oldest job age).
Document which channels are safe for bulk retry and which are not.
Add a weekly replay drill in staging from a production-like queue snapshot.

The main rule: do not optimize for clearing the queue fast. Optimize for clearing it safely without creating duplicate side effects.

Back to blog