Skip to content

Troubleshooting

Symptom → likely cause → fix. Each section below is the same shape: what you see, what's probably wrong, how to confirm, what to change, and a link into the reference page that owns the underlying design.

Symptom Likely cause
event=lease_lost recurring in logs Handler P99 > lease_ttl_seconds
Outbox row count grows + lease_lost spike DLQ CTE failing (DLQ schema drift)
Outbox row count grows, no lease_lost Fetch loop not running, or rows future-dated
Idle dispatch latency > max_fetch_interval LISTEN setup failed → polling fallback
Subscriber dispatch never starts; rows pile up Engine pool exhausted on writer-connection checkout
Duplicate handler invocations Lease expired before handler returned, or handler not idempotent
Rolling deploy leaks rows graceful_timeout < handler P99, or k8s grace too short
activate_in / activate_at fires immediately in tests TestOutboxBroker(run_loops=False) ignores scheduling
AckPolicy.ACK_FIRST raises ValueError at registration By design (would defeat outbox reliability)
OutboxResponse(...) + foreign-publisher decorator gets nacked By design (dual-fire footgun)
Chained OutboxResponse retries after the handler "succeeded" Follow-on publish fails post-handler; nacks the inbound row
validate_schema() raises ImportError [validate] extra not installed

event=lease_lost recurring in logs

Symptom. WARNING-level logs with structured field event=lease_lost, typically with phase=terminal or phase=retry, one per affected row.

Likely cause. The subscriber's lease_ttl_seconds is shorter than the handler's P99 duration. A handler took longer than the lease, another fetch reclaimed the row mid-flight, and the original handler's terminal DELETE / UPDATE matched zero rows.

Diagnose. Grep for event=lease_lost over the last hour and compare the rate against dispatched. A non-zero baseline rate (rather than occasional spikes) confirms TTL is the issue.

Fix. Raise lease_ttl_seconds for the affected subscriber, OR segregate slow work onto its own subscriber with a taller TTL (recommended — keeps the fast queue's reclaim tight). TTL must exceed handler P99 with margin.

Reference. Subscriber § Slow handlers — dedicated queue.

Outbox row count grows + lease_lost spike

Symptom. Two things at once: row count in the outbox table grows without bound, and event=lease_lost log rate spikes.

Likely cause. The DLQ CTE is failing on every terminal flush — DLQ schema drift means the INSERT INTO <dlq> clause inside the WITH deleted AS (DELETE … RETURNING …) statement rolls back the DELETE too. Rows stay in the outbox, leases keep expiring, the pattern compounds.

Diagnose. Run await broker.validate_schema() against the live DB (the [validate] extra is required). It will surface missing columns / indexes on the DLQ table. A frequent cause on older deployments is a hand-written DLQ migration missing the timer_id column — validate_schema() reports it as a missing column on the DLQ table. (The Alembic guide now includes it; pre-fix migrations may not.)

Fix. Bring the DLQ schema up to spec (apply the missing migration, or rename / drop the drifted column / index). After the schema is correct, the next claim of each stuck row flushes through the CTE and the outbox drains naturally.

Recommended alerts. A persistent DLQ misconfiguration (or a permanent relay config error) is the one way a config bug degrades into a storage-exhaustion outage — the affected rows cycle through fetch/fail forever while new rows accumulate. There is no built-in circuit breaker, so alert on outbox row count (trend / absolute ceiling) and on the lease_lost rate, and watch dlq_written vs nacked_terminal divergence (a gap means terminal failures aren't reaching the DLQ).

Reference. DLQ § Atomicity, Schema validation.

Outbox row count grows, no lease_lost

Symptom. Outbox rows accumulate, but logs are clean — no lease_lost, no exceptions.

Likely cause. Either no subscriber is registered for that queue, or the rows are future-dated (activate_in / activate_at set) and genuinely waiting to fire.

Diagnose. Inspect a stuck row's next_attempt_at — if it's in the future, the row is correctly waiting. Otherwise check whether a subscriber is registered: walk broker.subscribers (the property, which covers router-attached subscribers — broker._subscribers will miss them).

Fix. Register the subscriber, or adjust the producer's activate_* arg if the future date was unintentional.

Reference. Subscriber, Router § Gotcha: walking every subscriber, Timers.

Idle dispatch latency > max_fetch_interval

Symptom. Rows arrive but take up to max_fetch_interval (default 10 s) to dispatch, even though no other rows are in flight. NOTIFY should short-circuit the idle wait to ~10 ms.

Likely cause. LISTEN setup failed at subscriber start. The raw asyncpg connection that owns LISTEN outbox_<table> is separate from the SQLAlchemy fetch connection; common failure modes are: the asyncpg driver isn't installed (no [asyncpg] extra), the engine URL is not asyncpg, or Postgres user lacks LISTEN permission.

Diagnose. A connection or permission failure (asyncpg.connect / add_listener raising) logs a WARNING once at startup noting the NOTIFY fallback to polling. A missing asyncpg driver or a non-asyncpg engine URL falls back silently — there is no log line, so check the engine URL (drivername must be postgresql+asyncpg) and that the [asyncpg] extra is installed.

Fix. Install the [asyncpg] extra and use an asyncpg-driven engine URL (postgresql+asyncpg://...). Restart the subscriber.

Reference. Installation § Optional extras , How it works § Fetch loop.

Subscriber dispatch never starts; rows pile up

Symptom. Rows are published but never dispatched (the table grows) and the subscriber's loops emit repeating reconnect ERROR logs. broker.start() (or the FastAPI include_router lifespan) itself returns normally — it only schedules the loop tasks, so the failure shows up after startup, not as a hang.

Likely cause. SQLAlchemy pool exhausted on the per-worker writer connection checkout — the fetch/worker loops can't acquire their connections, so each cycle errors and backs off. Each subscriber needs max_workers + 1 pool connections; the default pool is pool_size=5, max_overflow=10. A handful of single-worker subscribers fits, but a fleet of high-max_workers subscribers does not.

Diagnose. Inspect the engine pool. Compute Σ subs × (max_workers + 1) from your subscriber registrations and compare to pool_size + max_overflow.

Fix. Raise pool_size / max_overflow on the engine, OR lower max_workers per subscriber. Also confirm Postgres max_connections ≥ replicas × Σ subs × (max_workers + 2) (the pool's max_workers + 1 plus the raw LISTEN connection) — rolling deploys multiply the demand.

Reference. Subscriber § Connection budget, Production checklist § Sizing.

Duplicate handler invocations

Symptom. The same outbox row's handler runs more than once. Side effects double up if the handler isn't idempotent.

Likely cause. Either the handler's wall-clock duration exceeded lease_ttl_seconds and another fetch reclaimed the row mid-flight, or the worker crashed between the handler's external side effect and the terminal DELETE. Both are at-least-once-delivery edge cases.

Diagnose. Cross-reference handler-side logs (the side effect) with event=lease_lost logs. Matching row IDs confirm TTL is too short. Crash-induced duplicates correlate with worker-process restarts.

Fix. Two layers: (a) make handlers idempotent — this is a contract of the outbox pattern, not a knob, and (b) tune lease_ttl_seconds above handler P99 so healthy handlers don't race their lease.

Reference. How it works § At-least-once delivery, Subscriber § Slow handlers — dedicated queue.

Rolling deploy leaks rows

Symptom. During a rolling restart, outbox rows are left in the "acquired" state until lease expiry, even though handlers were nominally healthy. Drain duration appears longer than expected.

Likely cause. Either the broker's graceful_timeout is shorter than the in-flight handler's remaining work, or Kubernetes terminationGracePeriodSeconds is shorter than the broker's graceful_timeout × parallel-drain factor — SIGKILL arrives mid- drain.

Diagnose. Time a clean shutdown locally (docker compose kill -s SIGTERM application) and compare to your k8s grace period. Look for log lines indicating drain abandonment.

Fix. Raise graceful_timeout past handler P99 + margin. Raise terminationGracePeriodSeconds past graceful_timeout + buffer for parallel-subscriber drain. The dispatch_one shutdown-race guard is always on; you don't need to opt into it.

Reference. Production checklist § Drain & lifecycle.

activate_in / activate_at fires immediately in tests

Symptom. A unit test publishes a row with activate_in=30s and the handler runs synchronously inside await broker.publish(...).

Likely cause. By design. TestOutboxBroker(run_loops=False) (the default) drives handlers synchronously through dispatch_one, which ignores next_attempt_at. This is the documented test-broker contract — trades production parity for test ergonomics.

Diagnose. Check the call site: TestOutboxBroker(broker) → sync mode, expected immediate firing.

Fix. Opt into TestOutboxBroker(broker, run_loops=True) for tests that need scheduled delivery to actually wait. Loop mode runs the real _fetch_loop / _worker_loop against the fake client.

Reference. Testing § Loop-driven mode, Timers § Test broker note.

AckPolicy.ACK_FIRST raises ValueError at registration

Symptom. @broker.subscriber("q", ack_policy=AckPolicy.ACK_FIRST) fails with ValueError at decoration time.

Likely cause. By design. ACK_FIRST would delete the outbox row before the handler runs, so a handler crash would silently drop the message — exactly the failure mode the outbox pattern exists to prevent.

Diagnose. None needed; the message identifies the policy.

Fix. Use the default AckPolicy.NACK_ON_ERROR (retry on handler exception via the configured retry strategy), or AckPolicy.REJECT_ON_ERROR (delete on first failure), or AckPolicy.MANUAL (handler calls ack / nack / reject).

Reference. Subscriber § Ack policy.

OutboxResponse(...) + foreign-publisher decorator gets nacked

Symptom. A handler with both @kafka_pub and an OutboxResponse(...) return value gets nacked on every dispatch, with a _OutboxConfigError logged.

Likely cause. By design. The combination would both insert a row into the outbox and publish to Kafka — a dual-fire that doubles delivery. The subscriber refuses the chain composition by raising _OutboxConfigError via the process_message override; it rides the normal nack path so the row is retried (and logged) until the configuration is fixed.

Diagnose. Inspect the handler decorator stack and return type.

Fix. Pick one path. Either return body plain (the foreign publisher picks it up) or return OutboxResponse(body, queue="...", session=...) (an outbox-internal chain) but not both.

Reference. Relay § What not to do, Publisher § Chained publishing.

A chained OutboxResponse row's handler keeps retrying after the handler "succeeded"

Symptom. A handler that returns OutboxResponse(...) completes its own logic, yet the inbound row keeps nacking/retrying (and may DLQ as retry_terminal), with an exception that's about the publish, not the handler's work.

Likely cause. The follow-on OutboxResponse row is published after the handler returns, inside the same consume scope — so a failure there (e.g. a DB error on the follow-on insert) unwinds through the AcknowledgementMiddleware and nacks the inbound row. There is currently no distinct signal separating "handler OK, relay-publish failed" from an ordinary handler exception (F5-03): the metric reads as a normal nacked_retried/retry_terminal, and the ERROR log shows the publish exception rather than a handler one. (The most common trigger — header propagation re-encoding conflicts — was removed in #85.)

Diagnose. Read the logged exception: a sqlalchemy/asyncpg error or an envelope ValueError naming content-type/correlation_id points at the relay publish, not the handler body.

Fix. Resolve the underlying publish failure (schema/connection for the follow-on insert; drop conflicting headers). For non-idempotent chains, pass a deterministic timer_id so a redelivery's insert is a no-op.

Reference. Publisher § Chained publishing.

validate_schema() raises ImportError

Symptom. Calling await broker.validate_schema() raises ImportError("requires alembic").

Likely cause. The [validate] extra isn't installed. Alembic is an optional dependency by design — every other code path works without it, but the schema validator delegates to Alembic's autogenerate.compare_metadata and so requires it.

Diagnose. pip show alembic returns nothing, or pip list | grep alembic is empty.

Fix. pip install 'faststream-outbox[validate]'. The validator runs unchanged after that; nothing else in the package needs to change.

Reference. Schema validation.