Production checklist¶
Scannable scaffold of pre-launch checks. Each item is one to two lines; the link points at the existing reference page that owns the full story.
Sizing¶
- [ ] Engine pool ≥
Σ subs × (max_workers + 1)— every subscriber holdsmax_workers + 1SQLAlchemy pool connections (one writer per worker + one fetch) plus one raw asyncpg connection forLISTEN. Sub-budget formula in Subscriber § Connection budget. - [ ] Postgres
max_connections≥replicas × Σ subs × (max_workers + 2)—max_workers + 1pool connections plus the raw asyncpgLISTENconnection per subscriber; the formula is per-process and rolling deploys multiply it. Failure mode: pods refuse withFATAL: too many connections.
Subscribers¶
- [ ]
lease_ttl_seconds> handler P99 with margin — otherwise healthy in-flight handlers race their own lease expiry. The lease cutoff is server-sidemake_interval(...), immune to clock skew. Tuning: Subscriber § Slow handlers — dedicated queue. - [ ] Slow handlers segregated onto their own subscriber with a
taller
lease_ttl_seconds. Don't raise it globally — that delays reclaim of actually stuck rows everywhere. - [ ]
max_deliveriesset (or knowingly unbounded). Default is unbounded; pair with a non-NoRetry()retry strategy or wedge-prone handlers can replay forever. - [ ] Retry strategy chosen. Default
ExponentialRetry(initial_delay_seconds=1.0, multiplier=2.0, max_delay_seconds=300.0, max_attempts=10, jitter_factor=0.2)is fine for most. Opt intoNoRetry()explicitly for an audit feed.
DLQ¶
- [ ]
dlq_table=configured — opt-in but recommended for any service where terminal failures need forensic recovery. See Dead-letter queue. - [ ] Alert on
nacked_terminalrate vsdlq_writtendivergence — persistent divergence means either DLQ schema drift (CTE rolls back) orlease_ttl_secondstoo low. See DLQ § Metric: dlq_written. - [ ] DLQ retention plan. Partition by
failed_at+ cron-drop old partitions, or a simpleDELETE … WHERE failed_at < intervalcron for low volume. Walk-through: Alembic migrations § DLQ retention via partition drop.
Drain & lifecycle¶
- [ ]
graceful_timeout≥ handler P99 + margin — otherwiseOutboxSubscriber.stop()cancels in-flight work and rows are reclaimed mid-handler. - [ ] Kubernetes
terminationGracePeriodSeconds≥ brokergraceful_timeoutwith margin for the parallel-subscriber drain. The broker gathers subscriber drains in parallel, but k8sSIGKILLs after the grace period regardless.
Schema¶
- [ ]
/healthcallsvalidate_schema()— opt-in; requires the[validate]extra. Do not call atbroker.start()— that would crash-loop on a pending migration. See Schema validation § Where to call it. - [ ] Outbox
table_nameshort enough for the NOTIFY channel — the channel name isoutbox_<table_name>, andmake_outbox_tableraisesValueErrorat table-build time when that exceeds Postgres' 63-byte identifier limit. There is no silent truncation or polling fallback — the guard makes an over-long name impossible to ship.
Observability¶
- [ ]
metrics_recorderset, native middleware registered, or both — the recommended setup is both. See Instrumentation seams § Layering. - [ ] Alert on
lease_lostrate — non-zero meanslease_ttl_seconds < handler P99for at least one subscriber. See Troubleshooting §event=lease_lost. - [ ]
LISTEN/NOTIFYfallback understood — a connection or permission failure (asyncpg.connect/add_listenerraising) logs a WARNING once and falls back to polling. A missing asyncpg driver or a non-asyncpg engine URL falls back silently (no log) — diagnose those from the engine URL, not the logs. Either way the subscriber lives with up-to-max_fetch_intervalidle latency.