Debugging Eventual Consistency With A Deterministic “Outbox Storm” Simulator
Written by
Elena Holos
I ran into a bug that felt haunted: data looked correct most of the time, then occasionally—usually right after a deploy or a load spike—it “snapped” into the wrong state for a few minutes. No errors in logs, no failed requests, just users seeing stale data long enough to file tickets.
What finally helped wasn’t a better dashboard. It was a mental model I could simulate: how an “outbox” (an event queue) can create an “outbox storm” that temporarily breaks the assumptions of the code that reads from a downstream store.
Below is a tiny deterministic simulator I built in Python that makes this failure mode obvious. I use it like a microscope: step through the system, watch the message backlog grow, and see how read-after-write assumptions collapse.
The mental model: “Outbox storms” happen when time is a participant
Many teams implement event-driven updates with an outbox: when you change a record in your primary database, you also store an “event message” in an outbox table in the same transaction. A background worker later forwards those outbox rows to a message bus (or directly to a consumer).
A simple (but often false) assumption is:
“After I write, the read model will be updated soon enough that my next read will be correct.”
Here’s the twist: the system has time dynamics. If the worker processing the outbox lags, the consumer reads will race against the backlog. That’s an outbox storm: the backlog grows faster than it drains, and downstream reads can observe an older world.
To see that clearly, I made a deterministic simulator.
The simulator: a primary write model + a delayed read model
I model three things:
- Primary store: the source of truth (what you write to).
- Outbox: event messages created on each write, waiting to be processed.
- Read model: a denormalized projection built from events, but with processing delay.
The system loop is simple:
- Each “tick” represents a unit of time.
- Writes generate outbox messages.
- Each tick, the worker processes a limited number of outbox messages (capacity).
- The consumer updates the read model when messages are processed.
- I also simulate “reads” that happen right after writes.
When outbox capacity drops for a few ticks, stale reads appear. That’s the outbox storm.
Working code (deterministic): step through the storm
from dataclasses import dataclass from typing import Dict, List, Tuple @dataclass class OutboxEvent: order_id: str new_status: str # message "created at" tick for visibility created_tick: int class OutboxStormSimulator: def __init__( self, worker_capacity_by_tick: Dict[int, int], tick_total: int, ): self.tick_total = tick_total # Primary source of truth: current status of each order self.primary: Dict[str, str] = {} # Outbox backlog: events waiting to be processed self.outbox: List[OutboxEvent] = [] # Downstream read model: status as of last processed event self.read_model: Dict[str, str] = {} # Worker capacity: how many outbox events can be processed per tick self.worker_capacity_by_tick = worker_capacity_by_tick # For debugging: record mismatches between primary and read_model self.mismatches: List[Tuple[int, str, str, str]] = [] def enqueue_write(self, tick: int, order_id: str, new_status: str) -> None: # This mimics the transactional behavior: primary write + outbox insert self.primary[order_id] = new_status self.outbox.append(OutboxEvent(order_id, new_status, tick)) def process_outbox(self, tick: int) -> None: # Worker processes up to capacity; remaining events stay in the backlog capacity = self.worker_capacity_by_tick.get(tick, 0) processed = 0 while processed < capacity and self.outbox: ev = self.outbox.pop(0) # Consumer updates the read model based on the event self.read_model[ev.order_id] = ev.new_status processed += 1 def simulate_read_after_write(self, tick: int, order_id: str) -> None: # A "read" checks what the UI sees (read model) vs source of truth (primary) primary_status = self.primary.get(order_id) read_status = self.read_model.get(order_id) # If read model hasn't caught up yet, read_status can be None or older if read_status != primary_status: self.mismatches.append((tick, order_id, primary_status, read_status)) def run(self, writes: List[Tuple[int, str, str]], reads: List[Tuple[int, str]]) -> None: # Pre-load writes/reads into per-tick buckets writes_by_tick: Dict[int, List[Tuple[str, str]]] = {} for t, order_id, status in writes: writes_by_tick.setdefault(t, []).append((order_id, status)) reads_by_tick: Dict[int, List[str]] = {} for t, order_id in reads: reads_by_tick.setdefault(t, []).append(order_id) for tick in range(self.tick_total + 1): # 1) Writes happen for order_id, status in writes_by_tick.get(tick, []): self.enqueue_write(tick, order_id, status) # 2) Worker processes outbox (possibly limited or stalled) self.process_outbox(tick) # 3) Reads happen (UI reads from read model) for order_id in reads_by_tick.get(tick, []): self.simulate_read_after_write(tick, order_id) # Optional: print a small trace for specific ticks # (kept minimal here, but still deterministic) if tick in (0, 1, 2, 3, 4, 5, 6, 7, 8): print( f"tick={tick} " f"primary={self.primary.get('A')} " f"read_model={self.read_model.get('A')} " f"outbox_backlog={len(self.outbox)}" ) def report(self) -> None: print("\n--- MISMATCHES (stale reads) ---") for tick, order_id, primary_status, read_status in self.mismatches: print( f"tick={tick} order={order_id} " f"primary={primary_status} read_model={read_status}" ) print(f"\nTotal mismatches: {len(self.mismatches)}") def demo_outbox_storm() -> None: # Capacity is high at first, then drops to near zero for a few ticks (the storm). # This simulates a deploy, GC pause, noisy neighbor, DB slowdown, etc. worker_capacity_by_tick = { 0: 2, 1: 2, 2: 0, # storm begins: worker can't keep up 3: 1, 4: 0, # storm persists 5: 3, # recovers 6: 3, 7: 3, 8: 3, } sim = OutboxStormSimulator(worker_capacity_by_tick=worker_capacity_by_tick, tick_total=8) # Writes: multiple status transitions for the same order A in quick succession. # Each write enqueues an outbox event. writes = [ (0, "A", "CREATED"), (1, "A", "PAID"), (2, "A", "PACKED"), (3, "A", "SHIPPED"), (4, "A", "DELIVERED"), ] # Reads: UI tries to read right after each write tick. reads = [ (0, "A"), (1, "A"), (2, "A"), (3, "A"), (4, "A"), # and one later check to show convergence (6, "A"), ] sim.run(writes=writes, reads=reads) sim.report() if __name__ == "__main__": demo_outbox_storm()
What each block is doing (and why it matters)
enqueue_write(...)updates the primary state and appends anOutboxEventto the outbox. This is the whole point of the outbox pattern: you don’t “fire and forget” an event outside the transaction.process_outbox(...)consumes outbox events at a fixed per-tick capacity. This is the “worker” behavior. When capacity drops, backlog grows.simulate_read_after_write(...)compares what the UI reads (read_model) to what the system wrote (primary). That mismatch is the concrete symptom.
Run it: watch stale reads appear during capacity collapse
When I run the demo, the trace shows something like this (exact values depend only on the deterministic script):
- At tick 0 and 1, the worker keeps up, so
read_modelmatchesprimary. - At tick 2, capacity is
0, so outbox events pile up. - Reads at tick 2/3/4 occur while the read model is behind.
- By tick 6, the worker catches up and the system converges.
That’s the mental model in action: the system’s behavior is governed by the queue dynamics between “write” and “projection update,” not just by code paths.
A concrete takeaway: the “correctness boundary” is queue health
This simulator made a counterintuitive thing feel obvious:
- Your write path can be perfectly correct.
- Your projection logic can be perfectly correct.
- And you can still serve wrong answers temporarily because your read model is governed by backlog.
In other words, the mental model shifts from:
“Is my code wrong?”
to:
“Is my system fast enough (end-to-end) to meet the timing assumptions of the UI?”
That’s why incident response often lands on worker throughput, consumer lag, and backlog growth rates—not just application exceptions.
Closing thoughts
I learned to treat eventual consistency like a system with time and queue dynamics, not just a messaging detail. By simulating an outbox storm deterministically, I can literally see how capacity collapse creates stale reads—even when every individual component is “working.”