At 19:01:58 UTC, our Valkey instance was OOM-killed by the kernel.
The pod's memory limit was set to 512 MiB — a low default carried over from initial setup — and Valkey itself had no eviction ceiling configured, so under normal application growth its working set crept past the pod's limit before the configured LRU eviction policy ever had a chance to engage. Valkey came back ~1 second later, but because it's an in-memory store, all ephemeral state (including the snowflake node-ID leases held by every API pod) was wiped on restart.
Starting around 19:04:30 UTC — roughly the renewal interval after the restart — every API pod tried to renew its lease, found it gone, and treated the loss as terminal instead of recoverable. From that point until mitigation, 100% of message sends and webhook deliveries returned 500 across the fleet.
We mitigated at ~19:09 UTC by rolling both API deployments, which let pods acquire fresh leases on startup. Total user-visible impact: ~5 minutes of fleet-wide message-send failures.