Notice history

Resolved
April 30, 2026 at 2:23 AM
Resolved
April 30, 2026 at 2:23 AM
This incident has been resolved.
Investigating
April 30, 2026 at 2:12 AM
Investigating
April 30, 2026 at 2:12 AM
We are currently investigating this incident.

Postmortem
April 29, 2026 at 7:21 PM
Postmortem
April 29, 2026 at 7:21 PM
At 19:01:58 UTC, our Valkey instance was OOM-killed by the kernel.
The pod's memory limit was set to 512 MiB — a low default carried over from initial setup — and Valkey itself had no eviction ceiling configured, so under normal application growth its working set crept past the pod's limit before the configured LRU eviction policy ever had a chance to engage. Valkey came back ~1 second later, but because it's an in-memory store, all ephemeral state (including the snowflake node-ID leases held by every API pod) was wiped on restart.
Starting around 19:04:30 UTC — roughly the renewal interval after the restart — every API pod tried to renew its lease, found it gone, and treated the loss as terminal instead of recoverable. From that point until mitigation, 100% of message sends and webhook deliveries returned 500 across the fleet.
We mitigated at ~19:09 UTC by rolling both API deployments, which let pods acquire fresh leases on startup. Total user-visible impact: ~5 minutes of fleet-wide message-send failures.
Resolved
April 29, 2026 at 7:11 PM
Resolved
April 29, 2026 at 7:11 PM
This incident has been resolved.
Investigating
April 29, 2026 at 7:07 PM
Investigating
April 29, 2026 at 7:07 PM
We are currently investigating this incident.

Resolved
April 24, 2026 at 11:13 PM
Resolved
April 24, 2026 at 11:13 PM
This incident has been resolved.
Investigating
April 24, 2026 at 10:29 PM
Investigating
April 24, 2026 at 10:29 PM
We are currently investigating this incident.

Postmortem
April 23, 2026 at 6:40 PM
Postmortem
April 23, 2026 at 6:40 PM
A brief production outage was caused when troubleshooting production state in the gateway and triggering a node failure. We are working towards better redundancy and protection against this class of failures in the future. Thanks for your patience!
Resolved
April 23, 2026 at 6:39 PM
Resolved
April 23, 2026 at 6:39 PM
This incident has been resolved.
Identified
April 23, 2026 at 6:27 PM
Identified
April 23, 2026 at 6:27 PM
We are continuing to work on a fix for this incident.

Resolved
April 20, 2026 at 4:48 PM
Resolved
April 20, 2026 at 4:48 PM
Sorry about the brief downtime. We'd been putting off a downscaling of our Kubernetes cluster for a while, and I finally went ahead with it. During the process, the gateway exceeded its memory limit and rolled, which forced everyone to reconnect — a process that takes a little time. Now that we've ripped the bandaid off, this shouldn't be needed again in the near future.
Investigating
April 20, 2026 at 4:35 PM
Investigating
April 20, 2026 at 4:35 PM
We are currently investigating this incident.

All systems operational