Fluxer - Issues with sending messages – Incident details

All systems operational

Issues with sending messages

Resolved
Major outage
Started 1 day agoLasted 4 minutes

Affected

Fluxer API (api.fluxer.app)

Major outage from 7:07 PM to 7:11 PM

Updates
  • Postmortem
    Postmortem

    At 19:01:58 UTC, our Valkey instance was OOM-killed by the kernel.

    The pod's memory limit was set to 512 MiB — a low default carried over from initial setup — and Valkey itself had no eviction ceiling configured, so under normal application growth its working set crept past the pod's limit before the configured LRU eviction policy ever had a chance to engage. Valkey came back ~1 second later, but because it's an in-memory store, all ephemeral state (including the snowflake node-ID leases held by every API pod) was wiped on restart.

    Starting around 19:04:30 UTC — roughly the renewal interval after the restart — every API pod tried to renew its lease, found it gone, and treated the loss as terminal instead of recoverable. From that point until mitigation, 100% of message sends and webhook deliveries returned 500 across the fleet.

    We mitigated at ~19:09 UTC by rolling both API deployments, which let pods acquire fresh leases on startup. Total user-visible impact: ~5 minutes of fleet-wide message-send failures.

  • Resolved
    Resolved
    This incident has been resolved.
  • Investigating
    Investigating
    We are currently investigating this incident.