Between 2026-04-19 06:16 UTC and 11:44 UTC (~5h 28m), the public API returned errors on every Cassandra-backed request. A Cassandra container log grew unbounded (no Docker log rotation) and filled the disk on a database node, which tripped Cassandra's disk_failure_policy=stop and took the database offline.
We truncated the log, restarted Cassandra, rolled the API pods, then added Docker log rotation, silenced the noisy logger, and installed a daily cron that truncates any runaway container log as a safety net.
We apologise for how long this took to remediate. We're setting up alerting for this class of failure that bypasses do-not-disturb so it can't happen again.