Affected
Major outage from 7:13 AM to 2:12 PM, Degraded performance from 2:12 PM to 4:39 PM, Under maintenance from 4:39 PM to 12:00 AM
- UpdateUpdate
Now awake again and working through the remaining checklist before takeoff.
- UpdateUpdate
Giving exact ETAs have turned out to be not so good of an idea so far, but it's not much done left, and the two people working on this migration will need to catch up on some missed sleep!
- InvestigatingInvestigating
And right as I posted that update, it went down again.
We’re going to limit access to the app and show a clear message while we finish the migration. These outages are pulling time away from the migration work, and the team working on it is extremely small. Right now it is a single person, me.
We do have one new hire though, and are working towards expanding the team!
A massive surge of new signups is also overwhelming the single production server we are working to migrate away from. That server is currently serving about 120,000 users, and we received those users in just two weeks.
Things are going to be back up and better, fully on the new environment, with wider voice server coverage in Johannesburg, Mumbai, São Paulo, Sydney, Tokyo, Miami, Dallas, Madrid, Frankfurt, Nuremberg, Stockholm, and more to come, plus improved anti-abuse and platform moderation tools to fight spam and raids, and more, at 10 AM UTC on Saturday if everything goes as expected!
I can also reveal that we've got a surprise for all pre-existing Plutonium and lifetime Visionary users, and all non-paying users too, as soon as everything is back and running.
Thanks for your patience, and have an awesome weekend!
- ResolvedResolved
We're really sorry about the downtime!
Things are going to get better soon, but right now we have to keep two worlds alive at the same time. We have to maintain the old production environment that is already overloaded, and we also have to keep working through issues in the new environment we're trying to move everyone to.
The hard part is that a lot of people want back in all at once, and most things are still running on that old environment. That creates the classic thundering herd effect. Requests pile up, some time out, clients retry, the retries add even more load, and it can spiral into downtime across multiple layers of the stack.
We've had to tweak a lot of things to blunt that surge and stop the negative loop. These are the well documented symptoms you see in systems that scale fast, including Discord in its early days.
We honestly did not expect to be operating at this scale so quickly, and we are an extremely small team. It is basically a single person driving the core work (with one new hire just today!). We're trying to do better <3
- IdentifiedIdentified
The API has been brought back online. However, the real-time Gateway is being slammed with requests to bring your communities back online. We have identified the slowness and are working on unclogging the queue.
- UpdateUpdate
- MonitoringMonitoring
We implemented a fix to get everyone back in and are currently monitoring the result, you may experience some missing servers and instability.
- IdentifiedIdentified
We are currently working on resolving the elevated error rates on the API.