Fluxer - Platform access is temporarily limited – Incident details

System Under Maintenance

Platform access is temporarily limited

Investigating
Degraded performance
Started 1 day ago

Affected

Fluxer API (api.fluxer.app)

Major outage from 7:13 AM to 2:12 PM, Degraded performance from 2:12 PM to 4:39 PM, Under maintenance from 4:39 PM to 12:00 AM

Updates
  • Update
    Update

    Now awake again and working through the remaining checklist before takeoff.

  • Update
    Update

    Giving exact ETAs have turned out to be not so good of an idea so far, but it's not much done left, and the two people working on this migration will need to catch up on some missed sleep!

  • Investigating
    Investigating

    And right as I posted that update, it went down again.

    We’re going to limit access to the app and show a clear message while we finish the migration. These outages are pulling time away from the migration work, and the team working on it is extremely small. Right now it is a single person, me.

    We do have one new hire though, and are working towards expanding the team!

    A massive surge of new signups is also overwhelming the single production server we are working to migrate away from. That server is currently serving about 120,000 users, and we received those users in just two weeks.

    Things are going to be back up and better, fully on the new environment, with wider voice server coverage in Johannesburg, Mumbai, São Paulo, Sydney, Tokyo, Miami, Dallas, Madrid, Frankfurt, Nuremberg, Stockholm, and more to come, plus improved anti-abuse and platform moderation tools to fight spam and raids, and more, at 10 AM UTC on Saturday if everything goes as expected!

    I can also reveal that we've got a surprise for all pre-existing Plutonium and lifetime Visionary users, and all non-paying users too, as soon as everything is back and running.

    Thanks for your patience, and have an awesome weekend!

  • Resolved
    Resolved

    We're really sorry about the downtime!

    Things are going to get better soon, but right now we have to keep two worlds alive at the same time. We have to maintain the old production environment that is already overloaded, and we also have to keep working through issues in the new environment we're trying to move everyone to.

    The hard part is that a lot of people want back in all at once, and most things are still running on that old environment. That creates the classic thundering herd effect. Requests pile up, some time out, clients retry, the retries add even more load, and it can spiral into downtime across multiple layers of the stack.

    We've had to tweak a lot of things to blunt that surge and stop the negative loop. These are the well documented symptoms you see in systems that scale fast, including Discord in its early days.

    We honestly did not expect to be operating at this scale so quickly, and we are an extremely small team. It is basically a single person driving the core work (with one new hire just today!). We're trying to do better <3

  • Identified
    Identified

    The API has been brought back online. However, the real-time Gateway is being slammed with requests to bring your communities back online. We have identified the slowness and are working on unclogging the queue.

  • Update
    Update

  • Monitoring
    Monitoring

    We implemented a fix to get everyone back in and are currently monitoring the result, you may experience some missing servers and instability.

  • Identified
    Identified

    We are currently working on resolving the elevated error rates on the API.