Fluxer - Issues with connecting to the app – Incident details

All systems operational

Issues with connecting to the app

Resolved
Major outage
Started 1 day agoLasted about 1 hour

Affected

Fluxer Gateway (gateway.fluxer.app)

Degraded performance from 2:07 AM to 2:22 AM, Major outage from 2:22 AM to 3:09 AM, Degraded performance from 3:09 AM to 3:19 AM

Updates
  • Postmortem
    Postmortem

    Fluxer is now running a fully distributed Erlang cluster of 16 gateway instances, spread across four physical machines. This should minimise impact of individual node failures moving forward. Thanks for flying Fluxer.

  • Resolved
    Resolved

    They said it couldn't be done.

  • Update
    Update
    We implemented a fix and are currently monitoring the result.
  • Update
    Update

    We identified some issues with reconnecting to communities that would yield inconsistent state depending on which node in the cluster you're connecting from. We're currently channeling our inner Joe Armstrong and the powers of the BEAM to rectify the situation as quickly as possible.

  • Update
    Update

    Things are progressing on the recovery side of things. There is light at the end of the tunnel!

  • Update
    Update

    The waves are doing their thing, and we're soon ready to reconnect people to communities in waves too. Did I mention that the gateway is now running at 16 replicas across 4 physical nodes?

  • Update
    Update

    We're now attempting to let everyone back in again in ~~waves~~.

  • Update
    Update

    We're changing strategy to prevent thundering herd by forcing the cluster to settle down, and when load returns to normal, we'll disconnect individual gateway sockets in waves at runtime instead.

  • Update
    Update

    We are now rolling the cluster to force clients to reconnect and recognise the new session rollout.

  • Update
    Update

    We have removed the taint on the Kubernetes worker node previously reserved to the single gateway replica, enabled clustering in the gateway deployment, scaled it out to 16 replicas across all four worker nodes, and rescheduled all stateless workloads in the cluster to balance things out, and we are monitoring the result.

  • Update
    Update

    We're taking this opportunity to roll out our new gateway clustering system.

  • Monitoring
    Monitoring
    We implemented a fix and are currently monitoring the result.