For 8.5 hours yesterday Gen 3 devices were offline. The disruption started on Thursday March 14 at 3:29pm Pacific Time and fully resolved at 12:16am Mar 15. Our alerting systems did not detect the issue. A member of the engineering team detected the problem after about 40 minutes.
The root technical cause stemmed from an accidental change made to a security group around 11:30am PT that blocked new network connections between 2 specific clusters. Since the change only impacted new connections, it did not interrupt any of the existing services immediately. As services restarted for various reasons later in the day, network connections were closed and failed to reopen upon startup.
Next steps:
Finish implementing an alert to detect this type of failure that we had already in progress.
Manage the infrastructure that was accidentally changed as Terraform code so we have faster discovery and easy rollback in similar future incidents.