3rd Generation devices can't connect to the cloud
Incident Report for Particle
Postmortem

For 8.5 hours yesterday Gen 3 devices were offline. The disruption started on Thursday March 14 at 3:29pm Pacific Time and fully resolved at 12:16am Mar 15. Our alerting systems did not detect the issue. A member of the engineering team detected the problem after about 40 minutes.

The root technical cause stemmed from an accidental change made to a security group around 11:30am PT that blocked new network connections between 2 specific clusters. Since the change only impacted new connections, it did not interrupt any of the existing services immediately. As services restarted for various reasons later in the day, network connections were closed and failed to reopen upon startup.

Next steps:

  • Finish implementing an alert to detect this type of failure that we had already in progress.

  • Manage the infrastructure that was accidentally changed as Terraform code so we have faster discovery and easy rollback in similar future incidents.

Posted 2 months ago. Mar 15, 2019 - 14:48 PDT

Resolved
This incident has been resolved.
Posted 2 months ago. Mar 15, 2019 - 00:02 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 2 months ago. Mar 14, 2019 - 23:54 PDT
Identified
We've identified a low level internal networking problem. We are working toward applying a fix.
Posted 2 months ago. Mar 14, 2019 - 21:25 PDT
Investigating
The Device Service that handles our Gen 3 devices (Borons, Argons and Xenons) isn't accepting connections currently. Our on-call team is looking into the issue.
Posted 2 months ago. Mar 14, 2019 - 17:03 PDT
This incident affected: Device Service.