Between May 2 and May 8, approximately 500 devices — mostly with sleepy, low-power use-cases — were unable to communicate with the Device Cloud. These devices were typically going to sleep before the Device OS could destroy their expired session keys. At Particle, we take reliability seriously, and we understand that when downtime happens, you and your team are negatively affected. We apologize for any adverse impact this incident may have caused you, and we strive to make sure the services you depend on are robust and fully operational.
As a result of our investigations and remediation, the platform is now robust to the failure mode that presented itself during this window, automatically and instantly healing any device that enters this state. Additionally, Device OS fixes are in progress to prevent devices from entering the failure mode in the first place.
On Thursday, May 2, a Redis cluster responsible for caching device session keys unexpectedly crashed, losing all data. For most devices, this triggered a handshake with the Device Cloud to establish a new session. However, unknown to us, approximately 500 devices never came back online.
On Friday, May 3, an enterprise customer reported to our customer success team that their webhooks were not firing as expected, and this issue was escalated to engineering.
On Saturday, May 4, it became clear that the enterprise support issue was not isolated to a single customer. In response, we immediately escalated the issue internally and involved significant engineering resources.
On Sunday, May 5, a large engineering meeting was held to distill what we knew and assign next steps.
A small number of devices, usually with sleepy, low-power use cases, were continuing to use old session keys instead of handshaking, and the cloud could not decrypt their communications.
The only obvious way to force a handshake in this scenario was to remove power from the device. However, that would require a person physically near the device to take action, and if at all possible, we wanted to avoid that for our customers.
We brainstormed several outside-the-box solutions and queued up work to evaluate them. Our best hope involved sending malformed packets to devices. Key engineers worked late into the night trying everything they could think of.
On Monday, May 6, the consensus within engineering was that no remote communications had any effect on the impacted devices, though we continued to evaluate variations of packets in different combinations of device and cloud states. The engineering team devised a multi-pronged plan to improve both short- and long-term reliability using what we had learned from the incident. People around the company began trying to assess the extent of the problem so that we could communicate effectively with impacted customers. Approximately 500 devices were affected, only a few of which had elevated data usage.
Early Tuesday morning, May 7, the team identified a malformed datagram that might trigger a device handshake. This potential solution required a great deal of testing and refinement, which was performed over the course of the day. By late afternoon we had high confidence that we could heal devices remotely; we had tested on staging and deployed behind feature flags on low-risk production infrastructure; and, we had a runbook for rolling out the fix safely to everyone on Wednesday.
By the evening of Wednesday, May 8, 480 devices had been remotely brought back online, and the flow of forced handshakes had slowed to a trickle.
Finally, on the afternoon of Wednesday, May 15, we deployed one more set of changes that helped heal a further subset of devices not fixed the previous week.
For more information, see the full postmortem on our blog.