On Monday, 8 October, 2018, Particle's Site Reliability Engineering (SRE) team responded to an incident around 7:03 a.m. PDT. The on-call engineer escalated the issue to the broader engineering team, and by 8:38 a.m. PDT service was fully restored. During this incident many devices were unable to establish a durable cloud session, and many messages to the API could not be delivered to devices.
Normally in these situations the engineering team is able to restore service more quickly, but in this case a recent code deploy was initially suspected as the cause of downtime. The team rolled back the deploy and observed the system for several minutes first, delaying identification of the root cause. Once it was established that the code change was not responsible for the incident, the team quickly identified a critical piece of infrastructure that had become partially unresponsive. That machine was quickly restarted, and service returned to normal.
This particular piece of infrastructure plays an important role in the routing of messages, and our typical metrics didn’t catch this partial failure mode. We are already in the process of rolling out new infrastructure to avoid this type of failure in the future, and adding metrics to catch this failure mode early.
At 7:03 a.m. PDT on Monday, 8 October, 2018, part of the routing service became unresponsive but did not crash. Normally, when a service dies for any reason the system immediately restarts it with no public impact. In this case, other services in the cloud continued to work, but messages to this hardware did not receive a response.
At 7:22 automatic alerts started to trigger, indicating that one server’s CPU was higher than normal. Usually this alert is low urgency and serves as an “early warning” that we should consider taking preemptive action to prevent user facing problems. At that same time we received the first user reports of disconnected devices.
At 7:29 we received several alerts related to device connectivity. This was the first moment the on-call team realized something more serious may be happening, and a full investigation began.
We had deployed a code change (ironically, toward our goal of scaling routing architecture) a few hours before the incident, so initial efforts focused on rolling back to the previous version of code. As the rollback progressed, we saw mixed signals where some systems seemed to improve while others did not.
At 7:54, the on-call team escalated the incident, and several of us joined a video chat and started public status page communications.
At 8:29 we discovered that the root cause was the routing system. We immediately restarted the machine, and cloud metrics quickly improved.
After the routing system was fully operational, the engineering team continued to monitor the situation, and began writing a postmortem and identifying next steps.
In order to mitigate issues like these in the future, there are two platform improvements we will make in the coming weeks.
We identified a new class of alert to help detect a partial hardware or service failure of this nature. The new alert is meant to detect when certain metrics haven’t updated in a reasonable time window. We expect this new type of alert would have caught this failure and quickly pointed us to the root cause. We will be implementing and deploying it this week.
The same service failed (in a very different way) on March 20th of this year. We have made significant progress toward scaling and improving the message routing subsystem, and some important changes will be fully deployed within the next 30 days.
Once those changes are completely rolled out, the scope of impact of a similar failure in the future will be significantly reduced. In November we will begin the final phase of routing improvements, which will make this failure mode obsolete — we’re targeting completion in May 2019.