Messages not passing through cloud

Incident Report for Particle

Postmortem

Summary

On Tuesday, 20 March, 2018, Particle's "device broker" service was unavailable from 16:08 to 17:23 Pacific Time. During this period, while devices could connect to the Device Cloud, they could neither send nor receive messages to or from other services (such as the API, webhooks, etc.).

The high uptime and availability of the Particle Device Cloud is one of the key features we offer, and I apologize for any adverse effects you or your customers might have experienced as a result of this outage.

Every incident teaches us how we can improve the platform. I want to share with you the events that transpired including some background, the root causes, and the immediate next steps we are taking to prevent similar downtime in the future.

The event

At 16:08 Pacific Time on Tuesday, 20 March, 2018, the AWS server hosting our device broker became unresponsive (likely due to hardware failure). The device broker is an internal service responsible for relaying messages between devices and other endpoints like integrations and the API. All other services were fully operational and unaffected by the hardware failure, but messages couldn't be sent to or from devices, and devices would have reported that they were offline during this period (even though they were technically online, but unreachable).

Particle has alerting systems set up to inform us immediately of any issues like this. However, unfortunately, our primary alerting system, which is hosted by InfluxData, was not operating as expected. As a result, we did not become aware of the issue until 16:41 PT (33 minutes after the server went down). At that point, our team sprang into action.

What we did

At 16:41 PT, the engineering team started a video conference with 5–7 engineers present at any given moment to collaborate, communicate, and address the issue. After diagnosing the root cause of the failure, we redeployed the device broker on new hardware at 17:04 PT. Events started trickling through the device broker at 17:06 PT, and it was fully operational at 17:23 PT.

After the service was fully operational, the engineering team continued monitoring the broker, assembling a postmortem, and planning mitigations.

Next steps

In order to mitigate issues like these in the future, there are two major platform improvements we will make over the coming weeks and months.

Improving system monitoring

As we've scaled the platform, downtime has become increasingly rare, and these days, when problems do arise, on-call engineers are typically able to address them in minutes.

An engineer would have been paged within seconds if our alerting system had been operating properly. We have a diverse portfolio of monitoring and alerting infrastructure, but we rely heavily on Kapacitor hosted by InfluxData. That hosted service has not lived up to our expectations of reliability. We are actively working with InfluxData to improve both our systems and theirs.

Additionally, we will diversify our alerting infrastructure, duplicating some high value alerting pipelines across multiple providers, enabling redundancy that insulates us from failures like this one.

Highly available broker

The device broker is one of the longest-lived and most stable pieces of the Device Cloud. It is a highly optimized and reliable service, almost unchanged since the Spark Core days in 2013. Because it has been so rock solid, for years we didn't prioritize making it horizontally scalable. Architecturally, it is the last piece of the Device Cloud that remains a single point of failure, and no amount of software reliability can protect it from hardware failures.

Earlier this year, thanks to the success and rapid scaling of our customers, we decided to prioritize the architectural overhaul of the device broker to add redundancy and make it horizontally scalable. This work has been actively in progress for the last couple months, and this incident only deepens our commitment to have a highly available broker deployed in production this summer.

Conclusion

In closing, let me talk about some of Particle's core values: caring, curiosity, openness, and trust. We care about our customers and our product, and we're deeply sorry for any problems this incident may have caused you. We are providing this postmortem in a spirit of openness because we want you to trust us with your IoT future — we're doing everything in our power to prevent and mitigate the severity of future failures. And last, but certainly not least, we are curious to hear your feedback at community.particle.io and support.particle.io.

Zachary Crockett
Particle Founder & CTO

Posted Mar 21, 2018 - 19:02 PDT

Resolved

All systems are stable.

Posted Mar 20, 2018 - 18:50 PDT

Monitoring

Everything looks OK, but we're continuing to actively monitor the system.

Posted Mar 20, 2018 - 17:27 PDT

Update

Some messages are beginning to go through.

Posted Mar 20, 2018 - 17:10 PDT

Update

We currently suspect a hardware failure in AWS us-east. We are spinning up new infrastructure.

Posted Mar 20, 2018 - 17:04 PDT

Investigating

While devices and API calls are working externally, messages are not being passed through the cloud. This also means webhooks are not being sent. The team is responding immediately.

Posted Mar 20, 2018 - 16:57 PDT