On October 1, a subset of Particle integrations stopped firing because of a failure on one of the servers that handles those integrations. As a result, a portion of messages using webhooks were not forwarded on to third-party servers. Since this was a partial service disruption, different customers were affected differently; most customers experienced no negative impact, while others experienced a significant outage.
We take platform reliability extremely seriously at Particle, as any downtime is a failure to deliver on our promise of reliable, scalable, and secure messaging between IoT devices and the web. The approach that we are taking in response to this incident is a broad internal reliability review; we want to address both the root cause of this particular issue as well as root out any other problems that might bite us down the road.
This post starts with a postmortem of this particular issue, but expands into a broader discussion of reliability practices at Particle, both existing practices and new ones that we’ve developed as a response to this incident. My goal is to show that we are taking these incidents seriously and are taking action to significantly improve the reliability and resilience of our platform to ensure that we can meet your high expectations.
Particle provides four basic communication primitives to support communication to and from devices. Two of these communication primitives are designed for request/response communications: you can send messages to a device by remotely calling a function or retrieving a variable. The other two primitives are designed for pub/sub communications; you can publish a message from a device, and you can subscribe to receive published messages.
Published messages from devices can be routed through multiple channels. You can subscribe to these messages from other devices; you can subscribe to message topics through our API; our Device Management Console consumes published messages for display; and you can route published messages onto other web services through our integrations with popular systems like Azure and Google Cloud or through webhooks that allow you to transform these messages and send them on to other servers.
When a published message is sent to another server through a webhook, the message is routed through a series of “microservices”, each of which is designed to scale independently. The offending service, the “Integration Worker”, is responsible for handling messages that will be sent through webhooks and third-party integrations, and inserts those messages into a queue for dissemination.
On October 1, 2020, at 10:32pm PDT (5:32am UTC), one of the containers for the “Integration Worker” service failed, disrupting the flow of messages for integrations for a subset of customers.
Usually when a container fails catastrophically it is because it runs out of a resource like RAM or compute. As a result, we monitor the use of those resources on all of our servers to ensure that we scale up and distribute load across different servers appropriately.
In this case, the worker failed for a different reason: it ran out of a Linux kernel resource known as “file descriptors”. In Linux, everything is a file, including network connections. A server can be configured to allow more or fewer files to be opened simultaneously. Our Site Reliability Engineering team has modified this setting on most of our high-traffic servers to ensure that they do not run out of file descriptors. However this particular server was still set to the default setting of 65,535.
When this server hit its file descriptors limit, the integration worker crashed and stopped inserting messages into the SQS queue. Because this service is upstream from the queueing service, those messages are not recoverable.
Particle has monitoring and automated alerting in place on all of our web services to ensure that our Site Reliability Engineering (SRE) team is notified immediately whenever there is an issue. An alert was triggered at 5:40am UTC – 8 minutes after the incident began – that a service had crashed. However, this particular alert was incorrectly set to be a “non-paging” alert – in other words, an alert that posts in Slack rather than waking up the on-call engineer. Because the alert fired at night and was non-paging, our U.S.-based SRE team did not see the alert.
The services upstream and downstream from the integration worker have additional alerting in place to ensure that traffic is flowing. The upstream microservices were happily passing along messages from devices; the downstream microservice was happily sending messages along to third-party servers. Because the “Integration Worker” failure was only a partial failure, where one of many containers failed, traffic dipped, but not enough to trigger the downstream alert on traffic flow.
Our east coast support team became aware of the issue early in the morning when they saw a forum post reporting the issue. At 4:17am PDT/12:17am UTC, the tech support team paged the Site Reliability Engineering team, and work began to address the issue. At 5:05am PDT/1:51pm UTC, the service was scaled up and was operating as expected.
Here is a detailed timeline of the incident:
Our first goal is to ensure reliable delivery of messages between IoT devices and the web. In order to deliver on this promise, two things must be true: our web services must be resilient, so that messages are still delivered even if any particular container or service fails; and we have to have proper alerting set up so that if anything goes wrong we can step in with corrective action immediately.
Our first failure here was on resilience; the service in question did not recover gracefully. Services will occasionally crash, and while we work to avoid crashes, reliable web services are reliable not because they never crash but because when individual containers or servers crash the customer doesn’t notice.
Our second failure here was on alerting; this failure slipped between a crack in our alerting infrastructure. While we had alerting in place, this particular failure mode wasn’t caught. As a result, our response to the issue was slower than it should have been.
We have begun two separate lines of work as a response to this incident: First, we will address the lack of resilience and alerting of this particular service and ensuring that integrations traffic is delivered reliably to its intended audience. Second, we are making a broader set of investments in the reliability of our entire platform. That work has already begun and is paying dividends, but there is more work to be done. Below I will describe the work that has already been completed as well as the scope of work still ahead of us.
Fixing this particular issue
As soon as this incident happened, the offending service was scaled up and alerting was put in place to detect this particular failure mode. It is no longer possible for this service to fail without us knowing about it.
We have also addressed the root cause of the failure, increasing the file descriptor limit both for this particular service and across our entire global infrastructure. Not only will this ensure that this particular failure mode doesn’t reoccur, it will ensure that no other service will crash for the same reason in the future.
Over the next two weeks, we will be rearchitecting the data pipeline for integrations to be more resilient to failure. We will deploy a redundant set of integration workers to ensure that even if one worker fails it will not result in data loss while the on-call team brings the failed worker back online. We expect to complete this work by October 22.
Broader investments in reliability
Throughout 2020 we have made substantial investments in the reliability and resilience of a number of aspects of our platform, and that work has already borne fruit. Here are some of the improvements we have made:
However, as this incident shows, our work is clearly not done. Our efforts have been focused specifically on device connectivity; after all, if the device isn’t connected, there’s not much else we can do. With this event, we will be expanding the scope of that reliability work to cover all of the mission-critical services that we provide.
We are committed to making the following improvements to our platform and our processes:
Thank you for entrusting us with the infrastructure that powers your products and business. We understand that the reliability of our platform is critical to our customers’ businesses, and we do not take that responsibility lightly.
My team and I have taken this issue as a very clear call to action: we need to deliver on reliability above all else, and we need to prove to you, our customers, that we can deliver on our promise. We have made great strides over the course of 2020 but there is more work to be done, and we will not rest until that work is complete.
Particle founder and CEO