Integrations not triggering

Incident Report for Particle

Postmortem

On October 1, a subset of Particle integrations stopped firing because of a failure on one of the servers that handles those integrations. As a result, a portion of messages using webhooks were not forwarded on to third-party servers. Since this was a partial service disruption, different customers were affected differently; most customers experienced no negative impact, while others experienced a significant outage.

We take platform reliability extremely seriously at Particle, as any downtime is a failure to deliver on our promise of reliable, scalable, and secure messaging between IoT devices and the web. The approach that we are taking in response to this incident is a broad internal reliability review; we want to address both the root cause of this particular issue as well as root out any other problems that might bite us down the road.

This post starts with a postmortem of this particular issue, but expands into a broader discussion of reliability practices at Particle, both existing practices and new ones that we’ve developed as a response to this incident. My goal is to show that we are taking these incidents seriously and are taking action to significantly improve the reliability and resilience of our platform to ensure that we can meet your high expectations.

Background

Particle provides four basic communication primitives to support communication to and from devices. Two of these communication primitives are designed for request/response communications: you can send messages to a device by remotely calling a function or retrieving a variable. The other two primitives are designed for pub/sub communications; you can publish a message from a device, and you can subscribe to receive published messages.

Published messages from devices can be routed through multiple channels. You can subscribe to these messages from other devices; you can subscribe to message topics through our API; our Device Management Console consumes published messages for display; and you can route published messages onto other web services through our integrations with popular systems like Azure and Google Cloud or through webhooks that allow you to transform these messages and send them on to other servers.

When a published message is sent to another server through a webhook, the message is routed through a series of “microservices”, each of which is designed to scale independently. The offending service, the “Integration Worker”, is responsible for handling messages that will be sent through webhooks and third-party integrations, and inserts those messages into a queue for dissemination.

What happened

On October 1, 2020, at 10:32pm PDT (5:32am UTC), one of the containers for the “Integration Worker” service failed, disrupting the flow of messages for integrations for a subset of customers.

Usually when a container fails catastrophically it is because it runs out of a resource like RAM or compute. As a result, we monitor the use of those resources on all of our servers to ensure that we scale up and distribute load across different servers appropriately.

In this case, the worker failed for a different reason: it ran out of a Linux kernel resource known as “file descriptors”. In Linux, everything is a file, including network connections. A server can be configured to allow more or fewer files to be opened simultaneously. Our Site Reliability Engineering team has modified this setting on most of our high-traffic servers to ensure that they do not run out of file descriptors. However this particular server was still set to the default setting of 65,535.

When this server hit its file descriptors limit, the integration worker crashed and stopped inserting messages into the SQS queue. Because this service is upstream from the queueing service, those messages are not recoverable.

Particle’s day-of response

Particle has monitoring and automated alerting in place on all of our web services to ensure that our Site Reliability Engineering (SRE) team is notified immediately whenever there is an issue. An alert was triggered at 5:40am UTC – 8 minutes after the incident began – that a service had crashed. However, this particular alert was incorrectly set to be a “non-paging” alert – in other words, an alert that posts in Slack rather than waking up the on-call engineer. Because the alert fired at night and was non-paging, our U.S.-based SRE team did not see the alert.

The services upstream and downstream from the integration worker have additional alerting in place to ensure that traffic is flowing. The upstream microservices were happily passing along messages from devices; the downstream microservice was happily sending messages along to third-party servers. Because the “Integration Worker” failure was only a partial failure, where one of many containers failed, traffic dipped, but not enough to trigger the downstream alert on traffic flow.

Our east coast support team became aware of the issue early in the morning when they saw a forum post reporting the issue. At 4:17am PDT/12:17am UTC, the tech support team paged the Site Reliability Engineering team, and work began to address the issue. At 5:05am PDT/1:51pm UTC, the service was scaled up and was operating as expected.

Here is a detailed timeline of the incident:

5:32am UTC (10:32pm PDT): worker-integration-3 begins failing due to "Too many open files"
5:40am UTC ( 10:40pm PDT) A non-paging Alert detects worker-integration-3 has restarted more than 3 times in the last hour in Slack
6:13am UTC (11:13pm PDT): Forum post opened reporting the issue
12:17pm UTC (4:17am PDT): East coast based tech support engineer sees issue
12:21pm UTC: Tech support pages on-call engineer
12:46pm UTC: Status page updated
12:47pm UTC: Site Reliability Engineering (SRE) team begins Zoom war room
12:54pm UTC: SRE team identifies the source of the issue
1:05pm UTC (5:05am PDT): SRE team begins to scale up offending service
1:51pm UTC (5:05am PDT): Scale up is complete, service is operational

What went wrong

Our first goal is to ensure reliable delivery of messages between IoT devices and the web. In order to deliver on this promise, two things must be true: our web services must be resilient, so that messages are still delivered even if any particular container or service fails; and we have to have proper alerting set up so that if anything goes wrong we can step in with corrective action immediately.

Our first failure here was on resilience; the service in question did not recover gracefully. Services will occasionally crash, and while we work to avoid crashes, reliable web services are reliable not because they never crash but because when individual containers or servers crash the customer doesn’t notice.

Our second failure here was on alerting; this failure slipped between a crack in our alerting infrastructure. While we had alerting in place, this particular failure mode wasn’t caught. As a result, our response to the issue was slower than it should have been.

Our commitment going forward

We have begun two separate lines of work as a response to this incident: First, we will address the lack of resilience and alerting of this particular service and ensuring that integrations traffic is delivered reliably to its intended audience. Second, we are making a broader set of investments in the reliability of our entire platform. That work has already begun and is paying dividends, but there is more work to be done. Below I will describe the work that has already been completed as well as the scope of work still ahead of us.

‌

Fixing this particular issue

As soon as this incident happened, the offending service was scaled up and alerting was put in place to detect this particular failure mode. It is no longer possible for this service to fail without us knowing about it.

We have also addressed the root cause of the failure, increasing the file descriptor limit both for this particular service and across our entire global infrastructure. Not only will this ensure that this particular failure mode doesn’t reoccur, it will ensure that no other service will crash for the same reason in the future.

‌

Over the next two weeks, we will be rearchitecting the data pipeline for integrations to be more resilient to failure. We will deploy a redundant set of integration workers to ensure that even if one worker fails it will not result in data loss while the on-call team brings the failed worker back online. We expect to complete this work by October 22.

Broader investments in reliability

Throughout 2020 we have made substantial investments in the reliability and resilience of a number of aspects of our platform, and that work has already borne fruit. Here are some of the improvements we have made:

We have made significant investments in OTA reliability, dramatically improving the reliability of OTA software updates through a series of incremental improvements that have been delivered to both Device OS and to our cloud service over the course of the year. That work is continuing with the upcoming release of both Device OS 2.0 (our first “long-term support” branch) and Device OS 3.0 (our development branch where new features and capabilities will be added over the next year). Stay tuned for more information, as both of those releases will be rolling out in the very near future.
We have made significant improvements to our data infrastructure following a database failure earlier this year. Those improvements include switching our MongoDB hosting provider to one that supports much better high availability, metrics, alerting capabilities as well as rolling out new engineering processes to ensure cloud services are running the latest drivers and behave as expected during routine failovers/step-downs.
We have identified and resolved the root cause of intermittent connectivity issues over the summer. In short, when a subset of devices experienced connectivity issues (usually due to a carrier somewhere in the world experiencing a short period of downtime), when the connectivity issues were resolved those devices would all come back online at once, swamping our servers with resource-intensive handshakes. This is known as the “thundering herd” problem. We have added protective measures to our cloud service that will spread the load out over a few minutes rather than accepting all new connections simultaneously, and we have simulated periods of intense load and proven that our servers are now resilient to this problem.
We have made improvements to our communications infrastructure with a status page that provides more useful information to self-service customers as well as more proactive communications processes with our enterprise customers to reach out and inform them of issues when they occur.
We have begun conducting quarterly Disaster Recovery training and preparedness exercises for on-call teams to improve our ability to respond quickly in case of disasters.

However, as this incident shows, our work is clearly not done. Our efforts have been focused specifically on device connectivity; after all, if the device isn’t connected, there’s not much else we can do. With this event, we will be expanding the scope of that reliability work to cover all of the mission-critical services that we provide.

We are committed to making the following improvements to our platform and our processes:

Global alerting. For the last six months, we have made significant investments in our alerting infrastructure that helps us identify connectivity issues. We will be expanding the scope of this alerting infrastructure to encompass all of the mission-critical services that Particle provides. This will give us confidence that we can and will resolve issues quickly when they do occur. This additional alerting will be put in place over the following two weeks.
Global load testing at 5x our current scale. Particle runs a separate staging infrastructure for QA and testing. We use this infrastructure to simulate failure modes and ensure that our platform is resilient and will recover gracefully when strained. In the past, that load testing has been done narrowly, testing very specific failure modes, such as what happens when a bunch of devices connect at the same time (the “thundering herd” problem). Over the course of Q4 we will be expanding the scope of that load testing to ensure that it covers all of our mission-critical infrastructure, and we will test our infrastructure at 5x our current load. This work will address any upcoming scaling issues that we haven’t already discovered.
Refactoring our internal messaging bus to improve resilience. Our internal messaging bus is powered by ZeroMQ, which has proven to be very performant and stable. That said, as we work to improve the resilience of our platform, we find that our ZeroMQ architecture gets in our way as often as it helps. We have begun the process of re-building our internal messaging bus on top of NATS, an open source messaging system written in Go which provides a similar set of capabilities but with a larger focus on resilience. These improvements will be rolled out over the course of Q4.
Continuing our investments in transparency and communications. We recently rebuilt our status page to better reflect the structure of our platform and have automated parts of the status page so that they are directly updated by our alerting system rather than requiring manual intervention of our SRE team to update the status page. We have also improved our processes for proactively communicating downtime to our enterprise customers. We are continuing to invest in both our self-service status page and our proactive communications to enterprise customers to ensure that you always hear about problems directly from us in a timely and transparent manner.

Conclusion

Thank you for entrusting us with the infrastructure that powers your products and business. We understand that the reliability of our platform is critical to our customers’ businesses, and we do not take that responsibility lightly.

My team and I have taken this issue as a very clear call to action: we need to deliver on reliability above all else, and we need to prove to you, our customers, that we can deliver on our promise. We have made great strides over the course of 2020 but there is more work to be done, and we will not rest until that work is complete.

Zach Supalla

Particle founder and CEO

Posted Oct 08, 2020 - 18:21 PDT

Resolved

Particle has determined the incident to be resolved and we will communicate a post mortem once the internal investigation has been completed.

Posted Oct 02, 2020 - 07:55 PDT

Monitoring

The team have implemented a fix and we will continue to monitor the situation.

Posted Oct 02, 2020 - 06:05 PDT

Identified

Particle engineers have identified the issue and are actively working on implementing a fix.

Posted Oct 02, 2020 - 05:51 PDT

Update

We are continuing to investigate this issue.

Posted Oct 02, 2020 - 05:03 PDT

Investigating

Particle is aware of a subset of integrations not triggering as expected. We are actively investigating.

Posted Oct 02, 2020 - 04:46 PDT

This incident affected: Integrations.