Partial degradation of UDP device service
On Feb 19-20 we observed a partial degradation to our UDP device service, which governs cellular device connections to the Particle Cloud. This resulted in a fraction of cellular devices temporarily losing the ability to publish data. The issue is no longer occurring and was resolved yesterday. We have also escalated and released a platform fix to address this issue going forward.
Feb 21, 14:17 PST
Partial degradation of some Particle services (now resolved)
From 3:28PST to 3:43PST, we experienced degraded availability for some services. During this 15 minute window, the impact was:
- Setting and enforcing individual SIM billing thresholds, changing billing plans, and monitoring cellular data usage, may have been unavailable
- Some web hooks may have been delayed
This issue is now resolved. Our cloud team is investigating the root cause and taking steps to ensure this will not occur again.
Feb 2, 17:34 PST
Webhook system not firing events
We're closing this incident out after monitoring the system for a few hours. Webhooks are and have been firing properly for some time.
Sep 14, 19:16-22:41 PST
Public API Interruption
In rolling out a change to one of our cloud services, an unexpected failure occurred. Additionally, the automated rollback strategy failed and required a manual intervention that took about 10 minutes to fully bring the service up again. The API is now operating as expected.
Aug 30, 14:11-14:44 PST
Updating API rate limits
With new rate limits in place API latency is the most consistent it's ever been. We'll be reaching out to users with overly aggressive scripts next week. If you see HTTP status 429, you know what to work on!
Aug 5, 19:08-19:56 PST
Device service degraded performance
Database latency is now stable and where it was before our db provider began reporting issues. As a result all Particle services are functioning as expected again.
Aug 2, 07:06-07:53 PST
This incident has been resolved.
Jun 23, 10:58-11:30 PST
Webhooks were partially unavailable overnight
We detected and resolved an issue this morning that didn't cause an alert to be triggered for our on-duty engineers. We're in the process of scaling out webhooks due to increased demand, and we're working on resolving the recently elevated error rate.
Jun 16, 07:31 PST
Brief Webhooks issue
We discovered one of our webhooks workers had an issue this morning that wasn't caught by our automatic alerts. We've restored the worker and setup alerts so we can react more quickly in the future. Thanks!
Jun 2, 08:14 PST
3rd party infrastructure changes preventing Build (build.particle.io) from loading
We've identified the source of the problem and fixed the underlying cause of it. We also know why our automated alerts did not catch this and notify us or automatically scale up/resolve the issue.
The root cause was complex: A 3rd Party infrastructure change coupled with a new load bearing infrastructure dependency for a particular http endpoint in our code that'd been running fine in production for months. Our automated alerts did not catch it because that particular endpoint was not involved in the periodic healthcheck. Sorry for the inconvenience, and thanks for your patience.
May 16, 08:23-08:54 PST
Cellular outage for some customers
Our telephony partner's outage has been resolved. Electrons using Particle SIMs are connecting as they should.
May 2, 19:14 - May 3, 00:10 PST
Our telephony partner's API is down
Our telephony partner has resolved their API issue. SIM card activation and deactivation should work again.
Apr 25, 16:43 - Apr 26, 06:02 PST
Build IDE loading
We've identified the root cause of this issue and have deployed a fix. If you've used Build in the last couple of hours, you may need to click the "refresh libraries" button to see them all again.
Feb 23, 18:38-22:17 PST
[Scheduled] Brief window during scale-up this morning
The deploy went very smoothly as expected, and we'll continue to monitor the performance and reliability of the cloud. We will also be deploying smaller code changes later today as well. Thanks!
Feb 1, 08:00-09:38 PST
Issue with an upstream service provider resolved
We've been responded to an issue with an upstream service provider that required more API restarts than we normally allow in a given day. Although each instance of downtime was only a few seconds, we apologize for this disruption. This issue should be resolved and we'll continue to monitor it closely. Thanks!
Oct 21, 16:56 PST
Cloud Compiler down
A dependency issue was resolved, compiling should be back to normal now.
Oct 12, 21:30-23:32 PST
Cloud compiler issues
For the last couple of hours the cloud compiler (used by the API, Web IDE (Build), CLI, and Local IDE (Dev) has not been performing as expected. We've added multiple alerts that will trigger again if this same issue arises and be able to address it more quickly. Additionally, we'll be investigating root cause more deeply this week to prevent the issue from arising in the first place.
Aug 22, 15:14 PST
updating build farm
updates are rolled out to the build farm and look good, thanks!
Aug 20, 10:38-11:22 PST
Cloud Compile Service
Cloud compile service logs and metrics have appeared normal for the last 25 minutes, service is operating normally.
Jul 31, 09:41-10:12 PST
Rolled out some minor updates
Some small changes were rolled out to help address some flashing issues that have been reported in the IDE.
Jul 27, 10:17 PST
Build farm had a brief partial outage
We discovered one of our build farm workers ran out of memory, and was unable to perform builds for some time this afternoon. We'll be adding alerts to catch this earlier, and working to identify the source issue this week. Should be back up and happy now. Thanks!
Jul 26, 15:47 PST
Short API Downtime
We experienced a very short blip in API availability resulting from an attempted deployment. We have rolled back and service has been restored.
Jul 23, 14:46 PST
Compile service ran out of space briefly
Looks like the workspace cleanup wasn't happening as expected and some disks filled up, we'll look into why we weren't alerted, but in the meantime online IDE builds should be running normally again. :)
May 28, 22:15 PST
Temporary Dashboard Outage
dashboard.particle.io experienced a short outage this afternoon during an attempted deployment. The issue has been identified, and the application has been rolled back to its original state until a fix is implemented.
May 26, 16:09 PST
Devices not connecting
Everything's back up. We'll be actively working to mitigate future such events in the coming weeks.
May 24, 00:03-00:46 PST
community forums load balancer is experiencing issues
The issue with the forums should now be resolved. We were taking a snapshot of that machine instance, but that snapshot restarted the box without warning. Some forum services needed some attention after the unexpected restart. We'll continue to monitor this service closely today.
May 21, 10:38-10:57 PST
[Scheduled] Expected downtime
The scheduled maintenance has been completed. We're very happy to report no incidents and zero downtime.
May 2, 23:30 - May 3, 01:30 PST
Some brief issues verifying software in the IDE
We deployed new code this morning and quickly rolled back after we saw an increased error rate. We're fixing those issues now and will roll out the improvements as they're fixed.
Apr 22, 12:04 PST
Devices currently unable to connect
Looks like we were pushing the memory ceiling on one of the device service boxes. All's well now. We'll setup new alerts to catch this in the future before it causes an outage.
Apr 12, 00:43-01:50 PST
build farm is back up
Sorry about that! Looks like an attack on Github impacted our build farm even though our build farm should be protected against that, we'll look into what caused things to jam up. In the meantime the build farm should be back up and running, but we'll keep monitoring it closely.
Mar 27, 05:48 PST
[Scheduled] Scheduled Cloud Upgrade
Thanks everyone! The scheduled maintenance went well, and we're running on shiny new hardware! Woo! Things should be back to normal if not a bit faster, thanks!
Mar 23, 09:15-10:20 PST
Seeing some cores having difficulties connecting
Our database hosting provider switched over between a primary and secondary servers and caused several device-service processes to stop responding to handshakes for a few minutes.
Mar 10, 21:37-21:52 PST
Community Site Down
Discourse automatic backups filled the disk space. We've cleared out old backups. This box is already scheduled for an upgrade this sprint. We look forward to avoiding these kinds of outages in the future.
Feb 8, 10:23-10:34 PST
Devices unable to connect
There was a problem with the device service's database connections. Everything's back online now.
Feb 2, 06:27-06:39 PST
beta functionality degraded
Looks like Amazon had stopped our beta services box unexpectedly, we're adding more alerts and monitoring to our beta box so this doesn't happen again.
Jan 19, 12:47-12:56 PST
community site outage
The server hosting the community site ran out of disk space again, I've freed up some space. It looks like our pagerduty alerts weren't hooked up to the forums, so we weren't woken up for this one, sorry! We'll schedule some maintenance later today to upgrade the disk on that box, and setup alerts
Jan 7, 06:31 PST
device service downtime
balance restored, we might reboot things one more time to carry some changes forward, sorry about the downtime!
Oct 16, 13:38-13:49 PST
Reports of cores appearing offline to the API after some time.
The self-healing / routing patch is deployed and working well. It might take the cloud a few seconds sometimes, but we'll keep working on that and get it back down to instantaneous. Thanks again to everybody who reported issues!
Oct 8, 19:36 - Oct 9, 10:45 PST
investigating some unusual metrics
We've been watching things closely for the last 3 hours just to be safe, and things have continued to be very stable, so I'm marking this as resolved for now, but we'll keep monitoring for any usual behavior.
Oct 4, 15:33-18:55 PST
community downtime this morning
We're looking into the cause of some community downtime we had this morning, but we'll continue monitoring the site throughout the day, the site is back up after a restart in the meantime.
Sep 29, 05:31 PST
Build site is down
Our host is fixing the failed secondary instance, and we reconfigured / restarted the site to ignore the secondary, and we'll need to update a database driver for the site since it didn't failover nicely. The rest of our services were not impacted by this downtime. Sorry!
Sep 19, 18:34-19:24 PST
Community SSL certificate temporary issue
So yesterday, for a few hours, Windows 7 users were not able to use the community site for a couple of hours due to certain browsers being unable to use a more modern, more secure SSL encryption algorithm.
Very quickly, @bko and @peekay123 (at community.spark.io) alerted us that they were having problems, and we reverted it—thanks for letting us know so quickly!
This incident was caused when we were in the process of rotating our SSL certificates in response to the HeartBleed bug. Note that we patched all of our systems within 24 hours of the initial announcement; SSL certificate rotation is an additional mitigation measure we're taking.
In the coming days we'll be finishing this process. We'll be sure to update this status page if any issues arise.
The Spark Team
Apr 11, 09:33 PST
Community Being Upgraded & DNS Change
community.spark.io has been operating normally since 4:45.
Within minutes of posting our first status update, two awesome people in our community, kennethlimcp and hypnopompia, helped us realize that image urls were being linked to via an old DNS record to sparkdevices.com rather than spark.io. This resulted in uploaded images not being viewable (for up to an hour) while the DNS change propagated.
Mar 5, 13:16-16:50 PST
Tinker version 1 was being sent to Cores, even though the Device Service was telling Cores to update if they were at a version lower than 2. As intended, version 2 is now correctly being distributed.
Jan 23, 23:19-23:47 PST
Trouble manually claiming cores using Chrome
The issues associated with claiming cores manually using the build site at spark.io/build are resolved. For those affected, thanks for giving us the details we needed to help us diagnose and fix the error as well as your patience while we worked out a solution!
Dec 28, 13:45 - Jan 6, 11:48 PST