Global API CDN unavailable for 30 minutes

Incident Report for Sanity

Postmortem

Incident summary

On Tuesday, 5 September 2023, from 11:40 to 12:10 UTC, customers observed errors consisting of 503 response codes when trying to access cached objects through the API CDN. Access to Studio and the ability to log in to Sanity was also disrupted during this incident. The incident was identified and mitigated within 30 minutes.

Incident timeline

All times UTC on 2023-Sep-05.

11:40 Tracing changes rolled-out - Outage begins

11:44 First system alert is fired (elevated 5xx errors)

11:47 Incident declared

12:06 Health check identified as root cause of API CDN outage

12:08 Health check corrected

12:08 Tracing work rolled back - Incident mitigated

12:09 503 error rate returns to normal level

12:10 API CDN service fully recovered, customers can log in to Sanity

12:15 Incident state moved to monitoring - Root cause analysis underway

Root cause

We recently developed more advanced tracing capabilities for our platform to improve system observability. This change was rolled out several weeks ago, but hit an edge case that unexpectedly increased the load on our identity management service in a way that was not caught in our testing and staging environments. The safety mechanism in the tracing library to prevent this sort of failure had a default value set too high for the system to cope with, causing our identity management service to fail. Our API CDN depended on a health check to this service and this dependency caused our API CDN to stop serving traffic. The ability to use Sanity Studio or log in to Sanity was also blocked by the unavailability of the identity service.

This is our current understanding of the incident and as we continue to investigate, if anything new and material comes to light, we will update with further details.

Remediation and prevention

Sanity engineers were alerted to the issue at 11:44 UTC and began investigating the API CDN failures promptly. At 12:06 UTC the team had determined that the root cause for the API CDN outage was a health check to the identity service, which was immediately corrected. The team also rolled back the tracing change which was the root cause for the identity service outage. As these changes were rolled out, error rates subsided, the identity service started answering requests again, and regular API CDN traffic resumed.

In addition to resolving the underlying cause, we will be implementing updates to both prevent and minimize the impact of this type of failure in the future. Given the critical nature of our CDN infrastructure, we are also initiating a complete audit of our caching layer, including making sure no additional legacy dependencies exist.

We would like to apologize to our customers for the impact this incident had on their operations and business. We take the reliability of our platform extremely seriously, especially when it comes to availability across regions.

Posted Sep 05, 2023 - 21:03 UTC

Resolved

This incident has been resolved.

Posted Sep 05, 2023 - 12:26 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 05, 2023 - 12:18 UTC

Investigating

We are experiencing an elevated level of API errors for some of our customers and are currently investigating the issue.

Posted Sep 05, 2023 - 11:59 UTC

This incident affected: apicdn.sanity.io.