On Tuesday, 5 September 2023, from 11:40 to 12:10 UTC, customers observed errors consisting of 503 response codes when trying to access cached objects through the API CDN. Access to Studio and the ability to log in to Sanity was also disrupted during this incident. The incident was identified and mitigated within 30 minutes.
All times UTC on 2023-Sep-05.
11:40 Tracing changes rolled-out - Outage begins
11:44 First system alert is fired (elevated 5xx errors)
11:47 Incident declared
12:06 Health check identified as root cause of API CDN outage
12:08 Health check corrected
12:08 Tracing work rolled back - Incident mitigated
12:09 503 error rate returns to normal level
12:10 API CDN service fully recovered, customers can log in to Sanity
12:15 Incident state moved to monitoring - Root cause analysis underway
We recently developed more advanced tracing capabilities for our platform to improve system observability. This change was rolled out several weeks ago, but hit an edge case that unexpectedly increased the load on our identity management service in a way that was not caught in our testing and staging environments. The safety mechanism in the tracing library to prevent this sort of failure had a default value set too high for the system to cope with, causing our identity management service to fail. Our API CDN depended on a health check to this service and this dependency caused our API CDN to stop serving traffic. The ability to use Sanity Studio or log in to Sanity was also blocked by the unavailability of the identity service.
This is our current understanding of the incident and as we continue to investigate, if anything new and material comes to light, we will update with further details.
Sanity engineers were alerted to the issue at 11:44 UTC and began investigating the API CDN failures promptly. At 12:06 UTC the team had determined that the root cause for the API CDN outage was a health check to the identity service, which was immediately corrected. The team also rolled back the tracing change which was the root cause for the identity service outage. As these changes were rolled out, error rates subsided, the identity service started answering requests again, and regular API CDN traffic resumed.
In addition to resolving the underlying cause, we will be implementing updates to both prevent and minimize the impact of this type of failure in the future. Given the critical nature of our CDN infrastructure, we are also initiating a complete audit of our caching layer, including making sure no additional legacy dependencies exist.
We would like to apologize to our customers for the impact this incident had on their operations and business. We take the reliability of our platform extremely seriously, especially when it comes to availability across regions.