Infrastructure connectivity issue impacting multiple systems
Incident Report for Datadog
Postmortem

We’ve completed our analysis of this incident and have published details here: https://www.datadoghq.com/blog/2020-09-25-infrastructure-connectivity-issue/

Posted Oct 06, 2020 - 18:59 EDT

Resolved
All live data including APM metrics are now current, as well as corresponding APM alerts. Note that a subset of historical APM metric data may still show gaps and will be recalculated, along with SLOs over the next 24h. We apologize again for the inconvenience this outage has caused.
Posted Sep 24, 2020 - 21:40 EDT
Update
Events data is now current. We are continuing to backfill delayed data for APM metrics.
Posted Sep 24, 2020 - 20:55 EDT
Update
Processes and NPM data is now current. We are currently processing remaining data backlogs and are continuing to backfill delayed data for events and APM metrics.
Posted Sep 24, 2020 - 19:49 EDT
Monitoring
We are currently processing remaining data backlogs. We’re now current with Metric data and alerts, and are working on backfilling delayed data for events, APM metrics, processes and NPM.
Posted Sep 24, 2020 - 18:31 EDT
Update
We are making further progress in the recovery of customer-facing systems. The web application and APIs are operational, so are logs and corresponding alerts, as well as live APM traces. A subset of metric data is still delayed and being caught-up. We are still however working on processing backloged APM metrics and other types of alerts.
Posted Sep 24, 2020 - 17:19 EDT
Update
We are making further progress in the recovery of customer-facing systems. The web application and APIs are operational, so are logs and corresponding alerts, as well as live APM traces. A subset of metric data is still delayed and being caught-up. We are still however working on processing backloged APM metrics and other types of alerts.
Posted Sep 24, 2020 - 17:17 EDT
Update
We are making further progress in the recovery of customer-facing systems. The web application and APIs are operational, so are logs and corresponding alerts, as well as live APM traces. A subset of metric data is still delayed and being caught-up. We are still however working on processing backloged APM metrics and other types of alerts.
Posted Sep 24, 2020 - 17:03 EDT
Update
We are making further progress in the recovery of customer-facing systems. The web application and APIs are operational, so are logs and corresponding alerts, as well as live APM traces. A subset of metric data is still delayed and being caught-up. We are still however working on processing backloged APM metrics and other types of alerts.
Posted Sep 24, 2020 - 16:59 EDT
Update
We are making progress in the recovery of customer-facing systems. Web application error rate is down, metrics data is available, although we are still catching-up on some of the delayed data. Logs data is available and timely. We are still working on re-enabling all functionality and catching-up our alerting systems.
Posted Sep 24, 2020 - 16:28 EDT
Update
We are making progress in the recovery of customer-facing systems. Web application error rate is down, metrics data is available, although we are still catching-up on some of the delayed data. Logs data is available and timely. We are still working on re-enabling all functionality and catching-up our alerting systems.
Posted Sep 24, 2020 - 16:26 EDT
Update
We are still working to resolve this outage. We are working to divert traffic away from the affected components and restoring our customer-facing services. Our mitigations are showing progress, but we are still observing high error rates in our web application and API, and delays in metrics processing and alerting.
Posted Sep 24, 2020 - 15:36 EDT
Update
We are currently experiencing a widespread outage in our US-1 Data center, and all hands are on deck to resolve it - we are truly sorry for the inconvenience and are working towards a timely resolution. The infrastructure that allows the configuration and resolution of our services is currently severely degraded, causing a number of customer-facing services to be disrupted. This results in high error rates in our web application and API, delays in metrics processing and disrupts alerting.
Posted Sep 24, 2020 - 14:32 EDT
Update
We are continuing to actively work to mitigate the internal infrastructure connectivity issue impacting multiple systems.
Posted Sep 24, 2020 - 14:19 EDT
Update
We are continuing to actively work to mitigate the internal infrastructure connectivity issue impacting multiple systems.
Posted Sep 24, 2020 - 13:16 EDT
Identified
We are actively working on an issue that affects internal infrastructure connectivity and is impacting multiple systems.
Posted Sep 24, 2020 - 12:35 EDT
Update
We are continuing to investigate this issue.
Posted Sep 24, 2020 - 12:31 EDT
Update
We are continuing to investigate the elevated error rate on the web application.
Posted Sep 24, 2020 - 12:19 EDT
Update
We are continuing to investigate the elevated error rate on the web application.
Posted Sep 24, 2020 - 11:42 EDT
Update
We are continuing to investigate this issue.
Posted Sep 24, 2020 - 11:07 EDT
Update
We are continuing to investigate the elevated error rate on the web application.
Posted Sep 24, 2020 - 11:06 EDT
Investigating
We are seeing an elevated error rate on the web application. We are currently investigating the issue. It's important to note that monitoring data is properly processed and that no data is lost.
Posted Sep 24, 2020 - 10:27 EDT
This incident affected: Alerting Engine, API, API Crawlers, APM, Metrics Pipeline, Processes, and Web Application.