AWS metrics delayed

Incident Report for Datadog US1

Resolved

AWS metrics are up to date and alerts on them have been re-enabled.
Posted Mar 30, 2016 - 18:18 EDT

Update

AWS has resolved their API issue and our CloudWatch metrics are processing the backlog at normal throughputs again. Respective alerts are still disabled until the backlog has been cleared.
Posted Mar 30, 2016 - 17:33 EDT

Update

AWS endpoints are still recovering: AWS CloudWatch metrics remain delayed and corresponding alerts disabled. We're testing a workaround to mitigate the impact of the AWS API errors and accelerate recovery.
Posted Mar 30, 2016 - 16:46 EDT

Monitoring

AWS endpoints are coming back online and we're progressively resuming collection of AWS Cloudwatch metrics. Corresponding alerts remain disabled until fully caught-up
Posted Mar 30, 2016 - 16:03 EDT

Identified

AWS is investigating elevated error rates in their authentication service which we use to collect AWS CloudWatch metrics. We're waiting for a resolution from Amazon.
Posted Mar 30, 2016 - 15:24 EDT

Investigating

We’re seeing a potential outage with some AWS API endpoints we’re using to retrieve metrics. As a result, AWS metrics are delayed and corresponding alerts have been suspended. We’re investigating further. All other metrics and integrations are unaffected.
Posted Mar 30, 2016 - 14:48 EDT