Degraded metric import performance
Incident Report for Scout
Resolved
With the fixes late last week, metric import is now running normally.
Posted Mar 28, 2016 - 13:29 MDT
Monitoring
We've deployed the critical updates, and things are looking much better. We'll be monitoring to ensure everything stays steady.
Posted Mar 25, 2016 - 15:25 MDT
Update
We're rolling out some fixes now. There may be a bit of degradation during deploys, before the improvements take hold.
Posted Mar 25, 2016 - 13:07 MDT
Update
We've got a background queue optimization which we believe will resolve the last 10% of this. We plan to deploy tomorrow (Friday) late morning / early afternoon.
Posted Mar 24, 2016 - 17:05 MDT
Identified
We've reduced the timeouts significantly through a combination of adding more hardware, optimizing DB queries, and rebalancing large accounts. We're still not 100% fixed. We're now improving background job processing.
Posted Mar 24, 2016 - 10:55 MDT
Update
We've made significant gains on the data gaps with a round of query optimizations (a 4x reduction). We'll update again tomorrow on our progress.
Posted Mar 22, 2016 - 23:47 MDT
Update
We added a new app server, which ameliorated the problem but didn't solve it. We are working on some further database optimizations. Thanks for your patience while we work through this.
Posted Mar 22, 2016 - 17:37 MDT
Update
We completed a round of database optimizations this morning and migrated how some workloads were performed. Unfortunately, that didn't have an impact. We're about to deploy an additional application server to assist in bringing down the queue backups.
Posted Mar 22, 2016 - 14:49 MDT
Update
We're working on several paths to resolve this issue: the root of the problem is a sharp increase in database response times. This bubbled up through our stack, generating sporadic 503s on metric ingestion.

Our first round of deployments to resolve this issue will be in place by this afternoon.
Posted Mar 22, 2016 - 11:12 MDT
Update
We're continuing to investigate the data gaps. We'll provide an update later this morning on our progress.
Posted Mar 22, 2016 - 08:56 MDT
Investigating
Metric imports are taking longer than usual. In some cases, agents are timing out when they try to send metrics to our servers. The potential impact: you might see charts that with small gaps from time to time. In most cases, it won't be visible. Triggers will still fire, and notifications will still be sent. We are investigating the root cause of the slowness. Send us an email if you have any questions.
Posted Mar 21, 2016 - 22:14 MDT