Some APM data being dropped
We found the root cause of the issue, and fixed it by adjusting Kafka partition configuration.
Ultimately, the impact is: (all times MST): 10:51am - 12:50pm: small amount of data dropped; 12:50pm-3:06pm: incoming data delayed by about 10min; 3:06PM & onward: data is current; operations normal.
Sep 19, 12:08-15:56 MST
Two-minute gap in metric ingestion caused by database upgrade
During a database upgrade, we restarted our ingestion pipeline too soon (before the database was fully back online), and as a result lost two minutes of incoming data, from 10:24am - 10:25am MDT. You'll see metrics flatline on your charts for those two minutes. Apologies for this, we take data continuity very seriously.
Aug 25, 10:43 MST
scoutapp.com is down
A database spike caused app servers to freeze. The impact was short-lived. We're investigating how to avoid a repeat.
Aug 18, 16:44-17:08 MST
Connectivity into our data center is resolved and traffic has been stable for the last hour.
Jun 25, 02:22-03:21 MST
Dropped metrics during 2:40-2:46 PM MDT
A brief networking issue between load balancers and database cluster caused an unintended database failover. Connectivity has remained stable since the incident. We'll be tuning the failover condition to better handle this scenario.
Jun 23, 14:54-17:53 MST
APM - Delay in metric insertion
Our metric aggregator failed to restart on an application restart, resulting in the backup. We manually restarted the aggregator and metric insertion quickly returned to normal. We'll be investigating the restart failure.
Feb 25, 22:46-23:04 MST
We're back to steady state, with no additional incidents overnight. Closing this issue.
Jan 27, 16:29 - Jan 28, 10:01 MST
Incoming data delayed 20 minutes
The backlog triggered the need for a database upgrade. Please see the following incident for updates: https://status.scoutapp.com/incidents/blhhgnhnbkfj.
Jan 27, 16:11-16:35 MST
Server Monitoring: 503 errors
We had a misconfiguration on a new timeseries database server resulting in problems with reverse DNS lookups. After fixing the issue, application errors have dropped back down to zero.
We're updating our configuration recipes to ensure this is setup correctly going forward.
Dec 8, 11:17-12:20 MST
Server Monitoring - Short blip in connectivity
We just saw a short blip in connectivity to scoutapp.com impacting server monitoring. Traffic is back to normal levels.
We're communicating with the data center to ensure it is resolved.
Dec 4, 19:09 MST
APM: Some gaps in reported data over the last 12 hours
Unfortunately, due to a combination of under-provisioned AWS infrastructure & increase in load, we lost some data over the weekend. You will see flatlines in your charts during those periods of data loss.
Even though were still in Tech Preview, we make every effort for incidents like this not to happen. On the upside, we learned a lot about capacity & breaking points, and it's far better to learn it now, in the previous phase, before we go into General Availability.
Thank you for your patience while we work the kinks out. We're making APM better, every day!
Oct 24, 11:03 - Oct 25, 22:27 MST
APM UI down
The issue was caused by a bad connection between our app servers and our ElasticSearch provider. We are working to make this connection more robust.
Oct 8, 12:13-12:35 MST
App Monitoring Unavailable
We've tracked it down to a connectivity issue between the Influx nodes. We're working on handling these connectivity issues more gracefully.
Sep 28, 13:51-14:18 MST
We're back. Sorry for the issues - we're working to make these db updates less impactful.
Sep 14, 15:01-15:03 MST
We were rolling out some updated queries for our alerts API and these caused issues when a 100k+ alerts were in the queries. We've rolled this back and are working on an updated query.
Sep 14, 10:29-10:37 MST
Isolated connectivity issues
Connectivity continues to be fine after traffic was no longer routed through GTT. As mentioned in an earlier update, we're considering plans to resolve these isolated connectivity issues faster.
Dec 27, 12:39 - Dec 29, 09:01 MST
Continued isolated connectivity issues
No update on root cause, but connectivity has has been established for the past 7 hours. We'll make this as resolved and create a new incident if things change.
Dec 26, 23:49 - Dec 27, 07:20 MST
Isolated connectivity issue (Seattle area)
Connectivity from the affected servers has been holding. Message from our host, RailsMachine:
> "I haven't heard anything back from GNAX/Zcolo yet but I ran a few online ping tests (https://www.site24x7.com/public/t/results-1419656583928.html and https://cloudmonitor.ca.com/en/ping.php) and they either reported everything fine or the errors that reports gave the same results for google.com. This indicates either a problem with their testing tool or a localized network issue in the locations that failed to connect to the site. We haven't received any reports from other customers so this is likely a localized issue outside of our control. We also haven't received alerts for any sites that we monitor using pingdom."
Marking this as resolved.
Dec 26, 22:02-22:24 MST
Apparent network issues. We are in contact with our hosting provider.
Message from our host: "Everything appears to be back online. We are pursuing a Root Cause Analysis from our datacenter, and will post that on our status page as soon as possible. We are also requesting that they increase communication during future outages or technical problems."
We (Scout) will tweet the root cause analysis when it becomes available as well. I apologize for this outage. It doesn't represent the level of reliability we aim for.
Mar 31, 12:30-13:37 MST
503 and 504 errors
We've brought down the load on our metric server and the timeouts have resolved.
We will look into optimizing/safeguarding the offending logic to prevent this issue going forward.
Mar 26, 21:14-21:50 MST
Intermittent 503 errors
503 errors are resolved. The root cause was a significant volume of disk activity on a metric server due to operations stuck in a loop. This slowed down the rest of Scout. We implemented a quick fix (an edge case resulted in this behavior) and are working on a longer-term fix.
Mar 11, 17:51-19:18 MST
This incident has been resolved.
Apr 28, 21:28 - Apr 29, 01:24 MST
PagerDuty integration may be dropping some alerts
This appears to have been a temporary network issue between our servers and PagerDuty's API endpoints. We have added monitoring so we can identify similar issues and react more quickly if this should happen again.
Apr 22, 18:27-19:43 MST
503 errors on scoutapp.com
Our hosting provider has identified a hardware issue as the root cause. We'll be migrating the server to a new host this week.
Apr 8, 07:41-10:09 MST
Isolated agent connectivity issues
Our hosting provider was unable to identify any issues on their end, but all servers are reporting. If you are seeing similar issues, email us at email@example.com.
Mar 29, 21:38 - Apr 8, 08:34 MST