Scout

February 2017

Network instability - Server Monitoring outage
This incident has been resolved.
Feb 14, 15:56-21:46 MST
AWS Networking Issue
AWS resolved their network issue and we should be back to normal.
Feb 9, 19:38-20:54 MST
[Scheduled] Zayo Scheduled Maintenance Overnight
The scheduled maintenance has been completed.
Feb 2, 20:00 - Feb 3, 04:00 MST

January 2017

[Scheduled] Switch Upgrade. Possible service interruption for 5-10 minutes.
The scheduled maintenance has been completed.
Jan 6, 22:00-22:15 MST
Somewhat degraded performance while datacenter upgrades switches
This incident has been resolved.
Jan 6, 12:26-13:11 MST
Server Monitoring: brief metric ingestion outage while swapping database writer role
Scout Server Monitoring had a brief ingestion outage from 4:27PM to 4:31PM MDT while swapping a database writer role.
Jan 1, 16:43 MST

December 2016

Brief downtime (database fix)
From 8:58PM-09:07PM MDT 2016-12-31, scoutapp.com was unavailable during a database alteration. Data was not collected during this time.
Dec 31, 22:05 MST
[Server Monitoring] Incorrect alert routing/Alerts not being sent out
We have corrected the underlying database issue causing the incorrectly routed alerts. Alerts should be back to normal for all accounts.
Dec 31, 19:58-21:44 MST
Network connectivity issues
Network connectivity is restored. There will be a 7-minute drop in charts corresponding to the outage.
Dec 13, 09:17-09:24 MST
Server Monitoring install packages are temporarily unavailable
The package repos are back and operating normally.
Dec 3, 04:05-08:00 MST

November 2016

Metric Data Delayed
Looks like everything is back to normal.
Nov 17, 12:46-14:50 MST
[Server Monitoring] Connectivity issues to scoutapp.com
Our host has confirmed service has been restored.
Nov 14, 14:16-16:05 MST
[Server Monitoring] Connectivity issues to scoutapp.com
This incident has been resolved.
Nov 11, 20:56 - Nov 12, 21:07 MST
Server Monitoring customers may experience connectivity issues
This incident has been resolved.
Nov 4, 08:47-09:54 MST
Delayed data and slow charts on some Server Monitoring accounts
This incident was caused by a rogue xinetd process on one of our servers. Performance and data intake are back to normal.
Nov 2, 09:19-10:42 MST

October 2016

[Scheduled] [Application Monitoring] Postgres cluster upgrade
The scheduled maintenance has been completed.
Oct 30, 21:05 - Oct 31, 13:44 MST
Metric ingestion lag for clients running scout_apm version <= 2.1.8
This incident has been resolved.
Oct 26, 16:18 - Oct 27, 07:10 MST
[Scheduled] Time Series Capacity Increase - Application Monitoring
Metric ingestion has caught up and the maintenance is complete.
Oct 20, 21:00-22:07 MST
[Scheduled] Time Series Database Upgrade
The scheduled maintenance has been completed.
Oct 11, 21:00-21:18 MST
[Scheduled] Metric ingestion, kafka changes
The scheduled maintenance has been completed.
Oct 3, 21:30-22:19 MST

September 2016

[Application Monitoring] Agent versions <= 1.3.4 ingestion lag
Legacy agent checkins are caught up
Sep 23, 18:57 - Sep 24, 07:21 MST
[Application Monitoring] data being dropped
Ingestion has caught back for agents sending data in JSON format. We're catching up for agents sending our legacy data format.
Sep 23, 15:48-18:15 MST
[Scheduled] Database changes requiring downtime
The scheduled maintenance has been completed.
Sep 22, 14:01-14:15 MST
Slowness displaying charts and some pages, particularly for apps with lots of endpoints.
We expect the changes made today to gradually lower the response times for charts over the next 24 hours.
Sep 20, 11:18-16:47 MST
[Scheduled] Database upgrade and reboot
The scheduled maintenance has been completed.
Sep 20, 12:31-12:45 MST
Some APM data being dropped
We found the root cause of the issue, and fixed it by adjusting Kafka partition configuration. Ultimately, the impact is: (all times MST): 10:51am - 12:50pm: small amount of data dropped; 12:50pm-3:06pm: incoming data delayed by about 10min; 3:06PM & onward: data is current; operations normal.
Sep 19, 12:08-15:56 MST
[Scheduled] Time Series Database Upgrade
The database upgrade is complete. All app metrics reporting should now be caught up to the present time.
Sep 16, 10:00-10:15 MST

August 2016

Server Monitoring - Network Connectivity Issues
This incident has been resolved.
Aug 28, 18:35-20:25 MST
Two-minute gap in metric ingestion caused by database upgrade
During a database upgrade, we restarted our ingestion pipeline too soon (before the database was fully back online), and as a result lost two minutes of incoming data, from 10:24am - 10:25am MDT. You'll see metrics flatline on your charts for those two minutes. Apologies for this, we take data continuity very seriously.
Aug 25, 10:43 MST
scoutapp.com is down
A database spike caused app servers to freeze. The impact was short-lived. We're investigating how to avoid a repeat.
Aug 18, 16:44-17:08 MST
Connectivity issue - likely DNS
This incident has been resolved.
Aug 16, 16:30-19:44 MST

July 2016

[Server Monitoring] Network connectivity
While our datacenter is investigating the incident and hasn't yet posted the cause, we're going to go ahead and close the incident. We'll create a new incident should the situation re-appear.
Jul 30, 10:38-14:17 MST
[Server Monitoring] 503 errors connecting to scoutapp.com
We've deployed the update to our rate-limiting. All issues should be resolved.
Jul 29, 10:18-12:13 MST
Delay in sending out some Server Monitoring email and SMS alerts
Resolved. Emails and SMS's are being sent normally.
Jul 20, 22:15-22:22 MST

June 2016

[APM] Metric Ingestion Delayed
Metric Ingestion has caught up.
Jun 28, 11:22-11:31 MST
Network connectivity
Connectivity into our data center is resolved and traffic has been stable for the last hour.
Jun 25, 02:22-03:21 MST
Dropped metrics during 2:40-2:46 PM MDT
A brief networking issue between load balancers and database cluster caused an unintended database failover. Connectivity has remained stable since the incident. We'll be tuning the failover condition to better handle this scenario.
Jun 23, 14:54-17:53 MST
[Server Monitoring] Connectivity issues to Scoutapp.com
Our upstream advised the issue was mitigated at 2AM EDT. Traffic has been stable since.
Jun 9, 20:52 - Jun 10, 07:12 MST

May 2016

[Server Monitoring] Connectivity issues to Scoutapp.com
Server Monitoring functionality is restored. Reach out to support@scoutapp.com if you have any questions.
May 4, 09:40-10:31 MST

April 2016

[Scheduled] Scheduled Datacenter Maintenance
The scheduled maintenance has been completed.
Apr 29, 08:49 MST
Metric Ingest issues for one hour this morning (Server Monitoring only)
We're resolved the database configuration inconsistency, and are no longer seeing metric ingest issues.
Apr 21, 09:08-13:54 MST
Network Connectivity Issues
Network operations have returned to normal as of 11:37 EDT.
Apr 19, 06:41-10:23 MST
[Scout APM] Dropping some metrics
The issue has been resolved and we have retried all failed payloads.
Apr 17, 18:42-19:13 MST

March 2016

Server Monitoring: queue backlog in alert notifications
We've caught back up. We had a mis-configured alert on our end when our notification worker crashed following a deploy. This has been adjusted.
Mar 29, 18:48-19:00 MST
Degraded metric import performance
With the fixes late last week, metric import is now running normally.
Mar 21, 22:14 - Mar 28, 13:29 MST

February 2016

APM - Delay in metric insertion
Our metric aggregator failed to restart on an application restart, resulting in the backup. We manually restarted the aggregator and metric insertion quickly returned to normal. We'll be investigating the restart failure.
Feb 25, 22:46-23:04 MST
[Scheduled] Database upgrade
The scheduled maintenance has been completed.
Feb 23, 22:00-22:15 MST
[Scheduled] Scheduled Database upgrade
The scheduled maintenance has been completed.
Feb 23, 21:00-21:15 MST
[APM] Slow Request Stream Inaccessible
Slow stream is fully operational, and all incoming data that was delayed is caught up. Sorry for the delays!
Feb 4, 16:21 - Feb 5, 00:37 MST

January 2016

Metric Backlog
We're back to steady state, with no additional incidents overnight. Closing this issue.
Jan 27, 16:29 - Jan 28, 10:01 MST
Incoming data delayed 20 minutes
The backlog triggered the need for a database upgrade. Please see the following incident for updates: https://status.scoutapp.com/incidents/blhhgnhnbkfj.
Jan 27, 16:11-16:35 MST
[Scheduled] APM - Additional database disk upgrade
The scheduled maintenance has been completed.
Jan 16, 21:30-23:15 MST
[Scheduled] APM database disk upgrade
The scheduled maintenance has been completed.
Jan 15, 21:30-23:00 MST
Delayed data import during database disk upgrade
This incident has been resolved.
Jan 15, 11:24-14:27 MST
Connectivity issues to scoutapp.com
The hosting provider resolved this issue, and connectivity is restored..
Jan 2, 00:24-07:03 MST
Connectivity issues to scoutapp.com
Traffic has been at normal levels for the last several hours - we're going to close out the incident.
Jan 1, 15:19-19:11 MST
Network issues to scoutapp.com
zColo network engineers have contained the DDoS attack and services have been restored.
Jan 1, 10:57-14:26 MST

December 2015

Delayed metric inserts
Metric insert performance has returned to normal.
Dec 30, 11:11 - Dec 31, 11:41 MST
APM: Backup in metric processing
This incident has been resolved.
Dec 18, 15:43-23:32 MST
We're investigating a drop in APM service
Outage is resolved. No data was lost.
Dec 14, 15:32-15:44 MST
We are investigating degraded performance on APM
Read operations were being starved for database connections. We've fixed the issue, and performance is back to normal.
Dec 11, 10:50-11:25 MST
Server Monitoring: 503 errors
We had a misconfiguration on a new timeseries database server resulting in problems with reverse DNS lookups. After fixing the issue, application errors have dropped back down to zero. We're updating our configuration recipes to ensure this is setup correctly going forward.
Dec 8, 11:17-12:20 MST
APM: Backup in metric insertion
Response times ares back to normal. We're continuing to work w/our database on the cause.
Dec 7, 14:45-20:32 MST
Server Monitoring - Short blip in connectivity
We just saw a short blip in connectivity to scoutapp.com impacting server monitoring. Traffic is back to normal levels. We're communicating with the data center to ensure it is resolved.
Dec 4, 19:09 MST

November 2015

Application Monitoring: errors viewing metrics
Metrics have been importing normally for six hous. Our team is working on several infrastructure updates to prevent the data consistency issues from appearing again.
Nov 9, 09:40-15:33 MST
APM: large gaps in data
We've restored all the historical data that was temporarily not showing up on charts. If you have any questions about this incident, drop us a support email.
Nov 2, 10:26 - Nov 5, 09:10 MST
Upstream High Latency and Packet Loss making Scoutapp.com unavailable
This incident has been resolved.
Nov 1, 09:24-12:05 MST

October 2015

APM: Some gaps in reported data over the last 12 hours
Unfortunately, due to a combination of under-provisioned AWS infrastructure & increase in load, we lost some data over the weekend. You will see flatlines in your charts during those periods of data loss. Even though were still in Tech Preview, we make every effort for incidents like this not to happen. On the upside, we learned a lot about capacity & breaking points, and it's far better to learn it now, in the previous phase, before we go into General Availability. Thank you for your patience while we work the kinks out. We're making APM better, every day!
Oct 24, 11:03 - Oct 25, 22:27 MST
APM has spotty availability due to database issues
This incident has been resolved.
Oct 24, 12:24 - Oct 25, 22:18 MST
[Scheduled] Database upgrade
Completed.
Oct 25, 13:21-13:57 MST
APM was offline for several minutes
Closing this issue and opening a new one for data re-sync.
Oct 24, 08:51-11:02 MST
[Scheduled] RDS database upgrade
The scheduled maintenance has been completed.
Oct 21, 22:00-22:30 MST
Connectivity issues from Azure cloud western Europe to scoutapp.com
High packet loss within European networks caused issues with scout clients reporting from Oct 15 20:00 to Oct 16 12:15 UTC.
Oct 16, 03:07-10:17 MST
APM has been taken offline to investigate a database issue
This incident has been resolved.
Oct 14, 15:50-16:10 MST
Database connection issues
An error in our last deploy caused brief database connection issues
Oct 9, 12:32-12:41 MST
APM UI down
The issue was caused by a bad connection between our app servers and our ElasticSearch provider. We are working to make this connection more robust.
Oct 8, 12:13-12:35 MST
Issues with displaying metrics in app monitoring
Things thing stable now. We'll continue to work w/the Influx team investigating the cause.
Oct 4, 20:13-20:32 MST

September 2015

Database upgrade
This incident has been resolved.
Sep 30, 10:38-11:36 MST
Influx connection issues
This incident has been resolved.
Sep 28, 17:38-17:48 MST
App Monitoring Unavailable
We've tracked it down to a connectivity issue between the Influx nodes. We're working on handling these connectivity issues more gracefully.
Sep 28, 13:51-14:18 MST
Seeing influx with our backend data store (APM only). Server Monitoring is not affected.
This incident has been resolved.
Sep 22, 11:02-14:19 MST
503 errors accessing slow requests in app monitoring
We're seeing better performance after upgrading our Elasticsearch Cluster.
Sep 21, 15:00-21:11 MST
we have some slow queries preventing access to the scout site.
This incident has been resolved.
Sep 14, 15:01-15:05 MST
scoutapp.com unaccessible
We're back. Sorry for the issues - we're working to make these db updates less impactful.
Sep 14, 15:01-15:03 MST
scoutapp.com unaccessible
We were rolling out some updated queries for our alerts API and these caused issues when a 100k+ alerts were in the queries. We've rolled this back and are working on an updated query.
Sep 14, 10:29-10:37 MST

August 2015

Scoutapp.com inaccessible
We're back online.
Aug 19, 11:04-11:11 MST
Scoutapp.com unreachable
The logs are inconclusive at this time - it appears as though the host just locked up, but all the checks we've run look good. We will keep an eye on it overnight, and continue looking into the logs tomorrow morning for additional information.
Aug 10, 22:09-22:46 MST

July 2015

Repo server down
and we're back! Sorry for the issues.
Jul 27, 18:35-19:00 MST

June 2015

External Notifications Outage
This is resolved: the outage was caused by alerts w/very large text fields not fitting into our background job queue. We increased the column size so its far greater than the possible alert text.
Jun 17, 09:53-12:18 MST

May 2015

Increasing 503s
API calls to the endpoint have been enabled again.
May 18, 09:52-20:58 MST

April 2015

503 errors accessing scoutapp.com
Reporting should be back to normal. Very sorry for the interruption - we'll be looking at taking further steps to prevent this from happening.
Apr 16, 18:08-19:09 MST
Shared dashboards errors
We renamed some fields and forgot to update in our migration - this is fixed now and shared dashboards should be functioning normally.
Apr 16, 15:44-16:27 MST
scoutapp.com unreachable
Our datacenter reported it was a DDOS attack, and that they mitigated the issue.
Apr 6, 09:37-12:10 MST

March 2015

No incidents reported for this month.

February 2015

No incidents reported for this month.

January 2015

Datacenter outage; possibly related to today's switch upgrade
This incident has been resolved.
Jan 11, 11:14-14:50 MST
Scheduled Maintenance
Update from Rails Machine: 10:50AM ET network connectivity has been restored. We apologize for the extended outage. http://status.railsmachine.com/
Jan 11, 08:14-09:04 MST

December 2014

Isolated connectivity issues
Connectivity continues to be fine after traffic was no longer routed through GTT. As mentioned in an earlier update, we're considering plans to resolve these isolated connectivity issues faster.
Dec 27, 12:39 - Dec 29, 09:01 MST
Continued isolated connectivity issues
No update on root cause, but connectivity has has been established for the past 7 hours. We'll make this as resolved and create a new incident if things change.
Dec 26, 23:49 - Dec 27, 07:20 MST
Isolated connectivity issue (Seattle area)
Connectivity from the affected servers has been holding. Message from our host, RailsMachine: > "I haven't heard anything back from GNAX/Zcolo yet but I ran a few online ping tests (https://www.site24x7.com/public/t/results-1419656583928.html and https://cloudmonitor.ca.com/en/ping.php) and they either reported everything fine or the errors that reports gave the same results for google.com. This indicates either a problem with their testing tool or a localized network issue in the locations that failed to connect to the site. We haven't received any reports from other customers so this is likely a localized issue outside of our control. We also haven't received alerts for any sites that we monitor using pingdom." Marking this as resolved.
Dec 26, 22:02-22:24 MST
Work queues backed up; role-based plugin installs will be delayed.
This incident has been resolved.
Dec 18, 13:01-13:56 MST

November 2014

No incidents reported for this month.

October 2014

504 errors
Our hosting provider reports that other customer are reporting similar issues and a network issue is suspected. We aren't seeing 504 errors anymore - looks like a temporary networking hiccup at the datacenter. If things change, we'll update.
Oct 1, 10:56-11:21 MST

September 2014

Degraded performance and occasional timeouts during software upgrade
Scout is fully operational and performance is good. The timeouts were due to an overload on service restart.
Sep 9, 17:28-17:52 MST

August 2014

Intermittent network connectivity
Networking issues have been resolved by the datacenter.
Aug 22, 06:26-09:09 MST

July 2014

No incidents reported for this month.

June 2014

Deployment issue is causing 1-2 minute outage
This incident has been resolved.
Jun 30, 12:02-12:04 MST
Spike in timeouts from PagerDuty's API
According to our logs, all recent calls to the PagerDuty API have been successful.
Jun 4, 00:09-00:24 MST

May 2014

No incidents reported for this month.

April 2014

Invalid No Data Alerts
This incident has been resolved.
Apr 28, 10:25-13:35 MST
500 errors
We had an asset compiling assets - all fixed. Alerts + metrics were collected during this period (just impacted the UI).
Apr 4, 16:42-16:45 MST
503 errors accessing scoutapp.com
As of 9:25 Mountain Time, all accounts are reporting data into Scout.
Apr 2, 19:41-21:49 MST

March 2014

Apparent network issues. We are in contact with our hosting provider.
Message from our host: "Everything appears to be back online. We are pursuing a Root Cause Analysis from our datacenter, and will post that on our status page as soon as possible. We are also requesting that they increase communication during future outages or technical problems." We (Scout) will tweet the root cause analysis when it becomes available as well. I apologize for this outage. It doesn't represent the level of reliability we aim for.
Mar 31, 12:30-13:37 MST
503 and 504 errors
We've brought down the load on our metric server and the timeouts have resolved. We will look into optimizing/safeguarding the offending logic to prevent this issue going forward.
Mar 26, 21:14-21:50 MST
Intermittent 503 errors
503 errors are resolved. The root cause was a significant volume of disk activity on a metric server due to operations stuck in a loop. This slowed down the rest of Scout. We implemented a quick fix (an edge case resulted in this behavior) and are working on a longer-term fix.
Mar 11, 17:51-19:18 MST

February 2014

Brief outage
Brief blip in service caused by upgrading database servers.
Feb 25, 09:47-10:00 MST
Database Upgrade
Syncing up data was taking too long, so we ended up cancelling the outage period. We'll create a new incident for the next round.
Feb 16, 10:28-22:20 MST

January 2014

503 errors accessing scoutapp.com
An overly aggressive backup - things have returned to normal.
Jan 28, 10:06-10:18 MST
503 Errors
Service is fully restored. A metrics server was under high disk utilization. We rebalanced accounts and disk utilization is back to normal.
Jan 26, 17:22-18:03 MST
503 errors
Performance appears to have stabilized as the metric storage I/O activity has dropped back down to normal.
Jan 22, 14:40-14:54 MST

December 2013

emergency maintenance on one of our servers
Thanks again for your patience if you were affected by this issue. Marking the issue "resolved".
Dec 10, 12:48-16:38 MST
Email deliverability problem to Google mail
Gmail SPAM issues appear to be resolved.
Nov 7, 14:39 - Dec 5, 09:38 MST

November 2013

Alert Delays
Things are back to normal. A collection of servers were mistakenly generating a large volume of alerts and we've throttled the servers.
Nov 19, 17:11-17:37 MST

October 2013

scoutapp.com Unreachable
scoutapp.com is backup - we manually switched our load balancer to another server. We're investigating why automatic failover didn't work.
Oct 29, 10:13-10:18 MST
Connectivity issues to scoutapp.com
...and we're back. Sorry for the issues - data center issues are resolved.
Oct 14, 08:01-08:17 MST

September 2013

No incidents reported for this month.

August 2013

No incidents reported for this month.

July 2013

No incidents reported for this month.

June 2013

Connectivity issues from AWS East Region
This appears to have resolved itself. If you are running into issues, shoot an email to support@scoutapp.com.
Jun 10, 14:02-17:34 MST

May 2013

No incidents reported for this month.

April 2013

site outage
This incident has been resolved.
Apr 28, 21:28 - Apr 29, 01:24 MST
PagerDuty integration may be dropping some alerts
This appears to have been a temporary network issue between our servers and PagerDuty's API endpoints. We have added monitoring so we can identify similar issues and react more quickly if this should happen again.
Apr 22, 18:27-19:43 MST
GNAX facility temporarily had high packet loss.
All access is restored. Send us an email if you are still experiencing any issues.
Apr 15, 10:38-11:34 MST
503 errors on scoutapp.com
Our hosting provider has identified a hardware issue as the root cause. We'll be migrating the server to a new host this week.
Apr 8, 07:41-10:09 MST
Isolated agent connectivity issues
Our hosting provider was unable to identify any issues on their end, but all servers are reporting. If you are seeing similar issues, email us at support@scoutapp.com.
Mar 29, 21:38 - Apr 8, 08:34 MST

March 2013

No incidents reported for this month.

February 2013

No incidents reported for this month.