[Server Monitoring] Connectivity issues to scoutapp.com
Incident Report for Scout
Postmortem

Summary

Our server monitoring service is hosted by Rails Machine in a Zayo datacenter in Atlanta, Georgia. On the evening of 2016-11-11, the entire datacenter lost utility power. In this scenario, a transparent failover to the UPS batteries should happen immediately and transparently while the on-site diesel generators power up to provide longer term power while utility power is restored. This process failed and as a result our service was offline for 17 hours while the power was restored and we worked through database corruption issues caused by the power outage. In eight years of Scout, this was by far our worst outage.

At this time we are waiting for a report from Zayo to understand why the failover did not work correctly.

We understand that you rely on Scout to be resilient and operational at all times, as you rely on our service to alert you on critical and time sensitive issues in your own infrastructure. We are evaluating options for multi-datacenter colocation that would allow us to handle datacenter-wide incidents in the future.

Timeline

  • 2016-11-11 8:50PM MST: Scout's monitoring systems indicate outage. We contact our datacenter hosting provider (Rails Machine).
  • 2016-11-11 8:56PM MST: Rails Machine investigates apparent network connectivity loss, posts Status Page incident.
  • 2016-11-11 8:56PM MST: Scout posts incident on Status Page.
  • 2016-11-11 11:33PM MST: Scout learns from Rails Machine that the outage is due to all power being lost at the datacenter and the backup generators failing.
  • 2016-11-12 5:26AM MST: Power is restored, Scout's servers are operational, though the Scout UI/website is still offline. Scout's Multi-Master MySQL database is in an inoperable state due to the abrupt power loss.
  • 2016-11-12 7:06AM MST: MySQL data integrity checks indicate data corruption. As a failsafe, backups of the databases taken prior to the power outage are prepared to stand-by. Work begins to correct MySQL data corruption.
  • 2016-11-12 1:17PM MST: MySQL corruption repaired. Scout brings site online, but not yet ingesting new metrics.
  • 2016-11-12 2:15PM MST: Scout turns ingestion back on. We are ingesting new metrics, but there is some unusual latency while services spin up.
  • 2016-11-12 3:10PM MST: Scout’s services are back up and operating normally.
Posted about 2 years ago. Nov 16, 2016 - 12:58 MST

Resolved
This incident has been resolved.
Posted about 2 years ago. Nov 12, 2016 - 21:07 MST
Monitoring
Systems are back to normal. Data has been ingesting normally since 2:15pm Mountain. We will continue to monitor the situation, and will supply a post-mortem as we learn more from our host and datacenter.
Posted about 2 years ago. Nov 12, 2016 - 15:10 MST
Update
We've started ingesting data again, but expect delays why the system warms up.
Posted about 2 years ago. Nov 12, 2016 - 14:37 MST
Update
We are coming back online. Historical data is visible, but new data not yet being ingested.
Posted about 2 years ago. Nov 12, 2016 - 13:54 MST
Update
We're working to restore our MySQL database, which was corrupted during the power loss.
Posted about 2 years ago. Nov 12, 2016 - 13:13 MST
Update
We are fixing some database corruption caused by the power outage.
Posted about 2 years ago. Nov 12, 2016 - 08:28 MST
Update
We've regained SSH access to our servers and are working to bring services back online.
Posted about 2 years ago. Nov 12, 2016 - 05:26 MST
Identified
There isn't additional information at this time. We will continue to update.
Posted about 2 years ago. Nov 12, 2016 - 01:18 MST
Update
From Rails Machine:

There was a data center wide power outage and generator failure at the datacenter. We are actively working Zayo to restore services as quickly as possible.
Posted about 2 years ago. Nov 11, 2016 - 22:47 MST
Update
Our data center (Rails Machine) hasn't posted an additional update at this time. We'll update this incident hourly.
Posted about 2 years ago. Nov 11, 2016 - 22:22 MST
Update
Our datacenter has created an incident: http://status.railsmachine.com/incidents/ds8qfydyqmvg
Posted about 2 years ago. Nov 11, 2016 - 20:58 MST
Investigating
We're investigating connectivity issues for scoutapp.com (server monitoring).
Posted about 2 years ago. Nov 11, 2016 - 20:56 MST