Summary
Our server monitoring service is hosted by Rails Machine in a Zayo datacenter in Atlanta, Georgia. On the evening of 2016-11-11, the entire datacenter lost utility power. In this scenario, a transparent failover to the UPS batteries should happen immediately and transparently while the on-site diesel generators power up to provide longer term power while utility power is restored. This process failed and as a result our service was offline for 17 hours while the power was restored and we worked through database corruption issues caused by the power outage. In eight years of Scout, this was by far our worst outage.
At this time we are waiting for a report from Zayo to understand why the failover did not work correctly.
We understand that you rely on Scout to be resilient and operational at all times, as you rely on our service to alert you on critical and time sensitive issues in your own infrastructure. We are evaluating options for multi-datacenter colocation that would allow us to handle datacenter-wide incidents in the future.
Timeline
- 2016-11-11 8:50PM MST: Scout's monitoring systems indicate outage. We contact our datacenter hosting provider (Rails Machine).
- 2016-11-11 8:56PM MST: Rails Machine investigates apparent network connectivity loss, posts Status Page incident.
- 2016-11-11 8:56PM MST: Scout posts incident on Status Page.
- 2016-11-11 11:33PM MST: Scout learns from Rails Machine that the outage is due to all power being lost at the datacenter and the backup generators failing.
- 2016-11-12 5:26AM MST: Power is restored, Scout's servers are operational, though the Scout UI/website is still offline. Scout's Multi-Master MySQL database is in an inoperable state due to the abrupt power loss.
- 2016-11-12 7:06AM MST: MySQL data integrity checks indicate data corruption. As a failsafe, backups of the databases taken prior to the power outage are prepared to stand-by. Work begins to correct MySQL data corruption.
- 2016-11-12 1:17PM MST: MySQL corruption repaired. Scout brings site online, but not yet ingesting new metrics.
- 2016-11-12 2:15PM MST: Scout turns ingestion back on. We are ingesting new metrics, but there is some unusual latency while services spin up.
- 2016-11-12 3:10PM MST: Scout’s services are back up and operating normally.