[Server Monitoring] Incorrect alert routing/Alerts not being sent out
Incident Report for Scout
Postmortem

Server Monitoring 12/31/2016 Postmortem

At 5:35PM MDT, our database table storing alerts hit the auto-increment limit for its primary key datatype. As a result, new alerts were either not created as they were supposed to, or in some cases, created and associated with the wrong account. Since the alerts table is huge, modifying it in-place was not an option. We began a sequence of altering the table on a MySQL read-only instance, switching multi-master to the secondary, and modifying the primary database. Shortly thereafter, we temporarily disabled notifications for all accounts to minimize the impact of the alterations.

By 8:37PM MDT, alterations were complete. Unfortunately, a glitch in the multi-master switchover process resulted in 7-minute outage from 8:58PM-09:07PM MTD. The glitch was the result of a duplicate mmm_mond process running, which repeatedly killed MySQL's replication thread, which caused database instability.

What We Have Done to Ensure This Does Not Happen Again

  1. We have added monitoring and alerting on MySQL Multi-master's mmm_mond process, to ensure that only one process is running at a time.
  2. We have audited all tables in our database to ensure that no other tables are close to exceeding their primary key auto-increment limit. While none are currently close, there are two tables at 50% of their limit, so we will be migrating these tables proactively during an upcoming scheduled maintenance window.
Posted Jan 04, 2017 - 13:20 MST

Resolved
We have corrected the underlying database issue causing the incorrectly routed alerts. Alerts should be back to normal for all accounts.
Posted Dec 31, 2016 - 21:44 MST
Identified
Alerts not being routed correctly. We have identified the problem and while the fix is implemented alerts have disabled for all accounts.
Posted Dec 31, 2016 - 19:58 MST