At 5:35PM MDT, our database table storing alerts hit the auto-increment limit for its primary key datatype. As a result, new alerts were either not created as they were supposed to, or in some cases, created and associated with the wrong account. Since the alerts table is huge, modifying it in-place was not an option. We began a sequence of altering the table on a MySQL read-only instance, switching multi-master to the secondary, and modifying the primary database. Shortly thereafter, we temporarily disabled notifications for all accounts to minimize the impact of the alterations.
By 8:37PM MDT, alterations were complete. Unfortunately, a glitch in the multi-master switchover process resulted in 7-minute outage from 8:58PM-09:07PM MTD. The glitch was the result of a duplicate mmm_mond
process running, which repeatedly killed MySQL's replication thread, which caused database instability.
mmm_mond
process, to ensure that only one process is running at a time.