After working with our time series database provider on ways to improve performance, we implemented a change that caused silent write failure that ultimately resulted in a gap of customer data for one hour. Although our metric ingestion pipeline allows us to replay data from days ago, it was decided in this particular case that due to a bug there was more benefit in not replaying all data.
Metric data was dropped for most customers from 2016-09-23 21:25 UTC to 2016-09-23 22:30 UTC.
We triggered an uncommonly encountered bug in our time series database.
We have been working with our time series database vendor to improve performance over the last few days. At the suggestion of our vendor, we made a change to our shard duration policy. We tested the change against copies of production data before applying the change to customer data.
21:25 UTC: We altered the shard duration for production databases to improve performance at the suggestion of our vendor. After the change to our production database, nearly all customer app data silently dropped the time series data, although the database logs show successful writes.
21:45 UTC: We identified many customer dashboards were not returning the most recent writes. Investigation into the root cause began.
22:10 UTC: In direct communication with our vendor, we began investigation into why we are not seeing query results although writes are indicating success.
22:30 UTC: We pause metric ingestion to databases, keeping them in queue.
23:35 UTC: A fix is implemented, by reverting an exact change in policy retention and avoiding already dropped metric writes.
23:45 UTC: Customer data begins being backfilled from 22:30 UTC to present.
24:15 UTC: Most customers have up-to-date metrics on dashboards. Customers running legacy (<=1.3) scout_apm may continue seeing dashboard lag.
05:20 UTC: Customers running legacy (<=1.3) have up-to-date data on dashboards.