Summary

After working with our time series database provider on ways to improve performance, we implemented a change that caused silent write failure that ultimately resulted in a gap of customer data for one hour. Although our metric ingestion pipeline allows us to replay data from days ago, it was decided in this particular case that due to a bug there was more benefit in not replaying all data.

Impact

Metric data was dropped for most customers from 2016-09-23 21:25 UTC to 2016-09-23 22:30 UTC.

Root Cause

We triggered an uncommonly encountered bug in our time series database.

Details

What changed

We have been working with our time series database vendor to improve performance over the last few days. At the suggestion of our vendor, we made a change to our shard duration policy. We tested the change against copies of production data before applying the change to customer data.

Timeline

21:25 UTC: We altered the shard duration for production databases to improve performance at the suggestion of our vendor. After the change to our production database, nearly all customer app data silently dropped the time series data, although the database logs show successful writes.

21:45 UTC: We identified many customer dashboards were not returning the most recent writes. Investigation into the root cause began.

22:10 UTC: In direct communication with our vendor, we began investigation into why we are not seeing query results although writes are indicating success.

22:30 UTC: We pause metric ingestion to databases, keeping them in queue.

23:35 UTC: A fix is implemented, by reverting an exact change in policy retention and avoiding already dropped metric writes.

23:45 UTC: Customer data begins being backfilled from 22:30 UTC to present.

24:15 UTC: Most customers have up-to-date metrics on dashboards. Customers running legacy (<=1.3) scout_apm may continue seeing dashboard lag.

05:20 UTC: Customers running legacy (<=1.3) have up-to-date data on dashboards.

Posted Sep 24, 2016 - 08:19 MDT

Resolved

Ingestion has caught back for agents sending data in JSON format. We're catching up for agents sending our legacy data format.

Posted Sep 23, 2016 - 18:15 MDT

Update

Ingestion is catching up, but we've lost data from 21:24 to 22:24 UTC (around 1 hour).

Posted Sep 23, 2016 - 18:09 MDT

Identified

We ran into a bug* in our metric storage engine with how shard duration updates were handled. We've rolled back the change and verified we're able to record fresh data. We're working on restarting our ingestion process.

* - https://github.com/influxdata/influxdb/issues/5878

Posted Sep 23, 2016 - 17:44 MDT

Update

We modified our shard duration in our metric storage and databases with more than 1 shard aren't returning data since the change. We're working with our metric database provider on the issue.

Posted Sep 23, 2016 - 16:04 MDT

Investigating

We're seeing blocks of missing data on a number of apps and are investigating.

Posted Sep 23, 2016 - 15:48 MDT