missing data: force tsdb corruption scan

craig

plug was pulled on the computer. upon plugging back in some pointvalues are plotting flatlines. using tsdb.

in env.properties db.nosql.runCorruptionOnStartupIfDirty=false

so no corruption scan happened when mango restarted. changing that property to true and restarting mango does not go back and fix corruption because it was shut down cleanly. kill -9 of mango and then restarting does initiate the corruption scan, but I still have flatlines in the logs for the time between computer being plugged back in and me cleanly restarting mango.

Is there a way to force mango to go back and run a corruption scan on the whole tsdb database so the data from the time period between (plug pulled) and (mango cleanly restarted) can be plotted, or is that data not even recorded since part of the database for those point values was corrupted?

is there any reason runcorruptiononstartupifdirty is set to false by default?
is there any way I can detect when data is not being logged due to corrupted database?

Thanks.

phildunlap

Hi craig,

The reason the default setting is false these days (was true in Mango 2.7 and before iirc) is that the database will automatically attempt to fix corruption when it finds it. That setting means that database will attempt to read all the data in a shard if there is a corresponding .drty file for its shard number in the Mango/databases/mangoTSDB/*/*/ directories.

To trigger a corruption scan of everything, you would query each point for all its data. This would cause the database to read all the shards and fix any corruption it found. For big databases, this would take a long time, but if you know the time range where you would have expected corruption to occur then you could just read over that section of time.

Typically, corruption is a result of the file buffer from writes not getting flushed to the disk before Mango is terminated, which is why you saw that it had some corruption scanning to do when it was shutdown with a kill -9. In normal circumstances, Mango will flush the buffers and delete the .drty files.

I am not aware of any lingering issues with fixing the corruption on the fly. In the email one of your colleagues sent to support not so long ago about a buffer overflow exception in an Excel Report, version 3.6.4 of the NoSQL module was released to handle that case. Is there another instance where fixing it on the fly isn't working? (also note that the buffer overflow corruption case wouldn't have been fixed by the old corruption scanner behavior of doing that work at startup instead of runtime, because the old scanner didn't handle the case).

Is there a way to force mango to go back and run a corruption scan on the whole tsdb database so the data from the time period between (plug pulled) and (mango cleanly restarted) can be plotted, or is that data not even recorded since part of the database for those point values was corrupted?

I'm not sure what you mean. Between the time of an abrupt Mango termination and Mango restarting there would be nothing gathering data.

craig

Thanks for the details on the corruption scan. Maybe not related at all!

Part of the problem appears to be that we have used meta points for one reason or another I don't understand. The modbus data point looks fine - all the data is there except for the 3 hour period when the machine was powered off.

in the watchlist some of the meta points are flatlined and did not update since the machine was powered off. here is a screenshot showing orange penstock flow and yellow PHDS gauge flow (meta points) flatlined starting at 3:25 when the machine was powered off and did not recover until mango was restarted again, whereas others carried on just fine september 26 onwards:
0_1570126090100_017a912a-5142-4a12-8581-490a2b17daa6-image.png

however on the modbus data source it is all there:
0_1570130569800_026fc08f-d133-46d4-9005-8513fb3fdca1-image.png

The excel report that uses the metadata points oddly doesn't show any flatline except for where there is actually missing data, so that doesn't make any sense to me how it is getting the right data from the meta point when the flatlines are shown in the watchlist for the meta point for the same time period:
0_1570130677400_bfac3734-9d49-4ccb-8815-1c7e7763308e-image.png

and lastly the e-mail report (not excel), also using the meta data point, is not able to produce a plot with a flatline even:
0_1570130965900_97584401-1c65-4538-a536-08617282f481-image.png

Probably re-generating the point history for the meta points will fix all this. Since I don't understand why we are using meta points at all we can probably just delete them and use the modbus points and stop having these issues, otherwise I will probably have to re-generate meta point history after the next dirty shutdown.

Before I re-generate meta history or get rid of the meta points altogether let me know if you'd like to take a closer look at the data and configuration that is there in case anything looks like an issue with mango that you would like to fix.

Thanks for your help and continued work on mango
Craig

phildunlap

Hmm, offhand I'm not sure why that would be. Perhaps another point in their context was disabled? There would have been an event message about disabled points in context, perhaps.

What is the meta point's update event? Were you using a rollup in the Excel Report (in which case the statistics will use the value prior to the period). It's not clear that it's the same color line in the Excel Report.

and lastly the e-mail report (not excel), also using the meta data point, is not able to produce a plot with a flatline even

I suspect that has to do with NaN values, which have polluted the average and sum in the image you've shown.

Probably re-generating the point history for the meta points will fix all this. Since I don't understand why we are using meta points at all we can probably just delete them and use the modbus points and stop having these issues, otherwise I will probably have to re-generate meta point history after the next dirty shutdown.

You shouldn't have to. I would be curious to know the logging type of the meta point, I wonder if what happened is that the meta points interval logged an NaN value somehow, and then because it's an interval average they kept polluting the next period with the NaN, which is why there are 722 values in the emailed report yet no chart.

I agree the regenerating the history should solve it, but if it doesn't then something in the meta point is producing the NaN value(s). I would have expected the NaN to still update the watchlist, though.

craig

No points have been disabled, no event messages about disabled points, just some meta points not starting back up properly after a dirty shutdown.

The meta point has a cron of 2 minutes as the update event, the script is "return p.previous(MINUTE,2).average", and is set to average interval logging every 2 minutes.

The excel report is using rollups.

Let me know if there is any other information I can provide before I start changing things that will make any further troubleshooting of the root cause impossible.

phildunlap

Feel free to get things running again.

The only thing that occurs to me is that the Mango/databases/mangoTSDB/*/{dataPointId}/{lastShardNumber}.data.rev file could be interesting, but if you already read from the point or whatnot that may have been repaired if it was to do with corruption. You would see something about that data point ID in the Mango/logs/iastsdb-corruption.log file if it had something to do with corruption. May be interesting to check that, too.