Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?
-
Have you checked your log file and events for any errors in writing?
It almost sounds like a TSDB file is owned by a high privilege user, or something, and Mango is attempting to write the data out but getting denied.
I'm not too surprised that the batching didn't handle it in this case. Like I said, it sounds to me the count may be differing from the reality somehow, or like a particular series cannot write its data to the disk, getting either permissions issues or file system issues when it tries.
About how fast is it rising? Does that correlate with any poll rates or cron patterns in the configuration?
-
Have you checked your log file and events for any errors in writing?
I did not find any ERRORs, but when I searched for tsdb I found WARNings:
07-14-2019-1.ma.log:WARN 2019-07-14T13:46:10,746 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T05:05:10,802 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T06:35:10,970 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T12:13:51,286 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T12:31:48,767 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T13:20:10,772 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T16:35:10,747 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T17:35:11,002 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-13-2019-1.ma.log:WARN 2019-07-13T12:53:03,262 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-13-2019-1.ma.log:WARN 2019-07-13T16:04:19,245 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T04:35:10,769 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T06:20:10,757 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T13:46:10,746 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1
I had noticed that message a few weeks ago, after running a Generate History on a metadata point, but I had thought that was related to a large meta point history generation and that it would resolve itself after the task queues caught up when the CPU utilization fell back down. By the time I was noticed the rising values to be written, I had forgotten about the backdate issue. I think the Generate History function may be triggering or exacerbating the problem. However, the values to be written was rising daily long before I started the Generate History. Either way, I don't know what to do about it.
About how fast is it rising?
The rate is irregular when viewed on an hour timescale: the samples to be written may go down every few minutes, but the increases exceed the decreases and a linear trend is seen in the graph. It has been averaging out to an additional 1,700 values per day. Day/1700 ≈ 0.847 min/value ≈ 50.8 sec/value, so not your typical cron job interval. Viewed on a 1-day timescale, the values to be written plot looks like a straight line with a steady rising slope.
System uptime is 3,500 hours (almost 146 days). 3,500h/272,458values = 46.24 seconds/value in the last 146 days. 146 days ago the values to be written was 445,000, rising at 14,000values/day. At restart that day the values to be written reset to zero, then rose at roughly 1,800values/day. That must be the day I last upgraded Mango. If I remember correctly, I had to kill mango many minutes after initiating shutdown because it was taking too long to complete and I was concerned about prolonging a Mango outage.
$ date -d 'now - 146 days' Tue Feb 19 11:44:35 EST 2019
I poked around for large changes in internal data source readings before and after the update, and today, to see if they correlate with the rate of change in values to be written:
- Backdates running: usually 0 then and now.
- Backdates queued: Usually close to 0; 1 today. Spiked above 100 on June 20. The values to be written rose at a much faster rate around that time, but then the slope resumed to 1,700values/day. This was probably the day I ran the Generate History on a metadata point.
- Consecutive run polls: currently 0; spiked from 0 to 205,000 on June 15, then suddenly down to 0 on June 27.
- Currently scheduled high priority tasks is typically 12,000-18,000. Does not seem to correlate with other issues.
- Point value write threads is currently 3. It was 5 before the Feb 19 restart, and appears to be going up in proportion to the number of values to be written.
- Total backdates is currently 216,293. It had climbed slowly to 1,200 at which time it was reset to 0 at the Feb 19 restart. Since then it climbed slowly to around 3,000 on May 1, at which point it shot up suddenly to almost 60,000 that day, then resumed its slower rate of increase. On June 21 it shot up suddenly from about 60,000 to 215,000, then resumed its slower increase rate.
I believe the total backdates shot up when I ran a Generate History, and the system has not caught up with the backlog. This is odd because CPU utilization is usually more than 80% idle and CPU % IO wait is currently 0. If I had not ran the Generate History, it seems that these numbers would still be increasing indefinitely, but not at such a fast rate.
-
Thank you for going into such detail! It may take me a couple days to test things / read around to see if I can reproduce what you're reporting, but I just wanted to say thanks for so much information!
Either way, I don't know what to do about it.
If it's just the count getting off and not values accruing in memory, it's not a big problem. it won't interfere with writing the data that is actually queued to get written, and so there would not be an increasing amount of memory allocated to this task, just a number that doesn't reflect the accurate size of the queue (the values waiting aren't counted in the queue, they tracked, because it's much more efficient not to count the large queue).
-
@phildunlap said in Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?:
If it's just the count getting off and not values accruing in memory, it's not a big problem.
Thanks. Hopefully that's the case, but I think there are accrued values hiding somewhere, as that would explain the prolonged shutdowns I experienced when I restarted Mango, whereby I eventually had to kill Mango. Accrued values aside, I need to be able to generate history after creating or changing metadata points, as they are of limited value without a history. I'm thinking that whatever is underlying this symptom is also affecting my ability to generate metadata value histories.
Sometimes when I generate history I'm wondering if I'm going to push Mango over the edge, where excessive resource demands cause a positive feedback loop of error messages that demand more resources, eventually requiring a restart for Mango to catch up. This occurred a year or two ago, but I can't recall whether it was a history generation or an Internet outage (and inability to send email alarms) that kicked off the positive feedback loop.
It would be nice to have automatic ad-hoc metadata points as a feature enhancement: if the metadata value is not found in the TSDB point log (e.g. because it was not logged), then it should be calculated at the time it is retrieved. This would result in calculating historical values only when the data is being retrieved, thus reducing computational load and the storage demand, yet still making it available for review or download on the occasion where the data is needed for analysis. Since the value would be calculated upon retrieval, values that are never retrieved would not have to be calculated or recorded. The values could optionally be logged when they are calculated, as they are now. Live values would still be displayed. Should I submit this feature request on Github?
-
@Pedro I was just reviewing this and had a few questions:
-
Are you seeing any Data Lost events? If so these should enlighten us as to why this is happening, they get raised whenever a batch fails to write. I would make sure that event level is set to something other than Do No Log too.
-
Do you have any Alphanumeric points that would be saving very large strings of text? The NoSQL database has a limit to the size of each entry, which is large but its possible this could cause these symptoms.
-
-
I recommend changing the
point clean interval (ms)
and thestale data period (ms)
to60000
, which is at least currently the default. There is a decent chance this is actually causing the issue, but we're still investigating. -
@terrypacker said in Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?:
Are you seeing any Data Lost events?
I have not seen Data Lost events. However, since I've had to kill Mango to complete the last several restarts, if there were a data loss it would not have generated an event at that time.
If so these should enlighten us as to why this is happening, they get raised whenever a batch fails to write.
Thanks, that's reassuring. I had not remembered that there was such an event.
I would make sure that event level is set to something other than Do No Log too.
I had apparently set that event to critical. I did not see any such events, and in the last month I re-enabled critical events emails forwarding through my mobile phone network's email to SMS gateway. My phone sounds an attention getting submarine dive alarm when receiving Mango text messages.
Do you have any Alphanumeric points that would be saving very large strings of text?
Only one. I tried listing ALPHANUMERIC points in the table on the data_sources page, but for some reason it would not show a full query response (there were hourglasses at the bottom of the list). In any case, the JSONata command-line utility jfq enables me to specify more detailed reporting of the Mango configuration.
Query:
# Show name and loggingType and context update of all enabled alphanumeric points that are configured to log their values: $ jfq 'dataPoints[pointLocator.dataType="ALPHANUMERIC"][enabled=true][loggingType!="NONE"].[name,enabled,loggingType,pointLocator.context]' Mango-Configuration-Jul-17-2019_000500.json
I generate a large JSON structure to predict the tides based on a pressure sensor data. The table is displayed in a Mango 2.x graphical view via a server-side script graphical object that converts the JSON to an HTML table. The table is apparently using the Alphanumeric_Default template, which saves "When point value changes." However, it only updates context when the tide direction changes, which is only a handful of times per day. The only other alphanumeric point being logged has a context that triggers only twice a year.
The NoSQL database has a limit to the size of each entry, which is large but its possible this could cause these symptoms.
The
Point values to be written
rises by about 1,700 values per day, which far exceeds the number of times the tide changes direction in one day.I see there is now a Failed Login event. That's wonderful, thanks.
I just changed
the point clean interval (ms) and the stale data period (ms) to 60000
I will report back whether or not I see a change in the trend of the
Point values to be written
. It should take an hour or two to see a change.Thanks for your help.
-
Thanks very much: 3.5 hours after changing the TSDB settings, I see that the
Point values to be written
has leveled off, on average. Plotted on a one day scale, it will surely look like a flat horizontal line. Now it would be nice to figure out how to make it go down from 276,196 values to zero.My current configuration:
$ jfq 'systemSettings' Mango-Configuration-Jul-17-2019_201152.json | grep -i nosql "mangoNoSql.writeBehind.statusProviderPeriodMs": 5000, "mangoNoSql.writeBehind.maxInstances": 10, "mangoNoSql.backupHour": 4, "mangoNoSql.backupEnabled": true, "mangoNoSql.backupMinute": 0, "mangoNoSql.writeBehind.maxInsertsPerPoint": 10000, "mangoNoSql.backupPeriods": 1, "mangoNoSql.backupFileCount": 3, "mangoNoSql.backupIncremental": false, "mangoNoSql.backupFileLocation": "/mnt/WesternDigitalUSB/mango-backup", "mangoNoSql.corruptionScanThreadCount": 100, "systemEventAlarmLevel.NOSQL_DATA_LOST": "CRITICAL", "mangoNoSql.writeBehind.minInsertsPerPoint": 1000, "mangoNoSql.writeBehind.stalePointDataPeriod": 60000, "mangoNoSql.writeBehind.batchProcessPeriodMs": 500, "mangoNoSql.writeBehind.maxRowsPerInstance": 100000, "mangoNoSql.backupLastSuccessfulRun": "Jul-13-2019_040000", "mangoNoSql.writeBehind.backdateDelay": 4985, "action.noSqlBackup": "", "mangoNoSql.writeBehind.stalePointCleanInterval": 60000, "action.noSqlRestore": "", "mangoNoSql.writeBehind.minDataFlushIntervalMs": 100, "mangoNoSql.backupPeriodType": "WEEKS", "mangoNoSql.writeBehind.spawnThreshold": 100000, "mangoNoSql.intraShardPurge": false
-
Glad to hear it! Thanks for bringing this possibility to our attention, and with such detail along the way!
Alas the best way to zero it out would be restarting Mango.
-
@phildunlap said in Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?:
the best way to zero it out would be restarting Mango.
Ironically, restarting Mango will result in a prolonged data outage while the values are being written. Therefore it becomes a matter of choosing which data I want to lose (today's, or the accumulated values). I also wonder if those values are from a particular point, or from random points.
-
As predicted, the
Point values to be written
has continued to be fairly flat since yesterday's configuration change. It is fluctuating between 276,180 values and 276,230 values, except for a brief spike to 276,300 values. It is not trending up or down. I wonder what will happen next time Igenerate history
on a Metadata point. -
I would not expect a history generation to cause the issue again with the settings changed.
I see your point about the delay in shutdown. You can wait in the shutdown until it is reporting that it is waiting for batch write behind tasks to finish, and then use a
kill -9
to stop Mango, this should not cause issues. It is possible to reflect out the counter and manually set it to something, but I think it would be better to restart Mango.