Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?
-
Hi Phil,
This is MA core 3.5.6 with NoSqlDtabase 3.5.1. Sorry the version I provided earlier made no sense.
The Point values to be written is now at 265,275 points. I don't usually look at this value, so it piled up at a steady rate before I noticed. Doesn't a low iowait indicate that a faster hard drive or lowering the write rate should not make a difference?
If this Mango version is new enough, what are the NoSQL fiddling recommendations?
I have never changed them from the default:Batch process manager period (ms) 500 Batch write behind spawn threshold 100000 Max batch write behind tasks 100 Maximum batch size 1 Minimum batch size 1 Delay for writing backate batches (ms) 5000 Minimum time to wait before flushing a small batch (ms) 1 Point clean interval (ms) 1000 Status update period (ms) 5000 Stale data period (ms) 1000 Corruption scan task threads 100
Would changing these settings require a restart?
-
Is the value still climbing? That version should be new enough.
Your settings seem fine for the general case, I would not have 100 max write behind tasks, the default is 10. I am not sure why you adjusted the defaults on the clean interval or stale data period, but I wouldn't expect those to cause this issue.
No, changing those settings does not require a restart. If the issue was that you cannot keep up with a deluge of data, most likely from one or more persistent TCP syncs, then the solution would be to increase the minimum time before flushing a small batch, increase the maximum batch size to something like 10000 and increase the minimum batch size.
The default settings are geared toward quality of service, not necessarily throughput. I would only expect persistent TCP syncs or extreme numbers of points / poll rates to encounter that (or slow disks, which are aided by batching writes, generally).
-
Thanks for your help. Yes, the values to be written are still climbing at a steady rate, as they have been at least for months, probably since last time I restarted Mango. I presume there were a lot of accumulated points to be written during my upgrade to 3.5.6, as that would explain the long shutdown time.
The TSDB setting values I posted yesterday were either the default when I converted to TSDB years ago, or they were changed to new values recommended by IA, probably around the day I converted to TSDB. I doubt I would have the courage to change them on my own without understanding their meaning. Since I have not seen documentation regarding the meaning of each of these settings, I have been afraid to change them. Before noticing that the
values to be written
was consistently increasing, I also did not see a need to change the settings, so I left them alone.Per your suggestion I just changed only the following settings (old ==> new):
max write behind tasks: 100 ==> 10 increase the minimum time before flushing a small batch: 1 ==> 100 increase the maximum batch size: 1 ==> 10000 increase the minimum batch size: 1 ==> 1000
The default settings are geared toward quality of service, not necessarily throughput. I would only expect persistent TCP syncs or extreme numbers of points / poll rates to encounter that
There are 58 data sources (of which 15 are metadata sources). Most of the data sources are Modbus/TCP with a typical polling rate of either 1, 2, or 5 seconds.
There are about 1300 enabled data points, though many are from metadata sources.
(or slow disks, which are aided by batching writes, generally).
I don't think the issue is slow disks because iowait is low.
It's too soon to see if the trend of increasing
values to be written
changed. I'll check thevalues to be written
later, and see if it started to decrease, then report back.Thanks again
-
I graphed the
Point values to be written
three hours before and after the change in TSDB settings. It continues to rise at the same rate, and is now at 267,341 points. I saw no visible difference in the graph.The
Database batch writes per second
varies between 85 and 100, as it did before. -
Have you checked your log file and events for any errors in writing?
It almost sounds like a TSDB file is owned by a high privilege user, or something, and Mango is attempting to write the data out but getting denied.
I'm not too surprised that the batching didn't handle it in this case. Like I said, it sounds to me the count may be differing from the reality somehow, or like a particular series cannot write its data to the disk, getting either permissions issues or file system issues when it tries.
About how fast is it rising? Does that correlate with any poll rates or cron patterns in the configuration?
-
Have you checked your log file and events for any errors in writing?
I did not find any ERRORs, but when I searched for tsdb I found WARNings:
07-14-2019-1.ma.log:WARN 2019-07-14T13:46:10,746 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T05:05:10,802 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T06:35:10,970 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T12:13:51,286 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T12:31:48,767 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T13:20:10,772 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T16:35:10,747 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T17:35:11,002 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-13-2019-1.ma.log:WARN 2019-07-13T12:53:03,262 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-13-2019-1.ma.log:WARN 2019-07-13T16:04:19,245 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T04:35:10,769 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T06:20:10,757 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T13:46:10,746 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1
I had noticed that message a few weeks ago, after running a Generate History on a metadata point, but I had thought that was related to a large meta point history generation and that it would resolve itself after the task queues caught up when the CPU utilization fell back down. By the time I was noticed the rising values to be written, I had forgotten about the backdate issue. I think the Generate History function may be triggering or exacerbating the problem. However, the values to be written was rising daily long before I started the Generate History. Either way, I don't know what to do about it.
About how fast is it rising?
The rate is irregular when viewed on an hour timescale: the samples to be written may go down every few minutes, but the increases exceed the decreases and a linear trend is seen in the graph. It has been averaging out to an additional 1,700 values per day. Day/1700 ≈ 0.847 min/value ≈ 50.8 sec/value, so not your typical cron job interval. Viewed on a 1-day timescale, the values to be written plot looks like a straight line with a steady rising slope.
System uptime is 3,500 hours (almost 146 days). 3,500h/272,458values = 46.24 seconds/value in the last 146 days. 146 days ago the values to be written was 445,000, rising at 14,000values/day. At restart that day the values to be written reset to zero, then rose at roughly 1,800values/day. That must be the day I last upgraded Mango. If I remember correctly, I had to kill mango many minutes after initiating shutdown because it was taking too long to complete and I was concerned about prolonging a Mango outage.
$ date -d 'now - 146 days' Tue Feb 19 11:44:35 EST 2019
I poked around for large changes in internal data source readings before and after the update, and today, to see if they correlate with the rate of change in values to be written:
- Backdates running: usually 0 then and now.
- Backdates queued: Usually close to 0; 1 today. Spiked above 100 on June 20. The values to be written rose at a much faster rate around that time, but then the slope resumed to 1,700values/day. This was probably the day I ran the Generate History on a metadata point.
- Consecutive run polls: currently 0; spiked from 0 to 205,000 on June 15, then suddenly down to 0 on June 27.
- Currently scheduled high priority tasks is typically 12,000-18,000. Does not seem to correlate with other issues.
- Point value write threads is currently 3. It was 5 before the Feb 19 restart, and appears to be going up in proportion to the number of values to be written.
- Total backdates is currently 216,293. It had climbed slowly to 1,200 at which time it was reset to 0 at the Feb 19 restart. Since then it climbed slowly to around 3,000 on May 1, at which point it shot up suddenly to almost 60,000 that day, then resumed its slower rate of increase. On June 21 it shot up suddenly from about 60,000 to 215,000, then resumed its slower increase rate.
I believe the total backdates shot up when I ran a Generate History, and the system has not caught up with the backlog. This is odd because CPU utilization is usually more than 80% idle and CPU % IO wait is currently 0. If I had not ran the Generate History, it seems that these numbers would still be increasing indefinitely, but not at such a fast rate.
-
Thank you for going into such detail! It may take me a couple days to test things / read around to see if I can reproduce what you're reporting, but I just wanted to say thanks for so much information!
Either way, I don't know what to do about it.
If it's just the count getting off and not values accruing in memory, it's not a big problem. it won't interfere with writing the data that is actually queued to get written, and so there would not be an increasing amount of memory allocated to this task, just a number that doesn't reflect the accurate size of the queue (the values waiting aren't counted in the queue, they tracked, because it's much more efficient not to count the large queue).
-
@phildunlap said in Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?:
If it's just the count getting off and not values accruing in memory, it's not a big problem.
Thanks. Hopefully that's the case, but I think there are accrued values hiding somewhere, as that would explain the prolonged shutdowns I experienced when I restarted Mango, whereby I eventually had to kill Mango. Accrued values aside, I need to be able to generate history after creating or changing metadata points, as they are of limited value without a history. I'm thinking that whatever is underlying this symptom is also affecting my ability to generate metadata value histories.
Sometimes when I generate history I'm wondering if I'm going to push Mango over the edge, where excessive resource demands cause a positive feedback loop of error messages that demand more resources, eventually requiring a restart for Mango to catch up. This occurred a year or two ago, but I can't recall whether it was a history generation or an Internet outage (and inability to send email alarms) that kicked off the positive feedback loop.
It would be nice to have automatic ad-hoc metadata points as a feature enhancement: if the metadata value is not found in the TSDB point log (e.g. because it was not logged), then it should be calculated at the time it is retrieved. This would result in calculating historical values only when the data is being retrieved, thus reducing computational load and the storage demand, yet still making it available for review or download on the occasion where the data is needed for analysis. Since the value would be calculated upon retrieval, values that are never retrieved would not have to be calculated or recorded. The values could optionally be logged when they are calculated, as they are now. Live values would still be displayed. Should I submit this feature request on Github?
-
@Pedro I was just reviewing this and had a few questions:
-
Are you seeing any Data Lost events? If so these should enlighten us as to why this is happening, they get raised whenever a batch fails to write. I would make sure that event level is set to something other than Do No Log too.
-
Do you have any Alphanumeric points that would be saving very large strings of text? The NoSQL database has a limit to the size of each entry, which is large but its possible this could cause these symptoms.
-
-
I recommend changing the
point clean interval (ms)
and thestale data period (ms)
to60000
, which is at least currently the default. There is a decent chance this is actually causing the issue, but we're still investigating. -
@terrypacker said in Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?:
Are you seeing any Data Lost events?
I have not seen Data Lost events. However, since I've had to kill Mango to complete the last several restarts, if there were a data loss it would not have generated an event at that time.
If so these should enlighten us as to why this is happening, they get raised whenever a batch fails to write.
Thanks, that's reassuring. I had not remembered that there was such an event.
I would make sure that event level is set to something other than Do No Log too.
I had apparently set that event to critical. I did not see any such events, and in the last month I re-enabled critical events emails forwarding through my mobile phone network's email to SMS gateway. My phone sounds an attention getting submarine dive alarm when receiving Mango text messages.
Do you have any Alphanumeric points that would be saving very large strings of text?
Only one. I tried listing ALPHANUMERIC points in the table on the data_sources page, but for some reason it would not show a full query response (there were hourglasses at the bottom of the list). In any case, the JSONata command-line utility jfq enables me to specify more detailed reporting of the Mango configuration.
Query:
# Show name and loggingType and context update of all enabled alphanumeric points that are configured to log their values: $ jfq 'dataPoints[pointLocator.dataType="ALPHANUMERIC"][enabled=true][loggingType!="NONE"].[name,enabled,loggingType,pointLocator.context]' Mango-Configuration-Jul-17-2019_000500.json
I generate a large JSON structure to predict the tides based on a pressure sensor data. The table is displayed in a Mango 2.x graphical view via a server-side script graphical object that converts the JSON to an HTML table. The table is apparently using the Alphanumeric_Default template, which saves "When point value changes." However, it only updates context when the tide direction changes, which is only a handful of times per day. The only other alphanumeric point being logged has a context that triggers only twice a year.
The NoSQL database has a limit to the size of each entry, which is large but its possible this could cause these symptoms.
The
Point values to be written
rises by about 1,700 values per day, which far exceeds the number of times the tide changes direction in one day.I see there is now a Failed Login event. That's wonderful, thanks.
I just changed
the point clean interval (ms) and the stale data period (ms) to 60000
I will report back whether or not I see a change in the trend of the
Point values to be written
. It should take an hour or two to see a change.Thanks for your help.
-
Thanks very much: 3.5 hours after changing the TSDB settings, I see that the
Point values to be written
has leveled off, on average. Plotted on a one day scale, it will surely look like a flat horizontal line. Now it would be nice to figure out how to make it go down from 276,196 values to zero.My current configuration:
$ jfq 'systemSettings' Mango-Configuration-Jul-17-2019_201152.json | grep -i nosql "mangoNoSql.writeBehind.statusProviderPeriodMs": 5000, "mangoNoSql.writeBehind.maxInstances": 10, "mangoNoSql.backupHour": 4, "mangoNoSql.backupEnabled": true, "mangoNoSql.backupMinute": 0, "mangoNoSql.writeBehind.maxInsertsPerPoint": 10000, "mangoNoSql.backupPeriods": 1, "mangoNoSql.backupFileCount": 3, "mangoNoSql.backupIncremental": false, "mangoNoSql.backupFileLocation": "/mnt/WesternDigitalUSB/mango-backup", "mangoNoSql.corruptionScanThreadCount": 100, "systemEventAlarmLevel.NOSQL_DATA_LOST": "CRITICAL", "mangoNoSql.writeBehind.minInsertsPerPoint": 1000, "mangoNoSql.writeBehind.stalePointDataPeriod": 60000, "mangoNoSql.writeBehind.batchProcessPeriodMs": 500, "mangoNoSql.writeBehind.maxRowsPerInstance": 100000, "mangoNoSql.backupLastSuccessfulRun": "Jul-13-2019_040000", "mangoNoSql.writeBehind.backdateDelay": 4985, "action.noSqlBackup": "", "mangoNoSql.writeBehind.stalePointCleanInterval": 60000, "action.noSqlRestore": "", "mangoNoSql.writeBehind.minDataFlushIntervalMs": 100, "mangoNoSql.backupPeriodType": "WEEKS", "mangoNoSql.writeBehind.spawnThreshold": 100000, "mangoNoSql.intraShardPurge": false
-
Glad to hear it! Thanks for bringing this possibility to our attention, and with such detail along the way!
Alas the best way to zero it out would be restarting Mango.
-
@phildunlap said in Internal data source attribute 'Point values to be written' keeps climbing. Mitigation strategies?:
the best way to zero it out would be restarting Mango.
Ironically, restarting Mango will result in a prolonged data outage while the values are being written. Therefore it becomes a matter of choosing which data I want to lose (today's, or the accumulated values). I also wonder if those values are from a particular point, or from random points.
-
As predicted, the
Point values to be written
has continued to be fairly flat since yesterday's configuration change. It is fluctuating between 276,180 values and 276,230 values, except for a brief spike to 276,300 values. It is not trending up or down. I wonder what will happen next time Igenerate history
on a Metadata point. -
I would not expect a history generation to cause the issue again with the settings changed.
I see your point about the delay in shutdown. You can wait in the shutdown until it is reporting that it is waiting for batch write behind tasks to finish, and then use a
kill -9
to stop Mango, this should not cause issues. It is possible to reflect out the counter and manually set it to something, but I think it would be better to restart Mango.