Data points stop logging

mattonfarm

Hi all,
We're having an issue with our Mango instance crashing randomly. Data points are received from sources but not committed to the database. The issue can be re-created by running a NoSQL backup. The relevant logs are attached below with the 'Timer already cancelled.' error repeating until the process is rebooted. Rebooting the process gets things going again for another couple of days.

INFO  2017-09-25T09:03:36,695 (com.serotonin.m2m2.rt.maint.work.DatabaseBackupWorkItem.execute:109) - Starting database backup WorkItem. 
INFO  2017-09-25T09:03:36,697 (com.serotonin.m2m2.rt.maint.work.DatabaseBackupWorkItem.createLogOutputStream:224) - Writing upgrade log to /opt/mango/logs/com.serotonin.m2m2.rt.maint.work.DatabaseBackupWorkItem.log 
FATAL 2017-09-25T09:03:51,645 (com.serotonin.timer.TimerThread.run:41) - TimerThread failed 
java.lang.NullPointerException: null
	at com.serotonin.timer.OrderedThreadPoolExecutor$TimePriorityLimitedTaskQueue.add(OrderedThreadPoolExecutor.java:442) ~[mango-3.2.1.jar:?]
	at com.serotonin.timer.OrderedThreadPoolExecutor.execute(OrderedThreadPoolExecutor.java:204) ~[mango-3.2.1.jar:?]
	at com.serotonin.timer.OrderedTimerThread.executeTask(OrderedTimerThread.java:26) ~[mango-3.2.1.jar:?]
	at com.serotonin.timer.TimerThread.mainLoop(TimerThread.java:133) ~[mango-3.2.1.jar:?]
	at com.serotonin.timer.TimerThread.run(TimerThread.java:38) [mango-3.2.1.jar:?]
ERROR 2017-09-25T09:03:53,506 (com.serotonin.m2m2.rt.maint.BackgroundProcessing$RejectableWorkItemRunnable.run:559) - Error in work item 
java.lang.IllegalStateException: Timer already cancelled.
	at com.serotonin.timer.RealTimeTimer.scheduleImpl(RealTimeTimer.java:110) ~[mango-3.2.1.jar:?]
	at com.serotonin.timer.AbstractTimer.schedule(AbstractTimer.java:26) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.maint.BackgroundProcessing.schedule(BackgroundProcessing.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.util.timeout.TimeoutTask.<init>(TimeoutTask.java:41) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.util.timeout.TimeoutTask.<init>(TimeoutTask.java:35) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.TimeoutDetectorRT.scheduleJob(TimeoutDetectorRT.java:103) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.TimeDelayedEventDetectorRT.scheduleJob(TimeDelayedEventDetectorRT.java:27) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.DifferenceDetectorRT.pointData(DifferenceDetectorRT.java:42) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.NoUpdateDetectorRT.pointUpdated(NoUpdateDetectorRT.java:22) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointRT$EventNotifyWorkItem.execute(DataPointRT.java:672) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.maint.BackgroundProcessing$RejectableWorkItemRunnable.run(BackgroundProcessing.java:556) [mango-3.2.1.jar:?]
	at com.serotonin.timer.Task.runTask(Task.java:179) [mango-3.2.1.jar:?]
	at com.serotonin.timer.TaskWrapper.run(TaskWrapper.java:23) [mango-3.2.1.jar:?]
	at com.serotonin.timer.OrderedThreadPoolExecutor$OrderedTaskCollection.run(OrderedThreadPoolExecutor.java:307) [mango-3.2.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
ERROR 2017-09-25T09:03:55,954 (com.serotonin.m2m2.rt.maint.BackgroundProcessing$RejectableWorkItemRunnable.run:559) - Error in work item 
java.lang.IllegalStateException: Timer already cancelled.
	at com.serotonin.timer.RealTimeTimer.scheduleImpl(RealTimeTimer.java:110) ~[mango-3.2.1.jar:?]
	at com.serotonin.timer.AbstractTimer.schedule(AbstractTimer.java:26) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.maint.BackgroundProcessing.schedule(BackgroundProcessing.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.util.timeout.TimeoutTask.<init>(TimeoutTask.java:41) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.util.timeout.TimeoutTask.<init>(TimeoutTask.java:35) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.TimeoutDetectorRT.scheduleJob(TimeoutDetectorRT.java:103) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.TimeDelayedEventDetectorRT.scheduleJob(TimeDelayedEventDetectorRT.java:27) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.DifferenceDetectorRT.pointData(DifferenceDetectorRT.java:42) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.event.detectors.NoUpdateDetectorRT.pointUpdated(NoUpdateDetectorRT.java:22) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:94) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointRT$EventNotifyWorkItem.execute(DataPointRT.java:672) ~[mango-3.2.1.jar:?]
	at com.serotonin.m2m2.rt.maint.BackgroundProcessing$RejectableWorkItemRunnable.run(BackgroundProcessing.java:556) [mango-3.2.1.jar:?]
	at com.serotonin.timer.Task.runTask(Task.java:179) [mango-3.2.1.jar:?]
	at com.serotonin.timer.TaskWrapper.run(TaskWrapper.java:23) [mango-3.2.1.jar:?]
	at com.serotonin.timer.OrderedThreadPoolExecutor$OrderedTaskCollection.run(OrderedThreadPoolExecutor.java:307) [mango-3.2.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

I've also noticed that the persistent data historical sync also gets held up when trying to fill the holes where data wasn't logged to the main server during these outages (we are using MangoES units on site for redundancy).

Any ideas on where to start looking? We need to get this sorted as we are going live to our clients in the next few days.

Thanks in advance,
Matt.

phildunlap

Hi Matt,

The Timer Already Cancelled message will happen if an exception or error managed to reach all the way back to the timer thread. Most of the time, i'd say that's an out of memory error. This would probably appear elsewhere in your logs, but it could have been a fair time prior.

If a NoSQL backup prompts it, I wonder if 1) you have enough disk space for that NoSQL backup. Check df -h and du -h /opt/mango/backup/* 2) you are doing incremental NoSQL backups or complete NoSQL backups? This would be in the NoSQL section of /system_settings.shtm, 3) Is your NoSQL module up to date?

To the persistent data sync, have you increased the minimum window size? This can greatly accelerate syncs and lower the burden of the sync (but it can cause problems if the receiver isn't using NoSQL!). A window size of 500 may be a fair value for your application. Also good to check is your number of sync threads. The default value is 10 which is quite a few. I would consider lowering that to 2 and if all is well increasing it to 4 at most.

What version of Mango are you using?

mattonfarm

Hi Phil,
Thanks for your response.

The virtual machine has been upgraded recently and Mango now has access to up to 6GB of RAM (up from 1.5) so memory shouldn't be the issue.
We have plenty of hdd space with 17% used of 70GB.

I've turned off backups for now as they cause the system to crash. Errors will begin populating within the first minute of the backup starting.

The persistent data sync is running with the default settings so the thread count is set to 10. We currently have 4 persistent data sources on the main server and one persistent data publisher.

I'm not sure what you mean by the minimum window size. I cannot find this option in the publisher, source or system settings.

We're running core 3.2.1+ enterprise with NoSQL and all modules are up to date.

To me it feels like there is a bad data point somewhere in the system and the backup and persistent processes force Mango to probe it, causing an exception. If this is the case, how would I go about finding the problematic point?

Cheers,
Matt.

phildunlap

My mistake, I got the name of the property a little wrong. It's the "Minimum overlap when syncing blocks of data"

| We currently have 4 persistent data sources on the main server and one persistent data publisher.

Other way around? One data source at central and four publishers? If so, I really would recommend turning down the number of sync threads on the publishers.

If it were a data point, you should collect thread dumps while Mango is in that state. Collect them via going to the /rest/v1/threads?asFile=true&stackDepth=30 endpoint and by using jstack which will be something like $JAVA_HOME/bin/jstack -l $(pidof java) > jstacked then email the thread dumps to us.

For a point to be "bad," it could be problematic in the NoSQL database or in the SQL database. If the NoSQL database, it should fix itself. If the SQL database, you would have exceptions when you tried to load that point elsewhere in the UI, like the data points table. If a NoSQL issue has happened, that should be evidenced in the thread dumps. You may also see your point values to be written climbing on the internal metrics page.

That is not my expectation, though. I still feel like this is going to be a memory issue. The 6GB is given to Mango in a Mango/bin/ext-enabled/ script? You see the -Xmx and -Xms arguments when you do ps $(pidof java) (presuming this is *nix)? If you are able to easily reproduce it by triggering a NoSQL backup, I advise launching Mango on the command line with Mango/bin/ma.sh start and then triggering a NoSQL backup and watching for the exceptions. An error could be escaping to stderr.

I have seen this issue (Timer already cancelled) on an ES doing publishing, and in that instance decreasing the sync threads and increasing the minimum overlap alleviated their problem sufficiently they didn't probe farther.

phildunlap

It should also be mentioned you may wish to update Java. Back around the release of 8_121 we advised everyone update as a memory link was found in an important Java structure around then, although I would doubt if that's the entirety of the issue.

mattonfarm

Hi Phil,
Thanks for the tips.
I upgraded JAVA about a month ago on all systems so that shouldn't be the issue. Yes, 4 persistent sources (from 4 MangoES publishers) and one publisher (to a MangoES). I'll adjust the thread count and overlap values and see what happens.
Yes, the 6GB option has been enabled in the ext-enabled folder. Mango reports 5.3GB with 5.1GB free. The server has 8 GB total.

I did notice this while checking the memory usage.
0_1506405881395_2db5c4b1-2358-4679-8ef9-55be26359f4e-image.png
I guess this could be due to the large thread count from the persistent publishers.

I will do some more probing in the console to see if I get a bit more info. At least I can trigger the fault easily by running a backup.

Will keep you posted.

Matt.

phildunlap

Did you check the other tabs on the the internal page to see if those tasks are identified?