Mango locking up after 2.8.4 update

mattonfarm

Hi all,
I am having issues with the latest update.
It seems like after a random amount of time (usually around 30mins) Mango gets stuck in some sort of loop, consuming 100% of the CPU and not allowing HTTP access.
Attached is a screenshot of the Thread Monitoring page showing the runaway thread.

0_1483665384710_thread monitoring.jpg

This corresponds to what I see in top

I'm getting a number of these errors in ma.log which may be the cause of the lockup or may be a side effect of it.

ERROR 2017-01-06 14:17:39,989 (com.infiniteautomation.datafilesource.rt.DataFileDataSourceRT.doPoll:269) -
java.lang.NullPointerException
at com.infiniteautomation.datafilesource.rt.DataFileDataSourceRT.loadNewFiles(DataFileDataSourceRT.java:182)
at com.infiniteautomation.datafilesource.rt.DataFileDataSourceRT.doPoll(DataFileDataSourceRT.java:259)
at com.infiniteautomation.datafilesource.rt.DataFileDataSourceRT.doPollNoSync(DataFileDataSourceRT.java:250)
at com.serotonin.m2m2.rt.dataSource.PollingDataSource.scheduleTimeout(PollingDataSource.java:134)
at com.serotonin.m2m2.util.timeout.TimeoutTask.run(TimeoutTask.java:69)
at com.serotonin.timer.TimerTask.runTask(TimerTask.java:148)
at com.serotonin.timer.OrderedTimerTaskWorker.run(OrderedTimerTaskWorker.java:29)
at com.serotonin.timer.OrderedThreadPoolExecutor$OrderedTask.run(OrderedThreadPoolExecutor.java:278)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Logs attached...0_1483665559469_ma.log

I've performed a clean install leaving only the databases and db folders behind and am still having issues.
I am running a Persistent TCP data source to sync with a remote MangoES.

Rebooting Mango seems to sort things out for 30 mins or so.

Cheers,
Matt.

phildunlap

Hi Matt,

I think the error your posted is probably a symptom of the problem. It would be possible to check for this condition in the code (and we probably ought to, so thank you for bringing it to our attention) but it's unlikely it's the issue.

The more interesting message to me in that log is that you've got too many open files. I wonder what all the open files are, and what the limit is. These command will assume only one Java process exists. If you have more on your server, you could do ps $(pidof java) | grep overrides and you will probably find the pid for Mango.

To get an output of open files:

lsof -p $(pidof java) > ~/lsof-output
#then to count the files
wc -l ~/lsof-output

To get the limit for the user,
ulimit -Hn

To set the limit higher if necessary,

#you'll first have to modify /etc/security/limits.conf most likely, create/modify the "user hard nofile 65535" line
#then you can create a script for Mango/bin/ext-enabled/ which does:
ulimit -Hn 65535
#You will need a new SSH session for that limit to get applied.

Another way to see the limits applying to a pid,

cd /proc/$(pidof java)/
cat limits
#Most interesting line: Max open files 4096 4096 files

phildunlap

My expectation is ulimit -Hn will probably say 4096, and ulimit -Sn may only give you 1024. You are probably using the NoSQL database and have ~500+ points and are perhaps hitting this limit during startup. You may be able to simply add the ext-enabled script ulimit -Hn 4096 and you could avoid doing anything in /etc/security/limits.conf

mattonfarm

Hi Phil,
I think you're right about the error being a symptom of something else.
Looking at the data I noticed that one of the persistent data points showed data up until the date the Mango server started to crash. Data on the MangoES was still being recorded. I'm guessing something had caused corruption in the database and when a historical sync was started it would get hung up in this data point as clearing all data point data and doing a complete re-sync with the MangoES on site seems to have solved things.

I'm guessing the file limit issues us due to too many hung up historical sync threads running. I'll look into the user limits though.

I'd still like a way of being able to work out if this happens again and how to fix it, especially if Mango is unable to start at all.

Many thanks for your support.
Matt.

phildunlap

Hi Matt,

I can't say for sure if it's related or not, but I have placed a new version of the NoSQL module into the store. I was guided to the change I made by investigating a description of your events, though, so perhaps it is related, and it could conceivably produce symptoms like what you describe (too many open files, apparent corruption) in unfortunate circumstances. I'd encourage you to update!

mattonfarm

@phildunlap Thanks Phil. I'll do the update and let you know how I get on.
Interestingly I seem to be getting the following error repeating over and over, sometimes only 10 or so seconds apart.

High priority task: com.serotonin.m2m2.persistent.ds.PersistentDataSourceRT$StatusProvider was rejected because it is already running.

0_1484857197402_upload-c7a76cda-5e97-4705-b124-01818c5ed487

This is on the data source end or the persistent data source publisher. Could this be an issue of corruption?

phildunlap

Hi Matt,

I actually just was realizing there was a small error in what I made available. I'm coding tests for the fix right now. I will have another module in the store before the end of my day.

I wouldn't typically worry about that Status provider getting rejected right now. It could be a symptom. It is happening a lot, though. After I update the module, we can check if that event is happening a lot again.

phildunlap

Hi Matt,

I have made 1.3.4 of the NoSQL database available. 1.3.3 should be abandoned quickly (the one I made available two days ago, for anyone who may have updated to it).
Thanks!