CPU Jumps after 50+ Days of Uptime & Adding Modbus devices

cmusselm

I've found this post that has the exact same error as we were seeing in our MA log file. https://forum.mango-os.com/topic/4668/persistent-tcp-publishers-stop-transfer/1

It's a different scenario that caused the errors, but made me wonder if the fix could be the same.

The resolution in that case was to increase the thread pools. Any ideas if that may fix the issue we're running into now?

Thanks,
Chad

Ralf

Hi Chad,

increase the thread pools a bit, that will help you I think.
Our system is running over one year without any problems (restarts only for updates). 268 Data Sources, 32,359 Data Points and 9 publishers.

Another thing I do for each mango-installation is to config the Mango-internal-data-source to the max (Mango process CPU load, System CPU load etc.) to get an overview what happens with the system and the data-sources (= PLCs). Don't forget to set "override system purge settings" for this ds or you will get a very large database...

Also check your update period versus poll duration for the modbus-ds. Maybe you have to increase your update period because the PLC needs to much time or the network-traffic is limited?

cmusselm

Ralf,

Thank you for the help! It's greatly appreciated. It's nice to see how you have the parameters set and your overall number of data sources and data points. We'll definitely update the thread pool sizes.

I have added several more of the internal data points as suggested and will check them in a few days to get a baseline number, and set up alerts going forward if it looks like anything is spiking. Good tip on the purge override as well so we don't bloat the DB.

Good thought about the update period vs poll duration. Before the guys have the device fully set up, the poll durations can be way longer than our update period as comms can be sketchy. We'll look at that as well and see if we should have a longer update period until the devices are fully configured and working as expected.

Thanks,
Chad

cmusselm

We've had another jump in CPU after the reboot last week. It went up to 28% later yesterday.

We had the following warning getting written to the ma.log file at the rate of 10 times per ms, 10,000 per second. We went through 5 log files in a minute. Luckily that burst of writing to the log file didn't last long.

WARN  2021-03-11T15:18:23,659 (com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.pointUpdated:246) - Error sending websocket message 
com.serotonin.m2m2.web.mvc.rest.v1.websockets.WebSocketClosedException: null
	at com.serotonin.m2m2.web.mvc.rest.v1.websockets.MangoWebSocketHandler.sendStringMessageAsync(MangoWebSocketHandler.java:133) ~[?:?]
	at com.serotonin.m2m2.web.mvc.rest.v1.websockets.MangoWebSocketHandler.sendMessage(MangoWebSocketHandler.java:99) ~[?:?]
	at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler.sendMessage(PointValueWebSocketHandler.java:143) ~[?:?]
	at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.sendNotification(PointValueWebSocketHandler.java:195) ~[?:?]
	at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.pointUpdated(PointValueWebSocketHandler.java:243) ~[?:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:152) ~[mango-3.7.7.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:143) ~[mango-3.7.7.jar:?]
	at com.serotonin.m2m2.rt.dataImage.DataPointRT$EventNotifyWorkItem.execute(DataPointRT.java:796) ~[mango-3.7.7.jar:?]
	at com.serotonin.m2m2.rt.maint.BackgroundProcessingImpl$RejectableWorkItemRunnable.run(BackgroundProcessingImpl.java:629) ~[mango-3.7.7.jar:?]
	at com.serotonin.timer.Task.runTask(Task.java:179) ~[mango-3.7.7.jar:?]
	at com.serotonin.timer.TaskWrapper.run(TaskWrapper.java:23) ~[mango-3.7.7.jar:?]
	at com.serotonin.timer.OrderedThreadPoolExecutor$OrderedTaskCollection.run(OrderedThreadPoolExecutor.java:314) ~[mango-3.7.7.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

Based on this behavior I have a few questions.

Why did Mango write the same message to the log file 100,000+ times in a minute? What could have happened with a Modbus device to do that?
What does this warning actually mean?
Is there a way to limit what Mango writes to the log file?
Could this writing to the log file spike the CPU, and then the Java process never release the CPU on the server side?

Based on Ralf's suggestion I did have additional internal data points being monitored. Screenshots of a few of those are posted below. Nothing looks out of the ordinary... The circled area represents when we say the increase in CPU usage.

High Priority Tasks Queued has stayed at 0

Active Thread Count:
Active Tread Count.png

High Priority Items Running
High Priority Items Running.png

Let me know if any other information would help with this.

Thanks,
Chad

Ralf

Hi Chad,

I always turn the logs off. Only if there are any problems with the data-source I turn it on to find the reason.
If you config the data-source correct, what is the reason to log the things continuously?
It needs r/w-recources and if you have many to log CPU Power.
If something went wrong you can get an event and check the data-source.

It would be a good thing to have a limit for a "message shower" (logs or events).

CraigWeb

@cmusselm Those warnings are coming from the rest controller. It seems like there are dashboards that have opened WebSockets to get updates on those points and the WebSocket messages are falling. As to the number of warnings, this would depend on the number of datapoints x the number of WebSockets opened on each point.

cmusselm

Thank you, Craig, and Ralf.

That helps a lot to know that the messages are coming from the dashboard. We do have a couple views set up for the users to see widgets for the hardware with various data points displaying. This particular client has a good amount of devices in the field, so we're gathering quite a bit of info via <ma-get-point-value tags. We have 224 on one page and 483 on another. Seeing it typed out looks like a lot. Is that too many, or should I try to use a different method if needing more than x data points on a page?

Quick question around Ralf's comment on disabling logging for a data source. I am not using the I/O log feature. Is there a way to make a data source not write data to the ma.log file?

MattFox

Use a ma-point-values and combine all your points into that. Better to do 1 query than over 400 at a time....

cmusselm

After doing more digging, here's what we have done to hopefully resolve this, or at least reduce the chance it happens again. We have made the updates, but have not rebooted production to have them take place. Once we do that, I'll let it run for a few weeks and hopefully remember to report back on how things are going.

Thread Pools

We increased thread pools as originally suggested by Ralf

API Calls

We could see that each <ma-get-point-value has its own API call, and the Watchlist made 2 API calls (1 to get the query string, and a 2nd to use the query string to pull all of the data points)
In our case, the Watchlist had grown to over 1,200 data points for various reason, and a majority of them did not need to appear on the UI.
The API call to get all of those data points from the Watchlist took ~6 seconds.
Because of this we refactored our UI to not use the Watchlist and only use <ma-get-point-value
Each one of these API calls ran for only ~100ms on average, and were performed asynchronously so overall it didn't take long to get the values

WebSockets

Thanks to Craig's reply we know two different messages we've seen in the ma.log are related to websockets (log messages listed in the threads above)
As seen in the screenshot, we had a websocket for both the Watchlist and our point-values.
Along the same lines as API calls, the elimination of using a Watchlist will eliminate unnecessary data point values from being pulled from the system
This way, if the websocket fails in the future, we won’t have to worry about 1000+ unneeded data point values failing and trying to write to the log

Logging

In addition to the measures taken above, we have also made updates to not log warnings or info events in production for both org.eclipse.jetty and com.serotonin.m2m2
These are what wrote thousands of messages to the log file when the websocket failed
We believe it was writing to the log file that contributed to the CPU spike
With the log level increased to Error for these items, even if there is a websocket issue in the future, the info and warning messages should not be written and hopefully we won’t see an increase in CPU
- Updated log4j2.xml file to have the <Loggers> level be error

cmusselm

Unfortunately we're still seeing CPU increases after the updates above were made. The ma.log file is clean for the times that the CPU is increasing, so our updates to reduce the log size did work at least.

Its always the Java process eating up the CPU. We're using OpenJDK14U-jdk_x64_windows_hotspot_14.0.1_7 We've been using this 6/15/2020. Any thoughts about the Java version and if we should look at using a different one?

The CPU increases seem to happen when we are working to add new Modbus devices to Mango, configuring them to ensure communications are working as expected.

Has any one else experienced anything like this, or is there something we may be doing wrong when getting new Modbus hardware online?

Thanks,
Chad

Ralf

Hi Chad,

we're using AdoptOpenJDK 11.0.8.10 (x64) without any problems...
Maybe use this Version?

cmusselm

Thanks, Ralf. I'll see if/when we can switch to that and post an update.

cmusselm

We installed Adopt OpenJDK v11.0.10.9 on our production environment last Thursday and for a couple days things looked good; CPU was steadily at or below 4%. On Saturday it jumped up to 16% and is holding there, which is the same behavior we saw with v14 of Adopt OpenJDK. Every few days or weeks the CPU jumps and stays there until it gets to 99% and we have to restart the app to get CPU back down.

cmusselm

It's been quite a while, but figured I would post a message to let everyone know our fix.

Turns out we were running into a Jetty bug which the support team identified. The latest to v3 core (3.7.8) was applied, and so far after 3+ days we have not seen the issue anymore.

Thanks,
Chad

cwangv

@cmusselm
that is a good outcome! Thanks for the update.