CPU Jumps after 50+ Days of Uptime & Adding Modbus devices
-
Hello,
We have seen some significant jumps in CPU over the past 6 days. The instance has been running for 57 days utilizing only 2 – 4% CPU, but now the Java process is taking 99% on a Windows server. (We’ve since restarted the Mango service and things are back to normal, but hope to find out what caused the spike to avoid in the future)
We have deployed 12 Modbus devices (water pumps) over the past week, so could it be something related to how we are setting them up?
More details are below:
Version 3.7.8
Java Version: OpenJDK14U-jdk_x64_windows_hotspot_14.0.1_7CPU Increases:
• From 2% to 27% on 2/26 @ 16:40 – 51 days uptime with ~2 – 4% CPU usage
• From 27% to 39% on 2/26 @ 19:20
• From 39% to 65% on 2/27 @ 17:00
• From 65% to 100% on 3/2 @ 16:25Hardware setups
12 water pumps using Modbus data sources. Each pump also has a corresponding scripting data source that runs every 3 seconds to see if the user has clicked a button to start or stop the pump. In addition to that, there is 1 Meta data source that houses 5 data points per pump for other various calculated fields (60 total Meta DPs for the 12 pumps).The IT team gets the data sources configured in Mango along with a dashboard/UI, ready for the field-techs to add an IP address and enable them. Once the devices are enabled, the field techs tweak cell modems, check or uncheck the Encapsulation option if they are RTU or not, and other modifications like that to get Mango reading values properly. While these tweaks are being worked, Mango usually cannot communicate with the cell modem and the slave monitor is 0.
Timing of Pump Data Sources Being Enabled and Worked:
Name - Enable Date
Pump 1 - 2/26 @ 3pm
Pump 2 - 2/27 @ 2:25pm
Pump 3 - 2/27 @ 2:58pm
Pump 4 - 3/1 @ 1:01pm
Pump 5 - 3/1 @ 5:03pm
Pump 6 - 3/2 @ 3:27pm
Pump 7 - 3/2 @ 4:07pm
Pump 8 - 3/2 @ 4:21pm
Pump 9 - 3/3 @ 8:02am
Pump 10 - 3/3 @ 3:30pm
Pump 11 - 3/3 @ 3:47pm
Pump 12 - 3/3 @ 3:56pmMA Logs
We reviewed the ma.log files and found quite a bit going on. Yesterday the logs were cycling every 3 minutes because they filled up at 5MB.On 2/26 we started to get the following error every 4/1000 of a second…. 250 per second. It would stop for a while, then come back around the same times the CPU increases would occur.
ERROR 2021-02-26T15:44:12,322 (com.serotonin.m2m2.db.dao.AbstractBasicDao$1.extractData:1063) - Internal Server Error org.eclipse.jetty.io.EofException: Closed com.infiniteautomation.mango.rest.v2.exception.GenericRestException: Internal Server Error org.eclipse.jetty.io.EofException: Closed
We’ve also noticed the following in the log file as well, but not nearly as frequent as the error listed above.
WARN 2021-03-02T15:48:34,770 (org.eclipse.jetty.websocket.common.extensions.compress.CompressExtension$Flusher.failed:455) - java.nio.channels.ClosedChannelException: null
Any help to dig into this is appreciated. Let me know if any additional info is needed and we’ll provide what we can.
Thanks,
Chad -
I've found this post that has the exact same error as we were seeing in our MA log file. https://forum.mango-os.com/topic/4668/persistent-tcp-publishers-stop-transfer/1
It's a different scenario that caused the errors, but made me wonder if the fix could be the same.
The resolution in that case was to increase the thread pools. Any ideas if that may fix the issue we're running into now?
Thanks,
Chad -
Hi Chad,
increase the thread pools a bit, that will help you I think.
Our system is running over one year without any problems (restarts only for updates). 268 Data Sources, 32,359 Data Points and 9 publishers.Another thing I do for each mango-installation is to config the Mango-internal-data-source to the max (Mango process CPU load, System CPU load etc.) to get an overview what happens with the system and the data-sources (= PLCs). Don't forget to set "override system purge settings" for this ds or you will get a very large database...
Also check your update period versus poll duration for the modbus-ds. Maybe you have to increase your update period because the PLC needs to much time or the network-traffic is limited?
-
Ralf,
Thank you for the help! It's greatly appreciated. It's nice to see how you have the parameters set and your overall number of data sources and data points. We'll definitely update the thread pool sizes.
I have added several more of the internal data points as suggested and will check them in a few days to get a baseline number, and set up alerts going forward if it looks like anything is spiking. Good tip on the purge override as well so we don't bloat the DB.
Good thought about the update period vs poll duration. Before the guys have the device fully set up, the poll durations can be way longer than our update period as comms can be sketchy. We'll look at that as well and see if we should have a longer update period until the devices are fully configured and working as expected.
Thanks,
Chad -
We've had another jump in CPU after the reboot last week. It went up to 28% later yesterday.
We had the following warning getting written to the ma.log file at the rate of 10 times per ms, 10,000 per second. We went through 5 log files in a minute. Luckily that burst of writing to the log file didn't last long.
WARN 2021-03-11T15:18:23,659 (com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.pointUpdated:246) - Error sending websocket message com.serotonin.m2m2.web.mvc.rest.v1.websockets.WebSocketClosedException: null at com.serotonin.m2m2.web.mvc.rest.v1.websockets.MangoWebSocketHandler.sendStringMessageAsync(MangoWebSocketHandler.java:133) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.MangoWebSocketHandler.sendMessage(MangoWebSocketHandler.java:99) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler.sendMessage(PointValueWebSocketHandler.java:143) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.sendNotification(PointValueWebSocketHandler.java:195) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.pointUpdated(PointValueWebSocketHandler.java:243) ~[?:?] at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:152) ~[mango-3.7.7.jar:?] at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:143) ~[mango-3.7.7.jar:?] at com.serotonin.m2m2.rt.dataImage.DataPointRT$EventNotifyWorkItem.execute(DataPointRT.java:796) ~[mango-3.7.7.jar:?] at com.serotonin.m2m2.rt.maint.BackgroundProcessingImpl$RejectableWorkItemRunnable.run(BackgroundProcessingImpl.java:629) ~[mango-3.7.7.jar:?] at com.serotonin.timer.Task.runTask(Task.java:179) ~[mango-3.7.7.jar:?] at com.serotonin.timer.TaskWrapper.run(TaskWrapper.java:23) ~[mango-3.7.7.jar:?] at com.serotonin.timer.OrderedThreadPoolExecutor$OrderedTaskCollection.run(OrderedThreadPoolExecutor.java:314) ~[mango-3.7.7.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?] at java.lang.Thread.run(Thread.java:832) [?:?]
Based on this behavior I have a few questions.
- Why did Mango write the same message to the log file 100,000+ times in a minute? What could have happened with a Modbus device to do that?
- What does this warning actually mean?
- Is there a way to limit what Mango writes to the log file?
- Could this writing to the log file spike the CPU, and then the Java process never release the CPU on the server side?
Based on Ralf's suggestion I did have additional internal data points being monitored. Screenshots of a few of those are posted below. Nothing looks out of the ordinary... The circled area represents when we say the increase in CPU usage.
High Priority Tasks Queued has stayed at 0
Active Thread Count:
High Priority Items Running
Let me know if any other information would help with this.
Thanks,
Chad -
Hi Chad,
I always turn the logs off. Only if there are any problems with the data-source I turn it on to find the reason.
If you config the data-source correct, what is the reason to log the things continuously?
It needs r/w-recources and if you have many to log CPU Power.
If something went wrong you can get an event and check the data-source.It would be a good thing to have a limit for a "message shower" (logs or events).
-
@cmusselm Those warnings are coming from the rest controller. It seems like there are dashboards that have opened WebSockets to get updates on those points and the WebSocket messages are falling. As to the number of warnings, this would depend on the number of datapoints x the number of WebSockets opened on each point.
-
Thank you, Craig, and Ralf.
That helps a lot to know that the messages are coming from the dashboard. We do have a couple views set up for the users to see widgets for the hardware with various data points displaying. This particular client has a good amount of devices in the field, so we're gathering quite a bit of info via <ma-get-point-value tags. We have 224 on one page and 483 on another. Seeing it typed out looks like a lot. Is that too many, or should I try to use a different method if needing more than x data points on a page?
Quick question around Ralf's comment on disabling logging for a data source. I am not using the I/O log feature. Is there a way to make a data source not write data to the ma.log file?
-
Use a ma-point-values and combine all your points into that. Better to do 1 query than over 400 at a time....
-
After doing more digging, here's what we have done to hopefully resolve this, or at least reduce the chance it happens again. We have made the updates, but have not rebooted production to have them take place. Once we do that, I'll let it run for a few weeks and hopefully remember to report back on how things are going.
Thread Pools
- We increased thread pools as originally suggested by Ralf
API Calls
- We could see that each <ma-get-point-value has its own API call, and the Watchlist made 2 API calls (1 to get the query string, and a 2nd to use the query string to pull all of the data points)
- In our case, the Watchlist had grown to over 1,200 data points for various reason, and a majority of them did not need to appear on the UI.
- The API call to get all of those data points from the Watchlist took ~6 seconds.
- Because of this we refactored our UI to not use the Watchlist and only use <ma-get-point-value
- Each one of these API calls ran for only ~100ms on average, and were performed asynchronously so overall it didn't take long to get the values
WebSockets
- Thanks to Craig's reply we know two different messages we've seen in the ma.log are related to websockets (log messages listed in the threads above)
- As seen in the screenshot, we had a websocket for both the Watchlist and our point-values.
- Along the same lines as API calls, the elimination of using a Watchlist will eliminate unnecessary data point values from being pulled from the system
- This way, if the websocket fails in the future, we won’t have to worry about 1000+ unneeded data point values failing and trying to write to the log
Logging
- In addition to the measures taken above, we have also made updates to not log warnings or info events in production for both org.eclipse.jetty and com.serotonin.m2m2
- These are what wrote thousands of messages to the log file when the websocket failed
- We believe it was writing to the log file that contributed to the CPU spike
- With the log level increased to Error for these items, even if there is a websocket issue in the future, the info and warning messages should not be written and hopefully we won’t see an increase in CPU
- Updated log4j2.xml file to have the <Loggers> level be error
-
Unfortunately we're still seeing CPU increases after the updates above were made. The ma.log file is clean for the times that the CPU is increasing, so our updates to reduce the log size did work at least.
Its always the Java process eating up the CPU. We're using OpenJDK14U-jdk_x64_windows_hotspot_14.0.1_7 We've been using this 6/15/2020. Any thoughts about the Java version and if we should look at using a different one?
The CPU increases seem to happen when we are working to add new Modbus devices to Mango, configuring them to ensure communications are working as expected.
Has any one else experienced anything like this, or is there something we may be doing wrong when getting new Modbus hardware online?
Thanks,
Chad -
Hi Chad,
we're using AdoptOpenJDK 11.0.8.10 (x64) without any problems...
Maybe use this Version? -
Thanks, Ralf. I'll see if/when we can switch to that and post an update.
-
We installed Adopt OpenJDK v11.0.10.9 on our production environment last Thursday and for a couple days things looked good; CPU was steadily at or below 4%. On Saturday it jumped up to 16% and is holding there, which is the same behavior we saw with v14 of Adopt OpenJDK. Every few days or weeks the CPU jumps and stays there until it gets to 99% and we have to restart the app to get CPU back down.
-
It's been quite a while, but figured I would post a message to let everyone know our fix.
Turns out we were running into a Jetty bug which the support team identified. The latest to v3 core (3.7.8) was applied, and so far after 3+ days we have not seen the issue anymore.
Thanks,
Chad -
@cmusselm
that is a good outcome! Thanks for the update.