Unfortunately we're still seeing CPU increases after the updates above were made. The ma.log file is clean for the times that the CPU is increasing, so our updates to reduce the log size did work at least.
Its always the Java process eating up the CPU. We're using OpenJDK14U-jdk_x64_windows_hotspot_14.0.1_7 We've been using this 6/15/2020. Any thoughts about the Java version and if we should look at using a different one?
The CPU increases seem to happen when we are working to add new Modbus devices to Mango, configuring them to ensure communications are working as expected.
Has any one else experienced anything like this, or is there something we may be doing wrong when getting new Modbus hardware online?
Thanks, Craig, that explains some of the results I've seen.
For the specific example I showed, the poll had a timeout of 4 seconds, and it was coming back in 3.2 seconds. Since there was no Modbus device actually connected to that IP, did Mango report the poll simply because it received TCP/IP info from the server?
I have a question around Modbus communications, specifically the poll duration. While working on items in our test environment we enable a Modbus data source with the server’s home IP (127.0.0.1) and port of 502. The server has no Modbus device attached it to and no simulator or anything listening to port 502.
As expected, the Slave Monitor is 0 since there is no Modbus device for Mango to talk to.
What we get that is somewhat unexpected is successful polls. In the Runtime Status tab, the polls are succeeding, taking about 3.2 seconds each.
We thought we’d see timeouts instead of successful polls. Are there any ideas why the polls aren’t timing out? Is the server sending TCP/IP info back to Mango, and that is triggering the successful poll?
I did turn on I/O logging for the data source, but nothing gets logged outside of the data source being enabled and disabled.
After doing more digging, here's what we have done to hopefully resolve this, or at least reduce the chance it happens again. We have made the updates, but have not rebooted production to have them take place. Once we do that, I'll let it run for a few weeks and hopefully remember to report back on how things are going.
- We increased thread pools as originally suggested by Ralf
- We could see that each <ma-get-point-value has its own API call, and the Watchlist made 2 API calls (1 to get the query string, and a 2nd to use the query string to pull all of the data points)
- In our case, the Watchlist had grown to over 1,200 data points for various reason, and a majority of them did not need to appear on the UI.
- The API call to get all of those data points from the Watchlist took ~6 seconds.
- Because of this we refactored our UI to not use the Watchlist and only use <ma-get-point-value
- Each one of these API calls ran for only ~100ms on average, and were performed asynchronously so overall it didn't take long to get the values
- Thanks to Craig's reply we know two different messages we've seen in the ma.log are related to websockets (log messages listed in the threads above)
- As seen in the screenshot, we had a websocket for both the Watchlist and our point-values.
- Along the same lines as API calls, the elimination of using a Watchlist will eliminate unnecessary data point values from being pulled from the system
- This way, if the websocket fails in the future, we won’t have to worry about 1000+ unneeded data point values failing and trying to write to the log
- In addition to the measures taken above, we have also made updates to not log warnings or info events in production for both org.eclipse.jetty and com.serotonin.m2m2
- These are what wrote thousands of messages to the log file when the websocket failed
- We believe it was writing to the log file that contributed to the CPU spike
- With the log level increased to Error for these items, even if there is a websocket issue in the future, the info and warning messages should not be written and hopefully we won’t see an increase in CPU
- Updated log4j2.xml file to have the <Loggers> level be error
One other quick note I forgot to include...
When I first tried this solution I was getting an error about a valid certificate path. I found this on Stack Overflow which resolve that issue: https://stackoverflow.com/questions/21076179/pkix-path-building-failed-and-unable-to-find-valid-certification-path-to-requ
Thank you, Craig, and Ralf.
That helps a lot to know that the messages are coming from the dashboard. We do have a couple views set up for the users to see widgets for the hardware with various data points displaying. This particular client has a good amount of devices in the field, so we're gathering quite a bit of info via <ma-get-point-value tags. We have 224 on one page and 483 on another. Seeing it typed out looks like a lot. Is that too many, or should I try to use a different method if needing more than x data points on a page?
Quick question around Ralf's comment on disabling logging for a data source. I am not using the I/O log feature. Is there a way to make a data source not write data to the ma.log file?
Thank you, Fox!!!! I got a proof of concept working and will be off and running now.
For those that want a bit more info on what was done, here are the steps I took for this.
- Read Fox's post: https://forum.mango-os.com/topic/4834/how-to-get-json-data-via-api/2
- Read the help page on logging into Mango via the API: https://docs-v3.mango-os.com/mango-rest-api-authentication/
In my case we were trying to hit the Mango API via a scripting data source within Mango. We used HttpBuilder as Fox indicated and gave a good outline of in his post.
When determining the headers needed to send the request I mainly looked at the Mango help page linked to above.
At first I tried to hit the runtime status endpoint directly and was getting unauthorized status back. This was because the resource hitting the API wasn't logged in. I then added the logic to log into Mango (following the first login example on the Mango help doc).
Even after logging in via the API I still received 401 error when trying to hit the runtime status endpoint.
After flailing around unsuccessfully for a day I decided to try the JWT method at the bottom of the help page. That worked perfectly, and easily! After setting up the JWT for a user, I was able to use the generated token and access the runtime status endpoint with no issues.
I have a client that would like to see what % of the previous X polls were successful. I've tried many things, but can't get what he really wants. Hopefully I'm missing something obvious.
We can look at the Runtime Status, or hit the /data-sources/status endpoint to get a snapshot of the last 10 successful and aborted polls. Any chance these are stored as a data point somewhere with the history?
Here's what we've looked into so far:
I know there's the internal data points which include the Poll Success Percent, but that goes back to the time the data source was re-started. That won't give us a good picture of the polls over the past 1 or 2 minutes.
I've also looked into the "previous sequential successful polls" but that seems to have a bug. See this post I had on it for more details: https://forum.mango-os.com/topic/4948/question-on-internal-metric-previous-sequential-successful-polls/
Right now we're using the Slave Monitor 1/0 value to know the % of time over the past x seconds Mango has been able to communicate with the Modbus device, but that's not really what the client wants. That doesn't tell him if individual polls aborted.
The only other solution I can think of is to hit the /data-sources/status API every x seconds, parse through the results and save them as data points.
If we did that, is there a way to hit the Mango API from a Scripting Data Source? That would make it easier to schedule the job and save results into a data point. It doesn't look like jQuery or an XMLHttpRequest can be used in the scripting data source. If possible, I'd like to avoid custom jobs and code being written outside of Mango.
We've had another jump in CPU after the reboot last week. It went up to 28% later yesterday.
We had the following warning getting written to the ma.log file at the rate of 10 times per ms, 10,000 per second. We went through 5 log files in a minute. Luckily that burst of writing to the log file didn't last long.
WARN 2021-03-11T15:18:23,659 (com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.pointUpdated:246) - Error sending websocket message com.serotonin.m2m2.web.mvc.rest.v1.websockets.WebSocketClosedException: null at com.serotonin.m2m2.web.mvc.rest.v1.websockets.MangoWebSocketHandler.sendStringMessageAsync(MangoWebSocketHandler.java:133) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.MangoWebSocketHandler.sendMessage(MangoWebSocketHandler.java:99) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler.sendMessage(PointValueWebSocketHandler.java:143) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.sendNotification(PointValueWebSocketHandler.java:195) ~[?:?] at com.serotonin.m2m2.web.mvc.rest.v1.websockets.pointValue.PointValueWebSocketHandler$PointValueWebSocketListener.pointUpdated(PointValueWebSocketHandler.java:243) ~[?:?] at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:152) ~[mango-3.7.7.jar:?] at com.serotonin.m2m2.rt.dataImage.DataPointEventMulticaster.pointUpdated(DataPointEventMulticaster.java:143) ~[mango-3.7.7.jar:?] at com.serotonin.m2m2.rt.dataImage.DataPointRT$EventNotifyWorkItem.execute(DataPointRT.java:796) ~[mango-3.7.7.jar:?] at com.serotonin.m2m2.rt.maint.BackgroundProcessingImpl$RejectableWorkItemRunnable.run(BackgroundProcessingImpl.java:629) ~[mango-3.7.7.jar:?] at com.serotonin.timer.Task.runTask(Task.java:179) ~[mango-3.7.7.jar:?] at com.serotonin.timer.TaskWrapper.run(TaskWrapper.java:23) ~[mango-3.7.7.jar:?] at com.serotonin.timer.OrderedThreadPoolExecutor$OrderedTaskCollection.run(OrderedThreadPoolExecutor.java:314) ~[mango-3.7.7.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?] at java.lang.Thread.run(Thread.java:832) [?:?]
Based on this behavior I have a few questions.
- Why did Mango write the same message to the log file 100,000+ times in a minute? What could have happened with a Modbus device to do that?
- What does this warning actually mean?
- Is there a way to limit what Mango writes to the log file?
- Could this writing to the log file spike the CPU, and then the Java process never release the CPU on the server side?
Based on Ralf's suggestion I did have additional internal data points being monitored. Screenshots of a few of those are posted below. Nothing looks out of the ordinary... The circled area represents when we say the increase in CPU usage.
High Priority Tasks Queued has stayed at 0
Active Thread Count:
High Priority Items Running
Let me know if any other information would help with this.