High CPU usage

btrombo

On a clean reboot two cores are pinned to 100% while the remaining 6 are low as expected. Ram usage is 50% and Mango still functions but after days to weeks I need to restart the system as datasources stop updating(At this point almost all the ram is used). I have been trying to figure out where the issues arise. I have 6 GB allocated to Java and 8GB total for the system. I have all the newest Mango updates. Restarted the server recently. I included some screenshot metrics page and htop of the system. This is a very BACnet heavy system. I do get the timeout alerts that others have seen from the recent update also. I'm looking for some advice on how to resolve this issue.
0_1475708501509_upload-e63a45f3-9c7e-4815-8de5-50083f27b75f
0_1475708541440_upload-f9376fcb-6e01-405e-bb48-5a4a8907e8b6
0_1475708584818_upload-88467dfb-b7a6-4486-87f7-8443dc3627dc
0_1475709025760_upload-d1685e1b-3349-4b63-8b8d-40dd94c8dd9a

phildunlap

Hi btrombo,

A few questions / thoughts:

Do you have Process event handlers? What're they doing? I would guess if the PPID was visible in that top display, we would see almost all the Java processes have the same PPID, which is likely the PID in Mango/bin/ma.pid, the Mango process. I'm guessing the rest of those are forks for process handlers. The PPID of the core Mango process should be the ma-start. sh process. Is a child process the one eating the CPU?
How many BACnet data sources do you have? It can often be more efficient to have all the points on a single BACnet data source for each Local Device. If this route is to be tried, you'll want to backup your database, then you can run an SQL command like this one (which will put all BACnet IP points on the same data source, such that the others can be disabled), and then restart Mango:

UPDATE dataPoints SET dataSourceId=(SELECT id FROM dataSources WHERE dataSourceType='BACnetIP' LIMIT 1) WHERE dataSourceId IN (SELECT id FROM dataSources WHERE dataSourceType='BACnetIP');

Can you swap your Mango/classes/log4j.xml and Mango/classes/debug-log4j.xml files, then reload your Log4J Configuration on the System Settings page? Is there anything interesting in your log files? Can you share them after debug log is enabled for a while the problem is observed?

btrombo

Hey Phildunlap,

There are event handlers but not "process event handlers". Here is the top in tree format, it looks like there is 3 layers. ma-start.sh > java > java and looks like the second level is using 100%.

0_1475789740425_upload-77a6591a-5558-4213-818d-4da8d1da8cbe

The way we have it arranged is each physical device is a datasource. I tried this command and didn't really change anything in terms of performance. Same resource usage and everything was working.
Looks like there is a lot of theses in the logs.

 2016-10-06 15:40:48,337 (com.serotonin.ma.bacnet.BACnetDataSourceRT.fireMessageExceptionEvent:651) -  
java.lang.IllegalStateException: Cannot send future request in the transport thread. Use a callback call instead, or make this
call in a new thread.
	at com.serotonin.bacnet4j.transport.DefaultTransport.send(DefaultTransport.java:239)
	at com.serotonin.bacnet4j.LocalDevice.send(LocalDevice.java:389)
	
at com.serotonin.ma.bacnet.BACnetDataSourceRT.sendCovSubscriptionImpl(BACnetDataSourceRT.java:630)
	at com.serotonin.ma.bacnet.BACnetDataSourceRT.covNotificationReceived(BACnetDataSourceRT.java:548)
	
at com.serotonin.bacnet4j.event.DeviceEventHandler.fireCovNotification(DeviceEventHandler.java:134)
	at com.serotonin.bacnet4j.service.unconfirmed.UnconfirmedCovNotificationRequest.handle(Uncon 2016-10-06 15:40:48,337 
(com.serotonin.ma.bacnet.BACnetDataSourceRT.fireMessageExceptionEvent:651) -  
java.lang.IllegalStateException: Cannot send future request in the transport thread. Use a callback call instead, or make this call in a new thread.
	
at com.serotonin.bacnet4j.transport.DefaultTransport.send(DefaultTransport.java:239)
	at com.serotonin.bacnet4j.LocalDevice.send(LocalDevice.java:389)
	
at com.serotonin.ma.bacnet.BACnetDataSourceRT.sendCovSubscriptionImpl(BACnetDataSourceRT.java:630)
	at com.serotonin.ma.bacnet.BACnetDataSourceRT.covNotificationReceived(BACnetDataSourceRT.java:548)
	
at com.serotonin.bacnet4j.event.DeviceEventHandler.fireCovNotification(DeviceEventHandler.java:134)
	at com.serotonin.bacnet4j.service.unconfirmed.UnconfirmedCovNotificationRequest.handle(UnconfirmedCovNotificationRequest.java:69)
	at com.serotonin.bacnet4j.transport.DefaultTransport.receiveAPDU(DefaultTransport.java:539)
	at com.serotonin.bacnet4j.transport.DefaultTransport.receiveImpl(DefaultTransport.java:463)
	
at com.serotonin.bacnet4j.transport.DefaultTransport.run(DefaultTransport.java:404)
	at java.lang.Thread.run(Thread.java:745)
INFO

I noticed in a different thread about garbage collection and using the concurrent instead of the parallel. I tried the concurrent and lower the RAM by 1G to allow more for the OS and boot up and UI performance seemed a lot better but the CPU core was still pinned, I turned this back for now but I will look at that again once this issue is resolved.

phildunlap

Are you sure? I do not know any other paths to cause forks, with possible exceptions of the ES Config module and the VMstat module (which we can see in your top display). In the first image it looked like a child process was doing some heavy lifting of some sort.

I'll send you an email about this one, and we can work on it together via email.

btrombo

Just want to clarify that by process event handlers you are talking about an event detector that runs a local process or shell command? I have many event detectors that send emails if a data point is above or below a threshold but nothing that runs OS level commands.

I received your email.

phildunlap

Yes, you understood me correctly. How very interesting, then; I wonder how those child processes have come into being.

btrombo

I found this on the thread monitoring, I'm assuming the high CPU time would relate to the two cores that are pinned. Not sure if this can help you diagnose the issues. I also emailed you the log files from the BACnet issue.

0_1476904402601_upload-370c9e4c-3bb1-4df7-8f1a-3093c9a72952

phildunlap

Interesting. That would suggest to me that main slowdown is writing out the data. So, a few questions,

Do you have very many Meta points? If so, do any of them have multiple context updates from the same data source? This can produce a lot of extra load on the CPU, and while there are plans to mitigate this condition, it isn't currently available.
What version of the MangoNoSql module do you have? If 1.2.3, what's db.nosql.shardStreamType set to in your env.properties file? (I'm guessing it's not there?)

btrombo

I would say yes I have a lot of these(75ish without counting) and a bunch of them do use points from the same data source and set to update on context change.
Yes 1.2.3, db.nosql.shardStreamType=INPUT_STREAM

JoelHaggar

Those meta data points could be causing your issues. Just go through them and make sure only 1 data point from the same data source is checked to do context update. One trick to do this is to export the JSON of the Meta Data Points and then go through it that way.

phildunlap

Hi btrombo,

I just revisited your log files, and I noticed that across all 11 log files only 1 minute of time goes by, which means that error is pounding the log file. I suspect a considerable amount of the CPU is directed toward that error. If fixing the Meta points doesn't resolve it, try adding this category near the bottom of your Mango/class/log4j.xml (you will see others) and then reloading the log4j configuration using the Log4J Reset section of the system settings.

<category name="com.serotonin.bacnet4j.transport.DefaultTransport"><level value="fatal"/></category>

This will prevent the ERROR level log messages you were getting from the DefaultTransport class from being logged.

btrombo

So this took a while but I did both of these and still a CPU core pinned. I'm going to double check the meta data points tomorrow as that was a lot to go through and might have missed some. Although I did get at least 95%+ of them there is no improvement, I'm not sure if one could cause this issue or if the CPU usage should have been progressively less as I changed them.

Those errors are no longer logging.

phildunlap

You can verify that you've gotten all the Meta points by doing something like this in a python script. You may wish to go to your System Settings page and do a configuration backup just before running it, such that you know you have the latest JSON of your configuration.

import json

configFile = open("/path/to/Mango/backup/Mango-Configuration.json")
config = json.load(configFile)
configFile.close()

dpXidDict = {}
for dp in config["dataPoints"] :
	dpXidDict[dp["xid"]] = dp
	
for dp in config["dataPoints"] :
	#check if it's got a context, use that as the suggestion it's a meta point
	if "context" not in dp["pointLocator"] :
		continue
	countDict = {}
	for ctx in dp["pointLocator"]["context"] :
		if ctx["updateContext"] :
			dsXid = dpXidDict[ctx["dataPointXid"]]["dataSourceXid"]
			if dsXid not in countDict :
				countDict[dsXid] = 1
			else :
				countDict[dsXid] += 1
	found = False
	for dsXid in countDict :
		if countDict[dsXid] > 1 :
			found = True
			print "Found too many context updates from dataSource %s on data point: %s - %s   with XID: %s" % (dsXid, dp["deviceName"], dp["name"], dp["xid"])
	if found :
		print "\n"

It should also be noted that you can probably cause this exact same issue by having different data sources updating the context, but with the same polling rate and quantize=true, but that multiple updates from the same data source is the main cause.

btrombo

I'm not sure if this has any help in trying to debug this but I noticed that I would get many alarms(timeouts, aborted polls) during the autobackup of the H2 Database Backup Settings.

btrombo

Gone through all these options, but still a pinned core. I tried to clean a lot of stuff up in the system(unused sources and points), I purged a bunch of old event logs and such and the UI is much more responsive in combination with all the meta point changes. The memory usage does still creep up over time as well. Any more ideas?

phildunlap

One this that can give us a little more information is doing a thread dump, or a few, while you're observing the problem. I prefer the output from the Logging Console page, and typically I will grab 3-5 thread dumps that are a few seconds apart. You can email those into support@infiniteautomation.com and we'll take a look.

I wonder if when you observe this is when that TimeoutTask count is reliably large on the internal metric page? Also having an internal metric point tracking the available memory can help get a grasp on the frequency of the issue.

btrombo

Hey thanks for the reply, I downloaded 5 dumps and emailed them in. I do have internal metrics for almost all the parameters available.
Here are two internal metric screenshots, I look at these often and they are stable as in I don't notice much change to either of these.

0_1478818536595_upload-58452cc5-90f4-44fe-8e6e-67480904fb47 0_1478818557900_upload-c3ce6a6a-4add-4158-9851-870b9bcdc904

Idle Memory:
0_1478818999560_upload-ba4f51d2-3320-4b52-93e5-7a6a843a82db

Virtual Memory:
0_1478819028518_upload-a310dad7-82ae-4b74-9029-bf0cfc7cfb8d

phildunlap

Can you post a picture of your Mango NoSql settings section of the system settings page? From the threads you sent in (processed by a python script to make them more readable and sort them), this thread looks like the killer:

    @ ConcurrentHashMap.java:advance:3339
		CpuTime: 1809691511511583	UserTime: 1809333440000000	LogTime: 15	Thread: high-pool-1-thread-1666	ID: 1711
		--=====--=====--=====--=====--=====--=====--=====--=====--
	  java.util.concurrent.ConcurrentHashMap$Traverser:advance:3339
	  java.util.concurrent.ConcurrentHashMap$BaseIterator:<init>:3391
	  java.util.concurrent.ConcurrentHashMap$ValueIterator:<init>:3430
	  java.util.concurrent.ConcurrentHashMap$ValuesView:iterator:4683
	  com.infiniteautomation.nosql.MangoNoSqlBatchWriteBehind:execute:63
	  com.serotonin.m2m2.rt.maint.BackgroundProcessing$1:run:54
	  java.util.concurrent.ThreadPoolExecutor:runWorker:1142
	  java.util.concurrent.ThreadPoolExecutor$Worker:run:617
	  java.lang.Thread:run:745
________________________________________________________________________________
	
    @ PlainDatagramSocketImpl.java:receive0:-2
		CpuTime: 413337358676	UserTime: 205690000000	LogTime: 11	Thread: BACnet4J IP socket listener	ID: 52
		--=====--=====--=====--=====--=====--=====--=====--=====--
	  java.net.PlainDatagramSocketImpl:receive0:-2
	  java.net.AbstractPlainDatagramSocketImpl:receive:143
	  java.net.DatagramSocket:receive:812
	  com.serotonin.bacnet4j.npdu.ip.IpNetwork:run:369
	  java.lang.Thread:run:745```

It's taking four orders of magnitude more time than the next long running runnable thread! Holy cow! I would wonder how many data points you have ( SELECT count(id) FROM dataPoints; ) but it shouldn't matter, as I've seen that HashMap working quite fast with hundreds of thousands of points (it's likely that it's the writing data loop, not the HashMap, so maybe this task has just never gotten ahead of the incoming data, and thus never died). I would also wonder if there's anything interesting like scripting data sources enabling and disabling large amount of data points or a few data sources. But, that also seems unlikely.

I suppose I also wonder if your Meta points are constantly querying ranges of data. The recommendation for stream type that follows would a significant effect in massive querying. If you are doing something like...

return p.past(MINUTE).sum;

you can potentially help Mango out by using some knowledge you have about that, with something like,

//Untested, replacement assuming a 2 second polling interval
var minuteAgo = new Date().getTime() - 60000;
var values = p.last(30);

var sum = 0;
for( var k = 0; k < values.length; k+=1 ) {
   if( values[k].time > minuteAgo )
      sum += values[k].value;
}
return sum;

as Mango will begin caching with the expectation of getting that value request (the last 30 values)

Another possibility is that a large number of threads used by BACnet is outcompeting this thread, and this thread isn't getting to run. We could test that by increasing the number of write behind threads in the No SQL settings on the system settings page (or lowering the points waiting spawn task threshhold if your point write threads is already > 1) such that more threads are trying to out-compete the BACnet threads for CPU time. But, I'm not sure that'll help the memory issue if you're not seeing the "Point values to be written" climbing. Your second internal metric page picture does have more values waiting and only 1 write thread. If you have liberty to play with the system some, I would try some settings like the following:

If this is a *nix, change db.shardStreamType to be db.shardStreamType=MAPPED_BYTE_BUFFER and use RANDOM_ACCESS_FILE if you are one Windows (MBB doesn't work on Windows yet, and I would expect MAPPED_BYTE_BUFFER to make a more significant impact than RAF, MBB is very fast). If your system is query heavy, this suggestion alone may make a huge difference. It would require a Mango restart to pick up the changed env.property
Lower the NoSQL setting (on the system settings) for "Batch write behind spawn threshold" to something pretty low, like 100 or 200 (not knowing your expected points/s it is difficult to guess confidently)
Raise the NoSQL setting for "Max batch write behind tasks" to 50 or so.

In most systems the most efficient number of write behind tasks is typically between 2-8 (displayed on the internal metric page as "Point value write threads" but I'm theorizing your optimal may fall between 30 and 50, so manipulate the spawn threshhold as such. Unfortunately, I am not overly confident in this being the solution, but it is something to try.

You may also consider disabling NoSQL backups (especially if you are not doing incremental backups, also in the NoSQL system settings section (I've seen massive databases take longer to finish a backup than it does for the next backup to start! And they were backing up the whole database every time!)).

I also wonder how many cores the computer running this has?

phildunlap

Nevermind the cores question, I see the answer in the first post. Also seeing that post again now, 16000 sounds like a gigantic threadpool. In your thread dumps, and 100 of both medium and low also seems very large. Medium and low are static pools, so that's incurring some overhead to keep those threads alive. About 250 threads, mostly from the low priority and medium priority pools, are just hanging out.

I know BACnet can require a decent sized high priority pool, but I would perhaps try adjusting your threadpools as
High priority = 20
High priority max = 200-1000 (if you have experienced a problem with a lower cap that drove the cap so high, then do as is needed)
Medium Priority = 4-8
Low priority = 3

Also, can you keep an eye out at your top for what Java process is consuming the CPU? In one of your pictures two java processes are consuming roughly equal CPU, and the forking is still mysterious to me, as you do not have process event handlers.

I editted my last post to include some details about using Mango's caching to lighten the query load of Meta points, and some other things.

btrombo

I adjusted the core pool to your suggestion except high priority max I changed to 10000. My system is 95% BACnet thus a while ago we needed to increase it because of issues and we have not touched it since. Those pictures were using htop a standard top is attached.
0_1479143095565_upload-10443284-fd3e-4512-9128-eec59047bfb6

Here is the current NoSQL settings:
0_1479143217656_upload-b7f1e1cd-e59f-46b9-8b4c-4b63bdf3af09

There are no scripts enabling or disabling data sources or data points. I know that we do use similar to return p.past(MINUTE).sum; but I will have to look a little closer at what and when that is happening.

I will try out some of your suggestions today and see if I can notice anything. Currently mango is running on a vmware server (ESXI) with about 7 other servers. The hardware of the server is an 8x2ghz cpu and mango has access to all 8 cores and 8GB of memory. The disks in the server I believe are Western digital red drives in a RAID configuration. The other 7 servers don’t use much of any of the resources (cpu time, disk access) most of the time. It is currently not an option to move mango to a dedicated server. It may be possible in the future but at this time I see no evidence of the other machines interfering with mango.