BACnet/IP points reading value timeout issue when configured to periodic update only in the data source

cwangv

Hi,
I have been battling the issue of my 198 BACnet/IP points (from one single BACnet/IP device) not updating in a timely manner or not at all after a period of time when configured to COV. The COV subscription period is set to 60 minutes in Mango.
The symptom is that the COV works for a couple of days or less, then nearly all of them stopped updating with new values. However, when a 'force read' was executed, a new value (which is above COV change threshhold) was successfully obtained. At the same time, from another device (not Mango) that is subscribed to the same data source , the points are updating normally. I can discovery/read all points under the device browser tab within the data source UI.

So as an alternative, I proceed to disable all COV on all points and only rely on periodic update of 3 minutes interval and logging set to 'All'. The moment i did that, the poll duration of the data source went up from milliseconds to value like 48100 milliseconds. See below:
One thing that i have noticed is that the 'successful Poll%' always at 100% even while these online/offline issues happening. The events are shown in the second picture.
0_1567916109567_ef5bf4a1-8b2a-4f62-8473-8e3012963c06-image.png
0_1567917084843_02bf80ac-7da2-4255-ae6e-50f73271150a-image.png
In order to narrow down the issue, i decided to put all the points back to COV enabled then proceeded to disable the COV setting on a number of them at a time and see when crazy poll duration happened.
Out of the 198 points, i finally reduced the number of points with COV enabled to '25'. The moment I disabled one more out of these 25, the poll duration suffered as shown. I then re-enabled one point with COV, the poll duration went back to normal values. From the screenshot, you can see the changes happen.
In trying to determine if i have misconfigured any settings, i have gone through and manipulated the following settings:
1- Thread pools: 'High priority pool size' (from 0 to 300); High priority maximum pool size' (from 100 to 1000);'Medium priority pool size' (from 1 to 300); 'Low priority pool size' (from 1 to 100). see below:
0_1567916663405_49c3577f-f19e-4db1-b128-59387c97a57e-image.png

I have done a couple of thread dumps and am happy to provide them for your review.
By the way, the free JVM memory never went below 2900MB if this helps.

Please let me know if you need more info.

Thanks!

phildunlap

Hi cwangv,

I would definitely not have that many medium and low priority threads, unless you're running on a massive server-class machine. Even then, 300 medium priority and 100 low priority threads is excessive. On machines with 40 cores I usually only go 30/10 on those settings, to provide some context. But, I don't think that's the issue with the COV points.

One thing that i have noticed is that the 'successful Poll%' always at 100% even while these online/offline issues happening.

The successful poll percentage refers to aborting polling for usually a timing reason. For instance, if a modbus data source was polling every 2 seconds but it always took 3.5 seconds to poll, it would abort 50% of its polls and show a 50% successful poll rate.

In order to narrow down the issue, i decided to put all the points back to COV enabled then proceeded to disable the COV setting on a number of them at a time and see when crazy poll duration happened.

Is this BACnet IP or MSTP ? It would be difficult to diagnose the cause of it slowing down that much - that would probably take a wireshark to see which half of the conversation was pausing. To me, this doesn't sound like it would affect the COV issue either.

In some previous versions there was an issue with having two BACnet data sources sharing the same local device and doing COV subscriptions. If you can confirm your BACnet module version that could be helpful.

Can you check your ma.log files from around the time COV subscription stopped working for a message like Subscription failed due to error response for instance at the WARN level?

cwangv

@phildunlap
Phil
It is a Bacnet/IP data source. Bacnet Module is 3.6.1. I have just recreated the comms problem (no COV) and have done Wireshark capture.

Can you let me know how I should submit the wireshark log file?

Thanks.

phildunlap

You can email it to support@infiniteautomation.com

cwangv

@phildunlap
Hi, Phil
I have just sent an email to the address you provided. It has got links to two Wireshark captures.
If you need more info, please let me know.

thanks.

cwangv

@phildunlap
Hi, Phil
I am just wondering if the Wireshark captures have been helpful to you guys.
I can reliably re-produce the issue whenever i set all the points to 'COV' Disabled.
So you need more capture data or log files from my Mango installation. Please don't hesitate to reach out to me.

Thanks.

phildunlap

Hi cwangv,

Thanks for sending in the captures. I did take a look at them and raise the issue, but did not yet resolve the matter. Can you fully describe the replication steps you've found?

cwangv

@phildunlap
Hi, Phil
A summary of the software configuration:
1-Mango Automation Software: Core 3.6.4, Bacnet 3.6.1, Windows 10 Pro 64 Bit.
2-Data Source: BACnet IP device with 198 AV points defined, all initially set to 'COV' Enabled and logging to 'All' with polling period of 5 minutes in the data source.

Steps to generate the issues:
-by setting all 198 AV points to 'COV' Disabled (unchecked), all 198 AV points will go unreliable and lots of 'RTN' events generating for each point. It also appears Mango Automation software will start polling the points in 10 seconds interval as soon as errors occurred.
-Sometimes, this issue can be generated if less than a dozen points are left as 'COV' Enabled (checked). Below is a typical event message when Mango is stuck in such state:
Note: this happens even when moments ago the polling was working fine for a long period of time. So i dont think it is the physical device problem.

0_1568850808192_44f87c8c-76ad-4536-a25e-03c5d593f28c-image.png

I have observed previously that the 'COV' functions complete stopped for all the points,ie, no updating even though the actual data source points are changing.

I hope the above description is helpful.

phildunlap

Okay, it sounds like there are two questions to be investigated, then.

First, that the points stop receiving COV notifications after some amount of time working without issue. You captured one of these events in the wiresharks you sent in, yes?

The second question concerns why the poll time would just to 48000-ish milliseconds. That number would suggest to me that some of the requests are timing out, and that to poll all those points the requests are being split up. Your Local Device has a timeout of 6000 and a retries of 3, so 48000 would be 8 timeouts, which may or may not all be in retrying one request. If they all did occur in requesting a single set of points, that would produce the event you have a picture of there. It could be that the device is more easily overwhelmed by the large read requests, or perhaps it is generating some kind of log messages about why it isn't responding to certain read requests.

cwangv

@phildunlap
Yes, one of the log files contained the traffic after the COV update stopped between the data source and Mango.
One thing to note - that while all these abnormal events happening inside the Mango instance we have, i have another device (Not Mango) polling the same data source via BACnet/IP for the same amount point with no issue at all.

Let me know if you need more wireshark captures.

Thanks.

phildunlap

Hi cwangv,

BACnet 3.6.2 was just released which may have a fix for the COV points stopping collecting data after some indeterminate amount of time. It also has a fix for COV points that are set to log on an interval not always storing data.

I did not see what would be causing the long poll times you saw at a certain threshhold of points. Maybe a wireshark of that specific event could be illustrative in the timing of the BACnet conversation.

cwangv

@phildunlap
Phil,
you guys rock. I will give it a go.
Thanks.

cwangv

@phildunlap
I tried again and experienced the same problem with massive timeout issue with every single points.
I will send the links for the wireshark captures for your guys to review.
At the meantime, i will continue to monitor the interval logging function on points with COV enabled.

thanks.

phildunlap

Thanks for the kind words and the captures!

What I notice is that the device at 192.168.69.105 doesn't respond to fragmented readPropertyMultiple requests. Can you try reading the device object's "Segmentation Supported" property to see what sort of segmentation is allowable?

I found this article relevant: https://store.chipkin.com/articles/segmentation-in-bacnet

We can see in the capture that the device can certainly send segmented messages, but perhaps it can not receive them and the other BACnet client is polling in small requests that do not require segmentation. I would have expected this to have been handled in the code (since it can read the segmentation support off the device object), but what I see in the capture is that it does not reply to segmented requests.

Edit: I believe I found in the IAm device 70105 in the captures that it professes to support segmentation in both directions. Hmm.

cwangv

@phildunlap
Phil,
On the PICS of device 70105, it states below:
0_1570141527173_3fa29a34-0873-49d2-83f9-30015e21cc27-image.png .
I then mapped the segmentation property out in Mango, it shows value of '0' which does correspond to supporting both direction of segmentation of messages.
0_1570141820740_9639192c-2b67-49b0-8312-085b443a64fb-image.png

Would you be able to email me back the captures that shows Mango sending out segmented request but not receiving segmented responses?

Then , I will ask our technical support about device 70105 actual capability of supporting 'receive segmented messages'.

Thanks.

phildunlap

The file called Mango Wireshark between 2019-10-03 1904PM to 2035PM I saw it in. Too large for email, as it was when you sent it!

We can see frame 170715 for instance is sent but no Segment ACK is sent in response to it. I don't know if BACnet4J would wait to send the completing fragment until after receiving the ACK, but that's what appears to be happening. We can then see the six second retries. I'm not sure why the request in question is being sent out twice in rapid succession, though.