Possible issues with v1 Mango ES and latest updates

psysak

Hey, so my Mango has stopped polling the MODBUS slave again. Here is what I see on the data source page.

Sadly I don't have ssh access to this unit. The serial number for that unit is 2006.

psysak

To resolve the issue I click on the green circle arrows, then the blue arrow selecting ttyUSB0 and save the data source. Essentially I changed nothing but after I do that and save comms return.

phildunlap

Hi psysak,

Was that screenshot before or after hitting the refresh arrow? I would expect what you describe to be the case had that port id not appeared in the dropdown, then you hit the green circle arrow, then it was. That should be solved in the next Core update.

https://github.com/infiniteautomation/ma-core-public/issues/1229

Edit: But, if it was there the whole time, then I think we'll want to know what feedback you would have gotten in the events, were they raised.

psysak

Hey that was before I clicked it. You want me to enable all the logging and alarm stuff?

I don't know if this is the same thing but I just had something which appears similar happen at another site, modbus comms went down for a couple hours earlier today and then recovered. These two sites have that same old Mango type, with the really big housing, I can get you the serial number. I've noticed that when this happens, if you wait long enough it will recover on its own.

I can get you all the logs you need, let me just see if I can get the ssh passwords from Jason at DHC who installed them. Otherwise I'll get back to you and ask for them from you guys.

phildunlap

Turning on IO logging will definitely shed light on it. Also if you were storing events that would probably give us more information too. It doesn't sound like the refreshing the serial port did anything, it was more likely saving the data source.

psysak

I enabled all my events and IO logging, just going to wait for the next failure. It happened a couple hours ago but those useful things weren't enabled.

psysak

Hey, this has happened a couple times again on one particular site. Here are some events which I see from about that time

'250_ALBERT': Exception from modbus master: CRC mismatch: given=47104, calc=20786

This was on Apr 21. Seems to be a fairly consistent occurrence before I lose comms. Sometimes it clears itself, I can see in the events active for several hours and then clears, or in this case after 1.86 days being active I disabled the data source then enabled and off it went.

Unfortunately the IO logs seem to roll over every 1.5 hours and I only have it to keep 10 so the IO incident is gone :( I'll see if there's anything else I can provide

psysak

It happened again today

2018-04-26 10:57:30 - 250_ALBERT': Exception from modbus master: CRC mismatch: given=47104, calc=20786

This time I do have IO logs, here is a very strange message which looks like it may start the issue

2018/04/26-10:57:30,000 O 0a03006d004b955b
2018/04/26-10:57:30,503 O 0a03006d004b955b
2018/04/26-10:57:30,523 I 0a039603e800481054491214d0424c000000ff00ff00ff03e8001905b84939d2a04188000000ff00ff00ff03e80003008c46c9ae003f80000000ff00ff00ff03e8000b02cd494a97c04080000000ff00ff00ff03e80002007549433ba00000000000ff00ff00ff03e8000300be4808d4004000000000ff00ff00ff03e8000100584926df300000000000ff00ff00ff03e80004012548a9c1a045b8
2018/04/26-10:57:30,524 O 0a0300b800024555
2018/04/26-10:57:30,765 I 0a0300b800024555
2018/04/26-10:57:30,774 I 54491214d0424c000000ff00ff00ff
2018/04/26-10:57:30,775 I 03e800
2018/04/26-10:57:30,795 I 1905b84939d2a04188000000ff00ff00ff03e80003008c46c9ae003f80000000
2018/04/26-10:57:30,796 I ff00ff
2018/04/26-10:57:30,816 I 00ff03e8000b02cd494a97c04080000000ff00ff00ff03e80002007549433b
2018/04/26-10:57:30,837 I a00000000000ff00ff00ff03e8000300be4808d4004000000000ff00ff00ff03e80001
2018/04/26-10:57:30,857 I 00584926df300000000000ff00ff00ff03e80004012548a9c1a0aafc

What's odd is that the error states a given CRC of 47104 which in HEX is B800. The message immediately after the really long one, at 10:57:30.524 just so happens to contain B800... Don't know.

Logs available this time

psysak

Same thing happened yesterday afternoon at 16:55 and cleared itself at 7:37 this morning.

phildunlap

The thing that strikes me in that message is that the response to the second read request was fragmented over a hundred milliseconds while the first was delivered in its entirety to the serial port twenty milliseconds after requested. I also think it's very strange that the first input message in the second response is an echo, rather than a proper response.

I am suspicious of the bus. Can you provide information about how many devices are on the bus? Is it properly terminated with a terminating resistor?

phildunlap

I would also check the dmesg output to see if the serial port is re-enumerating or anything unfortunate like that. But, I would expect the polling to stop in such a case, not get CRC errors.

psysak

Will do, thanks Phil. I don't know much about that bus as I'm, of course, coming in well after the fact. I thought it very strange that there are echos as well, was going to ask if that's something you expect to see in those logs. I shouldn't be seeing that at all