Possible issues with v1 Mango ES and latest updates
Hey I just wanted to report some odd behaviour I have seen on old v1 Mango ES's, by v1 I mean the really old big housing ones. I have a couple of these out here, both running 2.8, and I updated both of them to the latest module versions within the last month. Both of these devices have experienced odd loss of comms to a MODBUS/RTU slave. I only went to the one site but from what I could see the RS485 port had lost it's mind a little bit. For some reason it was showing two RS485 ports (dev/rs485 and something else like dev/ttla). That one I fixed up by rebooting but I have never seen that before. Both of them lost comms to their respective RS485 modbus slaves yesterday again but then both recovered at some point during the day. They are not in the same building and it was not an ethernet comms issue because my ES's were still publishing data, it was just not incrementing.
Just wanted to pass that along in case you've seen something like this. If it happens again I will try to troubleshoot a bit more. Is there anything I should try specifically which would help you?
V1 old? Might have to post a picture, I'm not sure we conceive of V1 being the same thing... maybe I'm thinking of V0.5
But, that you have a /dev/rs485 would imply you have udev rules to property enumerate the USB ports should they have serial adapters. Can you confirm, is there a file like
/etc/udev/rules.d/15-usb-serial.rules? What's in it?
The next item in this would be checking the serial port regex in the Mango/overrides/properties/env.properties file. If you make your regex more specific to only the symlinks created in the udev rule, then no user will set it up for ttyUSB0 and find the converted has enumerated this boot at /dev/ttyUSB1, or some such. These settings govern that
And finally if you notice that it isn't polling but the serial port file exists, you may need to press the refresh icon by the serial port list for any serial port data source.
The most helpful thing in probing deep would be a very clear picture of what 'lost its mind' exactly means.
Haha :) Will do, if this happens again I'll get you this info.
Hey, so my Mango has stopped polling the MODBUS slave again. Here is what I see on the data source page.
Sadly I don't have ssh access to this unit. The serial number for that unit is 2006.
To resolve the issue I click on the green circle arrows, then the blue arrow selecting ttyUSB0 and save the data source. Essentially I changed nothing but after I do that and save comms return.
phildunlap last edited by phildunlap
Was that screenshot before or after hitting the refresh arrow? I would expect what you describe to be the case had that port id not appeared in the dropdown, then you hit the green circle arrow, then it was. That should be solved in the next Core update.
Edit: But, if it was there the whole time, then I think we'll want to know what feedback you would have gotten in the events, were they raised.
Hey that was before I clicked it. You want me to enable all the logging and alarm stuff?
I don't know if this is the same thing but I just had something which appears similar happen at another site, modbus comms went down for a couple hours earlier today and then recovered. These two sites have that same old Mango type, with the really big housing, I can get you the serial number. I've noticed that when this happens, if you wait long enough it will recover on its own.
I can get you all the logs you need, let me just see if I can get the ssh passwords from Jason at DHC who installed them. Otherwise I'll get back to you and ask for them from you guys.
Turning on IO logging will definitely shed light on it. Also if you were storing events that would probably give us more information too. It doesn't sound like the refreshing the serial port did anything, it was more likely saving the data source.
I enabled all my events and IO logging, just going to wait for the next failure. It happened a couple hours ago but those useful things weren't enabled.
Hey, this has happened a couple times again on one particular site. Here are some events which I see from about that time
'250_ALBERT': Exception from modbus master: CRC mismatch: given=47104, calc=20786
This was on Apr 21. Seems to be a fairly consistent occurrence before I lose comms. Sometimes it clears itself, I can see in the events active for several hours and then clears, or in this case after 1.86 days being active I disabled the data source then enabled and off it went.
Unfortunately the IO logs seem to roll over every 1.5 hours and I only have it to keep 10 so the IO incident is gone :( I'll see if there's anything else I can provide
It happened again today
2018-04-26 10:57:30 - 250_ALBERT': Exception from modbus master: CRC mismatch: given=47104, calc=20786
This time I do have IO logs, here is a very strange message which looks like it may start the issue
2018/04/26-10:57:30,000 O 0a03006d004b955b 2018/04/26-10:57:30,503 O 0a03006d004b955b 2018/04/26-10:57:30,523 I 0a039603e800481054491214d0424c000000ff00ff00ff03e8001905b84939d2a04188000000ff00ff00ff03e80003008c46c9ae003f80000000ff00ff00ff03e8000b02cd494a97c04080000000ff00ff00ff03e80002007549433ba00000000000ff00ff00ff03e8000300be4808d4004000000000ff00ff00ff03e8000100584926df300000000000ff00ff00ff03e80004012548a9c1a045b8 2018/04/26-10:57:30,524 O 0a0300b800024555 2018/04/26-10:57:30,765 I 0a0300b800024555 2018/04/26-10:57:30,774 I 54491214d0424c000000ff00ff00ff 2018/04/26-10:57:30,775 I 03e800 2018/04/26-10:57:30,795 I 1905b84939d2a04188000000ff00ff00ff03e80003008c46c9ae003f80000000 2018/04/26-10:57:30,796 I ff00ff 2018/04/26-10:57:30,816 I 00ff03e8000b02cd494a97c04080000000ff00ff00ff03e80002007549433b 2018/04/26-10:57:30,837 I a00000000000ff00ff00ff03e8000300be4808d4004000000000ff00ff00ff03e80001 2018/04/26-10:57:30,857 I 00584926df300000000000ff00ff00ff03e80004012548a9c1a0aafc
What's odd is that the error states a given CRC of 47104 which in HEX is B800. The message immediately after the really long one, at 10:57:30.524 just so happens to contain B800... Don't know.
Logs available this time
Same thing happened yesterday afternoon at 16:55 and cleared itself at 7:37 this morning.
phildunlap last edited by phildunlap
The thing that strikes me in that message is that the response to the second read request was fragmented over a hundred milliseconds while the first was delivered in its entirety to the serial port twenty milliseconds after requested. I also think it's very strange that the first input message in the second response is an echo, rather than a proper response.
I am suspicious of the bus. Can you provide information about how many devices are on the bus? Is it properly terminated with a terminating resistor?
I would also check the dmesg output to see if the serial port is re-enumerating or anything unfortunate like that. But, I would expect the polling to stop in such a case, not get CRC errors.
Will do, thanks Phil. I don't know much about that bus as I'm, of course, coming in well after the fact. I thought it very strange that there are echos as well, was going to ask if that's something you expect to see in those logs. I shouldn't be seeing that at all