Virtual Data Source Disabled on Reboot / Lifecycle terminating

MaP

Some very strange behaviour occured with our mango (3.5.6). The device became uncontactable (no respond to pings over ethernet) and upon reboot a virtual data source appeared enabled (green LED) but reported being disabled. Toggling the enabled/disabled re-enabled this data source.

On review an information event recorded a System Shutdown. Is system shutdown a user only issued command or is it capable of being autonomously triggered?

INFO 2022-07-09T03:05:08,153 (com.infiniteautomation.mango.excelreports.ExcelReportPurgeDefinition.execute:32) - Report purge ended, 1 report instan$INFO 2022-07-09T06:57:53,784 (com.serotonin.m2m2.Lifecycle.terminate:422) - Mango Lifecycle terminating...
INFO 2022-07-09T06:57:53,784 (com.serotonin.m2m2.Lifecycle.terminate:422) - Mango Lifecycle terminating...
INFO 2022-07-09T06:57:54,205 (com.serotonin.m2m2.rt.DataSourceGroupTerminator.terminate:72) - Terminating 2 LAST priority data sources in 8 threads.
INFO 2022-07-09T06:57:54,221 (com.serotonin.m2m2.rt.RuntimeManagerImpl.stopDataSourceShutdown:423) - Data source 'MetaPoints' stopped
INFO 2022-07-09T06:57:54,267 (com.serotonin.m2m2.rt.RuntimeManagerImpl.stopDataSourceShutdown:423) - Data source 'Meta' stopped
INFO 2022-07-09T06:57:54,307 (com.serotonin.m2m2.rt.DataSourceGroupTerminator.terminate:102) - Termination of 2 LAST priority data sources took 103m$INFO 2022-07-09T06:57:54,308 (com.serotonin.m2m2.rt.DataSourceGroupTerminator.terminate:72) - Terminating 14 NORMAL priority data sources in 8 threa$INFO 2022-07-09T06:57:54,631 (com.serotonin.m2m2.rt.RuntimeManagerImpl.stopDataSourceShutdown:423) - Data source 'MangoES System' stopped …..

MattFox

@MaP been a while.
Is this an ES unit or something you have going by yourself?
Might need to see a few timestamps before that, see if anything started that caused an excess of memory or CPU usage.
I've had mango cut out on me because I ran out of cpu resources.
it might be worth ensuring you have enough high-priority threads enabled (even if the report is low to med priority).

Fox

MaP

Hi @MattFox !
Yes an ES instance. Yes there have been some OOM like behaviours in the past but because of these previous incidents I have been trending the JVM memory and the memory seemed pretty stable at the time. And these usually result in mango service crashing but I can still use SSH to reboot the service. The most disconcerting thing here is the ethernet interface going dead. I would like to discover the root cause but from a mitigation standpoint I would also like a recovery option which doesn't include driving to site.. A shell script which monitors systemctl active status and reboots the device seems like an option. Tips for discovering root cause, or mitigation strategies appreciated!

MattFox

@MaP in the past the company I was with built a microcontroller-based watchdog
As long as the mango system pinged it, it would reset the timer.
If it stopped, ten minutes later we pulled the plug, counted to five and reconnected power.
Your idea is good as well assuming the ES isn't locking up or overheating.
Your Ethernet stuff, you can look at dmesg and /var/log/syslog to see what's happening.
Can't sit and chat now, but will come back asap with some other possibilities.

Fox

MaP

This post is deleted!

MaP

@MaP Found it! It is a thermal shutdown!

Jul 28 08:44:01 mangoES3576 CRON[1977]: (root) CMD ((sleep 30; [ -x /usr/sbin/systemInfo/cronRunner ] && /usr/sbin/systemInfo/cronRunner) >/dev/null$Jul 28 08:44:02 mangoES3576 python[6945]: AEMO Healthy - PriceNow:92.31
Jul 28 08:44:57 mangoES3576 kernel: [1471180.187164] thermal thermal_zone2: critical temperature reached(107 C),shutting down
Jul 28 08:44:57 mangoES3576 systemd[1]: Started Synchronise Hardware Clock to System Clock.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping Session c1 of user root.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping system-ifup.slice.
Jul 28 08:44:57 mangoES3576 systemd[1]: Removed slice system-ifup.slice.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping User Manager for UID 0...
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping Graphical Interface.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopped target Graphical Interface.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping Multi-User System.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopped target Multi-User System.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping Deferred execution scheduler...
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping OpenBSD Secure Shell server...
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping AEMO Service...
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping Regular background program processing daemon...
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping OpenVPN service...
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopped OpenVPN service.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopping Login Prompts.
Jul 28 08:44:57 mangoES3576 systemd[1]: Stopped target Login Prompts.

MaP

@MaP @MattFox Its currently winter here so i'm assuming thermal shutdown must be caused by High CPU use or a fan not running properly? Are there some things to set off of high priority? When I look at work items everything is high priority, CPU use is low but temperatures are pretty high.

MattFox

@MaP Sorry for the late reply, for some reason I'm not receiving notifications...
Yes, thermal shutdown can definitely be a faulty fan. that and how the unit is stored.
They are usually the leading causes of the ES units failing (maybe IO as well?)
Get yourself a replacement if you can, they're usually around $5 USD mark...
A dead give away is the noise of the bearings in the fan itself.

Also how is the unit stored? is it in an open air case or is it sealed with an IP rated box that gets really muggy.
I do know the ED units are prone to overheating, but they have a guts load more kick than the GT's spineless raspberry pi 3

Fox