Alarms due to high priority task already running

craig

A mango instance has generated 50 "urgent" alarms in the last 8 days with the message

High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@84cbc5 was rejected because it is already running.

I have changed the rejected work item system event alarm level to none so we won't get this alarm that we can't do anything about.

How can I tell what the task was and whether or not it is an important issue I need to address?

phildunlap

Hi Craig,

We're going to be putting in more informative handlers to any place these errors can be generated going forward. The only information directly gleaned from that is 1) it is not a data source polling. That already implements a handler. 2) It is a repeating task.

Repeating tasks reschedule themselves before attempting execution, so that it is 50 events in 8 days...
176 hours, 3.5 hours/event could give you clues on the timing as it sits now (presuming they were all the same). Are the task IDs the same for each message? I am referring to the hex values after the '@' in TimeoutTask@84cbc5

That could at least tell you if it is the same task that is attempting to run while an instance of it is already running, provided Mango was not restarted during the sample.

Pedro

I'm receiving a lot of these alarms too. They do not occur during regular time intervals:

Event active notification ** Urgent **
08:25:30 - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@502c592b was rejected because it is already running.

They usually coincide with this alarm:

Event active notification ** Urgent ** 
08:25:29 - 'Setting': Data source with xid: DS_Setting and name: Setting, aborted 163 polls.

DS_Setting is a "Virtual data source" with a 1-second Update period. There are 27 "type":"NO_CHANGE" datapoints in that source. The only other change type is "RANDOM_MULTISTATE," but that point is disabled.

What follows at 8:25am is multiple timeouts of Modbus or BACnet polls, reported in ma.log:

2016-03-25 08:25:30,476 WARN  (com.serotonin.m2m2.rt.dataSource.PollingDataSource.scheduleTimeout:117) - SCM: poll scheduled at 2016/03/25 08:25:29.171 aborted because its start time of 2016/03/25 08:25:30.476 is more than the time allocated for in its poll period. 
2016-03-25 08:25:30,475 WARN  (com.serotonin.m2m2.rt.dataSource.PollingDataSource$2.execute:226) - Mod1: poll scheduled at 2016/03/25 08:25:30.000 aborted because Task Currently Running 
2016-03-25 08:25:30,475 WARN  (com.serotonin.m2m2.rt.dataSource.PollingDataSource$2.execute:226) - Mod2: poll scheduled at 2016/03/25 08:25:30.000 aborted because Task Currently Running 
2016-03-25 08:25:30,478 WARN  (com.serotonin.m2m2.rt.dataSource.PollingDataSource.scheduleTimeout:117) - Mod3: poll scheduled at 2016/03/25 08:25:29.173 aborted because its start time of 2016/03/25 08:25:30.470 is more than the time allocated for in its poll period. 
2016-03-25 08:25:30,475 WARN  (com.serotonin.m2m2.rt.dataSource.PollingDataSource$2.execute:226) - Mod4: poll scheduled at 2016/03/25 08:25:29.172 aborted because Task Currently Running 

Errors continue for many more data sources.

During these errors, the CPU idle time, according to mango's vmstat data source, has not fallen below 60%.
High priority core pool size = 100 and
High priority maximum pool size = 5000

Pedro

I just received this

2016-03-28 20:01:09,035 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@38c10714 rejected because Task Currently Running

I don't know if it's related, but around that time my CPU idle time plummeted, according to Mango's vmstat datasource:

And even though the idle time eventually came back up somewhat, there are still some suspicious threads like this one:

TIMED_WAITING	Unsafe.java:-2
LockSupport.java:215
AbstractQueuedSynchronizer.java:2078
ArrayBlockingQueue.java:418
QueuedThreadPool.java:531
QueuedThreadPool.java:47
QueuedThreadPool.java:590
Thread.java:745

... which I pasted from the thread monitor page:

I have noticed since last year that the idle time decreases more or less week after week, and that there is much more idle time the first few days after a Mango restart. Do you have any suggestions?

phildunlap

Hi Pedro,

I hesitate to speculate about why those threads are getting rejected. I would be it's the garbage collector spawning and locking everything down for an instant. I'm certainly looking forward to them all having descriptions.

Those threads as you posted are not a concern. That's an unoccupied thread, sitting in the pool waiting for allocation. You don't have messages about the thread pool being full, just that a certain rescheduling task is aborting because its previous incarnation is still running.

I will keep an eye out for the CPU consumption behavior you describe.

Pedro

@phildunlap I am becoming increasingly alarmed with the high number of alarms, alarm emails, and possible internal side effects of the timeouts described here, especially since they pertain to an internal virtual datasource. I don't know whether the higher than usual CPU utilization is a cause or effect, or unrelated. Mango is running on a powerful rack server, so I don't expect this to be due to an underpowered CPU.

According to /events.shtm, in approximately the last twelve hours I have received 136 events like this one:

High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@10868dc7 was rejected because it is already running.

which are associated with the ma.log entry:

2016-04-13 12:46:46,384 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@502c592b rejected because Task Currently Running

and received 126 events like this:

'Setting': Data source with xid: DS_Setting and name: Setting, aborted 7413 polls.

which are associated with the ma.log entry:

2016-04-13 12:46:46,392 WARN  (com.serotonin.m2m2.rt.dataSource.PollingDataSource.scheduleTimeout:117) - Setting: poll scheduled at 2016/04/13 12:46:45.171 aborted because its start time of 2016/04/13 12:46:46.392 is more than the time allocated for in its poll period.

Sometimes I receive several such errors per minute.

The "Setting" datasource is simply a virtual source used to hold constant values:

{
   "dataSources":[
      {
         "xid":"DS_Setting",
         "name":"Setting",
         "enabled":true,
         "type":"VIRTUAL",
         "alarmLevels":{
            "POLL_ABORTED":"URGENT"
         },
         "purgeType":"YEARS",
         "updatePeriodType":"SECONDS",
         "updatePeriods":1,
         "editPermission":null,
         "purgeOverride":true,
         "purgePeriod":10
      }
   ]
}

One point is "type":"RANDOM_MULTISTATE" but it is disabled.
All other points are "type":"NO_CHANGE"

I don't understant why this is causing such problems. Let me know if I should email you the complete json point list. What can be done about this?

Thanks.

terrypacker

Pedro,

Its very odd for a virtual data source to abort polls as there are only a few ways it won't have enough time to complete its poll in time. Can you supply us with the following information:

The JSON config for the points of your 'Setting' data source.
A thread dump from a time when you are experiencing the symptoms. This can be done by hitting this url in a web browser:

http://{mango-host}:{mango-port}/rest/v1/threads?depth=30&asFile=true

If getting the dump proves difficult you can try to setup an Email event handler for that data source on the PollAborted Event and include the system information. Although this approach might be dangerous if you are aborting a lot of polls frequently.

Thanks.

Pedro

Terry,

Like you, I was also surprised that a virtual data source would abort polls considering it does not have much work to do. I dumped the threads to a 1MB JSON file. This was easy as this FATAL error occurs frequently. I will email it to support@...., along with the JSON for the points of the Settings data source and a snippet of ma.log. Unfortunately the high number of alarms this error generated has triggered a spam threshold, which caused my email provider to disable outgoing emails from my domain. So I must wait for the spam limiter to reset before I can send the email.

The Settings datasource is used only to hold constants. Is it best practice to hold constants in a NO_CHANGE virtual datasource, or in a metadata point as: return my.value; set to update once per year?

terrypacker

Pedro,

It all depends on what you are using the constants for. If they are only used in a Meta or Scripting data source then a Global Script constant would be best. If you need them inside a data point or the data base then a low period polling Virtual data source would be best. I would think running a script to store them in a data point would be the least ideal option as the Javascript engine needs to be run up and execute the script where a No Change virtual point literally does nothing when it polls besides spawning the poll thread.

Pedro

Thanks. The points are used by meta datasource functions, and displayed in a graphical view where certain users can easily change their values by directly entering a new point value. Perhaps it would be more accurate to call them "user-edited "threshold values," i.e. "variables that seldom change" rather than calling them constants. Some are changed by a metadata function, but that is the exception.

I probably have well over one hundred javascript metadata functions. I don't want the settings in a global script because they would not be user-editable, nor would their history be logged. From what you state, it sounds like I should move those seldom-changing variables from meta to a NO_CHANGE virtual source. However, if I do that, I will lose my point history unless I roll up my sleeves and issue some commands from Mango's mySQL console, which I'd rather avoid unless I can expect a noticeable impact on performance.

terrypacker

Pedro,

A few questions about performance.

From what you said about SQL console it appears you are storing all your point value data in MySQL, is that correct?
Are you running on Java 7 or 8? I benchmarked the latest Java 8's Nashorn javascript engine against the old Mozilla engine and found it to be almost 2x as fast when running Mango's data source scripts.

Pedro

I'm storing pointValues in TSDB. However, if I want to retain the point history I presume I would have to reassign the point using SQL, since I cannot move points from one datasource to another. It would be very nice if I could do that, as there are a number of metadata points that I would like to move from one metadata source to another without having to get dirty with SQL, which I find risky.

I'm running Java 8 under Ubuntu:

$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

Pedro

@terrypacker My email hosting provider finally unblocked my outgoing emails, which had been blocked because it was also hosting Mango's alarm emails. So I just emailed the requested data to the support email address.

Since Mango was even slower today, I finally restarted it, taking the opportunity to upgrade the API, dashboard, and report modules.

The % idle went up significantly today:

And the CPU usage (100-idle%) went down accordingly:

My working theories are that either:

There is a memory leak that led to excessive disk swapping
Some error messages that used to only show up in ma.log also show up in /events.shtm. The combination of writing errors to SQL and text to ma.log may have caused Mango to fall behind on its tasks, to the point where it could not keep up unless it cancelled high priority taks
An intermittent task either continually raised the CPU utilization, and was killed during the restart.

Of the graphs below, the most interesting ones to me are the block writes, interrupts per second, swap memory, and heap memory. Labels are above the graphs. I think the block writes growing may be the ma.log writing rate increasing due to a positive feedback loop. What explains JVM-max-available-memory?

CPU-IOwait

free-memory

interrupts-per-second

jvm-heap-memory

JVM-max-available-memory

swap-memory

scheduled-work-items

Thanks.

Pedro

@terrypacker There may be a way to reproduce this issue, if it is caused by a high CPU load: since recent Mango versions take more CPU to render graphical views, if enough users or web browsers display a complex graphical view, the CPU idle time will fall too low, task timeouts will occur, and errors will start filling the event list and ma.log. Additional CPU will be utilized to send event emails.

The additional load will likely put Mango into a sustained mode where the tasks cannot meet with real-time, and the task manager will start shedding tasks and generating more error messages. At that point, reducing the page views will not lower the CPU load sufficiently to dig itself out of this vicious cycle, and servicing those very errors will sustain the high CPU load.

Obviously I do not want to cause this error on my live system, but I offer the following methods of preventing, assessing, and mitigating these issues:

Assessment:

Add a profiler to display to the admins how much time each module is using. A core module time per page rendered would also be useful.
Add a "Runtime status" section to report poll times of metadata sources, or of metadata points. If it can be resource intensive, pass it as a command switch for troubleshooting.

Prevention (or reduction):

Use the above time profiling data to determine what is causing most of the CPU load, so either developers or users can determine what part of their configuration consumes the most CPU, so they know where to make optimizations.
Optimize the graphical view renderer for better performance
Does Mango log in/out of the POP server after each email it sends? If so, increase email notification performance by not logging out.

Mitigation: once mango is in this vicious cycle of reporting errors, CPU load must be reduced to break the cycle by reducing the CPU error-servicing load:

Allow the administrator to temporarily raise the bar on servicing events, such as:
- Suspend reporting timeout errors to the event database
- Suspend reporting timeout errors to ma.log
- Raise ma.log reporting level to not report WARN
- Raise the bar on outgoing event emails (Information, Urgent, Critical)
- Raise the bar on logging to the event database (Information, Urgent, Critical)
Temporarily flush either the low, medium, or high priority job queue
Once the system is caught up, these measures should be put back to normal.

Perhaps this could be thought of as a soft reset rather than a hard restart.

Pedro

@terrypacker Since the restart the CPU idle time has been around 70% and graphical view simplepoints have been updating much faster. However, high priority tasks are still being "rejected because Task Currently Running:"

2016-05-08 23:28:36,733 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@39de09ff rejected because Task Currently Running 
2016-05-08 23:47:59,172 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@775e67cb rejected because Task Currently Running 
2016-05-08 23:47:59,327 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@39de09ff rejected because Task Currently Running 
2016-05-09 00:07:11,462 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@1c7b07ad rejected because Task Currently Running 
2016-05-09 00:26:35,707 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@30133195 rejected because Task Currently Running 
2016-05-09 00:26:35,719 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@588b748a rejected because Task Currently Running 
2016-05-09 00:45:48,775 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@1260f118 rejected because Task Currently Running 
2016-05-09 01:05:18,388 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@2795a94 rejected because Task Currently Running 
2016-05-09 01:24:34,660 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@3dc97162 rejected because Task Currently Running 
2016-05-09 01:43:50,236 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@4b00d126 rejected because Task Currently Running 
2016-05-09 02:03:05,363 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@fec7ec6 rejected because Task Currently Running 
2016-05-09 02:22:22,817 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@39de09ff rejected because Task Currently Running 
2016-05-09 02:41:36,815 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@3dc97162 rejected because Task Currently Running 
2016-05-09 03:00:53,652 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@39de09ff rejected because Task Currently Running 
2016-05-09 03:20:15,385 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@1260f118 rejected because Task Currently Running 
2016-05-09 03:39:32,514 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@588b748a rejected because Task Currently Running 
2016-05-09 03:58:53,301 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@4b00d126 rejected because Task Currently Running 
2016-05-09 04:18:13,426 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@3dc97162 rejected because Task Currently Running 
2016-05-09 04:37:13,392 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@30133195 rejected because Task Currently Running 
2016-05-09 04:56:29,434 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@2795a94 rejected because Task Currently Running 
2016-05-09 05:16:03,176 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@2795a94 rejected because Task Currently Running 
2016-05-09 05:56:16,256 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@4b00d126 rejected because Task Currently Running 
2016-05-09 06:16:13,354 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@b082053 rejected because Task Currently Running 
2016-05-09 06:35:48,535 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@2795a94 rejected because Task Currently Running 
2016-05-09 06:55:42,575 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@4b00d126 rejected because Task Currently Running 
2016-05-09 07:15:19,662 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@fec7ec6 rejected because Task Currently Running 
2016-05-09 07:15:19,947 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@79cb4799 rejected because Task Currently Running 
2016-05-09 07:35:53,464 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@30133195 rejected because Task Currently Running 
2016-05-09 07:55:40,366 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@2795a94 rejected because Task Currently Running 
2016-05-09 08:16:35,096 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@fec7ec6 rejected because Task Currently Running 
2016-05-09 08:37:16,287 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@30133195 rejected because Task Currently Running 
2016-05-09 09:17:54,381 FATAL (com.serotonin.m2m2.util.timeout.RejectedHighPriorityTaskEventGenerator.rejected:27) - High priority task: com.serotonin.m2m2.util.timeout.TimeoutTask@33bf3c4c rejected because Task Currently Running

terrypacker

@Pedro thanks for the information,

I was unable to come up with a solution for your system with the time I had allotted. In the next few weeks (for the 2.8.0 release) we will be making changes to the code in these areas. Once we re-open development I will be able to take a closer look and set up some simulations to see where we can make improvements.

I will be in touch with any questions or requests for more information as soon as we begin 2.8.0 development.

Thanks,
Terry

Pedro

@terrypacker Thanks for the update. Please keep in mind that this problem is not unique to my installation.