NoSQL Task Queue Full

phildunlap

You do not need to restart Mango for the settings to take effect. Sometimes things may need to settle (like reducing the maximum number of batch write behind tasks) for them to take effect, but that is not the case of the batch sizes / delays / backdate delays / any of the intervals.

I'm curious if your generation produced any values before it crashed? If not, you may just have enough context points that loading all the data from them all since 2015 (which the history generation will do, it doesn't stream the points out that cause context updates, which is a potential improvement) and shortening the time period will be the main benefactor.

mihairosu

Yes the generation produced a lot of values, I'd say about half of it, before it stopped. Many test of thousands of values.

So in this instance you think it would benefit to reduce the "Delay for backdate batches"?

phildunlap

It sounds like you're in the many-values-few-points situation described. So, I would expect the larger, 60000 ms to fare better than the default 5000 ms. That you have reduced the time range will probably skirt the problem more significantly, though.

mihairosu

You are correct, reducing those intervals does not create any issues. Ideally i would not sit down for hours, manually editing the intervals until I have all the data I want.

Is there some way to throw more hardware at the problem, and change some settings to make more use of that hardware? As far as I can tell, it's some sort of software issue, since we are not running out of memory.

Here's an htop during times when we've hit the Task Queue Full error:

0_1514921823392_mango hardware stats.png

phildunlap

Yes, I have done that too, it's not ideal. Did you try the whole range with the modified settings?

Hmm. Can you check if you have any hs_err files in your Mango/ directory that may reveal the OOM? Otherwise your log should have some indication of what's gone astray. And, if it doesn't, you should definitely keep an eye on stderr.

The task queue message is probably incidental to the crash. If you change your spawn threshold / max instances you can see this (force it to have max instances all the time with a low spawn threshold, the messages will go away, but it will probably still crash).

It may help to purge your event, audit and userEvent tables.

mihairosu

Oh wow I DO have a bunch of he_error files...I didn't realize they'd be here. Perhaps they should go in the log folder?

According to these logs, I am indeed running out of memory.

I wonder if it could be that the memory spikes up only temporarily before being dropped. I was not staring at it the whole time, to be fair.

Ok now we're getting somewhere!

Thanks for that information.

mihairosu

Ok I just noticed a problem is of my own making.

I had created an ext-available script with a new JAVA heap size, which was even larger than the real memory I had allocated to the VM.

I must have forgotten all this information and messed around with the memory allocation.

I've increased both, and made sure I have enough real physical memory allocated.

That could have been causing some major problems.

mihairosu

I'm checking out a current historical generation process, and I am not seeing memory usage increase above 4GB for the system, even though I've provided plenty.

Now it could be that the system just does not need that much, or maybe DISK IO is limiting database reads/writes.

In any case, I appreciate your insights, but I have another question:

Which NoSQL settings can I tweak to make sure more available memory can be used?

phildunlap

Good catch! Definitely if the allocation was above where the OOM Killer would awaken this would happen, even if we never were really using that much memory at one time, because the garbage collector wouldn't attempt to free anything in an intense computation moment like history generation.

Hmm, I've never looked into / thought of moving the hs_err files to the logs directory. Those are created by the HotSpot Java runtime - and they only occur if things have gotten off the rails, such as OOM, SegFault, etc. I'll bring it up to people, but usually we tell people to look in their ma.log file anyway, so I wouldn't expect that direction to reveal the hs_err files either. The advantage of them being in the Mango/ directory is that they're abnormal events so it's nice to have an easy reference to them, without having to look past lots of other log files.

mihairosu

I am noticing one more interesting tidbit with htop.

I always see one virtual CPU pegged at 100% with none of the others really showing any work.

Perhaps this process is not running multi-threaded?

phildunlap

If you are using the clocks in the interface, it'll run in a Jetty thread. You can run more than one at once, and it is multi-threaded in that sense. But, what you would find, is that after you get somewhere between 3 and 6 going you cannot start any more and your interface isn't especially responsive. You are correct in inferring that a meta point's history generation or a meta data source's history generation is only going to use one thread.

Phillip Weeks

@phildunlap said in NoSQL Task Queue Full:

The task queue message is probably incidental to the crash. If you change your spawn threshold / max instances you can see this (force it to have max instances all the time with a low spawn threshold, the messages will go away, but it will probably still crash).

Phil would you elaborate a little on the above comment, what is considered a low threshold?
Our situation is similar to the one discussed in this thread in that our systems have a limited number of points < 300 and very lengthy periods 2-5 years on every minute so about 100 million of values a year. I noticed your remark about streaming as opposed to loading the entire period's values in memory, Are there any plans to move in this direction? I have tried your tuning suggestions above and I believe you correctly suggested shortening the time period is the most obvious solution so I modified the run loop script you offered in the other thread so the the period gets divided-up and points processed repediavely and linearly keeping things under control with a RuntimeManager.sleep(5000) thrown in for good luck if the pointsTBW gets out of control. Our system was consistently running out of memory before I read this thread and now that I understand why a little better the result is crashes have been eliminated completely on metapoint regenerations. Combined with your tuning suggestions in this thread, the system works much better and we get hassle-free regeneration over a lengthy periods of a year or more.

I also remember something discussed in another thread about automatically regenerating the missing metapoints whenever the source data sync runs; Is there any motivation within your collective talents to create this feature?
Thanks in advance.

phildunlap

Phil would you elaborate a little on the above comment, what is considered a low threshold?

In the context of that post, anything that would cause the maximum number of batch write behind instances to exist in regular operating conditions.

I noticed your remark about streaming as opposed to loading the entire period's values in memory, Are there any plans to move in this direction?

I can't say if anyone will want to change anything about it, but I began I REST controller that did streaming values (or optionally load them all), generating multiple points' histories in one request, and provided the ability to cancel a history generation task. There was some stuff left there to be finished / fixed up, but I'm sure some more feature-ful REST controller is on its way.

I also remember something discussed in another thread about automatically regenerating the missing metapoints whenever the source data sync runs; Is there any motivation within your collective talents to create this feature?

Possible, but unlikely. I don't think you should have meta points on the receiver end unless you really cannot have them on the publisher's side. You could already take that history generation loop I gave you (invokes the meta edit dwr from a loop of points to generate iirc) and place that into a set point event handler for the persistent receiver's sync completed event. So, in that sense it is possible and not my favorite idea, so I'm unlikely to encourage it further without good cause.

There some discussion of this in this thread: https://forum.infiniteautomation.com/topic/3189/using-persistent-tcp-points-in-script-calculations