Mango consuming 100% CPU , web interface not responding, how to diagnose further?
-
Hi,
We are encountering unusual behaviour of our Mango process (running on a ubuntu 14.04 VM). CPU on the VM jumps up to consume all (4) cores. Disk and memory utilisation are low. This is a lightly loaded installation, with < 7 mangoES publishers pushing data to the instance. When this happens we are unable to view the management web interface, and javascript API calls are timing out (e.g. via the dashboard module). Attached are the stack traces from the two times this happened today, prior to today the issue occurred two weeks ago. The problem has occurred while running the latest version of the software and modules, as well as the prior version. The only way to bring it back is to kill the java process (with kill -9), then start Mango (service mango start) . What options do we have to track down the cause and stop this from happening?
Output of jstack when the CPU cores are busy and the web interface is timing out:
0_1463468117135_stack-trace.log
There seem to be a large number of threads working with JdbcConnectionPool when the problem occurs, does this suggest the embedded DB is running slowly or locked in some way?
Cheers,
Tim -
Thanks for doing so much leg work already!
Is this occurring reliably, after a consistent interval, or seemingly randomly?
From looking at the stack trace you've provided I would say something may be going awry inside the H2 database engine. It looks like lots of threads are blocked on the same object. There are 145 threads waiting for a lock, and 1 that seems to have gotten into some weird sounding spot:
p1005093407-3287" #3287 prio=5 os_prio=0 tid=0x00007f66a5334800 nid=0x7162 runnable [0x00007f6677dbe000] java.lang.Thread.State: RUNNABLE at org.h2.jdbc.JdbcSQLException.buildMessage(JdbcSQLException.java:156) at org.h2.jdbc.JdbcSQLException.setSQL(JdbcSQLException.java:152) at org.h2.message.DbException.addSQL(DbException.java:128) at org.h2.command.Command.executeUpdate(Command.java:266) - locked <0x00000000c2573e18> (a org.h2.engine.Database) at org.h2.jdbc.JdbcConnection.rollbackInternal(JdbcConnection.java:1501) at org.h2.jdbc.JdbcConnection.rollback(JdbcConnection.java:489) - locked <0x00000000f214e9a8> (a org.h2.jdbc.JdbcConnection) at org.h2.jdbcx.JdbcXAConnection.getConnection(JdbcXAConnection.java:107) at org.h2.jdbcx.JdbcConnectionPool.getConnectionNow(JdbcConnectionPool.java:229) at org.h2.jdbcx.JdbcConnectionPool.getConnection(JdbcConnectionPool.java:199) - locked <0x00000000c25738a8> (a org.h2.jdbcx.JdbcConnectionPool) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111) at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:615) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:680) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:712) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:722) at com.serotonin.db.spring.ExtendedJdbcTemplate.queryForObject(ExtendedJdbcTemplate.java:39) at com.serotonin.db.DaoUtils.queryForObject(DaoUtils.java:332) at com.serotonin.m2m2.db.dao.AbstractDao.getByXid(AbstractDao.java:188) at com.serotonin.m2m2.web.mvc.rest.v1.PointValueRestController.getPointValues(PointValueRestController.java:318) ....
Do you have applications built using the API? About how many data points are there? Did anything interesting make it into the regular log file at Mango/logs/ma.log?
-
Hi,
Thanks for the fast response. The ma.log file had entries similar to the following at the time of the problem:
--
ERROR 2016-05-17 14:22:37,909 (com.serotonin.m2m2.web.mvc.rest.v1.exception.RestExceptionHandler.handleMangoError:58) - Could not get JDBC Connection; nested exception is org.h2.jdbc.JdbcSQLException: Database may be already in use: "L
ocked by another process". Possible solutions: close all other connection(s); use the server mode [90020-181]
ERROR 2016-05-17 14:22:43,606 (com.serotonin.m2m2.web.mvc.rest.v1.exception.RestExceptionHandler.handleMangoError:58) - Could not get JDBC Connection; nested exception is org.h2.jdbc.JdbcSQLException: Database may be already in use: "L
ocked by another process". Possible solutions: close all other connection(s); use the server mode [90020-181]
org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is org.h2.jdbc.JdbcSQLException: Database may be already in use: "Locked by another process". Possible solutions: close all othe
r connection(s); use the server mode [90020-181]
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:615)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:680)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:712)The high CPU core utilisation activity appears to be correlated to API calls from the dashboard module (when retrieving various stats and the current/live values for 6 points). There are 19 million rows in the pointValues table. After logging this post, noticed the mangoNoSqlDatabase module was not installed - so have just installed it.
So migrating data from the old H2 store pointValue table into the new mangoNoSqlDatabase store may help. This however may have hit a wall, as I have received the following error: "GC overhead limit exceeded" twice (perhaps a few minutes apart), after processing 2.1 million rows.
I can now see entries like this while the migration is failing:
--
Caused by: java.lang.OutOfMemoryError: Java heap space
ERROR 2016-05-17 15:42:02,996 (org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver.doResolveHandlerMethodException:367) - Failed to invoke @ExceptionHandler method: protected org.springframework.http
.ResponseEntity<java.lang.Object> com.serotonin.m2m2.web.mvc.rest.v1.exception.RestExceptionHandler.handleMangoError(java.lang.Exception,org.springframework.web.context.request.WebRequest)
java.lang.ClassCastException: java.lang.OutOfMemoryError cannot be cast to java.lang.Exception
at com.serotonin.m2m2.web.mvc.rest.v1.exception.RestExceptionHandler.handleExceptionInternal(RestExceptionHandler.java:78)
at com.serotonin.m2m2.web.mvc.rest.v1.exception.RestExceptionHandler.handleMangoError(RestExceptionHandler.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)--
So a few further questions:
-
how can I solve the out of memory error while migrating data from the h2 instance into the new mangonosqlmodule?
-
is it safe to retry the import of data from pointValue, or will duplication occur in the destination?
-
would prefer the server to not shut down in 8 hours (and then drop data), I suspect we have the license (but not sure which account it is associated with. Is there a temporary way I can keep this server up? If there is no way to stop the shutdown, I can switch off the module, but then what will happen to data which was received into the modules nosql store, will that need to be re-sent from the child MangoES devices?
Regards,
Tim -
-
Re the java heap space issue, going to give the Mango VM another 4GB of RAM, plus swap, and try again.
-
A few things have been done. Will wait to see if it fixes the problem.
- enabled the mangoNoSqlDatabase module
- configured the process to start with 6GB heap size via:
#!/bin/bash
case "$1" in
init)
# Startup with Java Memory setup for Medium size installation
# The heap is set to non-expanding for increased performance. If memory use is a factor
# set the Minimum heap size to a lower number and let the JVM adjust when necessary.
JAVAOPTS=' -Xms6g -Xmx6g'
;;
esac- imported the old points from H2
- API calls for stats are completing in 1/20th the time
Will reply if the CPU load issue happens again.
Regards,
Tim -
Using the mangoNoSqlDatabase has fixed the performance issue we were seeing when retrieving point values and statistics via the API. The client is unable to justify the cost of this module however, and is asking if an alternate option exists for improving the performance of retrieving data from the built-in pointValues table . The stack trace and log file appears to suggest database locking is in place which slows down concurrent requests to point values. Any ideas?
-
Converting the H2 database to MySQL will also be an improvement assuming your MySQL server has enough resources to operate. It won't by any means be as fast at the NoSQL database and, as with any SQL database holding time series data, you may see performance suffer as the data set grows.
You might revisit your logging and purge settings and see if you can make adjustments to reduce the overall amount and frequency that data is stored.
-
Here is a link to information on database conversions: https://help.infiniteautomation.com/support/solutions/articles/14000022650-database-conversions
Note that when using MySQL Mango will not perform automatic backups of the database like it does with H2 and NoSQL.