When carrying out an on-premises troubleshooting engagement for a new customer with a WildFly and HornetQ application environment, I discovered that they had no monitoring system in place for the WildFly servers. With the affected application critical to their business processes, the challenge was to investigate the symptoms and identify the root cause without monitoring data. This post describes the steps I used to investigate the problem and identify those potential root causes.
First of all, let's describe the issues that the customer was facing.
Following an application deployment earlier in the month, several WildFly severs were experiencing netty client connections to terminate with the HornetQ brokers in the cluster. On reviewing the WildFly server logs, an ‘OutOfMemoryError: GC overhead limit exceeded’ error was being reported immediately before the first netty connections were being terminated - pointing to a potential issue with the JVM’s garbage collection or memory utilisation.
The parallel collector will throw an OutOfMemoryError when excessive time is being spent in garbage collection, and not enough heap space is being recovered: i.e. if less than 2% of the heap is recovered and the total time spent in garbage collection reaches 98%. This feature is designed to prevent applications from running for an extended period, while making little or no progress because the heap is too small.
This feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the JVM options, however, if the heap space cannot be recovered on subsequent garbage collections due to memory leaks, an OutOfMemoryException will eventually be thrown.
If the garbage collection is happening for long periods then the WildFly server may become cpu bound causing the server to become unresponsive and result in connection problems. HornetQ clients send "ping" requests periodically to brokers to check if they are available. If the server doesn't receive a response back within the configured connections ‘time to live’ (TTL), then the sessions relating to that connection are terminated, which could explain the cause of the problem.
Implementing a Solution
To investigate potential JVM issues, it is essential to monitor the garbage collection and heap utilisation metrics to identify trends such as excessive garbage collection or memory leaks. As this is a production environment any configuration or installation of monitoring tools would need to go through change control process, which meant that for the limited time I was on the customer site meant that I had limited options available for monitoring the JVM.
Command line tool jstat which comes with the JDK see Java Virtual Machine Statistics Tool. Executing the following in a command shell will print the garbage collection statistics showing the utilisation of the generational spaces as a percentage of the total used, together with the time spent in collection.
jstat -gcutil -h 10 <pid> 10s
- -h 10
prints the header again after 10 lines
the process id of the jvm, can be found by running the jps command line tool 10s interval between taking a reading, in this case every 10 seconds
I also advised them to add the following JVM options to the WildlFly’s start up configuration would enable GC logging to a file that can be used for analysis for any JVM related issues in the future.
-Xloggc:pathtofolder/gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10m
The files from the GC logging can be analysed using GCViewer - which can be used to chart the garbage collection statistics enabling you to identify any trends in the memory utilisation that caused by potential memory leaks. In order to find potential memory leaks, heap dumps need to be created at regular intervals to analyse the objects on the heap.
In order to investigate the problem further, I arranged for a series of heap dumps to be taken at 3 hour intervals during the day, over a period of 24 hours. The Eclipse Memory Analyzer Tool (MAT) was used to analyse all the dumps taken, running the Memory Leak Report, which identified 2 potential leak suspects.
Looking at the Suspect 1 in the report, objects were being kept alive by the thread running the class org.xnio.WorkerThread and Id 0x54124fbd0 default I/O-8 Thread, see in the ‘Shortest Paths To the Accumulation Point’ section of the Leak Suspects report. The same thread appeared on all the heap dumps taken over the 24-hour period so was a likely candidate for the cause of the memory problem. The thread was initiated by a servlet request to the Undertow web container.
The Dominator Tree allows you to identify the largest objects being held on the heap together with their references. The objects being held by the thread were related to the Hibernate layer and retrieved data from the database, loading it into Java objects (com.aviseurope.*) for subsequent CRUD operations on the database. These objects added up to approximately 1.295GB and are shown in the ‘Accumulated Objects by Class in the Dominator Tree’ section of the Leak Suspects report and should have been released when the user either saved the changes to the data or ended their session.
The second leak suspect also had similar characteristics to the first one and the thread holding the objects on the heap appeared in the heap dumps taken over the 24-hour period and should have been released when the user either saved the changes to the data or ended their session. This points to a potential problem with Undertow, Hibernate, application code or a combination of them all which required further investigation.
In this post, I have shown how using tools such as the Eclipse Memory Analyzer Tool can be effective in investigating problems with the JVM and help to track down where the problems are occurring. The cause of the issue, however needs further investigation looking at the potential problem areas identified by the heap analysis, looking into bugs in Undertow, Hibernate and the application code.