Interesting Tomcat not starting problem

Interesting Tomcat not starting problem

I was working with a customer on an upgrade to IDM 4.5.2 and switching from JBoss to clustered Tomcat. I must say, the 4.5 install process is much better than in 4.02, which is always nice to see.

We got OSP doing SAML federation to NAM, and then behind a NAM reverse proxy, and it was all going well. To get that going in Dev, we tested in steps. First Tomcat/JVM/OSP/SSPR and got Username password login working. Next we got SAML working. Then we moved on to reverse proxy. Finally we installed the User Application and it was great.

We modified our install procedure to do all the installs, and then mostly drop in config files (And certificates, so many certificates in keystores, but that is a topic for another article) and we were up and running in the QA environment.

Everything was going great, the servers were humming, the cluster was clustering, the testers were testing and then suddenly that all changed.

After a Tomcat restart (/etc/init.d/idmapps_tomcat_init restart command) everything died.

The solution was simple in the end, but I thought the process of how we approached troubleshooting would be helpful. TL;DR - The PostGres DB was down, which is why it affected both nodes at once. But I think the list of troubleshooting steps we took might help someone with a different but similar problem.

Our problem start to manifest as suddenly the catalina.out (/opt/netiq/idm/apps/tomcat/logs/catalina.out) had a grand total of 5 lines. All we would see if the following:

2015-12-29 15:49:07,441 [main] INFO  org.apache.catalina.core.AprLifecycleListener- The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2015-12-29 15:49:08,192 [main] INFO org.apache.coyote.http11.Http11Protocol- Initializing ProtocolHandler ["http-bio-8080"]
2015-12-29 15:49:08,223 [main] INFO org.apache.coyote.http11.Http11Protocol- Initializing ProtocolHandler ["http-bio-8443"]
2015-12-29 15:49:09,269 [main] INFO org.apache.coyote.ajp.AjpProtocol- Initializing ProtocolHandler ["ajp-bio-8009"]
2015-12-29 15:49:09,280 [main] INFO org.apache.catalina.startup.Catalina- Initialization processed in 2765 ms


Now if you have seen User App startup, and these boxes had User App, Catalog Admin, Dash, Landing, SSPR, and OSP all in one Tomcat instance you will know that there are about 10,000 lines of trace missing.

If you read that, it reads to me that port 8080 (http-alt), 8443 (pcsync-https), and 8009 were bound. Ok, checking with netstat -an (though a colleague introduced me to netstat -tln which is potentially simpler) we saw that yes, we were listening on 8009, 8080, and 8443.

That made sense, those were our three core ports. 8009 is the AJP port, for an Apache front end web server to talk to Tomcat. (Joe Doupnik has a very good presentation on how to use Apache in front of different web apps to protect things like Filr or iPrint appliance). The 8080 port is the default HTTP port, and we enabled SSL on 8443. (Then we used a redirect script on the server to redirect 443 traffic to 8443, since OSP is very sensitive to the full URL, and we did not want people to see 8443 in the URL).

But that was all we got in the log. Where was the rest? What was NOT happening next?

Well, on the next restart attempt we saw this:

2015-12-29 15:59:02,444 [main] ERROR org.apache.catalina.startup.Catalina- Could not contact 127.0.0.1:8005. Tomcat may not be running.
2015-12-29 15:59:02,444 [main] ERROR org.apache.catalina.startup.Catalina- Catalina.stop:
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at org.apache.catalina.startup.Catalina.stopServer(Catalina.java:498)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.catalina.startup.Bootstrap.stopServer(Bootstrap.java:370)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:457)


Followed by the same 5 lines we saw initially. When I tried opening an incident with support and submitted these short logs from catalina.out they did not believe me that this was the entire log.

This was the first inkling we had that somehow port 8005 is involved. If you look it up, you will see that the 8005 port is used to control the Tomcat process. It is waiting to hear a command to shutdown the instance. The idmapps_tomcat_init script in /etc/init.d on Linux tries to stop Tomcat, first by sending a SHUTDOWN command to port 8005. (Literally makes an HTTP connection on 8005, and sends the word SHUTDOWN. You could do this via Telnet if you wanted to. At some level this is a security hole, so protect 8005 from the outside world). Then if it fails, as it did in my example, then it does a kill command for the process ID (pid).

So even though we did not bind 8005, we were able to stop Tomcat, which is good I guess.

We tried everything we could think of in terms of what might have changed overnight on this server, to suddenly block binding to port 8005. We looked at permissions, everything was novlua:novlua. We looked firewall settings, internal and external (though that would not matter).

We distrusted the server OS level guys who said they did not patch anything last night, and tried the command:
rpm -qa --last | grep ' Dec '


This showed us a series of patches from Dec 10th, two weeks earlier. So it was not them. Darn, would have been so much easier if someone else's fault.

They were running an anti virus program, got the manager tool to disable the anti virus, still no joy, so we stopped the processes we could, killed the rest, still no joy.

We thought maybe some process level limit was hitting us in Linux, so we went /proc/PID#forJava/ and looked at the limits file:

Limit                     Soft Limit           Hard Limit           Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 1024 unlimited bytes
Max resident set 7012548608 unlimited bytes
Max processes 62855 62855 processes
Max open files 8192 8192 files
Max locked memory 65536 262144 bytes
Max address space 8310865920 unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 62855 62855 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us


Use ps -ef | grep java to find your Tomcat process. One will have ActiveMQ in the string, the second will have the catalina bootstrap class in the name. You want the catalina one. Remember the PID for it. Then you can:
less /proc/pid#/limits

and see the table above, whose numbers are basically good. (Maybe we could use more open files, but high enough it should not block startup.

Looking at a working box, we could see that 8005 was bound and listening when it worked, so clearly this was somehow related. But we were stuck on permissions to the port. We knew it was not another port using it, since between netstat and lsof tools, we could see nothing was holding 8005 open.

To see what I mean, try:
lsof | grep IPv | less 


This will show you all the ports held open (lsof means list of open file handles? or some such). That is when we noticed that ActiveMQ was listening on IPv6 not IPv4, so we disabled IPv6 entirely on the server, still no joy. Man this was getting frustrated. For fun, I did not know how to get lsof to return the port number, not the mapped 'pretty name' so 8080 came up as http-alt, and 8443 as pcsync-https, and 8009 had no mapping.

If you are interested, look in /etc/services file and find the port number to see where it is mapped. In our case, 8005 is mapped to mxi.

We looked at config files (server.xml, web.xml) compared to the working Dev instance. We kept coming up with NOTHING.

If you look at the server.xml file in /opt/netiq/idm/apps/tomcat/conf/server.xml you will see that there is a port definition for 8005 that looks like:

<Server port="8005" shutdown="SHUTDOWN">


When we first commented this out, Tomcat started with all sorts of XML parsing errors, which made us realize that the entire Tomcat config definition, starts with that <Server> node, so commenting it out was breaking the XML, since there was a close node, and no open node.

Then we tried changing the port to 8006, since if there was an issue blocking 8005, surely not 8006 as well? For fun we tried some crazy numbers in case the 8xxx series was somehow blocked, and no joy either.

Reading Google hits, we added to the Server node, an XML attribute of address="127.0.0.1" and many other combinations, on the theory that somehow we were not resolving localhost properly but that was not the issue either.

As part of the 4.5.2 patch, we applied the Java 1.8.0_65 upgrade. (IDM 4.5 ships with Java 7 of some flavour, and that is being deprecated, and 4.5.2 asks that you go get a JRE 8 from Oracle, and install it.) We kept the old JVM just in case we had to fail back/test with Java7. Tried that as well, same problems, so not a Java 7 vs 8 issue.

As we looked at this, we realized that there must be something more going on. We looked at a properly starting Tomcat in Dev, to see what happens next in the process. These are the hard ones to diagnose, the errors where the next message is missed, and that gives you the hint, since you need a working example to compare with. This is why I like writing these troubleshooting or error message articles, to give people examples to compare against.

We looked in Dev's catalina.out and saw that after our last line of Initialization and a time in milliseconds, we would see the following:

2015-12-15 16:14:45,589 [main] INFO  org.apache.catalina.core.StandardService- Starting service Catalina
2015-12-15 16:14:45,589 [main] INFO org.apache.catalina.core.StandardEngine- Starting Servlet Engine: Apache Tomcat/7.0.55


And then it would start deploying the web apps, in this case, starting with osp.war. (I think it might be timestamp or inode number based, since it is not alphabetical. That is oldest WAR file written to the file system executes first. If you happen to know how the order is determined, please post a comment.).

2015-12-15 16:14:45,792 [localhost-startStop-1] INFO  org.apache.catalina.startup.HostConfig- Deploying web application archive /opt/netiq/idm/apps/tomcat/webapps/osp.war


What this told us that Catalina had not properly started. Or was having an issue getting to that part. It was clearly reading the server.xml config file, since that holds the port definitions for 8009, 8080, and 8443 which were getting read.

Ok, so how do we enable logging? Well in troubleshooting User Application issues, you need to know the class name with the issue, to enable logging so you can see the rest of the message. I tried maintaining a web page with things I had learned about the various log levels in User App, but it was forced to be taken down. Which I think was a stupid decision, so I host it privately, and make it available to those who ask to avoid hassling on the topic. Sorry that everyone loses on that front, but it was a stupid decision that was made, and I have no influence over, regardless of how I tried. So how do you know what class to enable logging, to see the info you need, to debug your error? There are hundreds of classes in User App, so which do you need? Apparently the answer is "You Can't Handle The Truth". The reality is, you find a message that is nearby and enable logging on a higher parent class so you get flooded in logs in the hopes of finding the proper class to enable.

In this case, seeing a working example told us we wanted to see possibly the org.apache.catalina.core.StandardService and org.apache.catalina.core.StandardEngine classes tracing. Or maybe we just wanted the parent, org.apache.catalina.core class.

In User App you have two ways of enabling logging (Well with OSP, there it's own method, totally different) via the GUI in the Logging sub tab of the Administration tab. Or by directly editing the idm_userapp_logging.xml file. In fact, when you make a change in the GUI there is an option at the bottom "Persist changes" which means they will persist through a restart. It does this by writing to the logging file.

However this is logging inside the User App, inside Tomcat, where the WAR for User App is already running. Our issue is higher level (lower level? I always forget? Are we higher or lower in the stack now?)

The docs on changing this were just plain confusing. Check out this link and see what I mean: https://tomcat.apache.org/tomcat-7.0-doc/logging.html

Looked at the /opt/netiq/idm/apps/tomcat/bin/catalina.sh and noticed that there was a test for the existence of a file in the conf directory named logging.properties. Snagging that sample, this lead me to think we needed a file in the conf,

# Set juli LogManager config file if it is present and an override has not been issued
if [ -z "$LOGGING_CONFIG" ]; then
if [ -r "$CATALINA_BASE"/conf/logging.properties ]; then
LOGGING_CONFIG="-Djava.util.logging.config.file=$CATALINA_BASE/conf/logging.properties"
else
# Bugzilla 45585
LOGGING_CONFIG="-Dnop"
fi
fi


We tried making the logging.properties file, but we could not seem to get the syntax correct. Then we noticed that the JRE itself is allowed to enable logging. Look in /opt/netiq/idm/apps/jre/lib/logging.properties and in that file we saw:

# Default global logging level.
# This specifies which kinds of events are logged across
# all loggers. For any given facility this global level
# can be overriden by a facility specific level
# Note that the ConsoleHandler also has a separate level
# setting to limit messages printed to the console.
.level= INFO


So we set that to .level= ALL and that did not help. But then we noticed a paragraph or three lower it said:

# Limit the message that are printed on the console to INFO and above.
java.util.logging.ConsoleHandler.level = INFO


We knew from reading the Tomcat docs about logging that the console logs are using this class. So even though we set the system level logging to ALL, we were then overriding it down to INFO. Oops. Fixed that, and suddenly we got a flood of info in catalina.out. The full trace is 867 lines, but most of it is repetitive so I will edit it down for size.

Dec 30, 2015 2:06:58 PM org.apache.catalina.startup.ClassLoaderFactory createClassLoader
FINE: Creating new class loader
Dec 30, 2015 2:06:58 PM org.apache.catalina.startup.ClassLoaderFactory createClassLoader
FINE: Including directory file:/opt/netiq/idm/apps/tomcat/lib/


It appears to log every single JAR that it loads which takes a lot of lines.

Dec 30, 2015 2:06:59 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName = JMImplementation:type=MBeanServerDelegate


Then we see a couple of hundred lines of registerMBean lines like above, edited out.

But now we got to something interesting. Here we have hit the Bootstrap class init, which is where we THINK we get stuck.

Dec 30, 2015 2:06:59 PM org.apache.catalina.startup.Bootstrap init
FINE: Loading startup class
Dec 30, 2015 2:06:59 PM org.apache.catalina.startup.Bootstrap init
FINE: Setting startup class properties
Dec 30, 2015 2:06:59 PM org.apache.catalina.startup.Bootstrap load
FINE: Calling startup class public void org.apache.catalina.startup.Catalina.load(java.lang.String[])
2015-12-30 14:07:00,466 [main] INFO org.apache.catalina.core.AprLifecycleListener- The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib


That last line we are used to seeing in our log, so we know we are before the 8009/8080/8443 binding but getting there. This is a good landmark. A lot more stuff is shown in trace that I am going to cut out since it looks the same as above, but here is the next interesting part.

At first I dismissed this, by explaining this is just loading the data source definitions. That is, User App knows about the ActiveMQ Java Message Service (JMS) instance and talks to it, and we see the definition being loaded. Same for the User App database.

Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository contains
FINER: name = Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.ActiveMQConnectionFactory,name="jms/ConnectionFactory"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName = Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.ActiveMQConnectionFactory,name="jms/ConnectionFactory"
Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository addMBean
FINER: name = Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.ActiveMQConnectionFactory,name="jms/ConnectionFactory"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor addObject
FINER: Send create notification of object Catalina:class=org.apache.activemq.ActiveMQConnectionFactory,name="jms/ConnectionFactory",resourcetype=Global,type=Resource
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.ActiveMQConnectionFactory,name="jms/ConnectionFactory"
Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository contains
FINER: name = Catalina:type=Resource,resourcetype=Global,class=org.apache.catalina.UserDatabase,name="UserDatabase"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName = Catalina:type=Resource,resourcetype=Global,class=org.apache.catalina.UserDatabase,name="UserDatabase"
Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository addMBean
FINER: name = Catalina:type=Resource,resourcetype=Global,class=org.apache.catalina.UserDatabase,name="UserDatabase"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor addObject
FINER: Send create notification of object Catalina:class=org.apache.catalina.UserDatabase,name="UserDatabase",resourcetype=Global,type=Resource
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered Catalina:type=Resource,resourcetype=Global,class=org.apache.catalina.UserDatabase,name="UserDatabase"
Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository contains
FINER: name = Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.command.ActiveMQTopic,name="topic/IDMNotificationDurableTopic"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName = Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.command.ActiveMQTopic,name="topic/IDMNotificationDurableTopic"
Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository addMBean
FINER: name = Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.command.ActiveMQTopic,name="topic/IDMNotificationDurableTopic"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor addObject
FINER: Send create notification of object Catalina:class=org.apache.activemq.command.ActiveMQTopic,name="topic/IDMNotificationDurableTopic",resourcetype=Global,type=Resource
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered Catalina:type=Resource,resourcetype=Global,class=org.apache.activemq.command.ActiveMQTopic,name="topic/IDMNotificationDurableTopic"
Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository contains
FINER: name = Catalina:type=Resource,resourcetype=Global,class=javax.sql.DataSource,name="shared/IDMUADataSource"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName = Catalina:type=Resource,resourcetype=Global,class=javax.sql.DataSource,name="shared/IDMUADataSource"
Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository addMBean
FINER: name = Catalina:type=Resource,resourcetype=Global,class=javax.sql.DataSource,name="shared/IDMUADataSource"
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor addObject
FINER: Send create notification of object Catalina:class=javax.sql.DataSource,name="shared/IDMUADataSource",resourcetype=Global,type=Resource
Dec 30, 2015 2:07:01 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered Catalina:type=Resource,resourcetype=Global,class=javax.sql.DataSource,name="shared/IDMUADataSource"


Ok, there are several data sources. Now in fact this sent us on a bit of a snipe hunt looking for ActiveMQ communication issues (See the IPv6 discussion earlier). But we never thought to test the User App Database itself. We were running on a two node cluster, with a third server running PostGres for the DB.

Dec 30, 2015 2:07:01 PM com.sun.jmx.mbeanserver.Repository contains
FINER: name = Catalina:type=Connector,port=8080


Above we see that it is starting to process the server.xml with the three Connector definitions for our three main listening ports. I snipped out 50 lines of it loading that port (and then the same for the next two ports, but you get the idea). The important line, that acts a landmark, looks like this, for the 8443 port.

2015-12-30 14:07:01,263 [main] INFO  org.apache.coyote.http11.Http11Protocol- Initializing ProtocolHandler ["http-bio-8443"]


This is what we see three times in the catalina.out, so we know we are in the right place, seeing a bit more info now. But will it help us?

Dec 30, 2015 2:07:02 PM com.sun.jmx.interceptor.DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered Catalina:type=Mapper,port=8009
2015-12-30 14:07:02,060 [main] INFO org.apache.catalina.startup.Catalina- Initialization processed in 2484 ms


Finally we get to the end and we see that there is nothing further after the last line, which is the same last line we saw before the tracing was enabled. Drat, that seems like a dead end.

In hindsight we should have seen it sooner, but did not. It was only when we tried restoring one node entirely with no success, and then tried reinstalling the second node from scratch that we hit a point where the User App installer tries to make a connection to the PostGres DB. That just sat there hanging.

Ok, test it via PgAdminIII and we found that we could start to open a connection (or at least the port was listening) but it was not responding. We tried to SSH into the server and could not. Finally we forced a reboot and when it came up, suddenly Tomcat started properly and all was well, our apps deployed and everything was good.

Thus it seems that it was because our DB was in a strange state of being able to start opening the connection and that Tomcat loads the datasource definitions early, and they never completed, so we got stuck waiting for one of them to complete with no messaging or hints.

I am not sure this is something NetIQ can fix in their code, as this is all Tomcat code, earlier than any component from NetIQ running.

Once we discovered this, one of the developers trying to help noted that WebLogic (Owned by Oracle now, formerly BEA, and thus being killed by Oracle) had a similar issue that it would hang if a datasource was not responding during startup. I do not recall JBoss having this issue, but it is hard to know, since it might be different in the half dead case, versus a totally dead case.

Regardless, I am hopeful that some of these steps help you when you encounter something similar and need to know where to look for possible issues.





DISCLAIMER:

Some content on Community Tips & Information pages is not officially supported by Micro Focus. Please refer to our Terms of Use for more detail.
Top Contributors
Version history
Revision #:
1 of 1
Last update:
‎2016-01-08 17:29
Updated by:
 
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.