How to troubleshoot NAM SSL Handshake errors with origin web server
Troubleshooting NAM SSL Handshake errors
Recently there has been a lot of hoopla over security vulnerabilities such as POODLE and logjam and others. To help mitigate and/or resolve some of these issues, more strict implementations of SSL/TLS protocols and various ciphers may have to be used. In a NetIQ Access Management (NAM) environment, this can present a few interesting challenges, not only for the NAM product components themselves, but also for the origin web servers involved. I shall attempt to demonstrate a recent issue we encountered and how we were able to determine the culprit and resolve the issue.
In our scenario, we have the NAM 4.0.1 product in place. The Admin Console, IDS and MAG appliance are all separate machines in VMWare. We have multiple clusters and setups, including a test lab setup. We front-end almost all of our web servers (internally) with NAM. All of our externally available web servers are front-ended with NAM. If you don't authenticate successfully, no soup for you!
SSL Handshake Error - known cause
We have an old Filr 1.1 setup. In preparation for upgrading to Filr 1.2, we had to perform many upgrades. The first of which was to apply the first security patch (since this was a 1.1 original released setup). So, I powered off the test VM, snappshotted it, and powered it back on and applied the security patch. I then decided to test access via NAM. Upon which I received a rather ugly error message:
OK, the above is fairly obvious, the SSL Handshake failed. Now, in *this* particular case, I happen to know that the security patch for Filr 1.1 was specifically to address POODLE. I also happened to know that POODLE fixes used stronger protocols.
So I went digging and opened an SR and was pointed to the NetIQ docs which gave instructions on how to enable TLS 1.1/1.2 in NAM. The doc link here:
The instructions were rather easy to install, so we did and the above error went away and all is well.
That is until I got a phone call the next day with *one* particular web server in general.
Scenario - Unknown SSL Handshake error
Unfortunately we have an application that is not front-ended internally with NAM. However it is front-ended by NAM externally. And unfortunately we do not have an "external" test lab. And not many people use said application either. After applying the above openssl package to our NAM servers, the next day we got a call from one user reporting the following error:
Unfortunately that's not very helpful. My initial assumption was that it had something to do with the TSL/POODLE fixes perhaps on the origin web server. But if the POODLE fixes were on the origin web server, we should've gotten this error prior to enabling the NAM TLS support. Unfortunately that was not the case. Ironically if we access the web server through NAM, but go to the "root" URL of the web server, vs. the .jsp page, then we get the following error:
Unfortunately we don't maintain that web server, so it was impossible for us to find out if the POODLE fix or any other fixes were applied to that server. Now the hunt begins.
Now part of this was due to the documentation not being clear. In the above doc links, the NAM docs just state that by installing the package,
"After the new package is installed, the Access Gateway can accept connections from clients by using any SSL or TLS versions ranging from SSL 2.0, SSL 3.0, TLS 1.0, TLS 1.1, to TLS 1.2."
Apparently what the doc writers meant was:
"After the new package is installed, the Access Gateway can accept connections from Web Browser clients by using any SSL or TLS versions ranging from SSL 2.0, SSL 3.0, TLS 1.0, TLS 1.1, to TLS 1.2. "
Origin server "clients" is a different story. But I had not yet discovered that.
So the next thing is to take a look at google.
The closest I could find was this TID:
(Note that the above TID may actually have been corrected). At the time of the issue, the TID stated that TLS 1.2 was set to be used for *all* connections. But wait, the docs didn't mention this would happen, nor did I configure anything to specifically use ONLY TLS 1.2. The docs just said that NAM could *accept* connections, nothing about restricting them all to just TLS 1.2. However, the TID is useful because it shows the pertinent error log to look at.
"Users accessing the problem secure Web server would get 502 errors, and the error_log file on the AG would report the following:
[error] (502)Unknown error 502: proxy: pass request body failed to 10.175.121.57:443 (10.175.121.57)
AMEVENTID#8: proxy: Error during SSL Handshake with remote server returned by
So I looked at my MAG appliance for the:
Sure enough, I saw the error:
Jun 18 11:34:23 nam-lag-sles11 httpd: [error] (502)Unknown error 502: proxy: pass request body failed to 10.10.104.102:443 (10.10.104.102)
Jun 18 11:34:23 nam-lag-sles11 httpd: [error] AMEVENTID#19355: proxy: Error during SSL Handshake with remote server returned by /OA_HTML/AppsLocalLogin.jsp, referer: https://nam-idp.something.com/nidp/idff/sso?sid=0&sid=0
Jun 18 11:34:23 nam-lag-sles11 httpd: [error] AMEVENTID#19355: proxy: pass request body failed to 10.10.104.102:443 (10.10.104.102) from 10.10.108.209 (), referer: https://nam-idp.something.com/nidp/idff/sso?sid=0&sid=0
OK so this pretty much narrows things down to either a Protocol (SSL/TLS) and/or a cipher issue.
Well, the error log shows something interesting as well:
Jun 18 11:34:23 nam-lag-sles11 httpd: [info] AMEVENTID#19359: sending request to webserver
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_kernel.c(1872): OpenSSL: Handshake: start
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_kernel.c(1880): OpenSSL: Loop: before/connect initialization
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_kernel.c(1880): OpenSSL: Loop: SSLv2/v3 write client hello A
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_io.c(1907): OpenSSL: read 7/7 bytes from BIO#7faac3764980 [mem: 7faac3c67450] (BIO dump follows)
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_io.c(1840): +-------------------------------------------------------------------------+
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_io.c(1879): | 0000: 15 03 01 00 02 02 ...... |
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_io.c(1883): | 0007 - <SPACES/NULS>
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_io.c(1885): +-------------------------------------------------------------------------+
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_kernel.c(1885): OpenSSL: Read: SSLv2/v3 read server hello A
Jun 18 11:34:23 nam-lag-sles11 httpd: [debug] ssl_engine_kernel.c(1909): OpenSSL: Exit: error in SSLv2/v3 read server hello A
Jun 18 11:34:23 nam-lag-sles11 httpd: [info] SSL Proxy connect failed
Jun 18 11:34:23 nam-lag-sles11 httpd: [info] SSL Library Error: 336032744 error:140773E8:SSL routines:SSL23_GET_SERVER_HELLO:reason(1000)
Jun 18 11:34:23 nam-lag-sles11 httpd: [info] Connection closed to child 0 with abortive shutdown (server badserver.something.com:443)
Oh, OK. I see that the server is trying to negotiate and fails. I am now putting 2 and 2 together.
The TID mentioned that:
Tests were done adding a few SSL advanced options, but to no avail. These options included:
- SSLProxyCheckPeerCN off
- SSLProxyProtocol +SSLv2 +SSLv3 +TLSv1 +TLSv1.1
- SSLProxyVerify none
So I'm leaning towards an SSL protocol issue. So now the question is:
Since I don't have access to the server, and we can't administer it and we don't know what patches are on the server, how do we determine what Protocol/cipher is used by the origin web server? I'm pretty certain at this stage that the origin web server is NOT forcing the use of TLS 1.2. If it was, then NAM should work (see my Filr example above).
So let's fire up Wireshark and see what a regular web browser going direct to the origin server (bypass NAM) is able to achieve. (IP addresses obscured):
Look at packet #4 where the first TLSv1 Client Hello is seen.
Now let's look at the packet details:
OK, we see TLS 1.2, but notice in the first trace screens, we're not exactly successful. In packet #6, we get a close notify message. (No Soup for you!)
But if we continue further in the trace file, we see that the origin server and client are "stepping" down the Protocol used:
OK, the handshake is now down to TLS 1.1
But we're still getting rejected. (Close/notify packet messages are still seen). Let's continue to work our way down to the next TLSv1 Protocol packet, which is packet #23.
We go down to the next "Hello" packet #23
OK, now we've stepped down to TLS 1.0
The next set of packets #24-29 show good stuff now:
We get a Server Hello this time, then followed by the Certificate. Notice, that we have now successfully done a Key Exchange in packet #29
OK, so that tells us some interesting things:
- A Web Browser (Firefox 31.2.0 ESR in this case) tries to use TLS 1.2 to talk to the origin web server
- The origin web server apparently can't handle this and starts a process of Close/Notify (denying/failing the key exchange) and the browser tries to step down to negotiate lower Protocol values, until we finally achieve success with TLS 1.0
What's interesting is that if we look at a LAN trace with NAM (with the openssl package installed), we only see NAM ever try to use TLS 1.2 It never tries to step down.
See our first TLS Client Hello
The details of that packet:
2 packets later, we get a close Notify. So far, just like the web browser.
Now let's go down to the NEXT TLS Client Hello packet:
Wait a minute, it's still trying TLS v1.2 Unlike the web browser, NAM does not appear to be negotiating lower protocols. So that's why we're failing the SSL Handshake. NAM insists on using a protocol that the origin web server does not support.
In the documentation link (not the TID), we find something interesting on section 5.0:
"After installing Access Manager 4.0 Hotfix 3 or 4.0 SP1, if you have enabled SSL communication between the Access Gateway and the Web server, the Access Gateway uses the highest version of the TLS that the Web server supports. For example, if you have configured the Web server to use TLS 1.1 or TLS 1.2, the Access Gateway sends requests to the Web server by using the specified TLS version."
However, this appears to either be incorrect documentation, or a software bug. Remember, we're using NAM 4.0.1 HF3, which is not the most recent version. Unfortunately the documentation doesn't tell you how to go about changing/negotiating the protocol list, only how to enable SSL (ie: use "this" certificate and use port 443). At least in the section that the "Enabling TLS support" document points us to.
But we have some insight with the TID, that indicates there is an Advanced Options parameter:
+SSLv2 +SSLv3 +TLSv1 +TLSv1.1
That can be used.
Now, since we know via our web browser LAN trace, that we can negotiated with TLS 1.0, I decided to try that.
In the NAM Admin console, I found the reverse proxy for the affected origin web server (remember my Filr setup is working OK, so that means I do NOT want to do this at the Global Advanced Options level).
I click on the name of the reverse proxy:
Click the Advanced Options item.
And I insert the following line (below the first line):
SSLProxyProtocol +SSLv2 +SSLv3 +TLSv1
Save the changes, and apply them.
Re-test and we have success!!
Some more clear documentation/TIDs may have helped prevent this issue in the first place, as would more thorough testing on our part. However, if you have a specific origin web server that is having issues after implementing the openssl packages, you can use a LAN trace with a web browser to try to figure out what Protocol is being successfully negotiated and then adjust the NAM settings accordingly. I suppose we could have tried random protocol settings on the NAM Advanced Settings, but that was more time-consuming, and I learned more this way.