Users unable to log into ALM, faced error: "ALM failed to retrieve authentication data"
Wondering if anyone came across this before and can shed some light on what may have caused it:
Yesterday all users were who attempted to log into ALM were meet with below the error:
“ALM failed to retrieve authentication data”
We have two windows ALM servers (12.21 patch 7) in high availability with LDAP authentication in place. Users meet the same error when they hit either node.
Thankfully after restarting the ALM service on both ALM servers, users were then able to successfully access ALM. Still would like to understand what happened here.
QC logs mention:
Invalid request: Remote host: xx.xx.xx.xx, Meta Data: [Function Name: Logout, Login Session ID: 21510287, Project Session ID: 10038317, Call ID: 45]. Error: Failed to obtain a connection to schema 'qcsiteadmin_xxxxxxx' - timeout expired.
Any insight appreciated
The issue reoccurred this morning. Users were unable to login. Both application server nodes returned the same error message when users attempted to log in.
"ALM failed to retrieve authentication data"
In the SA logs the following error is thrown.
"Exception thrown when executing the job I am alive
Failed to set application server's last touch time.; Failed to obtain a connection to schema 'qcsiteadmin_db_tcoe_ALM1220' - timeout expired;"
We were able to resolve by restarting the ALM service on both application server nodes but it is very concerning that this is happening randomly in our Production environment.
Our DBA's checking our database server at the time of the issue and did not find anything of concern.
Looks like data server, repository server and application servers are not rebooting in correct order (one after after another) after weekend maintenance due to which you are facing this issue. Issue is occuring because DB server might not be fully up at the time application service wanted it to be available.
Ideally , both your database server and application sever should not be restarted at same time.
We have to ensure that application server should restart only after both database server and repository server has been restarted and is available ( up and online).
What you can do as keep the database sever patching maintenance in a week prior to application server patching , this way both servers will be on different week schedules and outage can be eradicated.
I checked the server boot times as you suggested.
DB Server: 06/21/2020 7PM
App Server 1: 06/20/2020 10PM
App Server 2: 06/20/2020 10:30PM
Repo is on a NAS share. Server was not rebooted.
However, this issue occurred on Friday June 19th, before any servers were patched and rebooted.
So the issue seems to be that on random occasions the application loses connection to the qcsiteadmin_db
Our systems support and DBA's would definitely notify us if there was an issue with the DB server losing network connectivity or being unavailable.
We use Dynatrace monitoring on all of our servers and I cannot see anything in there this morning when the latest instance of this issue occurred.
Yes true, from the error log it seems to be a random issue with the connection to the SA schema in your DB server.
Please check the stability of the DB server network, and another check point is if your DB server is under some maintenance work impacting the availability of that specific schema.