DP 23.4 [61:2052] Bar backup session was started but no client connected in 600 seconds. Aborting session!

Hi all,

We have this setup for our DP 23.4 backups of our VMware VM's:

  • RHEL 7.9 DP Cell Manager VM
  • 2 x RHEL 7.9 physical media servers
  • 1 x HPE Storeonce D2D
  • physical servers are SAN connected to the D2D's

Backups have been working fine for now for over 4 weeks without issue, approx 500 VM's to backup, approx 100 jobs, spread over the whole night, from 16:00 - 03:30 7 days a week, 6 incrementals, 1 full. Great, DP working as designed. Then this week, twice it's happened, we had 99% of the nightly jobs fail with the error after successful VMWare snapshots:

[61:2052] Bar backup session was started but no client connected in 600 seconds. Aborting session!

A couple of jobs work, the rest all have this issue. So first time it happened we rebooted D2D and both media servers, and tested a backup, and it worked, and the next two nights of 100 jobs worked fine, ok, problem "solved". Then last night, again, a 99% failure rate with the 600 seconds error. So second time round now, we rebooted only the D2D and changed the timeout to 30 mins as per this in the manual:

https://docs.microfocus.com/doc/Data_Protector/24.1/ZDBBarBackupSessionAborted

And the re-run worked, but in 18 mins hanging waiting to start the backup. Hmm we thought, this hasn't fixed the issue, it's only allowed it to hang for more than 600 seconds before it starts. So we rebooted both RHEL physical media servers, and re-ran the job, and instead of 18 mins hanging, it hung for about 2 seconds between snapshot and starting backups! So there seems to be something wrong on the RHEL physical media servers which is hanging backups, but we don't know why.

Has anyone else seen this type of behaviour before, any suggestions of what to look out for? Oh, and we can't send debug logs to Opentext, customer does not allow it, so the only chance to track the cause is to raise forums posts, and trawl through 60,000 lines of debug output finding the cause of the 18 minute hang. 

Thanks,

Andy

  • 0

    Oh, it's actually 50 jobs, not 100, it's 50 incremental jobs, and 50 full jobs, we have separate jobs for the different INCR or FULL. There are good reasons for this, honest!

  • 0  

    So just to be sure we are getting the picture. It's a total of 500 VM's spread over 50 different barlists. Are these 50 barlists started all at the same time or not? Did you already try starting less at a time and build in some delay before starting others?

    I'm not saying there's no problem. Only trying to think about possible tests both to work around and to better understand what's happening. Besides that ... yea, debugs may tell more. You'll have to find out who's waiting on what during these 18 minutes.


    Koen Verbelen

    Although I am an OpenText employee, I am speaking for myself and not for OpenText.
    If you found this post useful, give it a “Like” or click on "Verify Answer" under the "More" button.
    You may also be interested in my Data Protector Support Tips listed per category

  • Verified Answer

    +1  

    I had the same issue and in my case it was caused by slow network responce. My first suggestion is to increase the value of global variable SmWaitForFirstBackupClient from its default value of 10 minutes (600 seconds) to a greater value. In my case having a lot of VMs to be discovered I had to increase it to 90 minutes (this is not normal, but I have successfully started the backups). Having said it I suggest to set the value of SmWaitForFirstBackupClient=30 and then let me know the results. I need to know the last message you see before having the timeout error and the error message with the timeout error.

    Kind regards,

    Dionisio Sciascia

    Although I am an OpenText employee, I am speaking for myself and not for OpenText.
    If you found this post useful, give it a “Like” or click on "Verify Answer" under the "More" button.

    Dionisio Sciascia

    Although I am an OpenText employee, I am speaking for myself and not for OpenText.
    If you found this post useful, give it a “Like” or click on "Verify Answer" under the "More" button.

  • 0 in reply to   

    Hi Koen,

    Apologies for not responding, I've been away. So to confirm my rather garbled explanation:

    • We have 500 VM's to backup
    • VM's are split in to different groups based on size, varying between 1 VM per job, and as many as 20 VM's. 
    • We then have two backup jobs per group of VM's, 100 jobs in total
    • 50 of them are scheduled 6 days a week as INCR, we call these the "daily" jobs
    • And on the 7th day of the week, the other 50 jobs are run a FULL, these are the "weekly" jobs
    • Weeklies are spread throughout the week, i.e. they are not all run on any one night. So on any one night we could have 40 INCR's running, and 10 FULL's, or 45 INCR's, and 5 FULLs, or 35 INCR and 15 FULLs, etc..
    • Each job, INCR or FULL, is started at different times of the night, spread between 17:00 - 04:00

    We'd taken the decision after the second occurance to increase SmWaitForFirstBackupClient to 30 mins, and SmWaitForFirstBackupClientSec=1800, and we've not had a problem since. This matches what Dionisio mentioned also below. As above we did also reboot the Linux media server and that dropped the "delay" down to seconds rather than 10+ mins. 

    We're hoping this wait time will help to alleviate any future issues. Although not knowing exactly why we needed to introduce the 30 mins delay is a little frustrating, it would be good to understand it for our own piece of mind. But knowing someone else had the same issue, and identified a slow network helps, it's something we can get our network guys to check should it happen again.

    Thanks, Andy

  • 0   in reply to 

    Please note that setting the variable SmWaitForFirstBackupClient to 30 mins does not instroduce a delay but simply says to wait for 30 minutes before aborting the session if noone connects. If backup starts before 30 minutes nothing change in backp behaviour. To better understand what is happening in your environment compare the VM discovery time as of now (look for the lines Adding VM ...) with the VM discovery time you had in the past. In my situation discovery time changed from 3-4 seconds per VM to 40 seconds (10 times more) and so I had to increase the timeout to avoid session abort before all VM discovery was completed. I have then executed the same backup spec using a different Backup host and on the new backup host discovery time was again 3-4 sec per VM so my conclusion was that the issue was not related to Vcenter but to something else in the middle between backhost (included) and vCenter (excluded).

    The other suggestion can be to execute a backup preview in debug mode and then open a support case attaching the debug files.

    I hope this can help for your troubleshooting .

    Kind regards,

    Dionisio Sciascia

    Although I am an OpenText employee, I am speaking for myself and not for OpenText.
    If you found this post useful, give it a “Like” or click on "Verify Answer" under the "More" button.

    Dionisio Sciascia

    Although I am an OpenText employee, I am speaking for myself and not for OpenText.
    If you found this post useful, give it a “Like” or click on "Verify Answer" under the "More" button.