Exchange 2016 backup keeps failing. BSM lose connection to MA at the end of the job

Hi All. Anyone  able to unravel this mystery.

We are running DP version 10.30. All DP backup infrastructure involved in this issue except for the cell manager are all resident in Azure (Client being backed up and MA/Gateway are in Azure. backing up to Azure blob storage)

What seems to funny is that the incremental backup (backup of exchange logs) of this job completes fine. It is the FULL job that keeps failing usually at the of the backup session.

The error message points to BSM running on session manager loses connection to the MA agent.

[Major] From: BSM@"client_fqdn" "Job_name" Time: 28/11/2021 06:28:28 [61:3003] Lost connection to BMA named "Gateway_servername" [GW 262200:0:3786551445010777510]" on host "MA". Ipc subsystem reports: "IPC Read Error System error: [10054] Connection reset by peer Then followed by this error: [Major] From: BSM@cell-manager_FQDN "Job_name" Time: 28/11/2021 06:28:28 [61:3003] Lost connection to BMA named "Gateway server name" [GW 262200:0:3786551445010777510]" on host MA. Ipc subsystem reports: "IPC Read Error System error: [10054] Connection reset by peer " [Major] From: OB2BAR_VSSBAR_COMP@client_FQDN "MS Exchange 2010+ Server" Time: 28/11/2021 06:28:30 Failed to update object relationship attribute table in the IDB for the object: '05 bclient_FQDN:/Microsoft Exchange Writer(Exchange Replication Service)/Microsoft Information Store/715586fb-8eec-4b3d-8ff6-b229291db172/File // MSVSSW-APP' The following errors are seen in event logs of the exchange server node, used for backup at the exact time the above error is logged in backup job session messages (Passive copy resides on this server) 

Other errors found in event logs

1. AppLogEventID: 2034 AppLogEventMessage: The Microsoft Exchange Replication service VSS Writer (Instance 72e2b553-8d69-4149-8d2b-6302859407b0) failed with error FFFFFFFC when processing the backup completion event. The cluster shared Volume VSS writer is also found in Error state.

2. The cluster shared Volume VSS writer is also found in Error state.

    But the time of failure does not correlate with time of backup failure

What we have done so long and didn't make any difference

 1. Deploy media agent component to the same client that is being backed up - this is to cut out the network element and backup data flow will be restricted to within the client.

2 Deploy another blob target using another empty container - this to remove the possibility of issue with the original blob container.

3. Removed  variables in the client omnirc file.

   3.1 Removed all  variables in the client omnirc file.- client will use default variables, same as other clients in the environment

    3.2 Removed all other time-out variables in the MA omnirc file. Left only the following variables in the file

OB2SHUTDOWNFLAG=9
OB2SHUTDOWNTIMEOUT=20

Of course i have logged the issue with MF engineering team and they seems to be struggling as well.

Is this perhaps a bug in DP version 10.30 

  • Any one with ideas on progressing this?

  • Suggested Answer

    Did you execute the backup in debug mode in order to have more info on the issue ? I do not remember if exchange backup specification has the flag "reconnect broken connection". Try to set that flag as well to see if this helps.

    Kind regards,

  • Yes I did run the backup in debug mode. The debug files was sent to MF support and their feedback was that nothing suggests in the debug log that the DP software is the culprit.

    As for the broken reconnection flag, i enabled the flag on the job. the job session indeed showed that the session reconnected but nothing happens - the job does not progress from there on. In one instance i left the reconnected session overnight and it did not progress the job. I ended up aborting the job.

    Now i discovered something interesting in my further troubleshooting.

    I configured a file library. the file system path used as repo for the file library is on the client. I deployed the DP media agent component  on the same client. I set the file depot size to the maximum size of the FS (512), this is just for test. The backup ran and to my surprise when the size of the media file created reached 460GB, the job attempted to create another file depot file. By that time, i dont have enough free space left in the file system, so it threw mount request.

    The key discovery in the above is that all of my jobs (at least 5 different job sessions) that i have been running previously and that has been failing, terminates/failed when the backup data set (media) size reaches 460GB! and I get a BSM disconnects from the BMA error, and the job fails! That is where the perception of network disconnect was but to me that's not a network disconnect.

    MF Support is looking at this and initial response of the CSE engineer is that there is no known soft limit that can be set in DP like this. My view is that it's a bug which the dev team needs to recreate in the lab in order to resolve.

    I have checked the variables on the global options file on CM and omnirc files on the client and the MA server, nothing relates to this. 

    I can only now sit back hoping for a favorable feedback from MF Dev team!!!