Hi All. Anyone able to unravel this mystery.
We are running DP version 10.30. All DP backup infrastructure involved in this issue except for the cell manager are all resident in Azure (Client being backed up and MA/Gateway are in Azure. backing up to Azure blob storage)
What seems to funny is that the incremental backup (backup of exchange logs) of this job completes fine. It is the FULL job that keeps failing usually at the of the backup session.
The error message points to BSM running on session manager loses connection to the MA agent.
[Major] From: BSM@"client_fqdn" "Job_name" Time: 28/11/2021 06:28:28 [61:3003] Lost connection to BMA named "Gateway_servername" [GW 262200:0:3786551445010777510]" on host "MA". Ipc subsystem reports: "IPC Read Error System error: [10054] Connection reset by peer Then followed by this error: [Major] From: BSM@cell-manager_FQDN "Job_name" Time: 28/11/2021 06:28:28 [61:3003] Lost connection to BMA named "Gateway server name" [GW 262200:0:3786551445010777510]" on host MA. Ipc subsystem reports: "IPC Read Error System error: [10054] Connection reset by peer " [Major] From: OB2BAR_VSSBAR_COMP@client_FQDN "MS Exchange 2010+ Server" Time: 28/11/2021 06:28:30 Failed to update object relationship attribute table in the IDB for the object: '05 bclient_FQDN:/Microsoft Exchange Writer(Exchange Replication Service)/Microsoft Information Store/715586fb-8eec-4b3d-8ff6-b229291db172/File // MSVSSW-APP' The following errors are seen in event logs of the exchange server node, used for backup at the exact time the above error is logged in backup job session messages (Passive copy resides on this server)
Other errors found in event logs
1. AppLogEventID: 2034 AppLogEventMessage: The Microsoft Exchange Replication service VSS Writer (Instance 72e2b553-8d69-4149-8d2b-6302859407b0) failed with error FFFFFFFC when processing the backup completion event. The cluster shared Volume VSS writer is also found in Error state.
2. The cluster shared Volume VSS writer is also found in Error state.
But the time of failure does not correlate with time of backup failure
What we have done so long and didn't make any difference
1. Deploy media agent component to the same client that is being backed up - this is to cut out the network element and backup data flow will be restricted to within the client.
2 Deploy another blob target using another empty container - this to remove the possibility of issue with the original blob container.
3. Removed variables in the client omnirc file.
3.1 Removed all variables in the client omnirc file.- client will use default variables, same as other clients in the environment
3.2 Removed all other time-out variables in the MA omnirc file. Left only the following variables in the file
OB2SHUTDOWNFLAG=9
OB2SHUTDOWNTIMEOUT=20
Of course i have logged the issue with MF engineering team and they seems to be struggling as well.
Is this perhaps a bug in DP version 10.30