DP9: Hunting down intermittent error during replication

Every so often we get the following error.

[Major] From: CMA@s928-dbsapep1.ssmhc.com "IHT Catalyst_gw1 [GW 5748:4:2793244080635848377]" Time: 2/20/2017 10:24:05 PM

[90:54] \\COFC-2M235002N701\IHT_Store/a5688927_58abad47_0339_a07f

Cannot open device (StoreOnce error: StoreOnce device offline, network error occurred

Ideas how to  hunt the real reason down?  Does it mean anything was missed?  Here is the end of the session message:

 Replication statistics:
   
  Session Queuing Time (hours)         0.00 
  ----------------------------------------
  Completed Objects ............       5   
  Failed Objects ...............       9   
  Aborted Objects ..............       0   
  ----------------------------------------
  Objects Total  ...............      14   
  ========================================
  Completed Media Agents .......      12   
  Failed Media Agents ..........       0   
  Aborted Media Agents .........       3   
  ----------------------------------------
  Media Agents Total  ..........      15   
  ===========================================
  Throttled Gateways ...........       0   
  Gateways Total ...............       1   
  ========================================
  Overall Deduplication Ratio .. 202.6 : 1    
  ===========================================
  Mbytes Total ................. 1309613 MB
  Used Media Total .............      10  

Parents
  • Similar but [90:51].  Intermittent when it occurs.

    [Major] From: BMA@CellManager "Storeonce_Win_gw1 [GW 23964:2:410716141146769696]" Time: 2/23/2017 7:42:07 PM
    [90:51] \\StoreOnce.IP\Win_Cat\8e9913ac_58af65f6_3a58_4189
    Cannot write to device (StoreOnce error: StoreOnce device offline, network error occurred or secure communication failed while contacting the StoreOnce device)

    Which version of DP are you using?  I have really begun to notice these errors since 9.08, now 9.08_113.  It seems to occur most often on a filesystem backup that is drawing off of a RDM volume.

    Didn't you upgrade your StoreOnce devices recently too? 

     


  • wrote:

    Similar but [90:51].  Intermittent when it occurs.

    [Major] From: BMA@CellManager "Storeonce_Win_gw1 [GW 23964:2:410716141146769696]" Time: 2/23/2017 7:42:07 PM
    [90:51] \\StoreOnce.IP\Win_Cat\8e9913ac_58af65f6_3a58_4189
    Cannot write to device (StoreOnce error: StoreOnce device offline, network error occurred or secure communication failed while contacting the StoreOnce device)

    Which version of DP are you using?  I have really begun to notice these errors since 9.08, now 9.08_113.  It seems to occur most often on a filesystem backup that is drawing off of a RDM volume.

    Didn't you upgrade your StoreOnce devices recently too? 

     


    The GUI ix 9.07.  The CM is

    (root@sapbck)# omnicc -version

    HPE Data Protector A.09.08: OMNICC, internal build 112, built on Tue Oct 11 16:22:46 2016

    Yes, and updated the StoreOnce to 3.15.1-1636.1

    You having thoughts?

  • On the filesystem backup that I saw the issue going on I got 4 of the errors in January (this job runs INCR 3x day during the week with a Full on Saturday), then in February after I installed 9.08_118 I got 12 jobs with errors.  

    Last week I turned on Reconnect Broken Connections on all my filesystem backup specs and started reducing the number of Max devices assigned in the Destination tab of the backup specs with the thinking that the network load on the StoreOnce may be causing the issue.  I haven't seen the issue reoccur yet, but I'm keeping an eye on it.  I've been reducing the devices on my VM backups that occur during those times as well.  

    I haven't found a way to replicate the issue or discerned a pattern yet as to when the issue will occurs.  I hate to turn on debugging and let it run till I catch something because that's a whole lot of logs.  

    If you do try and run debugging for an extended time, use the omnirc file and redirect the logs to a drive that won't hose up your OS or Applications if it gets full.

     

    # OB2DBGDIR=<pathname>
    # Default: none
    # This variable is used to change the location of debug files on a per
    # system basis. You have to specify a fully qualified path of an existing
    # directory. This variable has precedence over the paths specified by the
    # postfix parameter.
    # By default, this variable is not set. If this variable is not set, the
    # pathname is set as /tmp (UNIX) or <Data_Protector_home>\tmp (Windows).

  • I have it set up.  What do you recommend the OB2DBG values be set to?  Here is what was there in the past when I worked with HPE:

    ## OB2DBG=1-331,333-500 debug.txt

    ## OB2DBG=1-500 omnirpt.txt

    ## OB2DBG=1-500 /debug/BMAdbg1.txt

    # change debug file location

    OB2DBGDIR=/omniback/debug

    -vince

  • how big is your debug_d2d.log? it's a circular log file limited to 10 MB, so what you see at the end of the file is not the latest entry. What you will get in MA debugs will be pretty much the same error that is logged in that log file anyways. 

Reply Children

  • wrote:

    how big is your debug_d2d.log? it's a circular log file limited to 10 MB, so what you see at the end of the file is not the latest entry. What you will get in MA debugs will be pretty much the same error that is logged in that log file anyways. 


     

    (root@sapbck)# ls -lrt debug_d2d.log

    -rw-rw-rw- 1 root sys 10485751 Feb 28 09:15 debug_d2d.log

    First entry was from 2/22,  Last was this morning.

  • ok, so in the log what should be matching session report message is this error code: 

    ReturnStatus = -1404 (OSCLT_ERR_SERVER_OFFLINE) 

    and might be some more details logged eg: 

    2017-02-17 09:31:36 : 41364 : ERROR : 1 : TC : Skt : 10.94.90.16 : Ix : MMA : osCltTcp_Connect : Ln 278 : Error connecting client socket to IPv4 server address 10*.*.*, port 9387. Errno : 10060.
    2017-02-17 09:31:36 : 41364 : ERROR : 1 : TC : Skt : 10.94.90.16 : Ix : MMA : osCltTcp_Connect : Ln 282 : GOTO ReturnStatus = -1 (Unknown).
    2017-02-17 09:31:36 : 41364 : ERROR : 1 : TC : Skt : 10.94.90.16 : Ix : MMA : osCltTrnCmd_OpenCommandSession : Ln 299 : Failed to connect to server : 10.*.*.*.
    2017-02-17 09:31:36 : 41364 : ERROR : 1 : TC : Skt : 10.94.90.16 : Ix : MMA : osCltTrnCmd_OpenCommandSession : Ln 300 : GOTO ReturnStatus = -1404 (OSCLT_ERR_SERVER_OFFLINE).

    obviously it would be different for CoFC, but just and idea what to look for