VEPA - session timeout with 'transport method: SAN' and HPE Primera LUNs

Hi,

changed storage (for ESXi datastores) from EMC VMAX to HPE Primera.

Now with these new LUNs, we see a 20-30 min timeout in a VEAgent / VEPA backup session (i.e. about 20 VMs, in full backup, with CBT active) before I/O starts while using 'transport method: SAN' - switching to 'transport method: hot-add' will start right away.

For optimal restore speed, there is a virtual linux machine with VEPA integration installed in each ESX cluster for VEAgent restores, since we use thin provisioned LUNs, anyways.

Our preferred 'transport method' for backups is SAN.


Our physical and dedicated Linux DP Backup Hosts (VEPA integration installed) see all ESXi LUNs from 2x HPE Primera systems via SAN for backup purposes.

96 LUNs from our set of HPE Primera SAN storage systems (ESX-LUNs)

  • those 96 LUNs have 8 paths each (768! in total)
  • half of those paths are ‘active ready running‘ (384!), the other half (384) are in state of ‘active ghost running’ (this changes in case of a disaster / site failover)
  • all LUNs are synchronous mirrored between site1 and site2

example of path from backup host to a Primera ESX-LUN

360002ac0000000000000006c00027224 dm-130 3PARdata,VV
size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:50  sdaf  65:240  active ready running
| |- 1:0:1:50  sdeh  128:144 active ready running
| |- 3:0:0:50  sdqt  132:464 active ready running
| `- 3:0:1:50  sduv  67:624  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  |- 1:0:2:50  sdij  135:48  active ghost running
  |- 1:0:3:50  sdml  69:464  active ghost running
  |- 3:0:2:50  sdyx  130:528 active ghost running
  `- 3:0:3:50  sdacz 8:944   active ghost running

^ 'active ready running' paths to storage at local site, 'active ghost running' paths to storage at remote site.

Mulitpathing is configured as HPE whitepapers recommend.

/etc/multipath.conf

defaults {
    polling_interval 10
    retain_attached_hw_handler 0
}
devices {
        device {
        vendor            "3PARdata"
                product            "VV"
                path_grouping_policy    "group_by_prio"
                uid_attribute        "ID_SERIAL"
                prio            "alua"
                path_selector        "service-time 0"
                path_checker        "tur"
                hardware_handler    "1 alua"
                failback        "immediate"
                rr_weight        "uniform"
                no_path_retry        18
                rr_min_io_rq        1
                fast_io_fail_tmo    10
        dev_loss_tmo        "infinity"
    }

}

All involved componentes are DP11, if applicable DP11P1. Timeouts were also there when we were at DP10.91 and DP10.80.
- ESXi, 6.7.0, 18828794
- vSphere Version 6.7.0.51000

Question, does anybody here have a similar setup and does or does not have this timeout issue when using 'transport method: SAN' ?

Thanks in advance!

BR,
Alex

Parents Reply Children
  • Hi,
    sorry, completely forgot to mention (added version info to post).

    - DP11P1
    - ESXi, 6.7.0, 18828794
    - vSphere Version 6.7.0.51000

    excerpt from session messages... (session writes mirror to two data domain systems via DDBoost GW (MA/VEPA)

    -snip-

    [Normal] From: VEPALIB_VMWARE@backuphost1-ddboostgw1.fqdn ""  Time: 12/20/21 10:00:48
            Virtual Machine 'VM4.fqdn': Getting Changed Blocks for disk 'scsi0:0'.

    [Normal] From: VEPALIB_VMWARE@backuphost1-ddboostgw1.fqdn ""  Time: 12/20/21 10:00:49
            Virtual Machine 'VM3.fqdn': Getting Changed Blocks for disk 'scsi0:0'.

    [Normal] From: VEPALIB_VMWARE@backuphost1-ddboostgw1.fqdn ""  Time: 12/20/21 10:00:50
            Virtual Machine 'VM3.fqdn': Getting Changed Blocks for disk 'scsi0:1'.

    [Normal] From: VEPALIB_VMWARE@backuphost1-ddboostgw1.fqdn ""  Time: 12/20/21 10:00:50
            Virtual Machine 'VM3.fqdn': Getting Changed Blocks for disk 'scsi1:0'.

    [Normal] From: VEPALIB_VMWARE@backuphost1-ddboostgw1.fqdn ""  Time: 12/20/21 10:00:51
            Virtual Machine 'VM3.fqdn': Getting Changed Blocks for disk 'scsi1:1'.

    [Normal] From: VEPALIB_VMWARE@backuphost1-ddboostgw1.fqdn ""  Time: 12/20/21 10:26:35
            Virtual Machine 'template-sles-125' (UUID '50044398-66ca-e432-603b-220c3f373e6d', Disk 'scsi0:0'): Starting full backup ...

    [Normal] From: BMA@backuphost2-ddboostgw2.fqdn "datadomain2_vmware_backuphost2-ddboostgw2 [GW 64453:0:1853676645313675371]"  Time: 20.12.2021 10:26:36
            STARTING Media Agent "datadomain2_vmware_backuphost2-ddboostgw2 [GW 64453:0:1853676645313675371]"

    [Normal] From: BMA@backuphost1-ddboostgw1.fqdn "datadomain1_vmware_backuphost1-ddboostgw1 [GW 64453:0:1925734239351603307]"  Time: 20.12.2021 10:26:36
            STARTING Media Agent "datadomain1_vmware_backuphost1-ddboostgw1 [GW 64453:0:1925734239351603307]"

    -snip-


    There was a 26min timeout between the last 'Getting Changed Blocks for disk' and the MA starting to move IO.
    Once it goes... this sessions moves around 9TB in 3h to the DDs.

    We had to raise global parameter 'SmWaitForFirstBackupClient' from default 10min to something higher than the longest wait... just to have session finish at all.

    BR,
    Alex

    vepa_session_messages.txt

  • Hello

    By default before abort session if media agent has not received data is 10 minutes, those 26 minutes are taking by Vcenter for to create Snaphots and provide CBT state to DP. So increase this value is a good idea. SO my apologizes but not sure if you still have any issue in such case please let me know.

    Best Regards

  • Hi,
    thanks.
    I am not so sure. If I hadn't seen it a lot faster and this would be normal, I would accept this 20-30 minute wait.

    But why would this same task take only seconds when using hot-add, and why was this not the case with the EMC storage.
    Will do some more testing / ruling out.

    Raised a case.

    BR,
    Alex

  • Did some more testing. This issue is surely with the HPE Primera LUN paths.
    Same setup with i.e. Huawei Oceanstor ESXi LUNs, there is no timeout at all - IO starts right away.

    Will do some more tests with only 'active' Primera LUN and/or lesser 'active/active-ghost' LUN paths.
    If this turns out to be an issue with Primera LUN path concept, there surerly needs to be some change made to DP/VEPA components regarding 'standby-path-handeling'.

    Anybode else with this setup please 'raise a hand'.

    Thanks,

    BR,

    Alex