VEPA - session timeout with 'transport method: SAN' and HPE Primera LUNs

Hi,

changed storage (for ESXi datastores) from EMC VMAX to HPE Primera.

Now with these new LUNs, we see a 20-30 min timeout in a VEAgent / VEPA backup session (i.e. about 20 VMs, in full backup, with CBT active) before I/O starts while using 'transport method: SAN' - switching to 'transport method: hot-add' will start right away.

For optimal restore speed, there is a virtual linux machine with VEPA integration installed in each ESX cluster for VEAgent restores, since we use thin provisioned LUNs, anyways.

Our preferred 'transport method' for backups is SAN.


Our physical and dedicated Linux DP Backup Hosts (VEPA integration installed) see all ESXi LUNs from 2x HPE Primera systems via SAN for backup purposes.

96 LUNs from our set of HPE Primera SAN storage systems (ESX-LUNs)

  • those 96 LUNs have 8 paths each (768! in total)
  • half of those paths are ‘active ready running‘ (384!), the other half (384) are in state of ‘active ghost running’ (this changes in case of a disaster / site failover)
  • all LUNs are synchronous mirrored between site1 and site2

example of path from backup host to a Primera ESX-LUN

360002ac0000000000000006c00027224 dm-130 3PARdata,VV
size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:50  sdaf  65:240  active ready running
| |- 1:0:1:50  sdeh  128:144 active ready running
| |- 3:0:0:50  sdqt  132:464 active ready running
| `- 3:0:1:50  sduv  67:624  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  |- 1:0:2:50  sdij  135:48  active ghost running
  |- 1:0:3:50  sdml  69:464  active ghost running
  |- 3:0:2:50  sdyx  130:528 active ghost running
  `- 3:0:3:50  sdacz 8:944   active ghost running

^ 'active ready running' paths to storage at local site, 'active ghost running' paths to storage at remote site.

Mulitpathing is configured as HPE whitepapers recommend.

/etc/multipath.conf

defaults {
    polling_interval 10
    retain_attached_hw_handler 0
}
devices {
        device {
        vendor            "3PARdata"
                product            "VV"
                path_grouping_policy    "group_by_prio"
                uid_attribute        "ID_SERIAL"
                prio            "alua"
                path_selector        "service-time 0"
                path_checker        "tur"
                hardware_handler    "1 alua"
                failback        "immediate"
                rr_weight        "uniform"
                no_path_retry        18
                rr_min_io_rq        1
                fast_io_fail_tmo    10
        dev_loss_tmo        "infinity"
    }

}

All involved componentes are DP11, if applicable DP11P1. Timeouts were also there when we were at DP10.91 and DP10.80.
- ESXi, 6.7.0, 18828794
- vSphere Version 6.7.0.51000

Question, does anybody here have a similar setup and does or does not have this timeout issue when using 'transport method: SAN' ?

Thanks in advance!

BR,
Alex

Parents
  • have an update.

    When we only present 'active ready running' LUN paths from the HPE Primera storage system to the DP GW hosts...there is no timeout at all.... backup starts right away.

    nice..so we proved....VEPA/DP has trouble with standby LUN paths of type 'active ghost running' (from HPE Primera storage systems)

    Will continue debugging...

Reply
  • have an update.

    When we only present 'active ready running' LUN paths from the HPE Primera storage system to the DP GW hosts...there is no timeout at all.... backup starts right away.

    nice..so we proved....VEPA/DP has trouble with standby LUN paths of type 'active ghost running' (from HPE Primera storage systems)

    Will continue debugging...

Children
No Data