Hi,
changed storage (for ESXi datastores) from EMC VMAX to HPE Primera.
Now with these new LUNs, we see a 20-30 min timeout in a VEAgent / VEPA backup session (i.e. about 20 VMs, in full backup, with CBT active) before I/O starts while using 'transport method: SAN' - switching to 'transport method: hot-add' will start right away.
For optimal restore speed, there is a virtual linux machine with VEPA integration installed in each ESX cluster for VEAgent restores, since we use thin provisioned LUNs, anyways.
Our preferred 'transport method' for backups is SAN.
Our physical and dedicated Linux DP Backup Hosts (VEPA integration installed) see all ESXi LUNs from 2x HPE Primera systems via SAN for backup purposes.
96 LUNs from our set of HPE Primera SAN storage systems (ESX-LUNs)
- those 96 LUNs have 8 paths each (768! in total)
- half of those paths are ‘active ready running‘ (384!), the other half (384) are in state of ‘active ghost running’ (this changes in case of a disaster / site failover)
- all LUNs are synchronous mirrored between site1 and site2
example of path from backup host to a Primera ESX-LUN
360002ac0000000000000006c00027224 dm-130 3PARdata,VV
size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:50 sdaf 65:240 active ready running
| |- 1:0:1:50 sdeh 128:144 active ready running
| |- 3:0:0:50 sdqt 132:464 active ready running
| `- 3:0:1:50 sduv 67:624 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
|- 1:0:2:50 sdij 135:48 active ghost running
|- 1:0:3:50 sdml 69:464 active ghost running
|- 3:0:2:50 sdyx 130:528 active ghost running
`- 3:0:3:50 sdacz 8:944 active ghost running
^ 'active ready running' paths to storage at local site, 'active ghost running' paths to storage at remote site.
Mulitpathing is configured as HPE whitepapers recommend.
/etc/multipath.conf
defaults {
polling_interval 10
retain_attached_hw_handler 0
}
devices {
device {
vendor "3PARdata"
product "VV"
path_grouping_policy "group_by_prio"
uid_attribute "ID_SERIAL"
prio "alua"
path_selector "service-time 0"
path_checker "tur"
hardware_handler "1 alua"
failback "immediate"
rr_weight "uniform"
no_path_retry 18
rr_min_io_rq 1
fast_io_fail_tmo 10
dev_loss_tmo "infinity"
}
}
All involved componentes are DP11, if applicable DP11P1. Timeouts were also there when we were at DP10.91 and DP10.80.
- ESXi, 6.7.0, 18828794
- vSphere Version 6.7.0.51000
Question, does anybody here have a similar setup and does or does not have this timeout issue when using 'transport method: SAN' ?
Thanks in advance!
BR,
Alex