Idea ID: 1653378

Restores should have priority over backups

Status : Under Consideration
Under Consideration
See status update history
over 2 years ago

Brief description:

When multiple sessions configured to use the same device are scheduled, queueing is possible. The active session locks the device and cartridge in the MMDB and the remaining sessions go into queueing state and wait for the device to become free. The queueing mechanism in Data Protector randomly decides which session gets the lock. This is fatal if the operator starts a critical restore session. This restore session should get priority over any existing backup session waiting, but this is not the case.

It should be possible to start a "high priority restore" that will either win the queueing (if there is any) OR pause a running backup session currently locking the device and resume it after the restore.

Benefit:

Restores usually have priority over backups but Data Protector does not offer an easy way to prioritize restore.

 

Tags:

  • I believe that was sort of the idea of the ABR (adaptive backup and recovery?) introduced with the advanced scheduler back in 8.x/9.x. There you could specify priorities for scheduled sessions, and there was the idea that higher priority sessions could interrupt lower priority sessions (by basically aborting them, and resuming them later, which AFAIK is one of the reasons for checkpoint resume). But AFAIK two things mired these efforts:

    • approach is completely oblivious to MMD, and attempts to infer device overlap by comparing device lists in backup specifications. This is of course flawed, since it ignores device load balancing. This is listed as a limitation in the docs, but device load balacing is a norm rather than an exception (e.g. there is no alternative for integration backups), so the feature ends up being mostly a dud.
    • ABR does not apply to restore sessions - they cannot be schuled, one cannot specify a priority for them, and ABR does not know which devices it will use.

    The correct approach is to delegate these decisions to MMD (you could poll its state, but it changes so frequently that you would just be open to TOCTOU).

    Ideally, you would delegate a large part of the device optimization strategy to MMD, because it is the only part of DP with cross-session awareness of device resources. But that requires:

    • giving MMD more information (backup windows, expected job sizes/durations, candidate hosts lists, full version->mediaset graphs to be able to pick collision-free paths for copies/restores)
    • relinquising to MMD some of the control from session managers (AMSS, device filtering, ability to pre-empty, abort or interrupt sessions, etc).
    • provide insight into MMD state (e.g. right now you have no idea which session is holding hostage the device or media that you need for restore, or how much contention and queueing there is for specific devices).
  • This is a great idea!
    I always need to kill my jobs to start a restore and that's horrible.
    In my environment, I run more than 1200 jobs a day so it's very difficult to find a window for restores. Just stopping the whole environment.