AS78 Absent Member.
Absent Member.
464 views

Restore in a virtualised environment

After a server or HD crash the normal recovery procedure involves setting up a new server OS, eDir and in our case IM software versions, patch these to the level present before the crash and afterwards restore the eDir content using a backup obtained by dsbk, e.g.

We were wondering, whether VM snapshots can simplify the first step. However, snapshots taken of a current patch level will also contain a running eDir instance with outdated content, which will cause dissonances in a multi server eDir environment. We have therefore done some tests bringing one server of a two server replica ring back to a VM snapshot taken earlier while inhibiting communication with the other replica server. We wiped any eDir stuff off the "crashed" server and installed a new temporary tree, while keeping OS and IM software versions and patch levels. We continued with restoring eDir content using a full backup taken by dsbk earlier. We did not include roll-forward logs on purpose, because for the time being we won't be able to store these on a different storage device and hence to provide all roll-forward logs needed after a real crash.

Simply following section 15.7 "Recovering the Database if Restore Verification Fails" of the eDir 9 admin guide yielded a "locked Directory Services Database" error when changing the replica information on the failed server into external references using ndsrepair -R -Ad -xk2. ndstrace reports "DSAgentOpenLocal failed, ds locked" (error -663), our test box is an eDir 9.0.4/SLES 12.2/IM 4.5 installation, partitions IDMDriverset and root.

I'd therefore like to ask the experts, whether anybody has an idea what makes the restore procedure following sections 15.6 and 15.7 of the eDir 9 administration doc fail?

Moreover we wonder, if a restore to get hold of the RST files is neccessary at all or if outdated partitions brought back by VM snapshots could be removed from the rest of the replica ring in a safe way before allowing any sync between servers? After reestablishing communication our idea would be to re-add partitions from replica servers not affected by the crash.

Thanks in advance

Axel
Labels (1)
0 Likes
3 Replies
Knowledge Partner Knowledge Partner
Knowledge Partner

Re: Restore in a virtualised environment

Lots of good, and common, questions. Well done thinking through this.

First, taking snapshots of in-use files is always risky, particularly with
transactional things like databases (PostgreSQL, MariaDB, etc.), or in
this case, eDirectory's DIB. As a result, I have had to help several
clients in the past who used a VM snapshot, planned on doing good thing
like XK2-ing the DIB and cleaning the replica ring appropriately before
letting it communicate, but instead found a -618 waiting for them, or a
-663 that would not go away, meaning they could not open the DIB to do the
XK2 to then get the server back into the tree by adding replicas.
Painful, that is.

To work around that, it may be valid as part of your backup procedure to
stop eDirectory, then let the VM snapshot happen, and then start it again.
This has at least two benefits:

1. You will not end up with backups in the middle of a write causing a
corrupt DIB preventing the use of that DIB, however carefully, when the
time comes. If this dose not work, nothing else matters and you are back
to dsbk.

2. You may end up with a snapshot that, when restored, has eDirectory
stopped, which is pretty convenient as it will help you avoid accidentally
letting the restored box talk to other boxes before you disable
communications somehow, so you can then XK2 the thing, clean replica
rings, etc.

Another option you may have is to somehow treat your eDirectory data
volume completely separately, e.g. mount it from a SAN directly so that
the VM snapshots are completely independent of DIB data. The benefit is
that if the server goes haywire and you need to restore a snapshot, the
snapshot will not rollback the DIB, which is your entire concern. Of
course, if your DIB IS corrupt (because eDirectory was writing when the
box crashed) then you stll need a backup/restore option, or just to remove
and add the server back to the tree, but that is always a possibility
since hardware is not perfect.

Regardless of the type of DIB restore you use (VM snapshots, ndsrc.pl
archives, dsbk), if you do not have RFL files then eDirectory voluntarily
disables itself somewhat because, by definition, the restore is not valid,
and you already know the possible issues there. As a result, as long as
your DIB is not corrupt, and you have RFL files (which is considered as
part of verifying a valid, non-corrupt DIB), eDirectory should start, and
otherwise you should be XK2-ing things, cleaning replica rings, and adding
the server back, if not removing the server from the tree entirely and
adding back (maybe with DIB Clone).

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.
0 Likes
AS78 Absent Member.
Absent Member.

Re: Restore in a virtualised environment

Thank you for your thorough explanation. I feel encouraged to continue considering VM snapshots as part of our backup strategy.

ab;2471722 wrote:
...

To work around that, it may be valid as part of your backup procedure to
stop eDirectory, then let the VM snapshot happen, and then start it
again.
This has at least two benefits:

1. You will not end up with backups in the middle of a write causing a
corrupt DIB preventing the use of that DIB, however carefully, when the
time comes. If this dose not work, nothing else matters and you are
back
to dsbk.

2. You may end up with a snapshot that, when restored, has eDirectory
stopped, which is pretty convenient as it will help you avoid
accidentally
letting the restored box talk to other boxes before you disable
communications somehow, so you can then XK2 the thing, clean replica
rings, etc.
...


I feel guilty for not having reported that we are only taking snapshots after all participating replica servers have been shut down. This way I don't expect any hanging eDirectory operations freezed in. Shutting just ndsd down as you mentioned will probably work even better, because it allows to activate the server snapshot without automatically starting eDir (and any IM drivers configured to start automatically).

Our test procedure was as follows:

Take snapshots of all servers in the replica ring (two in our test case) after they have been shut down
Restart them and configure RFL
Take a full dsbk backup including NICI
Delete a few objects to produce a different state than the one backed up
Run a few health checks (all successful)
Isolate the ring from the server to "crash" (in our case by switching off the network adapter of one server — this may happen in the middle of some sync operation being in progress, but this could also happen with a real failure)
Stop ndsd on the "crashing" server and remove all eDirectory stuff
Reestablish communication
Reinstall and configure a temporary tree as described in the documentation
Do a dsbk restore of the full backup taken without RFL, because with our setup there is a high likelihood to loose the latest RFL
As expected, the restore fails with -6034 errors for the two partitions involved and a final -755. Files RST.01, ~.db, ~.lck, ~.rfl are left in the DIB directory of the temporary eDir tree as well as a RFL.lck.lck file. The NICI restore procedure is reported as successful.
Continue with the recovery procedure according to section 15.7.1 of the eDir 9 admin doc, cleaning up the "surviving" part of the replica ring
Continue with section 15.7.2 "Repair the failed server": dsbk restadv -v -l somelog.log does not report any errors and in fact all RST files are gone except for RFL.lck.lck. ndsmanage still shows the tempory instance being active, though. The subsequent ndsrepair -R -Ad -xk2 starts promising, reports to kill any objects present and even states "Repair process completed, total errors found = 0", but finishes "Could not open the Directory Services Database, the repair procedure was not successful. ... Total errors: 0 NDSRepair process completed. Repair operation aborted since NDS Server is down." The -663 error is only reported by ndstrace.

This behaviour is reproducible and therefore doesn't seem to be attributable to a particularly unfavourable moment the communication is interrupted.

I'm therefore wondering, if our test setup is somewhere lacking something obvious or if replication isn't useful for disaster recovery at all. I may add that we also tested interrupting the communication, bringing one server back to an earlier snapshot and applying -xk2 directly without prior restore of a backup, that shall be overwritten with intact replica information from a different server. In this scenario ndsrepair -R -Ad -xk2 succeeded without errors, but re-adding the root partition after the first re-added partition had reached the "on" state ended in an endless "new" state.

Therefore any refinements of our testing approach would be welcome.

ab;2471722 wrote:
...
Painful, that is.


We're using a test box — no harm is done to any important data 😉 I will certainly take a closer look at ndsrc.pl, especially when it comes to migrations to a different server OS, but unlike replication or RFL saving on schedule won't provide disaster recovery up to the moment a failure happens. Moreover I'd like to find out, whether replication is useful for us at all: A single server could be reset to a VM snapshot and restored to the latest saved dsbk backup within minutes...
0 Likes
AS78 Absent Member.
Absent Member.

Short update

I had opened a SR on this and meanwhile the xk2 misbehaviour has been duplicated by support. More information is announced within a short period of time.

Axel
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.