eDirectory Disaster Recovery

0 Likes
This article is not going to present an entirely new idea, it is an update to an older article originally published by Jim Henderson, based on research that he and Peter Kuo did and presented at BrainShare. You can (and should) read his original article here: http://www.novell.com/coolsolutions/feature/15748.html

I have used Jim's original article dozens of times, and highly recommend it. It is well thought through, straightforward, and most importantly, tested. It works. So if it is so good, why update it? Because in a large tree it can take a very long time, several hours, to use the 'srvref' utility to find and replace all references to the server's objects, and another several hours to replace them. Also, because 'srvref' is a Win32 utility.

Recently, I found that I could work through a modified version of this procedure, much more quickly, and without the Windows requirement. After checking with Jim, I'm presenting the results here, along with some additional thoughts and things to look out for.

This is not the only way to restore eDirectory after a system failure. There are other options available - such as using the Hot Continuous Backup feature of eDirectory. This scenario assumes no backup exists of the database on the server, but other servers in the tree hold replicas of the partitions on the server that has crashed.

Here is a quick checklist of things to remember to do - the following sections will go into detail about each of these steps.



  1. Plan Ahead!

  • Don't Panic!

  • Reconfigure Time Synchronization

  • Referential Integrity

  • Delete the Old Server Object

  • Documentation and What Else Needs To Be Fixed

  • Verify Replica Rings are Clean

  • Rebuild and Re-Patch the Crashed Server

  • Fix File System Trustee Assignments for SYS Volume

  • Remove eDirectory if Installed into Temporary Tree

  • Reconfigure Time Synchronization

  • Install Rebuilt Server into Production Tree

  • Fixing the Unknown objects

  • Re-issue Server Certificates

  • Perform Post-Recovery Tasks



1. Plan Ahead!

Long before any disaster happens, you should have planned for one, and should know what your options will be. You should have read and understood the documentation for the various ways to recover. You should have tested them all in a non-production environment. You should be comfortable with them. You should be able to work through to a successful conclusion, even if you have been up all night and are sleep deprived. Because disasters rarely happen when it is convenient, knowing your options ahead of time will help you manage both the problem itself, and the accompanying stress.

Novell's official backup / restore of eDirectory documentation is here:
Backing Up and Restoring Novell eDirectory

Geoffrey documents how to use dsrepair to dump eDirectory to a file and to restore the database from that dump file here:
Restoring eDirectory with Dsrepair -rc and Dsrepair -$du


2. Don't Panic!

Readers familiar with the works of Douglas Adams may chuckle at this step, but it is offered as a serious part of a disaster recovery scenario, and is an absolutely critical step in the process.

Troubleshooting and disaster recovery scenarios are - without a doubt - high-stress situations. The directory is a core part of any business operation, because it allows services to authenticate users; without the directory, authentication stops working, and everything that depends on that authentication also stops working.

When a user's applications stop working, the user becomes idle - often times, the first instinct any user has is to find out why they are unable to complete tasks necessary for their jobs. The calls to the help desk start, and in many cases, users will know who is responsible for the systems in question and they will begin to seek out those individuals.

System administrators (both experienced and inexperienced) frequently react to this potentiality by scrambling to get the system back up. The amount of pressure that undoubtedly will be applied can create a sense of urgency to resolve the issue as quickly as possible. Unfortunately, this sense of urgency can be so intense as to result in steps being missed or bad decisions being made. It is important to have someone who can run interference with the users so the administrator can concentrate on the task at hand - getting the system operational with as little downtime as possible.

Making mistakes during a disaster recovery situation increases recovery time.


3. Reconfigure Time Synchronization

Time Synchronization is often misunderstood in an eDirectory environment. eDirectory itself does not provide time services - i.e., it is not a time services provider, but rather, it is a time services consumer. eDirectory depends on the time being correct in the platform in order to ensure timestamps are correct when applied to changes in the directory.

In order to ensure that events in the directory consistently are applied on all servers in a replica ring, time synchronization must be properly configured.

If the server that is being recovered is a time source for any other server in the tree, time services need to be reconfigured in order to provide consistent time during the recovery.

Time synchronization differs between the platforms; each of the major platforms will be considered individually.

NetWare

NetWare allows the use of two types of time synchronization: a "legacy" TIMESYNC configuration that uses a time server type (single reference, reference, primary, or secondary), and an option to use an NTP configuration.

When using a legacy TIMESYNC configuration, determine if the down server was a SINGLE, REFERENCE, or PRIMARY time server; if it was, find the servers that used the down server as a time source and reconfigure them to point to an alternate source. If the down server was configured as a SECONDARY time server, then nothing else need be done.

With an NTP configuration on NetWare, the time synchronization configuration is set using configured sources on each server. When using this configuration, the configured time source points to the server network address and must specify port 123 as part of the time source configuration:


MAGRATHEA:set timesync time sources
TIMESYNC Time Sources: 172.16.1.1:123;
Maximum length: 149
Description: This server contacts the servers in this list as time
providers. Each time server (IP Address, DNS Name) in this list
is separated by a ';'.For example :A ";" clears the
list"MyServer;" specifies that MyServer is the NetWare time
source."MyServer:123;" specifies that MyServer is a NTP time
source.



In this example, the time source points to 172.16.1.1:123, which indicates the configuration is using NTP.

Linux/Unix

The Linux and Unix platforms use very similar configurations - timesync is typically implemented using NTP.

On SUSE Linux Enterprise Server 9, for example, the configuration is stored in /etc/ntp.conf. Changing the time synchronization configuration in this environment involves modifying this file's "server" line:

(File clipped to show just the relevant section that needs to be changed)


##
## Outside source of synchronized time
##
## server xx.xx.xx.xx # IP address of server
server 172.16.1.1



Configuration is similar on other Linux and Unix platforms. Use:


ntpq -p



to verify that NTP is configured and working.

Windows

The Windows platforms use a subset of NTP called SNTP (for "Simple Network Time Protocol"). This time protocol is compatible with NTP servers, so Windows servers can participate in the same time synchronization environment.




NOTE: Windows time services are configured within a domain environment as well; designate one domain controller to receive time externally and allow Windows' time configuration to take care of the rest of your domain or forest if using a domain configuration.





To change the time source in Windows, enter:


NET TIME /SETSNTP:172.16.1.1



This will change the time source to point at server 172.16.1.1.

Then restart the W32TIME service:


NET STOP W32TIME
NET START W32TIME




4. Referential Integrity

The concept of Referential Integrity refers to the ability to properly track links between objects in the directory. For example, if a user is a member of a group, it is important that that reference to the group object be maintained regardless of where the group is located in the directory or what its name is. eDirectory maintains this information by using a piece of data (the "entry ID" or EID) that is not tied to the object's name or location in the directory.

Think of the EID as being similar to a database's primary key for a row of data: if the row of data is deleted, the key is no longer valid, and the reference is invalidated.

When an object is deleted from the directory, the references to that object become invalid and are removed from the objects that contain the reference. In some cases, this creates a slight inconvenience, but in others (e.g., volume objects), the loss of the server object in the tree will result in a schema class violation, and the object will mutate into an unknown object.

Why does this matter? Because in the next step we will be breaking some objects. See Step #2, though. Don't panic. It's ok that we are going to break them. This is where we diverge from Jim's original article.


5. Delete the Old Server Object

The NCP Server object for the down server must be deleted from eDirectory. This is necessary in order to allow the server to be reinstalled into the tree as if it were a new server. Delete ONLY the server object, no others.

This will have several side effects, most of them beneficial. First, eDirectory will clean up the replica rings for the partitions that the down server held replicas of. As the server object is removed, these partitions should be able to synchronize the remaining replicas on other servers. If this server held the MASTER replica of any partition, that will have to be fixed, but this is a minor problem.

Second, any other objects in eDirectory that referenced this server's NCP Server object will mutate to Unknown objects. This is eDirectory's mechanism for dealing with a problem that is not supposed to happen. If the object is defined in the schema with attributes that are flagged as 'mandatory', then the object cannot exist without them. Several objects (Volume, NLS LSP License, Volume Pool) that are associated with servers are defined with the EID of the server object as an attribute, and this is flagged as 'mandatory'. When the server object is removed, eDirectory cleans up and removes the invalid EID reference to the server. This then violates the schema definition for the object. In order to maintain consistancy, eDirectory modifies the invalid object to object class "Unknown". This class has no mandatory attributes, so the object can exist without problems. The original object class is preserved in a secondary attribute.

We want to leave the Volume objects in place if at all possible, because other objects in eDirectory may have references to them. This primarily affects User objects and their Home Directory attribute, because that contains the EID of the Volume object on which the directory resides. If we delete the Volume object, then all of these other objects will be updated to clean up the references. If, on the other hand, we can preserve the Volume object and fix it later, the other objects that reference it do not need to know that anything happened.


6. Documentation and What Else Needs To Be Fixed

If the crashed server is a NetWare server, some additional documentation about the server's configuration is available for recovery. With the TSA used for filesystem backup, there is a target called Server Specific Information (or SSI) that contains the following files:
SERVDATA.NDS
DSMISC.LOG
VOLSINFO.TXT
STARTUP.NCF
AUTOEXEC.NCF

TID 10062402 (http://support.novell.com/cgi-bin/search/searchtid.cgi?/10062402.htm) describes the contents of each of these files. For this procedure, use all of these files except for the SERVDATA.NDS file.

When restored, these files will be placed into a directory called SYS:SYSTEM\. For example, if the server's name is FS1, the path this server's SSI files will be restored to will be SYS:SYSTEM\FS1, regardless of the server to which data is restored.

If the crashed server is Linux, Unix, or Windows, see Step #1 and Be Prepared. You will need to know what partitions this server held replicas of. Lacking documentation, you will have to examine all partitions in your tree in the next step. It's better to have the documentation ahead of time.


7. Verify Replica Rings are Clean

Using DSREPAIR - and, if available, the DSMISC.LOG from the SSI restore - verify that the replica rings are healthy. As cited in the article Using the DSREPAIR Utility Appropriately (http://www.novell.com/coolsolutions/feature/15312.html), this is one of the circumstances where DSREPAIR should be used.

Correct any replica ring inconsistencies using DSREPAIR with the commands listed in the table below:


NetWare Windows Linux/Unix

DSREPAIR -A Start DSREPAIR.DLM with -A ndsrepair -P -Ad



Clearing the replica ring inconsistencies involves forcibly removing the server from the replica ring - this should be handled by step 5, but verification is very important. If the server needs to be removed, view the servers in the affected replica ring, and select the option to remove the server from the ring. If the crashed server held the master replica for any partition, use the DSREPAIR utility to designate a new server as the master for the ring; after the recovery is complete, use the standard administration tools (not DSREPAIR) to move the master to the crashed server. Remember that DSREPAIR should be used for this type of operation only when the standard utilities fail to do so, and only after diagnosing the problem.


8. Rebuild and Re-Patch the Crashed Server

Rebuild the server using configuration information restored from SSI or as found in your documentation (See Step #1).




NOTE: For NetWare servers only, the server must be installed into a temporary tree in order to complete the installation. This step is only necessary on NetWare because the eDirectory installation is an integral component of the OS installation; with the other platforms, patches can be applied prior to the eDirectory configuration.





This rebuild should include all patches that were on the system prior to the crash.




WARNING: The server must be rebuilt utilizing the configuration in use prior to the server crash. This is not the time to change IP addresses, rename the server, or perform other maintenance tasks.






9. Fix File System Trustee Assignments for SYS Volume




NOTE: This step applies to NetWare servers only.





Trustee assignments on Traditional File System (TFS) and NetWare 5's version of NSS used the EID of objects stored locally on the server. If the object did not exist in a local replica, an external reference was created so there would be an EID on the local server for this purpose.

When rebuilding a server like this, any pre-existing filesystem trustee rights will be assigned to EIDs set up for [Public], [Root], and other local IDs. When the DS is removed from the server in the next step, those file system trustees are orphaned. A reinstall of the directory can cause unusual filesystem trustee assignments to appear, because different EIDs are used for those objects. This makes it necessary to delete the existing trustee assignments.

With NSS on NetWare 6 and later, the GUID attribute is used by the filesystem to hold the trustee assignment information. However, the GUIDs will no longer be valid (or will be assigned to other objects in the tree) because this server was installed into a temporary tree, so even with filesystem rights managed by GUID, this step is still necessary.

If recovering a lost SYS volume, LOAD DSREPAIR -XK6 and select "Check Volume Objects and Trustees" in the advanced menu. When prompted to make the change on the SYS volume, answer no. For all other volumes, answer yes.




10. Remove eDirectory if Installed into Temporary Tree




NOTE: This step applies to NetWare servers only.





In order to reinstall the server into the production tree, it is necessary to remove the server from its temporary tree.


11. Reconfigure Time Synchronization

If, in Step 3, it was necessary to make changes to the environment's time synchronization configuration, this may be a good time to set the configuration as it was prior to the start of recovery. This is not a required step, but one goal of a DR procedure should be to restore the initial configuration. You may also want to wait and do this later.


12. Install Rebuilt Server into Production Tree

At this point, everything should be ready for the server's re-installation into the tree. Perform one last health check on the production tree using the procedures in the article Using iMonitor to Perform eDirectory Health Checks (http://www.novell.com/coolsolutions/feature/15336.html).


13. Fixing the Unknown objects

Using iManager or other management utilities now, you will see that some objects representing this server's volumes are no longer "Volume" objects, but are now "Unknown". As discussed above, this is eDirectory's protection mechanism for itself. Now it is time to un-do what it did.

13.1 Volumes

First, you need to re-establish the link from the Volume object to its host NCP Server. This recreates the lost Host Server attribute on the object. Use dsrepair / Advanced Options / Check Volumes and Trustees.

To confirm that this worked, use iMonitor to view the Volume object. It should now have a Host Server. It will still be an Unknown object, but now it has everything it needs to be converted back to its original object class. You can also see in iMonitor that the original base object class of the object is stored in an attribute named Unknown Base Class. As long as the Unknown Base Class attribute is available, and the 'mandatory' attributes are present, iMonitor can restore the object to the correct class.

To do this, enable iMonitor's "Advanced Mode (possibly dangerous)". Click on the NDS iMonitor logo in the upper left corner:



to get to this screen:



Click the Enabled radio button, then Submit. This enables some hidden options that are, as the message says, potentially dangerous, but that are also potentially useful. When you click Submit, nothing interesting will seem to have happened. Browse back to the Unknown Volume object. You will now have a new menu option, Advanced Options:



Select the Advanced Options link to reach this set of choices:



Because you've enabled the hidden Advanced Mode, and are electing to use a feature that, at least in theory, could cause a problem, Novell have a disclaimer here letting you know that if something goes wrong, it's not their fault. Click on the Mutate Object radio button, and then on Ok.

iMonitor will use the Unknown Base Class attribute to figure out what type of object this is supposed to be. It will check to see that the required mandatory attributes are present. And it will then update the object class to restore the object back to what it was before.

You will need to do this for each Volume object. But this only takes a few seconds per object. You will not be able to do this for the _Admin volume. That is ok.

13.2 NSS Volume Pools

For the NSS Volume Pool objects, the easiest way to fix them is to delete them and recreate them from NSSMU. Nothing else should be referencing them, so there is no loss in doing so.

13.3 _Admin Volume

For the _Admin volume, which is not actually a real disk volume anyway, you can delete the object and recreate it with iManager or ConsoleOne.

13.4 NLS LSP License

Delete the NLS LSP License object, if present, then reinstall NLS on the server. That will recreate the object.


14. Re-issue Server Certificates

Server certificates are created in the process of a normal installation of eDirectory. In some cases, this step may be unnecessary, but you should verify that the certificates that exist for the server are valid.

If the server being recovered is the certificate authority for the tree, this step can be much more involved:


  1. Delete the existing CA

  • Create a new CA

  • Re-issue all certificates in the tree



The last step can be done over time - the certificates will not be invalidated by the CA being replaced, but the certificates will be non-verifiable because the signing CA no longer exists. Eventually, they should be replaced with freshly issued certificates. See TID 10089041 (http://www.novell.com/support/php/search.do?cmd=displayKC&docType=kc&externalId=7000484&sliceId=1&docTypeID=DT_TID_1_1&dialogID=154949987&stateId=0 0 154947912) for details.

If the server is not the CA for the tree, issue new certificates only for the server itself. The SSL CertificateIP and SSL CertificateDNS certificates will likely need to be reissued. You can do this with iManager.


15. Perform Post-Recovery Tasks

Now that the recovery of the eDirectory server is completed, the total recovery of the server can be concluded by performing various recovery tasks.


  1. Restore Data and Trustee Information

  • Re-install Server-based Applications

  • Re-establish Replica Information

  • Identity Manager

  • Message Server



15.1 Restore Data and Trustee Information

When restoring data on a NetWare server, the option exists in most backup software to perform a data restore or a trustee restore. For the SYS volume, it is recommended that trustees be restored unless there is application data that needs to be restored. For other volumes, it depends on whether the volume had to be recovered. For example, if only the SYS volume on a server was lost, there's no need to restore the data for the other volumes - but you may wish to restore trustee information, depending on whether the trustee assignments were preserved properly. Chances are good if using NetWare 6 or later, the trustee information is fine, as that information is based on object GUIDs rather than EIDs. If the trustees appear not to have recovered properly, use the DSREPAIR -XK6 option outlined in step 9 on each volume affected and then perform a trustee-only restore.

15.2 Re-install Server-based Applications

If there are server-based applications installed on the server - anti-virus, backup, firewall, or others - restore these as well. When this step is performed depends on the criticality of the application to the business; less important applications may be able to wait until after data is restored. More important applications may need to be installed as soon as possible.

15.3 Re-establish Replica Information

Using iManager or ConsoleOne and information from the DSMISC.LOG file from SSI (if this is available), replace the replicas that were on the server and set the replica types as appropriate. This step can take some time to complete and may impact performance of the server; it may be advisable this after hours, depending on the number of replicas and their sizes.

15.4 Identity Manager

If the crashed server hosted an Identity Manager Driver Set, it is important to note that several of the attributes of the driver set and driver are flagged 'per-server' in the schema definitions. This means that they do not replicate to other servers in the tree, they are only stored locally in this server's DIB set. Since this server crashed and has been rebuilt, this data has been lost. The easiest and best way to recover from this is to use IDM Designer (http://www.novell.com/coolsolutions/dirxml/designer/) and to re-deploy the Driver Set. Of course, this assumes that you have planned ahead (See Step #1 above) and that all of your IDM development has been done in Designer. But if you are doing IDM development, you should be doing your development in Designer anyway.

15.5 Message Server

The Message Server attribute of a User object (seen in the management utilities as "Default Server") is an EID reference to a Server object. As discussed above, eDirectory will have cleaned up these references, so the User object that had a Message Server reference to the deleted Server object will no longer have it.

On one hand, this could be seen as a failure of this recovery method vs. the use of 'srvref', as that did save and restore the Message Server references. If this is important, it is possible to export these references via LDAP and to restore them afterwards.

On the other hand, the only thing that I am aware of that uses the Message Server attribute is the old broadcast message protocol. In many environments, broadcast messages from one user to another are not exactly important. In this case, the Message Server attribute does not matter, and can be safely ignored.

Another possibility would be to use LDAP to search for User objects where the Message Server attribute is missing, then use LDAP to update those objects with a new Message Server attribute pointing at the recovered server's NCP Server object.

Labels:

How To-Best Practice
Comment List
Related
Recommended