rafaelrpm Absent Member.
Absent Member.
508 views

eDir comms failed after reboot

Replica ring with 6 servers (four 8887 and two 904), large DIB with 14GB, around 1.4m objects.
One 8887 server was being removed from the replica and after all partitions removed, ndsconfig rm was executed, triggering as expected a huge sync between the servers and some -625 and -626 false-positive errors.
One of these 904 servers that remained in the replica ring (and also runs IDM with a driverset associated) was rebooted and when it got back, it was not receiving synchronization. I could see some -622 and -625 errors on ndstrace +SYNC +SYDL while others were flooded with sync operations. After ndsrepair -N with repair all network addresses, the server started receiving inbound sync again.
Question is: what could have triggered the network comm failure?
I could not see errors on /var/logn/message for the startup and in ndsd.log I saw "Received invalid signal -1." lines but other than that all modules were started successfully.
I have some logs but due to rotation I don´t have the logs from when the server was not receiving sync.
Labels (1)
0 Likes
1 Reply
Knowledge Partner
Knowledge Partner

Re: eDir comms failed after reboot

On 05/16/2018 02:24 PM, rafaelrpm wrote:
>
> Replica ring with 6 servers (four 8887 and two 904), large DIB with
> 14GB, around 1.4m objects.
> One 8887 server was being removed from the replica and after all
> partitions removed, ndsconfig rm was executed, triggering as expected a
> huge sync between the servers and some -625 and -626 false-positive
> errors.


Why would removing the server cause a huge sync? That does not make
sense, and should not normally happen. Removing a server from the tree
might delete at least one NCP Server object from the tree, and before that
will remove one value from one attribute on each partition root object the
server held in its DIB (a few bytes each, so maybe 1 KiB for a really
important server), and might even change a few other attribute references
(LDAP Server, LDAP Group, SAS Service, etc.), but that's not a storm by
any definition. In your case you removed replicas first,so even the
per-partition attribute value removal was omitted. There might be a lot
of changes within a DIB as external references are updated, but that's not
replication either.

You presumably saw something making you believe there was a sync storm,
but what, and where, and is that something you see at other times?

> One of these 904 servers that remained in the replica ring (and also
> runs IDM with a driverset associated) was rebooted and when it got back,
> it was not receiving synchronization. I could see some -622 and -625
> errors on ndstrace +SYNC +SYDL while others were flooded with sync
> operations. After ndsrepair -N with repair all network addresses, the
> server started receiving inbound sync again.


A -625 is often reported after eDirectory restarts, so that is not too
unusual. If it does not clean up within a reasonable period of time then
that is odd, though. You should not need to force any kind of repair to
get there, but "reasonable period of time" may be something longer than a
few seconds, or even a few minutes, dependingo n your environment. It
should definitely be within an hour on the high end, and that would be
unusual with an environment your size during a normal day.

> Question is: what could have triggered the network comm failure?
> I could not see errors on /var/logn/message for the startup and in
> ndsd.log I saw "Received invalid signal -1." lines but other than that
> all modules were started successfully.
> I have some logs but due to rotation I don�t have the logs from when the
> server was not receiving sync.


It would be interesting to see if the -625 happens again when eDirectory
(but not the server) is restarted. It would also be interesting to see
the output showing the -625, and understanding that portion. Every box
has network addresses of other boxes in the replica ring, so merely
rebooting would not prevent knowing where to replicate date, but all kinds
of things could potentially block the data at the time. Had you checked
replication to ensure no errors prior to the reboot? Did you test to
ensure network comms really were down at the TCP or IP layers using
something like netcat or were all of your conclusions from eDirectory's
own tools?

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.