jmckinne Absent Member.
Absent Member.
1043 views

eDir 8.8.8.8 (-618) Inconsistent Database error detection

I recently had a high priority incident with a (-618) Inconsistent Database error that was reported by an LDAP application that was failing to authenticate and/or authorize it's users. The tricky part was, the impact was only for one LDAP application and I was having a really difficult time finding which object or objects were reporting the (-618) error.

It turns out the following was true on the Users OU of the directory.

-618 FFFFFD96 INCONSISTENT DATABASE

Possible Cause: An occurrence of this error during the eDirectory or NDS Limber background process can indicate replica ring discrepancies if the following conditions exist:

The source server does not have a replica attribute value for itself in one or more of the replica rings of the eDirectory or NDS partitions held by the source server.
The source server was unable to locate the replica attribute value identifying the eDirectory or NDS server holding the master replica of an eDirectory or NDS partition held by the source server.
Action: Run DSREPAIR > Advanced Options > Repair Local eDirectory or NDS Database.

Of course, I ran the ndsrepair on the local dibset on all 6 of my ldap servers and it did resolve the issue. But, the question I'm being asked and I'm posing is: How could I proactively attempt to monitor for this type of error? Or, if monitoring is not possible, how could i have found the object with the corruption sooner?

I was having a heck of a time trying to determine which server and which object LDAP was responding (-618) Inconsistent database to the application. The application is not very well written or filtered as it was not limited to just user objects (because that's all it needed) but, that is beside the point. I did attempt ndstrace with LDAP, Inbound and Outbound synchronization turned on, but on 6 different, very busy LDAP production servers, I was unable to find anything. Are there any other ndstrace flag or flags which would have been more helpful? Is there any iMonitor report I could have run?

Thanks in advance for the help and advice,

Joe
Labels (1)
0 Likes
9 Replies
Knowledge Partner Knowledge Partner
Knowledge Partner

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detection

If I were in your shoes trying to figure out which server a client
application was accessing without having the configuration directly from
that application I would run a LAN/wire trace from that application's host
and see what it was doing, using the following command:


sudo /usr/sbin/tcpdump -n -s 0 -i any -w /tmp/ldap.cap -v port 389 or port 636


You could remove the '-w /tmp/ldap.cap' and '-v' parts to see the lines
written directly to the screen and have enough to see where it is
pointing, but I prefer writing to a file and then opening in Wireshark
because it lets me do a lot more than I could from seeing just the packet
headers from the command line.

Otherwise iMonitor is the right way to see tree health generally, and
using ndsrepair JUST for its Report Synchronization Status output can be
useful:


ndsrepair -E


If that does not turn up anything on any box, though, then the corruption
is not with replication but is just with an object in the DIB, and now
you're back to the LAN/wire trace and figuring out what the application is
doing. ndstrace can help with that, but busy environments make it a
little harder sometimes. Normally you should be able to get past that,
though, using ndstrace and 'less' or 'vi' or something to work through the
files. You know the IP address of the client application so searching for
that in the trace will show the connections that come in. Searching for
-618 should also let you focus in from later in the same connection, so
you can see which object is being accessed.

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.
0 Likes
jmckinne Absent Member.
Absent Member.

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detectio

Thanks, ab

OK. So, that's essentially what we did. I don't have direct access to these production boxes, so I was running everything from iMonitor only. Of course, that limits my ability to grep or less or anything like that from the console. iMonitor was not showing any errors in replica synchronization. The only time I actually saw the (-618) error in iMonitor was browsing directly to the object on the server that had the corrupt object. We did end up getting requesting wire trace but it was not the complete file. It did help me to narrow it down to the exact server. Next time, I will ask for the actual full cap file.

The purpose in me posting this was to make sure, when I told the powers-that-be, that there's no good vendor tool that will monitor this type of error, I'm correct in saying so. It sounds like maybe we could monitor the traffic for that specific error, but nothing in the Microfocus/NetIQ tooling that would alert us on such an error for this specific incident. Obviously, a (-618) error when opening a database would be easy to identify. 🙂

Finally, my last thought was a periodic scheduling of ndsrepair, but that presents it's own set of monitoring and alerting issues and those are only point in time anyway. I'm not sure I have a great answer for monitoring and detecting this type of error.

Thanks again,

Joe
0 Likes
Knowledge Partner Knowledge Partner
Knowledge Partner

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detection

On 11/14/2018 10:36 AM, jmckinne wrote:
>
> The purpose in me posting this was to make sure, when I told the
> powers-that-be, that there's no good vendor tool that will monitor this
> type of error, I'm correct in saying so. It sounds like maybe we could


Usually -618 means some kind of corruption of the filesystem has happened
for some strange reason. As a result, monitoring for it is pretty hard to
do since it's usually a hardware-ish level thing. eDirectory does not
randomly scan its entire DIB just looking for possibly-wrong things, as
that would be crazy on multiple levels. Sometimes errors may show up in
the ndsd.log file when they are encountered/thrown but I have not normally
used that as a specific troubleshooting step.

> monitor the traffic for that specific error, but nothing in the
> Microfocus/NetIQ tooling that would alert us on such an error for this
> specific incident. Obviously, a (-618) error when opening a database
> would be easy to identify. 🙂


Yes, usually those are nicely clear, but that's the only thing about them
that is nice. 🙂

> Finally, my last thought was a periodic scheduling of ndsrepair, but


No, don't do that.

> that presents it's own set of monitoring and alerting issues and those
> are only point in time anyway. I'm not sure I have a great answer for
> monitoring and detecting this type of error.


Running a repair should be done when it is believed that doing so will fix
a specific problem that exists. Running it for no reason other than
hoping to fix things that have not been found is a good way to cause
problems. While not common, and usually bugs, ndsrepair issues have
caused bigger problems than the one you had.

Ultimately if something in eDirectory can result in a -618 the right fix
is to have engineering fix it. As that has happened over the years, -618s
are rare (other than for hardware issues mentioned above) and it would be
nice to make sure any new issues are summarily squashed as well.

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.
0 Likes
Knowledge Partner
Knowledge Partner

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detection


> Running a repair should be done when it is believed that doing so will fix
> a specific problem that exists. Running it for no reason other than
> hoping to fix things that have not been found is a good way to cause
> problems. While not common, and usually bugs, ndsrepair issues have
> caused bigger problems than the one you had.


One example of an ndsrepair causing an issue is that there was a simple
schema mistake, which could have been fixed, but a repair was run, and
the data was cleared from all 20,000 users since it was incorrect. Had
they fixed it first would have been a nonissue. Now that is a stretch
case, but you never know.

0 Likes
jmckinne Absent Member.
Absent Member.

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detectio

Thanks, geoffc and ab,

Yes, very valid points. I will refrain from recommending scheduled ndsrepairs. Thank you both for your insights. I agree that this is extremely rare. I have only seen one other -618 and it was during ndsd startup so, this one was definitely unique for me.

Right now, my best lead is actually coming from the network. The F5 is showing the servers going up and down at random times and for random duration. Of course, they are not dropping at all. Server up times do not indicate a reboot when this error was first detected by the LDAP application. However, the communication between the servers and the F5 is suspect so, I'm investigating that angle at the moment. The file system and server hardware seem fine, but I will make an inquiry there as well.

Thanks again,

Joe
0 Likes
jmckinne Absent Member.
Absent Member.

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detectio

I just discovered something I find to be very interesting, strange and alarming. It definitely points to a communication issue going on in this tree.

The revision count on the object which had the -618 error is extremely high for an OU object:Revision Count 1,863,165 (commas added for clarity)

Creation TimeStamp 08/01/05 10:44:30 AM 1:1
Modification TimeStamp 11/14/18 09:37:55 AM 2:1
Revision Count 1862987

As you can see, the OU has been modified this morning so I checked and saw the following attributes were modified at that time:
Partition Status, Transitive Vector, Purge Vector,

Other attributes modified today:
Revision (duh), Local Received Up To, modifiersName (=the server object), and sssActiveServerList

So, I checked the replica synchronization in iMonitor and I do see periodic (-625) errors on various servers. There are a total of 6 servers. 3 in primary site and 3 in active secondary site. The (-625) errors do clear up eventually but re-appear and not always on the same servers. There's definitely something going on communication wise. Which is very strange because our Identity vault has an identical hardware, OS and eDirectory install (except for the number of partitions) and it does not suffer from the same communication problems. A partition root in that tree has a Revision count of 33,589. Granted, there are many different reasons an object could be modified.

I did check the other partition roots on the LDAP tree (Yes, it has 6 partitions! Don't ask, I inherited this tree 😐 ) and they all have similarly high revision counts. In fact the group OU has over 2 million.

What do you think? A tree with communication issues like this, could it be the root cause? To me, it seems the most likely, but I'd appreciate your input.

Thanks again,

Joe
0 Likes
Knowledge Partner Knowledge Partner
Knowledge Partner

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detection

On 11/14/2018 03:34 PM, jmckinne wrote:
>
> The revision count on the object which had the -618 error is extremely
> high for an OU object:Revision Count 1,863,165 (commas added for
> clarity)
>
> Creation TimeStamp 08/01/05 10:44:30 AM 1:1
> Modification TimeStamp 11/14/18 09:37:55 AM 2:1
> Revision Count 1862987


Thirteen years of changes, and 2 million revisions; sounds just fine to
me. If it syncs every hours, that's 24/day * 365 days/year * 13 years =
113,880 changes just from any single replica's attributes being updated,
but then you have multiple replicas, so multiple transitive vector and
replica attribute values, plus actual changes to this container's objects
causing real changes.

> As you can see, the OU has been modified this morning so I checked and
> saw the following attributes were modified at that time:
> Partition Status, Transitive Vector, Purge Vector,


A partition root sees updates every hours; that's nothing.

> Other attributes modified today:
> Revision (duh), Local Received Up To, modifiersName (=the server
> object), and sssActiveServerList


Normal things, yes.

> So, I checked the replica synchronization in iMonitor and I do see
> periodic (-625) errors on various servers. There are a total of 6
> servers. 3 in primary site and 3 in active secondary site. The (-625)
> errors do clear up eventually but re-appear and not always on the same
> servers. There's definitely something going on communication wise.
> Which is very strange because our Identity vault has an identical
> hardware, OS and eDirectory install (except for the number of
> partitions) and it does not suffer from the same communication problems.
> A partition root in that tree has a Revision count of 33,589. Granted,
> there are many different reasons an object could be modified.


Keep in mind that I believe Revision is also a per-replica attribute, so
it will not replicate, so a server that is new will have a low count,
while the oldest server with that object may have a high one.

> I did check the other partition roots on the LDAP tree (Yes, it has 6
> partitions! Don't ask, I inherited this tree 😐 ) and they all have
> similarly high revision counts. In fact the group OU has over 2
> million.


Merge the partitions if you do not need them; they just increase
maintenance/monitoring time otherwise.

> What do you think? A tree with communication issues like this, could it
> be the root cause? To me, it seems the most likely, but I'd appreciate
> your input.


-625s are not good, but they do not cause -618s; typically they cause
other things, like authentication or authorization problems because
something does not replicate properly. The biggest cause of these, in my
experience (both in Support and now as a consultant) is misconfigured
networks/routers/firewalls, for example allowing access from box A (the
master of [root] and everything) to talk to box B and box C in an
unsolicited way on TCP 524 (by default), but without allowing the reverse
direction as well. As a result when box A gets a change, it sends just
fine, but when box B needs to send its change, it cannot until box A
reaches out randomly (or during a heartbeat, or its own change), and
that's just poor design of the networking stuff. Both network-based and
host-based firewalls should be configured properly to allow any eDir box
to be reached by every other eDir box. Usually outbound rules do not
apply as outbound (on a host firewall) is wide open, but verify that
assumption or Murphy will get you.

Anyway, I doubt any of this caused your -618, but I wouldn't bet my life
on anything without more evidence. eDirectory is made to handle comms
issues (maintenance, WAN link outages, etc.) so -625 should never hurt,
other than in the ways that are obvious (data not able to synchronize).

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.
0 Likes
jmckinne Absent Member.
Absent Member.

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detectio

Ok, so a more likely root cause would be a file system issue at the time the replica object holder was attempting to update the object. So, doesn't it stand to reason, more frequent updates to the object = more frequent writes to disk which = higher chance of corruption? I guess I'm saying the 626's didn't cause the corruption per se, but the frequency is a potential risk. If I can lessen the communication errors, I would lessen the writes to disk, which, in turn, would lessen the risk of corruption occurrence.

I agree with your assessment of the 625 error issue. We have a pretty unique and complex setup due to these servers residing in our private cloud. Microfocus consulting helped with the install and configuration of the iptables setup on these servers since it required nat'ing of the internal IP addresses to the external IP address for communication across sites. There's a whole discovery process involved for adding and removing servers to the tree. It is pretty slick, but now, I'm wondering if there's something else going on at the neutron router level.

And, yes, Revision is a per replica attribute. Each replica holder is slightly different but, similar.

Again, thank you. It's great to be able to run my thoughts by someone else for clarity and direction. No one else knows eDirectory intimately around here. Not that I do either! 🙂
0 Likes
Knowledge Partner
Knowledge Partner

Re: eDir 8.8.8.8 (-618) Inconsistent Database error detectio

ab;2490784 wrote:
On 11/14/2018 10:36 AM, jmckinne wrote:
>
> The purpose in me posting this was to make sure, when I told the
> powers-that-be, that there's no good vendor tool that will monitor this
> type of error, I'm correct in saying so. It sounds like maybe we could


Usually -618 means some kind of corruption of the filesystem has happened
for some strange reason. As a result, monitoring for it is pretty hard to
do since it's usually a hardware-ish level thing. eDirectory does not
randomly scan its entire DIB just looking for possibly-wrong things, as
that would be crazy on multiple levels. Sometimes errors may show up in
the ndsd.log file when they are encountered/thrown but I have not normally
used that as a specific troubleshooting step.


For lack of a better description of it, there are two types of "-618" error.

The first, corrupt DIB, is reported as a -618 error on startup, and is essentially fatal. I have never successfully recovered from a -618 corrupt DIB error. If you have this one, it's time to get out your disaster recovery plan and start using it.

The other -618 is a corrupt object. I've seen this primarily with Groups. Never could track down exactly what was corrupt about the object, because iMonitor won't show you the object because -618. Kinda unhelpful there. Anyway, these, sometimes you can use iMonitor to find a "good" copy of the object in another replica. Other times, all replicas will have a copy of the corruption. I suspect that it's something like the member (DN) attribute pointing to an object that was removed, or something equally "can't happen, but does". For these, a simple single-object-repair in iMonitor will quickly fix the broken object in this server's replica. Repeat on the other replica servers that show a -618, and away you go.
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.