Tunning on large DIB size

eDirectory 9.03 (On the way to 9.1.1)
RHEL 6.10 (on the way to RHEL 7.5)


Current DIB:
DIBFileSize: 104365670400 Bytes (104.3656704 GB)
DIBRflmFileSize: 49290947 Bytes
DIBRollBackFileSize: 104857600 Bytes
DIBStreamFileSize: 34428453 Bytes
TotalDIBSize: 104554247400 Bytes


Mostly reads along with a lot of binds and bind update information.
63,000 binds/hour but peaks over 100,000 binds/hour

Current Memory:
free -g
total used free shared buffers cached
Mem: 126 114 11 0 0 104
-/ buffers/cache: 10 115
Swap: 7 0 7

Looking for any insights on tuning.

Thanks
-jim

Tags:

  • On 11/15/2018 8:24 AM, jwilleke wrote:
    >
    > eDirectory 9.03 (On the way to 9.1.1)
    > RHEL 6.10 (on the way to RHEL 7.5)
    >
    >
    > Current DIB:
    > DIBFileSize: 104365670400 Bytes (104.3656704 GB)
    > DIBRflmFileSize: 49290947 Bytes
    > DIBRollBackFileSize: 104857600 Bytes
    > DIBStreamFileSize: 34428453 Bytes
    > TotalDIBSize: 104554247400 Bytes
    >
    >
    > Mostly reads along with a lot of binds and bind update information.
    > 63,000 binds/hour but peaks over 100,000 binds/hour
    >
    > Current Memory:
    > free -g
    > total used free shared buffers
    > cached
    > Mem: 126 114 11 0 0
    > 104
    > -/ buffers/cache: 10 115
    > Swap: 7 0 7


    Store it on SSD? :)

    That is a large DIB. Is it many small objects or large objects?
    (Curious how many users/objects...)



  • On 11/15/2018 06:24 AM, jwilleke wrote:
    >
    > eDirectory 9.03 (On the way to 9.1.1)
    > RHEL 6.10 (on the way to RHEL 7.5)
    >
    > Current DIB:
    > DIBFileSize: 104365670400 Bytes (104.3656704 GB)
    > DIBRflmFileSize: 49290947 Bytes
    > DIBRollBackFileSize: 104857600 Bytes
    > DIBStreamFileSize: 34428453 Bytes
    > TotalDIBSize: 104554247400 Bytes


    It would be interesting to know what they are storing in 34 GiB of stream
    files; maybe pictures of people who take up the other 49 GiB of DIB space.

    > Mostly reads along with a lot of binds and bind update information.
    > 63,000 binds/hour but peaks over 100,000 binds/hour


    As I am sure you know, binds are super-simple, and don't need much to
    optimize once the binding (user usually) object is found other than
    processing power (to do encryption stuff related to binds or even just
    LDAPS traffic) or disk (to load up the user's credentials for comparison,
    or to write back attribute when updated due to the login or login attempt).

    > Current Memory:
    > free -g
    > total used free shared buffers
    > cached
    > Mem: 126 114 11 0 0
    > 104
    > -/ buffers/cache: 10 115
    > Swap: 7 0 7


    Forget swap; it is crap.

    You have a lot of RAM used for caching, so that's good and possibly
    helping performance. You could possibly increase the eDirectory DIB cache
    if your cache hit percentage is low, but how much that will help depends
    on what is not working well now, so let's take a step back.

    How, if at all, are they experiencing slowness? If they are not, but they
    want a 0.05 ms bind to now take 0.04 ms, then they may want to consider
    the cost/benefit ratio. Maybe it is worth it, but a single box handling
    100,000 binds per hours already seems okay.

    How are the users being found? Are they all in one container, so a simple
    username goes directly to the DN of the object without a search, or are
    they also doing some kind of search to find the object which is possibly
    (in your case certainly) using an index? Maybe look at other types of
    searches and be sure they are indexed; doing any kind of full DIB scan in
    this environment would be suicide, so presumably things are already
    indexed well.

    If logins themselves cause tons of traffic you could always disable some
    or all login attribute updates. On a normal login, out of the box,
    eDirectory keeps track of things like the login time, the source network
    address, etc. In a failed login case the intruder address, attempt
    counts, etc. are all tracked. All of these involve writing back to the
    DIB when doing a simple bind (something that feels like a simple compare
    or read), so disabling these can improve performance overall by decreasing
    the amount of data replicating all over. Of course, you may also lose
    some nice features of eDirectory, like being able to tell if a user is
    logging in at all, or if they are being attacked by intruders.

    --
    Good luck.

    If you find this post helpful and are logged into the web interface,
    show your appreciation and click on the star below.

    If you want to send me a private message, please let me know in the
    forum as I do not use the web interface often.
  • More than 7 million user entries.
  • > How are the users being found? Are they all in one container, so a simple
    > username goes directly to the DN of the object without a search, or are
    > they also doing some kind of search to find the object which is possibly
    > (in your case certainly) using an index? Maybe look at other types of
    > searches and be sure they are indexed; doing any kind of full DIB scan in
    > this environment would be suicide, so presumably things are already
    > indexed well.


    There is a trace setting, RECMAN? maybe, that shows Index work. I am
    curious how big the indexes are for 7 million users. Is there a good
    way to find out how much space Indices are taking up?

    > If logins themselves cause tons of traffic you could always disable some
    > or all login attribute updates. On a normal login, out of the box,
    > eDirectory keeps track of things like the login time, the source network
    > address, etc. In a failed login case the intruder address, attempt
    > counts, etc. are all tracked. All of these involve writing back to the
    > DIB when doing a simple bind (something that feels like a simple compare
    > or read), so disabling these can improve performance overall by decreasing
    > the amount of data replicating all over. Of course, you may also lose
    > some nice features of eDirectory, like being able to tell if a user is
    > logging in at all, or if they are being attacked by intruders.


    I seem to recall this is in two places? One is NMAS related, other is
    somewhere in iManager I cannot recall off hand.

    Also curious, how long does a DIBClone take? :)

    How long for a ndsrepair? :)

  • On 11/15/2018 8:24 AM, jwilleke wrote:
    >
    > eDirectory 9.03 (On the way to 9.1.1)
    > RHEL 6.10 (on the way to RHEL 7.5)
    >
    >
    > Current DIB:
    > DIBFileSize: 104365670400 Bytes (104.3656704 GB)
    > DIBRflmFileSize: 49290947 Bytes
    > DIBRollBackFileSize: 104857600 Bytes
    > DIBStreamFileSize: 34428453 Bytes
    > TotalDIBSize: 104554247400 Bytes



    One thing to consider is to trim out all extra NMAS methods. There is
    some bug that causes the methods to reload under certain circumstances.
    It should be fixed, but I think they are checked on login, so wastes time.


  • geoffc;2490928 wrote:
    > How are the users being found? Are they all in one container, so a simple
    > username goes directly to the DN of the object without a search, or are
    > they also doing some kind of search to find the object which is possibly
    > (in your case certainly) using an index? Maybe look at other types of
    > searches and be sure they are indexed; doing any kind of full DIB scan in
    > this environment would be suicide, so presumably things are already
    > indexed well.


    There is a trace setting, RECMAN? maybe, that shows Index work. I am
    curious how big the indexes are for 7 million users. Is there a good
    way to find out how much space Indices are taking up?

    > If logins themselves cause tons of traffic you could always disable some
    > or all login attribute updates. On a normal login, out of the box,
    > eDirectory keeps track of things like the login time, the source network
    > address, etc. In a failed login case the intruder address, attempt
    > counts, etc. are all tracked. All of these involve writing back to the
    > DIB when doing a simple bind (something that feels like a simple compare
    > or read), so disabling these can improve performance overall by decreasing
    > the amount of data replicating all over. Of course, you may also lose
    > some nice features of eDirectory, like being able to tell if a user is
    > logging in at all, or if they are being attacked by intruders.


    I seem to recall this is in two places? One is NMAS related, other is
    somewhere in iManager I cannot recall off hand.



    How are the users being found?
    Tow primary methods. Search for uid=value and searches for a "String" custom GUID placed on all new users.

    Are they all in one container?
    No.
    Currently there are 11 partitions and 40 containers (Mostly due to mergers and different apps using same LDAP)
    I think they should be in one location and one ROOT partition.
    Please, tell me what do you think?

    Also curious, how long does a DIBClone take? :)
    Amazingly short time to create the clone files. (Must be the same process as used for DSBK)
    Did one recently and it took less than 10 minutes to create the clone file.
    Takes a longer to transfer across the network to other server.

    How long for a ndsrepair? :)
    3-4 hours so we obviously try to avoid.
    We do several local entry repairs as we do see some regular occuring issues Which I blame on the way an application is used and that there are 1000s of "Group-Like" entries in separate partitions form users.
    We dropped all the partitions from a lower environment and all the sync issues went away.

    Questions, Comments and Suggestions are always encouraged!

    Thank ALL of you for having taken your time to provide feedback. Your helpful comments are much appreciated.

    -jim
  • jwilleke;2493824 wrote:

    Are they all in one container?
    No.
    Currently there are 11 partitions and 40 containers (Mostly due to mergers and different apps using same LDAP)
    I think they should be in one location and one ROOT partition.
    Please, tell me what do you think?


    I strongly disagree here. With lots of objects, you want multiple partitions to make replication faster and more efficient.
  • Thanks for the feedback.

    All of my customer sites, since moving off NetWare (i.e. File and Print directories), are all LDAP access and have one ROOT partition.
    Rarely ever had replication issues and almost never used ndsrepair.

    Also, in this case, the replication in PILOT was 2-5 minutes with 4 servers each with 11 partitions

    • Admittedly smaller (5 million entries)
    • Admittedly fewer daily connections
    • Had to use ndsrepair regularly (more than once a week)


    and when pilot was collapsed to one partition replication is measured in seconds.

    Anyone else have thoughts on this subject?

    -jim
  • On 01/17/2019 09:04 AM, jwilleke wrote:
    >
    > All of my customer sites, since moving off NetWare (i.e. File and Print
    > directories), are all LDAP access and have one ROOT partition.
    > Rarely ever had replication issues and almost never used ndsrepair.
    >
    > Also, in this case, the replication in PILOT was 2-5 minutes with 4
    > servers each with 11 partitions
    >
    > - Admittedly smaller (5 million entries)
    > - Admittedly fewer daily connections
    > - Had to use ndsrepair regularly (more than once a week)
    >
    >
    > and when pilot was collapsed to one partition replication is measured in
    > seconds.


    There may be good reasons for that difference. I do not think
    partitioning was meant primarily to help with total tree replication time.
    Partitioning lets you divide up objects as they are held on servers, so
    more partitions means you now have the option of having some objects on
    these servers, others on those servers, and all of them on this one server
    right here, and that's about it (my understanding).

    With regard to replication, when serverA tries to send data to serverB, it
    is also able, simultaneously, to replicate to servers C, D, E, F, and G,
    and assuming none of those servers are busy writing already, that
    replication can happen from one to many as quickly as hardware will allow,
    but that's for one partition. If you then have a second partition,
    serverA needs to contact everybody to replicate that partition, and then
    the third partition, and the fourth. This assumes eery partition has
    changes, which is not always the case, but we're talking about comparing
    high load, so assuming all partitions are getting changes.

    The problem is that while serverA can send to many at once, every server
    can only receive from one server at a time; many readers, one writer per
    instance of eDirectory. As a result, it is not uncommon for replication
    from serverA to many servers to have at least one of those servers return
    -698 (I'm busy, come back soon) REPLICA_IN_SKULK, and then the process to
    that one box must be tried again. With more servers in a partition, there
    are more possibly sources of updates, and that could mean more time that
    replication may be blocked by some other server replicating changes out to
    peers. Many partitions also adds to the issue. Once serverA contacts
    serverB to send a single-partition-tree's data, all data should replicate.
    If there are a dozen partitions, it will take more connections, adding to
    the possibility of the other server eventually being busy.

    Back in Support I saw a box with 100 partitions on it, and while it was
    the only one with that many, it was that way because it replicated with
    small sites around the world. It was ALWAYS replicating, and it was often
    telling other boxes that it was too busy to replicate their changes a a
    result. The heartbeat is set to once per hour by default, meaning that at
    least once per hour a partition will be replicated if things are healthy,
    but that's a per-partition timer, and if you have 100 partitions, then
    each of those must make a connection per hour, and then if you have many
    servers per partition, the overhead alone is significant.

    eDirectory 8.8 had issues where entries going into its change cache would
    start a (as I recall) five (5) second timer, after which time changes
    would be replicated if they were sync-immediate changes (as many things
    are; creates, password changes, etc.). If another event came in after
    just four (4) seconds, the time was restarted. If another change came in
    after four (4) seconds, the time restarted. This was contrary to how
    things were supposed to work. This is one of the MANY performance fixes
    built into eDirectory 9.0 and later by default, so replicating times in
    9.x are much better than they were in 8.8.

    Another change in 9.x is the increased NCP packet size possibility, so
    replicating lots of data is much faster for that reason too.

    --
    Good luck.

    If you find this post helpful and are logged into the web interface,
    show your appreciation and click on the star below.

    If you want to send me a private message, please let me know in the
    forum as I do not use the web interface often.