SLP service registrations lost

We have OES 2018 with two SLP DAs and serveral file servers that advertise their volumes to these DAs.

From time to time the service registrations are lost, only "rcslpd reload" on the file server brings them back.

I believe this is what's happening: when a SLP SA tries to refresh a service registration while the DA is temporarily not available (network interruption, server reboot, ...) then it removes the DA from its list of known DAs and never tries again.

Is this a bug or "works as designed"? To me this does not make sense.

Thanks,

Mirko

  • No, this doesn't make any sense at all. "bindery.novell" registrations are triggered by ndsd, so in order to push a "register NOW" you'd have to bounce the latter. Are you absolutely sure that bouncing slpd really forces a registration?

    Popular reasons for "lost" registrations are e.g. the "timezone pitfall" (ndsd initializes BEFORE time stuff, often caused by system clock NOT set to UTC on the box or the hosting VM host; ndsd gets confused then and tends to skip every second registration cycle) or a increased loglevel on the DA's slp daemon (slpd log is pretty small by default, and the daemon has to be bounced by logrotate to trigger a rollover).

    But anyway, i've never seen that restarting slpd would force a registration of bindery.novell or ndap.novell entries.

     

  • Here is an example what I see in the logs:

    Two days ago at about 6:56 we had a network problem that prevented the SA of one of our fileservers to see the DAs; when the server tried to register the services it failed and it removed the DAs from the list:

    SLPD: Didn't receive response from DA at xxx.xxx.xxx.2, removing it from list. Wed Aug 14 06:56:27 2019 KNOWNDA - Removal: DA address = xxx.xxx.xxx.2 directory-agent-url = service:directory-agent://xxx.xxx.xxx.2 bootstamp = 5cd859f4 scope = DEFAULT attributes = SLPD: Didn't receive response from DA at xxx.xxx.xxx.2, removing it from list. SLPD: Didn't receive response from DA at xxx.xxx.xxx.2, removing it from list. SLPD: Didn't receive response from DA at xxx.xxx.xxx.2, removing it from list.

    The second DA was removed later:

    SLPD: Didn't receive response from DA at xxx.xxx.xxx.1, removing it from list. Thu Aug 15 08:09:29 2019 KNOWNDA - Removal: DA address = xxx.xxx.xxx.1 directory-agent-url = service:directory-agent://xxx.xxx.xxx.1 bootstamp = 5d51bdc8 scope = DEFAULT attributes =

     

    At 10:26 I did a rcslpd reload:

    **************************************** Thu Aug 15 10:27:09 2019 SLPD daemon reset by SIGHUP **************************************** [..] **************************************** Thu Aug 15 10:27:09 2019 SLPD daemon reset finished **************************************** [..] Thu Aug 15 10:27:09 2019 KNOWNDA - Addition: DA address = xxx.xxx.xxx.1 directory-agent-url = service:directory-agent://xxx.xxx.xxx.1 bootstamp = 5d51bdc8 scope = DEFAULT attributes = Thu Aug 15 10:27:09 2019 KNOWNDA - Addition: DA address = xxx.xxx.xxx.2 directory-agent-url = service:directory-agent://xxx.xxx.xxx.2 bootstamp = 5d539513 scope = DEFAULT attributes =

     

    I have set:

    n4u.nds.advertise-life-time=300

     I think this makes the problem more likely to happen. But I think this should not happen at all...

  • If the SA really removes the DA entries when they're unreachable, then yes, reloading slpd on the SA (to make to connect to the DAs again) would indeed fix it, without a need to bounce ndsd. NDSD registers with the local slpd, not the DA directly.

     

     

  • Congrats. In this case you really have "lost" DAs rather than lost registrations. The latter ones are by far more common...

    n4u.nds.advertise-life-time=300

    is a pretty, let's say aggressive setting which of course can "help" to unveil a connectivity issue sooner, but as you've stated the daemon should catch up on its own. How long were the DAs unreachable and how long did you wait with bouncing slpd after connectivity had been restored?

     

  • The common condition when this happens is a reboot of the DA server for kernel update or OES maintenance, so the DA is unreachable for a few minutes. There were situations where I "waited" many days until reloading the SA, simply because I wasn't aware of the situation - this is why I think slpd never checks for a "lost" DA again. btw, the "aggressive" advertise-life-time setting is because of issues we had; registrations where lost and only reappeared after advertise-life-time; iirc this was when ndsd crashed. I am not sure if we need this any more.
  • I did some testing with these options:

    • net.slp.staleDACheckPeriod
    • net.slp.DAActiveDiscoveryInterval
    • net.slp.activeDADetection
    • net.slp.passiveDADetection

    Nothing prevented the DA from disappearing and it never came back without reload.

    The only usefull thing I learned, which I wasn't aware of: even when if the DAs are configured with net.slp.DAAddresses activeDADetection is still needed. If I disabled net.slp.activeDADetection no DA was used.