what is the cause of high processor utilization?

We have Netware 6.5 SP8. About a month ago we started getting high processor utilization (in Monitor it goes up to around 50% - 55%) on the server every day around 11:00AM, and it stays high for about an hour, and then after an hour or so, it will drop back down to normal. During the high utilization time, the server really slows down to a crawl causing major user problems. I have no idea what is causing this; as far as I know nothing has changed on the server. I checked the backup system and nothing is running. I also checked the cron file and no cron jobs are running. In Monitor checking the Kernel, Busies Threads doesn't give much information either. It always shows something like "Server 34" or "Server 17". Is there any way to find out what is causing the high utilization?

Thanks....
  • On Mon, 01 Jul 2013 11:06:01 0000, ncharleyhog wrote:

    > We have Netware 6.5 SP8. About a month ago we started getting high
    > processor utilization (in Monitor it goes up to around 50% - 55%) on the
    > server every day around 11:00AM, and it stays high for about an hour,
    > and then after an hour or so, it will drop back down to normal.


    Something external to the server is overloading it.


    > During
    > the high utilization time, the server really slows down to a crawl
    > causing major user problems.


    This server has two CPUs, right? You're seeing, in Monitor, that the
    "average" utilization is 50%, which often means that one CPU is running
    at 100% and the other is 0.


    > I have no idea what is causing this; as far
    > as I know nothing has changed on the server.


    It's most likely something external, not the server itself. You'll
    probably need a packet trace to see what's coming in and where it's
    coming from. Sometimes you can get a hint by looking at the users'
    connections in Monitor, looking for the one where the "requests"
    statistic is going through the roof.


    > are running. In Monitor checking the Kernel, Busies Threads doesn't give
    > much information either. It always shows something like "Server 34" or
    > "Server 17".


    Those are "worker threads", which tells you something, but not much.
    Essentially what happens is that a workstation makes a request of
    eDirectory. eDirectory allocates that request to a worker thread ("Server
    17"). The thread does whatever needs to be done, and the reply is sent
    back to the workstation that made the request.

    Picture it this way. You're sitting in your office. You have 100 minions
    at your disposal. 5000 people are outside your door. Every time one of
    them asks you for something, you hand the task to a minion, who performs
    it. Normally this scheme works pretty well. But then one of those people
    goes insane and starts asking for stuff you don't have, like "smoke
    benders" and "left handed bacon stretchers", as fast as he can. Pretty
    soon, all of your minions are 100% busy.

    The fault here isn't you, and it isn't the minions, it's the crazy
    person. The trick is figuring out where the insane connection is.


    > Is there any way to find out what is causing the high
    > utilization?


    Sometimes Monitor will show it. Other times you need a packet trace and
    you look for the busiest connection. Then you go look at whatever it is
    that's gone insane.


    --
    --------------------------------------------------------------------------
    David Gersic dgersic_@_niu.edu
    Knowledge Partner http://forums.netiq.com

    Please post questions in the forums. No support provided via email.

  • thanks for the suggestions.... About the User Requests, what do you consider as "going through the roof"? Our utilization is not high right at this moment and in looking thru the user requests they range from 0 - 194,000.
  • On Mon, 01 Jul 2013 15:56:01 0000, ncharleyhog wrote:

    > thanks for the suggestions.... About the User Requests, what do you
    > consider as "going through the roof"? Our utilization is not high right
    > at this moment and in looking thru the user requests they range from 0 -
    > 194,000.


    It's a counter. You may find one that's increasing rapidly, you may not.
    Watch some for a while to see what normal looks like. Then try to spot
    the one that isn't normal, when you're seeing the problem happen.


    --
    --------------------------------------------------------------------------
    David Gersic dgersic_@_niu.edu
    Knowledge Partner http://forums.netiq.com

    Please post questions in the forums. No support provided via email.

  • On Mon, 01 Jul 2013 15:56:01 0000, ncharleyhog wrote:

    > thanks for the suggestions.... About the User Requests, what do you
    > consider as "going through the roof"? Our utilization is not high right
    > at this moment and in looking thru the user requests they range from 0 -
    > 194,000.


    It's a counter. You may find one that's increasing rapidly, you may not.
    Watch some for a while to see what normal looks like. Then try to spot
    the one that isn't normal, when you're seeing the problem happen.


    --
    --------------------------------------------------------------------------
    David Gersic dgersic_@_niu.edu
    Knowledge Partner http://forums.netiq.com

    Please post questions in the forums. No support provided via email.

  • In addition to David's comments, love the minions reference, I'd only add that in NoRM you can see the stack trace of a process. NoRM --> Diagnose Server --> Profile / Debug --> Server XXX ( from the list, you can see the %utilization ) This also works for any process you see running, but since your issue is in server work to do's we'll look at them.

    After a few times you can catch the WTO in the act of actually processing work. The output will include something like the following... including a work description and the source of the work. This is NOT always present, as most of the time the worker thread is idle and you will see a very abbreviated stack dump and not much else. Keep trying and you will get a more detailed looking output referencing a type of work as well as potentially the module and stack dump of what is happening.

    -- Bob

    Example:

    Thread Name	Server 11
    Thread ID 8AA3A5A0
    Parent NLM SERVER.NLM
    Processor Assignment 0
    Total Execution Time days: 20, hours: 20, minutes: 46, seconds: 4, milliseconds: 304
    MP Status In classic NetWare
    Stack Bottom 896C2040
    Current Stack Pointer 896C9A18
    Stack Top 896C9FC0
    Stack Size 32,640 Bytes
    Suspend Reason Blocked on Semaphore
    Work Description _SkulkerWorkerProc (Outbound Replication) <-- Ah Ha!
    Semaphore Information
    Semaphore Name NOVDB
    Semaphore ID 8FCA90A0
    Semaphore Value -1
    Threads Waiting on Semaphore 1
    Current Owner Server 11

    Active Work To Do Information <-- All of this is useful to know.
    Work To Do ID 8FF4BC14
    Work Location 90C86CE2 (DS.NLM|BackgroundEvent 0)
    Work To Do Description DS AES Process
    Work Owner DS.NLM


    Stack:

    896C9A18: 00226431 (SERVER.NLM|SchedSwitch 2A9)
    896C9A1C: --896C9A50 ?
    896C9A20: --00000246 (LOADER.NLM|KernelAddressSpace 246)
    896C9A24: --8FCA90A0 ?
    896C9A28: --1C23E388 ?
    896C9A2C: 002168A4 (SERVER.NLM|kSemaphoreWait E0)
    896C9A30: --896C9A50 ?
    896C9A34: --ADB0BB24 ?
    896C9A38: --ABA05828 ?
    896C9A3C: --8FF3A408 ?
    896C9A40: 90E67991 (DS.NLM|waitToComplete$F_IOBuffer 14)
    896C9A44: --8FCA90A0 ?
    896C9A48: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9A4C: --8FF3A408 ?
    896C9A50: --896C9A60 ?
    896C9A54: 90E6751B (DS.NLM|waitForAllPendingIO$F_IOBufferMgr 14)
    896C9A58: --ABA05828 ?
    896C9A5C: --8A9AD088 (NSSWIN.NLM|DisplayHelp 9B8)
    896C9A60: --896C9A94 ?
    896C9A64: 90D90D47 (DS.NLM|completeTransWrites$F_Rfl 2D0)
    896C9A68: --8FF3A408 ?
    896C9A6C: --FE2D4900 ?
    896C9A70: --896C9A94 ?
    896C9A74: --896C9AAC ?
    896C9A78: --00000001 (LOADER.NLM|KernelAddressSpace 1)
    896C9A7C: --00000001 (LOADER.NLM|KernelAddressSpace 1)
    896C9A80: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9A84: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9A88: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9A8C: --ADB0BB24 ?
    896C9A90: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9A94: --896C9B0C ?
    896C9A98: 90D85794 (DS.NLM|flmCommitDbTrans 4FB)
    896C9A9C: --8A9AD088 (NSSWIN.NLM|DisplayHelp 9B8)
    896C9AA0: --ADB0BB24 ?
    896C9AA4: --00000001 (LOADER.NLM|KernelAddressSpace 1)
    896C9AA8: --00000001 (LOADER.NLM|KernelAddressSpace 1)
    896C9AAC: --8AA3A5A0 ?
    896C9AB0: --ADB0BB24 ?
    896C9AB4: --3C8D5048 ?
    896C9AB8: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9ABC: 00214492 (SERVER.NLM|kMutexUnlock 10A)
    896C9AC0: --8AA3A5A0 ?
    896C9AC4: --00000001 (LOADER.NLM|KernelAddressSpace 1)
    896C9AC8: --8ACF71EC (DHOST.NLM|?_V__YAXPAX_Z 2BDC)
    896C9ACC: --896C9B04 ?
    896C9AD0: --ADB0BB24 ?
    896C9AD4: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9AD8: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9ADC: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9AE0: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9AE4: --3C8D5048 ?
    896C9AE8: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9AEC: --00000001 (LOADER.NLM|KernelAddressSpace 1)
    896C9AF0: --8ACF7288 (DHOST.NLM|?_V__YAXPAX_Z 2C78)
    896C9AF4: --00000001 (LOADER.NLM|KernelAddressSpace 1)
    896C9AF8: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9AFC: --8ACF7064 (DHOST.NLM|?_V__YAXPAX_Z 2A54)
    896C9B00: --896C9C3C ?
    896C9B04: --00000000 (LOADER.NLM|KernelAddressSpace 0)
    896C9B08: --ADB0BB24 ?
    896C9B0C: --896C9C28 ?
    896C9B10: 90D859E9 (DS.NLM|FlmDbTransCommit CB)
    896C9B14: --ADB0BB24 ?
  • BTW my usual suspect for things like this is some process that is pulling data from LDAP. I found out that some of the LDAP widgets used by our systems team would perform queries like this: "Give me your entire tree, and then I'll pick the one user I need" and so on. Repeated login / logout requests can spin the server. Virus scanning. Etc.

    -- Bob
  • Am 01.07.2013 13:06, schrieb ncharleyhog:
    > every day around 11:00AM, and it stays high for about an hour,
    > and then after an hour or so, it will drop back down to normal


    Look at NoRM -> Diagnose Server -> Health Monitor -> Statistic Trend
    Graph. Check at least for a week CPU LAN IO and look for
    correlations to the high CPU.

    I've had cases where the storage was under heavy load through scripts
    which did a lot of I/O ion the server. That could be verified through
    Monitor -> Current Disk Requests increasing rapidly (should be zero,
    only short higher peaks are OK)

    I first identified the volumes on which the actions where performed,
    from that point it's easy to identify the connection (users, script)
    causing the I/O load.

    Tom