High CPU utilization and connectivity lost to Post Office

We are experiencing a big problem on our big Post Offices (1200 users peak), since the GroupWise 2014 upgrade.

During normal work days, the CPU utilization peaks at 30%. The problem comes in with the big Saturday 02:00 check.
The check pulls a steady 25%, most of which is disk I/O, until Tuesday morning, when it seems it starts checking the msgxx.db's.

The total utilization then hovers about 95%( disk 60%), but the PO still feels normal. 2-3 times during the day however,
all client connectivity is lost for about 10-15 min, during which total CPU falls to 80%, but disk jumps to 75%.
Both Tuesdays this problematic part started around the same time. Also , all the logs just show a gap for that time. No errors.

Do you have any recommendations?

The server has 4 CPU's and 48GByte memory assigned.

Tags:

  • Hi,

    What SP version is your GroupWise 2014 backend install?

    What operating system is this running on - Windows or Linux?

    Please let us know - thanks.

    Cheers,
  • GW2014 SP2 and we are on SLES11 SP2, OES11.
  • Hi,

    Awesome. When you are experiencing such high CPU utilization have you run "top" to see what exactly is "killing" the CPU? If you don't mind doing that and reporting back then we can see which GroupWise component (MTA, POA, DVA) is actually causing the issue. The reason I ask this is that there was issues in earlier versions of 2014 with the DVA running wild.

    Cheers,
  • We did move inactive mailboxes to a another POA.This will reduce the size of the POA, and will have minimum impact on the maintenance tasks.We also changed the sheduled jobs around, moving it to start earlier on a Saturday morning.Hopefullly this will solve the problem.
  • In article <plaubscher.726kzb@no-mx.forums.microfocus.com>, Plaubscher
    wrote:
    > The server has 4 CPU's and 48GByte memory assigned.

    Am assuming you missed the sp1 on OES11 to match the sp2 level of
    SLES11, otherwise that could be causing you some grief

    That should be enough resources, so lets confirm that it is actually
    seen by the OS and apps. Check during regular work day, off hours, and
    when the GWCheck is running

    - "Free -m" command will show what the OS sees and will show if it is
    using it all up. Caching will certainly have 'used' it all up, so the
    real test is if the system dips into swap much. If your swap space is
    getting used up, then more RAM would help.

    - "top" then press "1" which should then show all 4 CPUs. Are the
    GWCheck threads spreading over all 4 CPUs. As Laura asked, confirm
    which tasks are sucking the CPU to make sure we aren't hitting one of
    those known issues.

    For more bits on free and top,
    http://www.konecnyad.ca/andyk/nixadmin.htm

    Other things to look at are what file system are you running on? EXT3?
    NSS? Other? They all have their optimizations that would make an
    impact on this size of a system. Salvage should be off for NSS, noatime
    and nodiratime should be set for any file system that it applies (both
    NSS and EXT3)



    Andy of
    http://KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!

  • On 25.08.2015 15:16, plaubscher wrote:
    >
    > GW2014 SP2 and we are on SLES11 SP2, OES11.
    >
    >

    That would be an unsupported combination, GW2014 officially starts at
    SLES11SP3, aka OES11SP1.

    On top, we know nothing about your storage, but your issue is very
    clearly storage related.

    CU,
    --
    Massimo Rosen
    Novell Knowledge Partner
    No emails please!
    http://www.cfc-it.de
  • Sorry my mistake, I am just confirming my SLES and OES versions as under-mentioned.It is not, as mentioned as above.I will try the suggestions.

    #cat /etc/SuSE-release
    SUSE Linux Enterprise Server 11 (x86_64)
    VERSION = 11
    PATCHLEVEL = 3
    # cat /etc/issue
    Welcome to SUSE Linux Enterprise Server 11 SP3 (x86_64) - Kernel \r (\l).
  • mrosen;2405585 wrote:
    SLES11SP3, aka OES11SP1.


    For the record, that should read SLES11SP3, aka OES11SP2. I know it's a typo on your side Massimo, but just to clear it up.

    As for seeing a higher load on Linux servers running GroupWise 2014 vs the load they had when running GroupWise 2012 (e.g. simply comparing before and after upgrade load), it does seem "something" is causing more load when running GroupWise 2014.

    I haven't been able to pinpoint it myself, but disk I/O seems comparable... yet the reported system load (top) does peak higher (without seeing DVA or other "stuff" continually causing CPU paeking).

    "Thinking out loud":
    Verbose logging on the POA, in my case, so far has not show pointers to what might be causing a generic higher load (2012 vs 2014). So I'm curious what's causing the load increase that we are experiencing after an upgrade.
    On a side note, since upgrading to 2014 SP2 I have been seeing Mobility nag about searches without a filter set (don't have the exact message ATM). Mobilty servers in question are running 2.1 code.
    That could possibly be one thing that's putting more load on the POA and slowing other stuff down.

    Cheers,
    Willem
  • magic31;2408791 wrote:
    ...On a side note, since upgrading to 2014 SP2 I have been seeing Mobility nag about searches without a filter set (don't have the exact message ATM). Mobilty servers in question are running 2.1 code.


    This is the one I mean : EA18 Searching over the entire mailbox requires a filter

    Could well be a red herring.... but I don't remember seeing it before having upgraded a POA to GroupWise 2014 SP2, and certainly not as much as I've seen it after the upgrade.
  • The EA18 'Errors' are a buggy communication between GMS 2.1 and GW 2014.
    Is fixed in the coming GMS 2.2 / GW Cornell releases.

    There is a engineering build for GMS 2.1 that fixes the problem - maybe NTS can help you get it. Bug ID is: 948060