14.2: GMS Agent Manager stops working after a while

Hi,

sry to bother you guys, but gms 14.2 is more buggy than all other releases of gms. the gms agent manager (datasync-connectors) stops working after some time (2-3 hours) which prevents mobile devices from snyching mails. although status says running via console and admin console the agent manager doesn't do his job.

when i restart the gms agent manager via rcdatasync-connectors restart mail start getting snyched again. i suppose it is the groupwise agent which causes the mailfunction.

any ideas on this?

gms 14.2 on sles 11 sp4
  • Have you performed any maintenance of the databases? We found a lot of issues after our upgrade. We continue to have problems with some accounts even after we ran the repairs. One of our post offices, the only one that is running on Windows had an ntp problem, time was off. It had never had problems until we upgraded to the latest version of Groupwise and Mobility.
  • as adviced from support i ran dsapp -i -v with no luck. i assume that the crash, which is neither recognized by the shell console nor the webadmin, is realted to high cpu usage of the python process (way above 150%). we run gms on vsphere with 4 vcpus. cpu speed varies from 100 mhz which it should normally have if no work has to be done up to 6 ghz on heavy load. we have only 30 users. heavy load should only come at reinit of all users.

    gwcheck with delupfolders and folderreset on specifc userdb didn't do anything. i dont remember having these issues with gms 2.1.
  • On Tue, 12 Jan 2016 07:06:03 GMT, bahsig
    <bahsig@no-mx.forums.microfocus.com> wrote:

    >
    >as adviced from support i ran dsapp -i -v with no luck. i assume that
    >the crash, which is neither recognized by the shell console nor the
    >webadmin, is realted to high cpu usage of the python process (way above
    >150%). we run gms on vsphere with 4 vcpus. cpu speed varies from 100 mhz
    >which it should normally have if no work has to be done up to 6 ghz on
    >heavy load. we have only 30 users. heavy load should only come at reinit
    >of all users.
    >
    >gwcheck with delupfolders and folderreset on specifc userdb didn't do
    >anything. i dont remember having these issues with gms 2.1.


    When you find the agent manager stops, is it affecting all users or
    onely some?

    I'm running GMS 14.2 (SLES 11 SP3) on ESXi with 1 vCPU and 6 GB of
    memory. I have 59 users - most have 2 devices. The only problem I am
    running into is a few users occasionally stop syncing and I have to
    re-initialize their account to get it going again.

    Ken
  • ketter;2416368 wrote:
    On Tue, 12 Jan 2016 07:06:03 GMT, bahsig
    <bahsig@no-mx.forums.microfocus.com> wrote:

    >
    >as adviced from support i ran dsapp -i -v with no luck. i assume that
    >the crash, which is neither recognized by the shell console nor the
    >webadmin, is realted to high cpu usage of the python process (way above
    >150%). we run gms on vsphere with 4 vcpus. cpu speed varies from 100 mhz
    >which it should normally have if no work has to be done up to 6 ghz on
    >heavy load. we have only 30 users. heavy load should only come at reinit
    >of all users.
    >
    >gwcheck with delupfolders and folderreset on specifc userdb didn't do
    >anything. i dont remember having these issues with gms 2.1.


    When you find the agent manager stops, is it affecting all users or
    onely some?

    I'm running GMS 14.2 (SLES 11 SP3) on ESXi with 1 vCPU and 6 GB of
    memory. I have 59 users - most have 2 devices. The only problem I am
    running into is a few users occasionally stop syncing and I have to
    re-initialize their account to get it going again.

    Ken


    this is affecting all users. but i mananged to get it stable switching from sless 11 sp4 to sp3. cpu usage is back to normal and the gms agent manager runs. there might be an issue with g,s 14.2 running on sles 11 sp4. this needs further investigation.
  • Hello, we have exactly the same error. We running GMS 14.2 on SLES11SP3 with the latest patches. Running on VM with 4vCPU and 16 GB memory. But also the connectors stops with working, althoug when we do a rcgms status, everything looks fine. We also have a very high load on Python services, around 200% CPU. We cannot stop the Dev. Sync Agent through te GUI. Only trough /etc/init.d/datasync-connectors stop/start the synchronisation is picking up te work again. We run a wireshark capture, stopped/started the network services on the linux server, put the device powered off / on again, nothing helps. We did a telnet to port 443 during the issue, did work. But no synchronisation. Issue appears completely unexpected. sometimes it works a few days without problems, and the next time the issue is there 2 times a day. GMS sends a 'ping' we did see in the log, but devices cannot get connection to GMS during the issue. We are very curious to the solution. We also did openend a SR at Novell but we did get a solution yet. Looks its a combination between GMS and SLES and VM(?). But this issue is certainly new, and comes with GMS14.2. Before this version we had other issues, but not this one. kind regards, HKO
  • In article <HKO.7bx3i0@no-mx.forums.microfocus.com>, Hko wrote:
    > Running on VM with 4vCPU and 16 GB memory. But
    > also the connectors stops with working, althoug when we do a rcgms
    > status, everything looks fine. We also have a very high load on Python
    > services, around 200% CPU.


    Make sure you have much more than 10% disk space free. I have seen high
    CPU on GMS's Python when getting tight.
    How is that 16GB of memory being used? "free -m" command to make sure
    there isn't some memory problem/leak
    Run dsapp's General Health Check (7,1) to see if it sees anything that
    is a problem.



    Andy of
    http://KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!
    GMS troubleshooting tips at http://www.konecnyad.ca/andyk/gwmobility.htm


  • konecnya;2421486 wrote:
    In article <HKO.7bx3i0@no-mx.forums.microfocus.com>, Hko wrote:
    > Running on VM with 4vCPU and 16 GB memory. But
    > also the connectors stops with working, althoug when we do a rcgms
    > status, everything looks fine. We also have a very high load on Python
    > services, around 200% CPU.


    Make sure you have much more than 10% disk space free. I have seen high
    CPU on GMS's Python when getting tight.
    How is that 16GB of memory being used? "free -m" command to make sure
    there isn't some memory problem/leak
    Run dsapp's General Health Check (7,1) to see if it sees anything that
    is a problem.



    Andy of
    http://KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!
    GMS troubleshooting tips at http://www.konecnyad.ca/andyk/gwmobility.htm



    i managed to get gms running stable by resetting the GroupWise client folder structure back to default. this has to be done by hand and cannot be perfomed by gwcheck. gms is very sensitive concerning system folders. never ever try to put any system folder into a subfolder. you will get headaches.

    novell needs to block system folder movement into subfolders by default.

    i also disabled any ssl encryption between gwia and gms.

    run tail -f /var/log/datasync/connectors/mobility-agent.log and try to locate any users with messed up system folders. sry but i can't tell you what you need to look for as i sent my logs to novell and have them find the problematic users.
  • Hello Andy. We checked the diskspace, this was not an issue. Enough space left (212 GByte). With Free -m, we get Total memory: 16081, and used : 15914, Free : 167, shared : 524, buffers : 82 and cached : 14015. So all te memory is used almost. What we also see if the issue happens, and we do a : netstat --tcp --udp --listening -program that the ports on SLES11SP3 have a state : LISTEN. But there is not any services connected anymore with this port. Another thing we see, if the issue happens, and we do a : netstat -nap | awk '/tcp/ {print $6}'| sort | uniq -c then there are a number of 798 'close_wait'. Afther a restart of GMS there are only 18 'close_wait'. So maybe something with to many connections? We have set this on 1024 (ulimit -a, open files). Thank you for the tip about dsapp. We did run dsapp version 227, option 7 and 1. The log file mentioned that : a Warning on 'Checking Memory', but we have already 16GByte. Warning on POA status, because we did not enabled LDAP. (We use a Groupwise group for the members on GMS). And it gives a 'Failed' on Checking referenceCount : Detected referenceCount issue in datasync db. SOLUTION: See TID 7012163. So not 'big' errors wich are causing this issue i think. kind regards, HKO.
  • In article <bahsig.7bzvyn@no-mx.forums.microfocus.com>, Bahsig wrote:
    > novell needs to block system folder movement into subfolders by default.


    While they don't have it by default, they do have an option to fix it up
    afterwards.
    a manual Contents Check on a user with RESFLDR in the misc tab will put
    those folders back to where they are supposed to be.

    Such a block as you suggest would make a good Idea at
    https://ideas.microfocus.com/MFI/novell-gw and taking a look there I see
    someone already has
    https://ideas.microfocus.com/MFI/novell-gw/Idea/Detail/1651
    worth voting for as I just did.


    Andy of
    http://KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!
    GMS troubleshooting tips at http://www.konecnyad.ca/andyk/gwmobility.htm


  • In article <HKO.7c00lb@no-mx.forums.microfocus.com>, Hko wrote:
    ...
    > Space left (212 GByte). With Free -m, we get Total memory: 16081, and
    > used : 15914, Free : 167, shared : 524, buffers : 82 and cached : 14015.

    ...

    Hello HKO

    That space might be too tight if you have a greater than 2TB drive. It is
    all about the percentage as there is a hourly cronned job that checks for
    disk space being more than 90% free or it shuts down GMS. If you actually
    had that big a drive you can change that percentage, though not before
    being sure of how much the data on that partition fluctuates.

    That memory is fine, note how most if it is used for caching, which is
    just making good use of memory. The trick is how much swap is being used
    as seen in the last line. I go by the rule that if more than half is used
    on a box with a long uptime or a quarter is used if up less than a day,
    then it is time to add more memory.
    http://www.konecnyad.ca/andyk/nixadmin.htm

    netstat 'close-wait' and 'ulimit -a, open files' are completely different
    things, what makes you think they are connected?
    What would need to be factored in is how many devices do you have
    connecting to the system. I would expect some relationship between the
    'close-wait's and the number of devices after they've had a chance to all
    connect, though I haven't done any comparisons at this point.

    I've seen dsapp not always detect memory correctly, so likely that is what
    you are seeing.

    Do make sure you've resolved the issue that pointed you to TID 7012163 as
    fixing one database issue can often fix some of those others that aren't
    clearly visible.
    Have you vacuumed and reindex the databases?


    Andy of
    http://KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!
    GMS troubleshooting tips at http://www.konecnyad.ca/andyk/gwmobility.htm