GWIA stops processing messages and link goes down/up

Running GW2012 SP1 on OES 11 SP1, physical servers, 2 nodes using NCS

We have started to experience issues with our GWIA, where as every couple of weeks the agent will stop processing messages and gwmon will fire off a bunch of link down and link up messages. When you go to the http admin portal, all the message queues will show a negative one (-1). The only way to fix the issue is to manually stop and then start the service at the command line (rcgrpwise stop gwia.domain

Tags:

  • I would focus on
    Too many files open Error, error code = 820A
  • Looked into that, but cannot find out why it is having this issue since

    - GroupWise 8.0.2 and above should have had this issue corrected
    - the workaround for older version before 8.0.2 is still in place
    - there is more than enough space
    - no other agents on the same node (2 POAs, 1 MTA and another GWIA) are having such issues.
  • By default, the maximum number of open files on a Linux server is 1024. Are you sure you are not hitting this maximum?

    The ulimit -a command will show the maximum number
    of open files. For OES Linux/Suse servers, the default is 1024 files. On
    a busy Webaccess server, this will be too low, particularly if the
    agents are running as root. Additionally, all resources in linux are
    treated as a file, including sockets, so it*s important each process has
    enough file handles to do its job." http://ngwlist.com/pipermail/ngw/2007-October/116289.html

    Cheers,
  • As mentioned previously the workaround for GW8 is still in place on this server, so in limits.conf I have


    > cat /etc/security/limits.conf | grep nofile
    * soft nofile 8192
    * hard nofile 65535


    I also checked the max files for sysctl as explained in the link to the NGW list thread and I got:

    > sysctl -a | grep fs.file-max
    fs.file-max = 807716

    From both of the above it would seem to be a limit on the number of open files. Or am I incorrect in that assumption? When I run "lsof | wc -l" I usually get numbers from 10000-13000, but that is for the entire server correct? where as the soft limit is for an individual process.
  • In article <bogdansk.5p9z1d@no-mx.forums.novell.com>, Bogdansk wrote:
    > As mentioned previously the workaround for GW8 is still in place on this
    > server, so in limits.conf I have
    > > cat /etc/security/limits.conf | grep nofile

    > * soft nofile 8192
    > * hard nofile 65535


    Perhaps it might be time to remove those for now given that the particular
    issue at that time was a bug that was patched. I don't have any such
    entries on my systems, and it is possible that they now represent road
    blocks.

    > I also checked the max files for sysctl as explained in the link to the
    > NGW list thread and I got:
    > > sysctl -a | grep fs.file-max

    > fs.file-max = 807716


    I'm thinking that this is just a representation of what is allocated and
    not a limit as that number is all over the place (404080, 153936, 290420,
  • Did you ever find the underlying issue?

    I too am experiencing every couple of weeks running gw12sp1 on OES 10sp3. Before I was running 8.0.3 and never had the issue

    Error - Unable to create send file
    07:02:01 F145 MSG 6803132 Error: Fatal error processing message
    07:02:01 F145 MSG 6803132 Deferred delivery file memory error -- message undeliverable.

    Just checking before I open an SR

    Thanks
    Christa
  • From the NGW list I have heard that it is a know socket (file) leak issue in the code for the GWIA (v2012SP1). They have corrected the issue in the private beta of SP2, but it won't get released to public until general release of SP2 in a month or two.
  • Hello,
    we are having the same issue with same symptoms as described above , raised an SR and were pointed to a beta release of SP2
    which of course we cannot deploy in production and have to wait for a fully tested SP2 .
    Our environment is SLES11 SP1 and OES11.
    Hopefully engineering will get a riggle on and supply soon as there are a number of pressing issues to be fixed.
    Regards DEE
  • We also had this error pop up last night, thanks for posting you saved me the time of opening an SR.


    >>> bogdansk<bogdansk@no-mx.forums.novell.com> 16/01/2013 10:46 AM >>>


    Running GW2012 SP1 on OES 11 SP1, physical servers, 2 nodes using NCS

    We have started to experience issues with our GWIA, where as every
    couple of weeks the agent will stop processing messages and gwmon will
    fire off a bunch of link down and link up messages. When you go to the
    http admin portal, all the message queues will show a negative one (-1).
    The only way to fix the issue is to manually stop and then start the
    service at the command line (rcgrpwise stop gwia.domain