Some email not going out

Been having some weirdness on outbound email. It's been quite
sporadic, but it seems to be getting worse lately.

Some outbound emails that get sent seem to sit in a pending status from
within Groupwise. If that email is resent, or if a different email is
sent to that same address, they go right through.

Any ideas on this? Is it the email actually not going out, or is the
status message getting hung up somewhere?

Running gw2012sp2 (build 108211) on sles11sp3 / oes11sp2.

--
Stevo

Tags:

  • In article <07zxv.3928$BB4.3909@novprvlin0913.provo.novell.com>, Stevo
    wrote:
    > Some outbound emails that get sent seem to sit in a pending status from
    > within Groupwise. If that email is resent, or if a different email is
    > sent to that same address, they go right through.
    >

    Are these messages getting to GWIA? Lets rule out agent link problems
    first.
    What do GWIA's logs say about the message?

    Are all your agents running the same build?

    Do you send mail directly to the servers of the recipients or do you go
    through any sort of relay such as for antivirus and/or compliance
    monitoring?


    Andy of
    KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!

  • Hi Stevo

    Further to what Andy has said/asked... Is there anything sitting in the "defer" directory in your GWIA file subsystem? Those would be email that have not yet been delivered and are deferred according to your retry interval.

    Let us know.

    Cheers,

  • > Are these messages getting to GWIA? Lets rule out agent link
    > problems first.
    > What do GWIA's logs say about the message?


    Well, the email my boss said he sent first thing that morning does not
    appear in the gwia log. The only email (at least that I can figure)
    that he sent to this person was at 10:26:41, when boss said he sent one
    first thing that morning.


    > Are all your agents running the same build?


    All are the same build (108211) aside from one POA that, per NTS, is
    running build 115067.


    > Do you send mail directly to the servers of the recipients or do you
    > go through any sort of relay such as for antivirus and/or compliance
    > monitoring?


    Well, we do have a Gwava appliance that acts as our mail relay, but
    I've had confusion about that as well. I have received undeliverable
    bounceback emails that seem to look from my gwia, instead of some
    undeliverable coming back from our Gwava box.

    --
    Stevo
  • laurabuckley sounds like they 'said':

    > Further to what Andy has said/asked... Is there anything sitting in
    > the "defer" directory in your GWIA file subsystem? Those would be
    > email that have not yet been delivered and are deferred according to
    > your retry interval.


    So my response to laurabuckley's comment is...

    The only items in my defer directory are from earlier today, like 30-45
    minutes ago, and those are auto-reply (out of office) emails from a
    couple people in the IT dept to auto generated emails that do not have
    a valid email address.

    --
    Stevo
  • In article <r8Wxv.4001$BB4.1694@novprvlin0913.provo.novell.com>, Stevo
    wrote:
    > Well, the email my boss said he sent first thing that morning does not
    > appear in the gwia log. The only email (at least that I can figure)
    > that he sent to this person was at 10:26:41, when boss said he sent one
    > first thing that morning.

    This strongly suggests that there a linkage problem on the path from the
    POA through the MTA(s) to GWIA. Is this a basic one PO, one Domain
    system all running on one box including GWIA or are there more boxes and
    agents involved?
    We will need to look for stuck files in any of the wpcsin/# or wpcsout#
    of all the agents involved, and possible the mslocal/mshold. From a
    Windows client, Total Commander's Alt-Sft-Enter at those levels very
    quickly shows the totals (should be either zero or one tiny file). Or
    the Linux command on the server of
    du -hx --max-depth=1
    does the same. In either case be in the wpcs*/ level where you see the
    numbers or just in the mslocal/mshold
    If you have the web console to each of the agents running (which is a
    very good idea) start by looking at the MTA(s) and the Links showing. I
    suspect some are closing often enough and if so we need to fix that.

    > Well, we do have a Gwava appliance that acts as our mail relay, but
    > I've had confusion about that as well. I have received undeliverable
    > bounceback emails that seem to look from my gwia, instead of some
    > undeliverable coming back from our Gwava box.

    We would have to look at a selection of those errors to tell exactly what
    is happening. One suspicion is that some of those messages GWAVA can tell
    right away are a problem and trigger the reject mid steam which would
    have GWIA telling the user. Once the message is all the way over to
    GWAVA, then it would appear as GWAVA notifying.


    Andy of
    KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!

  • Andy Konecny sounds like they 'said':

    > This strongly suggests that there a linkage problem on the path from
    > the POA through the MTA(s) to GWIA. Is this a basic one PO, one
    > Domain system all running on one box including GWIA or are there more
    > boxes and agents involved?


    Not quite a basic setup. Have 7 PO's on 7 different servers, one of
    which has the main MTA on it too. 1 server for webaccess, 1 server
    with another MTA for my GWIA.

    Primary MTA
  • Andy Konecny sounds like they 'said':

    > We will need to look for stuck files in any of the wpcsin/# or
    > wpcsout# of all the agents involved, and possible the mslocal/mshold.
    > From a Windows client, Total Commander's Alt-Sft-Enter at those
    > levels very quickly shows the totals (should be either zero or one
    > tiny file). Or the Linux command on the server of
    > du -hx --max-depth=1
    > does the same. In either case be in the wpcs*/ level where you see
    > the numbers or just in the mslocal/mshold


    Checked the wpcsin folders on all POA servers as well as both MTA
    servers, 0 files.

    --
    Stevo
  • In article <sudyv.4061$BB4.1288@novprvlin0913.provo.novell.com>, Stevo
    wrote:
    > Not quite a basic setup. Have 7 PO's on 7 different servers, one of
    > which has the main MTA on it too. 1 server for webaccess, 1 server
    > with another MTA for my GWIA.

    So 6 POAs that have to travel the network to get to the MTA. Does the
    problem happen to just the one POA with the MTA, only the 6 that doesn't,
    all the above, or to little data to be sure yet?

    > Primary MTA
  • Andy Konecny sounds like they 'said':

    > So 6 POAs that have to travel the network to get to the MTA. Does the
    > problem happen to just the one POA with the MTA, only the 6 that
    > doesn't, all the above, or to little data to be sure yet?


    Too little data to know for sure. The only POA that has the issue that
    I've *heard* of is the one on the same box as the MTA.



    > How 'far away' are those other POAs? We also have the POA to MTA link
    > to consider. Beyond a certain distance (WAN links) it becomes a
    > better practice for them to have their own MTAs.


    Remote POA servers are 3 hops away, with all being on 1GB connections
    sans one. That one is via T-1, but it is a very small POA (6 users),
    small enough that if we had a better connection to that site the users
    would be migrated to our POA and that POA would go away.



    > Now that is suspicious given they are on the same box.
    > Internet domain? Is that he name of the Domain for that MTA? Or
    > something else? Either way very strange and we certainly need look at
    > that box closely.


    Internet is the name of the 'domain' containing external users.


    > Either GWIA or its host. So tell me more about this host: RAM (free
    > -m), CPU (top), Disk use levels (df -h) i.e. Are we pushing any of
    > them too hard. Also check the network interfaces with ifconfig. If
    > there are a lot of errors, it could be a failing NIC/cable/port.
    > Really dig into the different things reporting on the web admin
    > interface, Status page, for both the MTA and GWIA. Load might just
    > be enough we need to up some of those thread options, but first to
    > figure out which one(s). Also possible something is DOSing that box,
    > I've seen where some other system that has gone defective or
    > misconfigured just pummels the box you are trying to sort out, so
    > compare packet rates to some of your other servers.


    free-m: (2 GB allocated to it)
    total used free shared buffers cached
    Mem: 1878 1706 171 0 7 360
    -/ buffers/cache: 1338 540
    Swap: 2047 720 1327

    top: gwia hovers around 9% of cpu with occasional spikes to 13-20%

    df -h:
    Filesystem Size Used Avail Use% Mounted on
    /dev/sda2 20G 11G 8.1G 57% /
    udev 940M 100K 940M 1% /dev
    tmpfs 940M 2.1M 938M 1% /dev/shm
    admin 4.0M 0 4.0M 0% /_admin
    /dev/pool/DATA 5.1G 93M 5.0G 2% /opt/novell/nss/mnt/.pools/DATA
    DATA 5.1G 27M 5.0G 1% /media/nss/DATA

    Zero errors in ifconfig, but there are some dropped RX packets, less
    than 4000 out of almost 19,000,000.


    --
    Stevo
  • Andy Konecny sounds like they 'said':

    > Either GWIA or its host. So tell me more about this host: RAM (free
    > -m), CPU (top), Disk use levels (df -h) i.e. Are we pushing any of
    > them too hard. Also check the network interfaces with ifconfig. If
    > there are a lot of errors, it could be a failing NIC/cable/port.


    Crud, didn't notice you were talking about the host until after I
    posted the last reply.

    Hosts are x222 nodes in an IBM PureFlex. Each one with 128GB RAM, dual
    xeon 2.3GHz, 8 core cpus.

    Current host the gwia server resides on is running about 14% cpu util,
    and 68% memory util.

    Storage pool that houses this VM has over 400GB free.

    Both nics on the host have 0 errors and 0 drops.

    --
    Stevo