GPF in GWPOA with latest greatest 24.3 build but very old Groupwise data (oldest user creation date 2002)

Hello there to the dying breed of Groupwise Admins!

The post office run since 2023 with a 18.4.2 build

some GPF happened in the past and most of the time we rebooted the VM and for month the post office has kept on running, but on Monday the 26th of August the GWPOA of two of the four large post offices started to GPF on startup of the Groupwise service continuously.

After 3 days of nearly continuous downtime (many thanks to the sleepy support engineers of the Groupwise front line) a back line engineer from Rotterdam renamed the NGWDFR.DB for the tracking of messages sent with the delay delivery send option and on one of the two large post offices remain stable again.

The other one with 470 users, +1.2 million of files and more than 2 terabytes disk space used for /grpwise/po data files kept to GPF every few minutes.

I could solve the GPF (General Protection Fault) on my own with a 24 hour Team Viewer dial in from the Hotel during vacation running a standalone GWCHECK with all options with stopped Groupwise service on the host, took about 12 hours.

But i can dupe the GPF on a test host taking over the data with dbcopy as soon as the GWCHECK for the content with fix problems start to rebuild the NGWDFR.DB defer database

My assumption is that there must be "dangerous messages" in the post office with delivery date still in the future that corrupt the defer database as soon as the GWCHECK content check with fix problems find them and populate the defer database with the information from those weird messages.

We have a lot of users that are really fond of using the delay delivery send option and complain about the C0D5 when they try to send a message with delay delivery active and the NGWDFR.DB is not there.

Any idea how to come out of this s(h)ituation?

My only desperate last resort would be to create a new post office and move every user from the corrupted post office to the new one, but the effort is huge since you have to recheck continuously with which user move the error goes from one post office to the other.

so far - so good (or bad), Stefano

  • 0  

    My two cents ...

    I was in a similar situation at one of my customers and created a cronjob to watch the poa every two seconds. In case of troubles this cronjob restarts the poa. This just helps to avoid watching the system viaTeamViewer.

    I assume that you start gwchecks via GW administration console. If there are corrupted files, your POA will die and your users will suffer this poa crash. However you can run a "standalone" gwcheck. There is a binary in /opt/novell/groupwise/agentsbin . 
    I access this server with putty, but enable X11 forwarding. On my windows computer I start i.e. XMING to capture the X11 output. Now you can run gwcheck independent of your POA.


    Use "Verified Answers" if your problem/issue has been solved!

  • Suggested Answer

    0   in reply to   

    As i see this the corruption in the ngwdrf.db causing the POA to crash, now the tool to fix DB corruption is gwcheck and as Diethmar pointed out the stand-alone version could be used/tried when the one in the POA seems to crash the POA and see if this can fix the DB.

    If this is not able to fix it either i see only two options.

    1. Restore the ngwdfr.db from backup of the time the corruption was not there yet
    2. Create a new ngwdfr.db

    The CO5D error is a simple one, its saying the ngwguard.db knows about the DB but it is not found in the PO directory so with the restore the error will be gone however the data in there ( delay delivery ) will not happen for the items since the backup.

    Recreating the ngwdfr.db means a new empty database must be created by the POA but only possible when the ngwguard.db does not know about this anymore so you need to use gwcheck (stand-alone version) to drop the DB from the ngwguard.db. You run gwcheck with in the User/Resource field the name ngwdfr.db and then in Action select Structural Rebuild. After running this the reference to ngwdfr.db is dropped fro the ngwguard.db and the POA can create the new ngwdfr.db also with this the C05D is gone.

    In case of corruption some data might be lost we cant do much about that however the best option would be a restore from backup as this still has data.

    Do not consider moving users to and/or from the PO unless this ngwdfr.db is there again (restored or new ) as this database is the key in the move process so when not here moves will not be completed.

    When the system is or would be updated to the latest supported release we also would be able to provide the core file to development but as this is a not supported version anymore this is an option we dont currently have.

  • 0 in reply to   

    Hi all,

    i can dupe the problem on a test VM with a dbcopy copy of the live post office

    the whole system has been updated as of the 22th of September from 18.4.2 to 24.3

    i am able provide a scc<whatever>.txz with coredumpctl list and the PID provided in the list with getappcore -j <PID>

    i was even able to dupe the problem with GWPOA on the test VM with the DBCopy copy of the live post office data called over strace

    i would rather like to identify the dangerous mails and then to be able to continue without moving the users

    but first of all i want to try the structural rebuild of the ngwdfr.db and recreate a fresh one

    i run all my ssh sessions with MobaXTerm with a first SSH session to my Linux workstation VM with my favorite good old KDE konsole to connect over SSH to all other Linux VM's with ssh -t -Y <LDAP user>@<VM IP Addr>

    for the DBCopy of the live data i ran GWCheck always with the stand alone one over sh /opt/novell/groupwise/admin/gwcheck.sh

    i let you know how it goes on my test server with the copy of the post office mail data

    - Stefano

  • 0

    Hi all, here the standalone GWCHECK with ngwdfr.db renamed to ngwdfr.dba

    STRUCTURAL VERIFICATION of system databases
    STRUCTURAL VERIFICATION of database ngwguard.db
    - Database is structurally consistent
    Reading Guardian Database store catalog info
    Processing Post Office = PO02, Store Catalog Path = /grpwise/po02prod
    STRUCTURAL VERIFICATION of database /grpwise/po02prod/ofmsg/ngwdfr.db
    - Attempting to correct structural problem in database
    Problem 39- Unknown file ngwdfr.dba - 77824 bytes, 09/30/24 10:15
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbb - 217088 bytes, 09/24/24 17:35
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbc - 73728 bytes, 09/24/24 10:38
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbd - 77824 bytes, 09/27/24 13:56
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbe - 217088 bytes, 09/24/24 17:35
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbf - 221184 bytes, 09/30/24 00:41
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Error 0x8209 opening /grpwise/po02prod/ofmsg/ngwdfr.db
    - Beginning rebuild for database ngwdfr.db
    Error 26- DbRebuild error STORE_FILE_NOT_FOUND (0xC05D)
    - Store will be dropped from guardian catalog so it can be re-created
    *WARNING*: no records were recovered from database during
    rebuild process. Try to restore an earlier backup of the
    file, or else run CONTENTS check to repair system folders.
    Validating file references in database:
    Error 18- MESSAGE database open error INVALID_STORE_NUM (0xC067) on n
    Suggestion- Try physical check/rebuild of database
    PROCESSING COMPLETED- total processing time: 0:00:00

    *********************************************************************
    Uncorrectable conditions encountered:
    CODE DESCRIPTION COUNT
    ---- -------------------------------------------------- -----
    18 Message database open errors....................... 1
    26 Errors trying to do structural database rebuild.... 1
    Correctable conditions encountered:
    CODE DESCRIPTION COUNT
    ---- -------------------------------------------------- -----
    39 Unrecognized or invalid files in mail directories.. 6
    *********************************************************************

  • 0 in reply to 

    Hi there,

    for my understanding now if i send a message with delay delivery send option a new ngwdfr.db will be created, correct?

    Many thanks in advance, Stefano

  • 0   in reply to 

    As this shows in the log that means the reference is removed from the ngwguard.db for ngwdfr.db so it can be created again

    Error 26- DbRebuild error STORE_FILE_NOT_FOUND (0xC05D)
    - Store will be dropped from guardian catalog so it can be re-created

    This is all described in one of the TID's related to errors C05D and C067 as both errors related and i assume this and/or other TID's about this were shared as well.

  • 0   in reply to 

    Which file system is your DB and GroupWise running on? What is under your VMs? iSCSI? TrueNAS.Rob and Diethmar have given valuable hints, but I also know such behavior when hardware under the VMs is defective or something else is going on. With TruenNAS and related systems, there is or was currently a severe problem with the SCSI protocol under high load. 

    “You can't teach a person anything, you can only help them to discover it within themselves.” Galileo Galilei

  • 0 in reply to 

    Hi all,

    the first mail with activated delay delivery send option recreates the ngwdfr.db file for the defer database

    BUT:

    as soon as some admin configure a GWCHECK on the post office with content check the GPF comes back because the content check reintroduce the "dangerous messages" records back in defer database, so i would need to identify the dangerous messages

    content check is running on my test server dbcopy copy of the live PO, i let you know 

  • 0 in reply to   

    so far - so good

    but i assume that the corruption is reintroduced in the post office as soon as you run a GWCHECK with content / fix problems

    takes a long time, i let you know

  • 0 in reply to   

    Hi there,

    we have a Dell VxRail HCI 8 node VMware vSphere with vSAN stretched cluster with 2 data centers

    we have moved the data with a host migration just sunday the 22th of September

    XFS file system with the mkfs.xfs settings recommended from Veeam

    we have two GWDR servers in the two data center with copies of the live post offices, i can dupe the problem on different copies of the live data, corruption is in the Groupwise system and not in the underlying infrastructure, for this i am quite sure

    Open Text support put for days false claims about the culprit saying it is the SLES OS and the old Groupwise 18.4.2 Groupwise software patch level - i have duped the problem again with GW 24.3 and latest greatest SLES 15 SP5 patch level, it was all only diversionary tactic to avoid to spend hours and hours on a linux coredump file analysis