GPF in GWPOA with latest greatest 24.3 build but very old Groupwise data (oldest user creation date 2002)

Hello there to the dying breed of Groupwise Admins!

The post office run since 2023 with a 18.4.2 build

some GPF happened in the past and most of the time we rebooted the VM and for month the post office has kept on running, but on Monday the 26th of August the GWPOA of two of the four large post offices started to GPF on startup of the Groupwise service continuously.

After 3 days of nearly continuous downtime (many thanks to the sleepy support engineers of the Groupwise front line) a back line engineer from Rotterdam renamed the NGWDFR.DB for the tracking of messages sent with the delay delivery send option and on one of the two large post offices remain stable again.

The other one with 470 users, +1.2 million of files and more than 2 terabytes disk space used for /grpwise/po data files kept to GPF every few minutes.

I could solve the GPF (General Protection Fault) on my own with a 24 hour Team Viewer dial in from the Hotel during vacation running a standalone GWCHECK with all options with stopped Groupwise service on the host, took about 12 hours.

But i can dupe the GPF on a test host taking over the data with dbcopy as soon as the GWCHECK for the content with fix problems start to rebuild the NGWDFR.DB defer database

My assumption is that there must be "dangerous messages" in the post office with delivery date still in the future that corrupt the defer database as soon as the GWCHECK content check with fix problems find them and populate the defer database with the information from those weird messages.

We have a lot of users that are really fond of using the delay delivery send option and complain about the C0D5 when they try to send a message with delay delivery active and the NGWDFR.DB is not there.

Any idea how to come out of this s(h)ituation?

My only desperate last resort would be to create a new post office and move every user from the corrupted post office to the new one, but the effort is huge since you have to recheck continuously with which user move the error goes from one post office to the other.

so far - so good (or bad), Stefano

Parents
  • 0  

    My two cents ...

    I was in a similar situation at one of my customers and created a cronjob to watch the poa every two seconds. In case of troubles this cronjob restarts the poa. This just helps to avoid watching the system viaTeamViewer.

    I assume that you start gwchecks via GW administration console. If there are corrupted files, your POA will die and your users will suffer this poa crash. However you can run a "standalone" gwcheck. There is a binary in /opt/novell/groupwise/agentsbin . 
    I access this server with putty, but enable X11 forwarding. On my windows computer I start i.e. XMING to capture the X11 output. Now you can run gwcheck independent of your POA.


    Use "Verified Answers" if your problem/issue has been solved!

  • Suggested Answer

    0   in reply to   

    As i see this the corruption in the ngwdrf.db causing the POA to crash, now the tool to fix DB corruption is gwcheck and as Diethmar pointed out the stand-alone version could be used/tried when the one in the POA seems to crash the POA and see if this can fix the DB.

    If this is not able to fix it either i see only two options.

    1. Restore the ngwdfr.db from backup of the time the corruption was not there yet
    2. Create a new ngwdfr.db

    The CO5D error is a simple one, its saying the ngwguard.db knows about the DB but it is not found in the PO directory so with the restore the error will be gone however the data in there ( delay delivery ) will not happen for the items since the backup.

    Recreating the ngwdfr.db means a new empty database must be created by the POA but only possible when the ngwguard.db does not know about this anymore so you need to use gwcheck (stand-alone version) to drop the DB from the ngwguard.db. You run gwcheck with in the User/Resource field the name ngwdfr.db and then in Action select Structural Rebuild. After running this the reference to ngwdfr.db is dropped fro the ngwguard.db and the POA can create the new ngwdfr.db also with this the C05D is gone.

    In case of corruption some data might be lost we cant do much about that however the best option would be a restore from backup as this still has data.

    Do not consider moving users to and/or from the PO unless this ngwdfr.db is there again (restored or new ) as this database is the key in the move process so when not here moves will not be completed.

    When the system is or would be updated to the latest supported release we also would be able to provide the core file to development but as this is a not supported version anymore this is an option we dont currently have.

  • 0 in reply to   

    Hi all,

    i can dupe the problem on a test VM with a dbcopy copy of the live post office

    the whole system has been updated as of the 22th of September from 18.4.2 to 24.3

    i am able provide a scc<whatever>.txz with coredumpctl list and the PID provided in the list with getappcore -j <PID>

    i was even able to dupe the problem with GWPOA on the test VM with the DBCopy copy of the live post office data called over strace

    i would rather like to identify the dangerous mails and then to be able to continue without moving the users

    but first of all i want to try the structural rebuild of the ngwdfr.db and recreate a fresh one

    i run all my ssh sessions with MobaXTerm with a first SSH session to my Linux workstation VM with my favorite good old KDE konsole to connect over SSH to all other Linux VM's with ssh -t -Y <LDAP user>@<VM IP Addr>

    for the DBCopy of the live data i ran GWCheck always with the stand alone one over sh /opt/novell/groupwise/admin/gwcheck.sh

    i let you know how it goes on my test server with the copy of the post office mail data

    - Stefano

Reply
  • 0 in reply to   

    Hi all,

    i can dupe the problem on a test VM with a dbcopy copy of the live post office

    the whole system has been updated as of the 22th of September from 18.4.2 to 24.3

    i am able provide a scc<whatever>.txz with coredumpctl list and the PID provided in the list with getappcore -j <PID>

    i was even able to dupe the problem with GWPOA on the test VM with the DBCopy copy of the live post office data called over strace

    i would rather like to identify the dangerous mails and then to be able to continue without moving the users

    but first of all i want to try the structural rebuild of the ngwdfr.db and recreate a fresh one

    i run all my ssh sessions with MobaXTerm with a first SSH session to my Linux workstation VM with my favorite good old KDE konsole to connect over SSH to all other Linux VM's with ssh -t -Y <LDAP user>@<VM IP Addr>

    for the DBCopy of the live data i ran GWCheck always with the stand alone one over sh /opt/novell/groupwise/admin/gwcheck.sh

    i let you know how it goes on my test server with the copy of the post office mail data

    - Stefano

Children
  • 0 in reply to 

    Hi all,

    the first mail with activated delay delivery send option recreates the ngwdfr.db file for the defer database

    BUT:

    as soon as some admin configure a GWCHECK on the post office with content check the GPF comes back because the content check reintroduce the "dangerous messages" records back in defer database, so i would need to identify the dangerous messages

    content check is running on my test server dbcopy copy of the live PO, i let you know