Server crashes when doing DBCopy

We are trying to upgrade to GW2012. We have a new DL360G8 server running SUSE Release 11, kernal 3.0.93-0.5. All HP diagnostics on the computer and raid pass.

When we do dbcopy to move databases from our existing GW8 system, the server crashes saying there was a raid controller issue. No more details than that.

We've tried this on 2 new servers - same issue. Only crashes when using DBcopy - not when we copy large files using other methods.

HP and Novell have not been able to come up with any solutions.

Anyone have any ideas on this?
  • Hi,

    Can you provide a bit more info?

    Which server crashes - I'm assuming it's the new one...

    Are you running any specific RAID drivers or using the stock ones supplied with the SuSe kernel.

    At a technical level, I am assuming that dcopy involves thousands of small files rather than a small number of large ones which will almost certainly thrash the RAID controller and disks more.
  • Plus also info on what sort of RAID setup you have (RAID 1/5/6 etc) and what filesystems you are using (OES with NSS or ext3 etc) and if there is anything in the log files.

    Also what happens when the crash occurs - is it just dbcopy that crashes or does the system lock up. Does it have caching enabled? Is it a proper hardware RAID controller (which model) or are you attemping software RAID (a bad idea on a production GroupWise server).

    This additional info will almost certainly help people trying to assist on the forums - or at least avoid buying the same model you have!

    Good luck and here's hoping somebody has experienced the issue and found a solution.
  • Hi,

    It is indeed it is a driver / hardware / firmware issue. There is nothing that a user space application, such as DBCOPY can do that will crash the kernel or blow up disk drivers. So you should be talking to SUSE HP, not Novell, as they have nothing to do with the issue.

    I have to assume that HP has walked you through all of this stuff, but you never know. In order of decreasing relevance:

    Have you applied the latest HP firmware update CD?
    Have you applied the latest hpsa ( or appropriate ) drivers from the HP PSP? <=== This sounds like the problem.
    Have you disabled the GUI ( trust me, I almost know what I'm talking about ) by setting the default run level to 3 rather than 5 via YaST?
    Have you disabled all the HP power management crap in the BIOS, which is crap, did I mention its garbage? Because its crap.

    There was a showstopping issue with older hpsa drivers ( iirc version 3.1.0.something and prior ) which caused the driver to issue a reset and this would cause the I/O to blow up and subsequent kernel panic. You will see hpsa related messages in the kernel ring buffer (viewed via dmesg or on the Alt-F11 kernel debug screen) when this is happening, something lie "issuing reset on 3.0.0" and then blammo. Its semi critical to see those entries to see specifically what error precipitates the crash.

    ( Also when dealing with HP, its important to get to the proper, in this case smart array linux, backline team, which can be an arduous task. )

    -- Bob
  • Server is fully patched with latest updates. Drives are all setup ast EXT. Using the latest HP drivers. We added a new controller and put the OS on one and the GW system on the other. System is on Raid 1. GW is on Raid 1 0. Use other commands to copy files - no problem. Only crashes when using DBCOPY. By crash I mean it locks up - keyboard or mouse do not work. Need to reboot to get it back.
  • Bob - I was told the GUI was required for running Groupwise - is that not the case?
  • dknutson;2297227 wrote:
    Bob - I was told the GUI was required for running Groupwise - is that not the case?

    GUI is not needed. You need X and x/gdm installed because of some dependencies in the agents allowing them to expose a GUI if requested, so they need those bindings. But having gdm or X running is not needed at all once the system is installed.

    And again, we saw a very similar issue. Either the scsi driver crashes - which disconnects you SWAP and the OS just seizes up - OR - it would cause so much I/O that the OS could not swap quick enough and you would never be able to operate the GUI to get in. We also saw an issue with the graphics adapters on these servers as well, with the GUI running we would see this behavior, without it, everything was fine.

    On one initially problematic box, we have had, literally, no unplanned downtime for 3 years once we ditched the GUI.

    Also, you can try to limit the number of threads dbcopy is using the -t parameter, e.g.

    /opt/novell/groupwise/agents/bin/dbcopy -m -p -f -t 3 /source_path /destination_path

    And I just thought of this, since you mention a server to server copy, If the source / destination is a CIFS or NCP mounted share, you can also run into another issue: oplocks. Disable oplocks on both ends, and on the linux box, if using CIFS:

    echo 0 > /proc/fs/cifs/OplockEnabled

    -- Bob