Fixing a botched StoreOnce Software store?

Hi,

 

a server running StoreOnce Software on top of an XFS file system had an apparent freeze of the SmartArray controller (FS hanging, hpacucli hanging, no way to shutdown properly) and had to be power cycled. XFS recovered, StoreOnce Software checked the store for half an hour and finally went online (started, not readonly). It logged a number of zero length files, as could be expected from a controller hang/power cycle/XFS recover.

 

Now I'm verifying individual B2D media, and essentially every single one of them is broken with message "Cannot read from device (StoreOnce error: The object store is not ready)" and a debug log always looking like:

Apr07 12:11:15.369 ERROR Apr07 12:11:15.369 ERROR:       7f06c2889700 Failed to load container id 54 exception Exception [StorageMgrException]: thrown in (src/dedupe/core/CONISTON/src/corededupe/storagemgr/StorageMgr.h:259 getEntry)
    Error (9) unresolvable error code
    Exception [StorageMgrException]: thrown in (src/dedupe/core/CONISTON/src/corededupe/storagemgr/StorageMgr.h:259 getEntry)
    Error (9) unresolvable error code
    Exception [StorageMgrException]: thrown in (src/dedupe/core/CONISTON/src/corededupe/storagemgr/StorageMgr.h:259 getEntry)
    Error (9) unresolvable error code
    Detected zero length file

Is there some best practice on how to proceed from here?

 

From D2D^WStoreOnce appliances I know that hard power cycles are frowned upon and rebuild after having to do them can take hours, but it tends to complete without extreme breakage. How to achieve that with SO Software?

 

Specifically I'm irritated by B2D volumes written ages ago now verifying bad - the blocks and metadata they reference had already been written and stable, they shouldn't disappear when some newly written files are zeroed. Unless XFS broke more than just new files, which it is known to let happen sometimes. Then again, the appliances use XFS as well...

 

TIA,

Andre.

  • If you can find the s.bad_integrity file under $StoreRoot/StoreOncelibrary/dvol_000000x/committed/storemgr/store.d/store_x (x is 01, if the store is the first store; 02 for the second store; and so on) , I would start by removing the 's.bad_integrrity' file from the StoreOnce media agent

     

    Please keep in mind that the path shown is just an example, you may have to do a search for 's.bad_integrity'

     

    1.Stop all stores using command "StoreOnceSoftware --stop_store --name=StoreName"

    2.Stop StoreOnceSoftware service

    3.Remove s.bad_integrity file

    4.Start StoreOnceSoftware service

  • Hi Bob,

     

    there is no s.bad_integrity file and the store actually came online (started and is not readonly). I've had a store with s.bad_integrity once and the presence of this file would prevent it from going online in the first place.

     

    Interestingly, some B2D volumes verify Ok, but a large number (including very old ones) do not. I've started writing to the store again, and this went Ok for a while, but I finally ran into a zero length file error on writing. Now I would have assumed that writing would fix such files (it has the data, so it can write them anew), but this apparently didn't happen. The backup just froze due to the gateways in question getting unresponsive...

     

    So the question still is, what is the best practice to get a store back to life after such an event?

     

    In order for the customer to at least have a backup, I've created a second store, attached a B2D device to it and routed all backups to the new store. But how to proceed with the old store? Any ideas? Will leaving the store alone, waiting for it to expire all media (or manually wiping it now and here) and having them disappear through housekeeping finally restore the consistency, or is it botched beyond repair?

     

    I'm likely going to open a case on that, but I feel that SO software is missing some public documented procedures on how to proceed in certain error cases. Having boxes freeze and needing a power cycle, while not beeing an everyday occurence, isn't something that never happens. I need some tools to check and repair stuff in such cases, just imagine you would have to open vendor support cases every time a file system botches... Is there any kind of scrubbing or fsck procedure for SO stores?

     

    BTW, according to find, there are a total of 611 zero sized regular files in the store, all of them being either containerIndex_* or portion_* files.

     

    TIA,

    Andre.

  • I agree, the best course of action for this is to open a new case, because I really can't provide any additional information on this- the StoreOnce documentation is really lacking as to how to deal with actual problems

     

    You may be asked to debug the first store that you are having problems with.  I suggest that you setup the debugging this way, since it is unique to StoreOnce

     

    from the command prompt on the StoreOnce agent

     

              cd \Program Files\Omniback\bin

              sc stop Storeoncesoftware

              sc start Storeoncesoftware -–log_level=tracing

     

    Run your backup. This should create a debug file in

     

              ProgramData -> Omniback -> tmp

     

    that will start OB2DBG, and contain ‘storeoncesoftwarecore’ as part of the filename.  After the error, go back to normal logging

     

              sc stop Storeoncesoftware

              sc start Storeoncesoftware -–log_level=no_log

  • Re,

     

    FYI the basic procedure established in a case with HP Support is to:

    1. Bump to 7.03 as there are relevant fixes in this patch bundle
    2. Stop the store and SO Software
    3. Remove s.bad_integrity should it exist
    4. Remove the house keeping queue
    5. Remove any zero-sized regular file in the store (of which I had 611)
    6. Start the store again, with tracing enabled, and check where you stand.

    For me, this changed a number of error messages from "zero length file" to "no such file or directory", but I'm seemingly able to recycle and omnimm --delete_unprotected_media damaged media and have the store housekeep and compact again (it would not touch zero-length files, not even when retiring objects for deletion, so the manual removal seems necessary). I'm not going to try writing to it again before every broken medium is wiped. Sometimes verify will fail on opening the medium and lock up that medium, so it won't even be purged by delete_unprotected_media. I've found a restart of StoreOnceSoftware to cure this state.

     

    Now a little script is working though some 1300 remaining suspect media I pushed to a limbo pool, verifying them[1], and pushing those that verify good back to the pool where they belong. What doesn't pass muster is recycled, SO restarted and finally unprotected media is purged. Let's see how it ends, the verifying chore will likely take another day.

     

    [1] The key to scriptable verification is to find out how to call omnimver correctly for a B2D medium. I was so tired of clicking the GUI I tried everything until I found:

     

    omnimver -device b2d-gateway-name -slot location-id

     

    eg

     

    omnimver -device B2D-1-G1 -slot da7c78c2_5317b729_09cf_1cf9

     

    The location-id is the hex string you get from omnimm -list_pool in the Location column, where it looks like that:

    # omnimm -list_pool B2D-1_MediaPool | head
    
    Status Medium Label                  Location                               Full Protected
    ==========================================================================================
    Poor   B2D-1_MediaPool_7855           [B2D-1: da7c78c2_5313c2aa_09cf_1be8 No   04/28/14 01:51:56
    Poor   B2D-1_MediaPool_7856           [B2D-1: da7c78c2_5313c992_09cf_1bf0 No   04/28/14 02:36:30
    Poor   B2D-1_MediaPool_7858           [B2D-1: da7c78c2_5313c992_09cf_1bf4 No   04/28/14 02:19:37
    [...]

    For some odd reason, the location id is not the same as the medium id, which gave me extra pause. Anyway, it's working now.

     

    HTH,

    Andre.

  • Hi

     

    I've been following this thread for a while - just thought I'd add some of my own experiences here.

     

    Our setup here is similar to what you describe - we've got a StoreOnce software device that's been created on RHEL 6.4, running on an XFS filesystem. The XFS filesystems have been created on LUNs presented over a FC SAN.

     

    For various reasons (which we've now resolved) we had a few issues where housekeeping activities on the underlying storage devices caused I/O pauses, which then caused some of the OS filesystems to become read-only, and the only way out of this was a hard reboot of the server.

     

    This clearly caused significant issues with the StoreOnce device, similar to that which you have described above.

     

    I've been using the recovery and verification methods that have been posted here  along with instructions from Support with some success, but like you I'm a little concerned about the overall integrity of the data store. It seems that the only way to completely verify the StoreOnce datastore is to verify the media individually - I've got between 70 and 100TB of data in the store - you can imagine how long this is going to take!

     

    What I really would like is a way to verify the datastore directly, rather than via the individual media. Verifying this way would mean checking between 8 and 14 TB rather than the full data set. 

     

    The other thing I'd like to see is that if there *is* a corrupt block, it's flagged, and the next time a backup is taken, the block gets rewritten. There's no reason to completely discard ~20 copies of a backup if you still have a valid source copy you can use...

     

    I've also got a few operational issues - the fact that they put configuration data inside the store right alongside the user data, and the lack of good, consistent logging from the StoreOnce processes, but those are issues for another day...

     

     

     

     

  • Hi Justin,

     

    thanks for chiming in. As an update, the store in question is now dead in the water - it has completely emptied on the user data side (all remaining media has expired meanwhile), but still claims some 500GB on disk. Given there are no tools to repair this, I'm only left with removing that store, which I will do soon. This is clearly not a satisfactory solution.

     

    What I really would like is a way to verify the datastore directly, rather than via the individual media. Verifying this way would mean checking between 8 and 14 TB rather than the full data set.

     

    Exactly. There is a need for a fsck kind of operation that can get triggered externally and would verify the entire structure of a store, repairing anything that can be repaired, removing anything that can't, and returned to a consistent state represented to the outside. So I could call it on my remaining stump of a store and it would turn into a proper store with zero user data. And which I could have called directly after the issue, to clean up from it. There is also a need for automatic as well as externally triggerable scrubbing of entire stores in normal operation, to find errors that silently crept in. The latter is in the making for the appliances, so let's hope the feature which is in the codebase anyway will also get exposed to SO Software users.

     

    The other thing I'd like to see is that if there *is* a corrupt block, it's flagged, and the next time a backup is taken, the block gets rewritten. There's no reason to completely discard ~20 copies of a backup if you still have a valid source copy you can use...

     

    That is a great idea. It would mean that the fsck style operation has to come in two flavors - one that leaves missing data chunks pending, and one that finally loses missing data. Now you could do this after a botch:

    1. Run the soft fsck to get the store into a state where it can again be written to, with all media readable, but some of them failing on read due to missing blocks.
    2. Now wait a cycle of full backups or manually start some. This would repair all the blocks that went missing, magically reviving the media that failed read after step 1. Some will not make it, though.
    3. Now run the hard fsck to get the store back to a consistent state, finally getting rid of anything that is unrecoverable.

    Ideally, you would never need step 3 because media that expires would slowly rid us of ony referenced but missing blocks. Both the soft and hard fsck would also need a way to communicate broken media back to DP in a quick way, so the media can go poor there without a need for a verify (which, as you state, is essentially no option for any properly filled store, I was just lucky I got hit on a a less-than-1TB-store where the scripted verify took two days).

     

    The basic problem is probably that the developers never hear us. The layer between them and us which treats every idea or bug report as a support cost factor is quite thick these days...

     

    Thanks,

    Andre.

  • Both Justin and Andre, these are great thoughts, and I would like to see if there is a posibility of then being implemented

     

    Andre spoke to this

     

    "The basic problem is probably that the developers never hear us. The layer between them and us which treats every idea or bug report as a support cost factor is quite thick these days..."

     

    While I don't disagree with him, I still think that this is worth at aleast an Enhancement REquest.  You can do this from the SSO web site:

     

    https://support.openview.hp.com/enhancement/do/main

     

    I would encourage you to be as verbose and detailed as possible

  • Both Justin and Andre, these are great thoughts, and I would like to see if there is a posibility of then being implemented

     

    Andre spoke to this

     

    "The basic problem is probably that the developers never hear us. The layer between them and us which treats every idea or bug report as a support cost factor is quite thick these days..."

     

    While I don't disagree with him, I still think that this is worth at aleast an Enhancement REquest.  You can do this from the SSO web site:

     

    https://support.openview.hp.com/enhancement/do/main

     

    I would encourage you to be as verbose and detailed as possible

  • Both Justin and Andre, these are great thoughts, and I would like to see if there is a posibility of then being implemented

     

    Andre spoke to this

     

    "The basic problem is probably that the developers never hear us. The layer between them and us which treats every idea or bug report as a support cost factor is quite thick these days..."

     

    While I don't disagree with him, I still think that this is worth at aleast an Enhancement REquest.  You can do this from the SSO web site:

     

    https://support.openview.hp.com/enhancement/do/main

     

    I would encourage you to be as verbose and detailed as possible

  • Hi Bob

     

    Thanks for your assistance. It's always good to see a representative of the product/organisation posting in here.

     

    I have already raised these as a formal enhancement request - if you are interested, I can send you the ER number.