Welcome Serena Central users! CLICK HERE
The migration of the Serena Central community is currently underway. Be sure to read THIS MESSAGE to get your new login set up to access your account.
Anonymous_User Absent Member.
Absent Member.
3549 views

Pool had an error at block xxxxx

Ahh crap... Can anyone explain why I have gotten this error? this is the
second error of this type in the last couple of weeks. I couldn't find any
TIDS or forum posts on this error the first time it occurred. I did run a
pool verify which reported errors and then I ran a poolrebuild which fixed
the errors with some lost data (21 files, 12MB). A second pool verify found
no further errors or warnings.

The volume does not dismount, everything continues to 'work' but I'm not
sure if I can trust the pool, or the disks or the host.

If you have either knowledge of what exactly this error represents, or if
you know what steps I can take to find out what the source of the problem is
I'd love to hear about it.

The only file activity that I can definatively point to occuring at the time
was a cron job that executes trustee.nlm :
07/14-17:05:00 load trustee /D /A save all apps:\system\trustees.csv

According to the timestamp on the file it finishe writing at 5:11pm. So the
error occurred right near the start of the file creation process. It
overwrites the existing file, and it does this every day.


7-14-2009 5:06:43 pm: COMN-3.27-178 [nmID=A0020]
NSS-2.70-5009: Pool N02/APPS had an error
(20012(beastTree.c[510])) at block 60867455(file block -60867455)(ZID
1).


7-14-2009 5:06:43 pm: COMN-3.27-180 [nmID=A0022]
NSS-2.70-5008: Volume N02/APPS had an error
(20012(beastTree.c[510])) at block 60867455(file block -60867455)(ZID
1).

The environment is as follows:
1. NetWare 6.5sp8 with post sp8 nss patch

2. NetWare is running in an ESX 3.5u4 build 153xxx vm with 1 vCPU, ~5GB RAM,
e1000 nic, lsi virtual scsi hba which is accessing two raw device mapped
luns of 1.8TB each combined into a single pool with a single volume (APPS).

3. Each of the two rdm luns are 4 WD RE3 1TB drives in a RAID10 array hosted
on a CoRAID SR2461 storage shelf which is connected back to the ESX server
via a 10GbE hba and HP6400 10GbE switch.

4. Services hosted on the NetWare server are : ZDM 7 sp1 ir3a, iManager
2.7.2, eDir 8.7.3.10 ftf1.

Thanks for any insight you have!

Ron


Labels (2)
0 Likes
8 Replies
Anonymous_User Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

One more bit of info... the previous - and first - error occurred at nearly
the same time as the recent error:

7-07-2009 5:09:56 pm: COMN-3.27-178
Severity = 5 Locus = 3 Class = 0
NSS-2.70-5009: Pool N02/APPS had an error
(20012(beastTree.c[510])) at block 45975092(file block -45975092)(ZID 1).

7-07-2009 5:09:56 pm: COMN-3.27-180
Severity = 5 Locus = 3 Class = 0
NSS-2.70-5008: Volume N02/APPS had an error
(20012(beastTree.c[510])) at block 45975092(file block -45975092)(ZID 1).


Different block, but same error and close to same time and same day of the
week. Could just be coincidence ...

0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

AND the third piece of info, from the pool verify error report:

-------------------------------- ERRORS -----------------
Name Tree Entries != Named Objects.
Name Tree Entries Exist With No Object.
---------------------------------------------------------
+------------------------------- WARNINGS ---------------
¦ Unaccounted Blocks Exist.
¦ Volume System Report 'Number of Files' in Error.
¦ Volume System Report 'Number of Objects' in Error.
¦ Object Tree Node(s) Contain Corruption.
¦ A 'Volume Problem Report' Shows Problems.
+--------------------------------------------------------

Clearly I'll be running another pool rebuild...but I need to find the root
cause and correct it - so any help in this regard is dearly appreciated.

Thx,

Ron

0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

On Wed, 15 Jul 2009 01:37:26 +0000, Ron Neilly wrote:

> AND the third piece of info, from the pool verify error report:
>
> -------------------------------- ERRORS -----------------
> Name Tree Entries != Named Objects.
> Name Tree Entries Exist With No Object.
> ---------------------------------------------------------
> +------------------------------- WARNINGS --------------- ¦ Unaccounted
> Blocks Exist.
> ¦ Volume System Report 'Number of Files' in Error. ¦ Volume System
> Report 'Number of Objects' in Error. ¦ Object Tree Node(s) Contain
> Corruption. ¦ A 'Volume Problem Report' Shows Problems.
> +--------------------------------------------------------


All the notifications thrown at you is because NSS detects corruption(s) in the underlying structures.

Whether these corruption(s) are physical errors on the given block numbers, or perhaps because at one point NSS has
lost the connection to the storage where the pool was active, I don't know. That is what you need to investigate.

The Beast tree error is to prevent data corruption and so NSS is just to be considered the messenger here to inform you
that corruption exists and and in order to maintain data integrity you need to take action.

You posted the verify log file output already but after you have done the rebuild, what does the rebuild log file say
(<poolname>.rlf extension) ?

Hans
0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

Hans van den Heuvel wrote:

> You posted the verify log file output already but after you have done the
> rebuild, what does the rebuild log file say (<poolname>.rlf extension) ?
>
> Hans


Thanks for the response Hans. The rebuild log file has this to say:

Jul 14, 2009 6:37:49 pm
******* REPAIR LOG ********
Leaf Pruned, Files Lost.

Not terribly informative - well not to me anyways. Part way through the
rebuild process it stated that if I choose to continue data loss could occur
(and I should try using third party utilities to recover...). So I choose to
continue. I was then informed that a small number of files (8 I think) would
be lost (about 22MB). The files were not identified by name and nothing is
in the repair directory created at the root of the volume. So not much help
there.

I have seen various bits of advice in the vmware forums to modify various
timing parameters, eg on the scsihd.cdm driver (RETRIES=8) in startup.ncf
and on the NSS parameter /ZLSSPendingWriteIOs=2000.

I had previously set the scsihd.cmd retries=8 and the /ZLSSPendingWriteIOs
is currently set to 1000.

However this advice is generally for those folks with i/o starvation issues
from high latencies due to many hosts on the same san, or path failover
events etc. Not sure that that is applicable to what we are experiencing.

I have a support call into the SAN vendor (CoRAID) to see if it might be
something on their end.

Also our build version for ESX 3.5u4 is a wee bit behind - man they do a lot
of patching between updates! So that is another area to explore...

Basically, still clueless.

Cheers,

Ron

0 Likes
ataubman Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

NSS believes there is disk corruption. There is no known problem in NSS or NetWare that causes this. Possible suspects range from the disk driver down to the physical media; neither NSS nor NetWare has any way of being more specific.

I would never bother running verifies, IMO they're a waste of time. Instead when you get an IO error in NSS, run rebuilds until you get 0 errors or warnings, usually 2 runs should do it.

If they then persist then start suspecting the hardware (or in this case, ESX's emulation of it).

Andrew C Taubman (Sorry, support is not provided via e-mail) Opinions expressed above are not necessarily those of Micro Focus.
0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

ataubman wrote:

>
> NSS believes there is disk corruption. There is no known problem in NSS
> or NetWare that causes this. Possible suspects range from the disk
> driver down to the physical media; neither NSS nor NetWare has any way
> of being more specific.


I am looking to see if there is any way to enable/increase logging of NSS
and/or lower level scsi driver events/activity.... Haven't found anything
yet - are you aware of such?

>
> I would never bother running verifies, IMO they're a waste of time.
> Instead when you get an IO error in NSS, run rebuilds until you get 0
> errors or warnings, usually 2 runs should do it.
>
> If they then persist then start suspecting the hardware (or in this
> case, ESX's emulation of it).
>
>


Yeah, so many layers where things can go wrong. One thought that occurs to
me is that there is some kind of latency - due to virtualisation and/or SAN
channel packet loss and/or issues with NSS timing of writes to the
filesystem etc. etc. - and a file system write doesn't properly complete
then the corruption occurs. For instance I am aware of issues with NSS in a
virtualised XEN environment that required Novell to introduce a new
parameter called XENBLK BARRIERS which they say really should be enabled,
especially for SATA based disk systems, which is what our SAN uses even
though to NetWare it is presented as a scsi disk (yet another layer of
translation where more problems could be hiding beyond the knowledge of
NSS).

Anyways it is just a question mark in my mind at this point. Here is what
Novell has to say about NSS in a virtual environment (okay, so they only
refer to XEN, which is a para-virtual driver environment and so is not
directly comparable to ESX...Too bad they have nothing to say about running
NSS in an ESX virtual machine, guess this ESX thing is too new and unpopular
to merit any sort of documentation, best practices etc.)

NSS Admin Guide Sections 6.1.2 and 6.1.3:

-----
SATA or IDE disks have slower performance because special handling is
required when working through the Xen driver to ensure that data writes are
committed to the disk in the order intended before it reports back.
Configuring Write Barrier Behavior for NetWare in a Guest Environment

Write barriers are needed for controlling I/O behavior when writing to SATA
and ATA/IDE devices and disk images via the Xen I/O drivers from a guest
NetWare server. This is not an issue when NetWare is handling the I/O
directly on a physical server.

The XenBlk Barriers parameter for the SET command controls the behavior of
XenBlk Disk I/O when NetWare is running in a virtualization environment. The
setting appears in the Disk category when you issue the SET command in the
NetWare server console.

Valid settings for the XenBlk Barriers parameter are integer values from 0
to 255, with a default value of 16. A non-zero value specifies the depth of
the driver queue, and also controls how often a write barrier is inserted
into the I/O stream. A value of 0 turns off XenBlk Barriers.

A value of 0 (no barriers) is the best setting to use when the virtual disks
assigned to the guest server’s virtual machine are based on physical SCSI,
Fibre Channel, or iSCSI disks (or partitions on those physical disk types)
on the host server. In this configuration, disk I/O is handled so that data
is not exposed to corruption in the event of power failure or host crash, so
the XenBlk Barriers are not needed. If the write barriers are set to zero,
disk I/O performance is noticeably improved.

Other disk types such as SATA and ATA/IDE can leave disk I/O exposed to
corruption in the event of power failure or a host crash, and should use a
non-zero setting for the XenBlk Barriers parameter. Non-zero settings should
also be used for XenBlk Barriers when writing to Xen LVM-backed disk images
and Xen file-backed disk images, regardless of the physical disk type used
to store the disk images.
-----

0 Likes
ataubman Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

I am looking to see if there is any way to enable/increase logging of NSS
and/or lower level scsi driver events/activity.... Haven't found anything
yet - are you aware of such?

Not that I'm aware in the NSS sphere. I've never come across a disk driver that does any logging, but I guess that's a question for that vendor.

Andrew C Taubman (Sorry, support is not provided via e-mail) Opinions expressed above are not necessarily those of Micro Focus.
0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: Pool had an error at block xxxxx

On Wed, 15 Jul 2009 18:17:58 +0000, Ron Neilly wrote:


> Jul 14, 2009 6:37:49 pm
> ******* REPAIR LOG ********
> Leaf Pruned, Files Lost.
>
> Not terribly informative - well not to me anyways. Part way through the
> rebuild process it stated that if I choose to continue data loss could
> occur (and I should try using third party utilities to recover...). So I
> choose to continue. I was then informed that a small number of files (8
> I think) would be lost (about 22MB). The files were not identified by
> name and nothing is in the repair directory created at the root of the
> volume. So not much help there.


I **guess** you ran into some bad blocks on the disk as the rebuild specifically complains about a few files that are
corrupted.
Why it refers to 3rd party tool for data recovery is because they are often able to recover such files and Novell is not into
the data recovery business. We can only investigate, and often recover, on a partition level only.

> However this advice is generally for those folks with i/o starvation
> issues from high latencies due to many hosts on the same san, or path


There are a few cases where VMWare recommends these settings, yes.
The other important thing is to make sure you are using up to date LSI disk drivers.

> I have a support call into the SAN vendor (CoRAID) to see if it might be
> something on their end.


good

> Also our build version for ESX 3.5u4 is a wee bit behind - man they do a
> lot of patching between updates! So that is another area to explore...


Definitely worth to consider applying latest ESX updates.

Regards
Hans
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.