Monitoring RAID with NetSaint
by Dan Langille03/17/2005
In my previous article, I talked about my RAID-5 installation. It has been up and running for a few days now. I'm pleased with the result. However, RAID can fail. When it does, you need to take action before the next failure. Two failures close together, no matter how rare that may be, will involve a complete reinstall1.
I have been using NetSaint since first writing about it back in 2001. NetSaint development has continued under a new name: Nagios. I continue to use NetSaint; it does what I need.
The monitoring consists of three main components:
- NetSaint (which I assume you have installed and configured). I'm guessing my tools will also work with Nagios.
- netsaint_statd, which provides remote monitoring of hosts, as patched with my change.
- check_adptraid.pl, the plugin that monitors the RAID status.
With these simple tools, you'll be able to monitor your RAID array.
1For my setup, at least. You might know of RAID setups that allow for multiple failures, but mine does not.
Monitoring the Array
Monitoring the health of your RAID array is vital to the health of your system. Fortunately, Adaptec has a tool for this. It is available within the FreeBSD sysutils/asr-utils port. After installing the port, it took me a while to figure out what to use and how to use it. Compounding the problem, a runtime error took me on a little tangent before I could get it running. I will show you how to integrate this utility into your NetSaint configuration.
My first few attempts at running the monitoring tool failed, with this result:
# /usr/local/sbin/raidutil -L all
Engine connect failed: Open
After some Googling, I found that the problem was shared memory. It seems
that with PostgreSQL running, raidutil could not acquire what it
needed. I hunted around, asked questions, and found a few knobs and
switches:
# grep SHM /usr/src/sys/i386/conf/LINT
options SYSVSHM # include support for shared memory
options SHMMAXPGS=1025 # max amount of shared memory pages (4k on i386)
options SHMALL=1025 # max number of shared memory pages system wide
options SHMMAX="(SHMMAXPGS*PAGE_SIZE+1)"
options SHMMIN=2 # min shared memory segment size (bytes)
options SHMMNI=33 # max number of shared memory identifiers
options SHMSEG=9 # max shared memory segments per process
These kernel options are also available as sysctl values:
$ sysctl -a | grep shm
kern.ipc.shmmax: 33554432
kern.ipc.shmmin: 1
kern.ipc.shmmni: 192
kern.ipc.shmseg: 128
kern.ipc.shmall: 8192
kern.ipc.shm_use_phys: 0
kern.ipc.shm_allow_removed: 0
I stared playing with kern.ipc.shmmax but failed to find
anything useful. I went up to some very large values. I suspect someone will
suggest appropriate values. I found the solution by modifying the number of
PostgreSQL connections, changing the value of max_connections from
40 to 30 in /usr/local/pgsql/data/postgresql.conf. Issuing the
following command invoked the changes by restarting the PostgreSQL
postmaster:
$ kill -HUP `cat /usr/local/pgsql/data/postmaster.pid`
Now that raidutil can run, the output should resemble:
$ sudo raidutil -L all
RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility
Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine
# b0 b1 b2 Controller Cache FW NVRAM Serial Status
---------------------------------------------------------------------------
d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal
Physical View
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive
d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
Logical View
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Reconstruct 94%
d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive
d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
Address Max Speed Actual Rate / Width
---------------------------------------------------------------------------
d0b0t0d0 50 MHz 100 MB/sec wide
d0b1t0d0 50 MHz 100 MB/sec wide
d0b2t0d0 50 MHz 100 MB/sec wide
d0b3t0d0 10 MHz 100 MB/sec wide
Address Manufacturer/Model Write Cache Mode (HBA/Device)
---------------------------------------------------------------------------
d0b0t0d0 ADAPTEC RAID-5 Write Back / --
d0b0t0d0 ST380011 A -- / Write Back
d0b1t0d0 ST380011 A -- / Write Back
d0b2t0d0 ST380011 A -- / Write Back
d0b3t0d0 ST380011 A -- / Write Back
# Controller Cache FW NVRAM BIOS SMOR Serial
---------------------------------------------------------------------------
d0 ADAP2400A 16MB 3A0L CHNL 1.1 1.62 1.12/79I BF0B111Z0B4
# Controller Status Voltage Current Full Cap Rem Cap Rem Time
---------------------------------------------------------------------------
d0 ADAP2400A No battery
Address Manufacturer/Model FW Serial 123456789012
---------------------------------------------------------------------------
d0b0t0d0 ST380011 A 3.06 1ABW6AY1 -X-XX--X-O--
d0b1t0d0 ST380011 A 3.06 1ABEYH4P -X-XX--X-O--
d0b2t0d0 ST380011 A 3.06 1ABRWK0E -X-XX--X-O--
d0b3t0d0 ST380011 A 3.06 1ABRDS5E -X-XX--X-O--
Capabilities Map: Column 1 = Soft Reset
Column 2 = Cmd Queuing
Column 3 = Linked Cmds
Column 4 = Synchronous
Column 5 = Wide 16
Column 6 = Wide 32
Column 7 = Relative Addr
Column 8 = SCSI II
Column 9 = S.M.A.R.T.
Column 0 = SCAM
Column 1 = SCSI-3
Column 2 = SAF-TE
X = Capability Exists, - = Capability does not exist, O = Not Supported
The output shows:
- I'm using an Adaptec 2400A (
ADAP2400A). - I have four drives, all ST380011 and 80MB (
76319MB). - I'm running
RAID-5, giving me228957MBof space. - The array is rebuilding and is
98%through the reconstruction. - I
replacedthe drive on Channel 0 (d0b2t0d0).
It is a subset of this information that you can use to determine whether all is well with the RAID array. My next task was experimentation to
determine what raidutil reports when the array is in different
states.
Note: I did not actually replace d0b2t0d0 as the
output above indicates. As part of my RAID testing, I shut down the system,
disconnected the power to one drive, started the system, verified that it still
ran, shut down again, reconnected the drive, powered up again, and started to
rebuild the array.
Pages: 1, 2 |



