<!doctype linuxdoc system>

<article>

<!-- Title information -->

<title>Software-RAID mini-HOWTO
<author>Linas Vepstas, <tt>linas@linas.org</tt>
<date>v0.30 4 November 1997

<abstract>
RAID stands for ''Redundant Array of Inexpensive Disks'', and
is meant to be a way of creating a fast and reliable disk-drive
subsystem out of individual disks.

This document is a tutorial/HOWTO/FAQ for users of
the Linux MD kernel extension, the associated tools, and their use.
The MD extension implements RAID-0 (stripping), RAID-1 (mirroring),
RAID-4 and RAID-5 in software. That is, with MD, no special hardware
or disk controllers are required to get many of the benefits of RAID.

This document is <bf>NOT</bf> an introduction to RAID;
you must find this elsewhere.
</abstract>

<!-- Table of contents -->
<toc>


<!-- Begin the document -->

<p>
<descrip>
  <tag>Preamble</tag>
       This document is GPL'ed by Linas Vepstas 
       (<htmlurl url="mailto:linas@linas.org" name="linas@linas.org">).
       Permission to use, copy, distribute this document for any purpose is 
       hereby granted, provided that the author's / editor's name and
       this notice appear in all copies and/or supporting documents; and 
       that an unmodified version of this document is made freely available.
       This document is distributed in the hope that it will be useful, but 
       WITHOUT ANY WARRANTY, either expressed or implied.  While every effort 
       has been taken to ensure the accuracy of the information documented 
       herein, the author / editor / maintainer assumes NO RESPONSIBILITY 
       for any errors, or for any damages, direct or consequential, as a 
       result of the use of the information documented herein.

       RAID, although designed to improve system reliability by adding
       redundancy, can also lead to a false sense of security and confidence 
       when used improperly.  This false confidence can lead to even greater 
       disasters.  In particular, note that RAID is designed to protect against
       *disk* failures, and not against *power* failures. A power failure
       can damage data on the disks in such a way that it is not recoverable!
       RAID is *not* a substitute for proper backup of your system.
       Know what you are doing, test, be knowledgeable and aware!
</descrip>
</p>


<sect>Introduction

<p>
<enum>
  <item><bf>Q</bf>: 
        What is RAID?
        <quote>
          <bf>A</bf>:
          RAID stands for ``Redundant Array of Inexpensive Disks'',
          and is meant to be a way of creating a fast and reliable disk-drive
          subsystem out of individual disks.  
        </quote>

  <item><bf>Q</bf>:
        What is this document?
        <quote>
          <bf>A</bf>:
          This document is a tutorial/HOWTO/FAQ for users of the Linux MD 
          kernel extension, the associated tools, and their use.
          The MD extension implements RAID-0 (stripping), RAID-1 (mirroring),
          RAID-4 and RAID-5 in software.   That is, with MD, no special
          hardware or disk controllers are required to get many of the 
          benefits of RAID.

          This document is <bf>NOT</bf> an introduction to RAID&semi;
          you must find this elsewhere.
        </quote>
    
  <item><bf>Q</bf>:
        What levels of RAID does the Linux kernel implement?
        <quote>
            <bf>A</bf>:
            Striping (RAID-0) and linear concatenation are a part
            of the stock 2.x series of kernels.  This code is 
            of production quality&semi; it is well understood and well 
            maintained.  It is being used in some very large USENET 
            news servers.
        
            RAID-1, RAID-4 & RAID-5 are not present in the stock kernel&semi;
            a separate patch needs to be applied to get this functionality. 
            The current snapshots should be considered beta quality&semi; that 
            is, there are no known bugs but there are some rough edges and 
            untested system setups.
        
            RAID-1 hot reconstruction has been recently introduced 
            (August 1997) and should be considered alpha quality. 
            RAID-5 hot reconstruction will be alpha quality any day now ...
        </quote>
    
  <item><bf>Q</bf>:
        Where do I get it?
        <quote>
            <bf>A</bf>:
            Software RAID-0 and linear mode are a stock part of 
            all current Linux kernels.  Patches for Software RAID-1,4,5 
            are available from
            <url url="http://luthien.nuclecu.unam.mx/&tilde;miguel/raid">.
            See also the quasi-mirror 
            <url url="ftp://linux.kernel.org/pub/linux/daemons/raid/">
            for patches, tools and other goodies.
        </quote>

  <item><bf>Q</bf>:
        Are there other Linux RAID references?
        <quote>
            <bf>A</bf>:
            <itemize>
              <item>Generic RAID overview:
                    <url url="http://www.dpt.com/uraiddoc.html">.
              <item>General Linux RAID options:
                    <url url="http://linas.org/linux/raid.html">.
              <item>Linux-RAID mailing list archive:
                    <url url="http://www.linuxhq.com/lnxlists">.
              <item>Linux Software RAID Home Page:
                    <url url="http://luthien.nuclecu.unam.mx/&tilde;miguel/raid">.
              <item>Linux Software RAID tools:
                    <url url="ftp://linux.kernel.org/pub/linux/daemons/raid/">.
              <item>Linux RAID-Geschichten:
                    <url url="http://www.infodrom.north.de/&tilde;joey/Linux/raid/">.
            </itemize>
        </quote>

  <item><bf>Q</bf>:
        Who do I blame for this document?
        <quote>
            <bf>A</bf>:
            Linas Vepstas slapped this thing together.
            However, most of the information,
            and some of the words were supplied by
            <itemize>
              <item>Bradley Ward Allen
                    &lt;<htmlurl url="mailto:ulmo@Q.Net"
                                name="ulmo@Q.Net">&gt;
              <item>Luca Berra
                    &lt;<htmlurl url="mailto:bluca@comedia.it"
                                name="bluca@comedia.it">&gt;
              <item>Brian Candler
                    &lt;<htmlurl url="mailto:B.Candler@pobox.com"
                                name="B.Candler@pobox.com">&gt;
              <item>Bohumil Chalupa
                    &lt;<htmlurl url="mailto:bochal@apollo.karlov.mff.cuni.cz"
                                name="bochal@apollo.karlov.mff.cuni.cz">&gt;
              <item>Anton Hristozov
                    &lt;<htmlurl url="mailto:anton@intransco.com"
                                name="anton@intransco.com">&gt;
              <item>Miguel de Icaza
                    &lt;<htmlurl url="mailto:miguel@luthien.nuclecu.unam.mx"
                                name="miguel@luthien.nuclecu.unam.mx">&gt; 
              <item>Ingo Molnar
                    &lt;<htmlurl url="mailto:mingo@pc7537.hil.siemens.at"
                                name="mingo@pc7537.hil.siemens.at">&gt;
              <item>Alvin Oga
                    &lt;<htmlurl url="mailto:alvin@planet.fef.com"
                                name="alvin@planet.fef.com">&gt;
              <item>Gadi Oxman
                    &lt;<htmlurl url="mailto:gadio@netvision.net.il"
                                name="gadio@netvision.net.il">&gt;
              <item>Martin Schulze
                    &lt;<htmlurl url="mailto:joey@finlandia.infodrom.north.de"
                                name="joey@finlandia.infodrom.north.de">&gt;
              <item>Geoff Thompson
                    &lt;<htmlurl url="mailto:geofft@cs.waikato.ac.nz"
                                name="geofft@cs.waikato.ac.nz">&gt;
              <item>Edward Welbon
                    &lt;<htmlurl url="mailto:welbon@bga.com"
                                name="welbon@bga.com">&gt;
              <item>Rod Wilkens
                    &lt;<htmlurl url="mailto:rwilkens@border.net"
                                name="rwilkens@border.net">&gt;
              <item>Leonard N. Zubkoff
                    &lt;<htmlurl url="mailto:lnz@dandelion.com"
                                name="lnz@dandelion.com">&gt;
              <item>Marc ZYNGIER
                    &lt;<htmlurl url="mailto:zyngier@ufr-info-p7.ibp.fr"
                                name="zyngier@ufr-info-p7.ibp.fr">&gt;
            </itemize>
            <p>
            <bf>Copyrights</bf>
            <itemize>
              <item>Copyright (C) 1994-96 Marc ZYNGIER
              <item>Copyright (C) 1997 Gadi Oxman, Ingo Molnar, Miguel de Icaza
              <item>Copyright (C) 1997 Linas Vepstas
              <item>By copyright law, additional copyrights are implicitly held 
                    by the contributors listed above.
            </itemize>
            <p>
            Thanks all for being there!
        </quote>
</enum>
</p>


<sect>Setup &amp; Installation Considerations

<p>
<enum>
  <item><bf>Q</bf>:
        I must soon install Linux on new system,
        one requirement is to have RAID1.
        Now I'm wondering what is the easiest way to do it. 
        <quote>
            <bf>A</bf>:
            I keep rediscovering that file-system planning is one of the more
            difficult Unix configuration tasks.
            To answer your question, I can describe what we did.

            We planned the following setup:
            <itemize>
              <item>two EIDE disks, 2.1.gig each.
                    <tscreen>
                    <verb>
disk partition mount pt.  size    device
  1      1       /        300M   /dev/hda1
  1      2       swap      64M   /dev/hda2
  1      3       /home    800M   /dev/hda3
  1      4       /var     900M   /dev/hda4

  2      1       /root    300M   /dev/hdc1
  2      2       swap      64M   /dev/hdc2
  2      3       /home    800M   /dev/hdc3
  2      4       /var     900M   /dev/hdc4
                    </verb>
                    </tscreen>
              <item>each disk is on a separate controller (&amp; ribbon cable).
                    The theory is that a controller failure and/or
                    ribbon failure won't disable both disks.
                    Possibly get performance boost from parallel operations?

              <item>Install linux on <tt>/</tt> in <tt>/dev/hda1</tt>
                    this will allow booting and subsequent installation
                    of raid patches, etc.

              <item><tt>/dev/hdc1</tt> will contain a ``cold'' copy of
                    <tt>/dev/hda1</tt>. This is NOT a raid copy,
                    just a copy-copy. It's there just in case disk1 fails
                    completely&semi; we can use a rescue disk,
                    mark <tt>/dev/hdc1</tt> as bootable,
                    and use that to keep going,
                    without having to reinstall the system.

                    The theory here is that in case of severe failure,
                    I can still boot the system without worrying about
                    raid superblock-corruption or other raid failure modes
                    &amp; gotchas that I don't understand.

              <item><tt>/dev/hda3</tt> and <tt>/dev/hdc3</tt> will be mirrors
                    <tt>/dev/md0</tt>.
              <item><tt>/dev/hda4</tt> and <tt>/dev/hdc4</tt> will be mirrors
                    <tt>/dev/md1</tt>.

              <item>we picked <tt>/var</tt> and <tt>/home</tt> to be mirrored,
                    and in separate partitions, under the following
                    (convoluted ???) logic:
                    <itemize>
                      <item><tt>/</tt> will contain non-changing data &mdash;
                            for all practical purposes,
                            it will be read-only without actually being
                            read-only.
                      <item><tt>/home</tt> will contain slowly changing data
                            &mdash; an almost-read-only system.
                      <item><tt>/var</tt> will contain rapidly changing data,
                            including mail spools, database contents and
                            web server logs.
                    </itemize>
                    The theory is that <bf>if</bf> for some bizarre reason,
                    the operating system goes wild,
                    corruption is limited to one partition.
                    Thus, if for some unlikely, hypothetical reason,
                    the database starts scribbling everywhere,
                    it might clobber mail and log files,
                    but not <tt>/home</tt>.
    
                    I am not entirely satisfied with my logic &amp; reasoning,
                    but it was the best I could do on short notice.
                    I would like to have some scheme that verifies
                    that files in <tt>/usr</tt> and <tt>/home</tt>
                    are not changed, e.g. some MD5 signature scheme
                    that is run regularly.
                    The idea is to detect hacker intrusion as well as
                    corruption.  Similarly, the database contents are quite
                    valuable, and I don't have a fault-tolerant plan for that
                    that will let me sleep well at night.
            </itemize>

            So, to complete the answer to your question:
            <itemize>
              <item>install redhat on disk 1, partition 1.
                    do NOT mount any of the other partitions. 
              <item>install raid per instructions. 
              <item>configure <tt>md0</tt> and <tt>md1</tt>.
              <item>convince yourself that you know
                    what to do in case of a disk failure!
                    Discover sysadmin mistakes now,
                    and not during an actual crisis.
                    Experiment!
                    (we turned off power during disk activity &mdash;
                     this proved to be ugly but informative).
              <item>do some ugly mount/copy/unmount/rename/reboot scheme to
                    move <tt>/var</tt> over to the <tt>/dev/md1</tt>.
                    Done carefully, this is not dangerous.
              <item>enjoy!
           </itemize>
        </quote>
    
  <item><bf>Q</bf>:
        Can I strip/mirror the root partition (<tt>/</tt>)?
        Why can't I boot Linux directly from the <tt>md</tt> disks?

        <quote>
            <bf>A</bf>:
            Both Lilo and Loadlin need an non-stripped/mirrored partition
            to read the kernel image from. If you want to strip/mirror
            the root partition (<tt>/</tt>),
            then create an unstriped/mirrored partition.
            Typically, this is <tt>/boot</tt>.
            Then you either use the initial ramdisk support,
            or some old patches that were posted a while back,
            to allow your root device to be striped.

            Alternately, use <tt>mkinitrd</tt> to build the ramdisk image,
            see below.

            <p>
            Edward Welbon
            &lt;<htmlurl url="mailto:welbon@bga.com"
                        name="welbon@bga.com">&gt;
            writes:
            <itemize>
              ... all that is needed is a script to manage the boot setup.
              To mount an <tt>md</tt> filesystem as root,
              the main thing is to build an initial file system image
              that has the needed modules and md tools to start <tt>md</tt>.
              I have a simple script that does this.
            </itemize>
            <itemize>
              For boot media, I have a small <bf>cheap</bf> SCSI disk
              (170MB I got it used for &dollar;20).
              This disk runs on a AHA1452, but it could just as well be an
              inexpensive IDE disk on the native IDE.
              The disk need not be very fast since it is mainly for boot. 
            </itemize>
            <itemize>
              This disk has a small file system which contains the kernel and
              the file system image for <tt>initrd</tt>.
              The initial file system image has just enough stuff to allow me
              to load the raid SCSI device driver module and start the
              raid partition that will become root.
              I then do an
              <tscreen>
              <verb>
echo 0x900 > /proc/sys/kernel/real-root-dev
              </verb>
              </tscreen>
              (<tt>0x900</tt> is for <tt>/dev/md0</tt>)
              and exit <tt>linuxrc</tt>.
              The boot proceeds normally from there. 
            </itemize>
            <itemize>
              I have built most support as a module except for the AHA1452
              driver that brings in the <tt>initrd</tt> filesystem.
              So I have a fairly small kernel. The method is perfectly
              reliable, I have been doing this since before 2.1.26 and
              have never had a problem that I could not easily recover from.
              The file systems even survived several 2.1.4&lsqb;45&rsqb; hard
              crashes with no real problems.
            </itemize>
            <itemize>
              At one time I had partitioned the raid disks so that the initial
              cylinders of the first raid disk held the kernel and the initial
              cylinders of the second raid disk hold the initial file system
              image, instead I made the initial cylinders of the raid disks
              swap since they are the fastest cylinders
              (why waste them on boot?).
            </itemize>
            <itemize>
              The nice thing about having an inexpensive device dedicated to
              boot is that it is easy to boot from and can also serve as
              a rescue disk if necessary. If you are interested,
              you can take a look at the script that builds my initial
              ram disk image and then runs <tt>lilo</tt>.
              <tscreen>
             <url url="http://www.realtime.net/&tilde;welbon/initrd.md.tar.gz">
              </tscreen>
              It is current enough to show the picture.
              It isn't especially pretty and it could certainly build
              a much smaller filesystem image for the initial ram disk.
              It would be easy to a make it more efficient.
              But it uses <tt>lilo</tt> as is.
              If you make any improvements, please forward a copy to me. 8-) 
            </itemize>
        </quote>

  <item><bf>Q</bf>:
        I have heard that I can run mirroring over striping. Is this true?
        <quote>
            <bf>A</bf>:
            Yes, but not the reverse.  That is, you can put a stripe over 
            several disks, and then build a mirror on top of this.  However,
            striping cannot be put on top of mirroring.  

            A brief technical explanation is that the linear and stripe 
            personalities use the <tt>ll_rw_blk</tt> routine for access.
            The <tt>ll_rw_blk</tt> routine 
            maps disk devices and  sectors, not blocks.  Block devices can be
            layered one on top of the other; but devices that do raw, low-level
            disk accesses, such as <tt>ll_rw_blk</tt>, cannot.
        </quote>

  <item><bf>Q</bf>:
        What is the difference between RAID-1 and RAID-5 for a two-disk
        configuration (i.e. the difference between a RAID-1 array  built 
        out of two disks, and a RAID-5 array built out of two disks)?

        <quote>
            <bf>A</bf>:
            There is no difference in storage capacity.  Nor can disks be 
            added to either array to increase capacity (see the question below for
            details).
        
            RAID-1 offers a performance advantage for reads: the RAID-1
            driver uses distributed-read technology to simultaneously read 
            two sectors, one from each drive, thus doubling read performance.
        
            The RAID-5 driver, although it contains many optimizations, does not
            currently (September 1997) realize that the parity disk is actually
            a mirrored copy of the data disk.  Thus, it serializes data reads.
        </quote>


  <item><bf>Q</bf>:
        Can I add disks to a RAID-5 array?

        <quote>
            <bf>A</bf>:
            Currently, (September 1997) no. A conversion utility to allow this 
            does not yet exist.  The problem is that the actual structure and layout 
            of a RAID-5 array depends on the number of disks in the array.
        </quote>

  <item><bf>Q</bf>:
        How can I guard against a two-disk failure?

        <quote>
            <bf>A</bf>:
            Some of the RAID algorithms do guard against multiple disk
            failures, but these are not currently implemented for Linux.
            However, a the Linux Software RAID can guard against multiple
            disk failures by layering an array on top of an array.  For
            example, nine disks can be used to create three raid-5 arrays.
            Then these three arrays can in turn be hooked together into
            a single RAID-5 array on top.  In fact, this kind of a
            configuration will guard against a three-disk failure.  Note that 
            a large amount of disk space is ''wasted'' on the redundancy
            information.

            <tscreen>
            <verb>
    For an NxN raid-5 array,
    N=3, 5 out of 9 disks are used for parity (=55%percnt;)
    N=4, 7 out of 16 disks
    N=5, 9 out of 25 disks
    ...
    N=9, 17 out of 81 disks (=~20%percnt;)
            </verb>
            </tscreen>
      
            Another alternative is to create a RAID-1 array with 
            three disks.  Note that since all three disks contain
            identical data, that 2/3's of the space is ''wasted''.

        </quote>

</enum>
</p>


<sect>Error Recovery

<p>
<enum>
  <item><bf>Q</bf>:
        I have a RAID-1 (mirroring) setup, and lost power
        while there was disk activity.  Now what do I do?

        <quote>
            <bf>A</bf>:
            The redundancy of RAID levels is designed to protect against a 
            <bf>disk</bf> failure, not against a <bf>power</bf> failure.

            There are several ways to recover from this situation. 
        
            <itemize>
              <item>Method (1): Use the raid tools.  These can be used to sync
                    the raid arrays.  They do not fix file-system damage; after
                    the raid arrays are sync'ed, then the file-system still has
                    to be fixed with fsck.  Raid arrays can be checked with 
                    <tt>ckraid /etc/raid1.conf</tt> (for RAID-1, else, 
                    <tt>/etc/raid5.conf</tt>, etc.)
                
                    Calling <tt>ckraid /etc/raid1.conf --fix</tt> will pick one of the 
                    disks in the array (usually the first), and use that as the
                    master copy, and copy its blocks to the others in the mirror.
                
                    To designate which of the disks should be used as the master, 
                    you can use the <tt>--force-source</tt> flag: for example,
                    <tt>ckraid /etc/raid1.conf --fix --force-source /dev/hdc3</tt>
                
                    The ckraid command can be safely run without the <tt>--fix</tt> 
                    option 
                    to verify the inactive RAID array without making any changes. 
                    When you are comfortable with the proposed changes, supply 
                    the <tt>--fix</tt>  option.
           
              <item>Method (2): Paranoid, time-consuming, not much better than the
                    first way.  Lets assume a two-disk RAID-1 array, consisting of 
                    partitions <tt>/dev/hda3</tt> and <tt>/dev/hdc3</tt>.  You can 
                    try the following:
                    <enum>
                      <item><tt>fsck /dev/hda3</tt>
                      <item><tt>fsck /dev/hdc3</tt>
                      <item>decide which of the two partitions had fewer errors,
                            or were more easily recovered, or recovered the data
                            that you wanted.  Pick one, either one, to be your new
                            ''master'' copy.  Say you picked <tt>/dev/hdc3</tt>. 
                      <item><tt>dd if=/dev/hdc3 of=/dev/hda3</tt>
                      <item><tt>mkraid raid1.conf -f --only-superblock</tt>
                    </enum>

                    Instead of the last two steps, you can instead run 
                    <tt>ckraid /etc/raid1.conf --fix --force-source /dev/hdc3</tt>
                    which should be a bit faster.

              <item>Method (3): Lazy man's version of above.  If you don't want to 
                    wait for long fsck's to complete, it is perfectly fine to skip 
                    the first three steps above, and move directly to the last 
                    two steps.  
                    Just be sure to run <tt>fsck /dev/md0</tt> after you are done.
                    Method (3) is actually just method (1) in disguise.
            </itemize>

            In any case, the above steps will only sync up the raid arrays.
            The file system probably needs fixing as well: for this, 
            fsck needs to be run on the active, unmounted md device.

            With a three-disk RAID-1 array, there are more possibilities,
            such as using two disks to ''vote'' a majority answer.  Tools
            to automate this do not currently (September 97) exist.
        </quote>

  <item><bf>Q</bf>:
        I have a RAID-4 or a RAID-5 (parity) setup, and lost power while 
        there was disk activity.  Now what do I do?

        <quote>
            <bf>A</bf>:
            The redundancy of RAID levels is designed to protect against a 
            <bf>disk</bf> failure, not against a <bf>power</bf> failure.

            Since the disks in a RAID-4 or RAID-5 array do not contain a file
            system that fsck can read, there are fewer repair options.  You
            cannot use fsck to do preliminary checking and/or repair; you must
            use ckraid first.
        
            The <tt>ckraid</tt> command can be safely run without the 
            <tt>--fix</tt> option 
            to verify the inactive RAID array without making any changes. 
            When you are comfortable with the proposed changes, supply 
            the <tt>--fix</tt> option.
        
            If you wish, you can try designating one of the disks as a ''failed
            disk''.  Do this with the <tt>--suggest-failed-disk-mask</tt> flag.  
            Only one bit should be set in the flag: RAID-5 cannot recover two 
            failed disks.
            The mask is a binary bit mask: thus:
            <verb>
    0x1 == first disk
    0x2 == second disk
    0x4 == third disk
    0x8 == fourth disk, etc.
            </verb>
        
            Alternately, you can choose to modify the parity sectors, by using
            the <tt>--suggest-fix-parity</tt> flag.  This will recompute the 
            parity from the other sectors.
        
            The flags <tt>--suggest-failed-dsk-mask</tt> and 
            <tt>--suggest-fix-parity</tt>
            can be safely used for verification. No changes are made if the
            <tt>--fix</tt> flag is not specified.  Thus, you can experiment with
            different possible repair schemes.

        </quote>

  <item><bf>Q</bf>:
        My RAID-1 device, <tt>/dev/md0</tt> consists of two hard drive
        partitions: <tt>/dev/hda3</tt> and <tt>/dev/hdc3</tt>.
        Recently, the disk with <tt>/dev/hdc3</tt> failed,
        and was replaced with a new disk.  My best friend,
        who doesn't understand RAID, said that the correct thing to do now
        is to ``<tt>dd if=/dev/hda3 of=/dev/hdc3</tt>''.
        I tried this, but things still don't work.

        <quote>
            <bf>A</bf>:
            You should keep your best friend away from you computer.  
            Fortunately, no serious damage has been done.
            You can recover from this by running:
            <tscreen>
            <verb>
mkraid raid1.conf -f --only-superblock
            </verb>
            </tscreen>
            By using <tt>dd</tt>, two identical copies of the partition
            were created. This is almost correct, except that the RAID-1
            kernel extension expects the RAID superblocks to be different.
            Thus, when you try to reactive RAID, the software will notice
            the problem, and deactivate one of the two partitions.
            By re-creating the superblock, you should have a fully usable
            system.
        </quote>

  <item><bf>Q</bf>:
        My RAID-1 device, <tt>/dev/md0</tt> consists of two hard drive
        partitions: <tt>/dev/hda3</tt> and <tt>/dev/hdc3</tt>.
        My best (girl?)friend, who doesn't understand RAID,
        ran <tt>fsck</tt> on <tt>/dev/hda3</tt> while I wasn't looking,
        and now the RAID won't work. What should I do?

        <quote>
            <bf>A</bf>:
            You should re-examine your concept of ``best friend''.
            In general, <tt>fsck</tt> should never be run on the individual
            partitions that compose a RAID array.
            Assuming that neither of the partitions are/were heavily damaged,
            no data loss has occurred, and the RAID-1 device can be recovered
            as follows:
            <enum>
              <item>make a backup of the file system on <tt>/dev/hda3</tt>
              <item><tt>dd if=/dev/hda3 of=/dev/hdc3</tt>
              <item><tt>mkraid raid1.conf -f --only-superblock</tt>
            </enum>
            This should leave you with a working disk mirror.
        </quote>

  <item><bf>Q</bf>:
        Why does the above work as a recovery procedure?
        <quote>
            <bf>A</bf>:
            Because each of the component partitions in a RAID-1 mirror 
            is a perfectly valid copy of the file system.  In a pinch,
            mirroring can be disabled, and one of the partitions
            can be mounted and safely run as an ordinary, non-RAID
            file system.  When you are ready to restart using RAID-1,
            then unmount the partition, and follow the above 
            instructions to restore the mirror.   Note that the above 
            works ONLY for RAID-1, and not for any of the other levels.

            It may make you feel more comfortable to reverse the direction 
            of the copy above: copy <bf>from</bf> the disk that was untouched
            <bf>to</bf> the one that was.  Just be sure to fsck the final md.
        </quote>

  <item><bf>Q</bf>:
        I am confused by the above questions, but am not yet bailing out.
        Is it safe to run <tt>fsck /dev/md0</tt> ?

        <quote>
            <bf>A</bf>:
            Yes, it is safe to run <tt>fsck</tt> on the <tt>md</tt> devices. 
            In fact, this is the <bf>only</bf> safe place to run <tt>fsck</tt>.
        </quote>

  <item><bf>Q</bf>:
        If a disk is slowly failing, will it be obvious which one it is?
        I am concerned that it won't be, and this confusion could lead to 
        some dangerous decisions by a sysadmin.

        <quote>
            <bf>A</bf>:
            Once a disk fails, an error code will be returned from
            the low level driver to the RAID driver.
            The RAID driver will mark it as ``bad'' in the RAID superblocks
            of the ``good'' disks (so we will later know which mirrors are
            good and which aren't), and continue RAID operation
            on the remaining operational mirrors.

            This, of course, assumes that the disk and the low level driver
            can detect a read/write error, and will not silently corrupt data,
            for example. This is true of current drives
            (error detection schemes are being used internally),
            and is the basis of RAID operation.
        </quote>

  <item><bf>Q</bf>:
        What about hot-repair?

        <quote>
            <bf>A</bf>:
            There is a plan to add ``hot reconstruction'' at some point.
            With this feature, we can add several ``spare'' disks to
            the RAID set (be it level 1 or 4/5), and once a disk fails,
            we will reconstruct it on one of the spare disks in run time,
            without ever needing to shut down the array.

            Gadi Oxman
            &lt;<htmlurl url="mailto:gadio@netvision.net.il"
                        name="gadio@netvision.net.il">&gt;
            writes:
            <itemize>
              Currently, once the first disk is removed, the RAID set will be
              running in degraded mode. To restore full operation mode,
              you need to:
              <itemize>
                <item>stop the array (<tt>mdstop /dev/md0</tt>)
                <item>replace the failed drive
                <item>run <tt>ckraid raid.conf</tt> to reconstruct its contents
                <item>run the array again (<tt>mdadd</tt>, <tt>mdrun</tt>).
              </itemize>
              At this point, the array will be running with all the drives,
              and again protects against a failure of a single drive.
            </itemize>
            As of 22 July 97, there is an alpha version of MD that allows
            <itemize>
              <item>hot reconstruction/resyncing for RAID-1
              <item>a spare disk to be hot-added to an already running
                    RAID-1 array
            </itemize>
        </quote>

  <item><bf>Q</bf>:
        I would like to have an audible alarm for
        ``you schmuck, one disk in the mirror is down'',
        so that the novice sysadmin knows that there is a problem.

        <quote>
            <bf>A</bf>:
            The kernel is logging the event with a
            ``<tt>KERN&lowbar;ALERT</tt>'' priority in syslog.
            There are several software packages that will monitor the
            syslog files, and beep the PC speaker, call a pager, send e-mail,
            etc. automatically.
        </quote>

  <item><bf>Q</bf>:
        How do I run RAID-5 in degraded mode
        (with one disk failed, and not yet replaced)?

        <quote>
            <bf>A</bf>:
            Gadi Oxman
            &lt;<htmlurl url="mailto:gadio@netvision.net.il"
                        name="gadio@netvision.net.il">&gt;
            writes:
            <itemize>
              Normally, to run a RAID-5 set of n drives you have to:
              <tscreen>
              <verb>
mdadd /dev/md0 /dev/disk1 ... /dev/disk(n-1)
mdrun -p5 /dev/md0
              </verb>
              </tscreen>
            </itemize>
            Even if one of the disks has failed,
            you still have to <tt>mdadd</tt> it as you would in a normal setup.
            Then,
            <itemize>
              The array will be active in degraded mode with (n - 1) drives.
              If ``<tt>mdrun</tt>'' fails, the kernel has noticed an error
              (for example, several faulty drives, or an unclean shutdown).
              Use ``<tt>dmesg</tt>'' to display the kernel error messages from
              ``<tt>mdrun</tt>''.
            </itemize>
            If the raid-5 set is corrupted due to a power loss,
            rather than a disk crash, one can try to recover by
            creating a new RAID superblock:
            <tscreen>
            <verb>
mkraid -f --only-superblock raid5.conf
            </verb>
            </tscreen>
            A RAID array doesn't provide protection against a power failure or 
            a kernel crash, and can't guarantee correct recovery.
            Rebuilding the superblock will simply cause the system to ignore
            the condition by marking all the drives as ``OK'',
            as if nothing happened.
        </quote>

  <item><bf>Q</bf>:
        How does RAID-5 work when a disk fails?

        <quote>
            <bf>A</bf>:
            The typical operating scenario is as follows:
            <itemize>
              <item>A RAID-5 array is active.

              <item>One drive fails while the array is active.

              <item>The drive firmware and the low-level Linux disk/controller
                    drivers detect the failure and report an error code to the
                    MD driver.

              <item>The MD driver continues to provide an error-free
                    <tt>/dev/md0</tt>
                    device to the higher levels of the kernel (with a performance
                    degradation) by using the remaining operational drives.

              <item>The sysadmin can <tt>umount /dev/md0</tt> and 
                    <tt>mdstop /dev/md0</tt> as usual.

              <item>If the failed drive is not replaced, the sysadmin can still 
                    start the array in degraded mode as usual, by running 
                    <tt>mdadd</tt> and <tt>mdrun</tt>.
            </itemize>
        </quote>

  <item><bf>Q</bf>:
        The QuickStart says that <tt>mdstop</tt> is just to make sure that the
        disks are sync'ed. Is this REALLY necessary? Isn't unmounting the
        file systems enough?

        <quote>
            <bf>A</bf>:
            The command <tt>mdstop /dev/md0</tt> will:
            <itemize>
              <item>mark it ''clean''. This allows us to detect unclean shutdowns, for
                    example due to a power failure or a kernel crash.

              <item>sync the array. This is less important after unmounting a
                    filesystem, but is important if the <tt>/dev/md0</tt> is 
                    accessed directly rather than through a filesystem (for 
                    example, by <tt>e2fsck</tt>).
            </itemize>
        </quote>

  <item><bf>Q</bf>:
        <quote>
            <bf>A</bf>:
        </quote>

  <item><bf>Q</bf>:
        Why is there no question 13?

        <quote>
            <bf>A</bf>:
            If you are concerned about RAID, High Availability, and UPS,
            then its probably a good idea to be superstitious as well.
        </quote>

  <item><bf>Q</bf>:
        I'd like to understand  how it'd be possible to have something 
        like <tt>fsck</tt>: if the partition hasn't been cleanly unmounted, 
        <tt>fsck</tt> runs and fixes the filesystem by itself more than 
        90%percnt; of the time. Since the machine is capable of fixing it 
        by itself with <tt>ckraid --fix</tt>, why not make it automatic?


        <quote>
            <bf>A</bf>:
            Brian Candler &lt;<htmlurl url="mailto:B.Candler@pobox.com"
                                name="B.Candler@pobox.com">&gt;
            responds:

            Then you just put <tt>ckraid</tt> into your system initialization 
            scripts, like <tt>fsck</tt> is.  After the root partition is mounted, 
            add the following to <tt>/etc/rc.d/rc.sysinit</tt>:
            <tscreen>
            <verb>
    mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {
        ckraid --fix /etc/raid.usr.conf
        mdadd /dev/md0 /dev/hda1 /dev/hdc1
    }
    mdadd /dev/md1 /dev/hda2 /dev/hdc2 || {
        ckraid --fix /etc/raid.var.conf
        mdadd /dev/md0 /dev/hda2 /dev/hdc2
    }
            </verb>
            </tscreen>

            (Modify the above to suit your system.)

            Gadi Oxman explains the operation:
            In an unclean shutdown, Linux might be in one of the following states:
            <itemize>
              <item>The in-memory disk cache was in sync with the RAID set when
                    the unclean shutdown occurred; no data was lost.

              <item>The in-memory disk cache was newer than the RAID set contents
                    when the crash occurred; this results in a corrupted filesystem
                    and potentially in data loss.
      
                    This state can be further divided to the following two states:
      
              <itemize>
                <item>Linux was writing data when the unclean shutdown occurred.
                <item>Linux was not writing data when the crash occurred.
              </itemize>
            </itemize>

            Suppose we were using a RAID-1 array. In (2a), it might happen that
            before the crash, a small number of data blocks were successfully 
            written only to some of the mirrors, so that on the next reboot, 
            the mirrors will no longer contain the same data.
      
            If we ignore the mirror differences, the 0.36.3 read-balancing code
            might choose to read the above data blocks from any of the mirrors, 
            which will result in inconsistent behavior (for example, the output 
            of <tt>e2fsck -n /dev/md0</tt> can differ from run to run).
      
            Since RAID doesn't protect against unclean shutdowns, usually there
            isn't any ''obviously correct'' way to fix the mirror differences and
            the filesystem corruption.
      
            For example, by default <tt>ckraid --fix</tt> will choose the first 
            operational mirror and update the other mirrors with its contents.

            However, depending on the exact timing at the crash, the data on another
            mirror might be more recent, and we might want to use it as the source
            mirror instead, or perhaps use another method for recovery.
      
            If you wish to run <tt>ckraid --fix</tt> automatically, you can check the
            return code of <tt>mdrun</tt> for errors. For example:
            <verb>
    mdrun -p1 /dev/md0
    if [ $? -gt 0 ] ; then
            ckraid --fix /etc/raid1.conf
            mdrun -p1 /dev/md0
    fi
            </verb>
        </quote>

</enum> 
</p>

<sect>Troubleshooting Install Problems

<p>
<enum>
  <item><bf>Q</bf>:
        What is the current best known-stable or probably stable 
        patch for RAID in the 2.0.x series kernels?

        <quote>
            <bf>A</bf>:
            As of 18 Sept 1997, it is 
            ''2.0.30 + pre-9 2.0.31 + Werner Fink's swapping patch 
            + the alpha RAID patch''
        </quote>

  <item><bf>Q</bf>:
        I get the message: <tt>mdrun -a /dev/md0: Invalid argument</tt>

        <quote>
            <bf>A</bf>:
            Use <tt>mkraid</tt> to initialize the RAID set prior to the first use.
            <tt>mkraid</tt> ensures that the RAID array is initially in a 
            consistent state by erasing the RAID partitions. In addition, 
            <tt>mkraid</tt> will create the RAID superblocks.
        </quote>

  <item><bf>Q</bf>:
        I get the message: <tt>mdrun -a /dev/md0: Invalid argument</tt>
        The setup was:
        <itemize>
          <item>raid1 build as a kernel module
          <item>normal install procedure followed ... mdcreate, mdadd, etc.
          <item><tt>cat /proc/mdstat</tt> shows
                <verb>
    Personalities :
    read_ahead not set
    md0 : inactive sda1 sdb1 6313482 blocks
    md1 : inactive
    md2 : inactive
    md3 : inactive
                </verb>
          <item>mdrun -a creates the error message /dev/md0: Invalid argument
        </itemize>

        <quote>
            <bf>A</bf>:
            Try <tt>lsmod</tt> to see if the modules is loaded, and if not,
            load it with <tt>modprobe raid1</tt>.

        </quote>

  <item><bf>Q</bf>:
        Truxton Fulton wrote:
        <quote>
        On my Linux 2.0.30 system, while doing a <tt>mkraid</tt> for a 
        RAID-1 device,
        during the clearing of the two individual partitions, I got
        ''<tt>Cannot allocate free page</tt>'' errors appearing on the console,
        and ''<tt>Unable to handle kernel paging request at virtual address
        ...</tt>''
        errors in the system log.  At this time, the system became quite 
        unusable, but it appears to recover after a while.  The operation 
        appears to have completed with no other errors, and I am 
        successfully using my RAID-1 device.  The errors are disconcerting 
        though.  Any ideas?
        </quote>

        <quote>
            <bf>A</bf>:
            This was a well-known bug in the 2.0.30 kernels.  It is fixed in 
            the 2.0.31 kernal; alternately, fall back to 2.0.29.
        </quote>

  <item><bf>Q</bf>:
        I'm not able to <tt>mdrun</tt> a raid1, raid4 or raid5 device.
        If I try to <tt>mdrun</tt> a <tt>mdadd</tt>'ed device I get 
        the message ''<tt>invalid raid superblock magic</tt>''.

        <quote>
            <bf>A</bf>:
            Make sure that you've run the <tt>mkraid</tt> part of the install
            procedure.
        </quote>

  <item><bf>Q</bf>:
        When I access <tt>/dev/md0</tt>, the kernel spits out a 
        lot of errors like <tt>md0: device not running, giving up !</tt>
        and <tt>I/O error...</tt>. I've successfully added my devices to 
        the virtual device.

        <quote>
            <bf>A</bf>:
            To be usable, the device must be running. Use 
            <tt>mdrun -px /dev/md0</tt> where x is l for linear, 0 for 
            RAID-0 or 1 for RAID-1, etc.   Even better, create a 
            <tt>mdtab</tt> and do a <tt>mdadd -ar</tt>.

        </quote>

  <item><bf>Q</bf>:
        I've created a linear md-dev with 2 devices. 
        <tt>cat /proc/mdstat</tt> shows
        the total size of the device, but df only shows the size of the first
        physical device.

        <quote>
            <bf>A</bf>:
            You must <tt>mkfs</tt> your new md-dev before using it 
            the first time, so that the filesystem will cover the whole device.
        </quote>

  <item><bf>Q</bf>:
        While compiling raidtools 0.42, compilation stops trying to 
        include &lt;pthread.h&gt; but it doesn't exist in my system. 
        How do I fix this?

        <quote>
            <bf>A</bf>:
            raidtools-0.42 requires linuxthreads-0.6 from:
            <url url="ftp://ftp.inria.fr/INRIA/Projects/cristal/Xavier.Leroy">
            Alternately, use glibc v2.0
        </quote>

  <item><bf>Q</bf>:
        I get the message <tt>invalid raid superblock magic</tt> while trying to 
        run an array which consists of partitions which are bigger than 4GB.

        <quote>
            <bf>A</bf>:
            This bug is now fixed. (September 97)  Make sure you have the latest 
            raid code.
        </quote>

  <item><bf>Q</bf>:
        <tt>ckraid</tt> currently isn't able to read <tt>/etc/mdtab</tt>

        <quote>
            <bf>A</bf>:
             The RAID0/linear configuration file format used in
             <tt>/etc/mdtab</tt> is obsolete, although it will be supported 
             for a while more.  The current, up-to-date config files 
             are currently named <tt>/etc/raid1.conf</tt>, etc.
        </quote>

  <item><bf>Q</bf>:
        The personality modules (<tt>raid1.o</tt>) are not loaded automatically; 
        they have to be manually modprobe'd before mdrun. How can this
        be fixed?

        <quote>
            <bf>A</bf>:
            To autoload the modules, we can add the following to 
            <tt>/etc/conf.modules</tt>:
            <verb>
    alias md-personality-3 raid1
    alias md-personality-4 raid5
            </verb>

        </quote>

</enum>
</p>

<sect>Performance, Tools &amp; General Bone-headed Questions


<p>
<enum>
  <item><bf>Q</bf>:
        I've created a RAID-0 device on <tt>/dev/sda2</tt> and 
        <tt>/dev/sda3</tt>. The device is a lot slower than a 
        single partition. Isn't md a pile of junk?
        <quote>
            <bf>A</bf>:
             To have a RAID-0 device running a full speed, you must 
             have partitions from different disks.  Besides, putting 
             the two halves of the mirror on the same disk fails to 
             give you any protection whatsoever against disk failure.
        </quote>

  <item><bf>Q</bf>:
        I have 2 Brand X super-duper hard disks and a Brand Y controller.
        and am considering using <tt>md</tt>.
        Does it significantly increase the throughput?
        Is the performance really noticeable?

        <quote>
            <bf>A</bf>:
            The answer depends on the configuration that you use.
            <descrip>
              <tag>Linux MD RAID-0 (striping) performance:</tag>
                   Must wait for all disks to read/write the stripe.

              <tag>Linux MD RAID-1 (mirroring) read performance:</tag>
                   MD implements read balancing. In a low-IO situation,
                   this won't change performance.
                   But, with two disks in a high-IO environment,
                   this could as much as double the read performance.
                   For N disks in the mirror, this could improve performance
                   N-fold.

              <tag>Linux MD RAID-1 (mirroring) write performance:</tag>
                   Must wait for the write to occur to all of the disks
                   in the mirror.
            </descrip>
        </quote>

  <item><bf>Q</bf>:
        How does the chunk size influence the speed of my RAID device?

        <quote>
            <bf>A</bf>:
            The chunk size is the amount of data contiguous on the virtual device
            that is also contiguous on the physical device. Depending on your
            workload, the best is to let the chunk size match the size of your
            requests, so two requests have chances to be on different disks, and to
            be run the same time. This suppose a lot of testing with different chunk
            sizes to match the average request size, and to have the best
            performances.
        </quote>

  <item><bf>Q</bf>:
        Are linear MD's expandable?
        Can a new hard-drive/partition be added,
        and the size of the existing file system expanded?

        <quote>
            <bf>A</bf>:
            Miguel de Icaza
            &lt;<htmlurl url="mailto:miguel@luthien.nuclecu.unam.mx"
                        name="miguel@luthien.nuclecu.unam.mx">&gt;
            writes:
            <quote>
              I changed the ext2fs code to be aware of multiple-devices
              instead of the regular one device per file system assumption.

              So, when you want to extend a file system,
              you run a utility program that makes the appropriate changes
              on the new device (your extra partition) and then you just tell
              the system to extend the fs using the specified device.

              You can extend a file system with new devices at system operation
              time, no need to bring the system down
              (and whenever I get some extra time, you will be able to remove
              devices from the ext2 volume set, again without even having
              to go to single-user mode or any hack like that).

              You can get the patch for 2.1.x kernel from my web page:
              <tscreen>
               <url url="http://www.nuclecu.unam.mx/&tilde;miguel/ext2-volume">
              </tscreen>
            </quote>
        </quote>

  <item><bf>Q</bf>:
        Where can I put the <tt>md</tt> commands in the startup scripts,
        so that everything will start automatically at boot time?

        <quote>
            <bf>A</bf>:
            Rod Wilkens
            &lt;<htmlurl url="mailto:rwilkens@border.net"
                        name="rwilkens@border.net">&gt;
            writes:
            <quote>
              What I did is put ``<tt>mdadd -ar</tt>'' in
              the ``<tt>/etc/rc.d/rc.sysinit</tt>'' right after the kernel
              loads the modules, and before the ``<tt>fsck</tt>'' disk check.
              This way, you can put the ``<tt>/dev/md?</tt>'' device in the 
              ``<tt>/etc/fstab</tt>''. Then I put the ``<tt>mdstop -a</tt>''
              right after the ``<tt>umount -a</tt>'' unmounting the disks,
              in the ``<tt>/etc/rc.d/init.d/halt</tt>'' file.
            </quote>
            For raid-5, you will want to look at the return code
            for <tt>mdadd</tt>, and if it failed, do a 
            <tscreen>
            <verb>
ckraid --fix /etc/raid5.conf
            </verb>
            </tscreen>
            to repair any damage.
        </quote>

  <item><bf>Q</bf>:
        I have SCSI adapter brand XYZ (with or without several channels), 
        and disk brand(s) PQR and LMN, will these work with md to create
        a linear/stripped/mirrored personality? 

        <quote>
            <bf>A</bf>:
            Yes!  Software RAID will work with any disk controller (IDE
            or SCSI) and any disks.  The disks do not have to be identical,
            nor do the controllers.  For example, a RAID mirror can be
            created with one half the mirror being a SCSI disk, and the 
            other an IDE disk.  The disks do not even have to be the same 
            size.  There are no restrictions on the mixing & matching of 
            disks and controllers.
      
            This is because Software RAID works with disk partitions, not 
            with the raw disks themselves.  The only recommendation is that
            for RAID levels 1 and 5, the disk partitions that are used as part 
            of the same set be the same size. If the partitions used to make 
            up the RAID 1 or 5 array are not the same size, then the excess 
            space in the larger partitions is wasted (not used).
        </quote>

  <item><bf>Q</bf>:
        I was wondering if it's possible to setup stripping with more 
        than 2 devices in <tt>md0</tt>? This is for a news server,
        and I have 9 drives... Needless to say I need much more than two.
        Is this possible?

        <quote>
            <bf>A</bf>:
            Yes. (describe how to do this)
        </quote>

  <item><bf>Q</bf>:
        When is Software RAID superior to Hardware RAID?
        <quote>
            <bf>A</bf>:
            Normally, Hardware RAID is considered superior to Software 
            RAID, because hardware controllers often have a large cache,
            and can do a better job of scheduling operations in parallel.
            However, integrated Software RAID can (and does) gain certain 
            advantages from being close to the operating system.

            For example, ... ummm. Opaque description of caching of 
            reconstructed blocks in buffer cache elided ...
        </quote>

  <item><bf>Q</bf>:
        If I upgrade my version of raidtools, will it have trouble 
        manipulating older raid arrays?  In short, should I recreate my 
        RAID arrays when upgrading the raid utilities?

        <quote>
            <bf>A</bf>:
            No, not unless the major version number changes.
            An MD version x.y.z consists of three sub-versions:
            <verb>
     x:      Major version.
     y:      Minor version.
     z:      Patchlevel version.
            </verb>

            Version x1.y1.z1 of the RAID driver supports a RAID array with
            version x2.y2.z2 in case (x1 == x2) and (y1 >= y2).
        
            Different patchlevel (z) versions for the same (x.y) version are
            designed to be mostly compatible.
        
            The minor version number is increased whenever the RAID array layout
            is changed in a way which is incompatible with older versions of the
            driver. New versions of the driver will maintain compatibility with
            older RAID arrays.
        
            The major version number will be increased if it will no longer make
            sense to support old RAID arrays in the new kernel code.
         
            For RAID-1, it's not likely that the disk layout nor the
            superblock structure will change anytime soon.  Most all 
            Any optimization and new features (reconstruction, multithreaded 
            tools, hot-plug, etc.) doesn't affect the physical layout.
        </quote>

  <item><bf>Q</bf>:
        The command <tt>mdstop /dev/md0</tt> says that the device is busy.

        <quote>
            <bf>A</bf>:
            There's a process that has a file open on <tt>/dev/md0</tt>, or
            <tt>/dev/md0</tt> is still mounted.  Terminate the process or 
            <tt>umount /dev/md0</tt>.
        </quote>

  <item><bf>Q</bf>:
        Are there performance tools?
        <quote>
            <bf>A</bf>:
            There is also a new utility called <tt>iotrace</tt> in the 
            <tt>linux/iotrace</tt>
            directory. It reads <tt>/proc/io-trace</tt> and analyses/plots it's
            output.  If you feel your system's block IO performance is too 
            low, just look at the iotrace output.
        </quote>

</enum>
</p>


<sect>Questions Waiting for Answers

<p>
<enum>
  <item><bf>Q</bf>:
        What are the option you have used for formating the (raid) disks?
        I used:
        <tscreen>
        <verb>
mke2fs -b 4096 -R stride=4 ... blah
        </verb>
        </tscreen>
        or is it supposed to be 64K &times; 4 drives:
        <tscreen>
        <verb>
mke2fs -b 4096 -R stride=262000 ... blah
        </verb>
        </tscreen>
        are there any other options ?

  <p>
  <item><bf>Q</bf>:
        For testing the raw disk thru put...
        is there a character device for raw read/raw writes instead of
        <tt>/dev/sdaxx</tt> that we can use to measure performance
        on the raid drives??
        is there a GUI based tool to use to watch the disk thru-put??
</enum>
</p>


<sect>Wish List of Enhancements to MD and Related Software

<p>
Bradley Ward Allen
&lt;<htmlurl url="mailto:ulmo@Q.Net" name="ulmo@Q.Net">&gt;
wrote:
  <quote>
  Ideas include:
  <itemize>
    <item>Bootup parameters to tell the kernel which devices are
          to be MD devices (no more ``<tt>mdadd</tt>'')
    <item>Making MD transparent to ``<tt>mount</tt>''/``<tt>umount</tt>''
          such that there is no ``<tt>mdrun</tt>'' and ``<tt>mdstop</tt>''
    <item>Integrating ``<tt>ckraid</tt>'' entirely into the kernel,
          and letting it run as needed
  </itemize>
  (So far, all I've done is suggest getting rid of the tools and putting
   them into the kernel&semi; that's how I feel about it,
   this is a filesystem, not a toy.)
  <itemize>
    <item>Deal with arrays that can easily survive N disks going out
          simultaneously or at separate moments,
          where N is a whole number &gt; 0 settable by the administrator
    <item>Handle kernel freezes, power outages,
          and other abrupt shutdowns better
    <item>Don't disable a whole disk if only parts of it have failed,
          e.g., if the sector errors are confined to less than 50&percnt; of
          access over the attempts of 20 dissimilar requests,
          then it continues just ignoring those sectors of that particular
          disk.
    <item>Bad sectors:
          <itemize>
            <item>A mechanism for saving which sectors are bad,
                  someplace onto the disk.
            <item>If there is a generalized mechanism for marking degraded
                  bad blocks that upper filesystem levels can recognize,
                  use that. Program it if not.
            <item>Perhaps alternatively a mechanism for telling the upper
                  layer that the size of the disk got smaller,
                  even arranging for the upper layer to move out stuff from
                  the areas being eliminated.
                  This would help with a degraded blocks as well.
            <item>Failing the above ideas, keeping a small (admin settable)
                  amount of space aside for bad blocks (distributed evenly
                  across disk?), and using them (nearby if possible)
                  instead of the bad blocks when it does happen.
                  Of course, this is inefficient.
                  Furthermore, the kernel ought to log every time the RAID
                  array starts each bad sector and what is being done about
                  it with a ``<tt>crit</tt>'' level warning, just to get
                  the administrator to realize that his disk has a piece of
                  dust burrowing into it (or a head with platter sickness).
          </itemize>
    <item>Software-switchable disks:
          <descrip>
            <tag>``disable this disk''</tag>
                 would block until kernel has completed making sure
                 there is no data on the disk being shut down
                 that is needed (e.g., to complete an XOR/ECC/other error
                 correction), then release the disk from use
                 (so it could be removed, etc.)&semi;
            <tag>``enable this disk''</tag>
                 would <tt>mkraid</tt> a new disk if appropriate
                 and then start using it for ECC/whatever operations,
                 enlarging the RAID5 array as it goes&semi;
            <tag>``resize array''</tag>
                 would respecify the total number of disks
                 and the number of redundant disks, and the result
                 would often be to resize the size of the array&semi;
                 where no data loss would result,
                 doing this as needed would be nice,
                 but I have a hard time figuring out how it would do that&semi;
                 in any case, a mode where it would block
                 (for possibly hours (kernel ought to log something every
                  ten seconds if so)) would be necessary&semi;
            <tag>``enable this disk while saving data''</tag>
                 which would save the data on a disk as-is and move it
                 to the RAID5 system as needed, so that a horrific save
                 and restore would not have to happen every time someone
                 brings up a RAID5 system (instead, it may be simpler to
                 only save one partition instead of two,
                 it might fit onto the first as a gzip'd file even)&semi;
                 finally,
            <tag>``re-enable disk''</tag>
                 would be an operator's hint to the OS to try out
                 a previously failed disk (it would simply call disable
                 then enable, I suppose).
         </descrip>
    </itemize>
  </quote>
</p>

</article>
