GNU/Linux Pet Peeves

How to Hate Your Computer as Much as I Do

by Linas Vepstas <linas@linas.org>

After years of administering a variety of GNU/Linux systems, both server and desktop, I decided I've had enough. That's it. I hate having to deal with all the bullcrap! Below follows a list of my pet peeves, all of the things that are wrong with Linux and really need to be fixed. I am writing them down here because there's one thing I hate even more: arguing with idiots on mailing lists who don't even understand the problem you are trying to report, before they start telling you that its not a problem. Arghhhhh! So I'm just gonna blow of some steam here, and hope that maybe someday these things will get fixed. And before you tell me that its time I tried running Windows Server 2003, go read the last section, about why PC's suck. GNU/Linux, warts and all, is still better than Windows.

The /var Tirade

Either the FHS/LSB doesn't mandate a place where archive files should be kept, or package owners don't follow them. For example, mailman (at least on Debian) puts mailing list archives into /var/lib/mailman/archive, list configuration files into /var/lib/mailman/lists, and assorted crap in other subdirectories. Who knows where in the world it puts subscriber lists. I want to back up the archives, and the conf files, and the subscriber lists. I don't want to backup temporary queues. By mixing together temporary, transient files with permanent archives in the /var directory, it makes it hard to divine the correct backup strategy. This makes my backup scripts insanely complicated, error-prone, and of course ... I didn't realize they were broke until I needed the data. Grrr..

You may ask, "Why didn't you just save yourself a headache and backup the temp data, too?" Answer: Because my backup strategy is to never delete files. If I start backing up temp files, I would soon have a huge collection of crap files named "/var/spool/whizbang/queue/Bxxjzckue.218" that are totally and completely un-needed and unwanted and just chewing up backup space. The backup would grow without bound. The rational solution to this is to have mailman clearly distinguish temp working files, which need to go into /var/tmp or /var/spool or some other directory that is not backed up. But archives and subscriber lists need to go into directories which are backed up. To a (much) lesser degree, postfix is guilty of the same bad behavior.

Ideally, the /var filesystem shouldn't contain both temporary, transient files (such as /var/lock) and permanent, archivable files (such as /var/www) Its stupid to mix transient and permanent files in the same file system, for several reasons. Besides the backup problem, there is also the disk-reliability and disk-performance problem. If a file system is getting a lot of activity, it is more likely to experience hard-drive failure. I want to isolate that activity so that when the hard-drive failure occurs, it only takes down my spool area, and not my precious permanent data. I want my transient spool area on a different partition, or even better, a different disk than /etc, /home, /usr. When the head on the hard drive controller is seeking to the next temporary file, I want it to scrub over a very limited area of the hard drive, where the other temporary files are stored. I don't want it flying back and forth over my permanent data. Its like putting a precious Ming vase on a high shelf in a small shack at the end of an airport runway. Sure, airplanes almost never crash. But living in the flightpath is still a bad idea. Secondly, if my server needs to do a lot of spooling, I want to put that file system on something high-performance, maybe a stripped array, rather a slow-but-safe RAID-5 array. Mixing up transient and permanent files on /var makes it hard to draw this distinction. I shouldn't have to be a SuperHero SysOp to get it right. It should be possible for any WindowsLamer to get a reasonable Unix server config, right out of the box, that wouldn't make a pro Unix admin gag.

Right now, that parts of /var that must be backed up are interspersed, like swiss cheese, with other bits that must not be backed up. In the olden days of Unix, one never-ever backed up /var: by definition, all files in it were temporary, perishable. /var was like /tmp, but one step above it: it consisted entirely of transient files, and unlike /tmp, it just wasn't world writable. And then some damn fool decided that it was a good idea to put permanent files into /var. So now we have total idiocies like /var/www and /var/lib. That is just plain wrong. And packages like mailman, which install themselves into /var. Which is just plain insane. I dunno. The same FHS that explains that /var is for transient data speaks in the same breath of putting non-transient data into /var/lib. Conclusion: FHS is psychotic? Arghhhh! (April 2003)

The IDE/fsck/RAID Tirade

The current Linux storage subsystem works just fine if you have very reliable hardware, and never make any changes. But whoa be to the sysadmin who needs to enlarge, change, modify or repair. In that case, it reveals itself to be fragile, undependable and unreliable. Storage subsystem maintenance in Linux is a nightmare. Below follows a list of problems that I've experienced, and some ideas on steps that could be taken.

IDE-SMART needs to be vastly improved. It really, really needs to have an idiot option: You idiot, your hard drive is about to fail. Currently, it generates cryptic messages which go into your syslog file and are essentially useless, stuff like: smartd: Device: /dev/hda, S.M.A.R.T. Attribute: 1 Changed 1 Like WTF does that mean? I guess it means (in hindsight) that my hard-drive is about to fail. (April 2003)

RAID needs to have bad-block relocation built into it. I run RAID-1 (mirroring) a lot. This means both disks are supposed to be 'identical'. Well, as it so happens, I had a pair of disks, both of which had a lot of bad blocks. Well, it seems like they covered for one another: if one drive was bad, the mirror would suck data off the other drive, and all was well in the world. That is, all was well until I tried to replace one of the failing drives with a good drive. Well, when I started reconstructing the mirror, all hell broke loose. I'd taken out one of the old drives, to make room for the new drive. When the copy started from the other old drive to the new drive, it all went down the shithole. Massive file system corruption. Lock-ups. Reboots. Turns out you can even oops kernel-2.4.18 by rebooting while a RAID reconstruction is going on in the background. Yow. Of course, by the time I realized what was happening, things had been fsck'ed into oblivion.

Yes, this is arguably operator error. I really should have added a third disk to the mirror before I took out the failing disk. RAID pro's would say I commited a cardinal sin. But that's missing the point. The OS and/or tools should be telling me that I'm committing a cardinal sin. Instead of shredding my data. Remember, I'm a stupid dork sysadmin that has to be protected from myself. Just because I wrote the original Linux RAID HOWTO doesn't mean I don't commit stupid mistakes. (April 2003)

We need an on-line, hot fsck. A version of fsck that I can run while the system is on-line and running. I refuse to believe that this is technically hard. And the benefits, as I see them, would be immense. I have not infrequently seen file systems slowly go corrupt. I don't know why, it may be failing hardware, cosmic rays, buggy kernels, flaky interrupts, bad RAM, overheating CPU's. Who knows. But I really, really would like to have something validating and checking the integrity of my file system in the background, all the time. And repairing it, as much as possible, before it accumulates a patina of too many irrevocable, uncorrectable errors.

Do I need to point out that this is particularly urgent for the 'journaling' file systems? If you run ext3, jfs, xfs or reiserfs, you can have many fast happy reboots. In fact, you can have fast, happy reboots on a massively corrupted filesystem. Just because the journal was short, and was speedily dealt with, doesn't mean that you don't have corrupted data on your disk. A cosmic ray could have hit. Gremlins. Shit happens. (April 2003)

RAID and/or block drives need to have a bad-block test daemon running in the background, constantly. This daemon needs to perform write & read-back tests of unused blocks on the hard drive. This daemon needs to occasionally take a used block, and move it somewhere else, and check the underlying block to see if its still OK. (April 2003)

Device drivers for IDE controllers need to have enhanced diagnostics features. IDE controllers and/or ribbons can fail too, and there needs to be a way of distinguishing controller failures from disk failures. Controller failure need not be all-or-nothing. If a controller experiences a large number of parity errors, this is a sure sign that there are probably 2 or 3 bit errors that are not being detected/corrected. I own a controller that silently and seriously corrupted my filesystems. I saw occasional { DriveReady SeekComplete Error} messages in my error log. I made two mistakes: (1) I assumed that these were disk drive errors, and (2) I assumed that these were relatively harmless. I experienced massive file system corruption. Whoa is me. (April 2003)

The fsck tool needs to be particularly sensitive to its operating environment. If fsck is finding lots of errors on a system with a bad IDE/SCSI controller, it may actually worsen the problem by making 'fixes'. In fact, the stuff on the disk may be just fine, and corruption due to a bad ribbon cable, a bad disk controller, or a bad PCI bus/chipset may be the cause of the errors that fsck is seeing. In such a case, it is wrong to let fsck do anything, as it'll almost certainly make a bad situation worse. (May 2003)

The fsck tool needs to start putting things in layman's terms. I've been using Unix for a decade and a half, and I still have only a vague idea of what it means that 'dtime is zero' or 'block count is wrong'. Great. What does this really mean? Does it mean that one of my files might have been corrupted? If so, which one? If its just a harmless symptom of my computer having crashed, well, that's fine, so just tell me that this is the probable reason. But maybe these dtime error things occurred a long, long time ago, and are completely unrelated to the crash that caused the current reboot. If so, then I want to know that info too. I want to know if there's a problem brewing in the background. (May 2003)

The operating system/daemons/utilities need to be more pro-active in the face of possible data loss and corruption. When one gets one of those infamous { DriveReady SeekComplete Error} messages, one needs to be told many things:
1. Was the message harmless in the end? Did a retry eventually recover the data?
2. Is this a disk drive error, a controller error, or a cable-noise problem? Which subsystem is a candidate for replacement?
3. Immediately after the error, the data on the sector in question needs to be relocated, and a bad-clock test needs to be run, and if the sector tests bad, it needs to be retired immediately.
4. If the data was not recoverable, the name of the file with the error needs to be recorded in a log. The offset, in bytes, to the corrupted/lost region, and the length of that region, needs to be recorded. If you know the name of the file that was affected you can recover far more easily than trying to re-establish a whole entire filesystem.
5. If there is data corruption occurring, some sysops might want their system to go offline, so that additional corruption from ongoing operation is halted. I'm not sure what to do next, but if a data error occurs, the system as a whole needs to be very pro-active in dealing with the error, isolating the root cause, and making sure that more will not occur.
6. Some of the problems that fsck fixes involve possible corruption to data files. fsck prints these filenames out onto the screen. It also needs to put these fsck'ing filenames into a syslog, somehow. Because when they scroll off the screen, if you didn't write them down, that's it. There is no way of going back and finding out which of your files were corrupted by the fsck check.
7. The Gnome/KDE desktops need a panel applet that shows how many bad blocks you've got. And the trend rate. In color.
8. If you are managing a bunch of servers, you are probably using a tool called logcheck to report errors found in the syslog. To go along with this, we need a KDE/Gnome tool that blasts a big fat red window onto the screen when a particularly bad error is seen. Cause if you don't read your mail regularly, you may not know about the error for a while.
(May 2003)

Volume management subsystems, such as EVMS need to get off the crapper, and realize that volume management is not just about enlarging and moving around file systems, but also needs to be about managing the errors, problems and bugs associated with storage subsystems. EVMS also has to become a player in the system recovery/boot-diskette arena. EVMS is useless if you have to have a fully functioning system to use it. If you are fighting off a controller failure, or a disk failure, its quite possible that your root partition has been affected. This is particularly likely if your server has fewer than a half-dozen hard drives. In this case, you find yourself doing volume management on a system that you booted from a diskette, a system which not only can't get into graphical mode to run the EVMS GUI, but probably doesn't have /usr mounted, and thus can't access any fancy scripts. In this case, you'd better hope that libncurses is installed in /lib and not /usr/lib. It'd be even better if EVMS-curses fit on a single rescue diskette.

Storage Management

The current relationship between the mount command, the storage subsystem, RAID, and LILO is an unholy mess.

When I move a hard drive from one controller to another, I want the operating system to recognize it by the data on the drive, rather than by the cable its attached to. In particular, I am thinking of the mount command, and of /etc/fstab. If I need to reorganize a system, and /dev/hda3 suddenly becomes /dev/hde3, I shouldn't have to muck with /etc/fstab to get it right. Why? Because it seems like the only time I edit /etc/fstab is under times of duress. Something's down, something's wrong, you've just booted off the rescue disk, you have a only have minimalist editor (e.g. /bin/ae) handy, and you find yourself doing emergency repairs to /etc/fstab. Playing the guessing game of what data is on what controller sucks. You should be able to ask a disk partition "what are you?", and have the disk partition reply, and have the mount command respond to that.

Its even worse when you've mirrored you root partition. In these cases, you often end up working with only one instance of the mirror, rather then the whole thing. There is a variety complicated reasons as to how one gets wedged in such a situation, but its not uncommon. Anyway, in this case, one has two (or more?) copies of /etc/fstab in play. Which one gets used becomes a function of the bootloader, the rescue diskette, the arbitrary BIOS drive numberings scheme, and operator confusion. On any given reboot, the /etc/fstab that gets used may or may not be the one you just got done editing.

Ext2fs labels. Its great that you can put a label on ext2fs with the e2label command. But did you know that if you make a RAID-1 mirror /dev/md1 out of /dev/hda1 and /dev/hdc1, and put an Ext2fs label on it, that all three of these devices will have exactly the same ext2fs label? File-system labels are interesting, but are not a complete answer. Labels are needed at the block-device level, and the label needs to live with the actual storage media being labelled. The mount command, and /etc/fstab need to use block-device labels, not file-system labels.

The current Linux md subsystem uses 'raid superblocks' to store a UUID and cache other info. But there are no tools that really allow you to look at this info, nor to put a label in there. Never mind that you need your partition type to be 0xfd and not 0x83, and yada yada yada.

Maybe MSDOS-compatible hard drive partition tables are wrong? Maybe one of the other partition table styles is better? If I'm not running MSDOS or Windows on the machine, I don't need to use the MSDOS partition table. Where's the HOWTO that talks about stuff at this level? How do we get tools to use this stuff?

The latest LILO (version 25.5.3.1) has gone insane. My last version of LILO didn't complain that both of my hard drives were at 0x80. Now it does. Now I have to tell its explicitly, that drive 1 is 0x80 and drive 2 is 0x81. What's 1 and 2? I dunno. What does 0x80 and 0x81 signify? I dunno, a wild & lucky guess. And, in the keeping with the tradition of computers doing what you told them to do, rather than what you wanted them to do, LILO is now very, very capable of creating a boot sector that is unbootable. This is totally unacceptable.

My root partition is mirrored. I have a copy of / on both /dev/hda1 and on /dev/hdc1. When I tell LILO that I want a boot sector installed on /dev/hda and on /dev/hdc, and to mount either /dev/hda1 or /dev/hdc1 as the root partition, that's what I want. I don't want this L 40 40 40 crap. I don't want LI 80 80 80 either. I don't want it getting confused about which one is /dev/hda and which is /dev/hdc. I especially don't want it coming up with a different opinion on which is which, as compared to my rescue diskette.

After LILO installs a boot sector, it needs to go through the motions of pretending to do a dry-run boot, to see if it can actually accomplish it. If not, it needs to reinstall the old boot sector. There is no excuse for the scenario I experience far, far too often: running LILO, having it complete without error or complaint, and getting an unbootable system as a result. Why, last night, I think I booted a maybe fifteen times, juggling between rescue diskettes and LILO and RAID partitions and disk=/dev/hda bios=0x80 hacks and /etc/fstab edits before I finally unearthed a magic combination that rebooted into the same state that it was shutdown from. That half the boots got hung in the boot loader is just plain wrong.

Yes, I could use GRUB. If I could remember how. GRUB needs to be able to parse the lilo.conf configuration file. My life is already complicated. A little bit of compatibility could go a long, long way.

Wintel PC's Suck

Yes, they do. When you are fighting down in the trenches, PC's just aren't worth the crap they're made out of. Below are a few reasons why. Disclaimer: I am currently employed by a manufacturer of very high-end, very expensive, non-wintel computer hardware. Many PC users simply do not understand why these 'other' kinds of computers cost a factor of ten more than a PC does, for just about the same MIPS and FLOPS, speeds and feeds, RAM and disk.

I worked with a fool who thought it would be cool to build a terrabyte file server out of $5K of commodity PC parts. Yeah, right, nothing worked. Three out of ten hard drives arrived DOA. One hot-plug drive tray was DOA, and looked like it was used, returned as faulty by a previous customer, and shipped again as new. The IDE cables were too short. The on-board Ethernet controller hung after transferring about 1GB of data. There were PCI bus errors until I re-flashed a newer BIOS onto the main-board. The driver for the IDE subsystem was proprietary, not open source, and installed itself in a weird place, which was later blown away during subsequent install. Piece of honking, blowing junk. There is a reason that IBM/Dell/Compaq PC Servers cost $50K and not $5K. That's because they actually work, out of the box. There's a reason that SGI/Sun/IBM/HP Unix servers cost $500K and not $50K. Its because they are actually reliable, and deal with faults and failures in a predictable, recoverable way. Oh, and you actually get service.

I think the litany below might help PC owners to understand the high cost of servers from SGI, Sun, HP and IBM. For example, did you know that the gate oxide on the IBM Power4 CPU is four times thicker than that on the Intel Pentium/Xeon/Celeron, etc. CPU's? The thicker oxide makes the CPU one hundred times less likely to fail. It also makes the CPU run slower, since the gate cannot slew as fast; it cannot source/sink large currents the way that a thin-oxide gate can. The CPU chip clock cannot be run at the higher frequencies seen in the Intel chips. Net/net: the Power4 made a tradeoff between raw performance and reliability, and picked reliability over performance. Now ain't that counter to the currents in the PC world?)

Hard Drive manufacturers who think its a cool trick to put Intel boot agents into the boot sector of a brand-spanking new hard drive. What, exactly, does this accomplish? Show me one single user who actually *needs* this boot agent. I'll tell you what we need. We need a boot sector that prints out the following message: "Your Stupid BIOS just tried to boot the new hard drive you just bought. Sorry. Unplug your new hard drive and try again." This kind of a message would be a *great* help, instead of my having to guess my way out of some screwball Intel PXE boot prompt. Argghhh (April 2003)

Intel BIOS. BIOS sucks. You buy a new IDE controller, and it renames the order of the disks in your system. So that, for example, your new hard drive is now your C: drive. Which is what BIOS tries to boot from. What's up with that, Homey? Adding new hard drives and new controllers should not cause your existing drives to be renumbered every which way. Actually, for this sin, I should blame Microsoft. If WinDOS didn't insist that the world revolved around the C: drive, then we wouldn't have BIOS'es irrationally renumbering drives. (April 2003)

IDE Chipsets suck. Ribbon cables suck. I just went through three hard drives before I started understanding that the errors I was seeing might not have been failing hard drives, but in fact was due to a failing IDE controller and/or ribbon cables. Now maybe I should blame Linux for this, for not providing me with error messages to distinguish these different sources of errors. But I suspect it's not the fault of the software. I blame the hardware. I don't think that most IDE chipsets perform enough error and consistency checking, and do not self-diagnose and report problems. I don't think you can write a device driver for most IDE chipsets that will detect chipset failures, and report them. Instead, you, like I, will experience slow data corruption and vast cascades of error messages that seem to blame the hard drive. And you won't know until too late. All that this means is that if you want a reliable storage subsystem for your server, don't go with IDE. Since IDE sucks, by design. (April 2003)

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included at the URL http://www.linas.org/fdl.html, the web page titled "GNU Free Documentation License".