There used to be a time when Linux was a joy to use. Now, its a head-ache inducing slog through the bowels of the operating system. You have to be a brain surgeon and a rocket scientist. You have to work over-time. You have to have the patience of a saint and the perseverance of a boxer. I am so tired of this. I'm exhausted.
Why can't I have a distribution that boots? Why is it all on me? Why do I have to spend days upon days doing sysadmin surgery?
Update: 2 February 2023 - All I want is a system that boots
About a month ago, I updated one of my servers to Ubuntu 2022.04 LTS -- it was time -- it was running 2018.04 LTS before that. The update went great! It was quick, easy, painless. Exactly how things should be! I rebooted, and never gave it another thought.
Day before yesterday, on the 31st of January 2023, there was an ice-storm. Lost power for about 4 hours. Desktop is fine, but one server won't boot. Can't mount the rootfs! It drops into the initramfs rescue shell. But my keyboard doesn't work in that shell!? Its like as if the keyboard USB driver was missing from initramfs!??? Well, stuff happens, no big deal. I'm used to this. So I pull out the rescue USB stick and boot that.
And I find this.
bin -> usr/bin
lib -> usr/lib
sbin -> usr/sbin
WTF? Am I hallucinating? Did I do that? How the heck did that happen?
Did I really do that? This seems to prove, beyond the shadow of a doubt,
that I must have done something really really dumb. And forgotten about
it. Did not even write it down in my sysadmin notes! Must of been a 3AM
punch-drunk and hallucinating. What other possible explanation could
there be? I guess I did something truly moronic, and now I am paying
for it. Woe is me.
I mean ... my /usr is on a different partition. Which hasn't been mounted yet. So, of course, there's no /bin/bash or anything else. No wonder nothing boots. Also, I find /usr/lib/modules -> /usr/lib/modules which explains why initramfs was borked. But now I'm miffed, because update-grub created a broken initramfs and installed it and didn't say anything about it until an ice-storm a month later.
Whatever. Over the next few hours, I recreate a plausible /bin and /lib that allows me to mount root on my rescue image. Just update-grub and everything will be fine. Right? Right? But the grub that runs during boot never finds the images that grub installed during rescue. Huh? Back to rescue, double-check, reboot, no joy back to rescue, fiddle, check, fiddle, check, no joy. Hunt down UUID's in lets see ... ummm ... in /etc/default/grub and umm /boot/grub and some third place. Can't keep track of them all. Read a dozen stack-exchange posts. In a moment of desperation, I try the obvious: just boot like this: linux root=/dev/sda2 because hey you know, this is sure to work and eff the UUID's. Except. Well. It doesn't. And you know why? Because ... Because ... there's no /etc/fstab. You gotta be shittin me. I'm done for the day. Lets take this up the next morning.
It's snowballing out of control. I can't figure out why grub can't find grub images. I'm getting tired of doing apt install --reinstall stuff by hand. I don't know why the keyboard doesn't work inside of initramfs. And you know what? I never had enough room in / and /var and /usr anyway, so you know what? I'm going to create one giant partition, and reinstall. Not an entirely bad plan, because, 6 hours later, after downloading 4 different ISO images, after 9 or 10 failed install attempts, I finally do get a bootable system. And what do I see, when I type in ls -la /?
bin -> usr/bin
lib -> usr/lib
sbin -> usr/sbin
OMG.
Someone, I don't know who, but someone, intentionally sabotaged my system and made it unbootable. Made it effectively unrepairable. Did it silently, and without warning. Took a combo that MUST ALWAYS WORK: the bootloader, the kernel, the rootfs and /bin/bash and broke it.
WTF Ubuntu? WTF Debian? How could you do this? I've been running Linux for 25 years. I cut my teeth on Slackware. Why, oh why, don't we have a Linux distro that ... I dunno, ... ahh, works?
To add insult to injury, none of the Debian install images were able to run my ethernet card. My 6-year-old, just-fine, tried-n-true ethernet card. Always worked before. Still works. Works with Debian. Just ... not with the Debian Stable install images. Really? WTF?
I mean, I picked Debian cause perhaps its time to ditch Ubuntu. I mean, I like you guys. I'm a fan. Just ... stop breaking things.
Anyway, after two whole days of trying to get a running system, I am left with a smoking heap of garbage, manually recovering server config files stored in long-forgotten places. Well, assuming I ever get ethernet to work. Right now, packets go out on one interface and come in on another. Routing issues I've never had before. Unplug only one ethernet cable, and nothing is pingable. Did you guys do something to break networking, too? I sure hope not. Day three better be better than day two.
Update: 4 February 2017 - Systemd fucking rulez
cgroups/cgfs.c:lxc_cgroupfs_create:901 - Could not find writable mount point for cgroup hierarchy 12 while trying to create cgroup.
So, after an apt-get update; apt-get upgrade on Ubuntu 14.04 Trusty, my LXC containers stopped booting, with the above error message. It took me maybe 6 hours, with a dinner break, to find and fix the issue. The solution was simple, it turned out -- cgroups were not being mounted and my hacky solution was to copy /usr/bin/cgroupfs-mount from another system and run it by hand. Bingo, LXC containers work again.
What happened? Well, the interwebs claim that .. who knows. Its got something to do with systemd. The Ubuntu LTS maintainers apparently don't bother testing their code before pushing it any more ... and they broke LXC. WTF. OK, yes, Ubuntu jumped the shark many years ago; but the new hero, the savior and solution to all our problems has not yet appeared.
Seriously: operating systems for servers are supposed to be stable. The apt-get update; apt-get upgrade is not supposed to break working systems. WTF.
Update: 24 January 2017 - FUCK YOU SYSTEMD
systemd-udevd[120]: renamed network interface eth0 to p1p1
Why can't systemd just boot my machine without fucking with the network interfaces!? Why is networking so goddamned difficult with systemd? Why can't it just get out of the way and let the networking subsystem do its thing? I just want to boot my machine, I don't want to search on-line help to figure out why my system doesn't boot anymore because systemd renamed eth0 to p1p1 and then causes `ifup eth0` to fail. Its just frickin the integrated ethernet port on the motherboard! Quit trying to fuck with it! FUCK YOU SYSTEMD!
And now back to the original 25 May 2015 rant.
I have 5 Linux boxes I regularly maintain; two are webservers. You're looking at one now. The Linux kernel has a fantastic uptime -- a year, two years without reboots. But then there is the inevitable power outage during a thunderstorm. And then, at least one or two of my machines won't boot afterwards. Its been like this for 5 or 6 or 7 years now, and frankly I'm beyond getting tired of it. I'm beyond having enough. What are the Kubler-Ross stages of grief? Denial, anger, bargaining, depression, acceptance? I used to want to punch, well, I don't know who, maybe Kay Sievers, maybe Lennart Poettering, or someone, anyone, in the face, for all my trouble and my pain. The pain is still there. I think this open letter is a manifestation of the "bargaining" stage. What do I have to do, what price can I pay, to have a system that boots?
Its never the same thing twice in a row. Many years ago, it was udev and dbus. You had to do rocket surgery to get udev-based systems to boot. That eventually sorted itself out, but for a while, I lost back-to-back 12 hour days fighting udev. Then it was plymouth. Or it was upstart. Why were such utterly broken and buggy systems like plymouth and and upstart foisted on the world? Things with names like libdevmapper should not crash. And then there is systemd, which, as far as I can tell, is a brick shithouse where the laws of gravity don't hold. I understand the natural urge to design something newer than sysvinit, but how about testing it a bit more? I have 5 different computers, and on any given random reboot, 1 out of 5 of these won't boot. That's a 20% failure rate. Its been a 20% failure rate for over 6 years now.
Exactly how much system testing is needed to push the failure rate to less than 1-out-of-5? Is it really that hard to test software before you ship it? Especially system software related to booting!? If systemd plans to take over the world, it should at least work, instead of failing. Stop killing init. Stop failing to find the root file system. Stop running fsck on file systems that are already mounted r/w. Do you have any idea how hard it is to try to edit plymouth or upstart files from busy-box, hoping that maybe this time, all will be OK? To boot rescue images over and over and over and over, tracing a problem through a maze of subsystems, following clues, only to find, two days later, that it was Colonel Mustard, err, systemd that did it in the kitchen, with a candlestick? I mean, I have a really rather high IQ (just look at the web page below), and I have patience that is perhaps unmatched. And I find this stuff challenging. Lets get real: sysvinit was simple and easy-to-use by comparison, and it worked flawlessly. Between 1995 and 2009, I never once had a boot problem. Sure, there were times when I could not watch youtube videos ... but then Ubuntu came along and solved even that problem. For a while, it was Heaven on Earth.
Do you have any idea how shameful it is to tell your various bosses how great Linux is, and then have to dissemble and obfuscate, because you can't bear to tell them the reason you did no work for the last 10 days was because your Linux box didn't boot? To say "no thanks" when your boss offers to buy you a new laptop?
And its not just the low-level stuff, either. There's also the nuttiness known as gnome-shell and unity. Which crash or hang or draw garbage on your screen. And when they do work, they're unusable, from the day-to-day usability perspective. This wasn't a problem with gnome2. Gnome2 rocked. It was excellent. Why did you take something that worked really really well, and replace it with a borken, unusable mess? What happened, Gnome and UI developers? What were you thinking? In the grips of what madness? In what design universe is it OK to list 100 apps, whose names I don't recognize, in alphabetical order? Whoever your design and usability hero is, I am pretty sure they would not approve of this.
Its spreading, too. Like cancer. Before 2013, web browsers worked flawlessly. Now, both mozilla firefox and google chrome are almost unusable. Why, oh why, can't I watch youtube videos on firefox? Why does Chrome have to crash whenever I visit adware-infested websites? What's wrong with the concept of a web browser that doesn't crash? Why does googling my error messages bring up web forums with six thousand posts of people saying "me too, I have this same problem?" When you have umpteen tens of thousands of users with the exact same symptoms, why do you continue to blame the user?
I can understand temporary insanity and mass hysteria. It usually passes. I can wait a year or two or three. Or maybe four. Or more. But a trifecta of the Linux boot, the Linux desktop, and the Linux web-browser? What software crisis do we live in, that so many things can be going so badly, so consistently, for so long? Its one thing to blame Lennart Poettering for creating buggy, mal-designed, untested software. But why are the Gnome developers creating unusable user interfaces at the same time? And what does any of this have to do with the web browser?
I'm not sure its limited to Linux, either. Read the trade press, everyone belly-aches about the incompatible, fragmented Android universe. And, well, obviously, Microsoft Windows has been a cesspool for decades; it was the #1 reason why I switched to Linux in the first place. Duhh. But why has Linux morphed into all of the worst parts of Microsoft Windows, and none of the best parts? We are all Microsoft Windows, now.
What's at the root cause of this? Sure, its some combination of programmer hubris, lack of system test, inexperienced and callous coders. Overwhelmed coders with a 10 year-long backlog of reported, unfixed bugs. Perhaps some fatigue and depression in the ranks of Debian and Ubuntu package maintainer community. Perhaps it is a political problem: the older, more experienced developers have failed to teach, to guide the younger developers. Perhaps we've hit a fundamental complexity limit: there are too many possible combinations of hardware and software. I fear we have hit a wall in the ability to communally develop software; the community is not working. All bugs are no longer shallow. Or maybe it has something to do with capitalism and corporate profitability. Some malaise presaging the singularity. I don't know. What's the root cause of this train wreck?
We need to figure out what is going wrong, not just at the technical level, but at the social and political level, that is allowing major distros to ship buggy and incomplete and broken software, oblivious to the terrible condition it is in, uncaring and dis-interested in fixing it, or perhaps unable to fix it, and unable to see a way forward. But we have to move forward. We need to find a way out of this mess. It cannot continue like this.
Yesterday, there was another thunderstorm, another power outage. Today, I spent the last 11 hours trying to make my other webserver, https://gnucash.org boot. No matter how I twist and turn, I get a "can't mount root filesystem" or "killing init". Its supposed to be a holiday weekend. I'm not being paid to run these servers. Why can't I just have a system that boots?
-- Linas Vepstas 25 May 2015 Austin TX