Persistent Storage

This web page is a call to interested computer science programmers to implement persistent storage for Unix and Linux in particular.

Justification:

It can be very desirable at times to have the ability to take the image of a running process out of running state and literally store it in another state on media other than memory, ie hard drive, tape. The need for such a facility arises from long running processes. If I wanted for example to calculate the value for Pi to one billion digits and I was too lazy to put a stop restore mechanism in the program it would certainly be nice to be able to make a backup of the running process in case something went wrong, or I had to reboot the machine. More serious scenarios can be brought up programs such as Gaussian, which is a computational chemistry program, where even small calculations take weeks, and long ones may run for months. There is a serious need to be able to make running backups of the process for recovery should a system crash occur or a reboot be necessary.

The concept of taking the image of a running process storying it as a file, then restoring it for later execution is not a new concept. Programs such as emacs deliberately force a core dump, which is then manipulated and turned into an executable. This apparently cuts down on the startup considerably. Other people have written freeze/thaw programs with the intents mentioned in the previous paragraph, but were limited by the lack of dealing with the linkage of the running process, such as file pointers, etc... This was the case with a program of which all I can remember is that it was written by someone at CMU. Others, apparently have made much greater strides, such as Condor , though I haven't been able to find out anything regarding the copyright. Condor is actually very interesting and provides functionality that is very useful in Linux, though it would be a mistake for a first shot implementation from ground zero.

It may not be directly obvious, but this kind of functionality is a first step to clustering, which Linus indicated in an interview was a possible goal for linux3.0. The approach everybody has taken so far has been to implement these kind of programs externally to the kernel, as applications. Clearly there is something to be said about providing this type of functionality at a systems (ie kernel) level, in particular as we think of the long term goals.

Initial goals are not very ambitious. There are two of them and they should probably be approached in the order listed. They are listed d below under Phase 1 and Phase 2.

Phase 1

Provide a mechanism implemented in either the kernel or entirely at the application level to freeze a process, store it on disk, then load the process at a later time, thaw it, and bring it to a runnable state.

An oversimplified way of doing this would be to send the target process a SIGSTOP, then dump it's memory contents to a file, as well as any process state information (ie registers). Core images are really good for this. The core image can then be loaded at a later time hopefully without too much trouble and continue execution where stopped. Of course this doesn't take into account file pointers, which should be restored as well.

Phase 2

By "persistent storage", I mean the idea that a running system could be powered off, and powered back on, and none of the running programs would "know" that power had been turned off. I believe that a primitive form of this could be implemented in Linux by making some "simple" modifications to swap-file management, and to the process scheduler. They are as follows:
  1. A special "shutoff" utility would put all running programs to sleep. It would then write all virtual memory associated with the programs to the swap-file. It would then write the process tables to some special location on disk. It would then halt the CPU. The machine can now be powered off.

  2. During boot, the swap-file would NOT be purged. The process tables would be loaded back into the kernel. Assuming that pointers are appropriately patched, the "sleeping" processes could be restarted where they last left off.

Assuming that the above is done, there are further gotcha's that need to be handled.
  1. Network access. Any program that had an open socket would probably find that it's network connection has died. I don't know how to re-establish these.

  2. Print spooling; serial connections. Clearly a print spooler that had been suspended and restored in this way might find that the printer had gone off-line in the interim. Not a big deal, but ... Similarly, any devices attached to the serial ports may wonder what was going on, and may have reset themselves.

Why Do This?

Why is this an interesting thing to do? The most trivial reason is that it makes servers more robust, and better able to survive a power outage.

A more interesting reason is that it can change the way in which programs are designed. If a programmer knows that what they have in memory is persistent, they can have fewer worries about "saving things to disk". A programming model with persistent shared memory is potentially easier to design with than one which needs explicit flushes and syncs. This can open new ways of thinking about programming to novice programmers that today require adept, knowledgeable and experienced programmers to implement.


A Development Game Plan

That said, allow me to free-associate a development plan. First, note that any OS scheduler already knows how to suspend and restart execution of a process. So the first hack would be to grab a process, suspend execution, write its image to disk, read back it's image, and see if you can successfully restart it. It will take a few weeks to figure out how to grab some process, and how to write/read to disk. Once you've done this, you want to start hunting for the various buffers and structures that the kernel has associated with the process. Write them out to disk, free the memory, realloc the memory, re-read back from disk, and try to run the process again. That could take another 2-3 weeks.

The above should get you 80% there, and, for simple applications, even 100% there. You could, at this point, save to disk, reboot the machine, and reload the process as if nothing happened.

After this, though, the going gets tougher. You have to think about how to handle named pipes (aka Unix-domain sockets), shared memory, semaphores, signals, devices and TCP-IP sockets. Pipes are the commonest for of interprocess communication, and are important for getting right: its a pipe that connects stdin and stdout. Try suspending a process, and then revitalizing with its pipes in the correct state. Take something that prints "hello world" to stdout every time a key is pressed on stdin, and see if you can get that to function in an unbroken manner. Next, I suggest dealing with TCP-IP. Since TCP-IP connections cannot be suspended for long periods of time without discombobulating the remote end, I suggest getting the kernel to force a close of those sockets. Most applications do react gracefully to closed sockets, so, when they restart, they'll discover that the sockets are closed, cleanup, and then continue chunking along. Processes that merely have a socket in the "listen" state do not need to have that socket closed. These should be able to be frozen & thawed without a problem. Good ones to try might be sendmail, ftpd, telnetd or httpd, since it is easy to test if they are still working after a re-thaw. They are also good guinea pigs for signals, since kill -HUP usually forces them to re-read some config file.

Before moving to shared memory and semaphores, I would make sure that the Linux /proc filesystem has been correctly restored for the process in question. Cleaning up lint there will probably lead you to assorted minor bugs in your logic. Ideally, you want the re-thawed process to have the same PID as it had before.

Processes with open devices pose a major conundrum. Most devices (and therefore device drivers) do not allow you read their current state, save it, and restore it. Many/most have write-only registers, and their current state is heavily dependent on the past history of their state. This is arguably bad hardware design ... the hardware guys have a lot to learn yet, and the vast majority are not at all sensitive to the issues involved. Thus, I would argue to the professor that devices, semaphores & shared memory are extra-credit.

Pick a few of the simpler devices, such as ide disks or CDROM, and see what you can do there. Get a process with an open file that it is reading or writing, see if you can freeze the process, close the file, reopen the file, and restore the file pointers as you thaw it back out. Next, get that process writing to an unused disk partition. Freeze the process, unmount the disk partition (or CDROM), re-mount the partition, and get the process running again. Be prepared to accidentally corrupt the filesystem. I suggest that all experiments occur on a victim machine, and *not* on the machine on which you code.

Next comes the console itself. First, try freezing mgetty, shutting down VGA, restoring VGA, and restarting mgetty. Freezing the X server is a lot harder, because you don't know what state the VGA is in. Furthermore, the X server would need to be hacked to deliver a re-draw event to all windows (including the root window), so that the screen image is restored. We did this at IBM (we allowed the x server to run in virtual terminals) and the x-server guys hated it (mostly because it was not supported by the standard off-the-shelf Xserver, and they had to hack it each new release.)

By now, you should be most of the way to shutting down a whole system. Freeze the applications, freeze the daemons, inetd, syslogd, klogd, kerneld, kflushd, initd. Yikes!

Clearly, I've vastly oversimplified. I did not realize how daunting this could be! I am hardly as knowledgeable in this as I make out to be, so all of the above should be taken with grains of salt, pepper, and a dousing of hot salsa to douse the smell.


Linas Vepstas June 1997 Radu Duta October 1997
linas@linas.org

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included at the URL http://www.linas.org/fdl.html, the web page titled "GNU Free Documentation License".

Go Back to the Enterprise Linux Page