This book offers an exceptionally up-to-date, in-depth, and broad-based exploration of the latest advances in UNIX-based operating systems. Focusing on the design and implementation of the operating system itself — not on the applications and tools that run on it -- this book compares and analyzes the alternatives offered by several important UNIX variants, and covers several advanced subjects, such as multi-processors and threads. Compares several important UNIX variants —highlighting the issues and alternative solutions for various operating system components. Describes advanced technologies such as multiprocessor and multithreaded systems, log- structured file systems, and modern memory architecture. This text offers an exceptionally up-to-date, in-depth, and broad-based exploration of the latest advances in UNIX-based operating systems. Convert currency.
|Published (Last):||21 December 2004|
|PDF File Size:||1.9 Mb|
|ePub File Size:||11.54 Mb|
|Price:||Free* [*Free Regsitration Required]|
Subscribers to LWN. If you appreciate our content, please buy a subscription and make the next set of articles possible. September 3, Back in , I landed my then dream job as a full-time Linux kernel developer and distribution maintainer for a small embedded systems company. I was thrilled - and horrified. I'd only been working as a programmer for a couple of years and I was sure it was only a matter of time before my new employer figured out they'd hired an idiot.
The only solution was to learn more about operating systems, and quickly. So I pulled out my favorite operating systems textbook and read and re-read it obsessively over the course of the next year. It worked well enough that my company tried very hard to convince me not to quit when I got bored with my "dream job" and left to work at Sun. UNIX Internals is a careful, detailed examination of multiple UNIX implementations as they evolved over time, from the perspective of both the academic theorist and the practical kernel developer.
What makes this book particularly valuable to the practicing operating systems developer is that the review of each operating systems concept - say, processes and threads - is accompanied by descriptions of specific implementations and their histories - say, threading in Solaris, Mach, and Digital UNIX.
Each implementation is then compared on a number of practical levels, including performance, effect on programming interfaces, portability, and long-term maintenance burden - factors that Linux developers care passionately about, but are seldom considered in the academic operating systems literature.
UNIX Internals was published in A valid question is whether a book on the implementation details of UNIX operating systems published so long ago is still useful today. For example, Linux is only mentioned briefly in the introduction, and many of the UNIX variants described are now defunct. It is true that UNIX Internals holds relatively little value for the developer actively staying up to date with the latest research and development in a particular area.
However, my personal experience has been that many of the problems facing today's Linux developers are described in this book - and so are many of the proposed solutions, complete with the unsolved implementation problems. More importantly, the analysis is often detailed enough that it describes exactly the changes needed to improve the technique, if only anyone took the time to implement them.
The chapter on file system implementations shows how lessons learned from the oldest and most basic file system implementations can be useful when solving the latest and hottest file system design problems. The kernel memory allocator KMA is one of the most performance-critical kernel subsystems. A poor KMA implementation will hurt performance in every code path that needs to allocate or free memory.
Worse, it will fragment and waste precious kernel memory - memory that can't be easily freed or paged out - and pollute hardware caches with instructions and data used for allocation management. Vahalia begins with a short conceptual description of kernel memory allocation and then immediately dives into practical implementation, starting with page-level allocation in BSD.
Next, he describes memory allocation in the very earliest UNIX systems: a collection of fixed-size tables for structures like inodes and process table entries, occasional "borrowing" of blocks from the buffer cache, and a few subsystem-specific ad hoc allocators.
This primitive approach required a great deal of tuning, wasted a lot of memory, and made the system fragile. What constitutes a good KMA? After a quick review of the functional requirements, Vahalia lays out the criteria he'll use to judge the allocators: low waste fragmentation , good performance, simple interface appropriate for many different users, good alignment, efficient under changing workloads, reassignment of memory allocated for one buffer size to another, and integration with the paging system.
He also takes into consideration more subtle points, such as the cache and TLB footprint of the KMA's code, along with cache and lock contention in multi-processor systems. The charms of the resource map allocator include simplicity and allocation of exactly the size requested; the vices include high fragmentation and poor performance under nearly every workload. Even this allocation algorithm is useful under the right circumstances; Vahalia describes several subsystems that still use it System V semaphore allocation and management of free space in directory blocks on some systems and some minor tweaks that improve the algorithm.
One tweak to the resource map allocator keeps the description of each free region in the first few bytes of the region, a technique later used in the state-of-the-art SLUB allocator in the Linux kernel.
This is an example of how even the oldest and clunkiest algorithms can influence the design of the latest and greatest. Each following KMA is discussed in terms of the problems it solves from previous allocators, along with the problems it introduces. The accompanying diagrams of the data structures and buffers used to implement each allocator are particularly helpful in understanding the structure of the allocators.
Mach's zone allocator is the only fully garbage-collected KMA studied, with the concomitant unpredictable system-wide performance slowdowns during garbage collection, which would strike it from most developers' lists of useful KMAs.
But as with the resource map allocator, we still have lessons to learn from the zone allocator. The zone allocator creates a "zone" of memory reserved for each class of object allocated e. Pages are allocated to a zone as needed, up to a limit set at zone allocation time. Objects are packed tightly within each zone, even across pages, for very low internal fragmentation.
Anonymous power-of-two zones are also available. Each zone has its own free list and once a zone is set up, allocation and freeing simply add and remove items from the per-zone free list free list structures are also allocated from a zone.
Memory is reclaimed on a per-page basis by the garbage collector, which runs as part of the swapper task. It uses a two-pass algorithm: the first pass counts up the number of free objects in each page, and the second pass frees empty pages. Overall, the zone allocator was a major improvement on previous KMAs: fast, space efficient, and easy to use, marred only by the inefficient and unpredictable garbage collection algorithm.
The next KMA on the list is the hierarchical memory allocator for Dynix, which ran on the highly parallel Sequent S One of the major designers and implementers is our own Paul McKenney, familiar to many LWN readers as the progenitor of the read-copy-update RCU system used in many places in the Linux kernel. The goal of the Dynix allocator was efficient parallel memory allocation, in particular avoiding lock contention between processors. The solution was to create several layers in the memory allocation system, with per-cpu caches at the bottom and collections of large free segments at the top.
As memory is freed or allocated, regions move up and down one level of the hierarchy in batches. For example, each per-cpu cache has two free lists, one in active use and the other in reserve. When the active list runs out of free buffers, the free buffers from the reserve list are moved onto it, and the reserve list replenishes itself with buffers from the global list.
All the work requiring synchronization between multiple CPUs happens in one big transaction, rather than incurring synchronization overhead on each buffer allocation. Its memory reclamation system was far more efficient than the zone allocator's, performed on an on-going basis with bounded worst case performance on each operation.
Performance on SMP systems was unparalleled. Like the zone allocator, SLAB allocates per-object caches along with anonymous caches in useful sizes called kmem caches. Each cache has an associated optional constructor and destructor function run on the objects in a newly allocated and newly freed page, respectively though the destructor has since been removed in the Linux allocator. Each cache is a doubly-linked list of slabs - large contiguous chunks of memory.
Each slab keeps its slab data structure at the end of the slab, and divides the rest of the space into objects. Any leftover free space in the slab is divided between the beginning and end of the objects in order to vary the offset of objects with respect to the CPU cache, improving cache utilization in other words, cache coloring.
Each object has an associated 4-byte free list pointer. The slabs within each kmem cache are in a doubly linked list, sorted so that free slabs are located at one end, fully allocated slabs at the other, and partially allocated slabs in the middle.
Allocations always come from partially allocated slabs before touching free slabs. Freeing an object is simple: since slabs are always the same size and alignment, the base address of the slab can be calculated from the address of the object being freed.
This address is used to find the slab on the doubly linked list. Free counts are maintained on an on-going basis. When memory pressure occurs, the slab allocator walks the kmem caches freeing the free slabs at the end of the cache's slab list.
Slabs for larger objects are organized differently, with the slab management structure allocated separately and additional buffer management data included. As Vahalia notes, the SLAB allocator initially lacked optimizations for multi-processor systems, but these were added shortly afterward, using many of the same techniques as the Dynix hierarchical allocator.
Since then, most production kernel memory allocators have been SLAB-based. When viewed in isolation, the SLAB allocator may appear arbitrary or over-complex; when viewed in the context of previous memory allocators and their problems, the motivation behind each design decision is intuitive and clear. Despite the intervening years, these four chapters are the most comprehensive and practical description of file systems design and implementation I have yet seen.
The chapter on file systems implementations is too packed with useful detail to review fully in this article, so I'll focus on the points that are relevant to current hot file system design problems.
This chapter also covers the implementation of buffer caches, inode caches, directory entry caches, etc. One of the features of this chapter as elsewhere in the book is the carefully chosen bibliography. Bibliographies in research papers serve a double purpose as demonstrations of the authors' breadth of knowledge in the area and tend to be cluttered with more marginal references; the per-chapter bibliographies in UNIX Internals list only the most relevant publications and make excellent supplementary reading guides.
The on-disk layout consisted of a boot block followed by a superblock followed by a single monolithic inode table. The remainder of the disk is used for data and indirect blocks.
The inode free list is partial; when no more free inodes are on the list, it is replenished by scanning the inode table. Free blocks are tracked in a singly linked list rooted in the superblock - a truly terrifying design from the point of view of file system repair, especially given the lack of backup superblocks.
On the other hand, elements of the s5fs design have come back into vogue, often without addressing the inherent drawbacks still unsolved in the intervening decades. This turned out to be a key performance problem for s5fs, as every uncached file read virtually guaranteed a disk seek of non-trivial magnitude between the location of the metadata at the beginning of the disk and the file data, located anywhere except the beginning of the disk.
One of the major advances of FFS was to distribute inodes and bitmaps evenly across the disk and allocate associated file data and indirect blocks nearby.
Recently, collecting metadata in one place has returned as a way to optimize file system check and repair time as well as other metadata-intensive operations. It also appears in designs that keep metadata on a separate high-performance device usually solid state storage.
The problems with these schemes are the same as the first time around. For the fsck optimization case, most normal workloads will suffer from the required seek for reads of file data from uncached inodes in particular, system boot time would suffer greatly.
In the separate metadata device case, the problem of keeping a single, easily-corrupted copy of important metadata returns. Currently, most solid-state storage is less reliable than disk, yet most proposals to move file system metadata to solid state storage make no provision for backup copies on disk. Another cutting edge file system design issue first encountered in s5fs is backup, restore, and general manipulation of sparse files.
System administrators quickly discovered that it was possible to create a user-level backup that could not be restored because the tools would attempt to actually write and allocate the zero-filled unallocated portions of sparse files. Even more intelligent tools that do not explicitly write zero-filled portions of files still had to pointlessly copy pages of zeroes out of the kernel when reading sparse files.
The chapters on file systems are definitely frustratingly out of date, especially with regard to advances in on-disk file system design. You'll find little or no discussion of copy-on-write file systems, extents, btrees, or file system repair outside of the context of non-journaled file systems.
Unfortunately, I can't offer much in the way of a follow-up reading list; most of the papers in my file systems reading list are covered in this book exceptions include the papers on soft updates, WAFL, and XFS. File systems developers seem to publish less often than they used to; often the options for learning about the cutting edge are reading the code, browsing the project wiki, and attending presentations from the developers.
Your next opportunity for the latter is the Linux Plumbers Conference , which has a number of file system-related talks. If you want to learn more about UNIX networking design and implementation, this is the wrong book; buy some of the Stevens and Comer networking books instead. UNIX Internals was the original inspiration for the Kernel Hacker's Bookshelf series, simply because you could always find it on the bookshelf of every serious kernel hacker I knew.
As the age of the book is its most serious weakness, I originally intended to wait until the planned second edition was released before reviewing it.
The Kernel Hacker's Bookshelf: UNIX Internals
Subscribers to LWN. If you appreciate our content, please buy a subscription and make the next set of articles possible. September 3, Back in , I landed my then dream job as a full-time Linux kernel developer and distribution maintainer for a small embedded systems company.
ISBN 13: 9780131019089