Xen lets one piece of hardware run multiple operating systems, controlling their access to the hardware through a kind of meta-operating-system called the Hypervisor.
There is nothing new about the idea of virtualization, of course. It was associated with the IBM 360 further back than most of us can remember (in fact, IBM executives have told me they think Linux has a major role to play keeping old 360-series computers going by running in their virtual machines), and it's now making good money for VMWare. (More about them a bit later.)
Like VMWare, the major uses of Xen seem to be server consolidation (which means running several instances of Linux on one piece of hardware, a useful deployment because Linux seems to work best running only one server daemon) and virtual hosting. Speaker Mike D. Day also showed that Xen could be used to deploy Linux quickly to a large array of computers.
Ian Pratt gave a comprehensive overview of Xen's goals and implementation. He defined Xen's main achievements as two-fold (although his talk really focused on the first): isolating different processes in a secure manner, and controlling resources so different Quality of Service options could be offered to different processes.
Pratt then laid out some of the ingenuity that makes Xen more efficient than VMWare or User Mode Linux. For instance, Xen divides page tables among its guest operating systems and gives each guest full control over its page tables, so that the hardware doesn't slow down under the load of two levels of page management (one by the guest and one by the Hypervisor).
There is one necessary exception to Xen's practice of handing full control over paging to a guest: the guest is not allowed to write the pages that contain its page tables. If it could do that, it could give itself access to the other guests' pages. On the other hand, the guest must be able somehow to indicate that it needs a new page. So the Xen team has found some tricks to make it easy for Xen to trap a guest's writes to the page tables, make sure they're legitimate, and let the guest go ahead with the writes.
Another tour de force Pratt illustrated was how Xen eases failover. Whether scheduled or in a panic situation, hardware sometimes has to go down. Xen can make it easier to migrate processes to new systems with minimal downtime.
This is done by doing a series of pre-copies while the guest system keeps running and updating its state. The first copy takes a long time because it starts from scratch, but each subsequent copy has less state information to transfer and thus takes less and less time. (One weakness of Xen is that it maintains a lot of state and therefore has a lot of information to copy.) The amount of CPU time devoted to the copying can also be titrated to leave plenty of time for the process to continue handling incoming requests. Pratt actually drew applause when he showed the CPU utilization of a highly loaded Apache server during transfer to another node, and added it was down for only 164 milliseconds during the transfer.
As I mentioned, Pratt really concentrated on Xen's goal of isolating processes, but the second goal of doling out resources was touched on by a Rik van Riel in a BOF that evening. He divided the resources worth tracking into four types: CPU utilization, memory, data I/O, and network I/O. CPU utilization, he said, was easy to track without intruding on the guest operating system. So was I/O. Memory, which a Hypervisor could easily give too little or too much of to a guest operating system, was a much harder nut to crack. He suggested clever ways that patterns of waiting for reads, waiting for writes, and the length of request queues (way-stations for reads and writes) could tell the Hypervisor whether an operating system was underprovisioned or overprovisioned with memory. But more experimentation is needed to see whether these are valid measures.
Xen offers lot of features already on 32-bit x86 hardware, with 64-bit x86 and AMD coming along too. A number of operating systems can be guests, including Linux, Solaris, FreeBSD, and OpenBSD. An upcoming facility called VT-x should allow Xen to run operating systems that haven't been instrumented for it--so even Windows will someday show up on the list.
VMWare is not passively accepting the limits that have long been assumed on virtualization. They know that if they can break down the barrier between Hypervisor and guest operating system, and learn just a bit about what the operating system is doing (taking a spin lock, for instance, or releasing a disk block), they can achieve fantastic speed-ups in virtualization. An unannounced speaker from VMWare presented some of their innovations at van Riel's BOF, under the name para-virtualization.
Para-virtualization's goal is to blur the barrier between operating system and Hypervisor enough to obtain useful information, while minimizing engineering costs and the risk of breaking the operating system. VMWare knows that Linux developers would have little tolerance for a development process that required them to slow down so VMWare could keep up, and that Linux distributors would push back if VMWare slowed down performance or introduced risk.
Their solution is to introduce a new layer (named VMI) that would cause some 30 to 50 instructions in the operating system to trap into the Hypervisor instead of executing as normal. This is reminiscent of the trap instructions introduced by a debugger into a binary, but would be even less intrusive, requiring no change to the binary of the kernel.
The solution is unique to each processor being emulated, but could apply to any operating system compiled for that processor. The speaker claimed that para-virtualization had been easy to introduce into the Linux development tree and could be maintained as open source with a typical open source development and testing process.
The wide range of sessions on kernel changes--including solutions to improve storage management, such as multipath device access--were part of evidence that every kernel task (caching, filesystems, etc.) is being examined under a microscope to determine how it can scale better, adapt to future evolution, and shave off waste.
Other Linux deployments received less attention at this symposium. I noticed nothing about interesting but arcane deployments such as robotics or carrier grade (telephony) applications. Unlike the symposium I attended four years ago, this one gave just a nod to the desktop. Instead, the desktop formed the subject of its own two-day conference preceding this one, as reported in a recent blog of mine. A bit more at the symposium was offered on embedded systems. The developers give a lot of attention to power management, which I suspect is done for the benefit of embedded developers, but also benefits desktop users who have laptops.
Concerns about power management have a major impact on support for hyperthreading, as discussed by Suresh Siddha in his talk on Chip Multi Processing. The driving factor is that power consumption is the same on a chip regardless of whether just one thread is active, or both. If optimal performance is your goal, you want processes distributed among all processors, even if only one thread is active on each. But this maximizes power consumption. So if power management is a concern as well, the algorithm must be quite different, and must try to fill the threads of each active chip while leaving some chips idle.
I was told that the 2.6 kernel is much larger than the 2.4 version because developers honored the feature request list of sites running big iron. The losers in this exchange are embedded developers, many of whom insist on sticking with 2.4. The 2.6 version's slow boot-up is particularly detrimental to adoption for embedded systems.
Several talks offered practical advice on the use of debugging and instrumentation tools to make developers more effective.
I attended the Fedora and Gentoo BOFs partly to see whether I could detect any demographic or cultural differences in the attendees, but they seemed pretty comparable. The Fedora BOF was much larger, of course. Gentoo has an impressive following, though; it's BOF was led by two IBM employees who say it's gaining adherents among developers at IBM, and someone pointed out that the Mozilla Project runs its servers on Gentoo.
To capture some of Red Hat's and SUSE's followers, the Gentoo developers are considering a slower moving, more stable Enterprise edition, but an Enterprise edition seems like an oxymoron to me for a distribution that is known as the most adventurous, cutting-edge of the popular distributions, and for a team that prides itself on letting each user customize his or her installation.
The detail could sometimes become tiresome, to be sure. I don't think an audience was well-served by a description of a feature that goes field by field or function by function. What made the talks useful were their summaries of a feature's requirements, history, alternative or rejected implementations, and subtle implications of the chosen implementation.
I sensed less concern at this conference about political trends that could have an impact on Linux and open source. I just didn't hear much talk about them. Perhaps the vote of the European Union against patents eased the worries of attendees. New respect in the business community and larger public for open source (note the groundswell of praise for Firefox) and the gradual receding of the SCO case may also contribute to this lull in political hyperalertness--although more threats are likely to arise.
A lot of the folks here are extraordinarily intelligent and capable of extreme levels of dedicated effort. We're lucky they're obsessed with such things as reverse engineering old video games or getting every feature of power management to work on Linux. If one of them set his mind on evil, he could take over the world. (On the other hand, he couldn't be as evil as the people who are taking over the world.)
The deeper theme at this symposium is that open source is constantly being revitalized by the astonishing energy and intelligence of those drawn to it for whatever reason. And it's making inroads in little-known places. I mentioned earlier the reverse engineering of a game that runs on Windows: the hacker discovered along the way that this game uses Ogg Vorbis files for audio and Python scripts to implement many of its rules. It's hard to imagine a computer field without open source--but then, no such field will ever exist.
Earlier blog on this symposium:
Andy Oram is an editor for O'Reilly Media, specializing in Linux and free software books, and a member of Computer Professionals for Social Responsibility. His web site is www.praxagora.com/andyo.
oreillynet.com Copyright © 2006 O'Reilly Media, Inc.