From 5ead97c84fa7d63a6a7a2f4e9f18f452bd109045 Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 17 Jul 2007 18:37:04 -0700 Subject: xen: Core Xen implementation This patch is a rollup of all the core pieces of the Xen implementation, including: - booting and setup - pagetable setup - privileged instructions - segmentation - interrupt flags - upcalls - multicall batching BOOTING AND SETUP The vmlinux image is decorated with ELF notes which tell the Xen domain builder what the kernel's requirements are; the domain builder then constructs the address space accordingly and starts the kernel. Xen has its own entrypoint for the kernel (contained in an ELF note). The ELF notes are set up by xen-head.S, which is included into head.S. In principle it could be linked separately, but it seems to provoke lots of binutils bugs. Because the domain builder starts the kernel in a fairly sane state (32-bit protected mode, paging enabled, flat segments set up), there's not a lot of setup needed before starting the kernel proper. The main steps are: 1. Install the Xen paravirt_ops, which is simply a matter of a structure assignment. 2. Set init_mm to use the Xen-supplied pagetables (analogous to the head.S generated pagetables in a native boot). 3. Reserve address space for Xen, since it takes a chunk at the top of the address space for its own use. 4. Call start_kernel() PAGETABLE SETUP Once we hit the main kernel boot sequence, it will end up calling back via paravirt_ops to set up various pieces of Xen specific state. One of the critical things which requires a bit of extra care is the construction of the initial init_mm pagetable. Because Xen places tight constraints on pagetables (an active pagetable must always be valid, and must always be mapped read-only to the guest domain), we need to be careful when constructing the new pagetable to keep these constraints in mind. It turns out that the easiest way to do this is use the initial Xen-provided pagetable as a template, and then just insert new mappings for memory where a mapping doesn't already exist. This means that during pagetable setup, it uses a special version of xen_set_pte which ignores any attempt to remap a read-only page as read-write (since Xen will map its own initial pagetable as RO), but lets other changes to the ptes happen, so that things like NX are set properly. PRIVILEGED INSTRUCTIONS AND SEGMENTATION When the kernel runs under Xen, it runs in ring 1 rather than ring 0. This means that it is more privileged than user-mode in ring 3, but it still can't run privileged instructions directly. Non-performance critical instructions are dealt with by taking a privilege exception and trapping into the hypervisor and emulating the instruction, but more performance-critical instructions have their own specific paravirt_ops. In many cases we can avoid having to do any hypercalls for these instructions, or the Xen implementation is quite different from the normal native version. The privileged instructions fall into the broad classes of: Segmentation: setting up the GDT and the GDT entries, LDT, TLS and so on. Xen doesn't allow the GDT to be directly modified; all GDT updates are done via hypercalls where the new entries can be validated. This is important because Xen uses segment limits to prevent the guest kernel from damaging the hypervisor itself. Traps and exceptions: Xen uses a special format for trap entrypoints, so when the kernel wants to set an IDT entry, it needs to be converted to the form Xen expects. Xen sets int 0x80 up specially so that the trap goes straight from userspace into the guest kernel without going via the hypervisor. sysenter isn't supported. Kernel stack: The esp0 entry is extracted from the tss and provided to Xen. TLB operations: the various TLB calls are mapped into corresponding Xen hypercalls. Control registers: all the control registers are privileged. The most important is cr3, which points to the base of the current pagetable, and we handle it specially. Another instruction we treat specially is CPUID, even though its not privileged. We want to control what CPU features are visible to the rest of the kernel, and so CPUID ends up going into a paravirt_op. Xen implements this mainly to disable the ACPI and APIC subsystems. INTERRUPT FLAGS Xen maintains its own separate flag for masking events, which is contained within the per-cpu vcpu_info structure. Because the guest kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely ignored (and must be, because even if a guest domain disables interrupts for itself, it can't disable them overall). (A note on terminology: "events" and interrupts are effectively synonymous. However, rather than using an "enable flag", Xen uses a "mask flag", which blocks event delivery when it is non-zero.) There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which are implemented to manage the Xen event mask state. The only thing worth noting is that when events are unmasked, we need to explicitly see if there's a pending event and call into the hypervisor to make sure it gets delivered. UPCALLS Xen needs a couple of upcall (or callback) functions to be implemented by each guest. One is the event upcalls, which is how events (interrupts, effectively) are delivered to the guests. The other is the failsafe callback, which is used to report errors in either reloading a segment register, or caused by iret. These are implemented in i386/kernel/entry.S so they can jump into the normal iret_exc path when necessary. MULTICALL BATCHING Xen provides a multicall mechanism, which allows multiple hypercalls to be issued at once in order to mitigate the cost of trapping into the hypervisor. This is particularly useful for context switches, since the 4-5 hypercalls they would normally need (reload cr3, update TLS, maybe update LDT) can be reduced to one. This patch implements a generic batching mechanism for hypercalls, which gets used in many places in the Xen code. Signed-off-by: Jeremy Fitzhardinge Signed-off-by: Chris Wright Cc: Ian Pratt Cc: Christian Limpach Cc: Adrian Bunk --- arch/i386/xen/Makefile | 1 + 1 file changed, 1 insertion(+) create mode 100644 arch/i386/xen/Makefile (limited to 'arch/i386/xen/Makefile') diff --git a/arch/i386/xen/Makefile b/arch/i386/xen/Makefile new file mode 100644 index 0000000..60bc1cfb --- /dev/null +++ b/arch/i386/xen/Makefile @@ -0,0 +1 @@ +obj-y := enlighten.o setup.o features.o multicalls.o -- cgit v1.1 From 3b827c1b3aadf3adb4c602d19863f2d24e7cbc18 Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 17 Jul 2007 18:37:04 -0700 Subject: xen: virtual mmu Xen pagetable handling, including the machinery to implement direct pagetables. Xen presents the real CPU's pagetables directly to guests, with no added shadowing or other layer of abstraction. Naturally this means the hypervisor must maintain close control over what the guest can put into the pagetable. When the guest modifies the pte/pmd/pgd, it must convert its domain-specific notion of a "physical" pfn into a global machine frame number (mfn) before inserting the entry into the pagetable. Xen will check to make sure the domain is allowed to create a mapping of the given mfn. Xen also requires that all mappings the guest has of its own active pagetable are read-only. This is relatively easy to implement in Linux because all pagetables share the same pte pages for kernel mappings, so updating the pte in one pagetable will implicitly update the mapping in all pagetables. Normally a pagetable becomes active when you point to it with cr3 (or the Xen equivalent), but when you do so, Xen must check the whole pagetable for correctness, which is clearly a performance problem. Xen solves this with pinning which keeps a pagetable effectively active even if its currently unused, which means that all the normal update rules are enforced. This means that it need not revalidate the pagetable when loading cr3. This patch has a first-cut implementation of pinning, but it is more fully implemented in a later patch. Signed-off-by: Jeremy Fitzhardinge Signed-off-by: Chris Wright --- arch/i386/xen/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/i386/xen/Makefile') diff --git a/arch/i386/xen/Makefile b/arch/i386/xen/Makefile index 60bc1cfb..803c1ee 100644 --- a/arch/i386/xen/Makefile +++ b/arch/i386/xen/Makefile @@ -1 +1 @@ -obj-y := enlighten.o setup.o features.o multicalls.o +obj-y := enlighten.o setup.o features.o multicalls.o mmu.o -- cgit v1.1 From e46cdb66c8fc1c8d61cfae0f219ff47ac4b9d531 Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 17 Jul 2007 18:37:05 -0700 Subject: xen: event channels Xen implements interrupts in terms of event channels. Each guest domain gets 1024 event channels which can be used for a variety of purposes, such as Xen timer events, inter-domain events, inter-processor events (IPI) or for real hardware IRQs. Within the kernel, we map the event channels to IRQs, and implement the whole interrupt handling using a Xen irq_chip. Rather than setting NR_IRQ to 1024 under PARAVIRT in order to accomodate Xen, we create a dynamic mapping between event channels and IRQs. Ideally, Linux will eventually move towards dynamically allocating per-irq structures, and we can use a 1:1 mapping between event channels and irqs. Signed-off-by: Jeremy Fitzhardinge Signed-off-by: Chris Wright Cc: Ingo Molnar Cc: Eric W. Biederman --- arch/i386/xen/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/i386/xen/Makefile') diff --git a/arch/i386/xen/Makefile b/arch/i386/xen/Makefile index 803c1ee..7a78f27 100644 --- a/arch/i386/xen/Makefile +++ b/arch/i386/xen/Makefile @@ -1 +1,2 @@ -obj-y := enlighten.o setup.o features.o multicalls.o mmu.o +obj-y := enlighten.o setup.o features.o multicalls.o mmu.o \ + events.o -- cgit v1.1 From 15c84731d647c34d1491793fa6be96f5de3432eb Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 17 Jul 2007 18:37:05 -0700 Subject: xen: time implementation Xen maintains a base clock which measures nanoseconds since system boot. This is provided to guests via a shared page which contains a base time in ns, a tsc timestamp at that point and tsc frequency parameters. Guests can compute the current time by reading the tsc and using it to extrapolate the current time from the basetime. The hypervisor makes sure that the frequency parameters are updated regularly, paricularly if the tsc changes rate or stops. This is implemented as a clocksource, so the interface to the rest of the kernel is a simple clocksource which simply returns the current time directly in nanoseconds. Xen also provides a simple timer mechanism, which allows a timeout to be set in the future. When that time arrives, a timer event is sent to the guest. There are two timer interfaces: - An old one which also delivers a stream of (unused) ticks at 100Hz, and on the same event, the actual timer events. The 100Hz ticks cause a lot of spurious wakeups, but are basically harmless. - The new timer interface doesn't have the 100Hz ticks, and can also fail if the specified time is in the past. This code presents the Xen timer as a clockevent driver, and uses the new interface by preference. Signed-off-by: Jeremy Fitzhardinge Signed-off-by: Chris Wright Cc: Ingo Molnar Cc: Thomas Gleixner --- arch/i386/xen/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/i386/xen/Makefile') diff --git a/arch/i386/xen/Makefile b/arch/i386/xen/Makefile index 7a78f27..bf51cab 100644 --- a/arch/i386/xen/Makefile +++ b/arch/i386/xen/Makefile @@ -1,2 +1,2 @@ obj-y := enlighten.o setup.o features.o multicalls.o mmu.o \ - events.o + events.o time.o -- cgit v1.1 From f87e4cac4f4e940b328d3deb5b53e642e3881f43 Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 17 Jul 2007 18:37:06 -0700 Subject: xen: SMP guest support This is a fairly straightforward Xen implementation of smp_ops. Xen has its own IPI mechanisms, and has no dependency on any APIC-based IPI. The smp_ops hooks and the flush_tlb_others pv_op allow a Xen guest to avoid all APIC code in arch/i386 (the only apic operation is a single apic_read for the apic version number). One subtle point which needs to be addressed is unpinning pagetables when another cpu may have a lazy tlb reference to the pagetable. Xen will not allow an in-use pagetable to be unpinned, so we must find any other cpus with a reference to the pagetable and get them to shoot down their references. Signed-off-by: Jeremy Fitzhardinge Signed-off-by: Chris Wright Cc: Benjamin LaHaise Cc: Ingo Molnar Cc: Andi Kleen --- arch/i386/xen/Makefile | 2 ++ 1 file changed, 2 insertions(+) (limited to 'arch/i386/xen/Makefile') diff --git a/arch/i386/xen/Makefile b/arch/i386/xen/Makefile index bf51cab..fd05f24 100644 --- a/arch/i386/xen/Makefile +++ b/arch/i386/xen/Makefile @@ -1,2 +1,4 @@ obj-y := enlighten.o setup.o features.o multicalls.o mmu.o \ events.o time.o + +obj-$(CONFIG_SMP) += smp.o -- cgit v1.1 From 3e2b8fbeec8f005672f2a2e862fb9c26a0bafedc Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 17 Jul 2007 18:37:07 -0700 Subject: xen: handle external requests for shutdown, reboot and sysrq The guest domain can be asked to shutdown or reboot itself, or have a sysrq key injected, via xenbus. This patch adds a watcher for those events, and does the appropriate action. Signed-off-by: Jeremy Fitzhardinge Cc: Chris Wright --- arch/i386/xen/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/i386/xen/Makefile') diff --git a/arch/i386/xen/Makefile b/arch/i386/xen/Makefile index fd05f24..7bf2ce3 100644 --- a/arch/i386/xen/Makefile +++ b/arch/i386/xen/Makefile @@ -1,4 +1,4 @@ obj-y := enlighten.o setup.o features.o multicalls.o mmu.o \ - events.o time.o + events.o time.o manage.o obj-$(CONFIG_SMP) += smp.o -- cgit v1.1 From 6487673b8a858f99a5348e1078b3f5aec700f9e0 Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 17 Jul 2007 18:37:07 -0700 Subject: xen: Attempt to patch inline versions of common operations This patchs adds the mechanism to allow us to patch inline versions of common operations. The implementations of the direct-access versions save_fl, restore_fl, irq_enable and irq_disable are now in assembler, and the same code is used for both out of line and inline uses. Signed-off-by: Jeremy Fitzhardinge Cc: Chris Wright Cc: Keir Fraser --- arch/i386/xen/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/i386/xen/Makefile') diff --git a/arch/i386/xen/Makefile b/arch/i386/xen/Makefile index 7bf2ce3..343df24 100644 --- a/arch/i386/xen/Makefile +++ b/arch/i386/xen/Makefile @@ -1,4 +1,4 @@ obj-y := enlighten.o setup.o features.o multicalls.o mmu.o \ - events.o time.o manage.o + events.o time.o manage.o xen-asm.o obj-$(CONFIG_SMP) += smp.o -- cgit v1.1