op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	uml: fix a compile error	WANG Cong	2008-10-20	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Fix arch/um/sys-i386/signal.c: In function 'copy_sc_from_user': arch/um/sys-i386/signal.c:182: warning: dereferencing 'void *' pointer arch/um/sys-i386/signal.c:182: error: request for member '_fxsr_env' in something not a structure or union Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	arch/m68k/bvme6000/rtc.c: remove duplicated include	Huang Weiyi	2008-10-20	1	-1/+0
\| \| \| \| \| \| \| \| \| \|	Removed duplicated include file <linux/smp_lock.h> in arch/m68k/bvme6000/rtc.c. Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	container freezer: implement freezer cgroup subsystem	Matt Helsley	2008-10-20	22	-0/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements a new freezer subsystem in the control groups framework. It provides a way to stop and resume execution of all tasks in a cgroup by writing in the cgroup filesystem. The freezer subsystem in the container filesystem defines a file named freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup. Reading will return the current state. * Examples of usage : # mkdir /containers/freezer # mount -t cgroup -ofreezer freezer /containers # mkdir /containers/0 # echo $some_pid > /containers/0/tasks to get status of the freezer subsystem : # cat /containers/0/freezer.state RUNNING to freeze all tasks in the container : # echo FROZEN > /containers/0/freezer.state # cat /containers/0/freezer.state FREEZING # cat /containers/0/freezer.state FROZEN to unfreeze all tasks in the container : # echo RUNNING > /containers/0/freezer.state # cat /containers/0/freezer.state RUNNING This is the basic mechanism which should do the right thing for user space task in a simple scenario. It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens: 1) Userspace cancels the freezing operation by writing "RUNNING" to the freezer.state file 2) Userspace retries the freezing operation by writing "FROZEN" to the freezer.state file (writing "FREEZING" is not legal and returns EIO) 3) The tasks that blocked the cgroup from entering the "FROZEN" state disappear from the cgroup's set of tasks. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: export thaw_process] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	container freezer: add TIF_FREEZE flag to all architectures	Matt Helsley	2008-10-20	7	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch series introduces a cgroup subsystem that utilizes the swsusp freezer to freeze a group of tasks. It's immediately useful for batch job management scripts. It should also be useful in the future for implementing container checkpoint/restart. The freezer subsystem in the container filesystem defines a cgroup file named freezer.state. Reading freezer.state will return the current state of the cgroup. Writing "FROZEN" to the state file will freeze all tasks in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup. * Examples of usage : # mkdir /containers/freezer # mount -t cgroup -ofreezer freezer /containers # mkdir /containers/0 # echo $some_pid > /containers/0/tasks to get status of the freezer subsystem : # cat /containers/0/freezer.state RUNNING to freeze all tasks in the container : # echo FROZEN > /containers/0/freezer.state # cat /containers/0/freezer.state FREEZING # cat /containers/0/freezer.state FROZEN to unfreeze all tasks in the container : # echo RUNNING > /containers/0/freezer.state # cat /containers/0/freezer.state RUNNING This patch: The first step in making the refrigerator() available to all architectures, even for those without power management. The purpose of such a change is to be able to use the refrigerator() in a new control group subsystem which will implement a control group freezer. [akpm@linux-foundation.org: fix sparc] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Pavel Machek <pavel@suse.cz> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@tuxonice.net> Tested-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: rewrite vmap layer	Nick Piggin	2008-10-20	3	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and provide a fast, scalable percpu frontend for small vmaps (requires a slightly different API, though). The biggest problem with vmap is actually vunmap. Presently this requires a global kernel TLB flush, which on most architectures is a broadcast IPI to all CPUs to flush the cache. This is all done under a global lock. As the number of CPUs increases, so will the number of vunmaps a scaled workload will want to perform, and so will the cost of a global TLB flush. This gives terrible quadratic scalability characteristics. Another problem is that the entire vmap subsystem works under a single lock. It is a rwlock, but it is actually taken for write in all the fast paths, and the read locking would likely never be run concurrently anyway, so it's just pointless. This is a rewrite of vmap subsystem to solve those problems. The existing vmalloc API is implemented on top of the rewritten subsystem. The TLB flushing problem is solved by using lazy TLB unmapping. vmap addresses do not have to be flushed immediately when they are vunmapped, because the kernel will not reuse them again (would be a use-after-free) until they are reallocated. So the addresses aren't allocated again until a subsequent TLB flush. A single TLB flush then can flush multiple vunmaps from each CPU. XEN and PAT and such do not like deferred TLB flushing because they can't always handle multiple aliasing virtual addresses to a physical address. They now call vm_unmap_aliases() in order to flush any deferred mappings. That call is very expensive (well, actually not a lot more expensive than a single vunmap under the old scheme), however it should be OK if not called too often. The virtual memory extent information is stored in an rbtree rather than a linked list to improve the algorithmic scalability. There is a per-CPU allocator for small vmaps, which amortizes or avoids global locking. To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces must be used in place of vmap and vunmap. Vmalloc does not use these interfaces at the moment, so it will not be quite so scalable (although it will use lazy TLB flushing). As a quick test of performance, I ran a test that loops in the kernel, linearly mapping then touching then unmapping 4 pages. Different numbers of tests were run in parallel on an 4 core, 2 socket opteron. Results are in nanoseconds per map+touch+unmap. threads vanilla vmap rewrite 1 14700 2900 2 33600 3000 4 49500 2800 8 70631 2900 So with a 8 cores, the rewritten version is already 25x faster. In a slightly more realistic test (although with an older and less scalable version of the patch), I ripped the not-very-good vunmap batching code out of XFS, and implemented the large buffer mapping with vm_map_ram and vm_unmap_ram... along with a couple of other tricks, I was able to speed up a large directory workload by 20x on a 64 CPU system. I believe vmap/vunmap is actually sped up a lot more than 20x on such a system, but I'm running into other locks now. vmap is pretty well blown off the profiles. Before: 1352059 total 0.1401 798784 _write_lock 8320.6667 <- vmlist_lock 529313 default_idle 1181.5022 15242 smp_call_function 15.8771 <- vmap tlb flushing 2472 __get_vm_area_node 1.9312 <- vmap 1762 remove_vm_area 4.5885 <- vunmap 316 map_vm_area 0.2297 <- vmap 312 kfree 0.1950 300 _spin_lock 3.1250 252 sn_send_IPI_phys 0.4375 <- tlb flushing 238 vmap 0.8264 <- vmap 216 find_lock_page 0.5192 196 find_next_bit 0.3603 136 sn2_send_IPI 0.2024 130 pio_phys_write_mmr 2.0312 118 unmap_kernel_range 0.1229 After: 78406 total 0.0081 40053 default_idle 89.4040 33576 ia64_spinlock_contention 349.7500 1650 _spin_lock 17.1875 319 __reg_op 0.5538 281 _atomic_dec_and_lock 1.0977 153 mutex_unlock 1.5938 123 iget_locked 0.1671 117 xfs_dir_lookup 0.1662 117 dput 0.1406 114 xfs_iget_core 0.0268 92 xfs_da_hashname 0.1917 75 d_alloc 0.0670 68 vmap_page_range 0.0462 <- vmap 58 kmem_cache_alloc 0.0604 57 memset 0.0540 52 rb_next 0.1625 50 __copy_user 0.0208 49 bitmap_find_free_region 0.2188 <- vmap 46 ia64_sn_udelay 0.1106 45 find_inode_fast 0.1406 42 memcmp 0.2188 42 finish_task_switch 0.1094 42 __d_lookup 0.0410 40 radix_tree_lookup_slot 0.1250 37 _spin_unlock_irqrestore 0.3854 36 xfs_bmapi 0.0050 36 kmem_cache_free 0.0256 35 xfs_vn_getattr 0.0322 34 radix_tree_lookup 0.1062 33 __link_path_walk 0.0035 31 xfs_da_do_buf 0.0091 30 _xfs_buf_find 0.0204 28 find_get_page 0.0875 27 xfs_iread 0.0241 27 __strncpy_from_user 0.2812 26 _xfs_buf_initialize 0.0406 24 _xfs_buf_lookup_pages 0.0179 24 vunmap_page_range 0.0250 <- vunmap 23 find_lock_page 0.0799 22 vm_map_ram 0.0087 <- vmap 20 kfree 0.0125 19 put_page 0.0330 18 __kmalloc 0.0176 17 xfs_da_node_lookup_int 0.0086 17 _read_lock 0.0885 17 page_waitqueue 0.0664 vmap has gone from being the top 5 on the profiles and flushing the crap out of all TLBs, to using less than 1% of kernel time. [akpm@linux-foundation.org: cleanups, section fix] [akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: cleanup to make remove_memory() arch-neutral	Badari Pulavarty	2008-10-20	3	-45/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is nothing architecture specific about remove_memory(). remove_memory() function is common for all architectures which support hotplug memory remove. Instead of duplicating it in every architecture, collapse them into arch neutral function. [akpm@linux-foundation.org: fix the export] Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Gary Hade <garyhade@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Export kmap_atomic_pfn for DRM-GEM.	Eric Anholt	2008-10-18	1	-0/+1
\| \| \| \| \| \| \| \| \|	The driver would like to map IO space directly for copying data in when appropriate, to avoid CPU cache flushing for streaming writes. kmap_atomic_pfn lets us avoid IPIs associated with ioremap for this process. Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Dave Airlie <airlied@redhat.com>
*	Merge branch 'for-linus' of git://www.jni.nu/cris	Linus Torvalds	2008-10-17	13	-4/+6
\|\ \| \| \| \| \| \| \| \| \| \|	* 'for-linus' of git://www.jni.nu/cris: CRIS: proper defconfig setup [CRIS] Rename boot-linkscripts and fix the path to them.
\| *	CRIS: proper defconfig setup	Adrian Bunk	2008-10-13	5	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch moves the cris defconfigs to arch/cris/configs/ where they belong. As a side effect they can now be used directly through e.g. make ARCH=cris artpec_3_defconfig The default defconfig is set through KBUILD_DEFCONFIG. Signed-off-by: Adrian Bunk <adrian.bunk@movial.fi> Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
\| *	[CRIS] Rename boot-linkscripts and fix the path to them.	Jesper Nilsson	2008-07-14	8	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes the CRIS-port directories follow the same naming convention as the rest of the kernel. Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
* \|	Merge branch 'kvm-updates/2.6.28' of ↵	Linus Torvalds	2008-10-16	42	-1083/+2465
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (134 commits) KVM: ia64: Add intel iommu support for guests. KVM: ia64: add directed mmio range support for kvm guests KVM: ia64: Make pmt table be able to hold physical mmio entries. KVM: Move irqchip_in_kernel() from ioapic.h to irq.h KVM: Separate irq ack notification out of arch/x86/kvm/irq.c KVM: Change is_mmio_pfn to kvm_is_mmio_pfn, and make it common for all archs KVM: Move device assignment logic to common code KVM: Device Assignment: Move vtd.c from arch/x86/kvm/ to virt/kvm/ KVM: VMX: enable invlpg exiting if EPT is disabled KVM: x86: Silence various LAPIC-related host kernel messages KVM: Device Assignment: Map mmio pages into VT-d page table KVM: PIC: enhance IPI avoidance KVM: MMU: add "oos_shadow" parameter to disable oos KVM: MMU: speed up mmu_unsync_walk KVM: MMU: out of sync shadow core KVM: MMU: mmu_convert_notrap helper KVM: MMU: awareness of new kvm_mmu_zap_page behaviour KVM: MMU: mmu_parent_walk KVM: x86: trap invlpg KVM: MMU: sync roots on mmu reload ...
\| * \|	KVM: ia64: Add intel iommu support for guests.	Xiantao Zhang	2008-10-15	2	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With intel iommu hardware, we can assign devices to kvm/ia64 guests. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: ia64: add directed mmio range support for kvm guests	Xiantao Zhang	2008-10-15	4	-22/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using vt-d, kvm guests can be assigned physcial devices, so this patch introduce a new mmio type (directed mmio) to handle its mmio access. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: ia64: Make pmt table be able to hold physical mmio entries.	Xiantao Zhang	2008-10-15	1	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't try to do put_page once the entries are mmio. Set the tag to indicate the mmio space for vmm setting TLB's memory attribute. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: Move irqchip_in_kernel() from ioapic.h to irq.h	Xiantao Zhang	2008-10-15	2	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Moving irqchip_in_kernel() from ioapic.h to irq.h. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: Separate irq ack notification out of arch/x86/kvm/irq.c	Xiantao Zhang	2008-10-15	6	-48/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Moving irq ack notification logic as common, and make it shared with ia64 side. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: Move device assignment logic to common code	Xiantao Zhang	2008-10-15	1	-255/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To share with other archs, this patch moves device assignment logic to common parts. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: Device Assignment: Move vtd.c from arch/x86/kvm/ to virt/kvm/	Zhang xiantao	2008-10-15	2	-194/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Preparation for kvm/ia64 VT-d support. Signed-off-by: Zhang xiantao <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: VMX: enable invlpg exiting if EPT is disabled	Marcelo Tosatti	2008-10-15	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Manually disabling EPT via module option fails to re-enable INVLPG exiting. Reported-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: x86: Silence various LAPIC-related host kernel messages	Jan Kiszka	2008-10-15	2	-11/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	KVM-x86 dumps a lot of debug messages that have no meaning for normal operation: - INIT de-assertion is ignored - SIPIs are sent and received - APIC writes are unaligned or < 4 byte long (Windows Server 2003 triggers this on SMP) Degrade them to true debug messages, keeping the host kernel log clean for real problems. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: Device Assignment: Map mmio pages into VT-d page table	Weidong Han	2008-10-15	1	-18/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Assigned device could DMA to mmio pages, so also need to map mmio pages into VT-d page table. Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: PIC: enhance IPI avoidance	Marcelo Tosatti	2008-10-15	3	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The PIC code makes little effort to avoid kvm_vcpu_kick(), resulting in unnecessary guest exits in some conditions. For example, if the timer interrupt is routed through the IOAPIC, IRR for IRQ 0 will get set but not cleared, since the APIC is handling the acks. This means that everytime an interrupt < 16 is triggered, the priority logic will find IRQ0 pending and send an IPI to vcpu0 (in case IRQ0 is not masked, which is Linux's case). Introduce a new variable isr_ack to represent the IRQ's for which the guest has been signalled / cleared the ISR. Use it to avoid more than one IPI per trigger-ack cycle, in addition to the avoidance when ISR is set in get_priority(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: add "oos_shadow" parameter to disable oos	Marcelo Tosatti	2008-10-15	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Subject says it all. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: speed up mmu_unsync_walk	Marcelo Tosatti	2008-10-15	1	-12/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cache the unsynced children information in a per-page bitmap. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: out of sync shadow core	Marcelo Tosatti	2008-10-15	3	-18/+197
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow guest pagetables to go out of sync. Instead of emulating write accesses to guest pagetables, or unshadowing them, we un-write-protect the page table and allow the guest to modify it at will. We rely on invlpg executions to synchronize individual ptes, and will synchronize the entire pagetable on tlb flushes. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: mmu_convert_notrap helper	Marcelo Tosatti	2008-10-15	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Need to convert shadow_notrap_nonpresent -> shadow_trap_nonpresent when unsyncing pages. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: awareness of new kvm_mmu_zap_page behaviour	Marcelo Tosatti	2008-10-15	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	kvm_mmu_zap_page will soon zap the unsynced children of a page. Restart list walk in such case. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: mmu_parent_walk	Marcelo Tosatti	2008-10-15	1	-0/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce a function to walk all parents of a given page, invoking a handler. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: x86: trap invlpg	Marcelo Tosatti	2008-10-15	5	-5/+71
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With pages out of sync invlpg needs to be trapped. For now simply nuke the entry. Untested on AMD. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: sync roots on mmu reload	Marcelo Tosatti	2008-10-15	2	-0/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: mode specific sync_page	Marcelo Tosatti	2008-10-15	2	-0/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Examine guest pagetable and bring the shadow back in sync. Caller is responsible for local TLB flush before re-entering guest mode. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: do not write-protect large mappings	Marcelo Tosatti	2008-10-15	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is not much point in write protecting large mappings. This can only happen when a page is shadowed during the window between is_largepage_backed and mmu_lock acquision. Zap the entry instead, so the next pagefault will find a shadowed page via is_largepage_backed and fallback to 4k translations. Simplifies out of sync shadow. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: move local TLB flush to mmu_set_spte	Marcelo Tosatti	2008-10-15	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since the sync page path can collapse flushes. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: split mmu_set_spte	Marcelo Tosatti	2008-10-15	1	-44/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split the spte entry creation code into a new set_spte function. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: flush remote TLBs on large->normal entry overwrite	Marcelo Tosatti	2008-10-15	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is necessary to flush all TLB's when a large spte entry is overwritten with a normal page directory pointer. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	x86: pvclock: fix shadowed variable warning	Harvey Harrison	2008-10-15	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	arch/x86/kernel/pvclock.c:102:6: warning: symbol 'tsc_khz' shadows an earlier one include/asm/tsc.h:18:21: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: ia64: add support for Tukwila processors	Xiantao Zhang	2008-10-15	1	-21/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In Tukwila processor, VT-i has been enhanced in its implementation, it is often called VT-i2 techonology. With VTi-2 support, virtulization performance should be improved. In this patch, we added the related stuff to support kvm/ia64 for Tukwila processors. Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: ia64: Implement a uniform vps interface	Xiantao Zhang	2008-10-15	4	-51/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An uniform entry kvm_vps_entry is added for vps_sync_write/read, vps_resume_handler/guest, and branches to differnt PAL service according to the offset. Singed-off-by: Anthony Xu <anthony.xu@intel.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: don't enter guest after SIPI was received by a CPU	Gleb Natapov	2008-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The vcpu should process pending SIPI message before entering guest mode again. kvm_arch_vcpu_runnable() returns true if the vcpu is in SIPI state, so we can't call it here. Signed-off-by: Gleb Natapov <gleb@qumranet.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: x86.c make kvm_load_realmode_segment static	Harvey Harrison	2008-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Noticed by sparse: arch/x86/kvm/x86.c:3591:5: warning: symbol 'kvm_load_realmode_segment' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: switch to get_user_pages_fast	Marcelo Tosatti	2008-10-15	5	-33/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert gfn_to_pfn to use get_user_pages_fast, which can do lockless pagetable lookups on x86. Kernel compilation on 4-way guest is 3.7% faster on VMX. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: Device Assignment: Free device structures if IRQ allocation fails	Amit Shah	2008-10-15	1	-41/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When an IRQ allocation fails, we free up the device structures and disable the device so that we can unregister the device in the userspace and not expose it to the guest at all. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: Device Assignment with VT-d	Ben-Ami Yassour	2008-10-15	3	-0/+215
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on a patch by: Kay, Allen M <allen.m.kay@intel.com> This patch enables PCI device assignment based on VT-d support. When a device is assigned to the guest, the guest memory is pinned and the mapping is updated in the VT-d IOMMU. [Amit: Expose KVM_CAP_IOMMU so we can check if an IOMMU is present and also control enable/disable from userspace] Signed-off-by: Kay, Allen M <allen.m.kay@intel.com> Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com> Signed-off-by: Amit Shah <amit.shah@qumranet.com> Acked-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: x86 emulator: Use DstAcc for 'and'	Guillaume Thouvenin	2008-10-15	1	-19/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For instruction 'and al,imm' we use DstAcc instead of doing the emulation directly into the instruction's opcode. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: x86 emulator: Add cmp al, imm and cmp ax, imm instructions (ocodes 3c, 3d)	Guillaume Thouvenin	2008-10-15	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add decode entries for these opcodes; execution is already implemented. Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: x86 emulator: Add DstAcc operand type	Guillaume Thouvenin	2008-10-15	1	-16/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add DstAcc operand type. That means that there are 4 bits now for DstMask. "In the good old days cpus would have only one register that was able to fully participate in arithmetic operations, typically called A for Accumulator. The x86 retains this tradition by having special, shorter encodings for the A register (like the cmp opcode), and even some instructions that only operate on A (like mul). SrcAcc and DstAcc would accommodate these instructions by decoding A into the corresponding 'struct operand'." -- Avi Kivity Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	x86: Move FEATURE_CONTROL bits to msr-index.h	Sheng Yang	2008-10-15	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For MSR_IA32_FEATURE_CONTROL is already there. Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: VMX: Rename IA32_FEATURE_CONTROL bits	Sheng Yang	2008-10-15	2	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: x86 emulator: fix jmp r/m64 instruction	Avi Kivity	2008-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jmp r/m64 doesn't require the rex.w prefix to indicate the operand size is 64 bits. Set the Stack attribute (even though it doesn't involve the stack, really) to indicate this. Signed-off-by: Avi Kivity <avi@qumranet.com>
\| * \|	KVM: VMX: Cleanup stalled INTR_INFO read	Jan Kiszka	2008-10-15	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 1c0f4f5011829dac96347b5f84ba37c2252e1e08 left a useless access of VM_ENTRY_INTR_INFO_FIELD in vmx_intr_assist behind. Clean this up. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@qumranet.com>