summaryrefslogtreecommitdiffstats
path: root/include/linux/kvm_host.h
Commit message (Collapse)AuthorAgeFilesLines
* KVM: kvm_io_bus_unregister_dev() should never failDavid Hildenbrand2017-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | No caller currently checks the return value of kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on freeing their device. A stale reference will remain in the io_bus, getting at least used again, when the iobus gets teared down on kvm_destroy_vm() - leading to use after free errors. There is nothing the callers could do, except retrying over and over again. So let's simply remove the bus altogether, print an error and make sure no one can access this broken bus again (returning -ENOMEM on any attempt to access it). Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU") Cc: stable@vger.kernel.org # 3.4+ Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: convert kvm.users_count from atomic_t to refcount_tElena Reshetova2017-03-011-1/+2
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* KVM: x86: remove code for lazy FPU handlingPaolo Bonzini2017-02-171-1/+0
| | | | | | | | The FPU is always active now when running KVM. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Support vCPU-based gfn->hva cacheCao, Lei2017-02-161-8/+8
| | | | | | | | | | | | | Provide versions of struct gfn_to_hva_cache functions that take vcpu as a parameter instead of struct kvm. The existing functions are not needed anymore, so delete them. This allows dirty pages to be logged in the vcpu dirty ring, instead of the global dirty ring, for ring-based dirty memory tracking. Signed-off-by: Lei Cao <lei.cao@stratus.com> Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* arm/arm64: KVM: Get rid of KVM_MEMSLOT_INCOHERENTMarc Zyngier2017-01-301-1/+0
| | | | | | | | KVM_MEMSLOT_INCOHERENT is not used anymore, as we've killed its only use in the arm/arm64 MMU code. Let's remove the last artifacts. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2016-12-131-0/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "Small release, the most interesting stuff is x86 nested virt improvements. x86: - userspace can now hide nested VMX features from guests - nested VMX can now run Hyper-V in a guest - support for AVX512_4VNNIW and AVX512_FMAPS in KVM - infrastructure support for virtual Intel GPUs. PPC: - support for KVM guests on POWER9 - improved support for interrupt polling - optimizations and cleanups. s390: - two small optimizations, more stuff is in flight and will be in 4.11. ARM: - support for the GICv3 ITS on 32bit platforms" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest KVM: arm/arm64: timer: Check for properly initialized timer on init KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs KVM: x86: Handle the kthread worker using the new API KVM: nVMX: invvpid handling improvements KVM: nVMX: check host CR3 on vmentry and vmexit KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry KVM: nVMX: propagate errors from prepare_vmcs02 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation KVM: nVMX: support restore of VMX capability MSRs KVM: nVMX: generate non-true VMX MSRs based on true versions KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs. KVM: x86: Add kvm_skip_emulated_instruction and use it. KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12 KVM: VMX: Reorder some skip_emulated_instruction calls KVM: x86: Add a return value to kvm_emulate_cpuid KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h ...
| * Merge branch 'kvm-ppc-next' of ↵Radim Krčmář2016-11-291-0/+4
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc PPC KVM update for 4.10: * Support for KVM guests on POWER9 using the hashed page table MMU. * Updates and improvements to the halt-polling support on PPC, from Suraj Jitindar Singh. * An optimization to speed up emulated MMIO, from Yongji Xie. * Various other minor cleanups.
| | * KVM: Export kvm module parameter variablesSuraj Jitindar Singh2016-11-281-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvm module has the parameters halt_poll_ns, halt_poll_ns_grow, and halt_poll_ns_shrink. Halt polling was recently added to the powerpc kvm-hv module and these parameters were essentially duplicated for that. There is no benefit to this duplication and it can lead to confusion when trying to tune halt polling. Thus move the definition of these variables to kvm_host.h and export them. This will allow the kvm-hv module to use the same module parameters by accessing these variables, which will be implemented in the next patch, meaning that they will no longer be duplicated. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | kvm: x86: don't print warning messages for unimplemented msrsBandan Das2016-11-221-0/+6
| |/ | | | | | | | | | | | | | | | | | | | | | | Change unimplemented msrs messages to use pr_debug. If CONFIG_DYNAMIC_DEBUG is set, then these messages can be enabled at run time or else -DDEBUG can be used at compile time to enable them. These messages will still be printed if ignore_msrs=1. Signed-off-by: Bandan Das <bsd@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | Merge branch 'x86-fpu-for-linus' of ↵Linus Torvalds2016-12-121-1/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FPU updates from Ingo Molnar: "The main changes in this cycle were: - do a large round of simplifications after all CPUs do 'eager' FPU context switching in v4.9: remove CR0 twiddling, remove leftover eager/lazy bts, etc (Andy Lutomirski) - more FPU code simplifications: remove struct fpu::counter, clarify nomenclature, remove unnecessary arguments/functions and better structure the code (Rik van Riel)" * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Remove clts() x86/fpu: Remove stts() x86/fpu: Handle #NM without FPU emulation as an error x86/fpu, lguest: Remove CR0.TS support x86/fpu, kvm: Remove host CR0.TS manipulation x86/fpu: Remove irq_ts_save() and irq_ts_restore() x86/fpu: Stop saving and restoring CR0.TS in fpu__init_check_bugs() x86/fpu: Get rid of two redundant clts() calls x86/fpu: Finish excising 'eagerfpu' x86/fpu: Split old_fpu & new_fpu handling into separate functions x86/fpu: Remove 'cpu' argument from __cpu_invalidate_fpregs_state() x86/fpu: Split old & new FPU code paths x86/fpu: Remove __fpregs_(de)activate() x86/fpu: Rename lazy restore functions to "register state valid" x86/fpu, kvm: Remove KVM vcpu->fpu_counter x86/fpu: Remove struct fpu::counter x86/fpu: Remove use_eager_fpu() x86/fpu: Remove the XFEATURE_MASK_EAGER/LAZY distinction x86/fpu: Hard-disable lazy FPU mode x86/crypto, x86/fpu: Remove X86_FEATURE_EAGER_FPU #ifdef from the crc32c code
| * \ Merge tag 'v4.9-rc1' into x86/fpu, to resolve conflictIngo Molnar2016-10-161-0/+4
| |\ \ | | |/ | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/fpu, kvm: Remove KVM vcpu->fpu_counterRik van Riel2016-10-071-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the removal of the lazy FPU code, this field is no longer used. Get rid of it. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pbonzini@redhat.com Link: http://lkml.kernel.org/r/1475627678-20788-7-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | kvm: Introduce kvm_write_guest_offset_cached()Pan Xinhui2016-11-221-0/+2
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It allows us to update some status or field of a structure partially. We can also save a kvm_read_guest_cached() call if we just update one fild of the struct regardless of its current value. Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: borntraeger@de.ibm.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: jgross@suse.com Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-8-git-send-email-xinhui.pan@linux.vnet.ibm.com [ Typo fixes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | kvm: create per-vcpu dirs in debugfsLuiz Capitulino2016-09-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds the ability for archs to export per-vcpu information via a new per-vcpu dir in the VM's debugfs directory. If kvm_arch_has_vcpu_debugfs() returns true, then KVM will create a vcpu dir for each vCPU in the VM's debugfs directory. Then kvm_arch_create_vcpu_debugfs() is responsible for populating each vcpu directory with arch specific entries. The per-vcpu path in debugfs will look like: /sys/kernel/debug/kvm/29162-10/vcpu0 /sys/kernel/debug/kvm/29162-10/vcpu1 This is all arch specific for now because the only user of this interface (x86) wants to export x86-specific per-vcpu information to user-space. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: add stubs for arch specific debugfs supportLuiz Capitulino2016-09-161-0/+3
|/ | | | | | | | | | | | | | | | | Two stubs are added: o kvm_arch_has_vcpu_debugfs(): must return true if the arch supports creating debugfs entries in the vcpu debugfs dir (which will be implemented by the next commit) o kvm_arch_create_vcpu_debugfs(): code that creates debugfs entries in the vcpu debugfs dir For x86, this commit introduces a new file to avoid growing arch/x86/kvm/x86.c even more. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Protect device ops->create and list_add with kvm->lockChristoffer Dall2016-08-121-0/+6
| | | | | | | | | | | | | | | | | | | | KVM devices were manipulating list data structures without any form of synchronization, and some implementations of the create operations also suffered from a lack of synchronization. Now when we've split the xics create operation into create and init, we can hold the kvm->lock mutex while calling the create operation and when manipulating the devices list. The error path in the generic code gets slightly ugly because we have to take the mutex again and delete the device from the list, but holding the mutex during anon_inode_getfd or releasing/locking the mutex in the common non-error path seemed wrong. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* KVM: PPC: Move xics_debugfs_init out of createChristoffer Dall2016-08-121-0/+6
| | | | | | | | | | | | | | As we are about to hold the kvm->lock during the create operation on KVM devices, we should move the call to xics_debugfs_init into its own function, since holding a mutex over extended amounts of time might not be a good idea. Introduce an init operation on the kvm_device_ops struct which cannot fail and call this, if configured, after the device has been created. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* Merge tag 'kvm-arm-for-4.8-take2' of ↵Paolo Bonzini2016-08-041-3/+9
|\ | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/ARM Changes for v4.8 - Take 2 Includes GSI routing support to go along with the new VGIC and a small fix that has been cooking in -next for a while.
| * KVM: arm/arm64: Enable MSI routingEric Auger2016-07-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Up to now, only irqchip routing entries could be set. This patch adds the capability to insert MSI routing entries. For ARM64, let's also increase KVM_MAX_IRQ_ROUTES to 4096: this include SPI irqchip routes plus MSI routes. In the future this might be extended. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: Move kvm_setup_default/empty_irq_routing declaration in arch specific ↵Eric Auger2016-07-221-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | header kvm_setup_default_irq_routing and kvm_setup_empty_irq_routing are not used by generic code. So let's move the declarations in x86 irq.h header instead of kvm_host.h. Signed-off-by: Eric Auger <eric.auger@redhat.com> Suggested-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: Add devid in kvm_kernel_irq_routing_entryEric Auger2016-07-221-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Enhance kvm_kernel_irq_routing_entry to transport the device id field, devid. A new flags field makes possible to indicate the devid is valid. Those additions are used for ARM GICv3 ITS MSI injection. The original struct msi_msg msi field is replaced by a new custom structure that embeds the new fields. Signed-off-by: Eric Auger <eric.auger@redhat.com> Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Acked-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
* | Merge tag 'kvm-arm-for-4.8' of ↵Radim Krčmář2016-07-221-0/+2
|\ \ | |/ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into next KVM/ARM changes for Linux 4.8 - GICv3 ITS emulation - Simpler idmap management that fixes potential TLB conflicts - Honor the kernel protection in HYP mode - Removal of the old vgic implementation
| * KVM: kvm_io_bus: Add kvm_io_bus_get_dev() callAndre Przywara2016-07-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvm_io_bus framework is a nice place of holding information about various MMIO regions for kernel emulated devices. Add a call to retrieve the kvm_io_device structure which is associated with a certain MMIO address. This avoids to duplicate kvm_io_bus' knowledge of MMIO regions without having to fake MMIO calls if a user needs the device a certain MMIO address belongs to. This will be used by the ITS emulation to get the associated ITS device when someone triggers an MSI via an ioctl from userspace. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
* | KVM: pass struct kvm to kvm_set_routing_entryRadim Krčmář2016-07-141-1/+2
| | | | | | | | | | | | | | Arch-specific code will use it. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: remove kvm_guest_enter/exit wrappersPaolo Bonzini2016-07-011-22/+0
| | | | | | | | | | | | | | | | | | | | | | | | Use the functions from context_tracking.h directly. Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | context_tracking: move rcu_virt_note_context_switch out of kvm_host.hPaolo Bonzini2016-06-281-21/+4
|/ | | | | | | | | | | | | | Make kvm_guest_{enter,exit} and __kvm_guest_{enter,exit} trivial wrappers around the code in context_tracking.h. Name the context_tracking.h functions consistently with what those for kernel<->user switch. Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: remove kvm_vcpu_compatiblePaolo Bonzini2016-06-161-6/+0
| | | | | | | | | | | The new created_vcpus field makes it possible to avoid the race between irqchip and VCPU creation in a much nicer way; just check under kvm->lock whether a VCPU has already been created. We can then remove KVM_APIC_ARCHITECTURE too, because at this point the symbol is only governing the default definition of kvm_vcpu_compatible. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: introduce kvm->created_vcpusPaolo Bonzini2016-06-161-0/+8
| | | | | | | | | | | | | | | The race between creating the irqchip and the first VCPU is currently fixed by checking the presence of an irqchip before updating kvm->online_vcpus, and undoing the whole VCPU creation if someone created the irqchip in the meanwhile. Instead, introduce a new field in struct kvm that will count VCPUs under a mutex, without the atomic access and memory ordering that we need elsewhere to protect the vcpus array. This also plugs the race and is more easily applicable in all similar circumstances. Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Create debugfs dir and stat files for each VMJanosch Frank2016-05-251-0/+7
| | | | | | | | | | | | | | | This patch adds a kvm debugfs subdirectory for each VM, which is named after its pid and file descriptor. The directories contain the same kind of files that are already in the kvm debugfs directory, but the data exported through them is now VM specific. This makes the debugfs kvm data a convenient alternative to the tracepoints which already have per VM data. The debugfs data is easy to read and low overhead. CC: Dan Carpenter <dan.carpenter@oracle.com> [includes fixes by Dan Carpenter] Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: split kvm_vcpu_wake_up from kvm_vcpu_kickRadim Krčmář2016-05-181-0/+1
| | | | | | | | | AVIC has a use for kvm_vcpu_wake_up. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: halt_polling: provide a way to qualify wakeups during pollChristian Borntraeger2016-05-131-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some wakeups should not be considered a sucessful poll. For example on s390 I/O interrupts are usually floating, which means that _ALL_ CPUs would be considered runnable - letting all vCPUs poll all the time for transactional like workload, even if one vCPU would be enough. This can result in huge CPU usage for large guests. This patch lets architectures provide a way to qualify wakeups if they should be considered a good/bad wakeups in regard to polls. For s390 the implementation will fence of halt polling for anything but known good, single vCPU events. The s390 implementation for floating interrupts does a wakeup for one vCPU, but the interrupt will be delivered by whatever CPU checks first for a pending interrupt. We prefer the woken up CPU by marking the poll of this CPU as "good" poll. This code will also mark several other wakeup reasons like IPI or expired timers as "good". This will of course also mark some events as not sucessful. As KVM on z runs always as a 2nd level hypervisor, we prefer to not poll, unless we are really sure, though. This patch successfully limits the CPU usage for cases like uperf 1byte transactional ping pong workload or wakeup heavy workload like OLTP while still providing a proper speedup. This also introduced a new vcpu stat "halt_poll_no_tuning" that marks wakeups that are considered not good for polling. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version) Cc: David Matlack <dmatlack@google.com> Cc: Wanpeng Li <kernellwp@gmail.com> [Rename config symbol. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: Conditionally register IRQ bypass consumerAlex Williamson2016-05-111-0/+1
| | | | | | | | | | If we don't support a mechanism for bypassing IRQs, don't register as a consumer. This eliminates meaningless dev_info()s when the connect fails between producer and consumer, such as on AMD systems where kvm_x86_ops->update_pi_irte is not implemented Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: introduce KVM_MAX_VCPU_IDGreg Kurz2016-05-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and also the upper limit for vCPU ids. This is okay for all archs except PowerPC which can have higher ids, depending on the cpu/core/thread topology. In the worst case (single threaded guest, host with 8 threads per core), it limits the maximum number of vCPUS to KVM_MAX_VCPUS / 8. This patch separates the vCPU numbering from the total number of vCPUs, with the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids plus one. The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids before passing them to KVM_CREATE_VCPU. This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC. Other archs continue to return KVM_MAX_VCPUS instead. Suggested-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: remove NULL return path for vcpu ids >= KVM_MAX_VCPUSGreg Kurz2016-05-111-3/+4
| | | | | | | | | | | | | | | Commit c896939f7cff ("KVM: use heuristic for fast VCPU lookup by id") added a return path that prevents vcpu ids to exceed KVM_MAX_VCPUS. This is a problem for powerpc where vcpu ids can grow up to 8*KVM_MAX_VCPUS. This patch simply reverses the logic so that we only try fast path if the vcpu id can be tried as an index in kvm->vcpus[]. The slow path is not affected by the change. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: add missing memory barrier in kvm_{make,check}_requestPaolo Bonzini2016-04-201-0/+11
| | | | | | | | | | | kvm_make_request and kvm_check_request imply a producer-consumer relationship; add implicit memory barriers to them. There was indeed already a place that was adding an explicit smp_mb() to order between kvm_check_request and the processing of the request. That memory barrier can be removed (as an added benefit, kvm_check_request can use smp_mb__after_atomic which is free on x86). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Use simple waitqueue for vcpu->wqMarcelo Tosatti2016-02-251-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem: On -rt, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: This patch reduces the average latency in my tests from 14us to 11us. Daniel writes: Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency benchmark on mainline. The test was run 1000 times on tip/sched/core 4.4.0-rc8-01134-g0905f04: ./x86-run x86/tscdeadline_latency.flat -cpu host with idle=poll. The test seems not to deliver really stable numbers though most of them are smaller. Paolo write: "Anything above ~10000 cycles means that the host went to C1 or lower---the number means more or less nothing in that case. The mean shows an improvement indeed." Before: min max mean std count 1000.000000 1000.000000 1000.000000 1000.000000 mean 5162.596000 2019270.084000 5824.491541 20681.645558 std 75.431231 622607.723969 89.575700 6492.272062 min 4466.000000 23928.000000 5537.926500 585.864966 25% 5163.000000 1613252.750000 5790.132275 16683.745433 50% 5175.000000 2281919.000000 5834.654000 23151.990026 75% 5190.000000 2382865.750000 5861.412950 24148.206168 max 5228.000000 4175158.000000 6254.827300 46481.048691 After min max mean std count 1000.000000 1000.00000 1000.000000 1000.000000 mean 5143.511000 2076886.10300 5813.312474 21207.357565 std 77.668322 610413.09583 86.541500 6331.915127 min 4427.000000 25103.00000 5529.756600 559.187707 25% 5148.000000 1691272.75000 5784.889825 17473.518244 50% 5160.000000 2308328.50000 5832.025000 23464.837068 75% 5172.000000 2393037.75000 5853.177675 24223.969976 max 5222.000000 3922458.00000 6186.720500 42520.379830 [Patch was originaly based on the swait implementation found in the -rt tree. Daniel ported it to mainline's version and gathered the benchmark numbers for tscdeadline_latency test.] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-rt-users@vger.kernel.org Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* kvm: rename pfn_t to kvm_pfn_tDan Williams2016-01-151-18/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To date, we have implemented two I/O usage models for persistent memory, PMEM (a persistent "ram disk") and DAX (mmap persistent memory into userspace). This series adds a third, DAX-GUP, that allows DAX mappings to be the target of direct-i/o. It allows userspace to coordinate DMA/RDMA from/to persistent memory. The implementation leverages the ZONE_DEVICE mm-zone that went into 4.3-rc1 (also discussed at kernel summit) to flag pages that are owned and dynamically mapped by a device driver. The pmem driver, after mapping a persistent memory range into the system memmap via devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus page-backed pmem-pfns via flags in the new pfn_t type. The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the resulting pte(s) inserted into the process page tables with a new _PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys off _PAGE_DEVMAP to pin the device hosting the page range active. Finally, get_page() and put_page() are modified to take references against the device driver established page mapping. Finally, this need for "struct page" for persistent memory requires memory capacity to store the memmap array. Given the memmap array for a large pool of persistent may exhaust available DRAM introduce a mechanism to allocate the memmap from persistent memory. The new "struct vmem_altmap *" parameter to devm_memremap_pages() enables arch_add_memory() to use reserved pmem capacity rather than the page allocator. This patch (of 18): The core has developed a need for a "pfn_t" type [1]. Move the existing pfn_t in KVM to kvm_pfn_t [2]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* KVM: move architecture-dependent requests to arch/Paolo Bonzini2016-01-081-41/+2
| | | | | | | | | Since the numbers now overlap, it makes sense to enumerate them in asm/kvm_host.h rather than linux/kvm_host.h. Functions that refer to architecture-specific requests are also moved to arch/. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: renumber vcpu->request bitsPaolo Bonzini2016-01-081-30/+29
| | | | | | | Leave room for 4 more arch-independent requests. Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: document which architecture uses each request bitPaolo Bonzini2016-01-081-7/+13
| | | | | Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requestsPaolo Bonzini2016-01-081-1/+1
| | | | | | | | | Suggested-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> [Takuya moved all subsequent constants to fill the void, but that is useless in view of the following patches. So this change looks nothing like the original. - Paolo] Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: Dump guest rIP when the guest tried something unsupportedBorislav Petkov2015-12-161-1/+2
| | | | | | | | | | | | It looks like this in action: kvm [5197]: vcpu0, guest rIP: 0xffffffff810187ba unhandled rdmsr: 0xc001102 and helps to pinpoint quickly where in the guest we did the unsupported thing. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/x86: Hyper-V SynIC timersAndrey Smetanin2015-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per Hyper-V specification (and as required by Hyper-V-aware guests), SynIC provides 4 per-vCPU timers. Each timer is programmed via a pair of MSRs, and signals expiration by delivering a special format message to the configured SynIC message slot and triggering the corresponding synthetic interrupt. Note: as implemented by this patch, all periodic timers are "lazy" (i.e. if the vCPU wasn't scheduled for more than the timer period the timer events are lost), regardless of the corresponding configuration MSR. If deemed necessary, the "catch up" mode (the timer period is shortened until the timer catches up) will be implemented later. Changes v2: * Use remainder to calculate periodic timer expiration time Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/x86: Hyper-V SynIC message slot pending clearing at SINT ackAndrey Smetanin2015-12-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | The SynIC message protocol mandates that the message slot is claimed by atomically setting message type to something other than HVMSG_NONE. If another message is to be delivered while the slot is still busy, message pending flag is asserted to indicate to the guest that the hypervisor wants to be notified when the slot is released. To make sure the protocol works regardless of where the message sources are (kernel or userspace), clear the pending flag on SINT ACK notification, and let the message sources compete for the slot again. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Remove unnecessary debugfs dentry referencesJanosch Frank2015-11-301-1/+0
| | | | | | | | | | | | | | | KVM creates debugfs files to export VM statistics to userland. To be able to remove them on kvm exit it tracks the files' dentries. Since their parent directory is also tracked and since each parent direntry knows its children we can easily remove them by using debugfs_remove_recursive(kvm_debugfs_dir). Therefore we don't need the extra tracking in the kvm_stats_debugfs_item anymore. Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com> Reviewed-By: Sascha Silbe <silbe@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
* KVM: use heuristic for fast VCPU lookup by idDavid Hildenbrand2015-11-301-0/+5
| | | | | | | | | | | Usually, VCPU ids match the array index. So let's try a fast lookup first before falling back to the slow iteration. Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
* KVM: kvm_is_visible_gfn can be booleanYaowei Bai2015-11-251-1/+1
| | | | | | | | | | This patch makes kvm_is_visible_gfn return bool due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/x86: Hyper-V kvm exitAndrey Smetanin2015-11-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | A new vcpu exit is introduced to notify the userspace of the changes in Hyper-V SynIC configuration triggered by guest writing to the corresponding MSRs. Changes v4: * exit into userspace only if guest writes into SynIC MSR's Changes v3: * added KVM_EXIT_HYPERV types and structs notes into docs Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/x86: Hyper-V synthetic interrupt controllerAndrey Smetanin2015-11-251-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SynIC (synthetic interrupt controller) is a lapic extension, which is controlled via MSRs and maintains for each vCPU - 16 synthetic interrupt "lines" (SINT's); each can be configured to trigger a specific interrupt vector optionally with auto-EOI semantics - a message page in the guest memory with 16 256-byte per-SINT message slots - an event flag page in the guest memory with 16 2048-bit per-SINT event flag areas The host triggers a SINT whenever it delivers a new message to the corresponding slot or flips an event flag bit in the corresponding area. The guest informs the host that it can try delivering a message by explicitly asserting EOI in lapic or writing to End-Of-Message (EOM) MSR. The userspace (qemu) triggers interrupts and receives EOM notifications via irqfd with resampler; for that, a GSI is allocated for each configured SINT, and irq_routing api is extended to support GSI-SINT mapping. Changes v4: * added activation of SynIC by vcpu KVM_ENABLE_CAP * added per SynIC active flag * added deactivation of APICv upon SynIC activation Changes v3: * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into docs Changes v2: * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors * add Hyper-V SynIC vectors into EOI exit bitmap * Hyper-V SyniIC SINT msr write logic simplified Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/irqchip: kvm_arch_irq_routing_update renaming splitAndrey Smetanin2015-11-251-2/+3
| | | | | | | | | | | | | | | | | | | | | Actually kvm_arch_irq_routing_update() should be kvm_arch_post_irq_routing_update() as it's called at the end of irq routing update. This renaming frees kvm_arch_irq_routing_update function name. kvm_arch_irq_routing_update() weak function which will be used to update mappings for arch-specific irq routing entries (in particular, the upcoming Hyper-V synthetic interrupts). Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
OpenPOWER on IntegriCloud