summaryrefslogtreecommitdiffstats
path: root/virt/kvm
Commit message (Collapse)AuthorAgeFilesLines
* KVM: split kvm_vcpu_wake_up from kvm_vcpu_kickRadim Krčmář2016-05-181-6/+13
| | | | | | | | | AVIC has a use for kvm_vcpu_wake_up. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: shrink halt polling even more for invalid wakeupsChristian Borntraeger2016-05-181-3/+4
| | | | | | | | | | | | | | | | | | | commit 3491caf2755e ("KVM: halt_polling: provide a way to qualify wakeups during poll") added more aggressive shrinking of the polling interval if the wakeup did not match some criteria. This still allows to keep polling enabled if the polling time was smaller that the current max poll time (block_ns <= vcpu->halt_poll_ns). Performance measurement shows that even more aggressive shrinking (shrink polling on any invalid wakeup) reduces absolute and relative (to the workload) CPU usage even further. Cc: David Matlack <dmatlack@google.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Radim Krčmář <rkrcmar@redhat.com> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: halt_polling: provide a way to qualify wakeups during pollChristian Borntraeger2016-05-132-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some wakeups should not be considered a sucessful poll. For example on s390 I/O interrupts are usually floating, which means that _ALL_ CPUs would be considered runnable - letting all vCPUs poll all the time for transactional like workload, even if one vCPU would be enough. This can result in huge CPU usage for large guests. This patch lets architectures provide a way to qualify wakeups if they should be considered a good/bad wakeups in regard to polls. For s390 the implementation will fence of halt polling for anything but known good, single vCPU events. The s390 implementation for floating interrupts does a wakeup for one vCPU, but the interrupt will be delivered by whatever CPU checks first for a pending interrupt. We prefer the woken up CPU by marking the poll of this CPU as "good" poll. This code will also mark several other wakeup reasons like IPI or expired timers as "good". This will of course also mark some events as not sucessful. As KVM on z runs always as a 2nd level hypervisor, we prefer to not poll, unless we are really sure, though. This patch successfully limits the CPU usage for cases like uperf 1byte transactional ping pong workload or wakeup heavy workload like OLTP while still providing a proper speedup. This also introduced a new vcpu stat "halt_poll_no_tuning" that marks wakeups that are considered not good for polling. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version) Cc: David Matlack <dmatlack@google.com> Cc: Wanpeng Li <kernellwp@gmail.com> [Rename config symbol. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: Conditionally register IRQ bypass consumerAlex Williamson2016-05-111-8/+10
| | | | | | | | | | If we don't support a mechanism for bypassing IRQs, don't register as a consumer. This eliminates meaningless dev_info()s when the connect fails between producer and consumer, such as on AMD systems where kvm_x86_ops->update_pi_irte is not implemented Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: introduce KVM_MAX_VCPU_IDGreg Kurz2016-05-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and also the upper limit for vCPU ids. This is okay for all archs except PowerPC which can have higher ids, depending on the cpu/core/thread topology. In the worst case (single threaded guest, host with 8 threads per core), it limits the maximum number of vCPUS to KVM_MAX_VCPUS / 8. This patch separates the vCPU numbering from the total number of vCPUs, with the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids plus one. The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids before passing them to KVM_CREATE_VCPU. This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC. Other archs continue to return KVM_MAX_VCPUS instead. Suggested-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: arm/arm64: vgic: Rely on the GIC driver to parse the firmware tablesJulien Grall2016-05-033-89/+69
| | | | | | | | | | | | | | | | | Currently, the firmware tables are parsed 2 times: once in the GIC drivers, the other time when initializing the vGIC. It means code duplication and make more tedious to add the support for another firmware table (like ACPI). Use the recently introduced helper gic_get_kvm_info() to get information about the virtual GIC. With this change, the virtual GIC becomes agnostic to the firmware table and KVM will be able to initialize the vGIC on ACPI. Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* KVM: arm/arm64: arch_timer: Rely on the arch timer to parse the firmware tablesJulien Grall2016-05-031-29/+11
| | | | | | | | | | | | | The firmware table is currently parsed by the virtual timer code in order to retrieve the virtual timer interrupt. However, this is already done by the arch timer driver. To avoid code duplication, use the newly function arch_timer_get_kvm_info() which return all the information required by the virtual timer code. Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* KVM: arm/arm64: Handle forward time correction gracefullyMarc Zyngier2016-04-061-10/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a host that runs NTP, corrections can have a direct impact on the background timer that we program on the behalf of a vcpu. In particular, NTP performing a forward correction will result in a timer expiring sooner than expected from a guest point of view. Not a big deal, we kick the vcpu anyway. But on wake-up, the vcpu thread is going to perform a check to find out whether or not it should block. And at that point, the timer check is going to say "timer has not expired yet, go back to sleep". This results in the timer event being lost forever. There are multiple ways to handle this. One would be record that the timer has expired and let kvm_cpu_has_pending_timer return true in that case, but that would be fairly invasive. Another is to check for the "short sleep" condition in the hrtimer callback, and restart the timer for the remaining time when the condition is detected. This patch implements the latter, with a bit of refactoring in order to avoid too much code duplication. Cc: <stable@vger.kernel.org> Reported-by: Alexander Graf <agraf@suse.de> Reviewed-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* arm64: KVM: Add braces to multi-line if statement in virtual PMU codeWill Deacon2016-04-011-1/+2
| | | | | | | | | | | | | | | | | | | | The kernel is written in C, not python, so we need braces around multi-line if statements. GCC 6 actually warns about this, thanks to the fantastic new "-Wmisleading-indentation" flag: | virt/kvm/arm/pmu.c: In function ‘kvm_pmu_overflow_status’: | virt/kvm/arm/pmu.c:198:3: warning: statement is indented as if it were guarded by... [-Wmisleading-indentation] | reg &= vcpu_sys_reg(vcpu, PMCNTENSET_EL0); | ^~~ | arch/arm64/kvm/../../../virt/kvm/arm/pmu.c:196:2: note: ...this ‘if’ clause, but it is not | if ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E)) | ^~ As it turns out, this particular case is harmless (we just do some &= operations with 0), but worth fixing nonetheless. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* KVM: Replace smp_mb() with smp_load_acquire() in the kvm_flush_remote_tlbs()Lan Tianyu2016-03-221-2/+16
| | | | | | | | | smp_load_acquire() is enough here and it's cheaper than smp_mb(). Adding a comment about reusing memory barrier of kvm_make_all_cpus_request() here to keep order between modifications to the page tables and reading mode. Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Replace smp_mb() with smp_mb_after_atomic() in the ↵Lan Tianyu2016-03-221-2/+2
| | | | | | | kvm_make_all_cpus_request() Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: fix spin_lock_init order on x86Paolo Bonzini2016-03-221-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Moving the initialization earlier is needed in 4.6 because kvm_arch_init_vm is now using mmu_lock, causing lockdep to complain: [ 284.440294] INFO: trying to register non-static key. [ 284.445259] the code is fine but needs lockdep annotation. [ 284.450736] turning off the locking correctness validator. ... [ 284.528318] [<ffffffff810aecc3>] lock_acquire+0xd3/0x240 [ 284.533733] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm] [ 284.541467] [<ffffffff81715581>] _raw_spin_lock+0x41/0x80 [ 284.546960] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm] [ 284.554707] [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm] [ 284.562281] [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm] [ 284.568381] [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm] [ 284.574740] [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm] However, it also helps fixing a preexisting problem, which is why this patch is also good for stable kernels: kvm_create_vm was incrementing current->mm->mm_count but not decrementing it at the out_err label (in case kvm_init_mmu_notifier failed). The new initialization order makes it possible to add the required mmdrop without adding a new error label. Cc: stable@vger.kernel.org Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge branch 'mm-pkeys-for-linus' of ↵Linus Torvalds2016-03-202-6/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 protection key support from Ingo Molnar: "This tree adds support for a new memory protection hardware feature that is available in upcoming Intel CPUs: 'protection keys' (pkeys). There's a background article at LWN.net: https://lwn.net/Articles/643797/ The gist is that protection keys allow the encoding of user-controllable permission masks in the pte. So instead of having a fixed protection mask in the pte (which needs a system call to change and works on a per page basis), the user can map a (handful of) protection mask variants and can change the masks runtime relatively cheaply, without having to change every single page in the affected virtual memory range. This allows the dynamic switching of the protection bits of large amounts of virtual memory, via user-space instructions. It also allows more precise control of MMU permission bits: for example the executable bit is separate from the read bit (see more about that below). This tree adds the MM infrastructure and low level x86 glue needed for that, plus it adds a high level API to make use of protection keys - if a user-space application calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice this special case, and will set a special protection key on this memory range. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. So using protection keys the kernel is able to implement 'true' PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies PROT_READ as well. Unreadable executable mappings have security advantages: they cannot be read via information leaks to figure out ASLR details, nor can they be scanned for ROP gadgets - and they cannot be used by exploits for data purposes either. We know about no user-space code that relies on pure PROT_EXEC mappings today, but binary loaders could start making use of this new feature to map binaries and libraries in a more secure fashion. There is other pending pkeys work that offers more high level system call APIs to manage protection keys - but those are not part of this pull request. Right now there's a Kconfig that controls this feature (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled (like most x86 CPU feature enablement code that has no runtime overhead), but it's not user-configurable at the moment. If there's any serious problem with this then we can make it configurable and/or flip the default" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/mm/pkeys: Fix mismerge of protection keys CPUID bits mm/pkeys: Fix siginfo ABI breakage caused by new u64 field x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA mm/core, x86/mm/pkeys: Add execute-only protection keys support x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags x86/mm/pkeys: Allow kernel to modify user pkey rights register x86/fpu: Allow setting of XSAVE state x86/mm: Factor out LDT init from context init mm/core, x86/mm/pkeys: Add arch_validate_pkey() mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU x86/mm/pkeys: Add Kconfig prompt to existing config option x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps x86/mm/pkeys: Dump PKRU with other kernel registers mm/core, x86/mm/pkeys: Differentiate instruction fetches x86/mm/pkeys: Optimize fault handling in access_error() mm/core: Do not enforce PKEY permissions on remote mm access um, pkeys: Add UML arch_*_access_permitted() methods mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys x86/mm/gup: Simplify get_user_pages() PTE bit handling ...
| * mm/gup: Switch all callers of get_user_pages() to not pass tsk/mmDave Hansen2016-02-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will soon modify the vanilla get_user_pages() so it can no longer be used on mm/tasks other than 'current/current->mm', which is by far the most common way it is called. For now, we allow the old-style calls, but warn when they are used. (implemented in previous patch) This patch switches all callers of: get_user_pages() get_user_pages_unlocked() get_user_pages_locked() to stop passing tsk/mm so they will no longer see the warnings. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: jack@suse.cz Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * mm/gup: Introduce get_user_pages_remote()Dave Hansen2016-02-161-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For protection keys, we need to understand whether protections should be enforced in software or not. In general, we enforce protections when working on our own task, but not when on others. We call these "current" and "remote" operations. This patch introduces a new get_user_pages() variant: get_user_pages_remote() Which is a replacement for when get_user_pages() is called on non-current tsk/mm. We also introduce a new gup flag: FOLL_REMOTE which can be used for the "__" gup variants to get this new behavior. The uprobes is_trap_at_addr() location holds mmap_sem and calls get_user_pages(current->mm) on an instruction address. This makes it a pretty unique gup caller. Being an instruction access and also really originating from the kernel (vs. the app), I opted to consider this a 'remote' access where protection keys will not be enforced. Without protection keys, this patch should not change any behavior. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: jack@suse.cz Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2016-03-169-27/+850
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "One of the largest releases for KVM... Hardly any generic changes, but lots of architecture-specific updates. ARM: - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems - PMU support for guests - 32bit world switch rewritten in C - various optimizations to the vgic save/restore code. PPC: - enabled KVM-VFIO integration ("VFIO device") - optimizations to speed up IPIs between vcpus - in-kernel handling of IOMMU hypercalls - support for dynamic DMA windows (DDW). s390: - provide the floating point registers via sync regs; - separated instruction vs. data accesses - dirty log improvements for huge guests - bugfixes and documentation improvements. x86: - Hyper-V VMBus hypercall userspace exit - alternative implementation of lowest-priority interrupts using vector hashing (for better VT-d posted interrupt support) - fixed guest debugging with nested virtualizations - improved interrupt tracking in the in-kernel IOAPIC - generic infrastructure for tracking writes to guest memory - currently its only use is to speedup the legacy shadow paging (pre-EPT) case, but in the future it will be used for virtual GPUs as well - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits) KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch KVM: x86: disable MPX if host did not enable MPX XSAVE features arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit arm64: KVM: vgic-v3: Reset LRs at boot time arm64: KVM: vgic-v3: Do not save an LR known to be empty arm64: KVM: vgic-v3: Save maintenance interrupt state only if required arm64: KVM: vgic-v3: Avoid accessing ICH registers KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit KVM: arm/arm64: vgic-v2: Reset LRs at boot time KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers KVM: s390: allocate only one DMA page per VM KVM: s390: enable STFLE interpretation only if enabled for the guest KVM: s390: wake up when the VCPU cpu timer expires KVM: s390: step the VCPU timer while in enabled wait KVM: s390: protect VCPU cpu timer with a seqcount KVM: s390: step VCPU cpu timer during kvm_run ioctl ...
| * \ Merge tag 'kvm-arm-for-4.6' of ↵Paolo Bonzini2016-03-099-13/+834
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/ARM updates for 4.6 - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems - PMU support for guests - 32bit world switch rewritten in C - Various optimizations to the vgic save/restore code Conflicts: include/uapi/linux/kvm.h
| | * | arm64: KVM: vgic-v3: Reset LRs at boot timeMarc Zyngier2016-03-091-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to let the GICv3 code be more lazy in the way it accesses the LRs, it is necessary to start with a clean slate. Let's reset the LRs on each CPU when the vgic is probed (which includes a round trip to EL2...). Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: vgic-v3: Avoid accessing ICH registersMarc Zyngier2016-03-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like on GICv2, we're a bit hammer-happy with GICv3, and access them more often than we should. Adopt a policy similar to what we do for GICv2, only save/restoring the minimal set of registers. As we don't access the registers linearly anymore (we may skip some), the convoluted accessors become slightly simpler, and we can drop the ugly indexing macro that tended to confuse the reviewers. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hitMarc Zyngier2016-03-091-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The GICD_SGIR register lives a long way from the beginning of the handler array, which is searched linearly. As this is hit pretty often, let's move it up. This saves us some precious cycles when the guest is generating IPIs. Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exitMarc Zyngier2016-03-091-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far, we're always writing all possible LRs, setting the empty ones with a zero value. This is obvious doing a lot of work for nothing, and we're better off clearing those we've actually dirtied on the exit path (it is very rare to inject more than one interrupt at a time anyway). Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-v2: Reset LRs at boot timeMarc Zyngier2016-03-091-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to let make the GICv2 code more lazy in the way it accesses the LRs, it is necessary to start with a clean slate. Let's reset the LRs on each CPU when the vgic is probed. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-v2: Do not save an LR known to be emptyMarc Zyngier2016-03-091-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On exit, any empty LR will be signaled in GICH_ELRSR*. Which means that we do not have to save it, and we can just clear its state in the in-memory copy. Take this opportunity to move the LR saving code into its own function. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own functionMarc Zyngier2016-03-091-15/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to make the saving path slightly more readable and prepare for some more optimizations, let's move the GICH_ELRSR saving to its own function. No functional change. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if requiredMarc Zyngier2016-03-091-7/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Next on our list of useless accesses is the maintenance interrupt status registers (GICH_MISR, GICH_EISR{0,1}). It is pointless to save them if we haven't asked for a maintenance interrupt the first place, which can only happen for two reasons: - Underflow: GICH_HCR_UIE will be set, - EOI: GICH_LR_EOI will be set. These conditions can be checked on the in-memory copies of the regs. Should any of these two condition be valid, we must read GICH_MISR. We can then check for GICH_MISR_EOI, and only when set read GICH_EISR*. This means that in most case, we don't have to save them at all. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-v2: Avoid accessing GICH registersMarc Zyngier2016-03-091-22/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GICv2 registers are *slow*. As in "terrifyingly slow". Which is bad. But we're equaly bad, as we make a point in accessing them even if we don't have any interrupt in flight. A good solution is to first find out if we have anything useful to write into the GIC, and if we don't, to simply not do it. This involves tracking which LRs actually have something valid there. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: timer: Add active state cachingMarc Zyngier2016-02-291-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Programming the active state in the (re)distributor can be an expensive operation so it makes some sense to try and reduce the number of accesses as much as possible. So far, we program the active state on each VM entry, but there is some opportunity to do less. An obvious solution is to cache the active state in memory, and only program it in the HW when conditions change. But because the HW can also change things under our feet (the active state can transition from 1 to 0 when the guest does an EOI), some precautions have to be taken, which amount to only caching an "inactive" state, and always programing it otherwise. With this in place, we observe a reduction of around 700 cycles on a 2GHz GICv2 platform for a NULL hypercall. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add a new vcpu device control group for PMUv3Shannon Zhao2016-02-291-0/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To configure the virtual PMUv3 overflow interrupt number, we use the vcpu kvm_device ioctl, encapsulating the KVM_ARM_VCPU_PMU_V3_IRQ attribute within the KVM_ARM_VCPU_PMU_V3_CTRL group. After configuring the PMUv3, call the vcpu ioctl with attribute KVM_ARM_VCPU_PMU_V3_INIT to initialize the PMUv3. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Acked-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add a new feature bit for PMUv3Shannon Zhao2016-02-291-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support guest PMUv3, use one bit of the VCPU INIT feature array. Initialize the PMU when initialzing the vcpu with that bit and PMU overflow interrupt set. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Acked-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Free perf event of PMU when destroying vcpuShannon Zhao2016-02-291-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When KVM frees VCPU, it needs to free the perf_event of PMU. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Reset PMU state when resetting vcpuShannon Zhao2016-02-291-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When resetting vcpu, it needs to reset the PMU state to initial status. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add PMU overflow interrupt routingShannon Zhao2016-02-291-1/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When calling perf_event_create_kernel_counter to create perf_event, assign a overflow handler. Then when the perf event overflows, set the corresponding bit of guest PMOVSSET register. If this counter is enabled and its interrupt is enabled as well, kick the vcpu to sync the interrupt. On VM entry, if there is counter overflowed and interrupt level is changed, inject the interrupt with corresponding level. On VM exit, sync the interrupt level as well if it has been changed. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add helper to handle PMCR register bitsShannon Zhao2016-02-291-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to ARMv8 spec, when writing 1 to PMCR.E, all counters are enabled by PMCNTENSET, while writing 0 to PMCR.E, all counters are disabled. When writing 1 to PMCR.P, reset all event counters, not including PMCCNTR, to zero. When writing 1 to PMCR.C, reset PMCCNTR to zero. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add access handler for PMSWINC registerShannon Zhao2016-02-291-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add access handler which emulates writing and reading PMSWINC register and add support for creating software increment event. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add access handler for PMOVSSET and PMOVSCLR registerShannon Zhao2016-02-291-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the reset value of PMOVSSET and PMOVSCLR is UNKNOWN, use reset_unknown for its reset handler. Add a handler to emulate writing PMOVSSET or PMOVSCLR register. When writing non-zero value to PMOVSSET, the counter and its interrupt is enabled, kick this vcpu to sync PMU interrupt. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: PMU: Add perf event map and introduce perf event creating functionShannon Zhao2016-02-291-0/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we use tools like perf on host, perf passes the event type and the id of this event type category to kernel, then kernel will map them to hardware event number and write this number to PMU PMEVTYPER<n>_EL0 register. When getting the event number in KVM, directly use raw event type to create a perf_event for it. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add access handler for PMCNTENSET and PMCNTENCLR registerShannon Zhao2016-02-291-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the reset value of PMCNTENSET and PMCNTENCLR is UNKNOWN, use reset_unknown for its reset handler. Add a handler to emulate writing PMCNTENSET or PMCNTENCLR register. When writing to PMCNTENSET, call perf_event_enable to enable the perf event. When writing to PMCNTENCLR, call perf_event_disable to disable the perf event. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Add access handler for event counter registerShannon Zhao2016-02-291-0/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These kind of registers include PMEVCNTRn, PMCCNTR and PMXEVCNTR which is mapped to PMEVCNTRn. The access handler translates all aarch32 register offsets to aarch64 ones and uses vcpu_sys_reg() to access their values to avoid taking care of big endian. When reading these registers, return the sum of register value and the value perf event counts. Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Move vgic-v2 and timer save/restore to virt/kvm/arm/hypMarc Zyngier2016-02-292-0/+151
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already have virt/kvm/arm/ containing timer and vgic stuff. Add yet another subdirectory to contain the hyp-specific files (timer and vgic again). Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | | KVM: ensure __gfn_to_pfn_memslot initializes *writablePaolo Bonzini2016-03-041-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the kvm_is_error_hva, ubsan complains if the uninitialized writable is passed to __direct_map, even though the value itself is not used (__direct_map goes to mmu_set_spte->set_spte->set_mmio_spte but never looks at that argument). Ensuring that __gfn_to_pfn_memslot initializes *writable is cheap and avoids this kind of issue. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: async_pf: use list_first_entryGeliang Tang2016-02-231-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make the intention clearer, use list_first_entry instead of list_entry. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: use list_for_each_entry_safeGeliang Tang2016-02-231-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: halt_polling: improve grow/shrink settingsChristian Borntraeger2016-02-161-8/+10
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now halt_poll_ns can be change during runtime. The grow and shrink factors can only be set during module load. Lets fix several aspects of grow shrink: - make grow/shrink changeable by root - make all variables unsigned int - read the variables once to prevent races Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2016-03-142-11/+10
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle are: - Make schedstats a runtime tunable (disabled by default) and optimize it via static keys. As most distributions enable CONFIG_SCHEDSTATS=y due to its instrumentation value, this is a nice performance enhancement. (Mel Gorman) - Implement 'simple waitqueues' (swait): these are just pure waitqueues without any of the more complex features of full-blown waitqueues (callbacks, wake flags, wake keys, etc.). Simple waitqueues have less memory overhead and are faster. Use simple waitqueues in the RCU code (in 4 different places) and for handling KVM vCPU wakeups. (Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker, Marcelo Tosatti) - sched/numa enhancements (Rik van Riel) - NOHZ performance enhancements (Rik van Riel) - Various sched/deadline enhancements (Steven Rostedt) - Various fixes (Peter Zijlstra) - ... and a number of other fixes, cleanups and smaller enhancements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) sched/cputime: Fix steal_account_process_tick() to always return jiffies sched/deadline: Remove dl_new from struct sched_dl_entity Revert "kbuild: Add option to turn incompatible pointer check into error" sched/deadline: Remove superfluous call to switched_to_dl() sched/debug: Fix preempt_disable_ip recording for preempt_disable() sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity time, acct: Drop irq save & restore from __acct_update_integrals() acct, time: Change indentation in __acct_update_integrals() sched, time: Remove non-power-of-two divides from __acct_update_integrals() sched/rt: Kick RT bandwidth timer immediately on start up sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug sched/debug: Move sched_domain_sysctl to debug.c sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c sched/rt: Fix PI handling vs. sched_setscheduler() sched/core: Remove duplicated sched_group_set_shares() prototype sched/fair: Consolidate nohz CPU load update code sched/fair: Avoid using decay_load_missed() with a negative value sched/deadline: Always calculate end of period on sched_yield() sched/cgroup: Fix cgroup entity load tracking tear-down rcu: Use simple wait queues where possible in rcutree ...
| * \ \ Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar2016-02-293-6/+9
| |\ \ \ | | | |/ | | |/| | | | | | | | | | | | | applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | KVM: Use simple waitqueue for vcpu->wqMarcelo Tosatti2016-02-252-11/+10
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem: On -rt, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: This patch reduces the average latency in my tests from 14us to 11us. Daniel writes: Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency benchmark on mainline. The test was run 1000 times on tip/sched/core 4.4.0-rc8-01134-g0905f04: ./x86-run x86/tscdeadline_latency.flat -cpu host with idle=poll. The test seems not to deliver really stable numbers though most of them are smaller. Paolo write: "Anything above ~10000 cycles means that the host went to C1 or lower---the number means more or less nothing in that case. The mean shows an improvement indeed." Before: min max mean std count 1000.000000 1000.000000 1000.000000 1000.000000 mean 5162.596000 2019270.084000 5824.491541 20681.645558 std 75.431231 622607.723969 89.575700 6492.272062 min 4466.000000 23928.000000 5537.926500 585.864966 25% 5163.000000 1613252.750000 5790.132275 16683.745433 50% 5175.000000 2281919.000000 5834.654000 23151.990026 75% 5190.000000 2382865.750000 5861.412950 24148.206168 max 5228.000000 4175158.000000 6254.827300 46481.048691 After min max mean std count 1000.000000 1000.00000 1000.000000 1000.000000 mean 5143.511000 2076886.10300 5813.312474 21207.357565 std 77.668322 610413.09583 86.541500 6331.915127 min 4427.000000 25103.00000 5529.756600 559.187707 25% 5148.000000 1691272.75000 5784.889825 17473.518244 50% 5160.000000 2308328.50000 5832.025000 23464.837068 75% 5172.000000 2393037.75000 5853.177675 24223.969976 max 5222.000000 3922458.00000 6186.720500 42520.379830 [Patch was originaly based on the swait implementation found in the -rt tree. Daniel ported it to mainline's version and gathered the benchmark numbers for tscdeadline_latency test.] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-rt-users@vger.kernel.org Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | kvm: cap halt polling at exactly halt_poll_nsDavid Matlack2016-03-091-0/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When growing halt-polling, there is no check that the poll time exceeds the limit. It's possible for vcpu->halt_poll_ns grow once past halt_poll_ns, and stay there until a halt which takes longer than vcpu->halt_poll_ns. For example, booting a Linux guest with halt_poll_ns=11000: ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000) ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0) ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) Signed-off-by: David Matlack <dmatlack@google.com> Fixes: aca6ff29c4063a8d467cdee241e6b3bf7dc4a171 Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | Merge tag 'kvm-arm-for-4.5-rc6' of ↵Paolo Bonzini2016-02-251-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM fixes for 4.5-rc6 - Fix per-vcpu vgic bitmap allocation - Do not give copy random memory on MMIO read - Fix GICv3 APR register restore order
| * | KVM: arm/arm64: vgic: Ensure bitmaps are long enoughMark Rutland2016-02-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we allocate bitmaps in vgic_vcpu_init_maps, we divide the number of bits we need by 8 to figure out how many bytes to allocate. However, bitmap elements are always accessed as unsigned longs, and if we didn't happen to allocate a size such that size % sizeof(unsigned long) == 0, bitmap accesses may go past the end of the allocation. When using KASAN (which does byte-granular access checks), this results in a continuous stream of BUGs whenever these bitmaps are accessed: ============================================================================= BUG kmalloc-128 (Tainted: G B ): kasan: bad access detected ----------------------------------------------------------------------------- INFO: Allocated in vgic_init.part.25+0x55c/0x990 age=7493 cpu=3 pid=1730 INFO: Slab 0xffffffbde6d5da40 objects=16 used=15 fp=0xffffffc935769700 flags=0x4000000000000080 INFO: Object 0xffffffc935769500 @offset=1280 fp=0x (null) Bytes b4 ffffffc9357694f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769520: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769530: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769540: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769550: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769560: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object ffffffc935769570: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Padding ffffffc9357695b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Padding ffffffc9357695c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Padding ffffffc9357695d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Padding ffffffc9357695e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Padding ffffffc9357695f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ CPU: 3 PID: 1740 Comm: kvm-vcpu-0 Tainted: G B 4.4.0+ #17 Hardware name: ARM Juno development board (r1) (DT) Call trace: [<ffffffc00008e770>] dump_backtrace+0x0/0x280 [<ffffffc00008ea04>] show_stack+0x14/0x20 [<ffffffc000726360>] dump_stack+0x100/0x188 [<ffffffc00030d324>] print_trailer+0xfc/0x168 [<ffffffc000312294>] object_err+0x3c/0x50 [<ffffffc0003140fc>] kasan_report_error+0x244/0x558 [<ffffffc000314548>] __asan_report_load8_noabort+0x48/0x50 [<ffffffc000745688>] __bitmap_or+0xc0/0xc8 [<ffffffc0000d9e44>] kvm_vgic_flush_hwstate+0x1bc/0x650 [<ffffffc0000c514c>] kvm_arch_vcpu_ioctl_run+0x2ec/0xa60 [<ffffffc0000b9a6c>] kvm_vcpu_ioctl+0x474/0xa68 [<ffffffc00036b7b0>] do_vfs_ioctl+0x5b8/0xcb0 [<ffffffc00036bf34>] SyS_ioctl+0x8c/0xa0 [<ffffffc000086cb0>] el0_svc_naked+0x24/0x28 Memory state around the buggy address: ffffffc935769400: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffffc935769480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffffffc935769500: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffffffc935769580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffffc935769600: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fix the issue by always allocating a multiple of sizeof(unsigned long), as we do elsewhere in the vgic code. Fixes: c1bfb577a ("arm/arm64: KVM: vgic: switch to dynamic allocation") Cc: stable@vger.kernel.org Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
* | | KVM: async_pf: do not warn on page allocation failuresChristian Borntraeger2016-02-241-1/+1
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In async_pf we try to allocate with NOWAIT to get an element quickly or fail. This code also handle failures gracefully. Lets silence potential page allocation failures under load. qemu-system-s39: page allocation failure: order:0,mode:0x2200000 [...] Call Trace: ([<00000000001146b8>] show_trace+0xf8/0x148) [<000000000011476a>] show_stack+0x62/0xe8 [<00000000004a36b8>] dump_stack+0x70/0x98 [<0000000000272c3a>] warn_alloc_failed+0xd2/0x148 [<000000000027709e>] __alloc_pages_nodemask+0x94e/0xb38 [<00000000002cd36a>] new_slab+0x382/0x400 [<00000000002cf7ac>] ___slab_alloc.constprop.30+0x2dc/0x378 [<00000000002d03d0>] kmem_cache_alloc+0x160/0x1d0 [<0000000000133db4>] kvm_setup_async_pf+0x6c/0x198 [<000000000013dee8>] kvm_arch_vcpu_ioctl_run+0xd48/0xd58 [<000000000012fcaa>] kvm_vcpu_ioctl+0x372/0x690 [<00000000002f66f6>] do_vfs_ioctl+0x3be/0x510 [<00000000002f68ec>] SyS_ioctl+0xa4/0xb8 [<0000000000781c5e>] system_call+0xd6/0x264 [<000003ffa24fa06a>] 0x3ffa24fa06a Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
OpenPOWER on IntegriCloud