summaryrefslogtreecommitdiffstats
path: root/arch/s390/mm/pgtable.c
Commit message (Collapse)AuthorAgeFilesLines
* KVM: s390/mm: try a cow on read only pages for key opsChristian Borntraeger2014-08-251-0/+10
| | | | | | | | | | The PFMF instruction handler blindly wrote the storage key even if the page was mapped R/O in the host. Lets try a COW before continuing and bail out in case of errors. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
* s390/mm: implement dirty bits for large segment table entriesMartin Schwidefsky2014-08-011-0/+3
| | | | | | | | | | | | | The large segment table entry format has block of bits for the ACC/F values for the large page. These bits are valid only if another bit (AV bit 0x10000) of the segment table entry is set. The ACC/F bits do not have a meaning if the AV bit is off. This allows to put the THP splitting bit, the segment young bit and the new segment dirty bit into the ACC/F bits as long as the AV bit stays off. The dirty and young information is only available if the pmd is large. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* KVM: s390/mm: Fix page table locking vs. split pmd lockChristian Borntraeger2014-08-011-2/+3
| | | | | | | | | | | | commit ec66ad66a0de87866be347b5ecc83bd46427f53b (s390/mm: enable split page table lock for PMD level) activated the split pmd lock for s390. Turns out that we missed one place: We also have to take the pmd lock instead of the page table lock when we reallocate the page tables (==> changing entries in the PMD) during sie enablement. Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into nextLinus Torvalds2014-06-041-16/+73
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "At over 200 commits, covering almost all supported architectures, this was a pretty active cycle for KVM. Changes include: - a lot of s390 changes: optimizations, support for migration, GDB support and more - ARM changes are pretty small: support for the PSCI 0.2 hypercall interface on both the guest and the host (the latter acked by Catalin) - initial POWER8 and little-endian host support - support for running u-boot on embedded POWER targets - pretty large changes to MIPS too, completing the userspace interface and improving the handling of virtualized timer hardware - for x86, a larger set of changes is scheduled for 3.17. Still, we have a few emulator bugfixes and support for running nested fully-virtualized Xen guests (para-virtualized Xen guests have always worked). And some optimizations too. The only missing architecture here is ia64. It's not a coincidence that support for KVM on ia64 is scheduled for removal in 3.17" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits) KVM: add missing cleanup_srcu_struct KVM: PPC: Book3S PR: Rework SLB switching code KVM: PPC: Book3S PR: Use SLB entry 0 KVM: PPC: Book3S HV: Fix machine check delivery to guest KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs KVM: PPC: Book3S HV: Make sure we don't miss dirty pages KVM: PPC: Book3S HV: Fix dirty map for hugepages KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number KVM: PPC: Book3S: Add ONE_REG register names that were missed KVM: PPC: Add CAP to indicate hcall fixes KVM: PPC: MPIC: Reset IRQ source private members KVM: PPC: Graciously fail broken LE hypercalls PPC: ePAPR: Fix hypercall on LE guest KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler KVM: PPC: BOOK3S: Always use the saved DAR value PPC: KVM: Make NX bit available with magic page KVM: PPC: Disable NX for old magic page using guests KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest ...
| * KVM: s390: correct locking for s390_enable_skeyMartin Schwidefsky2014-05-161-15/+7
| | | | | | | | | | | | | | | | | | | | | | Use the mm semaphore to serialize multiple invocations of s390_enable_skey. The second CPU faulting on a storage key operation needs to wait for the completion of the page table update. Taking the mm semaphore writable has the positive side-effect that it prevents any host faults from taking place which does have implications on keys vs PGSTE. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * KVM: s390/mm: new gmap_test_and_clear_dirty functionDominik Dingel2014-04-221-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | For live migration kvm needs to test and clear the dirty bit of guest pages. That for is ptep_test_and_clear_user_dirty, to be sure we are not racing with other code, we protect the pte. This needs to be done within the architecture memory management code. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * KVM: s390/mm: use software dirty bit detection for user dirty trackingMartin Schwidefsky2014-04-221-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch the user dirty bit detection used for migration from the hardware provided host change-bit in the pgste to a fault based detection method. This reduced the dependency of the host from the storage key to a point where it becomes possible to enable the RCP bypass for KVM guests. The fault based dirty detection will only indicate changes caused by accesses via the guest address space. The hardware based method can detect all changes, even those caused by I/O or accesses via the kernel page table. The KVM/qemu code needs to take this into account. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * KVM: s390: Allow skeys to be enabled for the current processDominik Dingel2014-04-221-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new function s390_enable_skey(), which enables storage key handling via setting the use_skey flag in the mmu context. This function is only useful within the context of kvm. Note that enabling storage keys will cause a one-time hickup when walking the page table; however, it saves us special effort for cases like clear reset while making it possible for us to be architecture conform. s390_enable_skey() takes the page table lock to prevent reseting storage keys triggered from multiple vcpus. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * KVM: s390: Clear storage keysDominik Dingel2014-04-221-11/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | page_table_reset_pgste() already does a complete page table walk to reset the pgste. Enhance it to initialize the storage keys to PAGE_DEFAULT_KEY if requested by the caller. This will be used for lazy storage key handling. Also provide an empty stub for !CONFIG_PGSTE Lets adopt the current code (diag 308) to not clear the keys. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
* | s390/uaccess: simplify control register updatesMartin Schwidefsky2014-05-201-4/+6
|/ | | | | | | | | | | | | Always switch to the kernel ASCE in switch_mm. Load the secondary space ASCE in finish_arch_post_lock_switch after checking that any pending page table operations have completed. The primary ASCE is loaded in entry[64].S. With this the update_primary_asce call can be removed from the switch_to macro and from the start of switch_mm function. Remove the load_primary argument from update_user_asce/clear_user_asce, rename update_user_asce to set_user_asce and rename update_primary_asce to load_kernel_asce. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-04-081-5/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull second set of s390 patches from Martin Schwidefsky: "The second part of Heikos uaccess rework, the page table walker for uaccess is now a thing of the past (yay!) The code change to fix the theoretical TLB flush problem allows us to add a TLB flush optimization for zEC12, this machine has new instructions that allow to do CPU local TLB flushes for single pages and for all pages of a specific address space. Plus the usual bug fixing and some more cleanup" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/uaccess: rework uaccess code - fix locking issues s390/mm,tlb: optimize TLB flushing for zEC12 s390/mm,tlb: safeguard against speculative TLB creation s390/irq: Use defines for external interruption codes s390/irq: Add defines for external interruption codes s390/sclp: add timeout for queued requests kvm/s390: also set guest pages back to stable on kexec/kdump lcs: Add missing destroy_timer_on_stack() s390/tape: Add missing destroy_timer_on_stack() s390/tape: Use del_timer_sync() s390/3270: fix crash with multiple reset device requests s390/bitops,atomic: add missing memory barriers s390/zcrypt: add length check for aligned data to avoid overflow in msg-type 6
| * s390/uaccess: rework uaccess code - fix locking issuesHeiko Carstens2014-04-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current uaccess code uses a page table walk in some circumstances, e.g. in case of the in atomic futex operations or if running on old hardware which doesn't support the mvcos instruction. However it turned out that the page table walk code does not correctly lock page tables when accessing page table entries. In other words: a different cpu may invalidate a page table entry while the current cpu inspects the pte. This may lead to random data corruption. Adding correct locking however isn't trivial for all uaccess operations. Especially copy_in_user() is problematic since that requires to hold at least two locks, but must be protected against ABBA deadlock when a different cpu also performs a copy_in_user() operation. So the solution is a different approach where we change address spaces: User space runs in primary address mode, or access register mode within vdso code, like it currently already does. The kernel usually also runs in home space mode, however when accessing user space the kernel switches to primary or secondary address mode if the mvcos instruction is not available or if a compare-and-swap (futex) instruction on a user space address is performed. KVM however is special, since that requires the kernel to run in home address space while implicitly accessing user space with the sie instruction. So we end up with: User space: - runs in primary or access register mode - cr1 contains the user asce - cr7 contains the user asce - cr13 contains the kernel asce Kernel space: - runs in home space mode - cr1 contains the user or kernel asce -> the kernel asce is loaded when a uaccess requires primary or secondary address mode - cr7 contains the user or kernel asce, (changed with set_fs()) - cr13 contains the kernel asce In case of uaccess the kernel changes to: - primary space mode in case of a uaccess (copy_to_user) and uses e.g. the mvcp instruction to access user space. However the kernel will stay in home space mode if the mvcos instruction is available - secondary space mode in case of futex atomic operations, so that the instructions come from primary address space and data from secondary space In case of kvm the kernel runs in home space mode, but cr1 gets switched to contain the gmap asce before the sie instruction gets executed. When the sie instruction is finished cr1 will be switched back to contain the user asce. A context switch between two processes will always load the kernel asce for the next process in cr1. So the first exit to user space is a bit more expensive (one extra load control register instruction) than before, however keeps the code rather simple. In sum this means there is no need to perform any error prone page table walks anymore when accessing user space. The patch seems to be rather large, however it mainly removes the the page table walk code and restores the previously deleted "standard" uaccess code, with a couple of changes. The uaccess without mvcos mode can be enforced with the "uaccess_primary" kernel parameter. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * s390/mm,tlb: optimize TLB flushing for zEC12Martin Schwidefsky2014-04-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zEC12 machines introduced the local-clearing control for the IDTE and IPTE instruction. If the control is set only the TLB of the local CPU is cleared of entries, either all entries of a single address space for IDTE, or the entry for a single page-table entry for IPTE. Without the local-clearing control the TLB flush is broadcasted to all CPUs in the configuration, which is expensive. The reset of the bit mask of the CPUs that need flushing after a non-local IDTE is tricky. As TLB entries for an address space remain in the TLB even if the address space is detached a new bit field is required to keep track of attached CPUs vs. CPUs in the need of a flush. After a non-local flush with IDTE the bit-field of attached CPUs is copied to the bit-field of CPUs in need of a flush. The ordering of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is such that an underindication in mm_cpumask(mm) is prevented but an overindication in mm_cpumask(mm) is possible. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * s390/mm,tlb: safeguard against speculative TLB creationMartin Schwidefsky2014-04-031-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The principles of operations states that the CPU is allowed to create TLB entries for an address space anytime while an ASCE is loaded to the control register. This is true even if the CPU is running in the kernel and the user address space is not (actively) accessed. In theory this can affect two aspects of the TLB flush logic. For full-mm flushes the ASCE of the dying process is still attached. The approach to flush first with IDTE and then just free all page tables can in theory lead to stale TLB entries. Use the batched free of page tables for the full-mm flushes as well. For operations that can have a stale ASCE in the control register, e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE to prevent invalid TLBs from being created. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"Alex Thorlton2014-04-071-0/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main motivation behind this patch is to provide a way to disable THP for jobs where the code cannot be modified, and using a malloc hook with madvise is not an option (i.e. statically allocated data). This patch allows us to do just that, without affecting other jobs running on the system. We need to do this sort of thing for jobs where THP hurts performance, due to the possibility of increased remote memory accesses that can be created by situations such as the following: When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be handed out, and the THP will be stuck on whatever node the chunk was originally referenced from. If many remote nodes need to do work on that same chunk, they'll be making remote accesses. With THP disabled, 4K pages can be handed out to separate nodes as they're needed, greatly reducing the amount of remote accesses to memory. This patch is based on some of my work combined with some suggestions/patches given by Oleg Nesterov. The main goal here is to add a prctl switch to allow us to disable to THP on a per mm_struct basis. Here's a bit of test data with the new patch in place... First with the flag unset: # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 0 Set thp_disabled state to 0 Process pid = 18027 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23 Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 273719072.841402 task-clock # 641.026 CPUs utilized [100.00%] 1,008,986 context-switches # 0.000 M/sec [100.00%] 7,717 CPU-migrations # 0.000 M/sec [100.00%] 1,698,932 page-faults # 0.000 M/sec 355,222,544,890,379 cycles # 1.298 GHz [100.00%] 536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%] 409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%] 148,286,797,266,411 instructions # 0.42 insns per cycle # 3.62 stalled cycles per insn [100.00%] 27,061,793,159,503 branches # 98.867 M/sec [100.00%] 1,188,655,196 branch-misses # 0.00% of all branches 427.001706337 seconds time elapsed Now with the flag set: # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 1 Set thp_disabled state to 1 Process pid = 144957 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23 Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 138789390.540183 task-clock # 641.959 CPUs utilized [100.00%] 534,205 context-switches # 0.000 M/sec [100.00%] 4,595 CPU-migrations # 0.000 M/sec [100.00%] 63,133,119 page-faults # 0.000 M/sec 147,977,747,269,768 cycles # 1.066 GHz [100.00%] 200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%] 105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%] 180,916,213,503,160 instructions # 1.22 insns per cycle # 1.11 stalled cycles per insn [100.00%] 26,999,511,005,868 branches # 194.536 M/sec [100.00%] 714,066,351 branch-misses # 0.00% of all branches 216.196778807 seconds time elapsed As with previous versions of the patch, We're getting about a 2x performance increase here. Here's a link to the test case I used, along with the little wrapper to activate the flag: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz This patch (of 3): Revert commit 8e72033f2a48 and add in code to fix up any issues caused by the revert. The revert is necessary because hugepage_madvise would return -EINVAL when VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch set. Here's a snip of an e-mail from Gerald detailing the original purpose of this code, and providing justification for the revert: "The intent of commit 8e72033f2a48 was to guard against any future programming errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings, which would crash the kernel. Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48 was to be reverted, because that check will also prevent a kernel crash in the case described above, it will now send a SIGSEGV instead. This would now also allow to do the madvise on other parts, if needed, so it is a more flexible approach. One could also say that it would have been better to do it this way right from the beginning..." Signed-off-by: Alex Thorlton <athorlton@sgi.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* s390/mm: remove unnecessary parameter from gmap_do_ipte_notifyDominik Dingel2014-03-211-2/+1
| | | | | Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/mm: fixing comment so that parameter name matchDominik Dingel2014-03-211-1/+1
| | | | | Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/mm: enable split page table lock for PMD levelMartin Schwidefsky2014-02-211-2/+2
| | | | | | | | Add the pgtable_pmd_page_ctor/pgtable_pmd_page_dtor calls to the pmd allocation and free functions and enable ARCH_ENABLE_SPLIT_PMD_PTLOCK for 64 bit. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/kvm: set guest page states to stable on re-iplMartin Schwidefsky2014-02-211-0/+72
| | | | | | | The guest page state needs to be reset to stable for all pages on initial program load via diagnose 0x308. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/kvm: support collaborative memory managementKonstantin Weitz2014-02-211-0/+77
| | | | | | | | | | | | | | This patch enables Collaborative Memory Management (CMM) for kvm on s390. CMM allows the guest to inform the host about page usage (see arch/s390/mm/cmm.c). The host uses this information to avoid swapping in unused pages in the page fault handler. Further, a CPU provided list of unused invalid pages is processed to reclaim swap space of not yet accessed unused pages. [ Martin Schwidefsky: patch reordering and cleanup ] Signed-off-by: Konstantin Weitz <konstantin.weitz@gmail.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390: Fix misspellings using 'codespell' toolHendrik Brueckner2014-01-161-2/+2
| | | | | Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390: handle pgtable_page_ctor() failKirill A. Shutemov2013-11-151-2/+9
| | | | | | | | Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, thp: do not access mm->pmd_huge_pte directlyKirill A. Shutemov2013-11-151-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently mm->pmd_huge_pte protected by page table lock. It will not work with split lock. We have to have per-pmd pmd_huge_pte for proper access serialization. For now, let's just introduce wrapper to access mm->pmd_huge_pte. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* s390/mm,tlb: correct tlb flush on page table upgradeMartin Schwidefsky2013-11-041-0/+18
| | | | | | | | | | | | The IDTE instruction used to flush TLB entries for a specific address space uses the address-space-control element (ASCE) to identify affected TLB entries. The upgrade of a page table adds a new top level page table which changes the ASCE. The TLB entries associated with the old ASCE need to be flushed and the ASCE for the address space needs to be replaced synchronously on all CPUs which currently use it. The concept of a lazy ASCE update with an exception handler is broken. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/mm: page_table_realloc returns failureDominik Dingel2013-10-311-8/+13
| | | | | | | | | | | | | There is a possible race between setting has_pgste and reallocation of the page_table, change the order to fix this. Also page_table_alloc_pgste can fail, in that case we need to backpropagte this as -ENOMEM to the caller of page_table_realloc. Based on a patch by Christian Borntraeger <borntraeger@de.ibm.com>. Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/uaccess: always run the kernel in home spaceMartin Schwidefsky2013-10-241-4/+0
| | | | | | | Simplify the uaccess code by removing the user_mode=home option. The kernel will now always run in the home space mode. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-111-2/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Heiko Carstens: "This includes one bpf/jit bug fix where the jit compiler could sometimes write generated code out of bounds of the allocated memory area. The rest of the patches are only cleanups and minor improvements" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/irq: reduce size of external interrupt handler hash array s390/compat,uid16: use current_cred() s390/ap_bus: use and-mask instead of a cast s390/ftrace: avoid pointer arithmetics with function pointers s390: make various functions static, add declarations to header files s390/compat signal: add couple of __force annotations s390/mm: add __releases()/__acquires() annotations to gmap_alloc_table() s390: keep Kconfig sorted s390/irq: rework irq subclass handling s390/irq: use hlists for external interrupt handler array s390/dumpstack: convert print_symbol to %pSR s390/perf: Remove print_hex_dump_bytes() debug output s390: update defconfig s390/bpf,jit: fix address randomization
| * s390: make various functions static, add declarations to header filesHeiko Carstens2013-09-071-1/+1
| | | | | | | | | | | | | | Make various functions static, add declarations to header files to fix a couple of sparse findings. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
| * s390/mm: add __releases()/__acquires() annotations to gmap_alloc_table()Heiko Carstens2013-09-071-1/+3
| | | | | | | | | | | | Let sparse not incorrectly complain about unbalanced locking. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* | Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-09-041-65/+118
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Gleb Natapov: "The highlights of the release are nested EPT and pv-ticketlocks support (hypervisor part, guest part, which is most of the code, goes through tip tree). Apart of that there are many fixes for all arches" Fix up semantic conflicts as discussed in the pull request thread.. * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits) ARM: KVM: Add newlines to panic strings ARM: KVM: Work around older compiler bug ARM: KVM: Simplify tracepoint text ARM: KVM: Fix kvm_set_pte assignment ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256 ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs ARM: KVM: vgic: fix GICD_ICFGRn access ARM: KVM: vgic: simplify vgic_get_target_reg KVM: MMU: remove unused parameter KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate() KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX KVM: x86: update masterclock when kvmclock_offset is calculated (v2) KVM: PPC: Book3S: Fix compile error in XICS emulation KVM: PPC: Book3S PR: return appropriate error when allocation fails arch: powerpc: kvm: add signed type cast for comparation KVM: x86: add comments where MMIO does not return to the emulator KVM: vmx: count exits to userspace during invalid guest emulation KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp kvm: optimize away THP checks in kvm_is_mmio_pfn() ...
| * KVM: s390: fix task size checkMartin Schwidefsky2013-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The gmap_map_segment function uses PGDIR_SIZE in the check for the maximum address in the tasks address space. This incorrectly limits the amount of memory usable for a kvm guest to 4TB. The correct limit is (1UL << 53). As the TASK_SIZE has different values (4TB vs 8PB) dependent on the existance of the fourth page table level, create a new define 'TASK_MAX_SIZE' for (1UL << 53). Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: s390: allow sie enablement for multi-threaded programsMartin Schwidefsky2013-07-291-64/+117
| | | | | | | | | | | | | | | | | | | | Improve the code to upgrade the standard 2K page tables to 4K page tables with PGSTEs to allow the operation to happen when the program is already multi-threaded. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | s390/mm: implement software referenced bitsMartin Schwidefsky2013-08-291-11/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The last remaining use for the storage key of the s390 architecture is reference counting. The alternative is to make page table entries invalid while they are old. On access the fault handler marks the pte/pmd as young which makes the pte/pmd valid if the access rights allow read access. The pte/pmd invalidations required for software managed reference bits cost a bit of performance, on the other hand the RRBE/RRBM instructions to read and reset the referenced bits are quite expensive as well. Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | s390/mm: introduce ptep_flush_lazy helperMartin Schwidefsky2013-08-221-3/+3
| | | | | | | | | | | | | | Isolate the logic of IDTE vs. IPTE flushing of ptes in two functions, ptep_flush_lazy and __tlb_flush_mm_lazy. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | s390/mm: cleanup page table definitionsMartin Schwidefsky2013-08-221-29/+30
|/ | | | | | | | | | | | | Improve the encoding of the different pte types and the naming of the page, segment table and region table bits. Due to the different pte encoding the hugetlbfs primitives need to be adapted as well. To improve compatability with common code make the huge ptes use the encoding of normal ptes. The conversion between the pte and pmd encoding for a huge pte is done with set_huge_pte_at and huge_ptep_get. Overall the code is now easier to understand. Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* Merge branch 'next' of ↵Linus Torvalds2013-07-041-2/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "This is the powerpc changes for the 3.11 merge window. In addition to the usual bug fixes and small updates, the main highlights are: - Support for transparent huge pages by Aneesh Kumar for 64-bit server processors. This allows the use of 16M pages as transparent huge pages on kernels compiled with a 64K base page size. - Base VFIO support for KVM on power by Alexey Kardashevskiy - Wiring up of our nvram to the pstore infrastructure, including putting compressed oopses in there by Aruna Balakrishnaiah - Move, rework and improve our "EEH" (basically PCI error handling and recovery) infrastructure. It is no longer specific to pseries but is now usable by the new "powernv" platform as well (no hypervisor) by Gavin Shan. - I fixed some bugs in our math-emu instruction decoding and made it usable to emulate some optional FP instructions on processors with hard FP that lack them (such as fsqrt on Freescale embedded processors). - Support for Power8 "Event Based Branch" facility by Michael Ellerman. This facility allows what is basically "userspace interrupts" for performance monitor events. - A bunch of Transactional Memory vs. Signals bug fixes and HW breakpoint/watchpoint fixes by Michael Neuling. And more ... I appologize in advance if I've failed to highlight something that somebody deemed worth it." * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) pstore: Add hsize argument in write_buf call of pstore_ftrace_call powerpc/fsl: add MPIC timer wakeup support powerpc/mpic: create mpic subsystem object powerpc/mpic: add global timer support powerpc/mpic: add irq_set_wake support powerpc/85xx: enable coreint for all the 64bit boards powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig powerpc/mpic: Add get_version API both for internal and external use powerpc: Handle both new style and old style reserve maps powerpc/hw_brk: Fix off by one error when validating DAWR region end powerpc/pseries: Support compression of oops text via pstore powerpc/pseries: Re-organise the oops compression code pstore: Pass header size in the pstore write callback powerpc/powernv: Fix iommu initialization again powerpc/pseries: Inform the hypervisor we are using EBB regs powerpc/perf: Add power8 EBB support powerpc/perf: Core EBB support for 64-bit book3s powerpc/perf: Drop MMCRA from thread_struct powerpc/perf: Don't enable if we have zero events ...
| * mm/THP: add pmd args to pgtable deposit and withdraw APIsAneesh Kumar K.V2013-06-201-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | This will be later used by powerpc THP support. In powerpc we want to use pgtable for storing the hash index values. So instead of adding them to mm_context list, we would like to store them in the second half of pmd Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-07-031-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "On the x86 side, there are some optimizations and documentation updates. The big ARM/KVM change for 3.11, support for AArch64, will come through Catalin Marinas's tree. s390 and PPC have misc cleanups and bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits) KVM: PPC: Ignore PIR writes KVM: PPC: Book3S PR: Invalidate SLB entries properly KVM: PPC: Book3S PR: Allow guest to use 1TB segments KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry KVM: PPC: Book3S PR: Fix proto-VSID calculations KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL KVM: Fix RTC interrupt coalescing tracking kvm: Add a tracepoint write_tsc_offset KVM: MMU: Inform users of mmio generation wraparound KVM: MMU: document fast invalidate all mmio sptes KVM: MMU: document fast invalidate all pages KVM: MMU: document fast page fault KVM: MMU: document mmio page fault KVM: MMU: document write_flooding_count KVM: MMU: document clear_spte_count KVM: MMU: drop kvm_mmu_zap_mmio_sptes KVM: MMU: init kvm generation close to mmio wrap-around value KVM: MMU: add tracepoint for check_mmio_spte KVM: MMU: fast invalidate all mmio sptes ...
| * | KVM: s390: Provide function for setting the guest storage keyChristian Borntraeger2013-06-171-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From time to time we need to set the guest storage key. Lets provide a helper function that handles the changes with all the right locking and checking. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | s390: fix gmap_ipte_notifier vs. software dirty pagesChristian Borntraeger2013-05-211-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On heavy paging load some guest cpus started to loop in gmap_ipte_notify. This was visible as stalled cpus inside the guest. The gmap_ipte_notifier tries to map a user page and then made sure that the pte is valid and writable. Turns out that with the software change bit tracking the pte can become read-only (and only software writable) if the page is clean. Since we loop in this code, the page would stay clean and, therefore, be never writable again. Let us just use fixup_user_fault, that guarantees to call handle_mm_fault. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * | s390/kvm: rename RCP_xxx defines to PGSTE_xxxMartin Schwidefsky2013-05-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RCP byte is a part of the PGSTE value, the existing RCP_xxx names are inaccurate. As the defines describe bits and pieces of the PGSTE, the names should start with PGSTE_. The KVM_UR_BIT and KVM_UC_BIT are part of the PGSTE as well, give them better names as well. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* | | s390/kvm: Provide function for setting the guest storage keyChristian Borntraeger2013-06-261-0/+48
| |/ |/| | | | | | | | | | | | | | | From time to time we need to set the guest storage key. Lets provide a helper function that handles the changes with all the right locking and checking. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | s390/pgtable: Fix gmap notifier addressChristian Borntraeger2013-05-311-1/+1
| | | | | | | | | | | | | | | | | | The address of the gmap notifier was broken, resulting in unhandled validity intercepts in KVM. Fix the rmap->vmaddr to be on a segment boundary. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | s390: fix gmap_ipte_notifier vs. software dirty pagesChristian Borntraeger2013-05-151-2/+1
|/ | | | | | | | | | | | | | On heavy paging load some guest cpus started to loop in gmap_ipte_notify. This was visible as stalled cpus inside the guest. The gmap_ipte_notifier tries to map a user page and then made sure that the pte is valid and writable. Turns out that with the software change bit tracking the pte can become read-only (and only software writable) if the page is clean. Since we loop in this code, the page would stay clean and, therefore, be never writable again. Let us just use fixup_user_fault, that guarantees to call handle_mm_fault. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/mm: add pte invalidation notifier for kvmMartin Schwidefsky2013-05-031-4/+117
| | | | | | | | | | Add a notifier for kvm to get control before a page table entry is invalidated. The notifier is only called for ptes of an address space with pgstes that have been explicitly marked to require notification. Kvm will use this to get control before prefix pages of virtual CPU are unmapped. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/mm,gmap: segment mapping raceMartin Schwidefsky2013-04-231-69/+91
| | | | | | | | | | | | | The gmap_map_segment function creates a special invalid segment table entry with the address of the requested target location in the process address space. The first access will create the connection between the gmap segment table and the target page table of the main process. If two threads do this concurrently both will walk the page tables and allocate a gmap_rmap structure for the same segment table entry. To avoid the race recheck the segment table entry after taking to page table lock. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/mm,gmap: implement gmap_translate()Heiko Carstens2013-04-231-22/+85
| | | | | | | | | | | | | Implement gmap_translate() function which translates a guest absolute address to a user space process address without establishing the guest page table entries. This is useful for kvm guest address translations where no memory access is expected to happen soon (e.g. tprot exception handler). Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* s390/mm: move kernel_page_present/kernel_map_pages to page_attr.cHeiko Carstens2012-11-231-16/+0
| | | | | | | Keep related functions together and move to appropriate file. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* thp, s390: architecture backend for thp on s390Gerald Schaefer2012-10-091-0/+22
| | | | | | | | | | | | | | | This implements the architecture backend for transparent hugepages on s390. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* thp, s390: disable thp for kvm host on s390Gerald Schaefer2012-10-091-0/+30
| | | | | | | | | | | | | | | | | | This patch is part of the architecture backend for thp on s390. It disables thp for kvm hosts, because there is no kvm host hugepage support so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT, and future thp mappings are prevented by setting VM_NOHUGEPAGE in mm->def_flags. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud