summaryrefslogtreecommitdiffstats
path: root/arch/s390/mm
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2016-10-272-17/+22
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: "A few more s390 patches for 4.9: - a fix for an overflow in the dasd driver reported by UBSAN - fix a regression and add hotplug memory to the zone movable again - add ignore defines for the pkey system calls - fix the ouput of the merged stack tracer - replace printk with pr_cont in arch/s390 where appropriate - remove the arch specific return_address function again - ignore reserved channel paths at boot time - add a missing hugetlb_bad_size call to the arch backend" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/mm: fix zone calculation in arch_add_memory() s390/dumpstack: use pr_cont within show_stack and die s390/dumpstack: get rid of return_address again s390/disassambler: use pr_cont where appropriate s390/dumpstack: use pr_cont where appropriate s390/dumpstack: restore reliable indicator for call traces s390/mm: use hugetlb_bad_size() s390/cio: don't register chpids in reserved state s390: ignore pkey system calls s390/dasd: avoid undefined behaviour
| * s390/mm: fix zone calculation in arch_add_memory()Gerald Schaefer2016-10-241-17/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Standby (hotplug) memory should be added to ZONE_MOVABLE on s390. After commit 199071f1 "s390/mm: make arch_add_memory() NUMA aware", arch_add_memory() used memblock_end_of_DRAM() to find out the end of ZONE_NORMAL and the beginning of ZONE_MOVABLE. However, commit 7f36e3e5 "memory-hotplug: add hot-added memory ranges to memblock before allocate node_data for a node." moved the call of memblock_add_node() before the call of arch_add_memory() in add_memory_resource(), and thus changed the return value of memblock_end_of_DRAM() when called in arch_add_memory(). As a result, arch_add_memory() will think that all memory blocks should be added to ZONE_NORMAL. Fix this by changing the logic in arch_add_memory() so that it will manually iterate over all zones of a given node to find out which zone a memory block should be added to. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * s390/mm: use hugetlb_bad_size()Shyam Saini2016-10-171-0/+1
| | | | | | | | | | | | | | | | | | Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported hugepage size is found. Signed-off-by: Shyam Saini <mayhs11saini@gmail.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | mm: replace get_user_pages_unlocked() write/force parameters with gup_flagsLorenzo Stoakes2016-10-181-1/+2
|/ | | | | | | | | | | | This removes the 'write' and 'force' use from get_user_pages_unlocked() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2016-10-044-12/+27
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "The new features and main improvements in this merge for v4.9 - Support for the UBSAN sanitizer - Set HAVE_EFFICIENT_UNALIGNED_ACCESS, it improves the code in some places - Improvements for the in-kernel fpu code, in particular the overhead for multiple consecutive in kernel fpu users is recuded - Add a SIMD implementation for the RAID6 gen and xor operations - Add RAID6 recovery based on the XC instruction - The PCI DMA flush logic has been improved to increase the speed of the map / unmap operations - The time synchronization code has seen some updates And bug fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits) s390/con3270: fix insufficient space padding s390/con3270: fix use of uninitialised data MAINTAINERS: update DASD maintainer s390/cio: fix accidental interrupt enabling during resume s390/dasd: add missing \n to end of dev_err messages s390/config: Enable config options for Docker s390/dasd: make query host access interruptible s390/dasd: fix panic during offline processing s390/dasd: fix hanging offline processing s390/pci_dma: improve lazy flush for unmap s390/pci_dma: split dma_update_trans s390/pci_dma: improve map_sg s390/pci_dma: simplify dma address calculation s390/pci_dma: remove dma address range check iommu/s390: simplify registration of I/O address translation parameters s390: migrate exception table users off module.h and onto extable.h s390: export header for CLP ioctl s390/vmur: fix irq pointer dereference in int handler s390/dasd: add missing KOBJ_CHANGE event for unformatted devices s390: enable UBSAN ...
| * s390: migrate exception table users off module.h and onto extable.hPaul Gortmaker2016-09-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These files were only including module.h for exception table related functions. We've now separated that content out into its own file "extable.h" so now move over to that and avoid all the extra header content in module.h that we don't really need to compile these files. The additions of uaccess.h are to deal with implict includes like: arch/s390/kernel/traps.c: In function 'do_report_trap': arch/s390/kernel/traps.c:56:4: error: implicit declaration of function 'extable_fixup' [-Werror=implicit-function-declaration] arch/s390/kernel/traps.c: In function 'illegal_op': arch/s390/kernel/traps.c:173:3: error: implicit declaration of function 'get_user' [-Werror=implicit-function-declaration] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * s390/mm: merge local / non-local IDTE helperMartin Schwidefsky2016-08-241-5/+5
| | | | | | | | | | | | | | Merge the __p[m|u]xdp_idte and __p[m|u]dp_idte_local functions into a single __p[m|u]dp_idte function with an additional parameter. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * s390/mm: merge local / non-local IPTE helperMartin Schwidefsky2016-08-242-6/+6
| | | | | | | | | | | | | | | | Merge the __ptep_ipte and __ptep_ipte_local functions into a single __ptep_ipte function with an additional parameter. The __pte_ipte_range function is still extra as the while loops makes it hard to merge. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * s390/mm,kvm: flush gmap address space with IDTEMartin Schwidefsky2016-08-241-0/+15
| | | | | | | | | | | | | | | | | | The __tlb_flush_mm() helper uses a global flush if the mm struct has a gmap structure attached to it. Replace the global flush with two individual flushes by means of the IDTE instruction if only a single gmap is attached the the mm. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | s390/mm/pfault: Convert to hotplug state machineSebastian Andrzej Siewior2016-09-191-18/+12
|/ | | | | | | | | | | | | | Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-s390@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: rt@linutronix.de Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Link: http://lkml.kernel.org/r/20160906170457.32393-18-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* s390/pageattr: handle numpages parameter correctlyHeiko Carstens2016-08-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both set_memory_ro() and set_memory_rw() will modify the page attributes of at least one page, even if the numpages parameter is zero. The author expected that calling these functions with numpages == zero would never happen. However with the new 444d13ff10fb ("modules: add ro_after_init support") feature this happens frequently. Therefore do the right thing and make these two functions return gracefully if nothing should be done. Fixes crashes on module load like this one: Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 000003ff80008000 TEID: 000003ff80008407 Fault in home space mode while using kernel ASCE. AS:0000000000d18007 R3:00000001e6aa4007 S:00000001e6a10800 P:00000001e34ee21d Oops: 0004 ilc:3 [#1] SMP Modules linked in: x_tables CPU: 10 PID: 1 Comm: systemd Not tainted 4.7.0-11895-g3fa9045 #4 Hardware name: IBM 2964 N96 703 (LPAR) task: 00000001e9118000 task.stack: 00000001e9120000 Krnl PSW : 0704e00180000000 00000000005677f8 (rb_erase+0xf0/0x4d0) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 000003ff80008b20 000003ff80008b20 000003ff80008b70 0000000000b9d608 000003ff80008b20 0000000000000000 00000001e9123e88 000003ff80008950 00000001e485ab40 000003ff00000000 000003ff80008b00 00000001e4858480 0000000100000000 000003ff80008b68 00000000001d5998 00000001e9123c28 Krnl Code: 00000000005677e8: ec1801c3007c cgij %r1,0,8,567b6e 00000000005677ee: e32010100020 cg %r2,16(%r1) #00000000005677f4: a78401c2 brc 8,567b78 >00000000005677f8: e35010080024 stg %r5,8(%r1) 00000000005677fe: ec5801af007c cgij %r5,0,8,567b5c 0000000000567804: e30050000024 stg %r0,0(%r5) 000000000056780a: ebacf0680004 lmg %r10,%r12,104(%r15) 0000000000567810: 07fe bcr 15,%r14 Call Trace: ([<000003ff80008900>] __this_module+0x0/0xffffffffffffd700 [x_tables]) ([<0000000000264fd4>] do_init_module+0x12c/0x220) ([<00000000001da14a>] load_module+0x24e2/0x2b10) ([<00000000001da976>] SyS_finit_module+0xbe/0xd8) ([<0000000000803b26>] system_call+0xd6/0x264) Last Breaking-Event-Address: [<000000000056771a>] rb_erase+0x12/0x4d0 Kernel panic - not syncing: Fatal exception: panic_on_oops Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Fixes: e8a97e42dc98 ("s390/pageattr: allow kernel page table splitting") Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2016-08-024-117/+1707
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: - ARM: GICv3 ITS emulation and various fixes. Removal of the old VGIC implementation. - s390: support for trapping software breakpoints, nested virtualization (vSIE), the STHYI opcode, initial extensions for CPU model support. - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups, preliminary to this and the upcoming support for hardware virtualization extensions. - x86: support for execute-only mappings in nested EPT; reduced vmexit latency for TSC deadline timer (by about 30%) on Intel hosts; support for more than 255 vCPUs. - PPC: bugfixes. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits) KVM: PPC: Introduce KVM_CAP_PPC_HTM MIPS: Select HAVE_KVM for MIPS64_R{2,6} MIPS: KVM: Reset CP0_PageMask during host TLB flush MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX() MIPS: KVM: Sign extend MFC0/RDHWR results MIPS: KVM: Fix 64-bit big endian dynamic translation MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase MIPS: KVM: Use 64-bit CP0_EBase when appropriate MIPS: KVM: Set CP0_Status.KX on MIPS64 MIPS: KVM: Make entry code MIPS64 friendly MIPS: KVM: Use kmap instead of CKSEG0ADDR() MIPS: KVM: Use virt_to_phys() to get commpage PFN MIPS: Fix definition of KSEGX() for 64-bit KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD kvm: x86: nVMX: maintain internal copy of current VMCS KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures KVM: arm64: vgic-its: Simplify MAPI error handling KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers KVM: arm64: vgic-its: Turn device_id validation into generic ID validation ...
| * KVM: s390: backup the currently enabled gmap when scheduled outDavid Hildenbrand2016-06-201-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | Nested virtualization will have to enable own gmaps. Current code would enable the wrong gmap whenever scheduled out and back in, therefore resulting in the wrong gmap being enabled. This patch reenables the last enabled gmap, therefore avoiding having to touch vcpu->arch.gmap when enabling a different gmap. Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: don't fault everything in read-write in gmap_pte_op_fixup()David Hildenbrand2016-06-201-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's not fault in everything in read-write but limit it to read-only where possible. When restricting access rights, we already have the required protection level in our hands. When reading from guest 2 storage (gmap_read_table), it is obviously PROT_READ. When shadowing a pte, the required protection level is given via the guest 2 provided pte. Based on an initial patch by Martin Schwidefsky. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: allow to check if a gmap shadow is validDavid Hildenbrand2016-06-201-0/+20
| | | | | | | | | | | | | | | | | | It will be very helpful to have a mechanism to check without any locks if a given gmap shadow is still valid and matches the given properties. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: remember the int code for the last gmap faultDavid Hildenbrand2016-06-201-0/+1
| | | | | | | | | | | | | | | | | | | | For nested virtualization, we want to know if we are handling a protection exception, because these can directly be forwarded to the guest without additional checks. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: limit number of real-space gmap shadowsDavid Hildenbrand2016-06-201-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have no known user of real-space designation and only support it to be architecture compliant. Gmap shadows with real-space designation are never unshadowed automatically, as there is nothing to protect for the top level table. So let's simply limit the number of such shadows to one by removing existing ones on creation of another one. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: support real-space for gmap shadowsDavid Hildenbrand2016-06-201-3/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can easily support real-space designation just like EDAT1 and EDAT2. So guest2 can provide for guest3 an asce with the real-space control being set. We simply have to allocate the biggest page table possible and fake all levels. There is no protection to consider. If we exceed guest memory, vsie code will inject an addressing exception (via program intercept). In the future, we could limit the fake table level to the gmap page table. As the top level page table can never go away, such gmap shadows will never get unshadowed, we'll have to come up with another way to limit the number of kept gmap shadows. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: support EDAT2 for gmap shadowsDavid Hildenbrand2016-06-201-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the guest is enabled for EDAT2, we can easily create shadows for guest2 -> guest3 provided tables that make use of EDAT2. If guest2 references a 2GB page, this memory looks consecutive for guest2, but it does not have to be so for us. Therefore we have to create fake segment and page tables. This works just like EDAT1 support, so page tables are removed when the parent table (r3t table entry) is changed. We don't hve to care about: - ACCF-Validity Control in RTTE - Access-Control Bits in RTTE - Fetch-Protection Bit in RTTE - Common-Region Bit in RTTE Just like for EDAT1, all bits might be dropped and there is no guaranteed that they are active. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: support EDAT1 for gmap shadowsDavid Hildenbrand2016-06-201-4/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the guest is enabled for EDAT1, we can easily create shadows for guest2 -> guest3 provided tables that make use of EDAT1. If guest2 references a 1MB page, this memory looks consecutive for guest2, but it might not be so for us. Therefore we have to create fake page tables. We can easily add that to our existing infrastructure. The invalidation mechanism will make sure that fake page tables are removed when the parent table (sgt table entry) is changed. As EDAT1 also introduced protection on all page table levels, we have to also shadow these correctly. We don't have to care about: - ACCF-Validity Control in STE - Access-Control Bits in STE - Fetch-Protection Bit in STE - Common-Segment Bit in STE As all bits might be dropped and there is no guaranteed that they are active ("unpredictable whether the CPU uses these bits", "may be used"). Without using EDAT1 in the shadow ourselfes (STE-format control == 0), simply shadowing these bits would not be enough. They would be ignored. Please note that we are using the "fake" flag to make this look consistent with further changes (EDAT2, real-space designation support) and don't let the shadow functions handle fc=1 stes. In the future, with huge pages in the host, gmap_shadow_pgt() could simply try to map a huge host page if "fake" is set to one and indicate via return value that no lower fake tables / shadow ptes are required. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: prepare for EDAT1/EDAT2 support in gmap shadowDavid Hildenbrand2016-06-201-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for EDAT1/EDAT2 support for gmap shadows, we have to store the requested edat level in the gmap shadow. The edat level used during shadow translation is a property of the gmap shadow. Depending on that level, the gmap shadow will look differently for the same guest tables. We have to store it internally in order to support it later. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: fix races on gmap_shadow creationDavid Hildenbrand2016-06-201-17/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before any thread is allowed to use a gmap_shadow, it has to be fully initialized. However, for invalidation to work properly, we have to register the new gmap_shadow before we protect the parent gmap table. Because locking is tricky, and we have to avoid duplicate gmaps, let's introduce an initialized field, that signalizes other threads if that gmap_shadow can already be used or if they have to retry. Let's properly return errors using ERR_PTR() instead of simply returning NULL, so a caller can properly react on the error. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: avoid races on region/segment/page table shadowingDavid Hildenbrand2016-06-201-27/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have to unlock sg->guest_table_lock in order to call gmap_protect_rmap(). If we sleep just before that call, another VCPU might pick up that shadowed page table (while it is not protected yet) and use it. In order to avoid these races, we have to introduce a third state - "origin set but still invalid" for an entry. This way, we can avoid another thread already using the entry before the table is fully protected. As soon as everything is set up, we can clear the invalid bit - if we had no race with the unshadowing code. Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: shadow pages with real guest requested protectionDavid Hildenbrand2016-06-202-16/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We really want to avoid manually handling protection for nested virtualization. By shadowing pages with the protection the guest asked us for, the SIE can handle most protection-related actions for us (e.g. special handling for MVPG) and we can directly forward protection exceptions to the guest. PTEs will now always be shadowed with the correct _PAGE_PROTECT flag. Unshadowing will take care of any guest changes to the parent PTE and any host changes to the host PTE. If the host PTE doesn't have the fitting access rights or is not available, we have to fix it up. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: flush tlb of shadows in all situationsDavid Hildenbrand2016-06-201-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | For now, the tlb of shadow gmap is only flushed when the parent is removed, not when it is removed upfront. Therefore other shadow gmaps can reuse the tables without the tlb getting flushed. Fix this by simply flushing the tlb 1. Before the shadow tables are removed (analogouos to other unshadow functions) 2. When the gmap is freed and therefore the top level pages are freed. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: add shadow gmap supportMartin Schwidefsky2016-06-204-31/+1200
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a nested KVM guest the outer KVM host needs to create shadow page tables for the nested guest. This patch adds the basic support to the guest address space (gmap) code. For each guest address space the inner KVM host creates, the first outer KVM host needs to create shadow page tables. The address space is identified by the ASCE loaded into the control register 1 at the time the inner SIE instruction for the second nested KVM guest is executed. The outer KVM host creates the shadow tables starting with the table identified by the ASCE on a on-demand basis. The outer KVM host will get repeated faults for all the shadow tables needed to run the second KVM guest. While a shadow page table for the second KVM guest is active the access to the origin region, segment and page tables needs to be restricted for the first KVM guest. For region and segment and page tables the first KVM guest may read the memory, but write attempt has to lead to an unshadow. This is done using the page invalid and read-only bits in the page table of the first KVM guest. If the first guest re-accesses one of the origin pages of a shadow, it gets a fault and the affected parts of the shadow page table hierarchy needs to be removed again. PGSTE tables don't have to be shadowed, as all interpretation assist can't deal with the invalid bits in the shadow pte being set differently than the original ones provided by the first KVM guest. Many bug fixes and improvements by David Hildenbrand. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: add reference counter to gmap structureMartin Schwidefsky2016-06-201-20/+70
| | | | | | | | | | | | | | | | | | | | Let's use a reference counter mechanism to control the lifetime of gmap structures. This will be needed for further changes related to gmap shadows. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: extended gmap pte notifierMartin Schwidefsky2016-06-202-46/+178
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current gmap pte notifier forces a pte into to a read-write state. If the pte is invalidated the gmap notifier is called to inform KVM that the mapping will go away. Extend this approach to allow read-write, read-only and no-access as possible target states and call the pte notifier for any change to the pte. This mechanism is used to temporarily set specific access rights for a pte without doing the heavy work of a true mprotect call. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: use RCU for gmap notifier list and the per-mm gmap listMartin Schwidefsky2016-06-202-24/+31
| | | | | | | | | | | | | | | | | | The gmap notifier list and the gmap list in the mm_struct change rarely. Use RCU to optimize the reader of these lists. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/kvm: page table invalidation notifierMartin Schwidefsky2016-06-201-3/+16
| | | | | | | | | | | | | | | | | | | | Pass an address range to the page table invalidation notifier for KVM. This allows to notify changes that affect a larger virtual memory area, e.g. for 1MB pages. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * KVM: s390: handle missing storage-key facilityDavid Hildenbrand2016-06-101-0/+37
| | | | | | | | | | | | | | | | | | | | Without the storage-key facility, SIE won't interpret SSKE, ISKE and RRBE for us. So let's add proper interception handlers that will be called if lazy sske cannot be enabled. Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * KVM: s390: pfmf: support conditional-sske facilityDavid Hildenbrand2016-06-101-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | We already indicate that facility but don't implement it in our pfmf interception handler. Let's add a new storage key handling function for conditionally setting the guest storage key. As we will reuse this function later on, let's directly implement returning the old key via parameter and indicating if any change happened via rc. Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: return key via pointer in get_guest_storage_keyDavid Hildenbrand2016-06-101-6/+6
| | | | | | | | | | | | | | | | | | Let's just split returning the key and reporting errors. This makes calling code easier and avoids bugs as happened already. Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: simplify get_guest_storage_keyDavid Hildenbrand2016-06-101-13/+4
| | | | | | | | | | | | | | | | | | We can safe a few LOC and make that function easier to understand by rewriting existing code. Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: set and get guest storage key mmap lockingMartin Schwidefsky2016-06-101-12/+3
| | | | | | | | | | | | | | | | | | | | | | | | Move the mmap semaphore locking out of set_guest_storage_key and get_guest_storage_key. This makes the two functions more like the other ptep_xxx operations and allows to avoid repeated semaphore operations if multiple keys are read or written. Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * s390/mm: don't drop errors in get_guest_storage_keyDavid Hildenbrand2016-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | Commit 1e133ab296f3 ("s390/mm: split arch/s390/mm/pgtable.c") changed the return value of get_guest_storage_key to an unsigned char, resulting in -EFAULT getting interpreted as a valid storage key. Cc: stable@vger.kernel.org # 4.6+ Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
* | s390/mm: clean up pte/pmd encodingGerald Schaefer2016-07-311-14/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hugetlbfs pte<->pmd conversion functions currently assume that the pmd bit layout is consistent with the pte layout, which is not really true. The SW read and write bits are encoded as the sequence "wr" in a pte, but in a pmd it is "rw". The hugetlbfs conversion assumes that the sequence is identical in both cases, which results in swapped read and write bits in the pmd. In practice this is not a problem, because those pmd bits are only relevant for THP pmds and not for hugetlbfs pmds. The hugetlbfs code works on (fake) ptes, and the converted pte bits are correct. There is another variation in pte/pmd encoding which affects dirty prot-none ptes/pmds. In this case, a pmd has both its HW read-only and invalid bit set, while it is only the invalid bit for a pte. This also has no effect in practice, but it should better be consistent. This patch fixes both inconsistencies by changing the SW read/write bit layout for pmds as well as the PAGE_NONE encoding for ptes. It also makes the hugetlbfs conversion functions more robust by introducing a move_set_bit() macro that uses the pte/pmd bit #defines instead of constant shifts. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-07-261-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...
| * | mm: do not pass mm_struct into handle_mm_faultKirill A. Shutemov2016-07-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | We always have vma->vm_mm around. Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2016-07-2610-151/+493
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There are a couple of new things for s390 with this merge request: - a new scheduling domain "drawer" is added to reflect the unusual topology found on z13 machines. Performance tests showed up to 8 percent gain with the additional domain. - the new crc-32 checksum crypto module uses the vector-galois-field multiply and sum SIMD instruction to speed up crc-32 and crc-32c. - proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in the generic vmlinux.lds linker script definitions. - kcov instrumentation support. A prerequisite for that is the inline assembly basic block cleanup, which is the reason for the net/iucv/iucv.c change. - support for 2GB pages is added to the hugetlbfs backend. Then there are two removals: - the oprofile hardware sampling support is dead code and is removed. The oprofile user space uses the perf interface nowadays. - the ETR clock synchronization is removed, this has been superseeded be the STP clock synchronization. And it always has been "interesting" code.. And the usual bug fixes and cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits) s390/pci: Delete an unnecessary check before the function call "pci_dev_put" s390/smp: clean up a condition s390/cio/chp : Remove deprecated create_singlethread_workqueue s390/chsc: improve channel path descriptor determination s390/chsc: sanitize fmt check for chp_desc determination s390/cio: make fmt1 channel path descriptor optional s390/chsc: fix ioctl CHSC_INFO_CU command s390/cio/device_ops: fix kernel doc s390/cio: allow to reset channel measurement block s390/console: Make preferred console handling more consistent s390/mm: fix gmap tlb flush issues s390/mm: add support for 2GB hugepages s390: have unique symbol for __switch_to address s390/cpuinfo: show maximum thread id s390/ptrace: clarify bits in the per_struct s390: stack address vs thread_info s390: remove pointless load within __switch_to s390: enable kcov support s390/cpumf: use basic block for ecctr inline assembly s390/hypfs: use basic block for diag inline assembly ...
| * | s390/mm: fix gmap tlb flush issuesDavid Hildenbrand2016-07-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __tlb_flush_asce() should never be used if multiple asce belong to a mm. As this function changes mm logic determining if local or global tlb flushes will be neded, we might end up flushing only the gmap asce on all CPUs and a follow up mm asce flushes will only flush on the local CPU, although that asce ran on multiple CPUs. The missing tlb flushes will provoke strange faults in user space and even low address protections in user space, crashing the kernel. Fixes: 1b948d6caec4 ("s390/mm,tlb: optimize TLB flushing for zEC12") Cc: stable@vger.kernel.org # 3.15+ Reported-by: Sascha Silbe <silbe@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390/mm: add support for 2GB hugepagesGerald Schaefer2016-07-064-40/+176
| | | | | | | | | | | | | | | | | | | | | | | | This adds support for 2GB hugetlbfs pages on s390. Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390/mm: use basic block for essa inline assemblyHeiko Carstens2016-06-281-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use only simple inline assemblies which consist of a single basic block if the register asm construct is being used. Otherwise gcc would generate broken code if the compiler option --sanitize-coverage=trace-pc would be used. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390/mm: fix compile for PAGE_DEFAULT_KEY != 0Heiko Carstens2016-06-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The usual problem for code that is ifdef'ed out is that it doesn't compile after a while. That's also the case for the storage key initialisation code, if it would be used (set PAGE_DEFAULT_KEY to something not zero): ./arch/s390/include/asm/page.h: In function 'storage_key_init_range': ./arch/s390/include/asm/page.h:36:2: error: implicit declaration of function '__storage_key_init_range' Since the code itself has been useful for debugging purposes several times, remove the ifdefs and make sure the code gets compiler coverage. The cost for this is eight bytes. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390: avoid extable collisionsHeiko Carstens2016-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have some inline assemblies where the extable entry points to a label at the end of an inline assembly which is not followed by an instruction. On the other hand we have also inline assemblies where the extable entry points to the first instruction of an inline assembly. If a first type inline asm (extable point to empty label at the end) would be directly followed by a second type inline asm (extable points to first instruction) then we would have two different extable entries that point to the same instruction but would have a different target address. This can lead to quite random behaviour, depending on sorting order. I verified that we currently do not have such collisions within the kernel. However to avoid such subtle bugs add a couple of nop instructions to those inline assemblies which contain an extable that points to an empty label. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390: add proper __ro_after_init supportHeiko Carstens2016-06-132-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On s390 __ro_after_init is currently mapped to __read_mostly which means that data marked as __ro_after_init will not be protected. Reason for this is that the common code __ro_after_init implementation is x86 centric: the ro_after_init data section was added to rodata, since x86 enables write protection to kernel text and rodata very late. On s390 we have write protection for these sections enabled with the initial page tables. So adding the ro_after_init data section to rodata does not work on s390. In order to make __ro_after_init work properly on s390 move the ro_after_init data, right behind rodata. Unlike the rodata section it will be marked read-only later after all init calls happened. This s390 specific implementation adds new __start_ro_after_init and __end_ro_after_init labels. Everything in between will be marked read-only after the init calls happened. In addition to the __ro_after_init data move also the exception table there, since from a practical point of view it fits the __ro_after_init requirements. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390/mm: simplify the TLB flushing codeMartin Schwidefsky2016-06-132-23/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to decide between a lazy TLB flush vs an immediate TLB flush. The field contains two 16-bit counters, the number of CPUs that have the mm attached and can create TLB entries for it and the number of CPUs in the middle of a page table update. The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions use the attach counter and a mask check with mm_cpumask(mm) to decide between a local flush local of the current CPU and a global flush. For all these functions the decision between lazy vs immediate and local vs global TLB flush can be based on CPU masks. There are two masks: the mm->context.cpu_attach_mask with the CPUs that are actively using the mm, and the mm_cpumask(mm) with the CPUs that have used the mm since the last full flush. The decision between lazy vs immediate flush is based on the mm->context.cpu_attach_mask, to decide between local vs global flush the mm_cpumask(mm) is used. With this patch all checks will use the CPU masks, the old counter mm->context.attach_count with its two 16-bit values is turned into a single counter mm->context.flush_count that keeps track of the number of CPUs with incomplete page table updates. The sole user of this counter is finish_arch_post_lock_switch() which waits for the end of all page table updates. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390/mm: fix vunmap vs finish_arch_post_lock_switchMartin Schwidefsky2016-06-131-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vunmap_pte_range() function calls ptep_get_and_clear() without any locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct() for the page table update. ptep_flush_direct requires that preemption is disabled, but without any locking this is not the case. If the kernel preempts the task while the attach_counter is increased an endless loop in finish_arch_post_lock_switch() will occur the next time the task is scheduled. Add explicit preempt_disable()/preempt_enable() calls to the relevant functions in arch/s390/mm/pgtable.c. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390/mm: align swapper_pg_dir to 16kHeiko Carstens2016-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The segment/region table that is part of the kernel image must be properly aligned to 16k in order to make the crdte inline assembly work. Otherwise it will calculate a wrong segment/region table start address and access incorrect memory locations if the swapper_pg_dir is not aligned to 16k. Therefore define BSS_FIRST_SECTIONS in order to put the swapper_pg_dir at the beginning of the bss section and also align the bss section to 16k just like other architectures did. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| * | s390/pgtable: add mapping statisticsHeiko Carstens2016-06-132-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add statistics that show how memory is mapped within the kernel identity mapping. This is more or less the same like git commit ce0c0e50f94e ("x86, generic: CPA add statistics about state of direct mapping v4") for x86. I also intentionally copied the lower case "k" within DirectMap4k vs the upper case "M" and "G" within the two other lines. Let's have consistent inconsistencies across architectures. The output of /proc/meminfo now contains these additional lines: DirectMap4k: 2048 kB DirectMap1M: 3991552 kB DirectMap2G: 4194304 kB The implementation on s390 is lockless unlike the x86 version, since I assume changes to the kernel mapping are a very rare event. Therefore it really doesn't matter if these statistics could potentially be inconsistent if read while kernel pages tables are being changed. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
OpenPOWER on IntegriCloud