summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel/setup.c
Commit message (Collapse)AuthorAgeFilesLines
* ACPI / x86: Cleanup initrd related codeLv Zheng2016-04-181-3/+9
| | | | | | | | | | | | | In arch/x86/kernel/setup.c, the #ifdef kept for CONFIG_ACPI actually is related to the accessibility of initrd_start/initrd_end, so the stub should be provided from this source file and should only be dependent on CONFIG_BLK_DEV_INITRD. Note that when ACPI=n and BLK_DEV_INITRD=y, early_initrd_acpi_init() is still a stub because of the stub prepared for early_acpi_table_init(). Signed-off-by: Lv Zheng <lv.zheng@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* ACPI / tables: Move table override mechanisms to tables.cLv Zheng2016-04-181-1/+1
| | | | | | | | | | | | | | | | | This patch moves acpi_os_table_override() and acpi_os_physical_table_override() to tables.c. Along with the mechanisms, acpi_initrd_initialize_tables() is also moved to tables.c to form a static function. The following functions are renamed according to this change: 1. acpi_initrd_override() -> renamed to early_acpi_table_init(), which invokes acpi_table_initrd_init() 2. acpi_os_physical_table_override() -> which invokes acpi_table_initrd_override() 3. acpi_initialize_initrd_tables() -> renamed to acpi_table_initrd_scan() Signed-off-by: Lv Zheng <lv.zheng@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* Revert "x86: remove the kernel code/data/bss resources from /proc/iomem"Linus Torvalds2016-04-141-0/+37
| | | | | | | | | | | | | | | | | | This reverts commit c4004b02f8e5b9ce357a0bb1641756cc86962664. Sadly, my hope that nobody would actually use the special kernel entries in /proc/iomem were dashed by kexec. Which reads /proc/iomem explicitly to find the kernel base address. Nasty. Anyway, that means we can't do the sane and simple thing and just remove the entries, and we'll instead have to mask them out based on permissions. Reported-by: Zhengyu Zhang <zhezhang@redhat.com> Reported-by: Dave Young <dyoung@redhat.com> Reported-by: Freeman Zhang <freeman.zhang1992@gmail.com> Reported-by: Emrah Demir <ed@abdsec.com> Reported-by: Baoquan He <bhe@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86: remove the kernel code/data/bss resources from /proc/iomemLinus Torvalds2016-04-061-37/+0
| | | | | | | | | | | | | | | Let's see if anybody even notices. I doubt anybody uses this, and it does expose addresses that should be randomized, so let's just remove the code. It's old and traditional, and it used to be cute, but we should have removed this long ago. If it turns out anybody notices and this breaks something, we'll have to revert this, and maybe we'll end up using other approaches instead (using %pK or similar). But removing unnecessary code is always the preferred option. Noted-by: Emrah Demir <ed@abdsec.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'mm-pkeys-for-linus' of ↵Linus Torvalds2016-03-201-0/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 protection key support from Ingo Molnar: "This tree adds support for a new memory protection hardware feature that is available in upcoming Intel CPUs: 'protection keys' (pkeys). There's a background article at LWN.net: https://lwn.net/Articles/643797/ The gist is that protection keys allow the encoding of user-controllable permission masks in the pte. So instead of having a fixed protection mask in the pte (which needs a system call to change and works on a per page basis), the user can map a (handful of) protection mask variants and can change the masks runtime relatively cheaply, without having to change every single page in the affected virtual memory range. This allows the dynamic switching of the protection bits of large amounts of virtual memory, via user-space instructions. It also allows more precise control of MMU permission bits: for example the executable bit is separate from the read bit (see more about that below). This tree adds the MM infrastructure and low level x86 glue needed for that, plus it adds a high level API to make use of protection keys - if a user-space application calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice this special case, and will set a special protection key on this memory range. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. So using protection keys the kernel is able to implement 'true' PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies PROT_READ as well. Unreadable executable mappings have security advantages: they cannot be read via information leaks to figure out ASLR details, nor can they be scanned for ROP gadgets - and they cannot be used by exploits for data purposes either. We know about no user-space code that relies on pure PROT_EXEC mappings today, but binary loaders could start making use of this new feature to map binaries and libraries in a more secure fashion. There is other pending pkeys work that offers more high level system call APIs to manage protection keys - but those are not part of this pull request. Right now there's a Kconfig that controls this feature (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled (like most x86 CPU feature enablement code that has no runtime overhead), but it's not user-configurable at the moment. If there's any serious problem with this then we can make it configurable and/or flip the default" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/mm/pkeys: Fix mismerge of protection keys CPUID bits mm/pkeys: Fix siginfo ABI breakage caused by new u64 field x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA mm/core, x86/mm/pkeys: Add execute-only protection keys support x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags x86/mm/pkeys: Allow kernel to modify user pkey rights register x86/fpu: Allow setting of XSAVE state x86/mm: Factor out LDT init from context init mm/core, x86/mm/pkeys: Add arch_validate_pkey() mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU x86/mm/pkeys: Add Kconfig prompt to existing config option x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps x86/mm/pkeys: Dump PKRU with other kernel registers mm/core, x86/mm/pkeys: Differentiate instruction fetches x86/mm/pkeys: Optimize fault handling in access_error() mm/core: Do not enforce PKEY permissions on remote mm access um, pkeys: Add UML arch_*_access_permitted() methods mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys x86/mm/gup: Simplify get_user_pages() PTE bit handling ...
| * x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smapsDave Hansen2016-02-181-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The protection key can now be just as important as read/write permissions on a VMA. We need some debug mechanism to help figure out if it is in play. smaps seems like a logical place to expose it. arch/x86/kernel/setup.c is a bit of a weirdo place to put this code, but it already had seq_file.h and there was not a much better existing place to put it. We also use no #ifdef. If protection keys is .config'd out we will effectively get the same function as if we used the weak generic function. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Joerg Roedel <jroedel@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Mark Williamson <mwilliamson@undo-software.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | x86/e820: Set System RAM type and descriptorToshi Kani2016-01-301-3/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change e820_reserve_resources() to set 'flags' and 'desc' from e820 types. Set E820_RESERVED_KERN and E820_RAM's (System RAM) io resource type to IORESOURCE_SYSTEM_RAM. Do the same for "Kernel data", "Kernel code", and "Kernel bss", which are child nodes of System RAM. I/O resource descriptor is set to 'desc' for entries that are (and will be) target ranges of walk_iomem_res() and region_intersects(). Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Joerg Roedel <jroedel@suse.de> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Mark Salter <msalter@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: WANG Chao <chaowang@redhat.com> Cc: linux-arch@vger.kernel.org Cc: linux-mm <linux-mm@kvack.org> Link: http://lkml.kernel.org/r/1453841853-11383-5-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2016-01-111-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main changes in this cycle were: - make the debugfs 'kernel_page_tables' file read-only, as it only has read ops. (Borislav Petkov) - micro-optimize clflush_cache_range() (Chris Wilson) - swiotlb enhancements, which fixes certain KVM emulated devices (Igor Mammedov) - fix an LDT related debug message (Jan Beulich) - modularize CONFIG_X86_PTDUMP (Kees Cook) - tone down an overly alarming warning (Laura Abbott) - Mark variable __initdata (Rasmus Villemoes) - PAT additions (Toshi Kani)" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Micro-optimise clflush_cache_range() x86/mm/pat: Change free_memtype() to support shrinking case x86/mm/pat: Add untrack_pfn_moved for mremap x86/mm: Drop WARN from multi-BAR check x86/LDT: Print the real LDT base address x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN x86/mm: Introduce max_possible_pfn x86/mm/ptdump: Make (debugfs)/kernel_page_tables read-only x86/mm/mtrr: Mark the 'range_new' static variable in mtrr_calc_range_state() as __initdata x86/mm: Turn CONFIG_X86_PTDUMP into a module
| * x86/mm: Introduce max_possible_pfnIgor Mammedov2015-12-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | max_possible_pfn will be used for tracking max possible PFN for memory that isn't present in E820 table and could be hotplugged later. By default max_possible_pfn is initialized with max_pfn, but later it could be updated with highest PFN of hotpluggable memory ranges declared in ACPI SRAT table if any present. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: fujita.tomonori@lab.ntt.co.jp Cc: konrad.wilk@oracle.com Cc: pbonzini@redhat.com Cc: revers@redhat.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1449234426-273049-2-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | x86/microcode: Initialize the driver late when facilities are upBorislav Petkov2015-11-231-2/+0
|/ | | | | | | | | | | | | | | | | | | | Running microcode_init() from setup_arch() is a bad idea because not even kmalloc() is ready at that point and the loader does all kinds of allocations and init/registration with various subsystems. Make it a late initcall when required facilities are initialized so that the microcode driver initialization can succeed too. Reported-and-tested-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151120112400.GC4028@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/setup: Fix low identity map for >= 2GB kernel rangeKrzysztof Mazur2015-11-071-1/+1
| | | | | | | | | | | | | | | | | | | The commit f5f3497cad8c extended the low identity mapping. However, if the kernel uses more than 2 GB (VMSPLIT_2G_OPT or VMSPLIT_1G memory split), the normal memory mapping is overwritten by the low identity mapping causing a crash. To avoid overwritting, limit the low identity map to cover only memory before kernel range (PAGE_OFFSET). Fixes: f5f3497cad8c "x86/setup: Extend low identity map to cover whole kernel range Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Laszlo Ersek <lersek@redhat.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1446815916-22105-1-git-send-email-krzysiek@podlesie.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Merge branch 'ras-core-for-linus' of ↵Linus Torvalds2015-11-031-47/+55
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS changes from Ingo Molnar: "The main system reliability related changes were from x86, but also some generic RAS changes: - AMD MCE error injection subsystem enhancements. (Aravind Gopalakrishnan) - Fix MCE and CPU hotplug interaction bug. (Ashok Raj) - kcrash bootup robustness fix. (Baoquan He) - kcrash cleanups. (Borislav Petkov) - x86 microcode driver rework: simplify it by unmodularizing it and other cleanups. (Borislav Petkov)" * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init() x86/mce: Add a Scalable MCA vendor flags bit MAINTAINERS: Unify the microcode driver section x86/microcode/intel: Move #ifdef DEBUG inside the function x86/microcode/amd: Remove maintainers from comments x86/microcode: Remove modularization leftovers x86/microcode: Merge the early microcode loader x86/microcode: Unmodularize the microcode driver x86/mce: Fix thermal throttling reporting after kexec kexec/crash: Say which char is the unrecognized x86/setup/crash: Check memblock_reserve() retval x86/setup/crash: Cleanup some more x86/setup/crash: Remove alignment variable x86/setup: Cleanup crashkernel reservation functions x86/amd_nb, EDAC: Rename amd_get_node_id() x86/setup: Do not reserve crashkernel high memory if low reservation failed x86/microcode/amd: Do not overwrite final patch levels x86/microcode/amd: Extract current patch level read to a function x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts ...
| * x86/microcode: Unmodularize the microcode driverBorislav Petkov2015-10-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make CONFIG_MICROCODE a bool. It was practically a bool already anyway, since early loader was forcing it to =y. Regardless, there's no real reason to have something be a module which gets built-in on the majority of installations out there. And its not like there's noticeable change in functionality - we still can load late microcode - just the module glue disappears. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1445334889-300-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86/setup/crash: Check memblock_reserve() retvalBorislav Petkov2015-10-211-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memblock_reserve() can fail but the crashkernel reservation code doesn't check that and this can lead the user into believing that the crashkernel region was actually reserved. Make sure we check that return value and we exit early with a failure message in the error case. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-7-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86/setup/crash: Cleanup some moreBorislav Petkov2015-10-211-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Remove unused auto_set variable * Cleanup local function variable declarations * Reformat printk string and use pr_info() No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-6-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86/setup/crash: Remove alignment variableBorislav Petkov2015-10-211-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a macro instead. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-5-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86/setup: Cleanup crashkernel reservation functionsBorislav Petkov2015-10-211-22/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Shorten variable names * Realign code, space out for better readability No code changed: # arch/x86/kernel/setup.o: text data bss dec hex filename 4543 3096 69904 77543 12ee7 setup.o.before 4543 3096 69904 77543 12ee7 setup.o.after md5: 8a1b7c6738a553ca207b56bd84a8f359 setup.o.before.asm 8a1b7c6738a553ca207b56bd84a8f359 setup.o.after.asm Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-4-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86/setup: Do not reserve crashkernel high memory if low reservation failedBaoquan He2015-10-211-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | People reported that when allocating crashkernel memory using the ",high" and ",low" syntax, there were cases where the reservation of the high portion succeeds but the reservation of the low portion fails. Then kexec can load the kdump kernel successfully, but booting the kdump kernel fails as there's no low memory. The low memory allocation for the kdump kernel can fail on large systems for a couple of reasons. For example, the manually specified crashkernel low memory can be too large and thus no adequate memblock region would be found. Therefore, we try to reserve low memory for the crash kernel *after* the high memory portion has been allocated. If that fails, we free crashkernel high memory too and return. The user can then take measures accordingly. Tested-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Baoquan He <bhe@redhat.com> [ Massage text. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Young <dyoung@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/1445246268-26285-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'core-efi-for-linus' of ↵Linus Torvalds2015-11-031-1/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI changes from Ingo Molnar: "The main changes in this cycle were: - further EFI code generalization to make it more workable for ARM64 - various extensions, such as 64-bit framebuffer address support, UEFI v2.5 EFI_PROPERTIES_TABLE support - code modularization simplifications and cleanups - new debugging parameters - various fixes and smaller additions" * 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) efi: Fix warning of int-to-pointer-cast on x86 32-bit builds efi: Use correct type for struct efi_memory_map::phys_map x86/efi: Fix kernel panic when CONFIG_DEBUG_VIRTUAL is enabled efi: Add "efi_fake_mem" boot option x86/efi: Rename print_efi_memmap() to efi_print_memmap() efi: Auto-load the efi-pstore module efi: Introduce EFI_NX_PE_DATA bit and set it from properties table efi: Add support for UEFIv2.5 Properties table efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format() efifb: Add support for 64-bit frame buffer addresses efi/arm64: Clean up efi_get_fdt_params() interface arm64: Use core efi=debug instead of uefi_debug command line parameter efi/x86: Move efi=debug option parsing to core drivers/firmware: Make efi/esrt.c driver explicitly non-modular efi: Use the generic efi.memmap instead of 'memmap' acpi/apei: Use appropriate pgprot_t to map GHES memory arm64, acpi/apei: Implement arch_apei_get_mem_attributes() arm64/mm: Add PROT_DEVICE_nGnRnE and PROT_NORMAL_WT acpi, x86: Implement arch_apei_get_mem_attributes() efi, x86: Rearrange efi_mem_attributes() ...
| * \ Merge tag 'efi-next' of ↵Ingo Molnar2015-10-141-1/+3
| |\ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into core/efi Pull v4.4 EFI updates from Matt Fleming: - Make the EFI System Resource Table (ESRT) driver explicitly non-modular by ripping out the module_* code since Kconfig doesn't allow it to be built as a module anyway. (Paul Gortmaker) - Make the x86 efi=debug kernel parameter, which enables EFI debug code and output, generic and usable by arm64. (Leif Lindholm) - Add support to the x86 EFI boot stub for 64-bit Graphics Output Protocol frame buffer addresses. (Matt Fleming) - Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled in the firmware and set an efi.flags bit so the kernel knows when it can apply more strict runtime mapping attributes - Ard Biesheuvel - Auto-load the efi-pstore module on EFI systems, just like we currently do for the efivars module. (Ben Hutchings) - Add "efi_fake_mem" kernel parameter which allows the system's EFI memory map to be updated with additional attributes for specific memory ranges. This is useful for testing the kernel code that handles the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware doesn't include support. (Taku Izumi) Note: there is a semantic conflict between the following two commits: 8a53554e12e9 ("x86/efi: Fix multiple GOP device support") ae2ee627dc87 ("efifb: Add support for 64-bit frame buffer addresses") I fixed up the interaction in the merge commit, changing the type of current_fb_base from u32 to u64. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * efi: Add "efi_fake_mem" boot optionTaku Izumi2015-10-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Address Range Mirroring feature. For example, if "efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000" is specified, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute (0x10000): <original> efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0000000100000000-0x00000020a0000000) (129536MB) <updated> efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0000000100000000-0x0000000180000000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0000000180000000-0x00000010a0000000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x00000010a0000000-0x0000001120000000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0000001120000000-0x00000020a0000000) (63488MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
* | | x86/setup: Extend low identity map to cover whole kernel rangePaolo Bonzini2015-10-161-0/+8
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32-bit systems, the initial_page_table is reused by efi_call_phys_prolog as an identity map to call SetVirtualAddressMap. efi_call_phys_prolog takes care of converting the current CPU's GDT to a physical address too. For PAE kernels the identity mapping is achieved by aliasing the first PDPE for the kernel memory mapping into the first PDPE of initial_page_table. This makes the EFI stub's trick "just work". However, for non-PAE kernels there is no guarantee that the identity mapping in the initial_page_table extends as far as the GDT; in this case, accesses to the GDT will cause a page fault (which quickly becomes a triple fault). Fix this by copying the kernel mappings from swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at identity mapping. For some reason, this is only reproducible with QEMU's dynamic translation mode, and not for example with KVM. However, even under KVM one can clearly see that the page table is bogus: $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize $ gdb (gdb) target remote localhost:1234 (gdb) hb *0x02858f6f Hardware assisted breakpoint 1 at 0x2858f6f (gdb) c Continuing. Breakpoint 1, 0x02858f6f in ?? () (gdb) monitor info registers ... GDT= 0724e000 000000ff IDT= fffbb000 000007ff CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690 ... The page directory is sane: (gdb) x/4wx 0x32b7000 0x32b7000: 0x03398063 0x03399063 0x0339a063 0x0339b063 (gdb) x/4wx 0x3398000 0x3398000: 0x00000163 0x00001163 0x00002163 0x00003163 (gdb) x/4wx 0x3399000 0x3399000: 0x00400003 0x00401003 0x00402003 0x00403003 but our particular page directory entry is empty: (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4 0x32b7070: 0x00000000 [ It appears that you can skate past this issue if you don't receive any interrupts while the bogus GDT pointer is loaded, or if you avoid reloading the segment registers in general. Andy Lutomirski provides some additional insight: "AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to point to unmapped memory. Any attempt to use them (segment loads, interrupts, IRET, etc) will try to access that memory as if the access came from CPL 0 and, if the access fails, will generate a valid page fault with CR2 pointing into the GDT or LDT." Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI calls, not in the epilog/prolog calls") interrupts were disabled around the prolog and epilog calls, and the functional GDT was re-installed before interrupts were re-enabled. Which explains why no one has hit this issue until now. ] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: Laszlo Ersek <lersek@redhat.com> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Matt Fleming <matt.fleming@intel.com> [ Updated changelog. ]
* | kexec: split kexec_load syscall from kexec core codeDave Young2015-09-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two kexec load syscalls, kexec_load another and kexec_file_load. kexec_file_load has been splited as kernel/kexec_file.c. In this patch I split kexec_load syscall code to kernel/kexec.c. And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and use kexec_file_load only, or vice verse. The original requirement is from Ted Ts'o, he want kexec kernel signature being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use kexec_load syscall can bypass the checking. Vivek Goyal proposed to create a common kconfig option so user can compile in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects KEXEC_CORE so that old config files still work. Because there's general code need CONFIG_KEXEC_CORE, so I updated all the architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects KEXEC_CORE in arch Kconfig. Also updated general kernel code with to kexec_load syscall. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | x86: use generic early mem copyMark Salter2015-09-081-21/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The early_ioremap library now has a generic copy_from_early_mem() function. Use the generic copy function for x86 relocate_initrd(). [akpm@linux-foundation.org: remove MAX_MAP_CHUNK define, per Yinghai Lu] Signed-off-by: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | x86/boot: Obsolete the MCA sys_desc_tablePaolo Pisati2015-07-211-5/+0
|/ | | | | | | | | | | | | | | | | | | | | The kernel does not support the MCA bus anymroe, so mark sys_desc_table as obsolete: remove any reference from the code together with the remaining of MCA logic. bloat-o-meter output: add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-55 (-55) function old new delta i386_start_kernel 128 119 -9 setup_arch 1421 1375 -46 Signed-off-by: Paolo Pisati <p.pisati@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437409430-8491-1-git-send-email-p.pisati@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2015-07-041-4/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Two FPU rewrite related fixes. This addresses all known x86 regressions at this stage. Also some other misc fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Fix boot crash in the early FPU code x86/asm/entry/64: Update path names x86/fpu: Fix FPU related boot regression when CPUID masking BIOS feature is enabled x86/boot/setup: Clean up the e820_reserve_setup_data() code x86/kaslr: Fix typo in the KASLR_FLAG documentation
| * Merge branch 'x86/boot' into x86/urgentIngo Molnar2015-06-301-4/+3
| |\ | | | | | | | | | | | | | | | Merge branch that got ready. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * x86/boot/setup: Clean up the e820_reserve_setup_data() codeWei Yang2015-06-051-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Deobfuscate the 'found' logic, it can be replaced with a simple: if (!pa_data) return; and 'found' can be eliminated. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1433398729-8314-1-git-send-email-weiyang@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | x86, mirror: x86 enabling - find mirrored memory rangesTony Luck2015-06-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory address ranges. See UEFI 2.5 spec pages 157-158: http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf On EFI enabled systems scan the memory map and tell memblock about any mirrored ranges. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Xiexiuqi <xiexiuqi@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2015-06-231-1/+1
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fixes from Jiri Kosina: - symbol lookup locking fix, from Miroslav Benes - error handling improvements in case of failure of the module coming notifier, from Minfei Huang - we were too pessimistic when kASLR has been enabled on x86 and were dropping address hints on the floor unnecessarily in such case. Fix from Jiri Kosina - a few other small fixes and cleanups * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: add module locking around kallsyms calls livepatch: annotate klp_init() with __init livepatch: introduce patch/func-walking helpers livepatch: make kobject in klp_object statically allocated livepatch: Prevent patch inconsistencies if the coming module notifier fails livepatch: match return value to function signature x86: kaslr: fix build due to missing ALIGN definition livepatch: x86: make kASLR logic more accurate x86: introduce kaslr_offset()
| * | x86: introduce kaslr_offset()Jiri Kosina2015-04-291-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Offset that has been chosen for kaslr during kernel decompression can be easily computed as a difference between _text and __START_KERNEL. We are already making use of this in dump_kernel_offset() notifier and in arch_crash_save_vmcoreinfo(). Introduce kaslr_offset() that makes this computation instead of hard-coding it, so that other kernel code (such as live patching) can make use of it. Also convert existing users to make use of it. This patch is equivalent transofrmation without any effects on the resulting code: $ diff -u vmlinux.old.asm vmlinux.new.asm --- vmlinux.old.asm 2015-04-28 17:55:19.520983368 +0200 +++ vmlinux.new.asm 2015-04-28 17:55:24.141206072 +0200 @@ -1,5 +1,5 @@ -vmlinux.old: file format elf64-x86-64 +vmlinux.new: file format elf64-x86-64 Disassembly of section .text: $ Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | Merge branch 'x86-core-for-linus' of ↵Linus Torvalds2015-06-221-2/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Ingo Molnar: "There were so many changes in the x86/asm, x86/apic and x86/mm topics in this cycle that the topical separation of -tip broke down somewhat - so the result is a more traditional architecture pull request, collected into the 'x86/core' topic. The topics were still maintained separately as far as possible, so bisectability and conceptual separation should still be pretty good - but there were a handful of merge points to avoid excessive dependencies (and conflicts) that would have been poorly tested in the end. The next cycle will hopefully be much more quiet (or at least will have fewer dependencies). The main changes in this cycle were: * x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas Gleixner) - This is the second and most intrusive part of changes to the x86 interrupt handling - full conversion to hierarchical interrupt domains: [IOAPIC domain] ----- | [MSI domain] --------[Remapping domain] ----- [ Vector domain ] | (optional) | [HPET MSI domain] ----- | | [DMAR domain] ----------------------------- | [Legacy domain] ----------------------------- This now reflects the actual hardware and allowed us to distangle the domain specific code from the underlying parent domain, which can be optional in the case of interrupt remapping. It's a clear separation of functionality and removes quite some duct tape constructs which plugged the remap code between ioapic/msi/hpet and the vector management. - Intel IOMMU IRQ remapping enhancements, to allow direct interrupt injection into guests (Feng Wu) * x86/asm changes: - Tons of cleanups and small speedups, micro-optimizations. This is in preparation to move a good chunk of the low level entry code from assembly to C code (Denys Vlasenko, Andy Lutomirski, Brian Gerst) - Moved all system entry related code to a new home under arch/x86/entry/ (Ingo Molnar) - Removal of the fragile and ugly CFI dwarf debuginfo annotations. Conversion to C will reintroduce many of them - but meanwhile they are only getting in the way, and the upstream kernel does not rely on them (Ingo Molnar) - NOP handling refinements. (Borislav Petkov) * x86/mm changes: - Big PAT and MTRR rework: making the code more robust and preparing to phase out exposing direct MTRR interfaces to drivers - in favor of using PAT driven interfaces (Toshi Kani, Luis R Rodriguez, Borislav Petkov) - New ioremap_wt()/set_memory_wt() interfaces to support Write-Through cached memory mappings. This is especially important for good performance on NVDIMM hardware (Toshi Kani) * x86/ras changes: - Add support for deferred errors on AMD (Aravind Gopalakrishnan) This is an important RAS feature which adds hardware support for poisoned data. That means roughly that the hardware marks data which it has detected as corrupted but wasn't able to correct, as poisoned data and raises an APIC interrupt to signal that in the form of a deferred error. It is the OS's responsibility then to take proper recovery action and thus prolonge system lifetime as far as possible. - Add support for Intel "Local MCE"s: upcoming CPUs will support CPU-local MCE interrupts, as opposed to the traditional system- wide broadcasted MCE interrupts (Ashok Raj) - Misc cleanups (Borislav Petkov) * x86/platform changes: - Intel Atom SoC updates ... and lots of other cleanups, fixlets and other changes - see the shortlog and the Git log for details" * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits) x86/hpet: Use proper hpet device number for MSI allocation x86/hpet: Check for irq==0 when allocating hpet MSI interrupts x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail genirq: Prevent crash in irq_move_irq() genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain iommu, x86: Properly handle posted interrupts for IOMMU hotplug iommu, x86: Provide irq_remapping_cap() interface iommu, x86: Setup Posted-Interrupts capability for Intel iommu iommu, x86: Add cap_pi_support() to detect VT-d PI capability iommu, x86: Avoid migrating VT-d posted interrupts iommu, x86: Save the mode (posted or remapped) of an IRTE iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip iommu: dmar: Provide helper to copy shared irte fields iommu: dmar: Extend struct irte for VT-d Posted-Interrupts iommu: Add new member capability to struct irq_remap_ops x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry() ...
| * | x86: Remove more unmodified io_apic_opsThomas Gleixner2015-04-241-2/+1
| |/ | | | | | | | | | | | | | | io_apic_ops.init() is either NULL, if IO-APIC support is disabled at compile time or native_io_apic_init_mappings(). No point to have that as we can achieve the same thing with an empty inline. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | x86/crash: Allocate enough low memory when crashkernel=highJoerg Roedel2015-06-111-5/+7
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the crash kernel is loaded above 4GiB in memory, the first kernel allocates only 72MiB of low-memory for the DMA requirements of the second kernel. On systems with many devices this is not enough and causes device driver initialization errors and failed crash dumps. Testing by SUSE and Redhat has shown that 256MiB is a good default value for now and the discussion has lead to this value as well. So set this default value to 256MiB to make sure there is enough memory available for DMA. Signed-off-by: Joerg Roedel <jroedel@suse.de> [ Reflow comment. ] Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2015-04-131-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm changes from Ingo Molnar: "The main changes in this cycle were: - reduce the x86/32 PAE per task PGD allocation overhead from 4K to 0.032k (Fenghua Yu) - early_ioremap/memunmap() usage cleanups (Juergen Gross) - gbpages support cleanups (Luis R Rodriguez) - improve AMD Bulldozer (family 0x15) ASLR I$ aliasing workaround to increase randomization by 3 bits (per bootup) (Hector Marco-Gisbert) - misc fixlets" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Improve AMD Bulldozer ASLR workaround x86/mm/pat: Initialize __cachemode2pte_tbl[] and __pte2cachemode_tbl[] in a bit more readable fashion init.h: Clean up the __setup()/early_param() macros x86/mm: Simplify probe_page_size_mask() x86/mm: Further simplify 1 GB kernel linear mappings handling x86/mm: Use early_param_on_off() for direct_gbpages init.h: Add early_param_on_off() x86/mm: Simplify enabling direct_gbpages x86/mm: Use IS_ENABLED() for direct_gbpages x86/mm: Unexport set_memory_ro() and set_memory_rw() x86/mm, efi: Use early_ioremap() in arch/x86/platform/efi/efi-bgrt.c x86/mm: Use early_memunmap() instead of early_iounmap() x86/mm/pat: Ensure different messages in STRICT_DEVMEM and PAT cases x86/mm: Reduce PAE-mode per task pgd allocation overhead from 4K to 32 bytes
| * x86/mm: Use early_memunmap() instead of early_iounmap()Juergen Gross2015-02-241-4/+4
| | | | | | | | | | | | | | | | | | | | Memory mapped via early_memremap() should be unmapped with early_memunmap() instead of early_iounmap(). Signed-off-by: Juergen Gross <jgross@suse.com> Cc: matt.fleming@intel.com Link: http://lkml.kernel.org/r/1424769211-11378-2-git-send-email-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | x86/mm/KASLR: Propagate KASLR status to kernel properBorislav Petkov2015-04-031-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit: e2b32e678513 ("x86, kaslr: randomize module base load address") made module base address randomization unconditional and didn't regard disabled KKASLR due to CONFIG_HIBERNATION and command line option "nokaslr". For more info see (now reverted) commit: f47233c2d34f ("x86/mm/ASLR: Propagate base load address calculation") In order to propagate KASLR status to kernel proper, we need a single bit in boot_params.hdr.loadflags and we've chosen bit 1 thus leaving the top-down allocated bits for bits supposed to be used by the bootloader. Originally-From: Jiri Kosina <jkosina@suse.cz> Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Revert "x86/mm/ASLR: Propagate base load address calculation"Borislav Petkov2015-03-161-18/+4
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit: f47233c2d34f ("x86/mm/ASLR: Propagate base load address calculation") The main reason for the revert is that the new boot flag does not work at all currently, and in order to make this work, we need non-trivial changes to the x86 boot code which we didn't manage to get done in time for merging. And even if we did, they would've been too risky so instead of rushing things and break booting 4.1 on boxes left and right, we will be very strict and conservative and will take our time with this to fix and test it properly. Reported-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Baoquan He <bhe@redhat.com> Cc: H. Peter Anvin <hpa@linux.intel.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Junjie Mao <eternal.n08@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/20150316100628.GD22995@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2015-02-211-4/+18
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: "This contains: - EFI fixes - a boot printout fix - ASLR/kASLR fixes - intel microcode driver fixes - other misc fixes Most of the linecount comes from an EFI revert" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch x86/microcode/intel: Handle truncated microcode images more robustly x86/microcode/intel: Guard against stack overflow in the loader x86, mm/ASLR: Fix stack randomization on 64-bit systems x86/mm/init: Fix incorrect page size in init_memory_mapping() printks x86/mm/ASLR: Propagate base load address calculation Documentation/x86: Fix path in zero-page.txt x86/apic: Fix the devicetree build in certain configs Revert "efi/libstub: Call get_memory_map() to obtain map and desc sizes" x86/efi: Avoid triple faults during EFI mixed mode calls
| * Merge branch 'tip-x86-kaslr' of ↵Ingo Molnar2015-02-191-4/+18
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent Pull ASLR and kASLR fixes from Borislav Petkov: - Add a global flag announcing KASLR state so that relevant code can do informed decisions based on its setting. (Jiri Kosina) - Fix a stack randomization entropy decrease bug. (Hector Marco-Gisbert) Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * x86/mm/ASLR: Propagate base load address calculationJiri Kosina2015-02-191-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit: e2b32e678513 ("x86, kaslr: randomize module base load address") makes the base address for module to be unconditionally randomized in case when CONFIG_RANDOMIZE_BASE is defined and "nokaslr" option isn't present on the commandline. This is not consistent with how choose_kernel_location() decides whether it will randomize kernel load base. Namely, CONFIG_HIBERNATION disables kASLR (unless "kaslr" option is explicitly specified on kernel commandline), which makes the state space larger than what module loader is looking at. IOW CONFIG_HIBERNATION && CONFIG_RANDOMIZE_BASE is a valid config option, kASLR wouldn't be applied by default in that case, but module loader is not aware of that. Instead of fixing the logic in module.c, this patch takes more generic aproach. It introduces a new bootparam setup data_type SETUP_KASLR and uses that to pass the information whether kaslr has been applied during kernel decompression, and sets a global 'kaslr_enabled' variable accordingly, so that any kernel code (module loading, livepatching, ...) can make decisions based on its value. x86 module loader is converted to make use of this flag. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Kees Cook <keescook@chromium.org> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Link: https://lkml.kernel.org/r/alpine.LNX.2.00.1502101411280.10719@pobox.suse.cz [ Always dump correct kaslr status when panicking ] Signed-off-by: Borislav Petkov <bp@suse.de>
* | | Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2015-02-161-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf updates from Ingo Molnar: "This series tightens up RDPMC permissions: currently even highly sandboxed x86 execution environments (such as seccomp) have permission to execute RDPMC, which may leak various perf events / PMU state such as timing information and other CPU execution details. This 'all is allowed' RDPMC mode is still preserved as the (non-default) /sys/devices/cpu/rdpmc=2 setting. The new default is that RDPMC access is only allowed if a perf event is mmap-ed (which is needed to correctly interpret RDPMC counter values in any case). As a side effect of these changes CR4 handling is cleaned up in the x86 code and a shadow copy of the CR4 value is added. The extra CR4 manipulation adds ~ <50ns to the context switch cost between rdpmc-capable and rdpmc-non-capable mms" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks perf/x86: Only allow rdpmc if a perf_event is mapped perf: Pass the event to arch_perf_update_userpage() perf: Add pmu callbacks to track event mapping and unmapping x86: Add a comment clarifying LDT context switching x86: Store a per-cpu shadow copy of CR4 x86: Clean up cr4 manipulation
| * | | x86: Store a per-cpu shadow copy of CR4Andy Lutomirski2015-02-041-1/+1
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Context switches and TLB flushes can change individual bits of CR4. CR4 reads take several cycles, so store a shadow copy of CR4 in a per-cpu variable. To avoid wasting a cache line, I added the CR4 shadow to cpu_tlbstate, which is already touched in switch_mm. The heaviest users of the cr4 shadow will be switch_mm and __switch_to_xtra, and __switch_to_xtra is called shortly after switch_mm during context switch, so the cacheline is likely to be hot. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | x86_64: add KASan supportAndrey Ryabinin2015-02-131-0/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds arch specific code for kernel address sanitizer. 16TB of virtual addressed used for shadow memory. It's located in range [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup stacks. At early stage we map whole shadow region with zero page. Latter, after pages mapped to direct mapping address range we unmap zero pages from corresponding shadow (see kasan_map_shadow()) and allocate and map a real shadow memory reusing vmemmap_populate() function. Also replace __pa with __pa_nodebug before shadow initialized. __pa with CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr) __phys_addr is instrumented, so __asan_load could be called before shadow area initialized. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jim Davis <jim.epost@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | x86, setup: Let early_memremap() handle page alignmentWANG Chao2015-01-231-5/+3
|/ | | | | | | | | | | | | | | | early_memremap() takes care of page alignment and map size, so we can just remap the required data size and get rid of the adjustments in the setup code. [tglx: Massaged changelog ] Signed-off-by: WANG Chao <chaowang@redhat.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Bryan O'Donoghue <pure.logic@nexus-software.ie> Link: http://lkml.kernel.org/r/1420628150-16872-1-git-send-email-chaowang@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Merge branch 'x86-vdso-for-linus' of ↵Linus Torvalds2014-12-101-2/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vdso updates from Ingo Molnar: "Various vDSO updates from Andy Lutomirski, mostly cleanups and reorganization to improve maintainability, but also some micro-optimizations and robustization changes" * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86_64/vsyscall: Restore orig_ax after vsyscall seccomp x86_64: Add a comment explaining the TASK_SIZE_MAX guard page x86_64,vsyscall: Make vsyscall emulation configurable x86_64, vsyscall: Rewrite comment and clean up headers in vsyscall code x86_64, vsyscall: Turn vsyscalls all the way off when vsyscall==none x86,vdso: Use LSL unconditionally for vgetcpu x86: vdso: Fix build with older gcc x86_64/vdso: Clean up vgetcpu init and merge the vdso initcalls x86_64/vdso: Remove jiffies from the vvar page x86/vdso: Make the PER_CPU segment 32 bits x86/vdso: Make the PER_CPU segment start out accessed x86/vdso: Change the PER_CPU segment to use struct desc_struct x86_64/vdso: Move getcpu code from vsyscall_64.c to vdso/vma.c x86_64/vsyscall: Move all of the gate_area code to vsyscall_64.c
| * x86_64,vsyscall: Make vsyscall emulation configurableAndy Lutomirski2014-11-031-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds CONFIG_X86_VSYSCALL_EMULATION, guarded by CONFIG_EXPERT. Turning it off completely disables vsyscall emulation, saving ~3.5k for vsyscall_64.c, 4k for vsyscall_emu_64.S (the fake vsyscall page), some tiny amount of core mm code that supports a gate area, and possibly 4k for a wasted pagetable. The latter is because the vsyscall addresses are misaligned and fit poorly in the fixmap. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/406db88b8dd5f0cbbf38216d11be34bbb43c7eae.1414618407.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | x86, mpx: On-demand kernel allocation of bounds tablesDave Hansen2014-11-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | x86, cma: Reserve DMA contiguous area after initmem_init()Weijie Yang2014-10-281-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fengguang Wu reported a boot crash on the x86 platform via the 0-day Linux Kernel Performance Test: cma: dma_contiguous_reserve: reserving 31 MiB for global area BUG: Int 6: CR2 (null) [<41850786>] dump_stack+0x16/0x18 [<41d2b1db>] early_idt_handler+0x6b/0x6b [<41072227>] ? __phys_addr+0x2e/0xca [<41d4ee4d>] cma_declare_contiguous+0x3c/0x2d7 [<41d6d359>] dma_contiguous_reserve_area+0x27/0x47 [<41d6d4d1>] dma_contiguous_reserve+0x158/0x163 [<41d33e0f>] setup_arch+0x79b/0xc68 [<41d2b7cf>] start_kernel+0x9c/0x456 [<41d2b2ca>] i386_start_kernel+0x79/0x7d (See details at: https://lkml.org/lkml/2014/10/8/708) It is because dma_contiguous_reserve() is called before initmem_init() in x86, the variable high_memory is not initialized but accessed by __pa(high_memory) in dma_contiguous_reserve(). This patch moves dma_contiguous_reserve() after initmem_init() so that high_memory is initialized before accessed. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: iamjoonsoo.kim@lge.com Cc: 'Linux-MM' <linux-mm@kvack.org> Cc: 'Weijie Yang' <weijie.yang.kh@gmail.com> Link: http://lkml.kernel.org/r/000101cfef69%2431e528a0%2495af79e0%24%25yang@samsung.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86: Quark: Comment setup_arch() to document TLB/PGE bugBryan O'Donoghue2014-10-081-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quark SoC X1000 advertises Page Global Enable for it's Translation Lookaside Buffer via cpuid. The silicon does not in fact support PGE and hence will not flush the TLB when CR4.PGE is rewritten. The Quark documentation makes clear the necessity to instead rewrite CR3 in order to flush any TLB entries, irrespective of the state of CR4.PGE or an individual PTE.PGE See Intel Quark Core DevMan_001.pdf section 6.4.11 In setup.c setup_arch() the code will load_cr3() and then do a __flush_tlb_all(). On Quark the entire TLB will be flushed at the load_cr3(). The __flush_tlb_all() have no effect and can be safely ignored. Later on in the boot process we switch off the flag for cpu_has_pge() which means that subsequent calls to __flush_tlb_all() will call __flush_tlb() not __flush_tlb_global() flushing the TLB in the correct way via load_cr3() not CR4.PGE rewrite This patch documents the behaviour of flushing the TLB for Quark in setup_arch() Comment text suggested by Thomas Gleixner Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Cc: davej@redhat.com Cc: hmh@hmh.eng.br Link: http://lkml.kernel.org/r/1412641189-12415-2-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
OpenPOWER on IntegriCloud