op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge branch 'x86-mm-for-linus' of ↵	Linus Torvalds	2013-02-21	1	-0/+5
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm changes from Peter Anvin: "This is a huge set of several partly interrelated (and concurrently developed) changes, which is why the branch history is messier than one would like. The really big items are two humonguous patchsets mostly developed by Yinghai Lu at my request, which completely revamps the way we create initial page tables. In particular, rather than estimating how much memory we will need for page tables and then build them into that memory -- a calculation that has shown to be incredibly fragile -- we now build them (on 64 bits) with the aid of a "pseudo-linear mode" -- a #PF handler which creates temporary page tables on demand. This has several advantages: 1. It makes it much easier to support things that need access to data very early (a followon patchset uses this to load microcode way early in the kernel startup). 2. It allows the kernel and all the kernel data objects to be invoked from above the 4 GB limit. This allows kdump to work on very large systems. 3. It greatly reduces the difference between Xen and native (Xen's equivalent of the #PF handler are the temporary page tables created by the domain builder), eliminating a bunch of fragile hooks. The patch series also gets us a bit closer to W^X. Additional work in this pull is the 64-bit get_user() work which you were also involved with, and a bunch of cleanups/speedups to __phys_addr()/__pa()." * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits) x86, mm: Move reserving low memory later in initialization x86, doc: Clarify the use of asm("%edx") in uaccess.h x86, mm: Redesign get_user with a __builtin_choose_expr hack x86: Be consistent with data size in getuser.S x86, mm: Use a bitfield to mask nuisance get_user() warnings x86/kvm: Fix compile warning in kvm_register_steal_time() x86-32: Add support for 64bit get_user() x86-32, mm: Remove reference to alloc_remap() x86-32, mm: Remove reference to resume_map_numa_kva() x86-32, mm: Rip out x86_32 NUMA remapping code x86/numa: Use __pa_nodebug() instead x86: Don't panic if can not alloc buffer for swiotlb mm: Add alloc_bootmem_low_pages_nopanic() x86, 64bit, mm: hibernate use generic mapping_init x86, 64bit, mm: Mark data/bss/brk to nx x86: Merge early kernel reserve for 32bit and 64bit x86: Add Crash kernel low reservation x86, kdump: Remove crashkernel range find limit for 64bit memblock: Add memblock_mem_size() x86, boot: Not need to check setup_header version for setup_data ...
\| *	x86: Move some contents of page_64_types.h into pgtable_64.h and page_64.h	Alexander Duyck	2012-11-16	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch is meant to clean-up the fact that we have several functions in page_64_types.h which really don't belong there. I found this issue when I had tried to replace __phys_addr with an inline function. It resulted in the realmode bits generating compile warnings about types. In order to resolve that I am relocating the address translation to page_64.h since this is in keeping with where these functions are located in 32 bit. In addtion I have relocated several functions defined in init_64.c to pgtable_64.h as this seems to be where most of the functions related to memory initialization were already located. [ hpa: added missing #include <asm/pgtable.h> to apic_numachip.c, as reported by Yinghai Lu. ] Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Link: http://lkml.kernel.org/r/20121116215244.8521.31505.stgit@ahduyck-cp1.jf.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Daniel J Blueman <daniel@numascale-asia.com>
* \|	x86/mm: Convert update_mmu_cache() and update_mmu_cache_pmd() to functions	Kirill A. Shutemov	2013-01-24	1	-3/+0
\|/ \| \| \| \| \| \| \| \| \| \|	Converting macros to functions unhide type problems before changes will be integrated and trigger problems on other architectures. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: Add and use update_mmu_cache_pmd() in transparent huge page code.	David Miller	2012-10-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The transparent huge page code passes a PMD pointer in as the third argument of update_mmu_cache(), which expects a PTE pointer. This never got noticed because X86 implements update_mmu_cache() as a macro and thus we don't get any type checking, and X86 is the only architecture which supports transparent huge pages currently. Before other architectures can support transparent huge pages properly we need to add a new interface which will take a PMD pointer as the third argument rather than a PTE pointer. [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390] Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>	Joe Perches	2012-06-06	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use a more current logging style: - Bare printks should have a KERN_<LEVEL> for consistency's sake - Add pr_fmt where appropriate - Neaten some macro definitions - Convert some Ok output to OK - Use "%s: ", __func__ in pr_fmt for summit - Convert some printks to pr_<level> Message output is not identical in all cases. Signed-off-by: Joe Perches <joe@perches.com> Cc: levinsasha928@gmail.com Link: http://lkml.kernel.org/r/1337655007.24226.10.camel@joe2Laptop [ merged two similar patches, tidied up the changelog ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	thp: add x86 32bit support	Johannes Weiner	2011-01-13	1	-109/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for transparent hugepages to x86 32bit. Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support transparent hugepages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	thp: transparent hugepage core	Andrea Arcangeli	2011-01-13	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL102410241024) int main() { char p = malloc(SIZE), p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	thp: add pmd mangling functions to x86	Andrea Arcangeli	2011-01-13	1	-7/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add needed pmd mangling functions with symmetry with their pte counterparts. pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's needed to serialize the VM against split_huge_page. It simply atomically sets the splitting bit in a similar way pmdp_clear_flush_young atomically clears the accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it effective against gup_fast, but it wouldn't really require to flush the tlb too. Just the tlb flush is the simplest operation we can invoke to serialize pmdp_splitting_flush() against gup_fast. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	thp: special pmd_trans_* functions	Andrea Arcangeli	2011-01-13	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These returns 0 at compile time when the config option is disabled, to allow gcc to eliminate the transparent hugepage function calls at compile time without additional #ifdefs (only the export of those functions have to be visible to gcc but they won't be required at link time and huge_memory.o can be not built at all). _PAGE_BIT_UNUSED1 is never used for pmd, only on pte. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: remove pte_*map_nested()	Peter Zijlstra	2010-10-26	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since we no longer need to provide KM_type, the whole pte_*map_nested() API is now redundant, remove it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86, mm: Separate x86_64 vmalloc_sync_all() into separate functions	Haicheng Li	2010-08-26	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	No behavior change. Move some of vmalloc_sync_all() code into a new function sync_global_pgds() that will be useful for memory hotplug. Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> LKML-Reference: <4C6E4ECD.1090607@linux.intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
*	gcc-4.6: mm: fix unused but set warnings	Andi Kleen	2010-08-09	1	-2/+2
\| \| \| \| \| \| \| \|	No real bugs, just some dead code and some fixups. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself	Russell King	2010-02-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On VIVT ARM, when we have multiple shared mappings of the same file in the same MM, we need to ensure that we have coherency across all copies. We do this via make_coherent() by making the pages uncacheable. This used to work fine, until we allowed highmem with highpte - we now have a page table which is mapped as required, and is not available for modification via update_mmu_cache(). Ralf Beache suggested getting rid of the PTE value passed to update_mmu_cache(): On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables to construct a pointer to the pte again. Passing a pte_t * is much more elegant. Maybe we might even replace the pte argument with the pte_t? Ben Herrenschmidt would also like the pte pointer for PowerPC: Passing the ptep in there is exactly what I want. I want that -instead- of the PTE value, because I have issue on some ppc cases, for I$/D$ coherency, where set_pte_at() may decide to mask out the _PAGE_EXEC. So, pass in the mapped page table pointer into update_mmu_cache(), and remove the PTE value, updating all implementations and call sites to suit. Includes a fix from Stephen Rothwell: sparc: fix fallout from update_mmu_cache API change Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
*	x86, 64-bit: Clean up user address masking	Linus Torvalds	2009-06-20	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The discussion about using "access_ok()" in get_user_pages_fast() (see commit 7f8189068726492950bf1a2dcfd9b51314560abf: "x86: don't use 'access_ok()' as a range check in get_user_pages_fast()" for details and end result), made us notice that x86-64 was really being very sloppy about virtual address checking. So be way more careful and straightforward about masking x86-64 virtual addresses: - All the VIRTUAL_MASK* variants now cover half of the address space, it's not like we can use the full mask on a signed integer, and the larger mask just invites mistakes when applying it to either half of the 48-bit address space. - /proc/kcore's kc_offset_to_vaddr() becomes a lot more obvious when it transforms a file offset into a (kernel-half) virtual address. - Unify/simplify the 32-bit and 64-bit USER_DS definition to be based on TASK_SIZE_MAX. This cleanup and more careful/obvious user virtual address checking also uncovered a buglet in the x86-64 implementation of strnlen_user(): it would do an "access_ok()" check on the whole potential area, even if the string itself was much shorter, and thus return an error even for valid strings. Our sloppy checking had hidden this. So this fixes 'strnlen_user()' to do this properly, the same way we already handled user strings in 'strncpy_from_user()'. Namely by just checking the first byte, and then relying on fault handling for the rest. That always works, since we impose a guard page that cannot be mapped at the end of the user space address space (and even if we didn't, we'd have the address space hole). Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: clean up declarations and variables	Jaswinder Singh Rajput	2009-04-12	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup, no code changed - syscalls.h update declarations due to unifications - irq.c declare smp_generic_interrupt() before it gets used - process.c declare sys_fork() and sys_vfork() before they get used - tsc.c rename tsc_khz shadowed variable - apic/probe_32.c declare apic_default before it gets used - apic/nmi.c prev_nmi_count should be unsigned - apic/io_apic.c declare smp_irq_move_cleanup_interrupt() before it gets used - mm/init.c declare direct_gbpages and free_initrd_mem before they get used Signed-off-by: Jaswinder Singh Rajput <jaswinder@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	x86: Split pgtable_64.h into pgtable_64_types.h and pgtable_64.h	Jeremy Fitzhardinge	2009-02-11	1	-46/+2
\| \| \| \|	Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
*	Merge commit 'remotes/tip/x86/paravirt' into x86/untangle2	Jeremy Fitzhardinge	2009-02-11	1	-1/+0
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* commit 'remotes/tip/x86/paravirt': (175 commits) xen: use direct ops on 64-bit xen: make direct versions of irq_enable/disable/save/restore to common code xen: setup percpu data pointers xen: fix 32-bit build resulting from mmu move x86/paravirt: return full 64-bit result x86, percpu: fix kexec with vmlinux x86/vmi: fix interrupt enable/disable/save/restore calling convention. x86/paravirt: don't restore second return reg xen: setup percpu data pointers x86: split loading percpu segments from loading gdt x86: pass in cpu number to switch_to_new_gdt() x86: UV fix uv_flush_send_and_wait() x86/paravirt: fix missing callee-save call on pud_val x86/paravirt: use callee-saved convention for pte_val/make_pte/etc x86/paravirt: implement PVOP_CALL macros for callee-save functions x86/paravirt: add register-saving thunks to reduce caller register pressure x86/paravirt: selectively save/restore regs around pvops calls x86: fix paravirt clobber in entry_64.S x86/pvops: add a paravirt_ident functions to allow special patching xen: move remaining mmu-related stuff into mmu.c ... Conflicts: arch/x86/mach-voyager/voyager_smp.c arch/x86/mm/fault.c
\| *	x86: remove pda.h	Brian Gerst	2009-01-20	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Signed-off-by: Brian Gerst <brgerst@gmail.com>
* \|	x86: unify io_remap_pfn_range	Jeremy Fitzhardinge	2009-02-06	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify io_remap_pfn_range. Don't demacro yet. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pgd_none	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pgd_none. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pud_none	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pud_none. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pages_to_mb	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pages_to_mb. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_bad	Jeremy Fitzhardinge	2009-02-06	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_bad. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pgd_bad	Jeremy Fitzhardinge	2009-02-06	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pgd_bad. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pgd_bad	Jeremy Fitzhardinge	2009-02-06	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pgd_bad. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pud_large	Jeremy Fitzhardinge	2009-02-06	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pud_large. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pte_offset_kernel	Jeremy Fitzhardinge	2009-02-06	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pte_offset_kernel. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pte_index	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pte_index. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_pfn	Jeremy Fitzhardinge	2009-02-06	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify pmd_pfn. Unfortunately it can't be demacroed because it has a cyclic dependency on linux/mm.h:page_to_nid(). Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_pfn	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_pfn. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: remove redundant pfn_pmd definition	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup It's already defined in pgtable.h Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_offset	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_offset. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_index	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_index. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_page	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_page. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_page_vaddr	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_page_vaddr. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pud_offset	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pud_offset. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pud_index	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pud_index. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pgd_page	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pgd_page. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pud_page	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pud_page. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pud_page_vaddr	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pud_page_vaddr. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pgd_page_vaddr	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pgd_page_vaddr. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_none	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_none. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pmd_present	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pmd_present. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pgd_present	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pgd_present. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pud_present	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pud_present. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pte_present	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pte_present. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pte_same	Jeremy Fitzhardinge	2009-02-06	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pte_same. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* \|	x86: unify pte_none	Jeremy Fitzhardinge	2009-02-06	1	-1/+0
\|/ \| \| \| \| \| \| \|	Impact: cleanup Unify and demacro pte_none. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
*-.	Merge branches 'x86/apic', 'x86/cleanups', 'x86/cpufeature', ↵	Ingo Molnar	2008-12-23	1	-11/+17
\|\ \ \| \| \| \| \| \| \| \| \|	'x86/crashdump', 'x86/debug', 'x86/defconfig', 'x86/detect-hyper', 'x86/doc', 'x86/dumpstack', 'x86/early-printk', 'x86/fpu', 'x86/idle', 'x86/io', 'x86/memory-corruption-check', 'x86/microcode', 'x86/mm', 'x86/mtrr', 'x86/nmi-watchdog', 'x86/pat2', 'x86/pci-ioapic-boot-irq-quirks', 'x86/ptrace', 'x86/quirks', 'x86/reboot', 'x86/setup-memory', 'x86/signal', 'x86/sparse-fixes', 'x86/time', 'x86/uv' and 'x86/xen' into x86/core
\| \| *	x86: PAT: change pgprot_noncached to uc_minus instead of strong uc - v3	venkatesh.pallipadi@intel.com	2008-12-18	1	-6/+0
\| \|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Impact: mm behavior change. Make pgprot_noncached uc_minus instead of strong UC. This will make pgprot_noncached to be in line with ioremap_nocache() and all the other APIs that map page uc_minus on uc request. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>