op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	mm: move most file-based accounting to the node	Mel Gorman	2016-07-28	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: move page mapped accounting to the node	Mel Gorman	2016-07-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reclaim makes decisions based on the number of pages that are mapped but it's mixing node and zone information. Account NR_FILE_MAPPED and NR_ANON_PAGES pages on the node. Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm, vmscan: move LRU lists to node	Mel Gorman	2016-07-28	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	tile: get rid of superfluous __GFP_REPEAT	Michal Hocko	2016-06-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	__GFP_REPEAT has a rather weak semantic but since it has been introduced around 2.6.12 it has been ignored for low order allocations. pgtable_alloc_one uses __GFP_REPEAT flag for L2_USER_PGTABLE_ORDER but the order is either 0 or 3 if L2_KERNEL_PGTABLE_SHIFT for HPAGE_SHIFT. This means that this flag has never been actually useful here because it has always been used only for PAGE_ALLOC_COSTLY requests. Link: http://lkml.kernel.org/r/1464599699-30131-16-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	tile: Use the more common pr_warn instead of pr_warning	Joe Perches	2014-11-11	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	And other message logging neatening. Other miscellanea: o coalesce formats o realign arguments o standardize a couple of macros o use __func__ instead of embedding the function name Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	tile: handle pgtable_page_ctor() fail	Kirill A. Shutemov	2013-11-15	1	-1/+5
\| \| \| \| \| \| \|	Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	tile: use pmd_pfn() instead of casting via pte_t	Chris Metcalf	2013-09-13	1	-2/+1
\| \| \| \|	Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	tile: add virt_to_kpte() API and clean up and document behavior	Chris Metcalf	2013-09-03	1	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \|	We use virt_to_pte(NULL, va) a lot, which isn't very obvious. I added virt_to_kpte(va) as a more obvious wrapper function, that also validates the va as being a kernel adddress. And, I fixed the semantics of virt_to_pte() so that we handle the pud and pmd the same way, and we now document the fact that we handle the final pte level differently. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	tile: handle super huge pages in virt_to_pte	Chris Metcalf	2013-08-30	1	-0/+3
\| \| \| \| \| \| \| \| \|	This tile-specific API had a minor bug, in that if a super huge (>4GB) page mapped a particular address range, we wouldn't handle it correctly. As part of fixing that bug, I also cleaned up some of the pud and pmd accessors to make them more consistent. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	tile: remove set/clear_fixmap APIs	Chris Metcalf	2013-08-30	1	-49/+0
\| \| \| \| \| \| \| \| \| \|	Nothing in the codebase was using them, and as written they took "unsigned long" as the physical address rather than "phys_addr_t", which is wrong on tilepro anyway. Rather than fixing stale APIs, just remove them; if there's ever demand for them on this platform, we can put them back. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	tile: avoid struct vm_struct leak	Chris Metcalf	2013-08-13	1	-1/+1
\| \| \| \| \| \| \|	If ioreamp_prot() fails in ioremap_page_range() due to kernel memory exhaustion, we previously would leak a struct vm_struct. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	mm, vmalloc: change iterating a vmlist to find_vm_area()	Joonsoo Kim	2013-04-29	1	-6/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patchset removes vm_struct list management after initializing vmalloc. Adding and removing an entry to vmlist is linear time complexity, so it is inefficient. If we maintain this list, overall time complexity of adding and removing area to vmalloc space is O(N), although we use rbtree for finding vacant place and it's time complexity is just O(logN). And vmlist and vmlist_lock is used many places of outside of vmalloc.c. It is preferable that we hide this raw data structure and provide well-defined function for supporting them, because it makes that they cannot mistake when manipulating theses structure and it makes us easily maintain vmalloc layer. For kexec and makedumpfile, I export vmap_area_list, instead of vmlist. This comes from Atsushi's recommendation. For more information, please refer below link. https://lkml.org/lkml/2012/12/6/184 This patch: The purpose of iterating a vmlist is finding vm area with specific virtual address. find_vm_area() is provided for this purpose and more efficient, because it uses a rbtree. So change it. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	swap: add per-partition lock for swapfile	Shaohua Li	2013-02-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	swap_lock is heavily contended when I test swap to 3 fast SSD (even slightly slower than swap to 2 such SSD). The main contention comes from swap_info_get(). This patch tries to fix the gap with adding a new per-partition lock. Global data like nr_swapfiles, total_swap_pages, least_priority and swap_list are still protected by swap_lock. nr_swap_pages is an atomic now, it can be changed without swap_lock. In theory, it's possible get_swap_page() finds no swap pages but actually there are free swap pages. But sounds not a big problem. Accessing partition specific data (like scan_swap_map and so on) is only protected by swap_info_struct.lock. Changing swap_info_struct.flags need hold swap_lock and swap_info_struct.lock, because scan_scan_map() will check it. read the flags is ok with either the locks hold. If both swap_lock and swap_info_struct.lock must be hold, we always hold the former first to avoid deadlock. swap_entry_free() can change swap_list. To delete that code, we add a new highest_priority_index. Whenever get_swap_page() is called, we check it. If it's valid, we use it. It's a pity get_swap_page() still holds swap_lock(). But in practice, swap_lock() isn't heavily contended in my test with this patch (or I can say there are other much more heavier bottlenecks like TLB flush). And BTW, looks get_swap_page() doesn't really need the lock. We never free swap_info[] and we check SWAP_WRITEOK flag. The only risk without the lock is we could swapout to some low priority swap, but we can quickly recover after several rounds of swap, so sounds not a big deal to me. But I'd prefer to fix this if it's a real problem. "swap: make each swap partition have one address_space" improved the swapout speed from 1.7G/s to 2G/s. This patch further improves the speed to 2.3G/s, so around 15% improvement. It's a multi-process test, so TLB flush isn't the biggest bottleneck before the patches. [arnd@arndb.de: fix it for nommu] [hughd@google.com: add missing unlock] [minchan@kernel.org: get rid of lockdep whinge on sys_swapon] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	arch/tile: tilegx PCI root complex support	Chris Metcalf	2012-07-18	1	-7/+0
\| \| \| \| \| \| \| \|	This change implements PCIe root complex support for tilegx using the kernel support layer for accessing the TRIO hardware shim. Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> [changes in 07487f3] Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	arch/tile: support multiple huge page sizes dynamically	Chris Metcalf	2012-05-25	1	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change adds support for a new "super" bit in the PTE, using the new arch_make_huge_pte() method. The Tilera hypervisor sees the bit set at a given level of the page table and gangs together 4, 16, or 64 consecutive pages from that level of the hierarchy to create a larger TLB entry. One extra "super" page size can be specified at each of the three levels of the page table hierarchy on tilegx, using the "hugepagesz" argument on the boot command line. A new hypervisor API is added to allow Linux to tell the hypervisor how many PTEs to gang together at each level of the page table. To allow pre-allocating huge pages larger than the buddy allocator can handle, this change modifies the Tilera bootmem support to put all of memory on tilegx platforms into bootmem. As part of this change I eliminate the vestigial CONFIG_HIGHPTE support, which never worked anyway, and eliminate the hv_page_size() API in favor of the standard vma_kernel_pagesize() API. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	arch/tile: Allow tilegx to build with either 16K or 64K page size	Chris Metcalf	2012-05-25	1	-15/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change introduces new flags for the hv_install_context() API that passes a page table pointer to the hypervisor. Clients can explicitly request 4K, 16K, or 64K small pages when they install a new context. In practice, the page size is fixed at kernel compile time and the same size is always requested every time a new page table is installed. The <hv/hypervisor.h> header changes so that it provides more abstract macros for managing "page" things like PFNs and page tables. For example there is now a HV_DEFAULT_PAGE_SIZE_SMALL instead of the old HV_PAGE_SIZE_SMALL. The various PFN routines have been eliminated and only PA- or PTFN-based ones remain (since PTFNs are always expressed in fixed 2KB "page" size). The page-table management macros are renamed with a leading underscore and take page-size arguments with the presumption that clients will use those macros in some single place to provide the "real" macros they will use themselves. I happened to notice the old hv_set_caching() API was totally broken (it assumed 4KB pages) so I changed it so it would nominally work correctly with other page sizes. Tag modules with the page size so you can't load a module built with a conflicting page size. (And add a test for SMP while we're at it.) Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	arch/tile: fix up locking in pgtable.c slightly	Chris Metcalf	2012-04-02	1	-10/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	We should be holding the init_mm.page_table_lock in shatter_huge_page() since we are modifying the kernel page tables. Then, only if we are walking the other root page tables to update them, do we want to take the pgd_lock. Add a comment about taking the pgd_lock that we always do it with interrupts disabled and therefore are not at risk from the tlbflush IPI deadlock as is seen on x86. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	arch/tile: don't set the homecache of a PTE unless appropriate	Chris Metcalf	2012-04-02	1	-4/+12
\| \| \| \| \| \| \|	We make sure not to try to set the home for an MMIO PTE (on tilegx) or a PTE that isn't referencing memory managed by Linux. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	Disintegrate asm/system.h for Tile	David Howells	2012-03-28	1	-1/+0
\| \| \| \| \| \| \|	Disintegrate asm/system.h for Tile. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com>
*	lib, arch: add filter argument to show_mem and fix private implementations	David Rientjes	2011-03-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit ddd588b5dd55 ("oom: suppress nodes that are not allowed from meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem(): lib/lib.a(show_mem.o): In function `show_mem': show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage. Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	arch/tile: support 4KB page size as well as 64KB	Chris Metcalf	2011-03-10	1	-29/+141
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Tilera architecture traditionally supports 64KB page sizes to improve TLB utilization and improve performance when the hardware is being used primarily to run a single application. For more generic server scenarios, it can be beneficial to run with 4KB page sizes, so this commit allows that to be specified (by modifying the arch/tile/include/hv/pagesize.h header). As part of this change, we also re-worked the PTE management slightly so that PTE writes all go through a __set_pte() function where we can do some additional validation. The set_pte_order() function was eliminated since the "order" argument wasn't being used. One bug uncovered was in the PCI DMA code, which wasn't properly flushing the specified range. This was benign with 64KB pages, but with 4KB pages we were getting some larger flushes wrong. The per-cpu memory reservation code also needed updating to conform with the newer percpu stuff; before it always chose 64KB, and that was always correct, but with 4KB granularity we now have to pay closer attention and reserve the amount of memory that will be requested when the percpu code starts allocating. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	tile: Fix __pte_free_tlb	Peter Zijlstra	2011-02-23	1	-13/+2
\| \| \| \| \| \| \| \|	Tile's __pte_free_tlb() implementation makes assumptions about the generic mmu_gather implementation, cure this ;-) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	arch/tile: complete migration to new kmap_atomic scheme	Chris Metcalf	2010-11-01	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	This change makes KM_TYPE_NR independent of the actual deprecated list of km_type values, which are no longer used in tile code anywhere. For now we leave it set to 8, allowing that many nested mappings, and thus reserving 32MB of address space. A few remaining places using KM_* values were cleaned up as well. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	arch: tile: mm: pgtable.c: Removed duplicated #include	Andrea Gelmini	2010-08-13	1	-1/+0
\| \| \| \| \|	Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
*	arch/tile: Miscellaneous cleanup changes.	Chris Metcalf	2010-07-06	1	-41/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit is primarily changes caused by reviewing "sparse" and "checkpatch" output on our sources, so is somewhat noisy, since things like "printk() -> pr_err()" (or whatever) throughout the codebase tend to get tedious to read. Rather than trying to tease apart precisely which things changed due to which type of code review, this commit includes various cleanups in the code: - sparse: Add declarations in headers for globals. - sparse: Fix __user annotations. - sparse: Using gfp_t consistently instead of int. - sparse: removing functions not actually used. - checkpatch: Clean up printk() warnings by using pr_info(), etc.; also avoid partial-line printks except in bootup code. - checkpatch: Use exposed structs rather than typedefs. - checkpatch: Change some C99 comments to C89 comments. In addition, a couple of minor other changes are rolled in to this commit: - Add support for a "raise" instruction to cause SIGFPE, etc., to be raised. - Remove some compat code that is unnecessary when we fully eliminate some of the deprecated syscalls from the generic syscall ABI. - Update the tile_defconfig to reflect current config contents. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
*	arch/tile: core support for Tilera 32-bit chips.	Chris Metcalf	2010-06-04	1	-0/+566
	This change is the core kernel support for TILEPro and TILE64 chips. No driver support (except the console driver) is included yet. This includes the relevant Linux headers in asm/; the low-level low-level "Tile architecture" headers in arch/, which are shared with the hypervisor, etc., and are build-system agnostic; and the relevant hypervisor headers in hv/. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Reviewed-by: Paul Mundt <lethal@linux-sh.org>