summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* mm/msync: use offset_in_page macroAlexander Kuleshov2015-11-051-1/+1
| | | | | | | | | linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/list_lru.c: replace nr_node_ids for loop with for_each_node()Raghavendra K T2015-11-051-11/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [vdavydov@parallels.com: take memcg_aware check outside for_each loop] Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix docbook comment for get_vaddr_frames()Jonathan Corbet2015-11-051-1/+1
| | | | | | | | | | get_vaddr_frames() has a comment that's *almost* a docbook comment; add the missing star so that the tools will find it properly. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: ratify and consolidate over-charge handlingTejun Heo2015-11-051-49/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | try_charge() is the main charging logic of memcg. When it hits the limit but either can't fail the allocation due to __GFP_NOFAIL or the task is likely to free memory very soon, being OOM killed, has SIGKILL pending or exiting, it "bypasses" the charge to the root memcg and returns -EINTR. While this is one approach which can be taken for these situations, it has several issues. * It unnecessarily lies about the reality. The number itself doesn't go over the limit but the actual usage does. memcg is either forced to or actively chooses to go over the limit because that is the right behavior under the circumstances, which is completely fine, but, if at all avoidable, it shouldn't be misrepresenting what's happening by sneaking the charges into the root memcg. * Despite trying, we already do over-charge. kmemcg can't deal with switching over to the root memcg by the point try_charge() returns -EINTR, so it open-codes over-charing. * It complicates the callers. Each try_charge() user has to handle the weird -EINTR exception. memcg_charge_kmem() does the manual over-charging. mem_cgroup_do_precharge() performs unnecessary uncharging of root memcg, which BTW is inconsistent with what memcg_charge_kmem() does but not broken as [un]charging are noops on root memcg. mem_cgroup_try_charge() needs to switch the returned cgroup to the root one. The reality is that in memcg there are cases where we are forced and/or willing to go over the limit. Each such case needs to be scrutinized and justified but there definitely are situations where that is the right thing to do. We alredy do this but with a superficial and inconsistent disguise which leads to unnecessary complications. This patch updates try_charge() so that it over-charges and returns 0 when deemed necessary. -EINTR return is removed along with all special case handling in the callers. While at it, remove the local variable @ret, which was initialized to zero and never changed, along with done: label which just returned the always zero @ret. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: punt high overage reclaim to return-to-userland pathTejun Heo2015-11-051-8/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, try_charge() tries to reclaim memory synchronously when the high limit is breached; however, if the allocation doesn't have __GFP_WAIT, synchronous reclaim is skipped. If a process performs only speculative allocations, it can blow way past the high limit. This is actually easily reproducible by simply doing "find /". slab/slub allocator tries speculative allocations first, so as long as there's memory which can be consumed without blocking, it can keep allocating memory regardless of the high limit. This patch makes try_charge() always punt the over-high reclaim to the return-to-userland path. If try_charge() detects that high limit is breached, it adds the overage to current->memcg_nr_pages_over_high and schedules execution of mem_cgroup_handle_over_high() which performs synchronous reclaim from the return-to-userland path. As long as kernel doesn't have a run-away allocation spree, this should provide enough protection while making kmemcg behave more consistently. It also has the following benefits. - All over-high reclaims can use GFP_KERNEL regardless of the specific gfp mask in use, e.g. GFP_NOFS, when the limit was breached. - It copes with prio inversion. Previously, a low-prio task with small memory.high might perform over-high reclaim with a bunch of locks held. If a higher prio task needed any of these locks, it would have to wait until the low prio task finished reclaim and released the locks. By handing over-high reclaim to the task exit path this issue can be avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: flatten task_struct->memcg_oomTejun Heo2015-11-051-8/+8
| | | | | | | | | | | | | | | | | | | | | | task_struct->memcg_oom is a sub-struct containing fields which are used for async memcg oom handling. Most task_struct fields aren't packaged this way and it can lead to unnecessary alignment paddings. This patch flattens it. * task.memcg_oom.memcg -> task.memcg_in_oom * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask * task.memcg_oom.order -> task.memcg_oom_order * task.memcg_oom.may_oom -> task.memcg_may_oom In addition, task.memcg_may_oom is relocated to where other bitfields are which reduces the size of task_struct. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/mmap.c: remove useless statement "vma = NULL" in find_vma()Chen Gang2015-11-051-1/+0
| | | | | | | | | | | Before the main loop, vma is already is NULL. There is no need to set it to NULL again. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* uaccess: reimplement probe_kernel_address() using probe_kernel_read()Andrew Morton2015-11-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | probe_kernel_address() is basically the same as the (later added) probe_kernel_read(). The return value on EFAULT is a bit different: probe_kernel_address() returns number-of-bytes-not-copied whereas probe_kernel_read() returns -EFAULT. All callers have been checked, none cared. probe_kernel_read() can be overridden by the architecture whereas probe_kernel_address() cannot. parisc, blackfin and um do this, to insert additional checking. Hence this patch possibly fixes obscure bugs, although there are only two probe_kernel_address() callsites outside arch/. My first attempt involved removing probe_kernel_address() entirely and converting all callsites to use probe_kernel_read() directly, but that got tiresome. This patch shrinks mm/slab_common.o by 218 bytes. For a single probe_kernel_address() callsite. Cc: Steven Miao <realmz6@gmail.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/mlock.c: reorganize mlockall() return values and remove goto-out labelAlexey Klimov2015-11-051-5/+4
| | | | | | | | | | | | | | | In mlockall syscall wrapper after out-label for goto code just doing return. Remove goto out statements and return error values directly. Also instead of rewriting ret variable before every if-check move returns to 'error'-like path under if-check. Objdump asm listing showed me reducing by few asm lines. Object file size descreased from 220592 bytes to 220528 bytes for me (for aarch64). Signed-off-by: Alexey Klimov <klimov.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/kmemleak.c: remove unneeded initialization of object to NULLAlexey Klimov2015-11-051-1/+1
| | | | | | | | | | Few lines below object is reinitialized by lookup_object() so we don't need to init it by NULL in the beginning of find_and_get_object(). Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: slab: only move management objects off-slab for sizes larger than ↵Catalin Marinas2015-11-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KMALLOC_MIN_SIZE On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc configurations defining ARCH_DMA_MINALIGN to 128), the first kmalloc_caches[] entry to be initialised after slab_early_init = 0 is "kmalloc-128" with index 7. Depending on the debug kernel configuration, sizeof(struct kmem_cache) can be larger than 128 resulting in an INDEX_NODE of 8. Commit 8fc9cf420b36 ("slab: make more slab management structure off the slab") enables off-slab management objects for sizes starting with PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation of the "kmalloc-128" cache would try to place the management objects off-slab. However, since KMALLOC_MIN_SIZE is already 128 and freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not populated yet). This triggers the following bug on arm64: kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540 Hardware name: Juno (DT) PC is at __kmem_cache_create+0x21c/0x280 LR is at __kmem_cache_create+0x210/0x280 [...] Call trace: __kmem_cache_create+0x21c/0x280 create_boot_cache+0x48/0x80 create_kmalloc_cache+0x50/0x88 create_kmalloc_caches+0x4c/0xf4 kmem_cache_init+0x100/0x118 start_kernel+0x214/0x33c This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE. Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slub: calculate start order with reserved in considerationWei Yang2015-11-051-5/+1
| | | | | | | | | | | | | | | | | | In slub_order(), the order starts from max(min_order, get_order(min_objects * size)). When (min_objects * size) has different order from (min_objects * size + reserved), it will skip this order via a check in the loop. This patch optimizes this a little by calculating the start order with `reserved' in consideration and removing the check in loop. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slub: use get_order() instead of fls()Wei Yang2015-11-051-2/+1
| | | | | | | | | | | | | | get_order() is more easy to understand. This patch just replaces it. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slub: correct the comment in calculate_order()Wei Yang2015-11-051-1/+1
| | | | | | | | | | | | | | | | | In calculate_order(), it tries to calculate the best order by adjusting the fraction and min_objects. On each iteration on min_objects, fraction iterates on 16, 8, 4. Which means the acceptable waste increases with 1/16, 1/8, 1/4. This patch corrects the comment according to the code. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slab_common.c: initialize kmem_cache pointer to NULLAlexandru Moise2015-11-051-2/+1
| | | | | | | | | | | | | | | The assignment to NULL within the error condition was written in a 2014 patch to suppress a compiler warning. However it would be cleaner to just initialize the kmem_cache to NULL and just return it in case of an error condition. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slab_common.c: do not warn that cache is busy on destroy more than onceVladimir Davydov2015-11-051-6/+7
| | | | | | | | | | | | | | | | | | | | | | | Currently, when kmem_cache_destroy() is called for a global cache, we print a warning for each per memcg cache attached to it that has active objects (see shutdown_cache). This is redundant, because it gives no new information and only clutters the log. If a cache being destroyed has active objects, there must be a memory leak in the module that created the cache, and it does not matter if the cache was used by users in memory cgroups or not. This patch moves the warning from shutdown_cache(), which is called for shutting down both global and per memcg caches, to kmem_cache_destroy(), so that the warning is only printed once if there are objects left in the cache being destroyed. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slab_common.c: clear pointers to per memcg caches on destroyVladimir Davydov2015-11-052-21/+78
| | | | | | | | | | | | | | | | | | | | Currently, we do not clear pointers to per memcg caches in the memcg_params.memcg_caches array when a global cache is destroyed with kmem_cache_destroy. This is fine if the global cache does get destroyed. However, a cache can be left on the list if it still has active objects when kmem_cache_destroy is called (due to a memory leak). If this happens, the entries in the array will point to already freed areas, which is likely to result in data corruption when the cache is reused (via slab merging). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slab_common.c: rename cache create/destroy helpersVladimir Davydov2015-11-051-19/+18
| | | | | | | | | | | | | | | | | do_kmem_cache_create(), do_kmem_cache_shutdown(), and do_kmem_cache_release() sound awkward for static helper functions that are not supposed to be used outside slab_common.c. Rename them to create_cache(), shutdown_cache(), and release_caches(), respectively. This patch is a pure cleanup and does not introduce any functional changes. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slab: convert slab_is_available() to booleanDenis Kirjanov2015-11-051-1/+1
| | | | | | | | | | | | A good candidate to return a boolean result. Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Cc: Christoph Lameter <cl@linux.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'driver-core-4.4-rc1' of ↵Linus Torvalds2015-11-042-8/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch of debugfs updates, with a smattering of minor driver core fixes and updates as well. All have been in linux-next for a long time" * tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Add debugfs_create_ulong() of: to support binding numa node to specified device in devicetree debugfs: Add read-only/write-only bool file ops debugfs: Add read-only/write-only size_t file ops debugfs: Add read-only/write-only x64 file ops debugfs: Consolidate file mode checks in debugfs_create_*() Revert "mm: Check if section present during memory block (un)registering" driver-core: platform: Provide helpers for multi-driver modules mm: Check if section present during memory block (un)registering devres: fix a for loop bounds check CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit base/platform: assert that dev_pm_domain callbacks are called unconditionally sysfs: correctly handle short reads on PREALLOC attrs. base: soc: siplify ida usage kobject: move EXPORT_SYMBOL() macros next to corresponding definitions kobject: explain what kobject's sd field is debugfs: document that debugfs_remove*() accepts NULL and error values debugfs: Pass bool pointer to debugfs_create_bool() ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'
| * debugfs: Pass bool pointer to debugfs_create_bool()Viresh Kumar2015-10-042-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Its a bit odd that debugfs_create_bool() takes 'u32 *' as an argument, when all it needs is a boolean pointer. It would be better to update this API to make it accept 'bool *' instead, as that will make it more consistent and often more convenient. Over that bool takes just a byte. That required updates to all user sites as well, in the same commit updating the API. regmap core was also using debugfs_{read|write}_file_bool(), directly and variable types were updated for that to be bool as well. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'for-linus-4.4-rc0-tag' of ↵Linus Torvalds2015-11-041-8/+21
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from David Vrabel: - Improve balloon driver memory hotplug placement. - Use unpopulated hotplugged memory for foreign pages (if supported/enabled). - Support 64 KiB guest pages on arm64. - CPU hotplug support on arm/arm64. * tag 'for-linus-4.4-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (44 commits) xen: fix the check of e_pfn in xen_find_pfn_range x86/xen: add reschedule point when mapping foreign GFNs xen/arm: don't try to re-register vcpu_info on cpu_hotplug. xen, cpu_hotplug: call device_offline instead of cpu_down xen/arm: Enable cpu_hotplug.c xenbus: Support multiple grants ring with 64KB xen/grant-table: Add an helper to iterate over a specific number of grants xen/xenbus: Rename *RING_PAGE* to *RING_GRANT* xen/arm: correct comment in enlighten.c xen/gntdev: use types from linux/types.h in userspace headers xen/gntalloc: use types from linux/types.h in userspace headers xen/balloon: Use the correct sizeof when declaring frame_list xen/swiotlb: Add support for 64KB page granularity xen/swiotlb: Pass addresses rather than frame numbers to xen_arch_need_swiotlb arm/xen: Add support for 64KB page granularity xen/privcmd: Add support for Linux 64KB page granularity net/xen-netback: Make it running on 64KB page granularity net/xen-netfront: Make it running on 64KB page granularity block/xen-blkback: Make it running on 64KB page granularity block/xen-blkfront: Make it running on 64KB page granularity ...
| * | mm: memory hotplug with an existing resourceDavid Vrabel2015-10-231-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add add_memory_resource() to add memory using an existing "System RAM" resource. This is useful if the memory region is being located by finding a free resource slot with allocate_resource(). Xen guests will make use of this in their balloon driver to hotplug arbitrary amounts of memory in response to toolstack requests. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
* | | Merge tag 'arc-4.4-rc1' of ↵Linus Torvalds2015-11-032-53/+49
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC updates from Vineet Gupta: - Support for new MM features in ARCv2 cores (THP, PAE40) Some generic THP bits are touched - all ACKed by Kirill - Platform framework updates to prepare for EZChip arrival (still in works) - ARC Public Mailing list setup finally (linux-snps-arc@lists.infraded.org) * tag 'arc-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (42 commits) ARC: mm: PAE40 support ARC: mm: PAE40: tlbex.S: Explicitify the size of pte_t ARC: mm: PAE40: switch to using phys_addr_t for physical addresses ARC: mm: HIGHMEM: populate high memory from DT ARC: mm: HIGHMEM: kmap API implementation ARC: mm: preps ahead of HIGHMEM support #2 ARC: mm: preps ahead of HIGHMEM support ARC: mm: use generic macros _BITUL()/_AC() ARC: mm: Improve Duplicate PD Fault handler MAINTAINERS: Add public mailing list for ARC ARC: Ensure DT mem base is same as what kernel is built with ARC: boot: Non Master cpus only need to call EARLY_CPU_SETUP once ARCv2: smp: [plat-*]: No need to explicitly call mcip_init_smp() ARC: smp: Introduce smp hook @init_irq_cpu called for all cores ARC: smp: Rename platform hook @init_smp -> @init_cpu_smp ARCv2: smp: [plat-*]: No need to explicitly call mcip_init_early_smp() ARC: smp: Introduce smp hook @init_early_smp for Master core ARC: remove @init_time, @init_irq platform callbacks ARC: smp: irqchip: handle IPI as percpu irq like timer ARC: boot: Support Halt-on-reset and Run-on-reset SMP booting modes ...
| * | | mm,thp: introduce flush_pmd_tlb_rangeVineet Gupta2015-10-172-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ARCHes with special requirements for evicting THP backing TLB entries can implement this. Otherwise also, it can help optimize TLB flush in THP regime. stock flush_tlb_range() typically has optimization to nuke the entire TLB if flush span is greater than a certain threshhold, which will likely be true for a single huge page. Thus a single thp flush will invalidate the entrire TLB which is not desirable. e.g. see arch/arc: flush_pmd_tlb_range Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20151009100816.GC7873@node Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
| * | | mm,thp: reduce ifdef'ery for THP in generic codeVineet Gupta2015-10-171-21/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - pgtable-generic.c: Fold individual #ifdef for each helper into a top level #ifdef. Makes code more readable - Converted the stub helpers for !THP to BUILD_BUG() vs. runtime BUG() Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20151009133450.GA8597@node Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
| * | | mm: group pte related helpers togetherVineet Gupta2015-10-171-25/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reduces/simplifies the diff for the next patch which moves THP specific code. No semantical changes ! Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Link: http://lkml.kernel.org/r/1442918096-17454-9-git-send-email-vgupta@synopsys.com Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
* | | | mm: get rid of 'vmalloc_info' from /proc/meminfoLinus Torvalds2015-11-011-47/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out that at least some versions of glibc end up reading /proc/meminfo at every single startup, because glibc wants to know the amount of memory the machine has. And while that's arguably insane, it's just how things are. And it turns out that it's not all that expensive most of the time, but the vmalloc information statistics (amount of virtual memory used in the vmalloc space, and the biggest remaining chunk) can be rather expensive to compute. The 'get_vmalloc_info()' function actually showed up on my profiles as 4% of the CPU usage of "make test" in the git source repository, because the git tests are lots of very short-lived shell-scripts etc. It turns out that apparently this same silly vmalloc info gathering shows up on the facebook servers too, according to Dave Jones. So it's not just "make test" for git. We had two patches to just cache the information (one by me, one by Ingo) to mitigate this issue, but the whole vmalloc information of of rather dubious value to begin with, and people who *actually* want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead. In fact, according to my testing - and perhaps more importantly, according to that big search engine in the sky: Google - there is nothing out there that actually cares about those two expensive fields: VmallocUsed and VmallocChunk. So let's try to just remove them entirely. Actually, this just removes the computation and reports the numbers as zero for now, just to try to be minimally intrusive. If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2015-10-243-46/+79
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
| * | | | writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in ↵Tejun Heo2015-10-211-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgwb_bdi_destroy() a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi destruction") added rbtree_postorder_for_each_entry_safe() which is used to remove all entries; however, according to Cody, the iterator isn't safe against operations which may rebalance the tree. Fix it by switching to repeatedly removing rb_first() until empty. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Cody P Schafer <dev@codyps.com> Fixes: a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi destruction") Link: http://lkml.kernel.org/g/1443997973-1700-1-git-send-email-dev@codyps.com Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: don't release bdi while request_queue has live referencesTejun Heo2015-10-151-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Cc: stable@vger.kernel.org # v4.2+ Reported-and-tested-by: Andrey Konovalov <andreyknvl@google.com> Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com Reviewed-by: Jan Kara <jack@suse.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | writeback: fix incorrect calculation of available memory for memcg domainsTejun Heo2015-10-122-29/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For memcg domains, the amount of available memory was calculated as min(the amount currently in use + headroom according to memcg, total clean memory) This isn't quite correct as what should be capped by the amount of clean memory is the headroom, not the sum of memory in use and headroom. For example, if a memcg domain has a significant amount of dirty memory, the above can lead to a value which is lower than the current amount in use which doesn't make much sense. In most circumstances, the above leads to a number which is somewhat but not drastically lower. As the amount of memory which can be readily allocated to the memcg domain is capped by the amount of system-wide clean memory which is not already assigned to the memcg itself, the number we want is the amount currently in use + min(headroom according to memcg, clean memory elsewhere in the system) This patch updates mem_cgroup_wb_stats() to return the number of filepages and headroom instead of the calculated available pages. mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the above calculation from file, headroom, dirty and globally clean pages. v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading to build failure when !CGROUP_WRITEBACK. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling") Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | writeback: memcg dirty_throttle_control should be initialized with ↵Tejun Heo2015-10-121-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wb->memcg_completions MDTC_INIT() is used to initialize dirty_throttle_control for memcg domains. It used DTC_INIT_COMMON() to initialized mdtc->wb and ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the latter to wb->completions instead of wb->memcg_completions. This can lead to wildly incorrect results when calculating the proportion of dirty memory the memcg domain should get. Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize mdtc->wb_completions to wb->memcg_completions. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling") Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | writeback: bdi_writeback iteration must not skip dying onesTejun Heo2015-10-122-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bdi_for_each_wb() is used in several places to wake up or issue writeback work items to all wb's (bdi_writeback's) on a given bdi. The iteration is performed by walking bdi->cgwb_tree; however, the tree only indexes wb's which are currently active. For example, when a memcg gets associated with a different blkcg, the old wb is removed from the tree so that the new one can be indexed. The old wb starts dying from then on but will linger till all its inodes are drained. As these dying wb's may still host dirty inodes, writeback operations which affect all wb's must include them. bdi_for_each_wb() skipping dying wb's led to sync(2) missing and failing to sync the inodes belonging to those wb's. This patch adds a RCU protected @bdi->wb_list which lists all wb's beloinging to that bdi. wb's are added on creation and removed on release rather than on the start of destruction. bdi_for_each_wb() usages are replaced with list_for_each[_continue]_rcu() iterations over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed. v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs() fixed and unnecessary list head severing in cgwb_bdi_destroy() removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()") Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback ↵Tejun Heo2015-10-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iteration laptop_mode_timer_fn() was using bdi_for_each_wb() without the required RCU locking leading to the following warning. WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170() ... Call Trace: <IRQ> [<ffffffff81480cdc>] dump_stack+0x4e/0x82 [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0 [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20 [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170 [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0 [<ffffffff810cad25>] run_timer_softirq+0x205/0x370 [<ffffffff81056854>] __do_softirq+0xd4/0x460 [<ffffffff81056d69>] irq_exit+0x89/0xa0 [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50 [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90 ... Fix it by adding rcu_read_lock() around the iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's") Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | mm: make sendfile(2) killableJan Kara2015-10-231-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently a simple program below issues a sendfile(2) system call which takes about 62 days to complete in my test KVM instance. int fd; off_t off = 0; fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644); ftruncate(fd, 2); lseek(fd, 0, SEEK_END); sendfile(fd, fd, &off, 0xfffffff); Now you should not ask kernel to do a stupid stuff like copying 256MB in 2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin should have a way to stop you. We actually do have a check for fatal_signal_pending() in generic_perform_write() which triggers in this path however because we always succeed in writing something before the check is done, we return value > 0 from generic_perform_write() and thus the information about signal gets lost. Fix the problem by doing the signal check before writing anything. That way generic_perform_write() returns -EINTR, the error gets propagated up and the sendfile loop terminates early. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | thp: use is_zero_pfn() only after pte_present() checkMinchan Kim2015-10-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use is_zero_pfn() on pteval only after pte_present() check on pteval (It might be better idea to introduce is_zero_pte() which checks pte_present() first). Otherwise when working on a swap or migration entry and if pte_pfn's result is equal to zero_pfn by chance, we lose user's data in __collapse_huge_page_copy(). So if you're unlucky, the application segfaults and finally you could see below message on exit: BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3 Fixes: ca0984caa823 ("mm: incorporate zero pages into transparent huge pages") Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | mm: cma: fix incorrect type conversion for size during dma allocationRohit Vaswani2015-10-231-2/+2
| |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was found during userspace fuzzing test when a large size dma cma allocation is made by driver(like ion) through userspace. show_stack+0x10/0x1c dump_stack+0x74/0xc8 kasan_report_error+0x2b0/0x408 kasan_report+0x34/0x40 __asan_storeN+0x15c/0x168 memset+0x20/0x44 __dma_alloc_coherent+0x114/0x18c Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-10-163-4/+7
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "6 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: sh: add copy_user_page() alias for __copy_user() lib/Kconfig: ZLIB_DEFLATE must select BITREVERSE mm, dax: fix DAX deadlocks memcg: convert threshold to bytes builddeb: remove debian/files before build mm, fs: obey gfp_mapping for add_to_page_cache()
| * | | | mm, dax: fix DAX deadlocksRoss Zwisler2015-10-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following two locking commits in the DAX code: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") introduced a number of deadlocks and other issues which need to be fixed for the v4.3 kernel. The list of issues in DAX after these commits (some newly introduced by the commits, some preexisting) can be found here: https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault"). This undoes most of the changes introduced by those two commits, essentially returning us to the DAX locking scheme that was used in v4.2. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Tested-by: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | memcg: convert threshold to bytesShaohua Li2015-10-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | page_counter_memparse() returns pages for the threshold, while mem_cgroup_usage() returns bytes for memory usage. Convert the threshold to bytes. Fixes: 3e32cb2e0a12b6915 ("memcg: rename cgroup_event to mem_cgroup_event"). Signed-off-by: Shaohua Li <shli@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm, fs: obey gfp_mapping for add_to_page_cache()Michal Hocko2015-10-161-4/+4
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache allocation paths") has caught some users of hardcoded GFP_KERNEL used in the page cache allocation paths. This, however, wasn't complete and there were others which went unnoticed. Dave Chinner has reported the following deadlock for xfs on loop device: : With the recent merge of the loop device changes, I'm now seeing : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073. : : The deadlocked is as follows: : : kloopd1: loop_queue_read_work : xfs_file_iter_read : lock XFS inode XFS_IOLOCK_SHARED (on image file) : page cache read (GFP_KERNEL) : radix tree alloc : memory reclaim : reclaim XFS inodes : log force to unpin inodes : <wait for log IO completion> : : xfs-cil/loop1: <does log force IO work> : xlog_cil_push : xlog_write : <loop issuing log writes> : xlog_state_get_iclog_space() : <blocks due to all log buffers under write io> : <waits for IO completion> : : kloopd1: loop_queue_write_work : xfs_file_write_iter : lock XFS inode XFS_IOLOCK_EXCL (on image file) : <wait for inode to be unlocked> : : i.e. the kloopd, with it's split read and write work queues, has : introduced a dependency through memory reclaim. i.e. that writes : need to be able to progress for reads make progress. : : The problem, fundamentally, is that mpage_readpages() does a : GFP_KERNEL allocation, rather than paying attention to the inode's : mapping gfp mask, which is set to GFP_NOFS. : : The didn't used to happen, because the loop device used to issue : reads through the splice path and that does: : : error = add_to_page_cache_lru(page, mapping, index, : GFP_KERNEL & mapping_gfp_mask(mapping)); This has changed by commit aa4d86163e4 ("block: loop: switch to VFS ITER_BVEC"). This patch changes mpage_readpage{s} to follow gfp mask set for the mapping. There are, however, other places which are doing basically the same. lustre:ll_dir_filler is doing GFP_KERNEL from the function which apparently uses GFP_NOFS for other allocations so let's make this consistent. cifs:readpages_get_pages is called from cifs_readpages and __cifs_readpages_from_fscache called from the same path obeys mapping gfp. ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well regardless it uses mapping_gfp_mask for the page allocation. ext4_mpage_readpages is the called from the page cache allocation path same as read_pages and read_cache_pages As I've noticed in my previous post I cannot say I would be happy about sprinkling mapping_gfp_mask all over the place and it sounds like we should drop gfp_mask argument altogether and use it internally in __add_to_page_cache_locked that would require all the filesystems to use mapping gfp consistently which I am not sure is the case here. From a quick glance it seems that some file system use it all the time while others are selective. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Dave Chinner <david@fromorbit.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Ming Lei <ming.lei@canonical.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmstat: explicitly schedule per-cpu work on the CPU we need it to run onLinus Torvalds2015-10-151-3/+4
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vmstat code uses "schedule_delayed_work_on()" to do the initial startup of the delayed work on the right CPU, but then once it was started it would use the non-cpu-specific "schedule_delayed_work()" to re-schedule it on that CPU. That just happened to schedule it on the same CPU historically (well, in almost all situations), but the code _requires_ this work to be per-cpu, and should say so explicitly rather than depend on the non-cpu-specific scheduling to schedule on the current CPU. The timer code is being changed to not be as single-minded in always running things on the calling CPU. See also commit 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") that for now maintains the local CPU guarantees just in case there are other broken users that depended on the accidental behavior. Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Revert "fs: do not prefault sys_write() user buffer pages"Linus Torvalds2015-10-071-18/+16
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 998ef75ddb5709bbea0bf1506cd2717348a3c647. The commit itself does not appear to be buggy per se, but it is exposing a bug in ext4 (and Ted thinks ext3 too, but we solved that by getting rid of it). It's too late in the release cycle to really worry about this, even if Dave Hansen has a patch that may actually fix the underlying ext4 problem. We can (and should) revisit this for the next release. The problem is that moving the prefaulting later now exposes a special case with partially successful writes that isn't handled correctly. And the prefaulting likely isn't normally even that much of a performance issue - it looks like at least one reason Dave saw this in his performance tests is that he also ran them on Skylake that now supports the new SMAP code, which makes the normally very cheap user space prefaulting noticeably more expensive. Bisected-and-acked-by: Ted Ts'o <tytso@mit.edu> Analyzed-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | dmapool: fix overflow condition in pool_find_page()Robin Murphy2015-10-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a DMA pool lies at the very top of the dma_addr_t range (as may happen with an IOMMU involved), the calculated end address of the pool wraps around to zero, and page lookup always fails. Tweak the relevant calculation to be overflow-proof. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Sakari Ailus <sakari.ailus@iki.fi> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | memcg: remove pcp_counter_lockGreg Thelen2015-10-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online") removed the last use of the per memcg pcp_counter_lock but forgot to remove the variable. Kill the vestigial variable. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | memcg: make mem_cgroup_read_stat() unsignedGreg Thelen2015-10-011-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mem_cgroup_read_stat() returns a page count by summing per cpu page counters. The summing is racy wrt. updates, so a transient negative sum is possible. Callers don't want negative values: - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback. This could confuse dirty throttling. - oom reports and memory.stat shouldn't show confusing negative usage. - tree_usage() already avoids negatives. Avoid returning negative page counts from mem_cgroup_read_stat() and convert it to unsigned. [akpm@linux-foundation.org: fix old typo while we're in there] Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | memcg: fix dirty page migrationGreg Thelen2015-10-011-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem starts with a file backed dirty page which is charged to a memcg. Then page migration is used to move oldpage to newpage. Migration: - copies the oldpage's data to newpage - clears oldpage.PG_dirty - sets newpage.PG_dirty - uncharges oldpage from memcg - charges newpage to memcg Clearing oldpage.PG_dirty decrements the charged memcg's dirty page count. However, because newpage is not yet charged, setting newpage.PG_dirty does not increment the memcg's dirty page count. After migration completes newpage.PG_dirty is eventually cleared, often in account_page_cleaned(). At this time newpage is charged to a memcg so the memcg's dirty page count is decremented which causes underflow because the count was not previously incremented by migration. This underflow causes balance_dirty_pages() to see a very large unsigned number of dirty memcg pages which leads to aggressive throttling of buffered writes by processes in non root memcg. This issue: - can harm performance of non root memcg buffered writes. - can report too small (even negative) values in memory.stat[(total_)dirty] counters of all memcg, including the root. To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce page_memcg() and set_page_memcg() helpers. Test: 0) setup and enter limited memcg mkdir /sys/fs/cgroup/test echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes echo $$ > /sys/fs/cgroup/test/cgroup.procs 1) buffered writes baseline dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat 2) buffered writes with compaction antagonist to induce migration yes 1 > /proc/sys/vm/compact_memory & rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k kill % sync grep ^dirty /sys/fs/cgroup/test/memory.stat 3) buffered writes without antagonist, should match baseline rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat (speed, dirty residue) unpatched patched 1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages 2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages 3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages Notice that unpatched baseline performance (1) fell after migration (3): 841 -> 114 MB/s. In the patched kernel, post migration performance matches baseline. Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting") Signed-off-by: Greg Thelen <gthelen@google.com> Reported-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a faultMel Gorman2015-10-011-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SunDong reported the following on https://bugzilla.kernel.org/show_bug.cgi?id=103841 I think I find a linux bug, I have the test cases is constructed. I can stable recurring problems in fedora22(4.0.4) kernel version, arch for x86_64. I construct transparent huge page, when the parent and child process with MAP_SHARE, MAP_PRIVATE way to access the same huge page area, it has the opportunity to lead to huge page copy on write failure, and then it will munmap the child corresponding mmap area, but then the child mmap area with VM_MAYSHARE attributes, child process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags functions (vma - > vm_flags & VM_MAYSHARE). There were a number of problems with the report (e.g. it's hugetlbfs that triggers this, not transparent huge pages) but it was fundamentally correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that looks like this vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800 prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0 pgoff 0 file ffff88106bdb9800 private_data (null) flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb) ------------ kernel BUG at mm/hugetlb.c:462! SMP Modules linked in: xt_pkttype xt_LOG xt_limit [..] CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012 set_vma_resv_flags+0x2d/0x30 The VM_BUG_ON is correct because private and shared mappings have different reservation accounting but the warning clearly shows that the VMA is shared. When a private COW fails to allocate a new page then only the process that created the VMA gets the page -- all the children unmap the page. If the children access that data in the future then they get killed. The problem is that the same file is mapped shared and private. During the COW, the allocation fails, the VMAs are traversed to unmap the other private pages but a shared VMA is found and the bug is triggered. This patch identifies such VMAs and skips them. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: SunDong <sund_sky@126.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)Joonsoo Kim2015-10-011-3/+10
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit description is copied from the original post of this bug: http://comments.gmane.org/gmane.linux.kernel.mm/135349 Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next larger cache size than the size index INDEX_NODE mapping. In kernels 3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size. However, sometimes we can't get the right output we expected via kmalloc_size(INDEX_NODE + 1), causing a BUG(). The mapping table in the latest kernel is like: index = {0, 1, 2 , 3, 4, 5, 6, n} size = {0, 96, 192, 8, 16, 32, 64, 2^n} The mapping table before 3.10 is like this: index = {0 , 1 , 2, 3, 4 , 5 , 6, n} size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)} The problem on my mips64 machine is as follows: (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150", and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8. (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size = PAGE_SIZE". (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and "flags |= CFLGS_OFF_SLAB" will be covered. (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result here may be NULL while kernel bootup. (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the BUG info as the following shows (may be only mips64 has this problem): This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes the BUG by adding 'size >= 256' check to guarantee that all necessary small sized slabs are initialized regardless sequence of slab size in mapping table. Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Liuhailong <liu.hailong6@zte.com.cn> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud