summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | HWPOISON: Be more aggressive at freeing non LRU cachesAndi Kleen2009-12-161-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | shake_page handles more types of page caches than lru_drain_all() - per cpu page allocator pages - per CPU LRU Stops early when the page became free. Used in followon patches. Signed-off-by: Andi Kleen <ak@linux.intel.com>
* | | | Merge branch 'master' of ↵Linus Torvalds2009-12-164-221/+37
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits) direct I/O fallback sync simplification ocfs: stop using do_sync_mapping_range cleanup blockdev_direct_IO locking make generic_acl slightly more generic sanitize xattr handler prototypes libfs: move EXPORT_SYMBOL for d_alloc_name vfs: force reval of target when following LAST_BIND symlinks (try #7) ima: limit imbalance msg Untangling ima mess, part 3: kill dead code in ima Untangling ima mess, part 2: deal with counters Untangling ima mess, part 1: alloc_file() O_TRUNC open shouldn't fail after file truncation ima: call ima_inode_free ima_inode_free IMA: clean up the IMA counts updating code ima: only insert at inode creation time ima: valid return code from ima_inode_alloc fs: move get_empty_filp() deffinition to internal.h Sanitize exec_permission_lite() Kill cached_lookup() and real_lookup() Kill path_lookup_open() ... Trivial conflicts in fs/direct-io.c
| * | | | direct I/O fallback sync simplificationChristoph Hellwig2009-12-161-14/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of direct I/O falling back to buffered I/O we sync data twice currently: once at the end of generic_file_buffered_write using filemap_write_and_wait_range and once a little later in __generic_file_aio_write using do_sync_mapping_range with all flags set. The wait before write of the do_sync_mapping_range call does not make any sense, so just keep the filemap_write_and_wait_range call and move it to the right spot. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | make generic_acl slightly more genericChristoph Hellwig2009-12-163-141/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we cache the ACL pointers in the generic inode all the generic_acl cruft can go away and generic_acl.c can directly implement xattr handlers dealing with the full Posix ACL semantics for in-memory filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | sanitize xattr handler prototypesChristoph Hellwig2009-12-162-67/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a flags argument to struct xattr_handler and pass it to all xattr handler methods. This allows using the same methods for multiple handlers, e.g. for the ACL methods which perform exactly the same action for the access and default ACLs, just using a different underlying attribute. With a little more groundwork it'll also allow sharing the methods for the regular user/trusted/secure handlers in extN, ocfs2 and jffs2 like it's already done for xfs in this patch. Also change the inode argument to the handlers to a dentry to allow using the handlers mechnism for filesystems that require it later, e.g. cifs. [with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | Untangling ima mess, part 1: alloc_file()Al Viro2009-12-161-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are 2 groups of alloc_file() callers: * ones that are followed by ima_counts_get * ones giving non-regular files So let's pull that ima_counts_get() into alloc_file(); it's a no-op in case of non-regular files. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | switch alloc_file() to passing struct pathAl Viro2009-12-161-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... and have the caller grab both mnt and dentry; kill leak in infiniband, while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | switch shmem_file_setup() to alloc_file()Al Viro2009-12-161-12/+9
| |/ / / | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | memcg: code clean, remove unused variable in mem_cgroup_resize_limit()Bob Liu2009-12-161-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variable `progress' isn't used in mem_cgroup_resize_limit() any more. Remove it. [akpm@linux-foundation.org: cleanup] Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: remove memcg_tasklistDaisuke Nishimura2009-12-161-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock problem. The cgroup_mutex, which was removed by the commit, in mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e (Memory controller: OOM handling). IIUC, the intention of this cgroup_mutex was to prevent task move during select_bad_process() so that situations like below can be avoided. Assume cgroup "foo" has exceeded its limit and is about to trigger oom. 1. Process A, which has been in cgroup "baa" and uses large memory, is just moved to cgroup "foo". Process A can be the candidates for being killed. 2. Process B, which has been in cgroup "foo" and uses large memory, is just moved from cgroup "foo". Process B can be excluded from the candidates for being killed. But these race window exists anyway even if we hold a lock, because __mem_cgroup_try_charge() decides wether it should trigger oom or not outside of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And IMHO, those races are not so critical for users. This patch removes it and make codes simpler. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: avoid oom-killing innocent task in case of use_hierarchyDaisuke Nishimura2009-12-162-8/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: cleanup mem_cgroup_move_parent()Daisuke Nishimura2009-12-161-49/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. IMHO, charge/uncharge(especially charge) is high cost operation, so we should avoid it as far as possible. This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering checks it does. And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), changes the return value of __mem_cgroup_move_account() from int to void, and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid for moving account and calls __mem_cgroup_move_account(). This patch removes the last caller of trylock_page_cgroup(), so removes its definition too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: add mem_cgroup_cancel_charge()Daisuke Nishimura2009-12-161-20/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are some places calling both res_counter_uncharge() and css_put() to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge(). This patch introduces mem_cgroup_cancel_charge() and call it in those places. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: make memcg's file mapped consistent with global VMKAMEZAWA Hiroyuki2009-12-162-14/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED And in global VM, mapped shared memory is accounted into FILE_MAPPED. But memcg doesn't. fix it. Note: page_is_file_cache() just checks SwapBacked or not. So, we need to check PageAnon. Cc: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: coalesce charging via percpu storageKAMEZAWA Hiroyuki2009-12-161-6/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a patch for coalescing access to res_counter at charging by percpu caching. At charge, memcg charges 64pages and remember it in percpu cache. Because it's cache, drain/flush if necessary. This version uses public percpu area. 2 benefits for using public percpu area. 1. Sum of stocked charge in the system is limited to # of cpus not to the number of memcg. This shows better synchonization. 2. drain code for flush/cpuhotplug is very easy (and quick) The most important point of this patch is that we never touch res_counter in fast path. The res_counter is system-wide shared counter which is modified very frequently. We shouldn't touch it as far as we can for avoiding false sharing. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 cache miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 cache miss/faults [ + coalescing uncharge patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 cache miss/faults [ + coalescing uncharge patch + this patch ] 34224709 page-faults # 0.072 M/sec ( +- 0.173% ) 34.69 cache miss/faults Changelog (since Oct/2): - updated comments - replaced get_cpu_var() with __get_cpu_var() if possible. - removed mutex for system-wide drain. adds a counter instead of it. - removed CONFIG_HOTPLUG_CPU Changelog (old): - rebased onto the latest mmotm - moved charge size check before __GFP_WAIT check for avoiding unnecesary - added asynchronous flush routine. - fixed bugs pointed out by Nishimura-san. [akpm@linux-foundation.org: tweak comments] [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: coalesce uncharge during unmap/truncateKAMEZAWA Hiroyuki2009-12-163-6/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In massive parallel enviroment, res_counter can be a performance bottleneck. One strong techinque to reduce lock contention is reducing calls by coalescing some amount of calls into one. Considering charge/uncharge chatacteristic, - charge is done one by one via demand-paging. - uncharge is done by - in chunk at munmap, truncate, exit, execve... - one by one via vmscan/paging. It seems we have a chance to coalesce uncharges for improving scalability at unmap/truncation. This patch is a for coalescing uncharge. For avoiding scattering memcg's structure to functions under /mm, this patch adds memcg batch uncharge information to the task. A reason for per-task batching is for making use of caller's context information. We do batched uncharge (deleyed uncharge) when truncation/unmap occurs but do direct uncharge when uncharge is called by memory reclaim (vmscan.c). The degree of coalescing depends on callers - at invalidate/trucate... pagevec size - at unmap ....ZAP_BLOCK_SIZE (memory itself will be freed in this degree.) Then, we'll not coalescing too much. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 miss/faults [child with this patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 miss/faults We can see some amounts of improvement. (root cgroup doesn't affected by this patch) Another patch for "charge" will follow this and above will be improved more. Changelog(since 2009/10/02): - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes) - some clean up and commentary/description updates. - added initialize code to copy_process(). (possible bug fix) Changelog(old): - fixed !CONFIG_MEM_CGROUP case. - rebased onto the latest mmotm + softlimit fix patches. - unified patch for callers - added commetns. - make ->do_batch as bool. - removed css_get() at el. We don't need it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | memcg: fix memory.memsw.usage_in_bytes for root cgroupKirill A. Shutemov2009-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum of the usage of pages and swapents in the cgroup. Presently the root cgroup's memsw.usage_in_bytes shows the wrong value - the number of swapents are not added. So take MEM_CGROUP_STAT_SWAPOUT into account. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | oom-kill: fix NUMA constraint check with nodemaskKAMEZAWA Hiroyuki2009-12-162-19/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change "current" behavior. If callers use __GFP_THISNODE it should handle "page allocation failure" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | oom-kill: show virtual size and rss information of the killed processKOSAKI Motohiro2009-12-161-3/+13
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step. This patch adds vsz and rss information to the oom log to help this analysis. To save time for the debugging. example: =================================================================== rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24 Call Trace: [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40 [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0 (snip) 492283 pages non-shared Out of memory: kill process 2341 (memhog) score 527276 or a child Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB =========================================================================== ^ | here [rientjes@google.com: fix race, add pid & comm to message] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | nommu: fix malloc performance by adding uninitialized flagJie Zhang2009-12-151-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NOMMU code currently clears all anonymous mmapped memory. While this is what we want in the default case, all memory allocation from userspace under NOMMU has to go through this interface, including malloc() which is allowed to return uninitialized memory. This can easily be a significant performance penalty. So for constrained embedded systems were security is irrelevant, allow people to avoid clearing memory unnecessarily. This also alters the ELF-FDPIC binfmt such that it obtains uninitialised memory for the brk and stack region. Signed-off-by: Jie Zhang <jie.zhang@analog.com> Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm hugetlb: add hugepage support to pagemapNaoya Horiguchi2009-12-151-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables extraction of the pfn of a hugepage from /proc/pid/pagemap in an architecture independent manner. Details ------- My test program (leak_pagemap) works as follows: - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,) - read()/write() something on it, - call page-types with option -p, - munmap() and unlink() the file on hugetlbfs Without my patches ------------------ $ ./leak_pagemap flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 1 0 __________________________________ 0x0000000000000804 1 0 __R________M______________________ referenced,mmap 0x000000000000086c 81 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap 0x0000000000005808 5 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 101 0 The output of page-types don't show any hugepage. With my patches --------------- $ ./leak_pagemap flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 1 0 __________________________________ 0x0000000000030000 51100 199 ________________TG________________ compound_tail,huge 0x0000000000028018 100 0 ___UD__________H_G________________ uptodate,dirty,compound_head,huge 0x0000000000000804 1 0 __R________M______________________ referenced,mmap 0x000000000000080c 1 0 __RU_______M______________________ referenced,uptodate,mmap 0x000000000000086c 80 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap 0x0000000000005808 4 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 51300 200 The output of page-types shows 51200 pages contributing to hugepages, containing 100 head pages and 51100 tail pages as expected. [akpm@linux-foundation.org: build fix] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: hugetlb: fix hugepage memory leak in walk_page_range()Naoya Horiguchi2009-12-151-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most callers of pmd_none_or_clear_bad() check whether the target page is in a hugepage or not, but walk_page_range() do not check it. So if we read /proc/pid/pagemap for the hugepage on x86 machine, the hugepage memory is leaked as shown below. This patch fixes it. Details ======= My test program (leak_pagemap) works as follows: - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,) - read()/write() something on it, - call page-types with option -p (walk around the page tables), - munmap() and unlink() the file on hugetlbfs Without my patches ------------------ $ cat /proc/meminfo |grep "HugePage" HugePages_Total: 1000 HugePages_Free: 1000 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ./leak_pagemap [snip output] $ cat /proc/meminfo |grep "HugePage" HugePages_Total: 1000 HugePages_Free: 900 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ls /hugetlbfs/ $ 100 hugepages are accounted as used while there is no file on hugetlbfs. With my patches --------------- $ cat /proc/meminfo |grep "HugePage" HugePages_Total: 1000 HugePages_Free: 1000 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ./leak_pagemap [snip output] $ cat /proc/meminfo |grep "HugePage" HugePages_Total: 1000 HugePages_Free: 1000 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ls /hugetlbfs $ No memory leaks. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: hugetlb: fix hugepage memory leak in mincore()Naoya Horiguchi2009-12-151-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most callers of pmd_none_or_clear_bad() check whether the target page is in a hugepage or not, but mincore() and walk_page_range() do not check it. So if we use mincore() on a hugepage on x86 machine, the hugepage memory is leaked as shown below. This patch fixes it by extending mincore() system call to support hugepages. Details ======= My test program (leak_mincore) works as follows: - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,) - read()/write() something on it, - call mincore() for first ten pages and printf() the values of *vec - munmap() and unlink() the file on hugetlbfs Without my patch ---------------- $ cat /proc/meminfo| grep "HugePage" HugePages_Total: 1000 HugePages_Free: 1000 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ./leak_mincore vec[0] 0 vec[1] 0 vec[2] 0 vec[3] 0 vec[4] 0 vec[5] 0 vec[6] 0 vec[7] 0 vec[8] 0 vec[9] 0 $ cat /proc/meminfo |grep "HugePage" HugePages_Total: 1000 HugePages_Free: 999 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ls /hugetlbfs/ $ Return values in *vec from mincore() are set to 0, while the hugepage should be in memory, and 1 hugepage is still accounted as used while there is no file on hugetlbfs. With my patch ------------- $ cat /proc/meminfo| grep "HugePage" HugePages_Total: 1000 HugePages_Free: 1000 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ./leak_mincore vec[0] 1 vec[1] 1 vec[2] 1 vec[3] 1 vec[4] 1 vec[5] 1 vec[6] 1 vec[7] 1 vec[8] 1 vec[9] 1 $ cat /proc/meminfo |grep "HugePage" HugePages_Total: 1000 HugePages_Free: 1000 HugePages_Rsvd: 0 HugePages_Surp: 0 $ ls /hugetlbfs/ $ Return value in *vec set to 1 and no memory leaks. [akpm@linux-foundation.org: cleanup] [akpm@linux-foundation.org: build fix] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | hugetlb: abort a hugepage pool resize if a signal is pendingMel Gorman2009-12-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a user asks for a hugepage pool resize but specified a large number, the machine can begin trashing. In response, they might hit ctrl-c but signals are ignored and the pool resize continues until it fails an allocation. This can take a considerable amount of time so this patch aborts a pool resize if a signal is pending. Suggested by Dave Hansen. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mlock: replace stale comments in munlock_vma_page()Lee Schermerhorn2009-12-151-22/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cleanup stale comments on munlock_vma_page(). Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: remove unevictable_migrate_page functionLee Schermerhorn2009-12-152-14/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | unevictable_migrate_page() in mm/internal.h is a relic of the since removed UNEVICTABLE_LRU Kconfig option. This patch removes the function and open codes the test in migrate_page_copy(). Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | hugetlb: acquire the i_mmap_lock before walking the prio_tree to unmap a pageMel Gorman2009-12-151-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the owner of a mapping fails COW because a child process is holding a reference, the children VMAs are walked and the page is unmapped. The i_mmap_lock is taken for the unmapping of the page but not the walking of the prio_tree. In theory, that tree could be changing if the lock is not held. This patch takes the i_mmap_lock properly for the duration of the prio_tree walk. [hugh.dickins@tiscali.co.uk: Spotted the problem in the first place] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: uncached vma support with writenotifyMagnus Damm2009-12-151-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify the generic mmap() code to keep the cache attribute in vma->vm_page_prot regardless if writenotify is enabled or not. Without this patch the cache configuration selected by f_op->mmap() is overwritten if writenotify is enabled, making it impossible to keep the vma uncached. Needed by drivers such as drivers/video/sh_mobile_lcdcfb.c which uses deferred io together with uncached memory. Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | vmscan: simplify codeHuang Shijie2009-12-151-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the code for shrink_inactive_list(). Signed-off-by: Huang Shijie <shijie8@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | vmscan: do not evict inactive pages when skipping an active list scanRik van Riel2009-12-151-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In AIM7 runs, recent kernels start swapping out anonymous pages well before they should. This is due to shrink_list falling through to shrink_inactive_list if !inactive_anon_is_low(zone, sc), when all we really wanted to do is pre-age some anonymous pages to give them extra time to be referenced while on the inactive list. The obvious fix is to make sure that shrink_list does not fall through to scanning/reclaiming inactive pages when we called it to scan one of the active lists. This change should be safe because the loop in shrink_zone ensures that we will still shrink the anon and file inactive lists whenever we should. [kosaki.motohiro@jp.fujitsu.com: inactive_file_is_low() should be inactive_anon_is_low()] Reported-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tomasz Chmielewski <mangoo@wpkg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm/bootmem.c: properly __init-annotate helper functionsJan Beulich2009-12-151-4/+4
| | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: simplify try_to_unmap_one()KOSAKI Motohiro2009-12-151-13/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to unevictable-lru". So, following code is easy confusable. if (vma->vm_flags & VM_LOCKED) { ret = SWAP_MLOCK; goto out_unmap; } Plus, if the VMA doesn't have VM_LOCKED, We don't need to check the needed of calling mlock_vma_page(). Also, add some commentary to try_to_unmap_one(). Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: fix section mismatch in memory_hotplug.cRakib Mullick2009-12-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __free_pages_bootmem() is a __meminit function - which has been called from put_pages_bootmem thus causes a section mismatch warning. We were warned by the following warning: LD mm/built-in.o WARNING: mm/built-in.o(.text+0x26b22): Section mismatch in reference from the function put_page_bootmem() to the function .meminit.text:__free_pages_bootmem() The function put_page_bootmem() references the function __meminit __free_pages_bootmem(). This is often because put_page_bootmem lacks a __meminit annotation or the annotation of __free_pages_bootmem is wrong. Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | hugetlb: prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() ↵Larry Woodman2009-12-151-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fails hugetlb_fault() takes the mm->page_table_lock spinlock then calls hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an insufficient huge page pool it calls unmap_ref_private() with the mm->page_table_lock held. unmap_ref_private() then calls unmap_hugepage_range() which tries to acquire the mm->page_table_lock. [<ffffffff810928c3>] print_circular_bug_tail+0x80/0x9f [<ffffffff8109280b>] ? check_noncircular+0xb0/0xe8 [<ffffffff810935e0>] __lock_acquire+0x956/0xc0e [<ffffffff81093986>] lock_acquire+0xee/0x12e [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84 [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84 [<ffffffff814c348d>] _spin_lock+0x40/0x89 [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84 [<ffffffff8111afee>] ? alloc_huge_page+0x218/0x318 [<ffffffff8111a7a6>] unmap_hugepage_range+0x3e/0x84 [<ffffffff8111b2d0>] hugetlb_cow+0x1e2/0x3f4 [<ffffffff8111b935>] ? hugetlb_fault+0x453/0x4f6 [<ffffffff8111b962>] hugetlb_fault+0x480/0x4f6 [<ffffffff8111baee>] follow_hugetlb_page+0x116/0x2d9 [<ffffffff814c31a7>] ? _spin_unlock_irq+0x3a/0x5c [<ffffffff81107b4d>] __get_user_pages+0x2a3/0x427 [<ffffffff81107d0f>] get_user_pages+0x3e/0x54 [<ffffffff81040b8b>] get_user_pages_fast+0x170/0x1b5 [<ffffffff81160352>] dio_get_page+0x64/0x14a [<ffffffff8116112a>] __blockdev_direct_IO+0x4b7/0xb31 [<ffffffff8115ef91>] blkdev_direct_IO+0x58/0x6e [<ffffffff8115e0a4>] ? blkdev_get_blocks+0x0/0xb8 [<ffffffff810ed2c5>] generic_file_aio_read+0xdd/0x528 [<ffffffff81219da3>] ? avc_has_perm+0x66/0x8c [<ffffffff81132842>] do_sync_read+0xf5/0x146 [<ffffffff8107da00>] ? autoremove_wake_function+0x0/0x5a [<ffffffff81211857>] ? security_file_permission+0x24/0x3a [<ffffffff81132fd8>] vfs_read+0xb5/0x126 [<ffffffff81133f6b>] ? fget_light+0x5e/0xf8 [<ffffffff81133131>] sys_read+0x54/0x8c [<ffffffff81011e42>] system_call_fastpath+0x16/0x1b This can be fixed by dropping the mm->page_table_lock around the call to unmap_ref_private() if alloc_huge_page() fails, its dropped right below in the normal path anyway. However, earlier in the that function, it's also possible to call into the page allocator with the same spinlock held. What this patch does is drop the spinlock before the page allocator is potentially entered. The check for page allocation failure can be made without the page_table_lock as well as the copy of the huge page. Even if the PTE changed while the spinlock was held, the consequence is that a huge page is copied unnecessarily. This resolves both the double taking of the lock and sleeping with the spinlock held. [mel@csn.ul.ie: Cover also the case where process can sleep with spinlock] Signed-off-by: Larry Woodman <lwooman@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: memory_hotplug: make offline_pages() staticAndrew Morton2009-12-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has no references outside memory_hotplug.c. Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andi Kleen <andi@firstfloor.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: remove unswappable max_kernel_pagesHugh Dickins2009-12-152-40/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that ksm pages are swappable, and the known holes plugged, remove mention of unswappable kernel pages from KSM documentation and comments. Remove the totalram_pages/4 initialization of max_kernel_pages. In fact, remove max_kernel_pages altogether - we can reinstate it if removal turns out to break someone's script; but if we later want to limit KSM's memory usage, limiting the stable nodes would not be an effective approach. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: memory hotremove migration onlyHugh Dickins2009-12-154-32/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch enables page migration of ksm pages, but that soon gets into trouble: not surprising, since we're using the ksm page lock to lock operations on its stable_node, but page migration switches the page whose lock is to be used for that. Another layer of locking would fix it, but do we need that yet? Do we actually need page migration of ksm pages? Yes, memory hotremove needs to offline sections of memory: and since we stopped allocating ksm pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE candidates for migration. But KSM is currently unconscious of NUMA issues, happily merging pages from different NUMA nodes: at present the rule must be, not to use MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of ksm pages does not make sense yet. So, to complete support for ksm swapping we need to make hotremove safe. ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages are freed before migration reaches them, stable_nodes may be left still pointing to struct pages which have been removed from the system: the stable_node needs to identify a page by pfn rather than page pointer, then it can safely prune them when MEM_OFFLINE. And make NUMA migration skip PageKsm pages where it skips PageReserved. But it's only when we reach unmap_and_move() that the page lock is taken and we can be sure that raised pagecount has prevented a PageAnon from being upgraded: so add offlining arg to migrate_pages(), to migrate ksm page when offlining (has sufficient locking) but reject it otherwise. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: rmap_walk to remove_migation_ptesHugh Dickins2009-12-153-67/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A side-effect of making ksm pages swappable is that they have to be placed on the LRUs: which then exposes them to isolate_lru_page() and hence to page migration. Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some consolidation with existing code is possible, but don't attempt that yet (try_to_unmap needs to handle nonlinears, but migration pte removal does not). rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like remove_anon_migration_ptes() which it replaces, avoids calling page_lock_anon_vma(), because that includes a page_mapped() test which fails when all migration ptes are in place. That was valid when NUMA page migration was introduced (holding mmap_sem provided the missing guarantee that anon_vma's slab had not already been destroyed), but I believe not valid in the memory hotremove case added since. For now do the same as before, and consider the best way to fix that unlikely race later on. When fixed, we can probably use rmap_walk() on hwpoisoned ksm pages too: for now, they remain among hwpoison's various exceptions (its PageKsm test comes before the page is locked, but its page_lock_anon_vma fails safely if an anon gets upgraded). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: mem cgroup charge swapin copyHugh Dickins2009-12-151-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | But ksm swapping does require one small change in mem cgroup handling. When do_swap_page()'s call to ksm_might_need_to_copy() does indeed substitute a duplicate page to accommodate a different anon_vma (or a the !PageSwapCache check in mem_cgroup_try_charge_swapin(). That was returning success without charging, on the assumption that pte_same() would fail after, which is not the case here. Originally I proposed that success, so that an unshrinkable mem cgroup at its limit would not fail unnecessarily; but that's a minor point, and there are plenty of other places where we may fail an overallocation which might later prove unnecessary. So just go ahead and do what all the other exceptions do: proceed to charge current mm. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: share anon page without allocatingHugh Dickins2009-12-152-48/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ksm pages were unswappable, it made no sense to include them in mem cgroup accounting; but now that they are swappable (although I see no strict logical connection) the principle of least surprise implies that they should be accounted (with the usual dissatisfaction, that a shared page is accounted to only one of the cgroups using it). This patch was intended to add mem cgroup accounting where necessary; but turned inside out, it now avoids allocating a ksm page, instead upgrading an anon page to ksm - which brings its existing mem cgroup accounting with it. Thus mem cgroups don't appear in the patch at all. This upgrade from PageAnon to PageKsm takes place under page lock (via a somewhat hacky NULL kpage interface), and audit showed only one place which needed to cope with the race - page_referenced() is sometimes used without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be sure of getting anon_vma and flags together (no problem if the page goes ksm an instant after, the integrity of that anon_vma list is unaffected). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: take keyhole reference to pageHugh Dickins2009-12-151-39/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a lamentable flaw in KSM swapping: the stable_node holds a reference to the ksm page, so the page to be freed cannot actually be freed until ksmd works its way around to removing the last rmap_item from its stable_node. Which in some configurations may take minutes: not quite responsive enough for memory reclaim. And we don't want to twist KSM and its locking more tightly into the rest of mm. What a pity. But although the stable_node needs to hold a pointer to the ksm page, does it actually need to raise the reference count of that page? No. It would need to do so if struct pages were ordinary kmalloc'ed objects; but they are more stable than that, and reused in particular ways according to particular rules. Access to stable_node from its pointer in struct page is no problem, so long as we never free a stable_node before the ksm page itself has been freed. Access to struct page from its pointer in stable_node: reintroduce get_ksm_page(), and let that peep out through its keyhole (the stable_node pointer to ksm page), to see if that struct page still holds the right key to open it (the ksm page mapping pointer back to this stable_node). This relies upon the established way in which free_hot_cold_page() sets an anon (including ksm) page->mapping to NULL; and relies upon no other user of a struct page to put something which looks like the original stable_node pointer (with two low bits also set) into page->mapping. It also needs get_page_unless_zero() technique pioneered by speculative pagecache; and uses rcu_read_lock() to keep the guarantees that gives. There are several drivers which put pointers of their own into page-> mapping; but none of those could coincide with our stable_node pointers, since KSM won't free a stable_node until it sees that the page has gone. The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's break_lock on 32-bit, that spans both page->private and page->mapping. Since break_lock is only 0 or 1, again no confusion for get_ksm_page(). But what of DEBUG_SPINLOCK on 64-bit bigendian? When owner_cpu is 3 (matching PageKsm low bits), it might see 0xdead4ead00000003 in page-> mapping, which might coincide? We could get around that by... but a better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case - already proposed in an earlier mm/Kconfig patch. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: hold anon_vma in rmap_itemHugh Dickins2009-12-152-64/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For full functionality, page_referenced_one() and try_to_unmap_one() need to know the vma: to pass vma down to arch-dependent flushes, or to observe VM_LOCKED or VM_EXEC. But KSM keeps no record of vma: nor can it, since vmas get split and merged without its knowledge. Instead, note page's anon_vma in its rmap_item when adding to stable tree: all the vmas which might map that page are listed by its anon_vma. page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma, first to find the probable vma, that which matches rmap_item's mm; but if that is not enough to locate all instances, traverse again to try the others. This catches those occasions when fork has duplicated a pte of a ksm page, but ksmd has not yet come around to assign it an rmap_item. But each rmap_item in the stable tree which refers to an anon_vma needs to take a reference to it. Andrea's anon_vma design cleverly avoided a reference count (an anon_vma was free when its list of vmas was empty), but KSM now needs to add that. Is a 32-bit count sufficient? I believe so - the anon_vma is only free when both count is 0 and list is empty. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: let shared pages be swappableHugh Dickins2009-12-154-43/+211
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Initial implementation for swapping out KSM's shared pages: add page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when faced with a PageKsm page. Most of what's needed can be got from the rmap_items listed from the stable_node of the ksm page, without discovering the actual vma: so in this patch just fake up a struct vma for page_referenced_one() or try_to_unmap_one(), then refine that in the next patch. Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been implicit there (being only set with VM_SHARED, already excluded), but let's make it explicit, to help justify the lack of nonlinear unmap. Rely on the page lock to protect against concurrent modifications to that page's node of the stable tree. The awkward part is not swapout but swapin: do_swap_page() and page_add_anon_rmap() now have to allow for new possibilities - perhaps a ksm page still in swapcache, perhaps a swapcache page associated with one location in one anon_vma now needed for another location or anon_vma. (And the vma might even be no longer VM_MERGEABLE when that happens.) ksm_might_need_to_copy() checks for that case, and supplies a duplicate page when necessary, simply leaving it to a subsequent pass of ksmd to rediscover the identity and merge them back into one ksm page. Disappointingly primitive: but the alternative would have to accumulate unswappable info about the swapped out ksm pages, limiting swappability. Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the particular case it was handling, so just use it instead. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: fix mlockfreed to munlockedHugh Dickins2009-12-153-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When KSM merges an mlocked page, it has been forgetting to munlock it: that's been left to free_page_mlock(), which reports it in /proc/vmstat as unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and whinges "Page flag mlocked set for process" in mmotm, whereas mainline is silently forgiving). Call munlock_vma_page() to fix that. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: stable_node point to page and backHugh Dickins2009-12-151-65/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a pointer to the ksm page into struct stable_node, holding a reference to the page while the node exists. Put a pointer to the stable_node into the ksm page's ->mapping. Then we don't need get_ksm_page() while traversing the stable tree: the page to compare against is sure to be present and correct, even if it's no longer visible through any of its existing rmap_items. And we can handle the forked ksm page case more efficiently: no need to memcmp our way through the tree to find its match. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: separate stable_nodeHugh Dickins2009-12-151-79/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Though we still do well to keep rmap_items in the unstable tree without a separate tree_item at the node, for several reasons it becomes awkward to keep rmap_items in the stable tree without a separate stable_node: lack of space in the nicely-sized rmap_item, the need for an anchor as rmap_items are removed, the need for a node even when temporarily no rmap_items are attached to it. So declare struct stable_node (rb_node to place it in the tree and hlist_head for the rmap_items hanging off it), and convert stable tree handling to use it: without yet taking advantage of it. Note how one stable_tree_insert() of a node now has _two_ stable_tree_append()s of the two rmap_items being merged. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: singly-linked rmap_listHugh Dickins2009-12-151-30/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a singly-linked list: we always traverse that list sequentially, and we don't even lose any prefetches (but should consider adding a few later). Name it rmap_list throughout. Do we need to free up that pointer? Not immediately, and in the end, we could continue to avoid it with a union; but having done the conversion, let's keep it this way, since there's no downside, and maybe we'll want more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit and 64 bytes on 64-bit, so we shall want to avoid expanding it). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: cleanup some function argumentsHugh Dickins2009-12-151-122/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cleanup: make argument names more consistent from cmp_and_merge_page() down to replace_page(), so that it's easier to follow the rmap_item's page and the matching tree_page and the merged kpage through that code. In some places, e.g. break_cow(), pass rmap_item instead of separate mm and address. cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used uninitialized" warning seen in one config by Anil SB. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: remove redundancies when merging pageHugh Dickins2009-12-151-20/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need for replace_page() to calculate a write-protected prot vm_page_prot must already be write-protected for an anonymous page (see mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot). There is no need for try_to_merge_one_page() to get_page and put_page on newpage and oldpage: in every case we already hold a reference to each of them. But some instinct makes me move try_to_merge_one_page()'s unlock_page of oldpage down after replace_page(): that doesn't increase contention on the ksm page, and makes thinking about the transition easier. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ksm: three remove_rmap_item_from_tree cleanupsHugh Dickins2009-12-151-11/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. remove_rmap_item_from_tree() is called as a precaution from various places: don't dirty the rmap_item cacheline unnecessarily, just mask the flags out of the address when they have been set. 2. First get_next_rmap_item() removes an unstable rmap_item from its tree, then shortly afterwards cmp_and_merge_page() removes a stable rmap_item from its tree: it's easier just to do both at once (but definitely keep the BUG_ON(age > 1) which guards against a future omission). 3. When cmp_and_merge_page() moves an rmap_item from unstable to stable tree, it does its own rb_erase() and accounting: that's better expressed by remove_rmap_item_from_tree(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud