From 63bfd7384b119409685a17d5c58f0b56e5dc03da Mon Sep 17 00:00:00 2001
From: Pekka Enberg <penberg@kernel.org>
Date: Mon, 8 Nov 2010 21:29:07 +0200
Subject: perf_events: Fix perf_counter_mmap() hook in mprotect()

As pointed out by Linus, commit dab5855 ("perf_counter: Add mmap event hooks to
mprotect()") is fundamentally wrong as mprotect_fixup() can free 'vma' due to
merging. Fix the problem by moving perf_event_mmap() hook to
mprotect_fixup().

Note: there's another successful return path from mprotect_fixup() if old
flags equal to new flags. We don't, however, need to call
perf_event_mmap() there because 'perf' already knows the VMA is
executable.

Reported-by: Dave Jones <davej@redhat.com>
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'mm')

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2d1bf7c..4c51338 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -211,6 +211,7 @@ success:
 	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
+	perf_event_mmap(vma);
 	return 0;
 
 fail:
@@ -299,7 +300,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
-		perf_event_mmap(vma);
 		nstart = tmp;
 
 		if (nstart < prev->vm_end)
-- 
cgit v1.1


From d2e61b8dc99fdb36e0fd176e25365f69afda4ff9 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <error27@gmail.com>
Date: Thu, 11 Nov 2010 14:05:12 -0800
Subject: memcg: null dereference on allocation failure

The original code had a null dereference if alloc_percpu() failed.  This
was introduced in commit 711d3d2c9bc3 ("memcg: cpu hotplug aware percpu
count updates")

Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memcontrol.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

(limited to 'mm')

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..2efa8ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4208,15 +4208,17 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 
 	memset(mem, 0, size);
 	mem->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
-	if (!mem->stat) {
-		if (size < PAGE_SIZE)
-			kfree(mem);
-		else
-			vfree(mem);
-		mem = NULL;
-	}
+	if (!mem->stat)
+		goto out_free;
 	spin_lock_init(&mem->pcp_counter_lock);
 	return mem;
+
+out_free:
+	if (size < PAGE_SIZE)
+		kfree(mem);
+	else
+		vfree(mem);
+	return NULL;
 }
 
 /*
-- 
cgit v1.1


From 8d056cb965b8fb7c53c564abf28b1962d1061cd3 Mon Sep 17 00:00:00 2001
From: Dave Hansen <dave@linux.vnet.ibm.com>
Date: Thu, 11 Nov 2010 14:05:15 -0800
Subject: mm/vfs: revalidate page->mapping in do_generic_file_read()

70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
ran into a NULL dereference in here:

	int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
	                                        unsigned long from)
	{
---->		struct inode *inode = page->mapping->host;

It looks like page->mapping was the culprit.  (xmon trace is below).
After closer examination, I realized that do_generic_file_read() does a
find_get_page(), and eventually locks the page before calling
block_is_partially_uptodate().  However, it doesn't revalidate the
page->mapping after the page is locked.  So, there's a small window
between the find_get_page() and ->is_partially_uptodate() where the page
could get truncated and page->mapping cleared.

We _have_ a reference, so it can't get reclaimed, but it certainly
can be truncated.

I think the correct thing is to check page->mapping after the
trylock_page(), and jump out if it got truncated.  This patch has been
running in the test environment for a month or so now, and we have not
seen this bug pop up again.

xmon info:

  1f:mon> e
  cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
      pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
      lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
      sp: c0000002ae36f9f0
     msr: 8000000000009032
     dar: 0
   dsisr: 40000000
    current = 0xc000000378f99e30
    paca    = 0xc000000000f66300
      pid   = 21946, comm = bash
  1f:mon> r
  R00 = 0025c0500000006d   R16 = 0000000000000000
  R01 = c0000002ae36f9f0   R17 = c000000362cd3af0
  R02 = c000000000e8cd80   R18 = ffffffffffffffff
  R03 = c0000000031d0f88   R19 = 0000000000000001
  R04 = c0000002ae36fa68   R20 = c0000003bb97b8a0
  R05 = 0000000000000000   R21 = c0000002ae36fa68
  R06 = 0000000000000000   R22 = 0000000000000000
  R07 = 0000000000000001   R23 = c0000002ae36fbb0
  R08 = 0000000000000002   R24 = 0000000000000000
  R09 = 0000000000000000   R25 = c000000362cd3a80
  R10 = 0000000000000000   R26 = 0000000000000002
  R11 = c0000000001e7b60   R27 = 0000000000000000
  R12 = 0000000042000484   R28 = 0000000000000001
  R13 = c000000000f66300   R29 = c0000003bb97b9b8
  R14 = 0000000000000001   R30 = c000000000e28a08
  R15 = 000000000000ffff   R31 = c0000000031d0f88
  pc  = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
  lr  = c000000000142944 .generic_file_aio_read+0x1e4/0x770
  msr = 8000000000009032   cr  = 22000488
  ctr = c0000000001e7a60   xer = 0000000020000000   trap =  300
  dar = 0000000000000000   dsisr = 40000000
  1f:mon> t
  [link register   ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
  [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
  [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
  [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
  [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
  [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
  --- Exception: c00 (System Call) at 00000080a840bc54
  SP (fffca15df30) is in userspace
  1f:mon> di c0000000001e7a6c
  c0000000001e7a6c  e9290000      ld      r9,0(r9)
  c0000000001e7a70  418200c0      beq     c0000000001e7b30        # .block_is_partially_uptodate+0xd0/0x100
  c0000000001e7a74  e9440008      ld      r10,8(r4)
  c0000000001e7a78  78a80020      clrldi  r8,r5,32
  c0000000001e7a7c  3c000001      lis     r0,1
  c0000000001e7a80  812900a8      lwz     r9,168(r9)
  c0000000001e7a84  39600001      li      r11,1
  c0000000001e7a88  7c080050      subf    r0,r8,r0
  c0000000001e7a8c  7f805040      cmplw   cr7,r0,r10
  c0000000001e7a90  7d6b4830      slw     r11,r11,r9
  c0000000001e7a94  796b0020      clrldi  r11,r11,32
  c0000000001e7a98  419d00a8      bgt     cr7,c0000000001e7b40    # .block_is_partially_uptodate+0xe0/0x100
  c0000000001e7a9c  7fa55840      cmpld   cr7,r5,r11
  c0000000001e7aa0  7d004214      add     r8,r0,r8
  c0000000001e7aa4  79080020      clrldi  r8,r8,32
  c0000000001e7aa8  419c0078      blt     cr7,c0000000001e7b20    # .block_is_partially_uptodate+0xc0/0x100

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <arunabal@in.ibm.com>
Cc: <sbest@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/filemap.c | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'mm')

diff --git a/mm/filemap.c b/mm/filemap.c
index 61ba5e4..4ee2e99 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1029,6 +1029,9 @@ find_page:
 				goto page_not_up_to_date;
 			if (!trylock_page(page))
 				goto page_not_up_to_date;
+			/* Did it get truncated before we got the lock? */
+			if (!page->mapping)
+				goto page_not_up_to_date_locked;
 			if (!mapping->a_ops->is_partially_uptodate(page,
 								desc, offset))
 				goto page_not_up_to_date_locked;
-- 
cgit v1.1


From 1dce071e18b7264457d17c0dec4c7e430bfaee7d Mon Sep 17 00:00:00 2001
From: Shaohua Li <shaohua.li@intel.com>
Date: Thu, 11 Nov 2010 14:05:17 -0800
Subject: vmscan: avoid setting zone congested if no page dirty

nr_dirty and nr_congested are increased only when the page is dirty.  So
if all pages are clean, both them will be zero.  In this case, we should
not mark the zone congested.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'mm')

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8a6fdc..d31d7ce 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -913,7 +913,7 @@ keep_lumpy:
 	 * back off and wait for congestion to clear because further reclaim
 	 * will encounter the same problem
 	 */
-	if (nr_dirty == nr_congested)
+	if (nr_dirty == nr_congested && nr_dirty != 0)
 		zone_set_flag(zone, ZONE_CONGESTED);
 
 	free_page_list(&free_pages);
-- 
cgit v1.1


From 27d20fddc8af539464fc3ba499d6a830054c3bd6 Mon Sep 17 00:00:00 2001
From: Nick Piggin <npiggin@kernel.dk>
Date: Thu, 11 Nov 2010 14:05:19 -0800
Subject: radix-tree: fix RCU bug

Salman Qazi describes the following radix-tree bug:

In the following case, we get can get a deadlock:

0.  The radix tree contains two items, one has the index 0.
1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
2.  The reader acquires slot(s) for item(s) including the index 0 item.
3.  The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
5.  The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

The fix is to re-use the same "indirect" pointer case that requires a slot
lookup retry into a general "retry the lookup" bit.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reported-by: Salman Qazi <sqazi@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/filemap.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

(limited to 'mm')

diff --git a/mm/filemap.c b/mm/filemap.c
index 4ee2e99..ea89840 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -644,7 +644,9 @@ repeat:
 	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
 	if (pagep) {
 		page = radix_tree_deref_slot(pagep);
-		if (unlikely(!page || page == RADIX_TREE_RETRY))
+		if (unlikely(!page))
+			goto out;
+		if (radix_tree_deref_retry(page))
 			goto repeat;
 
 		if (!page_cache_get_speculative(page))
@@ -660,6 +662,7 @@ repeat:
 			goto repeat;
 		}
 	}
+out:
 	rcu_read_unlock();
 
 	return page;
@@ -777,12 +780,11 @@ repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
 		if (unlikely(!page))
 			continue;
-		/*
-		 * this can only trigger if nr_found == 1, making livelock
-		 * a non issue.
-		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (radix_tree_deref_retry(page)) {
+			if (ret)
+				start = pages[ret-1]->index;
 			goto restart;
+		}
 
 		if (!page_cache_get_speculative(page))
 			goto repeat;
@@ -830,11 +832,7 @@ repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
 		if (unlikely(!page))
 			continue;
-		/*
-		 * this can only trigger if nr_found == 1, making livelock
-		 * a non issue.
-		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (radix_tree_deref_retry(page))
 			goto restart;
 
 		if (page->mapping == NULL || page->index != index)
@@ -887,11 +885,7 @@ repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
 		if (unlikely(!page))
 			continue;
-		/*
-		 * this can only trigger if nr_found == 1, making livelock
-		 * a non issue.
-		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (radix_tree_deref_retry(page))
 			goto restart;
 
 		if (!page_cache_get_speculative(page))
-- 
cgit v1.1


From 68cee4f118c21a1c67e5764a91d766661db5b360 Mon Sep 17 00:00:00 2001
From: Pavel Emelyanov <xemul@parallels.com>
Date: Thu, 28 Oct 2010 13:50:37 +0400
Subject: slub: Fix slub_lock down/up imbalance

There are two places, that do not release the slub_lock.

Respective bugs were introduced by sysfs changes ab4d5ed5 (slub: Enable
sysfs support for !CONFIG_SLUB_DEBUG) and 2bce6485 ( slub: Allow removal
of slab caches during boot).

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
---
 mm/slub.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

(limited to 'mm')

diff --git a/mm/slub.c b/mm/slub.c
index 8fd5401..981fb73 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3273,9 +3273,9 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 		kfree(n);
 		kfree(s);
 	}
+err:
 	up_write(&slub_lock);
 
-err:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slabcache %s\n", name);
 	else
@@ -3862,6 +3862,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 			x += sprintf(buf + x, " N%d=%lu",
 					node, nodes[node]);
 #endif
+	up_read(&slub_lock);
 	kfree(nodes);
 	return x + sprintf(buf + x, "\n");
 }
-- 
cgit v1.1


From 04c3496152394d17e3bc2316f9731ee3e8a026bc Mon Sep 17 00:00:00 2001
From: "Steven J. Magnani" <steve@digidescorp.com>
Date: Wed, 24 Nov 2010 12:56:54 -0800
Subject: nommu: yield CPU while disposing VM

Depending on processor speed, page size, and the amount of memory a
process is allowed to amass, cleanup of a large VM may freeze the system
for many seconds.  This can result in a watchdog timeout.

Make sure other tasks receive some service when cleaning up large VMs.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/nommu.c | 1 +
 1 file changed, 1 insertion(+)

(limited to 'mm')

diff --git a/mm/nommu.c b/mm/nommu.c
index 3613517..27a9ac5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1717,6 +1717,7 @@ void exit_mmap(struct mm_struct *mm)
 		mm->mmap = vma->vm_next;
 		delete_vma_from_mm(vma);
 		delete_vma(mm, vma);
+		cond_resched();
 	}
 
 	kleave("");
-- 
cgit v1.1


From 112bc2e120a94a511858918d6866a4978f9c500e Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill@shutemov.name>
Date: Wed, 24 Nov 2010 12:56:58 -0800
Subject: memcg: fix false positive VM_BUG on non-SMP

Fix this:

  kernel BUG at mm/memcontrol.c:2155!
  invalid opcode: 0000 [#1]
  last sysfs file:

  Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
  EIP: 0060:[<c10731b2>] EFLAGS: 00000246 CPU: 0
  EIP is at mem_cgroup_move_account+0xe2/0xf0
  EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
  ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
   DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
  Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
  Stack:
   00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
   08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
   c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
  Call Trace:
   [<c1074d37>] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
   [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
   [<c106c75e>] ? walk_page_range+0xee/0x1d0
   [<c10725d6>] ? mem_cgroup_move_task+0x66/0x90
   [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
   [<c1072570>] ? mem_cgroup_move_task+0x0/0x90
   [<c1042616>] ? cgroup_attach_task+0x136/0x200
   [<c1042878>] ? cgroup_tasks_write+0x48/0xc0
   [<c1041e9e>] ? cgroup_file_write+0xde/0x220
   [<c101398d>] ? do_page_fault+0x17d/0x3f0
   [<c108a79d>] ? alloc_fd+0x2d/0xd0
   [<c1041dc0>] ? cgroup_file_write+0x0/0x220
   [<c1077ba2>] ? vfs_write+0x92/0xc0
   [<c1077c81>] ? sys_write+0x41/0x70
   [<c1140e3d>] ? syscall_call+0x7/0xb
  Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
  EIP: [<c10731b2>] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
  ---[ end trace 7daa1582159b6532 ]---

lock_page_cgroup and unlock_page_cgroup are implemented using
bit_spinlock.  bit_spinlock doesn't touch the bit if we are on non-SMP
machine, so we can't use the bit to check whether the lock was taken.

Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
of PageCgroupLocked to fix it.

[akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'mm')

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2efa8ea..62d1880 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2152,7 +2152,7 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 {
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
-	VM_BUG_ON(!PageCgroupLocked(pc));
+	VM_BUG_ON(!page_is_cgroup_locked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-- 
cgit v1.1


From b1dd693e5b9348bd68a80e679e03cf9c0973b01b Mon Sep 17 00:00:00 2001
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Date: Wed, 24 Nov 2010 12:57:06 -0800
Subject: memcg: avoid deadlock between move charge and try_charge()

__mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
mlock does it). This means it can cause deadlock if it races with move charge:

Ex.1)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |  down_write(&mmap_sem)
      mc.moving_task = current          |    ..
      mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
        mem_cgroup_count_precharge()    |    prepare_to_wait()
          down_read(&mmap_sem)          |    if (mc.moving_task)
          -> cannot aquire the lock     |    -> true
                                        |      schedule()

Ex.2)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |
      mc.moving_task = current          |
      mem_cgroup_precharge_mc()         |
        mem_cgroup_count_precharge()    |
          down_read(&mmap_sem)          |
          ..                            |
          up_read(&mmap_sem)            |
                                        |  down_write(&mmap_sem)
    mem_cgroup_move_task()              |    ..
      mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
        down_read(&mmap_sem)            |    prepare_to_wait()
        -> cannot aquire the lock       |    if (mc.moving_task)
                                        |    -> true
                                        |      schedule()

To avoid this deadlock, we do all the move charge works (both can_attach() and
attach()) under one mmap_sem section.
And after this patch, we set/clear mc.moving_task outside mc.lock, because we
use the lock only to check mc.from/to.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memcontrol.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

(limited to 'mm')

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62d1880..26218df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -278,13 +278,14 @@ enum move_type {
 
 /* "mc" and its members are protected by cgroup_mutex */
 static struct move_charge_struct {
-	spinlock_t	  lock; /* for from, to, moving_task */
+	spinlock_t	  lock; /* for from, to */
 	struct mem_cgroup *from;
 	struct mem_cgroup *to;
 	unsigned long precharge;
 	unsigned long moved_charge;
 	unsigned long moved_swap;
 	struct task_struct *moving_task;	/* a task moving charges */
+	struct mm_struct *mm;
 	wait_queue_head_t waitq;		/* a waitq for other context */
 } mc = {
 	.lock = __SPIN_LOCK_UNLOCKED(mc.lock),
@@ -4631,7 +4632,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 	unsigned long precharge;
 	struct vm_area_struct *vma;
 
-	down_read(&mm->mmap_sem);
+	/* We've already held the mmap_sem */
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		struct mm_walk mem_cgroup_count_precharge_walk = {
 			.pmd_entry = mem_cgroup_count_precharge_pte_range,
@@ -4643,7 +4644,6 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 		walk_page_range(vma->vm_start, vma->vm_end,
 					&mem_cgroup_count_precharge_walk);
 	}
-	up_read(&mm->mmap_sem);
 
 	precharge = mc.precharge;
 	mc.precharge = 0;
@@ -4694,11 +4694,16 @@ static void mem_cgroup_clear_mc(void)
 
 		mc.moved_swap = 0;
 	}
+	if (mc.mm) {
+		up_read(&mc.mm->mmap_sem);
+		mmput(mc.mm);
+	}
 	spin_lock(&mc.lock);
 	mc.from = NULL;
 	mc.to = NULL;
-	mc.moving_task = NULL;
 	spin_unlock(&mc.lock);
+	mc.moving_task = NULL;
+	mc.mm = NULL;
 	mem_cgroup_end_move(from);
 	memcg_oom_recover(from);
 	memcg_oom_recover(to);
@@ -4724,12 +4729,21 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			return 0;
 		/* We move charges only when we move a owner of the mm */
 		if (mm->owner == p) {
+			/*
+			 * We do all the move charge works under one mmap_sem to
+			 * avoid deadlock with down_write(&mmap_sem)
+			 * -> try_charge() -> if (mc.moving_task) -> sleep.
+			 */
+			down_read(&mm->mmap_sem);
+
 			VM_BUG_ON(mc.from);
 			VM_BUG_ON(mc.to);
 			VM_BUG_ON(mc.precharge);
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
 			VM_BUG_ON(mc.moving_task);
+			VM_BUG_ON(mc.mm);
+
 			mem_cgroup_start_move(from);
 			spin_lock(&mc.lock);
 			mc.from = from;
@@ -4737,14 +4751,16 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			mc.precharge = 0;
 			mc.moved_charge = 0;
 			mc.moved_swap = 0;
-			mc.moving_task = current;
 			spin_unlock(&mc.lock);
+			mc.moving_task = current;
+			mc.mm = mm;
 
 			ret = mem_cgroup_precharge_mc(mm);
 			if (ret)
 				mem_cgroup_clear_mc();
-		}
-		mmput(mm);
+			/* We call up_read() and mmput() in clear_mc(). */
+		} else
+			mmput(mm);
 	}
 	return ret;
 }
@@ -4832,7 +4848,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 	struct vm_area_struct *vma;
 
 	lru_add_drain_all();
-	down_read(&mm->mmap_sem);
+	/* We've already held the mmap_sem */
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		int ret;
 		struct mm_walk mem_cgroup_move_charge_walk = {
@@ -4851,7 +4867,6 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 			 */
 			break;
 	}
-	up_read(&mm->mmap_sem);
 }
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
@@ -4860,17 +4875,11 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	struct mm_struct *mm;
-
-	if (!mc.to)
+	if (!mc.mm)
 		/* no need to move charge */
 		return;
 
-	mm = get_task_mm(p);
-	if (mm) {
-		mem_cgroup_move_charge(mm);
-		mmput(mm);
-	}
+	mem_cgroup_move_charge(mc.mm);
 	mem_cgroup_clear_mc();
 }
 #else	/* !CONFIG_MMU */
-- 
cgit v1.1


From a42c390cfa0c2612459d7226ba11612847ca3a64 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 24 Nov 2010 12:57:08 -0800
Subject: cgroups: make swap accounting default behavior configurable

Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
configuration option and then it is turned on by default.  There is a boot
option (noswapaccount) which can disable this feature.

This makes it hard for distributors to enable the configuration option as
this feature leads to a bigger memory consumption and this is a no-go for
general purpose distribution kernel.  On the other hand swap accounting
may be very usuful for some workloads.

This patch adds a new configuration option which controls the default
behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED).  If the option is selected
then the feature is turned on by default.

It also adds a new boot parameter swapaccount[=1|0] which enhances the
original noswapaccount parameter semantic by means of enable/disable logic
(defaults to 1 if no value is provided to be still consistent with
noswapaccount).

The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memcontrol.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

(limited to 'mm')

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 26218df..7a22b41 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -61,7 +61,14 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 /* Turned on only when memory cgroup is enabled && really_do_swap_account = 1 */
 int do_swap_account __read_mostly;
-static int really_do_swap_account __initdata = 1; /* for remember boot option*/
+
+/* for remember boot option*/
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED
+static int really_do_swap_account __initdata = 1;
+#else
+static int really_do_swap_account __initdata = 0;
+#endif
+
 #else
 #define do_swap_account		(0)
 #endif
@@ -4920,10 +4927,20 @@ struct cgroup_subsys mem_cgroup_subsys = {
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static int __init enable_swap_account(char *s)
+{
+	/* consider enabled if no parameter or 1 is given */
+	if (!s || !strcmp(s, "1"))
+		really_do_swap_account = 1;
+	else if (!strcmp(s, "0"))
+		really_do_swap_account = 0;
+	return 1;
+}
+__setup("swapaccount", enable_swap_account);
 
 static int __init disable_swap_account(char *s)
 {
-	really_do_swap_account = 0;
+	enable_swap_account("0");
 	return 1;
 }
 __setup("noswapaccount", disable_swap_account);
-- 
cgit v1.1


From e9959f0f37160e1f5351af828cc981712b5066c1 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Wed, 24 Nov 2010 12:57:09 -0800
Subject: mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is
 wrongly called under stop_machine_run()

During memory hotplug, build_allzonelists() may be called under
stop_machine_run().  In this function, setup_zone_pageset() is called.
But it's bug because it will do page allocation under stop_machine_run().

Here is a report from Alok Kataria.

  BUG: sleeping function called from invalid context at kernel/mutex.c:94
  in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
  Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
  Call Trace:
   [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
   [<ffffffff81468245>] mutex_lock+0x24/0x50
   [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
   [<ffffffff81048888>] ? load_balance+0xbe/0x60e
   [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
   [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
   [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
   [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
   [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
   [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
   [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
   [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
   [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
   [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
   [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
   [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
   [<ffffffff81065f29>] kthread+0x7f/0x87
   [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
   [<ffffffff81065eaa>] ? kthread+0x0/0x87
   [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
  Built 5 zonelists in Node order, mobility grouping on.  Total pages: 289456
  Policy zone: Normal

This patch tries to fix the issue by moving setup_zone_pageset() out from
stop_machine_run(). It's obviously not necessary to be called under
stop_machine_run().

[akpm@linux-foundation.org: remove unneeded local]
Reported-by: Alok Kataria <akataria@vmware.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vmware.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/page_alloc.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

(limited to 'mm')

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..e409270 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3008,14 +3008,6 @@ static __init_refok int __build_all_zonelists(void *data)
 		build_zonelist_cache(pgdat);
 	}
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-	/* Setup real pagesets for the new zone */
-	if (data) {
-		struct zone *zone = data;
-		setup_zone_pageset(zone);
-	}
-#endif
-
 	/*
 	 * Initialize the boot_pagesets that are going to be used
 	 * for bootstrapping processors. The real pagesets for
@@ -3064,7 +3056,11 @@ void build_all_zonelists(void *data)
 	} else {
 		/* we have to stop all cpus to guarantee there is no user
 		   of zonelist */
-		stop_machine(__build_all_zonelists, data, NULL);
+#ifdef CONFIG_MEMORY_HOTPLUG
+		if (data)
+			setup_zone_pageset((struct zone *)data);
+#endif
+		stop_machine(__build_all_zonelists, NULL, NULL);
 		/* cpuset refresh routine should be here */
 	}
 	vm_total_pages = nr_free_pagecache_pages();
-- 
cgit v1.1


From 5f0af70a25593a9d53b87bc8d31902fb7cc63e40 Mon Sep 17 00:00:00 2001
From: David Sterba <dsterba@suse.cz>
Date: Wed, 24 Nov 2010 12:57:10 -0800
Subject: mm: remove call to find_vma in pagewalk for non-hugetlbfs

Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
walk_page_range()") introduces a check if a vma is a hugetlbfs one and
later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
left behind and its result is not used anywhere else in the function.

The side-effect of caching vma for @addr inside walk->mm is neither
utilized in walk_page_range() nor in called functions.

Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/pagewalk.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

(limited to 'mm')

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 8b1a2ce..38cc58b 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -139,7 +139,6 @@ int walk_page_range(unsigned long addr, unsigned long end,
 	pgd_t *pgd;
 	unsigned long next;
 	int err = 0;
-	struct vm_area_struct *vma;
 
 	if (addr >= end)
 		return err;
@@ -149,15 +148,17 @@ int walk_page_range(unsigned long addr, unsigned long end,
 
 	pgd = pgd_offset(walk->mm, addr);
 	do {
+		struct vm_area_struct *uninitialized_var(vma);
+
 		next = pgd_addr_end(addr, end);
 
+#ifdef CONFIG_HUGETLB_PAGE
 		/*
 		 * handle hugetlb vma individually because pagetable walk for
 		 * the hugetlb page is dependent on the architecture and
 		 * we can't handled it in the same manner as non-huge pages.
 		 */
 		vma = find_vma(walk->mm, addr);
-#ifdef CONFIG_HUGETLB_PAGE
 		if (vma && is_vm_hugetlb_page(vma)) {
 			if (vma->vm_end < next)
 				next = vma->vm_end;
-- 
cgit v1.1