mm: fix up numa read-only thread grouping logic

Dave Chinner reported that commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") slowed down his xfsrepair test enormously. In particular, it was using more system time due to extra TLB flushing. The ultimate reason turns out to be how the change to use the regular page table accessor functions broke the NUMA grouping logic. The old special mknuma/mknonnuma code accessed the page table present bit and the magic NUMA bit directly, while the new code just changes the page protections using PROT_NONE and the regular vma protections. That sounds equivalent, and from a fault standpoint it really is, but a subtle side effect is that the *other* protection bits of the page table entries also change. And the code to decide how to group the NUMA entries together used the writable bit to decide whether a particular page was likely to be shared read-only or not. And with the change to make the NUMA handling use the regular permission setting functions, that writable bit was basically always cleared for private mappings due to COW. So even if the page actually ends up being written to in the end, the NUMA balancing would act as if it was always shared RO. This code is a heuristic anyway, so the fix - at least for now - is to instead check whether the page is dirty rather than writable. The bit doesn't change with protection changes. NOTE! This also adds a FIXME comment to revisit this issue, Not only should we probably re-visit the whole "is this a shared read-only page" heuristic (we might want to take the vma permissions into account and base this more on those than the per-page ones, and also look at whether the particular access that triggers it is a write or not), but the whole COW issue shows that we should think about the NUMA fault handling some more. For example, maybe we should do the early-COW thing that a regular fault does. Or maybe we should accept that while using the same bits as PROTNONE was a good thing (and got rid of the specual NUMA bit), we might still want to just preseve the other protection bits across NUMA faulting. Those are bigger questions, left for later. This just fixes up the heuristic so that it at least approximates working again. More analysis and work needed. Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Mel Gorman <mgorman@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org>, Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Linus Torvalds <torvalds@linux-foundation.org> 2015-03-12 08:45:46 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2015-03-12 08:45:46 -0700
commit: 53da3bc2ba9e4899f32707b5cd7d18421b943687 (patch)
tree: b082c60c1fefa6383b0b3ae2834746e7debffa00 /mm
parent: cca28a5fda5d8fa69982bdb54341eeeb3eab215a (diff)
download: op-kernel-dev-53da3bc2ba9e4899f32707b5cd7d18421b943687.zip
op-kernel-dev-53da3bc2ba9e4899f32707b5cd7d18421b943687.tar.gz
2 files changed, 12 insertions, 2 deletions
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc00c8c..89b9075 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,8 +1295,13 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Avoid grouping on DSO/COW pages in specific and RO pages
 	 * in general, RO pages shouldn't hurt as much anyway since
 	 * they can be in shared cache state.
+	 *
+	 * FIXME! This checks "pmd_dirty()" as an approximation of
+	 * "is this a read-only page", since checking "pmd_write()"
+	 * is even more broken. We haven't actually turned this into
+	 * a writable page, so pmd_write() will always be false.
 	 */
-	if (!pmd_write(pmd))
+	if (!pmd_dirty(pmd))
 		flags |= TNF_NO_GROUP;
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..411144f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3072,8 +3072,13 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Avoid grouping on DSO/COW pages in specific and RO pages
 	 * in general, RO pages shouldn't hurt as much anyway since
 	 * they can be in shared cache state.
+	 *
+	 * FIXME! This checks "pmd_dirty()" as an approximation of
+	 * "is this a read-only page", since checking "pmd_write()"
+	 * is even more broken. We haven't actually turned this into
+	 * a writable page, so pmd_write() will always be false.
 	 */
-	if (!pte_write(pte))
+	if (!pte_dirty(pte))
 		flags |= TNF_NO_GROUP;
 
 	/*
author	Linus Torvalds <torvalds@linux-foundation.org>	2015-03-12 08:45:46 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2015-03-12 08:45:46 -0700
commit	53da3bc2ba9e4899f32707b5cd7d18421b943687 (patch)
tree	b082c60c1fefa6383b0b3ae2834746e7debffa00 /mm
parent	cca28a5fda5d8fa69982bdb54341eeeb3eab215a (diff)
download	op-kernel-dev-53da3bc2ba9e4899f32707b5cd7d18421b943687.zip op-kernel-dev-53da3bc2ba9e4899f32707b5cd7d18421b943687.tar.gz