mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node

Due to the fact that migrations are driven by the CPU a task is running on there is no point tracking NUMA faults until one task runs on a new node. This patch tracks the first node used by an address space. Until it changes, PTE scanning is disabled and no NUMA hinting faults are trapped. This should help workloads that are short-lived, do not care about NUMA placement or have bound themselves to a single node. This takes advantage of the logic in "mm: sched: numa: Implement slow start for working set sampling" to delay when the checks are made. This will take advantage of processes that set their CPU and node bindings early in their lifetime. It will also potentially allow any initial load balancing to take place. Signed-off-by: Mel Gorman <mgorman@suse.de>
author: Mel Gorman <mgorman@suse.de> 2012-11-22 14:40:03 +0000
committer: Mel Gorman <mgorman@suse.de> 2012-12-11 14:42:56 +0000
commit: 5bca23035391928c4c7301835accca3551b96cc2 (patch)
tree: 2feb63abf318e6edfded8bb97b43ca29c3c5b312 /kernel
parent: 3105b86a9fee7d2c2e76edb53bbbc4027599628f (diff)
download: op-kernel-dev-5bca23035391928c4c7301835accca3551b96cc2.zip
op-kernel-dev-5bca23035391928c4c7301835accca3551b96cc2.tar.gz
3 files changed, 24 insertions, 1 deletions
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..296ea30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -821,6 +821,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	mm->first_nid = NUMA_PTE_SCAN_INIT;
+#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a02a20..3e18f61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -861,6 +861,24 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * We do not care about task placement until a task runs on a node
+	 * other than the first one used by the address space. This is
+	 * largely because migrations are driven by what CPU the task
+	 * is running on. If it's never scheduled on another node, it'll
+	 * not migrate so why bother trapping the fault.
+	 */
+	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
+		mm->first_nid = numa_node_id();
+	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
+		/* Are we running on a new node yet? */
+		if (numa_node_id() == mm->first_nid &&
+		    !sched_feat_numa(NUMA_FORCE))
+			return;
+
+		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
+	}
+
+	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d2373a3..e7c25ff 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -65,8 +65,10 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=
+ * numa_balancing=. Allow PTE scanning to be forced on UMA machines
+ * for debugging the core machinery.
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
+SCHED_FEAT(NUMA_FORCE,	false)
 #endif
author	Mel Gorman <mgorman@suse.de>	2012-11-22 14:40:03 +0000
committer	Mel Gorman <mgorman@suse.de>	2012-12-11 14:42:56 +0000
commit	5bca23035391928c4c7301835accca3551b96cc2 (patch)
tree	2feb63abf318e6edfded8bb97b43ca29c3c5b312 /kernel
parent	3105b86a9fee7d2c2e76edb53bbbc4027599628f (diff)
download	op-kernel-dev-5bca23035391928c4c7301835accca3551b96cc2.zip op-kernel-dev-5bca23035391928c4c7301835accca3551b96cc2.tar.gz