From 4ffb6335da87b51c17e7ff6495170785f21558dd Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:29:09 -0700 Subject: mm: compaction: update comment in try_to_compact_pages Allocation success rates have been far lower since 3.4 due to commit fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled"). This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit 7db8889ab05b ("mm: have order > 0 compaction start off where it left") which testing has shown to severely reduce allocation success rates under load - to 0% in one case. This series aims to improve the allocation success rates without regressing the benefits of commit fe2c2a106663. The series is based on latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going away. Patch 1 updates a stale comment seeing as I was in the general area. Patch 2 updates reclaim/compaction to reclaim pages scaled on the number of recent failures. Patch 3 captures suitable high-order pages freed by compaction to reduce races with parallel allocation requests. Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction start off where it left] to enable compaction again Patch 5 identifies when compacion is taking too long due to contention and aborts. STRESS-HIGHALLOC 3.6-rc1-akpm full-series Pass 1 36.00 ( 0.00%) 51.00 (15.00%) Pass 2 42.00 ( 0.00%) 63.00 (21.00%) while Rested 86.00 ( 0.00%) 86.00 ( 0.00%) From http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html I know that the allocation success rates in 3.3.6 was 78% in comparison to 36% in in the current akpm tree. With the full series applied, the success rates are up to around 51% with some variability in the results. This is not as high a success rate but it does not reclaim excessively which is a key point. MMTests Statistics: vmstat Page Ins 3050912 3078892 Page Outs 8033528 8039096 Swap Ins 0 0 Swap Outs 0 0 Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates there were 71881 pages swapped out. Direct pages scanned 70942 122976 Kswapd pages scanned 1366300 1520122 Kswapd pages reclaimed 1366214 1484629 Direct pages reclaimed 70936 105716 Kswapd efficiency 99% 97% Kswapd velocity 1072.550 1182.615 Direct efficiency 99% 85% Direct velocity 55.690 95.672 The kswapd velocity changes very little as expected. kswapd velocity is around the 1000 pages/sec mark where as in kernel 3.3.6 with the high allocation success rates it was 8140 pages/second. Direct velocity is higher as a result of patch 2 of the series but this is expected and is acceptable. The direct reclaim and kswapd velocities change very little. If these get accepted for merging then there is a difficulty in how they should be handled. 7db8889a ("mm: have order > 0 compaction start off where it left") is broken but it is already in 3.6-rc1 and needs to be fixed. However, if just patch 4 from this series is applied then Jim Schutt's workload is known to break again as his workload also requires patch 5. While it would be preferred to have all these patches in 3.6 to improve compaction in general, it would at least be acceptable if just patches 4 and 5 were merged to 3.6 to fix a known problem without breaking compaction completely. On the face of it, that would force __GFP_NO_KSWAPD patches to be merged at the same time but I can do a version of this series with __GFP_NO_KSWAPD change reverted and then rebase it on top of this series. That might be best overall because I note that the __GFP_NO_KSWAPD patch should have removed deferred_compaction from page_alloc.c but it didn't but fixing that causes collisions with this series. This patch: The comment about order applied when the check was order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp: use compaction for all allocation orders"). Fixing the comment while I'm in the general area. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 7fcd3a5..7168edc 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -869,11 +869,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist, struct zone *zone; int rc = COMPACT_SKIPPED; - /* - * Check whether it is worth even starting compaction. The order check is - * made because an assumption is made that the page allocator can satisfy - * the "cheaper" orders without taking special steps - */ + /* Check if the GFP flags allow compaction */ if (!order || !may_enter_fs || !may_perform_io) return rc; -- cgit v1.1 From 1fb3f8ca0e9222535a39b884cb67a34628411b9f Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:29:12 -0700 Subject: mm: compaction: capture a suitable high-order page immediately when it is made available While compaction is migrating pages to free up large contiguous blocks for allocation it races with other allocation requests that may steal these blocks or break them up. This patch alters direct compaction to capture a suitable free page as soon as it becomes available to reduce this race. It uses similar logic to split_free_page() to ensure that watermarks are still obeyed. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 79 insertions(+), 11 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 7168edc..0fbc6b7 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -91,6 +91,60 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock, return compact_checklock_irqsave(lock, flags, false, cc); } +static void compact_capture_page(struct compact_control *cc) +{ + unsigned long flags; + int mtype, mtype_low, mtype_high; + + if (!cc->page || *cc->page) + return; + + /* + * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP + * regardless of the migratetype of the freelist is is captured from. + * This is fine because the order for a high-order MIGRATE_MOVABLE + * allocation is typically at least a pageblock size and overall + * fragmentation is not impaired. Other allocation types must + * capture pages from their own migratelist because otherwise they + * could pollute other pageblocks like MIGRATE_MOVABLE with + * difficult to move pages and making fragmentation worse overall. + */ + if (cc->migratetype == MIGRATE_MOVABLE) { + mtype_low = 0; + mtype_high = MIGRATE_PCPTYPES; + } else { + mtype_low = cc->migratetype; + mtype_high = cc->migratetype + 1; + } + + /* Speculatively examine the free lists without zone lock */ + for (mtype = mtype_low; mtype < mtype_high; mtype++) { + int order; + for (order = cc->order; order < MAX_ORDER; order++) { + struct page *page; + struct free_area *area; + area = &(cc->zone->free_area[order]); + if (list_empty(&area->free_list[mtype])) + continue; + + /* Take the lock and attempt capture of the page */ + if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc)) + return; + if (!list_empty(&area->free_list[mtype])) { + page = list_entry(area->free_list[mtype].next, + struct page, lru); + if (capture_free_page(page, cc->order, mtype)) { + spin_unlock_irqrestore(&cc->zone->lock, + flags); + *cc->page = page; + return; + } + } + spin_unlock_irqrestore(&cc->zone->lock, flags); + } + } +} + /* * Isolate free pages onto a private freelist. Caller must hold zone->lock. * If @strict is true, will abort returning 0 on any invalid PFNs or non-free @@ -645,7 +699,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, static int compact_finished(struct zone *zone, struct compact_control *cc) { - unsigned int order; unsigned long watermark; if (fatal_signal_pending(current)) @@ -688,14 +741,22 @@ static int compact_finished(struct zone *zone, return COMPACT_CONTINUE; /* Direct compactor: Is a suitable page free? */ - for (order = cc->order; order < MAX_ORDER; order++) { - /* Job done if page is free of the right migratetype */ - if (!list_empty(&zone->free_area[order].free_list[cc->migratetype])) - return COMPACT_PARTIAL; - - /* Job done if allocation would set block type */ - if (order >= pageblock_order && zone->free_area[order].nr_free) + if (cc->page) { + /* Was a suitable page captured? */ + if (*cc->page) return COMPACT_PARTIAL; + } else { + unsigned int order; + for (order = cc->order; order < MAX_ORDER; order++) { + struct free_area *area = &zone->free_area[cc->order]; + /* Job done if page is free of the right migratetype */ + if (!list_empty(&area->free_list[cc->migratetype])) + return COMPACT_PARTIAL; + + /* Job done if allocation would set block type */ + if (cc->order >= pageblock_order && area->nr_free) + return COMPACT_PARTIAL; + } } return COMPACT_CONTINUE; @@ -817,6 +878,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) goto out; } } + + /* Capture a page now if it is a suitable size */ + compact_capture_page(cc); } out: @@ -829,7 +893,8 @@ out: static unsigned long compact_zone_order(struct zone *zone, int order, gfp_t gfp_mask, - bool sync, bool *contended) + bool sync, bool *contended, + struct page **page) { struct compact_control cc = { .nr_freepages = 0, @@ -839,6 +904,7 @@ static unsigned long compact_zone_order(struct zone *zone, .zone = zone, .sync = sync, .contended = contended, + .page = page, }; INIT_LIST_HEAD(&cc.freepages); INIT_LIST_HEAD(&cc.migratepages); @@ -860,7 +926,7 @@ int sysctl_extfrag_threshold = 500; */ unsigned long try_to_compact_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *nodemask, - bool sync, bool *contended) + bool sync, bool *contended, struct page **page) { enum zone_type high_zoneidx = gfp_zone(gfp_mask); int may_enter_fs = gfp_mask & __GFP_FS; @@ -881,7 +947,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist, int status; status = compact_zone_order(zone, order, gfp_mask, sync, - contended); + contended, page); rc = max(status, rc); /* If a normal allocation would succeed, stop compacting */ @@ -936,6 +1002,7 @@ int compact_pgdat(pg_data_t *pgdat, int order) struct compact_control cc = { .order = order, .sync = false, + .page = NULL, }; return __compact_pgdat(pgdat, &cc); @@ -946,6 +1013,7 @@ static int compact_node(int nid) struct compact_control cc = { .order = -1, .sync = true, + .page = NULL, }; return __compact_pgdat(NODE_DATA(nid), &cc); -- cgit v1.1 From d95ea5d18e699515468368415c93ed49b1a3221b Mon Sep 17 00:00:00 2001 From: Bartlomiej Zolnierkiewicz Date: Mon, 8 Oct 2012 16:32:05 -0700 Subject: cma: fix watermark checking * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok() (from Minchan Kim). * During watermark check decrease available free pages number by free CMA pages number if necessary (unmovable allocations cannot use pages from CMA areas). Signed-off-by: Bartlomiej Zolnierkiewicz Signed-off-by: Kyungmin Park Cc: Marek Szyprowski Cc: Michal Nazarewicz Cc: Minchan Kim Cc: Mel Gorman Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 0fbc6b7..1f61bcb 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -934,6 +934,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist, struct zoneref *z; struct zone *zone; int rc = COMPACT_SKIPPED; + int alloc_flags = 0; /* Check if the GFP flags allow compaction */ if (!order || !may_enter_fs || !may_perform_io) @@ -941,6 +942,10 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist, count_vm_event(COMPACTSTALL); +#ifdef CONFIG_CMA + if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) + alloc_flags |= ALLOC_CMA; +#endif /* Compact each zone in the list */ for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, nodemask) { @@ -951,7 +956,8 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist, rc = max(status, rc); /* If a normal allocation would succeed, stop compacting */ - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) + if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, + alloc_flags)) break; } -- cgit v1.1 From e64c5237cf6ff474cb2f3f832f48f2b441dd9979 Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Mon, 8 Oct 2012 16:32:27 -0700 Subject: mm: compaction: abort compaction loop if lock is contended or run too long isolate_migratepages_range() might isolate no pages if for example when zone->lru_lock is contended and running asynchronous compaction. In this case, we should abort compaction, otherwise, compact_zone will run a useless loop and make zone->lru_lock is even contended. An additional check is added to ensure that cc.migratepages and cc.freepages get properly drained whan compaction is aborted. [minchan@kernel.org: Putback pages isolated for migration if aborting] [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended] [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended'] [minchan@kernel.org: Putback pages isolated for migration if aborting] Signed-off-by: Andrea Arcangeli Signed-off-by: Shaohua Li Signed-off-by: Mel Gorman Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 1f61bcb..0649cc1 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -70,8 +70,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, /* async aborts if taking too long or contended */ if (!cc->sync) { - if (cc->contended) - *cc->contended = true; + cc->contended = true; return false; } @@ -688,7 +687,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, /* Perform the isolation */ low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn); - if (!low_pfn) + if (!low_pfn || cc->contended) return ISOLATE_ABORT; cc->migrate_pfn = low_pfn; @@ -848,6 +847,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) switch (isolate_migratepages(zone, cc)) { case ISOLATE_ABORT: ret = COMPACT_PARTIAL; + putback_lru_pages(&cc->migratepages); + cc->nr_migratepages = 0; goto out; case ISOLATE_NONE: continue; @@ -896,6 +897,7 @@ static unsigned long compact_zone_order(struct zone *zone, bool sync, bool *contended, struct page **page) { + unsigned long ret; struct compact_control cc = { .nr_freepages = 0, .nr_migratepages = 0, @@ -903,13 +905,18 @@ static unsigned long compact_zone_order(struct zone *zone, .migratetype = allocflags_to_migratetype(gfp_mask), .zone = zone, .sync = sync, - .contended = contended, .page = page, }; INIT_LIST_HEAD(&cc.freepages); INIT_LIST_HEAD(&cc.migratepages); - return compact_zone(zone, &cc); + ret = compact_zone(zone, &cc); + + VM_BUG_ON(!list_empty(&cc.freepages)); + VM_BUG_ON(!list_empty(&cc.migratepages)); + + *contended = cc.contended; + return ret; } int sysctl_extfrag_threshold = 500; -- cgit v1.1 From 3cc668f4e30fbd97b3c0574d8cac7a83903c9bc7 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:30 -0700 Subject: mm: compaction: move fatal signal check out of compact_checklock_irqsave Commit c67fe3752abe ("mm: compaction: Abort async compaction if locks are contended or taking too long") addressed a lock contention problem in compaction by introducing compact_checklock_irqsave() that effecively aborting async compaction in the event of compaction. To preserve existing behaviour it also moved a fatal_signal_pending() check into compact_checklock_irqsave() but that is very misleading. It "hides" the check within a locking function but has nothing to do with locking as such. It just happens to work in a desirable fashion. This patch moves the fatal_signal_pending() check to isolate_migratepages_range() where it belongs. Arguably the same check should also happen when isolating pages for freeing but it's overkill. Signed-off-by: Mel Gorman Cc: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Shaohua Li Cc: Minchan Kim Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 0649cc1..78075a2 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -75,8 +75,6 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, } cond_resched(); - if (fatal_signal_pending(current)) - return false; } if (!locked) @@ -363,7 +361,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, /* Check if it is ok to still hold the lock */ locked = compact_checklock_irqsave(&zone->lru_lock, &flags, locked, cc); - if (!locked) + if (!locked || fatal_signal_pending(current)) break; /* -- cgit v1.1 From 661c4cb9b829110cb68c18ea05a56be39f75a4d2 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:31 -0700 Subject: mm: compaction: Update try_to_compact_pages()kerneldoc comment Parameters were added without documentation, tut tut. Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 78075a2..b16dd38 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -926,6 +926,8 @@ int sysctl_extfrag_threshold = 500; * @gfp_mask: The GFP mask of the current allocation * @nodemask: The allowed nodes to allocate from * @sync: Whether migration is synchronous or not + * @contended: Return value that is true if compaction was aborted due to lock contention + * @page: Optionally capture a free page of the requested order during compaction * * This is the main entry point for direct page compaction. */ -- cgit v1.1 From 2a1402aa044b55c2d30ab0ed9405693ef06fb07c Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:33 -0700 Subject: mm: compaction: acquire the zone->lru_lock as late as possible Richard Davies and Shaohua Li have both reported lock contention problems in compaction on the zone and LRU locks as well as significant amounts of time being spent in compaction. This series aims to reduce lock contention and scanning rates to reduce that CPU usage. Richard reported at https://lkml.org/lkml/2012/9/21/91 that this series made a big different to a problem he reported in August: http://marc.info/?l=kvm&m=134511507015614&w=2 Patch 1 defers acquiring the zone->lru_lock as long as possible. Patch 2 defers acquiring the zone->lock as lock as possible. Patch 3 reverts Rik's "skip-free" patches as the core concept gets reimplemented later and the remaining patches are easier to understand if this is reverted first. Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what pageblocks should be skipped by the migrate and free scanners. This drastically reduces the amount of scanning compaction has to do. Patch 5 reimplements something similar to Rik's idea except it uses the pageblock-skip information to decide where the scanners should restart from and does not need to wrap around. I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were akpm-20120920 3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012 lesslock Patches 1-6 revert Patches 1-7 cachefail Patches 1-8 skipuseless Patches 1-9 Stress high-order allocation tests looked ok. Success rates are more or less the same with the full series applied but there is an expectation that there is less opportunity to race with other allocation requests if there is less scanning. The time to complete the tests did not vary that much and are uninteresting as were the vmstat statistics so I will not present them here. Using ftrace I recorded how much scanning was done by compaction and got this 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 akpm-20120920 lockless revert-v2r2 cachefail skipuseless Total free scanned 360753976 515414028 565479007 17103281 18916589 Total free isolated 2852429 3597369 4048601 670493 727840 Total free efficiency 0.0079% 0.0070% 0.0072% 0.0392% 0.0385% Total migrate scanned 247728664 822729112 1004645830 17946827 14118903 Total migrate isolated 2555324 3245937 3437501 616359 658616 Total migrate efficiency 0.0103% 0.0039% 0.0034% 0.0343% 0.0466% The efficiency is worthless because of the nature of the test and the number of failures. The really interesting point as far as this patch series is concerned is the number of pages scanned. Note that reverting Rik's patches massively increases the number of pages scanned indicating that those patches really did make a difference to CPU usage. However, caching what pageblocks should be skipped has a much higher impact. With patches 1-8 applied, free page and migrate page scanning are both reduced by 95% in comparison to the akpm kernel. If the basic concept of Rik's patches are implemened on top then scanning then the free scanner barely changed but migrate scanning was further reduced. That said, tests on 3.6-rc5 indicated that the last patch had greater impact than what was measured here so it is a bit variable. One way or the other, this series has a large impact on the amount of scanning compaction does when there is a storm of THP allocations. This patch: Compaction's migrate scanner acquires the zone->lru_lock when scanning a range of pages looking for LRU pages to acquire. It does this even if there are no LRU pages in the range. If multiple processes are compacting then this can cause severe locking contention. To make matters worse commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled while isolating pages for migration") releases the lru_lock every SWAP_CLUSTER_MAX pages that are scanned. This patch makes two changes to how the migrate scanner acquires the LRU lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if the lock is contended. This reduces the number of times it unnecessarily disables and re-enables IRQs. The second is that it defers acquiring the LRU lock for as long as possible. If there are no LRU pages or the only LRU pages are transhuge then the LRU lock will not be acquired at all which reduces contention on zone->lru_lock. [minchan@kernel.org: augment comment] [akpm@linux-foundation.org: tweak comment text] Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Richard Davies Cc: Shaohua Li Cc: Avi Kivity Acked-by: Rafael Aquini Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 65 +++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 45 insertions(+), 20 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index b16dd38..832c4183 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -50,6 +50,11 @@ static inline bool migrate_async_suitable(int migratetype) return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE; } +static inline bool should_release_lock(spinlock_t *lock) +{ + return need_resched() || spin_is_contended(lock); +} + /* * Compaction requires the taking of some coarse locks that are potentially * very heavily contended. Check if the process needs to be scheduled or @@ -62,7 +67,7 @@ static inline bool migrate_async_suitable(int migratetype) static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, bool locked, struct compact_control *cc) { - if (need_resched() || spin_is_contended(lock)) { + if (should_release_lock(lock)) { if (locked) { spin_unlock_irqrestore(lock, *flags); locked = false; @@ -327,7 +332,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, isolate_mode_t mode = 0; struct lruvec *lruvec; unsigned long flags; - bool locked; + bool locked = false; /* * Ensure that there are not too many pages isolated from the LRU @@ -347,23 +352,17 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, /* Time to isolate some pages for migration */ cond_resched(); - spin_lock_irqsave(&zone->lru_lock, flags); - locked = true; for (; low_pfn < end_pfn; low_pfn++) { struct page *page; /* give a chance to irqs before checking need_resched() */ - if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) { - spin_unlock_irqrestore(&zone->lru_lock, flags); - locked = false; + if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) { + if (should_release_lock(&zone->lru_lock)) { + spin_unlock_irqrestore(&zone->lru_lock, flags); + locked = false; + } } - /* Check if it is ok to still hold the lock */ - locked = compact_checklock_irqsave(&zone->lru_lock, &flags, - locked, cc); - if (!locked || fatal_signal_pending(current)) - break; - /* * migrate_pfn does not necessarily start aligned to a * pageblock. Ensure that pfn_valid is called when moving @@ -403,21 +402,40 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, pageblock_nr = low_pfn >> pageblock_order; if (!cc->sync && last_pageblock_nr != pageblock_nr && !migrate_async_suitable(get_pageblock_migratetype(page))) { - low_pfn += pageblock_nr_pages; - low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1; - last_pageblock_nr = pageblock_nr; - continue; + goto next_pageblock; } + /* Check may be lockless but that's ok as we recheck later */ if (!PageLRU(page)) continue; /* - * PageLRU is set, and lru_lock excludes isolation, - * splitting and collapsing (collapsing has already - * happened if PageLRU is set). + * PageLRU is set. lru_lock normally excludes isolation + * splitting and collapsing (collapsing has already happened + * if PageLRU is set) but the lock is not necessarily taken + * here and it is wasteful to take it just to check transhuge. + * Check TransHuge without lock and skip the whole pageblock if + * it's either a transhuge or hugetlbfs page, as calling + * compound_order() without preventing THP from splitting the + * page underneath us may return surprising results. */ if (PageTransHuge(page)) { + if (!locked) + goto next_pageblock; + low_pfn += (1 << compound_order(page)) - 1; + continue; + } + + /* Check if it is ok to still hold the lock */ + locked = compact_checklock_irqsave(&zone->lru_lock, &flags, + locked, cc); + if (!locked || fatal_signal_pending(current)) + break; + + /* Recheck PageLRU and PageTransHuge under lock */ + if (!PageLRU(page)) + continue; + if (PageTransHuge(page)) { low_pfn += (1 << compound_order(page)) - 1; continue; } @@ -444,6 +462,13 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, ++low_pfn; break; } + + continue; + +next_pageblock: + low_pfn += pageblock_nr_pages; + low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1; + last_pageblock_nr = pageblock_nr; } acct_isolated(zone, locked, cc); -- cgit v1.1 From f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:36 -0700 Subject: mm: compaction: acquire the zone->lock as late as possible Compaction's free scanner acquires the zone->lock when checking for PageBuddy pages and isolating them. It does this even if there are no PageBuddy pages in the range. This patch defers acquiring the zone lock for as long as possible. In the event there are no free pages in the pageblock then the lock will not be acquired at all which reduces contention on zone->lock. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Richard Davies Cc: Shaohua Li Cc: Avi Kivity Acked-by: Rafael Aquini Acked-by: Minchan Kim Tested-by: Peter Ujfalusi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 140 ++++++++++++++++++++++++++++++-------------------------- 1 file changed, 76 insertions(+), 64 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 832c4183..bdf6e13 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -93,6 +93,27 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock, return compact_checklock_irqsave(lock, flags, false, cc); } +/* Returns true if the page is within a block suitable for migration to */ +static bool suitable_migration_target(struct page *page) +{ + int migratetype = get_pageblock_migratetype(page); + + /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */ + if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE) + return false; + + /* If the page is a large free page, then allow migration */ + if (PageBuddy(page) && page_order(page) >= pageblock_order) + return true; + + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ + if (migrate_async_suitable(migratetype)) + return true; + + /* Otherwise skip the block */ + return false; +} + static void compact_capture_page(struct compact_control *cc) { unsigned long flags; @@ -153,38 +174,56 @@ static void compact_capture_page(struct compact_control *cc) * pages inside of the pageblock (even though it may still end up isolating * some pages). */ -static unsigned long isolate_freepages_block(unsigned long blockpfn, +static unsigned long isolate_freepages_block(struct compact_control *cc, + unsigned long blockpfn, unsigned long end_pfn, struct list_head *freelist, bool strict) { int nr_scanned = 0, total_isolated = 0; struct page *cursor; + unsigned long nr_strict_required = end_pfn - blockpfn; + unsigned long flags; + bool locked = false; cursor = pfn_to_page(blockpfn); - /* Isolate free pages. This assumes the block is valid */ + /* Isolate free pages. */ for (; blockpfn < end_pfn; blockpfn++, cursor++) { int isolated, i; struct page *page = cursor; - if (!pfn_valid_within(blockpfn)) { - if (strict) - return 0; - continue; - } nr_scanned++; + if (!pfn_valid_within(blockpfn)) + continue; + if (!PageBuddy(page)) + continue; + + /* + * The zone lock must be held to isolate freepages. + * Unfortunately this is a very coarse lock and can be + * heavily contended if there are parallel allocations + * or parallel compactions. For async compaction do not + * spin on the lock and we acquire the lock as late as + * possible. + */ + locked = compact_checklock_irqsave(&cc->zone->lock, &flags, + locked, cc); + if (!locked) + break; + + /* Recheck this is a suitable migration target under lock */ + if (!strict && !suitable_migration_target(page)) + break; - if (!PageBuddy(page)) { - if (strict) - return 0; + /* Recheck this is a buddy page under lock */ + if (!PageBuddy(page)) continue; - } /* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); if (!isolated && strict) - return 0; + break; total_isolated += isolated; for (i = 0; i < isolated; i++) { list_add(&page->lru, freelist); @@ -199,6 +238,18 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn, } trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); + + /* + * If strict isolation is requested by CMA then check that all the + * pages requested were isolated. If there were any failures, 0 is + * returned and CMA will fail. + */ + if (strict && nr_strict_required != total_isolated) + total_isolated = 0; + + if (locked) + spin_unlock_irqrestore(&cc->zone->lock, flags); + return total_isolated; } @@ -218,12 +269,17 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn, unsigned long isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn) { - unsigned long isolated, pfn, block_end_pfn, flags; + unsigned long isolated, pfn, block_end_pfn; struct zone *zone = NULL; LIST_HEAD(freelist); + /* cc needed for isolate_freepages_block to acquire zone->lock */ + struct compact_control cc = { + .sync = true, + }; + if (pfn_valid(start_pfn)) - zone = page_zone(pfn_to_page(start_pfn)); + cc.zone = zone = page_zone(pfn_to_page(start_pfn)); for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) { if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn))) @@ -236,10 +292,8 @@ isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn) block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); block_end_pfn = min(block_end_pfn, end_pfn); - spin_lock_irqsave(&zone->lock, flags); - isolated = isolate_freepages_block(pfn, block_end_pfn, + isolated = isolate_freepages_block(&cc, pfn, block_end_pfn, &freelist, true); - spin_unlock_irqrestore(&zone->lock, flags); /* * In strict mode, isolate_freepages_block() returns 0 if @@ -483,29 +537,6 @@ next_pageblock: #endif /* CONFIG_COMPACTION || CONFIG_CMA */ #ifdef CONFIG_COMPACTION - -/* Returns true if the page is within a block suitable for migration to */ -static bool suitable_migration_target(struct page *page) -{ - - int migratetype = get_pageblock_migratetype(page); - - /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */ - if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE) - return false; - - /* If the page is a large free page, then allow migration */ - if (PageBuddy(page) && page_order(page) >= pageblock_order) - return true; - - /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ - if (migrate_async_suitable(migratetype)) - return true; - - /* Otherwise skip the block */ - return false; -} - /* * Returns the start pfn of the last page block in a zone. This is the starting * point for full compaction of a zone. Compaction searches for free pages from @@ -529,7 +560,6 @@ static void isolate_freepages(struct zone *zone, { struct page *page; unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn; - unsigned long flags; int nr_freepages = cc->nr_freepages; struct list_head *freelist = &cc->freepages; @@ -577,30 +607,12 @@ static void isolate_freepages(struct zone *zone, if (!suitable_migration_target(page)) continue; - /* - * Found a block suitable for isolating free pages from. Now - * we disabled interrupts, double check things are ok and - * isolate the pages. This is to minimise the time IRQs - * are disabled - */ + /* Found a block suitable for isolating free pages from */ isolated = 0; - - /* - * The zone lock must be held to isolate freepages. This - * unfortunately this is a very coarse lock and can be - * heavily contended if there are parallel allocations - * or parallel compactions. For async compaction do not - * spin on the lock - */ - if (!compact_trylock_irqsave(&zone->lock, &flags, cc)) - break; - if (suitable_migration_target(page)) { - end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn); - isolated = isolate_freepages_block(pfn, end_pfn, - freelist, false); - nr_freepages += isolated; - } - spin_unlock_irqrestore(&zone->lock, flags); + end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn); + isolated = isolate_freepages_block(cc, pfn, end_pfn, + freelist, false); + nr_freepages += isolated; /* * Record the highest PFN we isolated pages from. When next -- cgit v1.1 From 753341a4b85ff337487b9959c71c529f522004f4 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:40 -0700 Subject: revert "mm: have order > 0 compaction start off where it left" This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start off where it left") and commit de74f1cc ("mm: have order > 0 compaction start near a pageblock with free pages"). These patches were a good idea and tests confirmed that they massively reduced the amount of scanning but the implementation is complex and tricky to understand. A later patch will cache what pageblocks should be skipped and reimplements the concept of compact_cached_free_pfn on top for both migration and free scanners. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Richard Davies Cc: Shaohua Li Cc: Avi Kivity Acked-by: Rafael Aquini Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 65 +++++---------------------------------------------------- 1 file changed, 5 insertions(+), 60 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index bdf6e13..db76361 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -538,20 +538,6 @@ next_pageblock: #endif /* CONFIG_COMPACTION || CONFIG_CMA */ #ifdef CONFIG_COMPACTION /* - * Returns the start pfn of the last page block in a zone. This is the starting - * point for full compaction of a zone. Compaction searches for free pages from - * the end of each zone, while isolate_freepages_block scans forward inside each - * page block. - */ -static unsigned long start_free_pfn(struct zone *zone) -{ - unsigned long free_pfn; - free_pfn = zone->zone_start_pfn + zone->spanned_pages; - free_pfn &= ~(pageblock_nr_pages-1); - return free_pfn; -} - -/* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. */ @@ -619,19 +605,8 @@ static void isolate_freepages(struct zone *zone, * looking for free pages, the search will restart here as * page migration may have returned some pages to the allocator */ - if (isolated) { + if (isolated) high_pfn = max(high_pfn, pfn); - - /* - * If the free scanner has wrapped, update - * compact_cached_free_pfn to point to the highest - * pageblock with free pages. This reduces excessive - * scanning of full pageblocks near the end of the - * zone - */ - if (cc->order > 0 && cc->wrapped) - zone->compact_cached_free_pfn = high_pfn; - } } /* split_free_page does not map the pages */ @@ -639,11 +614,6 @@ static void isolate_freepages(struct zone *zone, cc->free_pfn = high_pfn; cc->nr_freepages = nr_freepages; - - /* If compact_cached_free_pfn is reset then set it now */ - if (cc->order > 0 && !cc->wrapped && - zone->compact_cached_free_pfn == start_free_pfn(zone)) - zone->compact_cached_free_pfn = high_pfn; } /* @@ -738,26 +708,8 @@ static int compact_finished(struct zone *zone, if (fatal_signal_pending(current)) return COMPACT_PARTIAL; - /* - * A full (order == -1) compaction run starts at the beginning and - * end of a zone; it completes when the migrate and free scanner meet. - * A partial (order > 0) compaction can start with the free scanner - * at a random point in the zone, and may have to restart. - */ - if (cc->free_pfn <= cc->migrate_pfn) { - if (cc->order > 0 && !cc->wrapped) { - /* We started partway through; restart at the end. */ - unsigned long free_pfn = start_free_pfn(zone); - zone->compact_cached_free_pfn = free_pfn; - cc->free_pfn = free_pfn; - cc->wrapped = 1; - return COMPACT_CONTINUE; - } - return COMPACT_COMPLETE; - } - - /* We wrapped around and ended up where we started. */ - if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn) + /* Compaction run completes if the migrate and free scanner meet */ + if (cc->free_pfn <= cc->migrate_pfn) return COMPACT_COMPLETE; /* @@ -863,15 +815,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) /* Setup to move all movable pages to the end of the zone */ cc->migrate_pfn = zone->zone_start_pfn; - - if (cc->order > 0) { - /* Incremental compaction. Start where the last one stopped. */ - cc->free_pfn = zone->compact_cached_free_pfn; - cc->start_free_pfn = cc->free_pfn; - } else { - /* Order == -1 starts at the end of the zone. */ - cc->free_pfn = start_free_pfn(zone); - } + cc->free_pfn = cc->migrate_pfn + zone->spanned_pages; + cc->free_pfn &= ~(pageblock_nr_pages-1); migrate_prep_local(); -- cgit v1.1 From bb13ffeb9f6bfeb301443994dfbf29f91117dfb3 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:41 -0700 Subject: mm: compaction: cache if a pageblock was scanned and no pages were isolated When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are a serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is "when to ignore the cached information?" If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Richard Davies Cc: Shaohua Li Cc: Avi Kivity Acked-by: Rafael Aquini Cc: Fengguang Wu Cc: Michal Nazarewicz Cc: Bartlomiej Zolnierkiewicz Cc: Kyungmin Park Cc: Mark Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 108 insertions(+), 17 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index db76361..d9dbb97 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -50,6 +50,79 @@ static inline bool migrate_async_suitable(int migratetype) return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE; } +#ifdef CONFIG_COMPACTION +/* Returns true if the pageblock should be scanned for pages to isolate. */ +static inline bool isolation_suitable(struct compact_control *cc, + struct page *page) +{ + if (cc->ignore_skip_hint) + return true; + + return !get_pageblock_skip(page); +} + +/* + * This function is called to clear all cached information on pageblocks that + * should be skipped for page isolation when the migrate and free page scanner + * meet. + */ +static void reset_isolation_suitable(struct zone *zone) +{ + unsigned long start_pfn = zone->zone_start_pfn; + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; + unsigned long pfn; + + /* + * Do not reset more than once every five seconds. If allocations are + * failing sufficiently quickly to allow this to happen then continually + * scanning for compaction is not going to help. The choice of five + * seconds is arbitrary but will mitigate excessive scanning. + */ + if (time_before(jiffies, zone->compact_blockskip_expire)) + return; + zone->compact_blockskip_expire = jiffies + (HZ * 5); + + /* Walk the zone and mark every pageblock as suitable for isolation */ + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { + struct page *page; + + cond_resched(); + + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (zone != page_zone(page)) + continue; + + clear_pageblock_skip(page); + } +} + +/* + * If no pages were isolated then mark this pageblock to be skipped in the + * future. The information is later cleared by reset_isolation_suitable(). + */ +static void update_pageblock_skip(struct page *page, unsigned long nr_isolated) +{ + if (!page) + return; + + if (!nr_isolated) + set_pageblock_skip(page); +} +#else +static inline bool isolation_suitable(struct compact_control *cc, + struct page *page) +{ + return true; +} + +static void update_pageblock_skip(struct page *page, unsigned long nr_isolated) +{ +} +#endif /* CONFIG_COMPACTION */ + static inline bool should_release_lock(spinlock_t *lock) { return need_resched() || spin_is_contended(lock); @@ -181,7 +254,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, bool strict) { int nr_scanned = 0, total_isolated = 0; - struct page *cursor; + struct page *cursor, *valid_page = NULL; unsigned long nr_strict_required = end_pfn - blockpfn; unsigned long flags; bool locked = false; @@ -196,6 +269,8 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, nr_scanned++; if (!pfn_valid_within(blockpfn)) continue; + if (!valid_page) + valid_page = page; if (!PageBuddy(page)) continue; @@ -250,6 +325,10 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, if (locked) spin_unlock_irqrestore(&cc->zone->lock, flags); + /* Update the pageblock-skip if the whole pageblock was scanned */ + if (blockpfn == end_pfn) + update_pageblock_skip(valid_page, total_isolated); + return total_isolated; } @@ -267,22 +346,14 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, * a free page). */ unsigned long -isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn) +isolate_freepages_range(struct compact_control *cc, + unsigned long start_pfn, unsigned long end_pfn) { unsigned long isolated, pfn, block_end_pfn; - struct zone *zone = NULL; LIST_HEAD(freelist); - /* cc needed for isolate_freepages_block to acquire zone->lock */ - struct compact_control cc = { - .sync = true, - }; - - if (pfn_valid(start_pfn)) - cc.zone = zone = page_zone(pfn_to_page(start_pfn)); - for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) { - if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn))) + if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn))) break; /* @@ -292,7 +363,7 @@ isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn) block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); block_end_pfn = min(block_end_pfn, end_pfn); - isolated = isolate_freepages_block(&cc, pfn, block_end_pfn, + isolated = isolate_freepages_block(cc, pfn, block_end_pfn, &freelist, true); /* @@ -387,6 +458,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, struct lruvec *lruvec; unsigned long flags; bool locked = false; + struct page *page = NULL, *valid_page = NULL; /* * Ensure that there are not too many pages isolated from the LRU @@ -407,8 +479,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, /* Time to isolate some pages for migration */ cond_resched(); for (; low_pfn < end_pfn; low_pfn++) { - struct page *page; - /* give a chance to irqs before checking need_resched() */ if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) { if (should_release_lock(&zone->lru_lock)) { @@ -444,6 +514,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, if (page_zone(page) != zone) continue; + if (!valid_page) + valid_page = page; + + /* If isolation recently failed, do not retry */ + pageblock_nr = low_pfn >> pageblock_order; + if (!isolation_suitable(cc, page)) + goto next_pageblock; + /* Skip if free */ if (PageBuddy(page)) continue; @@ -453,7 +531,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, * migration is optimistic to see if the minimum amount of work * satisfies the allocation */ - pageblock_nr = low_pfn >> pageblock_order; if (!cc->sync && last_pageblock_nr != pageblock_nr && !migrate_async_suitable(get_pageblock_migratetype(page))) { goto next_pageblock; @@ -530,6 +607,10 @@ next_pageblock: if (locked) spin_unlock_irqrestore(&zone->lru_lock, flags); + /* Update the pageblock-skip if the whole pageblock was scanned */ + if (low_pfn == end_pfn) + update_pageblock_skip(valid_page, nr_isolated); + trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated); return low_pfn; @@ -593,6 +674,10 @@ static void isolate_freepages(struct zone *zone, if (!suitable_migration_target(page)) continue; + /* If isolation recently failed, do not retry */ + if (!isolation_suitable(cc, page)) + continue; + /* Found a block suitable for isolating free pages from */ isolated = 0; end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn); @@ -709,8 +794,10 @@ static int compact_finished(struct zone *zone, return COMPACT_PARTIAL; /* Compaction run completes if the migrate and free scanner meet */ - if (cc->free_pfn <= cc->migrate_pfn) + if (cc->free_pfn <= cc->migrate_pfn) { + reset_isolation_suitable(cc->zone); return COMPACT_COMPLETE; + } /* * order == -1 is expected when compacting via @@ -818,6 +905,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) cc->free_pfn = cc->migrate_pfn + zone->spanned_pages; cc->free_pfn &= ~(pageblock_nr_pages-1); + /* Clear pageblock skip if there are numerous alloc failures */ + if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT) + reset_isolation_suitable(zone); + migrate_prep_local(); while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) { -- cgit v1.1 From c89511ab2f8fe2b47585e60da8af7fd213ec877e Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:45 -0700 Subject: mm: compaction: Restart compaction from near where it left off This is almost entirely based on Rik's previous patches and discussions with him about how this might be implemented. Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. This patch caches where the migration and free scanner should start from on subsequent compaction invocations using the pageblock-skip information. When compaction starts it begins from the cached restart points and will update the cached restart points until a page is isolated or a pageblock is skipped that would have been scanned by synchronous compaction. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Richard Davies Cc: Shaohua Li Cc: Avi Kivity Acked-by: Rafael Aquini Cc: Fengguang Wu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 48 insertions(+), 10 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index d9dbb97..f94cbc0 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -80,6 +80,9 @@ static void reset_isolation_suitable(struct zone *zone) */ if (time_before(jiffies, zone->compact_blockskip_expire)) return; + + zone->compact_cached_migrate_pfn = start_pfn; + zone->compact_cached_free_pfn = end_pfn; zone->compact_blockskip_expire = jiffies + (HZ * 5); /* Walk the zone and mark every pageblock as suitable for isolation */ @@ -103,13 +106,29 @@ static void reset_isolation_suitable(struct zone *zone) * If no pages were isolated then mark this pageblock to be skipped in the * future. The information is later cleared by reset_isolation_suitable(). */ -static void update_pageblock_skip(struct page *page, unsigned long nr_isolated) +static void update_pageblock_skip(struct compact_control *cc, + struct page *page, unsigned long nr_isolated, + bool migrate_scanner) { + struct zone *zone = cc->zone; if (!page) return; - if (!nr_isolated) + if (!nr_isolated) { + unsigned long pfn = page_to_pfn(page); set_pageblock_skip(page); + + /* Update where compaction should restart */ + if (migrate_scanner) { + if (!cc->finished_update_migrate && + pfn > zone->compact_cached_migrate_pfn) + zone->compact_cached_migrate_pfn = pfn; + } else { + if (!cc->finished_update_free && + pfn < zone->compact_cached_free_pfn) + zone->compact_cached_free_pfn = pfn; + } + } } #else static inline bool isolation_suitable(struct compact_control *cc, @@ -118,7 +137,9 @@ static inline bool isolation_suitable(struct compact_control *cc, return true; } -static void update_pageblock_skip(struct page *page, unsigned long nr_isolated) +static void update_pageblock_skip(struct compact_control *cc, + struct page *page, unsigned long nr_isolated, + bool migrate_scanner) { } #endif /* CONFIG_COMPACTION */ @@ -327,7 +348,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, /* Update the pageblock-skip if the whole pageblock was scanned */ if (blockpfn == end_pfn) - update_pageblock_skip(valid_page, total_isolated); + update_pageblock_skip(cc, valid_page, total_isolated, false); return total_isolated; } @@ -533,6 +554,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, */ if (!cc->sync && last_pageblock_nr != pageblock_nr && !migrate_async_suitable(get_pageblock_migratetype(page))) { + cc->finished_update_migrate = true; goto next_pageblock; } @@ -583,6 +605,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, VM_BUG_ON(PageTransCompound(page)); /* Successfully isolated */ + cc->finished_update_migrate = true; del_page_from_lru_list(page, lruvec, page_lru(page)); list_add(&page->lru, migratelist); cc->nr_migratepages++; @@ -609,7 +632,7 @@ next_pageblock: /* Update the pageblock-skip if the whole pageblock was scanned */ if (low_pfn == end_pfn) - update_pageblock_skip(valid_page, nr_isolated); + update_pageblock_skip(cc, valid_page, nr_isolated, true); trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated); @@ -690,8 +713,10 @@ static void isolate_freepages(struct zone *zone, * looking for free pages, the search will restart here as * page migration may have returned some pages to the allocator */ - if (isolated) + if (isolated) { + cc->finished_update_free = true; high_pfn = max(high_pfn, pfn); + } } /* split_free_page does not map the pages */ @@ -888,6 +913,8 @@ unsigned long compaction_suitable(struct zone *zone, int order) static int compact_zone(struct zone *zone, struct compact_control *cc) { int ret; + unsigned long start_pfn = zone->zone_start_pfn; + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; ret = compaction_suitable(zone, cc->order); switch (ret) { @@ -900,10 +927,21 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) ; } - /* Setup to move all movable pages to the end of the zone */ - cc->migrate_pfn = zone->zone_start_pfn; - cc->free_pfn = cc->migrate_pfn + zone->spanned_pages; - cc->free_pfn &= ~(pageblock_nr_pages-1); + /* + * Setup to move all movable pages to the end of the zone. Used cached + * information on where the scanners should start but check that it + * is initialised by ensuring the values are within zone boundaries. + */ + cc->migrate_pfn = zone->compact_cached_migrate_pfn; + cc->free_pfn = zone->compact_cached_free_pfn; + if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) { + cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1); + zone->compact_cached_free_pfn = cc->free_pfn; + } + if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) { + cc->migrate_pfn = start_pfn; + zone->compact_cached_migrate_pfn = cc->migrate_pfn; + } /* Clear pageblock skip if there are numerous alloc failures */ if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT) -- cgit v1.1 From 62997027ca5b3d4618198ed8b1aba40b61b1137b Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Oct 2012 16:32:47 -0700 Subject: mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity Compaction caches if a pageblock was scanned and no pages were isolated so that the pageblocks can be skipped in the future to reduce scanning. This information is not cleared by the page allocator based on activity due to the impact it would have to the page allocator fast paths. Hence there is a requirement that something clear the cache or pageblocks will be skipped forever. Currently the cache is cleared if there were a number of recent allocation failures and it has not been cleared within the last 5 seconds. Time-based decisions like this are terrible as they have no relationship to VM activity and is basically a big hammer. Unfortunately, accurate heuristics would add cost to some hot paths so this patch implements a rough heuristic. There are two cases where the cache is cleared. 1. If a !kswapd process completes a compaction cycle (migrate and free scanner meet), the zone is marked compact_blockskip_flush. When kswapd goes to sleep, it will clear the cache. This is expected to be the common case where the cache is cleared. It does not really matter if kswapd happens to be asleep or going to sleep when the flag is set as it will be woken on the next allocation request. 2. If there have been multiple failures recently and compaction just finished being deferred then a process will clear the cache and start a full scan. This situation happens if there are multiple high-order allocation requests under heavy memory pressure. The clearing of the PG_migrate_skip bits and other scans is inherently racy but the race is harmless. For allocations that can fail such as THP, they will simply fail. For requests that cannot fail, they will retry the allocation. Tests indicated that scanning rates were roughly similar to when the time-based heuristic was used and the allocation success rates were similar. Signed-off-by: Mel Gorman Cc: Rik van Riel Cc: Richard Davies Cc: Shaohua Li Cc: Avi Kivity Cc: Rafael Aquini Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 50 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 34 insertions(+), 16 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index f94cbc0..d8187f9 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -66,24 +66,15 @@ static inline bool isolation_suitable(struct compact_control *cc, * should be skipped for page isolation when the migrate and free page scanner * meet. */ -static void reset_isolation_suitable(struct zone *zone) +static void __reset_isolation_suitable(struct zone *zone) { unsigned long start_pfn = zone->zone_start_pfn; unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; unsigned long pfn; - /* - * Do not reset more than once every five seconds. If allocations are - * failing sufficiently quickly to allow this to happen then continually - * scanning for compaction is not going to help. The choice of five - * seconds is arbitrary but will mitigate excessive scanning. - */ - if (time_before(jiffies, zone->compact_blockskip_expire)) - return; - zone->compact_cached_migrate_pfn = start_pfn; zone->compact_cached_free_pfn = end_pfn; - zone->compact_blockskip_expire = jiffies + (HZ * 5); + zone->compact_blockskip_flush = false; /* Walk the zone and mark every pageblock as suitable for isolation */ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { @@ -102,9 +93,24 @@ static void reset_isolation_suitable(struct zone *zone) } } +void reset_isolation_suitable(pg_data_t *pgdat) +{ + int zoneid; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + struct zone *zone = &pgdat->node_zones[zoneid]; + if (!populated_zone(zone)) + continue; + + /* Only flush if a full compaction finished recently */ + if (zone->compact_blockskip_flush) + __reset_isolation_suitable(zone); + } +} + /* * If no pages were isolated then mark this pageblock to be skipped in the - * future. The information is later cleared by reset_isolation_suitable(). + * future. The information is later cleared by __reset_isolation_suitable(). */ static void update_pageblock_skip(struct compact_control *cc, struct page *page, unsigned long nr_isolated, @@ -820,7 +826,15 @@ static int compact_finished(struct zone *zone, /* Compaction run completes if the migrate and free scanner meet */ if (cc->free_pfn <= cc->migrate_pfn) { - reset_isolation_suitable(cc->zone); + /* + * Mark that the PG_migrate_skip information should be cleared + * by kswapd when it goes to sleep. kswapd does not set the + * flag itself as the decision to be clear should be directly + * based on an allocation request. + */ + if (!current_is_kswapd()) + zone->compact_blockskip_flush = true; + return COMPACT_COMPLETE; } @@ -943,9 +957,13 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) zone->compact_cached_migrate_pfn = cc->migrate_pfn; } - /* Clear pageblock skip if there are numerous alloc failures */ - if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT) - reset_isolation_suitable(zone); + /* + * Clear pageblock skip if there were failures recently and compaction + * is about to be retried after being deferred. kswapd does not do + * this reset as it'll reset the cached information when going to sleep. + */ + if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) + __reset_isolation_suitable(zone); migrate_prep_local(); -- cgit v1.1 From e46a28790e594c0876d1a84270926abf75460f61 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Mon, 8 Oct 2012 16:33:48 -0700 Subject: CMA: migrate mlocked pages Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: Minchan Kim Acked-by: Mel Gorman Cc: Michal Nazarewicz Cc: Bartlomiej Zolnierkiewicz Cc: Marek Szyprowski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index d8187f9..2c4ce17 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -461,6 +461,7 @@ static bool too_many_isolated(struct zone *zone) * @cc: Compaction control structure. * @low_pfn: The first PFN of the range. * @end_pfn: The one-past-the-last PFN of the range. + * @unevictable: true if it allows to isolate unevictable pages * * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). Returns zero if there is a fatal signal @@ -476,7 +477,7 @@ static bool too_many_isolated(struct zone *zone) */ unsigned long isolate_migratepages_range(struct zone *zone, struct compact_control *cc, - unsigned long low_pfn, unsigned long end_pfn) + unsigned long low_pfn, unsigned long end_pfn, bool unevictable) { unsigned long last_pageblock_nr = 0, pageblock_nr; unsigned long nr_scanned = 0, nr_isolated = 0; @@ -602,6 +603,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, if (!cc->sync) mode |= ISOLATE_ASYNC_MIGRATE; + if (unevictable) + mode |= ISOLATE_UNEVICTABLE; + lruvec = mem_cgroup_page_lruvec(page, zone); /* Try isolate the page */ @@ -807,7 +811,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, } /* Perform the isolation */ - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn); + low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false); if (!low_pfn || cc->contended) return ISOLATE_ABORT; -- cgit v1.1