From 11f81becca04bb7d2826a9b65bb8d27b0a1bb543 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:15 -0400 Subject: page_writeback: revive cancel_dirty_page() in a restricted form cancel_dirty_page() had some issues and b9ea25152e56 ("page_writeback: clean up mess around cancel_dirty_page()") replaced it with account_page_cleaned() which makes the caller responsible for clearing the dirty bit; unfortunately, the planned changes for cgroup writeback support requires synchronization between dirty bit manipulation and stat updates. While we can open-code such synchronization in each account_page_cleaned() callsite, that's gonna be unnecessarily awkward and verbose. This patch revives cancel_dirty_page() but in a more restricted form. All it does is TestClearPageDirty() followed by account_page_cleaned() invocation if the page was dirty. This helper covers all account_page_cleaned() usages except for __delete_from_page_cache() which is a special case anyway and left alone. As this leaves no module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped from it. This patch just revives cancel_dirty_page() as a trivial wrapper to replace equivalent usages and doesn't introduce any functional changes. Signed-off-by: Tejun Heo Cc: Konstantin Khlebnikov Signed-off-by: Jens Axboe --- mm/page-writeback.c | 27 ++++++++++++++++++++------- mm/truncate.c | 4 +--- 2 files changed, 21 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5daf556..227b867 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2112,12 +2112,6 @@ EXPORT_SYMBOL(account_page_dirtied); /* * Helper function for deaccounting dirty page without writeback. - * - * Doing this should *normally* only ever be done when a page - * is truncated, and is not actually mapped anywhere at all. However, - * fs/buffer.c does this when it notices that somebody has cleaned - * out all the buffers on a page without actually doing it through - * the VM. Can you say "ext3 is horribly ugly"? Thought you could. */ void account_page_cleaned(struct page *page, struct address_space *mapping) { @@ -2127,7 +2121,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping) task_io_account_cancelled_write(PAGE_CACHE_SIZE); } } -EXPORT_SYMBOL(account_page_cleaned); /* * For address_spaces which do not use buffers. Just tag the page as dirty in @@ -2266,6 +2259,26 @@ int set_page_dirty_lock(struct page *page) EXPORT_SYMBOL(set_page_dirty_lock); /* + * This cancels just the dirty bit on the kernel page itself, it does NOT + * actually remove dirty bits on any mmap's that may be around. It also + * leaves the page tagged dirty, so any sync activity will still find it on + * the dirty lists, and in particular, clear_page_dirty_for_io() will still + * look at the dirty bits in the VM. + * + * Doing this should *normally* only ever be done when a page is truncated, + * and is not actually mapped anywhere at all. However, fs/buffer.c does + * this when it notices that somebody has cleaned out all the buffers on a + * page without actually doing it through the VM. Can you say "ext3 is + * horribly ugly"? Thought you could. + */ +void cancel_dirty_page(struct page *page) +{ + if (TestClearPageDirty(page)) + account_page_cleaned(page, page_mapping(page)); +} +EXPORT_SYMBOL(cancel_dirty_page); + +/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * diff --git a/mm/truncate.c b/mm/truncate.c index 66af903..0c36025 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -116,9 +116,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page) * the VM has canceled the dirty bit (eg ext3 journaling). * Hence dirty accounting check is placed after invalidation. */ - if (TestClearPageDirty(page)) - account_page_cleaned(page, mapping); - + cancel_dirty_page(page); ClearPageMappedToDisk(page); delete_from_page_cache(page); return 0; -- cgit v1.1 From c4843a7593a9df3ff5b1806084cefdfa81dd7c79 Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Fri, 22 May 2015 17:13:16 -0400 Subject: memcg: add per cgroup dirty page accounting When modifying PG_Dirty on cached file pages, update the new MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where global NR_FILE_DIRTY is managed. The new memcg stat is visible in the per memcg memory.stat cgroupfs file. The most recent past attempt at this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632 The new accounting supports future efforts to add per cgroup dirty page throttling and writeback. It also helps an administrator break down a container's memory usage and provides evidence to understand memcg oom kills (the new dirty count is included in memcg oom kill messages). The ability to move page accounting between memcg (memory.move_charge_at_immigrate) makes this accounting more complicated than the global counter. The existing mem_cgroup_{begin,end}_page_stat() lock is used to serialize move accounting with stat updates. Typical update operation: memcg = mem_cgroup_begin_page_stat(page) if (TestSetPageDirty()) { [...] mem_cgroup_update_page_stat(memcg) } mem_cgroup_end_page_stat(memcg) Summary of mem_cgroup_end_page_stat() overhead: - Without CONFIG_MEMCG it's a no-op - With CONFIG_MEMCG and no inter memcg task movement, it's just rcu_read_lock() - With CONFIG_MEMCG and inter memcg task movement, it's rcu_read_lock() + spin_lock_irqsave() A memcg parameter is added to several routines because their callers now grab mem_cgroup_begin_page_stat() which returns the memcg later needed by for mem_cgroup_update_page_stat(). Because mem_cgroup_begin_page_stat() may disable interrupts, some adjustments are needed: - move __mark_inode_dirty() from __set_page_dirty() to its caller. __mark_inode_dirty() locking does not want interrupts disabled. - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in __delete_from_page_cache(), replace_page_cache_page(), invalidate_complete_page2(), and __remove_mapping(). text data bss dec hex filename 8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before 8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after +192 text bytes 8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before 8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after +773 text bytes Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for all metrics, they're all wall clock or cycle counts. The read and write fault benchmarks just measure fault time, they do not include I/O time. * CONFIG_MEMCG not set: baseline patched kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples) dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03% dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99% dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77% read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples) write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples) * CONFIG_MEMCG=y root_memcg: baseline patched kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples) dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90% dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33% dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00% read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples) write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples) * CONFIG_MEMCG=y non-root_memcg: baseline patched kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples) dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82% dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27% dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52% read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples) write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples) As expected anon page faults are not affected by this patch. tj: Updated to apply on top of the recent cancel_dirty_page() changes. Signed-off-by: Sha Zhengju Signed-off-by: Greg Thelen Signed-off-by: Tejun Heo Signed-off-by: Jens Axboe --- mm/filemap.c | 31 ++++++++++++++++++++++--------- mm/memcontrol.c | 24 +++++++++++++++++++++++- mm/page-writeback.c | 50 ++++++++++++++++++++++++++++++++++++++++++-------- mm/rmap.c | 2 ++ mm/truncate.c | 14 ++++++++++---- mm/vmscan.c | 17 ++++++++++++----- 6 files changed, 111 insertions(+), 27 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 6bf5e42..7b1443d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -100,6 +100,7 @@ * ->tree_lock (page_remove_rmap->set_page_dirty) * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) * ->inode->i_lock (page_remove_rmap->set_page_dirty) + * ->memcg->move_lock (page_remove_rmap->mem_cgroup_begin_page_stat) * bdi.wb->list_lock (zap_pte_range->set_page_dirty) * ->inode->i_lock (zap_pte_range->set_page_dirty) * ->private_lock (zap_pte_range->__set_page_dirty_buffers) @@ -174,9 +175,11 @@ static void page_cache_tree_delete(struct address_space *mapping, /* * Delete a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage - * is safe. The caller must hold the mapping's tree_lock. + * is safe. The caller must hold the mapping's tree_lock and + * mem_cgroup_begin_page_stat(). */ -void __delete_from_page_cache(struct page *page, void *shadow) +void __delete_from_page_cache(struct page *page, void *shadow, + struct mem_cgroup *memcg) { struct address_space *mapping = page->mapping; @@ -210,7 +213,7 @@ void __delete_from_page_cache(struct page *page, void *shadow) * anyway will be cleared before returning page into buddy allocator. */ if (WARN_ON_ONCE(PageDirty(page))) - account_page_cleaned(page, mapping); + account_page_cleaned(page, mapping, memcg); } /** @@ -224,14 +227,20 @@ void __delete_from_page_cache(struct page *page, void *shadow) void delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + struct mem_cgroup *memcg; + unsigned long flags; + void (*freepage)(struct page *); BUG_ON(!PageLocked(page)); freepage = mapping->a_ops->freepage; - spin_lock_irq(&mapping->tree_lock); - __delete_from_page_cache(page, NULL); - spin_unlock_irq(&mapping->tree_lock); + + memcg = mem_cgroup_begin_page_stat(page); + spin_lock_irqsave(&mapping->tree_lock, flags); + __delete_from_page_cache(page, NULL, memcg); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); if (freepage) freepage(page); @@ -470,6 +479,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) if (!error) { struct address_space *mapping = old->mapping; void (*freepage)(struct page *); + struct mem_cgroup *memcg; + unsigned long flags; pgoff_t offset = old->index; freepage = mapping->a_ops->freepage; @@ -478,15 +489,17 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) new->mapping = mapping; new->index = offset; - spin_lock_irq(&mapping->tree_lock); - __delete_from_page_cache(old, NULL); + memcg = mem_cgroup_begin_page_stat(old); + spin_lock_irqsave(&mapping->tree_lock, flags); + __delete_from_page_cache(old, NULL, memcg); error = radix_tree_insert(&mapping->page_tree, offset, new); BUG_ON(error); mapping->nrpages++; __inc_zone_page_state(new, NR_FILE_PAGES); if (PageSwapBacked(new)) __inc_zone_page_state(new, NR_SHMEM); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); mem_cgroup_migrate(old, new, true); radix_tree_preload_end(); if (freepage) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 14c2f20..c23c1a3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -90,6 +90,7 @@ static const char * const mem_cgroup_stat_names[] = { "rss", "rss_huge", "mapped_file", + "dirty", "writeback", "swap", }; @@ -2011,6 +2012,7 @@ again: return memcg; } +EXPORT_SYMBOL(mem_cgroup_begin_page_stat); /** * mem_cgroup_end_page_stat - finish a page state statistics transaction @@ -2029,6 +2031,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg) rcu_read_unlock(); } +EXPORT_SYMBOL(mem_cgroup_end_page_stat); /** * mem_cgroup_update_page_stat - update page state statistics @@ -4746,6 +4749,7 @@ static int mem_cgroup_move_account(struct page *page, { unsigned long flags; int ret; + bool anon; VM_BUG_ON(from == to); VM_BUG_ON_PAGE(PageLRU(page), page); @@ -4771,15 +4775,33 @@ static int mem_cgroup_move_account(struct page *page, if (page->mem_cgroup != from) goto out_unlock; + anon = PageAnon(page); + spin_lock_irqsave(&from->move_lock, flags); - if (!PageAnon(page) && page_mapped(page)) { + if (!anon && page_mapped(page)) { __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], nr_pages); __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], nr_pages); } + /* + * move_lock grabbed above and caller set from->moving_account, so + * mem_cgroup_update_page_stat() will serialize updates to PageDirty. + * So mapping should be stable for dirty pages. + */ + if (!anon && PageDirty(page)) { + struct address_space *mapping = page_mapping(page); + + if (mapping_cap_account_dirty(mapping)) { + __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY], + nr_pages); + __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY], + nr_pages); + } + } + if (PageWriteback(page)) { __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK], nr_pages); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 227b867..bdeecad 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2090,15 +2090,20 @@ int __set_page_dirty_no_writeback(struct page *page) /* * Helper function for set_page_dirty family. + * + * Caller must hold mem_cgroup_begin_page_stat(). + * * NOTE: This relies on being atomic wrt interrupts. */ -void account_page_dirtied(struct page *page, struct address_space *mapping) +void account_page_dirtied(struct page *page, struct address_space *mapping, + struct mem_cgroup *memcg) { trace_writeback_dirty_page(page, mapping); if (mapping_cap_account_dirty(mapping)) { struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); __inc_bdi_stat(bdi, BDI_RECLAIMABLE); @@ -2112,10 +2117,14 @@ EXPORT_SYMBOL(account_page_dirtied); /* * Helper function for deaccounting dirty page without writeback. + * + * Caller must hold mem_cgroup_begin_page_stat(). */ -void account_page_cleaned(struct page *page, struct address_space *mapping) +void account_page_cleaned(struct page *page, struct address_space *mapping, + struct mem_cgroup *memcg) { if (mapping_cap_account_dirty(mapping)) { + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE); task_io_account_cancelled_write(PAGE_CACHE_SIZE); @@ -2136,26 +2145,34 @@ void account_page_cleaned(struct page *page, struct address_space *mapping) */ int __set_page_dirty_nobuffers(struct page *page) { + struct mem_cgroup *memcg; + + memcg = mem_cgroup_begin_page_stat(page); if (!TestSetPageDirty(page)) { struct address_space *mapping = page_mapping(page); unsigned long flags; - if (!mapping) + if (!mapping) { + mem_cgroup_end_page_stat(memcg); return 1; + } spin_lock_irqsave(&mapping->tree_lock, flags); BUG_ON(page_mapping(page) != mapping); WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); - account_page_dirtied(page, mapping); + account_page_dirtied(page, mapping, memcg); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); + if (mapping->host) { /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } return 1; } + mem_cgroup_end_page_stat(memcg); return 0; } EXPORT_SYMBOL(__set_page_dirty_nobuffers); @@ -2273,8 +2290,20 @@ EXPORT_SYMBOL(set_page_dirty_lock); */ void cancel_dirty_page(struct page *page) { - if (TestClearPageDirty(page)) - account_page_cleaned(page, page_mapping(page)); + struct address_space *mapping = page_mapping(page); + + if (mapping_cap_account_dirty(mapping)) { + struct mem_cgroup *memcg; + + memcg = mem_cgroup_begin_page_stat(page); + + if (TestClearPageDirty(page)) + account_page_cleaned(page, mapping, memcg); + + mem_cgroup_end_page_stat(memcg); + } else { + ClearPageDirty(page); + } } EXPORT_SYMBOL(cancel_dirty_page); @@ -2295,6 +2324,8 @@ EXPORT_SYMBOL(cancel_dirty_page); int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); + struct mem_cgroup *memcg; + int ret = 0; BUG_ON(!PageLocked(page)); @@ -2334,13 +2365,16 @@ int clear_page_dirty_for_io(struct page *page) * always locked coming in here, so we get the desired * exclusion. */ + memcg = mem_cgroup_begin_page_stat(page); if (TestClearPageDirty(page)) { + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE); - return 1; + ret = 1; } - return 0; + mem_cgroup_end_page_stat(memcg); + return ret; } return TestClearPageDirty(page); } diff --git a/mm/rmap.c b/mm/rmap.c index 24dd3f9..8fc556c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -30,6 +30,8 @@ * swap_lock (in swap_duplicate, swap_info_get) * mmlist_lock (in mmput, drain_mmlist and others) * mapping->private_lock (in __set_page_dirty_buffers) + * mem_cgroup_{begin,end}_page_stat (memcg->move_lock) + * mapping->tree_lock (widely used) * inode->i_lock (in set_page_dirty's __mark_inode_dirty) * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) * sb_lock (within inode_lock in fs/fs-writeback.c) diff --git a/mm/truncate.c b/mm/truncate.c index 0c36025..76e35ad 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -510,19 +510,24 @@ EXPORT_SYMBOL(invalidate_mapping_pages); static int invalidate_complete_page2(struct address_space *mapping, struct page *page) { + struct mem_cgroup *memcg; + unsigned long flags; + if (page->mapping != mapping) return 0; if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; - spin_lock_irq(&mapping->tree_lock); + memcg = mem_cgroup_begin_page_stat(page); + spin_lock_irqsave(&mapping->tree_lock, flags); if (PageDirty(page)) goto failed; BUG_ON(page_has_private(page)); - __delete_from_page_cache(page, NULL); - spin_unlock_irq(&mapping->tree_lock); + __delete_from_page_cache(page, NULL, memcg); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); if (mapping->a_ops->freepage) mapping->a_ops->freepage(page); @@ -530,7 +535,8 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) page_cache_release(page); /* pagecache ref */ return 1; failed: - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd..7582f9f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -579,10 +579,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, static int __remove_mapping(struct address_space *mapping, struct page *page, bool reclaimed) { + unsigned long flags; + struct mem_cgroup *memcg; + BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - spin_lock_irq(&mapping->tree_lock); + memcg = mem_cgroup_begin_page_stat(page); + spin_lock_irqsave(&mapping->tree_lock, flags); /* * The non racy check for a busy page. * @@ -620,7 +624,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, swp_entry_t swap = { .val = page_private(page) }; mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); swapcache_free(swap); } else { void (*freepage)(struct page *); @@ -640,8 +645,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (reclaimed && page_is_file_cache(page) && !mapping_exiting(mapping)) shadow = workingset_eviction(mapping, page); - __delete_from_page_cache(page, shadow); - spin_unlock_irq(&mapping->tree_lock); + __delete_from_page_cache(page, shadow, memcg); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); if (freepage != NULL) freepage(page); @@ -650,7 +656,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, return 1; cannot_free: - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); return 0; } -- cgit v1.1 From 56161634e4824380a67243a4cf3fa52eb1e5d836 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:20 -0400 Subject: memcg: add mem_cgroup_root_css Add global mem_cgroup_root_css which points to the root memcg css. This will be used by cgroup writeback support. If memcg is disabled, it's defined as ERR_PTR(-EINVAL). Signed-off-by: Tejun Heo Cc: Johannes Weiner aCc: Michal Hocko Signed-off-by: Jens Axboe --- mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c23c1a3..b22a92b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -77,6 +77,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys); #define MEM_CGROUP_RECLAIM_RETRIES 5 static struct mem_cgroup *root_mem_cgroup __read_mostly; +struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; /* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP @@ -4441,6 +4442,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* root ? */ if (parent_css == NULL) { root_mem_cgroup = memcg; + mem_cgroup_root_css = &memcg->css; page_counter_init(&memcg->memory, NULL); memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; -- cgit v1.1 From ad7fa852d3d2816d68a138ebc5bc8967aeb7fd86 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 27 May 2015 20:00:02 -0400 Subject: memcg: implement mem_cgroup_css_from_page() Implement mem_cgroup_css_from_page() which returns the cgroup_subsys_state of the memcg associated with a given page on the default hierarchy. This will be used by cgroup writeback support. This function assumes that page->mem_cgroup association doesn't change until the page is released, which is true on the default hierarchy as long as replace_page_cache_page() is not used. As the only user of replace_page_cache_page() is FUSE which won't support cgroup writeback for the time being, this works for now, and replace_page_cache_page() will soon be updated so that the invariant actually holds. Note that the RCU protected page->mem_cgroup access is consistent with other usages across memcg but ultimately incorrect. These unlocked accesses are missing required barriers. page->mem_cgroup should be made an RCU pointer and updated and accessed using RCU operations. v4: Instead of triggering WARN, return the root css on the traditional hierarchies. This makes the function a lot easier to deal with especially as there's no light way to synchronize against hierarchy rebinding. v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/ v2: Trigger WARN if the function is used on the traditional hierarchies and add comment about the assumed invariant. Signed-off-by: Tejun Heo Cc: Johannes Weiner Cc: Michal Hocko Signed-off-by: Jens Axboe --- mm/memcontrol.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b22a92b..5c270a0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -598,6 +598,39 @@ struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg) return &memcg->css; } +/** + * mem_cgroup_css_from_page - css of the memcg associated with a page + * @page: page of interest + * + * If memcg is bound to the default hierarchy, css of the memcg associated + * with @page is returned. The returned css remains associated with @page + * until it is released. + * + * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup + * is returned. + * + * XXX: The above description of behavior on the default hierarchy isn't + * strictly true yet as replace_page_cache_page() can modify the + * association before @page is released even on the default hierarchy; + * however, the current and planned usages don't mix the the two functions + * and replace_page_cache_page() will soon be updated to make the invariant + * actually true. + */ +struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page) +{ + struct mem_cgroup *memcg; + + rcu_read_lock(); + + memcg = page->mem_cgroup; + + if (!memcg || !cgroup_on_dfl(memcg->css.cgroup)) + memcg = root_mem_cgroup; + + rcu_read_unlock(); + return &memcg->css; +} + static struct mem_cgroup_per_zone * mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page) { -- cgit v1.1 From 4452226ea276e74fc3e252c88d9bb7e8f8e44bf0 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:26 -0400 Subject: writeback: move backing_dev_info->state into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Cc: Wu Fengguang Cc: drbd-dev@lists.linbit.com Cc: Neil Brown Cc: Alasdair Kergon Cc: Mike Snitzer Signed-off-by: Jens Axboe --- mm/backing-dev.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 6dc4580..b23cf0e 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -96,7 +96,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) nr_io, nr_more_io, nr_dirty_time, - !list_empty(&bdi->bdi_list), bdi->state); + !list_empty(&bdi->bdi_list), bdi->wb.state); #undef K return 0; @@ -280,7 +280,7 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi) timeout = msecs_to_jiffies(dirty_writeback_interval * 10); spin_lock_bh(&bdi->wb_lock); - if (test_bit(BDI_registered, &bdi->state)) + if (test_bit(WB_registered, &bdi->wb.state)) queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); spin_unlock_bh(&bdi->wb_lock); } @@ -315,7 +315,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent, bdi->dev = dev; bdi_debug_register(bdi, dev_name(dev)); - set_bit(BDI_registered, &bdi->state); + set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); list_add_tail_rcu(&bdi->bdi_list, &bdi_list); @@ -339,7 +339,7 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi) { /* Make sure nobody queues further work */ spin_lock_bh(&bdi->wb_lock); - if (!test_and_clear_bit(BDI_registered, &bdi->state)) { + if (!test_and_clear_bit(WB_registered, &bdi->wb.state)) { spin_unlock_bh(&bdi->wb_lock); return; } @@ -492,11 +492,11 @@ static atomic_t nr_bdi_congested[2]; void clear_bdi_congested(struct backing_dev_info *bdi, int sync) { - enum bdi_state bit; + enum wb_state bit; wait_queue_head_t *wqh = &congestion_wqh[sync]; - bit = sync ? BDI_sync_congested : BDI_async_congested; - if (test_and_clear_bit(bit, &bdi->state)) + bit = sync ? WB_sync_congested : WB_async_congested; + if (test_and_clear_bit(bit, &bdi->wb.state)) atomic_dec(&nr_bdi_congested[sync]); smp_mb__after_atomic(); if (waitqueue_active(wqh)) @@ -506,10 +506,10 @@ EXPORT_SYMBOL(clear_bdi_congested); void set_bdi_congested(struct backing_dev_info *bdi, int sync) { - enum bdi_state bit; + enum wb_state bit; - bit = sync ? BDI_sync_congested : BDI_async_congested; - if (!test_and_set_bit(bit, &bdi->state)) + bit = sync ? WB_sync_congested : WB_async_congested; + if (!test_and_set_bit(bit, &bdi->wb.state)) atomic_inc(&nr_bdi_congested[sync]); } EXPORT_SYMBOL(set_bdi_congested); -- cgit v1.1 From 93f78d882865cb90020d0f80a9523c99cf46924c Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:27 -0400 Subject: writeback: move backing_dev_info->bdi_stat[] into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->bdi_stat[] into wb. * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all enums is changed from BDI_ to WB_. * BDI_STAT_BATCH() -> WB_STAT_BATCH() * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...) * bdi_stat[_error]() -> wb_stat[_error]() * bdi_writeout_inc() -> wb_writeout_inc() * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and frees stat. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Cc: Wu Fengguang Cc: Miklos Szeredi Cc: Trond Myklebust Signed-off-by: Jens Axboe --- mm/backing-dev.c | 60 +++++++++++++++++++++++++++++++++-------------------- mm/page-writeback.c | 55 ++++++++++++++++++++++++------------------------ 2 files changed, 64 insertions(+), 51 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index b23cf0e..7b1d191 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -84,13 +84,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "b_dirty_time: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", - (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), - (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), + (unsigned long) K(wb_stat(wb, WB_WRITEBACK)), + (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)), K(bdi_thresh), K(dirty_thresh), K(background_thresh), - (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)), - (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)), + (unsigned long) K(wb_stat(wb, WB_DIRTIED)), + (unsigned long) K(wb_stat(wb, WB_WRITTEN)), (unsigned long) K(bdi->write_bandwidth), nr_dirty, nr_io, @@ -376,8 +376,10 @@ void bdi_unregister(struct backing_dev_info *bdi) } EXPORT_SYMBOL(bdi_unregister); -static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) +static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) { + int i, err; + memset(wb, 0, sizeof(*wb)); wb->bdi = bdi; @@ -388,6 +390,27 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) INIT_LIST_HEAD(&wb->b_dirty_time); spin_lock_init(&wb->list_lock); INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn); + + for (i = 0; i < NR_WB_STAT_ITEMS; i++) { + err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL); + if (err) { + while (--i) + percpu_counter_destroy(&wb->stat[i]); + return err; + } + } + + return 0; +} + +static void bdi_wb_exit(struct bdi_writeback *wb) +{ + int i; + + WARN_ON(delayed_work_pending(&wb->dwork)); + + for (i = 0; i < NR_WB_STAT_ITEMS; i++) + percpu_counter_destroy(&wb->stat[i]); } /* @@ -397,7 +420,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) int bdi_init(struct backing_dev_info *bdi) { - int i, err; + int err; bdi->dev = NULL; @@ -408,13 +431,9 @@ int bdi_init(struct backing_dev_info *bdi) INIT_LIST_HEAD(&bdi->bdi_list); INIT_LIST_HEAD(&bdi->work_list); - bdi_wb_init(&bdi->wb, bdi); - - for (i = 0; i < NR_BDI_STAT_ITEMS; i++) { - err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL); - if (err) - goto err; - } + err = bdi_wb_init(&bdi->wb, bdi); + if (err) + return err; bdi->dirty_exceeded = 0; @@ -427,25 +446,20 @@ int bdi_init(struct backing_dev_info *bdi) bdi->avg_write_bandwidth = INIT_BW; err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL); - if (err) { -err: - while (i--) - percpu_counter_destroy(&bdi->bdi_stat[i]); + bdi_wb_exit(&bdi->wb); + return err; } - return err; + return 0; } EXPORT_SYMBOL(bdi_init); void bdi_destroy(struct backing_dev_info *bdi) { - int i; - bdi_wb_shutdown(bdi); WARN_ON(!list_empty(&bdi->work_list)); - WARN_ON(delayed_work_pending(&bdi->wb.dwork)); if (bdi->dev) { bdi_debug_unregister(bdi); @@ -453,8 +467,8 @@ void bdi_destroy(struct backing_dev_info *bdi) bdi->dev = NULL; } - for (i = 0; i < NR_BDI_STAT_ITEMS; i++) - percpu_counter_destroy(&bdi->bdi_stat[i]); + bdi_wb_exit(&bdi->wb); + fprop_local_destroy_percpu(&bdi->completions); } EXPORT_SYMBOL(bdi_destroy); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index bdeecad..dc673a0 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -396,11 +396,11 @@ static unsigned long wp_next_time(unsigned long cur_time) * Increment the BDI's writeout completion count and the global writeout * completion count. Called from test_clear_page_writeback(). */ -static inline void __bdi_writeout_inc(struct backing_dev_info *bdi) +static inline void __wb_writeout_inc(struct bdi_writeback *wb) { - __inc_bdi_stat(bdi, BDI_WRITTEN); - __fprop_inc_percpu_max(&writeout_completions, &bdi->completions, - bdi->max_prop_frac); + __inc_wb_stat(wb, WB_WRITTEN); + __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions, + wb->bdi->max_prop_frac); /* First event after period switching was turned off? */ if (!unlikely(writeout_period_time)) { /* @@ -414,15 +414,15 @@ static inline void __bdi_writeout_inc(struct backing_dev_info *bdi) } } -void bdi_writeout_inc(struct backing_dev_info *bdi) +void wb_writeout_inc(struct bdi_writeback *wb) { unsigned long flags; local_irq_save(flags); - __bdi_writeout_inc(bdi); + __wb_writeout_inc(wb); local_irq_restore(flags); } -EXPORT_SYMBOL_GPL(bdi_writeout_inc); +EXPORT_SYMBOL_GPL(wb_writeout_inc); /* * Obtain an accurate fraction of the BDI's portion. @@ -1130,8 +1130,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi, if (elapsed < BANDWIDTH_INTERVAL) return; - dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); - written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); + dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]); + written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]); /* * Skip quiet periods when disk bandwidth is under-utilized. @@ -1288,7 +1288,8 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi, unsigned long *bdi_thresh, unsigned long *bdi_bg_thresh) { - unsigned long bdi_reclaimable; + struct bdi_writeback *wb = &bdi->wb; + unsigned long wb_reclaimable; /* * bdi_thresh is not treated as some limiting factor as @@ -1320,14 +1321,12 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi, * actually dirty; with m+n sitting in the percpu * deltas. */ - if (*bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - *bdi_dirty = bdi_reclaimable + - bdi_stat_sum(bdi, BDI_WRITEBACK); + if (*bdi_thresh < 2 * wb_stat_error(wb)) { + wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); + *bdi_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); } else { - bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - *bdi_dirty = bdi_reclaimable + - bdi_stat(bdi, BDI_WRITEBACK); + wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); + *bdi_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); } } @@ -1514,9 +1513,9 @@ pause: * In theory 1 page is enough to keep the comsumer-producer * pipe going: the flusher cleans 1 page => the task dirties 1 * more page. However bdi_dirty has accounting errors. So use - * the larger and more IO friendly bdi_stat_error. + * the larger and more IO friendly wb_stat_error. */ - if (bdi_dirty <= bdi_stat_error(bdi)) + if (bdi_dirty <= wb_stat_error(&bdi->wb)) break; if (fatal_signal_pending(current)) @@ -2106,8 +2105,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping, mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); - __inc_bdi_stat(bdi, BDI_RECLAIMABLE); - __inc_bdi_stat(bdi, BDI_DIRTIED); + __inc_wb_stat(&bdi->wb, WB_RECLAIMABLE); + __inc_wb_stat(&bdi->wb, WB_DIRTIED); task_io_account_write(PAGE_CACHE_SIZE); current->nr_dirtied++; this_cpu_inc(bdp_ratelimits); @@ -2126,7 +2125,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE); + dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_CACHE_SIZE); } } @@ -2190,7 +2189,7 @@ void account_page_redirty(struct page *page) if (mapping && mapping_cap_account_dirty(mapping)) { current->nr_dirtied--; dec_zone_page_state(page, NR_DIRTIED); - dec_bdi_stat(inode_to_bdi(mapping->host), BDI_DIRTIED); + dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_DIRTIED); } } EXPORT_SYMBOL(account_page_redirty); @@ -2369,8 +2368,8 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(inode_to_bdi(mapping->host), - BDI_RECLAIMABLE); + dec_wb_stat(&inode_to_bdi(mapping->host)->wb, + WB_RECLAIMABLE); ret = 1; } mem_cgroup_end_page_stat(memcg); @@ -2398,8 +2397,8 @@ int test_clear_page_writeback(struct page *page) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) { - __dec_bdi_stat(bdi, BDI_WRITEBACK); - __bdi_writeout_inc(bdi); + __dec_wb_stat(&bdi->wb, WB_WRITEBACK); + __wb_writeout_inc(&bdi->wb); } } spin_unlock_irqrestore(&mapping->tree_lock, flags); @@ -2433,7 +2432,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) - __inc_bdi_stat(bdi, BDI_WRITEBACK); + __inc_wb_stat(&bdi->wb, WB_WRITEBACK); } if (!PageDirty(page)) radix_tree_tag_clear(&mapping->page_tree, -- cgit v1.1 From a88a341a73be4ef035ca26170c849f002797da27 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:28 -0400 Subject: writeback: move bandwidth related fields from backing_dev_info into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bandwidth related fields from backing_dev_info into bdi_writeback. * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp, write_bandwidth, avg_write_bandwidth, dirty_ratelimit, balanced_dirty_ratelimit, completions and dirty_exceeded. * writeback_chunk_size() and over_bground_thresh() now take @wb instead of @bdi. * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...) bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...) bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...) bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...) [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...) bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...) bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...) * Init/exits of the relocated fields are moved to bdi_wb_init/exit() respectively. Note that explicit zeroing is dropped in the process as wb's are cleared in entirety anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. v2: Typo in description fixed as suggested by Jan. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Cc: Wu Fengguang Cc: Jaegeuk Kim Cc: Steven Whitehouse Signed-off-by: Jens Axboe --- mm/backing-dev.c | 45 ++++----- mm/page-writeback.c | 262 ++++++++++++++++++++++++++-------------------------- 2 files changed, 152 insertions(+), 155 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 7b1d191..9a6c472 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -66,7 +66,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); - bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); + bdi_thresh = wb_dirty_limit(wb, dirty_thresh); #define K(x) ((x) << (PAGE_SHIFT - 10)) seq_printf(m, @@ -91,7 +91,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) K(background_thresh), (unsigned long) K(wb_stat(wb, WB_DIRTIED)), (unsigned long) K(wb_stat(wb, WB_WRITTEN)), - (unsigned long) K(bdi->write_bandwidth), + (unsigned long) K(wb->write_bandwidth), nr_dirty, nr_io, nr_more_io, @@ -376,6 +376,11 @@ void bdi_unregister(struct backing_dev_info *bdi) } EXPORT_SYMBOL(bdi_unregister); +/* + * Initial write bandwidth: 100 MB/s + */ +#define INIT_BW (100 << (20 - PAGE_SHIFT)) + static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) { int i, err; @@ -391,11 +396,22 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) spin_lock_init(&wb->list_lock); INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn); + wb->bw_time_stamp = jiffies; + wb->balanced_dirty_ratelimit = INIT_BW; + wb->dirty_ratelimit = INIT_BW; + wb->write_bandwidth = INIT_BW; + wb->avg_write_bandwidth = INIT_BW; + + err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL); + if (err) + return err; + for (i = 0; i < NR_WB_STAT_ITEMS; i++) { err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL); if (err) { while (--i) percpu_counter_destroy(&wb->stat[i]); + fprop_local_destroy_percpu(&wb->completions); return err; } } @@ -411,12 +427,9 @@ static void bdi_wb_exit(struct bdi_writeback *wb) for (i = 0; i < NR_WB_STAT_ITEMS; i++) percpu_counter_destroy(&wb->stat[i]); -} -/* - * Initial write bandwidth: 100 MB/s - */ -#define INIT_BW (100 << (20 - PAGE_SHIFT)) + fprop_local_destroy_percpu(&wb->completions); +} int bdi_init(struct backing_dev_info *bdi) { @@ -435,22 +448,6 @@ int bdi_init(struct backing_dev_info *bdi) if (err) return err; - bdi->dirty_exceeded = 0; - - bdi->bw_time_stamp = jiffies; - bdi->written_stamp = 0; - - bdi->balanced_dirty_ratelimit = INIT_BW; - bdi->dirty_ratelimit = INIT_BW; - bdi->write_bandwidth = INIT_BW; - bdi->avg_write_bandwidth = INIT_BW; - - err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL); - if (err) { - bdi_wb_exit(&bdi->wb); - return err; - } - return 0; } EXPORT_SYMBOL(bdi_init); @@ -468,8 +465,6 @@ void bdi_destroy(struct backing_dev_info *bdi) } bdi_wb_exit(&bdi->wb); - - fprop_local_destroy_percpu(&bdi->completions); } EXPORT_SYMBOL(bdi_destroy); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index dc673a0..cd39ee9 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -399,7 +399,7 @@ static unsigned long wp_next_time(unsigned long cur_time) static inline void __wb_writeout_inc(struct bdi_writeback *wb) { __inc_wb_stat(wb, WB_WRITTEN); - __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions, + __fprop_inc_percpu_max(&writeout_completions, &wb->completions, wb->bdi->max_prop_frac); /* First event after period switching was turned off? */ if (!unlikely(writeout_period_time)) { @@ -427,10 +427,10 @@ EXPORT_SYMBOL_GPL(wb_writeout_inc); /* * Obtain an accurate fraction of the BDI's portion. */ -static void bdi_writeout_fraction(struct backing_dev_info *bdi, - long *numerator, long *denominator) +static void wb_writeout_fraction(struct bdi_writeback *wb, + long *numerator, long *denominator) { - fprop_fraction_percpu(&writeout_completions, &bdi->completions, + fprop_fraction_percpu(&writeout_completions, &wb->completions, numerator, denominator); } @@ -516,11 +516,11 @@ static unsigned long hard_dirty_limit(unsigned long thresh) } /** - * bdi_dirty_limit - @bdi's share of dirty throttling threshold - * @bdi: the backing_dev_info to query + * wb_dirty_limit - @wb's share of dirty throttling threshold + * @wb: bdi_writeback to query * @dirty: global dirty limit in pages * - * Returns @bdi's dirty limit in pages. The term "dirty" in the context of + * Returns @wb's dirty limit in pages. The term "dirty" in the context of * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. * * Note that balance_dirty_pages() will only seriously take it as a hard limit @@ -528,34 +528,35 @@ static unsigned long hard_dirty_limit(unsigned long thresh) * control. For example, when the device is completely stalled due to some error * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key. * In the other normal situations, it acts more gently by throttling the tasks - * more (rather than completely block them) when the bdi dirty pages go high. + * more (rather than completely block them) when the wb dirty pages go high. * * It allocates high/low dirty limits to fast/slow devices, in order to prevent * - starving fast devices * - piling up dirty pages (that will take long time to sync) on slow devices * - * The bdi's share of dirty limit will be adapting to its throughput and + * The wb's share of dirty limit will be adapting to its throughput and * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. */ -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty) +unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) { - u64 bdi_dirty; + struct backing_dev_info *bdi = wb->bdi; + u64 wb_dirty; long numerator, denominator; /* * Calculate this BDI's share of the dirty ratio. */ - bdi_writeout_fraction(bdi, &numerator, &denominator); + wb_writeout_fraction(wb, &numerator, &denominator); - bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100; - bdi_dirty *= numerator; - do_div(bdi_dirty, denominator); + wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100; + wb_dirty *= numerator; + do_div(wb_dirty, denominator); - bdi_dirty += (dirty * bdi->min_ratio) / 100; - if (bdi_dirty > (dirty * bdi->max_ratio) / 100) - bdi_dirty = dirty * bdi->max_ratio / 100; + wb_dirty += (dirty * bdi->min_ratio) / 100; + if (wb_dirty > (dirty * bdi->max_ratio) / 100) + wb_dirty = dirty * bdi->max_ratio / 100; - return bdi_dirty; + return wb_dirty; } /* @@ -664,14 +665,14 @@ static long long pos_ratio_polynom(unsigned long setpoint, * card's bdi_dirty may rush to many times higher than bdi_setpoint. * - the bdi dirty thresh drops quickly due to change of JBOD workload */ -static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty) +static unsigned long wb_position_ratio(struct bdi_writeback *wb, + unsigned long thresh, + unsigned long bg_thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty) { - unsigned long write_bw = bdi->avg_write_bandwidth; + unsigned long write_bw = wb->avg_write_bandwidth; unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); unsigned long limit = hard_dirty_limit(thresh); unsigned long x_intercept; @@ -702,12 +703,12 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, * consume arbitrary amount of RAM because it is accounted in * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty". * - * Here, in bdi_position_ratio(), we calculate pos_ratio based on + * Here, in wb_position_ratio(), we calculate pos_ratio based on * two values: bdi_dirty and bdi_thresh. Let's consider an example: * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global * limits are set by default to 10% and 20% (background and throttle). * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. - * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is + * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. bdi_setpoint is * about ~6K pages (as the average of background and throttle bdi * limits). The 3rd order polynomial will provide positive feedback if * bdi_dirty is under bdi_setpoint and vice versa. @@ -717,7 +718,7 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB * in the example above). */ - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) { + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { long long bdi_pos_ratio; unsigned long bdi_bg_thresh; @@ -842,13 +843,13 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, return pos_ratio; } -static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, - unsigned long elapsed, - unsigned long written) +static void wb_update_write_bandwidth(struct bdi_writeback *wb, + unsigned long elapsed, + unsigned long written) { const unsigned long period = roundup_pow_of_two(3 * HZ); - unsigned long avg = bdi->avg_write_bandwidth; - unsigned long old = bdi->write_bandwidth; + unsigned long avg = wb->avg_write_bandwidth; + unsigned long old = wb->write_bandwidth; u64 bw; /* @@ -861,14 +862,14 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, * @written may have decreased due to account_page_redirty(). * Avoid underflowing @bw calculation. */ - bw = written - min(written, bdi->written_stamp); + bw = written - min(written, wb->written_stamp); bw *= HZ; if (unlikely(elapsed > period)) { do_div(bw, elapsed); avg = bw; goto out; } - bw += (u64)bdi->write_bandwidth * (period - elapsed); + bw += (u64)wb->write_bandwidth * (period - elapsed); bw >>= ilog2(period); /* @@ -881,8 +882,8 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, avg += (old - avg) >> 3; out: - bdi->write_bandwidth = bw; - bdi->avg_write_bandwidth = avg; + wb->write_bandwidth = bw; + wb->avg_write_bandwidth = avg; } /* @@ -947,20 +948,20 @@ static void global_update_bandwidth(unsigned long thresh, * Normal bdi tasks will be curbed at or below it in long term. * Obviously it should be around (write_bw / N) when there are N dd tasks. */ -static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long dirtied, - unsigned long elapsed) +static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, + unsigned long thresh, + unsigned long bg_thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty, + unsigned long dirtied, + unsigned long elapsed) { unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); unsigned long limit = hard_dirty_limit(thresh); unsigned long setpoint = (freerun + limit) / 2; - unsigned long write_bw = bdi->avg_write_bandwidth; - unsigned long dirty_ratelimit = bdi->dirty_ratelimit; + unsigned long write_bw = wb->avg_write_bandwidth; + unsigned long dirty_ratelimit = wb->dirty_ratelimit; unsigned long dirty_rate; unsigned long task_ratelimit; unsigned long balanced_dirty_ratelimit; @@ -972,10 +973,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, * The dirty rate will match the writeout rate in long term, except * when dirty pages are truncated by userspace or re-dirtied by FS. */ - dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; + dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed; - pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty); + pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty, + bdi_thresh, bdi_dirty); /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ @@ -1059,31 +1060,31 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, /* * For strictlimit case, calculations above were based on bdi counters - * and limits (starting from pos_ratio = bdi_position_ratio() and up to + * and limits (starting from pos_ratio = wb_position_ratio() and up to * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate). * Hence, to calculate "step" properly, we have to use bdi_dirty as * "dirty" and bdi_setpoint as "setpoint". * * We rampup dirty_ratelimit forcibly if bdi_dirty is low because * it's possible that bdi_thresh is close to zero due to inactivity - * of backing device (see the implementation of bdi_dirty_limit()). + * of backing device (see the implementation of wb_dirty_limit()). */ - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) { + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { dirty = bdi_dirty; if (bdi_dirty < 8) setpoint = bdi_dirty + 1; else setpoint = (bdi_thresh + - bdi_dirty_limit(bdi, bg_thresh)) / 2; + wb_dirty_limit(wb, bg_thresh)) / 2; } if (dirty < setpoint) { - x = min3(bdi->balanced_dirty_ratelimit, + x = min3(wb->balanced_dirty_ratelimit, balanced_dirty_ratelimit, task_ratelimit); if (dirty_ratelimit < x) step = x - dirty_ratelimit; } else { - x = max3(bdi->balanced_dirty_ratelimit, + x = max3(wb->balanced_dirty_ratelimit, balanced_dirty_ratelimit, task_ratelimit); if (dirty_ratelimit > x) step = dirty_ratelimit - x; @@ -1105,22 +1106,22 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, else dirty_ratelimit -= step; - bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL); - bdi->balanced_dirty_ratelimit = balanced_dirty_ratelimit; + wb->dirty_ratelimit = max(dirty_ratelimit, 1UL); + wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit; - trace_bdi_dirty_ratelimit(bdi, dirty_rate, task_ratelimit); + trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit); } -void __bdi_update_bandwidth(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long start_time) +void __wb_update_bandwidth(struct bdi_writeback *wb, + unsigned long thresh, + unsigned long bg_thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty, + unsigned long start_time) { unsigned long now = jiffies; - unsigned long elapsed = now - bdi->bw_time_stamp; + unsigned long elapsed = now - wb->bw_time_stamp; unsigned long dirtied; unsigned long written; @@ -1130,44 +1131,44 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi, if (elapsed < BANDWIDTH_INTERVAL) return; - dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]); - written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]); + dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]); + written = percpu_counter_read(&wb->stat[WB_WRITTEN]); /* * Skip quiet periods when disk bandwidth is under-utilized. * (at least 1s idle time between two flusher runs) */ - if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) + if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time)) goto snapshot; if (thresh) { global_update_bandwidth(thresh, dirty, now); - bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty, - dirtied, elapsed); + wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty, + bdi_thresh, bdi_dirty, + dirtied, elapsed); } - bdi_update_write_bandwidth(bdi, elapsed, written); + wb_update_write_bandwidth(wb, elapsed, written); snapshot: - bdi->dirtied_stamp = dirtied; - bdi->written_stamp = written; - bdi->bw_time_stamp = now; + wb->dirtied_stamp = dirtied; + wb->written_stamp = written; + wb->bw_time_stamp = now; } -static void bdi_update_bandwidth(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long start_time) +static void wb_update_bandwidth(struct bdi_writeback *wb, + unsigned long thresh, + unsigned long bg_thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty, + unsigned long start_time) { - if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL)) + if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL)) return; - spin_lock(&bdi->wb.list_lock); - __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty, start_time); - spin_unlock(&bdi->wb.list_lock); + spin_lock(&wb->list_lock); + __wb_update_bandwidth(wb, thresh, bg_thresh, dirty, + bdi_thresh, bdi_dirty, start_time); + spin_unlock(&wb->list_lock); } /* @@ -1187,10 +1188,10 @@ static unsigned long dirty_poll_interval(unsigned long dirty, return 1; } -static unsigned long bdi_max_pause(struct backing_dev_info *bdi, - unsigned long bdi_dirty) +static unsigned long wb_max_pause(struct bdi_writeback *wb, + unsigned long bdi_dirty) { - unsigned long bw = bdi->avg_write_bandwidth; + unsigned long bw = wb->avg_write_bandwidth; unsigned long t; /* @@ -1206,14 +1207,14 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi, return min_t(unsigned long, t, MAX_PAUSE); } -static long bdi_min_pause(struct backing_dev_info *bdi, - long max_pause, - unsigned long task_ratelimit, - unsigned long dirty_ratelimit, - int *nr_dirtied_pause) +static long wb_min_pause(struct bdi_writeback *wb, + long max_pause, + unsigned long task_ratelimit, + unsigned long dirty_ratelimit, + int *nr_dirtied_pause) { - long hi = ilog2(bdi->avg_write_bandwidth); - long lo = ilog2(bdi->dirty_ratelimit); + long hi = ilog2(wb->avg_write_bandwidth); + long lo = ilog2(wb->dirty_ratelimit); long t; /* target pause */ long pause; /* estimated next pause */ int pages; /* target nr_dirtied_pause */ @@ -1281,14 +1282,13 @@ static long bdi_min_pause(struct backing_dev_info *bdi, return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t; } -static inline void bdi_dirty_limits(struct backing_dev_info *bdi, - unsigned long dirty_thresh, - unsigned long background_thresh, - unsigned long *bdi_dirty, - unsigned long *bdi_thresh, - unsigned long *bdi_bg_thresh) +static inline void wb_dirty_limits(struct bdi_writeback *wb, + unsigned long dirty_thresh, + unsigned long background_thresh, + unsigned long *bdi_dirty, + unsigned long *bdi_thresh, + unsigned long *bdi_bg_thresh) { - struct bdi_writeback *wb = &bdi->wb; unsigned long wb_reclaimable; /* @@ -1301,10 +1301,10 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi, * In this case we don't want to hard throttle the USB key * dirtiers for 100 seconds until bdi_dirty drops under * bdi_thresh. Instead the auxiliary bdi control line in - * bdi_position_ratio() will let the dirtier task progress + * wb_position_ratio() will let the dirtier task progress * at some rate <= (write_bw / 2) for bringing down bdi_dirty. */ - *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); + *bdi_thresh = wb_dirty_limit(wb, dirty_thresh); if (bdi_bg_thresh) *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh * @@ -1354,6 +1354,7 @@ static void balance_dirty_pages(struct address_space *mapping, unsigned long dirty_ratelimit; unsigned long pos_ratio; struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct bdi_writeback *wb = &bdi->wb; bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; @@ -1378,8 +1379,8 @@ static void balance_dirty_pages(struct address_space *mapping, global_dirty_limits(&background_thresh, &dirty_thresh); if (unlikely(strictlimit)) { - bdi_dirty_limits(bdi, dirty_thresh, background_thresh, - &bdi_dirty, &bdi_thresh, &bg_thresh); + wb_dirty_limits(wb, dirty_thresh, background_thresh, + &bdi_dirty, &bdi_thresh, &bg_thresh); dirty = bdi_dirty; thresh = bdi_thresh; @@ -1410,28 +1411,28 @@ static void balance_dirty_pages(struct address_space *mapping, bdi_start_background_writeback(bdi); if (!strictlimit) - bdi_dirty_limits(bdi, dirty_thresh, background_thresh, - &bdi_dirty, &bdi_thresh, NULL); + wb_dirty_limits(wb, dirty_thresh, background_thresh, + &bdi_dirty, &bdi_thresh, NULL); dirty_exceeded = (bdi_dirty > bdi_thresh) && ((nr_dirty > dirty_thresh) || strictlimit); - if (dirty_exceeded && !bdi->dirty_exceeded) - bdi->dirty_exceeded = 1; + if (dirty_exceeded && !wb->dirty_exceeded) + wb->dirty_exceeded = 1; - bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, - nr_dirty, bdi_thresh, bdi_dirty, - start_time); + wb_update_bandwidth(wb, dirty_thresh, background_thresh, + nr_dirty, bdi_thresh, bdi_dirty, + start_time); - dirty_ratelimit = bdi->dirty_ratelimit; - pos_ratio = bdi_position_ratio(bdi, dirty_thresh, - background_thresh, nr_dirty, - bdi_thresh, bdi_dirty); + dirty_ratelimit = wb->dirty_ratelimit; + pos_ratio = wb_position_ratio(wb, dirty_thresh, + background_thresh, nr_dirty, + bdi_thresh, bdi_dirty); task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >> RATELIMIT_CALC_SHIFT; - max_pause = bdi_max_pause(bdi, bdi_dirty); - min_pause = bdi_min_pause(bdi, max_pause, - task_ratelimit, dirty_ratelimit, - &nr_dirtied_pause); + max_pause = wb_max_pause(wb, bdi_dirty); + min_pause = wb_min_pause(wb, max_pause, + task_ratelimit, dirty_ratelimit, + &nr_dirtied_pause); if (unlikely(task_ratelimit == 0)) { period = max_pause; @@ -1515,15 +1516,15 @@ pause: * more page. However bdi_dirty has accounting errors. So use * the larger and more IO friendly wb_stat_error. */ - if (bdi_dirty <= wb_stat_error(&bdi->wb)) + if (bdi_dirty <= wb_stat_error(wb)) break; if (fatal_signal_pending(current)) break; } - if (!dirty_exceeded && bdi->dirty_exceeded) - bdi->dirty_exceeded = 0; + if (!dirty_exceeded && wb->dirty_exceeded) + wb->dirty_exceeded = 0; if (writeback_in_progress(bdi)) return; @@ -1577,6 +1578,7 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0; void balance_dirty_pages_ratelimited(struct address_space *mapping) { struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct bdi_writeback *wb = &bdi->wb; int ratelimit; int *p; @@ -1584,7 +1586,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping) return; ratelimit = current->nr_dirtied_pause; - if (bdi->dirty_exceeded) + if (wb->dirty_exceeded) ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); preempt_disable(); -- cgit v1.1 From de1fff37b2781f9caae7bbb7b79fa788a9bb1115 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:29 -0400 Subject: writeback: s/bdi/wb/ in mm/page-writeback.c Writeback operations will now be per wb (bdi_writeback) instead of bdi. Replace the relevant bdi references in symbol names and comments with wb. This patch is purely cosmetic and doesn't make any functional changes. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Wu Fengguang Cc: Jens Axboe Signed-off-by: Jens Axboe --- mm/page-writeback.c | 270 ++++++++++++++++++++++++++-------------------------- 1 file changed, 134 insertions(+), 136 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index cd39ee9..78ef551 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -595,7 +595,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * * (o) global/bdi setpoints * - * We want the dirty pages be balanced around the global/bdi setpoints. + * We want the dirty pages be balanced around the global/wb setpoints. * When the number of dirty pages is higher/lower than the setpoint, the * dirty position control ratio (and hence task dirty ratelimit) will be * decreased/increased to bring the dirty pages back to the setpoint. @@ -605,8 +605,8 @@ static long long pos_ratio_polynom(unsigned long setpoint, * if (dirty < setpoint) scale up pos_ratio * if (dirty > setpoint) scale down pos_ratio * - * if (bdi_dirty < bdi_setpoint) scale up pos_ratio - * if (bdi_dirty > bdi_setpoint) scale down pos_ratio + * if (wb_dirty < wb_setpoint) scale up pos_ratio + * if (wb_dirty > wb_setpoint) scale down pos_ratio * * task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT * @@ -631,7 +631,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * 0 +------------.------------------.----------------------*-------------> * freerun^ setpoint^ limit^ dirty pages * - * (o) bdi control line + * (o) wb control line * * ^ pos_ratio * | @@ -657,27 +657,27 @@ static long long pos_ratio_polynom(unsigned long setpoint, * | . . * | . . * 0 +----------------------.-------------------------------.-------------> - * bdi_setpoint^ x_intercept^ + * wb_setpoint^ x_intercept^ * - * The bdi control line won't drop below pos_ratio=1/4, so that bdi_dirty can + * The wb control line won't drop below pos_ratio=1/4, so that wb_dirty can * be smoothly throttled down to normal if it starts high in situations like * - start writing to a slow SD card and a fast disk at the same time. The SD - * card's bdi_dirty may rush to many times higher than bdi_setpoint. - * - the bdi dirty thresh drops quickly due to change of JBOD workload + * card's wb_dirty may rush to many times higher than wb_setpoint. + * - the wb dirty thresh drops quickly due to change of JBOD workload */ static unsigned long wb_position_ratio(struct bdi_writeback *wb, unsigned long thresh, unsigned long bg_thresh, unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty) + unsigned long wb_thresh, + unsigned long wb_dirty) { unsigned long write_bw = wb->avg_write_bandwidth; unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); unsigned long limit = hard_dirty_limit(thresh); unsigned long x_intercept; unsigned long setpoint; /* dirty pages' target balance point */ - unsigned long bdi_setpoint; + unsigned long wb_setpoint; unsigned long span; long long pos_ratio; /* for scaling up/down the rate limit */ long x; @@ -696,146 +696,145 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, /* * The strictlimit feature is a tool preventing mistrusted filesystems * from growing a large number of dirty pages before throttling. For - * such filesystems balance_dirty_pages always checks bdi counters - * against bdi limits. Even if global "nr_dirty" is under "freerun". + * such filesystems balance_dirty_pages always checks wb counters + * against wb limits. Even if global "nr_dirty" is under "freerun". * This is especially important for fuse which sets bdi->max_ratio to * 1% by default. Without strictlimit feature, fuse writeback may * consume arbitrary amount of RAM because it is accounted in * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty". * * Here, in wb_position_ratio(), we calculate pos_ratio based on - * two values: bdi_dirty and bdi_thresh. Let's consider an example: + * two values: wb_dirty and wb_thresh. Let's consider an example: * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global * limits are set by default to 10% and 20% (background and throttle). - * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. - * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. bdi_setpoint is - * about ~6K pages (as the average of background and throttle bdi + * Then wb_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. + * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. wb_setpoint is + * about ~6K pages (as the average of background and throttle wb * limits). The 3rd order polynomial will provide positive feedback if - * bdi_dirty is under bdi_setpoint and vice versa. + * wb_dirty is under wb_setpoint and vice versa. * * Note, that we cannot use global counters in these calculations - * because we want to throttle process writing to a strictlimit BDI + * because we want to throttle process writing to a strictlimit wb * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB * in the example above). */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { - long long bdi_pos_ratio; - unsigned long bdi_bg_thresh; + long long wb_pos_ratio; + unsigned long wb_bg_thresh; - if (bdi_dirty < 8) + if (wb_dirty < 8) return min_t(long long, pos_ratio * 2, 2 << RATELIMIT_CALC_SHIFT); - if (bdi_dirty >= bdi_thresh) + if (wb_dirty >= wb_thresh) return 0; - bdi_bg_thresh = div_u64((u64)bdi_thresh * bg_thresh, thresh); - bdi_setpoint = dirty_freerun_ceiling(bdi_thresh, - bdi_bg_thresh); + wb_bg_thresh = div_u64((u64)wb_thresh * bg_thresh, thresh); + wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh); - if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh) + if (wb_setpoint == 0 || wb_setpoint == wb_thresh) return 0; - bdi_pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty, - bdi_thresh); + wb_pos_ratio = pos_ratio_polynom(wb_setpoint, wb_dirty, + wb_thresh); /* - * Typically, for strictlimit case, bdi_setpoint << setpoint - * and pos_ratio >> bdi_pos_ratio. In the other words global + * Typically, for strictlimit case, wb_setpoint << setpoint + * and pos_ratio >> wb_pos_ratio. In the other words global * state ("dirty") is not limiting factor and we have to - * make decision based on bdi counters. But there is an + * make decision based on wb counters. But there is an * important case when global pos_ratio should get precedence: * global limits are exceeded (e.g. due to activities on other - * BDIs) while given strictlimit BDI is below limit. + * wb's) while given strictlimit wb is below limit. * - * "pos_ratio * bdi_pos_ratio" would work for the case above, + * "pos_ratio * wb_pos_ratio" would work for the case above, * but it would look too non-natural for the case of all - * activity in the system coming from a single strictlimit BDI + * activity in the system coming from a single strictlimit wb * with bdi->max_ratio == 100%. * * Note that min() below somewhat changes the dynamics of the * control system. Normally, pos_ratio value can be well over 3 - * (when globally we are at freerun and bdi is well below bdi + * (when globally we are at freerun and wb is well below wb * setpoint). Now the maximum pos_ratio in the same situation * is 2. We might want to tweak this if we observe the control * system is too slow to adapt. */ - return min(pos_ratio, bdi_pos_ratio); + return min(pos_ratio, wb_pos_ratio); } /* * We have computed basic pos_ratio above based on global situation. If - * the bdi is over/under its share of dirty pages, we want to scale + * the wb is over/under its share of dirty pages, we want to scale * pos_ratio further down/up. That is done by the following mechanism. */ /* - * bdi setpoint + * wb setpoint * - * f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint) + * f(wb_dirty) := 1.0 + k * (wb_dirty - wb_setpoint) * - * x_intercept - bdi_dirty + * x_intercept - wb_dirty * := -------------------------- - * x_intercept - bdi_setpoint + * x_intercept - wb_setpoint * - * The main bdi control line is a linear function that subjects to + * The main wb control line is a linear function that subjects to * - * (1) f(bdi_setpoint) = 1.0 - * (2) k = - 1 / (8 * write_bw) (in single bdi case) - * or equally: x_intercept = bdi_setpoint + 8 * write_bw + * (1) f(wb_setpoint) = 1.0 + * (2) k = - 1 / (8 * write_bw) (in single wb case) + * or equally: x_intercept = wb_setpoint + 8 * write_bw * - * For single bdi case, the dirty pages are observed to fluctuate + * For single wb case, the dirty pages are observed to fluctuate * regularly within range - * [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2] + * [wb_setpoint - write_bw/2, wb_setpoint + write_bw/2] * for various filesystems, where (2) can yield in a reasonable 12.5% * fluctuation range for pos_ratio. * - * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its + * For JBOD case, wb_thresh (not wb_dirty!) could fluctuate up to its * own size, so move the slope over accordingly and choose a slope that - * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh. + * yields 100% pos_ratio fluctuation on suddenly doubled wb_thresh. */ - if (unlikely(bdi_thresh > thresh)) - bdi_thresh = thresh; + if (unlikely(wb_thresh > thresh)) + wb_thresh = thresh; /* - * It's very possible that bdi_thresh is close to 0 not because the + * It's very possible that wb_thresh is close to 0 not because the * device is slow, but that it has remained inactive for long time. * Honour such devices a reasonable good (hopefully IO efficient) * threshold, so that the occasional writes won't be blocked and active * writes can rampup the threshold quickly. */ - bdi_thresh = max(bdi_thresh, (limit - dirty) / 8); + wb_thresh = max(wb_thresh, (limit - dirty) / 8); /* - * scale global setpoint to bdi's: - * bdi_setpoint = setpoint * bdi_thresh / thresh + * scale global setpoint to wb's: + * wb_setpoint = setpoint * wb_thresh / thresh */ - x = div_u64((u64)bdi_thresh << 16, thresh + 1); - bdi_setpoint = setpoint * (u64)x >> 16; + x = div_u64((u64)wb_thresh << 16, thresh + 1); + wb_setpoint = setpoint * (u64)x >> 16; /* - * Use span=(8*write_bw) in single bdi case as indicated by - * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case. + * Use span=(8*write_bw) in single wb case as indicated by + * (thresh - wb_thresh ~= 0) and transit to wb_thresh in JBOD case. * - * bdi_thresh thresh - bdi_thresh - * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh - * thresh thresh + * wb_thresh thresh - wb_thresh + * span = --------- * (8 * write_bw) + ------------------ * wb_thresh + * thresh thresh */ - span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16; - x_intercept = bdi_setpoint + span; + span = (thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16; + x_intercept = wb_setpoint + span; - if (bdi_dirty < x_intercept - span / 4) { - pos_ratio = div64_u64(pos_ratio * (x_intercept - bdi_dirty), - x_intercept - bdi_setpoint + 1); + if (wb_dirty < x_intercept - span / 4) { + pos_ratio = div64_u64(pos_ratio * (x_intercept - wb_dirty), + x_intercept - wb_setpoint + 1); } else pos_ratio /= 4; /* - * bdi reserve area, safeguard against dirty pool underrun and disk idle + * wb reserve area, safeguard against dirty pool underrun and disk idle * It may push the desired control point of global dirty pages higher * than setpoint. */ - x_intercept = bdi_thresh / 2; - if (bdi_dirty < x_intercept) { - if (bdi_dirty > x_intercept / 8) - pos_ratio = div_u64(pos_ratio * x_intercept, bdi_dirty); + x_intercept = wb_thresh / 2; + if (wb_dirty < x_intercept) { + if (wb_dirty > x_intercept / 8) + pos_ratio = div_u64(pos_ratio * x_intercept, wb_dirty); else pos_ratio *= 8; } @@ -943,17 +942,17 @@ static void global_update_bandwidth(unsigned long thresh, } /* - * Maintain bdi->dirty_ratelimit, the base dirty throttle rate. + * Maintain wb->dirty_ratelimit, the base dirty throttle rate. * - * Normal bdi tasks will be curbed at or below it in long term. + * Normal wb tasks will be curbed at or below it in long term. * Obviously it should be around (write_bw / N) when there are N dd tasks. */ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, unsigned long thresh, unsigned long bg_thresh, unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, + unsigned long wb_thresh, + unsigned long wb_dirty, unsigned long dirtied, unsigned long elapsed) { @@ -976,7 +975,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed; pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty); + wb_thresh, wb_dirty); /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ @@ -986,7 +985,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, /* * A linear estimation of the "balanced" throttle rate. The theory is, - * if there are N dd tasks, each throttled at task_ratelimit, the bdi's + * if there are N dd tasks, each throttled at task_ratelimit, the wb's * dirty_rate will be measured to be (N * task_ratelimit). So the below * formula will yield the balanced rate limit (write_bw / N). * @@ -1025,7 +1024,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, /* * We could safely do this and return immediately: * - * bdi->dirty_ratelimit = balanced_dirty_ratelimit; + * wb->dirty_ratelimit = balanced_dirty_ratelimit; * * However to get a more stable dirty_ratelimit, the below elaborated * code makes use of task_ratelimit to filter out singular points and @@ -1059,22 +1058,22 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, step = 0; /* - * For strictlimit case, calculations above were based on bdi counters + * For strictlimit case, calculations above were based on wb counters * and limits (starting from pos_ratio = wb_position_ratio() and up to * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate). - * Hence, to calculate "step" properly, we have to use bdi_dirty as - * "dirty" and bdi_setpoint as "setpoint". + * Hence, to calculate "step" properly, we have to use wb_dirty as + * "dirty" and wb_setpoint as "setpoint". * - * We rampup dirty_ratelimit forcibly if bdi_dirty is low because - * it's possible that bdi_thresh is close to zero due to inactivity + * We rampup dirty_ratelimit forcibly if wb_dirty is low because + * it's possible that wb_thresh is close to zero due to inactivity * of backing device (see the implementation of wb_dirty_limit()). */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { - dirty = bdi_dirty; - if (bdi_dirty < 8) - setpoint = bdi_dirty + 1; + dirty = wb_dirty; + if (wb_dirty < 8) + setpoint = wb_dirty + 1; else - setpoint = (bdi_thresh + + setpoint = (wb_thresh + wb_dirty_limit(wb, bg_thresh)) / 2; } @@ -1116,8 +1115,8 @@ void __wb_update_bandwidth(struct bdi_writeback *wb, unsigned long thresh, unsigned long bg_thresh, unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, + unsigned long wb_thresh, + unsigned long wb_dirty, unsigned long start_time) { unsigned long now = jiffies; @@ -1144,7 +1143,7 @@ void __wb_update_bandwidth(struct bdi_writeback *wb, if (thresh) { global_update_bandwidth(thresh, dirty, now); wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty, + wb_thresh, wb_dirty, dirtied, elapsed); } wb_update_write_bandwidth(wb, elapsed, written); @@ -1159,15 +1158,15 @@ static void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long thresh, unsigned long bg_thresh, unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, + unsigned long wb_thresh, + unsigned long wb_dirty, unsigned long start_time) { if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL)) return; spin_lock(&wb->list_lock); __wb_update_bandwidth(wb, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty, start_time); + wb_thresh, wb_dirty, start_time); spin_unlock(&wb->list_lock); } @@ -1189,7 +1188,7 @@ static unsigned long dirty_poll_interval(unsigned long dirty, } static unsigned long wb_max_pause(struct bdi_writeback *wb, - unsigned long bdi_dirty) + unsigned long wb_dirty) { unsigned long bw = wb->avg_write_bandwidth; unsigned long t; @@ -1201,7 +1200,7 @@ static unsigned long wb_max_pause(struct bdi_writeback *wb, * * 8 serves as the safety ratio. */ - t = bdi_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8)); + t = wb_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8)); t++; return min_t(unsigned long, t, MAX_PAUSE); @@ -1285,31 +1284,31 @@ static long wb_min_pause(struct bdi_writeback *wb, static inline void wb_dirty_limits(struct bdi_writeback *wb, unsigned long dirty_thresh, unsigned long background_thresh, - unsigned long *bdi_dirty, - unsigned long *bdi_thresh, - unsigned long *bdi_bg_thresh) + unsigned long *wb_dirty, + unsigned long *wb_thresh, + unsigned long *wb_bg_thresh) { unsigned long wb_reclaimable; /* - * bdi_thresh is not treated as some limiting factor as + * wb_thresh is not treated as some limiting factor as * dirty_thresh, due to reasons - * - in JBOD setup, bdi_thresh can fluctuate a lot + * - in JBOD setup, wb_thresh can fluctuate a lot * - in a system with HDD and USB key, the USB key may somehow - * go into state (bdi_dirty >> bdi_thresh) either because - * bdi_dirty starts high, or because bdi_thresh drops low. + * go into state (wb_dirty >> wb_thresh) either because + * wb_dirty starts high, or because wb_thresh drops low. * In this case we don't want to hard throttle the USB key - * dirtiers for 100 seconds until bdi_dirty drops under - * bdi_thresh. Instead the auxiliary bdi control line in + * dirtiers for 100 seconds until wb_dirty drops under + * wb_thresh. Instead the auxiliary wb control line in * wb_position_ratio() will let the dirtier task progress - * at some rate <= (write_bw / 2) for bringing down bdi_dirty. + * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ - *bdi_thresh = wb_dirty_limit(wb, dirty_thresh); + *wb_thresh = wb_dirty_limit(wb, dirty_thresh); - if (bdi_bg_thresh) - *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh * - background_thresh, - dirty_thresh) : 0; + if (wb_bg_thresh) + *wb_bg_thresh = dirty_thresh ? div_u64((u64)*wb_thresh * + background_thresh, + dirty_thresh) : 0; /* * In order to avoid the stacked BDI deadlock we need @@ -1321,12 +1320,12 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb, * actually dirty; with m+n sitting in the percpu * deltas. */ - if (*bdi_thresh < 2 * wb_stat_error(wb)) { + if (*wb_thresh < 2 * wb_stat_error(wb)) { wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); - *bdi_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); + *wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); } else { wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); - *bdi_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); + *wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); } } @@ -1360,9 +1359,9 @@ static void balance_dirty_pages(struct address_space *mapping, for (;;) { unsigned long now = jiffies; - unsigned long uninitialized_var(bdi_thresh); + unsigned long uninitialized_var(wb_thresh); unsigned long thresh; - unsigned long uninitialized_var(bdi_dirty); + unsigned long uninitialized_var(wb_dirty); unsigned long dirty; unsigned long bg_thresh; @@ -1380,10 +1379,10 @@ static void balance_dirty_pages(struct address_space *mapping, if (unlikely(strictlimit)) { wb_dirty_limits(wb, dirty_thresh, background_thresh, - &bdi_dirty, &bdi_thresh, &bg_thresh); + &wb_dirty, &wb_thresh, &bg_thresh); - dirty = bdi_dirty; - thresh = bdi_thresh; + dirty = wb_dirty; + thresh = wb_thresh; } else { dirty = nr_dirty; thresh = dirty_thresh; @@ -1393,10 +1392,10 @@ static void balance_dirty_pages(struct address_space *mapping, /* * Throttle it only when the background writeback cannot * catch-up. This avoids (excessively) small writeouts - * when the bdi limits are ramping up in case of !strictlimit. + * when the wb limits are ramping up in case of !strictlimit. * - * In strictlimit case make decision based on the bdi counters - * and limits. Small writeouts when the bdi limits are ramping + * In strictlimit case make decision based on the wb counters + * and limits. Small writeouts when the wb limits are ramping * up are the price we consciously pay for strictlimit-ing. */ if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) { @@ -1412,24 +1411,23 @@ static void balance_dirty_pages(struct address_space *mapping, if (!strictlimit) wb_dirty_limits(wb, dirty_thresh, background_thresh, - &bdi_dirty, &bdi_thresh, NULL); + &wb_dirty, &wb_thresh, NULL); - dirty_exceeded = (bdi_dirty > bdi_thresh) && + dirty_exceeded = (wb_dirty > wb_thresh) && ((nr_dirty > dirty_thresh) || strictlimit); if (dirty_exceeded && !wb->dirty_exceeded) wb->dirty_exceeded = 1; wb_update_bandwidth(wb, dirty_thresh, background_thresh, - nr_dirty, bdi_thresh, bdi_dirty, - start_time); + nr_dirty, wb_thresh, wb_dirty, start_time); dirty_ratelimit = wb->dirty_ratelimit; pos_ratio = wb_position_ratio(wb, dirty_thresh, background_thresh, nr_dirty, - bdi_thresh, bdi_dirty); + wb_thresh, wb_dirty); task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >> RATELIMIT_CALC_SHIFT; - max_pause = wb_max_pause(wb, bdi_dirty); + max_pause = wb_max_pause(wb, wb_dirty); min_pause = wb_min_pause(wb, max_pause, task_ratelimit, dirty_ratelimit, &nr_dirtied_pause); @@ -1455,8 +1453,8 @@ static void balance_dirty_pages(struct address_space *mapping, dirty_thresh, background_thresh, nr_dirty, - bdi_thresh, - bdi_dirty, + wb_thresh, + wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1484,8 +1482,8 @@ pause: dirty_thresh, background_thresh, nr_dirty, - bdi_thresh, - bdi_dirty, + wb_thresh, + wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1508,15 +1506,15 @@ pause: /* * In the case of an unresponding NFS server and the NFS dirty - * pages exceeds dirty_thresh, give the other good bdi's a pipe + * pages exceeds dirty_thresh, give the other good wb's a pipe * to go through, so that tasks on them still remain responsive. * * In theory 1 page is enough to keep the comsumer-producer * pipe going: the flusher cleans 1 page => the task dirties 1 - * more page. However bdi_dirty has accounting errors. So use + * more page. However wb_dirty has accounting errors. So use * the larger and more IO friendly wb_stat_error. */ - if (bdi_dirty <= wb_stat_error(wb)) + if (wb_dirty <= wb_stat_error(wb)) break; if (fatal_signal_pending(current)) -- cgit v1.1 From f0054bb1e1f3be03cc33369df640db97f10f6172 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:30 -0400 Subject: writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->wb_lock and ->worklist into wb. * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While moving, rename it to wb->work_lock as wb->wb_lock is confusing. Also, move wb->dwork downwards so that it's colocated with the new ->work_lock and ->work_list fields. * bdi_writeback_workfn() -> wb_workfn() bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb) bdi_wakeup_thread(bdi) -> wb_wakeup(wb) bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...) __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...) get_next_work_item(bdi) -> get_next_work_item(wb) * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb. The function contained parts which belong to the containing bdi rather than the wb itself - testing cap_writeback_dirty and bdi_remove_from_list() invocation. Those are moved to bdi_unregister(). * bdi_wb_{init|exit}() are renamed to wb_{init|exit}(). Initializations of the moved bdi->wb_lock and ->work_list are relocated from bdi_init() to wb_init(). * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Cc: Wu Fengguang Signed-off-by: Jens Axboe --- mm/backing-dev.c | 59 +++++++++++++++++++++++++++----------------------------- 1 file changed, 28 insertions(+), 31 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 9a6c472..597f0ce 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -261,7 +261,7 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi) } /* - * This function is used when the first inode for this bdi is marked dirty. It + * This function is used when the first inode for this wb is marked dirty. It * wakes-up the corresponding bdi thread which should then take care of the * periodic background write-out of dirty inodes. Since the write-out would * starts only 'dirty_writeback_interval' centisecs from now anyway, we just @@ -274,15 +274,15 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi) * We have to be careful not to postpone flush work if it is scheduled for * earlier. Thus we use queue_delayed_work(). */ -void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi) +void wb_wakeup_delayed(struct bdi_writeback *wb) { unsigned long timeout; timeout = msecs_to_jiffies(dirty_writeback_interval * 10); - spin_lock_bh(&bdi->wb_lock); - if (test_bit(WB_registered, &bdi->wb.state)) - queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); - spin_unlock_bh(&bdi->wb_lock); + spin_lock_bh(&wb->work_lock); + if (test_bit(WB_registered, &wb->state)) + queue_delayed_work(bdi_wq, &wb->dwork, timeout); + spin_unlock_bh(&wb->work_lock); } /* @@ -335,28 +335,24 @@ EXPORT_SYMBOL(bdi_register_dev); /* * Remove bdi from the global list and shutdown any threads we have running */ -static void bdi_wb_shutdown(struct backing_dev_info *bdi) +static void wb_shutdown(struct bdi_writeback *wb) { /* Make sure nobody queues further work */ - spin_lock_bh(&bdi->wb_lock); - if (!test_and_clear_bit(WB_registered, &bdi->wb.state)) { - spin_unlock_bh(&bdi->wb_lock); + spin_lock_bh(&wb->work_lock); + if (!test_and_clear_bit(WB_registered, &wb->state)) { + spin_unlock_bh(&wb->work_lock); return; } - spin_unlock_bh(&bdi->wb_lock); + spin_unlock_bh(&wb->work_lock); /* - * Make sure nobody finds us on the bdi_list anymore + * Drain work list and shutdown the delayed_work. !WB_registered + * tells wb_workfn() that @wb is dying and its work_list needs to + * be drained no matter what. */ - bdi_remove_from_list(bdi); - - /* - * Drain work list and shutdown the delayed_work. At this point, - * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi - * is dying and its work_list needs to be drained no matter what. - */ - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); - flush_delayed_work(&bdi->wb.dwork); + mod_delayed_work(bdi_wq, &wb->dwork, 0); + flush_delayed_work(&wb->dwork); + WARN_ON(!list_empty(&wb->work_list)); } /* @@ -381,7 +377,7 @@ EXPORT_SYMBOL(bdi_unregister); */ #define INIT_BW (100 << (20 - PAGE_SHIFT)) -static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) +static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) { int i, err; @@ -394,7 +390,6 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) INIT_LIST_HEAD(&wb->b_more_io); INIT_LIST_HEAD(&wb->b_dirty_time); spin_lock_init(&wb->list_lock); - INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn); wb->bw_time_stamp = jiffies; wb->balanced_dirty_ratelimit = INIT_BW; @@ -402,6 +397,10 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) wb->write_bandwidth = INIT_BW; wb->avg_write_bandwidth = INIT_BW; + spin_lock_init(&wb->work_lock); + INIT_LIST_HEAD(&wb->work_list); + INIT_DELAYED_WORK(&wb->dwork, wb_workfn); + err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL); if (err) return err; @@ -419,7 +418,7 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) return 0; } -static void bdi_wb_exit(struct bdi_writeback *wb) +static void wb_exit(struct bdi_writeback *wb) { int i; @@ -440,11 +439,9 @@ int bdi_init(struct backing_dev_info *bdi) bdi->min_ratio = 0; bdi->max_ratio = 100; bdi->max_prop_frac = FPROP_FRAC_BASE; - spin_lock_init(&bdi->wb_lock); INIT_LIST_HEAD(&bdi->bdi_list); - INIT_LIST_HEAD(&bdi->work_list); - err = bdi_wb_init(&bdi->wb, bdi); + err = wb_init(&bdi->wb, bdi); if (err) return err; @@ -454,9 +451,9 @@ EXPORT_SYMBOL(bdi_init); void bdi_destroy(struct backing_dev_info *bdi) { - bdi_wb_shutdown(bdi); - - WARN_ON(!list_empty(&bdi->work_list)); + /* make sure nobody finds us on the bdi_list anymore */ + bdi_remove_from_list(bdi); + wb_shutdown(&bdi->wb); if (bdi->dev) { bdi_debug_unregister(bdi); @@ -464,7 +461,7 @@ void bdi_destroy(struct backing_dev_info *bdi) bdi->dev = NULL; } - bdi_wb_exit(&bdi->wb); + wb_exit(&bdi->wb); } EXPORT_SYMBOL(bdi_destroy); -- cgit v1.1 From 4610007142823307d930ac890d822633a05ce08c Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:31 -0400 Subject: writeback: reorganize mm/backing-dev.c Move wb_shutdown(), bdi_register(), bdi_register_dev(), bdi_prune_sb(), bdi_remove_from_list() and bdi_unregister() so that init / exit functions are grouped together. This will make updating init / exit paths for cgroup writeback support easier. This is pure source file reorganization. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Cc: Wu Fengguang Signed-off-by: Jens Axboe --- mm/backing-dev.c | 174 +++++++++++++++++++++++++++---------------------------- 1 file changed, 87 insertions(+), 87 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 597f0ce..ff85ecb 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -286,93 +286,6 @@ void wb_wakeup_delayed(struct bdi_writeback *wb) } /* - * Remove bdi from bdi_list, and ensure that it is no longer visible - */ -static void bdi_remove_from_list(struct backing_dev_info *bdi) -{ - spin_lock_bh(&bdi_lock); - list_del_rcu(&bdi->bdi_list); - spin_unlock_bh(&bdi_lock); - - synchronize_rcu_expedited(); -} - -int bdi_register(struct backing_dev_info *bdi, struct device *parent, - const char *fmt, ...) -{ - va_list args; - struct device *dev; - - if (bdi->dev) /* The driver needs to use separate queues per device */ - return 0; - - va_start(args, fmt); - dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args); - va_end(args); - if (IS_ERR(dev)) - return PTR_ERR(dev); - - bdi->dev = dev; - - bdi_debug_register(bdi, dev_name(dev)); - set_bit(WB_registered, &bdi->wb.state); - - spin_lock_bh(&bdi_lock); - list_add_tail_rcu(&bdi->bdi_list, &bdi_list); - spin_unlock_bh(&bdi_lock); - - trace_writeback_bdi_register(bdi); - return 0; -} -EXPORT_SYMBOL(bdi_register); - -int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev) -{ - return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev)); -} -EXPORT_SYMBOL(bdi_register_dev); - -/* - * Remove bdi from the global list and shutdown any threads we have running - */ -static void wb_shutdown(struct bdi_writeback *wb) -{ - /* Make sure nobody queues further work */ - spin_lock_bh(&wb->work_lock); - if (!test_and_clear_bit(WB_registered, &wb->state)) { - spin_unlock_bh(&wb->work_lock); - return; - } - spin_unlock_bh(&wb->work_lock); - - /* - * Drain work list and shutdown the delayed_work. !WB_registered - * tells wb_workfn() that @wb is dying and its work_list needs to - * be drained no matter what. - */ - mod_delayed_work(bdi_wq, &wb->dwork, 0); - flush_delayed_work(&wb->dwork); - WARN_ON(!list_empty(&wb->work_list)); -} - -/* - * Called when the device behind @bdi has been removed or ejected. - * - * We can't really do much here except for reducing the dirty ratio at - * the moment. In the future we should be able to set a flag so that - * the filesystem can handle errors at mark_inode_dirty time instead - * of only at writeback time. - */ -void bdi_unregister(struct backing_dev_info *bdi) -{ - if (WARN_ON_ONCE(!bdi->dev)) - return; - - bdi_set_min_ratio(bdi, 0); -} -EXPORT_SYMBOL(bdi_unregister); - -/* * Initial write bandwidth: 100 MB/s */ #define INIT_BW (100 << (20 - PAGE_SHIFT)) @@ -418,6 +331,29 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) return 0; } +/* + * Remove bdi from the global list and shutdown any threads we have running + */ +static void wb_shutdown(struct bdi_writeback *wb) +{ + /* Make sure nobody queues further work */ + spin_lock_bh(&wb->work_lock); + if (!test_and_clear_bit(WB_registered, &wb->state)) { + spin_unlock_bh(&wb->work_lock); + return; + } + spin_unlock_bh(&wb->work_lock); + + /* + * Drain work list and shutdown the delayed_work. !WB_registered + * tells wb_workfn() that @wb is dying and its work_list needs to + * be drained no matter what. + */ + mod_delayed_work(bdi_wq, &wb->dwork, 0); + flush_delayed_work(&wb->dwork); + WARN_ON(!list_empty(&wb->work_list)); +} + static void wb_exit(struct bdi_writeback *wb) { int i; @@ -449,6 +385,70 @@ int bdi_init(struct backing_dev_info *bdi) } EXPORT_SYMBOL(bdi_init); +int bdi_register(struct backing_dev_info *bdi, struct device *parent, + const char *fmt, ...) +{ + va_list args; + struct device *dev; + + if (bdi->dev) /* The driver needs to use separate queues per device */ + return 0; + + va_start(args, fmt); + dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args); + va_end(args); + if (IS_ERR(dev)) + return PTR_ERR(dev); + + bdi->dev = dev; + + bdi_debug_register(bdi, dev_name(dev)); + set_bit(WB_registered, &bdi->wb.state); + + spin_lock_bh(&bdi_lock); + list_add_tail_rcu(&bdi->bdi_list, &bdi_list); + spin_unlock_bh(&bdi_lock); + + trace_writeback_bdi_register(bdi); + return 0; +} +EXPORT_SYMBOL(bdi_register); + +int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev) +{ + return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev)); +} +EXPORT_SYMBOL(bdi_register_dev); + +/* + * Remove bdi from bdi_list, and ensure that it is no longer visible + */ +static void bdi_remove_from_list(struct backing_dev_info *bdi) +{ + spin_lock_bh(&bdi_lock); + list_del_rcu(&bdi->bdi_list); + spin_unlock_bh(&bdi_lock); + + synchronize_rcu_expedited(); +} + +/* + * Called when the device behind @bdi has been removed or ejected. + * + * We can't really do much here except for reducing the dirty ratio at + * the moment. In the future we should be able to set a flag so that + * the filesystem can handle errors at mark_inode_dirty time instead + * of only at writeback time. + */ +void bdi_unregister(struct backing_dev_info *bdi) +{ + if (WARN_ON_ONCE(!bdi->dev)) + return; + + bdi_set_min_ratio(bdi, 0); +} +EXPORT_SYMBOL(bdi_unregister); + void bdi_destroy(struct backing_dev_info *bdi) { /* make sure nobody finds us on the bdi_list anymore */ -- cgit v1.1 From 66114cad64bf76a155fec1f0fff0de771cf909d5 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:32 -0400 Subject: writeback: separate out include/linux/backing-dev-defs.h With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Signed-off-by: Jens Axboe --- mm/madvise.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/madvise.c b/mm/madvise.c index d551475..64bb8a2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include -- cgit v1.1 From a212b105b07d75b48b1a166378282e8a77fbf53d Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:33 -0400 Subject: bdi: make inode_to_bdi() inline Now that bdi definitions are moved to backing-dev-defs.h, backing-dev.h can include blkdev.h and inline inode_to_bdi() without worrying about introducing circular include dependency. The function gets called from hot paths and fairly trivial. This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the function calls inline. blockdev_superblock and noop_backing_dev_info are EXPORT_GPL'd to allow the inline functions to be used from modules. While at it, make sb_is_blkdev_sb() return bool instead of int. v2: Fixed typo in description as suggested by Jan. Signed-off-by: Tejun Heo Reviewed-by: Jens Axboe Cc: Christoph Hellwig Signed-off-by: Jens Axboe --- mm/backing-dev.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ff85ecb..b0707d1 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = { .name = "noop", .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, }; +EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; -- cgit v1.1 From 8395cd9f813d5d7ed9870e642230acfcfc1e8a0a Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:34 -0400 Subject: writeback: add @gfp to wb_init() wb_init() currently always uses GFP_KERNEL but the planned cgroup writeback support needs using other allocation masks. Add @gfp to wb_init(). This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Signed-off-by: Jens Axboe --- mm/backing-dev.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index b0707d1..805b287 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -291,7 +291,8 @@ void wb_wakeup_delayed(struct bdi_writeback *wb) */ #define INIT_BW (100 << (20 - PAGE_SHIFT)) -static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) +static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, + gfp_t gfp) { int i, err; @@ -315,12 +316,12 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) INIT_LIST_HEAD(&wb->work_list); INIT_DELAYED_WORK(&wb->dwork, wb_workfn); - err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL); + err = fprop_local_init_percpu(&wb->completions, gfp); if (err) return err; for (i = 0; i < NR_WB_STAT_ITEMS; i++) { - err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL); + err = percpu_counter_init(&wb->stat[i], 0, gfp); if (err) { while (--i) percpu_counter_destroy(&wb->stat[i]); @@ -378,7 +379,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(&bdi->bdi_list); - err = wb_init(&bdi->wb, bdi); + err = wb_init(&bdi->wb, bdi, GFP_KERNEL); if (err) return err; -- cgit v1.1 From 4aa9c692e052cf6db99db62a8fe0543e5c455da7 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:35 -0400 Subject: bdi: separate out congested state into a separate struct Currently, a wb's (bdi_writeback) congestion state is carried in its ->state field; however, cgroup writeback support will require multiple wb's sharing the same congestion state. This patch separates out congestion state into its own struct - struct bdi_writeback_congested. A new field wb field, wb_congested, points to its associated congested struct. The default wb, bdi->wb, always points to bdi->wb_congested. While this patch adds a layer of indirection, it doesn't introduce any behavior changes. Signed-off-by: Tejun Heo Signed-off-by: Jens Axboe --- mm/backing-dev.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 805b287..5ec7658 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -383,6 +383,9 @@ int bdi_init(struct backing_dev_info *bdi) if (err) return err; + bdi->wb_congested.state = 0; + bdi->wb.congested = &bdi->wb_congested; + return 0; } EXPORT_SYMBOL(bdi_init); @@ -504,7 +507,7 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync) wait_queue_head_t *wqh = &congestion_wqh[sync]; bit = sync ? WB_sync_congested : WB_async_congested; - if (test_and_clear_bit(bit, &bdi->wb.state)) + if (test_and_clear_bit(bit, &bdi->wb.congested->state)) atomic_dec(&nr_bdi_congested[sync]); smp_mb__after_atomic(); if (waitqueue_active(wqh)) @@ -517,7 +520,7 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) enum wb_state bit; bit = sync ? WB_sync_congested : WB_async_congested; - if (!test_and_set_bit(bit, &bdi->wb.state)) + if (!test_and_set_bit(bit, &bdi->wb.congested->state)) atomic_inc(&nr_bdi_congested[sync]); } EXPORT_SYMBOL(set_bdi_congested); -- cgit v1.1 From 52ebea749aaed195245701a8f90a23d672c7a933 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:37 -0400 Subject: writeback: make backing_dev_info host cgroup-specific bdi_writebacks For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo Cc: kbuild test robot Cc: Dan Carpenter Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/backing-dev.c | 397 ++++++++++++++++++++++++++++++++++++++++++++++++++++ mm/memcontrol.c | 19 ++- mm/page-writeback.c | 11 +- 3 files changed, 423 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 5ec7658..4c9386c 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -368,6 +368,401 @@ static void wb_exit(struct bdi_writeback *wb) fprop_local_destroy_percpu(&wb->completions); } +#ifdef CONFIG_CGROUP_WRITEBACK + +#include + +/* + * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree, + * blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU + * protected. cgwb_release_wait is used to wait for the completion of cgwb + * releases from bdi destruction path. + */ +static DEFINE_SPINLOCK(cgwb_lock); +static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait); + +/** + * wb_congested_get_create - get or create a wb_congested + * @bdi: associated bdi + * @blkcg_id: ID of the associated blkcg + * @gfp: allocation mask + * + * Look up the wb_congested for @blkcg_id on @bdi. If missing, create one. + * The returned wb_congested has its reference count incremented. Returns + * NULL on failure. + */ +struct bdi_writeback_congested * +wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp) +{ + struct bdi_writeback_congested *new_congested = NULL, *congested; + struct rb_node **node, *parent; + unsigned long flags; + + if (blkcg_id == 1) + return &bdi->wb_congested; +retry: + spin_lock_irqsave(&cgwb_lock, flags); + + node = &bdi->cgwb_congested_tree.rb_node; + parent = NULL; + + while (*node != NULL) { + parent = *node; + congested = container_of(parent, struct bdi_writeback_congested, + rb_node); + if (congested->blkcg_id < blkcg_id) + node = &parent->rb_left; + else if (congested->blkcg_id > blkcg_id) + node = &parent->rb_right; + else + goto found; + } + + if (new_congested) { + /* !found and storage for new one already allocated, insert */ + congested = new_congested; + new_congested = NULL; + rb_link_node(&congested->rb_node, parent, node); + rb_insert_color(&congested->rb_node, &bdi->cgwb_congested_tree); + atomic_inc(&bdi->usage_cnt); + goto found; + } + + spin_unlock_irqrestore(&cgwb_lock, flags); + + /* allocate storage for new one and retry */ + new_congested = kzalloc(sizeof(*new_congested), gfp); + if (!new_congested) + return NULL; + + atomic_set(&new_congested->refcnt, 0); + new_congested->bdi = bdi; + new_congested->blkcg_id = blkcg_id; + goto retry; + +found: + atomic_inc(&congested->refcnt); + spin_unlock_irqrestore(&cgwb_lock, flags); + kfree(new_congested); + return congested; +} + +/** + * wb_congested_put - put a wb_congested + * @congested: wb_congested to put + * + * Put @congested and destroy it if the refcnt reaches zero. + */ +void wb_congested_put(struct bdi_writeback_congested *congested) +{ + struct backing_dev_info *bdi = congested->bdi; + unsigned long flags; + + if (congested->blkcg_id == 1) + return; + + local_irq_save(flags); + if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) { + local_irq_restore(flags); + return; + } + + rb_erase(&congested->rb_node, &congested->bdi->cgwb_congested_tree); + spin_unlock_irqrestore(&cgwb_lock, flags); + kfree(congested); + + if (atomic_dec_and_test(&bdi->usage_cnt)) + wake_up_all(&cgwb_release_wait); +} + +static void cgwb_release_workfn(struct work_struct *work) +{ + struct bdi_writeback *wb = container_of(work, struct bdi_writeback, + release_work); + struct backing_dev_info *bdi = wb->bdi; + + wb_shutdown(wb); + + css_put(wb->memcg_css); + css_put(wb->blkcg_css); + wb_congested_put(wb->congested); + + percpu_ref_exit(&wb->refcnt); + wb_exit(wb); + kfree_rcu(wb, rcu); + + if (atomic_dec_and_test(&bdi->usage_cnt)) + wake_up_all(&cgwb_release_wait); +} + +static void cgwb_release(struct percpu_ref *refcnt) +{ + struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback, + refcnt); + schedule_work(&wb->release_work); +} + +static void cgwb_kill(struct bdi_writeback *wb) +{ + lockdep_assert_held(&cgwb_lock); + + WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); + list_del(&wb->memcg_node); + list_del(&wb->blkcg_node); + percpu_ref_kill(&wb->refcnt); +} + +static int cgwb_create(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css, gfp_t gfp) +{ + struct mem_cgroup *memcg; + struct cgroup_subsys_state *blkcg_css; + struct blkcg *blkcg; + struct list_head *memcg_cgwb_list, *blkcg_cgwb_list; + struct bdi_writeback *wb; + unsigned long flags; + int ret = 0; + + memcg = mem_cgroup_from_css(memcg_css); + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &blkio_cgrp_subsys); + blkcg = css_to_blkcg(blkcg_css); + memcg_cgwb_list = mem_cgroup_cgwb_list(memcg); + blkcg_cgwb_list = &blkcg->cgwb_list; + + /* look up again under lock and discard on blkcg mismatch */ + spin_lock_irqsave(&cgwb_lock, flags); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb && wb->blkcg_css != blkcg_css) { + cgwb_kill(wb); + wb = NULL; + } + spin_unlock_irqrestore(&cgwb_lock, flags); + if (wb) + goto out_put; + + /* need to create a new one */ + wb = kmalloc(sizeof(*wb), gfp); + if (!wb) + return -ENOMEM; + + ret = wb_init(wb, bdi, gfp); + if (ret) + goto err_free; + + ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp); + if (ret) + goto err_wb_exit; + + wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp); + if (!wb->congested) + goto err_ref_exit; + + wb->memcg_css = memcg_css; + wb->blkcg_css = blkcg_css; + INIT_WORK(&wb->release_work, cgwb_release_workfn); + set_bit(WB_registered, &wb->state); + + /* + * The root wb determines the registered state of the whole bdi and + * memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate + * whether they're still online. Don't link @wb if any is dead. + * See wb_memcg_offline() and wb_blkcg_offline(). + */ + ret = -ENODEV; + spin_lock_irqsave(&cgwb_lock, flags); + if (test_bit(WB_registered, &bdi->wb.state) && + blkcg_cgwb_list->next && memcg_cgwb_list->next) { + /* we might have raced another instance of this function */ + ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb); + if (!ret) { + atomic_inc(&bdi->usage_cnt); + list_add(&wb->memcg_node, memcg_cgwb_list); + list_add(&wb->blkcg_node, blkcg_cgwb_list); + css_get(memcg_css); + css_get(blkcg_css); + } + } + spin_unlock_irqrestore(&cgwb_lock, flags); + if (ret) { + if (ret == -EEXIST) + ret = 0; + goto err_put_congested; + } + goto out_put; + +err_put_congested: + wb_congested_put(wb->congested); +err_ref_exit: + percpu_ref_exit(&wb->refcnt); +err_wb_exit: + wb_exit(wb); +err_free: + kfree(wb); +out_put: + css_put(blkcg_css); + return ret; +} + +/** + * wb_get_create - get wb for a given memcg, create if necessary + * @bdi: target bdi + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) + * @gfp: allocation mask to use + * + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to + * create one. The returned wb has its refcount incremented. + * + * This function uses css_get() on @memcg_css and thus expects its refcnt + * to be positive on invocation. IOW, rcu_read_lock() protection on + * @memcg_css isn't enough. try_get it before calling this function. + * + * A wb is keyed by its associated memcg. As blkcg implicitly enables + * memcg on the default hierarchy, memcg association is guaranteed to be + * more specific (equal or descendant to the associated blkcg) and thus can + * identify both the memcg and blkcg associations. + * + * Because the blkcg associated with a memcg may change as blkcg is enabled + * and disabled closer to root in the hierarchy, each wb keeps track of + * both the memcg and blkcg associated with it and verifies the blkcg on + * each lookup. On mismatch, the existing wb is discarded and a new one is + * created. + */ +struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css, + gfp_t gfp) +{ + struct bdi_writeback *wb; + + might_sleep_if(gfp & __GFP_WAIT); + + if (!memcg_css->parent) + return &bdi->wb; + + do { + rcu_read_lock(); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb) { + struct cgroup_subsys_state *blkcg_css; + + /* see whether the blkcg association has changed */ + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, + &blkio_cgrp_subsys); + if (unlikely(wb->blkcg_css != blkcg_css || + !wb_tryget(wb))) + wb = NULL; + css_put(blkcg_css); + } + rcu_read_unlock(); + } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); + + return wb; +} + +void __inode_attach_wb(struct inode *inode, struct page *page) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = NULL; + + if (inode_cgwb_enabled(inode)) { + struct cgroup_subsys_state *memcg_css; + + if (page) { + memcg_css = mem_cgroup_css_from_page(page); + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + } else { + /* must pin memcg_css, see wb_get_create() */ + memcg_css = task_get_css(current, memory_cgrp_id); + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + css_put(memcg_css); + } + } + + if (!wb) + wb = &bdi->wb; + + /* + * There may be multiple instances of this function racing to + * update the same inode. Use cmpxchg() to tell the winner. + */ + if (unlikely(cmpxchg(&inode->i_wb, NULL, wb))) + wb_put(wb); +} + +static void cgwb_bdi_init(struct backing_dev_info *bdi) +{ + bdi->wb.memcg_css = mem_cgroup_root_css; + bdi->wb.blkcg_css = blkcg_root_css; + bdi->wb_congested.blkcg_id = 1; + INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC); + bdi->cgwb_congested_tree = RB_ROOT; + atomic_set(&bdi->usage_cnt, 1); +} + +static void cgwb_bdi_destroy(struct backing_dev_info *bdi) +{ + struct radix_tree_iter iter; + void **slot; + + WARN_ON(test_bit(WB_registered, &bdi->wb.state)); + + spin_lock_irq(&cgwb_lock); + radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) + cgwb_kill(*slot); + spin_unlock_irq(&cgwb_lock); + + /* + * All cgwb's and their congested states must be shutdown and + * released before returning. Drain the usage counter to wait for + * all cgwb's and cgwb_congested's ever created on @bdi. + */ + atomic_dec(&bdi->usage_cnt); + wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt)); +} + +/** + * wb_memcg_offline - kill all wb's associated with a memcg being offlined + * @memcg: memcg being offlined + * + * Also prevents creation of any new wb's associated with @memcg. + */ +void wb_memcg_offline(struct mem_cgroup *memcg) +{ + LIST_HEAD(to_destroy); + struct list_head *memcg_cgwb_list = mem_cgroup_cgwb_list(memcg); + struct bdi_writeback *wb, *next; + + spin_lock_irq(&cgwb_lock); + list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) + cgwb_kill(wb); + memcg_cgwb_list->next = NULL; /* prevent new wb's */ + spin_unlock_irq(&cgwb_lock); +} + +/** + * wb_blkcg_offline - kill all wb's associated with a blkcg being offlined + * @blkcg: blkcg being offlined + * + * Also prevents creation of any new wb's associated with @blkcg. + */ +void wb_blkcg_offline(struct blkcg *blkcg) +{ + LIST_HEAD(to_destroy); + struct bdi_writeback *wb, *next; + + spin_lock_irq(&cgwb_lock); + list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node) + cgwb_kill(wb); + blkcg->cgwb_list.next = NULL; /* prevent new wb's */ + spin_unlock_irq(&cgwb_lock); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static void cgwb_bdi_init(struct backing_dev_info *bdi) { } +static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { } + +#endif /* CONFIG_CGROUP_WRITEBACK */ + int bdi_init(struct backing_dev_info *bdi) { int err; @@ -386,6 +781,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->wb_congested.state = 0; bdi->wb.congested = &bdi->wb_congested; + cgwb_bdi_init(bdi); return 0; } EXPORT_SYMBOL(bdi_init); @@ -459,6 +855,7 @@ void bdi_destroy(struct backing_dev_info *bdi) /* make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); wb_shutdown(&bdi->wb); + cgwb_bdi_destroy(bdi); if (bdi->dev) { bdi_debug_unregister(bdi); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5c270a0..49e5aa6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -348,6 +348,10 @@ struct mem_cgroup { atomic_t numainfo_updating; #endif +#ifdef CONFIG_CGROUP_WRITEBACK + struct list_head cgwb_list; +#endif + /* List of events which userspace want to receive */ struct list_head event_list; spinlock_t event_list_lock; @@ -4030,6 +4034,15 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg) } #endif +#ifdef CONFIG_CGROUP_WRITEBACK + +struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg) +{ + return &memcg->cgwb_list; +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + /* * DO NOT USE IN NEW FILES. * @@ -4494,7 +4507,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) #ifdef CONFIG_MEMCG_KMEM memcg->kmemcg_id = -1; #endif - +#ifdef CONFIG_CGROUP_WRITEBACK + INIT_LIST_HEAD(&memcg->cgwb_list); +#endif return &memcg->css; free_out: @@ -4582,6 +4597,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) vmpressure_cleanup(&memcg->vmpressure); memcg_deactivate_kmem(memcg); + + wb_memcg_offline(memcg); } static void mem_cgroup_css_free(struct cgroup_subsys_state *css) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 78ef551..9b95cf8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2097,16 +2097,21 @@ int __set_page_dirty_no_writeback(struct page *page) void account_page_dirtied(struct page *page, struct address_space *mapping, struct mem_cgroup *memcg) { + struct inode *inode = mapping->host; + trace_writeback_dirty_page(page, mapping); if (mapping_cap_account_dirty(mapping)) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct bdi_writeback *wb; + + inode_attach_wb(inode, page); + wb = inode_to_wb(inode); mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); - __inc_wb_stat(&bdi->wb, WB_RECLAIMABLE); - __inc_wb_stat(&bdi->wb, WB_DIRTIED); + __inc_wb_stat(wb, WB_RECLAIMABLE); + __inc_wb_stat(wb, WB_DIRTIED); task_io_account_write(PAGE_CACHE_SIZE); current->nr_dirtied++; this_cpu_inc(bdp_ratelimits); -- cgit v1.1 From 910181343774cd5fed95900d9fd2cb4ff7758162 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:39 -0400 Subject: writeback: attribute stats to the matching per-cgroup bdi_writeback Until now, all WB_* stats were accounted against the root wb (bdi_writeback), now that multiple wb (bdi_writeback) support is in place, let's attributes the stats to the respective per-cgroup wb's. As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to visible behavior differences. v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9b95cf8..4d0a9da 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2130,7 +2130,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_RECLAIMABLE); + dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_CACHE_SIZE); } } @@ -2191,10 +2191,13 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers); void account_page_redirty(struct page *page) { struct address_space *mapping = page->mapping; + if (mapping && mapping_cap_account_dirty(mapping)) { + struct bdi_writeback *wb = inode_to_wb(mapping->host); + current->nr_dirtied--; dec_zone_page_state(page, NR_DIRTIED); - dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_DIRTIED); + dec_wb_stat(wb, WB_DIRTIED); } } EXPORT_SYMBOL(account_page_redirty); @@ -2373,8 +2376,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_wb_stat(&inode_to_bdi(mapping->host)->wb, - WB_RECLAIMABLE); + dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE); ret = 1; } mem_cgroup_end_page_stat(memcg); @@ -2392,7 +2394,8 @@ int test_clear_page_writeback(struct page *page) memcg = mem_cgroup_begin_page_stat(page); if (mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; spin_lock_irqsave(&mapping->tree_lock, flags); @@ -2402,8 +2405,10 @@ int test_clear_page_writeback(struct page *page) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) { - __dec_wb_stat(&bdi->wb, WB_WRITEBACK); - __wb_writeout_inc(&bdi->wb); + struct bdi_writeback *wb = inode_to_wb(inode); + + __dec_wb_stat(wb, WB_WRITEBACK); + __wb_writeout_inc(wb); } } spin_unlock_irqrestore(&mapping->tree_lock, flags); @@ -2427,7 +2432,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write) memcg = mem_cgroup_begin_page_stat(page); if (mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; spin_lock_irqsave(&mapping->tree_lock, flags); @@ -2437,7 +2443,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) - __inc_wb_stat(&bdi->wb, WB_WRITEBACK); + __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK); } if (!PageDirty(page)) radix_tree_tag_clear(&mapping->page_tree, -- cgit v1.1 From dfb8ae567835425d27db8acc6c9fc5db88d38e2b Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:40 -0400 Subject: writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback Currently, balance_dirty_pages() always work on bdi->wb. This patch updates it to work on the wb (bdi_writeback) matching memcg and blkcg of the current task as that's what the inode is being dirtied against. balance_dirty_pages_ratelimited() now pins the current wb and passes it to balance_dirty_pages(). As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to visible behavior differences. v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 4d0a9da..e31dea9 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1337,6 +1337,7 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb, * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, + struct bdi_writeback *wb, unsigned long pages_dirtied) { unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ @@ -1352,8 +1353,7 @@ static void balance_dirty_pages(struct address_space *mapping, unsigned long task_ratelimit; unsigned long dirty_ratelimit; unsigned long pos_ratio; - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); - struct bdi_writeback *wb = &bdi->wb; + struct backing_dev_info *bdi = wb->bdi; bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; @@ -1575,14 +1575,20 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0; */ void balance_dirty_pages_ratelimited(struct address_space *mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); - struct bdi_writeback *wb = &bdi->wb; + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = NULL; int ratelimit; int *p; if (!bdi_cap_account_dirty(bdi)) return; + if (inode_cgwb_enabled(inode)) + wb = wb_get_create_current(bdi, GFP_KERNEL); + if (!wb) + wb = &bdi->wb; + ratelimit = current->nr_dirtied_pause; if (wb->dirty_exceeded) ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); @@ -1616,7 +1622,9 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping) preempt_enable(); if (unlikely(current->nr_dirtied >= ratelimit)) - balance_dirty_pages(mapping, current->nr_dirtied); + balance_dirty_pages(mapping, wb, current->nr_dirtied); + + wb_put(wb); } EXPORT_SYMBOL(balance_dirty_pages_ratelimited); -- cgit v1.1 From ec8a6f2643923ee5b74d24fa8d134240379f436b Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:41 -0400 Subject: writeback: make congestion functions per bdi_writeback Currently, all congestion functions take bdi (backing_dev_info) and always operate on the root wb (bdi->wb) and the congestion state from the block layer is propagated only for the root blkcg. This patch introduces {set|clear}_wb_congested() and wb_congested() which take a bdi_writeback_congested and bdi_writeback respectively. The bdi counteparts are now wrappers invoking the wb based functions on @bdi->wb. While converting clear_bdi_congested() to clear_wb_congested(), the local variable declaration order between @wqh and @bit is swapped for cosmetic reason. This patch just adds the new wb based functions. The following patches will apply them. v2: Updated for bdi_writeback_congested. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara Cc: Jens Axboe Signed-off-by: Jens Axboe --- mm/backing-dev.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 4c9386c..5029c4a 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -896,31 +896,31 @@ static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) }; -static atomic_t nr_bdi_congested[2]; +static atomic_t nr_wb_congested[2]; -void clear_bdi_congested(struct backing_dev_info *bdi, int sync) +void clear_wb_congested(struct bdi_writeback_congested *congested, int sync) { - enum wb_state bit; wait_queue_head_t *wqh = &congestion_wqh[sync]; + enum wb_state bit; bit = sync ? WB_sync_congested : WB_async_congested; - if (test_and_clear_bit(bit, &bdi->wb.congested->state)) - atomic_dec(&nr_bdi_congested[sync]); + if (test_and_clear_bit(bit, &congested->state)) + atomic_dec(&nr_wb_congested[sync]); smp_mb__after_atomic(); if (waitqueue_active(wqh)) wake_up(wqh); } -EXPORT_SYMBOL(clear_bdi_congested); +EXPORT_SYMBOL(clear_wb_congested); -void set_bdi_congested(struct backing_dev_info *bdi, int sync) +void set_wb_congested(struct bdi_writeback_congested *congested, int sync) { enum wb_state bit; bit = sync ? WB_sync_congested : WB_async_congested; - if (!test_and_set_bit(bit, &bdi->wb.congested->state)) - atomic_inc(&nr_bdi_congested[sync]); + if (!test_and_set_bit(bit, &congested->state)) + atomic_inc(&nr_wb_congested[sync]); } -EXPORT_SYMBOL(set_bdi_congested); +EXPORT_SYMBOL(set_wb_congested); /** * congestion_wait - wait for a backing_dev to become uncongested @@ -979,7 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) * encountered in the current zone, yield if necessary instead * of sleeping on the congestion queue */ - if (atomic_read(&nr_bdi_congested[sync]) == 0 || + if (atomic_read(&nr_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, &zone->flags)) { cond_resched(); -- cgit v1.1 From 703c270887bb5106c4c46a00cc7477d30d5e04f5 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:44 -0400 Subject: writeback: implement and use inode_congested() In several places, bdi_congested() and its wrappers are used to determine whether more IOs should be issued. With cgroup writeback support, this question can't be answered solely based on the bdi (backing_dev_info). It's dependent on whether the filesystem and bdi support cgroup writeback and the blkcg the inode is associated with. This patch implements inode_congested() and its wrappers which take @inode and determines the congestion state considering cgroup writeback. The new functions replace bdi_*congested() calls in places where the query is about specific inode and task. There are several filesystem users which also fit this criteria but they should be updated when each filesystem implements cgroup writeback support. v2: Now that a given inode is associated with only one wb, congestion state can be determined independent from the asking task. Drop @task. Spotted by Vivek. Also, converted to take @inode instead of @mapping and renamed to inode_congested(). Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Vivek Goyal Signed-off-by: Jens Axboe --- mm/fadvise.c | 2 +- mm/readahead.c | 2 +- mm/vmscan.c | 11 +++++------ 3 files changed, 7 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/mm/fadvise.c b/mm/fadvise.c index 4a3907c..b8a5bc6 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -115,7 +115,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) case POSIX_FADV_NOREUSE: break; case POSIX_FADV_DONTNEED: - if (!bdi_write_congested(bdi)) + if (!inode_write_congested(mapping->host)) __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE); diff --git a/mm/readahead.c b/mm/readahead.c index 9356758..60cd846 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -541,7 +541,7 @@ page_cache_async_readahead(struct address_space *mapping, /* * Defer asynchronous read-ahead on IO congestion. */ - if (bdi_read_congested(inode_to_bdi(mapping->host))) + if (inode_read_congested(mapping->host)) return; /* do read-ahead */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 7582f9f..f463398 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -452,14 +452,13 @@ static inline int is_page_cache_freeable(struct page *page) return page_count(page) - page_has_private(page) == 2; } -static int may_write_to_queue(struct backing_dev_info *bdi, - struct scan_control *sc) +static int may_write_to_inode(struct inode *inode, struct scan_control *sc) { if (current->flags & PF_SWAPWRITE) return 1; - if (!bdi_write_congested(bdi)) + if (!inode_write_congested(inode)) return 1; - if (bdi == current->backing_dev_info) + if (inode_to_bdi(inode) == current->backing_dev_info) return 1; return 0; } @@ -538,7 +537,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, } if (mapping->a_ops->writepage == NULL) return PAGE_ACTIVATE; - if (!may_write_to_queue(inode_to_bdi(mapping->host), sc)) + if (!may_write_to_inode(mapping->host, sc)) return PAGE_KEEP; if (clear_page_dirty_for_io(page)) { @@ -924,7 +923,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ mapping = page_mapping(page); if (((dirty || writeback) && mapping && - bdi_write_congested(inode_to_bdi(mapping->host))) || + inode_write_congested(mapping->host)) || (writeback && PageReclaim(page))) nr_congested++; -- cgit v1.1 From d6c10f1fc8626dc55946f4768ae322b4c57b07dd Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:45 -0400 Subject: writeback: implement WB_has_dirty_io wb_state flag Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback) has any dirty inode by testing all three IO lists on each invocation without actively keeping track. For cgroup writeback support, a single bdi will host multiple wb's each of which will host dirty inodes separately and we'll need to make bdi_has_dirty_io(), which currently only represents the root wb, aggregate has_dirty_io from all member wb's, which requires tracking transitions in has_dirty_io state on each wb. This patch introduces inode_wb_list_{move|del}_locked() to consolidate IO list operations leaving queue_io() the only other function which directly manipulates IO lists (via move_expired_inodes()). All three functions are updated to call wb_io_lists_[de]populated() which keep track of whether the wb has dirty inodes or not and record it using the new WB_has_dirty_io flag. inode_wb_list_moved_locked()'s return value indicates whether the wb had no dirty inodes before. mark_inode_dirty() is restructured so that the return value of inode_wb_list_move_locked() can be used for deciding whether to wake up the wb. While at it, change {bdi|wb}_has_dirty_io()'s return values to bool. These functions were returning 0 and 1 before. Also, add a comment explaining the synchronization of wb_state flags. v2: Updated to accommodate b_dirty_time. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/backing-dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 5029c4a..161ddf1 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -256,7 +256,7 @@ static int __init default_bdi_init(void) } subsys_initcall(default_bdi_init); -int bdi_has_dirty_io(struct backing_dev_info *bdi) +bool bdi_has_dirty_io(struct backing_dev_info *bdi) { return wb_has_dirty_io(&bdi->wb); } -- cgit v1.1 From 766a9d6e60578f1ef6de71f89f022084f8bffc82 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:46 -0400 Subject: writeback: implement backing_dev_info->tot_write_bandwidth cgroup writeback support needs to keep track of the sum of avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to distribute write workload. This patch adds bdi->tot_write_bandwidth and updates inode_wb_list_move_locked(), inode_wb_list_del_locked() and wb_update_write_bandwidth() to adjust it as wb's gain and lose dirty inodes and its avg_write_bandwidth gets updated. As the update events are not synchronized with each other, bdi->tot_write_bandwidth is an atomic_long_t. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index e31dea9..c95eb24 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -881,6 +881,9 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb, avg += (old - avg) >> 3; out: + if (wb_has_dirty_io(wb)) + atomic_long_add(avg - wb->avg_write_bandwidth, + &wb->bdi->tot_write_bandwidth); wb->write_bandwidth = bw; wb->avg_write_bandwidth = avg; } -- cgit v1.1 From 95a46c65e3c09edb9f17dabf2dc16670cd328739 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:47 -0400 Subject: writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account bdi_has_dirty_io() used to only reflect whether the root wb (bdi_writeback) has dirty inodes. For cgroup writeback support, it needs to take all active wb's into account. If any wb on the bdi has dirty inodes, bdi_has_dirty_io() should return true. To achieve that, as inode_wb_list_{move|del}_locked() now keep track of the dirty state transition of each wb, the number of dirty wbs can be counted in the bdi; however, bdi is already aggregating wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when there are any dirty inodes by ensuring wb->avg_write_bandwidth can't dip below 1. bdi_has_dirty_io() can simply test whether bdi->tot_write_bandwidth is zero or not. While this bumps the value of wb->avg_write_bandwidth to one when it used to be zero, this shouldn't cause any meaningful behavior difference. bdi_has_dirty_io() is made an inline function which tests whether ->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its value are added to inode_wb_list_{move|del}_locked(). Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/backing-dev.c | 5 ----- mm/page-writeback.c | 10 +++++++--- 2 files changed, 7 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 161ddf1..d2f16fc9 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -256,11 +256,6 @@ static int __init default_bdi_init(void) } subsys_initcall(default_bdi_init); -bool bdi_has_dirty_io(struct backing_dev_info *bdi) -{ - return wb_has_dirty_io(&bdi->wb); -} - /* * This function is used when the first inode for this wb is marked dirty. It * wakes-up the corresponding bdi thread which should then take care of the diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c95eb24..99b8846 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -881,9 +881,13 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb, avg += (old - avg) >> 3; out: - if (wb_has_dirty_io(wb)) - atomic_long_add(avg - wb->avg_write_bandwidth, - &wb->bdi->tot_write_bandwidth); + /* keep avg > 0 to guarantee that tot > 0 if there are dirty wbs */ + avg = max(avg, 1LU); + if (wb_has_dirty_io(wb)) { + long delta = avg - wb->avg_write_bandwidth; + WARN_ON_ONCE(atomic_long_add_return(delta, + &wb->bdi->tot_write_bandwidth) <= 0); + } wb->write_bandwidth = bw; wb->avg_write_bandwidth = avg; } -- cgit v1.1 From 693108a8a6672cec88265d83f7187dc83ba1d6a3 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:49 -0400 Subject: writeback: make bdi->min/max_ratio handling cgroup writeback aware bdi->min/max_ratio are user-configurable per-bdi knobs which regulate dirty limit of each bdi. For cgroup writeback, they need to be further distributed across wb's (bdi_writeback's) belonging to the configured bdi. This patch introduces wb_min_max_ratio() which distributes bdi->min/max_ratio according to a wb's proportion in the total active bandwidth of its bdi. v2: Update wb_min_max_ratio() to fix a bug where both min and max were assigned the min value and avoid calculations when possible. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 46 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 99b8846..9b55f12 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -155,6 +155,46 @@ static unsigned long writeout_period_time = 0; */ #define VM_COMPLETIONS_PERIOD_LEN (3*HZ) +#ifdef CONFIG_CGROUP_WRITEBACK + +static void wb_min_max_ratio(struct bdi_writeback *wb, + unsigned long *minp, unsigned long *maxp) +{ + unsigned long this_bw = wb->avg_write_bandwidth; + unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth); + unsigned long long min = wb->bdi->min_ratio; + unsigned long long max = wb->bdi->max_ratio; + + /* + * @wb may already be clean by the time control reaches here and + * the total may not include its bw. + */ + if (this_bw < tot_bw) { + if (min) { + min *= this_bw; + do_div(min, tot_bw); + } + if (max < 100) { + max *= this_bw; + do_div(max, tot_bw); + } + } + + *minp = min; + *maxp = max; +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static void wb_min_max_ratio(struct bdi_writeback *wb, + unsigned long *minp, unsigned long *maxp) +{ + *minp = wb->bdi->min_ratio; + *maxp = wb->bdi->max_ratio; +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + /* * In a memory zone, there is a certain amount of pages we consider * available for the page cache, which is essentially the number of @@ -539,9 +579,9 @@ static unsigned long hard_dirty_limit(unsigned long thresh) */ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) { - struct backing_dev_info *bdi = wb->bdi; u64 wb_dirty; long numerator, denominator; + unsigned long wb_min_ratio, wb_max_ratio; /* * Calculate this BDI's share of the dirty ratio. @@ -552,9 +592,11 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) wb_dirty *= numerator; do_div(wb_dirty, denominator); - wb_dirty += (dirty * bdi->min_ratio) / 100; - if (wb_dirty > (dirty * bdi->max_ratio) / 100) - wb_dirty = dirty * bdi->max_ratio / 100; + wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio); + + wb_dirty += (dirty * wb_min_ratio) / 100; + if (wb_dirty > (dirty * wb_max_ratio) / 100) + wb_dirty = dirty * wb_max_ratio / 100; return wb_dirty; } -- cgit v1.1 From c00ddad39f512b1a81e25b7892217ce10efab0f1 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:51 -0400 Subject: writeback: remove bdi_start_writeback() bdi_start_writeback() is a thin wrapper on top of __wb_start_writeback() which is used only by laptop_mode_timer_fn(). This patches removes bdi_start_writeback(), renames __wb_start_writeback() to wb_start_writeback() and makes laptop_mode_timer_fn() use it instead. This doesn't cause any functional difference and will ease making laptop_mode_timer_fn() cgroup writeback aware. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9b55f12..6301af2 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1729,8 +1729,8 @@ void laptop_mode_timer_fn(unsigned long data) * threshold */ if (bdi_has_dirty_io(&q->backing_dev_info)) - bdi_start_writeback(&q->backing_dev_info, nr_pages, - WB_REASON_LAPTOP_TIMER); + wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true, + WB_REASON_LAPTOP_TIMER); } /* -- cgit v1.1 From a06fd6b102286e3b727ed42b8fb37825fa7127a2 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:52 -0400 Subject: writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's For cgroup writeback support, all bdi-wide operations should be distributed to all its wb's (bdi_writeback's). This patch updates laptop_mode_timer_fn() so that it invokes wb_start_writeback() on all wb's rather than just the root one. As the intent is writing out all dirty data, there's no reason to split the number of pages to write. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 6301af2..682e3a6 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1723,14 +1723,20 @@ void laptop_mode_timer_fn(unsigned long data) struct request_queue *q = (struct request_queue *)data; int nr_pages = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); + struct bdi_writeback *wb; + struct wb_iter iter; /* * We want to write everything out, not just down to the dirty * threshold */ - if (bdi_has_dirty_io(&q->backing_dev_info)) - wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true, - WB_REASON_LAPTOP_TIMER); + if (!bdi_has_dirty_io(&q->backing_dev_info)) + return; + + bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0) + if (wb_has_dirty_io(wb)) + wb_start_writeback(wb, nr_pages, true, + WB_REASON_LAPTOP_TIMER); } /* -- cgit v1.1 From bc05873dccd27d75d6acdf812c3edfb181f1ba17 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:53 -0400 Subject: writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info writeback_in_progress() currently takes @bdi and returns whether writeback is in progress on its root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. While at it, make it an inline function. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 682e3a6..e3b5c1d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1455,7 +1455,7 @@ static void balance_dirty_pages(struct address_space *mapping, break; } - if (unlikely(!writeback_in_progress(bdi))) + if (unlikely(!writeback_in_progress(wb))) bdi_start_background_writeback(bdi); if (!strictlimit) @@ -1573,7 +1573,7 @@ pause: if (!dirty_exceeded && wb->dirty_exceeded) wb->dirty_exceeded = 0; - if (writeback_in_progress(bdi)) + if (writeback_in_progress(wb)) return; /* -- cgit v1.1 From 9ecf4866c018aeb304a7b49216c4d183665becb7 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:54 -0400 Subject: writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info bdi_start_background_writeback() currently takes @bdi and kicks the root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/page-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index e3b5c1d..70cf98d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1456,7 +1456,7 @@ static void balance_dirty_pages(struct address_space *mapping, } if (unlikely(!writeback_in_progress(wb))) - bdi_start_background_writeback(bdi); + wb_start_background_writeback(wb); if (!strictlimit) wb_dirty_limits(wb, dirty_thresh, background_thresh, @@ -1588,7 +1588,7 @@ pause: return; if (nr_reclaimable > background_thresh) - bdi_start_background_writeback(bdi); + wb_start_background_writeback(wb); } static DEFINE_PER_CPU(int, bdp_ratelimits); -- cgit v1.1 From cc395d7f1f7b9c740ab6d367ef1f6eb248595dff Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 17:13:58 -0400 Subject: writeback: implement bdi_wait_for_completion() If the completion of a wb_writeback_work can be waited upon by setting its ->done to a struct completion and waiting on it; however, for cgroup writeback support, it's necessary to issue multiple work items to multiple bdi_writebacks and wait for the completion of all. This patch implements wb_completion which can wait for multiple work items and replaces the struct completion with it. It can be defined using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and waited for by wb_wait_for_completion(). Nobody currently issues multiple work items and this patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Signed-off-by: Jens Axboe --- mm/backing-dev.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index d2f16fc9..ad5608d 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -768,6 +768,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->max_ratio = 100; bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(&bdi->bdi_list); + init_waitqueue_head(&bdi->wb_waitq); err = wb_init(&bdi->wb, bdi, GFP_KERNEL); if (err) -- cgit v1.1 From 733a572e66d2a23c852fdce34dba5bbd40667817 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:18 -0400 Subject: memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online cpu_possible_mask represents the CPUs which are actually possible during that boot instance. For systems which don't support CPU hotplug, this will match cpu_online_mask exactly in most cases. Even for systems which support CPU hotplug, the number of possible CPU slots is highly unlikely to diverge greatly from the number of online CPUs. The only cases where the difference between possible and online caused problems were when the boot code failed to initialize the possible mask and left it fully set at NR_CPUS - 1. As such, most per-cpu constructs allocate for all possible CPUs and often iterate over the possibles, which also has the benefit of avoiding the blocking CPU hotplug synchronization. memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and mem_cgroup_read_events(), which iterates over online CPUs and handles CPU hotplug operations explicitly. This complexity doesn't actually buy anything. Switch to iterating over the possibles and drop the explicit CPU hotplug handling. Eventually, we want to convert memcg to use percpu_counter instead of its own custom implementation which also benefits from quick access w/o summing for cases where larger error margin is acceptable. This will allow mem_cgroup_read_stat() to be called from non-sleepable contexts which will be used by cgroup writeback. Signed-off-by: Tejun Heo Cc: Michal Hocko Acked-by: Johannes Weiner Signed-off-by: Jens Axboe --- mm/memcontrol.c | 51 ++------------------------------------------------- 1 file changed, 2 insertions(+), 49 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 49e5aa6..701cbee 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -324,11 +324,6 @@ struct mem_cgroup { * percpu counter. */ struct mem_cgroup_stat_cpu __percpu *stat; - /* - * used when a cpu is offlined or other synchronizations - * See mem_cgroup_read_stat(). - */ - struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) @@ -834,15 +829,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg, long val = 0; int cpu; - get_online_cpus(); - for_each_online_cpu(cpu) + for_each_possible_cpu(cpu) val += per_cpu(memcg->stat->count[idx], cpu); -#ifdef CONFIG_HOTPLUG_CPU - spin_lock(&memcg->pcp_counter_lock); - val += memcg->nocpu_base.count[idx]; - spin_unlock(&memcg->pcp_counter_lock); -#endif - put_online_cpus(); return val; } @@ -852,15 +840,8 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, unsigned long val = 0; int cpu; - get_online_cpus(); - for_each_online_cpu(cpu) + for_each_possible_cpu(cpu) val += per_cpu(memcg->stat->events[idx], cpu); -#ifdef CONFIG_HOTPLUG_CPU - spin_lock(&memcg->pcp_counter_lock); - val += memcg->nocpu_base.events[idx]; - spin_unlock(&memcg->pcp_counter_lock); -#endif - put_online_cpus(); return val; } @@ -2210,37 +2191,12 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) mutex_unlock(&percpu_charge_mutex); } -/* - * This function drains percpu counter value from DEAD cpu and - * move it to local cpu. Note that this function can be preempted. - */ -static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu) -{ - int i; - - spin_lock(&memcg->pcp_counter_lock); - for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { - long x = per_cpu(memcg->stat->count[i], cpu); - - per_cpu(memcg->stat->count[i], cpu) = 0; - memcg->nocpu_base.count[i] += x; - } - for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { - unsigned long x = per_cpu(memcg->stat->events[i], cpu); - - per_cpu(memcg->stat->events[i], cpu) = 0; - memcg->nocpu_base.events[i] += x; - } - spin_unlock(&memcg->pcp_counter_lock); -} - static int memcg_cpu_hotplug_callback(struct notifier_block *nb, unsigned long action, void *hcpu) { int cpu = (unsigned long)hcpu; struct memcg_stock_pcp *stock; - struct mem_cgroup *iter; if (action == CPU_ONLINE) return NOTIFY_OK; @@ -2248,9 +2204,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, if (action != CPU_DEAD && action != CPU_DEAD_FROZEN) return NOTIFY_OK; - for_each_mem_cgroup(iter) - mem_cgroup_drain_pcp_counter(iter, cpu); - stock = &per_cpu(memcg_stock, cpu); drain_stock(stock); return NOTIFY_OK; -- cgit v1.1 From 0d960a383ae7aa791b2833e122ba7519d264cf92 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:19 -0400 Subject: writeback: clean up wb_dirty_limit() The function name wb_dirty_limit(), its argument @dirty and the local variable @wb_dirty are mortally confusing given that the function calculates per-wb threshold value not dirty pages, especially given that @dirty and @wb_dirty are used elsewhere for dirty pages. Let's rename the function to wb_calc_thresh() and wb_dirty to wb_thresh. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/backing-dev.c | 6 +++--- mm/page-writeback.c | 30 +++++++++++++++--------------- 2 files changed, 18 insertions(+), 18 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ad5608d..9c8b7b5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -49,7 +49,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) struct bdi_writeback *wb = &bdi->wb; unsigned long background_thresh; unsigned long dirty_thresh; - unsigned long bdi_thresh; + unsigned long wb_thresh; unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time; struct inode *inode; @@ -67,7 +67,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); - bdi_thresh = wb_dirty_limit(wb, dirty_thresh); + wb_thresh = wb_calc_thresh(wb, dirty_thresh); #define K(x) ((x) << (PAGE_SHIFT - 10)) seq_printf(m, @@ -87,7 +87,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "state: %10lx\n", (unsigned long) K(wb_stat(wb, WB_WRITEBACK)), (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)), - K(bdi_thresh), + K(wb_thresh), K(dirty_thresh), K(background_thresh), (unsigned long) K(wb_stat(wb, WB_DIRTIED)), diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 70cf98d..c7745a7 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -556,7 +556,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh) } /** - * wb_dirty_limit - @wb's share of dirty throttling threshold + * wb_calc_thresh - @wb's share of dirty throttling threshold * @wb: bdi_writeback to query * @dirty: global dirty limit in pages * @@ -577,28 +577,28 @@ static unsigned long hard_dirty_limit(unsigned long thresh) * The wb's share of dirty limit will be adapting to its throughput and * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. */ -unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) +unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) { - u64 wb_dirty; + u64 wb_thresh; long numerator, denominator; unsigned long wb_min_ratio, wb_max_ratio; /* - * Calculate this BDI's share of the dirty ratio. + * Calculate this BDI's share of the thresh ratio. */ wb_writeout_fraction(wb, &numerator, &denominator); - wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100; - wb_dirty *= numerator; - do_div(wb_dirty, denominator); + wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100; + wb_thresh *= numerator; + do_div(wb_thresh, denominator); wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio); - wb_dirty += (dirty * wb_min_ratio) / 100; - if (wb_dirty > (dirty * wb_max_ratio) / 100) - wb_dirty = dirty * wb_max_ratio / 100; + wb_thresh += (thresh * wb_min_ratio) / 100; + if (wb_thresh > (thresh * wb_max_ratio) / 100) + wb_thresh = thresh * wb_max_ratio / 100; - return wb_dirty; + return wb_thresh; } /* @@ -750,7 +750,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global * limits are set by default to 10% and 20% (background and throttle). * Then wb_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. - * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. wb_setpoint is + * wb_calc_thresh(wb, bg_thresh) is about ~4K pages. wb_setpoint is * about ~6K pages (as the average of background and throttle wb * limits). The 3rd order polynomial will provide positive feedback if * wb_dirty is under wb_setpoint and vice versa. @@ -1115,7 +1115,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, * * We rampup dirty_ratelimit forcibly if wb_dirty is low because * it's possible that wb_thresh is close to zero due to inactivity - * of backing device (see the implementation of wb_dirty_limit()). + * of backing device (see the implementation of wb_calc_thresh()). */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { dirty = wb_dirty; @@ -1123,7 +1123,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, setpoint = wb_dirty + 1; else setpoint = (wb_thresh + - wb_dirty_limit(wb, bg_thresh)) / 2; + wb_calc_thresh(wb, bg_thresh)) / 2; } if (dirty < setpoint) { @@ -1352,7 +1352,7 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb, * wb_position_ratio() will let the dirtier task progress * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ - *wb_thresh = wb_dirty_limit(wb, dirty_thresh); + *wb_thresh = wb_calc_thresh(wb, dirty_thresh); if (wb_bg_thresh) *wb_bg_thresh = dirty_thresh ? div_u64((u64)*wb_thresh * -- cgit v1.1 From 8a73179956e649df0d4b3250db17734f272d8266 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:20 -0400 Subject: writeback: reorganize [__]wb_update_bandwidth() __wb_update_bandwidth() is called from two places - fs/fs-writeback.c::balance_dirty_pages() and mm/page-writeback.c::wb_writeback(). The latter updates only the write bandwidth while the former also deals with the dirty ratelimit. The two callsites are distinguished by whether @thresh parameter is zero or not, which is cryptic. In addition, the two files define their own different versions of wb_update_bandwidth() on top of __wb_update_bandwidth(), which is confusing to say the least. This patch cleans up [__]wb_update_bandwidth() in the following ways. * __wb_update_bandwidth() now takes explicit @update_ratelimit parameter to gate dirty ratelimit handling. * mm/page-writeback.c::wb_update_bandwidth() is flattened into its caller - balance_dirty_pages(). * fs/fs-writeback.c::wb_update_bandwidth() is moved to mm/page-writeback.c and __wb_update_bandwidth() is made static. * While at it, add a lockdep assertion to __wb_update_bandwidth(). Except for the lockdep addition, this is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 45 ++++++++++++++++++++++----------------------- 1 file changed, 22 insertions(+), 23 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c7745a7..bebdd41 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1160,19 +1160,22 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit); } -void __wb_update_bandwidth(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long wb_thresh, - unsigned long wb_dirty, - unsigned long start_time) +static void __wb_update_bandwidth(struct bdi_writeback *wb, + unsigned long thresh, + unsigned long bg_thresh, + unsigned long dirty, + unsigned long wb_thresh, + unsigned long wb_dirty, + unsigned long start_time, + bool update_ratelimit) { unsigned long now = jiffies; unsigned long elapsed = now - wb->bw_time_stamp; unsigned long dirtied; unsigned long written; + lockdep_assert_held(&wb->list_lock); + /* * rate-limit, only update once every 200ms. */ @@ -1189,7 +1192,7 @@ void __wb_update_bandwidth(struct bdi_writeback *wb, if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time)) goto snapshot; - if (thresh) { + if (update_ratelimit) { global_update_bandwidth(thresh, dirty, now); wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty, wb_thresh, wb_dirty, @@ -1203,20 +1206,9 @@ snapshot: wb->bw_time_stamp = now; } -static void wb_update_bandwidth(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long wb_thresh, - unsigned long wb_dirty, - unsigned long start_time) +void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time) { - if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL)) - return; - spin_lock(&wb->list_lock); - __wb_update_bandwidth(wb, thresh, bg_thresh, dirty, - wb_thresh, wb_dirty, start_time); - spin_unlock(&wb->list_lock); + __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time, false); } /* @@ -1467,8 +1459,15 @@ static void balance_dirty_pages(struct address_space *mapping, if (dirty_exceeded && !wb->dirty_exceeded) wb->dirty_exceeded = 1; - wb_update_bandwidth(wb, dirty_thresh, background_thresh, - nr_dirty, wb_thresh, wb_dirty, start_time); + if (time_is_before_jiffies(wb->bw_time_stamp + + BANDWIDTH_INTERVAL)) { + spin_lock(&wb->list_lock); + __wb_update_bandwidth(wb, dirty_thresh, + background_thresh, nr_dirty, + wb_thresh, wb_dirty, start_time, + true); + spin_unlock(&wb->list_lock); + } dirty_ratelimit = wb->dirty_ratelimit; pos_ratio = wb_position_ratio(wb, dirty_thresh, -- cgit v1.1 From 380c27ca33ebecc9da35aa90c8b3a9154f90aac2 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:21 -0400 Subject: writeback: implement wb_domain Dirtyable memory is distributed to a wb (bdi_writeback) according to the relative bandwidth the wb is writing out in the whole system. This distribution is global - each wb is measured against all other wb's and gets the proportinately sized portion of the memory in the whole system. For cgroup writeback, the amount of dirtyable memory is scoped by memcg and thus each wb would need to be measured and controlled in its memcg. IOW, a wb will belong to two writeback domains - the global and memcg domains. Currently, what constitutes the global writeback domain are scattered across a number of global states. This patch starts collecting them into struct wb_domain. * fprop_global which serves as the basis for proportional bandwidth measurement and its period timer are moved into struct wb_domain. * global_wb_domain hosts the states for the global domain. * While at it, flatten wb_writeout_fraction() into its callers. This thin wrapper doesn't provide any actual benefits while getting in the way. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 72 ++++++++++++++++++++--------------------------------- 1 file changed, 27 insertions(+), 45 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index bebdd41..08e1737 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -124,29 +124,7 @@ EXPORT_SYMBOL(laptop_mode); unsigned long global_dirty_limit; -/* - * Scale the writeback cache size proportional to the relative writeout speeds. - * - * We do this by keeping a floating proportion between BDIs, based on page - * writeback completions [end_page_writeback()]. Those devices that write out - * pages fastest will get the larger share, while the slower will get a smaller - * share. - * - * We use page writeout completions because we are interested in getting rid of - * dirty pages. Having them written out is the primary goal. - * - * We introduce a concept of time, a period over which we measure these events, - * because demand can/will vary over time. The length of this period itself is - * measured in page writeback completions. - * - */ -static struct fprop_global writeout_completions; - -static void writeout_period(unsigned long t); -/* Timer for aging of writeout_completions */ -static struct timer_list writeout_period_timer = - TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0); -static unsigned long writeout_period_time = 0; +static struct wb_domain global_wb_domain; /* * Length of period for aging writeout fractions of bdis. This is an @@ -433,24 +411,26 @@ static unsigned long wp_next_time(unsigned long cur_time) } /* - * Increment the BDI's writeout completion count and the global writeout + * Increment the wb's writeout completion count and the global writeout * completion count. Called from test_clear_page_writeback(). */ static inline void __wb_writeout_inc(struct bdi_writeback *wb) { + struct wb_domain *dom = &global_wb_domain; + __inc_wb_stat(wb, WB_WRITTEN); - __fprop_inc_percpu_max(&writeout_completions, &wb->completions, + __fprop_inc_percpu_max(&dom->completions, &wb->completions, wb->bdi->max_prop_frac); /* First event after period switching was turned off? */ - if (!unlikely(writeout_period_time)) { + if (!unlikely(dom->period_time)) { /* * We can race with other __bdi_writeout_inc calls here but * it does not cause any harm since the resulting time when * timer will fire and what is in writeout_period_time will be * roughly the same. */ - writeout_period_time = wp_next_time(jiffies); - mod_timer(&writeout_period_timer, writeout_period_time); + dom->period_time = wp_next_time(jiffies); + mod_timer(&dom->period_timer, dom->period_time); } } @@ -465,37 +445,37 @@ void wb_writeout_inc(struct bdi_writeback *wb) EXPORT_SYMBOL_GPL(wb_writeout_inc); /* - * Obtain an accurate fraction of the BDI's portion. - */ -static void wb_writeout_fraction(struct bdi_writeback *wb, - long *numerator, long *denominator) -{ - fprop_fraction_percpu(&writeout_completions, &wb->completions, - numerator, denominator); -} - -/* * On idle system, we can be called long after we scheduled because we use * deferred timers so count with missed periods. */ static void writeout_period(unsigned long t) { - int miss_periods = (jiffies - writeout_period_time) / + struct wb_domain *dom = (void *)t; + int miss_periods = (jiffies - dom->period_time) / VM_COMPLETIONS_PERIOD_LEN; - if (fprop_new_period(&writeout_completions, miss_periods + 1)) { - writeout_period_time = wp_next_time(writeout_period_time + + if (fprop_new_period(&dom->completions, miss_periods + 1)) { + dom->period_time = wp_next_time(dom->period_time + miss_periods * VM_COMPLETIONS_PERIOD_LEN); - mod_timer(&writeout_period_timer, writeout_period_time); + mod_timer(&dom->period_timer, dom->period_time); } else { /* * Aging has zeroed all fractions. Stop wasting CPU on period * updates. */ - writeout_period_time = 0; + dom->period_time = 0; } } +int wb_domain_init(struct wb_domain *dom, gfp_t gfp) +{ + memset(dom, 0, sizeof(*dom)); + init_timer_deferrable(&dom->period_timer); + dom->period_timer.function = writeout_period; + dom->period_timer.data = (unsigned long)dom; + return fprop_global_init(&dom->completions, gfp); +} + /* * bdi_min_ratio keeps the sum of the minimum dirty shares of all * registered backing devices, which, for obvious reasons, can not @@ -579,6 +559,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh) */ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) { + struct wb_domain *dom = &global_wb_domain; u64 wb_thresh; long numerator, denominator; unsigned long wb_min_ratio, wb_max_ratio; @@ -586,7 +567,8 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) /* * Calculate this BDI's share of the thresh ratio. */ - wb_writeout_fraction(wb, &numerator, &denominator); + fprop_fraction_percpu(&dom->completions, &wb->completions, + &numerator, &denominator); wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100; wb_thresh *= numerator; @@ -1831,7 +1813,7 @@ void __init page_writeback_init(void) writeback_set_ratelimit(); register_cpu_notifier(&ratelimit_nb); - fprop_global_init(&writeout_completions, GFP_KERNEL); + BUG_ON(wb_domain_init(&global_wb_domain, GFP_KERNEL)); } /** -- cgit v1.1 From dcc25ae76eb7b8ff883eaaab57e30e8f2f085be3 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:22 -0400 Subject: writeback: move global_dirty_limit into wb_domain This patch is a part of the series to define wb_domain which represents a domain that wb's (bdi_writeback's) belong to and are measured against each other in. This will enable IO backpressure propagation for cgroup writeback. global_dirty_limit exists to regulate the global dirty threshold which is a property of the wb_domain. This patch moves hard_dirty_limit, dirty_lock, and update_time into wb_domain. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 46 +++++++++++++++++++++++----------------------- 1 file changed, 23 insertions(+), 23 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 08e1737..27e60ba 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -122,9 +122,7 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -unsigned long global_dirty_limit; - -static struct wb_domain global_wb_domain; +struct wb_domain global_wb_domain; /* * Length of period for aging writeout fractions of bdis. This is an @@ -470,9 +468,15 @@ static void writeout_period(unsigned long t) int wb_domain_init(struct wb_domain *dom, gfp_t gfp) { memset(dom, 0, sizeof(*dom)); + + spin_lock_init(&dom->lock); + init_timer_deferrable(&dom->period_timer); dom->period_timer.function = writeout_period; dom->period_timer.data = (unsigned long)dom; + + dom->dirty_limit_tstamp = jiffies; + return fprop_global_init(&dom->completions, gfp); } @@ -532,7 +536,9 @@ static unsigned long dirty_freerun_ceiling(unsigned long thresh, static unsigned long hard_dirty_limit(unsigned long thresh) { - return max(thresh, global_dirty_limit); + struct wb_domain *dom = &global_wb_domain; + + return max(thresh, dom->dirty_limit); } /** @@ -916,17 +922,10 @@ out: wb->avg_write_bandwidth = avg; } -/* - * The global dirtyable memory and dirty threshold could be suddenly knocked - * down by a large amount (eg. on the startup of KVM in a swapless system). - * This may throw the system into deep dirty exceeded state and throttle - * heavy/light dirtiers alike. To retain good responsiveness, maintain - * global_dirty_limit for tracking slowly down to the knocked down dirty - * threshold. - */ static void update_dirty_limit(unsigned long thresh, unsigned long dirty) { - unsigned long limit = global_dirty_limit; + struct wb_domain *dom = &global_wb_domain; + unsigned long limit = dom->dirty_limit; /* * Follow up in one step. @@ -939,7 +938,7 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty) /* * Follow down slowly. Use the higher one as the target, because thresh * may drop below dirty. This is exactly the reason to introduce - * global_dirty_limit which is guaranteed to lie above the dirty pages. + * dom->dirty_limit which is guaranteed to lie above the dirty pages. */ thresh = max(thresh, dirty); if (limit > thresh) { @@ -948,28 +947,27 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty) } return; update: - global_dirty_limit = limit; + dom->dirty_limit = limit; } static void global_update_bandwidth(unsigned long thresh, unsigned long dirty, unsigned long now) { - static DEFINE_SPINLOCK(dirty_lock); - static unsigned long update_time = INITIAL_JIFFIES; + struct wb_domain *dom = &global_wb_domain; /* * check locklessly first to optimize away locking for the most time */ - if (time_before(now, update_time + BANDWIDTH_INTERVAL)) + if (time_before(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) return; - spin_lock(&dirty_lock); - if (time_after_eq(now, update_time + BANDWIDTH_INTERVAL)) { + spin_lock(&dom->lock); + if (time_after_eq(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) { update_dirty_limit(thresh, dirty); - update_time = now; + dom->dirty_limit_tstamp = now; } - spin_unlock(&dirty_lock); + spin_unlock(&dom->lock); } /* @@ -1761,10 +1759,12 @@ void laptop_sync_completion(void) void writeback_set_ratelimit(void) { + struct wb_domain *dom = &global_wb_domain; unsigned long background_thresh; unsigned long dirty_thresh; + global_dirty_limits(&background_thresh, &dirty_thresh); - global_dirty_limit = dirty_thresh; + dom->dirty_limit = dirty_thresh; ratelimit_pages = dirty_thresh / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16; -- cgit v1.1 From 2bc00aef030f4f75550d5c88062ce1830e40097f Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:23 -0400 Subject: writeback: consolidate dirty throttle parameters into dirty_throttle_control Dirty throttling implemented in balance_dirty_pages() and its subroutines makes use of a number of parameters which are passed around individually. This renders these functions somewhat unwieldy and makes it difficult to add or change the involved parameters. Also some functions use different or conflicting naming schemes for the same parameters making the code confusing to follow. This patch consolidates the main parameters into struct dirty_throttle_control so that they can be passed around easily and adding new paramters isn't painful. This also unifies how a given parameter is named and accessed. The drawback of using this type of control structure rather than explicit paramters is that it isn't immediately obvious which function accesses and modifies what; however, it's fairly clear that the benefits outweigh in this case. GDTC_INIT() macro is provided to ease initializing dirty_throttle_control for the global_wb_domain and balance_dirty_pages() uses a separate pointer to point to its global dirty_throttle_control. This is to make it uniform with memcg domain handling which will be added later. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 212 +++++++++++++++++++++++++--------------------------- 1 file changed, 101 insertions(+), 111 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 27e60ba..126e3c8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -124,6 +124,20 @@ EXPORT_SYMBOL(laptop_mode); struct wb_domain global_wb_domain; +/* consolidated parameters for balance_dirty_pages() and its subroutines */ +struct dirty_throttle_control { + struct bdi_writeback *wb; + + unsigned long dirty; /* file_dirty + write + nfs */ + unsigned long thresh; /* dirty threshold */ + unsigned long bg_thresh; /* dirty background threshold */ + + unsigned long wb_dirty; /* per-wb counterparts */ + unsigned long wb_thresh; +}; + +#define GDTC_INIT(__wb) .wb = (__wb) + /* * Length of period for aging writeout fractions of bdis. This is an * arbitrarily chosen number. The longer the period, the slower fractions will @@ -695,16 +709,13 @@ static long long pos_ratio_polynom(unsigned long setpoint, * card's wb_dirty may rush to many times higher than wb_setpoint. * - the wb dirty thresh drops quickly due to change of JBOD workload */ -static unsigned long wb_position_ratio(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long wb_thresh, - unsigned long wb_dirty) +static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) { + struct bdi_writeback *wb = dtc->wb; unsigned long write_bw = wb->avg_write_bandwidth; - unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); - unsigned long limit = hard_dirty_limit(thresh); + unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); + unsigned long limit = hard_dirty_limit(dtc->thresh); + unsigned long wb_thresh = dtc->wb_thresh; unsigned long x_intercept; unsigned long setpoint; /* dirty pages' target balance point */ unsigned long wb_setpoint; @@ -712,7 +723,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, long long pos_ratio; /* for scaling up/down the rate limit */ long x; - if (unlikely(dirty >= limit)) + if (unlikely(dtc->dirty >= limit)) return 0; /* @@ -721,7 +732,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, * See comment for pos_ratio_polynom(). */ setpoint = (freerun + limit) / 2; - pos_ratio = pos_ratio_polynom(setpoint, dirty, limit); + pos_ratio = pos_ratio_polynom(setpoint, dtc->dirty, limit); /* * The strictlimit feature is a tool preventing mistrusted filesystems @@ -752,20 +763,21 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, long long wb_pos_ratio; unsigned long wb_bg_thresh; - if (wb_dirty < 8) + if (dtc->wb_dirty < 8) return min_t(long long, pos_ratio * 2, 2 << RATELIMIT_CALC_SHIFT); - if (wb_dirty >= wb_thresh) + if (dtc->wb_dirty >= wb_thresh) return 0; - wb_bg_thresh = div_u64((u64)wb_thresh * bg_thresh, thresh); + wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh, + dtc->thresh); wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh); if (wb_setpoint == 0 || wb_setpoint == wb_thresh) return 0; - wb_pos_ratio = pos_ratio_polynom(wb_setpoint, wb_dirty, + wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty, wb_thresh); /* @@ -823,8 +835,8 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, * own size, so move the slope over accordingly and choose a slope that * yields 100% pos_ratio fluctuation on suddenly doubled wb_thresh. */ - if (unlikely(wb_thresh > thresh)) - wb_thresh = thresh; + if (unlikely(wb_thresh > dtc->thresh)) + wb_thresh = dtc->thresh; /* * It's very possible that wb_thresh is close to 0 not because the * device is slow, but that it has remained inactive for long time. @@ -832,12 +844,12 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, * threshold, so that the occasional writes won't be blocked and active * writes can rampup the threshold quickly. */ - wb_thresh = max(wb_thresh, (limit - dirty) / 8); + wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); /* * scale global setpoint to wb's: * wb_setpoint = setpoint * wb_thresh / thresh */ - x = div_u64((u64)wb_thresh << 16, thresh + 1); + x = div_u64((u64)wb_thresh << 16, dtc->thresh + 1); wb_setpoint = setpoint * (u64)x >> 16; /* * Use span=(8*write_bw) in single wb case as indicated by @@ -847,12 +859,12 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, * span = --------- * (8 * write_bw) + ------------------ * wb_thresh * thresh thresh */ - span = (thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16; + span = (dtc->thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16; x_intercept = wb_setpoint + span; - if (wb_dirty < x_intercept - span / 4) { - pos_ratio = div64_u64(pos_ratio * (x_intercept - wb_dirty), - x_intercept - wb_setpoint + 1); + if (dtc->wb_dirty < x_intercept - span / 4) { + pos_ratio = div64_u64(pos_ratio * (x_intercept - dtc->wb_dirty), + x_intercept - wb_setpoint + 1); } else pos_ratio /= 4; @@ -862,9 +874,10 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, * than setpoint. */ x_intercept = wb_thresh / 2; - if (wb_dirty < x_intercept) { - if (wb_dirty > x_intercept / 8) - pos_ratio = div_u64(pos_ratio * x_intercept, wb_dirty); + if (dtc->wb_dirty < x_intercept) { + if (dtc->wb_dirty > x_intercept / 8) + pos_ratio = div_u64(pos_ratio * x_intercept, + dtc->wb_dirty); else pos_ratio *= 8; } @@ -922,9 +935,10 @@ out: wb->avg_write_bandwidth = avg; } -static void update_dirty_limit(unsigned long thresh, unsigned long dirty) +static void update_dirty_limit(struct dirty_throttle_control *dtc) { struct wb_domain *dom = &global_wb_domain; + unsigned long thresh = dtc->thresh; unsigned long limit = dom->dirty_limit; /* @@ -940,7 +954,7 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty) * may drop below dirty. This is exactly the reason to introduce * dom->dirty_limit which is guaranteed to lie above the dirty pages. */ - thresh = max(thresh, dirty); + thresh = max(thresh, dtc->dirty); if (limit > thresh) { limit -= (limit - thresh) >> 5; goto update; @@ -950,8 +964,7 @@ update: dom->dirty_limit = limit; } -static void global_update_bandwidth(unsigned long thresh, - unsigned long dirty, +static void global_update_bandwidth(struct dirty_throttle_control *dtc, unsigned long now) { struct wb_domain *dom = &global_wb_domain; @@ -964,7 +977,7 @@ static void global_update_bandwidth(unsigned long thresh, spin_lock(&dom->lock); if (time_after_eq(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) { - update_dirty_limit(thresh, dirty); + update_dirty_limit(dtc); dom->dirty_limit_tstamp = now; } spin_unlock(&dom->lock); @@ -976,17 +989,14 @@ static void global_update_bandwidth(unsigned long thresh, * Normal wb tasks will be curbed at or below it in long term. * Obviously it should be around (write_bw / N) when there are N dd tasks. */ -static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long wb_thresh, - unsigned long wb_dirty, +static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, unsigned long dirtied, unsigned long elapsed) { - unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); - unsigned long limit = hard_dirty_limit(thresh); + struct bdi_writeback *wb = dtc->wb; + unsigned long dirty = dtc->dirty; + unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); + unsigned long limit = hard_dirty_limit(dtc->thresh); unsigned long setpoint = (freerun + limit) / 2; unsigned long write_bw = wb->avg_write_bandwidth; unsigned long dirty_ratelimit = wb->dirty_ratelimit; @@ -1003,8 +1013,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, */ dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed; - pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty, - wb_thresh, wb_dirty); + pos_ratio = wb_position_ratio(dtc); /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ @@ -1098,12 +1107,12 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, * of backing device (see the implementation of wb_calc_thresh()). */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { - dirty = wb_dirty; - if (wb_dirty < 8) - setpoint = wb_dirty + 1; + dirty = dtc->wb_dirty; + if (dtc->wb_dirty < 8) + setpoint = dtc->wb_dirty + 1; else - setpoint = (wb_thresh + - wb_calc_thresh(wb, bg_thresh)) / 2; + setpoint = (dtc->wb_thresh + + wb_calc_thresh(wb, dtc->bg_thresh)) / 2; } if (dirty < setpoint) { @@ -1140,15 +1149,11 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit); } -static void __wb_update_bandwidth(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long wb_thresh, - unsigned long wb_dirty, +static void __wb_update_bandwidth(struct dirty_throttle_control *dtc, unsigned long start_time, bool update_ratelimit) { + struct bdi_writeback *wb = dtc->wb; unsigned long now = jiffies; unsigned long elapsed = now - wb->bw_time_stamp; unsigned long dirtied; @@ -1173,10 +1178,8 @@ static void __wb_update_bandwidth(struct bdi_writeback *wb, goto snapshot; if (update_ratelimit) { - global_update_bandwidth(thresh, dirty, now); - wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty, - wb_thresh, wb_dirty, - dirtied, elapsed); + global_update_bandwidth(dtc, now); + wb_update_dirty_ratelimit(dtc, dirtied, elapsed); } wb_update_write_bandwidth(wb, elapsed, written); @@ -1188,7 +1191,9 @@ snapshot: void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time) { - __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time, false); + struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; + + __wb_update_bandwidth(&gdtc, start_time, false); } /* @@ -1302,13 +1307,10 @@ static long wb_min_pause(struct bdi_writeback *wb, return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t; } -static inline void wb_dirty_limits(struct bdi_writeback *wb, - unsigned long dirty_thresh, - unsigned long background_thresh, - unsigned long *wb_dirty, - unsigned long *wb_thresh, +static inline void wb_dirty_limits(struct dirty_throttle_control *dtc, unsigned long *wb_bg_thresh) { + struct bdi_writeback *wb = dtc->wb; unsigned long wb_reclaimable; /* @@ -1324,12 +1326,12 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb, * wb_position_ratio() will let the dirtier task progress * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ - *wb_thresh = wb_calc_thresh(wb, dirty_thresh); + dtc->wb_thresh = wb_calc_thresh(dtc->wb, dtc->thresh); if (wb_bg_thresh) - *wb_bg_thresh = dirty_thresh ? div_u64((u64)*wb_thresh * - background_thresh, - dirty_thresh) : 0; + *wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh * + dtc->bg_thresh, + dtc->thresh) : 0; /* * In order to avoid the stacked BDI deadlock we need @@ -1341,12 +1343,12 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb, * actually dirty; with m+n sitting in the percpu * deltas. */ - if (*wb_thresh < 2 * wb_stat_error(wb)) { + if (dtc->wb_thresh < 2 * wb_stat_error(wb)) { wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); - *wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); + dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); } else { wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); - *wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); + dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); } } @@ -1361,10 +1363,9 @@ static void balance_dirty_pages(struct address_space *mapping, struct bdi_writeback *wb, unsigned long pages_dirtied) { + struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; + struct dirty_throttle_control * const gdtc = &gdtc_stor; unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ - unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ - unsigned long background_thresh; - unsigned long dirty_thresh; long period; long pause; long max_pause; @@ -1380,11 +1381,7 @@ static void balance_dirty_pages(struct address_space *mapping, for (;;) { unsigned long now = jiffies; - unsigned long uninitialized_var(wb_thresh); - unsigned long thresh; - unsigned long uninitialized_var(wb_dirty); - unsigned long dirty; - unsigned long bg_thresh; + unsigned long dirty, thresh, bg_thresh; /* * Unstable writes are a feature of certain networked @@ -1394,20 +1391,19 @@ static void balance_dirty_pages(struct address_space *mapping, */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); - nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); + gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); - global_dirty_limits(&background_thresh, &dirty_thresh); + global_dirty_limits(&gdtc->bg_thresh, &gdtc->thresh); if (unlikely(strictlimit)) { - wb_dirty_limits(wb, dirty_thresh, background_thresh, - &wb_dirty, &wb_thresh, &bg_thresh); + wb_dirty_limits(gdtc, &bg_thresh); - dirty = wb_dirty; - thresh = wb_thresh; + dirty = gdtc->wb_dirty; + thresh = gdtc->wb_thresh; } else { - dirty = nr_dirty; - thresh = dirty_thresh; - bg_thresh = background_thresh; + dirty = gdtc->dirty; + thresh = gdtc->thresh; + bg_thresh = gdtc->bg_thresh; } /* @@ -1431,31 +1427,25 @@ static void balance_dirty_pages(struct address_space *mapping, wb_start_background_writeback(wb); if (!strictlimit) - wb_dirty_limits(wb, dirty_thresh, background_thresh, - &wb_dirty, &wb_thresh, NULL); + wb_dirty_limits(gdtc, NULL); - dirty_exceeded = (wb_dirty > wb_thresh) && - ((nr_dirty > dirty_thresh) || strictlimit); + dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && + ((gdtc->dirty > gdtc->thresh) || strictlimit); if (dirty_exceeded && !wb->dirty_exceeded) wb->dirty_exceeded = 1; if (time_is_before_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL)) { spin_lock(&wb->list_lock); - __wb_update_bandwidth(wb, dirty_thresh, - background_thresh, nr_dirty, - wb_thresh, wb_dirty, start_time, - true); + __wb_update_bandwidth(gdtc, start_time, true); spin_unlock(&wb->list_lock); } dirty_ratelimit = wb->dirty_ratelimit; - pos_ratio = wb_position_ratio(wb, dirty_thresh, - background_thresh, nr_dirty, - wb_thresh, wb_dirty); + pos_ratio = wb_position_ratio(gdtc); task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >> RATELIMIT_CALC_SHIFT; - max_pause = wb_max_pause(wb, wb_dirty); + max_pause = wb_max_pause(wb, gdtc->wb_dirty); min_pause = wb_min_pause(wb, max_pause, task_ratelimit, dirty_ratelimit, &nr_dirtied_pause); @@ -1478,11 +1468,11 @@ static void balance_dirty_pages(struct address_space *mapping, */ if (pause < min_pause) { trace_balance_dirty_pages(bdi, - dirty_thresh, - background_thresh, - nr_dirty, - wb_thresh, - wb_dirty, + gdtc->thresh, + gdtc->bg_thresh, + gdtc->dirty, + gdtc->wb_thresh, + gdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1507,11 +1497,11 @@ static void balance_dirty_pages(struct address_space *mapping, pause: trace_balance_dirty_pages(bdi, - dirty_thresh, - background_thresh, - nr_dirty, - wb_thresh, - wb_dirty, + gdtc->thresh, + gdtc->bg_thresh, + gdtc->dirty, + gdtc->wb_thresh, + gdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1526,8 +1516,8 @@ pause: current->nr_dirtied_pause = nr_dirtied_pause; /* - * This is typically equal to (nr_dirty < dirty_thresh) and can - * also keep "1000+ dd on a slow USB stick" under control. + * This is typically equal to (dirty < thresh) and can also + * keep "1000+ dd on a slow USB stick" under control. */ if (task_ratelimit) break; @@ -1542,7 +1532,7 @@ pause: * more page. However wb_dirty has accounting errors. So use * the larger and more IO friendly wb_stat_error. */ - if (wb_dirty <= wb_stat_error(wb)) + if (gdtc->wb_dirty <= wb_stat_error(wb)) break; if (fatal_signal_pending(current)) @@ -1566,7 +1556,7 @@ pause: if (laptop_mode) return; - if (nr_reclaimable > background_thresh) + if (nr_reclaimable > gdtc->bg_thresh) wb_start_background_writeback(wb); } -- cgit v1.1 From 970fb01ad3a773b5612a9bba6b366abcefc18eaf Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:24 -0400 Subject: writeback: add dirty_throttle_control->wb_bg_thresh wb_bg_thresh is currently treated as a second-class citizen. It's only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages() doesn't calculate it unless the cap is set. When the cap is set, the calculated value is not passed around but instead recalculated whenever it's used. wb_position_ratio() calculates it by scaling wb_thresh proportional to bg_thresh / thresh. wb_update_dirty_ratelimit() uses wb_dirty_limit() on bg_thresh, which should generally lead to a similar result as the proportional scaling but can also be way off in the presence of max/min_ratio settings. Avoiding wb_bg_thresh calculation saves us one u64 multiplication and divsion when BDI_CAP_STRICTLIMIT is not set. Given that balance_dirty_pages() is already ratelimited, this doesn't justify the incurred extra complexity. This patch adds wb_bg_thresh to dirty_throttle_control and makes wb_dirty_limits() always calculate it and updates the users to use the pre-calculated value. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 126e3c8..3ec9223 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -134,6 +134,7 @@ struct dirty_throttle_control { unsigned long wb_dirty; /* per-wb counterparts */ unsigned long wb_thresh; + unsigned long wb_bg_thresh; }; #define GDTC_INIT(__wb) .wb = (__wb) @@ -761,7 +762,6 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { long long wb_pos_ratio; - unsigned long wb_bg_thresh; if (dtc->wb_dirty < 8) return min_t(long long, pos_ratio * 2, @@ -770,9 +770,8 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) if (dtc->wb_dirty >= wb_thresh) return 0; - wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh, - dtc->thresh); - wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh); + wb_setpoint = dirty_freerun_ceiling(wb_thresh, + dtc->wb_bg_thresh); if (wb_setpoint == 0 || wb_setpoint == wb_thresh) return 0; @@ -1104,15 +1103,14 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, * * We rampup dirty_ratelimit forcibly if wb_dirty is low because * it's possible that wb_thresh is close to zero due to inactivity - * of backing device (see the implementation of wb_calc_thresh()). + * of backing device. */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { dirty = dtc->wb_dirty; if (dtc->wb_dirty < 8) setpoint = dtc->wb_dirty + 1; else - setpoint = (dtc->wb_thresh + - wb_calc_thresh(wb, dtc->bg_thresh)) / 2; + setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2; } if (dirty < setpoint) { @@ -1307,8 +1305,7 @@ static long wb_min_pause(struct bdi_writeback *wb, return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t; } -static inline void wb_dirty_limits(struct dirty_throttle_control *dtc, - unsigned long *wb_bg_thresh) +static inline void wb_dirty_limits(struct dirty_throttle_control *dtc) { struct bdi_writeback *wb = dtc->wb; unsigned long wb_reclaimable; @@ -1327,11 +1324,8 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc, * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ dtc->wb_thresh = wb_calc_thresh(dtc->wb, dtc->thresh); - - if (wb_bg_thresh) - *wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh * - dtc->bg_thresh, - dtc->thresh) : 0; + dtc->wb_bg_thresh = dtc->thresh ? + div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0; /* * In order to avoid the stacked BDI deadlock we need @@ -1396,10 +1390,11 @@ static void balance_dirty_pages(struct address_space *mapping, global_dirty_limits(&gdtc->bg_thresh, &gdtc->thresh); if (unlikely(strictlimit)) { - wb_dirty_limits(gdtc, &bg_thresh); + wb_dirty_limits(gdtc); dirty = gdtc->wb_dirty; thresh = gdtc->wb_thresh; + bg_thresh = gdtc->wb_bg_thresh; } else { dirty = gdtc->dirty; thresh = gdtc->thresh; @@ -1427,7 +1422,7 @@ static void balance_dirty_pages(struct address_space *mapping, wb_start_background_writeback(wb); if (!strictlimit) - wb_dirty_limits(gdtc, NULL); + wb_dirty_limits(gdtc); dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && ((gdtc->dirty > gdtc->thresh) || strictlimit); -- cgit v1.1 From b1cbc6d40c85639d405fee37c7bb688c3bf468d6 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:25 -0400 Subject: writeback: make __wb_calc_thresh() take dirty_throttle_control wb_calc_thresh() calculates wb_thresh by scaling thresh according to the wb's portion in the system-wide write bandwidth. cgroup writeback support would need to calculate wb_thresh against memcg domain too. This patch renames wb_calc_thresh() to __wb_calc_thresh() and makes it take dirty_throttle_control so that the function can later be updated to calculate against different domains according to dirty_throttle_control. wb_calc_thresh() is now a thin wrapper around __wb_calc_thresh(). v2: The original version was incorrectly scaling dtc->dirty instead of dtc->thresh. This was due to the extremely confusing function and variable names. Added a rename patch and fixed this one. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3ec9223..2352c69 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -557,9 +557,8 @@ static unsigned long hard_dirty_limit(unsigned long thresh) } /** - * wb_calc_thresh - @wb's share of dirty throttling threshold - * @wb: bdi_writeback to query - * @dirty: global dirty limit in pages + * __wb_calc_thresh - @wb's share of dirty throttling threshold + * @dtc: dirty_throttle_context of interest * * Returns @wb's dirty limit in pages. The term "dirty" in the context of * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. @@ -578,9 +577,10 @@ static unsigned long hard_dirty_limit(unsigned long thresh) * The wb's share of dirty limit will be adapting to its throughput and * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. */ -unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) +static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) { struct wb_domain *dom = &global_wb_domain; + unsigned long thresh = dtc->thresh; u64 wb_thresh; long numerator, denominator; unsigned long wb_min_ratio, wb_max_ratio; @@ -588,14 +588,14 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) /* * Calculate this BDI's share of the thresh ratio. */ - fprop_fraction_percpu(&dom->completions, &wb->completions, + fprop_fraction_percpu(&dom->completions, &dtc->wb->completions, &numerator, &denominator); wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100; wb_thresh *= numerator; do_div(wb_thresh, denominator); - wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio); + wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio); wb_thresh += (thresh * wb_min_ratio) / 100; if (wb_thresh > (thresh * wb_max_ratio) / 100) @@ -604,6 +604,13 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) return wb_thresh; } +unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) +{ + struct dirty_throttle_control gdtc = { GDTC_INIT(wb), + .thresh = thresh }; + return __wb_calc_thresh(&gdtc); +} + /* * setpoint - dirty 3 * f(dirty) := 1.0 + (----------------) @@ -1323,7 +1330,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc) * wb_position_ratio() will let the dirtier task progress * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ - dtc->wb_thresh = wb_calc_thresh(dtc->wb, dtc->thresh); + dtc->wb_thresh = __wb_calc_thresh(dtc); dtc->wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0; -- cgit v1.1 From daddfa3cb30ebfe322d50af146d830fd435ddb1f Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:26 -0400 Subject: writeback: add dirty_throttle_control->pos_ratio wb_position_ratio() is used to calculate pos_ratio, which is used for two purposes. wb_update_dirty_ratelimit() uses it to adjust wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to immediately adjust dirty_ratelimit right before applying it to determine pause duration. While wb_update_dirty_ratelimit() is separately rate limited from balance_dirty_pages(), on the run where the ratelimit is updated, we end up calculating pos_ratio twice with the same parameters. This patch adds dirty_throttle_control->pos_ratio. balance_dirty_pages() calculates it once per run and wb_update_dirty_ratelimit() uses the value stored in dirty_throttle_control. This removes the duplicate calculation and also will help implementing memcg wb_domain. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 36 +++++++++++++++++++++--------------- 1 file changed, 21 insertions(+), 15 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2352c69..fcebae7 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -135,6 +135,8 @@ struct dirty_throttle_control { unsigned long wb_dirty; /* per-wb counterparts */ unsigned long wb_thresh; unsigned long wb_bg_thresh; + + unsigned long pos_ratio; }; #define GDTC_INIT(__wb) .wb = (__wb) @@ -717,7 +719,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * card's wb_dirty may rush to many times higher than wb_setpoint. * - the wb dirty thresh drops quickly due to change of JBOD workload */ -static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) +static void wb_position_ratio(struct dirty_throttle_control *dtc) { struct bdi_writeback *wb = dtc->wb; unsigned long write_bw = wb->avg_write_bandwidth; @@ -731,8 +733,10 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) long long pos_ratio; /* for scaling up/down the rate limit */ long x; + dtc->pos_ratio = 0; + if (unlikely(dtc->dirty >= limit)) - return 0; + return; /* * global setpoint @@ -770,18 +774,20 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { long long wb_pos_ratio; - if (dtc->wb_dirty < 8) - return min_t(long long, pos_ratio * 2, - 2 << RATELIMIT_CALC_SHIFT); + if (dtc->wb_dirty < 8) { + dtc->pos_ratio = min_t(long long, pos_ratio * 2, + 2 << RATELIMIT_CALC_SHIFT); + return; + } if (dtc->wb_dirty >= wb_thresh) - return 0; + return; wb_setpoint = dirty_freerun_ceiling(wb_thresh, dtc->wb_bg_thresh); if (wb_setpoint == 0 || wb_setpoint == wb_thresh) - return 0; + return; wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty, wb_thresh); @@ -807,7 +813,8 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) * is 2. We might want to tweak this if we observe the control * system is too slow to adapt. */ - return min(pos_ratio, wb_pos_ratio); + dtc->pos_ratio = min(pos_ratio, wb_pos_ratio); + return; } /* @@ -888,7 +895,7 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) pos_ratio *= 8; } - return pos_ratio; + dtc->pos_ratio = pos_ratio; } static void wb_update_write_bandwidth(struct bdi_writeback *wb, @@ -1009,7 +1016,6 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, unsigned long dirty_rate; unsigned long task_ratelimit; unsigned long balanced_dirty_ratelimit; - unsigned long pos_ratio; unsigned long step; unsigned long x; @@ -1019,12 +1025,11 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, */ dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed; - pos_ratio = wb_position_ratio(dtc); /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ task_ratelimit = (u64)dirty_ratelimit * - pos_ratio >> RATELIMIT_CALC_SHIFT; + dtc->pos_ratio >> RATELIMIT_CALC_SHIFT; task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */ /* @@ -1375,7 +1380,6 @@ static void balance_dirty_pages(struct address_space *mapping, bool dirty_exceeded = false; unsigned long task_ratelimit; unsigned long dirty_ratelimit; - unsigned long pos_ratio; struct backing_dev_info *bdi = wb->bdi; bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; @@ -1433,6 +1437,9 @@ static void balance_dirty_pages(struct address_space *mapping, dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && ((gdtc->dirty > gdtc->thresh) || strictlimit); + + wb_position_ratio(gdtc); + if (dirty_exceeded && !wb->dirty_exceeded) wb->dirty_exceeded = 1; @@ -1444,8 +1451,7 @@ static void balance_dirty_pages(struct address_space *mapping, } dirty_ratelimit = wb->dirty_ratelimit; - pos_ratio = wb_position_ratio(gdtc); - task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >> + task_ratelimit = ((u64)dirty_ratelimit * gdtc->pos_ratio) >> RATELIMIT_CALC_SHIFT; max_pause = wb_max_pause(wb, gdtc->wb_dirty); min_pause = wb_min_pause(wb, max_pause, -- cgit v1.1 From e9770b3487328b7e28803caf6c809292dd7adbf0 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:27 -0400 Subject: writeback: add dirty_throttle_control->wb_completions wb->completions measures the wb's proportional write bandwidth in global_wb_domain and thus naturally tied to the wb_domain. This patch adds dirty_throttle_control->wb_completions which is initialized to wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use it instead of dereferencing wb->completions directly. This will allow dirty_throttle_control to represent different wb_domains and the matching wb completions. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index fcebae7..5b439fc 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -127,6 +127,7 @@ struct wb_domain global_wb_domain; /* consolidated parameters for balance_dirty_pages() and its subroutines */ struct dirty_throttle_control { struct bdi_writeback *wb; + struct fprop_local_percpu *wb_completions; unsigned long dirty; /* file_dirty + write + nfs */ unsigned long thresh; /* dirty threshold */ @@ -139,7 +140,8 @@ struct dirty_throttle_control { unsigned long pos_ratio; }; -#define GDTC_INIT(__wb) .wb = (__wb) +#define GDTC_INIT(__wb) .wb = (__wb), \ + .wb_completions = &(__wb)->completions /* * Length of period for aging writeout fractions of bdis. This is an @@ -590,7 +592,7 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) /* * Calculate this BDI's share of the thresh ratio. */ - fprop_fraction_percpu(&dom->completions, &dtc->wb->completions, + fprop_fraction_percpu(&dom->completions, dtc->wb_completions, &numerator, &denominator); wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100; -- cgit v1.1 From e9f07dfd7086a0b7e9ce98bb97b7422861aad40b Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:28 -0400 Subject: writeback: add dirty_throttle_control->dom Currently all dirty throttle operations use global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch adds dirty_throttle_control->dom and updates functions which are directly using globabl_wb_domain to use it instead. As this makes global_update_bandwidth() a misnomer, the function is renamed to domain_update_bandwidth(). This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5b439fc..38d45d8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -126,6 +126,9 @@ struct wb_domain global_wb_domain; /* consolidated parameters for balance_dirty_pages() and its subroutines */ struct dirty_throttle_control { +#ifdef CONFIG_CGROUP_WRITEBACK + struct wb_domain *dom; +#endif struct bdi_writeback *wb; struct fprop_local_percpu *wb_completions; @@ -140,7 +143,7 @@ struct dirty_throttle_control { unsigned long pos_ratio; }; -#define GDTC_INIT(__wb) .wb = (__wb), \ +#define DTC_INIT_COMMON(__wb) .wb = (__wb), \ .wb_completions = &(__wb)->completions /* @@ -152,6 +155,14 @@ struct dirty_throttle_control { #ifdef CONFIG_CGROUP_WRITEBACK +#define GDTC_INIT(__wb) .dom = &global_wb_domain, \ + DTC_INIT_COMMON(__wb) + +static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) +{ + return dtc->dom; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -181,6 +192,13 @@ static void wb_min_max_ratio(struct bdi_writeback *wb, #else /* CONFIG_CGROUP_WRITEBACK */ +#define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb) + +static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) +{ + return &global_wb_domain; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -583,7 +601,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh) */ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) { - struct wb_domain *dom = &global_wb_domain; + struct wb_domain *dom = dtc_dom(dtc); unsigned long thresh = dtc->thresh; u64 wb_thresh; long numerator, denominator; @@ -952,7 +970,7 @@ out: static void update_dirty_limit(struct dirty_throttle_control *dtc) { - struct wb_domain *dom = &global_wb_domain; + struct wb_domain *dom = dtc_dom(dtc); unsigned long thresh = dtc->thresh; unsigned long limit = dom->dirty_limit; @@ -979,10 +997,10 @@ update: dom->dirty_limit = limit; } -static void global_update_bandwidth(struct dirty_throttle_control *dtc, +static void domain_update_bandwidth(struct dirty_throttle_control *dtc, unsigned long now) { - struct wb_domain *dom = &global_wb_domain; + struct wb_domain *dom = dtc_dom(dtc); /* * check locklessly first to optimize away locking for the most time @@ -1190,7 +1208,7 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *dtc, goto snapshot; if (update_ratelimit) { - global_update_bandwidth(dtc, now); + domain_update_bandwidth(dtc, now); wb_update_dirty_ratelimit(dtc, dirtied, elapsed); } wb_update_write_bandwidth(wb, elapsed, written); -- cgit v1.1 From c7981433ef05e67b1b40740b2c40edbd4476b659 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:29 -0400 Subject: writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a parameter Currently __wb_writeout_inc() and hard_dirty_limit() assume global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch separates out domain-specific part of __wb_writeout_inc() into wb_domain_writeout_inc() which takes wb_domain as a parameter and adds the parameter to hard_dirty_limit(). This will allow these two functions to handle per-memcg wb_domains. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 38d45d8..a4d0cee 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -445,17 +445,12 @@ static unsigned long wp_next_time(unsigned long cur_time) return cur_time; } -/* - * Increment the wb's writeout completion count and the global writeout - * completion count. Called from test_clear_page_writeback(). - */ -static inline void __wb_writeout_inc(struct bdi_writeback *wb) +static void wb_domain_writeout_inc(struct wb_domain *dom, + struct fprop_local_percpu *completions, + unsigned int max_prop_frac) { - struct wb_domain *dom = &global_wb_domain; - - __inc_wb_stat(wb, WB_WRITTEN); - __fprop_inc_percpu_max(&dom->completions, &wb->completions, - wb->bdi->max_prop_frac); + __fprop_inc_percpu_max(&dom->completions, completions, + max_prop_frac); /* First event after period switching was turned off? */ if (!unlikely(dom->period_time)) { /* @@ -469,6 +464,17 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb) } } +/* + * Increment @wb's writeout completion count and the global writeout + * completion count. Called from test_clear_page_writeback(). + */ +static inline void __wb_writeout_inc(struct bdi_writeback *wb) +{ + __inc_wb_stat(wb, WB_WRITTEN); + wb_domain_writeout_inc(&global_wb_domain, &wb->completions, + wb->bdi->max_prop_frac); +} + void wb_writeout_inc(struct bdi_writeback *wb) { unsigned long flags; @@ -571,10 +577,9 @@ static unsigned long dirty_freerun_ceiling(unsigned long thresh, return (thresh + bg_thresh) / 2; } -static unsigned long hard_dirty_limit(unsigned long thresh) +static unsigned long hard_dirty_limit(struct wb_domain *dom, + unsigned long thresh) { - struct wb_domain *dom = &global_wb_domain; - return max(thresh, dom->dirty_limit); } @@ -744,7 +749,7 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc) struct bdi_writeback *wb = dtc->wb; unsigned long write_bw = wb->avg_write_bandwidth; unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); - unsigned long limit = hard_dirty_limit(dtc->thresh); + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); unsigned long wb_thresh = dtc->wb_thresh; unsigned long x_intercept; unsigned long setpoint; /* dirty pages' target balance point */ @@ -1029,7 +1034,7 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, struct bdi_writeback *wb = dtc->wb; unsigned long dirty = dtc->dirty; unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); - unsigned long limit = hard_dirty_limit(dtc->thresh); + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); unsigned long setpoint = (freerun + limit) / 2; unsigned long write_bw = wb->avg_write_bandwidth; unsigned long dirty_ratelimit = wb->dirty_ratelimit; @@ -1681,7 +1686,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) for ( ; ; ) { global_dirty_limits(&background_thresh, &dirty_thresh); - dirty_thresh = hard_dirty_limit(dirty_thresh); + dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh); /* * Boost the allowable dirty threshold a bit for page -- cgit v1.1 From 9fc3a43e1757ab6de0e8ce83b5d5a83083174e3b Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:30 -0400 Subject: writeback: separate out domain_dirty_limits() global_dirty_limits() calculates thresh and bg_thresh (confusingly called *pdirty and *pbackground in the function) assuming global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch separates out domain_dirty_limits() which takes dirty_throttle_control out of global_dirty_limits(). As thresh and bg_thresh calculation needs the amount of dirtyable memory in the domain, dirty_throttle_control->avail is added. The new function calculates the two thresholds and store them directly in the dirty_throttle_control. Also, as memcg domains can't follow vm_dirty_bytes and dirty_background_bytes settings directly. If those are set and domain_dirty_limits() is invoked for a !global domain, the settings are translated to ratios by scaling them against globally available memory. dirty_throttle_control->gdtc is added to enable this when CONFIG_CGROUP_WRITEBACK. global_dirty_limits() is now a thin wrapper around domain_dirty_limits() and balance_dirty_pages() is updated to use the new function too. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 111 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 86 insertions(+), 25 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a4d0cee..c8ac8ce 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -128,10 +128,12 @@ struct wb_domain global_wb_domain; struct dirty_throttle_control { #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain *dom; + struct dirty_throttle_control *gdtc; /* only set in memcg dtc's */ #endif struct bdi_writeback *wb; struct fprop_local_percpu *wb_completions; + unsigned long avail; /* dirtyable */ unsigned long dirty; /* file_dirty + write + nfs */ unsigned long thresh; /* dirty threshold */ unsigned long bg_thresh; /* dirty background threshold */ @@ -157,12 +159,18 @@ struct dirty_throttle_control { #define GDTC_INIT(__wb) .dom = &global_wb_domain, \ DTC_INIT_COMMON(__wb) +#define GDTC_INIT_NO_WB .dom = &global_wb_domain static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) { return dtc->dom; } +static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) +{ + return mdtc->gdtc; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -193,12 +201,18 @@ static void wb_min_max_ratio(struct bdi_writeback *wb, #else /* CONFIG_CGROUP_WRITEBACK */ #define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb) +#define GDTC_INIT_NO_WB static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) { return &global_wb_domain; } +static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) +{ + return NULL; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -303,42 +317,88 @@ static unsigned long global_dirtyable_memory(void) return x + 1; /* Ensure that we never return 0 */ } -/* - * global_dirty_limits - background-writeback and dirty-throttling thresholds +/** + * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain + * @dtc: dirty_throttle_control of interest * - * Calculate the dirty thresholds based on sysctl parameters - * - vm.dirty_background_ratio or vm.dirty_background_bytes - * - vm.dirty_ratio or vm.dirty_bytes - * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and + * Calculate @dtc->thresh and ->bg_thresh considering + * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller + * must ensure that @dtc->avail is set before calling this function. The + * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and * real-time tasks. */ -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) -{ - const unsigned long available_memory = global_dirtyable_memory(); - unsigned long background; - unsigned long dirty; +static void domain_dirty_limits(struct dirty_throttle_control *dtc) +{ + const unsigned long available_memory = dtc->avail; + struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc); + unsigned long bytes = vm_dirty_bytes; + unsigned long bg_bytes = dirty_background_bytes; + unsigned long ratio = vm_dirty_ratio; + unsigned long bg_ratio = dirty_background_ratio; + unsigned long thresh; + unsigned long bg_thresh; struct task_struct *tsk; - if (vm_dirty_bytes) - dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); + /* gdtc is !NULL iff @dtc is for memcg domain */ + if (gdtc) { + unsigned long global_avail = gdtc->avail; + + /* + * The byte settings can't be applied directly to memcg + * domains. Convert them to ratios by scaling against + * globally available memory. + */ + if (bytes) + ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 / + global_avail, 100UL); + if (bg_bytes) + bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 / + global_avail, 100UL); + bytes = bg_bytes = 0; + } + + if (bytes) + thresh = DIV_ROUND_UP(bytes, PAGE_SIZE); else - dirty = (vm_dirty_ratio * available_memory) / 100; + thresh = (ratio * available_memory) / 100; - if (dirty_background_bytes) - background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE); + if (bg_bytes) + bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE); else - background = (dirty_background_ratio * available_memory) / 100; + bg_thresh = (bg_ratio * available_memory) / 100; - if (background >= dirty) - background = dirty / 2; + if (bg_thresh >= thresh) + bg_thresh = thresh / 2; tsk = current; if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { - background += background / 4; - dirty += dirty / 4; + bg_thresh += bg_thresh / 4; + thresh += thresh / 4; } - *pbackground = background; - *pdirty = dirty; - trace_global_dirty_state(background, dirty); + dtc->thresh = thresh; + dtc->bg_thresh = bg_thresh; + + /* we should eventually report the domain in the TP */ + if (!gdtc) + trace_global_dirty_state(bg_thresh, thresh); +} + +/** + * global_dirty_limits - background-writeback and dirty-throttling thresholds + * @pbackground: out parameter for bg_thresh + * @pdirty: out parameter for thresh + * + * Calculate bg_thresh and thresh for global_wb_domain. See + * domain_dirty_limits() for details. + */ +void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) +{ + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB }; + + gdtc.avail = global_dirtyable_memory(); + domain_dirty_limits(&gdtc); + + *pbackground = gdtc.bg_thresh; + *pdirty = gdtc.thresh; } /** @@ -1421,9 +1481,10 @@ static void balance_dirty_pages(struct address_space *mapping, */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); + gdtc->avail = global_dirtyable_memory(); gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); - global_dirty_limits(&gdtc->bg_thresh, &gdtc->thresh); + domain_dirty_limits(gdtc); if (unlikely(strictlimit)) { wb_dirty_limits(gdtc); -- cgit v1.1 From aa661bbe1e61ce80ca4ae98804f673ede94b0827 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:31 -0400 Subject: writeback: move over_bground_thresh() to mm/page-writeback.c and rename it to wb_over_bg_thresh(). The function is closely tied to the dirty throttling mechanism implemented in page-writeback.c. This relocation will allow future updates necessary for cgroup writeback support. While at it, add function comment. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c8ac8ce..9d9a896 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1740,6 +1740,29 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping) } EXPORT_SYMBOL(balance_dirty_pages_ratelimited); +/** + * wb_over_bg_thresh - does @wb need to be written back? + * @wb: bdi_writeback of interest + * + * Determines whether background writeback should keep writing @wb or it's + * clean enough. Returns %true if writeback should continue. + */ +bool wb_over_bg_thresh(struct bdi_writeback *wb) +{ + unsigned long background_thresh, dirty_thresh; + + global_dirty_limits(&background_thresh, &dirty_thresh); + + if (global_page_state(NR_FILE_DIRTY) + + global_page_state(NR_UNSTABLE_NFS) > background_thresh) + return true; + + if (wb_stat(wb, WB_RECLAIMABLE) > wb_calc_thresh(wb, background_thresh)) + return true; + + return false; +} + void throttle_vm_writeout(gfp_t gfp_mask) { unsigned long background_thresh; -- cgit v1.1 From 947e9762a8ddefda38aa21e249e6a4fec215cd12 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:32 -0400 Subject: writeback: update wb_over_bg_thresh() to use wb_domain aware operations wb_over_bg_thresh() currently uses global_dirty_limits() and wb_dirty_limit() both of which are wrappers around operations which take dirty_throttle_control. For cgroup writeback support, the function will be updated to also consider memcg wb_domains which requires the context information carried in dirty_throttle_control. This patch updates wb_over_bg_thresh() so that it uses the underlying wb_domain aware operations directly and builds the global dirty_throttle_control in the process. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/page-writeback.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9d9a896..a7ba5ce 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1749,15 +1749,22 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited); */ bool wb_over_bg_thresh(struct bdi_writeback *wb) { - unsigned long background_thresh, dirty_thresh; + struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; + struct dirty_throttle_control * const gdtc = &gdtc_stor; - global_dirty_limits(&background_thresh, &dirty_thresh); + /* + * Similar to balance_dirty_pages() but ignores pages being written + * as we're trying to decide whether to put more under writeback. + */ + gdtc->avail = global_dirtyable_memory(); + gdtc->dirty = global_page_state(NR_FILE_DIRTY) + + global_page_state(NR_UNSTABLE_NFS); + domain_dirty_limits(gdtc); - if (global_page_state(NR_FILE_DIRTY) + - global_page_state(NR_UNSTABLE_NFS) > background_thresh) + if (gdtc->dirty > gdtc->bg_thresh) return true; - if (wb_stat(wb, WB_RECLAIMABLE) > wb_calc_thresh(wb, background_thresh)) + if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(gdtc)) return true; return false; -- cgit v1.1 From 841710aa6e4acd066ab9fe8c8cb6f4e4e6709d83 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:33 -0400 Subject: writeback: implement memcg wb_domain Dirtyable memory is distributed to a wb (bdi_writeback) according to the relative bandwidth the wb is writing out in the whole system. This distribution is global - each wb is measured against all other wb's and gets the proportinately sized portion of the memory in the whole system. For cgroup writeback, the amount of dirtyable memory is scoped by memcg and thus each wb would need to be measured and controlled in its memcg. IOW, a wb will belong to two writeback domains - the global and memcg domains. The previous patches laid the groundwork to support the two wb_domains and this patch implements memcg wb_domain. memcg->cgwb_domain is initialized on css online and destroyed on css release, wb->memcg_completions is added, and __wb_writeout_inc() is updated to increment completions against both global and memcg wb_domains. The following patches will update balance_dirty_pages() and its subroutines to actually consider memcg wb_domain for throttling. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/backing-dev.c | 9 ++++++++- mm/memcontrol.c | 39 +++++++++++++++++++++++++++++++++++++++ mm/page-writeback.c | 25 +++++++++++++++++++++++++ 3 files changed, 72 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 9c8b7b5..84ebf7c 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -482,6 +482,7 @@ static void cgwb_release_workfn(struct work_struct *work) css_put(wb->blkcg_css); wb_congested_put(wb->congested); + fprop_local_destroy_percpu(&wb->memcg_completions); percpu_ref_exit(&wb->refcnt); wb_exit(wb); kfree_rcu(wb, rcu); @@ -548,9 +549,13 @@ static int cgwb_create(struct backing_dev_info *bdi, if (ret) goto err_wb_exit; + ret = fprop_local_init_percpu(&wb->memcg_completions, gfp); + if (ret) + goto err_ref_exit; + wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp); if (!wb->congested) - goto err_ref_exit; + goto err_fprop_exit; wb->memcg_css = memcg_css; wb->blkcg_css = blkcg_css; @@ -587,6 +592,8 @@ static int cgwb_create(struct backing_dev_info *bdi, err_put_congested: wb_congested_put(wb->congested); +err_fprop_exit: + fprop_local_destroy_percpu(&wb->memcg_completions); err_ref_exit: percpu_ref_exit(&wb->refcnt); err_wb_exit: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 701cbee..ce113dd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -345,6 +345,7 @@ struct mem_cgroup { #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; + struct wb_domain cgwb_domain; #endif /* List of events which userspace want to receive */ @@ -3994,6 +3995,37 @@ struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg) return &memcg->cgwb_list; } +static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) +{ + return wb_domain_init(&memcg->cgwb_domain, gfp); +} + +static void memcg_wb_domain_exit(struct mem_cgroup *memcg) +{ + wb_domain_exit(&memcg->cgwb_domain); +} + +struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + + if (!memcg->css.parent) + return NULL; + + return &memcg->cgwb_domain; +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) +{ + return 0; +} + +static void memcg_wb_domain_exit(struct mem_cgroup *memcg) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ /* @@ -4380,9 +4412,15 @@ static struct mem_cgroup *mem_cgroup_alloc(void) memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu); if (!memcg->stat) goto out_free; + + if (memcg_wb_domain_init(memcg, GFP_KERNEL)) + goto out_free_stat; + spin_lock_init(&memcg->pcp_counter_lock); return memcg; +out_free_stat: + free_percpu(memcg->stat); out_free: kfree(memcg); return NULL; @@ -4409,6 +4447,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) free_mem_cgroup_per_zone_info(memcg, node); free_percpu(memcg->stat); + memcg_wb_domain_exit(memcg); kfree(memcg); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a7ba5ce..a146e33 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -171,6 +171,11 @@ static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *m return mdtc->gdtc; } +static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb) +{ + return &wb->memcg_completions; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -213,6 +218,11 @@ static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *m return NULL; } +static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb) +{ + return NULL; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -530,9 +540,16 @@ static void wb_domain_writeout_inc(struct wb_domain *dom, */ static inline void __wb_writeout_inc(struct bdi_writeback *wb) { + struct wb_domain *cgdom; + __inc_wb_stat(wb, WB_WRITTEN); wb_domain_writeout_inc(&global_wb_domain, &wb->completions, wb->bdi->max_prop_frac); + + cgdom = mem_cgroup_wb_domain(wb); + if (cgdom) + wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb), + wb->bdi->max_prop_frac); } void wb_writeout_inc(struct bdi_writeback *wb) @@ -583,6 +600,14 @@ int wb_domain_init(struct wb_domain *dom, gfp_t gfp) return fprop_global_init(&dom->completions, gfp); } +#ifdef CONFIG_CGROUP_WRITEBACK +void wb_domain_exit(struct wb_domain *dom) +{ + del_timer_sync(&dom->period_timer); + fprop_global_destroy(&dom->completions); +} +#endif + /* * bdi_min_ratio keeps the sum of the minimum dirty shares of all * registered backing devices, which, for obvious reasons, can not -- cgit v1.1 From 2529bb3aadc40a93e642f5f3650f63379a964467 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:34 -0400 Subject: writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes The amount of available memory to a memcg wb_domain can change as memcg configuration changes. A domain's ->dirty_limit exists to smooth out sudden drops in dirty threshold; however, when a domain's size actually drops significantly, it hinders the dirty throttling from adjusting to the new configuration leading to unexpected behaviors including unnecessary OOM kills. This patch resolves the issue by adding wb_domain_size_changed() which resets ->dirty_limit[_tstmp] and making memcg call it on configuration changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/memcontrol.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ce113dd..c0b0406 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4005,6 +4005,11 @@ static void memcg_wb_domain_exit(struct mem_cgroup *memcg) wb_domain_exit(&memcg->cgwb_domain); } +static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +{ + wb_domain_size_changed(&memcg->cgwb_domain); +} + struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) { struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); @@ -4026,6 +4031,10 @@ static void memcg_wb_domain_exit(struct mem_cgroup *memcg) { } +static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ /* @@ -4624,6 +4633,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg->low = 0; memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; + memcg_wb_domain_size_changed(memcg); } #ifdef CONFIG_MMU @@ -5361,6 +5371,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, memcg->high = high; + memcg_wb_domain_size_changed(memcg); return nbytes; } @@ -5393,6 +5404,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (err) return err; + memcg_wb_domain_size_changed(memcg); return nbytes; } -- cgit v1.1 From c2aa723a6093633ae4ec15b08a4db276643cab3e Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:35 -0400 Subject: writeback: implement memcg writeback domain based throttling While cgroup writeback support now connects memcg and blkcg so that writeback IOs are properly attributed and controlled, the IO back pressure propagation mechanism implemented in balance_dirty_pages() and its subroutines wasn't aware of cgroup writeback. Processes belonging to a memcg may have access to only subset of total memory available in the system and not factoring this into dirty throttling rendered it completely ineffective for processes under memcg limits and memcg ended up building a separate ad-hoc degenerate mechanism directly into vmscan code to limit page dirtying. The previous patches updated balance_dirty_pages() and its subroutines so that they can deal with multiple wb_domain's (writeback domains) and defined per-memcg wb_domain. Processes belonging to a non-root memcg are bound to two wb_domains, global wb_domain and memcg wb_domain, and should be throttled according to IO pressures from both domains. This patch updates dirty throttling code so that it repeats similar calculations for the two domains - the differences between the two are few and minor - and applies the lower of the two sets of resulting constraints. wb_over_bg_thresh(), which controls when background writeback terminates, is also updated to consider both global and memcg wb_domains. It returns true if dirty is over bg_thresh for either domain. This makes the dirty throttling mechanism operational for memcg domains including writeback-bandwidth-proportional dirty page distribution inside them but the ad-hoc memcg throttling mechanism in vmscan is still in place. The next patch will rip it out. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/memcontrol.c | 43 ++++++++++++++ mm/page-writeback.c | 158 ++++++++++++++++++++++++++++++++++++++++++++-------- 2 files changed, 179 insertions(+), 22 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c0b0406..f816d91 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4020,6 +4020,49 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) return &memcg->cgwb_domain; } +/** + * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg + * @wb: bdi_writeback in question + * @pavail: out parameter for number of available pages + * @pdirty: out parameter for number of dirty pages + * @pwriteback: out parameter for number of pages under writeback + * + * Determine the numbers of available, dirty, and writeback pages in @wb's + * memcg. Dirty and writeback are self-explanatory. Available is a bit + * more involved. + * + * A memcg's headroom is "min(max, high) - used". The available memory is + * calculated as the lowest headroom of itself and the ancestors plus the + * number of pages already being used for file pages. Note that this + * doesn't consider the actual amount of available memory in the system. + * The caller should further cap *@pavail accordingly. + */ +void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail, + unsigned long *pdirty, unsigned long *pwriteback) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + struct mem_cgroup *parent; + unsigned long head_room = PAGE_COUNTER_MAX; + unsigned long file_pages; + + *pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY); + + /* this should eventually include NR_UNSTABLE_NFS */ + *pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); + + file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | + (1 << LRU_ACTIVE_FILE)); + while ((parent = parent_mem_cgroup(memcg))) { + unsigned long ceiling = min(memcg->memory.limit, memcg->high); + unsigned long used = page_counter_read(&memcg->memory); + + head_room = min(head_room, ceiling - min(ceiling, used)); + memcg = parent; + } + + *pavail = file_pages + head_room; +} + #else /* CONFIG_CGROUP_WRITEBACK */ static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a146e33..e890335 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -160,6 +160,14 @@ struct dirty_throttle_control { #define GDTC_INIT(__wb) .dom = &global_wb_domain, \ DTC_INIT_COMMON(__wb) #define GDTC_INIT_NO_WB .dom = &global_wb_domain +#define MDTC_INIT(__wb, __gdtc) .dom = mem_cgroup_wb_domain(__wb), \ + .gdtc = __gdtc, \ + DTC_INIT_COMMON(__wb) + +static bool mdtc_valid(struct dirty_throttle_control *dtc) +{ + return dtc->dom; +} static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) { @@ -207,6 +215,12 @@ static void wb_min_max_ratio(struct bdi_writeback *wb, #define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb) #define GDTC_INIT_NO_WB +#define MDTC_INIT(__wb, __gdtc) + +static bool mdtc_valid(struct dirty_throttle_control *dtc) +{ + return false; +} static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) { @@ -668,6 +682,15 @@ static unsigned long hard_dirty_limit(struct wb_domain *dom, return max(thresh, dom->dirty_limit); } +/* memory available to a memcg domain is capped by system-wide clean memory */ +static void mdtc_cap_avail(struct dirty_throttle_control *mdtc) +{ + struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc); + unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty); + + mdtc->avail = min(mdtc->avail, clean); +} + /** * __wb_calc_thresh - @wb's share of dirty throttling threshold * @dtc: dirty_throttle_context of interest @@ -1269,11 +1292,12 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit); } -static void __wb_update_bandwidth(struct dirty_throttle_control *dtc, +static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc, + struct dirty_throttle_control *mdtc, unsigned long start_time, bool update_ratelimit) { - struct bdi_writeback *wb = dtc->wb; + struct bdi_writeback *wb = gdtc->wb; unsigned long now = jiffies; unsigned long elapsed = now - wb->bw_time_stamp; unsigned long dirtied; @@ -1298,8 +1322,17 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *dtc, goto snapshot; if (update_ratelimit) { - domain_update_bandwidth(dtc, now); - wb_update_dirty_ratelimit(dtc, dirtied, elapsed); + domain_update_bandwidth(gdtc, now); + wb_update_dirty_ratelimit(gdtc, dirtied, elapsed); + + /* + * @mdtc is always NULL if !CGROUP_WRITEBACK but the + * compiler has no way to figure that out. Help it. + */ + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) { + domain_update_bandwidth(mdtc, now); + wb_update_dirty_ratelimit(mdtc, dirtied, elapsed); + } } wb_update_write_bandwidth(wb, elapsed, written); @@ -1313,7 +1346,7 @@ void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time) { struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; - __wb_update_bandwidth(&gdtc, start_time, false); + __wb_update_bandwidth(&gdtc, NULL, start_time, false); } /* @@ -1480,7 +1513,11 @@ static void balance_dirty_pages(struct address_space *mapping, unsigned long pages_dirtied) { struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; + struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) }; struct dirty_throttle_control * const gdtc = &gdtc_stor; + struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? + &mdtc_stor : NULL; + struct dirty_throttle_control *sdtc; unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ long period; long pause; @@ -1497,6 +1534,7 @@ static void balance_dirty_pages(struct address_space *mapping, for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; + unsigned long m_dirty, m_thresh, m_bg_thresh; /* * Unstable writes are a feature of certain networked @@ -1523,6 +1561,32 @@ static void balance_dirty_pages(struct address_space *mapping, bg_thresh = gdtc->bg_thresh; } + if (mdtc) { + unsigned long writeback; + + /* + * If @wb belongs to !root memcg, repeat the same + * basic calculations for the memcg domain. + */ + mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, + &writeback); + mdtc_cap_avail(mdtc); + mdtc->dirty += writeback; + + domain_dirty_limits(mdtc); + + if (unlikely(strictlimit)) { + wb_dirty_limits(mdtc); + m_dirty = mdtc->wb_dirty; + m_thresh = mdtc->wb_thresh; + m_bg_thresh = mdtc->wb_bg_thresh; + } else { + m_dirty = mdtc->dirty; + m_thresh = mdtc->thresh; + m_bg_thresh = mdtc->bg_thresh; + } + } + /* * Throttle it only when the background writeback cannot * catch-up. This avoids (excessively) small writeouts @@ -1531,18 +1595,31 @@ static void balance_dirty_pages(struct address_space *mapping, * In strictlimit case make decision based on the wb counters * and limits. Small writeouts when the wb limits are ramping * up are the price we consciously pay for strictlimit-ing. + * + * If memcg domain is in effect, @dirty should be under + * both global and memcg freerun ceilings. */ - if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) { + if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && + (!mdtc || + m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { + unsigned long intv = dirty_poll_interval(dirty, thresh); + unsigned long m_intv = ULONG_MAX; + current->dirty_paused_when = now; current->nr_dirtied = 0; - current->nr_dirtied_pause = - dirty_poll_interval(dirty, thresh); + if (mdtc) + m_intv = dirty_poll_interval(m_dirty, m_thresh); + current->nr_dirtied_pause = min(intv, m_intv); break; } if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); + /* + * Calculate global domain's pos_ratio and select the + * global dtc by default. + */ if (!strictlimit) wb_dirty_limits(gdtc); @@ -1550,6 +1627,25 @@ static void balance_dirty_pages(struct address_space *mapping, ((gdtc->dirty > gdtc->thresh) || strictlimit); wb_position_ratio(gdtc); + sdtc = gdtc; + + if (mdtc) { + /* + * If memcg domain is in effect, calculate its + * pos_ratio. @wb should satisfy constraints from + * both global and memcg domains. Choose the one + * w/ lower pos_ratio. + */ + if (!strictlimit) + wb_dirty_limits(mdtc); + + dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && + ((mdtc->dirty > mdtc->thresh) || strictlimit); + + wb_position_ratio(mdtc); + if (mdtc->pos_ratio < gdtc->pos_ratio) + sdtc = mdtc; + } if (dirty_exceeded && !wb->dirty_exceeded) wb->dirty_exceeded = 1; @@ -1557,14 +1653,15 @@ static void balance_dirty_pages(struct address_space *mapping, if (time_is_before_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL)) { spin_lock(&wb->list_lock); - __wb_update_bandwidth(gdtc, start_time, true); + __wb_update_bandwidth(gdtc, mdtc, start_time, true); spin_unlock(&wb->list_lock); } + /* throttle according to the chosen dtc */ dirty_ratelimit = wb->dirty_ratelimit; - task_ratelimit = ((u64)dirty_ratelimit * gdtc->pos_ratio) >> + task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >> RATELIMIT_CALC_SHIFT; - max_pause = wb_max_pause(wb, gdtc->wb_dirty); + max_pause = wb_max_pause(wb, sdtc->wb_dirty); min_pause = wb_min_pause(wb, max_pause, task_ratelimit, dirty_ratelimit, &nr_dirtied_pause); @@ -1587,11 +1684,11 @@ static void balance_dirty_pages(struct address_space *mapping, */ if (pause < min_pause) { trace_balance_dirty_pages(bdi, - gdtc->thresh, - gdtc->bg_thresh, - gdtc->dirty, - gdtc->wb_thresh, - gdtc->wb_dirty, + sdtc->thresh, + sdtc->bg_thresh, + sdtc->dirty, + sdtc->wb_thresh, + sdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1616,11 +1713,11 @@ static void balance_dirty_pages(struct address_space *mapping, pause: trace_balance_dirty_pages(bdi, - gdtc->thresh, - gdtc->bg_thresh, - gdtc->dirty, - gdtc->wb_thresh, - gdtc->wb_dirty, + sdtc->thresh, + sdtc->bg_thresh, + sdtc->dirty, + sdtc->wb_thresh, + sdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1651,7 +1748,7 @@ pause: * more page. However wb_dirty has accounting errors. So use * the larger and more IO friendly wb_stat_error. */ - if (gdtc->wb_dirty <= wb_stat_error(wb)) + if (sdtc->wb_dirty <= wb_stat_error(wb)) break; if (fatal_signal_pending(current)) @@ -1775,7 +1872,10 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited); bool wb_over_bg_thresh(struct bdi_writeback *wb) { struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; + struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) }; struct dirty_throttle_control * const gdtc = &gdtc_stor; + struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? + &mdtc_stor : NULL; /* * Similar to balance_dirty_pages() but ignores pages being written @@ -1792,6 +1892,20 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(gdtc)) return true; + if (mdtc) { + unsigned long writeback; + + mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback); + mdtc_cap_avail(mdtc); + domain_dirty_limits(mdtc); /* ditto, ignore writeback */ + + if (mdtc->dirty > mdtc->bg_thresh) + return true; + + if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(mdtc)) + return true; + } + return false; } -- cgit v1.1 From 97c9341f727105c29478da19f1687b0e0a917256 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 22 May 2015 18:23:36 -0400 Subject: mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use Because writeback wasn't cgroup aware before, the usual dirty throttling mechanism in balance_dirty_pages() didn't work for processes under memcg limit. The writeback path didn't know how much memory is available or how fast the dirty pages are being written out for a given memcg and balance_dirty_pages() didn't have any measure of IO back pressure for the memcg. To work around the issue, memcg implemented an ad-hoc dirty throttling mechanism in the direct reclaim path by stalling on pages under writeback which are encountered during direct reclaim scan. This is rather ugly and crude - none of the configurability, fairness, or bandwidth-proportional distribution of the normal path. The previous patches implemented proper memcg aware dirty throttling when cgroup writeback is in use making the ad-hoc mechanism unnecessary. This patch disables direct reclaim stalling for such case. Note: I disabled the parts which seemed obvious and it behaves fine while testing but my understanding of this code path is rudimentary and it's quite possible that I got something wrong. Please let me know if I got some wrong or more global_reclaim() sites should be updated. v2: The original patch removed the direct stalling mechanism which breaks legacy hierarchies. Conditionalize instead of removing. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Cc: Vladimir Davydov Signed-off-by: Jens Axboe --- mm/vmscan.c | 51 +++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 41 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index f463398..8cb16eb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -154,11 +154,42 @@ static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; } + +/** + * sane_reclaim - is the usual dirty throttling mechanism operational? + * @sc: scan_control in question + * + * The normal page dirty throttling mechanism in balance_dirty_pages() is + * completely broken with the legacy memcg and direct stalling in + * shrink_page_list() is used for throttling instead, which lacks all the + * niceties such as fairness, adaptive pausing, bandwidth proportional + * allocation and configurability. + * + * This function tests whether the vmscan currently in progress can assume + * that the normal dirty throttling mechanism is operational. + */ +static bool sane_reclaim(struct scan_control *sc) +{ + struct mem_cgroup *memcg = sc->target_mem_cgroup; + + if (!memcg) + return true; +#ifdef CONFIG_CGROUP_WRITEBACK + if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup)) + return true; +#endif + return false; +} #else static bool global_reclaim(struct scan_control *sc) { return true; } + +static bool sane_reclaim(struct scan_control *sc) +{ + return true; +} #endif static unsigned long zone_reclaimable_pages(struct zone *zone) @@ -941,10 +972,10 @@ static unsigned long shrink_page_list(struct list_head *page_list, * note that the LRU is being scanned too quickly and the * caller can stall after page list has been processed. * - * 2) Global reclaim encounters a page, memcg encounters a - * page that is not marked for immediate reclaim or - * the caller does not have __GFP_IO. In this case mark - * the page for immediate reclaim and continue scanning. + * 2) Global or new memcg reclaim encounters a page that is + * not marked for immediate reclaim or the caller does not + * have __GFP_IO. In this case mark the page for immediate + * reclaim and continue scanning. * * __GFP_IO is checked because a loop driver thread might * enter reclaim, and deadlock if it waits on a page for @@ -958,7 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing * may_enter_fs here is liable to OOM on them. * - * 3) memcg encounters a page that is not already marked + * 3) Legacy memcg encounters a page that is not already marked * PageReclaim. memcg does not have any dirty pages * throttling so we could easily OOM just because too many * pages are in writeback and there is nothing else to @@ -973,7 +1004,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep_locked; /* Case 2 above */ - } else if (global_reclaim(sc) || + } else if (sane_reclaim(sc) || !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { /* * This is slightly racy - end_page_writeback() @@ -1422,7 +1453,7 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; - if (!global_reclaim(sc)) + if (!sane_reclaim(sc)) return 0; if (file) { @@ -1614,10 +1645,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, set_bit(ZONE_WRITEBACK, &zone->flags); /* - * memcg will stall in page writeback so only consider forcibly - * stalling for global reclaim + * Legacy memcg will stall in page writeback so avoid forcibly + * stalling here. */ - if (global_reclaim(sc)) { + if (sane_reclaim(sc)) { /* * Tag a zone as congested if all the dirty pages scanned were * backed by a congested BDI and wait_iff_congested will stall. -- cgit v1.1 From 21c6321fbb3a3787af07f1bc031d713a707fb69c Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Thu, 28 May 2015 14:50:49 -0400 Subject: writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() Currently, majority of cgroup writeback support including all the above functions are implemented in include/linux/backing-dev.h and mm/backing-dev.c; however, the portion closely related to writeback logic implemented in include/linux/writeback.h and mm/page-writeback.c will expand to support foreign writeback detection and correction. This patch moves wb[_try]_get() and wb_put() to include/linux/backing-dev-defs.h so that they can be used from writeback.h and inode_{attach|detach}_wb() to writeback.h and page-writeback.c. This is pure reorganization and doesn't introduce any functional changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/backing-dev.c | 30 ------------------------------ 1 file changed, 30 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 84ebf7c..887d72a8 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -660,36 +660,6 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, return wb; } -void __inode_attach_wb(struct inode *inode, struct page *page) -{ - struct backing_dev_info *bdi = inode_to_bdi(inode); - struct bdi_writeback *wb = NULL; - - if (inode_cgwb_enabled(inode)) { - struct cgroup_subsys_state *memcg_css; - - if (page) { - memcg_css = mem_cgroup_css_from_page(page); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); - } else { - /* must pin memcg_css, see wb_get_create() */ - memcg_css = task_get_css(current, memory_cgrp_id); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); - css_put(memcg_css); - } - } - - if (!wb) - wb = &bdi->wb; - - /* - * There may be multiple instances of this function racing to - * update the same inode. Use cmpxchg() to tell the winner. - */ - if (unlikely(cmpxchg(&inode->i_wb, NULL, wb))) - wb_put(wb); -} - static void cgwb_bdi_init(struct backing_dev_info *bdi) { bdi->wb.memcg_css = mem_cgroup_root_css; -- cgit v1.1 From b16b1deb553adcd7b3b7ce3e6d6fd1b923f314da Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 2 Jun 2015 08:39:48 -0600 Subject: writeback: make writeback_control track the inode being written back Currently, for cgroup writeback, the IO submission paths directly associate the bio's with the blkcg from inode_to_wb_blkcg_css(); however, it'd be necessary to keep more writeback context to implement foreign inode writeback detection. wbc (writeback_control) is the natural fit for the extra context - it persists throughout the writeback of each inode and is passed all the way down to IO submission paths. This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and wbc_attach_fdatawrite_inode() which are used to associate wbc with the inode being written back. IO submission paths now use wbc_init_bio() instead of directly associating bio's with blkcg themselves. This leaves inode_to_wb_blkcg_css() w/o any user. The function is removed. wbc currently only tracks the associated wb (bdi_writeback). Future patches will add more for foreign inode detection. The association is established under i_lock which will be depended upon when migrating foreign inodes to other wb's. As currently, once established, inode to wb association never changes, going through wbc when initializing bio's doesn't cause any behavior changes. v2: submit_blk_blkcg() now checks whether the wbc is associated with a wb before dereferencing it. This can happen when pageout() is writing pages directly without going through the usual writeback path. As pageout() path is single-threaded, we don't want it to be blocked behind a slow cgroup and ultimately want it to delegate actual writing to the usual writeback path. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 7b1443d..2f065b1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -290,7 +290,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, if (!mapping_cap_writeback_dirty(mapping)) return 0; + wbc_attach_fdatawrite_inode(&wbc, mapping->host); ret = do_writepages(mapping, &wbc); + wbc_detach_inode(&wbc); return ret; } -- cgit v1.1 From 682aa8e1a6a1504a4caaa62e6c2c9daae3757210 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Thu, 28 May 2015 14:50:53 -0400 Subject: writeback: implement unlocked_inode_to_wb transaction and use it for stat updates The mechanism for detecting whether an inode should switch its wb (bdi_writeback) association is now in place. This patch build the framework for the actual switching. This patch adds a new inode flag I_WB_SWITCHING, which has two functions. First, the easy one, it ensures that there's only one switching in progress for a give inode. Second, it's used as a mechanism to synchronize wb stat updates. The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters but track the current number of dirty pages and pages under writeback respectively. As such, when an inode is moved from one wb to another, the inode's portion of those stats have to be transferred together; unfortunately, this is a bit tricky as those stat updates are percpu operations which are performed without holding any lock in some places. This patch solves the problem in a similar way as memcg. Each such lockless stat updates are wrapped in transaction surrounded by unlocked_inode_to_wb_begin/end(). During normal operation, they map to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted, mapping->tree_lock is grabbed across the transaction. In turn, the switching path sets I_WB_SWITCHING and waits for a RCU grace period to pass before actually starting to switch, which guarantees that all stat update paths are synchronizing against mapping->tree_lock. This patch still doesn't implement the actual switching. v3: Updated on top of the recent cancel_dirty_page() updates. unlocked_inode_to_wb_begin() now nests inside mem_cgroup_begin_page_stat() to match the locking order. v2: The i_wb access transaction will be used for !stat accesses too. Function names and comments updated accordingly. s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/ s/switch_wb/switch_wbs/ Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Signed-off-by: Jens Axboe --- mm/filemap.c | 3 ++- mm/page-writeback.c | 27 +++++++++++++++++++++------ 2 files changed, 23 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 2f065b1..bfc1ab053 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -213,7 +213,8 @@ void __delete_from_page_cache(struct page *page, void *shadow, * anyway will be cleared before returning page into buddy allocator. */ if (WARN_ON_ONCE(PageDirty(page))) - account_page_cleaned(page, mapping, memcg); + account_page_cleaned(page, mapping, memcg, + inode_to_wb(mapping->host)); } /** diff --git a/mm/page-writeback.c b/mm/page-writeback.c index e890335..e1514d5 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2422,12 +2422,12 @@ EXPORT_SYMBOL(account_page_dirtied); * Caller must hold mem_cgroup_begin_page_stat(). */ void account_page_cleaned(struct page *page, struct address_space *mapping, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, struct bdi_writeback *wb) { if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE); + dec_wb_stat(wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_CACHE_SIZE); } } @@ -2490,11 +2490,15 @@ void account_page_redirty(struct page *page) struct address_space *mapping = page->mapping; if (mapping && mapping_cap_account_dirty(mapping)) { - struct bdi_writeback *wb = inode_to_wb(mapping->host); + struct inode *inode = mapping->host; + struct bdi_writeback *wb; + bool locked; + wb = unlocked_inode_to_wb_begin(inode, &locked); current->nr_dirtied--; dec_zone_page_state(page, NR_DIRTIED); dec_wb_stat(wb, WB_DIRTIED); + unlocked_inode_to_wb_end(inode, locked); } } EXPORT_SYMBOL(account_page_redirty); @@ -2597,13 +2601,18 @@ void cancel_dirty_page(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping_cap_account_dirty(mapping)) { + struct inode *inode = mapping->host; + struct bdi_writeback *wb; struct mem_cgroup *memcg; + bool locked; memcg = mem_cgroup_begin_page_stat(page); + wb = unlocked_inode_to_wb_begin(inode, &locked); if (TestClearPageDirty(page)) - account_page_cleaned(page, mapping, memcg); + account_page_cleaned(page, mapping, memcg, wb); + unlocked_inode_to_wb_end(inode, locked); mem_cgroup_end_page_stat(memcg); } else { ClearPageDirty(page); @@ -2628,12 +2637,16 @@ EXPORT_SYMBOL(cancel_dirty_page); int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - struct mem_cgroup *memcg; int ret = 0; BUG_ON(!PageLocked(page)); if (mapping && mapping_cap_account_dirty(mapping)) { + struct inode *inode = mapping->host; + struct bdi_writeback *wb; + struct mem_cgroup *memcg; + bool locked; + /* * Yes, Virginia, this is indeed insane. * @@ -2670,12 +2683,14 @@ int clear_page_dirty_for_io(struct page *page) * exclusion. */ memcg = mem_cgroup_begin_page_stat(page); + wb = unlocked_inode_to_wb_begin(inode, &locked); if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE); + dec_wb_stat(wb, WB_RECLAIMABLE); ret = 1; } + unlocked_inode_to_wb_end(inode, locked); mem_cgroup_end_page_stat(memcg); return ret; } -- cgit v1.1 From 5857cd637bc0d60dc7e37af396b01324f199d89b Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 5 Jun 2015 06:20:51 +0900 Subject: bdi: fix wrong error return value in cgwb_create() On wb_congested_get_create() failure, cgwb_create() forgot to set @ret to -ENOMEM ending up returning 0. Fix it so that it returns -ENOMEM. Signed-off-by: Tejun Heo Reported-by: Dan Carpenter Signed-off-by: Jens Axboe --- mm/backing-dev.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 887d72a8..436bb53 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -554,8 +554,10 @@ static int cgwb_create(struct backing_dev_info *bdi, goto err_ref_exit; wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp); - if (!wb->congested) + if (!wb->congested) { + ret = -ENOMEM; goto err_fprop_exit; + } wb->memcg_css = memcg_css; wb->blkcg_css = blkcg_css; -- cgit v1.1