summaryrefslogtreecommitdiffstats
path: root/mm/page-writeback.c
Commit message (Collapse)AuthorAgeFilesLines
* radix-tree: delete radix_tree_range_tag_if_tagged()Matthew Wilcox2016-12-141-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | This is an exceptionally complicated function with just one caller (tag_pages_for_writeback). We devote a large portion of the runtime of the test suite to testing this one function which has one caller. By introducing the new function radix_tree_iter_tag_set(), we can eliminate all of the complexity while keeping the performance. The caller can now use a fairly standard radix_tree_for_each() loop, and it doesn't need to worry about tricksy things like 'start' wrapping. The test suite continues to spend a large amount of time investigating this function, but now it's testing the underlying primitives such as radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator which are also used by other parts of the kernel. Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* writeback: track if we're sleeping on progress in balance_dirty_pages()Jens Axboe2016-11-081-0/+1
| | | | | | | | | Note in the bdi_writeback structure whenever a task ends up sleeping waiting for progress. We can use that information in the lower layers to increase the priority of writes. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz>
* mm: don't use radix tree writeback tags for pages in swap cacheHuang Ying2016-10-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK, etc.) to accelerate finding the pages with a specific tag in the radix tree during inode writeback. But for anonymous pages in the swap cache, there is no inode writeback. So there is no need to find the pages with some writeback tags in the radix tree. It is not necessary to touch radix tree writeback tags for pages in the swap cache. Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is introduced for address spaces which don't need to update the writeback tags. The flag is set for swap caches. It may be used for DAX file systems, etc. With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. The improvement comes from the reduced contention on the swap cache radix tree lock. To test sequential swapping out, the test case uses 8 processes, which sequentially allocate and write to the anonymous pages until RAM and part of the swap device is used up. Details of comparison is as follow, base base+patch ---------------- -------------------------- %stddev %change %stddev \ | \ 2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput 1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so 10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list 10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg 10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage 10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmscan: get rid of throttle_vm_writeoutMichal Hocko2016-10-071-30/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by excessive pageout activity during the reclaim. Too many pages could be put under writeback therefore LRUs would be full of unreclaimable pages until the IO completes and in turn the OOM killer could be invoked. There have been some important changes introduced since then in the reclaim path though. Writers are throttled by balance_dirty_pages when initiating the buffered IO and later during the memory pressure, the direct reclaim is throttled by wait_iff_congested if the node is considered congested by dirty pages on LRUs and the underlying bdi is congested by the queued IO. The kswapd is throttled as well if it encounters pages marked for immediate reclaim or under writeback which signals that that there are too many pages under writeback already. Finally should_reclaim_retry does congestion_wait if the reclaim cannot make any progress and there are too many dirty/writeback pages. Another important aspect is that we do not issue any IO from the direct reclaim context anymore. In a heavy parallel load this could queue a lot of IO which would be very scattered and thus unefficient which would just make the problem worse. This three mechanisms should throttle and keep the amount of IO in a steady state even under heavy IO and memory pressure so yet another throttling point doesn't really seem helpful. Quite contrary, Mikulas Patocka has reported that swap backed by dm-crypt doesn't work properly because the swapout IO cannot make sufficient progress as the writeout path depends on dm_crypt worker which has to allocate memory to perform the encryption. In order to guarantee a forward progress it relies on the mempool allocator. mempool_alloc(), however, prefers to use the underlying (usually page) allocator before it grabs objects from the pool. Such an allocation can dive into the memory reclaim and consequently to throttle_vm_writeout. If there are too many dirty or pages under writeback it will get throttled even though it is in fact a flusher to clear pending pages. kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000 Workqueue: kcryptd kcryptd_crypt [dm_crypt] Call Trace: schedule+0x3c/0x90 schedule_timeout+0x1d8/0x360 io_schedule_timeout+0xa4/0x110 congestion_wait+0x86/0x1f0 throttle_vm_writeout+0x44/0xd0 shrink_zone_memcg+0x613/0x720 shrink_zone+0xe0/0x300 do_try_to_free_pages+0x1ad/0x450 try_to_free_pages+0xef/0x300 __alloc_pages_nodemask+0x879/0x1210 alloc_pages_current+0xa1/0x1f0 new_slab+0x2d7/0x6a0 ___slab_alloc+0x3fb/0x5c0 __slab_alloc+0x51/0x90 kmem_cache_alloc+0x27b/0x310 mempool_alloc_slab+0x1d/0x30 mempool_alloc+0x91/0x230 bio_alloc_bioset+0xbd/0x260 kcryptd_crypt+0x114/0x3b0 [dm_crypt] Let's just drop throttle_vm_writeout altogether. It is not very much helpful anymore. I have tried to test a potential writeback IO runaway similar to the one described in the original patch which has introduced that [1]. Small virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a rather slow NFS in a sync mode on the host) with 8 parallel writers each writing 1G worth of data. As soon as the pagecache fills up and the direct reclaim hits then I start anon memory consumer in a loop (allocating 300M and exiting after populating it) in the background to make the memory pressure even stronger as well as to disrupt the steady state for the IO. The direct reclaim is throttled because of the congestion as well as kswapd hitting congestion_wait due to nr_immediate but throttle_vm_writeout doesn't ever trigger the sleep throughout the test. Dirty+writeback are close to nr_dirty_threshold with some fluctuations caused by the anon consumer. [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: NeilBrown <neilb@suse.com> Cc: Ondrej Kozina <okozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/writeback: Convert to hotplug state machineSebastian Andrzej Siewior2016-09-061-19/+7
| | | | | | | | | | | | | | | Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@fb.com> Cc: linux-mm@kvack.org Cc: rt@linutronix.de Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20160818125731.27256-6-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* mm: remove reclaim and compaction retry approximationsMel Gorman2016-07-281-0/+5
| | | | | | | | | | | | | | | | | | | If per-zone LRU accounting is available then there is no point approximating whether reclaim and compaction should retry based on pgdat statistics. This is effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" with the difference that inactive/active stats are still available. This preserves the history of why the approximation was retried and why it had to be reverted to handle OOM kills on 32-bit systems. Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmscan: remove highmem_file_pagesMel Gorman2016-07-281-8/+4
| | | | | | | | | | | | | | | With the reintroduction of per-zone LRU stats, highmem_file_pages is redundant so remove it. [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory] Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, page_alloc: fix dirtyable highmem calculationMinchan Kim2016-07-281-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I tested vmscale in mmtest in 32bit, I found the benchmark was slow down 0.5 times. base node 1 global-1 User 12.98 16.04 System 147.61 166.42 Elapsed 26.48 38.08 With vmstat, I found IO wait avg is much increased compared to base. The reason was highmem_dirtyable_memory accumulates free pages and highmem_file_pages from HIGHMEM to MOVABLE zones which was wrong. With that, dirth_thresh in throtlle_vm_write is always 0 so that it calls congestion_wait frequently if writeback starts. With this patch, it is much recovered. base node fi 1 global-1 fix User 12.98 16.04 13.78 System 147.61 166.42 143.92 Elapsed 26.48 38.08 29.64 Link: http://lkml.kernel.org/r/1468404004-5085-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmstat: remove zone and node double accounting by approximating retriesMel Gorman2016-07-281-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The number of LRU pages, dirty pages and writeback pages must be accounted for on both zones and nodes because of the reclaim retry logic, compaction retry logic and highmem calculations all depending on per-zone stats. Many lowmem allocations are immune from OOM kill due to a check in __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception is costly high-order allocations or allocations that cannot fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations then it would fall through to __alloc_pages_direct_compact. This patch will blindly retry reclaim for zone-constrained allocations in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal but without per-zone stats there are not many alternatives. The impact it that zone-constrained allocations may delay before considering the OOM killer. As there is no guarantee enough memory can ever be freed to satisfy compaction, this patch avoids retrying compaction for zone-contrained allocations. In combination, that means that the per-node stats can be used when deciding whether to continue reclaim using a rough approximation. While it is possible this will make the wrong decision on occasion, it will not infinite loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES. The final step is calculating the number of dirtyable highmem pages. As those calculations only care about the global count of file pages in highmem. This patch uses a global counter used instead of per-zone stats as it is sufficient. In combination, this allows the per-zone LRU and dirty state counters to be removed. [mgorman@techsingularity.net: fix acct_highmem_file_pages()] Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Suggested by: Michal Hocko <mhocko@kernel.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move vmscan writes and file write accounting to the nodeMel Gorman2016-07-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | As reclaim is now node-based, it follows that page write activity due to page reclaim should also be accounted for on the node. For consistency, also account page writes and page dirtying on a per-node basis. After this patch, there are a few remaining zone counters that may appear strange but are fine. NUMA stats are still per-zone as this is a user-space interface that tools consume. NR_MLOCK, NR_SLAB_*, NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that potentially pin low memory and cannot trivially be reclaimed on demand. This information is still useful for debugging a page allocation failure warning. Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move most file-based accounting to the nodeMel Gorman2016-07-281-25/+22
| | | | | | | | | | | | | | | | | | | | | | | There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, page_alloc: consider dirtyable memory in terms of nodesMel Gorman2016-07-281-30/+61
| | | | | | | | | | | | | | | | | Historically dirty pages were spread among zones but now that LRUs are per-node it is more appropriate to consider dirty pages in a node. Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmscan: move LRU lists to nodeMel Gorman2016-07-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* shmem: add huge pages supportKirill A. Shutemov2016-07-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Here's basic implementation of huge pages support for shmem/tmpfs. It's all pretty streight-forward: - shmem_getpage() allcoates huge page if it can and try to inserd into radix tree with shmem_add_to_page_cache(); - shmem_add_to_page_cache() puts the page onto radix-tree if there's space for it; - shmem_undo_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. As we don't really create hole in this case, lseek(SEEK_HOLE) may have inconsistent results depending what pages happened to be allocated. - no need to change shmem_fault: core-mm will map an compound page as huge if VMA is suitable; Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/fs-writeback.c: add a new writeback list for syncDave Chinner2016-07-261-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wait_sb_inodes() currently does a walk of all inodes in the filesystem to find dirty one to wait on during sync. This is highly inefficient and wastes a lot of CPU when there are lots of clean cached inodes that we don't need to wait on. To avoid this "all inode" walk, we need to track inodes that are currently under writeback that we need to wait for. We do this by adding inodes to a writeback list on the sb when the mapping is first tagged as having pages under writeback. wait_sb_inodes() can then walk this list of "inodes under IO" and wait specifically just for the inodes that the current sync(2) needs to wait for. Define a couple helpers to add/remove an inode from the writeback list and call them when the overall mapping is tagged for or cleared from writeback. Update wait_sb_inodes() to walk only the inodes under writeback due to the sync. With this change, filesystem sync times are significantly reduced for fs' with largely populated inode caches and otherwise no other work to do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem with a ~10m entry inode cache, sync times are reduced from ~7.3s to less than 0.1s when the filesystem is fully clean. Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* writeback: use higher precision calculation in domain_dirty_limits()Tejun Heo2016-05-301-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | As vm.dirty_[background_]bytes can't be applied verbatim to multiple cgroup writeback domains, they get converted to percentages in domain_dirty_limits() and applied the same way as vm.dirty_[background]ratio. However, if the specified bytes is lower than 1% of available memory, the calculated ratios become zero and the writeback domain gets throttled constantly. Fix it by using per-PAGE_SIZE instead of percentage for ratio calculations. Also, the updated DIV_ROUND_UP() usages now should yield 1/4096 (0.0244%) as the minimum ratio as long as the specified bytes are above zero. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Miao Xie <miaoxie@huawei.com> Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com Cc: stable@vger.kernel.org # v4.2+ Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()") Reviewed-by: Jan Kara <jack@suse.cz> Adjusted comment based on Jan's suggestion. Signed-off-by: Jens Axboe <axboe@fb.com>
* MM: increase safety margin provided by PF_LESS_THROTTLENeilBrown2016-05-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When nfsd is exporting a filesystem over NFS which is then NFS-mounted on the local machine there is a risk of deadlock. This happens when there are lots of dirty pages in the NFS filesystem and they cause NFSD to be throttled, either in throttle_vm_writeout() or in balance_dirty_pages(). To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads and it provides a 25% increase to the limits that affect NFSD. Any process writing to an NFS filesystem will be throttled well before the number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD will not deadlock on pages that it needs to write out. At least it shouldn't. All processes are allowed a small excess margin to avoid performing too many calculations: ratelimit_pages. ratelimit_pages is set so that if a thread on every CPU uses the entire margin, the total will only go 3% over the limit, and this is much less than the 25% bonus that PF_LESS_THROTTLE provides, so this margin shouldn't be a problem. But it is. The "total memory" that these 3% and 25% are calculated against are not really total memory but are "global_dirtyable_memory()" which doesn't include anonymous memory, just free memory and page-cache memory. The "ratelimit_pages" number is based on whatever the global_dirtyable_memory was on the last CPU hot-plug, which might not be what you expect, but is probably close to the total freeable memory. The throttle threshold uses the global_dirtable_memory at the moment when the throttling happens, which could be much less than at the last CPU hotplug. So if lots of anonymous memory has been allocated, thus pushing out lots of page-cache pages, then NFSD might end up being throttled due to dirty NFS pages because the "25%" bonus it gets is calculated against a rather small amount of dirtyable memory, while the "3%" margin that other processes are allowed to dirty without penalty is calculated against a much larger number. To remove this possibility of deadlock we need to make sure that the margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin. Simply adding ratelimit_pages isn't enough as that should be multiplied by the number of cpus. So add "global_wb_domain.dirty_limit / 32" as that more accurately reflects the current total over-shoot margin. This ensures that the number of dirty NFS pages never gets so high that nfsd will be throttled waiting for them to be written. Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/writeback: correct dirty page calculation for highmemJoonsoo Kim2016-05-191-2/+6
| | | | | | | | | | | | | | | | | | | | | | ZONE_MOVABLE could be treated as highmem so we need to consider it for accurate calculation of dirty pages. And, in following patches, ZONE_CMA will be introduced and it can be treated as highmem, too. So, instead of manually adding stat of ZONE_MOVABLE, looping all zones and check whether the zone is highmem or not and add stat of the zone which can be treated as highmem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2016-05-061-2/+4
|\ | | | | | | | | | | | | | | | | | | Pull writeback fix from Jens Axboe: "Just a single fix for domain aware writeback, fixing a regression that can cause balance_dirty_pages() to keep looping while not getting any work done" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: Fix performance regression in wb_over_bg_thresh()
| * writeback: Fix performance regression in wb_over_bg_thresh()Howard Cochran2016-05-051-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations") unintentionally changed this function's meaning from "are there more dirty pages than the background writeback threshold" to "are there more dirty pages than the writeback threshold". The background writeback threshold is typically half of the writeback threshold, so this had the effect of raising the number of dirty pages required to cause a writeback worker to perform background writeout. This can cause a very severe performance regression when a BDI uses BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker can now disagree on whether writeback should be initiated. For example, in a system having 1GB of RAM, a single spinning disk, and a "pass-through" FUSE filesystem mounted over the disk, application code mmapped a 128MB file on the disk and was randomly dirtying pages in that mapping. Because FUSE uses strictlimit and has a default max_ratio of only 1%, in balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the dirty_freerun_ceiling is the average of those, ~150. So, it pauses the dirtying processes when we have 151 dirty pages and wakes up a background writeback worker. But the worker tests the wrong threshold (200 instead of 100), so it does not initiate writeback and just returns. Thus, balance_dirty_pages keeps looping, sleeping and then waking up the worker who will do nothing. It remains stuck in this state until the few dirty pages that we have finally expire and we write them back for that reason. Then the whole process repeats, resulting in near-zero throughput through the FUSE BDI. The fix is to call the parameterized variant of wb_calc_thresh, so that the worker will do writeback if the bg_thresh is exceeded which was the behavior before the referenced commit. Fixes: 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations") Signed-off-by: Howard Cochran <hcochran@kernelspring.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+ Tested-by Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-041-6/+6
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove unnecessary uses of lock_page_memcg()Johannes Weiner2016-03-151-2/+0
| | | | | | | | | | | | | | | There are several users that nest lock_page_memcg() inside lock_page() to prevent page->mem_cgroup from changing. But the page lock prevents pages from moving between cgroups, so that is unnecessary overhead. Remove lock_page_memcg() in contexts with locked contexts and fix the debug code in the page stat functions to be okay with the page lock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: simplify lock_page_memcg()Johannes Weiner2016-03-151-28/+21
| | | | | | | | | | | | | | Now that migration doesn't clear page->mem_cgroup of live pages anymore, it's safe to make lock_page_memcg() and the memcg stat functions take pages, and spare the callers from memcg objects. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: generalize locking for the page->mem_cgroup bindingJohannes Weiner2016-03-151-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These patches tag the page cache radix tree eviction entries with the memcg an evicted page belonged to, thus making per-cgroup LRU reclaim work properly and be as adaptive to new cache workingsets as global reclaim already is. This should have been part of the original thrash detection patch series, but was deferred due to the complexity of those patches. This patch (of 5): So far the only sites that needed to exclude charge migration to stabilize page->mem_cgroup have been per-cgroup page statistics, hence the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection will add another site that needs to ensure page->mem_cgroup lifetime. Rename these locking functions to the more generic lock_page_memcg() and unlock_page_memcg(). Since charge migration is a cgroup1 feature only, we might be able to delete it at some point, and these now easy to identify locking sites along with it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page-writeback: fix dirty_ratelimit calculationAndrey Ryabinin2016-03-151-5/+6
| | | | | | | | | | | | | | | | | | | Calculation of dirty_ratelimit sometimes is not correct. E.g. initial values of dirty_ratelimit == INIT_BW and step == 0, lead to the following result: UBSAN: Undefined behaviour in ../mm/page-writeback.c:1286:7 shift exponent 25600 is too large for 64-bit type 'long unsigned int' The fix is straightforward - make step 0 if the shift exponent is too big. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: generalize the dirty balance reserveJohannes Weiner2016-01-141-2/+12
| | | | | | | | | | | | | | The dirty balance reserve that dirty throttling has to consider is merely memory not available to userspace allocations. There is nothing writeback-specific about it. Generalize the name so that it's reusable outside of that context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* treewide: Remove old email addressPeter Zijlstra2015-11-231-1/+1
| | | | | | | | | | | | | | | | | There were still a number of references to my old Red Hat email address in the kernel source. Remove these while keeping the Red Hat copyright notices intact. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm/page-writeback.c: initialize m_dirty to avoid compile warningYang Shi2015-11-201-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When building kernel with gcc 5.2, the below warning is raised: mm/page-writeback.c: In function 'balance_dirty_pages.isra.10': mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized] unsigned long m_dirty, m_thresh, m_bg_thresh; The m_dirty{thresh, bg_thresh} are initialized in the block of "if (mdtc)", so if mdts is null, they won't be initialized before being used. Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to keep consistency. They are used later by if condition: !mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh) If mdtc is null, dirty_freerun_ceiling will not be called at all, so the initialization will not change any behavior other than just ceasing the compile warning. (akpm: the patch actually reduces .text size by ~20 bytes on gcc-4.x.y) [akpm@linux-foundation.org: add comment] Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* writeback: fix incorrect calculation of available memory for memcg domainsTejun Heo2015-10-121-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For memcg domains, the amount of available memory was calculated as min(the amount currently in use + headroom according to memcg, total clean memory) This isn't quite correct as what should be capped by the amount of clean memory is the headroom, not the sum of memory in use and headroom. For example, if a memcg domain has a significant amount of dirty memory, the above can lead to a value which is lower than the current amount in use which doesn't make much sense. In most circumstances, the above leads to a number which is somewhat but not drastically lower. As the amount of memory which can be readily allocated to the memcg domain is capped by the amount of system-wide clean memory which is not already assigned to the memcg itself, the number we want is the amount currently in use + min(headroom according to memcg, clean memory elsewhere in the system) This patch updates mem_cgroup_wb_stats() to return the number of filepages and headroom instead of the calculated available pages. mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the above calculation from file, headroom, dirty and globally clean pages. v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading to build failure when !CGROUP_WRITEBACK. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling") Signed-off-by: Jens Axboe <axboe@fb.com>
* writeback: memcg dirty_throttle_control should be initialized with ↵Tejun Heo2015-10-121-9/+11
| | | | | | | | | | | | | | | | | | wb->memcg_completions MDTC_INIT() is used to initialize dirty_throttle_control for memcg domains. It used DTC_INIT_COMMON() to initialized mdtc->wb and ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the latter to wb->completions instead of wb->memcg_completions. This can lead to wildly incorrect results when calculating the proportion of dirty memory the memcg domain should get. Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize mdtc->wb_completions to wb->memcg_completions. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling") Signed-off-by: Jens Axboe <axboe@fb.com>
* writeback: bdi_writeback iteration must not skip dying onesTejun Heo2015-10-121-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bdi_for_each_wb() is used in several places to wake up or issue writeback work items to all wb's (bdi_writeback's) on a given bdi. The iteration is performed by walking bdi->cgwb_tree; however, the tree only indexes wb's which are currently active. For example, when a memcg gets associated with a different blkcg, the old wb is removed from the tree so that the new one can be indexed. The old wb starts dying from then on but will linger till all its inodes are drained. As these dying wb's may still host dirty inodes, writeback operations which affect all wb's must include them. bdi_for_each_wb() skipping dying wb's led to sync(2) missing and failing to sync the inodes belonging to those wb's. This patch adds a RCU protected @bdi->wb_list which lists all wb's beloinging to that bdi. wb's are added on creation and removed on release rather than on the start of destruction. bdi_for_each_wb() usages are replaced with list_for_each[_continue]_rcu() iterations over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed. v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs() fixed and unnecessary list head severing in cgwb_bdi_destroy() removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()") Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback ↵Tejun Heo2015-10-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | iteration laptop_mode_timer_fn() was using bdi_for_each_wb() without the required RCU locking leading to the following warning. WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170() ... Call Trace: <IRQ> [<ffffffff81480cdc>] dump_stack+0x4e/0x82 [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0 [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20 [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170 [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0 [<ffffffff810cad25>] run_timer_softirq+0x205/0x370 [<ffffffff81056854>] __do_softirq+0xd4/0x460 [<ffffffff81056d69>] irq_exit+0x89/0xa0 [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50 [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90 ... Fix it by adding rcu_read_lock() around the iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's") Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.3/blkcg' of git://git.kernel.dk/linux-blockLinus Torvalds2015-09-101-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull blk-cg updates from Jens Axboe: "A bit later in the cycle, but this has been in the block tree for a a while. This is basically four patchsets from Tejun, that improve our buffered cgroup writeback. It was dependent on the other cgroup changes, but they went in earlier in this cycle. Series 1 is set of 5 patches that has cgroup writeback updates: - bdi_writeback iteration fix which could lead to some wb's being skipped or repeated during e.g. sync under memory pressure. - Simplification of wb work wait mechanism. - Writeback tracepoints updated to report cgroup. Series 2 is is a set of updates for the CFQ cgroup writeback handling: cfq has always charged all async IOs to the root cgroup. It didn't have much choice as writeback didn't know about cgroups and there was no way to tell who to blame for a given writeback IO. writeback finally grew support for cgroups and now tags each writeback IO with the appropriate cgroup to charge it against. This patchset updates cfq so that it follows the blkcg each bio is tagged with. Async cfq_queues are now shared across cfq_group, which is per-cgroup, instead of per-request_queue cfq_data. This makes all IOs follow the weight based IO resource distribution implemented by cfq. - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff. - Other misc review points addressed, acks added and rebased. Series 3 is the blkcg policy cleanup patches: This patchset contains assorted cleanups for blkcg_policy methods and blk[c]g_policy_data handling. - alloc/free added for blkg_policy_data. exit dropped. - alloc/free added for blkcg_policy_data. - blk-throttle's async percpu allocation is replaced with direct allocation. - all methods now take blk[c]g_policy_data instead of blkcg_gq or blkcg. And finally, series 4 is a set of patches cleaning up the blkcg stats handling: blkcg's stats have always been somwhat of a mess. This patchset tries to improve the situation a bit. - The following patches added to consolidate blkcg entry point and blkg creation. This is in itself is an improvement and helps colllecting common stats on bio issue. - per-blkg stats now accounted on bio issue rather than request completion so that bio based and request based drivers can behave the same way. The issue was spotted by Vivek. - cfq-iosched implements custom recursive stats and blk-throttle implements custom per-cpu stats. This patchset make blkcg core support both by default. - cfq-iosched and blk-throttle keep track of the same stats multiple times. Unify them" * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits) blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/ blkcg: implement interface for the unified hierarchy blkcg: misc preparations for unified hierarchy interface blkcg: separate out tg_conf_updated() from tg_set_conf() blkcg: move body parsing from blkg_conf_prep() to its callers blkcg: mark existing cftypes as legacy blkcg: rename subsystem name from blkio to io blkcg: refine error codes returned during blkcg configuration blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device() blkcg: reduce stack usage of blkg_rwstat_recursive_sum() blkcg: remove cfqg_stats->sectors blkcg: move io_service_bytes and io_serviced stats into blkcg_gq blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq blkcg: make blkcg_[rw]stat per-cpu blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it blkcg: consolidate blkg creation in blkcg_bio_issue_check() blk-throttle: improve queue bypass handling blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup() blkcg: inline [__]blkg_lookup() ...
| * writeback: update writeback tracepoints to report cgroupTejun Heo2015-08-181-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following tracepoints are updated to report the cgroup used during cgroup writeback. * writeback_write_inode[_start] * writeback_queue * writeback_exec * writeback_start * writeback_written * writeback_wait * writeback_nowork * writeback_wake_background * wbc_writepage * writeback_queue_io * bdi_dirty_ratelimit * balance_dirty_pages * writeback_sb_inodes_requeue * writeback_single_inode[_start] Note that writeback_bdi_register is separated out from writeback_class as reporting cgroup doesn't make sense to it. Tracepoints which take bdi are updated to take bdi_writeback instead. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* | writeback: fix initial dirty limitRabin Vincent2015-08-071-2/+2
|/ | | | | | | | | | | | The initial value of global_wb_domain.dirty_limit set by writeback_set_ratelimit() is zeroed out by the memset in wb_domain_init(). Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds2015-06-251-431/+800
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
| * writeback: implement unlocked_inode_to_wb transaction and use it for stat ↵Tejun Heo2015-06-021-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | updates The mechanism for detecting whether an inode should switch its wb (bdi_writeback) association is now in place. This patch build the framework for the actual switching. This patch adds a new inode flag I_WB_SWITCHING, which has two functions. First, the easy one, it ensures that there's only one switching in progress for a give inode. Second, it's used as a mechanism to synchronize wb stat updates. The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters but track the current number of dirty pages and pages under writeback respectively. As such, when an inode is moved from one wb to another, the inode's portion of those stats have to be transferred together; unfortunately, this is a bit tricky as those stat updates are percpu operations which are performed without holding any lock in some places. This patch solves the problem in a similar way as memcg. Each such lockless stat updates are wrapped in transaction surrounded by unlocked_inode_to_wb_begin/end(). During normal operation, they map to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted, mapping->tree_lock is grabbed across the transaction. In turn, the switching path sets I_WB_SWITCHING and waits for a RCU grace period to pass before actually starting to switch, which guarantees that all stat update paths are synchronizing against mapping->tree_lock. This patch still doesn't implement the actual switching. v3: Updated on top of the recent cancel_dirty_page() updates. unlocked_inode_to_wb_begin() now nests inside mem_cgroup_begin_page_stat() to match the locking order. v2: The i_wb access transaction will be used for !stat accesses too. Function names and comments updated accordingly. s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/ s/switch_wb/switch_wbs/ Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: implement memcg writeback domain based throttlingTejun Heo2015-06-021-22/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While cgroup writeback support now connects memcg and blkcg so that writeback IOs are properly attributed and controlled, the IO back pressure propagation mechanism implemented in balance_dirty_pages() and its subroutines wasn't aware of cgroup writeback. Processes belonging to a memcg may have access to only subset of total memory available in the system and not factoring this into dirty throttling rendered it completely ineffective for processes under memcg limits and memcg ended up building a separate ad-hoc degenerate mechanism directly into vmscan code to limit page dirtying. The previous patches updated balance_dirty_pages() and its subroutines so that they can deal with multiple wb_domain's (writeback domains) and defined per-memcg wb_domain. Processes belonging to a non-root memcg are bound to two wb_domains, global wb_domain and memcg wb_domain, and should be throttled according to IO pressures from both domains. This patch updates dirty throttling code so that it repeats similar calculations for the two domains - the differences between the two are few and minor - and applies the lower of the two sets of resulting constraints. wb_over_bg_thresh(), which controls when background writeback terminates, is also updated to consider both global and memcg wb_domains. It returns true if dirty is over bg_thresh for either domain. This makes the dirty throttling mechanism operational for memcg domains including writeback-bandwidth-proportional dirty page distribution inside them but the ad-hoc memcg throttling mechanism in vmscan is still in place. The next patch will rip it out. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: implement memcg wb_domainTejun Heo2015-06-021-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dirtyable memory is distributed to a wb (bdi_writeback) according to the relative bandwidth the wb is writing out in the whole system. This distribution is global - each wb is measured against all other wb's and gets the proportinately sized portion of the memory in the whole system. For cgroup writeback, the amount of dirtyable memory is scoped by memcg and thus each wb would need to be measured and controlled in its memcg. IOW, a wb will belong to two writeback domains - the global and memcg domains. The previous patches laid the groundwork to support the two wb_domains and this patch implements memcg wb_domain. memcg->cgwb_domain is initialized on css online and destroyed on css release, wb->memcg_completions is added, and __wb_writeout_inc() is updated to increment completions against both global and memcg wb_domains. The following patches will update balance_dirty_pages() and its subroutines to actually consider memcg wb_domain for throttling. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: update wb_over_bg_thresh() to use wb_domain aware operationsTejun Heo2015-06-021-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wb_over_bg_thresh() currently uses global_dirty_limits() and wb_dirty_limit() both of which are wrappers around operations which take dirty_throttle_control. For cgroup writeback support, the function will be updated to also consider memcg wb_domains which requires the context information carried in dirty_throttle_control. This patch updates wb_over_bg_thresh() so that it uses the underlying wb_domain aware operations directly and builds the global dirty_throttle_control in the process. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: move over_bground_thresh() to mm/page-writeback.cTejun Heo2015-06-021-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and rename it to wb_over_bg_thresh(). The function is closely tied to the dirty throttling mechanism implemented in page-writeback.c. This relocation will allow future updates necessary for cgroup writeback support. While at it, add function comment. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: separate out domain_dirty_limits()Tejun Heo2015-06-021-25/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | global_dirty_limits() calculates thresh and bg_thresh (confusingly called *pdirty and *pbackground in the function) assuming global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch separates out domain_dirty_limits() which takes dirty_throttle_control out of global_dirty_limits(). As thresh and bg_thresh calculation needs the amount of dirtyable memory in the domain, dirty_throttle_control->avail is added. The new function calculates the two thresholds and store them directly in the dirty_throttle_control. Also, as memcg domains can't follow vm_dirty_bytes and dirty_background_bytes settings directly. If those are set and domain_dirty_limits() is invoked for a !global domain, the settings are translated to ratios by scaling them against globally available memory. dirty_throttle_control->gdtc is added to enable this when CONFIG_CGROUP_WRITEBACK. global_dirty_limits() is now a thin wrapper around domain_dirty_limits() and balance_dirty_pages() is updated to use the new function too. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a ↵Tejun Heo2015-06-021-16/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | parameter Currently __wb_writeout_inc() and hard_dirty_limit() assume global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch separates out domain-specific part of __wb_writeout_inc() into wb_domain_writeout_inc() which takes wb_domain as a parameter and adds the parameter to hard_dirty_limit(). This will allow these two functions to handle per-memcg wb_domains. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: add dirty_throttle_control->domTejun Heo2015-06-021-6/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently all dirty throttle operations use global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch adds dirty_throttle_control->dom and updates functions which are directly using globabl_wb_domain to use it instead. As this makes global_update_bandwidth() a misnomer, the function is renamed to domain_update_bandwidth(). This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: add dirty_throttle_control->wb_completionsTejun Heo2015-06-021-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wb->completions measures the wb's proportional write bandwidth in global_wb_domain and thus naturally tied to the wb_domain. This patch adds dirty_throttle_control->wb_completions which is initialized to wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use it instead of dereferencing wb->completions directly. This will allow dirty_throttle_control to represent different wb_domains and the matching wb completions. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: add dirty_throttle_control->pos_ratioTejun Heo2015-06-021-15/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wb_position_ratio() is used to calculate pos_ratio, which is used for two purposes. wb_update_dirty_ratelimit() uses it to adjust wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to immediately adjust dirty_ratelimit right before applying it to determine pause duration. While wb_update_dirty_ratelimit() is separately rate limited from balance_dirty_pages(), on the run where the ratelimit is updated, we end up calculating pos_ratio twice with the same parameters. This patch adds dirty_throttle_control->pos_ratio. balance_dirty_pages() calculates it once per run and wb_update_dirty_ratelimit() uses the value stored in dirty_throttle_control. This removes the duplicate calculation and also will help implementing memcg wb_domain. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: make __wb_calc_thresh() take dirty_throttle_controlTejun Heo2015-06-021-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wb_calc_thresh() calculates wb_thresh by scaling thresh according to the wb's portion in the system-wide write bandwidth. cgroup writeback support would need to calculate wb_thresh against memcg domain too. This patch renames wb_calc_thresh() to __wb_calc_thresh() and makes it take dirty_throttle_control so that the function can later be updated to calculate against different domains according to dirty_throttle_control. wb_calc_thresh() is now a thin wrapper around __wb_calc_thresh(). v2: The original version was incorrectly scaling dtc->dirty instead of dtc->thresh. This was due to the extremely confusing function and variable names. Added a rename patch and fixed this one. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: add dirty_throttle_control->wb_bg_threshTejun Heo2015-06-021-16/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wb_bg_thresh is currently treated as a second-class citizen. It's only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages() doesn't calculate it unless the cap is set. When the cap is set, the calculated value is not passed around but instead recalculated whenever it's used. wb_position_ratio() calculates it by scaling wb_thresh proportional to bg_thresh / thresh. wb_update_dirty_ratelimit() uses wb_dirty_limit() on bg_thresh, which should generally lead to a similar result as the proportional scaling but can also be way off in the presence of max/min_ratio settings. Avoiding wb_bg_thresh calculation saves us one u64 multiplication and divsion when BDI_CAP_STRICTLIMIT is not set. Given that balance_dirty_pages() is already ratelimited, this doesn't justify the incurred extra complexity. This patch adds wb_bg_thresh to dirty_throttle_control and makes wb_dirty_limits() always calculate it and updates the users to use the pre-calculated value. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: consolidate dirty throttle parameters into dirty_throttle_controlTejun Heo2015-06-021-111/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dirty throttling implemented in balance_dirty_pages() and its subroutines makes use of a number of parameters which are passed around individually. This renders these functions somewhat unwieldy and makes it difficult to add or change the involved parameters. Also some functions use different or conflicting naming schemes for the same parameters making the code confusing to follow. This patch consolidates the main parameters into struct dirty_throttle_control so that they can be passed around easily and adding new paramters isn't painful. This also unifies how a given parameter is named and accessed. The drawback of using this type of control structure rather than explicit paramters is that it isn't immediately obvious which function accesses and modifies what; however, it's fairly clear that the benefits outweigh in this case. GDTC_INIT() macro is provided to ease initializing dirty_throttle_control for the global_wb_domain and balance_dirty_pages() uses a separate pointer to point to its global dirty_throttle_control. This is to make it uniform with memcg domain handling which will be added later. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * writeback: move global_dirty_limit into wb_domainTejun Heo2015-06-021-23/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a part of the series to define wb_domain which represents a domain that wb's (bdi_writeback's) belong to and are measured against each other in. This will enable IO backpressure propagation for cgroup writeback. global_dirty_limit exists to regulate the global dirty threshold which is a property of the wb_domain. This patch moves hard_dirty_limit, dirty_lock, and update_time into wb_domain. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
OpenPOWER on IntegriCloud