summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* writeback: introduce max-pause and pass-good dirty limitsWu Fengguang2011-07-092-0/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The max-pause limit helps to keep the sleep time inside balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which normally is enough to stop dirtiers from continue pushing the dirty pages high, unless there are a sufficient large number of slow dirtiers (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the write bandwidth of a slow disk and hence accumulating more and more dirty pages). The pass-good limit helps to let go of the good bdi's in the presence of a blocked bdi (ie. NFS server not responding) or slow USB disk which for some reason build up a large number of initial dirty pages that refuse to go away anytime soon. For example, given two bdi's A and B and the initial state bdi_thresh_A = dirty_thresh / 2 bdi_thresh_B = dirty_thresh / 2 bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 Then A get blocked, after a dozen seconds bdi_thresh_A = 0 bdi_thresh_B = dirty_thresh bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages will be effectively throttled by condition (nr_dirty < dirty_thresh). This has two problems: (1) we lose the protections for light dirtiers (2) balance_dirty_pages() effectively becomes IO-less because the (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good for IO, but balance_dirty_pages() loses an important way to break out of the loop which leads to more spread out throttle delays. DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is, DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above example while this patch uses the more conservative value 8 so as not to surprise people with too many dirty pages than expected. The max-pause limit won't noticeably impact the speed dirty pages are knocked down when there is a sudden drop of global/bdi dirty thresholds. Because the heavy dirties will be throttled below 160KB/s which is slow enough. It does help to avoid long dirty throttle delays and especially will make light dirtiers more responsive. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: introduce smoothed global dirty limitWu Fengguang2011-07-093-3/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The start of a heavy weight application (ie. KVM) may instantly knock down determine_dirtyable_memory() if the swap is not enabled or full. global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi dirty thresholds that are _much_ lower than the global/bdi dirty pages. balance_dirty_pages() will then heavily throttle all dirtiers including the light ones, until the dirty pages drop below the new dirty thresholds. During this _deep_ dirty-exceeded state, the system may appear rather unresponsive to the users. About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty threshold to heavy dirtiers than light ones, and the dirty pages will be throttled around the heavy dirtiers' dirty threshold and reasonably below the light dirtiers' dirty threshold. In this state, only the heavy dirtiers will be throttled and the dirty pages are carefully controlled to not exceed the light dirtiers' dirty threshold. However if the threshold itself suddenly drops below the number of dirty pages, the light dirtiers will get heavily throttled. So introduce global_dirty_limit for tracking the global dirty threshold with policies - follow downwards slowly - follow up in one shot global_dirty_limit can effectively mask out the impact of sudden drop of dirtyable memory. It will be used in the next patch for two new type of dirty limits. Note that the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: consolidate variable names in balance_dirty_pages()Wu Fengguang2011-07-091-10/+11
| | | | | | | | | | | | | | Introduce nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS in order to simplify many tests in the following patches. balance_dirty_pages() will eventually care only about the dirty sums besides nr_writeback. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: show bdi write bandwidth in debugfsWu Fengguang2011-07-091-11/+13
| | | | | | | | | | | | | | | Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats. btw, increase digital field width to 10, for keeping the possibly huge BdiWritten number aligned at least for desktop systems. Impact: this could break user space tools if they are dumb enough to depend on the number of white spaces. CC: Theodore Ts'o <tytso@mit.edu> CC: Jan Kara <jack@suse.cz> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: bdi write bandwidth estimationWu Fengguang2011-07-095-0/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The estimation value will start from 100MB/s and adapt to the real bandwidth in seconds. It tries to update the bandwidth only when disk is fully utilized. Any inactive period of more than one second will be skipped. The estimated bandwidth will be reflecting how fast the device can writeout when _fully utilized_, and won't drop to 0 when it goes idle. The value will remain constant at disk idle time. At busy write time, if not considering fluctuations, it will also remain high unless be knocked down by possible concurrent reads that compete for the disk time and bandwidth with async writes. The estimation is not done purely in the flusher because there is no guarantee for write_cache_pages() to return timely to update bandwidth. The bdi->avg_write_bandwidth smoothing is very effective for filtering out sudden spikes, however may be a little biased in long term. The overheads are low because the bdi bandwidth update only occurs at 200ms intervals. The 200ms update interval is suitable, because it's not possible to get the real bandwidth for the instance at all, due to large fluctuations. The NFS commits can be as large as seconds worth of data. One XFS completion may be as large as half second worth of data if we are going to increase the write chunk to half second worth of data. In ext4, fluctuations with time period of around 5 seconds is observed. And there is another pattern of irregular periods of up to 20 seconds on SSD tests. That's why we are not only doing the estimation at 200ms intervals, but also averaging them over a period of 3 seconds and then go further to do another level of smoothing in avg_write_bandwidth. CC: Li Shaohua <shaohua.li@intel.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: account per-bdi accumulated written pagesJan Kara2011-07-093-2/+10
| | | | | | | | | | | | | | Introduce the BDI_WRITTEN counter. It will be used for estimating the bdi's write bandwidth. Peter Zijlstra <a.p.zijlstra@chello.nl>: Move BDI_WRITTEN accounting into __bdi_writeout_inc(). This will cover and fix fuse, which only calls bdi_writeout_inc(). CC: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: make writeback_control.nr_to_write straightWu Fengguang2011-07-096-129/+148
| | | | | | | | | | | | | | | | | | | | | | Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. Acked-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()Wu Fengguang2011-06-201-3/+4
| | | | | | | This helps prevent tmpfs dirtiers from skewing the per-cpu bdp_ratelimits. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: trace event writeback_queue_ioWu Fengguang2011-06-082-4/+35
| | | | | | | | | Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number of inodes that can be moved in one shot, in order to keep spinlock hold time under control. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: trace event writeback_single_inodeWu Fengguang2011-06-082-0/+74
| | | | | | | | | | | | | | | | It is valuable to know how the dirty inodes are iterated and their IO size. "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0" - "state" reflects inode->i_state at the end of writeback_single_inode() - "index" reflects mapping->writeback_index after the ->writepages() call - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode() - "wrote" is the number of pages actually written v2: add trace event writeback_single_inode_requeue as proposed by Dave. CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: remove .nonblocking and .encountered_congestionWu Fengguang2011-06-084-9/+4
| | | | | | | | | | | | | | Remove two unused struct writeback_control fields: .encountered_congestion (completely unused) .nonblocking (never set, checked/showed in XFS,NFS/btrfs) The .for_background check in nfs_write_inode() is also removed btw, as .for_background implies WB_SYNC_NONE. Reviewed-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: remove writeback_control.more_ioWu Fengguang2011-06-084-16/+5
| | | | | | | | | | | When wbc.more_io was first introduced, it indicates whether there are at least one superblock whose s_more_io contains more IO work. Now with the per-bdi writeback, it can be replaced with a simple b_more_io test. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: skip balance_dirty_pages() for in-memory fsWu Fengguang2011-06-081-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This avoids unnecessary checks and dirty throttling on tmpfs/ramfs. Notes about the tmpfs/ramfs behavior changes: As for 2.6.36 and older kernels, the tmpfs writes will sleep inside balance_dirty_pages() as long as we are over the (dirty+background)/2 global throttle threshold. This is because both the dirty pages and threshold will be 0 for tmpfs/ramfs. Hence this test will always evaluate to TRUE: dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) || (nr_reclaimable + nr_writeback >= dirty_thresh); For 2.6.37, someone complained that the current logic does not allow the users to set vm.dirty_ratio=0. So commit 4cbec4c8b9 changed the test to dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh) || (nr_reclaimable + nr_writeback > dirty_thresh); So 2.6.37 will behave differently for tmpfs/ramfs: it will never get throttled unless the global dirty threshold is exceeded (which is very unlikely to happen; once happen, will block many tasks). I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means for a busy writing server, tmpfs write()s may get livelocked! The "inadvertent" throttling can hardly bring help to any workload because of its "either no throttling, or get throttled to death" property. So based on 2.6.37, this patch won't bring more noticeable changes. CC: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: add bdi_dirty_limit() kernel-docWu Fengguang2011-06-081-2/+9
| | | | | | | | Clarify the bdi_dirty_limit() comment. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: avoid extra sync work at enqueue timeWu Fengguang2011-06-082-16/+3
| | | | | | | | | This removes writeback_control.wb_start and does more straightforward sync livelock prevention by setting .older_than_this to prevent extra inodes from being enqueued in the first place. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: elevate queue_io() into wb_writeback()Wu Fengguang2011-06-081-17/+12
| | | | | | | | | | | | | Code refactor for more logical code layout. No behavior change. - remove the mis-named __writeback_inodes_sb() - wb_writeback()/writeback_inodes_wb() will decide when to queue_io() before calling __writeback_inodes_wb() Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: split inode_wb_list_lock into bdi_writeback.list_lockChristoph Hellwig2011-06-088-68/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the global inode_wb_list_lock into a per-bdi_writeback list_lock, as it's currently the most contended lock in the system for metadata heavy workloads. It won't help for single-filesystem workloads for which we'll need the I/O-less balance_dirty_pages, but at least we can dedicate a cpu to spinning on each bdi now for larger systems. Based on earlier patches from Nick Piggin and Dave Chinner. It reduces lock contentions to 1/4 in this test case: 10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram lock_stat version 0.3 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- vanilla 2.6.39-rc3: inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23 ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157 2.6.39-rc3 + patch: &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37 ------------------------ &(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150 &(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f &(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223 ------------------------ &(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b &(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf &(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: refill b_io iff emptyWu Fengguang2011-06-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no point to carry different refill policies between for_kupdate and other type of works. Use a consistent "refill b_io iff empty" policy which can guarantee fairness in an easy to understand way. A b_io refill will setup a _fixed_ work set with all currently eligible inodes and start a new round of walk through b_io. The "fixed" work set means no new inodes will be added to the work set during the walk. Only when a complete walk over b_io is done, new inodes that are eligible at the time will be enqueued and the walk be started over. This procedure provides fairness among the inodes because it guarantees each inode to be synced once and only once at each round. So all inodes will be free from starvations. This change relies on wb_writeback() to keep retrying as long as we made some progress on cleaning some pages and/or inodes. Without that ability, the old logic on background works relies on aggressively queuing all eligible inodes into b_io at every time. But that's not a guarantee. The below test script completes a slightly faster now: 2.6.39-rc3 2.6.39-rc3-dyn-expire+ ------------------------------------------------ all elapsed 256.043 252.367 stddev 24.381 12.530 tar elapsed 30.097 28.808 dd elapsed 13.214 11.782 #!/bin/zsh cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/ umount /dev/sda7 mkfs.xfs -f /dev/sda7 mount /dev/sda7 /fs echo 3 > /proc/sys/vm/drop_caches tic=$(cat /proc/uptime|cut -d' ' -f2) cd /fs time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 & time dd if=/dev/zero of=/fs/zero bs=1M count=1000 & wait sync tac=$(cat /proc/uptime|cut -d' ' -f2) echo elapsed: $((tac - tic)) It maintains roughly the same small vs. large file writeout shares, and offers large files better chances to be written in nice 4M chunks. Analyzes from Dave Chinner in great details: Let's say we have lots of inodes with 100 dirty pages being created, and one large writeback going on. We expire 8 new inodes for every 1024 pages we write back. With the old code, we do: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) writeback 8 small inodes 800 pages 1 large inode 224 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) ..... Your new code: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io (b_io == 8s) writeback 8 small inodes 800 pages b_io empty: (1800 pages written) b_more_io (large inode) -> b_io (1l) 14 newly expired inodes -> b_io (1l, 14s) writeback large inode 1024 pages -> b_more_io (b_io == 14s) writeback 10 small inodes 1000 pages 1 small inode 24 pages -> b_more_io (1l, 1s(24)) writeback 5 small inodes 500 pages b_io empty: (2548 pages written) b_more_io (large inode) -> b_io (1l, 1s(24)) 20 newly expired inodes -> b_io (1l, 1s(24), 20s) ...... Rough progression of pages written at b_io refill: Old code: total large file % of writeback 1024 224 21.9% (fixed) New code: total large file % of writeback 1800 1024 ~55% 2550 1024 ~40% 3050 1024 ~33% 3500 1024 ~29% 3950 1024 ~26% 4250 1024 ~24% 4500 1024 ~22.7% 4700 1024 ~21.7% 4800 1024 ~21.3% 4800 1024 ~21.3% (pretty much steady state from here) Ok, so the steady state is reached with a similar percentage of writeback to the large file as the existing code. Ok, that's good, but providing some evidence that is doesn't change the shared of writeback to the large should be in the commit message ;) The other advantage to this is that we always write 1024 page chunks to the large file, rather than smaller "whatever remains" chunks. CC: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: the kupdate expire timestamp should be a moving targetWu Fengguang2011-06-081-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dynamically compute the dirty expire timestamp at queue_io() time. writeback_control.older_than_this used to be determined at entrance to the kupdate writeback work. This _static_ timestamp may go stale if the kupdate work runs on and on. The flusher may then stuck with some old busy inodes, never considering newly expired inodes thereafter. This has two possible problems: - It is unfair for a large dirty inode to delay (for a long time) the writeback of small dirty inodes. - As time goes by, the large and busy dirty inode may contain only _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks delaying the expired dirty pages to the end of LRU lists, triggering the evil pageout(). Nevertheless this patch merely addresses part of the problem. v2: keep policy changes inside wb_writeback() and keep the wbc.older_than_this visibility as suggested by Dave. CC: Dave Chinner <david@fromorbit.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: try more writeback as long as something was writtenWu Fengguang2011-06-081-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that they only populate possibly a subset of eligible inodes into b_io at entrance time. When the queued set of inodes are all synced, they just return, possibly with all queued inode pages written but still wbc.nr_to_write > 0. For kupdate and background writeback, there may be more eligible inodes sitting in b_dirty when the current set of b_io inodes are completed. So it is necessary to try another round of writeback as long as we made some progress in this round. When there are no more eligible inodes, no more inodes will be enqueued in queue_io(), hence nothing could/will be synced and we may safely bail. For example, imagine 100 inodes i0, i1, i2, ..., i90, i91, i99 At queue_io() time, i90-i99 happen to be expired and moved to s_io for IO. When finished successfully, if their total size is less than MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will quit the background work (w/o this patch) while it's still over background threshold. This will be a fairly normal/frequent case I guess. Now that we do tagged sync and update inode->dirtied_when after the sync, this change won't livelock sync(1). I actually tried to write 1 page per 1ms with this command write-and-fsync -n10000 -S 1000 -c 4096 /fs/test and do sync(1) at the same time. The sync completes quickly on ext4, xfs, btrfs. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: introduce writeback_control.inodes_writtenWu Fengguang2011-06-082-0/+5
| | | | | | | | | | | | | | | The flusher works on dirty inodes in batches, and may quit prematurely if the batch of inodes happen to be metadata-only dirtied: in this case wbc->nr_to_write won't be decreased at all, which stands for "no pages written" but also mis-interpreted as "no progress". So introduce writeback_control.inodes_written to count the inodes get cleaned from VFS POV. A non-zero value means there are some progress on writeback, in which case more writeback can be tried. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: update dirtied_when for synced inode to prevent livelockWu Fengguang2011-06-081-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicitly update .dirtied_when on synced inodes, so that they are no longer considered for writeback in the next round. It can prevent both of the following livelock schemes: - while true; do echo data >> f; done - while true; do touch f; done (in theory) The exact livelock condition is, during sync(1): (1) no new inodes are dirtied (2) an inode being actively dirtied On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX. When finished, it will be redirty_tail()ed because it's still dirty and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when on condition (1). The sync work will then revisit it on the next queue_io() and find it eligible again because its old ->dirtied_when predates the sync work start time. We'll do more aggressive "keep writeback as long as we wrote something" logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will no longer be enough to stop sync livelock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stageWu Fengguang2011-06-084-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit f446daaea9, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* Linux 3.0-rc2v3.0-rc2Linus Torvalds2011-06-061-1/+1
|
* mm: fix ENOSPC returned by handle_mm_fault()Hugh Dickins2011-06-061-2/+2
| | | | | | | | | | | Al Viro observes that in the hugetlb case, handle_mm_fault() may return a value of the kind ENOSPC when its caller is expecting a value of the kind VM_FAULT_SIGBUS: fix alloc_huge_page()'s failure returns. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2011-06-067-17/+30
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6: ALSA: usb - turn off de-emphasis in s/pdif for cm6206 ALSA: asihpi: Use angle brackets for system includes ALSA: fm801: add error handling if auto-detect fails ALSA: hda - Check pin support EAPD in ad198x_power_eapd_write ALSA: hda - Fix HP and Front pins of ad1988/ad1989 in ad198x_power_eapd() ALSA: 6fire: Don't leak firmware in error path ASoC: Fix wm_hubs input PGA ZC bits ASoC: Fix dapm_is_shared_kcontrol so everything isn't shared
| * Merge branch 'fix/asoc' into for-linusTakashi Iwai2011-06-062-5/+8
| |\
| | * ASoC: Fix wm_hubs input PGA ZC bitsMark Brown2011-05-271-4/+4
| | | | | | | | | | | | | | | Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: Liam Girdwood <lrg@ti.com>
| | * ASoC: Fix dapm_is_shared_kcontrol so everything isn't sharedStephen Warren2011-05-271-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit af46800 ("ASoC: Implement mux control sharing") introduced function dapm_is_shared_kcontrol. When this function returns true, the naming of DAPM controls is derived from the kcontrol_new. Otherwise, the name comes from the widget (and possibly a widget's naming prefix). A bug in the implementation of dapm_is_shared_kcontrol made it return 1 in all cases. Hence, that commit caused a change in control naming for all controls instead of just shared controls. Specifically, a control is always considered shared because it is always compared against itself. Solve this by never comparing against the widget containing the control being created. Equally, controls should never be shared between DAPM contexts; when the same codec is instantiated multiple times, the same kcontrol_new will be used. However, the control should no be shared between the multiple instances. I tested that with the Tegra WM8903 driver: * Shared is now mostly 0 as expected, and sometimes 1. * The expected controls are still generated after this change. However, I don't have any systems that have a widget/control naming prefix, so I can't test that aspect. Thanks for Jarkko Nikula for pointing out how to fix this. Reported-by: Liam Girdwood <lrg@ti.com> Tested-by: Jarkko Nikula <jhnikula@gmail.com> Signed-off-by: Stephen Warren <swarren@nvidia.com> Acked-by: Liam Girdwood <lrg@ti.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
| * | ALSA: usb - turn off de-emphasis in s/pdif for cm6206Eric Lammerts2011-06-031-1/+1
| | | | | | | | | | | | | | | | | | | | | CM6206: Turn off de-emphasis channel status bit in S/PDIF output. Signed-off-by: Eric Lammerts <eric@lammerts.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | ALSA: asihpi: Use angle brackets for system includesJoe Perches2011-06-031-1/+1
| | | | | | | | | | | | | | | | | | | | | Use the normal include style. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | ALSA: fm801: add error handling if auto-detect failsDan Carpenter2011-06-031-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | In the original code if auto detect failed and tea575x_tuner == 4 then we copy bogus information to chip->tea.card. I've changed the autodetect code to cleanup and return -ENODEV on error instead. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | ALSA: hda - Check pin support EAPD in ad198x_power_eapd_writeRaymond Yau2011-06-031-2/+4
| | | | | | | | | | | | | | | | | | | | | Check whether the pin supports EAPD in ad198x_power_eapd_write. Signed-off-by: Raymond Yau <superquad.vortex2@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | ALSA: hda - Fix HP and Front pins of ad1988/ad1989 in ad198x_power_eapd()Takashi Iwai2011-06-031-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | In ad198x_power_eapd(), wrong pin NIDs are used for controlling EAPD for HP and Front outputs of AD1988/AD1989. These are actually same with the ones for AD1984 & co, port-A is 0x11 and port-D 0x12. Reported-by: Raymond Yau <superquad.vortex2@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | ALSA: 6fire: Don't leak firmware in error pathJesper Juhl2011-06-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the error paths in sound/usb/6fire/firmware.c::usb6fire_fw_ezusb_upload() neglects to free the memory allocated for the firmware before returning, thus leaking the memory. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Takashi Iwai <tiwai@suse.de>
* | | Merge branch 'hwmon-for-linus' of ↵Linus Torvalds2011-06-062-23/+22
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging * 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging: hwmon: (max6642): Better chip detection schema hwmon: (coretemp) Further relax temperature range checks hwmon: (coretemp) Fix TjMax detection for older CPUs hwmon: (coretemp) Relax target temperature range check hwmon: (max6642) Rename temp_fault sysfs attribute to temp2_fault
| * | | hwmon: (max6642): Better chip detection schemaPer Dalén2011-06-041-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve detection of MAX6642 by reading non existing registers (0x04, 0x06 and 0xff). Reading those registers returns the previously read value. Signed-off-by: Per Dalen <per.dalen@appeartv.com> [guenter.roeck@ericsson.com: added second set of register reads] Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
| * | | hwmon: (coretemp) Further relax temperature range checksGuenter Roeck2011-06-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Further relax temperature range checks after reading the IA32_TEMPERATURE_TARGET register. If the register returns a value other than 0 in bits 16..32, assume that the returned value is correct. This change applies to both packet and core temperature limits. Cc: Carsten Emde <C.Emde@osadl.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com> Acked-by: Fenghua Yu <fenghua.yu@intel.com>
| * | | hwmon: (coretemp) Fix TjMax detection for older CPUsGuenter Roeck2011-06-011-17/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a321cedb12904114e2ba5041a3673ca24deb09c9 excludes CPU models 0xe, 0xf, 0x16, and 0x1a from TjMax temperature adjustment, even though several of those CPUs are known to have TiMax other than 100 degrees C, and even though the code in adjust_tjmax() explicitly handles those CPUs and points to a Web document listing several of the affected CPU IDs. Reinstate original TjMax adjustment if TjMax can not be determined using the IA32_TEMPERATURE_TARGET register. https://bugzilla.kernel.org/show_bug.cgi?id=32582 Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com> Cc: Huaxu Wan <huaxu.wan@linux.intel.com> Cc: Carsten Emde <C.Emde@osadl.org> Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Yong Wang <yong.y.wang@linux.intel.com> Cc: Rudolf Marek <r.marek@assembler.cz> Cc: Fenghua Yu <fenghua.yu@intel.com> Tested-by: Jean Delvare <khali@linux-fr.org> Acked-by: Jean Delvare <khali@linux-fr.org> Acked-by: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@kernel.org> # .35.x .36.x .37.x .38.x .39.x
| * | | hwmon: (coretemp) Relax target temperature range checkJean Delvare2011-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current temperature range check of MSR_IA32_TEMPERATURE_TARGET seems too strict to me, some TjMax values documented in Documentation/hwmon/coretemp wouldn't pass. Relax the check so that all the documented values pass. Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Carsten Emde <C.Emde@osadl.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
| * | | hwmon: (max6642) Rename temp_fault sysfs attribute to temp2_faultPer Dalen2011-06-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The temp_fault sysfs attribute is wrong, it should be temp2_fault instead. Reported-by: Guenter Roeck <guenter.roeck@ericsson.com> Signed-off-by: Per Dalen <per.dalen@appeartv.com> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
* | | | Merge branch 'for-linus' of git://android.git.kernel.org/kernel/tegraLinus Torvalds2011-06-052-2/+5
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://android.git.kernel.org/kernel/tegra: ARM: Tegra: Harmony: Fix conflicting GPIO numbering
| * | | | ARM: Tegra: Harmony: Fix conflicting GPIO numberingStephen Warren2011-06-042-2/+5
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, both the WM8903 and TPS6586x chips attempt to register with gpiolib using the same GPIO numbers. This causes the audio driver to fail to initialize. To solve this, add a define to board-harmony.h for the TPS6586x, and make board-harmony-power.c use this define, instead of directly referencing TEGRA_NR_GPIOS. This fixes a regression introduced by commit 6f168f2fa60f87e85e0df25e87e2372f22f5eb7c. ARM: tegra: harmony: initialize the TPS65862 PMIC Signed-off-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Colin Cross <ccross@android.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-06-0519-468/+635
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits) btrfs: fix uninitialized variable warning btrfs: add helper for fs_info->closing Btrfs: add mount -o inode_cache btrfs: scrub: add explicit plugging btrfs: use btrfs_ino to access inode number Btrfs: don't save the inode cache if we are deleting this root btrfs: false BUG_ON when degraded Btrfs: don't save the inode cache in non-FS roots Btrfs: make sure we don't overflow the free space cache crc page Btrfs: fix uninit variable in the delayed inode code btrfs: scrub: don't reuse bios and pages Btrfs: leave spinning on lookup and map the leaf Btrfs: check for duplicate entries in the free space cache Btrfs: don't try to allocate from a block group that doesn't have enough space Btrfs: don't always do readahead Btrfs: try not to sleep as much when doing slow caching Btrfs: kill BTRFS_I(inode)->block_group Btrfs: don't look at the extent buffer level 3 times in a row Btrfs: map the node block when looking for readahead targets Btrfs: set range_start to the right start in count_range_bits ...
| * | | | btrfs: fix uninitialized variable warningDavid Sterba2011-06-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With Linus' tree, today's linux-next build (powercp ppc64_defconfig) produced this warning: fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode': fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used uninitialized in this function Introduced by commit 16cdcec736cd ("btrfs: implement delayed inode items operation"). This fixes a bug in btrfs_update_inode(): if the returned value from btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not updated and several call paths may hit a BUG_ON or fail with strange code. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David Sterba <dsterba@suse.cz>
| * | | | btrfs: add helper for fs_info->closingDavid Sterba2011-06-048-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wrap checking of filesystem 'closing' flag and fix a few missing memory barriers. Signed-off-by: David Sterba <dsterba@suse.cz>
| * | | | Btrfs: add mount -o inode_cacheChris Mason2011-06-044-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes the inode map cache default to off until we fix the overflow problem when the free space crcs don't fit inside a single page. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | btrfs: scrub: add explicit pluggingArne Jansen2011-06-041-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the removal of the implicit plugging scrub ends up doing more and smaller I/O than necessary. This patch adds explicit plugging per chunk. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | btrfs: use btrfs_ino to access inode numberDavid Sterba2011-06-042-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode number directly while it should use the helper with the new inode number allocator. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | Btrfs: don't save the inode cache if we are deleting this rootJosef Bacik2011-06-041-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With xfstest 254 I can panic the box every time with the inode number caching stuff on. This is because we clean the inodes out when we delete the subvolume, but then we write out the inode cache which adds an inode to the subvolume inode tree, and then when it gets evicted again the root gets added back on the dead roots list and is deleted again, so we have a double free. To stop this from happening just return 0 if refs is 0 (and we're not the tree root since tree root always has refs of 0). With this fix 254 no longer panics. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
OpenPOWER on IntegriCloud