summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* Revert "block: add missing block_bio_complete() tracepoint"Linus Torvalds2013-04-181-0/+1
| | | | | | | | | | | | | | | | | | | | This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "loop: cleanup partitions when detaching loop device"Jens Axboe2013-04-081-1/+0
| | | | | | | | | | | | | This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107. There are situations where the destruction path is called with the bdev->bd_mutex already held, which then deadlocks in loop_clr_fd(). The normal partition cleanup does a trylock() on the mutex, but it'd be nice to have a more bullet proof method in loop. So punt this more involved fix to the next merge window, and just back out this buggy fix for now. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: avoid using uninitialized value in from queue_var_storeArnd Bergmann2013-04-031-0/+2
| | | | | | | | | | | | | | | | As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions that use a value generated by queue_var_store independent of whether that value was set or not. block/blk-sysfs.c: In function 'queue_store_nonrot': block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized] Unlike most other such warnings, this one is not a false positive, writing any non-number string into the sysfs files indeed has an undefined result, rather than returning an error. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Block: blk-flush: Fixed indent code styleAlice Ferrazzi2013-03-221-1/+1
| | | | | | | Fixed code indent should use tabs where possible. Signed-off-by: Alice Ferrazzi <alice.ferrazzi@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* loop: cleanup partitions when detaching loop devicePhillip Susi2013-03-221-0/+1
| | | | | | | | | | | | | | | Any partitions added by user space to the loop device were being left in place after detaching the loop device. This was because the detach path issued a BLKRRPART to clean up partitions if LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto scanned on attach. Replace this BLKRRPART with code that unconditionally cleans up partitions on detach instead. Signed-off-by: Phillip Susi <psusi@ubuntu.com> Modified by Jens to export delete_partition(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-blockLinus Torvalds2013-02-2811-205/+834
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...
| * cfq: fix lock imbalance with failed allocationsGlauber Costa2013-02-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While stress-running very-small container scenarios with the Kernel Memory Controller, I've run into a lockdep-detected lock imbalance in cfq-iosched.c. I'll apologize beforehand for not posting a backlog: I didn't anticipate it would be so hard to reproduce, so I didn't save my serial output and went directly on debugging. Turns out that it did not happen again in more than 20 runs, making it a quite rare pattern. But here is my analysis: When we are in very low-memory situations, we will arrive at cfq_find_alloc_queue and may not find a queue, having to resort to the oom queue, in an rcu-locked condition: if (!cfqq || cfqq == &cfqd->oom_cfqq) [ ... ] Next, we will release the rcu lock, and try to allocate a queue, retrying if we succeed: rcu_read_unlock(); spin_unlock_irq(cfqd->queue->queue_lock); new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask | __GFP_ZERO, cfqd->queue->node); spin_lock_irq(cfqd->queue->queue_lock); if (new_cfqq) goto retry; We are unlocked at this point, but it should be fine, since we will reacquire the rcu_read_lock when we retry. Except of course, that we may not retry: the allocation may very well fail and we'll keep on going through the flow: The next branch is: if (cfqq) { [ ... ] } else cfqq = &cfqd->oom_cfqq; And right before exiting, we'll issue rcu_read_unlock(). Being already unlocked, this is the likely source of our imbalance. Since cfqq is either already NULL or made NULL in the first statement of the outter branch, the only viable alternative here seems to be to return the oom queue right away in case of allocation failure. Please review the following patch and apply if you agree with my analysis. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: don't select PERCPU_RWSEMMikulas Patocka2013-02-221-1/+0
| | | | | | | | | | | | | | | | | | | | The block device doesn't use percpu rw-semaphore anymore, so don't select it for compilation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: account iowait time when waiting for completion of IO requestVladimir Davydov2013-02-153-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using wait_for_completion() for waiting for a IO request to be executed results in wrong iowait time accounting. For example, a system having the only task doing write() and fdatasync() on a block device can be reported being idle instead of iowaiting as it should because blkdev_issue_flush() calls wait_for_completion() which in turn calls schedule() that does not increment the iowait proc counter and thus does not turn on iowait time accounting. The patch makes block layer use wait_for_completion_io() instead of wait_for_completion() where appropriate to account iowait time correctly. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add @req to bio_{front|back}_merge tracepointsTejun Heo2013-01-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_{front|back}_merge tracepoints report a bio merging into an existing request but didn't specify which request the bio is being merged into. Add @req to it. This makes it impossible to share the event template with block_bio_queue - split it out. @req isn't used or exported to userland at this point and there is no userland visible behavior change. Later changes will make use of the extra parameter. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add missing block_bio_complete() tracepointTejun Heo2013-01-141-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Merge branch 'blkcg-cfq-hierarchy' of ↵Jens Axboe2013-01-114-163/+818
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.9/core Tejun writes: Hello, Jens. Please consider pulling from the following branch to receive cfq blkcg hierarchy support. The branch is based on top of v3.8-rc2. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy The patchset was reviewd in the following thread. http://thread.gmane.org/gmane.linux.kernel.cgroups/5571
| | * cfq-iosched: add hierarchical cfq_group statisticsTejun Heo2013-01-091-0/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately, at this point, there's no way to make the existing statistics hierarchical without creating nasty surprises for the existing users. Just create recursive counterpart of the existing stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: collect stats from dead cfqgsTejun Heo2013-01-091-1/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support hierarchical stats, it's necessary to remember stats from dead children. Add cfqg->dead_stats and make a dying cfqg transfer its stats to the parent's dead-stats. The transfer happens form ->pd_offline_fn() and it is possible that there are some residual IOs completing afterwards. Currently, we lose these stats. Given that cgroup removal isn't a very high frequency operation and the amount of residual IOs on offline are likely to be nil or small, this shouldn't be a big deal and the complexity needed to handle residual IOs - another callback and rather elaborate synchronization to reach and lock the matching q - doesn't seem justified. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()Tejun Heo2013-01-091-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Separate out cfqg_stats_reset() which takes struct cfqg_stats * from cfq_pd_reset_stats() and move the latter to where other pd methods are defined. cfqg_stats_reset() will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lockTejun Heo2013-01-091-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of holding blkcg->lock while walking ->blkg_list and executing prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while executing prfill(). This makes prfill() implementations easier as stats are mostly protected by queue lock. This will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * block: RCU free request_queueTejun Heo2013-01-091-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | RCU free request_queue so that blkcg_gq->q can be dereferenced under RCU lock. This will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()Tejun Heo2013-01-092-0/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge(). The former two collect the [rw]stats designated by the target policy data and offset from the pd's subtree. The latter two add one [rw]stat to another. Note that the recursive sum functions require the queue lock to be held on entry to make blkg online test reliable. This is necessary to properly handle stats of a dying blkg. These will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: export __blkg_prfill_rwstat()Tejun Heo2013-01-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat(). Export it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
| | * blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/Tejun Heo2013-01-092-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for summing up stats from multiple blkgs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->onlineTejun Heo2013-01-092-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(), which are invoked as the policy_data gets activated and deactivated while holding both blkcg and q locks. Also, add blkcg_gq->online bool, which is set and cleared as the blkcg_gq gets activated and deactivated. This flag also is toggled while holding both blkcg and q locks. These will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: add blkg_policy_data->plidTejun Heo2013-01-092-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | Add pd->plid so that the policy a pd belongs to can be identified easily. This will be used to implement hierarchical blkg_[rw]stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: enable full blkcg hierarchy supportTejun Heo2013-01-091-15/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the previous two patches, all cfqg scheduling decisions are based on vfraction and ready for hierarchy support. The only thing which keeps the behavior flat is cfqg_flat_parent() which makes vfraction calculation consider all non-root cfqgs children of the root cfqg. Replace it with cfqg_parent() which returns the real parent. This enables full blkcg hierarchy support for cfq-iosched. For example, consider the following hierarchy. root / \ A:500 B:250 / \ AA:500 AB:1000 For simplicity, let's say all the leaf nodes have active tasks and are on service tree. For each leaf node, vfraction would be AA: (500 / 1500) * (500 / 750) =~ 0.2222 AB: (1000 / 1500) * (500 / 750) =~ 0.4444 B: (250 / 750) =~ 0.3333 and vdisktime will be distributed accordingly. For more detail, please refer to Documentation/block/cfq-iosched.txt. v2: cfq-iosched.txt updated to describe group scheduling as suggested by Vivek. v3: blkio-controller.txt updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: convert cfq_group_slice() to use cfqg->vfractionTejun Heo2013-01-091-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cfq_group_slice() calculates slice by taking a fraction of cfq_target_latency according to the ratio of cfqg->weight against service_tree->total_weight. This currently works only because all cfqgs are treated to be at the same level. To prepare for proper hierarchy support, convert cfq_group_slice() to base the calculation on cfqg->vfraction. As cfqg->vfraction is always a fraction of 1 and represents the fraction allocated to the cfqg with hierarchy considered, the slice can be simply calculated by multiplying cfqg->vfraction to cfq_target_latency (with fixed point shift factored in). As vfraction calculation currently treats all non-root cfqgs as children of the root cfqg, this patch doesn't introduce noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: implement hierarchy-ready cfq_group charge scalingTejun Heo2013-01-091-30/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, cfqg charges are scaled directly according to cfqg->weight. Regardless of the number of active cfqgs or the amount of active weights, a given weight value always scales charge the same way. This works fine as long as all cfqgs are treated equally regardless of their positions in the hierarchy, which is what cfq currently implements. It can't work in hierarchical settings because the interpretation of a given weight value depends on where the weight is located in the hierarchy. This patch reimplements cfqg charge scaling so that it can be used to support hierarchy properly. The scheme is fairly simple and light-weight. * When a cfqg is added to the service tree, v(disktime)weight is calculated. It walks up the tree to root calculating the fraction it has in the hierarchy. At each level, the fraction can be calculated as cfqg->weight / parent->level_weight By compounding these, the global fraction of vdisktime the cfqg has claim to - vfraction - can be determined. * When the cfqg needs to be charged, the charge is scaled inversely proportionally to the vfraction. The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point representation as before; however, the smallest scaling factor is now 1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1 was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller scaling factor. While this shifts the global scale of vdisktime a bit, it doesn't change the relative relationships among cfqgs and the scheduling result isn't different. cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY didn't have any relevance to vdisktime before and is unlikely to cause any visible behavior difference now especially as the scale shift isn't that large. As the new scheme now makes proper distinction between cfqg->weight and ->leaf_weight, reverse the weight aliasing for root cfqgs. For root, both weights are now mapped to ->leaf_weight instead of the other way around. Because we're still using cfqg_flat_parent(), this patch shouldn't change the scheduling behavior in any noticeable way. v2: Beefed up comments on vfraction as requested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: implement cfq_group->nr_active and ->children_weightTejun Heo2013-01-091-0/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To prepare for blkcg hierarchy support, add cfqg->nr_active and ->children_weight. cfqg->nr_active counts the number of active cfqgs at the cfqg's level and ->children_weight is sum of weights of those cfqgs. The level covers itself (cfqg->leaf_weight) and immediate children. The two values are updated when a cfqg enters and leaves the group service tree. Unless the hierarchy is very deep, the added overhead should be negligible. Currently, the parent is determined using cfqg_flat_parent() which makes the root cfqg the parent of all other cfqgs. This is to make the transition to hierarchy-aware scheduling gradual. Scheduling logic will be converted to use cfqg->children_weight without actually changing the behavior. When everything is ready, blkcg_weight_parent() will be replaced with proper parent function. This patch doesn't introduce any behavior chagne. v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: add leaf_weightTejun Heo2013-01-093-9/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cfq blkcg is about to grow proper hierarchy handling, where a child blkg's weight would nest inside the parent's. This makes tasks in a blkg to compete against both tasks in the sibling blkgs and the tasks of child blkgs. We're gonna use the existing weight as the group weight which decides the blkg's weight against its siblings. This patch introduces a new weight - leaf_weight - which decides the weight of a blkg against the child blkgs. It's named leaf_weight because another way to look at it is that each internal blkg nodes have a hidden child leaf node which contains all its tasks and leaf_weight is the weight of the leaf node and handled the same as the weight of the child blkgs. This patch only adds leaf_weight fields and exposes it to userland. The new weight isn't actually used anywhere yet. Note that cfq-iosched currently offcially supports only single level hierarchy and root blkgs compete with the first level blkgs - ie. root weight is basically being used as leaf_weight. For root blkgs, the two weights are kept in sync for backward compatibility. v2: cfqd->root_group->leaf_weight initialization was missing from cfq_init_queue() causing divide by zero when !CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com>
| | * blkcg: make blkcg_gq's hierarchicalTejun Heo2013-01-092-5/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently a child blkg (blkcg_gq) can be created even if its parent doesn't exist. ie. Given a blkg, it's not guaranteed that its ancestors will exist. This makes it difficult to implement proper hierarchy support for blkcg policies. Always create blkgs recursively and make a child blkg hold a reference to its parent. blkg->parent is added so that finding the parent is easy. blkcg_parent() is also added in the process. This change can be visible to userland. e.g. while issuing IO in a nested cgroup didn't affect the ancestors at all, now it will initialize all ancestor blkgs and zero stats for the request_queue will always appear on them. While this is userland visible, this shouldn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: cosmetic updates to blkg_create()Tejun Heo2013-01-091-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Rename out_* labels to err_*. * Do ERR_PTR() conversion once in the error return path. This patch is cosmetic and to prepare for the hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: reorganize blkg_lookup_create() and friendsTejun Heo2013-01-091-23/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reorganize such that * __blkg_lookup() takes bool param @update_hint to determine whether to update hint. * __blkg_lookup_create() no longer performs lookup before trying to create. Renamed to blkg_create(). * blkg_lookup_create() now performs lookup and then invokes blkg_create() if lookup fails. * root_blkg creation in blkcg_activate_policy() updated accordingly. Note that blkcg_activate_policy() no longer updates lookup hint if root_blkg already exists. Except for the last lookup hint bit which is immaterial, this is pure reorganization and doesn't introduce any visible behavior change. This is to prepare for proper hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * blkcg: fix minor bug in blkg_alloc()Tejun Heo2013-01-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice. The latter test should have been on whether pol->pd_init_fn() exists. This doesn't cause actual problems because both blkcg policies implement pol->pd_init_fn(). Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
| | * cfq-iosched: Print sync-noidle information in blktrace messagesVivek Goyal2013-01-091-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we attach a character "S" or "A" to the cfqq<pid>, to represent whether queues is sync or async. Add one more character "N" to represent whether it is sync-noidle queue or sync queue. So now three different type of queues will look as follows. cfq1234S --> sync queus cfq1234SN --> sync noidle queue cfq1234A --> Async queue Previously S/A classification was being printed only if group scheduling was enabled. This patch also makes sure that this classification is displayed even if group idling is disabled. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| | * cfq-iosched: Get rid of unnecessary local variableVivek Goyal2013-01-091-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use of local varibale "n" seems to be unnecessary. Remove it. This brings it inline with function __cfq_group_st_add(), which is also doing the similar operation of adding a group to a rb tree. No functionality change here. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| | * cfq-iosched: Rename few functions related to selecting workloadVivek Goyal2013-01-091-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | choose_service_tree() selects/sets both wl_class and wl_type. Rename it to choose_wl_class_and_type() to make it very clear. cfq_choose_wl() only selects and sets wl_type. It is easy to confuse it with choose_st(). So rename it to cfq_choose_wl_type() to make it clear what does it do. Just renaming. No functionality change. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| | * cfq-iosched: Rename "service_tree" to "st" at some placesVivek Goyal2013-01-091-41/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At quite a few places we use the keyword "service_tree". At some places, especially local variables, I have abbreviated it to "st". Also at couple of places moved binary operator "+" from beginning of line to end of previous line, as per Tejun's feedback. v2: Reverted most of the service tree name change based on Jeff Moyer's feedback. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| | * cfq-iosched: More renaming to better represent wl_class and wl_typeVivek Goyal2013-01-091-31/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some more renaming. Again making the code uniform w.r.t use of wl_class/class to represent IO class (RT, BE, IDLE) and using wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC). At places this patch shortens the string "workload" to "wl". Renamed "saved_workload" to "saved_wl_type". Renamed "saved_serving_class" to "saved_wl_class". For uniformity with "saved_wl_*" variables, renamed "serving_class" to "serving_wl_class" and renamed "serving_type" to "serving_wl_type". Again, just trying to improve upon code uniformity and improve readability. No functional change. v2: - Restored the usage of keyword "service" based on Jeff Moyer's feedback. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| | * cfq-iosched: Properly name all references to IO classVivek Goyal2013-01-091-33/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we are calling workloads belonging to these classes as "prio". This gets very confusing as one starts to associate it with ioprio. So this patch just does bunch of renaming so that reading code becomes easier. All reference to RT, BE and IDLE workload are done using keyword "class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made using keyword "type". This makes me feel much better while I am reading the code. There is no functionality change due to this patch. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | block: Remove should_sort judgement when flush blk_plugJianpeng Ma2013-01-111-12/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing. Although this commit was used for the situation which blk_plug handled multi devices on the same time like md device. I think there must be some situations like this but only single device. So remove the should_sort judgement. Because the parameter should_sort is only for this purpose,it can delete should_sort from blk_plug. CC: Shaohua Li <shli@kernel.org> Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block,elevator: use new hashtable implementationSasha Levin2013-01-112-20/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch elevator to use the new hashtable implementation. This reduces the amount of generic unrelated code in the elevator. This also removes the dymanic allocation of the hash table. The size of the table is constant so there's no point in paying the price of an extra dereference when accessing it. This patch depends on d9b482c ("hashtable: introduce a small and naive hashtable") which was merged in v3.6. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | hlist: drop the node parameter from iteratorsSasha Levin2013-02-275-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block/partitions: optimize memory allocation in check_partition()Ming Lei2013-02-273-8/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so it is easy to trigger page allocation failure by check_partition, especially in hotplug block device situation(such as, USB mass storage, MMC card, ...), and Felipe Balbi has observed the failure. This patch does below optimizations on the allocation of struct parsed_partitions to try to address the issue: - make parsed_partitions.parts as pointer so that the pointed memory can fit in 32KB buffer, then approximate 32KB memory can be saved - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is still a bit big for kmalloc - given that many devices have the partition count limit, so only allocate disk_max_parts() partitions instead of 256 partitions always Signed-off-by: Ming Lei <ming.lei@canonical.com> Reported-by: Felipe Balbi <balbi@ti.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block/partitions/mac.c: obey the state->limit constraintMing Lei2013-02-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It isn't necessary to read the information of partitions whose number is equal and more than state->limit since only maximum state->limit partitions will be added inside rescan_partitions(). That is also what other kind of partitions are doing. Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block/partitions/efi.c: ensure that the GPT header is at least the size of ↵Peter Jones2013-02-271-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the structure. UEFI 2.3.1D will include a change to the spec language mandating that a GPT header must be greater than *or equal to* the size of the defined structure. While verifying that this would work on Linux, I discovered that we're not actually checking the minimum bound at all. The result of this is that when we verify the checksum, it's possible that on a malformed header (with header_size of 0), we won't actually verify any data. [akpm@linux-foundation.org: fix printk warning] Signed-off-by: Peter Jones <pjones@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block/partition/msdos: detect AIX formatted disks even without 55aaPhilippe De Muyter2013-02-271-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AIX formatted disks do not always have the MSDOS 55aa signature. This happens e.g. for unbootable AIX disks. Up to now, such disks were not recognized as AIX disks, because of the missing 55aa. Fix that by inverting the two tests. Let's first check for the AIX magic strings, and only if that fails check for the MSDOS magic word. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Andreas Mohr <andi@lisas.de> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Cc: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block: convert to idr_alloc()Tejun Heo2013-02-272-32/+15
| | | | | | | | | | | | | | | | | | | | Convert to the much saner new idr interface. Both bsg and genhd protect idr w/ mutex making preloading unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block: fix synchronization and limit check in blk_alloc_devt()Tejun Heo2013-02-271-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | idr allocation in blk_alloc_devt() wasn't synchronized against lookup and removal, and its limit check was off by one - 1 << MINORBITS is the number of minors allowed, not the maximum allowed minor. Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit checking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block: fix ext_devt_idr handlingTomas Henzl2013-02-272-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While adding and removing a lot of disks disks and partitions this sometimes shows up: WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted) Hardware name: sysfs: cannot create duplicate filename '/dev/block/259:751' Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan] Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1 Call Trace: warn_slowpath_common+0x87/0xc0 warn_slowpath_fmt+0x46/0x50 sysfs_add_one+0xc9/0x130 sysfs_do_create_link+0x12b/0x170 sysfs_create_link+0x13/0x20 device_add+0x317/0x650 idr_get_new+0x13/0x50 add_partition+0x21c/0x390 rescan_partitions+0x32b/0x470 sd_open+0x81/0x1f0 [sd_mod] __blkdev_get+0x1b6/0x3c0 blkdev_get+0x10/0x20 register_disk+0x155/0x170 add_disk+0xa6/0x160 sd_probe_async+0x13b/0x210 [sd_mod] add_wait_queue+0x46/0x60 async_thread+0x102/0x250 default_wake_function+0x0/0x20 async_thread+0x0/0x250 kthread+0x96/0xa0 child_rip+0xa/0x20 kthread+0x0/0xa0 child_rip+0x0/0x20 This most likely happens because dev_t is freed while the number is still used and idr_get_new() is not protected on every use. The fix adds a mutex where it wasn't before and moves the dev_t free function so it is called after device del. Signed-off-by: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block/genhd.c: apply pm_runtime_set_memalloc_noio on block devicesMing Lei2013-02-231-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Apply the introduced pm_runtime_set_memalloc_noio on block device so that PM core will teach mm to not allocate memory with GFP_IOFS when calling the runtime_resume and runtime_suspend callback for block devices and its ancestors. Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | block: optionally snapshot page contents to provide stable pages during writeDarrick J. Wong2013-02-211-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This provides a band-aid to provide stable page writes on jbd without needing to backport the fixed locking and page writeback bit handling schemes of jbd2. The band-aid works by using bounce buffers to snapshot page contents instead of waiting. For those wondering about the ext3 bandage -- fixing the jbd locking (which was done as part of ext4dev years ago) is a lot of surgery, and setting PG_writeback on data pages when we actually hold the page lock dropped ext3 performance by nearly an order of magnitude. If we're going to migrate iscsi and raid to use stable page writes, the complaints about high latency will likely return. We might as well centralize their page snapshotting thing to one place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bdi: allow block devices to say that they require stable page writesDarrick J. Wong2013-02-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset ("stable page writes, part 2") makes some key modifications to the original 'stable page writes' patchset. First, it provides creators (devices and filesystems) of a backing_dev_info a flag that declares whether or not it is necessary to ensure that page contents cannot change during writeout. It is no longer assumed that this is true of all devices (which was never true anyway). Second, the flag is used to relaxed the wait_on_page_writeback calls so that wait only occurs if the device needs it. Third, it fixes up the remaining disk-backed filesystems to use this improved conditional-wait logic to provide stable page writes on those filesystems. It is hoped that (for people not using checksumming devices, anyway) this patchset will give back unnecessary performance decreases since the original stable page write patchset went into 3.0. Sorry about not fixing it sooner. Complaints were registered by several people about the long write latencies introduced by the original stable page write patchset. Generally speaking, the kernel ought to allocate as little extra memory as possible to facilitate writeout, but for people who simply cannot wait, a second page stability strategy is (re)introduced: snapshotting page contents. The waiting behavior is still the default strategy; to enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be set. This flag is used to bandaid^Henable stable page writeback on ext3[1], and is not used anywhere else. Given that there are already a few storage devices and network FSes that have rolled their own page stability wait/page snapshot code, it would be nice to move towards consolidating all of these. It seems possible that iscsi and raid5 may wish to use the new stable page write support to enable zero-copy writeout. Thank you to Jan Kara for helping fix a couple more filesystems. Per Andrew Morton's request, here are the result of using dbench to measure latencies on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, for ext2 the maximum write latency decreases from ~60ms on a laptop hard disk to ~4ms. I'm not sure why the flush latencies increase, though I suspect that being able to dirty pages faster gives the flusher more work to do. On ext4, the average write latency decreases as well as all the maximum latencies: 3.8.0-rc3: WriteX 85624 0.152 33.078 ReadX 272090 0.010 61.210 Flush 12129 36.219 168.260 Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms 3.8.0-rc3 + patches: WriteX 86082 0.141 30.928 ReadX 273358 0.010 36.124 Flush 12214 34.800 165.689 Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms XFS seems to exhibit similar latency improvements as ext2: 3.8.0-rc3: WriteX 125739 0.028 104.343 ReadX 399070 0.005 4.115 Flush 17851 25.004 131.390 Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms 3.8.0-rc3 + patches: WriteX 123529 0.028 6.299 ReadX 392434 0.005 4.287 Flush 17549 25.120 188.687 Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms ...and btrfs, just to round things out, also shows some latency decreases: 3.8.0-rc3: WriteX 67122 0.083 82.355 ReadX 212719 0.005 2.828 Flush 9547 47.561 147.418 Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms 3.8.0-rc3 + patches: WriteX 64898 0.101 71.631 ReadX 206673 0.005 7.123 Flush 9190 47.963 219.034 Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own wait code, or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same results as -rc3. [1] The alternative fixes to ext3 include fixing the locking order and page bit handling like we did for ext4 (but then why not just use ext4?), or setting PG_writeback so early that ext3 becomes extremely slow. I tried that, but the number of write()s I could initiate dropped by nearly an order of magnitude. That was a bit much even for the author of the stable page series! :) This patch: Creates a per-backing-device flag that tracks whether or not pages must be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nothing requires stable pages. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud