summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULLJens Axboe2015-06-201-1/+11
| | | | | | | | | Commit 9470e4a693db only covered the initial bug report, there are other spots in CFQ where we need to check that we actually have a valid cfq_group_data structure. Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: fix sysfs oops when attempting to read unconfigured weightsJens Axboe2015-06-191-2/+12
| | | | | | | | | | | | | | If none of the devices in the system are using CFQ, then attempting to read: /sys/fs/cgroup/blkio/blkio.leaf_weight will results in a NULL dereference. Check for a valid cfq_group_data struct before attempting to dereference it. Reported-by: Andrey Wagin <avagin@gmail.com> Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: move group scheduling functions under ifdefJens Axboe2015-06-191-16/+16
| | | | | | | | | | | | | | | | | | If CFQ_GROUP_IOSCHED is not set, the compiler produces the following warning: CC block/cfq-iosched.o linux/block/cfq-iosched.c:469:2: warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function] *cpd_to_cfqgd(struct blkcg_policy_data *cpd) ^ In reality, two other lookup functions aren't used either if CFQ_GROUP_IOSCHED isn't set. Move all three under one of the CFQ_GROUP_IOSCHED sections in the code. Reported-by: Vladimir Zapolskiy <vz@mleia.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* cfq-iosched: fix the setting of IOPS mode on SSDsJens Axboe2015-06-103-1/+18
| | | | | | | | | | | | | | | | A previous commit wanted to make CFQ default to IOPS mode on non-rotational storage, however it did so when the queue was initialized and the non-rotational flag is only set later on in the probe. Add an elevator hook that gets called off the add_disk() path, at that point we know that feature probing has finished, and we can reliably check for the various flags that drivers can set. Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs") Tested-by: Romain Francoise <romain@orebokech.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS fileSteven Rostedt2015-06-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | blktrace.c is currently maintained by Jens Axboe as it is used for debugging the block layer. That file should be added to the MAINTAINERS file under BLOCK LAYER, such that people will know to Cc the BLOCK LAYER maintainer when modifying that file. Currently: $ scripts/get_maintainer.pl -f kernel/trace/blktrace.c Steven Rostedt <rostedt@goodmis.org> (maintainer:TRACING) Ingo Molnar <mingo@redhat.com> (maintainer:TRACING) linux-kernel@vger.kernel.org (open list) After the patch: $ scripts/get_maintainer.pl -f kernel/trace/blktrace.c Jens Axboe <axboe@kernel.dk> (maintainer:BLOCK LAYER) Steven Rostedt <rostedt@goodmis.org> (maintainer:TRACING) Ingo Molnar <mingo@redhat.com> (maintainer:TRACING) linux-kernel@vger.kernel.org (open list) Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* block, cgroup: implement policy-specific per-blkcg dataArianna Avanzini2015-06-073-29/+173
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Make CFQ default to IOPS mode on SSDsTahsin Erdogan2015-06-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CFQ idling causes reduced IOPS throughput on non-rotational disks. Since disk head seeking is not applicable to SSDs, it doesn't really help performance by anticipating future near-by IO requests. By turning off idling (and switching to IOPS mode), we allow other processes to dispatch IO requests down to the driver and so increase IO throughput. Following FIO benchmark results were taken on a cloud SSD offering with idling on and off: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 7054 90.107 38.697 28217KB/s Off 29255 21.836 11.730 117022KB/s fio --name=temp --size=100G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10 And the following is from a local SSD run: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 19320 33.043 14.068 77281KB/s Off 21626 29.465 12.662 86507KB/s fio --name=temp --size=5G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10 Reviewed-by: Nauman Rafique <nauman@google.com> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add blk_set_queue_dying() to blkdev.hJens Axboe2015-06-051-0/+1
| | | | | | | We export this function and NVMe wants to use it, but for some reason it was never added to the block header. Do that. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: Shared tag enhancementsKeith Busch2015-06-014-2/+53
| | | | | | | | | | | | Storage controllers may expose multiple block devices that share hardware resources managed by blk-mq. This patch enhances the shared tags so a low-level driver can access the shared resources not tied to the unshared h/w contexts. This way the LLD can dynamically add and delete disks and request queues without having to track all the request_queue hctx's to iterate outstanding tags. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: don't honor chunk sizes for data-less IOJens Axboe2015-05-291-1/+1
| | | | | | | We don't need to honor chunk sizes for IO that doesn't carry any data. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: only honor SG gap prevention for merges that contain dataJens Axboe2015-05-291-1/+2
| | | | | | | | We can safely merge anything that wont generate an SG list entry, so if the bio is data-less (discard), don't look at potential SG gaps. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: fix returnvar.cocci warningsJulia Lawall2015-05-261-2/+1
| | | | | | | | | | Remove unneeded variable used to store return value. Generated by: scripts/coccinelle/misc/returnvar.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jens Axboe <axboe@fb.com>
* block, dm: don't copy bios for request clonesChristoph Hellwig2015-05-226-230/+73
| | | | | | | | | | | | | | | | | | | Currently dm-multipath has to clone the bios for every request sent to the lower devices, which wastes cpu cycles and ties down memory. This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio to not complete bios attached to a request, which we set on clone requests similar to bios in a flush sequence. With this change I/O errors on a path failure only get propagated to dm-multipath, which can then either resubmit the I/O or complete the bios on the original request. I've done some basic testing of this on a Linux target with ALUA support, and it survives path failures during I/O nicely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: remove management of bi_remaining when restoring original bi_end_ioMike Snitzer2015-05-2212-66/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: replace trylock with mutex_lock in blkdev_reread_part()Ming Lei2015-05-201-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only possible problem of using mutex_lock() instead of trylock is about deadlock. If there aren't any locks held before calling blkdev_reread_part(), deadlock can't be caused by this conversion. If there are locks held before calling blkdev_reread_part(), and if these locks arn't required in open, close handler and I/O path, deadlock shouldn't be caused too. Both user space's ioctl(BLKRRPART) and md_setup_drive() from init/do_mounts_md.c belongs to the 1st case, so the conversion is safe for the two cases. For loop, the previous patches in this pathset has fixed the ABBA lock dependency, so the conversion is OK. For nbd, tx_lock is held when calling the function: - both open and release won't hold the lock - when blkdev_reread_part() is run, I/O thread has been stopped already, so tx_lock won't be acquired in I/O path at that time. - so the conversion won't cause deadlock for nbd For dasd, both dasd_open(), dasd_release() and request function don't acquire any mutex/semphone, so the conversion should be safe. Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: export blkdev_reread_part() and __blkdev_reread_part()Jarod Wilson2015-05-202-3/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch exports blkdev_reread_part() for block drivers, also introduce __blkdev_reread_part(). For some drivers, such as loop, reread of partitions can be run from the release path, and bd_mutex may already be held prior to calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce __blkdev_reread_part for use in such cases. CC: Christoph Hellwig <hch@lst.de> CC: Jens Axboe <axboe@kernel.dk> CC: Tejun Heo <tj@kernel.org> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Markus Pargmann <mpa@pengutronix.de> CC: Stefan Weinhuber <wein@de.ibm.com> CC: Stefan Haberland <stefan.haberland@de.ibm.com> CC: Sebastian Ott <sebott@linux.vnet.ibm.com> CC: Fabian Frederick <fabf@skynet.be> CC: Ming Lei <ming.lei@canonical.com> CC: David Herrmann <dh.herrmann@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Peter Zijlstra <peterz@infradead.org> CC: nbd-general@lists.sourceforge.net CC: linux-s390@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* suspend: simplify block I/O handlingChristoph Hellwig2015-05-196-155/+122
| | | | | | | | | | | | | | | Stop abusing struct page functionality and the swap end_io handler, and instead add a modified version of the blk-lib.c bio_batch helpers. Also move the block I/O code into swap.c as they are directly tied into each other. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Pavel Machek <pavel@ucw.cz> Tested-by: Ming Lin <mlin@kernel.org> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: collapse bio bit spaceJens Axboe2015-05-191-9/+9
| | | | | | | | Various previous patches removed bits and left holes, collapse them all. Leave the reset start bit where it is, we don't need to change that. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: remove unused BIO_RW_BLOCK and BIO_EOF flagsChristoph Hellwig2015-05-192-4/+0
| | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: remove BIO_EOPNOTSUPPChristoph Hellwig2015-05-197-38/+2
| | | | | | | | | | | Since the big barrier rewrite/removal in 2007 we never fail FLUSH or FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag to help propagating those to the buffer_head layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: use an atomic_t for mq_freeze_depthChristoph Hellwig2015-05-192-15/+11
| | | | | | | | | lockdep gets unhappy about the not disabling irqs when using the queue_lock around it. Instead of trying to fix that up just switch to an atomic_t and get rid of the lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: make plug work for mutiple disks and queuesShaohua Li2015-05-083-9/+23
| | | | | | | | | | | | | | | | Last patch makes plug work for multiple queue case. However it only works for single disk case, because it assumes only one request in the plug list. If a task is accessing multiple disks, eg MD/DM, the assumption is wrong. Let blk_attempt_plug_merge() record request from the same queue. V2: use NULL parameter in !mq case. Fix a bug. Add comments in blk_attempt_plug_merge to make it less (hopefully) confusion. Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: do limited block plug for multiple queue caseShaohua Li2015-05-081-23/+59
| | | | | | | | | | | | | | | | plug is still helpful for workload with IO merge, but it can be harmful otherwise especially with multiple hardware queues, as there is (supposed) no lock contention in this case and plug can introduce latency. For multiple queues, we do limited plug, eg plug only if there is request merge. If a request doesn't have merge with following request, the requet will be dispatched immediately. V2: check blk_queue_nomerges() as suggested by Jeff. Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: avoid re-initialize request which is failed in direct dispatchShaohua Li2015-05-081-0/+2
| | | | | | | | | | If we directly issue a request and it fails, we use blk_mq_merge_queue_io(). But we already assigned bio to a request in blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run blk_mq_bio_to_request again. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: fix plugging in blk_sq_make_requestJeff Moyer2015-05-081-22/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following appears in blk_sq_make_request: /* * If we have multiple hardware queues, just go directly to * one of those for sync IO. */ We clearly don't have multiple hardware queues, here! This comment was introduced with this commit 07068d5b8e (blk-mq: split make request handler for multi and single queue): We want slightly different behavior from them: - On single queue devices, we currently use the per-process plug for deferred IO and for merging. - On multi queue devices, we don't use the per-process plug, but we want to go straight to hardware for SYNC IO. The old code had this: use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync); and that was converted to: use_plug = !is_flush_fua && !is_sync; which is not equivalent. For the single queue case, that second half of the && expression is always true. So, what I think was actually inteded follows (and this more closely matches what is done in blk_queue_bio). V2: delete the 'likely', which should not be a big deal Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* sched: always use blk_schedule_flush_plug in io_schedule_outShaohua Li2015-05-081-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | block plug callback could sleep, so we introduce a parameter 'from_schedule' and corresponding drivers can use it to destinguish a schedule plug flush or a plug finish. Unfortunately io_schedule_out still uses blk_flush_plug(). This causes below output (Note, I added a might_sleep() in raid1_unplug to make it trigger faster, but the whole thing doesn't matter if I add might_sleep). In raid1/10, this can cause deadlock. This patch makes io_schedule_out always uses blk_schedule_flush_plug. This should only impact drivers (as far as I know, raid 1/10) which are sensitive to the 'from_schedule' parameter. [ 370.817949] ------------[ cut here ]------------ [ 370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90() [ 370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90 [ 370.817971] Modules linked in: raid1 [ 370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G W 4.0.0+ #361 [ 370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1) [ 370.817985] ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001 [ 370.817988] ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8 [ 370.817990] ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28 [ 370.817993] Call Trace: [ 370.817999] [<ffffffff819dd7af>] dump_stack+0x4f/0x7b [ 370.818002] [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0 [ 370.818004] [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50 [ 370.818006] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818008] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818010] [<ffffffff810776ef>] __might_sleep+0x7f/0x90 [ 370.818014] [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1] [ 370.818024] [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0 [ 370.818028] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818031] [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140 [ 370.818033] [<ffffffff819e3586>] bit_wait_io+0x36/0x50 [ 370.818034] [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90 [ 370.818041] [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630 [ 370.818043] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818045] [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80 [ 370.818047] [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40 [ 370.818050] [<ffffffff811de744>] __wait_on_buffer+0x44/0x50 [ 370.818053] [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0 [ 370.818058] [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790 [ 370.818062] [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50 [ 370.818064] [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200 [ 370.818066] [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360 [ 370.818068] [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0 [ 370.818070] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818072] [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460 [ 370.818074] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818076] [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620 [ 370.818079] [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0 [ 370.818081] [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290 [ 370.818085] [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0 [ 370.818088] [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50 [ 370.818094] [<ffffffff81149691>] do_writepages+0x21/0x50 [ 370.818097] [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490 [ 370.818099] [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590 [ 370.818103] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818105] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818107] [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0 [ 370.818109] [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0 [ 370.818111] [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550 [ 370.818116] [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570 [ 370.818117] [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570 [ 370.818119] [<ffffffff8106c09b>] worker_thread+0x11b/0x470 [ 370.818121] [<ffffffff8106bf80>] ? process_one_work+0x570/0x570 [ 370.818124] [<ffffffff81071868>] kthread+0xf8/0x110 [ 370.818126] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818129] [<ffffffff819e9322>] ret_from_fork+0x42/0x70 [ 370.818131] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818132] ---[ end trace 7b4deb71e68b6605 ]--- V2: don't change ->in_iowait Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk: clean up plugShaohua Li2015-05-081-12/+12
| | | | | | | | | | | | | Current code looks like inner plug gets flushed with a blk_finish_plug(). Actually it's a nop. All requests/callbacks are added to current->plug, while only outmost plug is assigned to current->plug. So inner plug always has empty request/callback list, which makes blk_flush_plug_list() a nop. This tries to make the code more clear. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* nbd: stop using req->cmdChristoph Hellwig2015-05-052-27/+23
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move PM request support to IDEChristoph Hellwig2015-05-059-51/+70
| | | | | | | | | This removes the request types and hacks from the block code and into the old IDE driver. There is a small amunt of code duplication due to this, but it's not too bad. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: remove REQ_TYPE_PM_SHUTDOWNChristoph Hellwig2015-05-051-1/+0
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move REQ_TYPE_SENSE to the ide driverChristoph Hellwig2015-05-056-10/+10
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move REQ_TYPE_ATA_TASKFILE and REQ_TYPE_ATA_PC to ide.hChristoph Hellwig2015-05-052-9/+9
| | | | | | | | | | These values are only used by the IDE driver, so move them into it by allowing drivers to take cmd_type values after the first private one. Note that we have to turn cmd_type into a plain unsigned integer so that gcc doesn't complain about mismatching enum types. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: rename REQ_TYPE_SPECIAL to REQ_TYPE_DRV_PRIVChristoph Hellwig2015-05-0515-26/+26
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* bio: skip atomic inc/dec of ->bi_cnt for most use casesJens Axboe2015-05-056-12/+30
| | | | | | | | | | | | | | | Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* bio: skip atomic inc/dec of ->bi_remaining for non-chainsJens Axboe2015-05-057-15/+47
| | | | | | | | | | | | | | | | | | | | Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds2015-05-055-11/+32
|\ | | | | | | | | | | | | | | | | | | Pull crypto fixes from Herbert Xu: "This fixes a build problem with bcm63xx and yet another fix to the memzero_explicit function to ensure that the memset is not elided" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: hwrng: bcm63xx - Fix driver compilation lib: make memzero_explicit more robust against dead store elimination
| * hwrng: bcm63xx - Fix driver compilationÁlvaro Fernández Rojas2015-05-041-9/+9
| | | | | | | | | | | | | | | | | | | | | | - s/clk_didsable_unprepare/clk_disable_unprepare - s/prov/priv - s/error/ret (bcm63xx_rng_probe) Fixes: 6229c16060fe ("hwrng: bcm63xx - make use of devm_hwrng_register") Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
| * lib: make memzero_explicit more robust against dead store eliminationDaniel Borkmann2015-05-044-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 0b053c951829 ("lib: memzero_explicit: use barrier instead of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in case LTO would decide to inline memzero_explicit() and eventually find out it could be elimiated as dead store. While using barrier() works well for the case of gcc, recent efforts from LLVMLinux people suggest to use llvm as an alternative to gcc, and there, Stephan found in a simple stand-alone user space example that llvm could nevertheless optimize and thus elimitate the memset(). A similar issue has been observed in the referenced llvm bug report, which is regarded as not-a-bug. Based on some experiments, icc is a bit special on its own, while it doesn't seem to eliminate the memset(), it could do so with an own implementation, and then result in similar findings as with llvm. The fix in this patch now works for all three compilers (also tested with more aggressive optimization levels). Arguably, in the current kernel tree it's more of a theoretical issue, but imho, it's better to be pedantic about it. It's clearly visible with gcc/llvm though, with the below code: if we would have used barrier() only here, llvm would have omitted clearing, not so with barrier_data() variant: static inline void memzero_explicit(void *s, size_t count) { memset(s, 0, count); barrier_data(s); } int main(void) { char buff[20]; memzero_explicit(buff, sizeof(buff)); return 0; } $ gcc -O2 test.c $ gdb a.out (gdb) disassemble main Dump of assembler code for function main: 0x0000000000400400 <+0>: lea -0x28(%rsp),%rax 0x0000000000400405 <+5>: movq $0x0,-0x28(%rsp) 0x000000000040040e <+14>: movq $0x0,-0x20(%rsp) 0x0000000000400417 <+23>: movl $0x0,-0x18(%rsp) 0x000000000040041f <+31>: xor %eax,%eax 0x0000000000400421 <+33>: retq End of assembler dump. $ clang -O2 test.c $ gdb a.out (gdb) disassemble main Dump of assembler code for function main: 0x00000000004004f0 <+0>: xorps %xmm0,%xmm0 0x00000000004004f3 <+3>: movaps %xmm0,-0x18(%rsp) 0x00000000004004f8 <+8>: movl $0x0,-0x8(%rsp) 0x0000000000400500 <+16>: lea -0x18(%rsp),%rax 0x0000000000400505 <+21>: xor %eax,%eax 0x0000000000400507 <+23>: retq End of assembler dump. As gcc, clang, but also icc defines __GNUC__, it's sufficient to define this in compiler-gcc.h only to be picked up. For a fallback or otherwise unsupported compiler, we define it as a barrier. Similarly, for ecc which does not support gcc inline asm. Reference: https://llvm.org/bugs/show_bug.cgi?id=15495 Reported-by: Stephan Mueller <smueller@chronox.de> Tested-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Stephan Mueller <smueller@chronox.de> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: mancha security <mancha1@zoho.com> Cc: Mark Charlebois <charlebm@gmail.com> Cc: Behan Webster <behanw@converseincode.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | Merge tag 'media/v4.1-3' of ↵Linus Torvalds2015-05-057-17/+40
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: "Three driver fixes: - fix for omap4, fixing a regression due to a subsystem API that got removed for 4.1 (commit efde234674d9); - fix for one of the formats supported by Marvel ccic driver; - fix rcar_vin driver that, when stopping abnormally, the driver can't return from wait_for_completion" * tag 'media/v4.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: [media] v4l: omap4iss: Replace outdated OMAP4 control pad API with syscon [media] media: soc_camera: rcar_vin: Fix wait_for_completion [media] marvell-ccic: fix Y'CbCr ordering
| * | [media] v4l: omap4iss: Replace outdated OMAP4 control pad API with sysconLaurent Pinchart2015-04-284-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The omap4_ctrl_pad_readl and omap4_ctrl_pad_writel functions have been removed by commit efde234674d9 but are still used by the OMAP4 ISS driver, resulting in a compilation breakage: drivers/staging/media/omap4iss/iss_csiphy.c: In function 'omap4iss_csiphy_config': drivers/staging/media/omap4iss/iss_csiphy.c:167:2: error: implicit declaration of function 'omap4_ctrl_pad_writel' [-Werror=implicit-function-declaration] omap4_ctrl_pad_writel(cam_rx_ctrl, Fix the problem by using the syscon API to reaplace the control pad API. Lookup the syscon instance by compatible name for now as the OMAP4 ISS driver doesn't support DT yet. Fixes: efde234674d9 ("ARM: OMAP4+: control: remove support for legacy pad read/write") Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Acked-by: Sakari Alius <sakari.ailus@iki.fi> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
| * | [media] media: soc_camera: rcar_vin: Fix wait_for_completionKoji Matsuoka2015-04-271-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When stopping abnormally, a driver can't return from wait_for_completion. This patch resolved this problem by changing wait_for_completion_timeout from wait_for_completion. Signed-off-by: Koji Matsuoka <koji.matsuoka.xm@renesas.com> Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com> Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
| * | [media] marvell-ccic: fix Y'CbCr orderingHans Verkuil2015-04-272-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Various formats had their byte ordering implemented incorrectly, and the V4L2_PIX_FMT_UYVY is actually impossible to create, instead you get V4L2_PIX_FMT_YVYU. This was working before commit ad6ac452227b7cb93ac79beec092850d178740b1 ("add new formats support for marvell-ccic driver"). That commit broke the original format support and the OLPC XO-1 laptop showed wrong colors ever since (if you are crazy enough to attempt to run the latest kernel on it, like I did). The email addresses of the authors of that patch are no longer valid, so without a way to reach them and ask them about their test setup I am going with what I can test on the OLPC laptop. If this breaks something for someone on their non-OLPC setup, then contact the linux-media mailinglist. My suspicion however is that that commit went in untested. Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Cc: <stable@vger.kernel.org> # for v3.19 and up Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
* | | Linux 4.1-rc2v4.1-rc2Linus Torvalds2015-05-031-1/+1
| | |
* | | Merge tag 'for_linus_stable' of ↵Linus Torvalds2015-05-0313-229/+210
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some miscellaneous bug fixes and some final on-disk and ABI changes for ext4 encryption which provide better security and performance" * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix growing of tiny filesystems ext4: move check under lock scope to close a race. ext4: fix data corruption caused by unwritten and delayed extents ext4 crypto: remove duplicated encryption mode definitions ext4 crypto: do not select from EXT4_FS_ENCRYPTION ext4 crypto: add padding to filenames before encrypting ext4 crypto: simplify and speed up filename encryption
| * | | ext4: fix growing of tiny filesystemsJan Kara2015-05-021-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The estimate of necessary transaction credits in ext4_flex_group_add() is too pessimistic. It reserves credit for sb, resize inode, and resize inode dindirect block for each group added in a flex group although they are always the same block and thus it is enough to account them only once. Also the number of modified GDT block is overestimated since we fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block. Make the estimation more precise. That reduces number of requested credits enough that we can grow 20 MB filesystem (which has 1 MB journal, 79 reserved GDT blocks, and flex group size 16 by default). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
| * | | ext4: move check under lock scope to close a race.Davide Italiano2015-05-021-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fallocate() checks that the file is extent-based and returns EOPNOTSUPP in case is not. Other tasks can convert from and to indirect and extent so it's safe to check only after grabbing the inode mutex. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * | | ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner2015-05-022-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * | | ext4 crypto: remove duplicated encryption mode definitionsChanho Park2015-05-021-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes duplicated encryption modes which were already in ext4.h. They were duplicated from commit 3edc18d and commit f542fb. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Chanho Park <chanho61.park@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4 crypto: do not select from EXT4_FS_ENCRYPTIONHerbert Xu2015-05-021-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a tristate EXT4_ENCRYPTION to do the selections for EXT4_FS_ENCRYPTION because selecting from a bool causes all the selected options to be built-in, even if EXT4 itself is a module. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4 crypto: add padding to filenames before encryptingTheodore Ts'o2015-05-015-8/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This obscures the length of the filenames, to decrease the amount of information leakage. By default, we pad the filenames to the next 4 byte boundaries. This costs nothing, since the directory entries are aligned to 4 byte boundaries anyway. Filenames can also be padded to 8, 16, or 32 bytes, which will consume more directory space. Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
OpenPOWER on IntegriCloud