summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* dm raid: fixup documentation for discard supportHeinz Mauelshagen2015-05-292-2/+1
| | | | | | | | | | | | Remove comment above parse_raid_params() that claims "devices_handle_discard_safely" is a table line argument when it is actually is a module parameter. Also, backfill dm-raid target version 1.6.0 documentation. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm thin metadata: remove in-core 'read_only' flagMike Snitzer2015-05-293-5/+8
| | | | | | | Leverage the block manager's read_only flag instead of duplicating it; access with new dm_bm_is_read_only() method. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm thin: cleanup schedule_zero() to read more logicallyMike Snitzer2015-05-291-9/+7
| | | | | | | | | The overwrite has only ever about optimizing away the need to zero a block if the entire block was being overwritten. As such it is only relevant when zeroing is enabled. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>
* dm thin: cleanup overwrite's endio restore to be centralizedMike Snitzer2015-05-291-8/+3
| | | | Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm: factor out a common cleanup_mapped_device()Mike Snitzer2015-05-291-35/+43
| | | | | | | | Introduce a single common method for cleaning up a DM device's mapped_device. No functional change, just eliminates duplication of delicate mapped_device cleanup code. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm: cleanup methods that requeue requestsMike Snitzer2015-05-291-12/+5
| | | | | | | | | | | | | More often than not a request that is requeued _is_ mapped (meaning the clone request is allocated and clone->q is initialized). Rename dm_requeue_unmapped_original_request() to avoid potential confusion due to function name containing "unmapped". Also, remove dm_requeue_unmapped_request() since callers can easily call the dm_requeue_original_request() directly. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm: do not allocate any mempools for blk-mq request-based DMMike Snitzer2015-05-292-33/+40
| | | | | | | | | | | Do not allocate the io_pool mempool for blk-mq request-based DM (DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools(). Also refine __bind_mempools() to have more precise awareness of which mempools each type of DM device uses -- avoids mempool churn when reloading DM tables (particularly for DM_TYPE_REQUEST_BASED). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2Mike Snitzer2015-05-2954-709/+573
|\
| * block: fix returnvar.cocci warningsJulia Lawall2015-05-261-2/+1
| | | | | | | | | | | | | | | | | | | | Remove unneeded variable used to store return value. Generated by: scripts/coccinelle/misc/returnvar.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block, dm: don't copy bios for request clonesChristoph Hellwig2015-05-226-230/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently dm-multipath has to clone the bios for every request sent to the lower devices, which wastes cpu cycles and ties down memory. This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio to not complete bios attached to a request, which we set on clone requests similar to bios in a flush sequence. With this change I/O errors on a path failure only get propagated to dm-multipath, which can then either resubmit the I/O or complete the bios on the original request. I've done some basic testing of this on a Linux target with ALUA support, and it survives path failures during I/O nicely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: remove management of bi_remaining when restoring original bi_end_ioMike Snitzer2015-05-2212-66/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: replace trylock with mutex_lock in blkdev_reread_part()Ming Lei2015-05-201-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only possible problem of using mutex_lock() instead of trylock is about deadlock. If there aren't any locks held before calling blkdev_reread_part(), deadlock can't be caused by this conversion. If there are locks held before calling blkdev_reread_part(), and if these locks arn't required in open, close handler and I/O path, deadlock shouldn't be caused too. Both user space's ioctl(BLKRRPART) and md_setup_drive() from init/do_mounts_md.c belongs to the 1st case, so the conversion is safe for the two cases. For loop, the previous patches in this pathset has fixed the ABBA lock dependency, so the conversion is OK. For nbd, tx_lock is held when calling the function: - both open and release won't hold the lock - when blkdev_reread_part() is run, I/O thread has been stopped already, so tx_lock won't be acquired in I/O path at that time. - so the conversion won't cause deadlock for nbd For dasd, both dasd_open(), dasd_release() and request function don't acquire any mutex/semphone, so the conversion should be safe. Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: export blkdev_reread_part() and __blkdev_reread_part()Jarod Wilson2015-05-202-3/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch exports blkdev_reread_part() for block drivers, also introduce __blkdev_reread_part(). For some drivers, such as loop, reread of partitions can be run from the release path, and bd_mutex may already be held prior to calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce __blkdev_reread_part for use in such cases. CC: Christoph Hellwig <hch@lst.de> CC: Jens Axboe <axboe@kernel.dk> CC: Tejun Heo <tj@kernel.org> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Markus Pargmann <mpa@pengutronix.de> CC: Stefan Weinhuber <wein@de.ibm.com> CC: Stefan Haberland <stefan.haberland@de.ibm.com> CC: Sebastian Ott <sebott@linux.vnet.ibm.com> CC: Fabian Frederick <fabf@skynet.be> CC: Ming Lei <ming.lei@canonical.com> CC: David Herrmann <dh.herrmann@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Peter Zijlstra <peterz@infradead.org> CC: nbd-general@lists.sourceforge.net CC: linux-s390@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * suspend: simplify block I/O handlingChristoph Hellwig2015-05-196-155/+122
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stop abusing struct page functionality and the swap end_io handler, and instead add a modified version of the blk-lib.c bio_batch helpers. Also move the block I/O code into swap.c as they are directly tied into each other. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Pavel Machek <pavel@ucw.cz> Tested-by: Ming Lin <mlin@kernel.org> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: collapse bio bit spaceJens Axboe2015-05-191-9/+9
| | | | | | | | | | | | | | | | Various previous patches removed bits and left holes, collapse them all. Leave the reset start bit where it is, we don't need to change that. Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: remove unused BIO_RW_BLOCK and BIO_EOF flagsChristoph Hellwig2015-05-192-4/+0
| | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: remove BIO_EOPNOTSUPPChristoph Hellwig2015-05-197-38/+2
| | | | | | | | | | | | | | | | | | | | | | Since the big barrier rewrite/removal in 2007 we never fail FLUSH or FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag to help propagating those to the buffer_head layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: use an atomic_t for mq_freeze_depthChristoph Hellwig2015-05-192-15/+11
| | | | | | | | | | | | | | | | | | lockdep gets unhappy about the not disabling irqs when using the queue_lock around it. Instead of trying to fix that up just switch to an atomic_t and get rid of the lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: make plug work for mutiple disks and queuesShaohua Li2015-05-083-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Last patch makes plug work for multiple queue case. However it only works for single disk case, because it assumes only one request in the plug list. If a task is accessing multiple disks, eg MD/DM, the assumption is wrong. Let blk_attempt_plug_merge() record request from the same queue. V2: use NULL parameter in !mq case. Fix a bug. Add comments in blk_attempt_plug_merge to make it less (hopefully) confusion. Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: do limited block plug for multiple queue caseShaohua Li2015-05-081-23/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | plug is still helpful for workload with IO merge, but it can be harmful otherwise especially with multiple hardware queues, as there is (supposed) no lock contention in this case and plug can introduce latency. For multiple queues, we do limited plug, eg plug only if there is request merge. If a request doesn't have merge with following request, the requet will be dispatched immediately. V2: check blk_queue_nomerges() as suggested by Jeff. Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: avoid re-initialize request which is failed in direct dispatchShaohua Li2015-05-081-0/+2
| | | | | | | | | | | | | | | | | | | | If we directly issue a request and it fails, we use blk_mq_merge_queue_io(). But we already assigned bio to a request in blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run blk_mq_bio_to_request again. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: fix plugging in blk_sq_make_requestJeff Moyer2015-05-081-22/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following appears in blk_sq_make_request: /* * If we have multiple hardware queues, just go directly to * one of those for sync IO. */ We clearly don't have multiple hardware queues, here! This comment was introduced with this commit 07068d5b8e (blk-mq: split make request handler for multi and single queue): We want slightly different behavior from them: - On single queue devices, we currently use the per-process plug for deferred IO and for merging. - On multi queue devices, we don't use the per-process plug, but we want to go straight to hardware for SYNC IO. The old code had this: use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync); and that was converted to: use_plug = !is_flush_fua && !is_sync; which is not equivalent. For the single queue case, that second half of the && expression is always true. So, what I think was actually inteded follows (and this more closely matches what is done in blk_queue_bio). V2: delete the 'likely', which should not be a big deal Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * sched: always use blk_schedule_flush_plug in io_schedule_outShaohua Li2015-05-081-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | block plug callback could sleep, so we introduce a parameter 'from_schedule' and corresponding drivers can use it to destinguish a schedule plug flush or a plug finish. Unfortunately io_schedule_out still uses blk_flush_plug(). This causes below output (Note, I added a might_sleep() in raid1_unplug to make it trigger faster, but the whole thing doesn't matter if I add might_sleep). In raid1/10, this can cause deadlock. This patch makes io_schedule_out always uses blk_schedule_flush_plug. This should only impact drivers (as far as I know, raid 1/10) which are sensitive to the 'from_schedule' parameter. [ 370.817949] ------------[ cut here ]------------ [ 370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90() [ 370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90 [ 370.817971] Modules linked in: raid1 [ 370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G W 4.0.0+ #361 [ 370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1) [ 370.817985] ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001 [ 370.817988] ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8 [ 370.817990] ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28 [ 370.817993] Call Trace: [ 370.817999] [<ffffffff819dd7af>] dump_stack+0x4f/0x7b [ 370.818002] [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0 [ 370.818004] [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50 [ 370.818006] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818008] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818010] [<ffffffff810776ef>] __might_sleep+0x7f/0x90 [ 370.818014] [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1] [ 370.818024] [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0 [ 370.818028] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818031] [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140 [ 370.818033] [<ffffffff819e3586>] bit_wait_io+0x36/0x50 [ 370.818034] [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90 [ 370.818041] [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630 [ 370.818043] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818045] [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80 [ 370.818047] [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40 [ 370.818050] [<ffffffff811de744>] __wait_on_buffer+0x44/0x50 [ 370.818053] [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0 [ 370.818058] [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790 [ 370.818062] [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50 [ 370.818064] [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200 [ 370.818066] [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360 [ 370.818068] [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0 [ 370.818070] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818072] [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460 [ 370.818074] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818076] [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620 [ 370.818079] [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0 [ 370.818081] [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290 [ 370.818085] [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0 [ 370.818088] [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50 [ 370.818094] [<ffffffff81149691>] do_writepages+0x21/0x50 [ 370.818097] [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490 [ 370.818099] [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590 [ 370.818103] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818105] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818107] [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0 [ 370.818109] [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0 [ 370.818111] [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550 [ 370.818116] [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570 [ 370.818117] [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570 [ 370.818119] [<ffffffff8106c09b>] worker_thread+0x11b/0x470 [ 370.818121] [<ffffffff8106bf80>] ? process_one_work+0x570/0x570 [ 370.818124] [<ffffffff81071868>] kthread+0xf8/0x110 [ 370.818126] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818129] [<ffffffff819e9322>] ret_from_fork+0x42/0x70 [ 370.818131] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818132] ---[ end trace 7b4deb71e68b6605 ]--- V2: don't change ->in_iowait Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk: clean up plugShaohua Li2015-05-081-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | Current code looks like inner plug gets flushed with a blk_finish_plug(). Actually it's a nop. All requests/callbacks are added to current->plug, while only outmost plug is assigned to current->plug. So inner plug always has empty request/callback list, which makes blk_flush_plug_list() a nop. This tries to make the code more clear. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nbd: stop using req->cmdChristoph Hellwig2015-05-052-27/+23
| | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: move PM request support to IDEChristoph Hellwig2015-05-059-51/+70
| | | | | | | | | | | | | | | | | | This removes the request types and hacks from the block code and into the old IDE driver. There is a small amunt of code duplication due to this, but it's not too bad. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: remove REQ_TYPE_PM_SHUTDOWNChristoph Hellwig2015-05-051-1/+0
| | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: move REQ_TYPE_SENSE to the ide driverChristoph Hellwig2015-05-056-10/+10
| | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: move REQ_TYPE_ATA_TASKFILE and REQ_TYPE_ATA_PC to ide.hChristoph Hellwig2015-05-052-9/+9
| | | | | | | | | | | | | | | | | | | | These values are only used by the IDE driver, so move them into it by allowing drivers to take cmd_type values after the first private one. Note that we have to turn cmd_type into a plain unsigned integer so that gcc doesn't complain about mismatching enum types. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: rename REQ_TYPE_SPECIAL to REQ_TYPE_DRV_PRIVChristoph Hellwig2015-05-0515-26/+26
| | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * bio: skip atomic inc/dec of ->bi_cnt for most use casesJens Axboe2015-05-056-12/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * bio: skip atomic inc/dec of ->bi_remaining for non-chainsJens Axboe2015-05-057-15/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* | dm: fix casting bug in dm_merge_bvec()Joe Thornber2015-05-291-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm_merge_bvec() was originally added in f6fccb ("dm: introduce merge_bvec_fn"). In that commit a value in sectors is converted to bytes using << 9, and then assigned to an int. This code made assumptions about the value of BIO_MAX_SECTORS. A later commit 148e51 ("dm: improve documentation and code clarity in dm_merge_bvec") was meant to have no functional change but it removed the use of BIO_MAX_SECTORS in favor of using queue_max_sectors(). At this point the cast from sector_t to int resulted in a zero value. The fallout being dm_merge_bvec() would only allow a single page to be added to a bio. This interim fix is minimal for the benefit of stable@ because the more comprehensive cleanup of passing a sector_t to all DM targets' merge function will impact quite a few DM targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+
* | dm: fix reload failure of 0 path multipath mapping on blk-mq devicesJunichi Nomura2015-05-291-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm-multipath accepts 0 path mapping. # echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev Such a mapping can be used to release underlying devices while still holding requests in its queue until working paths come back. However, once the multipath device is created over blk-mq devices, it rejects reloading of 0 path mapping: # echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \ | dmsetup create mpath1 # echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1 device-mapper: reload ioctl on mpath1 failed: Invalid argument Command failed With following kernel message: device-mapper: ioctl: can't change device type after initial table load. DM tries to inherit the current table type using dm_table_set_type() but it doesn't work as expected because of unnecessary check about whether the target type is hybrid or not. Hybrid type is for targets that work as either request-based or bio-based and not required for blk-mq or non blk-mq checking. Fixes: 65803c205983 ("dm table: train hybrid target type detection to select blk-mq if appropriate") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm: fix false warning in free_rq_clone() for unmapped requestsMike Snitzer2015-05-291-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When stacking request-based dm device on non blk-mq device and device-mapper target could not map the request (error target is used, multipath target with all paths down, etc), the WARN_ON_ONCE() in free_rq_clone() will trigger when it shouldn't. The warning was added by commit aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request"). But free_rq_clone() with clone->q == NULL is valid usage for the case where dm_kill_unmapped_request() initiates request cleanup. Fix this false warning by just removing the WARN_ON -- it only generated false positives and was never useful in catching the intended case (completing clone request not being mapped e.g. clone->q being NULL). Fixes: aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSYMike Snitzer2015-05-271-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the DM blk-mq device's .queue_rq. This cleans up the previous convoluted handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK (even though it wasn't) and then run blk_mq_requeue_request() followed by blk_mq_kick_requeue_list(). Also, document that DM blk-mq ontop of old request_fn devices cannot fail in clone_rq() since the clone request is preallocated as part of the pdu. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error pathMike Snitzer2015-05-271-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise kmemleak reported: unreferenced object 0xffff88009b14e2b0 (size 16): comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s) hex dump (first 16 bytes): 40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00 @............... backtrace: [<ffffffff81600029>] kmemleak_alloc+0x49/0xb0 [<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160 [<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20 [<ffffffff8111cb37>] mempool_alloc+0x57/0x150 [<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath] [<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath] [<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod] [<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod] [<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380 [<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100 [<ffffffff812ce350>] blk_sq_make_request+0x240/0x300 [<ffffffff812c0f30>] generic_make_request+0xc0/0x110 [<ffffffff812c0ff2>] submit_bio+0x72/0x150 [<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0 [<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40 [<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390 Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+
* | dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPEDJunichi Nomura2015-05-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When stacking request-based DM on blk_mq device, request cloning and remapping are done in a single call to target's clone_and_map_rq(). The clone is allocated and valid only if clone_and_map_rq() returns DM_MAPIO_REMAPPED. The "IS_ERR(clone)" check in map_request() does not cover all the !DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices are not ready or unavailable, clone_and_map_rq() may return DM_MAPIO_REQUEUE without ever having established an ERR_PTR). Fix this by explicitly checking for a return that is not DM_MAPIO_REMAPPED in map_request(). Without this fix, DM core may call setup_clone() for a NULL clone and oops like this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137 ... CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1 ... Call Trace: [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod] [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071fdd>] kthread+0xe6/0xee [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff814c2d98>] ret_from_fork+0x58/0x90 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+
* | dm: run queue on re-queueJunichi Nomura2015-05-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without kicking queue, requeued request may stay forever in the queue if there are no other I/O activities to the device. The original error had been in v2.6.39 with commit 7eaceaccab5f ("block: remove per-queue plugging"), which replaced conditional plugging by periodic runqueue. Commit 9d1deb83d489 in v4.1-rc1 removed the periodic runqueue and the problem started to manifest. Fixes: 9d1deb83d489 ("dm: don't schedule delayed run of the queue if nothing to do") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | Linux 4.1-rc5v4.1-rc5Linus Torvalds2015-05-241-1/+1
| |
* | Merge tag 'scsi-fixes' of ↵Linus Torvalds2015-05-2413-77/+71
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is a set of five fixes: Two MAINTAINER email updates (urgent because the non-avagotech emails will start bouncing) an lpfc big endian oops fix, a 256 byte sector hang fix (to eliminate 256 byte sectors) and a storvsc fix which could cause test unit ready failures on bringup" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: MAINTAINERS: Revise lpfc maintainers for Avago Technologies ownership of Emulex MAINTAINERS, be2iscsi: change email domain sd: Disable support for 256 byte/sector disks lpfc: Fix breakage on big endian kernels storvsc: Set the SRB flags correctly when no data transfer is needed
| * | MAINTAINERS: Revise lpfc maintainers for Avago Technologies ownership of EmulexJames Smart2015-05-181-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | The old email addresses will go away very soon. Revising with new addresses. Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com> Signed-off-by: James Smart <james.smart@avagotech.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>
| * | MAINTAINERS, be2iscsi: change email domainMinh Tran2015-05-1810-38/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | be2iscsi change of ownership from Emulex to Avago Technologies recently. We like to get the following updates in: changed "Emulex" to "Avago Technologies", changed email addresses from "emulex.com" to "avagotech.com", updated MAINTAINER list for be2iscsi driver. Signed-off-by: Minh Tran <minh.tran@avagotech.com> Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@avagotech.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>
| * | sd: Disable support for 256 byte/sector disksMark Hounschell2015-05-181-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 256 bytes per sector support has been broken since 2.6.X, and no-one stepped up to fix this. So disable support for it. Signed-off-by: Mark Hounschell <dmarkh@cfl.rr.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org Signed-off-by: James Bottomley <JBottomley@Odin.com>
| * | lpfc: Fix breakage on big endian kernelsAlexey Kardashevskiy2015-05-111-20/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts 4fbdf9cb it breaks LPFC on POWER7 machine, big endian kernel. Without this, the kernel enters an infinite oops loop. This is the hardware used for verification: 0005:01:00.0 Fibre Channel [0c04]: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter [10df:f100] (rev 03) 0005:01:00.1 Fibre Channel [0c04]: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter [10df:f100] (rev 03) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>
| * | storvsc: Set the SRB flags correctly when no data transfer is neededK. Y. Srinivasan2015-05-111-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set the SRB flags correctly when there is no data transfer. Without this change some IHV drivers will fail valid commands such as TEST_UNIT_READY. Cc: <stable@vger.kernel.org> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>
* | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2015-05-232-12/+29
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "One more fix from the timer departement: - Handle division of negative nanosecond values proper on 32bit. A recent cleanup wrecked the sign handling of the dividend and dropped the check for negative divisors" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ktime: Fix ktime_divns to do signed division
| * | | ktime: Fix ktime_divns to do signed divisionJohn Stultz2015-05-132-12/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was noted that the 32bit implementation of ktime_divns() was doing unsigned division and didn't properly handle negative values. And when a ktime helper was changed to utilize ktime_divns, it caused a regression on some IR blasters. See the following bugzilla for details: https://bugzilla.redhat.com/show_bug.cgi?id=1200353 This patch fixes the problem in ktime_divns by checking and preserving the sign bit, and then reapplying it if appropriate after the division, it also changes the return type to a s64 to make it more obvious this is expected. Nicolas also pointed out that negative dividers would cause infinite loops on 32bit systems, negative dividers is unlikely for users of this function, but out of caution this patch adds checks for negative dividers for both 32-bit (BUG_ON) and 64-bit(WARN_ON) versions to make sure no such use cases creep in. [ tglx: Hand an u64 to do_div() to avoid the compiler warning ] Fixes: 166afb64511e 'ktime: Sanitize ktime_to_us/ms conversion' Reported-and-tested-by: Trevor Cordes <trevor@tecnopolis.ca> Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Boyer <jwboyer@redhat.com> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1431118043-23452-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2015-05-231-1/+8
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irqchip fix from Thomas Gleixner: "A fix for a GIC-V3 irqchip regression which prevents some systems from booting" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gicv3-its: ITS table size should not be smaller than PSZ
| * | | | irqchip/gicv3-its: ITS table size should not be smaller than PSZMinghuan Lian2015-05-201-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When allocating a device table, if the requested allocation is smaller than the default granule size of the ITS then, we need to round up to the default size. Signed-off-by: Minghuan Lian <Minghuan.Lian@freescale.com> [ stuart: Added comments and massaged changelog ] Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com> Reviewed-by: Marc Zygnier <marc.zyngier@arm.com> Cc: <linux-arm-kernel@lists.infradead.org> Cc: <jason@lakedaemon.net> Link: http://lkml.kernel.org/r/1432134795-661-1-git-send-email-stuart.yoder@freescale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
OpenPOWER on IntegriCloud