summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | elevator: lookup mq vs non-mq elevatorsJens Axboe2017-10-251-23/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an IO scheduler is selected via elevator= and it doesn't match the driver in question wrt blk-mq support, then we fail to boot. The elevator= parameter is deprecated and only supported for non-mq devices. Augment the elevator lookup API so that we pass in if we're looking for an mq capable scheduler or not, so that we only ever return a valid type for the queue in question. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695 Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()Ilya Dryomov2017-10-251-10/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK $ fallocate -zn -l 1k /dev/sdg # OK The following calls succeed because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if sd_done() has just set ->no_write_same thus indicating lack of offload support. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: factor out __blkdev_issue_zero_pages()Ilya Dryomov2017-10-251-26/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: move CAP_SYS_ADMIN check in blkdev_roset()Ilya Dryomov2017-10-251-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check for CAP_SYS_ADMIN before calling into the driver, similar to blkdev_flushbuf(). This is safer and can spare a check in the driver. (Currently BLKROSET is overridden by md and rbd, rbd is missing the check. md has the check, but it covers a lot more than BLKROSET.) Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: Invalidate cache on discard v2Dmitry Monakhov2017-10-241-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is reasonable drop page cache on discard, otherwise that pages may be written by writeback second later, so thin provision devices will not be happy. This seems to be a security leak in case of secure discard case. Also add check for queue_discard flag on early stage. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | kyber: fix hang on domain token wait queueOmar Sandoval2017-10-171-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we're getting a domain token, if we fail to get a token on our first attempt, we put the current hardware queue on a wait queue and then try again just in case a token was freed after our initial attempt but before we got on the wait queue. If this second attempt succeeds, we currently leave the hardware queue on the wait queue. Usually this is okay; we'll just run the hardware queue one extra time when another token is freed. However, if the hardware queue doesn't have any other requests waiting, then when it it gets the extra wakeup, it won't have anything to free and therefore won't wake up any other hardware queues. If tokens are limited, then we won't make forward progress and the device will hang. Reported-by: Bin Zha <zhabin.zb@alibaba-inc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: fix Sphinx kernel-doc warningRandy Dunlap2017-10-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sphinx treats symbols that end with '_' as a kind of special documentation indicator, so fix that by adding an ending '*' to it. ../block/bio.c:404: ERROR: Unknown target name: "gfp". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: set request_list for requestShaohua Li2017-10-102-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Legacy queue sets request's request_list, mq doesn't. This makes mq does the same thing, so we can find cgroup of a request. Note, we really only use blkg field of request_list, it's pointless to allocate mempool for request_list in mq case. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blk-stat: delete useless codeShaohua Li2017-10-101-38/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix two issues: - the per-cpu stat flush is unnecessary, nobody uses per-cpu stat except sum it to global stat. We can do the calculation there. The flush just wastes cpu time. - some fields are signed int/s64. I don't see the point. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blk-throttle: fix null pointer dereference while throttling writeback IOsJiufei Xue2017-10-101-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A null pointer dereference can occur when blkcg is removed manually with writeback IOs inflight. This is caused by the following case: Writeback kworker submit the bio and set bio->bi_cg_private to tg in blk_throtl_assoc_bio. Then we remove the block cgroup manually, the blkg and tg would be freed if there is no request inflight. When the submitted bio come back, blk_throtl_bio_endio() fetch the tg which was already freed. Fix this by increasing the refcount of blkg in funcion blk_throtl_assoc_bio() so that the blkg will not be freed until the bio_endio called. Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Jiufei Xue <jiufei.xjf@alibaba-inc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blkcg: check pol->cpd_free_fn before free cpdweiping zhang2017-10-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block, bfq: fix unbalanced decrements of burst sizePaolo Valente2017-10-091-2/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit "block, bfq: decrease burst size when queues in burst exit" introduced the decrement of burst_size on the removal of a bfq_queue from the burst list. Unfortunately, this decrement can happen to be performed even when burst size is already equal to 0, because of unbalanced decrements. A description follows of the cause of these unbalanced decrements, namely a wrong assumption, and of the way how this wrong assumption leads to unbalanced decrements. The wrong assumption is that a bfq_queue can exit only if the process associated with the bfq_queue has exited. This is false, because a bfq_queue, say Q, may exit also as a consequence of a merge with another bfq_queue. In this case, Q exits because the I/O of its associated process has been redirected to another bfq_queue. The decrement unbalance occurs because Q may then be re-created after a split, and added back to the current burst list, *without* incrementing burst_size. burst_size is not incremented because Q is not a new bfq_queue added to the burst list, but a bfq_queue only temporarily removed from the list, and, before the commit "bfq-sq, bfq-mq: decrease burst size when queues in burst exit", burst_size was not decremented when Q was removed. This commit addresses this issue by just checking whether the exiting bfq_queue is a merged bfq_queue, and, in that case, not decrementing burst_size. Unfortunately, this still leaves room for unbalanced decrements, in the following rarer case: on a split, the bfq_queue happens to be inserted into a different burst list than that it was removed from when merged. If this happens, the number of elements in the new burst list becomes higher than burst_size (by one). When the bfq_queue then exits, it is of course not in a merged state any longer, thus burst_size is decremented, which results in an unbalanced decrement. To handle this sporadic, unlucky case in a simple way, this commit also checks that burst_size is larger than 0 before decrementing it. Finally, this commit removes an useless, extra check: the check that the bfq_queue is sync, performed before checking whether the bfq_queue is in the burst list. This extra check is redundant, because only sync bfq_queues can be inserted into the burst list. Fixes: 7cb04004fa37 ("block, bfq: decrease burst size when queues in burst exit") Reported-by: Philip Müller <philm@manjaro.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Philip Müller <philm@manjaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block,bfq: Disable writeback throttlingLuca Miccio2017-10-092-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to CFQ, BFQ has its write-throttling heuristics, and it is better not to combine them with further write-throttling heuristics of a different nature. So this commit disables write-back throttling for a device if BFQ is used as I/O scheduler for that device. Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block/bio: Remove null checks before mempool_destroy in bioset_freeTim Hansen2017-10-061-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes redundant checks for null values on bio_pool and bvec_pool. Found using make coccicheck M=block/ on linux-net tree on the next-20170929 tag. Signed-off-by: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: remove unnecessary NULL checks in bioset_integrity_free()Tim Hansen2017-10-061-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mempool_destroy() already checks for a NULL value being passed in, this eliminates duplicate checks. This was caught by running make coccicheck M=block/ on linus' tree on commit 77ede3a014a32746002f7889211f0cecf4803163 (current head as of this patch). Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: remove QUEUE_FLAG_STACKABLEChristoph Hellwig2017-10-052-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already have a queue_is_rq_based helper to check if a request_queue is request based, so we can remove the flag for it. Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blk-mq: document the need to have STARTED and COMPLETED share a byteJens Axboe2017-10-042-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For memory ordering guarantees on stores, we need to ensure that these two bits share the same byte of storage in the unsigned long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that we don't violate this requirement. Suggested-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blk-mq: attempt to fix atomic flag memory orderingPeter Zijlstra2017-10-042-13/+41
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Attempt to untangle the ordering in blk-mq. The patch introducing the single smp_mb__before_atomic() is obviously broken in that it doesn't clearly specify a pairing barrier and an obtained guarantee. The comment is further misleading in that it hints that the deadline store and the COMPLETE store also need to be ordered, but AFAICT there is no such dependency. However what does appear to be important is the clear happening _after_ the store, and that worked by pure accident. This clarifies blk_mq_start_request() -- we should not get there with STARTING set -- this simplifies the code and makes the barrier usage sane (the old code could be read to allow not having _any_ atomic after the barrier, in which case the barrier hasn't got anything to order). We then also introduce the missing pairing barrier for it. Also down-grade the barrier to smp_wmb(), this is cheaper for PowerPC/ARM and doesn't cost anything extra on x86. And it documents the STARTING vs COMPLETE ordering. Although I've not been entirely successful in reverse engineering the blk-mq state machine so there might still be more funnies around timeout vs requeue. If I got anything wrong, feel free to educate me by adding comments to clarify things ;-) Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Will Deacon <will.deacon@arm.com> Cc: Ming Lei <tom.leiming@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking rq as started") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: move __elv_next_request to blk-core.cChristoph Hellwig2017-10-032-41/+40
| | | | | | | | | | | | | | | | | | | | | No need to have this helper inline in a header. Also drop the __ prefix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: decrease burst size when queues in burst exitPaolo Valente2017-10-031-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If many queues belonging to the same group happen to be created shortly after each other, then the concurrent processes associated with these queues have typically a common goal, and they get it done as soon as possible if not hampered by device idling. Examples are processes spawned by git grep, or by systemd during boot. As for device idling, this mechanism is currently necessary for weight raising to succeed in its goal: privileging I/O. In view of these facts, BFQ does not provide the above queues with either weight raising or device idling. On the other hand, a burst of queue creations may be caused also by the start-up of a complex application. In this case, these queues need usually to be served one after the other, and as quickly as possible, to maximise responsiveness. Therefore, in this case the best strategy is to weight-raise all the queues created during the burst, i.e., the exact opposite of the strategy for the above case. To distinguish between the two cases, BFQ uses an empirical burst-size threshold, found through extensive tests and monitoring of daily usage. Only large bursts, i.e., burst with a size above this threshold, are considered as generated by a high number of parallel processes. In this respect, upstart-based boot proved to be rather hard to detect as generating a large burst of queue creations, because with upstart most of the queues created in a burst exit *before* the next queues in the same burst are created. To address this issue, I changed the burst-detection mechanism so as to not decrease the size of the current burst even if one of the queues in the burst is eliminated. Unfortunately, this missing decrease causes false positives on very fast systems: on the start-up of a complex application, such as libreoffice writer, so many queues are created, served and exited shortly after each other, that a large burst of queue creations is wrongly detected as occurring. These false positives just disappear if the size of a burst is decreased when one of the queues in the burst exits. This commit restores the missing burst-size decrease, relying of the fact that upstart is apparently unlikely to be used on systems running this and future versions of the kernel. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Mirko Montanari <mirkomontanari91@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: let early-merged queues be weight-raised on split tooPaolo Valente2017-10-031-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A just-created bfq_queue, say Q, may happen to be merged with another bfq_queue on the very first invocation of the function __bfq_insert_request. In such a case, even if Q would clearly deserve interactive weight raising (as it has just been created), the function bfq_add_request does not make it to be invoked for Q, and thus to activate weight raising for Q. As a consequence, when the state of Q is saved for a possible future restore, after a split of Q from the other bfq_queue(s), such a state happens to be (unjustly) non-weight-raised. Then the bfq_queue will not enjoy any weight raising on the split, even if should still be in an interactive weight-raising period when the split occurs. This commit solves this problem as follows, for a just-created bfq_queue that is being early-merged: it stores directly, in the saved state of the bfq_queue, the weight-raising state that would have been assigned to the bfq_queue if not early-merged. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Mirko Montanari <mirkomontanari91@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: check and switch back to interactive wr also on queue splitPaolo Valente2017-10-031-38/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As already explained in the message of commit "block, bfq: fix wrong init of saved start time for weight raising", if a soft real-time weight-raising period happens to be nested in a larger interactive weight-raising period, then BFQ restores the interactive weight raising at the end of the soft real-time weight raising. In particular, BFQ checks whether the latter has ended only on request dispatches. Unfortunately, the above scheme fails to restore interactive weight raising in the following corner case: if a bfq_queue, say Q, 1) Is merged with another bfq_queue while it is in a nested soft real-time weight-raising period. The weight-raising state of Q is then saved, and not considered any longer until a split occurs. 2) Is split from the other bfq_queue(s) at a time instant when its soft real-time weight raising is already finished. On the split, while resuming the previous, soft real-time weight-raised state of the bfq_queue Q, BFQ checks whether the current soft real-time weight-raising period is actually over. If so, BFQ switches weight raising off for Q, *without* checking whether the soft real-time period was actually nested in a non-yet-finished interactive weight-raising period. This commit addresses this issue by adding the above missing check in bfq_queue splits, and restoring interactive weight raising if needed. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Mirko Montanari <mirkomontanari91@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: fix wrong init of saved start time for weight raisingPaolo Valente2017-10-031-19/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes a bug that causes bfq to fail to guarantee a high responsiveness on some drives, if there is heavy random read+write I/O in the background. More precisely, such a failure allowed this bug to be found [1], but the bug may well cause other yet unreported anomalies. BFQ raises the weight of the bfq_queues associated with soft real-time applications, to privilege the I/O, and thus reduce latency, for these applications. This mechanism is named soft-real-time weight raising in BFQ. A soft real-time period may happen to be nested into an interactive weight raising period, i.e., it may happen that, when a bfq_queue switches to a soft real-time weight-raised state, the bfq_queue is already being weight-raised because deemed interactive too. In this case, BFQ saves in a special variable wr_start_at_switch_to_srt, the time instant when the interactive weight-raising period started for the bfq_queue, i.e., the time instant when BFQ started to deem the bfq_queue interactive. This value is then used to check whether the interactive weight-raising period would still be in progress when the soft real-time weight-raising period ends. If so, interactive weight raising is restored for the bfq_queue. This restore is useful, in particular, because it prevents bfq_queues from losing their interactive weight raising prematurely, as a consequence of spurious, short-lived soft real-time weight-raising periods caused by wrong detections as soft real-time. If, instead, a bfq_queue switches to soft-real-time weight raising while it *is not* already in an interactive weight-raising period, then the variable wr_start_at_switch_to_srt has no meaning during the following soft real-time weight-raising period. Unfortunately the handling of this case is wrong in BFQ: not only the variable is not flagged somehow as meaningless, but it is also set to the time when the switch to soft real-time weight-raising occurs. This may cause an interactive weight-raising period to be considered mistakenly as still in progress, and thus a spurious interactive weight-raising period to start for the bfq_queue, at the end of the soft-real-time weight-raising period. In particular the spurious interactive weight-raising period will be considered as still in progress, if the soft-real-time weight-raising period does not last very long. The bfq_queue will then be wrongly privileged and, if I/O bound, will unjustly steal bandwidth to truly interactive or soft real-time bfq_queues, harming responsiveness and low latency. This commit fixes this issue by just setting wr_start_at_switch_to_srt to minus infinity (farthest past time instant according to jiffies macros): when the soft-real-time weight-raising period ends, certainly no interactive weight-raising period will be considered as still in progress. [1] Background I/O Type: Random - Background I/O mix: Reads and writes - Application to start: LibreOffice Writer in http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Mirko Montanari <mirkomontanari91@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: wire up completion notifier for laptop modeJens Axboe2017-10-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some reason, the laptop mode IO completion notified was never wired up for blk-mq. Ensure that we trigger the callback appropriately, to arm the laptop mode flush timer. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq-tag: kill unused tag enumsJens Axboe2017-10-011-6/+1
| | | | | | | | | | | | | | | | | | | | | We don't have any notion of a tagging cache anymore, and haven't for a long time. Kill off the unused enums. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: remove unused function hctx_allow_mergesweiping zhang2017-09-301-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | since 9bddeb2a5b981 "blk-mq: make per-sw-queue bio merge as default .bio_merge" there is no caller for this function. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blkcg: delete unused APIsShaohua Li2017-09-261-31/+0
| |/ | | | | | | | | | | | | | | Nobody uses the APIs right now. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge branch 'linus' into locking/core, to resolve conflictsIngo Molnar2017-11-0743-0/+43
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: include/linux/compiler-clang.h include/linux/compiler-gcc.h include/linux/compiler-intel.h include/uapi/linux/stddef.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman2017-11-0243-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | block, locking/lockdep: Assign a lock_class per gendisk used for ↵Byungchul Park2017-10-262-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wait_for_completion() Darrick posted the following warning and Dave Chinner analyzed it: > ====================================================== > WARNING: possible circular locking dependency detected > 4.14.0-rc1-fixes #1 Tainted: G W > ------------------------------------------------------ > loop0/31693 is trying to acquire lock: > (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs] > > but now in release context of a crosslock acquired at the following: > ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0 > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #2 ((complete)&ret.event){+.+.}: > lock_acquire+0xab/0x200 > wait_for_completion_io+0x4e/0x1a0 > submit_bio_wait+0x7f/0xb0 > blkdev_issue_zeroout+0x71/0xa0 > xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs] > xfs_bmapi_write+0x374/0x11f0 [xfs] > xfs_iomap_write_direct+0x2ac/0x430 [xfs] > xfs_file_iomap_begin+0x20d/0xd50 [xfs] > iomap_apply+0x43/0xe0 > dax_iomap_rw+0x89/0xf0 > xfs_file_dax_write+0xcc/0x220 [xfs] > xfs_file_write_iter+0xf0/0x130 [xfs] > __vfs_write+0xd9/0x150 > vfs_write+0xc8/0x1c0 > SyS_write+0x45/0xa0 > entry_SYSCALL_64_fastpath+0x1f/0xbe > > -> #1 (&xfs_nondir_ilock_class){++++}: > lock_acquire+0xab/0x200 > down_write_nested+0x4a/0xb0 > xfs_ilock+0x263/0x330 [xfs] > xfs_setattr_size+0x152/0x370 [xfs] > xfs_vn_setattr+0x6b/0x90 [xfs] > notify_change+0x27d/0x3f0 > do_truncate+0x5b/0x90 > path_openat+0x237/0xa90 > do_filp_open+0x8a/0xf0 > do_sys_open+0x11c/0x1f0 > entry_SYSCALL_64_fastpath+0x1f/0xbe > > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}: > up_write+0x1c/0x40 > xfs_iunlock+0x1d0/0x310 [xfs] > xfs_file_fallocate+0x8a/0x310 [xfs] > loop_queue_work+0xb7/0x8d0 > kthread_worker_fn+0xb9/0x1f0 > > Chain exists of: > &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event > > Possible unsafe locking scenario by crosslock: > > CPU0 CPU1 > ---- ---- > lock(&xfs_nondir_ilock_class); > lock((complete)&ret.event); > lock(&(&ip->i_mmaplock)->mr_lock); > unlock((complete)&ret.event); > > *** DEADLOCK *** The warning is a false positive, caused by the fact that all wait_for_completion()s in submit_bio_wait() are waiting with the same lock class. However, some bios have nothing to do with others, for example in the case of loop devices, there's no direct connection between the bios of an upper device and the bios of a lower device(=loop device). The safest way to assign different lock classes to different devices is to do it for each gendisk. In other words, this patch assigns a lockdep_map per gendisk and uses it when initializing completion in submit_bio_wait(). Analyzed-by: Dave Chinner <david@fromorbit.com> Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Byungchul Park <byungchul.park@lge.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: amir73il@gmail.com Cc: axboe@kernel.dk Cc: david@fromorbit.com Cc: hch@infradead.org Cc: idryomov@gmail.com Cc: johan@kernel.org Cc: johannes.berg@intel.com Cc: kernel-team@lge.com Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Cc: oleg@redhat.com Cc: tj@kernel.org Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | block: Use DECLARE_COMPLETION_ONSTACK() in submit_bio_wait()Christoph Hellwig2017-10-251-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the code by getting rid of the submit_bio_ret structure. (This also helps address a lockdep false positive.) Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: amir73il@gmail.com Cc: axboe@kernel.dk Cc: darrick.wong@oracle.com Cc: david@fromorbit.com Cc: hch@infradead.org Cc: idryomov@gmail.com Cc: johan@kernel.org Cc: johannes.berg@intel.com Cc: kernel-team@lge.com Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Cc: oleg@redhat.com Cc: tj@kernel.org Link: http://lkml.kernel.org/r/1508921765-15396-2-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland2017-10-251-1/+1
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | bio_copy_user_iov(): don't ignore ->iov_offsetAl Viro2017-10-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | Since "block: support large requests in blk_rq_map_user_iov" we started to call it with partially drained iter; that works fine on the write side, but reads create a copy of iter for completion time. And that needs to take the possibility of ->iov_iter != 0 into account... Cc: stable@vger.kernel.org #v4.5+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | more bio_map_user_iov() leak fixesAl Viro2017-10-101-5/+9
| | | | | | | | | | | | | | | | | | | | we need to take care of failure exit as well - pages already in bio should be dropped by analogue of bio_unmap_pages(), since their refcounts had been bumped only once per reference in bio. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fix unbalanced page refcounting in bio_map_user_iovVitaly Mayatskikh2017-10-101-0/+8
| | | | | | | | | | | | | | | | | | | | | | bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if IO vector has small consecutive buffers belonging to the same page. bio_add_pc_page merges them into one, but the page reference is never dropped. Cc: stable@vger.kernel.org Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | bsg-lib: fix use-after-free under memory-pressureBenjamin Block2017-10-041-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When under memory-pressure it is possible that the mempool which backs the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count emergency buffers - in case it can't get a regular allocation. These buffers are preallocated and once they are also used, they are re-supplied with old finished requests from the same request_queue (see mempool_free()). The bug is, when re-supplying the emergency pool, the old requests are not again ran through the callback mempool_t->alloc(), and thus also not through the callback bsg_init_rq(). Thus we skip initialization, and while the sense-buffer still should be good, scsi_request->cmd might have become to be an invalid pointer in the meantime. When the request is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB, bsg will replace it with a custom allocated buffer, which is freed when the user's command is finished, thus it dangles afterwards. When next a command is sent by the user that has a smaller/similar CDB as BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by scsi_request->__cmd, will not make a custom allocation, and write into undefined memory. Fix this by splitting bsg_init_rq() into two functions: - bsg_init_rq() is changed to only do the allocation of the sense-buffer, which is used to back the bsg job's reply buffer. This pointer should never change during the lifetime of a scsi_request, so it doesn't need re-initialization. - bsg_initialize_rq() is a new function that makes use of 'struct request_queue's initialize_rq_fn callback (which was introduced in v4.12). This is always called before the request is given out via blk_get_request(). This function does the remaining initialization that was previously done in bsg_init_rq(), and will also do it when the request is taken from the emergency-pool of the backing mempool. Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer") Cc: <stable@vger.kernel.org> # 4.11+ Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq-debugfs: fix device sched directory for default schedulerOmar Sandoval2017-10-031-1/+5
| | | | | | | | | | | | | | | | | | | | | | In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched directories if a default scheduler was already configured by blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do the same for the device-wide sched directory. Fix it. Fixes: d332ce091813 ("blk-mq-debugfs: allow schedulers to register debugfs attributes") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-throttle: fix possible io stall when upgrade to maxJoseph Qi2017-10-031-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a case which will lead to io stall. The case is described as follows. /test1 |-subtest1 /test2 |-subtest2 And subtest1 and subtest2 each has 32 queued bios already. Now upgrade to max. In throtl_upgrade_state, it will try to dispatch bios as follows: 1) tg=subtest1, do nothing; 2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending left, no need to schedule next dispatch; 3) tg=subtest2, do nothing; 4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending left, no need to schedule next dispatch; 5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2 to /; note that test1 and test2 each still has 16 queued bios left; 6) tg=/, try to schedule next dispatch, but since disptime is now (update in tg_update_disptime, wait=0), pending timer is not scheduled in fact; 7) In throtl_upgrade_state it totally dispatches 32 queued bios and with 32 left. test1 and test2 each has 16 queued bios; 8) throtl_pending_timer_fn sees the left over bios, but could do nothing, because throtl_select_dispatch returns 0, and test1/test2 has no pending tg. The blktrace shows the following: 8,32 0 0 2.539007641 0 m N throtl upgrade to max 8,32 0 0 2.539072267 0 m N throtl /test2 dispatch nr_queued=16 read=0 write=16 8,32 7 0 2.539077142 0 m N throtl /test1 dispatch nr_queued=16 read=0 write=16 So force schedule dispatch if there are pending children. Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Joseph Qi <qijiang.qj@alibaba-inc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: fix a crash caused by wrong APIShaohua Li2017-09-251-1/+1
| | | | | | | | | | | part_stat_show takes a part device not a disk, so we should use part_to_disk. Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting") Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blktrace: Fix potential deadlock between delete & sysfs opsWaiman Long2017-09-251-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The lockdep code had reported the following unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#228); lock(&bdev->bd_mutex/1); lock(s_active#228); lock(&bdev->bd_mutex); *** DEADLOCK *** The deadlock may happen when one task (CPU1) is trying to delete a partition in a block device and another task (CPU0) is accessing tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that partition. The s_active isn't an actual lock. It is a reference count (kn->count) on the sysfs (kernfs) file. Removal of a sysfs file, however, require a wait until all the references are gone. The reference count is treated like a rwsem using lockdep instrumentation code. The fact that a thread is in the sysfs callback method or in the ioctl call means there is a reference to the opended sysfs or device file. That should prevent the underlying block structure from being removed. Instead of using bd_mutex in the block_device structure, a new blk_trace_mutex is now added to the request_queue structure to protect access to the blk_trace structure. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Fix typo in patch subject line, and prune a comment detailing how the code used to work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bsg-lib: don't free job in bsg_prepare_jobChristoph Hellwig2017-09-251-1/+0
| | | | | | | | | | The job structure is allocated as part of the request, so we should not free it in the error path of bsg_prepare_job. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: directly insert blk-mq request from blk_insert_cloned_request()Jens Axboe2017-09-113-1/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A NULL pointer crash was reported for the case of having the BFQ IO scheduler attached to the underlying blk-mq paths of a DM multipath device. The crash occured in blk_mq_sched_insert_request()'s call to e->type->ops.mq.insert_requests(). Paolo Valente correctly summarized why the crash occured with: "the call chain (dm_mq_queue_rq -> map_request -> setup_clone -> blk_rq_prep_clone) creates a cloned request without invoking e->type->ops.mq.prepare_request for the target elevator e. The cloned request is therefore not initialized for the scheduler, but it is however inserted into the scheduler by blk_mq_sched_insert_request." All said, a request-based DM multipath device's IO scheduler should be the only one used -- when the original requests are issued to the underlying paths as cloned requests they are inserted directly in the underlying dispatch queue(s) rather than through an additional elevator. But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO schedulers") switched blk_insert_cloned_request() from using blk_mq_insert_request() to blk_mq_sched_insert_request(). Which incorrectly added elevator machinery into a call chain that isn't supposed to have any. To fix this introduce a blk-mq private blk_mq_request_bypass_insert() that blk_insert_cloned_request() calls to insert the request without involving any elevator that may be attached to the cloned request's request_queue. Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Cc: stable@vger.kernel.org Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: fix integer overflow in __blkdev_sectors_to_bio_pages()Mikulas Patocka2017-09-111-2/+2
| | | | | | | | | | Fix possible integer overflow in __blkdev_sectors_to_bio_pages if sector_t is 32-bit. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 615d22a51c04 ("block: Fix __blkdev_issue_zeroout loop") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: sed-opal: Set MBRDone on S3 resume path if TPER is MBREnabledScott Bauer2017-09-112-0/+33
| | | | | | | | | | | | | | | | | | | Users who are booting off their Opal enabled drives are having issues when they have a shadow MBR set up after s3/resume cycle. When the Drive has a shadow MBR setup the MBRDone flag is set to false upon power loss (S3/S4/S5). When the MBRDone flag is false I/O to LBA 0 -> LBA_END_MBR are remapped to the shadow mbr of the drive. If the drive contains useful data in the 0 -> end_mbr range upon s3 resume the user can never get to that data as the drive will keep remapping it to the MBR. To fix this when we unlock on S3 resume, we need to tell the drive that we're done with the shadow mbr (even though we didnt use it) by setting true to MBRDone. This way the drive will stop the remapping and the user can access their data. Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-blockLinus Torvalds2017-09-094-72/+116
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...
| * bfq: Use icq_to_bic() consistentlyBart Van Assche2017-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | Some code uses icq_to_bic() to convert an io_cq pointer to a bfq_io_cq pointer while other code uses a direct cast. Convert the code that uses a direct cast such that it uses icq_to_bic(). Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * bfq: Suppress compiler warnings about comparisonsBart Van Assche2017-09-011-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch avoids that the following warnings are reported when building with W=1: block/bfq-iosched.c: In function 'bfq_back_seek_max_store': block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4876:1: note: in expansion of macro 'STORE_FUNCTION' STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0); ^~~~~~~~~~~~~~ block/bfq-iosched.c: In function 'bfq_slice_idle_store': block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4879:1: note: in expansion of macro 'STORE_FUNCTION' STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2); ^~~~~~~~~~~~~~ block/bfq-iosched.c: In function 'bfq_slice_idle_us_store': block/bfq-iosched.c:4892:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4899:1: note: in expansion of macro 'USEC_STORE_FUNCTION' USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0, ^~~~~~~~~~~~~~~~~~~ Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * bfq: Check kstrtoul() return valueBart Van Assche2017-09-011-15/+37
| | | | | | | | | | | | | | | | | | | | Make sysfs writes fail for invalid numbers instead of storing uninitialized data copied from the stack. This patch removes all uninitialized_var() occurrences from the BFQ source code. Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * bfq: Declare local functions staticBart Van Assche2017-09-011-9/+9
| | | | | | | | | | | | Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * bfq: Annotate fall-through in a switch statementBart Van Assche2017-09-011-0/+1
| | | | | | | | | | | | | | | | | | This patch avoids that gcc 7 issues a warning about fall-through when building with W=1. Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
OpenPOWER on IntegriCloud