summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* block: fix use-after-free in sys_ioprio_get()Omar Sandoval2016-07-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_task_ioprio() accesses the task->io_context without holding the task lock and thus can race with exit_io_context(), leading to a use-after-free. The reproducer below hits this within a few seconds on my 4-core QEMU VM: #define _GNU_SOURCE #include <assert.h> #include <unistd.h> #include <sys/syscall.h> #include <sys/wait.h> int main(int argc, char **argv) { pid_t pid, child; long nproc, i; /* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */ syscall(SYS_ioprio_set, 1, 0, 0x6000); nproc = sysconf(_SC_NPROCESSORS_ONLN); for (i = 0; i < nproc; i++) { pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { pid = fork(); assert(pid != -1); if (pid == 0) { _exit(0); } else { child = wait(NULL); assert(child == pid); } } } pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { /* ioprio_get(IOPRIO_WHO_PGRP, 0); */ syscall(SYS_ioprio_get, 2, 0); } } } for (;;) { /* ioprio_get(IOPRIO_WHO_PGRP, 0); */ syscall(SYS_ioprio_get, 2, 0); } return 0; } This gets us KASAN dumps like this: [ 35.526914] ================================================================== [ 35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c [ 35.530009] Read of size 2 by task ioprio-gpf/363 [ 35.530009] ============================================================================= [ 35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected [ 35.530009] ----------------------------------------------------------------------------- [ 35.530009] Disabling lock debugging due to kernel taint [ 35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360 [ 35.530009] ___slab_alloc+0x55d/0x5a0 [ 35.530009] __slab_alloc.isra.20+0x2b/0x40 [ 35.530009] kmem_cache_alloc_node+0x84/0x200 [ 35.530009] create_task_io_context+0x2b/0x370 [ 35.530009] get_task_io_context+0x92/0xb0 [ 35.530009] copy_process.part.8+0x5029/0x5660 [ 35.530009] _do_fork+0x155/0x7e0 [ 35.530009] SyS_clone+0x19/0x20 [ 35.530009] do_syscall_64+0x195/0x3a0 [ 35.530009] return_from_SYSCALL_64+0x0/0x6a [ 35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060 [ 35.530009] __slab_free+0x27b/0x3d0 [ 35.530009] kmem_cache_free+0x1fb/0x220 [ 35.530009] put_io_context+0xe7/0x120 [ 35.530009] put_io_context_active+0x238/0x380 [ 35.530009] exit_io_context+0x66/0x80 [ 35.530009] do_exit+0x158e/0x2b90 [ 35.530009] do_group_exit+0xe5/0x2b0 [ 35.530009] SyS_exit_group+0x1d/0x20 [ 35.530009] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ 35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080 [ 35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001 [ 35.530009] ================================================================== Fix it by grabbing the task lock while we poke at the io_context. Cc: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: missing bio_put following submit_bio_waitShaun Tancheff2016-06-071-3/+9
| | | | | | | | | | | | | | | submit_bio_wait() gives the caller an opportunity to examine struct bio and so expects the caller to issue the put_bio() This fixes a memory leak reported by a few people in 4.7-rc2 kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch") Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Larry Finger@lwfinger.net Tested-by: David Drysdale <drysdale@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: really fix plug list flushing for nomerge queuesOmar Sandoval2016-06-021-9/+8
| | | | | | | | | | | | | | | Commit 0809e3ac6231 ("block: fix plug list flushing for nomerge queues") updated blk_mq_make_request() to set request_count even when blk_queue_nomerges() returns true. However, blk_mq_make_request() only does limited plugging and doesn't use request_count; blk_sq_make_request() is the one that should have been fixed. Do that and get rid of the unnecessary work in the mq version. Fixes: 0809e3ac6231 ("block: fix plug list flushing for nomerge queues") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2016-05-271-1/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A set of fixes that wasn't included in the first merge window pull request. This pull request contains: - A set of NVMe fixes from Keith, and one from Nic for the integrity side of it. - Fix from Ming, clearing ->mq_ops if we don't successfully setup a queue for multiqueue. - A set of stability fixes for bcache from Jiri, and also marking bcache as orphaned as it's no longer actively maintained (in mainline, at least)" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: clear q->mq_ops if init fail MAINTAINERS: mark bcache as orphan bcache: bch_gc_thread() is not freezable bcache: bch_allocator_thread() is not freezable bcache: bch_writeback_thread() is not freezable nvme/host: Add missing blk_integrity tag_size + flags assignments NVMe: Add device ID's with stripe quirk NVMe: Short-cut removal on surprise hot-unplug NVMe: Allow user initiated rescan NVMe: Reduce driver log spamming NVMe: Unbind driver on failure NVMe: Delete only created queues NVMe: Allocate queues only for online cpus
| * blk-mq: clear q->mq_ops if init failMing Lin2016-05-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_init_queue() calls blk_mq_init_allocated_queue(), but q->mq_ops was not cleared when blk_mq_init_allocated_queue() fails. Then blk_cleanup_queue() calls blk_mq_free_queue() which will crash because: - q->all_q_node is not added to all_q_list yet - q->tag_set is NULL - hctx was not setup yet or already freed Fixed it by clearing q->mq_ops on error path. Signed-off-by: Ming Lin <ming.l@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge tag 'dax-misc-for-4.7' of ↵Linus Torvalds2016-05-261-1/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c
| * | block: Update blkdev_dax_capable() for consistencyToshi Kani2016-05-171-30/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkdev_dax_capable() is similar to bdev_dax_supported(), but needs to remain as a separate interface for checking dax capability of a raw block device. Rename and relocate blkdev_dax_capable() to keep them maintained consistently, and call bdev_direct_access() for the dax capability check. There is no change in the behavior. Link: https://lkml.org/lkml/2016/5/9/950 Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@fb.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
* | | Merge tag 'libnvdimm-for-4.7' of ↵Linus Torvalds2016-05-231-32/+0
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this update was stabilized before the merge window and appeared in -next. The "device dax" implementation was revised this week in response to review feedback, and to address failures detected by the recently expanded ndctl unit test suite. Not included in this pull request are two dax topic branches (dax error handling, and dax radix-tree locking). These topics were deferred to get a few more days of -next integration testing, and to coordinate a branch baseline with Ted and the ext4 tree. Vishal and Ross will send the error handling and locking topics respectively in the next few days. This branch has received a positive build result from the kbuild robot across 226 configs. Summary: - Device DAX for persistent memory: Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: a) Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. b) Enforces deterministic behavior by being strict about what fault scenarios are supported. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance/feature differentiated memory ranges. - Support for the HPE DSM (device specific method) command formats. This enables management of these first generation devices until a unified DSM specification materializes. - Further ACPI 6.1 compliance with support for the common dimm identifier format. - Various fixes and cleanups across the subsystem" * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits) libnvdimm, dax: fix deletion libnvdimm, dax: fix alignment validation libnvdimm, dax: autodetect support libnvdimm: release ida resources Revert "block: enable dax for raw block devices" /dev/dax, core: file operations and dax-mmap /dev/dax, pmem: direct access to persistent memory libnvdimm: stop requiring a driver ->remove() method libnvdimm, dax: record the specified alignment of a dax-device instance libnvdimm, dax: reserve space to store labels for device-dax libnvdimm, dax: introduce device-dax infrastructure nfit: add sysfs dimm 'family' and 'dsm_mask' attributes tools/testing/nvdimm: ND_CMD_CALL support nfit: disable vendor specific commands nfit: export subsystem ids as attributes nfit: fix format interface code byte order per ACPI6.1 nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism nfit, libnvdimm: clarify "commands" vs "_DSMs" libnvdimm: increase max envelope size for ioctl acpi/nfit: Add sysfs "id" for NVDIMM ID ...
| * | | Revert "block: enable dax for raw block devices"Dan Williams2016-05-201-32/+0
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937. The functionality is superseded by the new "Device DAX" facility. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | | block/partitions/ldm.c: use generic UUID libraryAndy Shevchenko2016-05-201-56/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of opencoding let's use generic UUID library functions here. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2016-05-171-2/+2
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits) gitignore: fix wording mfd: ab8500-debugfs: fix "between" in printk memstick: trivial fix of spelling mistake on management cpupowerutils: bench: fix "average" treewide: Fix typos in printk IB/mlx4: printk fix pinctrl: sirf/atlas7: fix printk spelling serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/ w1: comment spelling s/minmum/minimum/ Blackfin: comment spelling s/divsor/divisor/ metag: Fix misspellings in comments. ia64: Fix misspellings in comments. hexagon: Fix misspellings in comments. tools/perf: Fix misspellings in comments. cris: Fix misspellings in comments. c6x: Fix misspellings in comments. blackfin: Fix misspelling of 'register' in comment. avr32: Fix misspelling of 'definitions' in comment. treewide: Fix typos in printk Doc: treewide : Fix typos in DocBook/filesystem.xml ...
| * | treewide: Fix typos in printkMasanari Iida2016-04-181-2/+2
| |/ | | | | | | | | | | | | | | | | This patch fix spelling typos found in printk within various part of the kernel sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2016-05-174-37/+20
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...
| * | block: kill off q->flush_flagsJens Axboe2016-04-133-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: kill blk_queue_flush()Jens Axboe2016-04-121-20/+0
| | | | | | | | | | | | | | | | | | | | | | | | We don't have any drivers left using it, so kill it off. Update documentation to use the newer blk_queue_write_cache(). Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | Merge branch 'for-4.7/core' into for-4.7/driversJens Axboe2016-04-122-0/+65
| |\ \
| * | | blk-mq: Make blk_mq_all_tag_busy_iter staticSagi Grimberg2016-04-121-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No caller outside the blk-mq code so we can settle with it static. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | Merge branch 'for-4.7/core' of git://git.kernel.dk/linux-blockLinus Torvalds2016-05-178-133/+148
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block layer updates from Jens Axboe: "This is the core block IO changes for this merge window. Nothing earth shattering in here, it's mostly just fixes. In detail: - Fix for a long standing issue where wrong ordering in blk-mq caused order_to_size() to spew a warning. From Bart. - Async discard support from Christoph. Basically just splitting our sync interface into a submit + wait part. - Add a cleaner interface for flagging whether a device has a write back cache or not. We've previously overloaded blk_queue_flush() with this, but let's make it more explicit. Drivers cleaned up and updated in the drivers pull request. From me. - Fix for a double check for whether IO accounting is enabled or not. From Michael Callahan. - Fix for the async discard from Mike Snitzer, reinstating the early EOPNOTSUPP return if the device doesn't support discards. - Also from Mike, export bio_inc_remaining() so dm can drop it's private copy of it. - From Ming Lin, add support for passing in an offset for request payloads. - Tag function export from Sagi, which will be used in NVMe in the drivers pull. - Two blktrace related fixes from Shaohua. - Propagate NOMERGE flag when making a request from a bio, also from Shaohua. - An optimization to not parse cgroup paths in blk-throttle, if we don't need to. From Shaohua" * 'for-4.7/core' of git://git.kernel.dk/linux-block: blk-mq: fix undefined behaviour in order_to_size() blk-throttle: don't parse cgroup path if trace isn't enabled blktrace: add missed mask name blktrace: delete garbage for message trace block: make bio_inc_remaining() interface accessible again block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard block: Minor blk_account_io_start usage cleanup block: add __blkdev_issue_discard block: remove struct bio_batch block: copy NOMERGE flag from bio to request block: add ability to flag write back caching on a device blk-mq: Export tagset iter function block: add offset in blk_add_request_payload() writeback: Fix performance regression in wb_over_bg_thresh()
| * | | | blk-mq: fix undefined behaviour in order_to_size()Bartlomiej Zolnierkiewicz2016-05-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When this_order variable in blk_mq_init_rq_map() becomes zero the code incorrectly decrements the variable and passes the result to order_to_size() helper causing undefined behaviour: UBSAN: Undefined behaviour in block/blk-mq.c:1459:27 shift exponent 4294967295 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc6-00072-g33656a1 #22 Fix the code by checking this_order variable for not having the zero value first. Reported-by: Meelis Roos <mroos@linux.ee> Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism") Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-throttle: don't parse cgroup path if trace isn't enabledShaohua Li2016-05-101-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if trace isn't enabled, parsing cgroup path just wastes cpu Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: make bio_inc_remaining() interface accessible againMike Snitzer2016-05-051-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 326e1dbb57 ("block: remove management of bi_remaining when restoring original bi_end_io") made bio_inc_remaining() private to bio.c because the only use-case that made sense was confined to the bio_chain() interface. Since that time DM thinp went on to use bio_chain() in its relatively complex implementation of async discard support. That implementation, even when converted over to use the new async __blkdev_issue_discard() interface, depends on deferred completion of the original discard bio -- which is most appropriately implemented using bio_inc_remaining(). DM thinp foolishly duplicated bio_inc_remaining(), local to dm-thin.c as __bio_inc_remaining(), so re-exporting bio_inc_remaining() allows us to put an end to that foolishness. All said, bio_inc_remaining() should really only be used in conjunction with bio_chain(). It isn't intended for generic bio reference counting. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discardMike Snitzer2016-05-051-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 38f25255330 ("block: add __blkdev_issue_discard") incorrectly disallowed the early return of -EOPNOTSUPP if the device doesn't support discard (or secure discard). This early return of -EOPNOTSUPP has always been part of blkdev_issue_discard() interface so there isn't a good reason to break that behaviour -- especially when it can be easily reinstated. The nuance of allowing early return of -EOPNOTSUPP vs disallowing late return of -EOPNOTSUPP is: if the overall device never advertised support for discards and one is issued to the device it is beneficial to inform the caller that discards are not supported via -EOPNOTSUPP. But if a device advertises discard support it means that at least a subset of the device does have discard support -- but it could be that discards issued to some regions of a stacked device will not be supported. In that case the late return of -EOPNOTSUPP must be disallowed. Fixes: 38f25255330 ("block: add __blkdev_issue_discard") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: Minor blk_account_io_start usage cleanupMichael Callahan2016-05-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_account_io_start does not need to be wrapped with blk_do_io_stat ais it already checks for that condition. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: add __blkdev_issue_discardChristoph Hellwig2016-05-021-26/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a version of blkdev_issue_discard which doesn't wait for the I/O to complete, but instead allows the caller to submit the final bio and/or chain it to others. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagig@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: remove struct bio_batchChristoph Hellwig2016-05-021-91/+27
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It can be replaced with a combination of bio_chain and submit_bio_wait. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagig@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: add ability to flag write back caching on a deviceJens Axboe2016-04-122-0/+65
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an internal helper and flag for setting whether a queue has write back caching, or write through (or none). Add a sysfs file to show this as well, and make it changeable from user space. This will replace the (awkward) blk_queue_flush() interface that drivers currently use to inform the block layer of write cache state and capabilities. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | blk-mq: Export tagset iter functionSagi Grimberg2016-04-121-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Its useful to iterate on all the active tags in cases where we will need to fail all the queues IO. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> [hch: carefully check for valid tagsets] Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: add offset in blk_add_request_payload()Ming Lin2016-04-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | We could kmalloc() the payload, so need the offset in page. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Merge branch 'ovl-fixes' into for-linusAl Viro2016-05-118-29/+36
|\ \ \ | | |/ | |/|
| * | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2016-04-151-3/+10
| |\ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A few fixes for the current series. This contains: - Two fixes for NVMe: One fixes a reset race that can be triggered by repeated insert/removal of the module. The other fixes an issue on some platforms, where we get probe timeouts since legacy interrupts isn't working. This used not to be a problem since we had the worker thread poll for completions, but since that was killed off, it means those poor souls can't successfully probe their NVMe device. Use a proper IRQ check and probe (msi-x -> msi ->legacy), like most other drivers to work around this. Both from Keith. - A loop corruption issue with offset in iters, from Ming Lei. - A fix for not having the partition stat per cpu ref count initialized before sending out the KOBJ_ADD, which could cause user space to access the counter prior to initialization. Also from Ming Lei. - A fix for using the wrong congestion state, from Kaixu Xia" * 'for-linus' of git://git.kernel.dk/linux-block: block: loop: fix filesystem corruption in case of aio/dio NVMe: Always use MSI/MSI-x interrupts NVMe: Fix reset/remove race writeback: fix the wrong congested state variable definition block: partition: initialize percpuref before sending out KOBJ_ADD
| | * block: partition: initialize percpuref before sending out KOBJ_ADDMing Lei2016-03-291-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The initialization of partition's percpu_ref should have been done before sending out KOBJ_ADD uevent, which may cause userspace to read partition table. So the uninitialized percpu_ref may be accessed in data path. This patch fixes this issue reported by Naveen. Reported-by: Naveen Kaje <nkaje@codeaurora.org> Tested-by: Naveen Kaje <nkaje@codeaurora.org> Fixes: 6c71013ecb7e2(block: partition: convert percpu ref) Cc: <stable@vger.kernel.org> # v4.3+ Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usageKirill A. Shutemov2016-04-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-048-24/+24
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | fix the copy vs. map logics in blk_rq_map_user_iov()Al Viro2016-04-081-39/+8
|/ | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2016-03-242-3/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Final round of fixes for this merge window - some of this has come up after the initial pull request, and some of it was put in a post-merge branch before the merge window. This contains: - Fix for a bad check for an error on dma mapping in the mtip32xx driver, from Alexey Khoroshilov. - A set of fixes for lightnvm, from Javier, Matias, and Wenwei. - An NVMe completion record corruption fix from Marta, ensuring that we read things in the right order. - Two writeback fixes from Tejun, marked for stable@ as well. - A blk-mq sw queue iterator fix from Thomas, fixing an oops for sparse CPU maps. They hit this in the hot plug/unplug rework" * 'for-linus' of git://git.kernel.dk/linux-block: nvme: avoid cqe corruption when update at the same time as read writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() blk-mq: Use proper cpumask iterator mtip32xx: fix checks for dma mapping errors lightnvm: do not load L2P table if not supported lightnvm: do not reserve lun on l2p loading nvme: lightnvm: return ppa completion status lightnvm: add a bitmap of luns lightnvm: specify target's logical address area null_blk: add lightnvm null_blk device to the nullb_list
| * blk-mq: Use proper cpumask iteratorThomas Gleixner2016-03-202-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | queue_for_each_ctx() iterates over per_cpu variables under the assumption that the possible cpu mask cannot have holes. That's wrong as all cpumasks can have holes. In case there are holes the iteration ends up accessing uninitialized memory and crashing as a result. Replace the macro by a proper for_each_possible_cpu() loop and drop the unused macro blk_ctx_sum() which references queue_for_each_ctx(). Reported-by: Xiong Zhou <jencce.kernel@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-4.6' of ↵Linus Torvalds2016-03-181-0/+24
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: - ahci grew runtime power management support so that the controller can be turned off if no devices are attached. - sata_via isn't dead yet. It got hotplug support and more refined workaround for certain WD drives. - Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing conflicts in ahci PCI ID table. * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ata: ahci_xgene: dereferencing uninitialized pointer in probe AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs ata: sata_rcar: Use ARCH_RENESAS sata_via: Implement hotplug for VT6421 sata_via: Apply WD workaround only when needed on VT6421 ahci: Add runtime PM support for the host controller ahci: Add functions to manage runtime PM of AHCI ports ahci: Convert driver to use modern PM hooks ahci: Cache host controller version scsi: Drop runtime PM usage count after host is added scsi: Set request queue runtime PM status back to active on resume block: Add blk_set_runtime_active() ata: ahci_mvebu: add support for Armada 3700 variant libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show() libata: support AHCI on OCTEON platform
| * block: Add blk_set_runtime_active()Mika Westerberg2016-02-191-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If block device is left runtime suspended during system suspend, resume hook of the driver typically corrects runtime PM status of the device back to "active" after it is resumed. However, this is not enough as queue's runtime PM status is still "suspended". As long as it is in this state blk_pm_peek_request() returns NULL and thus prevents new requests to be processed. Add new function blk_set_runtime_active() that can be used to force the queue status back to "active" as needed. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge branch 'for-4.6/core' of git://git.kernel.dk/linux-blockLinus Torvalds2016-03-185-109/+178
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block updates from Jens Axboe: "Here are the core block changes for this merge window. Not a lot of exciting stuff going on in this round, most of the changes have been on the driver side of things. That pull request is coming next. This pull request contains: - A set of fixes for chained bio handling from Christoph. - A tag bounds check for blk-mq from Hannes, ensuring that we don't do something stupid if a device reports an invalid tag value. - A set of fixes/updates for the CFQ IO scheduler from Jan Kara. - A set of blk-mq fixes from Keith, adding support for dynamic hardware queues, and fixing init of max_dev_sectors for stacking devices. - A fix for the dynamic hw context from Ming. - Enabling of cgroup writeback support on a block device, from Shaohua" * 'for-4.6/core' of git://git.kernel.dk/linux-block: blk-mq: add bounds check on tag-to-rq conversion block: bio_remaining_done() isn't unlikely block: cleanup bio_endio block: factor out chained bio completion block: don't unecessarily clobber bi_error for chained bios block-dev: enable writeback cgroup support blk-mq: Fix NULL pointer updating nr_requests blk-mq: mark request queue as mq asap block: Initialize max_dev_sectors to 0 blk-mq: dynamic h/w context count cfq-iosched: Allow parent cgroup to preempt its child cfq-iosched: Allow sync noidle workloads to preempt each other cfq-iosched: Reorder checks in cfq_should_preempt() cfq-iosched: Don't group_idle if cfqq has big thinktime
| * | blk-mq: add bounds check on tag-to-rq conversionHannes Reinecke2016-03-151-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to check for a valid index before accessing the array element to avoid accessing invalid memory regions. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Modified by Jens to drop the unlikely(), and make the fall through path be having a valid tag. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: bio_remaining_done() isn't unlikelyChristoph Hellwig2016-03-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use bio chaining during most I/Os these days due to the delayed bio splitting. Additionally XFS will start using it, and there is a pending direct I/O rewrite also making heavy use for it. Don't pretend it's always unlikely, and let the branch predictor do it's job instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: cleanup bio_endioChristoph Hellwig2016-03-141-18/+17
| | | | | | | | | | | | | | | | | | | | | | | | Replace the while loop that unecessarily checks for a NULL bio in the fast path with a simple goto loop. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: factor out chained bio completionChristoph Hellwig2016-03-141-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | Factor common code between bio_chain_endio and bio_endio into a common helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: don't unecessarily clobber bi_error for chained biosChristoph Hellwig2016-03-141-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | Only overwrite the parents bi_error if it was zero. That way a successful bio completion doesn't reset the error pointer. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: Fix NULL pointer updating nr_requestsKeith Busch2016-03-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | A h/w context's tags are freed if it was not assigned a CPU. Check if the context has tags before updating the depth. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: mark request queue as mq asapMing Lei2016-02-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently q->mq_ops is used widely to decide if the queue is mq or not, so we should set the 'flag' asap so that both block core and drivers can get the correct mq info. For example, commit 868f2f0b720(blk-mq: dynamic h/w context count) moves the hctx's initialization before setting q->mq_ops in blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue() to think the queue is non-mq and don't allocate command size for the per-hctx flush rq. This patches should fix the problem reported by Sasha. Cc: Keith Busch <keith.busch@intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count") Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: Initialize max_dev_sectors to 0Keith Busch2016-02-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new queue limit is not used by the majority of block drivers, and should be initialized to 0 for the driver's requested settings to be used. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: dynamic h/w context countKeith Busch2016-02-093-73/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hardware's provided queue count may change at runtime with resource provisioning. This patch allows a block driver to alter the number of h/w queues available when its resource count changes. The main part is a new blk-mq API to request a new number of h/w queues for a given live tag set. The new API freezes all queues using that set, then adjusts the allocated count prior to remapping these to CPUs. The bulk of the rest just shifts where h/w contexts and all their artifacts are allocated and freed. The number of max h/w contexts is capped to the number of possible cpus since there is no use for more than that. As such, all pre-allocated memory for pointers need to account for the max possible rather than the initial number of queues. A side effect of this is that the blk-mq will proceed successfully as long as it can allocate at least one h/w context. Previously it would fail request queue initialization if less than the requested number was allocated. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | cfq-iosched: Allow parent cgroup to preempt its childJan Kara2016-02-041-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we don't allow sync workload of one cgroup to preempt sync workload of any other cgroup. This is because we want to achieve service separation between cgroups. However in cases where cgroup preempting is ancestor of the current cgroup, there is no need of separation and idling introduces unnecessary overhead. This hurts for example the case when workload is isolated within a cgroup but journalling threads are in root cgroup. Simple way to demostrate the issue is using: dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1 on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make difference more visible). When all processes are in the root cgroup, reported throughput is 153.132 MB/sec. When dbench process gets its own blkio cgroup, reported throughput drops to 26.1006 MB/sec. Fix the problem by making check in cfq_should_preempt() more benevolent and allow preemption by ancestor cgroup. This improves the throughput reported by dbench4 to 48.9106 MB/sec. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | cfq-iosched: Allow sync noidle workloads to preempt each otherJan Kara2016-02-041-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original idea with preemption of sync noidle queues (introduced in commit 718eee0579b8 "cfq-iosched: fairness for sync no-idle queues") was that we service all sync noidle queues together, we don't idle on any of the queues individually and we idle only if there is no sync noidle queue to be served. This intention also matches the original test: if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && new_cfqq->service_tree == cfqq->service_tree) return true; However since at that time cfqq->service_tree was not set for idling queues, this test was unreliable and was replaced in commit e4a229196a7c "cfq-iosched: fix no-idle preemption logic" by: if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD && new_cfqq->service_tree->count == 1) return true; That was a reliable test but was actually doing something different - now we preempt sync noidle queue only if the new queue is the only one busy in the service tree. These days cfq queue is kept in service tree even if it is idling and thus the original check would be safe again. But since we actually check that cfq queues are in the same cgroup, of the same priority class and workload type (sync noidle), we know that new_cfqq is fine to preempt cfqq. So just remove the service tree check. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
OpenPOWER on IntegriCloud