| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block fixes from Jens Axboe:
"Here's a pull request for 4.11-rc, fixing a set of issues mostly
centered around the new scheduling framework. These have been brewing
for a while, but split up into what we absolutely need in 4.11, and
what we can defer until 4.12. These are well tested, on both single
queue and multiqueue setups, and with and without shared tags. They
fix several hangs that have happened in testing.
This is obviously larger than I would have preferred at this point in
time, but I don't think we can shave much off this and still get the
desired results.
In detail, this pull request contains:
- a set of five fixes for NVMe, mostly from Christoph and one from
Roland.
- a series from Bart, fixing issues with dm-mq and SCSI shared tags
and scheduling. Note that one of those patches commit messages may
read like an optimization, but it is in fact an important fix for
queue restarts in particular.
- a series from Omar, most importantly fixing a hang with multiple
hardware queues when we fail to get a driver tag. Another important
fix in there is for resizing hardware queues, which nbd does when
handling multiple sockets for one connection.
- fixing an imbalance in putting the ctx for hctx request allocations
from Minchan"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: Restart a single queue if tag sets are shared
dm rq: Avoid that request processing stalls sporadically
scsi: Avoid that SCSI queues get stuck
blk-mq: Introduce blk_mq_delay_run_hw_queue()
blk-mq: remap queues when adding/removing hardware queues
blk-mq-sched: fix crash in switch error path
blk-mq-sched: set up scheduler tags when bringing up new queues
blk-mq-sched: refactor scheduler initialization
blk-mq: use the right hctx when getting a driver tag fails
nvmet: fix byte swap in nvmet_parse_io_cmd
nvmet: fix byte swap in nvmet_execute_write_zeroes
nvmet: add missing byte swap in nvmet_get_smart_log
nvme: add missing byte swap in nvme_setup_discard
nvme: Correct NVMF enum values to match NVMe-oF rev 1.0
block: do not put mq context in blk_mq_alloc_request_hctx
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:
echo run >/sys/kernel/debug/block/dm-0/state
This patch avoids that such request processing stalls occur. The
test I ran is as follows:
while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|\ \
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- two stable fixes for the verity target's FEC support
- a stable fix for raid target's raid1 support (when no bitmap is used)
- a 4.11 cache metadata v2 format fix to properly test blocks are clean
* tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm verity fec: fix bufio leaks
dm raid: fix NULL pointer dereference for raid1 without bitmap
dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty
dm verity fec: limit error correction recursion
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Buffers read through dm_bufio_read() were not released in all code paths.
Fixes: a739ff3f543a ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 4257e08 ("dm raid: support to change bitmap region size")
introduced a bitmap resize call during preresume phase. User can create
a DM device with "raid" target configured as raid1 with no metadata
devices to hold superblock/bitmap info. It can be achieved using the
following sequence:
truncate -s 32M /dev/shm/raid-test
LOOP=$(losetup --show -f /dev/shm/raid-test)
dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0"
dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024"
dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1"
This results in the following crash:
[ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB
[ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors
[ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.114816] PGD 0
…
[ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015
[ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000
[ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246
[ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
[ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000
[ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff
[ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058
[ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 4029.115223] FS: 00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000
[ 4029.115245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0
[ 4029.115281] Call Trace:
[ 4029.115291] ? raid_iterate_devices+0x63/0x80 [dm_raid]
[ 4029.115309] ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod]
[ 4029.115329] ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod]
[ 4029.115346] raid_preresume+0x81/0x2e0 [dm_raid]
[ 4029.115361] dm_table_resume_targets+0x47/0xe0 [dm_mod]
[ 4029.115378] dm_resume+0xa8/0xd0 [dm_mod]
[ 4029.115391] dev_suspend+0x123/0x250 [dm_mod]
[ 4029.115405] ? table_load+0x350/0x350 [dm_mod]
[ 4029.115419] ctl_ioctl+0x1c2/0x490 [dm_mod]
[ 4029.115433] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[ 4029.115447] do_vfs_ioctl+0x8d/0x5a0
[ 4029.115459] ? ____fput+0x9/0x10
[ 4029.115470] ? task_work_run+0x79/0xa0
[ 4029.115481] SyS_ioctl+0x3c/0x70
[ 4029.115493] entry_SYSCALL_64_fastpath+0x13/0x94
The raid_preresume() function incorrectly assumes that the raid_set has
a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set. But
RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap()
even if there is no bitmap present (and bitmap_load() happily returns 0
even if a bitmap isn't present). So the only way forward in the
near-term is to check if the bitmap is present by seeing if
mddev->bitmap is not NULL after bitmap_load() has been called.
By doing so the above NULL pointer is avoided.
Fixes: 4257e08 ("dm raid: support to change bitmap region size")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru>
Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The dm_bitset_cursor_begin() call was using the incorrect nr_entries.
Also, the last dm_bitset_cursor_next() must be avoided if we're at the
end of the cursor.
Fixes: 7f1b21591a6 ("dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the hash tree itself is sufficiently corrupt in addition to data blocks,
it's possible for error correction to end up in a deep recursive loop,
which eventually causes a kernel panic. This change limits the
recursion to a reasonable level during a single I/O operation.
Fixes: a739ff3f543a ("dm verity: add support for forward error correction")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v4.5+
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
- fix a parity calculation bug of raid5 cache by Song
- fix a potential deadlock issue by me
- fix two endian issues by Jason
- fix a disk limitation issue by Neil
- other small fixes and cleanup
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/raid1: fix a trivial typo in comments
md/r5cache: fix set_syndrome_sources() for data in cache
md: fix incorrect use of lexx_to_cpu in does_sb_need_changing
md: fix super_offset endianness in super_1_rdev_size_change
md/raid1/10: fix potential deadlock
md: don't impose the MD_SB_DISKS limit on arrays without metadata.
md: move funcs from pers->resize to update_size
md-cluster: remove useless memset from gather_all_resync_info
md-cluster: free md_cluster_info if node leave cluster
md: delete dead code
md/raid10: submit bio directly to replacement disk
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
raid1.c: fix a trivial typo in comments of freeze_array().
Cc: Jack Wang <jack.wang.usish@gmail.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Cc: John Stoffel <john@stoffel.org>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Before this patch, device InJournal will be included in prexor
(SYNDROME_SRC_WANT_DRAIN) but not in reconstruct (SYNDROME_SRC_WRITTEN). So it
will break parity calculation. With srctype == SYNDROME_SRC_WRITTEN, we need
include both dev with non-null ->written and dev with R5_InJournal. This fixes
logic in 1e6d690(md/r5cache: caching phase of r5cache)
Cc: stable@vger.kernel.org (v4.10+)
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The sb->layout is of type __le32, so we shoud use le32_to_cpu.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The sb->super_offset should be big-endian, but the rdev->sb_start is in
host byte order, so fix this by adding cpu_to_le64.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:
1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer
If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.
The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:
if (need to split) {
split = bio_split(bio, ...)
bio_chain(...)
make_request_fn(split);
generic_make_request(bio);
} else
make_request_fn(mddev, bio);
This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices. These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first. Then it will process the remainder
of the original bio once the first section has been fully processed.
"
Note, this only happens in read path. In write path, the bio is flushed to
underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
It's queued in current->bio_list.
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
Suggested-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
These arrays, created with "mdadm --build" don't benefit from a limit.
The default will be used, which is '0' and is interpreted as "don't
impose a limit".
Reported-by: ian_bruce@mail.ru
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
raid1_resize and raid5_resize should also check the
mddev->queue if run underneath dm-raid.
And both set_capacity and revalidate_disk are used in
pers->resize such as raid1, raid10 and raid5. So
move them from personality file to common code.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This memset is not needed. The lvb is already zeroed because
it was recently allocated by lockres_init, which uses kzalloc(),
and read_resync_info() doesn't need it to be zero anyway.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
To avoid memory leak, we need to free the cinfo which
is allocated when node join cluster.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
Nobody is using mddev_check_plugged(), so delete the dead code
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 57c67df(md/raid10: submit IO from originating thread instead of
md thread) submits bio directly for normal disks but not for replacement
disks. There is no point we shouldn't do this for replacement disks.
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.
There are two places which walk the list and requeue selected bios,
and others that check if the list is empty. These are no longer
correct.
So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.
Signed-off-by: NeilBrown <neilb@suse.com>
Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|/
|
|
|
|
|
|
|
| |
Link: http://lkml.kernel.org/r/20170226060230.11555-1-standby24x7@gmail.com
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Coly Li <colyli@suse.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull sched.h split-up from Ingo Molnar:
"The point of these changes is to significantly reduce the
<linux/sched.h> header footprint, to speed up the kernel build and to
have a cleaner header structure.
After these changes the new <linux/sched.h>'s typical preprocessed
size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
lines), which is around 40% faster to build on typical configs.
Not much changed from the last version (-v2) posted three weeks ago: I
eliminated quirks, backmerged fixes plus I rebased it to an upstream
SHA1 from yesterday that includes most changes queued up in -next plus
all sched.h changes that were pending from Andrew.
I've re-tested the series both on x86 and on cross-arch defconfigs,
and did a bisectability test at a number of random points.
I tried to test as many build configurations as possible, but some
build breakage is probably still left - but it should be mostly
limited to architectures that have no cross-compiler binaries
available on kernel.org, and non-default configurations"
* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
sched/headers: Clean up <linux/sched.h>
sched/headers: Remove #ifdefs from <linux/sched.h>
sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
sched/core: Remove unused prefetch_stack()
sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
sched/headers: Remove <linux/signal.h> from <linux/sched.h>
sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
sched/headers: Remove the runqueue_is_locked() prototype
sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
related APIs from <linux/sched.h> to <linux/sched/task.h>
But first update usage sites with the new header dependency.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
<linux/rculist.h> in <linux/sched.h>
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.
But first update code that relied on the implicit header inclusion.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
<linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Update the .c files that depend on these APIs.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
<linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
<linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
<linux/sched/clock.h>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.
Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a dm-raid stable@ fix for possible corruption when triggering a raid
reshape via lvm2; and an additional small patch ontop to bump version
of the dm-raid target outside of the stable@ fix
- a dm-raid fix for a 'dm-4.11-changes' regression introduced by a
commit that was meant to only cleanup confusing branching.
* tag 'dm-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm raid: bump the target version
dm raid: fix data corruption on reshape request
dm raid: fix raid "check" regression due to improper cleanup in raid_message()
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This version bump reflects that the reshape corruption fix (commit
92a39f6cc "dm raid: fix data corruption on reshape request") is
present.
Done as a separate fix because the above referenced commit is marked for
stable and target version bumps in a stable@ fix are a recipe for the
fix to never get backported to stable@ kernels (because of target
version number conflicts).
Also, move RESUME_STAY_FROZEN_FLAGS up with the reset the the _FLAGS
definitions now that we don't need to worry about stable@ conflicts as a
result of missing context.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The lvm2 sequence to manage dm-raid constructor flags that trigger a
rebuild or a reshape is defined as:
1) load table with flags (e.g. rebuild/delta_disks/data_offset)
2) clear out the flags in lvm2 metadata
3) store the lvm2 metadata, reload the table to reset the flags
previously established during the initial load (1) -- in order to
prevent repeatedly requesting a rebuild or a reshape on activation
Currently, loading an inactive table with rebuild/reshape flags
specified will cause dm-raid to rebuild/reshape on resume and thus start
updating the raid metadata (about the progress). When the second table
reload, to reset the flags, occurs the constructor accesses the volatile
progress state kept in the raid superblocks. Because the active mapping
is still processing the rebuild/reshape, that position will be stale by
the time the device is resumed.
In the reshape case, this causes data corruption by processing already
reshaped stripes again. In the rebuild case, it does _not_ cause data
corruption but instead involves superfluous rebuilds.
Fix by keeping the raid set frozen during the first resume and then
allow the rebuild/reshape during the second resume.
Fixes: 9dbd1aa3a ("dm raid: add reshaping support to the target")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.8+
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
While cleaning up awkward branching in raid_message() a raid set "check"
regression was introduced because "check" needs both MD_RECOVERY_SYNC
and MD_RECOVERY_REQUESTED flags set.
Fix this regression by explicitly setting both flags for the "check"
case (like is also done for the "repair" case, but redundant set_bit()s
are perfectly fine because it adds clarity to what is needed in response
to both messages -- in addition this isn't fast path code).
Fixes: 105db59912 ("dm raid: cleanup awkward branching in raid_message() option processing")
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
rcu_dereference_key() and user_key_payload() are currently being used in
two different, incompatible ways:
(1) As a wrapper to rcu_dereference() - when only the RCU read lock used
to protect the key.
(2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
used to protect the key and the may be being modified.
Fix this by splitting both of the key wrappers to produce:
(1) RCU accessors for keys when caller has the key semaphore locked:
dereference_key_locked()
user_key_payload_locked()
(2) RCU accessors for keys when caller holds the RCU read lock:
dereference_key_rcu()
user_key_payload_rcu()
This should fix following warning in the NFS idmapper
===============================
[ INFO: suspicious RCU usage. ]
4.10.0 #1 Tainted: G W
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
1 lock held by mount.nfs/5987:
#0: (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
stack backtrace:
CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G W 4.10.0 #1
Call Trace:
dump_stack+0xe8/0x154 (unreliable)
lockdep_rcu_suspicious+0x140/0x190
nfs_idmap_get_key+0x380/0x420 [nfsv4]
nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
call_decode+0x29c/0x910 [sunrpc]
__rpc_execute+0x140/0x8f0 [sunrpc]
rpc_run_task+0x170/0x200 [sunrpc]
nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
_nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
nfs4_lookup_root+0xe0/0x350 [nfsv4]
nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
nfs4_find_root_sec+0xc4/0x100 [nfsv4]
nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
nfs4_get_rootfh+0x6c/0x190 [nfsv4]
nfs4_server_common_setup+0xc4/0x260 [nfsv4]
nfs4_create_server+0x278/0x3c0 [nfsv4]
nfs4_remote_mount+0x50/0xb0 [nfsv4]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
nfs_do_root_mount+0xb0/0x140 [nfsv4]
nfs4_try_mount+0x60/0x100 [nfsv4]
nfs_fs_mount+0x5ec/0xda0 [nfs]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
do_mount+0x254/0xf70
SyS_mount+0x94/0x100
system_call+0x38/0xe0
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull md updates from Shaohua Li:
"Mainly fixes bugs and improves performance:
- Improve scalability for raid1 from Coly
- Improve raid5-cache read performance, disk efficiency and IO
pattern from Song and me
- Fix a race condition of disk hotplug for linear from Coly
- A few cleanup patches from Ming and Byungchul
- Fix a memory leak from Neil
- Fix WRITE SAME IO failure from me
- Add doc for raid5-cache from me"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits)
md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
md/raid1: handle flush request correctly
md/linear: shutup lockdep warnning
md/raid1: fix a use-after-free bug
RAID1: avoid unnecessary spin locks in I/O barrier code
RAID1: a new I/O barrier implementation to remove resync window
md/raid5: Don't reinvent the wheel but use existing llist API
md: fast clone bio in bio_clone_mddev()
md: remove unnecessary check on mddev
md/raid1: use bio_clone_bioset_partial() in case of write behind
md: fail if mddev->bio_set can't be created
block: introduce bio_clone_bioset_partial()
md: disable WRITE SAME if it fails in underlayer disks
md/raid5-cache: exclude reclaiming stripes in reclaim check
md/raid5-cache: stripe reclaim only counts valid stripes
MD: add doc for raid5-cache
Documentation: move MD related doc into a separate dir
md: ensure md devices are freed before module is unloaded.
md/r5cache: improve journal device efficiency
md/r5cache: enable chunk_aligned_read with write back cache
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There are two issues, introduced by commit 8e58e32(md/raid1: use
bio_clone_bioset_partial() in case of write behind):
- bio_clone_bioset_partial() uses bytes instead of sectors as parameters
- in writebehind mode, we return bio if all !writemostly disk bios finish,
which could happen before writemostly disk bios run. So all
writemostly disk bios should have their bvec. Here we just make sure
all bios are cloned instead of fast cloned.
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
I got a warning triggered in align_to_barrier_unit_end. It's a flush
request so sectors == 0. The flush request happens to work well without
the new barrier patch, but we'd better handle it explictly.
Cc: NeilBrown <neilb@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 03a9e24(md linear: fix a race between linear_add() and
linear_congested()) introduces the warnning.
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit fd76863 (RAID1: a new I/O barrier implementation to remove resync
window) introduces a user-after-free bug.
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.
The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
- 41.41% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
+ 89.92% allow_barrier
+ 9.34% __wake_up
- 37.30% fio [kernel.kallsyms] [k] _raw_spin_lock_irq
- _raw_spin_lock_irq
- 100.00% wait_barrier
The reason is, in these I/O barrier related functions,
- raise_barrier()
- lower_barrier()
- wait_barrier()
- allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.
The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it has to.
The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.
In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
- wait_barrier()
Which calls _wait_barrier(), is used for regular write I/O. If there is
resync I/O happening on the same I/O barrier bucket, or the whole
array is frozen, task will wait until no barrier on same barrier bucket,
or the whold array is unfreezed.
- wait_read_barrier()
Since regular read I/O won't interfere with resync I/O (read_balance()
will make sure only uptodate data will be read out), it is unnecessary
to wait for barrier in regular read I/Os, waiting in only necessary
when the whole array is frozen.
The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf->resync_lock, it can just increase
conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.
This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).
Changelog
V4:
- Change conf->nr_queued[] to atomic_t.
- Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf->nr_queued[] from atomic_t to int.
- Change conf->array_frozen from atomic_t back to int, and use
READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
to fix a deadlock between _wait_barrier()/wait_read_barrier and
freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
atomic_t variables at same time.
V1:
- Original RFC patch for comments.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.
The idea of sliding resync widow is awesome, but code complexity is a
challenge. Sliding resync window requires several variables to work
collectively, this is complexed and very hard to make it work correctly.
Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
to fix the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.
Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.
The brief idea of the simpler barrier is,
- Do not maintain a global unique resync window
- Use multiple hash buckets to reduce I/O barrier conflicts, regular
I/O only has to wait for a resync I/O when both them have same barrier
bucket index, vice versa.
- I/O barrier can be reduced to an acceptable number if there are enough
barrier buckets
Here I explain how the barrier buckets are designed,
- BARRIER_UNIT_SECTOR_SIZE
The whole LBA address space of a raid1 device is divided into multiple
barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
Bio requests won't go across border of barrier unit size, that means
maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
For random I/O 64MB is large enough for both read and write requests,
for sequential I/O considering underlying block layer may merge them
into larger requests, 64MB is still good enough.
Neil also points out that for resync operation, "we want the resync to
move from region to region fairly quickly so that the slowness caused
by having to synchronize with the resync is averaged out over a fairly
small time frame". For full speed resync, 64MB should take less then 1
second. When resync is competing with other I/O, it could take up a few
minutes. Therefore 64MB size is fairly good range for resync.
- BARRIER_BUCKETS_NR
There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
#define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2)
#define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS)
this patch makes the bellowed members of struct r1conf from integer
to array of integers,
- int nr_pending;
- int nr_waiting;
- int nr_queued;
- int barrier;
+ int *nr_pending;
+ int *nr_waiting;
+ int *nr_queued;
+ int *barrier;
number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
barrier buckets, and each array of integers occupies single memory page.
1024 means for a request which is smaller than the I/O barrier unit size
has ~0.1% chance to wait for resync to pause, which is quite a small
enough fraction. Also requesting single memory page is more friendly to
kernel page allocator than larger memory size.
- I/O barrier bucket is indexed by bio start sector
If multiple I/O requests hit different I/O barrier units, they only need
to compete I/O barrier with other I/Os which hit the same I/O barrier
bucket index with each other. The index of a barrier bucket which a
bio should look for is calculated by sector_to_idx() which is defined
in raid1.h as an inline function,
static inline int sector_to_idx(sector_t sector)
{
return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
BARRIER_BUCKETS_NR_BITS);
}
Here sector_nr is the start sector number of a bio.
- Single bio won't go across boundary of a I/O barrier unit
If a request goes across boundary of barrier unit, it will be split. A
bio may be split in raid1_make_request() or raid1_sync_request(), if
sectors returned by align_to_barrier_unit_end() is smaller than
original bio size.
Comparing to single sliding resync window,
- Currently resync I/O grows linearly, therefore regular and resync I/O
will conflict within a single barrier units. So the I/O behavior is
similar to single sliding resync window.
- But a barrier unit bucket is shared by all barrier units with identical
barrier uinit index, the probability of conflict might be higher
than single sliding resync window, in condition that writing I/Os
always hit barrier units which have identical barrier bucket indexs with
the resync I/Os. This is a very rare condition in real I/O work loads,
I cannot imagine how it could happen in practice.
- Therefore we can achieve a good enough low conflict rate with much
simpler barrier algorithm and implementation.
There are two changes should be noticed,
- In raid1d(), I change the code to decrease conf->nr_pending[idx] into
single loop, it looks like this,
spin_lock_irqsave(&conf->device_lock, flags);
conf->nr_queued[idx]--;
spin_unlock_irqrestore(&conf->device_lock, flags);
This change generates more spin lock operations, but in next patch of
this patch set, it will be replaced by a single line code,
atomic_dec(&conf->nr_queueud[idx]);
So we don't need to worry about spin lock cost here.
- Mainline raid1 code split original raid1_make_request() into
raid1_read_request() and raid1_write_request(). If the original bio
goes across an I/O barrier unit size, this bio will be split before
calling raid1_read_request() or raid1_write_request(), this change
the code logic more simple and clear.
- In this patch wait_barrier() is moved from raid1_make_request() to
raid1_write_request(). In raid_read_request(), original wait_barrier()
is replaced by raid1_read_request().
The differnece is wait_read_barrier() only waits if array is frozen,
using different barrier function in different code path makes the code
more clean and easy to read.
Changelog
V4:
- Add alloc_r1bio() to remove redundant r1bio memory allocation code.
- Fix many typos in patch comments.
- Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
V3:
- Rebase the patch against latest upstream kernel code.
- Many fixes by review comments from Neil,
- Back to use pointers to replace arraries in struct r1conf
- Remove total_barriers from struct r1conf
- Add more patch comments to explain how/why the values of
BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
- Use get_unqueued_pending() to replace get_all_pendings() and
get_all_queued()
- Increase bucket number from 512 to 1024
- Change code comments format by review from Shaohua.
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
bounday, to make the code more simple, by suggestion from Shaohua and
Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
confilict between resync I/O and sequential write I/O, by suggestion from
Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
allocated in memory page. To make the code more simple, V2 patch moves
the memory space into struct r1conf, like this,
- int nr_pending;
- int nr_waiting;
- int nr_queued;
- int barrier;
+ int nr_pending[BARRIER_BUCKETS_NR];
+ int nr_waiting[BARRIER_BUCKETS_NR];
+ int nr_queued[BARRIER_BUCKETS_NR];
+ int barrier[BARRIER_BUCKETS_NR];
This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
raid1_make_write_request().
V1:
- Original RFC patch for comments
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Although llist provides proper APIs, they are not used. Make them used.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.
Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.
So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().
Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
mddev is never NULL and neither is ->bio_set, so
remove the check.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Write behind need to replace pages in bio's bvecs, and we have
to clone a fresh bio with new bvec table, so use the introduced
bio_clone_bioset_partial() for it.
For other bio_clone_mddev() cases, we will use fast clone since
they don't need to touch bvec table.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The current behaviour is to fall back to allocate
bio from 'fs_bio_set', that isn't a correct way
because it might cause deadlock.
So this patch simply return failure if mddev->bio_set
can't be created.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.
We did a little bit different than dm. Instead of disabling writesame in
the first IO error, we disable it till next writesame IO coming after
the first IO error. This way we don't need to clone a bio.
Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581
Suggested-by: NeilBrown <neilb@suse.com>
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
stripes which are being reclaimed are still accounted into cached
stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
stripes and does unnecessary stripe reclaim. In practice, I saw one
stripe is reclaimed one time. This will cause bad IO pattern. Fixing
this by excluding the reclaing stripes in the check.
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When log space is tight, we try to reclaim stripes from log head. There
are stripes which can't be reclaimed right now if some conditions are
met. We skip such stripes but accidentally count them, which might cause
no stripes are claimed. Fixing this by only counting valid stripes.
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit: cbd199837750 ("md: Fix unfortunate interaction with evms")
change mddev_put() so that it would not destroy an md device while
->ctime was non-zero.
Unfortunately, we didn't make sure to clear ->ctime when unloading
the module, so it is possible for an md device to remain after
module unload. An attempt to open such a device will trigger
an invalid memory reference in:
get_gendisk -> kobj_lookup -> exact_lock -> get_disk
when tring to access disk->fops, which was in the module that has
been removed.
So ensure we clear ->ctime in md_exit(), and explain how that is useful,
as it isn't immediately obvious when looking at the code.
Fixes: cbd199837750 ("md: Fix unfortunate interaction with evms")
Tested-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|