summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-linus-20180309' of git://git.kernel.dk/linux-blockLinus Torvalds2018-03-101-6/+21
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - a xen-blkfront fix from Bhavesh with a multiqueue fix when detaching/re-attaching - a few important NVMe fixes, including a revert for a sysfs fix that caused some user space confusion - two bcache fixes by way of Michael Lyle - a loop regression fix, fixing an issue with lost writes on DAX. * tag 'for-linus-20180309' of git://git.kernel.dk/linux-block: loop: Fix lost writes caused by missing flag nvme_fc: rework sqsize handling nvme-fabrics: Ignore nr_io_queues option for discovery controllers xen-blkfront: move negotiate_mq to cover all cases of new VBDs Revert "nvme: create 'slaves' and 'holders' entries for hidden controllers" bcache: don't attach backing with duplicate UUID bcache: fix crashes in duplicate cache device register nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors nvme-pci: Fix EEH failure on ppc
| * bcache: don't attach backing with duplicate UUIDMichael Lyle2018-03-051-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This can happen e.g. during disk cloning. This is an incomplete fix: it does not catch duplicate UUIDs earlier when things are still unattached. It does not unregister the device. Further changes to cope better with this are planned but conflict with Coly's ongoing improvements to handling device errors. In the meantime, one can manually stop the device after this has happened. Attempts to attach a duplicate device result in: [ 136.372404] loop: module loaded [ 136.424461] bcache: register_bdev() registered backing device loop0 [ 136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but duplicate UUID already attached My test procedure is: dd if=/dev/sdb1 of=imgfile bs=1024 count=262144 losetup -f imgfile Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * bcache: fix crashes in duplicate cache device registerTang Junhui2018-03-051-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel crashed when register a duplicate cache device, the call trace is bellow: [ 417.643790] CPU: 1 PID: 16886 Comm: bcache-register Tainted: G W OE 4.15.5-amd64-preempt-sysrq-20171018 #2 [ 417.643861] Hardware name: LENOVO 20ERCTO1WW/20ERCTO1WW, BIOS N1DET41W (1.15 ) 12/31/2015 [ 417.643870] RIP: 0010:bdevname+0x13/0x1e [ 417.643876] RSP: 0018:ffffa3aa9138fd38 EFLAGS: 00010282 [ 417.643884] RAX: 0000000000000000 RBX: ffff8c8f2f2f8000 RCX: ffffd6701f8 c7edf [ 417.643890] RDX: ffffa3aa9138fd88 RSI: ffffa3aa9138fd88 RDI: 00000000000 00000 [ 417.643895] RBP: ffffa3aa9138fde0 R08: ffffa3aa9138fae8 R09: 00000000000 1850e [ 417.643901] R10: ffff8c8eed34b271 R11: ffff8c8eed34b250 R12: 00000000000 00000 [ 417.643906] R13: ffffd6701f78f940 R14: ffff8c8f38f80000 R15: ffff8c8ea7d 90000 [ 417.643913] FS: 00007fde7e66f500(0000) GS:ffff8c8f61440000(0000) knlGS: 0000000000000000 [ 417.643919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 417.643925] CR2: 0000000000000314 CR3: 00000007e6fa0001 CR4: 00000000003 606e0 [ 417.643931] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000 00000 [ 417.643938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 00000000000 00400 [ 417.643946] Call Trace: [ 417.643978] register_bcache+0x1117/0x1270 [bcache] [ 417.643994] ? slab_pre_alloc_hook+0x15/0x3c [ 417.644001] ? slab_post_alloc_hook.isra.44+0xa/0x1a [ 417.644013] ? kernfs_fop_write+0xf6/0x138 [ 417.644020] kernfs_fop_write+0xf6/0x138 [ 417.644031] __vfs_write+0x31/0xcc [ 417.644043] ? current_kernel_time64+0x10/0x36 [ 417.644115] ? __audit_syscall_entry+0xbf/0xe3 [ 417.644124] vfs_write+0xa5/0xe2 [ 417.644133] SyS_write+0x5c/0x9f [ 417.644144] do_syscall_64+0x72/0x81 [ 417.644161] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 417.644169] RIP: 0033:0x7fde7e1c1974 [ 417.644175] RSP: 002b:00007fff13009a38 EFLAGS: 00000246 ORIG_RAX: 0000000 000000001 [ 417.644183] RAX: ffffffffffffffda RBX: 0000000001658280 RCX: 00007fde7e1c 1974 [ 417.644188] RDX: 000000000000000a RSI: 0000000001658280 RDI: 000000000000 0001 [ 417.644193] RBP: 000000000000000a R08: 0000000000000003 R09: 000000000000 0077 [ 417.644198] R10: 000000000000089e R11: 0000000000000246 R12: 000000000000 0001 [ 417.644203] R13: 000000000000000a R14: 7fffffffffffffff R15: 000000000000 0000 [ 417.644213] Code: c7 c2 83 6f ee 98 be 20 00 00 00 48 89 df e8 6c 27 3b 0 0 48 89 d8 5b c3 0f 1f 44 00 00 48 8b 47 70 48 89 f2 48 8b bf 80 00 00 00 <8 b> b0 14 03 00 00 e9 73 ff ff ff 0f 1f 44 00 00 48 8b 47 40 39 [ 417.644302] RIP: bdevname+0x13/0x1e RSP: ffffa3aa9138fd38 [ 417.644306] CR2: 0000000000000314 When registering duplicate cache device in register_cache(), after failure on calling register_cache_set(), bch_cache_release() will be called, then bdev will be freed, so bdevname(bdev, name) caused kernel crash. Since bch_cache_release() will free bdev, so in this patch we make sure bdev being freed if register_cache() fail, and do not free bdev again in register_bcache() when register_cache() fail. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | dm table: allow upgrade from bio-based to specialized bio-based variantMike Snitzer2018-03-061-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In practice this is really only meaningful in the context of the DM multipath target (which uses dm_table_set_type() to set the type of device DM should create via its "queue_mode" option). So this change allows a DM multipath device with "queue_mode bio" to be upgraded from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED -- iff the underlying device(s) are NVMe. DM_TYPE_NVME_BIO_BASED is just a DM core implementation detail that allows for NVMe-specific optimizations (e.g. use direct_make_request instead of generic_make_request). If in the future there is no benefit or need to distinguish NVMe vs not: then it will be removed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checksMike Snitzer2018-03-061-37/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This eliminates the "queue_mode" configuration's "nvme" mode. There wasn't anything NVMe-specific about that mode. It was named "nvme" because it was a short name for the mode. But the entire point of the mode was to optimize the multipath target for underlying devices that are _not_ SCSI-based. Devices that aren't SCSI have no need for the various SCSI device handler (scsi_dh) specific code in DM multipath. But rather than narrowly define this scsi_dh vs not branching in terms of "nvme": invert the logic so that we're just checking whether a multipath device is layered on SCSI devices with scsi_dh attached. This allows any future storage technology to avoid scsi_dh specific code in the multipath target too. Fixes: 848b8aefd4 ("dm mpath: optimize NVMe bio-based support") Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm table: fix "nvme" testMikulas Patocka2018-03-061-1/+1
| | | | | | | | | | | | | | | | The strncmp function should compare 4 bytes. Fixes: 22c11858e8002 ("dm: introduce DM_TYPE_NVME_BIO_BASED") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm raid: fix incorrect sync_ratio when degradedJonathan Brassow2018-03-061-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit 4102d9de6d375 ("dm raid: fix rs_get_progress() synchronization state/ratio") in combination with commit 7c29744ecce ("dm raid: simplify rs_get_progress()") introduced a regression by incorrectly reporting a sync_ratio of 0 for degraded raid sets. This caused lvm2 to fail to repair raid legs automatically. Fix by identifying the degraded state by checking the MD_RECOVERY_INTR flag and returning mddev->recovery_cp in case it is set. MD sets recovery = [ MD_RECOVERY_RECOVER MD_RECOVERY_INTR MD_RECOVERY_NEEDED ] when a RAID member fails. It then shuts down any sync thread that is running and leaves us with all MD_RECOVERY_* flags cleared. The bug occurs if a status is requested in the short time it takes to shut down any sync thread and clear the flags, because we were keying in on the MD_RECOVERY_NEEDED - understanding it to be the initial phase of a “recover” sync thread. However, this is an incorrect interpretation if MD_RECOVERY_INTR is also set. This also explains why the bug only happened when automatic repair was enabled and not a normal ‘manual’ method. It is impossible to react quick enough to hit the problematic window without it being automated. Fix passes automatic repair tests. Fixes: 7c29744ecce ("dm raid: simplify rs_get_progress()") Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm: use blkdev_get rather than bdgrab when issuing pass-through ioctlMike Snitzer2018-03-061-15/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise an underlying device's teardown (e.g. SCSI) may race with the DM ioctl or persistent reservation and result in dereferencing driver memory that gets freed when the underlying device's final blkdev_put() occurs. bdgrab() only increases the refcount for the block_device's inode to ensure the block_device struct itself will not be freed, but does not guarantee the block_device will remain associated with the gendisk or its storage. Cc: stable@vger.kernel.org # 4.8+ Reported-by: David Jeffery <djeffery@redhat.com> Suggested-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Ben Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | dm bufio: avoid false-positive Wmaybe-uninitialized warningArnd Bergmann2018-03-061-10/+6
|/ | | | | | | | | | | | | | | | | | | | | | | | | | gcc-6.3 and earlier show a new warning after a seemingly unrelated change to the arm64 PAGE_KERNEL definition: In file included from drivers/md/dm-bufio.c:14:0: drivers/md/dm-bufio.c: In function 'alloc_buffer': include/linux/sched/mm.h:182:56: warning: 'noio_flag' may be used uninitialized in this function [-Wmaybe-uninitialized] current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; ^ The same warning happened earlier on linux-3.18 for MIPS and I did a workaround for that, but now it's come back. gcc-7 and newer are apparently smart enough to figure this out, and other architectures don't show it, so the best I could come up with is to rework the caller slightly in a way that makes it obvious enough to all arm64 compilers what is happening here. Fixes: 41acec624087 ("arm64: kpti: Make use of nG dependent on arm64_kernel_unmapped_at_el0()") Link: https://patchwork.kernel.org/patch/9692829/ Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> [snitzer: moved declarations inside conditional, altered vmalloc return] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* Merge tag 'for-linus-20180302' of git://git.kernel.dk/linux-blockLinus Torvalds2018-03-022-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A collection of fixes for this series. This is a little larger than usual at this time, but that's mainly because I was out on vacation last week. Nothing in here is major in any way, it's just two weeks of fixes. This contains: - NVMe pull from Keith, with a set of fixes from the usual suspects. - mq-deadline zone unlock fix from Damien, fixing an issue with the SMR zone locking added for 4.16. - two bcache fixes sent in by Michael, with changes from Coly and Tang. - comment typo fix from Eric for blktrace. - return-value error handling fix for nbd, from Gustavo. - fix a direct-io case where we don't defer to a completion handler, making us sleep from IRQ device completion. From Jan. - a small series from Jan fixing up holes around handling of bdev references. - small set of regression fixes from Jiufei, mostly fixing problems around the gendisk pointer -> partition index change. - regression fix from Ming, fixing a boundary issue with the discard page cache invalidation. - two-patch series from Ming, fixing both a core blk-mq-sched and kyber issue around token freeing on a requeue condition" * tag 'for-linus-20180302' of git://git.kernel.dk/linux-block: (24 commits) block: fix a typo block: display the correct diskname for bio block: fix the count of PGPGOUT for WRITE_SAME mq-deadline: Make sure to always unlock zones nvmet: fix PSDT field check in command format nvme-multipath: fix sysfs dangerously created links nbd: fix return value in error handling path bcache: fix kcrashes with fio in RAID5 backend dev bcache: correct flash only vols (check all uuids) blktrace_api.h: fix comment for struct blk_user_trace_setup blockdev: Avoid two active bdev inodes for one device genhd: Fix BUG in blkdev_open() genhd: Fix use after free in __blkdev_get() genhd: Add helper put_disk_and_module() genhd: Rename get_disk() to get_disk_and_module() genhd: Fix leaked module reference for NVME devices direct-io: Fix sleep in atomic due to sync AIO nvme-pci: Fix nvme queue cleanup if IRQ setup fails block: kyber: fix domain token leak during requeue blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch ...
| * bcache: fix kcrashes with fio in RAID5 backend devTang Junhui2018-02-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel crashed when run fio in a RAID5 backend bcache device, the call trace is bellow: [ 440.012034] kernel BUG at block/blk-ioc.c:146! [ 440.012696] invalid opcode: 0000 [#1] SMP NOPTI [ 440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8 [ 440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16 /2015 [ 440.028615] RIP: 0010:put_io_context+0x8b/0x90 [ 440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246 [ 440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000 0f4240 [ 440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882 94fca0 [ 440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8 0c1700 [ 440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000 01000 [ 440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11 bd1e0 [ 440.035239] FS: 0000000000000000(0000) GS:ffff949cba280000(0000) knlGS: 0000000000000000 [ 440.060190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001 606e0 [ 440.110498] Call Trace: [ 440.135443] bio_disassociate_task+0x1b/0x60 [ 440.160355] bio_free+0x1b/0x60 [ 440.184666] bio_put+0x23/0x30 [ 440.208272] search_free+0x23/0x40 [bcache] [ 440.231448] cached_dev_write_complete+0x31/0x70 [bcache] [ 440.254468] closure_put+0xb6/0xd0 [bcache] [ 440.277087] request_endio+0x30/0x40 [bcache] [ 440.298703] bio_endio+0xa1/0x120 [ 440.319644] handle_stripe+0x418/0x2270 [raid456] [ 440.340614] ? load_balance+0x17b/0x9c0 [ 440.360506] handle_active_stripes.isra.58+0x387/0x5a0 [raid456] [ 440.380675] ? __release_stripe+0x15/0x20 [raid456] [ 440.400132] raid5d+0x3ed/0x5d0 [raid456] [ 440.419193] ? schedule+0x36/0x80 [ 440.437932] ? schedule_timeout+0x1d2/0x2f0 [ 440.456136] md_thread+0x122/0x150 [ 440.473687] ? wait_woken+0x80/0x80 [ 440.491411] kthread+0x102/0x140 [ 440.508636] ? find_pers+0x70/0x70 [ 440.524927] ? kthread_associate_blkcg+0xa0/0xa0 [ 440.541791] ret_from_fork+0x35/0x40 [ 440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2 48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41 [ 440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8 [ 440.628575] ---[ end trace a1fd79d85643a73e ]-- All the crash issue happened when a bypass IO coming, in such scenario s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the s->orig_bio by calling bio_complete(), and after that, s->iop.bio became invalid, then kernel would crash when calling bio_put(). Maybe its upper layer's faulty, since bio should not be freed before we calling bio_put(), but we'd better calling bio_put() first before calling bio_complete() to notify upper layer ending this bio. This patch moves bio_complete() under bio_put() to avoid kernel crash. [mlyle: fixed commit subject for character limits] Reported-by: Matthias Ferdinand <bcache@mfedv.net> Tested-by: Matthias Ferdinand <bcache@mfedv.net> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * bcache: correct flash only vols (check all uuids)Coly Li2018-02-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by devices_max_used") adds c->devices_max_used to reduce iteration of c->uuids elements, this value is updated in bcache_device_attach(). But for flash only volume, when calling flash_devs_run(), the function bcache_device_attach() is not called yet and c->devices_max_used is not updated. The unexpected result is, the flash only volume won't be run by flash_devs_run(). This patch fixes the issue by iterate all c->uuids elements in flash_devs_run(). c->devices_max_used will be updated properly when bcache_device_attach() gets called. [mlyle: commit subject edited for character limit] Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by devices_max_used") Reported-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | md/raid1: fix NULL pointer dereferenceYufen Yu2018-02-251-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In handle_write_finished(), if r1_bio->bios[m] != NULL, it thinks the corresponding conf->mirrors[m].rdev is also not NULL. But, it is not always true. Even if some io hold replacement rdev(i.e. rdev->nr_pending.count > 0), raid1_remove_disk() can also set the rdev as NULL. That means, bios[m] != NULL, but mirrors[m].rdev is NULL, resulting in NULL pointer dereference in handle_write_finished and sync_request_write. This patch can fix BUGs as follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000140 IP: [<ffffffff815bbbbd>] raid1d+0x2bd/0xfc0 PGD 12ab52067 PUD 12f587067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 2008 Comm: md3_raid1 Not tainted 4.1.44+ #130 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 Call Trace: ? schedule+0x37/0x90 ? prepare_to_wait_event+0x83/0xf0 md_thread+0x144/0x150 ? wake_atomic_t_function+0x70/0x70 ? md_start_sync+0xf0/0xf0 kthread+0xd8/0xf0 ? kthread_worker_fn+0x160/0x160 ret_from_fork+0x42/0x70 ? kthread_worker_fn+0x160/0x160 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 IP: sync_request_write+0x9e/0x980 PGD 800000007c518067 P4D 800000007c518067 PUD 8002b067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 24 PID: 2549 Comm: md3_raid1 Not tainted 4.15.0+ #118 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 Call Trace: ? sched_clock+0x5/0x10 ? sched_clock_cpu+0xc/0xb0 ? flush_pending_writes+0x3a/0xd0 ? pick_next_task_fair+0x4d5/0x5f0 ? __switch_to+0xa2/0x430 raid1d+0x65a/0x870 ? find_pers+0x70/0x70 ? find_pers+0x70/0x70 ? md_thread+0x11c/0x160 md_thread+0x11c/0x160 ? finish_wait+0x80/0x80 kthread+0x111/0x130 ? kthread_create_worker_on_cpu+0x70/0x70 ? do_syscall_64+0x6f/0x190 ? SyS_exit_group+0x10/0x10 ret_from_fork+0x35/0x40 Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md: fix a potential deadlock of raid5/raid10 reshapeBingJing Chang2018-02-253-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a potential deadlock if mount/umount happens when raid5_finish_reshape() tries to grow the size of emulated disk. How the deadlock happens? 1) The raid5 resync thread finished reshape (expanding array). 2) The mount or umount thread holds VFS sb->s_umount lock and tries to write through critical data into raid5 emulated block device. So it waits for raid5 kernel thread handling stripes in order to finish it I/Os. 3) In the routine of raid5 kernel thread, md_check_recovery() will be called first in order to reap the raid5 resync thread. That is, raid5_finish_reshape() will be called. In this function, it will try to update conf and call VFS revalidate_disk() to grow the raid5 emulated block device. It will try to acquire VFS sb->s_umount lock. The raid5 kernel thread cannot continue, so no one can handle mount/ umount I/Os (stripes). Once the write-through I/Os cannot be finished, mount/umount will not release sb->s_umount lock. The deadlock happens. The raid5 kernel thread is an emulated block device. It is responible to handle I/Os (stripes) from upper layers. The emulated block device should not request any I/Os on itself. That is, it should not call VFS layer functions. (If it did, it will try to acquire VFS locks to guarantee the I/Os sequence.) So we have the resync thread to send resync I/O requests and to wait for the results. For solving this potential deadlock, we can put the size growth of the emulated block device as the final step of reshape thread. 2017/12/29: Thanks to Guoqing Jiang <gqjiang@suse.com>, we confirmed that there is the same deadlock issue in raid10. It's reproducible and can be fixed by this patch. For raid10.c, we can remove the similar code to prevent deadlock as well since they has been called before. Reported-by: Alex Wu <alexwu@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md-cluster: choose correct label when clustered layout is not supportedLidong Zhong2018-02-251-1/+1
| | | | | | | | | | | | | | | | r10conf is already successfully allocated before checking the layout Signed-off-by: Lidong Zhong <lzhong@suse.com> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md: raid5: avoid string overflow warningArnd Bergmann2018-02-211-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc warns about a possible overflow of the kmem_cache string, when adding four characters to a string of the same length: drivers/md/raid5.c: In function 'setup_conf': drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=] sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]); ^~~~ drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32 sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If I'm counting correctly, we need 11 characters for the fixed part of the string and 18 characters for a 64-bit pointer (when no gendisk is used), so that leaves three characters for conf->level, which should always be sufficient. This makes the code use snprintf() with the correct length, to make the code more robust against changes, and to get the compiler to shut up. In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk") from 2010, Neil said that the pointer could be removed "shortly" once devices without gendisk are disallowed. I have no idea if that happened, but if it did, that should probably be changed as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | raid5-ppl: fix handling flush requestsArtur Paszkiewicz2018-02-212-1/+12
| | | | | | | | | | | | | | | | Add missing bio completion. Without this any flush request would hang. Fixes: 1532d9e87e8b ("raid5-ppl: PPL support for disks with write-back cache enabled") Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md raid10: fix NULL deference in handle_write_completed()Yufen Yu2018-02-191-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of 'recover', an r10bio with R10BIO_WriteError & R10BIO_IsRecover will be progressed by handle_write_completed(). This function traverses all r10bio->devs[copies]. If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement is also not NULL. However, this is not always true. When there is an rdev of raid10 has replacement, then each r10bio ->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover', even if corresponded replacement is NULL, it doesn't clear r10bio ->devs[m].repl_bio, resulting in replacement NULL deference. This bug was introduced when replacement support for raid10 was added in Linux 3.3. As NeilBrown suggested: Elsewhere the determination of "is this device part of the resync/recovery" is made by resting bio->bi_end_io. If this is end_sync_write, then we tried to write here. If it is NULL, then we didn't try to write. Fixes: 9ad1aefc8ae8 ("md/raid10: Handle replacement devices during resync.") Cc: stable (V3.3+) Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md: only allow remove_and_add_spares when no sync_thread running.NeilBrown2018-02-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The locking protocols in md assume that a device will never be removed from an array during resync/recovery/reshape. When that isn't happening, rcu or reconfig_mutex is needed to protect an rdev pointer while taking a refcount. When it is happening, that protection isn't needed. Unfortunately there are cases were remove_and_add_spares() is called when recovery might be happening: is state_store(), slot_store() and hot_remove_disk(). In each case, this is just an optimization, to try to expedite removal from the personality so the device can be removed from the array. If resync etc is happening, we just have to wait for md_check_recover to find a suitable time to call remove_and_add_spares(). This optimization and not essential so it doesn't matter if it fails. So change remove_and_add_spares() to abort early if resync/recovery/reshape is happening, unless it is called from md_check_recovery() as part of a newly started recovery. The parameter "this" is only NULL when called from md_check_recovery() so when it is NULL, there is no need to abort. As this can result in a NULL dereference, the fix is suitable for -stable. cc: yuyufen <yuyufen@huawei.com> Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") Cc: stable@ver.kernel.org (v4.8+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md: document lifetime of internal rdev pointer.NeilBrown2018-02-183-0/+37
| | | | | | | | | | | | | | | | | | | | | | The rdev pointer kept in the local 'config' for each for raid1, raid10, raid4/5/6 has non-obvious lifetime rules. Sometimes RCU is needed, sometimes a lock, something nothing. Add documentation to explain this. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md: fix md_write_start() deadlock w/o metadata devicesHeinz Mauelshagen2018-02-182-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If no metadata devices are configured on raid1/4/5/6/10 (e.g. via dm-raid), md_write_start() unconditionally waits for superblocks to be written thus deadlocking. Fix introduces mddev->has_superblocks bool, defines it in md_run() and checks for it in md_write_start() to conditionally avoid waiting. Once on it, check for non-existing superblocks in md_super_write(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647 Fixes: cc27b0c78c796 ("md: fix deadlock between mddev_suspend() and md_write_start()") Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | MD: Free bioset when md_run failsXiao Ni2018-02-171-5/+21
| | | | | | | | | | | | Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | raid10: change the size of resync window for clustered raidGuoqing Jiang2018-02-171-1/+1
| | | | | | | | | | | | | | | | | | | | To align with raid1's resync window, we need to set the resync window of raid10 to 32M as well. Fixes: 8db87912c9a8 ("md-cluster: Use a small window for raid10 resync") Reported-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md-multipath: Use seq_putc() in multipath_status()Markus Elfring2018-02-171-1/+1
| | | | | | | | | | | | | | | | | | | | A single character (closing square bracket) should be put into a sequence. Thus use the corresponding function "seq_putc". This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md/raid1: Fix trailing semicolonLuis de Bethencourt2018-02-171-1/+1
| | | | | | | | | | | | | | | | The trailing semicolon is an empty statement that does no operation. Removing it since it doesn't do anything. Signed-off-by: Luis de Bethencourt <luisbg@kernel.org> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* | md/raid5: simplify uninitialization of shrinkerAliaksei Karaliou2018-02-171-3/+1
|/ | | | | | | | | | Don't use shrinker.nr_deferred to check whether shrinker was initialized or not. Now this check was integrated into unregister_shrinker(), so it is safe to call it against unregistered shrinker. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
* dm: correctly handle chained bios in dec_pending()NeilBrown2018-02-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dec_pending() is given an error status (possibly 0) to be recorded against a bio. It can be called several times on the one 'struct dm_io', and it is careful to only assign a non-zero error to io->status. However when it then assigned io->status to bio->bi_status, it is not careful and could overwrite a genuine error status with 0. This can happen when chained bios are in use. If a bio is chained beneath the bio that this dm_io is handling, the child bio might complete and set bio->bi_status before the dm_io completes. This has been possible since chained bios were introduced in 3.14, and has become a lot easier to trigger with commit 18a25da84354 ("dm: ensure bio submission follows a depth-first tree walk") as that commit caused dm to start using chained bios itself. A particular failure mode is that if a bio spans an 'error' target and a working target, the 'error' fragment will complete instantly and set the ->bi_status, and the other fragment will normally complete a little later, and will clear ->bi_status. The fix is simply to only assign io_error to bio->bi_status when io_error is not zero. Reported-and-tested-by: Milan Broz <gmazyland@gmail.com> Cc: stable@vger.kernel.org (v3.14+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds2018-02-112-4/+4
| | | | | | | | | | | | | | | | | | | | | | | This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bcache: fix for data collapse after re-attaching an attached deviceTang Junhui2018-02-073-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | back-end device sdm has already attached a cache_set with ID f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with another cache set, and it returns with an error: [root]# cd /sys/block/sdm/bcache [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach -bash: echo: write error: Invalid argument After that, execute a command to modify the label of bcache device: [root]# echo data_disk1 > label Then we reboot the system, when the system power on, the back-end device can not attach to cache_set, a messages show in the log: Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache: bch_cached_dev_attach() couldn't find uuid for sdm in set In sysfs_attach(), dc->sb.set_uuid was assigned to the value which input through sysfs, no matter whether it is success or not in bch_cached_dev_attach(). For example, If the back-end device has already attached to an cache set, bch_cached_dev_attach() would fail, but dc->sb.set_uuid was changed. Then modify the label of bcache device, it will call bch_write_bdev_super(), which would write the dc->sb.set_uuid to the super block, so we record a wrong cache set ID in the super block, after the system reboot, the cache set couldn't find the uuid of the back-end device, so the bcache device couldn't exist and use any more. In this patch, we don't assigned cache set ID to dc->sb.set_uuid in sysfs_attach() directly, but input it into bch_cached_dev_attach(), and assigned dc->sb.set_uuid to the cache set ID after the back-end device attached to the cache set successful. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: return attach error when no cache set existTang Junhui2018-02-071-2/+3
| | | | | | | | | | | | | | | | | | | | I attach a back-end device to a cache set, and the cache set is not registered yet, this back-end device did not attach successfully, and no error returned: [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach [root]# In sysfs_attach(), the return value "v" is initialized to "size" in the beginning, and if no cache set exist in bch_cache_sets, the "v" value would not change any more, and return to sysfs, sysfs regard it as success since the "size" is a positive number. This patch fixes this issue by assigning "v" with "-ENOENT" in the initialization. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: set writeback_rate_update_seconds in range [1, 60] secondsColy Li2018-02-073-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | dc->writeback_rate_update_seconds can be set via sysfs and its value can be set to [1, ULONG_MAX]. It does not make sense to set such a large value, 60 seconds is long enough value considering the default 5 seconds works well for long time. Because dc->writeback_rate_update is a special delayed work, it re-arms itself inside the delayed work routine update_writeback_rate(). When stopping it by cancel_delayed_work_sync(), there should be a timeout to wait and make sure the re-armed delayed work is stopped too. A small max value of dc->writeback_rate_update_seconds is also helpful to decide a reasonable small timeout. This patch limits sysfs interface to set dc->writeback_rate_update_seconds in range of [1, 60] seconds, and replaces the hand-coded number by macros. Changelog: v2: fix a rebase typo in v4, which is pointed out by Michael Lyle. v1: initial version. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: fix for allocator and register thread raceTang Junhui2018-02-072-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After long time running of random small IO writing, I reboot the machine, and after the machine power on, I found bcache got stuck, the stack is: [root@ceph153 ~]# cat /proc/2510/task/*/stack [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache] [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache] [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache] [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache] [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache] [<ffffffff810a631f>] kthread+0xcf/0xe0 [<ffffffff8164c318>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff [root@ceph153 ~]# cat /proc/2038/task/*/stack [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache] [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache] [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache] [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache] [<ffffffff812f702f>] kobj_attr_store+0xf/0x20 [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140 [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0 [<ffffffff811e069f>] SyS_write+0x7f/0xe0 [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1 The stack shows the register thread and allocator thread were getting stuck when registering cache device. I reboot the machine several times, the issue always exsit in this machine. I debug the code, and found the call trace as bellow: register_bcache() ==>run_cache_set() ==>bch_journal_replay() ==>bch_btree_insert() ==>__bch_btree_map_nodes() ==>btree_insert_fn() ==>btree_split() //node need split ==>btree_check_reserve() In btree_check_reserve(), It will check if there is enough buckets of RESERVE_BTREE type, since allocator thread did not work yet, so no buckets of RESERVE_BTREE type allocated, so the register thread waits on c->btree_cache_wait, and goes to sleep. Then the allocator thread initialized, the call trace is bellow: bch_allocator_thread() ==>bch_prio_write() ==>bch_journal_meta() ==>bch_journal() ==>journal_wait_for_write() In journal_wait_for_write(), It will check if journal is full by journal_full(), but the long time random small IO writing causes the exhaustion of journal buckets(journal.blocks_free=0), In order to release the journal buckets, the allocator calls btree_flush_write() to flush keys to btree nodes, and waits on c->journal.wait until btree nodes writing over or there has already some journal buckets space, then the allocator thread goes to sleep. but in btree_flush_write(), since bch_journal_replay() is not finished, so no btree nodes have journal (condition "if (btree_current_write(b)->journal)" never satisfied), so we got no btree node to flush, no journal bucket released, and allocator sleep all the times. Through the above analysis, we can see that: 1) Register thread wait for allocator thread to allocate buckets of RESERVE_BTREE type; 2) Alloctor thread wait for register thread to replay journal, so it can flush btree nodes and get journal bucket. then they are all got stuck by waiting for each other. Hua Rui provided a patch for me, by allocating some buckets of RESERVE_BTREE type in advance, so the register thread can get bucket when btree node splitting and no need to waiting for the allocator thread. I tested it, it has effect, and register thread run a step forward, but finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE type were allocated, and in bch_journal_replay(), after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left, then btree_check_reserve() is not satisfied anymore, so it goes to sleep again, and in the same time, alloctor thread did not flush enough btree nodes to release a journal bucket, so they all got stuck again. So we need to allocate more buckets of RESERVE_BTREE type in advance, but how much is enough? By experience and test, I think it should be as much as journal buckets. Then I modify the code as this patch, and test in the machine, and it works. This patch modified base on Hua Rui’s patch, and allocate more buckets of RESERVE_BTREE type in advance to avoid register thread and allocate thread going to wait for each other. [patch v2] ca->sb.njournal_buckets would be 0 in the first time after cache creation, and no journal exists, so just 8 btree buckets is OK. Signed-off-by: Hua Rui <huarui.dev@gmail.com> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: set error_limit correctlyColy Li2018-02-073-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Struct cache uses io_errors for two purposes, - Error decay: when cache set error_decay is set, io_errors is used to generate a small piece of delay when I/O error happens. - I/O errors counter: in order to generate big enough value for error decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a IO_ERROR_SHIFT). In function bch_count_io_errors(), if I/O errors counter reaches cache set error limit, bch_cache_set_error() will be called to retire the whold cache set. But current code is problematic when checking the error limit, see the following code piece from bch_count_io_errors(), 90 if (error) { 91 char buf[BDEVNAME_SIZE]; 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT, 93 &ca->io_errors); 94 errors >>= IO_ERROR_SHIFT; 95 96 if (errors < ca->set->error_limit) 97 pr_err("%s: IO error on %s, recovering", 98 bdevname(ca->bdev, buf), m); 99 else 100 bch_cache_set_error(ca->set, 101 "%s: too many IO errors %s", 102 bdevname(ca->bdev, buf), m); 103 } At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real errors counter to compare at line 96. But ca->set->error_limit is initia- lized with an amplified value in bch_cache_set_alloc(), 1545 c->error_limit = 8 << IO_ERROR_SHIFT; It means by default, in bch_count_io_errors(), before 8<<20 errors happened bch_cache_set_error() won't be called to retire the problematic cache device. If the average request size is 64KB, it means bcache won't handle failed device until 512GB data is requested. This is too large to be an I/O threashold. So I believe the correct error limit should be much less. This patch sets default cache set error limit to 8, then in bch_count_io_errors() when errors counter reaches 8 (if it is default value), function bch_cache_set_error() will be called to retire the whole cache set. This patch also removes bits shifting when store or show io_error_limit value via sysfs interface. Nowadays most of SSDs handle internal flash failure automatically by LBA address re-indirect mapping. If an I/O error can be observed by upper layer code, it will be a notable error because that SSD can not re-indirect map the problematic LBA address to an available flash block. This situation indicates the whole SSD will be failed very soon. Therefore setting 8 as the default io error limit value makes sense, it is enough for most of cache devices. Changelog: v2: add reviewed-by from Hannes. v1: initial version for review. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: properly set task state in bch_writeback_thread()Coly Li2018-02-072-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel thread routine bch_writeback_thread() has the following code block, 447 down_write(&dc->writeback_lock); 448~450 if (check conditions) { 451 up_write(&dc->writeback_lock); 452 set_current_state(TASK_INTERRUPTIBLE); 453 454 if (kthread_should_stop()) 455 return 0; 456 457 schedule(); 458 continue; 459 } If condition check is true, its task state is set to TASK_INTERRUPTIBLE and call schedule() to wait for others to wake up it. There are 2 issues in current code, 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if another process changes the condition and call wake_up_process(dc-> writeback_thread), then at line 452 task state is set back to TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be waken up. 2, At line 454 if kthread_should_stop() is true, writeback kernel thread will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and call do_exit(). It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a warning message is reported by __might_sleep(): "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at [xxxx]". For the first issue, task state should be set before condition checks. Ineed because dc->writeback_lock is required when modifying all the conditions, calling set_current_state() inside code block where dc-> writeback_lock is hold is safe. But this is quite implicit, so I still move set_current_state() before all the condition checks. For the second issue, frankley speaking it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE state, but this warning message scares users, makes them feel there might be something risky with bcache and hurt their data. Setting task state to TASK_RUNNING before returning fixes this problem. In alloc.c:allocator_wait(), there is also a similar issue, and is also fixed in this patch. Changelog: v3: merge two similar fixes into one patch v2: fix the race issue in v1 patch. v1: initial buggy fix. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: fix high CPU occupancy during journalTang Junhui2018-02-073-15/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After long time small writing I/O running, we found the occupancy of CPU is very high and I/O performance has been reduced by about half: [root@ceph151 internal]# top top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache KiB Swap: 65667068 total, 65667068 free, 0 used. 25136812 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd The kernel work consumed a lot of CPU, and I found they are running journal work, The journal is reclaiming source and flush btree node with surprising frequency. Through further analysis, we found that in btree_flush_write(), we try to get a btree node with the smallest fifo idex to flush by traverse all the btree nodein c->bucket_hash, after we getting it, since no locker protects it, this btree node may have been written to cache device by other works, and if this occurred, we retry to traverse in c->bucket_hash and get another btree node. When the problem occurrd, the retry times is very high, and we consume a lot of CPU in looking for a appropriate btree node. In this patch, we try to record 128 btree nodes with the smallest fifo idex in heap, and pop one by one when we need to flush btree node. It greatly reduces the time for the loop to find the appropriate BTREE node, and also reduce the occupancy of CPU. [note by mpl: this triggers a checkpatch error because of adjacent, pre-existing style violations] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: add journal statisticTang Junhui2018-02-073-0/+24
| | | | | | | | | | | | | | | | | | | | Sometimes, Journal takes up a lot of CPU, we need statistics to know what's the journal is doing. So this patch provide some journal statistics: 1) reclaim: how many times the journal try to reclaim resource, usually the journal bucket or/and the pin are exhausted. 2) flush_write: how many times the journal try to flush btree node to cache device, usually the journal bucket are exhausted. 3) retry_flush_write: how many times the journal retry to flush the next btree node, usually the previous tree node have been flushed by other thread. we show these statistic by sysfs interface. Through these statistics We can totally see the status of journal module when the CPU is too high. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-linus-20180204' of git://git.kernel.dk/linux-blockLinus Torvalds2018-02-041-3/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: "Most of this is fixes and not new code/features: - skd fix from Arnd, fixing a build error dependent on sla allocator type. - blk-mq scheduler discard merging fixes, one from me and one from Keith. This fixes a segment miscalculation for blk-mq-sched, where we mistakenly think two segments are physically contigious even though the request isn't carrying real data. Also fixes a bio-to-rq merge case. - Don't re-set a bit on the buffer_head flags, if it's already set. This can cause scalability concerns on bigger machines and workloads. From Kemi Wang. - Add BLK_STS_DEV_RESOURCE return value to blk-mq, allowing us to distuingish between a local (device related) resource starvation and a global one. The latter might happen without IO being in flight, so it has to be handled a bit differently. From Ming" * tag 'for-linus-20180204' of git://git.kernel.dk/linux-block: block: skd: fix incorrect linux/slab_def.h inclusion buffer: Avoid setting buffer bits that are already set blk-mq-sched: Enable merging discard bio into request blk-mq: fix discard merge with scheduler attached blk-mq: introduce BLK_STS_DEV_RESOURCE
| * blk-mq: introduce BLK_STS_DEV_RESOURCEMing Lei2018-01-301-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This status is returned from driver to block layer if device related resource is unavailable, but driver can guarantee that IO dispatch will be triggered in future when the resource is available. Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is 3 ms because both scsi-mq and nvmefc are using that magic value. If a driver can make sure there is in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE because: 1) If all in-flight IOs complete before examining SCHED_RESTART in blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue is run immediately in this case by blk_mq_dispatch_rq_list(); 2) if there is any in-flight IO after/when examining SCHED_RESTART in blk_mq_dispatch_rq_list(): - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) - otherwise, this request will be dispatched after any in-flight IO is completed via blk_mq_sched_restart() 3) if SCHED_RESTART is set concurently in context because of BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two cases and make sure IO hang can be avoided. One invariant is that queue will be rerun if SCHED_RESTART is set. Suggested-by: Jens Axboe <axboe@kernel.dk> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-4.16/dm-changes' of ↵Linus Torvalds2018-01-3124-612/+1252
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core fixes to ensure that bio submission follows a depth-first tree walk; this is critical to allow forward progress without the need to use the bioset's BIOSET_NEED_RESCUER. - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure. - DM core cleanups and improvements to make bio-based DM more efficient (e.g. reduced memory footprint as well leveraging per-bio-data more). - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages the more direct IO submission path in the block layer; this mode is used by DM multipath and also optimizes targets like DM thin-pool that stack directly on NVMe data device. - DM multipath improvements to factor out legacy SCSI-only (e.g. scsi_dh) code paths to allow for more optimized support for NVMe multipath. - A fix for DM multipath path selectors (service-time and queue-length) to select paths in a more balanced way; largely academic but doesn't hurt. - Numerous DM raid target fixes and improvements. - Add a new DM "unstriped" target that enables Intel to workaround firmware limitations in some NVMe drives that are striped internally (this target also works when stacked above the DM "striped" target). - Various Documentation fixes and improvements. - Misc cleanups and fixes across various DM infrastructure and targets (e.g. bufio, flakey, log-writes, snapshot). * tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits) dm cache: Documentation: update default migration_throttling value dm mpath selector: more evenly distribute ties dm unstripe: fix target length versus number of stripes size check dm thin: fix trailing semicolon in __remap_and_issue_shared_cell dm table: fix NVMe bio-based dm_table_determine_type() validation dm: various cleanups to md->queue initialization code dm mpath: delay the retry of a request if the target responded as busy dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure dm log writes: fix max length used for kstrndup dm: backfill missing calls to mutex_destroy() dm snapshot: use mutex instead of rw_semaphore dm flakey: check for null arg_name in parse_features() dm thin: extend thinpool status format string with omitted fields dm thin: fixes in thin-provisioning.txt dm thin: document representation of <highest mapped sector> when there is none dm thin: fix documentation relative to low water mark threshold dm cache: be consistent in specifying sectors and SI units in cache.txt dm cache: delete obsoleted paragraph in cache.txt dm cache: fix grammar in cache-policies.txt ...
| * | dm mpath selector: more evenly distribute tiesKhazhismel Kumykov2018-01-292-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the last used path to the end of the list (least preferred) so that ties are more evenly distributed. For example, in case with three paths with one that is slower than others, the remaining two would be unevenly used if they tie. This is due to the rotation not being a truely fair distribution. Illustrated: paths a, b, c, 'c' has 1 outstanding IO, a and b are 'tied' Three possible rotations: (a, b, c) -> best path 'a' (b, c, a) -> best path 'b' (c, a, b) -> best path 'a' (a, b, c) -> best path 'a' (b, c, a) -> best path 'b' (c, a, b) -> best path 'a' ... So 'a' is used 2x more than 'b', although they should be used evenly. With this change, the most recently used path is always the least preferred, removing this bias resulting in even distribution. (a, b, c) -> best path 'a' (b, c, a) -> best path 'b' (c, a, b) -> best path 'a' (c, b, a) -> best path 'b' ... Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm unstripe: fix target length versus number of stripes size checkScott Bauer2018-01-291-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the unstripe target takes a target length which is the size of *one* striped member we're trying to expose, not the total size of *all* the striped members, the check does not make sense and fails for some striped setups. For example, say we have a 4TB striped device: or 3907018496 sectors per underlying device: if (sector_div(width, uc->stripes)) : 3907018496 / 2(num stripes) == 1953509248 tmp_len = width; if (sector_div(tmp_len, uc->chunk_size)) : 1953509248 / 256(chunk size) == 7630895.5 (fails) Fix this by removing the first check which isn't valid for unstriping. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm thin: fix trailing semicolon in __remap_and_issue_shared_cellLuis de Bethencourt2018-01-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The trailing semicolon is an empty statement that does no operation. Removing it since it doesn't do anything. Signed-off-by: Luis de Bethencourt <luisbg@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm table: fix NVMe bio-based dm_table_determine_type() validationMike Snitzer2018-01-291-22/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'verify_rq_based:' code in dm_table_determine_type() was checking all devices in the DM table rather than only checking the data devices. Fix this by using the immutable target's iterate_devices method. Also, tweak the block of dm_table_determine_type() code that decides whether to upgrade from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED so that it makes sure the immutable_target doesn't support require splitting IOs. These changes have been verified to allow a "thin-pool" target whose data device is an NVMe device to be upgraded to DM_TYPE_NVME_BIO_BASED. Using the thin-pool in NVMe bio-based mode was verified to pass all the device-mapper-test-suite's "thin-provisioning" tests. Also verified that request-based DM multipath (with queue_mode "rq" and "mq") works as expected using the 'mptest' harness. Fixes: 22c11858e ("dm: introduce DM_TYPE_NVME_BIO_BASED") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm: various cleanups to md->queue initialization codeMike Snitzer2018-01-293-22/+12
| | | | | | | | | | | | | | | | | | Also, add dm_sysfs_init() error handling to dm_create(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm mpath: delay the retry of a request if the target responded as busyMike Snitzer2018-01-292-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add DM_ENDIO_DELAY_REQUEUE to allow request-based multipath's multipath_end_io() to instruct dm-rq.c:dm_done() to delay a requeue. This is beneficial to do if BLK_STS_RESOURCE is returned from the target (because target is busy). Relative to blk-mq: kick the hw queues via blk_mq_requeue_work(), indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay. For old .request_fn: use blk_delay_queue(). bio-based multipath doesn't have feature parity with request-based for retryable error requeues; that is something that'll need fixing in the future. Suggested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Bart Van Assche <bart.vanassche@wdc.com> [as interpreted from Bart's "... patch looks fine to me."]
| * | dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIREDMing Lei2018-01-171-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid using DM_MAPIO_REQUEUE unless absolutely necessary because it results in dm-rq.c:dm_mq_queue_rq() returning BLK_STS_RESOURCE to blk-mq -- doing so should only ever be done if the underlying queue is out of resources. So switch to returning DM_MAPIO_DELAY_REQUEUE from multipath_clone_and_map() if either MPATHF_QUEUE_IO or MPATHF_PG_INIT_REQUIRED are set. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failureMing Lei2018-01-171-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq will rerun queue via RESTART or dispatch wake after one request is completed, so not necessary to wait random time for requeuing, we should trust blk-mq to do it. More importantly, we need to return BLK_STS_RESOURCE to blk-mq so that dequeuing from the I/O scheduler can be stopped, this results in improved I/O merging. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm log writes: fix max length used for kstrndupMa Shimiao2018-01-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | If source string is longer than max, kstrndup will allocate max+1 space. So make sure the result will not exceed max. Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm: backfill missing calls to mutex_destroy()Mike Snitzer2018-01-179-2/+27
| | | | | | | | | | | | Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm snapshot: use mutex instead of rw_semaphoreMikulas Patocka2018-01-171-41/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | The rw_semaphore is acquired for read only in two places, neither is performance-critical. So replace it with a mutex -- which is more efficient. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
OpenPOWER on IntegriCloud