summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/scrub.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2015-02-191-168/+141
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is mostly cleanups and fixes: - The raid5/6 cleanups from Zhao Lei fixup some long standing warts in the code and add improvements on top of the scrubbing support from 3.19. - Josef has round one of our ENOSPC fixes coming from large btrfs clusters here at FB. - Dave Sterba continues a long series of cleanups (thanks Dave), and Filipe continues hammering on corner cases in fsync and others This all was held up a little trying to track down a use-after-free in btrfs raid5/6. It's not clear yet if this is just made easier to trigger with this pull or if its a new bug from the raid5/6 cleanups. Dave Sterba is the only one to trigger it so far, but he has a consistent way to reproduce, so we'll get it nailed shortly" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits) Btrfs: don't remove extents and xattrs when logging new names Btrfs: fix fsync data loss after adding hard link to inode Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group Btrfs: account for large extents with enospc Btrfs: don't set and clear delalloc for O_DIRECT writes Btrfs: only adjust outstanding_extents when we do a short write btrfs: Fix out-of-space bug Btrfs: scrub, fix sleep in atomic context Btrfs: fix scheduler warning when syncing log Btrfs: Remove unnecessary placeholder in btrfs_err_code btrfs: cleanup init for list in free-space-cache btrfs: delete chunk allocation attemp when setting block group ro btrfs: clear bio reference after submit_one_bio() Btrfs: fix scrub race leading to use-after-free Btrfs: add missing cleanup on sysfs init failure Btrfs: fix race between transaction commit and empty block group removal btrfs: add more checks to btrfs_read_sys_array btrfs: cleanup, rename a few variables in btrfs_read_sys_array btrfs: add checks for sys_chunk_array sizes btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize ...
| * Btrfs: scrub, fix sleep in atomic contextFilipe Manana2015-02-141-16/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My previous patch "Btrfs: fix scrub race leading to use-after-free" introduced the possibility to sleep in an atomic context, which happens when the scrub_lock mutex is held at the time scrub_pending_bio_dec() is called - this function can be called under an atomic context. Chris ran into this in a debug kernel which gave the following trace: [ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621 [ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress [ 1928.981324] INFO: lockdep is turned off. [ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G W 3.19.0-rc7-mason+ #41 [ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012 [ 1929.022207] ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78 [ 1929.037267] ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8 [ 1929.052315] 0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8 [ 1929.067381] Call Trace: [ 1929.072344] <IRQ> [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e [ 1929.083968] [<ffffffff810856c4>] ___might_sleep+0x174/0x230 [ 1929.095352] [<ffffffff810857d2>] __might_sleep+0x52/0x90 [ 1929.106223] [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0 [ 1929.117951] [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10 [ 1929.129708] [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs] [ 1929.143370] [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs] [ 1929.157191] [<ffffffff812fa603>] bio_endio+0x53/0xa0 [ 1929.167382] [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs] [ 1929.180161] [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs] [ 1929.194318] [<ffffffff812fa603>] bio_endio+0x53/0xa0 [ 1929.204496] [<ffffffff8130401b>] blk_update_request+0x1eb/0x450 [ 1929.216569] [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500 [ 1929.229176] [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0 [ 1929.240740] [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0 [ 1929.252654] [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150 [ 1929.264725] [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170 [ 1929.276635] [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0 [ 1929.288014] [<ffffffff8105d92e>] __do_softirq+0xde/0x600 [ 1929.298885] [<ffffffff8105df6d>] irq_exit+0xbd/0xd0 (...) Fix this by using a reference count on the scrub context structure instead of locking the scrub_lock mutex. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix scrub race leading to use-after-freeFilipe Manana2015-02-021-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got the following trace: [68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50 [68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060 [68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo [68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4 [68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs] [68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000 [68127.807663] RIP: 0010:[<ffffffff8107da31>] [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] RSP: 0018:ffff8803f097fbb8 EFLAGS: 00010093 [68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854 [68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001 [68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001 [68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390 [68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00 [68127.807663] FS: 0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000 [68127.807663] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0 [68127.807663] Stack: [68127.807663] ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000 [68127.807663] ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314 [68127.807663] ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50 [68127.807663] Call Trace: [68127.807663] [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55 [68127.807663] [<ffffffff81072948>] ? __wake_up+0x22/0x4b [68127.807663] [<ffffffff81072948>] __wake_up+0x22/0x4b [68127.807663] [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs] [68127.807663] [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs] [68127.807663] [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28 [68127.807663] [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9 [68127.807663] [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs] [68127.807663] [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs] [68127.807663] [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6 [68127.807663] [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf [68127.807663] [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8 [68127.807663] [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219 [68127.807663] [<ffffffff8105cd75>] kthread+0xdb/0xe3 [68127.807663] [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67 [68127.807663] [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0 [68127.807663] [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67 [68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75 [68127.807663] RIP [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] RSP <ffff8803f097fbb8> [68127.807663] CR2: ffff8803f8947a50 [68127.807663] ---[ end trace d7045aac00a66cd8 ]--- This is due to a race that can happen in a very tiny time window and is illustrated by the following sequence diagram: CPU 1 CPU 2 btrfs_scrub_dev() scrub_bio_end_io_worker() scrub_pending_bio_dec() atomic_dec(&sctx->bios_in_flight) wait sctx->bios_in_flight == 0 wait sctx->workers_pending == 0 mutex_lock(&fs_info->scrub_lock) (...) mutex_lock(&fs_info->scrub_lock) scrub_free_ctx(sctx) kfree(sctx) wake_up(&sctx->list_wait) __wake_up() spin_lock_irqsave(&sctx->list_wait->lock, flags) Another variation of this scenario that results in the same use-after-free issue is: CPU 1 CPU 2 btrfs_scrub_dev() wait sctx->bios_in_flight == 0 scrub_bio_end_io_worker() scrub_pending_bio_dec() __wake_up(&sctx->list_wait) spin_lock_irqsave(&sctx->list_wait->lock, flags) default_wake_function() wake up task at CPU 2 wait sctx->workers_pending == 0 mutex_lock(&fs_info->scrub_lock) (...) mutex_lock(&fs_info->scrub_lock) scrub_free_ctx(sctx) kfree(sctx) spin_unlock_irqrestore(&sctx->list_wait->lock, flags) Fix this by holding the scrub lock while doing the wakeup. This isn't a recent regression, the issue as been around since the scrub feature was added (2011, commit a2de733c78fa7af51ba9670482fa7d392aa67c57). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Rename all ref_count to refs in structZhao Lei2015-01-211-13/+13
| | | | | | | | | | | | | | | | refs is better than ref_count to record a struct's ref count. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Suggested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simplyZhao Lei2015-01-211-11/+6
| | | | | | | | | | | | | | | | | | | | | | So we can check raid56 with: (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) instead of long: (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Include map_type in raid_bioZhao Lei2015-01-211-13/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Corrent code use many kinds of "clever" way to determine operation target's raid type, as: raid_map != NULL or raid_map[MAX_NR] == RAID[56]_Q_STRIPE To make code easy to maintenance, this patch put raid type into bbio, and we can always get raid type from bbio with a "stupid" way. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Simplify scrub_setup_recheck_block()'s argumentZhao Lei2015-01-211-16/+9
| | | | | | | | | | | | | | | | | | | | scrub_setup_recheck_block() have many arguments but most of them can be get from one of them, we can remove them to make code clean. Some other cleanup for that function also included in this patch. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Combine per-page recover in dev-replace and scrubZhao Lei2015-01-211-72/+48
| | | | | | | | | | | | | | | | | | The code are similar, combine them to make code clean and easy to maintenance. Some lost condition are also completed with benefit of this combination. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Separate finding-right-mirror and writing-to-target's process in ↵Zhao Lei2015-01-211-27/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | scrub_handle_errored_block() In corrent code, code of finding-right-mirror and writing-to-target are mixed in logic, if we find a right mirror but failed in writing to target, it will treat as "hadn't found right block", and fill the target with sblock_bad. Actually, "failed in writing to target" does not mean "source block is wrong", this patch separate above two condition in logic, and do some cleanup to make code clean. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Break loop when reach BTRFS_MAX_MIRRORS in scrub_setup_recheck_block()Zhao Lei2015-01-211-1/+1
| | | | | | | | | | | | | | | | | | Use break instead of useless loop should be more suitable in this case. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Remove noneed force_write in scrub_write_block_to_dev_replaceZhao Lei2015-01-211-12/+7
| | | | | | | | | | | | | | | | | | It is always 1 in this place, because !1 case was already jumped out in previous code. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Fix a jump typo of nodatasum_case to avoid wrong WARN_ON()Zhao Lei2015-01-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if (sctx->is_dev_replace && !is_metadata && !have_csum) { ... goto nodatasum_case; } ... nodatasum_case: WARN_ON(sctx->is_dev_replace); In above code, nodatasum_case marker should be moved after WARN_ON(). Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: add ref_count and free function for btrfs_bioZhao Lei2015-01-211-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: Make raid_map array be inlined in btrfs_bio structureZhao Lei2015-01-211-21/+10
| | | | | | | | | | | | | | | | | | It can make code more simple and clear, we need not care about free bbio and raid_map together. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix a out-of-bound access of raid_mapZhao Lei2015-01-211-2/+5
| | | | | | | | | | | | | | | | | | | | | | We add the number of stripes on target devices into bbio->num_stripes if we are under device replacement, and we just sort the raid_map of those stripes that not on the target devices, so if when we need real raid_map, we need skip the stripes on the target devices. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: expand btrfs_find_item if found_key is NULLDavid Sterba2015-01-141-2/+6
| | | | | | | | | | | | | | | | | | | | If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper for simple btrfs_search_slot. After we've removed all such callers, passing a NULL key is not valid anymore. Signed-off-by: David Sterba <dsterba@suse.cz>
| * btrfs: cleanup, remove inode_item_info helperDavid Sterba2015-01-141-1/+5
| | | | | | | | | | | | | | It's only a simple wrapper around btrfs_find_item, the locally defined key is not used. Signed-off-by: David Sterba <dsterba@suse.cz>
* | btrfs: fix raid56 scrub failed in xfstests btrfs/072Gui Hecheng2015-01-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The xfstests btrfs/072 reports uncorrectable read errors in dmesg, because scrub forgets to use commit_root for parity scrub routine and scrub attempts to scrub those extents items whose contents are not fully on disk. To fix it, we just add the @search_commit_root flag back. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>
* | Btrfs: fix incorrect freeing in scrub_stripeTsutomu Itoh2015-01-191-1/+1
|/ | | | | | | | The address that should be freed is not 'ppath' but 'path'. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()Dan Carpenter2015-01-021-1/+1
| | | | | | | | | The only way that "ret" is set is when we call scrub_pages_for_parity() so the skip to "if (ret) " test doesn't make sense and causes a static checker warning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* Merge branch 'raid56-scrub-replace' of git://github.com/miaoxie/linux-btrfs ↵Chris Mason2014-12-021-32/+771
|\ | | | | | | into for-linus
| * Btrfs, raid56: fix use-after-free problem in the final device replace ↵Miao Xie2014-12-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | procedure on raid56 The commit c404e0dc (Btrfs: fix use-after-free in the finishing procedure of the device replace) fixed a use-after-free problem which happened when removing the source device at the end of device replace, but at that time, btrfs didn't support device replace on raid56, so we didn't fix the problem on the raid56 profile. Currently, we implemented device replace for raid56, so we need kick that problem out before we enable that function for raid56. The fix method is very simple, we just increase the bio per-cpu counter before we submit a raid56 io, and decrease the counter when the raid56 io ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| * Btrfs, replace: write raid56 parity into the replace target deviceMiao Xie2014-12-031-1/+1
| | | | | | | | | | | | | | | | This function reused the code of parity scrub, and we just write the right parity or corrected parity into the target device before the parity scrub end. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| * Btrfs, raid56: support parity scrub on raid56Miao Xie2014-12-031-14/+595
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The implementation is: - Read and check all the data with checksum in the same stripe. All the data which has checksum is COW data, and we are sure that it is not changed though we don't lock the stripe. because the space of that data just can be reclaimed after the current transction is committed, and then the fs can use it to store the other data, but when doing scrub, we hold the current transaction, that is that data can not be recovered, it is safe that read and check it out of the stripe lock. - Lock the stripe - Read out all the data without checksum and parity The data without checksum and the parity may be changed if we don't lock the stripe, so we need read it in the stripe lock context. - Check the parity - Re-calculate the new parity and write back it if the old parity is not right - Unlock the stripe If we can not read out the data or the data we read is corrupted, we will try to repair it. If the repair fails. we will mark the horizontal sub-stripe(pages on the same horizontal) as corrupted sub-stripe, and we will skip the parity check and repair of that horizontal sub-stripe. And in order to skip the horizontal sub-stripe that has no data, we introduce a bitmap. If there is some data on the horizontal sub-stripe, we will the relative bit to 1, and when we check and repair the parity, we will skip those horizontal sub-stripes that the relative bits is 0. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| * Btrfs, scrub: repair the common data on RAID5/6 if it is corruptedMiao Xie2014-12-031-18/+176
| | | | | | | | | | | | | | | | | | This patch implement the RAID5/6 common data repair function, the implementation is similar to the scrub on the other RAID such as RAID1, the differentia is that we don't read the data from the mirror, we use the data repair function of RAID5/6. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
* | btrfs: fix dead lock while running replace and defrag concurrentlyGui Hecheng2014-11-201-30/+60
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This can be reproduced by fstests: btrfs/070 The scenario is like the following: replace worker thread defrag thread --------------------- ------------- copy_nocow_pages_worker btrfs_defrag_file copy_nocow_pages_for_inode ... btrfs_writepages |A| lock_extent_bits extent_write_cache_pages |B| lock_page __extent_writepage ... writepage_delalloc find_lock_delalloc_range |B| lock_extent_bits find_or_create_page pagecache_get_page |A| lock_page This leads to an ABBA pattern deadlock. To fix it, o we just change it to an AABB pattern which means to @unlock_extent_bits() before we @lock_page(), and in this way the @extent_read_full_page_nolock() is no longer in an locked context, so change it back to @extent_read_full_page() to regain protection. o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(), the extent may not really point at the physical block we want, so we have to check it before write. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused members from struct scrub_warningDavid Sterba2014-10-021-15/+2
| | | | Signed-off-by: David Sterba <dsterba@suse.cz>
* Btrfs: modify clean_io_failure and make it suit direct ioMiao Xie2014-09-171-2/+1
| | | | | | | | | | We could not use clean_io_failure in the direct IO path because it got the filesystem information from the page structure, but the page in the direct IO bio didn't have the filesystem information in its structure. So we need modify it and pass all the information it need by parameters. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: modify repair_io_failure and make it suit direct ioMiao Xie2014-09-171-0/+1
| | | | | | | | | | | | | The original code of repair_io_failure was just used for buffered read, because it got some filesystem data from page structure, it is safe for the page in the page cache. But when we do a direct read, the pages in bio are not in the page cache, that is there is no filesystem data in the page structure. In order to implement direct read data repair, we need modify repair_io_failure and pass all filesystem data it need by function parameters. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix wrong disk size when writing super blocksMiao Xie2014-09-171-1/+2
| | | | | | | | | | | | total_size will be changed when resizing a device, and disk_total_size will be changed if resizing is successful. Meanwhile, the on-disk super blocks of the previous transaction might not be updated. Considering the consistency of the metadata in the previous transaction, We should use the size in the previous transaction to check if the super block is beyond the boundary of the device. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix wrong generation check of super block on a seed deviceMiao Xie2014-09-171-1/+5
| | | | | | | | | | | | The super block generation of the seed devices is not the same as the filesystem which sprouted from them because we don't update the super block on the seed devices when we change that new filesystem. So we should not use the generation of that new filesystem to check the super block generation on the seed devices, Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix wrong fsid check of scrubMiao Xie2014-09-171-5/+13
| | | | | | | | | | All the metadata in the seed devices has the same fsid as the fsid of the seed filesystem which is on the seed device, so we should check them by the current filesystem. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: use nodesize everywhere, kill leafsizeDavid Sterba2014-09-171-16/+1
| | | | | | | | | | | | | | | The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: kill the key type accessor helpersDavid Sterba2014-09-171-1/+1
| | | | | | | | | btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix task hang under heavy compressed writeLiu Bo2014-08-241-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: don't write any data into a readonly device when scrubMiao Xie2014-08-191-0/+11
| | | | | | | | | We should not write data into a readonly device especially seed device when doing scrub, skip those devices. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: Skip scrubbing removed chunks to avoid -ENOENT.Qu Wenruo2014-06-191-10/+9
| | | | | | | | | | | | | | | | | When run scrub with balance, sometimes -ENOENT will be returned, since in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but btrfs_lookup_block_group() will search block group in *MEMORY*, so if a chunk is removed but not committed, -ENOENT will be returned. However, there is no need to stop scrubbing since other chunks may be scrubbed without problem. So this patch changes the behavior to skip removed chunks and continue to scrub the rest. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix scrub_print_warning to handle skinny metadata extentsLiu Bo2014-06-091-2/+3
| | | | | | | | | The skinny extents are intepreted incorrectly in scrub_print_warning(), and end up hitting the BUG() in btrfs_extent_inline_ref_size. Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: Remove unnecessary check for NULLTobias Klauser2014-06-091-2/+2
| | | | | | | | iput() already checks for the inode being NULL, thus it's unnecessary to check before calling. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Chris Mason <clm@fb.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-04-111-19/+89
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull second set of btrfs updates from Chris Mason: "The most important changes here are from Josef, fixing a btrfs regression in 3.14 that can cause corruptions in the extent allocation tree when snapshots are in use. Josef also fixed some deadlocks in send/recv and other assorted races when balance is running" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits) Btrfs: fix compile warnings on on avr32 platform btrfs: allow mounting btrfs subvolumes with different ro/rw options btrfs: export global block reserve size as space_info btrfs: fix crash in remount(thread_pool=) case Btrfs: abort the transaction when we don't find our extent ref Btrfs: fix EINVAL checks in btrfs_clone Btrfs: fix unlock in __start_delalloc_inodes() Btrfs: scrub raid56 stripes in the right way Btrfs: don't compress for a small write Btrfs: more efficient io tree navigation on wait_extent_bit Btrfs: send, build path string only once in send_hole btrfs: filter invalid arg for btrfs resize Btrfs: send, fix data corruption due to incorrect hole detection Btrfs: kmalloc() doesn't return an ERR_PTR Btrfs: fix snapshot vs nocow writting btrfs: Change the expanding write sequence to fix snapshot related bug. btrfs: make device scan less noisy btrfs: fix lockdep warning with reclaim lock inversion Btrfs: hold the commit_root_sem when getting the commit root during send Btrfs: remove transaction from send ...
| * Btrfs: fix compile warnings on on avr32 platformWang Shilong2014-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | fs/btrfs/scrub.c: In function 'get_raid56_logic_offset': fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast fs/btrfs/scrub.c:2269: warning: right shift count >= width of type fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type Since @rot is an int type, we should not use do_div(), fix it. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: scrub raid56 stripes in the right wayWang Shilong2014-04-071-19/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Steps to reproduce: # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5 # mount /dev/sda8 /mnt # btrfs scrub start -BR /mnt # echo $? <--unverified errors make return value be 3 This is because we don't setup right mapping between physical and logical address for raid56, which makes checksum mismatch. But we will find everthing is fine later when rechecking using btrfs_map_block(). This patch fixed the problem by settuping right mappings and we only verify data stripes' checksums. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2014-04-041-35/+62
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs changes from Chris Mason: "This is a pretty long stream of bug fixes and performance fixes. Qu Wenruo has replaced the btrfs async threads with regular kernel workqueues. We'll keep an eye out for performance differences, but it's nice to be using more generic code for this. We still have some corruption fixes and other patches coming in for the merge window, but this batch is tested and ready to go" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits) Btrfs: fix a crash of clone with inline extents's split btrfs: fix uninit variable warning Btrfs: take into account total references when doing backref lookup Btrfs: part 2, fix incremental send's decision to delay a dir move/rename Btrfs: fix incremental send's decision to delay a dir move/rename Btrfs: remove unnecessary inode generation lookup in send Btrfs: fix race when updating existing ref head btrfs: Add trace for btrfs_workqueue alloc/destroy Btrfs: less fs tree lock contention when using autodefrag Btrfs: return EPERM when deleting a default subvolume Btrfs: add missing kfree in btrfs_destroy_workqueue Btrfs: cache extent states in defrag code path Btrfs: fix deadlock with nested trans handles Btrfs: fix possible empty list access when flushing the delalloc inodes Btrfs: split the global ordered extents mutex Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Btrfs: reclaim delalloc metadata more aggressively Btrfs: remove unnecessary lock in may_commit_transaction() Btrfs: remove the unnecessary flush when preparing the pages Btrfs: just do dirty page flush for the inode with compression before direct IO ...
| * btrfs: Cleanup the "_struct" suffix in btrfs_workequeueQu Wenruo2014-03-101-13/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
| * btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue.Qu Wenruo2014-03-101-43/+50
| | | | | | | | | | | | | | | | | | Replace the fs_info->scrub_* with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
| * Btrfs: wake up @scrub_pause_wait as much as we canWang Shilong2014-03-101-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check if @scrubs_running=@scrubs_paused condition inside wait_event() is not an atomic operation which means we may inc/dec @scrub_running/ paused at any time. Let's wake up @scrub_pause_wait as much as we can to let commit transaction blocked less. An example below: Thread1 Thread2 |->scrub_blocked_if_needed() |->scrub_pending_trans_workers_inc |->increase @scrub_paused |->increase @scrub_running |->wake up scrub_pause_wait list |->scrub blocked |->increase @scrub_paused Thread3 is commiting transaction which is blocked at btrfs_scrub_pause(). So after Thread2 increase @scrub_paused, we meet the condition @scrub_paused=@scrub_running, but transaction will be still blocked until another calling to wake up @scrub_pause_wait. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
| * Btrfs: device_replace: fix deadlock for nocow caseWang Shilong2014-03-101-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cb7ab02156e4 cause a following deadlock found by xfstests,btrfs/011: Thread1 is commiting transaction which is blocked at btrfs_scrub_pause(). Thread2 is calling btrfs_file_aio_write() which has held inode's @i_mutex and commit transaction(blocked because Thread1 is committing transaction). Thread3 is copy_nocow_page worker which will also try to hold inode @i_mutex, so thread3 will wait Thread1 finished. Thread4 is waiting pending workers finished which will wait Thread3 finished. So the problem is like this: Thread1--->Thread4--->Thread3--->Thread2---->Thread1 Deadlock happens! we fix it by letting Thread1 go firstly, which means we won't block transaction commit while we are waiting pending workers finished. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2014-01-301-66/+68
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is a pretty big pull, and most of these changes have been floating in btrfs-next for a long time. Filipe's properties work is a cool building block for inheriting attributes like compression down on a per inode basis. Jeff Mahoney kicked in code to export filesystem info into sysfs. Otherwise, lots of performance improvements, cleanups and bug fixes. Looks like there are still a few other small pending incrementals, but I wanted to get the bulk of this in first" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits) Btrfs: fix spin_unlock in check_ref_cleanup Btrfs: setup inode location during btrfs_init_inode_locked Btrfs: don't use ram_bytes for uncompressed inline items Btrfs: fix btrfs_search_slot_for_read backwards iteration Btrfs: do not export ulist functions Btrfs: rework ulist with list+rb_tree Btrfs: fix memory leaks on walking backrefs failure Btrfs: fix send file hole detection leading to data corruption Btrfs: add a reschedule point in btrfs_find_all_roots() Btrfs: make send's file extent item search more efficient Btrfs: fix to catch all errors when resolving indirect ref Btrfs: fix protection between walking backrefs and root deletion btrfs: fix warning while merging two adjacent extents Btrfs: fix infinite path build loops in incremental send btrfs: undo sysfs when open_ctree() fails Btrfs: fix snprintf usage by send's gen_unique_name btrfs: fix defrag 32-bit integer overflow btrfs: sysfs: list the NO_HOLES feature btrfs: sysfs: don't show reserved incompat feature btrfs: call permission checks earlier in ioctls and return EPERM ...
| * Btrfs: fix to search previous metadata extent item since skinny metadataWang Shilong2014-01-281-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a bug that using btrfs_previous_item() to search metadata extent item. This is because in btrfs_previous_item(), we need type match, however, since skinny metada was introduced by josef, we may mix this two types. So just use btrfs_previous_item() is not working right. To keep btrfs_previous_item() like normal tree search, i introduce another function btrfs_previous_extent_item(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix missing skinny metadata check in scrub_stripe()Wang Shilong2014-01-281-1/+4
| | | | | | | | | | | | | | | | | | Check if we support skinny metadata firstly and fix to use right type to search. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
OpenPOWER on IntegriCloud