summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | VFS: assorted weird filesystems: d_inode() annotationsDavid Howells2015-04-153-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells2015-04-15266-1351/+1350
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | VFS: Cachefiles should perform fs modifications on the top layer onlyDavid Howells2015-04-152-28/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cachefiles should perform fs modifications (eg. vfs_unlink()) on the top layer only and should not attempt to alter the lower layer. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | Merge branch 'for-4.1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2015-04-2410-77/+35
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "A quiet cycle this time; this is basically entirely bugfixes. The few that aren't cc'd to stable are cleanup or seemed unlikely to affect anyone much" * 'for-4.1' of git://linux-nfs.org/~bfields/linux: uapi: Remove kernel internal declaration nfsd: fix nsfd startup race triggering BUG_ON nfsd: eliminate NFSD_DEBUG nfsd4: fix READ permission checking nfsd4: disallow SEEK with special stateids nfsd4: disallow ALLOCATE with special stateids nfsd: add NFSEXP_PNFS to the exflags array nfsd: Remove duplicate macro define for max sec label length nfsd: allow setting acls with unenforceable DENYs nfsd: NFSD_FAULT_INJECTION depends on DEBUG_FS nfsd: remove unused status arg to nfsd4_cleanup_open_state nfsd: remove bogus setting of status in nfsd4_process_open2 NFSD: Use correct reply size calculating function NFSD: Using path_equal() for checking two paths
| * | | | | nfsd: fix nsfd startup race triggering BUG_ONGiuseppe Cantavenera2015-04-211-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfsd triggered a BUG_ON in net_generic(...) when rpc_pipefs_event(...) in fs/nfsd/nfs4recover.c was called before assigning ntfsd_net_id. The following was observed on a MIPS 32-core processor: kernel: Call Trace: kernel: [<ffffffffc00bc5e4>] rpc_pipefs_event+0x7c/0x158 [nfsd] kernel: [<ffffffff8017a2a0>] notifier_call_chain+0x70/0xb8 kernel: [<ffffffff8017a4e4>] __blocking_notifier_call_chain+0x4c/0x70 kernel: [<ffffffff8053aff8>] rpc_fill_super+0xf8/0x1a0 kernel: [<ffffffff8022204c>] mount_ns+0xb4/0xf0 kernel: [<ffffffff80222b48>] mount_fs+0x50/0x1f8 kernel: [<ffffffff8023dc00>] vfs_kern_mount+0x58/0xf0 kernel: [<ffffffff802404ac>] do_mount+0x27c/0xa28 kernel: [<ffffffff80240cf0>] SyS_mount+0x98/0xe8 kernel: [<ffffffff80135d24>] handle_sys64+0x44/0x68 kernel: kernel: Code: 0040f809 00000000 2e020001 <00020336> 3c12c00d 3c02801a de100000 6442eb98 0040f809 kernel: ---[ end trace 7471374335809536 ]--- Fixed this behaviour by calling register_pernet_subsys(&nfsd_net_ops) before registering rpc_pipefs_event(...) with the notifier chain. Signed-off-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com> Signed-off-by: Lorenzo Restelli <lorenzo.restelli.ext@nokia.com> Reviewed-by: Kinlong Mee <kinglongmee@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd: eliminate NFSD_DEBUGMark Salter2015-04-213-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f895b252d4edf ("sunrpc: eliminate RPC_DEBUG") introduced use of IS_ENABLED() in a uapi header which leads to a build failure for userspace apps trying to use <linux/nfsd/debug.h>: linux/nfsd/debug.h:18:15: error: missing binary operator before token "(" #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) ^ Since this was only used to define NFSD_DEBUG if CONFIG_SUNRPC_DEBUG is enabled, replace instances of NFSD_DEBUG with CONFIG_SUNRPC_DEBUG. Cc: stable@vger.kernel.org Fixes: f895b252d4edf "sunrpc: eliminate RPC_DEBUG" Signed-off-by: Mark Salter <msalter@redhat.com> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd4: fix READ permission checkingJ. Bruce Fields2015-04-211-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case we already have a struct file (derived from a stateid), we still need to do permission-checking; otherwise an unauthorized user could gain access to a file by sniffing or guessing somebody else's stateid. Cc: stable@vger.kernel.org Fixes: dc97618ddda9 "nfsd4: separate splice and readv cases" Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd4: disallow SEEK with special stateidsJ. Bruce Fields2015-04-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the client uses a special stateid then we'll pass a NULL file to vfs_llseek. Fixes: 24bab491220f " NFSD: Implement SEEK" Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: stable@vger.kernel.org Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd4: disallow ALLOCATE with special stateidsJ. Bruce Fields2015-04-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vfs_fallocate will hit a NULL dereference if the client tries an ALLOCATE or DEALLOCATE with a special stateid. Fix that. (We also depend on the open to have broken any conflicting leases or delegations for us.) (If it turns out we need to allow special stateid's then we could do a temporary open here in the special-stateid case, as we do for read and write. For now I'm assuming it's not necessary.) Fixes: 95d871f03cae "nfsd: Add ALLOCATE support" Cc: stable@vger.kernel.org Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd: add NFSEXP_PNFS to the exflags arrayChristoph Hellwig2015-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd: Remove duplicate macro define for max sec label lengthKinglong Mee2015-03-313-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFS4_MAXLABELLEN has defined for sec label max length, use it directly. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd: allow setting acls with unenforceable DENYsJ. Bruce Fields2015-03-311-49/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've been refusing ACLs that DENY permissions that we can't effectively deny. (For example, we can't deny permission to read attributes.) Andreas points out that any DENY of Window's "read", "write", or "modify" permissions would trigger this. That would be annoying. So maybe we should be a little less paranoid, and ignore entirely the permissions that are meaningless to us. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd: NFSD_FAULT_INJECTION depends on DEBUG_FSChengyu Song2015-03-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFSD_FAULT_INJECTION depends on DEBUG_FS, otherwise the debugfs_create_* interface may return unexpected error -ENODEV, and cause system crash. Signed-off-by: Chengyu Song <csong84@gatech.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd: remove unused status arg to nfsd4_cleanup_open_stateJeff Layton2015-03-313-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | nfsd: remove bogus setting of status in nfsd4_process_open2Jeff Layton2015-03-311-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | status is always reset after this (and it doesn't make much sense there anyway). Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | NFSD: Use correct reply size calculating functionKinglong Mee2015-03-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ALLOCATE/DEALLOCATE only reply one status value to client, so, using nfsd4_only_status_rsize for reply size calculating. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | NFSD: Using path_equal() for checking two pathsKinglong Mee2015-03-312-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | | | | | Merge branch 'for-linus-4.1' of ↵Linus Torvalds2015-04-2446-897/+2113
|\ \ \ \ \ \ | | |_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "I've been running these through a longer set of load tests because my commits change the free space cache writeout. It fixes commit stalls on large filesystems (~20T space used and up) that we have been triggering here. We were seeing new writers blocked for 10 seconds or more during commits, which is far from good. Josef and I fixed up ENOSPC aborts when deleting huge files (3T or more), that are triggered because our metadata reservations were not properly accounting for crcs and were not replenishing during the truncate. Also in this series, a number of qgroup fixes from Fujitsu and Dave Sterba collected most of the pending cleanups from the list" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits) btrfs: quota: Update quota tree after qgroup relationship change. btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations. btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota. btrfs: Update btrfs qgroup status item when rescan is done. btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value. btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created btrfs: Check qgroup level in kernel qgroup assign. btrfs: qgroup: allow to remove qgroup which has parent but no child. btrfs: qgroup: return EINVAL if level of parent is not higher than child's. btrfs: qgroup: do a reservation in a higher level. Btrfs: qgroup, Account data space in more proper timings. Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use. Btrfs: qgroup: free reserved in exceeding quota. Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup(). btrfs: qgroup: fix limit args override whole limit struct btrfs: qgroup: update limit info in function btrfs_run_qgroups(). btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item(). btrfs: qgroup: update qgroup in memory at the same time when we update it in btree. btrfs: qgroup: inherit limit info from srcgroup in creating snapshot. btrfs: Support busy loop of write and delete ...
| * | | | | btrfs: quota: Update quota tree after qgroup relationship change.Qu Wenruo2015-04-131-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous patch modified the in memory struct but it's not written in quota tree until next commit. So user will still get old data using "btrfs qgroup show" after assign/remove. This patch will call btrfs_run_qgroups in assign ioctl so it will be updated to in memory quota trees and user will get up-to-date results. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: quota: Automatically update related qgroups or mark INCONSISTENT ↵Qu Wenruo2015-04-131-57/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | flags when assigning/deleting a qgroup relations. Operation like qgroups assigning/deleting qgroup relations will mostly cause qgroup data inconsistent, since it needs to do the full rescan to determine whether shared extents are exclusive or still shared in parent qgroups. But there are some exceptions, like qgroup with only exclusive extents (qgroup->excl == qgroup->rfer), in that case, we only needs to modify all its parents' excl and rfer. So this patch adds a quick path for such qgroup in qgroup assign/remove routine, and if quick path failed, the qgroup status will be marked INCONSISTENT, and return 1 to info user-land. BTW since the quick path is much the same of qgroup_excl_accounting(), so move the core of it to __qgroup_excl_accounting() and reuse it. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.Dongsheng Yang2015-04-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | we forgot to clear STATUS_FLAG_ON in quota_disable(), it will cause a problem shown as below: # mount /dev/sdc /mnt # btrfs quota enable /mnt # btrfs quota disable /mnt # btrfs quota rescan /mnt quota rescan started <--- expecting it fail here. # echo $? 0 Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Update btrfs qgroup status item when rescan is done.Qu Wenruo2015-04-131-1/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update qgroup status when rescan is done. Before this patch, status item is not updated on rescan finish, which causing the RESCAN and INCONSISTENT flags never cleared. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.Qu Wenruo2015-04-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Old qgroup_rescan_leaf() comment indicates ret == 2 as complete and cleared INCONSISTENT flag. This is not true since it will never return 2, and inside it no codes will clear INCONSISTENT flag. The flag clearance is done in btrfs_qgroup_rescan_work(). This caused the bug that INCONSISTENT flag is never cleared. So change the comment and fix the dead judgment. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be createdQu Wenruo2015-04-132-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs will create qgroup on subvolume creation if quota is enabled, but qgroup uses the high bits(currently 16 bits) as level, to build the inheritance. However it is fully possible a subvolume can be created with a subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be considered as level 1 and can't be assigned to other qgroup in level 1. This patch will prevent such things so qgroup inheritance will not be screwed up. The downside is very clear, btrfs subvolume number limit will decrease from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to (u48 max -256(first free objectid)). But we still have near u48(that's 15 digits in dec), so that should not be a huge problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Check qgroup level in kernel qgroup assign.Qu Wenruo2015-04-132-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although we have qgroup level check in btrfs-progs, it's not enough since other programe may still call ioctl directly not using btrfs-progs. For example, systemd. But it's btrfs-progs to be blame since we don't provide a full-function(like subvolume create things) btrfs library with enough check, and only rely on kernel ioctl. So Add level checks in kernel too. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: allow to remove qgroup which has parent but no child.Dongsheng Yang2015-04-131-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a qgroup has parents but no child, it should be removable in Theory I think. But currently, we can not remove it when it has either parent or child. Example: # btrfs quota enable /mnt # btrfs qgroup create 1/0 /mnt # btrfs qgroup create 2/0 /mnt # btrfs qgroup assign 1/0 2/0 /mnt # btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16384 16384 0 0 --- --- 1/0 0 0 0 0 2/0 --- 2/0 0 0 0 0 --- 1/0 At this time, there is no subvol or qgroup depending on it. Just a qgroup 2/0 is its parent, but 2/0 can work well without 1/0. So I think 1/0 should be removalbe. But: # btrfs qgroup destroy 1/0 /mnt ERROR: unable to destroy quota group: Device or resource busy This patch remove the check of qgroup->parent in removing it, then we can remove a qgroup when it has a parent. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: return EINVAL if level of parent is not higher than child's.Dongsheng Yang2015-04-131-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we create a subvol inheriting a qgroup, we need to check the level of them. Otherwise, there is a chance a qgroup can inherit another qgroup at the same level. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: do a reservation in a higher level.Dongsheng Yang2015-04-137-121/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two problems in qgroup: a). The PAGE_CACHE is 4K, even when we are writing a data of 1K, qgroup will reserve a 4K size. It will cause the last 3K in a qgroup is not available to user. b). When user is writing a inline data, qgroup will not reserve it, it means this is a window we can exceed the limit of a qgroup. The main idea of this patch is reserving the data size of write_bytes rather than the reserve_bytes. It means qgroup will not care about the data size btrfs will reserve for user, but only care about the data size user is going to write. Then reserve it when user want to write and release it in transaction committed. In this way, qgroup can be released from the complex procedure in btrfs and only do the reserve when user want to write and account when the data is written in commit_transaction(). Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: qgroup, Account data space in more proper timings.Dongsheng Yang2015-04-132-16/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currenly, in data writing, ->reserved is accounted in fill_delalloc(), but ->may_use is released in clear_bit_hook() which is called by btrfs_finish_ordered_io(). That's too late, that said, between fill_delalloc() and btrfs_finish_ordered_io(), the data is doublely accounted by qgroup. It will cause some unexpected -EDQUOT. Example: # btrfs quota enable /root/btrfs-auto-test/ # btrfs subvolume create /root/btrfs-auto-test//sub Create subvolume '/root/btrfs-auto-test/sub' # btrfs qgroup limit 1G /root/btrfs-auto-test//sub dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=1500000 dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded 681353+0 records in 681352+0 records out 697704448 bytes (698 MB) copied, 8.15563 s, 85.5 MB/s It's (698 MB) when we got an -EDQUOT, but we limit it by 1G. This patch move the btrfs_qgroup_reserve/free() for data from btrfs_delalloc_reserve/release_metadata() to btrfs_check_data_free_space() and btrfs_free_reserved_data_space(). Then the accounter in qgroup will be updated at the same time with the accounter in space_info updated. In this way, the unexpected -EDQUOT will be killed. Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.Dongsheng Yang2015-04-134-6/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, for pre_alloc or delay_alloc, the bytes will be accounted in space_info by the three guys. space_info->bytes_may_use --- space_info->reserved --- space_info->used. But on the other hand, in qgroup, there are only two counters to account the bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts bytes in space_info->bytes_may_use and qg->excl accounts bytes in space_info->used. So the bytes in space_info->reserved is not accounted in qgroup. If so, there is a window we can exceed the quota limit when bytes is in space_info->reserved. Example: # btrfs quota enable /mnt # btrfs qgroup limit -e 10M /mnt # for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done # sync # btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 20987904 20987904 0 10485760 --- --- qg->excl is 20987904 larger than max_excl 10485760. This patch introduce a new counter named may_use to qgroup, then there are three counters in qgroup to account bytes in space_info as below. space_info->bytes_may_use --- space_info->reserved --- space_info->used. qgroup->may_use --- qgroup->reserved --- qgroup->excl With this patch applied: # btrfs quota enable /mnt # btrfs qgroup limit -e 10M /mnt # for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done fallocate: /mnt/data9: fallocate failed: Disk quota exceeded fallocate: /mnt/data10: fallocate failed: Disk quota exceeded fallocate: /mnt/data11: fallocate failed: Disk quota exceeded fallocate: /mnt/data12: fallocate failed: Disk quota exceeded fallocate: /mnt/data13: fallocate failed: Disk quota exceeded fallocate: /mnt/data14: fallocate failed: Disk quota exceeded fallocate: /mnt/data15: fallocate failed: Disk quota exceeded fallocate: /mnt/data16: fallocate failed: Disk quota exceeded fallocate: /mnt/data17: fallocate failed: Disk quota exceeded fallocate: /mnt/data18: fallocate failed: Disk quota exceeded fallocate: /mnt/data19: fallocate failed: Disk quota exceeded # sync # btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 9453568 9453568 0 10485760 --- --- Reported-by: Cyril SCETBON <cyril.scetbon@free.fr> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: qgroup: free reserved in exceeding quota.Dongsheng Yang2015-04-131-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we exceed quota limit in writing, we will free some reserved extent when we need to drop but not free account in qgroup. It means, each time we exceed quota in writing, there will be some remain space in qg->reserved we can not use any more. If things go on like this, the all space will be ate up. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().Dongsheng Yang2015-04-134-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: fix limit args override whole limit structDongsheng Yang2015-04-131-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_limit_group use arg limit to override the old qgroup_limit of corresponding qgroup. However, we should override part of old qgroup_limit according to the bit which has been set in arg limit. Signed-off-by: Fan Chengniang <fancn.fnst@cn.fujitsu.com> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: update limit info in function btrfs_run_qgroups().Dongsheng Yang2015-04-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we commit_transaction(), qgroups in btree should be updated. But, limit info is not considered currently. It will cause a problem when a qgroup of a snapshot inherit the limit info from srcqgroup, then there is an inconsistency. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().Dongsheng Yang2015-04-131-27/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cleanup: Change the parameter of update_qgroup_limit_item() to the family of update_qgroup_xxx_item(). Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: update qgroup in memory at the same time when we update it in ↵Dongsheng Yang2015-04-131-11/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btree. When we call btrfs_qgroup_inherit() with BTRFS_QGROUP_INHERIT_SET_LIMITS, btrfs will update the limit info of qgroup in btree but forget to update the qgroup in rbtree at the same time. It obviousely will cause an inconsistency. This patch fix it by updating the rbtree at the same time. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.Dongsheng Yang2015-04-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when we snapshot a subvol, snapshot will not copy the limits from srcqgroup. This patch make the qgroup in snapshot inherit the limit info when create a snapshot. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Support busy loop of write and deleteZhao Lei2015-04-131-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reproduce: while true; do dd if=/dev/zero of=/mnt/btrfs/file count=[75% fs_size] rm /mnt/btrfs/file done Then we can see above loop failed on NO_SPACE. It it long-term problem since very beginning, because delayed-iput after rm are not run. We already have commit_transaction() in alloc_space code, but it is not triggered in above case. This patch trigger commit_transaction() to run delayed-iput and reflash pinned-space to to make write success. It is based on previous fix of delayed-iput in commit_transaction(), need to be applied on top of: btrfs: Fix NO_SPACE bug caused by delayed-iput Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Fix NO_SPACE bug caused by delayed-iputZhao Lei2015-04-134-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Steps to reproduce: while true; do dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%] rm /btrfs_dir/file sync done And we'll see dd failed because btrfs return NO_SPACE. Reason: Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs() in end to free fs space for next write, but sometimes it hadn't done work on time, because btrfs-cleaner thread get delayed-iputs from list before, but do iput() after next write. This is log: [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs() [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs() [ 2569.087554] comm=sync func=btrfs_commit_transaction() end [ 2569.191081] comm=dd begin [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28 [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888 [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512 ... [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496 [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end Fix: Make btrfs_commit_transaction() wait current running btrfs-cleaner's delayed-iputs() done in end. Test: Use script similar to above(more complex), before patch: 7 failed in 100 * 20 loop. after patch: 0 failed in 100 * 20 loop. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: add WARN_ON() to check is space_info op currentZhao Lei2015-04-131-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | space_info's value calculation is some complex and easy to cause bug, add WARN_ON() to help debug. Changelog v1->v2: Put WARN_ON()s under the ENOSPC_DEBUG mount option. Suggested by: David Sterba <dsterba@suse.cz> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Set relative data on clear btrfs_block_group_cache->pinnedZhao Lei2015-04-131-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bug1: space_info->bytes_readonly was set to very large(negative) value in btrfs_remove_block_group(). Reason: Current code set block_group_cache->pinned = 0 in btrfs_delete_unused_bgs(), but above space was not counted to space_info->bytes_readonly. Then in btrfs_remove_block_group(): block_group->space_info->bytes_readonly -= block_group->key.offset; We can see following value in trace: btrfs_remove_block_group: pid=2677 comm=btrfs-cleaner WARNING: bytes_readonly=12582912, key.offset=134217728 Bug2: space_info->total_bytes_pinned grow to value larger than fs size. In a 1.2G fs, we can get following trace log: at first: ZL_DEBUG: add_pinned_bytes: pid=2710 comm=sync change total_bytes_pinned flags=1 869793792 + 95944704 = 965738496 after some op: ZL_DEBUG: add_pinned_bytes: pid=2770 comm=sync change total_bytes_pinned flags=1 1780178944 + 95944704 = 1876123648 after some op: ZL_DEBUG: add_pinned_bytes: pid=3193 comm=sync change total_bytes_pinned flags=1 2924568576 + 95551488 = 3020120064 ... Reason: Similar to bug1, we also need to adjust space_info->total_bytes_pinned in above code block. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Adjust commit-transaction condition to avoid NO_SPACE moreZhao Lei2015-04-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have any chance to make a successful write, we should not give up. This patch adjust commit-transaction condition from: pinned >= wanted to left + pinned >= wanted Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: Fix tail space processing in find_free_dev_extent()Zhao Lei2015-04-131-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is another reason for NO_SPACE case. When we found enough free space in loop and saved them to max_hole_start/size before, and tail space contains pending extent, origional innocent max_hole_start/size are reset in retry. As a result, find_free_dev_extent() returns less space than it can, and cause NO_SPACE in user program. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | btrfs: fix condition of commit transactionZhao Lei2015-04-131-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Old code bypass commit transaction when we don't have enough pinned space, but another case is there exist freed bgs in current transction, it have possibility to make alloc_chunk success. This patch modify the condition to: if (have_free_bg || have_pinned_space) commit_transaction() Confirmed above action by printk before and after patch. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: fix uninit variable in clone ioctlChris Mason2015-04-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0d97a64e0 creates a new variable but doesn't always set it up. This puts it back to the original method (key.offset + 1) for the cases not covered by Filipe's new logic. Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: fix inode eviction infinite loop after cloning into itFilipe Manana2015-04-131-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we attempt to clone a 0 length region into a file we can end up inserting a range in the inode's extent_io tree with a start offset that is greater then the end offset, which triggers immediately the following warning: [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3914.620886] BTRFS: end < start 4095 4096 (...) [ 3914.638093] Call Trace: [ 3914.638636] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3914.639620] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3914.640789] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3914.642041] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3914.643236] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3914.644441] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3914.645711] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3914.646914] [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33 [ 3914.648058] [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs] [ 3914.650105] [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs] [ 3914.651361] [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs] [ 3914.652761] [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs] [ 3914.654128] [<ffffffff811226dd>] ? might_fault+0x58/0xb5 [ 3914.655320] [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs] (...) [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]--- This later makes the inode eviction handler enter an infinite loop that keeps dumping the following warning over and over: [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3915.119913] BTRFS: end < start 4095 4096 (...) [ 3915.137394] Call Trace: [ 3915.137913] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3915.139154] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3915.140316] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3915.141505] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3915.142709] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3915.143849] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3915.145120] [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs] [ 3915.146352] [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50 [ 3915.147565] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3915.148785] [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33 [ 3915.149931] [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs] [ 3915.151154] [<ffffffff81168904>] evict+0xa0/0x148 [ 3915.152094] [<ffffffff811689e5>] dispose_list+0x39/0x43 [ 3915.153081] [<ffffffff81169564>] evict_inodes+0xdc/0xeb [ 3915.154062] [<ffffffff81154418>] generic_shutdown_super+0x49/0xef [ 3915.155193] [<ffffffff811546d1>] kill_anon_super+0x13/0x1e [ 3915.156274] [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs] (...) [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]--- So just bail out of the clone ioctl if the length of the region to clone is zero, without locking any extent range, in order to prevent this issue (same behaviour as a pwrite with a 0 length for example). This is trivial to reproduce. For example, the steps for the test I just made for fstests: mkfs.btrfs -f SCRATCH_DEV mount SCRATCH_DEV $SCRATCH_MNT touch $SCRATCH_MNT/foo touch $SCRATCH_MNT/bar $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar umount $SCRATCH_MNT A test case for fstests follows soon. CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: fix inode eviction infinite loop after extent_same ioctlFilipe Manana2015-04-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we pass a length of 0 to the extent_same ioctl, we end up locking an extent range with a start offset greater then its end offset (if the destination file's offset is greater than zero). This results in a warning from extent_io.c:insert_state through the following call chain: btrfs_extent_same() btrfs_double_lock() lock_extent_range() lock_extent(inode->io_tree, offset, offset + len - 1) lock_extent_bits() __set_extent_bit() insert_state() --> WARN_ON(end < start) This leads to an infinite loop when evicting the inode. This is the same problem that my previous patch titled "Btrfs: fix inode eviction infinite loop after cloning into it" addressed but for the extent_same ioctl instead of the clone ioctl. CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: fix range cloning when same inode used as source and destinationFilipe Manana2015-04-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While searching for extents to clone we might find one where we only use a part of it coming from its tail. If our destination inode is the same the source inode, we end up removing the tail part of the extent item and insert after a new one that point to the same extent with an adjusted key file offset and data offset. After this we search for the next extent item in the fs/subvol tree with a key that has an offset incremented by one. But this second search leaves us at the new extent item we inserted previously, and since that extent item has a non-zero data offset, it it can make us call btrfs_drop_extents with an empty range (start == end) which causes the following warning: [23978.537119] WARNING: CPU: 6 PID: 16251 at fs/btrfs/file.c:550 btrfs_drop_extent_cache+0x43/0x385 [btrfs]() (...) [23978.557266] Call Trace: [23978.557978] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [23978.559191] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [23978.560699] [<ffffffffa047f0ea>] ? btrfs_drop_extent_cache+0x43/0x385 [btrfs] [23978.562389] [<ffffffff8104544d>] warn_slowpath_null+0x1a/0x1c [23978.563613] [<ffffffffa047f0ea>] btrfs_drop_extent_cache+0x43/0x385 [btrfs] [23978.565103] [<ffffffff810e3a18>] ? time_hardirqs_off+0x15/0x28 [23978.566294] [<ffffffff81079ff8>] ? trace_hardirqs_off+0xd/0xf [23978.567438] [<ffffffffa047f73d>] __btrfs_drop_extents+0x6b/0x9e1 [btrfs] [23978.568702] [<ffffffff8107c03f>] ? trace_hardirqs_on+0xd/0xf [23978.569763] [<ffffffff811441c0>] ? ____cache_alloc+0x69/0x2eb [23978.570817] [<ffffffff81142269>] ? virt_to_head_page+0x9/0x36 [23978.571872] [<ffffffff81143c15>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb [23978.573466] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18 [23978.574962] [<ffffffffa0480d07>] btrfs_drop_extents+0x66/0x7f [btrfs] [23978.576179] [<ffffffffa049aa35>] btrfs_clone+0x516/0xaf5 [btrfs] [23978.577311] [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs] [23978.578520] [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs] [23978.580282] [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs] (...) [23978.591887] ---[ end trace 988ec2a653d03ed3 ]--- Then we attempt to insert a new extent item with a key that already exists, which makes btrfs_insert_empty_item return -EEXIST resulting in abortion of the current transaction: [23978.594355] WARNING: CPU: 6 PID: 16251 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() (...) [23978.622589] Call Trace: [23978.623181] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [23978.624359] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [23978.625573] [<ffffffffa044ab6c>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [23978.626971] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [23978.628003] [<ffffffff8108a6c8>] ? vprintk_default+0x1d/0x1f [23978.629138] [<ffffffffa044ab6c>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [23978.630528] [<ffffffffa049ad1b>] btrfs_clone+0x7fc/0xaf5 [btrfs] [23978.631635] [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs] [23978.632886] [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs] [23978.634119] [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs] (...) [23978.647714] ---[ end trace 988ec2a653d03ed4 ]--- This is wrong because we should not process the extent item that we just inserted previously, and instead process the extent item that follows it in the tree For example for the test case I wrote for fstests: bs=$((64 * 1024)) mkfs.btrfs -f -l $bs -O ^no-holes /dev/sdc mount /dev/sdc /mnt xfs_io -f -c "pwrite -S 0xaa $(($bs * 2)) $(($bs * 2))" /mnt/foo $CLONER_PROG -s $((3 * $bs)) -d $((267 * $bs)) -l 0 /mnt/foo /mnt/foo $CLONER_PROG -s $((217 * $bs)) -d $((95 * $bs)) -l 0 /mnt/foo /mnt/foo The second clone call fails with -EEXIST, because when we process the first extent item (offset 262144), we drop part of it (counting from the end) and then insert a new extent item with a key greater then the key we found. The next time we search the tree we search for a key with offset 262144 + 1, which leaves us at the new extent item we have just inserted but we think it refers to an extent that we need to clone. Fix this by ensuring the next search key uses an offset corresponding to the offset of the key we found previously plus the data length of the corresponding extent item. This ensures we skip new extent items that we inserted and works for the case of implicit holes too (NO_HOLES feature). A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: fix use after free when close_ctree frees the orphan_rsvChris Mason2015-04-103-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Near the end of close_ctree, we're calling btrfs_free_block_rsv to free up the orphan rsv. The problem is this call updates the space_info, which has already been freed. This adds a new __ function that directly calls kfree instead of trying to update the space infos. Signed-off-by: Chris Mason <clm@fb.com>
| * | | | | Btrfs: allow block group cache writeout outside critical section in commitChris Mason2015-04-109-37/+341
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: Chris Mason <clm@fb.com>
OpenPOWER on IntegriCloud