summaryrefslogtreecommitdiffstats
path: root/fs/xfs
Commit message (Collapse)AuthorAgeFilesLines
* xfs: flush entire file on dio read/write to cached fileBrian Foster2015-08-191-22/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Filesystems are responsible to manage file coherency between the page cache and direct I/O. The generic dio code flushes dirty pages over the range of a dio to ensure that the dio read or a future buffered read returns the correct data. XFS has generally followed this pattern, though traditionally has flushed and invalidated the range from the start of the I/O all the way to the end of the file. This changed after the following commit: 7d4ea3ce xfs: use ranged writeback and invalidation for direct IO ... as the full file flush was no longer necessary to deal with the strange post-eof delalloc issues that were since fixed. Unfortunately, we have since received complaints about performance degradation due to the increased exclusive iolock cycles (which locks out parallel dio submission) that occur when a file has cached pages. This does not occur on filesystems that use the generic code as it also does not incorporate locking. The exclusive iolock is acquired any time the inode mapping has cached pages, regardless of whether they reside in the range of the I/O or not. If not, the flush/inval calls do no work and the lock was cycled for no reason. Under consideration of the cost of the exclusive iolock, update the dio read and write handlers to flush and invalidate the entire mapping when cached pages exist. In most cases, this increases the cost of the initial flush sequence but eliminates the need for further lock cycles and flushes so long as the workload does not actively mix direct and buffered I/O. This also more closely matches historical behavior and performance characteristics that users have come to expect. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: Fix xfs_attr_leafblock definitionJan Kara2015-08-191-2/+9
| | | | | | | | | | | | | | | | | struct xfs_attr_leafblock contains 'entries' array which is declared with size 1 altough it can in fact contain much more entries. Since this array is followed by further struct members, gcc (at least in version 4.8.3) thinks that the array has the fixed size of 1 element and thus may optimize away all accesses beyond the end of array resulting in non-working code. This problem was only observed with userspace code in xfsprogs, however it's better to be safe in kernel as well and have matching kernel and xfsprogs definitions. cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* libxfs: readahead of dir3 data blocks should use the read verifierDarrick J. Wong2015-08-191-1/+2
| | | | | | | | | | | | | | | In the dir3 data block readahead function, use the regular read verifier to check the block's CRC and spot-check the block contents instead of directly calling only the spot-checking routine. This prevents corrupted directory data blocks from being read into the kernel, which can lead to garbage ls output and directory loops (if say one of the entries contains slashes and other junk). cc: <stable@vger.kernel.org> # 3.12 - 4.2 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: stop holding ILOCK over filldir callbacksDave Chinner2015-08-194-19/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent change to the readdir locking made in 40194ec ("xfs: reinstate the ilock in xfs_readdir") for CXFS directory sanity was probably the wrong thing to do. Deep in the readdir code we can take page faults in the filldir callback, and so taking a page fault while holding an inode ilock creates a new set of locking issues that lockdep warns all over the place about. The locking order for regular inodes w.r.t. page faults is io_lock -> pagefault -> mmap_sem -> ilock. The directory readdir code now triggers ilock -> page fault -> mmap_sem. While we cannot deadlock at this point, it inverts all the locking patterns that lockdep normally sees on XFS inodes, and so triggers lockdep. We worked around this with commit 93a8614 ("xfs: fix directory inode iolock lockdep false positive"), but that then just moved the lockdep warning to deeper in the page fault path and triggered on security inode locks. Fixing the shmem issue there just moved the lockdep reports somewhere else, and now we are getting false positives from filesystem freezing annotations getting confused. Further, if we enter memory reclaim in a readdir path, we now get lockdep warning about potential deadlocks because the ilock is held when we enter reclaim. This, again, is different to a regular file in that we never allow memory reclaim to run while holding the ilock for regular files. Hence lockdep now throws ilock->kmalloc->reclaim->ilock warnings. Basically, the problem is that the ilock is being used to protect the directory data and the inode metadata, whereas for a regular file the iolock protects the data and the ilock protects the metadata. From the VFS perspective, the i_mutex serialises all accesses to the directory data, and so not holding the ilock for readdir doesn't matter. The issue is that CXFS doesn't access directory data via the VFS, so it has no "data serialisaton" mechanism. Hence we need to hold the IOLOCK in the correct places to provide this low level directory data access serialisation. The ilock can then be used just when the extent list needs to be read, just like we do for regular files. The directory modification code can take the iolock exclusive when the ilock is also taken, and this then ensures that readdir is correct excluded while modifications are in progress. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: clean up inode lockdep annotationsDave Chinner2015-08-192-43/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lockdep annotations are a maintenance nightmare. Locking has to be modified to suit the limitations of the annotations, and we're always having to fix the annotations because they are unable to express the complexity of locking heirarchies correctly. So, next up, we've got more issues with lockdep annotations for inode locking w.r.t. XFS_LOCK_PARENT: - lockdep classes are exclusive and can't be ORed together to form new classes. - IOLOCK needs multiple PARENT subclasses to express the changes needed for the readdir locking rework needed to stop the endless flow of lockdep false positives involving readdir calling filldir under the ILOCK. - there are only 8 unique lockdep subclasses available, so we can't create a generic solution. IOWs we need to treat the 3-bit space available to each lock type differently: - IOLOCK uses xfs_lock_two_inodes(), so needs: - at least 2 IOLOCK subclasses - at least 2 IOLOCK_PARENT subclasses - MMAPLOCK uses xfs_lock_two_inodes(), so needs: - at least 2 MMAPLOCK subclasses - ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs: - at least 5 ILOCK subclasses - one ILOCK_PARENT subclass - one RTBITMAP subclass - one RTSUM subclass For the IOLOCK, split the space into two sets of subclasses. For the MMAPLOCK, just use half the space for the one subclass to match the non-parent lock classes of the IOLOCK. For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the remaining individual subclasses. Because they are now all different, modify xfs_lock_inumorder() to handle the nested subclasses, and to assert fail if passed an invalid subclass. Further, annotate xfs_lock_inodes() to assert fail if an invalid combination of lock primitives and inode counts are passed that would result in a lockdep subclass annotation overflow. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: swap leaf buffer into path struct atomically during path shiftBrian Foster2015-08-191-9/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The node directory lookup code uses a state structure that tracks the path of buffers used to search for the hash of a filename through the leaf blocks. When the lookup encounters a block that ends with the requested hash, but the entry has not yet been found, it must shift over to the next block and continue looking for the entry (i.e., duplicate hashes could continue over into the next block). This shift mechanism involves walking back up and down the state structure, replacing buffers at the appropriate btree levels as necessary. When a buffer is replaced, the old buffer is released and the new buffer read into the active slot in the path structure. Because the buffer is read directly into the path slot, a buffer read failure can result in setting a NULL buffer pointer in an active slot. This throws off the state cleanup code in xfs_dir2_node_lookup(), which expects to release a buffer from each active slot. Instead, a BUG occurs due to a NULL pointer dereference: BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8 IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs] ... RIP: 0010:[<ffffffffa0585063>] [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs] ... Call Trace: [<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs] [<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs] [<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs] [<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs] [<ffffffff8122de8d>] lookup_real+0x1d/0x50 [<ffffffff8123330e>] path_openat+0x91e/0x1490 [<ffffffff81235079>] do_filp_open+0x89/0x100 ... This has been reproduced via a parallel fsstress and filesystem shutdown workload in a loop. The shutdown triggers the read error in the aforementioned codepath and causes the BUG in xfs_dir2_node_lookup(). Update xfs_da3_path_shift() to update the active path slot atomically with respect to the caller when a buffer is replaced. This ensures that the caller always sees the old or new buffer in the slot and prevents the NULL pointer dereference. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: relocate sparse inode mount warningBrian Foster2015-08-192-3/+4
| | | | | | | | | | | | | | | | | | The sparse inodes feature is currently considered experimental. We warn at mount time from xfs_mount_validate_sb(). This function is part of the superblock verifier codepath, however, which means it could be invoked repeatedly on superblock reads or writes. This is currently only noticeable from userspace, where mkfs produces multiple warnings at format time. As mkfs warnings were not the intent of this change, relocate the mount time warning to xfs_fs_fill_super(), which is only invoked once and only in kernel space. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: dquots should be stamped with sb_meta_uuidDave Chinner2015-08-191-1/+1
| | | | | | | | | | Once the sb_uuid is changed, the wrong uuid is stamped into new dquots on disk. Found by inspection, verified by generic/219. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: log recovery needs to validate against sb_meta_uuidDave Chinner2015-08-191-2/+12
| | | | | | | | | | | | | | | | | | | Now that sb_uuid can be changed by the user, we cannot use this to validate the metadata blocks being recovered belong to this filesystem. We must check against the sb_meta_uuid as that will remain unchanged. There is a complication in this code - the superblock itself. We can not check the sb_meta_uuid unconditionally, as that may not be set on disk. Hence we must verify the superblock sb_uuid matches between the log record and the in-core superblock. Found by inspection after the previous two problems were found. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: growfs not aware of sb_meta_uuidDave Chinner2015-08-191-3/+3
| | | | | | | | | | | | | | | | | | | | | Adding this simple change to xfstests:common/rc::_scratch_mkfs_xfs: + if [ $mkfs_status -eq 0 ]; then + xfs_admin -U generate $SCRATCH_DEV > /dev/null + fi triggers all sorts of errors in xfstests. xfs/104 is an example, where growfs fails with a UUID mismatch corruption detected by xfs_agf_write_verify() when trying to write the first new AG headers. Fix this problem by making sure we copy the sb_meta_uuid into new metadata written by growfs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: fix sb_meta_uuid usageDave Chinner2015-08-191-1/+1
| | | | | | | | | | | | | | | After changing the UUID on a v5 filesystem, xfstests fails immediately on a debug kernel with: XFS: Assertion failed: uuid_equal(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid), file: fs/xfs/xfs_inode.c, line: 799 This needs to check against the sb_meta_uuid, not the user visible UUID that was changed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: set XFS_DA_OP_OKNOENT in xfs_attr_getEric Sandeen2015-08-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's entirely possible for userspace to ask for an xattr which does not exist. Normally, there is no problem whatsoever when we ask for such a thing, but when we look at an obfuscated metadump image on a debug kernel with selinux, we trip over this ASSERT in xfs_da3_path_shift(): *result = -ENOENT; /* we're out of our tree */ ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); It (more or less) only shows up in the above scenario, because xfs_metadump obfuscates attr names, but chooses names which keep the same hash value - and xfs_da3_node_lookup_int does: if (((retval == -ENOENT) || (retval == -ENOATTR)) && (blk->hashval == args->hashval)) { error = xfs_da3_path_shift(state, &state->path, 1, 1, &retval); IOWS, we only get down to the xfs_da3_path_shift() ASSERT if we are looking for an xattr which doesn't exist, but we find xattrs on disk which have the same hash, and so might be a hash collision, so we try the path shift. When *that* fails to find what we're looking for, we hit the assert about XFS_DA_OP_OKNOENT. Simply setting XFS_DA_OP_OKNOENT in xfs_attr_get solves this rather corner-case problem with no ill side effects. It's fine for an attr name lookup to fail. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: create new metadata UUID field and incompat flagEric Sandeen2015-07-2918-36/+66
| | | | | | | | | | | | | | | | | | | | | | | This adds a new superblock field, sb_meta_uuid. If set, along with a new incompat flag, the code will use that field on a V5 filesystem to compare to metadata UUIDs, which allows us to change the user- visible UUID at will. Userspace handles the setting and clearing of the incompat flag as appropriate, as the UUID gets changed; i.e. setting the user-visible UUID back to the original UUID (as stored in the new field) will remove the incompatible feature flag. If the incompat flag is not set, this copies the user-visible UUID into into the meta_uuid slot in memory when the superblock is read from disk; the meta_uuid field is not written back to disk in this case. The remainder of this patch simply switches verifiers, initializers, etc to use the new sb_meta_uuid field. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-07-041-1/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
| * xfs: Correctly lock inode when removing suid and file capabilitiesJan Kara2015-06-231-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently XFS calls file_remove_privs() without holding i_mutex. This is wrong because that function can end up messing with file permissions and file capabilities stored in xattrs for which we need i_mutex held. Fix the problem by grabbing iolock exclusively when we will need to change anything in permissions / xattrs. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: Rename file_remove_suid() to file_remove_privs()Jan Kara2015-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | file_remove_suid() is a misnomer since it removes also file capabilities stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells something else than whether file_remove_suid() call is necessary which leads to bugs. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'xfs-for-linus-4.2-rc1' of ↵Linus Torvalds2015-06-3054-835/+1548
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pul xfs updates from Dave Chinner: "There's a couple of small API changes to the core DAX code which required small changes to the ext2 and ext4 code bases, but otherwise everything is within the XFS codebase. This update contains: - A new sparse on-disk inode record format to allow small extents to be used for inode allocation when free space is fragmented. - DAX support. This includes minor changes to the DAX core code to fix problems with lock ordering and bufferhead mapping abuse. - transaction commit interface cleanup - removal of various unnecessary XFS specific type definitions - cleanup and optimisation of freelist preparation before allocation - various minor cleanups - bug fixes for - transaction reservation leaks - incorrect inode logging in unwritten extent conversion - mmap lock vs freeze ordering - remote symlink mishandling - attribute fork removal issues" * tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: don't truncate attribute extents if no extents exist xfs: clean up XFS_MIN_FREELIST macros xfs: sanitise error handling in xfs_alloc_fix_freelist xfs: factor out free space extent length check xfs: xfs_alloc_fix_freelist() can use incore perag structures xfs: remove xfs_caddr_t xfs: use void pointers in log validation helpers xfs: return a void pointer from xfs_buf_offset xfs: remove inst_t xfs: remove __psint_t and __psunsigned_t xfs: fix remote symlinks on V5/CRC filesystems xfs: fix xfs_log_done interface xfs: saner xfs_trans_commit interface xfs: remove the flags argument to xfs_trans_cancel xfs: pass a boolean flag to xfs_trans_free_items xfs: switch remaining xfs_trans_dup users to xfs_trans_roll xfs: check min blks for random debug mode sparse allocations xfs: fix sparse inodes 32-bit compile failure xfs: add initial DAX support xfs: add DAX IO path support ...
| * \ Merge branch 'xfs-misc-fixes-for-4.2-3' into for-nextDave Chinner2015-06-2315-91/+77
| |\ \
| | * | xfs: don't truncate attribute extents if no extents existBrian Foster2015-06-231-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xfs_attr3_root_inactive() call from xfs_attr_inactive() assumes that attribute blocks exist to invalidate. It is possible to have an attribute fork without extents, however. Consider the case where the attribute fork is created towards the beginning of xfs_attr_set() but some part of the subsequent attribute set fails. If an inode in such a state hits xfs_attr_inactive(), it eventually calls xfs_dabuf_map() and possibly xfs_bmapi_read(). The former emits a filesystem corruption warning, returns an error that bubbles back up to xfs_attr_inactive(), and leads to destruction of the in-core attribute fork without an on-disk reset. If the inode happens to make it back through xfs_inactive() in this state (e.g., via a concurrent bulkstat that cycles the inode from the reclaim state and releases it), i_afp might not exist when xfs_bmapi_read() is called and causes a NULL dereference panic. A '-p 2' fsstress run to ENOSPC on a relatively small fs (1GB) reproduces these problems. The behavior is a regression caused by: 6dfe5a0 xfs: xfs_attr_inactive leaves inconsistent attr fork state behind ... which removed logic that avoided the attribute extent truncate when no extents exist. Restore this logic to ensure the attribute fork is destroyed and reset correctly if it exists without any allocated extents. cc: stable@vger.kernel.org # 3.12 to 4.0.x Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: remove xfs_caddr_tChristoph Hellwig2015-06-223-30/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just use char pointers directly instead of the confusing typedef to a pointer type. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: use void pointers in log validation helpersChristoph Hellwig2015-06-222-17/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compared to char pointers this saves us a lot of casting effort. Also add another local variable to make the code easier to read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: return a void pointer from xfs_buf_offsetChristoph Hellwig2015-06-226-17/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This avoids all kinds of unessecary casts in an envrionment like Linux where we can assume that pointer arithmetics are support on void pointers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: remove inst_tChristoph Hellwig2015-06-223-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can simply use a void pointer to pass a long return addresses in the debugging helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: remove __psint_t and __psunsigned_tChristoph Hellwig2015-06-224-20/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace uses of __psint_t with the proper uintptr_t and ptrdiff_t types, and remove the defintions of __psint_t and __psunsigned_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: fix remote symlinks on V5/CRC filesystemsEric Sandeen2015-06-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we create a CRC filesystem, mount it, and create a symlink with a path long enough that it can't live in the inode, we get a very strange result upon remount: # ls -l mnt total 4 lrwxrwxrwx. 1 root root 929 Jun 15 16:58 link -> XSLM XSLM is the V5 symlink block header magic (which happens to be followed by a NUL, so the string looks terminated). xfs_readlink_bmap() advanced cur_chunk by the size of the header for CRC filesystems, but never actually used that pointer; it kept reading from bp->b_addr, which is the start of the block, rather than the start of the symlink data after the header. Looks like this problem goes back to v3.10. Fixing this gets us reading the proper link target, again. Cc: stable@vger.kernel.org Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | Merge branch 'xfs-freelist-cleanup' into for-nextDave Chinner2015-06-237-130/+142
| |\ \ \
| | * | | xfs: clean up XFS_MIN_FREELIST macrosDave Chinner2015-06-227-21/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We no longer calculate the minimum freelist size from the on-disk AGF, so we don't need the macros used for this. That means the nested macros can be cleaned up, and turn this into an actual function so the logic is clear and concise. This will make it much easier to add support for the rmap btree when the time comes. This also gets rid of the XFS_AG_MAXLEVELS macro used by these freelist macros as it is simply a wrapper around a single variable. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: sanitise error handling in xfs_alloc_fix_freelistDave Chinner2015-06-221-56/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The error handling is currently an inconsistent mess as every error condition handles return values and releasing buffers individually. Clean this up by using gotos and a sane error label stack. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: factor out free space extent length checkDave Chinner2015-06-221-27/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The longest extent length checks in xfs_alloc_fix_freelist() are now essentially identical. Factor them out into a helper function, so we know they are checking exactly the same thing before and after we lock the AGF. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | xfs: xfs_alloc_fix_freelist() can use incore perag structuresDave Chinner2015-06-224-46/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment, xfs_alloc_fix_freelist() uses a mix of per-ag based access and agf buffer based access to freelist and space usage information. However, once the AGF buffer is locked inside this function, it is guaranteed that both the in-memory and on-disk values are identical. xfs_alloc_fix_freelist() doesn't modify the values in the structures directly, so it is a read-only user of the infomration, and hence can use the per-ag structure exclusively for determining what it should do. This opens up an avenue for cleaning up a lot of duplicated logic whose only difference is the structure it gets the data from, and in doing so removes a lot of needless byte swapping overhead when fixing up the free list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | Merge branch 'xfs-commit-cleanup' into for-nextDave Chinner2015-06-0426-385/+193
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: fs/xfs/xfs_attr_inactive.c
| | * | | | xfs: fix xfs_log_done interfaceChristoph Hellwig2015-06-044-52/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of the confusing flags argument pass a boolean flag to indicate if we want to release or regrant a log reservation. Also ensure that xfs_log_done always drop the reference on the log ticket, to both simplify the code and make the logic in xfs_trans_roll easier to understand. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | xfs: saner xfs_trans_commit interfaceChristoph Hellwig2015-06-0424-74/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flags argument to xfs_trans_commit is not useful for most callers, as a commit of a transaction without a permanent log reservation must pass 0 here, and all callers for a transaction with a permanent log reservation except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove the flags argument from the public xfs_trans_commit interfaces, and introduce low-level __xfs_trans_commit variant just for xfs_trans_roll that regrants a log reservation instead of releasing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | xfs: remove the flags argument to xfs_trans_cancelChristoph Hellwig2015-06-0422-171/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and XFS_TRANS_ABORT. Both of them are a direct product of the transaction state, and can be deducted: - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled, and XFS_TRANS_ABORT is a noop for a transaction that is not dirty. - any transaction with a permanent log reservation needs XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent log reservation is invalid. So just remove the flags argument and do the right thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | xfs: pass a boolean flag to xfs_trans_free_itemsChristoph Hellwig2015-06-043-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flags value always was 0 or XFS_TRANS_ABORT. Switch to a bool parameter to allow further cleanups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | xfs: switch remaining xfs_trans_dup users to xfs_trans_rollChristoph Hellwig2015-06-044-92/+16
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have three remaining callers of xfs_trans_dup: - xfs_itruncate_extents which open codes xfs_trans_roll - xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves attaching them to it's callers, but otherwise is identical to xfs_trans_roll - xfs_dir_ialloc looks at the log reservations in the old xfs_trans structure instead of the log reservation parameters, but otherwise is identical to xfs_trans_roll. By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch these three remaining users over to xfs_trans_roll and mark xfs_trans_dup static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | Merge branch 'xfs-misc-fixes-for-4.2-2' into for-nextDave Chinner2015-06-042-10/+12
| |\ \ \ \
| | * | | | xfs: check min blks for random debug mode sparse allocationsBrian Foster2015-06-041-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode allocator enables random sparse inode chunk allocations in DEBUG mode to facilitate testing. Sparse inode allocations are not always possible, however, depending on the fs geometry. For example, there is no possibility for a sparse inode allocation on filesystems where the block size is large enough to fit one or more inode chunks within a single block. Fix up the DEBUG mode sparse inode allocation logic to trigger random sparse allocations only when the geometry of the fs allows it. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | xfs: fix sparse inodes 32-bit compile failureBrian Foster2015-06-041-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kbuild test robot reports the following compilation failure with a 32-bit kernel configuration: fs/built-in.o: In function `xfs_ifree_cluster': >> xfs_inode.c:(.text+0x17ac84): undefined reference to `__umoddi3' This is due to the use of the modulus operator on a 64-bit variable in the ASSERT() added as part of the following commit: xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() This ASSERT() simply checks that the offset of the inode in a sparse cluster is appropriately aligned. Since the maximum inode record offset is 63 (for a 64 inode record) and the calculated offset here should be something less than that, just use a 32-bit variable to store the offset and call the do_mod() helper. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | | Merge branch 'xfs-dax-support' into for-nextDave Chinner2015-06-047-126/+275
| |\ \ \ \ \
| | * | | | | xfs: add initial DAX supportDave Chinner2015-06-043-14/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add initial DAX support to XFS. To do this we need a new mount option to turn DAX on filesystem, and we need to propagate this into the inode flags whenever an inode is instantiated so that the per-inode checks throughout the code Do The Right Thing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | | xfs: add DAX IO path supportDave Chinner2015-06-041-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DAX does not do buffered IO (can't buffer direct access!) and hence all read/write IO is vectored through the direct IO path. Hence we need to add the DAX IO path callouts to the direct IO infrastructure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | | xfs: add DAX truncate supportDave Chinner2015-06-041-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we truncate a DAX file, we need to call through the DAX page truncation path rather than through block_truncate_page() so that mappings and block zeroing are all handled correctly. Otherwise, truncate does not need to change. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | | xfs: add DAX block zeroing supportDave Chinner2015-06-042-22/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add initial support for DAX block zeroing operations to XFS. DAX cannot use buffered IO through the page cache for zeroing, nor do we need to issue IO for uncached block zeroing. In both cases, we can simply call out to the dax block zeroing function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | | xfs: add DAX file operations supportDave Chinner2015-06-043-83/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the initial support for DAX file operations to XFS. This includes the necessary block allocation and mmap page fault hooks for DAX to function. Note that there are changes to the splice interfaces to ensure that for DAX splice avoids direct page cache manipulations and instead takes the DAX IO paths for read/write operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | | xfs: mmap lock needs to be inside freeze protectionDave Chinner2015-06-041-3/+8
| | | |/ / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lock ordering for the new mmap lock needs to be: mmap_sem sb_start_pagefault i_mmap_lock page lock <fault processsing> Right now xfs_vm_page_mkwrite gets this the wrong way around, While technically it cannot deadlock due to the current freeze ordering, it's still a landmine that might explode if we change anything in future. Hence we need to nest the locks correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | | Merge branch 'xfs-sparse-inode' into for-nextDave Chinner2015-06-0116-86/+829
| |\ \ \ \ \ | | | |/ / / | | |/| | |
| | * | | | xfs: enable sparse inode chunks for v5 superblocksBrian Foster2015-05-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable mounting of filesystems with sparse inode support enabled. Add the incompat. feature bit to the *_ALL mask. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()Brian Foster2015-05-293-20/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_ifree_cluster() is called to mark all in-memory inodes and inode buffers as stale. This occurs after we've removed the inobt records and dropped any references of inobt data. xfs_ifree_cluster() uses the starting inode number to walk the namespace of inodes expected for a single chunk a cluster buffer at a time. The cluster buffer disk addresses are calculated by decoding the sequential inode numbers expected from the chunk. The problem with this approach is that if the inode chunk being removed is a sparse chunk, not all of the buffer addresses that are calculated as part of this sequence may be inode clusters. Attempting to acquire the buffer based on expected inode characterstics (i.e., cluster length) can lead to errors and is generally incorrect. We already use a couple variables to carry requisite state from xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a new internal structure to carry the existing parameters through these functions. Add an alloc field that represents the physical allocation bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster() to check each inode against the bitmap and skip the clusters that were never allocated as real inodes on disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | | | xfs: only free allocated regions of inode chunksBrian Foster2015-05-291-3/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An inode chunk is currently added to the transaction free list based on a simple fsb conversion and hardcoded chunk length. The nature of sparse chunks is such that the physical chunk of inodes on disk may consist of one or more discontiguous parts. Blocks that reside in the holes of the inode chunk are not inodes and could be allocated to any other use or not allocated at all. Refactor the existing xfs_bmap_add_free() call into the xfs_difree_inode_chunk() helper. The new helper uses the existing calculation if a chunk is not sparse. Otherwise, use the inobt record holemask to free the contiguous regions of the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
OpenPOWER on IntegriCloud