summaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_super.c
Commit message (Collapse)AuthorAgeFilesLines
* xfs: remove the extents array from the rmap update done log itemDarrick J. Wong2016-08-031-3/+1
| | | | | | | | | | Nothing ever uses the extent array in the rmap update done redo item, so remove it before it is fixed in the on-disk log format. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: enable the rmap btree functionalityDarrick J. Wong2016-08-031-0/+4
| | | | | | | | | | | | | | Originally-From: Dave Chinner <dchinner@redhat.com> Add the feature flag to the supported matrix so that the kernel can mount and use rmap btree enabled filesystems Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick.wong@oracle.com: move the experimental tag] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: enable the xfs_defer mechanism to process rmaps to updateDarrick J. Wong2016-08-031-0/+1
| | | | | | | | | | | Connect the xfs_defer mechanism with the pieces that we'll need to handle deferred rmap updates. We'll wire up the existing code to our new deferred mechanism later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: create rmap update intent log itemsDarrick J. Wong2016-08-031-0/+21
| | | | | | | | | | | | | | | Create rmap update intent/done log items to record redo information in the log. Because we need to roll transactions between updating the bmbt mapping and updating the reverse mapping, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions, just in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: rmap btree requires more reserved free spaceDarrick J. Wong2016-08-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Originally-From: Dave Chinner <dchinner@redhat.com> The rmap btree is allocated from the AGFL, which means we have to ensure ENOSPC is reported to userspace before we run out of free space in each AG. The last allocation in an AG can cause a full height rmap btree split, and that means we have to reserve at least this many blocks *in each AG* to be placed on the AGFL at ENOSPC. Update the various space calculation functions to handle this. Also, because the macros are now executing conditional code and are called quite frequently, convert them to functions that initialise variables in the struct xfs_mount, use the new variables everywhere and document the calculations better. [darrick.wong@oracle.com: don't reserve blocks if !rmap] [dchinner@redhat.com: update m_ag_max_usable after growfs] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_*Darrick J. Wong2016-08-031-1/+1
| | | | | | | | | | Drop the compatibility shims that we were using to integrate the new deferred operation mechanism into the existing code. No new code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: enable the xfs_defer mechanism to process extents to freeDarrick J. Wong2016-08-031-0/+2
| | | | | | | | | | | Connect the xfs_defer mechanism with the pieces that we'll need to handle deferred extent freeing. We'll wire up the existing code to our new deferred mechanism later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* Merge branch 'xfs-4.8-misc-fixes-4' into for-nextDave Chinner2016-07-221-4/+0
|\
| * xfs: remove EXPERIMENTAL tag from sparse inode featureDave Chinner2016-07-221-4/+0
| | | | | | | | | | | | | | | | | | | | Been around for long enough now, hasn't caused any regression test failures in the past 3 months, so it's time to make it a fully supported feature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | Merge branch 'xfs-4.8-misc-fixes-2' into for-nextDave Chinner2016-06-211-2/+11
|\ \
| * | xfs: convert list of extents to free into a regular listDarrick J. Wong2016-06-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In struct xfs_bmap_free, convert the open-coded free extent list to a regular list, then use list_sort to sort it prior to processing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: cancel eofblocks background trimming on remount read-onlyBrian Foster2016-06-211-0/+8
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The filesystem quiesce sequence performs the operations necessary to drain all background work, push pending transactions through the log infrastructure and wait on I/O resulting from the final AIL push. We have had reports of remount,ro hangs in xfs_log_quiesce() -> xfs_wait_buftarg(), however, and some instrumentation code to detect transaction commits at this point in the quiesce sequence has inculpated the eofblocks background scanner as a cause. While higher level remount code generally prevents user modifications by the time the filesystem has made it to xfs_log_quiesce(), the background scanner may still be alive and can perform pending work at any time. If this occurs between the xfs_log_force() and xfs_wait_buftarg() calls within xfs_log_quiesce(), this can lead to an indefinite lockup in xfs_wait_buftarg(). To prevent this problem, cancel the background eofblocks scan worker during the remount read-only quiesce sequence. This suspends background trimming when a filesystem is remounted read-only. This is only done in the remount path because the freeze codepath has already locked out new transactions by the time the filesystem attempts to quiesce (and thus waiting on an active work item could deadlock). Kick the eofblocks worker to pick up where it left off once an fs is remounted back to read-write. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: make several functions staticEric Sandeen2016-06-011-1/+1
|/ | | | | | | | | | | | | | | Al Viro noticed that xfs_lock_inodes should be static, and that led to ... a few more. These are just the easy ones, others require moving functions higher in source files, so that's not done here to keep this review simple. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
* Merge tag 'dax-misc-for-4.7' of ↵Linus Torvalds2016-05-261-7/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c
| * xfs: Add alignment check for DAX mountToshi Kani2016-05-171-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a partition is not aligned by 4KB, mount -o dax succeeds, but any read/write access to the filesystem fails, except for metadata update. Call bdev_dax_supported() to perform proper precondition checks which includes this partition alignment check. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
* | Merge tag 'xfs-for-linus-4.7-rc1' of ↵Linus Torvalds2016-05-261-39/+26
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "A pretty average collection of fixes, cleanups and improvements in this request. Summary: - fixes for mount line parsing, sparse warnings, read-only compat feature remount behaviour - allow fast path symlink lookups for inline symlinks. - attribute listing cleanups - writeback goes direct to bios rather than indirecting through bufferheads - transaction allocation cleanup - optimised kmem_realloc - added configurable error handling for metadata write errors, changed default error handling behaviour from "retry forever" to "retry until unmount then fail" - fixed several inode cluster writeback lookup vs reclaim race conditions - fixed inode cluster writeback checking wrong inode after lookup - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe - cleaned up inode reclaim tagging" * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: fix warning in xfs_finish_page_writeback for non-debug builds xfs: move reclaim tagging functions xfs: simplify inode reclaim tagging interfaces xfs: rename variables in xfs_iflush_cluster for clarity xfs: xfs_iflush_cluster has range issues xfs: mark reclaimed inodes invalid earlier xfs: xfs_inode_free() isn't RCU safe xfs: optimise xfs_iext_destroy xfs: skip stale inodes in xfs_iflush_cluster xfs: fix inode validity check in xfs_iflush_cluster xfs: xfs_iflush_cluster fails to abort on error xfs: remove xfs_fs_evict_inode() xfs: add "fail at unmount" error handling configuration xfs: add configuration handlers for specific errors xfs: add configuration of error failure speed xfs: introduce table-based init for error behaviors xfs: add configurable error support to metadata buffers xfs: introduce metadata IO error class xfs: configurable error behavior via sysfs xfs: buffer ->bi_end_io function requires irq-safe lock ...
| * Merge branch 'xfs-4.7-inode-reclaim' into for-nextDave Chinner2016-05-201-21/+7
| |\
| | * xfs: remove xfs_fs_evict_inode()Dave Chinner2016-05-181-21/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Joe Lawrence reported a list_add corruption with 4.6-rc1 when testing some custom md administration code that made it's own block device nodes for the md array. The simple test loop of: for i in {0..100}; do mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR mdadm --detail --export $tmp/tmp_node > /dev/null rm -f $tmp/tmp_node done Would produce this warning in bd_acquire() when mdadm opened the device node: list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8. And then produce this from bd_forget from kdevtmpfs evicting a block dev inode: list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8 This is a regression caused by commit c19b3b05 ("xfs: mode di_mode to vfs inode"). The issue is that xfs_inactive() frees the unlinked inode, and the above commit meant that this freeing zeroed the mode in the struct inode. The problem is that after evict() has called ->evict_inode, it expects the i_mode to be intact so that it can call bd_forget() or cd_forget() to drop the reference to the block device inode attached to the XFS inode. In reality, the only thing we do in xfs_fs_evict_inode() that is not generic is call xfs_inactive(). We can move the xfs_inactive() call to xfs_fs_destroy_inode() without any problems at all, and this will leave the VFS inode intact until it is completely done with it. So, remove xfs_fs_evict_inode(), and do the work it used to do in ->destroy_inode instead. cc: <stable@vger.kernel.org> # 4.6 Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | Merge branch 'xfs-4.7-writeback-bio' into for-nextDave Chinner2016-05-201-18/+8
| |\ \
| | * | xfs: optimize bio handling in the buffer writeback pathChristoph Hellwig2016-04-061-18/+8
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements two closely related changes: First it embeds a bio the ioend structure so that we don't have to allocate one separately. Second it uses the block layer bio chaining mechanism to chain additional bios off this first one if needed instead of manually accounting for multiple bio completions in the ioend structure. Together this removes a memory allocation per ioend and greatly simplifies the ioend setup and I/O completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: add missing break in xfs_parseargs()Eryu Guan2016-04-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2e74af0e1189 ("xfs: convert mount option parsing to tokens") missed a 'break;' in xfs_parseargs() which causes mount to fail with "-o pqnoenforce" option when mounting a v4 filesystem. xfs/050 catches this failure: XFS (vda6): Super block does not support project and group quota together Fixes: 2e74af0e1189 ("xfs: convert mount option parsing to tokens") Signed-off-by: Eryu Guan <guaneryu@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: disallow rw remount on fs with unknown ro-compat featuresEric Sandeen2016-04-061-0/+10
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today, a kernel which refuses to mount a filesystem read-write due to unknown ro-compat features can still transition to read-write via the remount path. The old kernel is most likely none the wiser, because it's unaware of the new feature, and isn't using it. However, writing to the filesystem may well corrupt metadata related to that new feature, and moving to a newer kernel which understand the feature will have problems. Right now the only ro-compat feature we have is the free inode btree, which showed up in v3.16. It would be good to push this back to all the active stable kernels, I think, so that if anyone is using newer mkfs (which enables the finobt feature) with older kernel releases, they'll be protected. Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usageKirill A. Shutemov2016-04-041-2/+2
| | | | | | | | | | | | | | | | | | Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-041-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'xfs-misc-fixes-4.6-3' into for-nextDave Chinner2016-03-091-0/+3
|\
| * xfs: check sizes of XFS on-disk structures at compile timeDarrick J. Wong2016-03-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Check the sizes of XFS on-disk structures when compiling the kernel. Use this to catch inadvertent changes in structure size due to padding and alignment issues, etc. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: fix up inode32/64 (re)mount handlingEric Sandeen2016-03-021-47/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inode32/inode64 allocator behavior with respect to mount, remount and growfs is a little tricky. The inode32 mount option should only enable the inode32 allocator heuristics if the filesystem is large enough for 64-bit inodes to exist. Today, it has this behavior on the initial mount, but a remount with inode32 unconditionally changes the allocation heuristics, even for a small fs. Also, an inode32 mounted small filesystem should transition to the inode32 allocator if the filesystem is subsequently grown to a sufficient size. Today that does not happen. This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a single new function, and moves the "is the maximum inode number big enough to matter" test into that function, so it doesn't rely on the caller to get it right - which remount did not do, previously. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: sanitize remount optionsEric Sandeen2016-03-021-1/+31
| | | | | | | | | | | | | | | | | | | | | | Perform basic sanitization of remount options by passing the option string and a dummy mount structure through xfs_parseargs and returning the result. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: convert mount option parsing to tokensEric Sandeen2016-03-021-194/+197
|/ | | | | | | | | | | This should be a no-op change, just switch to token parsing like every other respectable filesystem does. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
* kmemcg: account certain kmem allocations to memcgVladimir Davydov2016-01-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* XFS: Use a signed return type for suffix_kstrtoint()Markus Elfring2016-01-041-1/+1
| | | | | | | | | | | | | | The return type "unsigned long" was used by the suffix_kstrtoint() function even though it will eventually return a negative error code. Improve this implementation detail by using the type "int" instead. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* Merge branch 'xfs-misc-fixes-for-4.4-3' into for-nextDave Chinner2015-11-101-3/+4
|\
| * xfs: give all workqueues rescuer threadsChris Mason2015-11-101-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're consistently hitting deadlocks here with XFS on recent kernels. After some digging through the crash files, it looks like everyone in the system is waiting for XFS to reclaim memory. Something like this: PID: 2733434 TASK: ffff8808cd242800 CPU: 19 COMMAND: "java" #0 [ffff880019c53588] __schedule at ffffffff818c4df2 #1 [ffff880019c535d8] schedule at ffffffff818c5517 #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348 #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453 #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7 #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433 #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9 #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73 #10 [ffff880019c539c8] shrink_slab at ffffffff811460e6 #11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f #12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba #13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a #14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8 #15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671 #16 [ffff880019c53dd8] copy_process at ffffffff8104f781 #17 [ffff880019c53ec8] do_fork at ffffffff8105129c #18 [ffff880019c53f38] sys_clone at ffffffff810515b6 #19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting for IO, which is waiting for workers to complete the IO which is waiting for worker threads that don't exist yet: PID: 2752451 TASK: ffff880bd6bdda00 CPU: 37 COMMAND: "kworker/37:1" #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2 #1 [ffff8808d20abc00] schedule at ffffffff818c5517 #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495 #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82 #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f #6 [ffff8808d20abe40] worker_thread at ffffffff810699be #7 [ffff8808d20abec0] kthread at ffffffff8106ef59 #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8 I think we should be using WQ_MEM_RECLAIM to make sure this thread pool makes progress when we're not able to allocate new workers. [dchinner: make all workqueues WQ_MEM_RECLAIM] Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | Merge branch 'xfs-misc-fixes-for-4.4-2' into for-nextDave Chinner2015-11-031-0/+1
|\ \ | |/
| * xfs: don't leak uuid table on rmmodDarrick J. Wong2015-11-031-0/+1
| | | | | | | | | | | | | | | | | | | | Don't leak the UUID table when the module is unloaded. (Found with kmemleak.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: fix an error code in xfs_fs_fill_super()Dan Carpenter2015-10-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | If alloc_percpu() fails, we accidentally return PTR_ERR(NULL), which means success, but we intended to return -ENOMEM. Fixes: 225e4635580c ('xfs: per-filesystem stats in sysfs') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: per-filesystem stats counter implementationBill O'Donnell2015-10-121-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch modifies the stats counting macros and the callers to those macros to properly increment, decrement, and add-to the xfs stats counts. The counts for global and per-fs stats are correctly advanced, and cleared by writing a "1" to the corresponding clear file. global counts: /sys/fs/xfs/stats/stats per-fs counts: /sys/fs/xfs/sda*/stats/stats global clear: /sys/fs/xfs/stats/stats_clear per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around stats structures and macros. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: per-filesystem stats in sysfsBill O'Donnell2015-10-121-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements per-filesystem stats objects in sysfs. It depends on the application of the previous patch series that develops the infrastructure to support both xfs global stats and xfs per-fs stats in sysfs. Stats objects are instantiated when an xfs filesystem is mounted and deleted on unmount. With this patch, the stats directory is created and populated with the familiar stats and stats_clear files. Example: /sys/fs/xfs/sda9/stats/stats /sys/fs/xfs/sda9/stats/stats_clear With this patch, the individual counts within the new per-fs stats file(s) remain at zero. Functions that use the the macros to increment, decrement, and add-to the per-fs stats counts will be covered in a separate new patch to follow this one. Note that the counts within the global stats file (/sys/fs/xfs/stats/stats) advance normally and can be cleared as it was prior to this patch. [dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so it is down before/after any path that uses the per-mount stats. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: pass xfsstats structures to handlers and macrosBill O'Donnell2015-10-121-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is the next step toward per-fs xfs stats. The patch makes the show and clear routines able to handle any stats structure associated with a kobject. Instead of a single global xfsstats structure, add kobject and a pointer to a per-cpu struct xfsstats. Modify the macros that manipulate the stats accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access xfsstats->xs_stats. The sysfs functions need to get from the kobject back to the xfsstats structure which contains it, and pass the pointer to the ->xs_stats percpu structure into the show & clear routines. Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: create global stats and stats_clear in sysfsBill O'Donnell2015-10-121-5/+15
|/ | | | | | | | | | | Currently, xfs global stats are in procfs. This patch introduces (replicates) the global stats in sysfs. Additionally a stats_clear file is introduced in sysfs. Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* Merge tag 'xfs-for-linus-4.3' of ↵Linus Torvalds2015-09-071-10/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There isn't a whole lot to this update - it's mostly bug fixes and they are spread pretty much all over XFS. There are some corruption fixes, some fixes for log recovery, some fixes that prevent unount from hanging, a lockdep annotation rework for inode locking to prevent false positives and the usual random bunch of cleanups and minor improvements. Deatils: - large rework of EFI/EFD lifecycle handling to fix log recovery corruption issues, crashes and unmount hangs - separate metadata UUID on disk to enable changing boot label UUID for v5 filesystems - fixes for gcc miscompilation on certain platforms and optimisation levels - remote attribute allocation and recovery corruption fixes - inode lockdep annotation rework to fix bugs with too many subclasses - directory inode locking changes to prevent lockdep false positives - a handful of minor corruption fixes - various other small cleanups and bug fixes" * tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits) xfs: fix error gotos in xfs_setattr_nonsize xfs: add mssing inode cache attempts counter increment xfs: return errors from partial I/O failures to files libxfs: bad magic number should set da block buffer error xfs: fix non-debug build warnings xfs: collapse allocsize and biosize mount option handling xfs: Fix file type directory corruption for btree directories xfs: lockdep annotations throw warnings on non-debug builds xfs: Fix uninitialized return value in xfs_alloc_fix_freelist() xfs: inode lockdep annotations broke non-lockdep build xfs: flush entire file on dio read/write to cached file xfs: Fix xfs_attr_leafblock definition libxfs: readahead of dir3 data blocks should use the read verifier xfs: stop holding ILOCK over filldir callbacks xfs: clean up inode lockdep annotations xfs: swap leaf buffer into path struct atomically during path shift xfs: relocate sparse inode mount warning xfs: dquots should be stamped with sb_meta_uuid xfs: log recovery needs to validate against sb_meta_uuid xfs: growfs not aware of sb_meta_uuid ...
| * xfs: collapse allocsize and biosize mount option handlingEric Sandeen2015-08-251-10/+2
| | | | | | | | | | | | | | | | | | | | | | | | The allocsize and biosize mount options are handled identically, other than allocsize accepting suffixes. suffix_kstrtoint handles bare numbers just fine too, so these can be collapsed. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * xfs: relocate sparse inode mount warningBrian Foster2015-08-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sparse inodes feature is currently considered experimental. We warn at mount time from xfs_mount_validate_sb(). This function is part of the superblock verifier codepath, however, which means it could be invoked repeatedly on superblock reads or writes. This is currently only noticeable from userspace, where mkfs produces multiple warnings at format time. As mkfs warnings were not the intent of this change, relocate the mount time warning to xfs_fs_fill_super(), which is only invoked once and only in kernel space. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | fs: create and use seq_show_option for escapingKees Cook2015-09-041-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xfs: add initial DAX supportDave Chinner2015-06-041-2/+23
| | | | | | | | | | | | Add initial DAX support to XFS. To do this we need a new mount option to turn DAX on filesystem, and we need to propagate this into the inode flags whenever an inode is instantiated so that the per-inode checks throughout the code Do The Right Thing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-04-261-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
| * VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells2015-04-151-1/+1
| | | | | | | | | | | | | | that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'xfs-misc-fixes-for-4.1-3' into for-nextDave Chinner2015-04-131-0/+6
|\ \ | | | | | | | | | | | | Conflicts: fs/xfs/xfs_iops.c
| * | xfs: disallow ro->rw remount on norecovery mountEric Sandeen2015-04-131-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a bit of a loophole in norecovery mount handling right now: an initial mount must be readonly, but nothing prevents a mount -o remount,rw from producing a writable, unrecovered xfs filesystem. It might be possible to try to perform a log recovery when this is requested, but I'm not sure it's worth the effort. For now, simply disallow this sort of transition. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | | Merge branch 'xfs-misc-fixes-for-4.1-2' into for-nextDave Chinner2015-03-251-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | Conflicts: fs/xfs/libxfs/xfs_bmap.c fs/xfs/xfs_inode.c
OpenPOWER on IntegriCloud