summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseekJosef Bacik2011-07-207-12/+66
| | | | | | | | | | | | This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases we just return -EINVAL, in others we do the normal generic thing, and in others we're simply making sure that the properly due-dilligence is done. For example in NFS/CIFS we need to make sure the file size is update properly for the SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself that is all we have to do. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Ext4: handle SEEK_HOLE/SEEK_DATA genericallyJosef Bacik2011-07-201-0/+21
| | | | | | | | | Since Ext4 has its own lseek we need to make sure it handles SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic case, somebody else can come along and make it do fancy things later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Btrfs: implement our own ->llseekJosef Bacik2011-07-202-1/+150
| | | | | | | | | | | In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek. Basically for the normal SEEK_*'s we will just defer to the generic helper, and for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest hole or data. Currently this helper doesn't check for delalloc bytes for prealloc space, so for now treat prealloc as data until that is fixed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: add SEEK_HOLE and SEEK_DATA flagsJosef Bacik2011-07-201-3/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out using fiemap in things like cp cause more problems than it solves, so lets try and give userspace an interface that doesn't suck. We need to match solaris here, and the definitions are *o* If /whence/ is SEEK_HOLE, the offset of the start of the next hole greater than or equal to the supplied offset is returned. The definition of a hole is provided near the end of the DESCRIPTION. *o* If /whence/ is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset. So in the generic case the entire file is data and there is a virtual hole at the end. That means we will just return i_size for SEEK_HOLE and will return the same offset for SEEK_DATA. This is how Solaris does it so we have to do it the same way. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* reiserfs: make reiserfs default to barrier=flushChristoph Hellwig2011-07-201-0/+1
| | | | | | | | | Change the default reiserfs mount option to barrier=flush. Based on a patch from Jeff Mahoney in the SuSE tree. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ext3: make ext3 mount default to barrier=1Christoph Hellwig2011-07-201-0/+2
| | | | | | | | | | | | | This patch turns on barriers by default for ext3. mount -o barrier=0 will turn them off. Based on a patch from Chris Mason in the SuSE tree. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Mahoney <jeffm@suse.com> Acked-by: Ted Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* don't open-code parent_ino() in assorted ->readdir()Al Viro2011-07-203-3/+3
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* minix_getattr(): don't bother with ->d_parentAl Viro2011-07-201-2/+1
| | | | | | we can find superblock easier, TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* coda_venus_readdir(): use offsetof()Al Viro2011-07-201-2/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: seq_file - add event counter to simplify poll() supportKay Sievers2011-07-202-3/+3
| | | | | | | | | | | | | | Moving the event counter into the dynamically allocated 'struc seq_file' allows poll() support without the need to allocate its own tracking structure. All current users are switched over to use the new counter. Requested-by: Andrew Morton akpm@linux-foundation.org Acked-by: NeilBrown <neilb@suse.de> Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: move inode_dio_done to the end_io handlerChristoph Hellwig2011-07-204-3/+13
| | | | | | | | | | | | | For filesystems that delay their end_io processing we should keep our i_dio_count until the the processing is done. Enable this by moving the inode_dio_done call to the end_io handler if one exist. Note that the actual move to the workqueue for ext4 and XFS is not done in this patch yet, but left to the filesystem maintainers. At least for XFS it's not needed yet either as XFS has an internal equivalent to i_dio_count. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: simplify the blockdev_direct_IO prototypeChristoph Hellwig2011-07-209-24/+22
| | | | | | | | | | | | | | Simple filesystems always pass inode->i_sb_bdev as the block device argument, and never need a end_io handler. Let's simply things for them and for my grepping activity by dropping these arguments. The only thing not falling into that scheme is ext4, which passes and end_io handler without needing special flags (yet), but given how messy the direct I/O code there is use of __blockdev_direct_IO in one instead of two out of three cases isn't going to make a large difference anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: always maintain i_dio_countChristoph Hellwig2011-07-203-24/+17
| | | | | | | | | | | | | | | | | Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING. This these filesystems to also protect truncate against direct I/O requests by using common code. Right now the only non-DIO_LOCKING filesystem that appears to do so is XFS, which uses an opencoded variant of the i_dio_count scheme. Behaviour doesn't change for filesystems never calling inode_dio_wait. For ext4 behaviour changes when using the dioread_nonlock option, which previously was missing any protection between truncate and direct I/O reads. For ocfs2 that handcrafted i_dio_count manipulations are replaced with the common code now enable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: move inode_dio_wait calls into ->setattrChristoph Hellwig2011-07-2012-3/+24
| | | | | | | | | | Let filesystems handle waiting for direct I/O requests themselves instead of doing it beforehand. This means filesystem-specific locks to prevent new dio referenes from appearing can be held. This is important to allow generalizing i_dio_count to non-DIO_LOCKING filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: kill i_alloc_semChristoph Hellwig2011-07-208-44/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | i_alloc_sem is a rather special rw_semaphore. It's the last one that may be released by a non-owner, and it's write side is always mirrored by real exclusion. It's intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can't proceed as long as it's non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can't appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS's i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bits on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: simplify handling of zero sized reads in __blockdev_direct_IOChristoph Hellwig2011-07-201-2/+5
| | | | | | | | | Reject zero sized reads as soon as we know our I/O length, and don't borther with locks or allocations that might have to be cleaned up otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ext4: Rewrite ext4_page_mkwrite() to use generic helpersJan Kara2011-07-201-51/+55
| | | | | | | | | | | | | | | | Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This removes the need of using i_alloc_sem to avoid races with truncate which seems to be the wrong locking order according to lock ordering documented in mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to be problematic because we can decide to flush delay-allocated blocks which will acquire s_umount semaphore - again creating unpleasant lock dependency if not directly a deadlock. Also add a check for frozen filesystem so that we don't busyloop in page fault when the filesystem is frozen. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fat: remove i_alloc_sem abuseChristoph Hellwig2011-07-203-2/+7
| | | | | | | | | | | Add a new rw_semaphore to protect bmap against truncate. Previous i_alloc_sem was abused for this, but it's going away in this series. Note that we can't simply use i_mutex, given that the swapon code calls ->bmap under it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* VFS: Fixup kerneldoc for generic_permission()Tobias Klauser2011-07-201-1/+0
| | | | | | | | The flags parameter went away in d749519b444db985e40b897f73ce1898b11f997e Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* xfs: make use of new shrinker callout for the inode cacheDave Chinner2011-07-203-56/+46
| | | | | | | | | | | | Convert the inode reclaim shrinker to use the new per-sb shrinker operations. This allows much bigger reclaim batches to be used, and allows the XFS inode cache to be shrunk in proportion with the VFS dentry and inode caches. This avoids the problem of the VFS caches being shrunk significantly before the XFS inode cache is shrunk resulting in imbalances in the caches during reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: increase shrinker batch sizeDave Chinner2011-07-201-0/+1
| | | | | | | | | | | | | | | | | | | | | Now that the per-sb shrinker is responsible for shrinking 2 or more caches, increase the batch size to keep econmies of scale for shrinking each cache. Increase the shrinker batch size to 1024 objects. To allow for a large increase in batch size, add a conditional reschedule to prune_icache_sb() so that we don't hold the LRU spin lock for too long. This mirrors the behaviour of the __shrink_dcache_sb(), and allows us to increase the batch size without needing to worry about problems caused by long lock hold times. To ensure that filesystems using the per-sb shrinker callouts don't cause problems, document that the object freeing method must reschedule appropriately inside loops. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* superblock: add filesystem shrinker operationsDave Chinner2011-07-201-12/+33
| | | | | | | | | | | | | | | Now we have a per-superblock shrinker implementation, we can add a filesystem specific callout to it to allow filesystem internal caches to be shrunk by the superblock shrinker. Rather than perpetuate the multipurpose shrinker callback API (i.e. nr_to_scan == 0 meaning "tell me how many objects freeable in the cache), two operations will be added. The first will return the number of objects that are freeable, the second is the actual shrinker call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* inode: remove iprune_semDave Chinner2011-07-201-21/+0
| | | | | | | | | | Now that we have per-sb shrinkers with a lifecycle that is a subset of the superblock lifecycle and can reliably detect a filesystem being unmounted, there is not longer any race condition for the iprune_sem to protect against. Hence we can remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* superblock: introduce per-sb cache shrinker infrastructureDave Chinner2011-07-203-218/+71
| | | | | | | | | | | | | | | | | | | | | | | | With context based shrinkers, we can implement a per-superblock shrinker that shrinks the caches attached to the superblock. We currently have global shrinkers for the inode and dentry caches that split up into per-superblock operations via a coarse proportioning method that does not batch very well. The global shrinkers also have a dependency - dentries pin inodes - so we have to be very careful about how we register the global shrinkers so that the implicit call order is always correct. With a per-sb shrinker callout, we can encode this dependency directly into the per-sb shrinker, hence avoiding the need for strictly ordering shrinker registrations. We also have no need for any proportioning code for the shrinker subsystem already provides this functionality across all shrinkers. Allowing the shrinker to operate on a single superblock at a time means that we do less superblock list traversals and locking and reclaim should batch more effectively. This should result in less CPU overhead for reclaim and potentially faster reclaim of items from each filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* superblock: move pin_sb_for_writeback() to fs/super.cDave Chinner2011-07-203-27/+35
| | | | | | | | | | | | | | | The per-sb shrinker has the same requirement as the writeback threads of ensuring that the superblock is usable and pinned for the time it takes to run the work. Both need to take a passive reference to the sb, take a read lock on the s_umount lock and then only continue if an unmount is not in progress. pin_sb_for_writeback() does this exactly, so move it to fs/super.c and rename it to grab_super_passive() and exporting it via fs/internal.h for all the VFS code to be able to use. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* inode: move to per-sb LRU locksDave Chinner2011-07-202-14/+14
| | | | | | | | | | With the inode LRUs moving to per-sb structures, there is no longer a need for a global inode_lru_lock. The locking can be made more fine-grained by moving to a per-sb LRU lock, isolating the LRU operations of different filesytsems completely from each other. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* inode: Make unused inode LRU per superblockDave Chinner2011-07-202-11/+81
| | | | | | | | | | | | | | | | | | The inode unused list is currently a global LRU. This does not match the other global filesystem cache - the dentry cache - which uses per-superblock LRU lists. Hence we have related filesystem object types using different LRU reclaimation schemes. To enable a per-superblock filesystem cache shrinker, both of these caches need to have per-sb unused object LRU lists. Hence this patch converts the global inode LRU to per-sb LRUs. The patch only does rudimentary per-sb propotioning in the shrinker infrastructure, as this gets removed when the per-sb shrinker callouts are introduced later on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* inode: convert inode_stat.nr_unused to per-cpu countersDave Chinner2011-07-201-5/+11
| | | | | | | | | Before we split up the inode_lru_lock, the unused inode counter needs to be made independent of the global inode_lru_lock. Convert it to per-cpu counters to do this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)Al Viro2011-07-2015-94/+39
| | | | | | ... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* deuglify squashfs_lookup()Al Viro2011-07-201-4/+1
| | | | | | | | d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL so no need for that if (inode) ... in there (or ERR_PTR(0), for that matter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* nfsd4_list_rec_dir(): don't bother with reopening rec_fileAl Viro2011-07-201-31/+21
| | | | | | | just rewind it to the beginning before vfs_readdir() and be done with that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* kill useless checks for sb->s_op == NULLAl Viro2011-07-202-2/+1
| | | | | | never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* btrfs: kill magical embedded struct superblockAl Viro2011-07-204-22/+29
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* get rid of pointless checks for dentry->sb == NULLAl Viro2011-07-201-1/+0
| | | | | | it never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Make ->d_sb assign-once and always non-NULLAl Viro2011-07-203-39/+47
| | | | | | | | | | | | | | New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name). Allocates dentry, sets its ->d_sb to given superblock and sets ->d_op accordingly. Old d_alloc(NULL, name) callers are converted to that (all of them know what superblock they want). d_alloc() itself is left only for parent != NULl case; uses __d_alloc(), inserts result into the list of parent's children. Note that now ->d_sb is assign-once and never NULL *and* ->d_parent is never NULL either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* unexport kern_path_parent()Al Viro2011-07-201-1/+0
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* switch vfs_path_lookup() to struct pathAl Viro2011-07-203-21/+20
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* kill lookup_create()Al Viro2011-07-201-36/+18
| | | | | | folded into the only caller (kern_path_create()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* new helpers: kern_path_create/user_path_createAl Viro2011-07-202-116/+87
| | | | | | | combination of kern_path_parent() and lookup_create(). Does *not* expose struct nameidata to caller. Syscalls converted to that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* kill LOOKUP_CONTINUEAl Viro2011-07-201-8/+3
| | | | | | LOOKUP_PARENT is equivalent to it now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* nfs: LOOKUP_{OPEN,CREATE,EXCL} is set only on the last stepAl Viro2011-07-201-4/+2
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* cifs_lookup(): LOOKUP_OPEN is set only on the last componentAl Viro2011-07-201-1/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ceph: LOOKUP_OPEN is set only when it's the last componentAl Viro2011-07-201-1/+0
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* jfs_ci_revalidate() is safe from RCU modeAl Viro2011-07-201-2/+0
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* LOOKUP_CREATE and LOOKUP_RENAME_TARGET can be set only on the last stepAl Viro2011-07-203-12/+6
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* no need to check for LOOKUP_OPEN in ->create() instancesAl Viro2011-07-205-10/+10
| | | | | | | ... it will be set in nd->flag for all cases with non-NULL nd (i.e. when called from do_last()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* don't pass nameidata to vfs_create() from ecryptfs_create()Al Viro2011-07-201-28/+5
| | | | | | | | | Instead of playing with removal of LOOKUP_OPEN, mangling (and restoring) nd->path, just pass NULL to vfs_create(). The whole point of what's being done there is to suppress any attempts to open file by underlying fs, which is what nd == NULL indicates. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* don't transliterate lower bits of ->intent.open.flags to FMODE_...Al Viro2011-07-206-30/+23
| | | | | | ->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Don't pass nameidata when calling vfs_create() from mknod()Al Viro2011-07-201-1/+1
| | | | | | | All instances can cope with that now (and ceph one actually starts working properly). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fix mknod() on nfs4 (hopefully)Al Viro2011-07-201-12/+12
| | | | | | | | | a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE) b) default (!LOOKUP_OPEN) open_flags is O_CREAT|O_EXCL|FMODE_READ, not 0 c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN; otherwise we need to issue CLOSE, lest we leak stateid on server. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
OpenPOWER on IntegriCloud