summaryrefslogtreecommitdiffstats
path: root/fs/inode.c
Commit message (Collapse)AuthorAgeFilesLines
* fs: turn iprune_mutex into rwsemNick Piggin2009-09-231-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have had a report of bad memory allocation latency during DVD-RAM (UDF) writing. This is causing the user's desktop session to become unusable. Jan tracked the cause of this down to UDF inode reclaim blocking: gnome-screens D ffff810006d1d598 0 20686 1 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0 Call Trace: [<ffffffff804477f3>] io_schedule+0x63/0xa5 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21 [<ffffffff802c442a>] __bread+0x70/0x86 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b [<ffffffff802b3ab1>] dispose_list+0x5b/0x102 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189 [<ffffffff802822d8>] __do_fault+0x10c/0x417 [<ffffffff80284232>] handle_mm_fault+0x466/0x940 [<ffffffff8044b922>] do_page_fault+0x676/0xabf This blocks with iprune_mutex held, which then blocks other reclaimers: X D ffff81009d47c400 0 17285 14831 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740 Call Trace: [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73 [<ffffffff802ad949>] do_select+0x308/0x506 [<ffffffff802adced>] core_sys_select+0x1a6/0x254 [<ffffffff802ae0b7>] sys_select+0xb5/0x157 Now I think the main problem is having the filesystem block (and do IO) in inode reclaim. The problem is that this doesn't get accounted well and penalizes a random allocator with a big latency spike caused by work generated from elsewhere. I think the best idea would be to avoid this. By design if possible, or by deferring the hard work to an asynchronous context. If the latter, then the fs would probably want to throttle creation of new work with queue size of the deferred work, but let's not get into those details. Anyway, the other obvious thing we looked at is the iprune_mutex which is causing the cascading blocking. We could turn this into an rwsem to improve concurrency. It is unreasonable to totally ban all potentially slow or blocking operations in inode reclaim, so I think this is a cheap way to get a small improvement. This doesn't solve the whole problem of course. The process doing inode reclaim will still take the latency hit, and concurrent processes may end up contending on filesystem locks. So fs developers should keep these problems in mind. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jan Kara <jack@ucw.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* const: mark remaining inode_operations as constAlexey Dobriyan2009-09-221-1/+1
| | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: make sure data stored into inode is properly seen before unlocking new inodeJan Kara2009-09-221-6/+8
| | | | | | | | | | | | | | | | In theory it could happen that on one CPU we initialize a new inode but clearing of I_NEW | I_LOCK gets reordered before some of the initialization. Thus on another CPU we return not fully uptodate inode from iget_locked(). This seems to fix a corruption issue on ext3 mounted over NFS. [akpm@linux-foundation.org: add some commentary] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: remove bdev->bd_inode_backing_dev_infoJens Axboe2009-09-161-3/+1
| | | | | | | | | | | | | | | It has been unused since it was introduced in: commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac Author: Andrew Morton <akpm@osdl.org> Date: Fri May 21 00:46:17 2004 -0700 [PATCH] block device layer: separate backing_dev_info infrastructure So lets just kill it. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* vfs: add __destroy_inodeChristoph Hellwig2009-08-071-3/+7
| | | | | | | | | | | | | When we want to tear down an inode that lost the add to the cache race in XFS we must not call into ->destroy_inode because that would delete the inode that won the race from the inode cache radix tree. This patch provides the __destroy_inode helper needed to fix this, the actual fix will be in th next patch. As XFS was the only reason destroy_inode was exported we shift the export to the new __destroy_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
* vfs: fix inode_init_always calling conventionChristoph Hellwig2009-08-071-13/+17
| | | | | | | | | | | | | | | | Currently inode_init_always calls into ->destroy_inode if the additional initialization fails. That's not only counter-intuitive because inode_init_always did not allocate the inode structure, but in case of XFS it's actively harmful as ->destroy_inode might delete the inode from a radix-tree that has never been added. This in turn might end up deleting the inode for the same inum that has been instanciated by another process and cause lots of cause subtile problems. Also in the case of re-initializing a reclaimable inode in XFS it would free an inode we still want to keep alive. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
* Merge branch 'for-linus' of ↵Linus Torvalds2009-06-241-0/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits) switch xfs to generic acl caching helpers helpers for acl caching + switch to those switch shmem to inode->i_acl switch reiserfs to inode->i_acl switch reiserfs to usual conventions for caching ACLs reiserfs: minimal fix for ACL caching switch nilfs2 to inode->i_acl switch btrfs to inode->i_acl switch jffs2 to inode->i_acl switch jfs to inode->i_acl switch ext4 to inode->i_acl switch ext3 to inode->i_acl switch ext2 to inode->i_acl add caching of ACLs in struct inode fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls cleanup __writeback_single_inode ... and the same for vfsmount id/mount group id Make allocation of anon devices cheaper update Documentation/filesystems/Locking devpts: remove module-related code ...
| * add caching of ACLs in struct inodeAl Viro2009-06-241-0/+10
| | | | | | | | | | | | No helpers, no conversions yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | vfs: Set special lockdep map for dirs only if not set by fsJan Kara2009-06-221-6/+11
|/ | | | | | | | | | | | | | | | Some filesystems need to set lockdep map for i_mutex differently for different directories. For example OCFS2 has system directories (for orphan inode tracking and for gathering all system files like journal or quota files into a single place) which have different locking locking rules than standard directories. For a filesystem setting lockdep map is naturaly done when the inode is read but we have to modify unlock_new_inode() not to overwrite the lockdep map the filesystem has set. Acked-by: peterz@infradead.org CC: mingo@redhat.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com>
* trivial: fs/inode: Fix typo in file_update_time nanodocWolfram Sang2009-06-121-1/+1
| | | | | | | The advertised flag for not updating the time was wrong. Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* fs: introduce mnt_clone_writenpiggin@suse.de2009-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch speeds up lmbench lat_mmap test by about another 2% after the first patch. Before: avg = 462.286 std = 5.46106 After: avg = 453.12 std = 9.58257 (50 runs of each, stddev gives a reasonable confidence) It does this by introducing mnt_clone_write, which avoids some heavyweight operations of mnt_want_write if called on a vfsmount which we know already has a write count; and mnt_want_write_file, which can call mnt_clone_write if the file is open for write. After these two patches, mnt_want_write and mnt_drop_write go from 7% on the profile down to 1.3% (including mnt_clone_write). [AV: mnt_want_write_file() should take file alone and derive mnt from it; not only all callers have that form, but that's the only mnt about which we know that it's already held for write if file is opened for write] Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fsnotify: handle filesystem unmounts with fsnotify marksEric Paris2009-06-111-0/+1
| | | | | | | | | | When an fs is unmounted with an fsnotify mark entry attached to one of its inodes we need to destroy that mark entry and we also (like inotify) send an unmount event. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de>
* fsnotify: add marks to inodes so groups can interpret how to handle those inodesEric Paris2009-06-111-0/+9
| | | | | | | | | | | | | | | | | This patch creates a way for fsnotify groups to attach marks to inodes. These marks have little meaning to the generic fsnotify infrastructure and thus their meaning should be interpreted by the group that attached them to the inode's list. dnotify and inotify will make use of these markings to indicate which inodes are of interest to their respective groups. But this implementation has the useful property that in the future other listeners could actually use the marks for the exact opposite reason, aka to indicate which inodes it had NO interest in. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de>
* integrity: fix IMA inode leakHugh Dickins2009-06-061-0/+1
| | | | | | | | | CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects until the system runs out of memory. Nowhere is calling ima_inode_free() a.k.a. ima_iint_delete(). Fix that by calling it from destroy_inode(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext3/4 with synchronous writes gets wedged by PostfixAl Viro2009-06-061-6/+25
| | | | | | | | | | | | | | | | | OK, that's probably the easiest way to do that, as much as I don't like it... Since iget() et.al. will not accept I_FREEING (will wait to go away and restart), and since we'd better have serialization between new/free on fs data structures anyway, we can afford simply skipping I_FREEING et.al. in insert_inode_locked(). We do that from new_inode, so it won't race with free_inode in any interesting ways and it won't race with iget (of any origin; nfsd or in case of fs corruption a lookup) since both still will wait for I_LOCK. Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Tested-by: David Watson <dbwatson@ukfsn.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Make checkpatch.pl shut up on fs/inode.cManish Katiyar2009-05-091-46/+35
| | | | | | | | | | Code Quality According To Mingo(tm) has been vastly improved, no code has been damaged^Wchanged^Wdamaged. [commit message rewritten -- AV] Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* splice: add helpers for locking pipe inodeMiklos Szeredi2009-04-151-36/+0
| | | | | | | | | | | | | | | | | | | | There are lots of sequences like this, especially in splice code: if (pipe->inode) mutex_lock(&pipe->inode->i_mutex); /* do something */ if (pipe->inode) mutex_unlock(&pipe->inode->i_mutex); so introduce helpers which do the conditional locking and unlocking. Also replace the inode_double_lock() call with a pipe_double_lock() helper to avoid spreading the use of this functionality beyond the pipe code. This patch is just a cleanup, and should cause no behavioral changes. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2009-03-271-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits) fs: avoid I_NEW inodes Merge code for single and multiple-instance mounts Remove get_init_pts_sb() Move common mknod_ptmx() calls into caller Parse mount options just once and copy them to super block Unroll essentials of do_remount_sb() into devpts vfs: simple_set_mnt() should return void fs: move bdev code out of buffer.c constify dentry_operations: rest constify dentry_operations: configfs constify dentry_operations: sysfs constify dentry_operations: JFS constify dentry_operations: OCFS2 constify dentry_operations: GFS2 constify dentry_operations: FAT constify dentry_operations: FUSE constify dentry_operations: procfs constify dentry_operations: ecryptfs constify dentry_operations: CIFS constify dentry_operations: AFS ...
| * fs: avoid I_NEW inodesNick Piggin2009-03-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To be on the safe side, it should be less fragile to exclude I_NEW inodes from inode list scans by default (unless there is an important reason to have them). Normally they will get excluded (eg. by zero refcount or writecount etc), however it is a bit fragile for list walkers to know exactly what parts of the inode state is set up and valid to test when in I_NEW. So along these lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc checks with them too -- this shouldn't be a problem should it?) Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for_linus' of ↵Linus Torvalds2009-03-271-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits) ext2: Zero our b_size in ext2_quota_read() trivial: fix typos/grammar errors in fs/Kconfig quota: Coding style fixes quota: Remove superfluous inlines quota: Remove uppercase aliases for quota functions. nfsd: Use lowercase names of quota functions jfs: Use lowercase names of quota functions udf: Use lowercase names of quota functions ufs: Use lowercase names of quota functions reiserfs: Use lowercase names of quota functions ext4: Use lowercase names of quota functions ext3: Use lowercase names of quota functions ext2: Use lowercase names of quota functions ramfs: Remove quota call vfs: Use lowercase names of quota functions quota: Remove dqbuf_t and other cleanups quota: Remove NODQUOT macro quota: Make global quota locks cacheline aligned quota: Move quota files into separate directory ext4: quota reservation for delayed allocation ...
| * | vfs: Use lowercase names of quota functionsJan Kara2009-03-261-2/+2
| |/ | | | | | | | | | | | | Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: Alexander Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for-linus' of ↵Linus Torvalds2009-03-261-7/+17
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits) SELinux: inode_doinit_with_dentry drop no dentry printk SELinux: new permission between tty audit and audit socket SELinux: open perm for sock files smack: fixes for unlabeled host support keys: make procfiles per-user-namespace keys: skip keys from another user namespace keys: consider user namespace in key_permission keys: distinguish per-uid keys in different namespaces integrity: ima iint radix_tree_lookup locking fix TOMOYO: Do not call tomoyo_realpath_init unless registered. integrity: ima scatterlist bug fix smack: fix lots of kernel-doc notation TOMOYO: Don't create securityfs entries unless registered. TOMOYO: Fix exception policy read failure. SELinux: convert the avc cache hash list to an hlist SELinux: code readability with avc_cache SELinux: remove unused av.decided field SELinux: more careful use of avd in avc_has_perm_noaudit SELinux: remove the unused ae.used SELinux: check seqno when updating an avc_node ...
| * \ Merge branch 'master' into nextJames Morris2009-03-241-0/+7
| |\ \ | | |/
| * | Merge branch 'master' into nextJames Morris2009-02-061-6/+68
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: fs/namei.c Manually merged per: diff --cc fs/namei.c index 734f2b5,bbc15c2..0000000 --- a/fs/namei.c +++ b/fs/namei.c @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char nd->flags |= LOOKUP_CONTINUE; err = exec_permission_lite(inode); if (err == -EAGAIN) - err = vfs_permission(nd, MAY_EXEC); + err = inode_permission(nd->path.dentry->d_inode, + MAY_EXEC); + if (!err) + err = ima_path_check(&nd->path, MAY_EXEC); if (err) break; @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc flag &= ~O_TRUNC; } - error = vfs_permission(nd, acc_mode); + error = inode_permission(inode, acc_mode); if (error) return error; + - error = ima_path_check(&nd->path, ++ error = ima_path_check(path, + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC)); + if (error) + return error; /* * An append-only file must be opened in append mode for writing. */ Signed-off-by: James Morris <jmorris@namei.org>
| * | | integrity: IMA hooksMimi Zohar2009-02-061-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch replaces the generic integrity hooks, for which IMA registered itself, with IMA integrity hooks in the appropriate places directly in the fs directory. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
* | | | Allow relatime to update atime once a dayMatthew Garrett2009-03-261-9/+38
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow atime to be updated once per day even with relatime. This lets utilities like tmpreaper (which delete files based on last access time) continue working, making relatime a plausible default for distributions. Signed-off-by: Matthew Garrett <mjg@redhat.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: Valerie Aurora Henson <vaurora@redhat.com> Acked-by: Alan Cox <alan@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | fs: new inode i_state corruption fixNick Piggin2009-03-121-0/+7
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a report of a data corruption http://lkml.org/lkml/2008/11/14/121. There is a script included to reproduce the problem. During testing, I encountered a number of strange things with ext3, so I tried ext2 to attempt to reduce complexity of the problem. I found that fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be cleared, even though instrumentation showed that unlock_new_inode had already been called for that inode. This points to memory scribble, or synchronisation problme. i_state of I_NEW inodes is not protected by inode_lock because other processes are not supposed to touch them until I_LOCK (and I_NEW) is cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify i_state revealed that generic_sync_sb_inodes is picking up new inodes from the inode lists and passing them to __writeback_single_inode without waiting for I_NEW. Subsequently modifying i_state causes corruption. In my case it would look like this: CPU0 CPU1 unlock_new_inode() __sync_single_inode() reg <- inode->i_state reg -> reg & ~(I_LOCK|I_NEW) reg <- inode->i_state reg -> inode->i_state reg -> reg | I_SYNC reg -> inode->i_state Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again. Fix for this is rather than wait for I_NEW inodes, just skip over them: inodes concurrently being created are not subject to data integrity operations, and should not significantly contribute to dirty memory either. After this change, I'm unable to reproduce any of the added warnings or hangs after ~1hour of running. Previously, the new warnings would start immediately and hang would happen in under 5 minutes. I'm also testing on ext3 now, and so far no problems there either. I don't know whether this fixes the problem reported above, but it fixes a real problem for me. Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net> Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Cc: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | partial revert of asynchronous inode deleteArjan van de Ven2009-01-091-12/+7
| | | | | | | | | | | | | | let the core of this one bake in -next as well, but leave some of the infrastructure in place. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* | async: make the final inode deletion an asynchronous eventArjan van de Ven2009-01-071-7/+13
| | | | | | | | | | | | | | this makes "rm -rf" on a (names cached) kernel tree go from 11.6 to 8.6 seconds on an ext3 filesystem Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* | fs/inode: fix kernel-doc notationRandy Dunlap2009-01-061-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix kernel-doc notation: Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'sb' Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'inode' Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'sb' Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'inode' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: remove GFP_HIGHUSER_PAGECACHEHugh Dickins2009-01-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making that harder to track down: remove it, and its out-of-work brothers GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE. Since we're making that improvement to hotremove_migrate_alloc(), I think we can now also remove one of the "o"s from its comment. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | zero i_uid/i_gid on inode allocationAl Viro2009-01-051-0/+2
| | | | | | | | | | | | | | | | | | ... and don't bother in callers. Don't bother with zeroing i_blocks, while we are at it - it's already been zeroed. i_mode is not worth the effort; it has no common default value. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | nfsd/create race fixes, infrastructureAl Viro2008-12-311-0/+59
|/ | | | | | | | | | | | | new helpers - insert_inode_locked() and insert_inode_locked4(). Hash new inode, making sure that there's no such inode in icache already. If there is and it does not end up unhashed (as would happen if we have nfsd trying to resolve a bogus fhandle), fail. Otherwise insert our inode into hash and succeed. In either case have i_state set to new+locked; cleanup ends up being simpler with such calling conventions. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: xfs needs inode_wait to be exportedStephen Rothwell2008-11-101-0/+1
| | | | | | | | | Since wait_on_inode() references it. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* Inode: export symbol destroy_inodeChristoph Hellwig2008-10-301-0/+1
| | | | | | | | | | To make sure we free the security data inodes need to be freed using the proper VFS helper (which we also need to export for this). We mark these inodes bad so we can skip the flush path for them. Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <david@fromorbit.com>
* Inode: Allow external list initialisationDavid Chinner2008-10-301-21/+46
| | | | | | | | | | | | | | | | | | | To allow XFS to combine the XFS and linux inodes into a single structure, we need to drive inode lookup from the XFS inode cache, not the generic inode cache. This means that we need initialise a struct inode from a context outside alloc_inode() as it is no longer used by XFS. After inode allocation and initialisation, we need to add the inode to the superblock list, the in-use list, hash it and do some accounting. This all needs to be done with the inode_lock held and there are already several places in fs/inode.c that do this list manipulation. Factor out the common code, add a locking wrapper and export the function so ti can be called from XFS. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* Inode: Allow external initialisersDavid Chinner2008-10-301-62/+78
| | | | | | | | | | | | | | | | | | To allow XFS to combine the XFS and linux inodes into a single structure, we need to drive inode lookup from the XFS inode cache, not the generic inode cache. This means that we need initialise a struct inode from a context outside alloc_inode() as it is no longer used by XFS. Factor and export the struct inode initialisation code from alloc_inode() to inode_init_always() as a counterpart to inode_init_once(). i.e. we have to call this init function for each inode instantiation (always), as opposed inode_init_once() which is only called on slab object instantiation (once). Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* fs/inode.c: properly init address_space->writeback_indexChris Mason2008-08-151-0/+1
| | | | | | | | | | | | | | | | | | | write_cache_pages() uses i_mapping->writeback_index to pick up where it left off the last time a given inode was found by pdflush or balance_dirty_pages (or anyone else who sets wbc->range_cyclic) alloc_inode() should set it to a sane value so that writeback doesn't start in the middle of a file. It is somewhat difficult to notice the bug since write_cache_pages will loop around to the start of the file and the elevator helps hide the resulting seeks. For whatever reason, Btrfs hits this often. Unpatched, untarring 30 copies of the linux kernel in series runs at 47MB/s on a single sata drive. With this fix, it jumps to 62MB/s. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* SL*B: drop kmem cache argument from constructorAlexey Dobriyan2008-07-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: spinlock tree_lockNick Piggin2008-07-261-1/+1
| | | | | | | | | | | | | | mapping->tree_lock has no read lockers. convert the lock from an rwlock to a spinlock. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* VFS: fix unused variable warningLinus Torvalds2008-05-061-2/+0
| | | | | | | | | | Commit 33dcdac2df54e66c447ae03f58c95c7251aa5649 ("kill ->put_inode") removed the final use of i_op->put_inode, but left the now totally unused "op" variable in iput(). Get rid of it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] kill ->put_inodeChristoph Hellwig2008-05-061-3/+0
| | | | | | | | | | | | | | | And with that last patch to affs killing the last put_inode instance we can finally, after many years of transition kill this racy and awkward interface. (It's kinda funny that even the description in Documentation/filesystems/vfs.txt was entirely wrong..) Also remove a very misleading comment above the defintion of struct super_operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs/inode.c: use hlist_for_each_entry()Matthias Kaehlcke2008-04-291-4/+2
| | | | | | | | | fs/inode.c: use hlist_for_each_entry() in find_inode() and find_inode_fast() [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] r/o bind mounts: write count for file_update_time()Dave Hansen2008-04-191-1/+5
| | | | | | | Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* [PATCH] r/o bind mounts: write counts for touch_atime()Dave Hansen2008-04-191-25/+20
| | | | | | | | | Remove handling of NULL mnt while we are at it - that can't happen these days. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* iget: remove iget() and the read_inode() super op as being obsoleteDavid Howells2008-02-071-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the old iget() call and the read_inode() superblock operation it uses as these are really obsolete, and the use of read_inode() does not produce proper error handling (no distinction between ENOMEM and EIO when marking an inode bad). Furthermore, this removes the temptation to use iget() to find an inode by number in a filesystem from code outside that filesystem. iget_locked() should be used instead. A new function is added in an earlier patch (iget_failed) that is to be called to mark an inode as bad, unlock it and release it should the get routine fail. Mark iget() and read_inode() as being obsolete and remove references to them from the documentation. Typically a filesystem will be modified such that the read_inode function becomes an internal iget function, for example the following: void thingyfs_read_inode(struct inode *inode) { ... } would be changed into something like: struct inode *thingyfs_iget(struct super_block *sp, unsigned long ino) { struct inode *inode; int ret; inode = iget_locked(sb, ino); if (!inode) return ERR_PTR(-ENOMEM); if (!(inode->i_state & I_NEW)) return inode; ... unlock_new_inode(inode); return inode; error: iget_failed(inode); return ERR_PTR(ret); } and then thingyfs_iget() would be called rather than iget(), for example: ret = -EINVAL; inode = iget(sb, ino); if (!inode || is_bad_inode(inode)) goto error; becomes: inode = thingyfs_iget(sb, ino); if (IS_ERR(inode)) { ret = PTR_ERR(inode); goto error; } Note that is_bad_inode() does not need to be called. The error returned by thingyfs_iget() should render it unnecessary. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext4: Add inode version support in ext4Jean Noel Cordenner2008-01-281-17/+0
| | | | | | | | | | | | | | This patch adds 64-bit inode version support to ext4. The lower 32 bits are stored in the osd1.linux1.l_i_version field while the high 32 bits are stored in the i_version_hi field newly created in the ext4_inode. This field is incremented in case the ext4_inode is large enough. A i_version mount option has been added to enable the feature. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net>
* vfs: Add 64 bit i_version supportJean Noel Cordenner2008-01-281-0/+22
| | | | | | | | | | | | | | | The i_version field of the inode is changed to be a 64-bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the "ctime" time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530. This first part concerns the vfs, it converts the 32-bit i_version in the generic inode to a 64-bit, a flag is added in the super block in order to check if the feature is enabled and the i_version is incremented in the vfs. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
* introduce I_SYNCJoern Engel2007-10-171-12/+12
| | | | | | | | | | | | | | | | | | | | I_LOCK was used for several unrelated purposes, which caused deadlock situations in certain filesystems as a side effect. One of the purposes now uses the new I_SYNC bit. Also document the various bits and change their order from historical to logical. [bunk@stusta.de: make fs/inode.c:wake_up_inode() static] Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: David Chinner <dgc@sgi.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: remove the unused mempages parameterDenis Cheng2007-10-171-1/+1
| | | | | | | | | | | | | | | | Since the mempages parameter is actually not used, they should be removed. Now there is only files_init use the mempages parameter, files_init(mempages); but I don't think the adaptation to mempages in files_init is really useful; and if files_init also changed to the prototype void (*func)(void), the wrapper vfs_caches_init would also not need the mempages parameter. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud