op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	locks: break delegations on rename	J. Bruce Fields	2013-11-09	4	-7/+47
\| \| \| \| \| \| \|	Cc: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	locks: helper functions for delegation breaking	J. Bruce Fields	2013-11-09	1	-10/+3
\| \| \| \| \| \| \| \|	We'll need the same logic for rename and link. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	locks: break delegations on unlink	J. Bruce Fields	2013-11-09	4	-7/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to break delegations on any operation that changes the set of links pointing to an inode. Start with unlink. Such operations also hold the i_mutex on a parent directory. Breaking a delegation may require waiting for a timeout (by default 90 seconds) in the case of a unresponsive NFS client. To avoid blocking all directory operations, we therefore drop locks before waiting for the delegation. The logic then looks like: acquire locks ... test for delegation; if found: take reference on inode release locks wait for delegation break drop reference on inode retry It is possible this could never terminate. (Even if we take precautions to prevent another delegation being acquired on the same inode, we could get a different inode on each retry.) But this seems very unlikely. The initial test for a delegation happens after the lock on the target inode is acquired, but the directory inode may have been acquired further up the call stack. We therefore add a "struct inode **" argument to any intervening functions, which we use to pass the inode back up to the caller in the case it needs a delegation synchronously broken. Cc: David Howells <dhowells@redhat.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Dustin Kirkland <dustin.kirkland@gazzang.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	namei: minor vfs_unlink cleanup	J. Bruce Fields	2013-11-09	1	-3/+4
\| \| \| \| \| \| \| \|	We'll be using dentry->d_inode in one more place. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	locks: implement delegations	J. Bruce Fields	2013-11-09	1	-10/+45
\| \| \| \| \| \| \| \| \| \| \| \| \|	Implement NFSv4 delegations at the vfs level using the new FL_DELEG lock type. Note nfsd is the only delegation user and is only using read delegations. Warn on any attempt to set a write delegation for now. We'll come back to that case later. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	locks: introduce new FL_DELEG lock flag	J. Bruce Fields	2013-11-09	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	For now FL_DELEG is just a synonym for FL_LEASE. So this patch doesn't change behavior. Next we'll modify break_lease to treat FL_DELEG leases differently, to account for the fact that NFSv4 delegations should be broken in more situations than Windows oplocks. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: take i_mutex on renamed file	J. Bruce Fields	2013-11-09	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A read delegation is used by NFSv4 as a guarantee that a client can perform local read opens without informing the server. The open operation takes the last component of the pathname as an argument, thus is also a lookup operation, and giving the client the above guarantee means informing the client before we allow anything that would change the set of names pointing to the inode. Therefore, we need to break delegations on rename, link, and unlink. We also need to prevent new delegations from being acquired while one of these operations is in progress. We could add some completely new locking for that purpose, but it's simpler to use the i_mutex, since that's already taken by all the operations we care about. The single exception is rename. So, modify rename to take the i_mutex on the file that is being renamed. Also fix up lockdep and Documentation/filesystems/directory-locking to reflect the change. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: rename I_MUTEX_QUOTA now that it's not used for quotas	J. Bruce Fields	2013-11-09	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	I_MUTEX_QUOTA is now just being used whenever we want to lock two non-directories. So the name isn't right. I_MUTEX_NONDIR2 isn't especially elegant but it's the best I could think of. Also fix some outdated documentation. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: don't use PARENT/CHILD lock classes for non-directories	J. Bruce Fields	2013-11-09	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Reserve I_MUTEX_PARENT and I_MUTEX_CHILD for locking of actual directories. (Also I_MUTEX_QUOTA isn't really a meaningful name for this locking class any more; fixed in a later patch.) Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: pull ext4's double-i_mutex-locking into common code	J. Bruce Fields	2013-11-09	4	-42/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We want to do this elsewhere as well. Also catch any attempts to use it for directories (where this ordering would conflict with ancestor-first directory ordering in lock_rename). Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Acked-by: Jeff Layton <jlayton@redhat.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: fix quadratic behavior in filehandle lookup	J. Bruce Fields	2013-11-09	1	-53/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Suppose we're given the filehandle for a directory whose closest ancestor in the dcache is its Nth ancestor. The main loop in reconnect_path searches for an IS_ROOT ancestor of target_dir, reconnects that ancestor to its parent, then recommences the search for an IS_ROOT ancestor from target_dir. This behavior is quadratic in N. And there's really no need to restart the search from target_dir each time: once a directory has been looked up, it won't become IS_ROOT again. So instead of starting from target_dir each time, we can continue where we left off. This simplifies the code and improves performance on very deep directory heirachies. (I can't think of any reason anyone should need heirarchies a hundred or more deep, but the performance improvement may be valuable if only to limit damage in case of abuse.) Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: better variable name	J. Bruce Fields	2013-11-09	1	-6/+6
\| \| \| \| \| \| \| \|	Replace another unhelpful acronym. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: move most of reconnect_path to helper function	J. Bruce Fields	2013-11-09	1	-78/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also replace 3 easily-confused three-letter acronyms by more helpful variable names. Just cleanup, no change in functionality, with one exception: the dentry_connected() check in the "out_reconnected" case will now only check the ancestors of the current dentry instead of checking all the way from target_dir. Since we've already verified connectivity up to this dentry, that should be sufficient. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: eliminate unused "noprogress" counter	J. Bruce Fields	2013-11-09	1	-13/+2
\| \| \| \| \| \| \| \| \|	Note this counter is now being set to 0 on every pass through the loop, so it no longer serves any useful purpose. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: stop retrying once we race with rename/remove	J. Bruce Fields	2013-11-09	1	-5/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two places here where we could race with a rename or remove: - We could find the parent, but then be removed or renamed away from that parent directory before finding our name in that directory. - We could find the parent, and find our name in that parent, but then be renamed or removed before we look ourselves up by that name in that parent. In both cases the concurrent rename or remove will take care of reconnecting the directory that we're currently examining. Our target directory should then also be connected. Check this and clear DISCONNECTED in these cases instead of looping around again. Note: we do need to check that this actually happened if we want to be robust in the face of corrupted filesystems: a corrupted filesystem could just return a completely wrong parent, and we want to fail with an error in that case before starting to clear DISCONNECTED on non-DISCONNECTED filesystems. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: clear DISCONNECTED on all parents sooner	J. Bruce Fields	2013-11-09	1	-4/+21
\| \| \| \| \| \| \| \| \| \|	Once we've found any connected parent, we know all our parents are connected--that's true even if there's a concurrent rename. May as well clear them all at once and be done with it. Reviewed-by: Cristoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: more detailed comment for path_reconnect	J. Bruce Fields	2013-11-09	1	-1/+13
\| \| \| \| \| \|	Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: BUG_ON in crazy corner case	Christoph Hellwig	2013-11-09	1	-6/+2
\| \| \| \| \| \| \| \| \|	This would indicate a nasty bug in the dcache and has never triggered in the past 10 years as far as I know. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	dcache: fix outdated DCACHE_NEED_LOOKUP comment	J. Bruce Fields	2013-11-09	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	The DCACHE_NEED_LOOKUP case referred to here was removed with 39e3c9553f34381a1b664c27b0c696a266a5735e "vfs: remove DCACHE_NEED_LOOKUP". There are only four real_lookup() callers and all of them pass in an unhashed dentry just returned from d_alloc. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	dcache: don't clear DCACHE_DISCONNECTED too early	J. Bruce Fields	2013-11-09	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	DCACHE_DISCONNECTED should not be cleared until we're sure the dentry is connected all the way up to the root of the filesystem. It shouldn't be cleared as soon as the dentry is connected to a parent. That will cause bugs at least on exportable filesystems. Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	dcache: Don't set DISCONNECTED on "pseudo filesystem" dentries	J. Bruce Fields	2013-11-09	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I can't for the life of me see any reason why anyone should care whether a dentry that is never hooked into the dentry cache would need DCACHE_DISCONNECTED set. This originates from 4b936885ab04dc6e0bb0ef35e0e23c1a7364d9e5 "fs: improve scalability of pseudo filesystems", which probably just made the false assumption the DCACHE_DISCONNECTED was meant to be set on anything not connected to a parent somehow. So this is just confusing. Ideally the only uses of DCACHE_DISCONNECTED would be in the filehandle-lookup code, which needs it to ensure dentries are connected into the dentry tree before use. I left d_alloc_pseudo there even though it's now equivalent to __d_alloc(), just on the theory the name is better documentation of its intended use outside dcache.c. Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	dcache: use IS_ROOT to decide where dentry is hashed	J. Bruce Fields	2013-11-09	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Every hashed dentry is either hashed in the dentry_hashtable, or a superblock's s_anon list. __d_drop() assumes it can determine which is the case by checking DCACHE_DISCONNECTED; this is not true. It is true that when DCACHE_DISCONNECTED is cleared, the dentry is not only hashed on dentry_hashtable, but is fully connected to its parents back to the root. But the converse is not true: fs/exportfs/expfs.c:reconnect_path() attempts to connect a directory (found by filehandle lookup) back to root by ascending to parents and performing lookups one at a time. It does not clear DCACHE_DISCONNECTED until it's done, and that is not at all an atomic process. In particular, it is possible for DCACHE_DISCONNECTED to be set on a dentry which is hashed on the dentry_hashtable. Instead, use IS_ROOT() to check which hash chain a dentry is on. This does work: Dentries are hashed only by: - d_obtain_alias, which adds an IS_ROOT() dentry to sb_anon. - __d_rehash, called by _d_rehash: hashes to the dentry's parent, and all callers of _d_rehash appear to have d_parent set to a "real" parent. - __d_rehash, called by __d_move: rehashes the moved dentry to hash chain determined by target, and assigns target's d_parent to its d_parent, before dropping the dentry's d_lock. Therefore I believe it's safe for a holder of a dentry's d_lock to assume that it is hashed on sb_anon if and only if IS_ROOT(dentry) is true. I believe the incorrect assumption about DCACHE_DISCONNECTED was originally introduced by ceb5bdc2d246 "fs: dcache per-bucket dcache hash locking". Also add a comment while we're here. Cc: Nick Piggin <npiggin@kernel.dk> Acked-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	ocfs2: get rid of impossible checks	Al Viro	2013-11-09	1	-10/+0
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	qnx4: i_sb is never NULL	Al Viro	2013-11-09	1	-4/+0
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	exportfs: fix 32-bit nfsd handling of 64-bit inode numbers	J. Bruce Fields	2013-11-09	1	-2/+16
\| \| \| \| \| \| \| \| \| \| \|	Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a 32-bit NFS server exporting a very large XFS filesystem, when the server's cache is cold (so the inodes in question are not in cache). Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Trevor Cordes <trevor@tecnopolis.ca> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: split out vfs_getattr_nosec	J. Bruce Fields	2013-11-09	1	-6/+25
\| \| \| \| \| \| \| \|	The filehandle lookup code wants this version of getattr. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	iget/iget5: don't bother with ->i_lock until we find a match	Al Viro	2013-11-09	2	-15/+7
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	VFS: Put a small type field into struct dentry::d_flags	David Howells	2013-11-09	2	-62/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Put a type field into struct dentry::d_flags to indicate if the dentry is one of the following types that relate particularly to pathwalk: Miss (negative dentry) Directory "Automount" directory (defective - no i_op->lookup()) Symlink Other (regular, socket, fifo, device) The type field is set to one of the first five types on a dentry by calls to __d_instantiate() and d_obtain_alias() from information in the inode (if one is given). The type is cleared by dentry_unlink_inode() when it reconstitutes an existing dentry as a negative dentry. Accessors provided are: d_set_type(dentry, type) d_is_directory(dentry) d_is_autodir(dentry) d_is_symlink(dentry) d_is_file(dentry) d_is_negative(dentry) d_is_positive(dentry) A bunch of checks in pathname resolution switched to those. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	elf{,_fdpic} coredump: get rid of pointless if (siginfo->si_signo)	Al Viro	2013-11-09	2	-37/+31
\| \| \| \| \| \|	we can't get to do_coredump() if that condition isn't satisfied... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	constify do_coredump() argument	Al Viro	2013-11-09	2	-3/+3
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	constify copy_siginfo_to_user{,32}()	Al Viro	2013-11-09	1	-1/+1
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	... and kill anon_inode_getfile_private()	Al Viro	2013-11-09	1	-66/+0
\| \| \| \| \| \|	it's a seriously misguided API, now fortunately without users. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	rework aio migrate pages to use aio fs	Benjamin LaHaise	2013-11-09	1	-6/+57
\| \| \| \| \| \| \| \| \| \|	Don't abuse anon_inodes.c to host private files needed by aio; we can bloody well declare a mini-fs of our own instead of patching up what anon_inodes can create for us. Tested-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	take anon inode allocation to libfs.c	Al Viro	2013-11-09	2	-48/+45
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	new helper: dump_align()	Al Viro	2013-11-09	3	-16/+13
\| \| \| \| \| \|	dump_skip to given alignment... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	dump_skip(): dump_seek() replacement taking coredump_params	Al Viro	2013-11-09	4	-40/+20
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	make dump_emit() use vfs_write() instead of banging at ->f_op->write directly	Al Viro	2013-11-09	1	-5/+12
\| \| \| \| \| \| \| \|	... and deal with short writes properly - the output might be to pipe, after all; as it is, e.g. no-MMU case of elf_fdpic coredump can write a whole lot more than a page worth of data at one call. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	binfmt_elf: count notes towards coredump limit	Al Viro	2013-11-09	1	-3/+0
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	aout: switch to dump_emit	Al Viro	2013-11-09	1	-4/+3
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	switch elf_coredump_extra_notes_write() to dump_emit()	Al Viro	2013-11-09	1	-4/+3
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	convert the rest of binfmt_elf_fdpic to dump_emit()	Al Viro	2013-11-09	1	-79/+31
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	binfmt_elf: convert writing actual dump pages to dump_emit()	Al Viro	2013-11-09	1	-11/+3
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	switch elf_core_write_extra_data() to dump_emit()	Al Viro	2013-11-09	2	-2/+6
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	switch elf_core_write_extra_phdrs() to dump_emit()	Al Viro	2013-11-09	2	-3/+5
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	new helper: dump_emit()	Al Viro	2013-11-09	2	-37/+37
\| \| \| \| \| \| \| \|	dump_write() analog, takes core_dump_params instead of file, keeps track of the amount written in cprm->written and checks for cprm->limit. Start using it in binfmt_elf.c... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	coda_revalidate_inode(): switch to passing inode...	Al Viro	2013-11-09	3	-4/+3
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	fold __d_shrink() into its only remaining caller	Al Viro	2013-11-09	1	-22/+10
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	get rid of s_files and files_lock	Al Viro	2013-11-09	4	-143/+2
\| \| \| \| \| \| \| \|	The only thing we need it for is alt-sysrq-r (emergency remount r/o) and these days we can do just as well without going through the list of files. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	get rid of {lock,unlock}_rcu_walk()	Al Viro	2013-11-09	1	-24/+14
\| \| \| \| \| \|	those have become aliases for rcu_read_{lock,unlock}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	RCU'd vfsmounts	Al Viro	2013-11-09	4	-82/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* RCU-delayed freeing of vfsmounts * vfsmount_lock replaced with a seqlock (mount_lock) * sequence number from mount_lock is stored in nameidata->m_seq and used when we exit RCU mode * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its caller knows that vfsmount will have no surviving references. * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock() and doing pending mntput(). * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence number against seq, then grabs reference to mnt. Then it rechecks mount_lock again to close the race and either returns success or drops the reference it has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can simply decrement the refcount and sod off - aforementioned synchronize_rcu() makes sure that final mntput() won't come until we leave RCU mode. We need that, since we don't want to end up with some lazy pathwalk racing with umount() and stealing the final mntput() from it - caller of umount() may expect it to return only once the fs is shut down and we don't want to break that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do full-blown mntput() in case of mount_lock sequence number mismatch happening just as we'd grabbed the reference, but in those cases we won't be stealing the final mntput() from anything that would care. * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally, SMP and UP cases are handled the same way - no ifdefs there. * normal pathname resolution does not do any writes to mount_lock. It does, of course, bump the refcounts of vfsmount and dentry in the very end, but that's it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>