summaryrefslogtreecommitdiffstats
path: root/fs/pipe.c
Commit message (Collapse)AuthorAgeFilesLines
* vfs: fix subtle use-after-free of pipe_inode_infoLinus Torvalds2013-12-021-20/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pipe code was trying (and failing) to be very careful about freeing the pipe info only after the last access, with a pattern like: spin_lock(&inode->i_lock); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1; } spin_unlock(&inode->i_lock); __pipe_unlock(pipe); if (kill) free_pipe_info(pipe); where the final freeing is done last. HOWEVER. The above is actually broken, because while the freeing is done at the end, if we have two racing processes releasing the pipe inode info, the one that *doesn't* free it will decrement the ->files count, and unlock the inode i_lock, but then still use the "pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)". This is *very* hard to trigger in practice, since the race window is very small, and adding debug options seems to just hide it by slowing things down. Simon originally reported this way back in July as an Oops in kmem_cache_allocate due to a single bit corruption (due to the final "spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different allocation that had re-used the free'd pipe-info), it's taken this long to figure out. Since the 'pipe->files' accesses aren't even protected by the pipe lock (we very much use the inode lock for that), the simple solution is to just drop the pipe lock early. And since there were two users of this pattern, create a helper function for it. Introduced commit ba5bb147330a ("pipe: take allocation and freeing of pipe_inode_info out of ->i_mutex"). Reported-by: Simon Kirby <sim@hostway.ca> Reported-by: Ian Applegate <ia@cloudflare.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org # v3.10+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* aio: don't include aio.h in sched.hKent Overstreet2013-05-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* get rid of the last free_pipe_info() callersAl Viro2013-04-091-12/+6
| | | | | | and rename __free_pipe_info() to free_pipe_info() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* get rid of alloc_pipe_info() argumentAl Viro2013-04-091-3/+3
| | | | | | not used anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* get rid of pipe->inodeAl Viro2013-04-091-3/+2
| | | | | | | | it's used only as a flag to distinguish normal pipes/FIFOs from the internal per-task one used by file-to-file splice. And pipe->files would work just as well for that purpose... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* introduce variants of pipe_lock/pipe_unlock for real pipes/FIFOsAl Viro2013-04-091-15/+25
| | | | | | | fs/pipe.c file_operations methods *know* that pipe is not an internal one; no need to check pipe->inode for those callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe: set file->private_data to ->i_pipeAl Viro2013-04-091-8/+9
| | | | | | simplify get_pipe_info(), while we are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe: don't use ->i_mutexAl Viro2013-04-091-2/+3
| | | | | | | now it can be done - put mutex into pipe_inode_info, use it instead of ->i_mutex Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe: take allocation and freeing of pipe_inode_info out of ->i_mutexAl Viro2013-04-091-21/+51
| | | | | | | | | | | | | | | | | | * new field - pipe->files; number of struct file over that pipe (all sharing the same inode, of course); protected by inode->i_lock. * pipe_release() decrements pipe->files, clears inode->i_pipe when if the counter has reached 0 (all under ->i_lock) and, in that case, frees pipe after having done pipe_unlock() * fifo_open() starts with grabbing ->i_lock, and either bumps pipe->files if ->i_pipe was non-NULL or allocates a new pipe (dropping and regaining ->i_lock) and rechecks ->i_pipe; if it's still NULL, inserts new pipe there, otherwise bumps ->i_pipe->files and frees the one we'd allocated. At that point we know that ->i_pipe is non-NULL and won't go away, so we can do pipe_lock() on it and proceed as we used to. If we end up failing, decrement pipe->files and if it reaches 0 clear ->i_pipe and free the sucker after pipe_unlock(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe: preparation to new locking rulesAl Viro2013-04-091-23/+15
| | | | | | | | * use the fact that file_inode(file)->i_pipe doesn't change while the file is opened - no locks needed to access that. * switch to pipe_lock/pipe_unlock where it's easy to do Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe: switch wait_for_partner() and wake_up_partner() to pipe_inode_infoAl Viro2013-04-091-9/+9
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe: fold file_operations instances in oneAl Viro2013-04-091-189/+32
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fold fifo.c into pipe.cAl Viro2013-04-091-0/+138
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: fix pipe counter breakageAl Viro2013-03-121-0/+3
| | | | | | | | | | | | | | | | | | If you open a pipe for neither read nor write, the pipe code will not add any usage counters to the pipe, causing the 'struct pipe_inode_info" to be potentially released early. That doesn't normally matter, since you cannot actually use the pipe, but the pipe release code - particularly fasync handling - still expects the actual pipe infrastructure to all be there. And rather than adding NULL pointer checks, let's just disallow this case, the same way we already do for the named pipe ("fifo") case. This is ancient going back to pre-2.4 days, and until trinity, nobody naver noticed. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: Preserve error code in get_empty_filp(), part 2Anatol Pomozov2013-02-221-2/+2
| | | | | | | | | | | | | | | | | | | | | Allocating a file structure in function get_empty_filp() might fail because of several reasons: - not enough memory for file structures - operation is not allowed - user is over its limit Currently the function returns NULL in all cases and we loose the exact reason of the error. All callers of get_empty_filp() assume that the function can fail with ENFILE only. Return error through pointer. Change all callers to preserve this error code. [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit (things remaining here deal with alloc_file()), removed pipe(2) behaviour change] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* new helper: file_inode(file)Al Viro2013-02-221-8/+8
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe(2) - race-free error recoveryAl Viro2012-09-261-9/+22
| | | | | | | don't mess with sys_close() if copy_to_user() fails; just postpone fd_install() until we know it hasn't. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-08-011-49/+26
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is *not* dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...
| * consolidate pipe file creationAl Viro2012-07-291-49/+26
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | pipe: remove KM_USER0 from commentsCong Wang2012-07-241-1/+1
|/ | | | Signed-off-by: Cong Wang <amwang@redhat.com>
* fs: introduce inode operation ->update_timeJosef Bacik2012-06-011-2/+5
| | | | | | | | | | | | | | | | | | | | Btrfs has to make sure we have space to allocate new blocks in order to modify the inode, so updating time can fail. We've gotten around this by having our own file_update_time but this is kind of a pain, and Christoph has indicated he would like to make xfs do something different with atime updates. So introduce ->update_time, where we will deal with i_version an a/m/c time updates and indicate which changes need to be made. The normal version just does what it has always done, updates the time and marks the inode dirty, and then filesystems can choose to do something different. I've gone through all of the users of file_update_time and made them check for errors with the exception of the fault code since it's complicated and I wasn't quite sure what to do there, also Jan is going to be pushing the file time updates into page_mkwrite for those who have it so that should satisfy btrfs and make it not a big deal to check the file_update_time() return code in the generic fault path. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
* pipe: return -ENOIOCTLCMD instead of -EINVAL on unknown ioctl commandWill Deacon2012-05-301-1/+1
| | | | | | | | | | | | | | | | As described in commit 07d106d0a ("vfs: fix up ENOIOCTLCMD error handling"), drivers should return -ENOIOCTLCMD if they receive an ioctl command which they don't understand. Doing so will result in -ENOTTY being returned to userspace, which matches the behaviour of the compat layer if it fails to translate an ioctl command. This patch fixes the pipe ioctl to return -ENOIOCTLCMD instead of -EINVAL when passed an unknown ioctl command. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipes: add a "packetized pipe" mode for writingLinus Torvalds2012-04-291-2/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The actual internal pipe implementation is already really about individual packets (called "pipe buffers"), and this simply exposes that as a special packetized mode. When we are in the packetized mode (marked by O_DIRECT as suggested by Alan Cox), a write() on a pipe will not merge the new data with previous writes, so each write will get a pipe buffer of its own. The pipe buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn will tell the reader side to break the read at that boundary (and throw away any partial packet contents that do not fit in the read buffer). End result: as long as you do writes less than PIPE_BUF in size (so that the pipe doesn't have to split them up), you can now treat the pipe as a packet interface, where each read() system call will read one packet at a time. You can just use a sufficiently big read buffer (PIPE_BUF is sufficient, since bigger than that doesn't guarantee atomicity anyway), and the return value of the read() will naturally give you the size of the packet. NOTE! We do not support zero-sized packets, and zero-sized reads and writes to a pipe continue to be no-ops. Also note that big packets will currently be split at write time, but that the size at which that happens is not really specified (except that it's bigger than PIPE_BUF). Currently that limit is the system page size, but we might want to explicitly support bigger packets some day. The main user for this is going to be the autofs packet interface, allowing us to stop having to care so deeply about exact packet sizes (which have had bugs with 32/64-bit compatibility modes). But user space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will fail with an EINVAL on kernels that do not support this interface. Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Ian Kent <raven@themaw.net> Cc: Thomas Meyer <thomas@m3y3r.de> Cc: stable@kernel.org # needed for systemd/autofs interaction fix Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* magic.h: move some FS magic numbers into magic.hMuthu Kumar2012-03-231-0/+1
| | | | | | | | | | | - Move open-coded filesystem magic numbers into magic.h - Rearrange magic.h so that the filesystem-related constants are grouped together. Signed-off-by: Muthukumar R <muthur@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: remove the second argument of k[un]map_atomic()Cong Wang2012-03-201-4/+4
| | | | | Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Cong Wang <amwang@redhat.com>
* pipe: fail cleanly when root tries F_SETPIPE_SZ with big sizeSasha Levin2012-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe with size bigger than kmalloc() can alloc it spits out an ugly warning: ------------[ cut here ]------------ WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0() Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4 Call Trace: warn_slowpath_common+0x75/0xb0 warn_slowpath_null+0x15/0x20 __alloc_pages_nodemask+0x5d3/0x7a0 __get_free_pages+0x12/0x50 __kmalloc+0x12b/0x150 pipe_set_size+0x75/0x120 pipe_fcntl+0xf8/0x140 do_fcntl+0x2d4/0x410 sys_fcntl+0x66/0xa0 system_call_fastpath+0x16/0x1b ---[ end trace 432f702e6db7b5ee ]--- Instead, make kcalloc() handle the overflow case and fail quietly. [akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness] Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: pipe.c is really non-modularAl Viro2012-01-031-7/+0
| | | | | | | ... so no exitcalls there. Not much would work if pipe(2) would stop working, after all... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs/pipe.c: add ->statfs callback for pipefsPavel Emelyanov2011-10-311-0/+1
| | | | | | | | | | | | | | | | | | | | | | | Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire pipfs up so that the statfs succeeds. This is required by checkpoint-restart in the userspace to make it possible to distinguish pipes from fifos. When we dump information about task's open files we use the /proc/pid/fd directoy's symlinks and the fact that opening any of them gives us exactly the same dentry->inode pair as the original process has. Now if a task we're dumping has opened pipe and fifo we need to detect this and act accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is the most precise way. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: dont chain pipe/anon/socket on superblock s_inodes listEric Dumazet2011-07-261-1/+1
| | | | | | | | | | | | Workloads using pipes and sockets hit inode_sb_list_lock contention. superblock s_inodes list is needed for quota, dirty, pagecache and fsnotify management. pipe/anon/socket fs are clearly not candidates for these. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* VFS : mount lock scalability for internal mountsTim Chen2011-07-241-1/+1
| | | | | | | | | | | | | | | | | | | For a number of file systems that don't have a mount point (e.g. sockfs and pipefs), they are not marked as long term. Therefore in mntput_no_expire, all locks in vfs_mount lock are taken instead of just local cpu's lock to aggregate reference counts when we release reference to file objects. In fact, only local lock need to have been taken to update ref counts as these file systems are in no danger of going away until we are ready to unregister them. The attached patch marks file systems using kern_mount without mount point as long term. The contentions of vfs_mount lock is now eliminated. Before un-registering such file system, kern_unmount should be called to remove the long term flag and make the mount point ready to be freed. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Fix broken "pipe: use event aware wakeups" optimizationLinus Torvalds2011-01-201-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit e462c448fdc8 ("pipe: use event aware wakeups") optimized the pipe event wakeup calls to avoid wakeups if the events do not match the requested set. However, the optimization was buggy, in that it didn't actually use the correct sets for the events: when we make room for more data to be written, the pipe poll() routine will return both the POLLOUT _and_ POLLWRNORM bits. Similarly for read. And most critically, when a pipe is released, that will potentially result in POLLHUP|POLLERR (depending on whether it was the last reader or writer), not just the regular POLLIN|POLLOUT. This bug showed itself as a hung gnome-screensaver-dialog process, stuck forever (or at least until it was poked by a signal or by being traced) in a poll() system call. Cc: Davide Libenzi <davidel@xmailserver.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sanitize vfsmount refcounting changesAl Viro2011-01-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of splitting refcount between (per-cpu) mnt_count and (SMP-only) mnt_longrefs, make all references contribute to mnt_count again and keep track of how many are longterm ones. Accounting rules for longterm count: * 1 for each fs_struct.root.mnt * 1 for each fs_struct.pwd.mnt * 1 for having non-NULL ->mnt_ns * decrement to 0 happens only under vfsmount lock exclusive That allows nice common case for mntput() - since we can't drop the final reference until after mnt_longterm has reached 0 due to the rules above, mntput() can grab vfsmount lock shared and check mnt_longterm. If it turns out to be non-zero (which is the common case), we know that this is not the final mntput() and can just blindly decrement percpu mnt_count. Otherwise we grab vfsmount lock exclusive and do usual decrement-and-check of percpu mnt_count. For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm(); namespace.c uses the latter in places where we don't already hold vfsmount lock exclusive and opencodes a few remaining spots where we need to manipulate mnt_longterm. Note that we mostly revert the code outside of fs/namespace.c back to what we used to have; in particular, normal code doesn't need to care about two kinds of references, etc. And we get to keep the optimization Nick's variant had bought us... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2011-01-131-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits) fs: add documentation on fallocate hole punching Gfs2: fail if we try to use hole punch Btrfs: fail if we try to use hole punch Ext4: fail if we try to use hole punch Ocfs2: handle hole punching via fallocate properly XFS: handle hole punching via fallocate properly fs: add hole punching to fallocate vfs: pass struct file to do_truncate on O_TRUNC opens (try #2) fix signedness mess in rw_verify_area() on 64bit architectures fs: fix kernel-doc for dcache::prepend_path fs: fix kernel-doc for dcache::d_validate sanitize ecryptfs ->mount() switch afs move internal-only parts of ncpfs headers to fs/ncpfs switch ncpfs switch 9p pass default dentry_operations to mount_pseudo() switch hostfs switch affs switch configfs ...
| * pass default dentry_operations to mount_pseudo()Al Viro2011-01-121-2/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | pipe: use event aware wakeupsDavide Libenzi2011-01-131-5/+5
|/ | | | | | | | | | | | | Send the events the wakeup refers to, so that epoll, and even the new poll code in fs/select.c can avoid wakeups if the events do not match the requested set. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: scale mntget/mntputNick Piggin2011-01-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem that this patch aims to fix is vfsmount refcounting scalability. We need to take a reference on the vfsmount for every successful path lookup, which often go to the same mount point. The fundamental difficulty is that a "simple" reference count can never be made scalable, because any time a reference is dropped, we must check whether that was the last reference. To do that requires communication with all other CPUs that may have taken a reference count. We can make refcounts more scalable in a couple of ways, involving keeping distributed counters, and checking for the global-zero condition less frequently. - check the global sum once every interval (this will delay zero detection for some interval, so it's probably a showstopper for vfsmounts). - keep a local count and only taking the global sum when local reaches 0 (this is difficult for vfsmounts, because we can't hold preempt off for the life of a reference, so a counter would need to be per-thread or tied strongly to a particular CPU which requires more locking). - keep a local difference of increments and decrements, which allows us to sum the total difference and hence find the refcount when summing all CPUs. Then, keep a single integer "long" refcount for slow and long lasting references, and only take the global sum of local counters when the long refcount is 0. This last scheme is what I implemented here. Attached mounts and process root and working directory references are "long" references, and everything else is a short reference. This allows scalable vfsmount references during path walking over mounted subtrees and unattached (lazy umounted) mounts with processes still running in them. This results in one fewer atomic op in the fastpath: mntget is now just a per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock and non-atomic decrement in the common case. However code is otherwise bigger and heavier, so single threaded performance is basically a wash. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* fs: improve scalability of pseudo filesystemsNick Piggin2011-01-071-1/+1
| | | | | | | | | Regardless of how much we possibly try to scale dcache, there is likely always going to be some fundamental contention when adding or removing children under the same parent. Pseudo filesystems do not seem need to have connected dentries because by definition they are disconnected. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* fs: dcache reduce branches in lookup pathNick Piggin2011-01-071-1/+1
| | | | | | | | | | | | | | Reduce some branches and memory accesses in dcache lookup by adding dentry flags to indicate common d_ops are set, rather than having to check them. This saves a pointer memory access (dentry->d_op) in common path lookup situations, and saves another pointer load and branch in cases where we have d_op but not the particular operation. Patched with: git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* fs: avoid inode RCU freeing for pseudo fsNick Piggin2011-01-071-1/+5
| | | | | | | Pseudo filesystems that don't put inode on RCU list or reachable by rcu-walk dentries do not need to RCU free their inodes. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* Un-inline get_pipe_info() helper functionLinus Torvalds2010-11-281-0/+12
| | | | | | | | This avoids some include-file hell, and the function isn't really important enough to be inlined anyway. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Export 'get_pipe_info()' to other usersLinus Torvalds2010-11-281-1/+1
| | | | | | | | | | | | | | | | | | | And in particular, use it in 'pipe_fcntl()'. The other pipe functions do not need to use the 'careful' version, since they are only ever called for things that are already known to be pipes. The normal read/write/ioctl functions are called through the file operations structures, so if a file isn't a pipe, they'd never get called. But pipe_fcntl() is special, and called directly from the generic fcntl code, and needs to use the same careful function that the splice code is using. Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* convert get_sb_pseudo() usersAl Viro2010-10-291-5/+4
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: do not assign default i_ino in new_inodeChristoph Hellwig2010-10-251-0/+2
| | | | | | | | | | | | | | | Instead of always assigning an increasing inode number in new_inode move the call to assign it into those callers that actually need it. For now callers that need it is estimated conservatively, that is the call is added to all filesystems that do not assign an i_ino by themselves. For a few more filesystems we can avoid assigning any inode number given that they aren't user visible, and for others it could be done lazily when an inode number is actually needed, but that's left for later patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pipe: fix failure to return error code on ->confirm()Nicolas Kaiser2010-10-211-1/+1
| | | | | | | | | The arguments were transposed, we want to assign the error code to 'ret', which is being returned. Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* pipe: fix check in "set size" fcntlMiklos Szeredi2010-06-101-3/+4
| | | | | | | | | | | As it stands this check compares the number of pages to the page size. This makes no sense and makes the fcntl fail in almost any sane case. Fix it by checking if nr_pages is not zero (it can become zero only if arg is too big and round_pipe_size() overflows). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* pipe: fix pipe buffer resizingMiklos Szeredi2010-06-101-3/+10
| | | | | | | | | | | | | pipe_set_size() needs to copy pipe bufs from the old circular buffer to the new. The current code gets this wrong in multiple ways, resulting in oops. Test program is available here: http://www.kernel.org/pub/linux/kernel/people/mszeredi/piperesize/ Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* pipe: change /proc/sys/fs/pipe-max-pages to byte sized interfaceJens Axboe2010-06-031-12/+42
| | | | | | | | | | | This changes the interface to be based on bytes instead. The API matches that of F_SETPIPE_SZ in that it rounds up the passed in size so that the resulting page array is a power-of-2 in size. The proc file is renamed to /proc/sys/fs/pipe-max-size to reflect this change. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* pipe: change the privilege required for growing a pipe beyond system maxJens Axboe2010-06-031-1/+1
| | | | | | | | Change it to CAP_SYS_RESOURCE, as that more accurately models what we want to control. Suggested-by: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* pipe: adjust minimum pipe size to 1 pageJens Axboe2010-06-031-7/+1
| | | | | | | | We don't need to pages to guarantee the POSIX requirement that upto a page size write must be atomic to an empty pipe. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Merge branch 'master' into for-linusJens Axboe2010-06-011-5/+16
|\ | | | | | | | | | | | | Conflicts: fs/pipe.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
OpenPOWER on IntegriCloud