summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_vnops.c
Commit message (Collapse)AuthorAgeFilesLines
* Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() toattilio2013-02-201-2/+2
| | | | | | their "write" versions. Sponsored by: EMC / Isilon storage division
* Switch vm_object lock to be a rwlock.attilio2013-02-201-0/+1
| | | | | | | | * VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations * VM_OBJECT_SLEEP() is introduced as a general purpose primitve to get a sleep operation using a VM_OBJECT_LOCK() as protection * The approach must bear with vm_pager.h namespace pollution so many files require including directly rwlock.h
* vn_io_faults_cnt:pluknet2013-02-151-2/+2
| | | | | | | | - use u_long consistently - use SYSCTL_ULONG to match the type of variable Reviewed by: kib MFC after: 1 week
* Simplify code a bit. This is leftover after Giant removal from VFS.pjd2013-01-311-3/+1
|
* Add flags argument to vfs_write_resume() and removekib2013-01-111-9/+2
| | | | | | vfs_write_resume_flags(). Sponsored by: The FreeBSD Foundation
* The process_deferred_inactive() function locks the vnodes of the ufskib2013-01-011-1/+2
| | | | | | | | | | | | | | | mount, which means that is must not be called while the snaplock is owned. The vfs_write_resume(9) does call the function as the VFS_SUSP_CLEAN() method, which is too early and falls into the region still protected by snaplock. Add yet another flag for the vfs_write_resume_flags() to avoid calling suspension cleanup handler after the suspend is lifted, and use it in the ffs_snapshot() call to vfs_write_resume. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Make it possible to atomically resume writes on the mount and accountkib2012-12-281-27/+52
| | | | | | | | | | | | | | | | | | | the write start, by adding a variation of the vfs_write_resume(9) which accepts flags. Use the new function to prevent a deadlock between parallel suspension and snapshotting a UFS mount. The ffs_snapshot() code performed vfs_write_resume() followed by vn_start_write() while owning the snaplock. If the suspension intervene between resume and vn_start_write(), the deadlock occured after the suspending thread tried to lock the snaplock, most typically during the write in the ffs_copyonwrite(). Reported and tested by: Andreas Longwitz <longwitz@incore.de> Reviewed by: mckusick MFC after: 2 weeks X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC, in HEAD
* - Add NOCAPCHECK flag to namei that allows lookup to work even if the processpjd2012-11-271-0/+4
| | | | | | | | | | | | is in capability mode. - Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into NOCAPCHECK namei flag. This functionality will be used to enable core dumps for sandboxed processes. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks
* The r241025 fixed the case when a binary, executed from nullfs mount,kib2012-11-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks
* Remove the support for using non-mpsafe filesystem modules.kib2012-10-221-57/+7
| | | | | | | | | | | | In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
* Fix the mis-handling of the VV_TEXT on the nullfs vnodes.kib2012-09-281-1/+1
| | | | | | | | | | | | | | | | If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks
* vn_write() always expects FOF_OFFSET flag, which is asserted at the begining,pjd2012-09-251-4/+3
| | | | | | | so there is no need to check for it. Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* Reorder the managament of advisory locks on open files so that the advisoryjhb2012-07-311-5/+55
| | | | | | | | | | | | | | | | | | | | | | lock is obtained before the write count is increased during open() and the lock is released after the write count is decreased during close(). The first change closes a race where an open() that will block with O_SHLOCK or O_EXLOCK can increase the write count while it waits. If the process holding the current lock on the file then tries to call exec() on the file it has locked, it can fail with ETXTBUSY even though the advisory lock is preventing other threads from succesfully completeing a writable open(). The second change closes a race where a read-only open() with O_SHLOCK or O_EXLOCK may return successfully while the write count is non-zero due to another descriptor that had the advisory lock and was blocking the open() still being in the process of closing. If the process that completed the open() then attempts to call exec() on the file it locked, it can fail with ETXTBUSY even though the other process that held a write lock has closed the file and released the lock. Reviewed by: kib MFC after: 1 month
* Extend the KPI to lock and unlock f_offset member of struct file. Itkib2012-07-021-29/+73
| | | | | | | | | | | | | | | | | | now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks
* Fix locking for f_offset, vn_read() and vn_write() cases only, for now.kib2012-06-211-53/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems that intended locking protocol for struct file f_offset field was as follows: f_offset should always be changed under the vnode lock (except fcntl(2) and lseek(2) did not followed the rules). Since read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally taken to serialize shared vnode lock owners. This was broken first by enabling shared lock on writes, then by fadvise changes, which moved f_offset assigned from under vnode lock, and last by vn_io_fault() doing chunked i/o. More, due to uio_offset not yet valid in vn_io_fault(), the range lock for reads was taken on the wrong region. Change the locking for f_offset to always use FOFFSET_LOCKED block, which is placed before rangelocks in the lock order. Extract foffset_lock() and foffset_unlock() functions which implements FOFFSET_LOCKED lock, and consistently lock f_offset with it in the vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is not set for the vnode mount. Indicate that f_offset is already valid for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET flag, and assert that all callers of vn_read() and vn_write() follow this protocol. Extract get_advice() function to calculate the POSIX_FADV_XXX value for the i/o region, and use it were appropriate. Reviewed by: jhb Tested by: pho MFC after: 2 weeks
* Further refine the implementation of POSIX_FADV_NOREUSE.jhb2012-06-191-11/+86
| | | | | | | | | | | | | | | | | | | | | | | | First, extend the changes in r230782 to better handle the common case of using NOREUSE with sequential reads. A NOREUSE file descriptor will now track the last implicit DONTNEED request it made as a result of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the previous range, it will apply the DONTNEED request to the entire range of both the previous read and the current read. The effect is that each read of a file accessed sequentially will apply the DONTNEED request to the entire range that has been read. This allows NOREUSE to properly handle misaligned reads by flushing each buffer to cache once it has been completely read. Second, apply the same changes made to read(2) by r230782 and this change to writes. This provides much better performance in the sequential write case as it allows writes to still be clustered. It also provides much better performance for misaligned writes. It does mean that NOREUSE will be generally ineffective for non-sequential writes as the current implementation relies on a future NOREUSE write's implicit DONTNEED request to flush the dirty buffer from the current write. MFC after: 2 weeks
* Split the second half of vn_open_cred() (after a vnode has been found viajhb2012-06-081-32/+42
| | | | | | | | | | a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function and use this function in fhopen() instead of duplicating code from vn_open_cred() directly. Tested by: pho Reviewed by: kib MFC after: 2 weeks
* Add a knob to disable vn_io_fault.kib2012-06-031-1/+5
| | | | MFC after: 1 month
* Count and export the number of prefaulting happen.kib2012-06-031-0/+5
| | | | MFC after: 1 month
* vn_io_fault() is a facility to prevent page faults while filesystemskib2012-05-301-44/+293
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perform copyin/copyout of the file data into the usermode buffer. Typical filesystem hold vnode lock and some buffer locks over the VOP_READ() and VOP_WRITE() operations, and since page fault handler may need to recurse into VFS to get the page content, a deadlock is possible. The facility works by disabling page faults handling for the current thread and attempting to execute i/o while allowing uiomove() to access the usermode mapping of the i/o buffer. If all buffer pages are resident, uiomove() is successfull and request is finished. If EFAULT is returned from uiomove(), the pages backing i/o buffer are faulted in and held, and the copyin/out is performed using uiomove_fromphys() over the held pages for the second attempt of VOP call. Since pages are hold in chunks to prevent large i/o requests from starving free pages pool, and since vnode lock is only taken for i/o over the current chunk, the vnode lock no longer protect atomicity of the whole i/o request. Use newly added rangelocks to provide the required atomicity of i/o regardind other i/o and truncations. Filesystems need to explicitely opt-in into the scheme, by setting the MNTK_NO_IOPF struct mount flag, and optionally by using vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or converting uio into request for uiomove_fromphys(). Reviewed by: bf (comments), mdf, pjd (previous version) Tested by: pho Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version) MFC after: 2 months
* Add a vn_bmap_seekhole(9) vnode helper which can be used by anykib2012-05-261-0/+53
| | | | | | | filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA commands for lseek(2). MFC after: 2 weeks
* Add KTR_VFS traces to track modifications to a vnode's writecount.jhb2012-03-081-1/+6
|
* Fix found places where uio_resid is truncated to int.kib2012-02-211-2/+2
| | | | | | | | | Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
* Refine the implementation of POSIX_FADV_NOREUSE for the read(2) case suchjhb2012-01-301-7/+7
| | | | | | | | that instead of using direct I/O it allows read-ahead similar to POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the read(2) has completed to purge just-read data. The write(2) path continues to use direct I/O for POSIX_FADV_NOREUSE for now. Note that NOREUSE works optimally if an application reads and writes full fs blocks.
* Add the posix_fadvise(2) system call. It is somewhat similar tojhb2011-11-041-14/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month
* In order to maximize the re-usability of kernel code in user space thiskmacy2011-09-161-1/+1
| | | | | | | | | | | | | patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
* Generalize ffs_pages_remove() into vn_pages_remove().mm2011-08-251-0/+15
| | | | | | | | | | | Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week
* Add the fo_chown and fo_chmod methods to struct fileops and use themkib2011-08-161-0/+41
| | | | | | | | | | to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)
* Use a name instead of a magic number for kern_yield(9) when the prioritymdf2011-05-131-1/+1
| | | | | | | | should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
* Based on discussions on the svn-src mailing list, rework r218195:mdf2011-02-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
* Correct arguments order.pjd2010-06-261-2/+2
|
* Avoid overflow.trasz2010-05-061-1/+1
| | | | Submitted by: bde@
* Style fixes and removal of unneeded variable.trasz2010-05-061-3/+3
| | | | Submitted by: bde@
* Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().trasz2010-05-051-0/+19
| | | | Reviewed by: kib
* vn_stat: take into account va_blocksize when setting st_blksizeavg2010-04-031-3/+2
| | | | | | | | | | | | | | | | | | | As currently st_blksize is always PAGE_SIZE, it is playing safe to not use any smaller value. For some cases this might not be optimal, but at least nothing should get broken. Generally I don't expect this commit to change much for the following reasons (in case of VREG, VDIR): - application I/O and physical I/O are sufficiently decoupled by filesystem code, buffer cache code, cluster and read-ahead logic - not all applications use st_blksize as a hint, some use f_iosize, some use fixed block sizes I expect writes to the middle of files on ZFS to benefit the most from this change. Silence from: fs@ MFC after: 2 weeks
* Rename st_*timespec fields to st_*tim for POSIX 2008 compliance.ed2010-03-281-4/+4
| | | | | | | | | | | | | | | A nice thing about POSIX 2008 is that it finally standardizes a way to obtain file access/modification/change times in sub-second precision, namely using struct timespec, which we already have for a very long time. Unfortunately POSIX uses different names. This commit adds compatibility macros, so existing code should still build properly. Also change all source code in the kernel to work without any of the compatibility macros. This makes it all a less ambiguous. I am also renaming st_birthtime to st_birthtim, even though it was a local extension anyway. It seems Cygwin also has a st_birthtim.
* Actually make O_DIRECTORY work.ed2010-03-211-0/+4
| | | | | | According to POSIX open() must return ENOTDIR when the path name does not refer to a path name. Change vn_open() to respect this flag. This also simplifies the Linuxolator a bit.
* Don't add VAPPEND if the file is not being opened for writing. Note that thistrasz2009-12-081-1/+1
| | | | | | | only affects cases where open(2) is being used improperly - i.e. when the user specifies O_APPEND without O_WRONLY or O_RDWR. Reviewed by: rwatson
* Revert r198874, pending further discussion.trasz2009-11-041-1/+1
|
* Make sure we don't end up with VAPPEND without VWRITE, if someone calls open(2)trasz2009-11-041-1/+1
| | | | like this: open(..., O_APPEND).
* Add two new fcntls to enable/disable read-ahead:delphij2009-09-281-0/+3
| | | | | | | | | | | | | | | | | | | | - F_READAHEAD: specify the amount for sequential access. The amount is specified in bytes and is rounded up to nearest block size. - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential access size. A third argument of zero disables the read-ahead behavior. Please note that the read-ahead amount is also constrainted by sysctl variable, vfs.read_max, which may need to be raised in order to better utilize this feature. Thanks Igor Sysoev for proposing the feature and submitting the original version, and kib@ for his valuable comments. Submitted by: Igor Sysoev <is rambler-co ru> Reviewed by: kib@ MFC after: 1 month
* Fix mount reference leak when V_XSLEEP is specified to vn_start_write().kib2009-09-011-1/+1
| | | | Submitted by: tegge
* Make the mnt_writeopcount and mnt_secondary_writes counters,kib2009-08-311-2/+4
| | | | | | | | | | | | | | | | | used by the suspension code, not greater then mnt_ref reference counter value. Increment mnt_ref together with write counter in vn_start_write()/ vn_start_secondary_write(), releasing in vn_finished_write/vn_finished_secondary_write(). Since r186197, unmount code requires that no writers occured after all references are expired. We still could get write counter incremented for freed or reused struct mount, but it seems to be innocent, since corresponding vnode should be referenced and reclaimed then. Reported by: pho (last half a year), erwin Reviewed by: attilio Tested by: pho, erwin MFC after: 1 week
* In vn_vget_ino() and their inline equivalents, mnt_ref() the mount pointkib2009-07-021-0/+2
| | | | | | | | | | | around the sequence that drop vnode lock and then busies the mount point. Not having vlocked node or direct reference to the mp allows for the forced unmount to proceed, making mp unmounted or reused. Tested by: pho Reviewed by: jeff Approved by: re (kensmith) MFC after: 2 weeks
* Add another flags argument to vn_open_cred. Use it to specify that somekib2009-06-211-8/+9
| | | | | | | | | | | | | | vn_open_cred invocations shall not audit namei path. In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by default implementation of vop_vptocnp, and for the open done for core file. vn_fullpath is called from the audit code, and vn_open there need to disable audit to avoid infinite recursion. Core file is created on return to user mode, that, in particular, happens during syscall return. The creation of the core file is audited by direct calls, and we do not want to overwrite audit information for syscall. Reported, reviewed and tested by: rwatson
* Simply shared vnode locking and extend it to also include fsync.ps2009-06-081-3/+4
| | | | | | | Also, in vop_write, no longer assert for exclusive locks on the vnode. Reviewed by: jhb, kmacy, jeffr
* Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERICrwatson2009-06-051-2/+0
| | | | | | | | and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
* When checking for shared writes, use the struct mount returned fromps2009-06-041-2/+1
| | | | | | vn_start_write. Reviewed by: jhb
* Support shared vnode locks for write operations when the offset isps2009-06-041-4/+19
| | | | | | | provided on filesystems that support it. This really improves mysql + innodb performance on ZFS. Reviewed by: jhb, kmacy, jeffr
* Remove the thread argument from the FSD (File-System Dependent) parts ofattilio2009-05-111-2/+1
| | | | | | | | | | | | | | | | | the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.
OpenPOWER on IntegriCloud