summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_syscalls.c
Commit message (Collapse)AuthorAgeFilesLines
* Remove unused `vfslocked' variable.ed2012-10-221-2/+1
| | | | | I have no idea what this `vfslocked' thing means. I wonder how it ended up here.
* Remove the support for using non-mpsafe filesystem modules.kib2012-10-221-230/+47
| | | | | | | | | | | | In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
* Acquire the rangelock for truncate(2) as well.kib2012-10-151-3/+7
| | | | | | Reported and reviewed by: avg Tested by: pho MFC after: 1 week
* - Enforce CAP_MKFIFO on mkfifoat(2), not on mknodat(2). Without this changepjd2012-10-011-3/+4
| | | | | | | | mkfifoat(2) was not restricted. - Introduce CAP_MKNOD and enforce it on mknodat(2). Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* Require CAP_DELETE on directory descriptor for unlinkat(2).pjd2012-09-251-2/+2
| | | | | Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* Require CAP_CREATE on directory descriptor for symlinkat(2).pjd2012-09-251-2/+2
| | | | | Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* Require CAP_CREATE on directory descriptor for linkat(2).pjd2012-09-251-2/+2
| | | | | Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* O_EXEC flag is not part of the O_ACCMODE mask, check it separately.pjd2012-09-251-15/+13
| | | | | | | | | | If O_EXEC is provided don't require CAP_READ/CAP_WRITE, as O_EXEC is mutually exclusive to O_RDONLY/O_WRONLY/O_RDWR. Without this change CAP_FEXECVE capability right is not enforced. Sponsored by: FreeBSD Foundation MFC after: 3 days
* Reorder the managament of advisory locks on open files so that the advisoryjhb2012-07-311-38/+6
| | | | | | | | | | | | | | | | | | | | | | lock is obtained before the write count is increased during open() and the lock is released after the write count is decreased during close(). The first change closes a race where an open() that will block with O_SHLOCK or O_EXLOCK can increase the write count while it waits. If the process holding the current lock on the file then tries to call exec() on the file it has locked, it can fail with ETXTBUSY even though the advisory lock is preventing other threads from succesfully completeing a writable open(). The second change closes a race where a read-only open() with O_SHLOCK or O_EXLOCK may return successfully while the write count is non-zero due to another descriptor that had the advisory lock and was blocking the open() still being in the process of closing. If the process that completed the open() then attempts to call exec() on the file it locked, it can fail with ETXTBUSY even though the other process that held a write lock has closed the file and released the lock. Reviewed by: kib MFC after: 1 month
* Extend the KPI to lock and unlock f_offset member of struct file. Itkib2012-07-021-13/+23
| | | | | | | | | | | | | | | | | | now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks
* Further refine the implementation of POSIX_FADV_NOREUSE.jhb2012-06-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | First, extend the changes in r230782 to better handle the common case of using NOREUSE with sequential reads. A NOREUSE file descriptor will now track the last implicit DONTNEED request it made as a result of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the previous range, it will apply the DONTNEED request to the entire range of both the previous read and the current read. The effect is that each read of a file accessed sequentially will apply the DONTNEED request to the entire range that has been read. This allows NOREUSE to properly handle misaligned reads by flushing each buffer to cache once it has been completely read. Second, apply the same changes made to read(2) by r230782 and this change to writes. This provides much better performance in the sequential write case as it allows writes to still be clustered. It also provides much better performance for misaligned writes. It does mean that NOREUSE will be generally ineffective for non-sequential writes as the current implementation relies on a future NOREUSE write's implicit DONTNEED request to flush the dirty buffer from the current write. MFC after: 2 weeks
* Now that dupfdopen() doesn't depend on finstall() being called earlier,pjd2012-06-131-2/+1
| | | | | | | | | | indx will never be -1 on error, as none of dupfdopen(), finstall() and kern_capwrap() modifies it on error, but what is more important none of those functions install and leave file at indx descriptor on error. Leave an assert to prove my words. MFC after: 1 month
* Allocate descriptor number in dupfdopen() itself instead of depending onpjd2012-06-131-11/+7
| | | | | | | | the caller using finstall(). This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes a race between finstall() and dupfdopen(). MFC after: 1 month
* - Remove nfp variable that is not really needed.pjd2012-06-131-10/+12
| | | | | | | - Update comment. - Style nits. MFC after: 1 month
* Remove duplicated code.pjd2012-06-131-9/+1
| | | | MFC after: 1 month
* Add missing {.pjd2012-06-131-1/+1
| | | | MFC after: 1 month
* Style.pjd2012-06-131-7/+8
| | | | MFC after: 1 month
* There is no need to set td->td_retval[0] to -1 on error.pjd2012-06-131-1/+0
| | | | | Confirmed by: jhb MFC after: 1 month
* Style fixes and simplifications.pjd2012-06-111-4/+2
| | | | MFC after: 1 month
* Split the second half of vn_open_cred() (after a vnode has been found viajhb2012-06-081-126/+39
| | | | | | | | | | a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function and use this function in fhopen() instead of duplicating code from vn_open_cred() directly. Tested by: pho Reviewed by: kib MFC after: 2 weeks
* Add kern_fhstat(), adjust sys_fhstat() to use it.gleb2012-05-241-11/+23
| | | | | | | Extend kern_getdirentries() to accept uio segflag and optionally return buffer residue. Sponsored by: Google Summer of Code 2011
* The value of flags matching VNOVAL can't be supported. Return EOPNOTSUPPjh2012-04-201-0/+4
| | | | | | | | from setfflags() in this case. This fixes the return value of chflags(path, -1). Discussed with: bde MFC after: 2 weeks
* Perform the parameter validation before assigning it to a signed intpho2012-03-091-2/+2
| | | | | | | variable. This fixes the problem seen with readdir(3) fuzzing. Submitted by: bde MFC after: 1 week
* Free up allocated memory used by posix_fadvise(2).pho2012-03-081-1/+1
|
* Add KTR_VFS traces to track modifications to a vnode's writecount.jhb2012-03-081-2/+8
|
* Fix found places where uio_resid is truncated to int.kib2012-02-211-3/+3
| | | | | | | | | Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
* Current implementations of sync(2) and syncer vnode fsync() VOP useskib2012-02-061-11/+3
| | | | | | | | | | | | | | | | | | | | | | mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks
* Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon().kib2012-01-081-1/+16
| | | | | | | | | | | | | The vfs_busy() is after covered vnode lock in the global lock order, but since quotaon() does recursive VFS call to open quota file, we usually end up locking covered vnode after mp is busied in sys_quotactl(). Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon, esp. if vfs_busy cannot succeed due to unmount being performed. Reported and tested by: pho MFC after: 1 week
* Fire a kevent if necessary after seeking on a regular file. This fixes ajhb2011-12-161-0/+1
| | | | | | | | case where a kevent would not fire on a regular file if an application read to EOF and then seeked backwards into the file. Reviewed by: kib MFC after: 2 weeks
* Document a large number of currently undocumented sysctls. While hereeadler2011-12-131-1/+2
| | | | | | | | | | | | fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week
* Fix a race between getvnode() dereferencing half-constructed filekib2011-11-241-1/+14
| | | | | | | and dupfdopen(). Reported and tested by: pho MFC after: 3 days
* Improve *access*() parameter name consistency.ed2011-11-191-17/+17
| | | | | | | | | The current code mixes the use of `flags' and `mode'. This is a bit confusing, since the faccessat() function as a `flag' parameter to store the AT_ flag. Make this less confusing by using the same name as used in the POSIX specification -- `amode'.
* - Split out a kern_posix_fadvise() from the posix_fadvise() system call sojhb2011-11-141-26/+31
| | | | | | | it can be used by in-kernel consumers. - Make kern_posix_fallocate() public. - Use kern_posix_fadvise() and kern_posix_fallocate() to implement the freebsd32 wrappers for the two system calls.
* Add the posix_fadvise(2) system call. It is somewhat similar tojhb2011-11-041-0/+134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month
* In order to maximize the re-usability of kernel code in user space thiskmacy2011-09-161-72/+72
| | | | | | | | | | | | | patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
* Add the fo_chown and fo_chmod methods to struct fileops and use themkib2011-08-161-40/+20
| | | | | | | | | | to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)
* When falloc() was broken into separate falloc_noinstall() and finstall(),rwatson2011-08-131-2/+3
| | | | | | | | | | | a bug was introduced in kern_openat() such that the error from the vnode open operation was overwritten before it was passed as an argument to dupfdopen(). This broke operations on /dev/{stdin,stdout,stderr}. Fix by preserving the original error number across finstall() so that it is still available. Approved by: re (kib) Reported by: cognet
* Allow Capsicum capabilities to delegate constrainedjonathan2011-08-131-33/+87
| | | | | | | | | | | | | | | | | access to file system subtrees to sandboxed processes. - Use of absolute paths and '..' are limited in capability mode. - Use of absolute paths and '..' are limited when looking up relative to a capability. - When a name lookup is performed, identify what operation is to be performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP. With these constraints, openat() and friends are now safe in capability mode, and can then be used by code such as the capability-mode runtime linker. Approved by: re (bz), mentor (rwatson) Sponsored by: Google Inc
* Only call fdclose() on successfully-opened FDs.jonathan2011-08-111-2/+4
| | | | | | | | | Since kern_openat() now uses falloc_noinstall() and finstall() separately, there are cases where we could get to cleanup code without ever creating a file descriptor. In those cases, we should not call fdclose() on FD -1. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc
* Second-to-last commit implementing Capsicum capabilities in the FreeBSDrwatson2011-08-111-32/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
* Add a lock flags argument to the VFS_FHTOVP() file systemrmacklem2011-05-221-3/+3
| | | | | | | | | | | method, so that callers can indicate the minimum vnode locking requirement. This will allow some file systems to choose to return a LK_SHARED locked vnode when LK_SHARED is specified for the flags argument. This patch only adds the flag. It does not change any file system to use it and all callers specify LK_EXCLUSIVE, so file system semantics are not changed. Reviewed by: kib
* Allow VOP_ALLOCATE to be iterative, and have kern_posix_fallocate(9)mdf2011-04-191-22/+37
| | | | | | drive looping and potentially yielding. Requested by: kib
* Fix a copy/paste whitespace error.mdf2011-04-181-3/+3
|
* Add the posix_fallocate(2) syscall. The default implementation inmdf2011-04-181-0/+80
| | | | | | | | | | | | | | vop_stdallocate() is filesystem agnostic and will run as slow as a read/write loop in userspace; however, it serves to correctly implement the functionality for filesystems that do not implement a VOP_ALLOCATE. Note that __FreeBSD_version was already bumped today to 900036 for any ports which would like to use this function. Also reserve space in the syscall table for posix_fadvise(2). Reviewed by: -arch (previous version)
* After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)kib2011-04-011-2/+2
| | | | | | | | and remove the falloc() version that lacks flag argument. This is done to reduce the KPI bloat. Requested by: jhb X-MFC-note: do not
* Add support for executing the FreeBSD 1/i386 a.out binaries on amd64.kib2011-04-011-9/+16
| | | | | | | | | | | | | | | In particular: - implement compat shims for old stat(2) variants and ogetdirentries(2); - implement delivery of signals with ancient stack frame layout and corresponding sigreturn(2); - implement old getpagesize(2); - provide a user-mode trampoline and LDT call gate for lcall $7,$0; - port a.out image activator and connect it to the build as a module on amd64. The changes are hidden under COMPAT_43. MFC after: 1 month
* Add O_CLOEXEC flag to open(2) and fhopen(2).kib2011-03-251-2/+2
| | | | | | | | The new function fallocf(9), that is renamed falloc(9) with added flag argument, is provided to facilitate the merge to stable branch. Reviewed by: jhb MFC after: 1 week
* Add an extra comment to the SDT probes definition. This allows us to getrpaulo2010-08-221-2/+2
| | | | | | | | | use '-' in probe names, matching the probe names in Solaris.[1] Add userland SDT probes definitions to sys/sdt.h. Sponsored by: The FreeBSD Foundation Discussed with: rwaston [1]
* In revoke(), verify that VCHR vnode indeed belongs to devfs.kib2010-07-061-1/+1
| | | | | Found and tested by: pho MFC after: 1 week
* Handle a case in kern_openat() when vn_open() change file type fromkib2010-04-131-15/+2
| | | | | | | | | | | | | | DTYPE_VNODE. Only acquire locks for O_EXLOCK/O_SHLOCK if file type is still vnode, since we allow for fcntl(2) to process with advisory locks for DTYPE_VNODE only. Another reason is that all fo_close() routines need to check and release locks otherwise. For O_TRUNC, call fo_truncate() instead of truncating the vnode. Discussed with: rwatson MFC after: 2 week
OpenPOWER on IntegriCloud