summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_syscalls.c
Commit message (Collapse)AuthorAgeFilesLines
* fd: replace fd_nfiles with fd_lastfile where appropriatemjg2014-06-221-1/+1
| | | | | | | | | | fd_lastfile is guaranteed to be the biggest open fd, so when the intent is to iterate over active fds or lookup one, there is no point in looking beyond that limit. Few places are left unpatched for now. MFC after: 1 week
* Update kernel inclusions of capability.h to use capsicum.h instead; somerwatson2014-03-161-1/+1
| | | | | | | | further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks
* The auio structure is only initialized when the vnode is symlink,kib2014-03-121-1/+1
| | | | | | | avoid reading from it otherwise. Submitted by: Conrad Meyer <cemeyer@uw.edu> MFC after: 1 week
* The posix_madvise(3) and posix_fadvise(2) should return error onkib2014-01-301-2/+3
| | | | | | | | | failure, same as posix_fallocate(2). Noted by: Bob Bishop <rb@gid.co.uk> Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week
* The posix_fallocate(2) syscall should return error number on error,kib2014-01-231-1/+3
| | | | | | | | | | without modifying errno. Reported and tested by: Gennady Proskurin <gpr@mail.ru> Reviewed by: mdf PR: standards/186028 Sponsored by: The FreeBSD Foundation MFC after: 1 week
* dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2013-11-261-2/+2
| | | | | | | | In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
* - For kernel compiled only with KDTRACE_HOOKS and not any lock debuggingattilio2013-11-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
* Correct the logic broken in my last commit.pjd2013-09-051-1/+1
| | | | Reported by: tijl
* Style fixes.pjd2013-09-051-122/+122
|
* Change the cap_rights_t type from uint64_t to a structure that we can extendpjd2013-09-051-50/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t *cap_rights_init(cap_rights_t *rights, ...); void cap_rights_set(cap_rights_t *rights, ...); void cap_rights_clear(cap_rights_t *rights, ...); bool cap_rights_is_set(const cap_rights_t *rights, ...); bool cap_rights_is_valid(const cap_rights_t *rights); void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src); void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src); bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation
* Make the seek a method of the struct fileops.kib2013-08-211-65/+3
| | | | | Tested by: pho Sponsored by: The FreeBSD Foundation
* Specify SDT probe argument types in the probe definition itself rather thanmarkj2013-08-151-6/+2
| | | | | | | | | using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API to allow probes with dynamically-translated types. There is no functional change. MFC after: 2 weeks
* - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). Thekib2013-05-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* - Constify local path variable for chflagsat().pjd2013-03-221-1/+1
| | | | | | - Use correct format characters (%lx) for u_long. This fixes the build broken in r248599.
* Implement chflagsat(2) system call, similar to fchmodat(2), but operates onpjd2013-03-211-15/+50
| | | | | | | file flags. Reviewed by: kib, jilles Sponsored by: The FreeBSD Foundation
* - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of typepjd2013-03-211-10/+10
| | | | | | | | | | | u_long. Before this change it was of type int for syscalls, but prototypes in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not for lchflags(2)) stated that it was u_long. Now some related functions use u_long type for flags (strtofflags(3), fflagstostr(3)). - Make path argument of type 'const char *' for consistency. Discussed on: arch Sponsored by: The FreeBSD Foundation
* Require CAP_SEEK if both O_APPEND and O_TRUNC flags are absent.pjd2013-03-161-1/+1
| | | | | | | | In other words we don't require CAP_SEEK if either O_APPEND or O_TRUNC flag is given, because O_APPEND doesn't allow to overwrite existing data and O_TRUNC requires CAP_FTRUNCATE already. Sponsored by: The FreeBSD Foundation
* Style: Whitespace fixes.pjd2013-03-161-3/+2
|
* Style: Remove redundant space.pjd2013-03-161-1/+1
|
* MFCattilio2013-03-021-48/+45
|\
| * If the target file already exists, check for the CAP_UNLINKAT capabiity rightpjd2013-03-021-7/+10
| | | | | | | | | | | | | | | | | | | | on the target directory descriptor, but only if this is renameat(2) and real target directory descriptor is given (not AT_FDCWD). Without this fix regular rename(2) fails if the target file already exists. Reported by: Michael Butler <imb@protected-networks.net> Reported by: Larry Rosenman <ler@lerctr.org> Sponsored by: The FreeBSD Foundation
| * Merge Capsicum overhaul:pjd2013-03-021-47/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ | PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ | PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE | PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK | CAP_READ) #define CAP_PWRITE (CAP_SEEK | CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ) #define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \ CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \ CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \ CAP_SETSOCKOPT | CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib
| * Reduce lock scope a little.pjd2013-03-011-1/+1
| |
* | Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() toattilio2013-02-201-2/+2
| | | | | | | | | | | | their "write" versions. Sponsored by: EMC / Isilon storage division
* | Switch vm_object lock to be a rwlock.attilio2013-02-201-0/+1
|/ | | | | | | | * VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations * VM_OBJECT_SLEEP() is introduced as a general purpose primitve to get a sleep operation using a VM_OBJECT_LOCK() as protection * The approach must bear with vm_pager.h namespace pollution so many files require including directly rwlock.h
* Style.pjd2013-02-171-9/+5
|
* - Require CAP_FSYNC capability right when opening a file with O_SYNC or O_FSYNCpjd2013-02-171-1/+4
| | | | | | | flags. - While here simplify check for locking flags. Sponsored by: The FreeBSD Foundation
* Stop translating the ERESTART error from the open(2) into EINTR.kib2013-02-071-2/+0
| | | | | | | | | | | | Posix requires that open(2) is restartable for SA_RESTART. For non-posix objects, in particular, devfs nodes, still disable automatic restart of the opens. The open call to a driver could have significant side effects for the hardware. Noted and reviewed by: jilles Discussed with: bde MFC after: 2 weeks
* Now that MPSAFE flag is gone, we can arrange code a bit better.pjd2013-01-311-49/+46
|
* Remove leftover label after Giant removal from VFS.pjd2013-01-311-5/+3
|
* Remove unused `vfslocked' variable.ed2012-10-221-2/+1
| | | | | I have no idea what this `vfslocked' thing means. I wonder how it ended up here.
* Remove the support for using non-mpsafe filesystem modules.kib2012-10-221-230/+47
| | | | | | | | | | | | In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
* Acquire the rangelock for truncate(2) as well.kib2012-10-151-3/+7
| | | | | | Reported and reviewed by: avg Tested by: pho MFC after: 1 week
* - Enforce CAP_MKFIFO on mkfifoat(2), not on mknodat(2). Without this changepjd2012-10-011-3/+4
| | | | | | | | mkfifoat(2) was not restricted. - Introduce CAP_MKNOD and enforce it on mknodat(2). Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* Require CAP_DELETE on directory descriptor for unlinkat(2).pjd2012-09-251-2/+2
| | | | | Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* Require CAP_CREATE on directory descriptor for symlinkat(2).pjd2012-09-251-2/+2
| | | | | Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* Require CAP_CREATE on directory descriptor for linkat(2).pjd2012-09-251-2/+2
| | | | | Sponsored by: FreeBSD Foundation MFC after: 2 weeks
* O_EXEC flag is not part of the O_ACCMODE mask, check it separately.pjd2012-09-251-15/+13
| | | | | | | | | | If O_EXEC is provided don't require CAP_READ/CAP_WRITE, as O_EXEC is mutually exclusive to O_RDONLY/O_WRONLY/O_RDWR. Without this change CAP_FEXECVE capability right is not enforced. Sponsored by: FreeBSD Foundation MFC after: 3 days
* Reorder the managament of advisory locks on open files so that the advisoryjhb2012-07-311-38/+6
| | | | | | | | | | | | | | | | | | | | | | lock is obtained before the write count is increased during open() and the lock is released after the write count is decreased during close(). The first change closes a race where an open() that will block with O_SHLOCK or O_EXLOCK can increase the write count while it waits. If the process holding the current lock on the file then tries to call exec() on the file it has locked, it can fail with ETXTBUSY even though the advisory lock is preventing other threads from succesfully completeing a writable open(). The second change closes a race where a read-only open() with O_SHLOCK or O_EXLOCK may return successfully while the write count is non-zero due to another descriptor that had the advisory lock and was blocking the open() still being in the process of closing. If the process that completed the open() then attempts to call exec() on the file it locked, it can fail with ETXTBUSY even though the other process that held a write lock has closed the file and released the lock. Reviewed by: kib MFC after: 1 month
* Extend the KPI to lock and unlock f_offset member of struct file. Itkib2012-07-021-13/+23
| | | | | | | | | | | | | | | | | | now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks
* Further refine the implementation of POSIX_FADV_NOREUSE.jhb2012-06-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | First, extend the changes in r230782 to better handle the common case of using NOREUSE with sequential reads. A NOREUSE file descriptor will now track the last implicit DONTNEED request it made as a result of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the previous range, it will apply the DONTNEED request to the entire range of both the previous read and the current read. The effect is that each read of a file accessed sequentially will apply the DONTNEED request to the entire range that has been read. This allows NOREUSE to properly handle misaligned reads by flushing each buffer to cache once it has been completely read. Second, apply the same changes made to read(2) by r230782 and this change to writes. This provides much better performance in the sequential write case as it allows writes to still be clustered. It also provides much better performance for misaligned writes. It does mean that NOREUSE will be generally ineffective for non-sequential writes as the current implementation relies on a future NOREUSE write's implicit DONTNEED request to flush the dirty buffer from the current write. MFC after: 2 weeks
* Now that dupfdopen() doesn't depend on finstall() being called earlier,pjd2012-06-131-2/+1
| | | | | | | | | | indx will never be -1 on error, as none of dupfdopen(), finstall() and kern_capwrap() modifies it on error, but what is more important none of those functions install and leave file at indx descriptor on error. Leave an assert to prove my words. MFC after: 1 month
* Allocate descriptor number in dupfdopen() itself instead of depending onpjd2012-06-131-11/+7
| | | | | | | | the caller using finstall(). This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes a race between finstall() and dupfdopen(). MFC after: 1 month
* - Remove nfp variable that is not really needed.pjd2012-06-131-10/+12
| | | | | | | - Update comment. - Style nits. MFC after: 1 month
* Remove duplicated code.pjd2012-06-131-9/+1
| | | | MFC after: 1 month
* Add missing {.pjd2012-06-131-1/+1
| | | | MFC after: 1 month
* Style.pjd2012-06-131-7/+8
| | | | MFC after: 1 month
* There is no need to set td->td_retval[0] to -1 on error.pjd2012-06-131-1/+0
| | | | | Confirmed by: jhb MFC after: 1 month
* Style fixes and simplifications.pjd2012-06-111-4/+2
| | | | MFC after: 1 month
* Split the second half of vn_open_cred() (after a vnode has been found viajhb2012-06-081-126/+39
| | | | | | | | | | a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function and use this function in fhopen() instead of duplicating code from vn_open_cred() directly. Tested by: pho Reviewed by: kib MFC after: 2 weeks
OpenPOWER on IntegriCloud