summaryrefslogtreecommitdiffstats
path: root/sys/kern/kern_event.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r258148,r258149,r258150,r258152,r258153,r258154,r258181,r258182:pjd2013-11-181-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r258148: Add a note that this file is compiled as part of the kernel and libc. Requested by: kib r258149: Change cap_rights_merge(3) and cap_rights_remove(3) to return pointer to the destination cap_rights_t structure. This already matches manual page. r258150: Sync return value with actual implementation. r258151: Style. r258152: Precisely document capability rights here too (they are already documented in rights(4)). r258153: The CAP_LINKAT, CAP_MKDIRAT, CAP_MKFIFOAT, CAP_MKNODAT, CAP_RENAMEAT, CAP_SYMLINKAT and CAP_UNLINKAT capability rights make no sense without the CAP_LOOKUP right, so include this rights. r258154: - Move CAP_EXTATTR_* and CAP_ACL_* rights to index 1 to have more room in index 0 for the future. - Move CAP_BINDAT and CAP_CONNECTAT rights to index 0 so we can include CAP_LOOKUP right in them. - Shuffle the bits around so there are no gaps. This is last chance to do that as all moved rights are not used yet. r258181: Replace CAP_POLL_EVENT and CAP_POST_EVENT capability rights (which I had a very hard time to fully understand) with much more intuitive rights: CAP_EVENT - when set on descriptor, the descriptor can be monitored with syscalls like select(2), poll(2), kevent(2). CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2) syscall can be called on this kqueue to with the eventlist argument set to non-NULL value; in other words the given kqueue descriptor can be used to monitor other descriptors. CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2) syscall can be called on this kqueue to with the changelist argument set to non-NULL value; in other words it allows to modify events monitored with the given kqueue descriptor. Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and CAP_KQUEUE_CHANGE. Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT. r258182: Correct right names. Sponsored by: The FreeBSD Foundation Approved by: re (kib)
* Do not allow negative timeouts for kqueue timers, check for thekib2013-09-261-2/+10
| | | | | | | | | | | | | negative timeout both before and after the conversion to sbintime_t. For periodic kqueue timer, convert zero timeout into 1ms, to avoid interrupt storm on fast event timers. Reported and tested by: pho Discussed with: mav Reviewed by: davide Sponsored by: The FreeBSD Foundation Approved by: re (marius)
* Pre-acquire the filedesc sx when a possibility exists that the laterkib2013-09-221-3/+30
| | | | | | | | | | | | code could need to remove a kqueue from the filedesc list. Global lock is already locked, which causes sleepable after non-sleepable lock acquisition. Reported and tested by: pho Reviewed by: jmg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)
* Revert r255672, it has some serious flaws, leaking file references etc.rdivacky2013-09-181-71/+52
| | | | Approved by: re (delphij)
* Implement epoll support in Linuxulator. This is a tiny wrapper around kqueuerdivacky2013-09-181-52/+71
| | | | | | | | | | | | | | to implement epoll subset of functionality. The kqueue user data are 32bit on i386 which is not enough for epoll user data so this patch overrides kqueue fileops to maintain enough space in struct file. Initial patch developed by me in 2007 and then extended and finished by Yuri Victorovich. Approved by: re (delphij) Sponsored by: Google Summer of Code Submitted by: Yuri Victorovich <yuri at rawbw dot com> Tested by: Yuri Victorovich <yuri at rawbw dot com>
* Use TAILQ instead of STAILQ for kqeueue filedescriptors to ensure constantkib2013-09-131-3/+3
| | | | | | | | | | time removal on kqueue close. Reported and tested by: pho Reviewed by: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (delphij)
* Change the cap_rights_t type from uint64_t to a structure that we can extendpjd2013-09-051-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t *cap_rights_init(cap_rights_t *rights, ...); void cap_rights_set(cap_rights_t *rights, ...); void cap_rights_clear(cap_rights_t *rights, ...); bool cap_rights_is_set(const cap_rights_t *rights, ...); bool cap_rights_is_valid(const cap_rights_t *rights); void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src); void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src); bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation
* fix up some comments and a white space issue...jmg2013-08-261-6/+6
| | | | MFC after: 3 days
* Make sendfile() a method in the struct fileops. Currently onlyglebius2013-08-151-0/+1
| | | | | | | | vnode backed file descriptors have this method implemented. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix
* Some small cleanups to the fixes in r180340:jhb2013-08-131-3/+3
| | | | | | | | | | | - Set NOTE_TRACKERR before running filt_proc(). If the knote did not have NOTE_FORK set in fflags when registered, then the TRACKERR event could miss being posted. - Don't pass the pid in to filt_proc() for NOTE_FORK events. The special handling for pids is done knote_fork() directly and no longer in filt_proc(). MFC after: 2 weeks
* Don't emit a spurious EVFILT_PROC event with no fflags set on process exitjhb2013-08-071-2/+19
| | | | | | | | | | | | | | if NOTE_EXIT is not being monitored. The rationale is that a listener should only get an event for exit() if they registered interest via NOTE_EXIT. This matches the behavior on OS X. - Don't save the exit status on process exit unless NOTE_EXIT is being monitored. - Add an internal EV_DROP flag that requests kqueue_scan() to free the knote without signalling it to userland and use this when a process exits but the fflags in the knote is zero. Reviewed by: jmg MFC after: 1 month
* Change callout use counter to use C11 atomics.ed2013-06-161-10/+15
| | | | | | | | | In order to get some coverage of C11 atomics in kernelspace, switch at least one piece of code in kernelspace to use C11 atomics instead of <machine/atomic.h>. While there, slightly improve the code by adding an assertion to prevent the use count from going negative.
* Rework overflow checks of r247898 to not let too "intelligent" compiler tomav2013-03-091-3/+4
| | | | | | optimize it out. Submitted by: bde
* Fix off-by-one error in nanoseconds validation.mav2013-03-071-1/+1
| | | | Submitted by: bde
* Fix time math overflows and improve zero intervals handling in poll(),mav2013-03-061-5/+10
| | | | | | select(), nanosleep() and kevent() functions after calloutng changes. Reported by: bde
* MFcalloutng:davide2013-03-041-53/+29
| | | | | | | | | - Rewrite kevent() timeout implementation to allow sub-tick precision. - Make the interval timings for EVFILT_TIMER more accurate. This also removes an hack introduced in r238424. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil
* Make the interval timings for EVFILT_TIMER more accurate. tvtohz() alwaysjhb2012-07-131-5/+12
| | | | | | | | | | | | | | adds an extra tick to account for the current partial clock tick. However, that is not appropriate for a repeating timer when the exact tvtohz() value should be used for subsequent intervals. Fix repeating callouts for EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the fix used in realitexpire() for interval timers. While here, update a few comments to note that if the EVFILT_TIMER code were to move out of kern_event.c, it should move to kern_time.c (where the interval timer code it mimics lives) rather than kern_timeout.c. MFC after: 1 month
* Update comment.pjd2012-06-141-1/+1
| | | | MFC after: 1 month
* - Add knlist_init_rw_reader() function to kqueue(9).melifaro2012-03-261-0/+42
| | | | | | | | | | | Function acquired reader lock if needed. Assert check for reader or writer lock (RA_LOCKED / RA_UNLOCKED) - While here, add knlist_init_mtx.9 to MLINKS and fix some style(9) issues Reviewed by: glebius Approved by: ae(mentor) MFC after: 2 weeks
* In order to maximize the re-usability of kernel code in user space thiskmacy2011-09-161-2/+2
| | | | | | | | | | | | | patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
* Fix a deficiency in the selinfo interface:attilio2011-08-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks
* Add the fo_chown and fo_chmod methods to struct fileops and use themkib2011-08-161-0/+2
| | | | | | | | | | to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)
* Rename CAP_*_KEVENT to CAP_*_EVENT.jonathan2011-08-121-3/+3
| | | | | | | | Change the names of a couple of capability rights to be less FreeBSD-specific. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc
* Second-to-last commit implementing Capsicum capabilities in the FreeBSDrwatson2011-08-111-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
* After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)kib2011-04-011-1/+1
| | | | | | | | and remove the falloc() version that lacks flag argument. This is done to reduce the KPI bloat. Requested by: jhb X-MFC-note: do not
* Defer freeing a kevent list until after dropping kqueue locks.jhb2010-03-301-4/+6
| | | | | | LOR: 185 Submitted by: Matthew Fleming @ Isilon MFC after: 1 week
* Do not leak process lock when current thread is not allowed to see target.kib2010-02-141-1/+3
| | | | | Bumped into by: ed MFC after: 3 days
* If a filter has already been added, actually return EEXIST when tryingbrooks2009-12-311-1/+2
| | | | | | at add it again. MFC after: 1 week
* The devices that supported EVFILT_NETDEV kqueue filters were removed inbrooks2009-12-311-1/+1
| | | | | | | | | | | | r195175. Remove all definitions, documentation, and usage. fifo_misc.c: Remove all kqueue tests as fifo_io.c performs all those that would have remained. Reviewed by: rwatson MFC after: 3 weeks X-MFC note: don't change vlan_link_state() function signature
* Postpone dropping fp till both kq_global and kqueue mutexes arekib2009-10-101-3/+3
| | | | | | | | | | unlocked. fdrop() closes file descriptor when reference count goes to zero. Close method for vnodes locks the vnode, resulting in "sleepable after non-sleepable". For pipes, pipe mutex is before kqueue lock, causing LOR. Reported and tested by: pho MFC after: 2 weeks
* Use correct sizeof() object for klist 'list'. Currently, struct klistdelphij2009-09-281-4/+4
| | | | | | | | | contained only SLIST_HEAD as its member, thus sizeof(struct klist) would equal to sizeof(struct klist *), so this change makes the code more correct in terms of semantics, but should be a no-op to compiler at this time. Reported by: MQ <antinvidia at gmail com>
* Change unsigned foo to u_foo as required by style(9).rdivacky2009-09-221-3/+3
| | | | | Requested by: bde Approved by: ed (mentor)
* Fix the style of the previous commit.rdivacky2009-09-171-1/+2
| | | | Approved by: ed (mentor, implicit)
* Make these argument/variable unsigned as the defines for them don't fitrdivacky2009-09-171-3/+3
| | | | | | | into signed 32bit integer. Approved by: ed (mentor, implicit) Approved by: sson
* Add EV_RECEIPT to kevents.sson2009-09-161-1/+1
| | | | | | | | EV_RECEIPT is useful to disambiguating error conditions when multiple events structures are passed to kevent(2). The error code is returned in the data field and EV_ERROR is set. Approved by: rwatson (co-mentor)
* Add the EV_DISPATCH flag to kevents.sson2009-09-161-2/+4
| | | | | | | | When the EV_DISPATCH flag is used the event source will be disabled immediately after the delivery of an event. This is similar to the EV_ONESHOT flag but it doesn't delete the event. Approved by: rwatson (co-mentor)
* Add EVFILT_USER to kevents.sson2009-09-161-0/+99
| | | | | | | | | Add user events support to kernel events which are not associated with any kernel mechanism but are triggered by user level code. This is useful for adding user level events to an event handler that may also be monitoring kernel events. Approved by: rwatson (co-mentor)
* Add optional touch event filter hooks to kevents.sson2009-09-161-37/+56
| | | | | | | | The touch event filter is called when a kernel event data is possibly updated. There are two hook points. First, during a kevent() system call. Second, when an event has been triggered. Approved by: rwatson (co-mentor)
* Use C99 initialization for struct filterops.rwatson2009-09-121-10/+25
| | | | | | Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks
* - Turn the third (islocked) argument of the knote call into flags parameter.stas2009-06-281-6/+18
| | | | | | | | | | | Introduce the new flag KNF_NOKQLOCK to allow event callers to be called without KQ_LOCK mtx held. - Modify VFS knote calls to always use KNF_NOKQLOCK flag. This is required for ZFS as its getattr implementation may sleep. Approved by: re (rwatson) Reviewed by: kib MFC after: 2 weeks
* Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Usekib2009-06-101-14/+29
| | | | | | | | | | | | | | | | | | | | | | | | | vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland
* Fix a number of style issues in the MALLOC / FREE commit. I've tried todes2008-10-231-3/+2
| | | | | be careful not to fix anything that was already broken; the NFSv4 code is particularly bad in this respect.
* Retire the MALLOC and FREE macros. They are an abomination unto style(9).des2008-10-231-6/+5
| | | | MFC after: 3 months
* The kqueue_register() function assumes that it is called from the top ofkib2008-07-071-15/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | the syscall code and acquires various event subsystem locks as needed. The handling of the NOTE_TRACK for EVFILT_PROC is currently done by calling the kqueue_register() from filt_proc() filter, causing recursive entrance of the kqueue code. This results in the LORs and recursive acquisition of the locks. Implement the variant of the knote() function designed to only handle the fork() event. It mostly copies the knote() body, but also handles the NOTE_TRACK, removing the handling from the filt_proc(), where it causes problems described above. The function is called from the fork1() instead of knote(). When encountering NOTE_TRACK knote, it marks the knote as influx and drops the knlist and kqueue lock. In this context call to kqueue_register is safe from the problems. An error from the kqueue_register() is reported to the observer as NOTE_TRACKERR fflag. PR: 108201 Reviewed by: jhb, Pramod Srinivasan <pramod juniper net> (previous version) Discussed with: jmg Tested by: pho MFC after: 2 weeks
* The r178914 I erronously put the setting of the KQ_FLUXWAIT flag beforekib2008-07-071-2/+1
| | | | | | | | | | KQ_FLUX_WAKEUP(). Since the later macro clears the KQ_FLUXWAIT, the kqueue_scan() thread may be not woken up. Move the setting of KQ_FLUXWAIT after wakeup to correct the issue. Reported and tested by: pho MFC after: 3 days
* Kqueue_scan() may sleep when encountered the influx knotes. On the otherkib2008-05-101-1/+10
| | | | | | | | | | | | hand, it may cause other threads to sleep since kqueue_scan() may mark some knotes as infux. This could lead to the deadlock. Before kqueue_scan() sleeps, wakeup the threads that are waiting for the influx knotes produced by this thread. Tested by: pho (previous version) Reviewed by: jmg MFC after: 2 weeks
* The kqueue_close() encountering the KN_INFLUX knotes on the kq beingkib2008-05-101-4/+11
| | | | | | | | | | closed is the legitimate situation. For instance, filedescriptor with registered events may be closed in parallel with closing the kqueue. Properly handle the case instead of asserting that this cannot happen. Reported and tested by: pho Reviewed by: jmg MFC after: 2 weeks
* - Convert two timeout users to the new callout_reset_curcpu() api.jeff2008-04-021-3/+3
| | | | Sponsored by: Nokia
* In keeping with style(9)'s recommendations on macros, use a ';'rwatson2008-03-161-1/+1
| | | | | | | | | after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
* Make ftruncate a 'struct file' operation rather than a vnode operation.jhb2008-01-071-0/+11
| | | | | | | | | | | | | | This makes it possible to support ftruncate() on non-vnode file types in the future. - 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on a given file descriptor. - ftruncate() moves to kern/sys_generic.c and now just fetches a file object and invokes fo_truncate(). - The vnode-specific portions of ftruncate() move to vn_truncate() in vfs_vnops.c which implements fo_truncate() for vnode file types. - Non-vnode file types return EINVAL in their fo_truncate() method. Submitted by: rwatson
OpenPOWER on IntegriCloud