diff options
Diffstat (limited to 'lib/libc/sys/kse.2')
-rw-r--r-- | lib/libc/sys/kse.2 | 679 |
1 files changed, 679 insertions, 0 deletions
diff --git a/lib/libc/sys/kse.2 b/lib/libc/sys/kse.2 new file mode 100644 index 0000000..41fcc37 --- /dev/null +++ b/lib/libc/sys/kse.2 @@ -0,0 +1,679 @@ +.\" Copyright (c) 2002 Packet Design, LLC. +.\" All rights reserved. +.\" +.\" Subject to the following obligations and disclaimer of warranty, +.\" use and redistribution of this software, in source or object code +.\" forms, with or without modifications are expressly permitted by +.\" Packet Design; provided, however, that: +.\" +.\" (i) Any and all reproductions of the source or object code +.\" must include the copyright notice above and the following +.\" disclaimer of warranties; and +.\" (ii) No rights are granted, in any manner or form, to use +.\" Packet Design trademarks, including the mark "PACKET DESIGN" +.\" on advertising, endorsements, or otherwise except as such +.\" appears in the above copyright notice or in the software. +.\" +.\" THIS SOFTWARE IS BEING PROVIDED BY PACKET DESIGN "AS IS", AND +.\" TO THE MAXIMUM EXTENT PERMITTED BY LAW, PACKET DESIGN MAKES NO +.\" REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, REGARDING +.\" THIS SOFTWARE, INCLUDING WITHOUT LIMITATION, ANY AND ALL IMPLIED +.\" WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, +.\" OR NON-INFRINGEMENT. PACKET DESIGN DOES NOT WARRANT, GUARANTEE, +.\" OR MAKE ANY REPRESENTATIONS REGARDING THE USE OF, OR THE RESULTS +.\" OF THE USE OF THIS SOFTWARE IN TERMS OF ITS CORRECTNESS, ACCURACY, +.\" RELIABILITY OR OTHERWISE. IN NO EVENT SHALL PACKET DESIGN BE +.\" LIABLE FOR ANY DAMAGES RESULTING FROM OR ARISING OUT OF ANY USE +.\" OF THIS SOFTWARE, INCLUDING WITHOUT LIMITATION, ANY DIRECT, +.\" INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE, OR CONSEQUENTIAL +.\" DAMAGES, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, LOSS OF +.\" USE, DATA OR PROFITS, HOWEVER CAUSED AND UNDER ANY THEORY OF +.\" LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF +.\" THE USE OF THIS SOFTWARE, EVEN IF PACKET DESIGN IS ADVISED OF +.\" THE POSSIBILITY OF SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd February 13, 2007 +.Dt KSE 2 +.Os +.Sh NAME +.Nm kse +.Nd "kernel support for user threads" +.Sh LIBRARY +.Lb libc +.Sh SYNOPSIS +.In sys/types.h +.In sys/kse.h +.Ft int +.Fn kse_create "struct kse_mailbox *mbx" "int sys-scope" +.Ft int +.Fn kse_exit void +.Ft int +.Fn kse_release "struct timespec *timeout" +.Ft int +.Fn kse_switchin "struct kse_thr_mailbox *tmbx" "int flags" +.Ft int +.Fn kse_thr_interrupt "struct kse_thr_mailbox *tmbx" "int cmd" "long data" +.Ft int +.Fn kse_wakeup "struct kse_mailbox *mbx" +.Sh DESCRIPTION +These system calls implement kernel support for multi-threaded processes. +.\" +.Ss Overview +.\" +Traditionally, user threading has been implemented in one of two ways: +either all threads are managed in user space and the kernel is unaware +of any threading (also known as +.Dq "N to 1" ) , +or else separate processes sharing +a common memory space are created for each thread (also known as +.Dq "N to N" ) . +These approaches have advantages and disadvantages: +.Bl -column "- Cannot utilize multiple CPUs" "+ Can utilize multiple CPUs" +.It Sy "User threading Kernel threading" +.It "+ Lightweight - Heavyweight" +.It "+ User controls scheduling - Kernel controls scheduling" +.It "- Syscalls must be wrapped + No syscall wrapping required" +.It "- Cannot utilize multiple CPUs + Can utilize multiple CPUs" +.El +.Pp +The KSE system is a +hybrid approach that achieves the advantages of both the user and kernel +threading approaches. +The underlying philosophy of the KSE system is to give kernel support +for user threading without taking away any of the user threading library's +ability to make scheduling decisions. +A kernel-to-user upcall mechanism is used to pass control to the user +threading library whenever a scheduling decision needs to be made. +An arbitrarily number of user threads are multiplexed onto a fixed number of +virtual CPUs supplied by the kernel. +This can be thought of as an +.Dq "N to M" +threading scheme. +.Pp +Some general implications of this approach include: +.Bl -bullet +.It +The user process can run multiple threads simultaneously on multi-processor +machines. +The kernel grants the process virtual CPUs to schedule as it +wishes; these may run concurrently on real CPUs. +.It +All operations that block in the kernel become asynchronous, allowing +the user process to schedule another thread when any thread blocks. +.El +.\" +.Ss Definitions +.\" +KSE allows a user process to have multiple +.Sy threads +of execution in existence at the same time, some of which may be blocked +in the kernel while others may be executing or blocked in user space. +A +.Sy "kernel scheduling entity" +(KSE) is a +.Dq "virtual CPU" +granted to the process for the purpose of executing threads. +A thread that is currently executing is always associated with +exactly one KSE, whether executing in user space or in the kernel. +The KSE is said to be +.Sy assigned +to the thread. +KSEs (a user abstraction) are implemented on top +of kernel threads using an 'upcall' entity. +.Pp +The KSE becomes +.Sy unassigned , +and the associated thread is suspended, when the KSE has an associated +.Sy mailbox , +(see below) the thread has an associated +.Sy thread mailbox , +(also see below) and any of the following occurs: +.Bl -bullet +.It +The thread invokes a system call that blocks. +.It +The thread makes any other demand of the kernel that cannot be immediately +satisfied, e.g., touches a page of memory that needs to be fetched from disk, +causing a page fault. +.It +Another thread that was previously blocked in the kernel completes its +work in the kernel (or is +.Sy interrupted ) +and becomes ready to return to user space, and the current thread is returning +to user space. +.It +A signal is delivered to the process, and this KSE is chosen to deliver it. +.El +.Pp +In other words, as soon as there is a scheduling decision to be made, +the KSE becomes unassigned, because the kernel does not presume to know +how the process' other runnable threads should be scheduled. +Unassigned KSEs always return to user space as soon as possible via +the +.Sy upcall +mechanism (described below), allowing the user process to decide how +that KSE should be utilized next. +KSEs always complete as much work as possible in the kernel before +becoming unassigned. +.Pp +Individual KSEs within a process are effectively indistinguishable, +and any KSE in a process may be assigned by the kernel to any runnable +(in the kernel) thread associated with that process. +In practice, the kernel attempts to preserve the affinity between threads +and actual CPUs to optimize cache behavior, but this is invisible to the +user process. +(Affinity is not yet fully implemented.) +.Pp +Each KSE has a unique +.Sy "KSE mailbox" +supplied by the user process. +A mailbox consists of a control structure containing a pointer to an +.Sy "upcall function" +and a user stack. +The KSE invokes this function whenever it becomes unassigned. +The kernel updates this structure with information about threads that have +become runnable and signals that have been delivered before each upcall. +Upcalls may be temporarily blocked by the user thread scheduling code +during critical sections. +.Pp +Each user thread has a unique +.Sy "thread mailbox" +as well. +Threads are referred to using pointers to these mailboxes when communicating +between the kernel and the user thread scheduler. +Each KSE's mailbox contains a pointer to the mailbox of the user thread +that the KSE is currently executing. +This pointer is saved when the thread blocks in the kernel. +.Pp +Whenever a thread blocked in the kernel is ready to return to user space, +it is added to the process's list of +.Sy completed +threads. +This list is presented to the user code at the next upcall as a linked list +of thread mailboxes. +.Pp +There is a kernel-imposed limit on the number of threads in a process +that may be simultaneously blocked in the kernel (this number is not +currently visible to the user). +When this limit is reached, upcalls are blocked and no work is performed +for the process until one of the threads completes (or a signal is +received). +.\" +.Ss Managing KSEs +.\" +To become multi-threaded, a process must first invoke +.Fn kse_create . +The +.Fn kse_create +system call +creates a new KSE (except for the very first invocation; see below). +The KSE will be associated with the mailbox pointed to by +.Fa mbx . +If +.Fa sys_scope +is non-zero, then the new thread will be counted as a system scope +thread. Other things must be done as well to make a system scope thread +so this is not sufficient (yet). +System scope variables are not covered +in detail in this manual page yet, but briefly, they never perform +upcalls and do not return to the user thread scheduler. +Once launched they run autonomously. +The pthreads library knows how to make system +scope threads and users are encouraged to use the library interface. +.Pp +Each process initially has a single KSE executing a single user thread. +Since the KSE does not have an associated mailbox, it must remain assigned +to the thread and does not perform any upcalls. +(It is by definition a system scope thread). +The result is the traditional, unthreaded mode of operation. +Therefore, as a special case, the first call to +.Fn kse_create +by this initial thread with +.Fa sys_scope +equal to zero does not create a new KSE; instead, it simply associates the +current KSE with the supplied KSE mailbox, and no immediate upcall results. +However, an upcall will be triggered the next time the thread blocks and +the required conditions are met. +.Pp +The kernel does not allow more KSEs to exist in a process than the +number of physical CPUs in the system (this number is available as the +.Xr sysctl 3 +variable +.Va hw.ncpu ) . +Having more KSEs than CPUs would not add any value to the user process, +as the additional KSEs would just compete with each other for access to +the real CPUs. +Since the extra KSEs would always be side-lined, the result +to the application would be exactly the same as having fewer KSEs. +There may however be arbitrarily many user threads, and it is up to the +user thread scheduler to handle mapping the application's user threads +onto the available KSEs. +.Pp +The +.Fn kse_exit +system call +causes the KSE assigned to the currently running thread to be destroyed. +If this KSE is the last one in the process, there must be no remaining +threads associated with that process blocked in the kernel. +This system call does not return unless there is an error. +Calling +.Fn kse_exit +from the last thread is the same as calling +.Fn exit . +.Pp +The +.Fn kse_release +system call +is used to +.Dq park +the KSE assigned to the currently running thread when it is not needed, +e.g., when there are more available KSEs than runnable user threads. +The thread converts to an upcall but does not get scheduled until +there is a new reason to do so, e.g., a previously +blocked thread becomes runnable, or the timeout expires. +If successful, +.Fn kse_release +does not return to the caller. +.Pp +The +.Fn kse_switchin +system call can be used by the UTS, when it has selected a new thread, +to switch to the context of that thread. +The use of +.Fn kse_switchin +is machine dependent. +Some platforms do not need a system call to switch to a new context, +while others require its use in particular cases. +.Pp +The +.Fn kse_wakeup +system call +is the opposite of +.Fn kse_release . +It causes the (parked) KSE associated with the mailbox pointed to by +.Fa mbx +to be woken up, causing it to upcall. +If the KSE has already woken up for another reason, this system call has no +effect. +The +.Fa mbx +argument +may be +.Dv NULL +to specify +.Dq "any KSE in the current process" . +.Pp +The +.Fn kse_thr_interrupt +system call +is used to interrupt a currently blocked thread. +The thread must either be blocked in the kernel or assigned to a KSE +(i.e., executing). +The thread is then marked as interrupted. +As soon as the thread invokes an interruptible system call (or immediately +for threads already blocked in one), the thread will be made runnable again, +even though the kernel operation may not have completed. +The effect on the interrupted system call is the same as if it had been +interrupted by a signal; typically this means an error is returned with +.Va errno +set to +.Er EINTR . +.\" +.Ss Signals +.\" +The current implementation creates a special signal thread. +Kernel threads (KSEs) in a process mask all signals, and only the signal +thread waits for signals to be delivered to the process, the signal thread +is responsible +for dispatching signals to user threads. +.Pp +A downside of this is that if a multiplexed thread +calls the +.Fn execve +syscall, its signal mask and pending signals may not be +available in the kernel. +They are stored +in userland and the kernel does not know where to get them, however +.Tn POSIX +requires them to be restored and passed them to new process. +Just setting the mask for the thread before calling +.Fn execve +is only a +close approximation to the problem as it does not re-deliver back to the kernel +any pending signals that the old process may have blocked, and it allows a +window in which new signals may be delivered to the process between the setting +of the mask and the +.Fn execve . +.Pp +For now this problem has been solved by adding a special combined +.Fn kse_thr_interrupt Ns / Ns Fn execve +mode to the +.Fn kse_thr_interrupt +syscall. +The +.Fn kse_thr_interrupt +syscall has a sub command +.Dv KSE_INTR_EXECVE , +that allows it to accept a +.Vt kse_execv_args +structure, and allowing it to adjust the signals and then atomically +convert into an +.Fn execve +call. +Additional pending signals and the correct signal mask can be passed +to the kernel in this way. +The thread library overrides the +.Fn execve +syscall +and translates it into +.Fn kse_intr_interrupt +call, allowing a multiplexed thread +to restore pending signals and the correct signal mask before doing the +.Fn exec . +This solution to the problem may change. +.\" +.Ss KSE Mailboxes +.\" +Each KSE has a unique mailbox for user-kernel communication defined in +.In sys/kse.h . +Some of the fields there are: +.Pp +.Va km_version +describes the version of this structure and must be equal to +.Dv KSE_VER_0 . +.Va km_udata +is an opaque pointer ignored by the kernel. +.Pp +.Va km_func +points to the KSE's upcall function; +it will be invoked using +.Va km_stack , +which must remain valid for the lifetime of the KSE. +.Pp +.Va km_curthread +always points to the thread that is currently assigned to this KSE if any, +or +.Dv NULL +otherwise. +This field is modified by both the kernel and the user process as follows. +.Pp +When +.Va km_curthread +is not +.Dv NULL , +it is assumed to be pointing at the mailbox for the currently executing +thread, and the KSE may be unassigned, e.g., if the thread blocks in the +kernel. +The kernel will then save the contents of +.Va km_curthread +with the blocked thread, set +.Va km_curthread +to +.Dv NULL , +and upcall to invoke +.Fn km_func . +.Pp +When +.Va km_curthread +is +.Dv NULL , +the kernel will never perform any upcalls with this KSE; in other words, +the KSE remains assigned to the thread even if it blocks. +.Va km_curthread +must be +.Dv NULL +while the KSE is executing critical user thread scheduler +code that would be disrupted by an intervening upcall; +in particular, while +.Fn km_func +itself is executing. +.Pp +Before invoking +.Fn km_func +in any upcall, the kernel always sets +.Va km_curthread +to +.Dv NULL . +Once the user thread scheduler has chosen a new thread to run, +it should point +.Va km_curthread +at the thread's mailbox, re-enabling upcalls, and then resume the thread. +.Em Note : +modification of +.Va km_curthread +by the user thread scheduler must be atomic +with the loading of the context of the new thread, to avoid +the situation where the thread context area +may be modified by a blocking async operation, while there +is still valid information to be read out of it. +.Pp +.Va km_completed +points to a linked list of user threads that have completed their work +in the kernel since the last upcall. +The user thread scheduler should put these threads back into its +own runnable queue. +Each thread in a process that completes a kernel operation +(synchronous or asynchronous) that results in an upcall is guaranteed to be +linked into exactly one KSE's +.Va km_completed +list; which KSE in the group, however, is indeterminate. +Furthermore, the completion will be reported in only one upcall. +.Pp +.Va km_sigscaught +contains the list of signals caught by this process since the previous +upcall to any KSE in the process. +As long as there exists one or more KSEs with an associated mailbox in +the user process, signals are delivered this way rather than the +traditional way. +(This has not been implemented and may change.) +.Pp +.Va km_timeofday +is set by the kernel to the current system time before performing +each upcall. +.Pp +.Va km_flags +may contain any of the following bits OR'ed together: +.Bl -tag -width indent +.It Dv KMF_NOUPCALL +Block upcalls from happening. +The thread is in some critical section. +.It Dv KMF_NOCOMPLETED , KMF_DONE , KMF_BOUND +This thread should be considered to be permanently bound to +its KSE, and treated much like a non-threaded process would be. +It is a +.Dq "long term" +version of +.Dv KMF_NOUPCALL +in some ways. +.It Dv KMF_WAITSIGEVENT +Implement characteristics needed for the signal delivery thread. +.El +.\" +.Ss Thread Mailboxes +.\" +Each user thread must have associated with it a unique +.Vt "struct kse_thr_mailbox" +as defined in +.In sys/kse.h . +It includes the following fields. +.Pp +.Va tm_udata +is an opaque pointer ignored by the kernel. +.Pp +.Va tm_context +stores the context for the thread when the thread is blocked in user space. +This field is also updated by the kernel before a completed thread is returned +to the user thread scheduler via +.Va km_completed . +.Pp +.Va tm_next +links the +.Va km_completed +threads together when returned by the kernel with an upcall. +The end of the list is marked with a +.Dv NULL +pointer. +.Pp +.Va tm_uticks +and +.Va tm_sticks +are time counters for user mode and kernel mode execution, respectively. +These counters count ticks of the statistics clock (see +.Xr clocks 7 ) . +While any thread is actively executing in the kernel, the corresponding +.Va tm_sticks +counter is incremented. +While any KSE is executing in user space and that KSE's +.Va km_curthread +pointer is not equal to +.Dv NULL , +the corresponding +.Va tm_uticks +counter is incremented. +.Pp +.Va tm_flags +may contain any of the following bits OR'ed together: +.Bl -tag -width indent +.It Dv TMF_NOUPCALL +Similar to +.Dv KMF_NOUPCALL . +This flag inhibits upcalling for critical sections. +Some architectures require this to be in one place and some in the other. +.El +.Sh RETURN VALUES +The +.Fn kse_create , +.Fn kse_wakeup , +and +.Fn kse_thr_interrupt +system calls +return zero if successful. +The +.Fn kse_exit +and +.Fn kse_release +system calls +do not return if successful. +.Pp +All of these system calls return a non-zero error code in case of an error. +.Sh ERRORS +The +.Fn kse_create +system call +will fail if: +.Bl -tag -width Er +.It Bq Er ENXIO +There are already as many KSEs in the process as hardware processors. +.It Bq Er EAGAIN +The user is not the super user, and the soft resource limit corresponding +to the +.Fa resource +argument +.Dv RLIMIT_NPROC +would be exceeded (see +.Xr getrlimit 2 ) . +.It Bq Er EFAULT +The +.Fa mbx +argument +points to an address which is not a valid part of the process address space. +.El +.Pp +The +.Fn kse_exit +system call +will fail if: +.Bl -tag -width Er +.It Bq Er EDEADLK +The current KSE is the last in its process and there are still one or more +threads associated with the process blocked in the kernel. +.It Bq Er ESRCH +The current KSE has no associated mailbox, i.e., the process is operating +in traditional, unthreaded mode (in this case use +.Xr _exit 2 +to exit the process). +.El +.Pp +The +.Fn kse_release +system call +will fail if: +.Bl -tag -width Er +.It Bq Er ESRCH +The current KSE has no associated mailbox, i.e., the process is operating is +traditional, unthreaded mode. +.El +.Pp +The +.Fn kse_wakeup +system call +will fail if: +.Bl -tag -width Er +.It Bq Er ESRCH +The +.Fa mbx +argument +is not +.Dv NULL +and the mailbox pointed to by +.Fa mbx +is not associated with any KSE in the process. +.It Bq Er ESRCH +The +.Fa mbx +argument +is +.Dv NULL +and the current KSE has no associated mailbox, i.e., the process is operating +in traditional, unthreaded mode. +.El +.Pp +The +.Fn kse_thr_interrupt +system call +will fail if: +.Bl -tag -width Er +.It Bq Er ESRCH +The thread corresponding to +.Fa tmbx +is neither currently assigned to any KSE in the process nor blocked in the +kernel. +.El +.Sh SEE ALSO +.Xr rfork 2 , +.Xr pthread 3 , +.Xr ucontext 3 +.Rs +.%A "Thomas E. Anderson" +.%A "Brian N. Bershad" +.%A "Edward D. Lazowska" +.%A "Henry M. Levy" +.%J "ACM Transactions on Computer Systems" +.%N Issue 1 +.%V Volume 10 +.%D February 1992 +.%I ACM Press +.%P pp. 53-79 +.%T "Scheduler activations: effective kernel support for the user-level management of parallelism" +.Re +.Sh HISTORY +The KSE system calls first appeared in +.Fx 5.0 . +.Sh AUTHORS +KSE was originally implemented by +.An -nosplit +.An "Julian Elischer" Aq julian@FreeBSD.org , +with additional contributions by +.An "Jonathan Mini" Aq mini@FreeBSD.org , +.An "Daniel Eischen" Aq deischen@FreeBSD.org , +and +.An "David Xu" Aq davidxu@FreeBSD.org . +.Pp +This manual page was written by +.An "Archie Cobbs" Aq archie@FreeBSD.org . +.Sh BUGS +The KSE code is +.Ud |