1 files changed, 679 insertions, 0 deletions
diff --git a/lib/libc/sys/kse.2 b/lib/libc/sys/kse.2
new file mode 100644
index 0000000..41fcc37
--- /dev/null
+++ b/lib/libc/sys/kse.2
@@ -0,0 +1,679 @@
+.\" Copyright (c) 2002 Packet Design, LLC.
+.\" All rights reserved.
+.\"
+.\" Subject to the following obligations and disclaimer of warranty,
+.\" use and redistribution of this software, in source or object code
+.\" forms, with or without modifications are expressly permitted by
+.\" Packet Design; provided, however, that:
+.\"
+.\"    (i)  Any and all reproductions of the source or object code
+.\"         must include the copyright notice above and the following
+.\"         disclaimer of warranties; and
+.\"    (ii) No rights are granted, in any manner or form, to use
+.\"         Packet Design trademarks, including the mark "PACKET DESIGN"
+.\"         on advertising, endorsements, or otherwise except as such
+.\"         appears in the above copyright notice or in the software.
+.\"
+.\" THIS SOFTWARE IS BEING PROVIDED BY PACKET DESIGN "AS IS", AND
+.\" TO THE MAXIMUM EXTENT PERMITTED BY LAW, PACKET DESIGN MAKES NO
+.\" REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, REGARDING
+.\" THIS SOFTWARE, INCLUDING WITHOUT LIMITATION, ANY AND ALL IMPLIED
+.\" WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
+.\" OR NON-INFRINGEMENT.  PACKET DESIGN DOES NOT WARRANT, GUARANTEE,
+.\" OR MAKE ANY REPRESENTATIONS REGARDING THE USE OF, OR THE RESULTS
+.\" OF THE USE OF THIS SOFTWARE IN TERMS OF ITS CORRECTNESS, ACCURACY,
+.\" RELIABILITY OR OTHERWISE.  IN NO EVENT SHALL PACKET DESIGN BE
+.\" LIABLE FOR ANY DAMAGES RESULTING FROM OR ARISING OUT OF ANY USE
+.\" OF THIS SOFTWARE, INCLUDING WITHOUT LIMITATION, ANY DIRECT,
+.\" INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE, OR CONSEQUENTIAL
+.\" DAMAGES, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, LOSS OF
+.\" USE, DATA OR PROFITS, HOWEVER CAUSED AND UNDER ANY THEORY OF
+.\" LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+.\" THE USE OF THIS SOFTWARE, EVEN IF PACKET DESIGN IS ADVISED OF
+.\" THE POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd February 13, 2007
+.Dt KSE 2
+.Os
+.Sh NAME
+.Nm kse
+.Nd "kernel support for user threads"
+.Sh LIBRARY
+.Lb libc
+.Sh SYNOPSIS
+.In sys/types.h
+.In sys/kse.h
+.Ft int
+.Fn kse_create "struct kse_mailbox *mbx" "int sys-scope"
+.Ft int
+.Fn kse_exit void
+.Ft int
+.Fn kse_release "struct timespec *timeout"
+.Ft int
+.Fn kse_switchin "struct kse_thr_mailbox *tmbx" "int flags"
+.Ft int
+.Fn kse_thr_interrupt "struct kse_thr_mailbox *tmbx" "int cmd" "long data"
+.Ft int
+.Fn kse_wakeup "struct kse_mailbox *mbx"
+.Sh DESCRIPTION
+These system calls implement kernel support for multi-threaded processes.
+.\"
+.Ss Overview
+.\"
+Traditionally, user threading has been implemented in one of two ways:
+either all threads are managed in user space and the kernel is unaware
+of any threading (also known as
+.Dq "N to 1" ) ,
+or else separate processes sharing
+a common memory space are created for each thread (also known as
+.Dq "N to N" ) .
+These approaches have advantages and disadvantages:
+.Bl -column "- Cannot utilize multiple CPUs" "+ Can utilize multiple CPUs"
+.It Sy "User threading	Kernel threading"
+.It "+ Lightweight	- Heavyweight"
+.It "+ User controls scheduling	- Kernel controls scheduling"
+.It "- Syscalls must be wrapped	+ No syscall wrapping required"
+.It "- Cannot utilize multiple CPUs	+ Can utilize multiple CPUs"
+.El
+.Pp
+The KSE system is a
+hybrid approach that achieves the advantages of both the user and kernel
+threading approaches.
+The underlying philosophy of the KSE system is to give kernel support
+for user threading without taking away any of the user threading library's
+ability to make scheduling decisions.
+A kernel-to-user upcall mechanism is used to pass control to the user
+threading library whenever a scheduling decision needs to be made.
+An arbitrarily number of user threads are multiplexed onto a fixed number of
+virtual CPUs supplied by the kernel.
+This can be thought of as an
+.Dq "N to M"
+threading scheme.
+.Pp
+Some general implications of this approach include:
+.Bl -bullet
+.It
+The user process can run multiple threads simultaneously on multi-processor
+machines.
+The kernel grants the process virtual CPUs to schedule as it
+wishes; these may run concurrently on real CPUs.
+.It
+All operations that block in the kernel become asynchronous, allowing
+the user process to schedule another thread when any thread blocks.
+.El
+.\"
+.Ss Definitions
+.\"
+KSE allows a user process to have multiple
+.Sy threads
+of execution in existence at the same time, some of which may be blocked
+in the kernel while others may be executing or blocked in user space.
+A
+.Sy "kernel scheduling entity"
+(KSE) is a
+.Dq "virtual CPU"
+granted to the process for the purpose of executing threads.
+A thread that is currently executing is always associated with
+exactly one KSE, whether executing in user space or in the kernel.
+The KSE is said to be
+.Sy assigned
+to the thread.
+KSEs (a user abstraction) are implemented on top
+of kernel threads using an 'upcall' entity.
+.Pp
+The KSE becomes
+.Sy unassigned ,
+and the associated thread is suspended, when the KSE has an associated
+.Sy mailbox ,
+(see below) the thread has an associated
+.Sy thread mailbox ,
+(also see below) and any of the following occurs:
+.Bl -bullet
+.It
+The thread invokes a system call that blocks.
+.It
+The thread makes any other demand of the kernel that cannot be immediately
+satisfied, e.g., touches a page of memory that needs to be fetched from disk,
+causing a page fault.
+.It
+Another thread that was previously blocked in the kernel completes its
+work in the kernel (or is
+.Sy interrupted )
+and becomes ready to return to user space, and the current thread is returning
+to user space.
+.It
+A signal is delivered to the process, and this KSE is chosen to deliver it.
+.El
+.Pp
+In other words, as soon as there is a scheduling decision to be made,
+the KSE becomes unassigned, because the kernel does not presume to know
+how the process' other runnable threads should be scheduled.
+Unassigned KSEs always return to user space as soon as possible via
+the
+.Sy upcall
+mechanism (described below), allowing the user process to decide how
+that KSE should be utilized next.
+KSEs always complete as much work as possible in the kernel before
+becoming unassigned.
+.Pp
+Individual KSEs within a process are effectively indistinguishable,
+and any KSE in a process may be assigned by the kernel to any runnable
+(in the kernel) thread associated with that process.
+In practice, the kernel attempts to preserve the affinity between threads
+and actual CPUs to optimize cache behavior, but this is invisible to the
+user process.
+(Affinity is not yet fully implemented.)
+.Pp
+Each KSE has a unique
+.Sy "KSE mailbox"
+supplied by the user process.
+A mailbox consists of a control structure containing a pointer to an
+.Sy "upcall function"
+and a user stack.
+The KSE invokes this function whenever it becomes unassigned.
+The kernel updates this structure with information about threads that have
+become runnable and signals that have been delivered before each upcall.
+Upcalls may be temporarily blocked by the user thread scheduling code
+during critical sections.
+.Pp
+Each user thread has a unique
+.Sy "thread mailbox"
+as well.
+Threads are referred to using pointers to these mailboxes when communicating
+between the kernel and the user thread scheduler.
+Each KSE's mailbox contains a pointer to the mailbox of the user thread
+that the KSE is currently executing.
+This pointer is saved when the thread blocks in the kernel.
+.Pp
+Whenever a thread blocked in the kernel is ready to return to user space,
+it is added to the process's list of
+.Sy completed
+threads.
+This list is presented to the user code at the next upcall as a linked list
+of thread mailboxes.
+.Pp
+There is a kernel-imposed limit on the number of threads in a process
+that may be simultaneously blocked in the kernel (this number is not
+currently visible to the user).
+When this limit is reached, upcalls are blocked and no work is performed
+for the process until one of the threads completes (or a signal is
+received).
+.\"
+.Ss Managing KSEs
+.\"
+To become multi-threaded, a process must first invoke
+.Fn kse_create .
+The
+.Fn kse_create
+system call
+creates a new KSE (except for the very first invocation; see below).
+The KSE will be associated with the mailbox pointed to by
+.Fa mbx .
+If
+.Fa sys_scope
+is non-zero, then the new thread will be counted as a system scope
+thread. Other things must be done as well to make a system scope thread
+so this is not sufficient (yet).
+System scope variables are not covered
+in detail in this manual page yet, but briefly, they never perform
+upcalls and do not return to the user thread scheduler.
+Once launched they run autonomously.
+The pthreads library knows how to make system
+scope threads and users are encouraged to use the library interface.
+.Pp
+Each process initially has a single KSE executing a single user thread.
+Since the KSE does not have an associated mailbox, it must remain assigned
+to the thread and does not perform any upcalls.
+(It is by definition a system scope thread).
+The result is the traditional, unthreaded mode of operation.
+Therefore, as a special case, the first call to
+.Fn kse_create
+by this initial thread with
+.Fa sys_scope
+equal to zero does not create a new KSE; instead, it simply associates the
+current KSE with the supplied KSE mailbox, and no immediate upcall results.
+However, an upcall will be triggered the next time the thread blocks and
+the required conditions are met.
+.Pp
+The kernel does not allow more KSEs to exist in a process than the
+number of physical CPUs in the system (this number is available as the
+.Xr sysctl 3
+variable
+.Va hw.ncpu ) .
+Having more KSEs than CPUs would not add any value to the user process,
+as the additional KSEs would just compete with each other for access to
+the real CPUs.
+Since the extra KSEs would always be side-lined, the result
+to the application would be exactly the same as having fewer KSEs.
+There may however be arbitrarily many user threads, and it is up to the
+user thread scheduler to handle mapping the application's user threads
+onto the available KSEs.
+.Pp
+The
+.Fn kse_exit
+system call
+causes the KSE assigned to the currently running thread to be destroyed.
+If this KSE is the last one in the process, there must be no remaining
+threads associated with that process blocked in the kernel.
+This system call does not return unless there is an error.
+Calling
+.Fn kse_exit
+from the last thread is the same as calling
+.Fn exit .
+.Pp
+The
+.Fn kse_release
+system call
+is used to
+.Dq park
+the KSE assigned to the currently running thread when it is not needed,
+e.g., when there are more available KSEs than runnable user threads.
+The thread converts to an upcall but does not get scheduled until
+there is a new reason to do so, e.g., a previously
+blocked thread becomes runnable, or the timeout expires.
+If successful,
+.Fn kse_release
+does not return to the caller.
+.Pp
+The
+.Fn kse_switchin
+system call can be used by the UTS, when it has selected a new thread,
+to switch to the context of that thread.
+The use of
+.Fn kse_switchin
+is machine dependent.
+Some platforms do not need a system call to switch to a new context,
+while others require its use in particular cases.
+.Pp
+The
+.Fn kse_wakeup
+system call
+is the opposite of
+.Fn kse_release .
+It causes the (parked) KSE associated with the mailbox pointed to by
+.Fa mbx
+to be woken up, causing it to upcall.
+If the KSE has already woken up for another reason, this system call has no
+effect.
+The
+.Fa mbx
+argument
+may be
+.Dv NULL
+to specify
+.Dq "any KSE in the current process" .
+.Pp
+The
+.Fn kse_thr_interrupt
+system call
+is used to interrupt a currently blocked thread.
+The thread must either be blocked in the kernel or assigned to a KSE
+(i.e., executing).
+The thread is then marked as interrupted.
+As soon as the thread invokes an interruptible system call (or immediately
+for threads already blocked in one), the thread will be made runnable again,
+even though the kernel operation may not have completed.
+The effect on the interrupted system call is the same as if it had been
+interrupted by a signal; typically this means an error is returned with
+.Va errno
+set to
+.Er EINTR .
+.\"
+.Ss Signals
+.\"
+The current implementation creates a special signal thread.
+Kernel threads (KSEs) in a process mask all signals, and only the signal
+thread waits for signals to be delivered to the process, the signal thread
+is responsible
+for dispatching signals to user threads.
+.Pp
+A downside of this is that if a multiplexed thread
+calls the
+.Fn execve
+syscall, its signal mask and pending signals may not be
+available in the kernel.
+They are stored
+in userland and the kernel does not know where to get them, however
+.Tn POSIX
+requires them to be restored and passed them to new process.
+Just setting the mask for the thread before calling
+.Fn execve
+is only a
+close approximation to the problem as it does not re-deliver back to the kernel
+any pending signals that the old process may have blocked, and it allows a
+window in which new signals may be delivered to the process between the setting
+of the mask and the
+.Fn execve .
+.Pp
+For now this problem has been solved by adding a special combined
+.Fn kse_thr_interrupt Ns / Ns Fn execve
+mode to the
+.Fn kse_thr_interrupt
+syscall.
+The
+.Fn kse_thr_interrupt
+syscall has a sub command
+.Dv KSE_INTR_EXECVE ,
+that allows it to accept a
+.Vt kse_execv_args
+structure, and allowing it to adjust the signals and then atomically
+convert into an
+.Fn execve
+call.
+Additional pending signals and the correct signal mask can be passed
+to the kernel in this way.
+The thread library overrides the
+.Fn execve
+syscall
+and translates it into
+.Fn kse_intr_interrupt
+call, allowing a multiplexed thread
+to restore pending signals and the correct signal mask before doing the
+.Fn exec .
+This solution to the problem may change.
+.\"
+.Ss KSE Mailboxes
+.\"
+Each KSE has a unique mailbox for user-kernel communication defined in
+.In sys/kse.h .
+Some of the fields there are:
+.Pp
+.Va km_version
+describes the version of this structure and must be equal to
+.Dv KSE_VER_0 .
+.Va km_udata
+is an opaque pointer ignored by the kernel.
+.Pp
+.Va km_func
+points to the KSE's upcall function;
+it will be invoked using
+.Va km_stack ,
+which must remain valid for the lifetime of the KSE.
+.Pp
+.Va km_curthread
+always points to the thread that is currently assigned to this KSE if any,
+or
+.Dv NULL
+otherwise.
+This field is modified by both the kernel and the user process as follows.
+.Pp
+When
+.Va km_curthread
+is not
+.Dv NULL ,
+it is assumed to be pointing at the mailbox for the currently executing
+thread, and the KSE may be unassigned, e.g., if the thread blocks in the
+kernel.
+The kernel will then save the contents of
+.Va km_curthread
+with the blocked thread, set
+.Va km_curthread
+to
+.Dv NULL ,
+and upcall to invoke
+.Fn km_func .
+.Pp
+When
+.Va km_curthread
+is
+.Dv NULL ,
+the kernel will never perform any upcalls with this KSE; in other words,
+the KSE remains assigned to the thread even if it blocks.
+.Va km_curthread
+must be
+.Dv NULL
+while the KSE is executing critical user thread scheduler
+code that would be disrupted by an intervening upcall;
+in particular, while
+.Fn km_func
+itself is executing.
+.Pp
+Before invoking
+.Fn km_func
+in any upcall, the kernel always sets
+.Va km_curthread
+to
+.Dv NULL .
+Once the user thread scheduler has chosen a new thread to run,
+it should point
+.Va km_curthread
+at the thread's mailbox, re-enabling upcalls, and then resume the thread.
+.Em Note :
+modification of
+.Va km_curthread
+by the user thread scheduler must be atomic
+with the loading of the context of the new thread, to avoid
+the situation where the thread context area
+may be modified by a blocking async operation, while there
+is still valid information to be read out of it.
+.Pp
+.Va km_completed
+points to a linked list of user threads that have completed their work
+in the kernel since the last upcall.
+The user thread scheduler should put these threads back into its
+own runnable queue.
+Each thread in a process that completes a kernel operation
+(synchronous or asynchronous) that results in an upcall is guaranteed to be
+linked into exactly one KSE's
+.Va km_completed
+list; which KSE in the group, however, is indeterminate.
+Furthermore, the completion will be reported in only one upcall.
+.Pp
+.Va km_sigscaught
+contains the list of signals caught by this process since the previous
+upcall to any KSE in the process.
+As long as there exists one or more KSEs with an associated mailbox in
+the user process, signals are delivered this way rather than the
+traditional way.
+(This has not been implemented and may change.)
+.Pp
+.Va km_timeofday
+is set by the kernel to the current system time before performing
+each upcall.
+.Pp
+.Va km_flags
+may contain any of the following bits OR'ed together:
+.Bl -tag -width indent
+.It Dv KMF_NOUPCALL
+Block upcalls from happening.
+The thread is in some critical section.
+.It Dv KMF_NOCOMPLETED , KMF_DONE , KMF_BOUND
+This thread should be considered to be permanently bound to
+its KSE, and treated much like a non-threaded process would be.
+It is a
+.Dq "long term"
+version of
+.Dv KMF_NOUPCALL
+in some ways.
+.It Dv KMF_WAITSIGEVENT
+Implement characteristics needed for the signal delivery thread.
+.El
+.\"
+.Ss Thread Mailboxes
+.\"
+Each user thread must have associated with it a unique
+.Vt "struct kse_thr_mailbox"
+as defined in
+.In sys/kse.h .
+It includes the following fields.
+.Pp
+.Va tm_udata
+is an opaque pointer ignored by the kernel.
+.Pp
+.Va tm_context
+stores the context for the thread when the thread is blocked in user space.
+This field is also updated by the kernel before a completed thread is returned
+to the user thread scheduler via
+.Va km_completed .
+.Pp
+.Va tm_next
+links the
+.Va km_completed
+threads together when returned by the kernel with an upcall.
+The end of the list is marked with a
+.Dv NULL
+pointer.
+.Pp
+.Va tm_uticks
+and
+.Va tm_sticks
+are time counters for user mode and kernel mode execution, respectively.
+These counters count ticks of the statistics clock (see
+.Xr clocks 7 ) .
+While any thread is actively executing in the kernel, the corresponding
+.Va tm_sticks
+counter is incremented.
+While any KSE is executing in user space and that KSE's
+.Va km_curthread
+pointer is not equal to
+.Dv NULL ,
+the corresponding
+.Va tm_uticks
+counter is incremented.
+.Pp
+.Va tm_flags
+may contain any of the following bits OR'ed together:
+.Bl -tag -width indent
+.It Dv TMF_NOUPCALL
+Similar to
+.Dv KMF_NOUPCALL .
+This flag inhibits upcalling for critical sections.
+Some architectures require this to be in one place and some in the other.
+.El
+.Sh RETURN VALUES
+The
+.Fn kse_create ,
+.Fn kse_wakeup ,
+and
+.Fn kse_thr_interrupt
+system calls
+return zero if successful.
+The
+.Fn kse_exit
+and
+.Fn kse_release
+system calls
+do not return if successful.
+.Pp
+All of these system calls return a non-zero error code in case of an error.
+.Sh ERRORS
+The
+.Fn kse_create
+system call
+will fail if:
+.Bl -tag -width Er
+.It Bq Er ENXIO
+There are already as many KSEs in the process as hardware processors.
+.It Bq Er EAGAIN
+The user is not the super user, and the soft resource limit corresponding
+to the
+.Fa resource
+argument
+.Dv RLIMIT_NPROC
+would be exceeded (see
+.Xr getrlimit 2 ) .
+.It Bq Er EFAULT
+The
+.Fa mbx
+argument
+points to an address which is not a valid part of the process address space.
+.El
+.Pp
+The
+.Fn kse_exit
+system call
+will fail if:
+.Bl -tag -width Er
+.It Bq Er EDEADLK
+The current KSE is the last in its process and there are still one or more
+threads associated with the process blocked in the kernel.
+.It Bq Er ESRCH
+The current KSE has no associated mailbox, i.e., the process is operating
+in traditional, unthreaded mode (in this case use
+.Xr _exit 2
+to exit the process).
+.El
+.Pp
+The
+.Fn kse_release
+system call
+will fail if:
+.Bl -tag -width Er
+.It Bq Er ESRCH
+The current KSE has no associated mailbox, i.e., the process is operating is
+traditional, unthreaded mode.
+.El
+.Pp
+The
+.Fn kse_wakeup
+system call
+will fail if:
+.Bl -tag -width Er
+.It Bq Er ESRCH
+The
+.Fa mbx
+argument
+is not
+.Dv NULL
+and the mailbox pointed to by
+.Fa mbx
+is not associated with any KSE in the process.
+.It Bq Er ESRCH
+The
+.Fa mbx
+argument
+is
+.Dv NULL
+and the current KSE has no associated mailbox, i.e., the process is operating
+in traditional, unthreaded mode.
+.El
+.Pp
+The
+.Fn kse_thr_interrupt
+system call
+will fail if:
+.Bl -tag -width Er
+.It Bq Er ESRCH
+The thread corresponding to
+.Fa tmbx
+is neither currently assigned to any KSE in the process nor blocked in the
+kernel.
+.El
+.Sh SEE ALSO
+.Xr rfork 2 ,
+.Xr pthread 3 ,
+.Xr ucontext 3
+.Rs
+.%A "Thomas E. Anderson"
+.%A "Brian N. Bershad"
+.%A "Edward D. Lazowska"
+.%A "Henry M. Levy"
+.%J "ACM Transactions on Computer Systems"
+.%N Issue 1
+.%V Volume 10
+.%D February 1992
+.%I ACM Press
+.%P pp. 53-79
+.%T "Scheduler activations: effective kernel support for the user-level management of parallelism"
+.Re
+.Sh HISTORY
+The KSE system calls first appeared in
+.Fx 5.0 .
+.Sh AUTHORS
+KSE was originally implemented by
+.An -nosplit
+.An "Julian Elischer" Aq julian@FreeBSD.org ,
+with additional contributions by
+.An "Jonathan Mini" Aq mini@FreeBSD.org ,
+.An "Daniel Eischen" Aq deischen@FreeBSD.org ,
+and
+.An "David Xu" Aq davidxu@FreeBSD.org .
+.Pp
+This manual page was written by
+.An "Archie Cobbs" Aq archie@FreeBSD.org .
+.Sh BUGS
+The KSE code is
+.Ud