From 12983bcd822f5c13a0f350cc97bc9fb781cae944 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 21 Sep 2016 13:14:35 -0300 Subject: Documentation/adding-syscalls.txt: convert it to ReST markup Convert adding-syscalls.txt to ReST markup and add it to the development-process book: - add extra lines to make Sphinx to correctly parse paragraphs; - use quote blocks for examples; - use monotonic font for dirs, function calls, etc; - mark manpage pages using the right markup; - add cross-reference to SubmittingPatches. Signed-off-by: Mauro Carvalho Chehab --- Documentation/adding-syscalls.txt | 527 ------------------------------------ Documentation/adding-syscals.txt | 542 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 542 insertions(+), 527 deletions(-) delete mode 100644 Documentation/adding-syscalls.txt create mode 100644 Documentation/adding-syscals.txt diff --git a/Documentation/adding-syscalls.txt b/Documentation/adding-syscalls.txt deleted file mode 100644 index bbb31e0..0000000 --- a/Documentation/adding-syscalls.txt +++ /dev/null @@ -1,527 +0,0 @@ -Adding a New System Call -======================== - -This document describes what's involved in adding a new system call to the -Linux kernel, over and above the normal submission advice in -Documentation/SubmittingPatches. - - -System Call Alternatives ------------------------- - -The first thing to consider when adding a new system call is whether one of -the alternatives might be suitable instead. Although system calls are the -most traditional and most obvious interaction points between userspace and the -kernel, there are other possibilities -- choose what fits best for your -interface. - - - If the operations involved can be made to look like a filesystem-like - object, it may make more sense to create a new filesystem or device. This - also makes it easier to encapsulate the new functionality in a kernel module - rather than requiring it to be built into the main kernel. - - If the new functionality involves operations where the kernel notifies - userspace that something has happened, then returning a new file - descriptor for the relevant object allows userspace to use - poll/select/epoll to receive that notification. - - However, operations that don't map to read(2)/write(2)-like operations - have to be implemented as ioctl(2) requests, which can lead to a - somewhat opaque API. - - If you're just exposing runtime system information, a new node in sysfs - (see Documentation/filesystems/sysfs.txt) or the /proc filesystem may be - more appropriate. However, access to these mechanisms requires that the - relevant filesystem is mounted, which might not always be the case (e.g. - in a namespaced/sandboxed/chrooted environment). Avoid adding any API to - debugfs, as this is not considered a 'production' interface to userspace. - - If the operation is specific to a particular file or file descriptor, then - an additional fcntl(2) command option may be more appropriate. However, - fcntl(2) is a multiplexing system call that hides a lot of complexity, so - this option is best for when the new function is closely analogous to - existing fcntl(2) functionality, or the new functionality is very simple - (for example, getting/setting a simple flag related to a file descriptor). - - If the operation is specific to a particular task or process, then an - additional prctl(2) command option may be more appropriate. As with - fcntl(2), this system call is a complicated multiplexor so is best reserved - for near-analogs of existing prctl() commands or getting/setting a simple - flag related to a process. - - -Designing the API: Planning for Extension ------------------------------------------ - -A new system call forms part of the API of the kernel, and has to be supported -indefinitely. As such, it's a very good idea to explicitly discuss the -interface on the kernel mailing list, and it's important to plan for future -extensions of the interface. - -(The syscall table is littered with historical examples where this wasn't done, -together with the corresponding follow-up system calls -- eventfd/eventfd2, -dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2 -- so -learn from the history of the kernel and plan for extensions from the start.) - -For simpler system calls that only take a couple of arguments, the preferred -way to allow for future extensibility is to include a flags argument to the -system call. To make sure that userspace programs can safely use flags -between kernel versions, check whether the flags value holds any unknown -flags, and reject the system call (with EINVAL) if it does: - - if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) - return -EINVAL; - -(If no flags values are used yet, check that the flags argument is zero.) - -For more sophisticated system calls that involve a larger number of arguments, -it's preferred to encapsulate the majority of the arguments into a structure -that is passed in by pointer. Such a structure can cope with future extension -by including a size argument in the structure: - - struct xyzzy_params { - u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ - u32 param_1; - u64 param_2; - u64 param_3; - }; - -As long as any subsequently added field, say param_4, is designed so that a -zero value gives the previous behaviour, then this allows both directions of -version mismatch: - - - To cope with a later userspace program calling an older kernel, the kernel - code should check that any memory beyond the size of the structure that it - expects is zero (effectively checking that param_4 == 0). - - To cope with an older userspace program calling a newer kernel, the kernel - code can zero-extend a smaller instance of the structure (effectively - setting param_4 = 0). - -See perf_event_open(2) and the perf_copy_attr() function (in -kernel/events/core.c) for an example of this approach. - - -Designing the API: Other Considerations ---------------------------------------- - -If your new system call allows userspace to refer to a kernel object, it -should use a file descriptor as the handle for that object -- don't invent a -new type of userspace object handle when the kernel already has mechanisms and -well-defined semantics for using file descriptors. - -If your new xyzzy(2) system call does return a new file descriptor, then the -flags argument should include a value that is equivalent to setting O_CLOEXEC -on the new FD. This makes it possible for userspace to close the timing -window between xyzzy() and calling fcntl(fd, F_SETFD, FD_CLOEXEC), where an -unexpected fork() and execve() in another thread could leak a descriptor to -the exec'ed program. (However, resist the temptation to re-use the actual value -of the O_CLOEXEC constant, as it is architecture-specific and is part of a -numbering space of O_* flags that is fairly full.) - -If your system call returns a new file descriptor, you should also consider -what it means to use the poll(2) family of system calls on that file -descriptor. Making a file descriptor ready for reading or writing is the -normal way for the kernel to indicate to userspace that an event has -occurred on the corresponding kernel object. - -If your new xyzzy(2) system call involves a filename argument: - - int sys_xyzzy(const char __user *path, ..., unsigned int flags); - -you should also consider whether an xyzzyat(2) version is more appropriate: - - int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags); - -This allows more flexibility for how userspace specifies the file in question; -in particular it allows userspace to request the functionality for an -already-opened file descriptor using the AT_EMPTY_PATH flag, effectively giving -an fxyzzy(3) operation for free: - - - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...) - - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...) - -(For more details on the rationale of the *at() calls, see the openat(2) man -page; for an example of AT_EMPTY_PATH, see the fstatat(2) man page.) - -If your new xyzzy(2) system call involves a parameter describing an offset -within a file, make its type loff_t so that 64-bit offsets can be supported -even on 32-bit architectures. - -If your new xyzzy(2) system call involves privileged functionality, it needs -to be governed by the appropriate Linux capability bit (checked with a call to -capable()), as described in the capabilities(7) man page. Choose an existing -capability bit that governs related functionality, but try to avoid combining -lots of only vaguely related functions together under the same bit, as this -goes against capabilities' purpose of splitting the power of root. In -particular, avoid adding new uses of the already overly-general CAP_SYS_ADMIN -capability. - -If your new xyzzy(2) system call manipulates a process other than the calling -process, it should be restricted (using a call to ptrace_may_access()) so that -only a calling process with the same permissions as the target process, or -with the necessary capabilities, can manipulate the target process. - -Finally, be aware that some non-x86 architectures have an easier time if -system call parameters that are explicitly 64-bit fall on odd-numbered -arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit -registers. (This concern does not apply if the arguments are part of a -structure that's passed in by pointer.) - - -Proposing the API ------------------ - -To make new system calls easy to review, it's best to divide up the patchset -into separate chunks. These should include at least the following items as -distinct commits (each of which is described further below): - - - The core implementation of the system call, together with prototypes, - generic numbering, Kconfig changes and fallback stub implementation. - - Wiring up of the new system call for one particular architecture, usually - x86 (including all of x86_64, x86_32 and x32). - - A demonstration of the use of the new system call in userspace via a - selftest in tools/testing/selftests/. - - A draft man-page for the new system call, either as plain text in the - cover letter, or as a patch to the (separate) man-pages repository. - -New system call proposals, like any change to the kernel's API, should always -be cc'ed to linux-api@vger.kernel.org. - - -Generic System Call Implementation ----------------------------------- - -The main entry point for your new xyzzy(2) system call will be called -sys_xyzzy(), but you add this entry point with the appropriate -SYSCALL_DEFINEn() macro rather than explicitly. The 'n' indicates the number -of arguments to the system call, and the macro takes the system call name -followed by the (type, name) pairs for the parameters as arguments. Using -this macro allows metadata about the new system call to be made available for -other tools. - -The new entry point also needs a corresponding function prototype, in -include/linux/syscalls.h, marked as asmlinkage to match the way that system -calls are invoked: - - asmlinkage long sys_xyzzy(...); - -Some architectures (e.g. x86) have their own architecture-specific syscall -tables, but several other architectures share a generic syscall table. Add your -new system call to the generic list by adding an entry to the list in -include/uapi/asm-generic/unistd.h: - - #define __NR_xyzzy 292 - __SYSCALL(__NR_xyzzy, sys_xyzzy) - -Also update the __NR_syscalls count to reflect the additional system call, and -note that if multiple new system calls are added in the same merge window, -your new syscall number may get adjusted to resolve conflicts. - -The file kernel/sys_ni.c provides a fallback stub implementation of each system -call, returning -ENOSYS. Add your new system call here too: - - cond_syscall(sys_xyzzy); - -Your new kernel functionality, and the system call that controls it, should -normally be optional, so add a CONFIG option (typically to init/Kconfig) for -it. As usual for new CONFIG options: - - - Include a description of the new functionality and system call controlled - by the option. - - Make the option depend on EXPERT if it should be hidden from normal users. - - Make any new source files implementing the function dependent on the CONFIG - option in the Makefile (e.g. "obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.c"). - - Double check that the kernel still builds with the new CONFIG option turned - off. - -To summarize, you need a commit that includes: - - - CONFIG option for the new function, normally in init/Kconfig - - SYSCALL_DEFINEn(xyzzy, ...) for the entry point - - corresponding prototype in include/linux/syscalls.h - - generic table entry in include/uapi/asm-generic/unistd.h - - fallback stub in kernel/sys_ni.c - - -x86 System Call Implementation ------------------------------- - -To wire up your new system call for x86 platforms, you need to update the -master syscall tables. Assuming your new system call isn't special in some -way (see below), this involves a "common" entry (for x86_64 and x32) in -arch/x86/entry/syscalls/syscall_64.tbl: - - 333 common xyzzy sys_xyzzy - -and an "i386" entry in arch/x86/entry/syscalls/syscall_32.tbl: - - 380 i386 xyzzy sys_xyzzy - -Again, these numbers are liable to be changed if there are conflicts in the -relevant merge window. - - -Compatibility System Calls (Generic) ------------------------------------- - -For most system calls the same 64-bit implementation can be invoked even when -the userspace program is itself 32-bit; even if the system call's parameters -include an explicit pointer, this is handled transparently. - -However, there are a couple of situations where a compatibility layer is -needed to cope with size differences between 32-bit and 64-bit. - -The first is if the 64-bit kernel also supports 32-bit userspace programs, and -so needs to parse areas of (__user) memory that could hold either 32-bit or -64-bit values. In particular, this is needed whenever a system call argument -is: - - - a pointer to a pointer - - a pointer to a struct containing a pointer (e.g. struct iovec __user *) - - a pointer to a varying sized integral type (time_t, off_t, long, ...) - - a pointer to a struct containing a varying sized integral type. - -The second situation that requires a compatibility layer is if one of the -system call's arguments has a type that is explicitly 64-bit even on a 32-bit -architecture, for example loff_t or __u64. In this case, a value that arrives -at a 64-bit kernel from a 32-bit application will be split into two 32-bit -values, which then need to be re-assembled in the compatibility layer. - -(Note that a system call argument that's a pointer to an explicit 64-bit type -does *not* need a compatibility layer; for example, splice(2)'s arguments of -type loff_t __user * do not trigger the need for a compat_ system call.) - -The compatibility version of the system call is called compat_sys_xyzzy(), and -is added with the COMPAT_SYSCALL_DEFINEn() macro, analogously to -SYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit -kernel, but expects to receive 32-bit parameter values and does whatever is -needed to deal with them. (Typically, the compat_sys_ version converts the -values to 64-bit versions and either calls on to the sys_ version, or both of -them call a common inner implementation function.) - -The compat entry point also needs a corresponding function prototype, in -include/linux/compat.h, marked as asmlinkage to match the way that system -calls are invoked: - - asmlinkage long compat_sys_xyzzy(...); - -If the system call involves a structure that is laid out differently on 32-bit -and 64-bit systems, say struct xyzzy_args, then the include/linux/compat.h -header file should also include a compat version of the structure (struct -compat_xyzzy_args) where each variable-size field has the appropriate compat_ -type that corresponds to the type in struct xyzzy_args. The -compat_sys_xyzzy() routine can then use this compat_ structure to parse the -arguments from a 32-bit invocation. - -For example, if there are fields: - - struct xyzzy_args { - const char __user *ptr; - __kernel_long_t varying_val; - u64 fixed_val; - /* ... */ - }; - -in struct xyzzy_args, then struct compat_xyzzy_args would have: - - struct compat_xyzzy_args { - compat_uptr_t ptr; - compat_long_t varying_val; - u64 fixed_val; - /* ... */ - }; - -The generic system call list also needs adjusting to allow for the compat -version; the entry in include/uapi/asm-generic/unistd.h should use -__SC_COMP rather than __SYSCALL: - - #define __NR_xyzzy 292 - __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy) - -To summarize, you need: - - - a COMPAT_SYSCALL_DEFINEn(xyzzy, ...) for the compat entry point - - corresponding prototype in include/linux/compat.h - - (if needed) 32-bit mapping struct in include/linux/compat.h - - instance of __SC_COMP not __SYSCALL in include/uapi/asm-generic/unistd.h - - -Compatibility System Calls (x86) --------------------------------- - -To wire up the x86 architecture of a system call with a compatibility version, -the entries in the syscall tables need to be adjusted. - -First, the entry in arch/x86/entry/syscalls/syscall_32.tbl gets an extra -column to indicate that a 32-bit userspace program running on a 64-bit kernel -should hit the compat entry point: - - 380 i386 xyzzy sys_xyzzy compat_sys_xyzzy - -Second, you need to figure out what should happen for the x32 ABI version of -the new system call. There's a choice here: the layout of the arguments -should either match the 64-bit version or the 32-bit version. - -If there's a pointer-to-a-pointer involved, the decision is easy: x32 is -ILP32, so the layout should match the 32-bit version, and the entry in -arch/x86/entry/syscalls/syscall_64.tbl is split so that x32 programs hit the -compatibility wrapper: - - 333 64 xyzzy sys_xyzzy - ... - 555 x32 xyzzy compat_sys_xyzzy - -If no pointers are involved, then it is preferable to re-use the 64-bit system -call for the x32 ABI (and consequently the entry in -arch/x86/entry/syscalls/syscall_64.tbl is unchanged). - -In either case, you should check that the types involved in your argument -layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or -64-bit (-m64) equivalents. - - -System Calls Returning Elsewhere --------------------------------- - -For most system calls, once the system call is complete the user program -continues exactly where it left off -- at the next instruction, with the -stack the same and most of the registers the same as before the system call, -and with the same virtual memory space. - -However, a few system calls do things differently. They might return to a -different location (rt_sigreturn) or change the memory space (fork/vfork/clone) -or even architecture (execve/execveat) of the program. - -To allow for this, the kernel implementation of the system call may need to -save and restore additional registers to the kernel stack, allowing complete -control of where and how execution continues after the system call. - -This is arch-specific, but typically involves defining assembly entry points -that save/restore additional registers and invoke the real system call entry -point. - -For x86_64, this is implemented as a stub_xyzzy entry point in -arch/x86/entry/entry_64.S, and the entry in the syscall table -(arch/x86/entry/syscalls/syscall_64.tbl) is adjusted to match: - - 333 common xyzzy stub_xyzzy - -The equivalent for 32-bit programs running on a 64-bit kernel is normally -called stub32_xyzzy and implemented in arch/x86/entry/entry_64_compat.S, -with the corresponding syscall table adjustment in -arch/x86/entry/syscalls/syscall_32.tbl: - - 380 i386 xyzzy sys_xyzzy stub32_xyzzy - -If the system call needs a compatibility layer (as in the previous section) -then the stub32_ version needs to call on to the compat_sys_ version of the -system call rather than the native 64-bit version. Also, if the x32 ABI -implementation is not common with the x86_64 version, then its syscall -table will also need to invoke a stub that calls on to the compat_sys_ -version. - -For completeness, it's also nice to set up a mapping so that user-mode Linux -still works -- its syscall table will reference stub_xyzzy, but the UML build -doesn't include arch/x86/entry/entry_64.S implementation (because UML -simulates registers etc). Fixing this is as simple as adding a #define to -arch/x86/um/sys_call_table_64.c: - - #define stub_xyzzy sys_xyzzy - - -Other Details -------------- - -Most of the kernel treats system calls in a generic way, but there is the -occasional exception that may need updating for your particular system call. - -The audit subsystem is one such special case; it includes (arch-specific) -functions that classify some special types of system call -- specifically -file open (open/openat), program execution (execve/exeveat) or socket -multiplexor (socketcall) operations. If your new system call is analogous to -one of these, then the audit system should be updated. - -More generally, if there is an existing system call that is analogous to your -new system call, it's worth doing a kernel-wide grep for the existing system -call to check there are no other special cases. - - -Testing -------- - -A new system call should obviously be tested; it is also useful to provide -reviewers with a demonstration of how user space programs will use the system -call. A good way to combine these aims is to include a simple self-test -program in a new directory under tools/testing/selftests/. - -For a new system call, there will obviously be no libc wrapper function and so -the test will need to invoke it using syscall(); also, if the system call -involves a new userspace-visible structure, the corresponding header will need -to be installed to compile the test. - -Make sure the selftest runs successfully on all supported architectures. For -example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32) -and x32 (-mx32) ABI program. - -For more extensive and thorough testing of new functionality, you should also -consider adding tests to the Linux Test Project, or to the xfstests project -for filesystem-related changes. - - https://linux-test-project.github.io/ - - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git - - -Man Page --------- - -All new system calls should come with a complete man page, ideally using groff -markup, but plain text will do. If groff is used, it's helpful to include a -pre-rendered ASCII version of the man page in the cover email for the -patchset, for the convenience of reviewers. - -The man page should be cc'ed to linux-man@vger.kernel.org -For more details, see https://www.kernel.org/doc/man-pages/patches.html - -References and Sources ----------------------- - - - LWN article from Michael Kerrisk on use of flags argument in system calls: - https://lwn.net/Articles/585415/ - - LWN article from Michael Kerrisk on how to handle unknown flags in a system - call: https://lwn.net/Articles/588444/ - - LWN article from Jake Edge describing constraints on 64-bit system call - arguments: https://lwn.net/Articles/311630/ - - Pair of LWN articles from David Drysdale that describe the system call - implementation paths in detail for v3.14: - - https://lwn.net/Articles/604287/ - - https://lwn.net/Articles/604515/ - - Architecture-specific requirements for system calls are discussed in the - syscall(2) man-page: - http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES - - Collated emails from Linus Torvalds discussing the problems with ioctl(): - http://yarchive.net/comp/linux/ioctl.html - - "How to not invent kernel interfaces", Arnd Bergmann, - http://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf - - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN: - https://lwn.net/Articles/486306/ - - Recommendation from Andrew Morton that all related information for a new - system call should come in the same email thread: - https://lkml.org/lkml/2014/7/24/641 - - Recommendation from Michael Kerrisk that a new system call should come with - a man page: https://lkml.org/lkml/2014/6/13/309 - - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate - commit: https://lkml.org/lkml/2014/11/19/254 - - Suggestion from Greg Kroah-Hartman that it's good for new system calls to - come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710 - - Discussion from Michael Kerrisk of new system call vs. prctl(2) extension: - https://lkml.org/lkml/2014/6/3/411 - - Suggestion from Ingo Molnar that system calls that involve multiple - arguments should encapsulate those arguments in a struct, which includes a - size field for future extensibility: https://lkml.org/lkml/2015/7/30/117 - - Numbering oddities arising from (re-)use of O_* numbering space flags: - - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness - check") - - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc - conflict") - - commit bb458c644a59 ("Safer ABI for O_TMPFILE") - - Discussion from Matthew Wilcox about restrictions on 64-bit arguments: - https://lkml.org/lkml/2008/12/12/187 - - Recommendation from Greg Kroah-Hartman that unknown flags should be - policed: https://lkml.org/lkml/2014/7/17/577 - - Recommendation from Linus Torvalds that x32 system calls should prefer - compatibility with 64-bit versions rather than 32-bit versions: - https://lkml.org/lkml/2011/8/31/244 diff --git a/Documentation/adding-syscals.txt b/Documentation/adding-syscals.txt new file mode 100644 index 0000000..f5b5b1a --- /dev/null +++ b/Documentation/adding-syscals.txt @@ -0,0 +1,542 @@ +Adding a New System Call +======================== + +This document describes what's involved in adding a new system call to the +Linux kernel, over and above the normal submission advice in +:ref:`Documentation/SubmittingPatches `. + + +System Call Alternatives +------------------------ + +The first thing to consider when adding a new system call is whether one of +the alternatives might be suitable instead. Although system calls are the +most traditional and most obvious interaction points between userspace and the +kernel, there are other possibilities -- choose what fits best for your +interface. + + - If the operations involved can be made to look like a filesystem-like + object, it may make more sense to create a new filesystem or device. This + also makes it easier to encapsulate the new functionality in a kernel module + rather than requiring it to be built into the main kernel. + + - If the new functionality involves operations where the kernel notifies + userspace that something has happened, then returning a new file + descriptor for the relevant object allows userspace to use + ``poll``/``select``/``epoll`` to receive that notification. + - However, operations that don't map to + :manpage:`read(2)`/:manpage:`write(2)`-like operations + have to be implemented as :manpage:`ioctl(2)` requests, which can lead + to a somewhat opaque API. + + - If you're just exposing runtime system information, a new node in sysfs + (see ``Documentation/filesystems/sysfs.txt``) or the ``/proc`` filesystem may + be more appropriate. However, access to these mechanisms requires that the + relevant filesystem is mounted, which might not always be the case (e.g. + in a namespaced/sandboxed/chrooted environment). Avoid adding any API to + debugfs, as this is not considered a 'production' interface to userspace. + - If the operation is specific to a particular file or file descriptor, then + an additional :manpage:`fcntl(2)` command option may be more appropriate. However, + :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so + this option is best for when the new function is closely analogous to + existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple + (for example, getting/setting a simple flag related to a file descriptor). + - If the operation is specific to a particular task or process, then an + additional :manpage:`prctl(2)` command option may be more appropriate. As + with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so + is best reserved for near-analogs of existing ``prctl()`` commands or + getting/setting a simple flag related to a process. + + +Designing the API: Planning for Extension +----------------------------------------- + +A new system call forms part of the API of the kernel, and has to be supported +indefinitely. As such, it's a very good idea to explicitly discuss the +interface on the kernel mailing list, and it's important to plan for future +extensions of the interface. + +(The syscall table is littered with historical examples where this wasn't done, +together with the corresponding follow-up system calls -- +``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``, +``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so +learn from the history of the kernel and plan for extensions from the start.) + +For simpler system calls that only take a couple of arguments, the preferred +way to allow for future extensibility is to include a flags argument to the +system call. To make sure that userspace programs can safely use flags +between kernel versions, check whether the flags value holds any unknown +flags, and reject the system call (with ``EINVAL``) if it does:: + + if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) + return -EINVAL; + +(If no flags values are used yet, check that the flags argument is zero.) + +For more sophisticated system calls that involve a larger number of arguments, +it's preferred to encapsulate the majority of the arguments into a structure +that is passed in by pointer. Such a structure can cope with future extension +by including a size argument in the structure:: + + struct xyzzy_params { + u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ + u32 param_1; + u64 param_2; + u64 param_3; + }; + +As long as any subsequently added field, say ``param_4``, is designed so that a +zero value gives the previous behaviour, then this allows both directions of +version mismatch: + + - To cope with a later userspace program calling an older kernel, the kernel + code should check that any memory beyond the size of the structure that it + expects is zero (effectively checking that ``param_4 == 0``). + - To cope with an older userspace program calling a newer kernel, the kernel + code can zero-extend a smaller instance of the structure (effectively + setting ``param_4 = 0``). + +See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in +``kernel/events/core.c``) for an example of this approach. + + +Designing the API: Other Considerations +--------------------------------------- + +If your new system call allows userspace to refer to a kernel object, it +should use a file descriptor as the handle for that object -- don't invent a +new type of userspace object handle when the kernel already has mechanisms and +well-defined semantics for using file descriptors. + +If your new :manpage:`xyzzy(2)` system call does return a new file descriptor, +then the flags argument should include a value that is equivalent to setting +``O_CLOEXEC`` on the new FD. This makes it possible for userspace to close +the timing window between ``xyzzy()`` and calling +``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and +``execve()`` in another thread could leak a descriptor to +the exec'ed program. (However, resist the temptation to re-use the actual value +of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a +numbering space of ``O_*`` flags that is fairly full.) + +If your system call returns a new file descriptor, you should also consider +what it means to use the :manpage:`poll(2)` family of system calls on that file +descriptor. Making a file descriptor ready for reading or writing is the +normal way for the kernel to indicate to userspace that an event has +occurred on the corresponding kernel object. + +If your new :manpage:`xyzzy(2)` system call involves a filename argument:: + + int sys_xyzzy(const char __user *path, ..., unsigned int flags); + +you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate:: + + int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags); + +This allows more flexibility for how userspace specifies the file in question; +in particular it allows userspace to request the functionality for an +already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively +giving an :manpage:`fxyzzy(3)` operation for free:: + + - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...) + - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...) + +(For more details on the rationale of the \*at() calls, see the +:manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the +:manpage:`fstatat(2)` man page.) + +If your new :manpage:`xyzzy(2)` system call involves a parameter describing an +offset within a file, make its type ``loff_t`` so that 64-bit offsets can be +supported even on 32-bit architectures. + +If your new :manpage:`xyzzy(2)` system call involves privileged functionality, +it needs to be governed by the appropriate Linux capability bit (checked with +a call to ``capable()``), as described in the :manpage:`capabilities(7)` man +page. Choose an existing capability bit that governs related functionality, +but try to avoid combining lots of only vaguely related functions together +under the same bit, as this goes against capabilities' purpose of splitting +the power of root. In particular, avoid adding new uses of the already +overly-general ``CAP_SYS_ADMIN`` capability. + +If your new :manpage:`xyzzy(2)` system call manipulates a process other than +the calling process, it should be restricted (using a call to +``ptrace_may_access()``) so that only a calling process with the same +permissions as the target process, or with the necessary capabilities, can +manipulate the target process. + +Finally, be aware that some non-x86 architectures have an easier time if +system call parameters that are explicitly 64-bit fall on odd-numbered +arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit +registers. (This concern does not apply if the arguments are part of a +structure that's passed in by pointer.) + + +Proposing the API +----------------- + +To make new system calls easy to review, it's best to divide up the patchset +into separate chunks. These should include at least the following items as +distinct commits (each of which is described further below): + + - The core implementation of the system call, together with prototypes, + generic numbering, Kconfig changes and fallback stub implementation. + - Wiring up of the new system call for one particular architecture, usually + x86 (including all of x86_64, x86_32 and x32). + - A demonstration of the use of the new system call in userspace via a + selftest in ``tools/testing/selftests/``. + - A draft man-page for the new system call, either as plain text in the + cover letter, or as a patch to the (separate) man-pages repository. + +New system call proposals, like any change to the kernel's API, should always +be cc'ed to linux-api@vger.kernel.org. + + +Generic System Call Implementation +---------------------------------- + +The main entry point for your new :manpage:`xyzzy(2)` system call will be called +``sys_xyzzy()``, but you add this entry point with the appropriate +``SYSCALL_DEFINEn()`` macro rather than explicitly. The 'n' indicates the +number of arguments to the system call, and the macro takes the system call name +followed by the (type, name) pairs for the parameters as arguments. Using +this macro allows metadata about the new system call to be made available for +other tools. + +The new entry point also needs a corresponding function prototype, in +``include/linux/syscalls.h``, marked as asmlinkage to match the way that system +calls are invoked:: + + asmlinkage long sys_xyzzy(...); + +Some architectures (e.g. x86) have their own architecture-specific syscall +tables, but several other architectures share a generic syscall table. Add your +new system call to the generic list by adding an entry to the list in +``include/uapi/asm-generic/unistd.h``:: + + #define __NR_xyzzy 292 + __SYSCALL(__NR_xyzzy, sys_xyzzy) + +Also update the __NR_syscalls count to reflect the additional system call, and +note that if multiple new system calls are added in the same merge window, +your new syscall number may get adjusted to resolve conflicts. + +The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each +system call, returning ``-ENOSYS``. Add your new system call here too:: + + cond_syscall(sys_xyzzy); + +Your new kernel functionality, and the system call that controls it, should +normally be optional, so add a ``CONFIG`` option (typically to +``init/Kconfig``) for it. As usual for new ``CONFIG`` options: + + - Include a description of the new functionality and system call controlled + by the option. + - Make the option depend on EXPERT if it should be hidden from normal users. + - Make any new source files implementing the function dependent on the CONFIG + option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.c``). + - Double check that the kernel still builds with the new CONFIG option turned + off. + +To summarize, you need a commit that includes: + + - ``CONFIG`` option for the new function, normally in ``init/Kconfig`` + - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point + - corresponding prototype in ``include/linux/syscalls.h`` + - generic table entry in ``include/uapi/asm-generic/unistd.h`` + - fallback stub in ``kernel/sys_ni.c`` + + +x86 System Call Implementation +------------------------------ + +To wire up your new system call for x86 platforms, you need to update the +master syscall tables. Assuming your new system call isn't special in some +way (see below), this involves a "common" entry (for x86_64 and x32) in +arch/x86/entry/syscalls/syscall_64.tbl:: + + 333 common xyzzy sys_xyzzy + +and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``:: + + 380 i386 xyzzy sys_xyzzy + +Again, these numbers are liable to be changed if there are conflicts in the +relevant merge window. + + +Compatibility System Calls (Generic) +------------------------------------ + +For most system calls the same 64-bit implementation can be invoked even when +the userspace program is itself 32-bit; even if the system call's parameters +include an explicit pointer, this is handled transparently. + +However, there are a couple of situations where a compatibility layer is +needed to cope with size differences between 32-bit and 64-bit. + +The first is if the 64-bit kernel also supports 32-bit userspace programs, and +so needs to parse areas of (``__user``) memory that could hold either 32-bit or +64-bit values. In particular, this is needed whenever a system call argument +is: + + - a pointer to a pointer + - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``) + - a pointer to a varying sized integral type (``time_t``, ``off_t``, + ``long``, ...) + - a pointer to a struct containing a varying sized integral type. + +The second situation that requires a compatibility layer is if one of the +system call's arguments has a type that is explicitly 64-bit even on a 32-bit +architecture, for example ``loff_t`` or ``__u64``. In this case, a value that +arrives at a 64-bit kernel from a 32-bit application will be split into two +32-bit values, which then need to be re-assembled in the compatibility layer. + +(Note that a system call argument that's a pointer to an explicit 64-bit type +does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of +type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.) + +The compatibility version of the system call is called ``compat_sys_xyzzy()``, +and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to +SYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit +kernel, but expects to receive 32-bit parameter values and does whatever is +needed to deal with them. (Typically, the ``compat_sys_`` version converts the +values to 64-bit versions and either calls on to the ``sys_`` version, or both of +them call a common inner implementation function.) + +The compat entry point also needs a corresponding function prototype, in +``include/linux/compat.h``, marked as asmlinkage to match the way that system +calls are invoked:: + + asmlinkage long compat_sys_xyzzy(...); + +If the system call involves a structure that is laid out differently on 32-bit +and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h +header file should also include a compat version of the structure (``struct +compat_xyzzy_args``) where each variable-size field has the appropriate +``compat_`` type that corresponds to the type in ``struct xyzzy_args``. The +``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to +parse the arguments from a 32-bit invocation. + +For example, if there are fields:: + + struct xyzzy_args { + const char __user *ptr; + __kernel_long_t varying_val; + u64 fixed_val; + /* ... */ + }; + +in struct xyzzy_args, then struct compat_xyzzy_args would have:: + + struct compat_xyzzy_args { + compat_uptr_t ptr; + compat_long_t varying_val; + u64 fixed_val; + /* ... */ + }; + +The generic system call list also needs adjusting to allow for the compat +version; the entry in ``include/uapi/asm-generic/unistd.h`` should use +``__SC_COMP`` rather than ``__SYSCALL``:: + + #define __NR_xyzzy 292 + __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy) + +To summarize, you need: + + - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point + - corresponding prototype in ``include/linux/compat.h`` + - (if needed) 32-bit mapping struct in ``include/linux/compat.h`` + - instance of ``__SC_COMP`` not ``__SYSCALL`` in + ``include/uapi/asm-generic/unistd.h`` + + +Compatibility System Calls (x86) +-------------------------------- + +To wire up the x86 architecture of a system call with a compatibility version, +the entries in the syscall tables need to be adjusted. + +First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra +column to indicate that a 32-bit userspace program running on a 64-bit kernel +should hit the compat entry point:: + + 380 i386 xyzzy sys_xyzzy compat_sys_xyzzy + +Second, you need to figure out what should happen for the x32 ABI version of +the new system call. There's a choice here: the layout of the arguments +should either match the 64-bit version or the 32-bit version. + +If there's a pointer-to-a-pointer involved, the decision is easy: x32 is +ILP32, so the layout should match the 32-bit version, and the entry in +``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit +the compatibility wrapper:: + + 333 64 xyzzy sys_xyzzy + ... + 555 x32 xyzzy compat_sys_xyzzy + +If no pointers are involved, then it is preferable to re-use the 64-bit system +call for the x32 ABI (and consequently the entry in +arch/x86/entry/syscalls/syscall_64.tbl is unchanged). + +In either case, you should check that the types involved in your argument +layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or +64-bit (-m64) equivalents. + + +System Calls Returning Elsewhere +-------------------------------- + +For most system calls, once the system call is complete the user program +continues exactly where it left off -- at the next instruction, with the +stack the same and most of the registers the same as before the system call, +and with the same virtual memory space. + +However, a few system calls do things differently. They might return to a +different location (``rt_sigreturn``) or change the memory space +(``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``) +of the program. + +To allow for this, the kernel implementation of the system call may need to +save and restore additional registers to the kernel stack, allowing complete +control of where and how execution continues after the system call. + +This is arch-specific, but typically involves defining assembly entry points +that save/restore additional registers and invoke the real system call entry +point. + +For x86_64, this is implemented as a ``stub_xyzzy`` entry point in +``arch/x86/entry/entry_64.S``, and the entry in the syscall table +(``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match:: + + 333 common xyzzy stub_xyzzy + +The equivalent for 32-bit programs running on a 64-bit kernel is normally +called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``, +with the corresponding syscall table adjustment in +``arch/x86/entry/syscalls/syscall_32.tbl``:: + + 380 i386 xyzzy sys_xyzzy stub32_xyzzy + +If the system call needs a compatibility layer (as in the previous section) +then the ``stub32_`` version needs to call on to the ``compat_sys_`` version +of the system call rather than the native 64-bit version. Also, if the x32 ABI +implementation is not common with the x86_64 version, then its syscall +table will also need to invoke a stub that calls on to the ``compat_sys_`` +version. + +For completeness, it's also nice to set up a mapping so that user-mode Linux +still works -- its syscall table will reference stub_xyzzy, but the UML build +doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML +simulates registers etc). Fixing this is as simple as adding a #define to +``arch/x86/um/sys_call_table_64.c``:: + + #define stub_xyzzy sys_xyzzy + + +Other Details +------------- + +Most of the kernel treats system calls in a generic way, but there is the +occasional exception that may need updating for your particular system call. + +The audit subsystem is one such special case; it includes (arch-specific) +functions that classify some special types of system call -- specifically +file open (``open``/``openat``), program execution (``execve``/``exeveat``) or +socket multiplexor (``socketcall``) operations. If your new system call is +analogous to one of these, then the audit system should be updated. + +More generally, if there is an existing system call that is analogous to your +new system call, it's worth doing a kernel-wide grep for the existing system +call to check there are no other special cases. + + +Testing +------- + +A new system call should obviously be tested; it is also useful to provide +reviewers with a demonstration of how user space programs will use the system +call. A good way to combine these aims is to include a simple self-test +program in a new directory under ``tools/testing/selftests/``. + +For a new system call, there will obviously be no libc wrapper function and so +the test will need to invoke it using ``syscall()``; also, if the system call +involves a new userspace-visible structure, the corresponding header will need +to be installed to compile the test. + +Make sure the selftest runs successfully on all supported architectures. For +example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32) +and x32 (-mx32) ABI program. + +For more extensive and thorough testing of new functionality, you should also +consider adding tests to the Linux Test Project, or to the xfstests project +for filesystem-related changes. + + - https://linux-test-project.github.io/ + - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git + + +Man Page +-------- + +All new system calls should come with a complete man page, ideally using groff +markup, but plain text will do. If groff is used, it's helpful to include a +pre-rendered ASCII version of the man page in the cover email for the +patchset, for the convenience of reviewers. + +The man page should be cc'ed to linux-man@vger.kernel.org +For more details, see https://www.kernel.org/doc/man-pages/patches.html + +References and Sources +---------------------- + + - LWN article from Michael Kerrisk on use of flags argument in system calls: + https://lwn.net/Articles/585415/ + - LWN article from Michael Kerrisk on how to handle unknown flags in a system + call: https://lwn.net/Articles/588444/ + - LWN article from Jake Edge describing constraints on 64-bit system call + arguments: https://lwn.net/Articles/311630/ + - Pair of LWN articles from David Drysdale that describe the system call + implementation paths in detail for v3.14: + + - https://lwn.net/Articles/604287/ + - https://lwn.net/Articles/604515/ + + - Architecture-specific requirements for system calls are discussed in the + :manpage:`syscall(2)` man-page: + http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES + - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``: + http://yarchive.net/comp/linux/ioctl.html + - "How to not invent kernel interfaces", Arnd Bergmann, + http://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf + - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN: + https://lwn.net/Articles/486306/ + - Recommendation from Andrew Morton that all related information for a new + system call should come in the same email thread: + https://lkml.org/lkml/2014/7/24/641 + - Recommendation from Michael Kerrisk that a new system call should come with + a man page: https://lkml.org/lkml/2014/6/13/309 + - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate + commit: https://lkml.org/lkml/2014/11/19/254 + - Suggestion from Greg Kroah-Hartman that it's good for new system calls to + come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710 + - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension: + https://lkml.org/lkml/2014/6/3/411 + - Suggestion from Ingo Molnar that system calls that involve multiple + arguments should encapsulate those arguments in a struct, which includes a + size field for future extensibility: https://lkml.org/lkml/2015/7/30/117 + - Numbering oddities arising from (re-)use of O_* numbering space flags: + + - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness + check") + - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc + conflict") + - commit bb458c644a59 ("Safer ABI for O_TMPFILE") + + - Discussion from Matthew Wilcox about restrictions on 64-bit arguments: + https://lkml.org/lkml/2008/12/12/187 + - Recommendation from Greg Kroah-Hartman that unknown flags should be + policed: https://lkml.org/lkml/2014/7/17/577 + - Recommendation from Linus Torvalds that x32 system calls should prefer + compatibility with 64-bit versions rather than 32-bit versions: + https://lkml.org/lkml/2011/8/31/244 -- cgit v1.1