From 58319057b7847667f0c9585b9de0e8932b0fdb08 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Fri, 4 Sep 2015 15:42:45 -0700 Subject: capabilities: ambient capabilities Credit where credit is due: this idea comes from Christoph Lameter with a lot of valuable input from Serge Hallyn. This patch is heavily based on Christoph's patch. ===== The status quo ===== On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: effective (pE), permitted (pP), inheritable (pI), and a bounding set (X). When the kernel checks for a capability, it checks pE. The other capability masks serve to modify what capabilities can be in pE. Any task can remove capabilities from pE, pP, or pI at any time. If a task has a capability in pP, it can add that capability to pE and/or pI. If a task has CAP_SETPCAP, then it can add any capability to pI, and it can remove capabilities from X. Tasks are not the only things that can have capabilities; files can also have capabilities. A file can have no capabilty information at all [1]. If a file has capability information, then it has a permitted mask (fP) and an inheritable mask (fI) as well as a single effective bit (fE) [2]. File capabilities modify the capabilities of tasks that execve(2) them. A task that successfully calls execve has its capabilities modified for the file ultimately being excecuted (i.e. the binary itself if that binary is ELF or for the interpreter if the binary is a script.) [3] In the capability evolution rules, for each mask Z, pZ represents the old value and pZ' represents the new value. The rules are: pP' = (X & fP) | (pI & fI) pI' = pI pE' = (fE ? pP' : 0) X is unchanged For setuid binaries, fP, fI, and fE are modified by a moderately complicated set of rules that emulate POSIX behavior. Similarly, if euid == 0 or ruid == 0, then fP, fI, and fE are modified differently (primary, fP and fI usually end up being the full set). For nonroot users executing binaries with neither setuid nor file caps, fI and fP are empty and fE is false. As an extra complication, if you execute a process as nonroot and fE is set, then the "secure exec" rules are in effect: AT_SECURE gets set, LD_PRELOAD doesn't work, etc. This is rather messy. We've learned that making any changes is dangerous, though: if a new kernel version allows an unprivileged program to change its security state in a way that persists cross execution of a setuid program or a program with file caps, this persistent state is surprisingly likely to allow setuid or file-capped programs to be exploited for privilege escalation. ===== The problem ===== Capability inheritance is basically useless. If you aren't root and you execute an ordinary binary, fI is zero, so your capabilities have no effect whatsoever on pP'. This means that you can't usefully execute a helper process or a shell command with elevated capabilities if you aren't root. On current kernels, you can sort of work around this by setting fI to the full set for most or all non-setuid executable files. This causes pP' = pI for nonroot, and inheritance works. No one does this because it's a PITA and it isn't even supported on most filesystems. If you try this, you'll discover that every nonroot program ends up with secure exec rules, breaking many things. This is a problem that has bitten many people who have tried to use capabilities for anything useful. ===== The proposed change ===== This patch adds a fifth capability mask called the ambient mask (pA). pA does what most people expect pI to do. pA obeys the invariant that no bit can ever be set in pA if it is not set in both pP and pI. Dropping a bit from pP or pI drops that bit from pA. This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and then calling execve effectively drops capabilities. Therefore, setresuid from root to nonroot conditionally clears pA unless SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can re-add bits to pA afterwards. The capability evolution rules are changed: pA' = (file caps or setuid or setgid ? 0 : pA) pP' = (X & fP) | (pI & fI) | pA' pI' = pI pE' = (fE ? pP' : pA') X is unchanged If you are nonroot but you have a capability, you can add it to pA. If you do so, your children get that capability in pA, pP, and pE. For example, you can set pA = CAP_NET_BIND_SERVICE, and your children can automatically bind low-numbered ports. Hallelujah! Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees. This is currently more or less impossible. Hallelujah! You cannot use pA to try to subvert a setuid, setgid, or file-capped program: if you execute any such program, pA gets cleared and the resulting evolution rules are unchanged by this patch. Users with nonzero pA are unlikely to unintentionally leak that capability. If they run programs that try to drop privileges, dropping privileges will still work. It's worth noting that the degree of paranoia in this patch could possibly be reduced without causing serious problems. Specifically, if we allowed pA to persist across executing non-pA-aware setuid binaries and across setresuid, then, naively, the only capabilities that could leak as a result would be the capabilities in pA, and any attacker *already* has those capabilities. This would make me nervous, though -- setuid binaries that tried to privilege-separate might fail to do so, and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have unexpected side effects. (Whether these unexpected side effects would be exploitable is an open question.) I've therefore taken the more paranoid route. We can revisit this later. An alternative would be to require PR_SET_NO_NEW_PRIVS before setting ambient capabilities. I think that this would be annoying and would make granting otherwise unprivileged users minor ambient capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than it is with this patch. ===== Footnotes ===== [1] Files that are missing the "security.capability" xattr or that have unrecognized values for that xattr end up with has_cap set to false. The code that does that appears to be complicated for no good reason. [2] The libcap capability mask parsers and formatters are dangerously misleading and the documentation is flat-out wrong. fE is *not* a mask; it's a single bit. This has probably confused every single person who has tried to use file capabilities. [3] Linux very confusingly processes both the script and the interpreter if applicable, for reasons that elude me. The results from thinking about a script's file capabilities and/or setuid bits are mostly discarded. Preliminary userspace code is here, but it needs updating: https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2 Here is a test program that can be used to verify the functionality (from Christoph): /* * Test program for the ambient capabilities. This program spawns a shell * that allows running processes with a defined set of capabilities. * * (C) 2015 Christoph Lameter * Released under: GPL v3 or later. * * * Compile using: * * gcc -o ambient_test ambient_test.o -lcap-ng * * This program must have the following capabilities to run properly: * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE * * A command to equip the binary with the right caps is: * * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test * * * To get a shell with additional caps that can be inherited by other processes: * * ./ambient_test /bin/bash * * * Verifying that it works: * * From the bash spawed by ambient_test run * * cat /proc/$$/status * * and have a look at the capabilities. */ #include #include #include #include #include #include /* * Definitions from the kernel header files. These are going to be removed * when the /usr/include files have these defined. */ #define PR_CAP_AMBIENT 47 #define PR_CAP_AMBIENT_IS_SET 1 #define PR_CAP_AMBIENT_RAISE 2 #define PR_CAP_AMBIENT_LOWER 3 #define PR_CAP_AMBIENT_CLEAR_ALL 4 static void set_ambient_cap(int cap) { int rc; capng_get_caps_process(); rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap); if (rc) { printf("Cannot add inheritable cap\n"); exit(2); } capng_apply(CAPNG_SELECT_CAPS); /* Note the two 0s at the end. Kernel checks for these */ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) { perror("Cannot set cap"); exit(1); } } int main(int argc, char **argv) { int rc; set_ambient_cap(CAP_NET_RAW); set_ambient_cap(CAP_NET_ADMIN); set_ambient_cap(CAP_SYS_NICE); printf("Ambient_test forking shell\n"); if (execv(argv[1], argv + 1)) perror("Cannot exec"); return 0; } Signed-off-by: Christoph Lameter # Original author Signed-off-by: Andy Lutomirski Acked-by: Serge E. Hallyn Acked-by: Kees Cook Cc: Jonathan Corbet Cc: Aaron Jones Cc: Ted Ts'o Cc: Andrew G. Morgan Cc: Mimi Zohar Cc: Austin S Hemmelgarn Cc: Markku Savela Cc: Jarkko Sakkinen Cc: Michael Kerrisk Cc: James Morris Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/prctl.h | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9..a8d0759 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,11 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Control the ambient capability set */ +#define PR_CAP_AMBIENT 47 +# define PR_CAP_AMBIENT_IS_SET 1 +# define PR_CAP_AMBIENT_RAISE 2 +# define PR_CAP_AMBIENT_LOWER 3 +# define PR_CAP_AMBIENT_CLEAR_ALL 4 + #endif /* _LINUX_PRCTL_H */ -- cgit v1.1 From 746bf6d64275be0c65b0631d8a72b16f1454cfa1 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Fri, 4 Sep 2015 15:42:51 -0700 Subject: capabilities: add a securebit to disable PR_CAP_AMBIENT_RAISE Per Andrew Morgan's request, add a securebit to allow admins to disable PR_CAP_AMBIENT_RAISE. This securebit will prevent processes from adding capabilities to their ambient set. For simplicity, this disables PR_CAP_AMBIENT_RAISE entirely rather than just disabling setting previously cleared bits. Signed-off-by: Andy Lutomirski Acked-by: Andrew G. Morgan Acked-by: Serge Hallyn Cc: Kees Cook Cc: Christoph Lameter Cc: Serge Hallyn Cc: Jonathan Corbet Cc: Aaron Jones Cc: Ted Ts'o Cc: Andrew G. Morgan Cc: Mimi Zohar Cc: Austin S Hemmelgarn Cc: Markku Savela Cc: Jarkko Sakkinen Cc: Michael Kerrisk Cc: James Morris Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/securebits.h | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h index 985aac9..35ac35c 100644 --- a/include/uapi/linux/securebits.h +++ b/include/uapi/linux/securebits.h @@ -43,9 +43,18 @@ #define SECBIT_KEEP_CAPS (issecure_mask(SECURE_KEEP_CAPS)) #define SECBIT_KEEP_CAPS_LOCKED (issecure_mask(SECURE_KEEP_CAPS_LOCKED)) +/* When set, a process cannot add new capabilities to its ambient set. */ +#define SECURE_NO_CAP_AMBIENT_RAISE 6 +#define SECURE_NO_CAP_AMBIENT_RAISE_LOCKED 7 /* make bit-6 immutable */ + +#define SECBIT_NO_CAP_AMBIENT_RAISE (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE)) +#define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ + (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED)) + #define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ issecure_mask(SECURE_NO_SETUID_FIXUP) | \ - issecure_mask(SECURE_KEEP_CAPS)) + issecure_mask(SECURE_KEEP_CAPS) | \ + issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE)) #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) #endif /* _UAPI_LINUX_SECUREBITS_H */ -- cgit v1.1 From 1038628d80e96e3a086189172d9be8eb85ecfabf Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 4 Sep 2015 15:46:04 -0700 Subject: userfaultfd: uAPI Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol. Signed-off-by: Andrea Arcangeli Acked-by: Pavel Emelyanov Cc: Sanidhya Kashyap Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" Cc: Andres Lagar-Cavilla Cc: Dave Hansen Cc: Paolo Bonzini Cc: Rik van Riel Cc: Mel Gorman Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Peter Feiner Cc: "Dr. David Alan Gilbert" Cc: Johannes Weiner Cc: "Huangpeng (Peter)" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/Kbuild | 1 + include/uapi/linux/userfaultfd.h | 83 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+) create mode 100644 include/uapi/linux/userfaultfd.h (limited to 'include/uapi') diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index aafb993..70ff1d9 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -456,3 +456,4 @@ header-y += xfrm.h header-y += xilinx-v4l2-controls.h header-y += zorro.h header-y += zorro_ids.h +header-y += userfaultfd.h diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h new file mode 100644 index 0000000..09c2e2a --- /dev/null +++ b/include/uapi/linux/userfaultfd.h @@ -0,0 +1,83 @@ +/* + * include/linux/userfaultfd.h + * + * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2015 Red Hat, Inc. + * + */ + +#ifndef _LINUX_USERFAULTFD_H +#define _LINUX_USERFAULTFD_H + +#include + +#define UFFD_API ((__u64)0xAA) +/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */ +#define UFFD_API_BITS (UFFD_BIT_WRITE) +#define UFFD_API_IOCTLS \ + ((__u64)1 << _UFFDIO_REGISTER | \ + (__u64)1 << _UFFDIO_UNREGISTER | \ + (__u64)1 << _UFFDIO_API) +#define UFFD_API_RANGE_IOCTLS \ + ((__u64)1 << _UFFDIO_WAKE) + +/* + * Valid ioctl command number range with this API is from 0x00 to + * 0x3F. UFFDIO_API is the fixed number, everything else can be + * changed by implementing a different UFFD_API. If sticking to the + * same UFFD_API more ioctl can be added and userland will be aware of + * which ioctl the running kernel implements through the ioctl command + * bitmask written by the UFFDIO_API. + */ +#define _UFFDIO_REGISTER (0x00) +#define _UFFDIO_UNREGISTER (0x01) +#define _UFFDIO_WAKE (0x02) +#define _UFFDIO_API (0x3F) + +/* userfaultfd ioctl ids */ +#define UFFDIO 0xAA +#define UFFDIO_API _IOWR(UFFDIO, _UFFDIO_API, \ + struct uffdio_api) +#define UFFDIO_REGISTER _IOWR(UFFDIO, _UFFDIO_REGISTER, \ + struct uffdio_register) +#define UFFDIO_UNREGISTER _IOR(UFFDIO, _UFFDIO_UNREGISTER, \ + struct uffdio_range) +#define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \ + struct uffdio_range) + +/* + * Valid bits below PAGE_SHIFT in the userfault address read through + * the read() syscall. + */ +#define UFFD_BIT_WRITE (1<<0) /* this was a write fault, MISSING or WP */ +#define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */ +#define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */ + +struct uffdio_api { + /* userland asks for an API number */ + __u64 api; + + /* kernel answers below with the available features for the API */ + __u64 bits; + __u64 ioctls; +}; + +struct uffdio_range { + __u64 start; + __u64 len; +}; + +struct uffdio_register { + struct uffdio_range range; +#define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0) +#define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1) + __u64 mode; + + /* + * kernel answers which ioctl commands are available for the + * range, keep at the end as the last 8 bytes aren't read. + */ + __u64 ioctls; +}; + +#endif /* _LINUX_USERFAULTFD_H */ -- cgit v1.1 From 3f602d2724b1f7d2d27ddcd7963a040a5890fd16 Mon Sep 17 00:00:00 2001 From: Pavel Emelyanov Date: Fri, 4 Sep 2015 15:46:34 -0700 Subject: userfaultfd: Rename uffd_api.bits into .features This is (seems to be) the minimal thing that is required to unblock standard uffd usage from the non-cooperative one. Now more bits can be added to the features field indicating e.g. UFFD_FEATURE_FORK and others needed for the latter use-case. Signed-off-by: Pavel Emelyanov Signed-off-by: Andrea Arcangeli Cc: Sanidhya Kashyap Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" Cc: Andres Lagar-Cavilla Cc: Dave Hansen Cc: Paolo Bonzini Cc: Rik van Riel Cc: Mel Gorman Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Peter Feiner Cc: "Dr. David Alan Gilbert" Cc: Johannes Weiner Cc: "Huangpeng (Peter)" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/userfaultfd.h | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 09c2e2a..3302060 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -12,8 +12,8 @@ #include #define UFFD_API ((__u64)0xAA) -/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */ -#define UFFD_API_BITS (UFFD_BIT_WRITE) +/* FIXME: add "|UFFD_FEATURE_WP" to UFFD_API_FEATURES after implementing it */ +#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -53,12 +53,18 @@ #define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */ #define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */ +/* + * Features reported in uffdio_api.features field + */ +#define UFFD_FEATURE_WRITE_BIT (1<<0) /* Corresponds to UFFD_BIT_WRITE */ +#define UFFD_FEATURE_WP_BIT (1<<1) /* Corresponds to UFFD_BIT_WP */ + struct uffdio_api { /* userland asks for an API number */ __u64 api; /* kernel answers below with the available features for the API */ - __u64 bits; + __u64 features; __u64 ioctls; }; -- cgit v1.1 From a9b85f9415fd9e529d03299e5335433f614ec1fb Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 4 Sep 2015 15:46:37 -0700 Subject: userfaultfd: change the read API to return a uffd_msg I had requests to return the full address (not the page aligned one) to userland. It's not entirely clear how the page offset could be relevant because userfaults aren't like SIGBUS that can sigjump to a different place and it actually skip resolving the fault depending on a page offset. There's currently no real way to skip the fault especially because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the kernel without having to return to userland first (not even self modifying code replacing the .text that touched the faulting address would prevent the fault to be repeated). Userland cannot skip repeating the fault even more so if the fault was triggered by a KVM secondary page fault or any get_user_pages or any copy-user inside some syscall which will return to kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS being raised because the userfault can't wait if it cannot release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that). Still returning userland a proper structure during the read() on the uffd, can allow to use the current UFFD_API for the future non-cooperative extensions too and it looks cleaner as well. Once we get additional fields there's no point to return the fault address page aligned anymore to reuse the bits below PAGE_SHIFT. The only downside is that the read() syscall will read 32bytes instead of 8bytes but that's not going to be measurable overhead. The total number of new events that can be extended or of new future bits for already shipped events, is limited to 64 by the features field of the uffdio_api structure. If more will be needed a bump of UFFD_API will be required. [akpm@linux-foundation.org: use __packed] Signed-off-by: Andrea Arcangeli Acked-by: Pavel Emelyanov Cc: Sanidhya Kashyap Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" Cc: Andres Lagar-Cavilla Cc: Dave Hansen Cc: Paolo Bonzini Cc: Rik van Riel Cc: Mel Gorman Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Peter Feiner Cc: "Dr. David Alan Gilbert" Cc: Johannes Weiner Cc: "Huangpeng (Peter)" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/userfaultfd.h | 70 +++++++++++++++++++++++++++++++--------- 1 file changed, 55 insertions(+), 15 deletions(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3302060..a5f8825 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -11,9 +11,15 @@ #include +#include + #define UFFD_API ((__u64)0xAA) -/* FIXME: add "|UFFD_FEATURE_WP" to UFFD_API_FEATURES after implementing it */ -#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT) +/* + * After implementing the respective features it will become: + * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \ + * UFFD_FEATURE_EVENT_FORK) + */ +#define UFFD_API_FEATURES (0) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -45,26 +51,60 @@ #define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \ struct uffdio_range) -/* - * Valid bits below PAGE_SHIFT in the userfault address read through - * the read() syscall. - */ -#define UFFD_BIT_WRITE (1<<0) /* this was a write fault, MISSING or WP */ -#define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */ -#define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */ +/* read() structure */ +struct uffd_msg { + __u8 event; + + __u8 reserved1; + __u16 reserved2; + __u32 reserved3; + + union { + struct { + __u64 flags; + __u64 address; + } pagefault; + + struct { + /* unused reserved fields */ + __u64 reserved1; + __u64 reserved2; + __u64 reserved3; + } reserved; + } arg; +} __packed; /* - * Features reported in uffdio_api.features field + * Start at 0x12 and not at 0 to be more strict against bugs. */ -#define UFFD_FEATURE_WRITE_BIT (1<<0) /* Corresponds to UFFD_BIT_WRITE */ -#define UFFD_FEATURE_WP_BIT (1<<1) /* Corresponds to UFFD_BIT_WP */ +#define UFFD_EVENT_PAGEFAULT 0x12 +#if 0 /* not available yet */ +#define UFFD_EVENT_FORK 0x13 +#endif + +/* flags for UFFD_EVENT_PAGEFAULT */ +#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */ +#define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */ struct uffdio_api { - /* userland asks for an API number */ + /* userland asks for an API number and the features to enable */ __u64 api; - - /* kernel answers below with the available features for the API */ + /* + * Kernel answers below with the all available features for + * the API, this notifies userland of which events and/or + * which flags for each event are enabled in the current + * kernel. + * + * Note: UFFD_EVENT_PAGEFAULT and UFFD_PAGEFAULT_FLAG_WRITE + * are to be considered implicitly always enabled in all kernels as + * long as the uffdio_api.api requested matches UFFD_API. + */ +#if 0 /* not available yet */ +#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) +#define UFFD_FEATURE_EVENT_FORK (1<<1) +#endif __u64 features; + __u64 ioctls; }; -- cgit v1.1 From 1f1c6f075904c241f9e44eb37efa8777141fc938 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 4 Sep 2015 15:47:01 -0700 Subject: userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE. Signed-off-by: Andrea Arcangeli Acked-by: Pavel Emelyanov Cc: Sanidhya Kashyap Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" Cc: Andres Lagar-Cavilla Cc: Dave Hansen Cc: Paolo Bonzini Cc: Rik van Riel Cc: Mel Gorman Cc: Andy Lutomirski Cc: Hugh Dickins Cc: Peter Feiner Cc: "Dr. David Alan Gilbert" Cc: Johannes Weiner Cc: "Huangpeng (Peter)" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/userfaultfd.h | 42 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index a5f8825..df0e09b 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -25,7 +25,9 @@ (__u64)1 << _UFFDIO_UNREGISTER | \ (__u64)1 << _UFFDIO_API) #define UFFD_API_RANGE_IOCTLS \ - ((__u64)1 << _UFFDIO_WAKE) + ((__u64)1 << _UFFDIO_WAKE | \ + (__u64)1 << _UFFDIO_COPY | \ + (__u64)1 << _UFFDIO_ZEROPAGE) /* * Valid ioctl command number range with this API is from 0x00 to @@ -38,6 +40,8 @@ #define _UFFDIO_REGISTER (0x00) #define _UFFDIO_UNREGISTER (0x01) #define _UFFDIO_WAKE (0x02) +#define _UFFDIO_COPY (0x03) +#define _UFFDIO_ZEROPAGE (0x04) #define _UFFDIO_API (0x3F) /* userfaultfd ioctl ids */ @@ -50,6 +54,10 @@ struct uffdio_range) #define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \ struct uffdio_range) +#define UFFDIO_COPY _IOWR(UFFDIO, _UFFDIO_COPY, \ + struct uffdio_copy) +#define UFFDIO_ZEROPAGE _IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \ + struct uffdio_zeropage) /* read() structure */ struct uffd_msg { @@ -126,4 +134,36 @@ struct uffdio_register { __u64 ioctls; }; +struct uffdio_copy { + __u64 dst; + __u64 src; + __u64 len; + /* + * There will be a wrprotection flag later that allows to map + * pages wrprotected on the fly. And such a flag will be + * available if the wrprotection ioctl are implemented for the + * range according to the uffdio_register.ioctls. + */ +#define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0) + __u64 mode; + + /* + * "copy" is written by the ioctl and must be at the end: the + * copy_from_user will not read the last 8 bytes. + */ + __s64 copy; +}; + +struct uffdio_zeropage { + struct uffdio_range range; +#define UFFDIO_ZEROPAGE_MODE_DONTWAKE ((__u64)1<<0) + __u64 mode; + + /* + * "zeropage" is written by the ioctl and must be at the end: + * the copy_from_user will not read the last 8 bytes. + */ + __s64 zeropage; +}; + #endif /* _LINUX_USERFAULTFD_H */ -- cgit v1.1