diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2015-06-22 15:52:04 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2015-06-22 15:52:04 -0700 |
commit | 23b7776290b10297fe2cae0fb5f166a4f2c68121 (patch) | |
tree | 73d1e76644a20bc7bff80fbfdb08e8b9a9f28420 | |
parent | 6bc4c3ad3619e1bcb4a6330e030007ace8ca465e (diff) | |
parent | 6fab54101923044712baee429ff573f03b99fc47 (diff) | |
download | op-kernel-dev-23b7776290b10297fe2cae0fb5f166a4f2c68121.zip op-kernel-dev-23b7776290b10297fe2cae0fb5f166a4f2c68121.tar.gz |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- lockless wakeup support for futexes and IPC message queues
(Davidlohr Bueso, Peter Zijlstra)
- Replace spinlocks with atomics in thread_group_cputimer(), to
improve scalability (Jason Low)
- NUMA balancing improvements (Rik van Riel)
- SCHED_DEADLINE improvements (Wanpeng Li)
- clean up and reorganize preemption helpers (Frederic Weisbecker)
- decouple page fault disabling machinery from the preemption
counter, to improve debuggability and robustness (David
Hildenbrand)
- SCHED_DEADLINE documentation updates (Luca Abeni)
- topology CPU masks cleanups (Bartosz Golaszewski)
- /proc/sched_debug improvements (Srikar Dronamraju)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
sched/deadline: Remove needless parameter in dl_runtime_exceeded()
sched: Remove superfluous resetting of the p->dl_throttled flag
sched/deadline: Drop duplicate init_sched_dl_class() declaration
sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
sched/deadline: Make init_sched_dl_class() __init
sched/deadline: Optimize pull_dl_task()
sched/preempt: Add static_key() to preempt_notifiers
sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
sched/debug: Properly format runnable tasks in /proc/sched_debug
sched/numa: Only consider less busy nodes as numa balancing destinations
Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
sched/fair: Prevent throttling in early pick_next_task_fair()
preempt: Reorganize the notrace definitions a bit
preempt: Use preempt_schedule_context() as the official tracing preemption point
sched: Make preempt_schedule_context() function-tracing safe
x86: Remove cpu_sibling_mask() and cpu_core_mask()
x86: Replace cpu_**_mask() with topology_**_cpumask()
...
138 files changed, 1442 insertions, 972 deletions
diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 0aad6de..12b1b25 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -1,6 +1,6 @@ Export CPU topology info via sysfs. Items (attributes) are similar -to /proc/cpuinfo. +to /proc/cpuinfo output of some architectures: 1) /sys/devices/system/cpu/cpuX/topology/physical_package_id: @@ -23,20 +23,35 @@ to /proc/cpuinfo. 4) /sys/devices/system/cpu/cpuX/topology/thread_siblings: internal kernel map of cpuX's hardware threads within the same - core as cpuX + core as cpuX. -5) /sys/devices/system/cpu/cpuX/topology/core_siblings: +5) /sys/devices/system/cpu/cpuX/topology/thread_siblings_list: + + human-readable list of cpuX's hardware threads within the same + core as cpuX. + +6) /sys/devices/system/cpu/cpuX/topology/core_siblings: internal kernel map of cpuX's hardware threads within the same physical_package_id. -6) /sys/devices/system/cpu/cpuX/topology/book_siblings: +7) /sys/devices/system/cpu/cpuX/topology/core_siblings_list: + + human-readable list of cpuX's hardware threads within the same + physical_package_id. + +8) /sys/devices/system/cpu/cpuX/topology/book_siblings: internal kernel map of cpuX's hardware threads within the same book_id. +9) /sys/devices/system/cpu/cpuX/topology/book_siblings_list: + + human-readable list of cpuX's hardware threads within the same + book_id. + To implement it in an architecture-neutral way, a new source file, -drivers/base/topology.c, is to export the 4 or 6 attributes. The two book +drivers/base/topology.c, is to export the 6 or 9 attributes. The three book related sysfs files will only be created if CONFIG_SCHED_BOOK is selected. For an architecture to support this feature, it must define some of @@ -44,20 +59,22 @@ these macros in include/asm-XXX/topology.h: #define topology_physical_package_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) -#define topology_thread_cpumask(cpu) +#define topology_sibling_cpumask(cpu) #define topology_core_cpumask(cpu) #define topology_book_cpumask(cpu) -The type of **_id is int. -The type of siblings is (const) struct cpumask *. +The type of **_id macros is int. +The type of **_cpumask macros is (const) struct cpumask *. The latter +correspond with appropriate **_siblings sysfs attributes (except for +topology_sibling_cpumask() which corresponds with thread_siblings). To be consistent on all architectures, include/linux/topology.h provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h: 1) physical_package_id: -1 2) core_id: 0 -3) thread_siblings: just the given CPU -4) core_siblings: just the given CPU +3) sibling_cpumask: just the given CPU +4) core_cpumask: just the given CPU For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt index 21461a0..e114513 100644 --- a/Documentation/scheduler/sched-deadline.txt +++ b/Documentation/scheduler/sched-deadline.txt @@ -8,6 +8,10 @@ CONTENTS 1. Overview 2. Scheduling algorithm 3. Scheduling Real-Time Tasks + 3.1 Definitions + 3.2 Schedulability Analysis for Uniprocessor Systems + 3.3 Schedulability Analysis for Multiprocessor Systems + 3.4 Relationship with SCHED_DEADLINE Parameters 4. Bandwidth management 4.1 System-wide settings 4.2 Task interface @@ -43,7 +47,7 @@ CONTENTS "deadline", to schedule tasks. A SCHED_DEADLINE task should receive "runtime" microseconds of execution time every "period" microseconds, and these "runtime" microseconds are available within "deadline" microseconds - from the beginning of the period. In order to implement this behaviour, + from the beginning of the period. In order to implement this behavior, every time the task wakes up, the scheduler computes a "scheduling deadline" consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then scheduled using EDF[1] on these scheduling deadlines (the task with the @@ -52,7 +56,7 @@ CONTENTS "admission control" strategy (see Section "4. Bandwidth management") is used (clearly, if the system is overloaded this guarantee cannot be respected). - Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so + Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so that each task runs for at most its runtime every period, avoiding any interference between different tasks (bandwidth isolation), while the EDF[1] algorithm selects the task with the earliest scheduling deadline as the one @@ -63,7 +67,7 @@ CONTENTS In more details, the CBS algorithm assigns scheduling deadlines to tasks in the following way: - - Each SCHED_DEADLINE task is characterised by the "runtime", + - Each SCHED_DEADLINE task is characterized by the "runtime", "deadline", and "period" parameters; - The state of the task is described by a "scheduling deadline", and @@ -78,7 +82,7 @@ CONTENTS then, if the scheduling deadline is smaller than the current time, or this condition is verified, the scheduling deadline and the - remaining runtime are re-initialised as + remaining runtime are re-initialized as scheduling deadline = current time + deadline remaining runtime = runtime @@ -126,31 +130,37 @@ CONTENTS suited for periodic or sporadic real-time tasks that need guarantees on their timing behavior, e.g., multimedia, streaming, control applications, etc. +3.1 Definitions +------------------------ + A typical real-time task is composed of a repetition of computation phases (task instances, or jobs) which are activated on a periodic or sporadic fashion. - Each job J_j (where J_j is the j^th job of the task) is characterised by an + Each job J_j (where J_j is the j^th job of the task) is characterized by an arrival time r_j (the time when the job starts), an amount of computation time c_j needed to finish the job, and a job absolute deadline d_j, which is the time within which the job should be finished. The maximum execution - time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. + time max{c_j} is called "Worst Case Execution Time" (WCET) for the task. A real-time task can be periodic with period P if r_{j+1} = r_j + P, or sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, d_j = r_j + D, where D is the task's relative deadline. - The utilisation of a real-time task is defined as the ratio between its + Summing up, a real-time task can be described as + Task = (WCET, D, P) + + The utilization of a real-time task is defined as the ratio between its WCET and its period (or minimum inter-arrival time), and represents the fraction of CPU time needed to execute the task. - If the total utilisation sum_i(WCET_i/P_i) is larger than M (with M equal + If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal to the number of CPUs), then the scheduler is unable to respect all the deadlines. - Note that total utilisation is defined as the sum of the utilisations + Note that total utilization is defined as the sum of the utilizations WCET_i/P_i over all the real-time tasks in the system. When considering multiple real-time tasks, the parameters of the i-th task are indicated with the "_i" suffix. - Moreover, if the total utilisation is larger than M, then we risk starving + Moreover, if the total utilization is larger than M, then we risk starving non- real-time tasks by real-time tasks. - If, instead, the total utilisation is smaller than M, then non real-time + If, instead, the total utilization is smaller than M, then non real-time tasks will not be starved and the system might be able to respect all the deadlines. As a matter of fact, in this case it is possible to provide an upper bound @@ -159,38 +169,119 @@ CONTENTS More precisely, it can be proven that using a global EDF scheduler the maximum tardiness of each task is smaller or equal than ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max - where WCET_max = max_i{WCET_i} is the maximum WCET, WCET_min=min_i{WCET_i} - is the minimum WCET, and U_max = max_i{WCET_i/P_i} is the maximum utilisation. + where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i} + is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum + utilization[12]. + +3.2 Schedulability Analysis for Uniprocessor Systems +------------------------ If M=1 (uniprocessor system), or in case of partitioned scheduling (each real-time task is statically assigned to one and only one CPU), it is possible to formally check if all the deadlines are respected. If D_i = P_i for all tasks, then EDF is able to respect all the deadlines - of all the tasks executing on a CPU if and only if the total utilisation + of all the tasks executing on a CPU if and only if the total utilization of the tasks running on such a CPU is smaller or equal than 1. If D_i != P_i for some task, then it is possible to define the density of - a task as C_i/min{D_i,T_i}, and EDF is able to respect all the deadlines - of all the tasks running on a CPU if the sum sum_i C_i/min{D_i,T_i} of the - densities of the tasks running on such a CPU is smaller or equal than 1 - (notice that this condition is only sufficient, and not necessary). + a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines + of all the tasks running on a CPU if the sum of the densities of the tasks + running on such a CPU is smaller or equal than 1: + sum(WCET_i / min{D_i, P_i}) <= 1 + It is important to notice that this condition is only sufficient, and not + necessary: there are task sets that are schedulable, but do not respect the + condition. For example, consider the task set {Task_1,Task_2} composed by + Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms). + EDF is clearly able to schedule the two tasks without missing any deadline + (Task_1 is scheduled as soon as it is released, and finishes just in time + to respect its deadline; Task_2 is scheduled immediately after Task_1, hence + its response time cannot be larger than 50ms + 10ms = 60ms) even if + 50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1 + Of course it is possible to test the exact schedulability of tasks with + D_i != P_i (checking a condition that is both sufficient and necessary), + but this cannot be done by comparing the total utilization or density with + a constant. Instead, the so called "processor demand" approach can be used, + computing the total amount of CPU time h(t) needed by all the tasks to + respect all of their deadlines in a time interval of size t, and comparing + such a time with the interval size t. If h(t) is smaller than t (that is, + the amount of time needed by the tasks in a time interval of size t is + smaller than the size of the interval) for all the possible values of t, then + EDF is able to schedule the tasks respecting all of their deadlines. Since + performing this check for all possible values of t is impossible, it has been + proven[4,5,6] that it is sufficient to perform the test for values of t + between 0 and a maximum value L. The cited papers contain all of the + mathematical details and explain how to compute h(t) and L. + In any case, this kind of analysis is too complex as well as too + time-consuming to be performed on-line. Hence, as explained in Section + 4 Linux uses an admission test based on the tasks' utilizations. + +3.3 Schedulability Analysis for Multiprocessor Systems +------------------------ On multiprocessor systems with global EDF scheduling (non partitioned systems), a sufficient test for schedulability can not be based on the - utilisations (it can be shown that task sets with utilisations slightly - larger than 1 can miss deadlines regardless of the number of CPUs M). - However, as previously stated, enforcing that the total utilisation is smaller - than M is enough to guarantee that non real-time tasks are not starved and - that the tardiness of real-time tasks has an upper bound. + utilizations or densities: it can be shown that even if D_i = P_i task + sets with utilizations slightly larger than 1 can miss deadlines regardless + of the number of CPUs. + + Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M + CPUs, with the first task Task_1=(P,P,P) having period, relative deadline + and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an + arbitrarily small worst case execution time (indicated as "e" here) and a + period smaller than the one of the first task. Hence, if all the tasks + activate at the same time t, global EDF schedules these M tasks first + (because their absolute deadlines are equal to t + P - 1, hence they are + smaller than the absolute deadline of Task_1, which is t + P). As a + result, Task_1 can be scheduled only at time t + e, and will finish at + time t + e + P, after its absolute deadline. The total utilization of the + task set is U = M · e / (P - 1) + P / P = M · e / (P - 1) + 1, and for small + values of e this can become very close to 1. This is known as "Dhall's + effect"[7]. Note: the example in the original paper by Dhall has been + slightly simplified here (for example, Dhall more correctly computed + lim_{e->0}U). + + More complex schedulability tests for global EDF have been developed in + real-time literature[8,9], but they are not based on a simple comparison + between total utilization (or density) and a fixed constant. If all tasks + have D_i = P_i, a sufficient schedulability condition can be expressed in + a simple way: + sum(WCET_i / P_i) <= M - (M - 1) · U_max + where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1, + M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition + just confirms the Dhall's effect. A more complete survey of the literature + about schedulability tests for multi-processor real-time scheduling can be + found in [11]. + + As seen, enforcing that the total utilization is smaller than M does not + guarantee that global EDF schedules the tasks without missing any deadline + (in other words, global EDF is not an optimal scheduling algorithm). However, + a total utilization smaller than M is enough to guarantee that non real-time + tasks are not starved and that the tardiness of real-time tasks has an upper + bound[12] (as previously noted). Different bounds on the maximum tardiness + experienced by real-time tasks have been developed in various papers[13,14], + but the theoretical result that is important for SCHED_DEADLINE is that if + the total utilization is smaller or equal than M then the response times of + the tasks are limited. + +3.4 Relationship with SCHED_DEADLINE Parameters +------------------------ - SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that - the jobs' deadlines of a task are respected. In order to do this, a task - must be scheduled by setting: + Finally, it is important to understand the relationship between the + SCHED_DEADLINE scheduling parameters described in Section 2 (runtime, + deadline and period) and the real-time task parameters (WCET, D, P) + described in this section. Note that the tasks' temporal constraints are + represented by its absolute deadlines d_j = r_j + D described above, while + SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see + Section 2). + If an admission test is used to guarantee that the scheduling deadlines + are respected, then SCHED_DEADLINE can be used to schedule real-time tasks + guaranteeing that all the jobs' deadlines of a task are respected. + In order to do this, a task must be scheduled by setting: - runtime >= WCET - deadline = D - period <= P - IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines + IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines and the absolute deadlines (d_j) coincide, so a proper admission control allows to respect the jobs' absolute deadlines for this task (this is what is called "hard schedulability property" and is an extension of Lemma 1 of [2]). @@ -206,6 +297,39 @@ CONTENTS Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf + 4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of + Periodic, Real-Time Tasks. Information Processing Letters, vol. 11, + no. 3, pp. 115-118, 1980. + 5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling + Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the + 11th IEEE Real-time Systems Symposium, 1990. + 6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity + Concerning the Preemptive Scheduling of Periodic Real-Time tasks on + One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324, + 1990. + 7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations + research, vol. 26, no. 1, pp 127-140, 1978. + 8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability + Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003. + 9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor. + IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8, + pp 760-768, 2005. + 10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of + Periodic Task Systems on Multiprocessors. Real-Time Systems Journal, + vol. 25, no. 2–3, pp. 187–205, 2003. + 11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for + Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011. + http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf + 12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF + Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32, + no. 2, pp 133-189, 2008. + 13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft + Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of + the 26th IEEE Real-Time Systems Symposium, 2005. + 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for + Global EDF. Proceedings of the 22nd Euromicro Conference on + Real-Time Systems, 2010. + 4. Bandwidth management ======================= @@ -218,10 +342,10 @@ CONTENTS no guarantee can be given on the actual scheduling of the -deadline tasks. As already stated in Section 3, a necessary condition to be respected to - correctly schedule a set of real-time tasks is that the total utilisation + correctly schedule a set of real-time tasks is that the total utilization is smaller than M. When talking about -deadline tasks, this requires that the sum of the ratio between runtime and period for all tasks is smaller - than M. Notice that the ratio runtime/period is equivalent to the utilisation + than M. Notice that the ratio runtime/period is equivalent to the utilization of a "traditional" real-time task, and is also often referred to as "bandwidth". The interface used to control the CPU bandwidth that can be allocated @@ -251,7 +375,7 @@ CONTENTS The system wide settings are configured under the /proc virtual file system. For now the -rt knobs are used for -deadline admission control and the - -deadline runtime is accounted against the -rt runtime. We realise that this + -deadline runtime is accounted against the -rt runtime. We realize that this isn't entirely desirable; however, it is better to have a small interface for now, and be able to change it easily later. The ideal situation (see 5.) is to run -rt tasks from a -deadline server; in which case the -rt bandwidth is a diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index 9d0ac09..4a905bd 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -23,8 +23,7 @@ #include <linux/smp.h> #include <linux/interrupt.h> #include <linux/module.h> - -#include <asm/uaccess.h> +#include <linux/uaccess.h> extern void die_if_kernel(char *,struct pt_regs *,long, unsigned long *); @@ -107,7 +106,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, /* If we're in an interrupt context, or have no user context, we must not take the fault. */ - if (!mm || in_atomic()) + if (!mm || faulthandler_disabled()) goto no_context; #ifdef CONFIG_ALPHA_LARGE_VMALLOC diff --git a/arch/arc/include/asm/futex.h b/arch/arc/include/asm/futex.h index 4dc64dd..05b5aaf 100644 --- a/arch/arc/include/asm/futex.h +++ b/arch/arc/include/asm/futex.h @@ -53,7 +53,7 @@ static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int))) return -EFAULT; - pagefault_disable(); /* implies preempt_disable() */ + pagefault_disable(); switch (op) { case FUTEX_OP_SET: @@ -75,7 +75,7 @@ static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); /* subsumes preempt_enable() */ + pagefault_enable(); if (!ret) { switch (cmp) { @@ -104,7 +104,7 @@ static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) return ret; } -/* Compare-xchg with preemption disabled. +/* Compare-xchg with pagefaults disabled. * Notes: * -Best-Effort: Exchg happens only if compare succeeds. * If compare fails, returns; leaving retry/looping to upper layers @@ -121,7 +121,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int))) return -EFAULT; - pagefault_disable(); /* implies preempt_disable() */ + pagefault_disable(); /* TBD : can use llock/scond */ __asm__ __volatile__( @@ -142,7 +142,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, : "r"(oldval), "r"(newval), "r"(uaddr), "ir"(-EFAULT) : "cc", "memory"); - pagefault_enable(); /* subsumes preempt_enable() */ + pagefault_enable(); *uval = val; return val; diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 6a2e006..d948e4e9 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -86,7 +86,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(regs)) diff --git a/arch/arm/include/asm/futex.h b/arch/arm/include/asm/futex.h index 4e78065..5eed828 100644 --- a/arch/arm/include/asm/futex.h +++ b/arch/arm/include/asm/futex.h @@ -93,6 +93,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) return -EFAULT; + preempt_disable(); __asm__ __volatile__("@futex_atomic_cmpxchg_inatomic\n" "1: " TUSER(ldr) " %1, [%4]\n" " teq %1, %2\n" @@ -104,6 +105,8 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, : "cc", "memory"); *uval = val; + preempt_enable(); + return ret; } @@ -124,7 +127,10 @@ futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr) if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) return -EFAULT; - pagefault_disable(); /* implies preempt_disable() */ +#ifndef CONFIG_SMP + preempt_disable(); +#endif + pagefault_disable(); switch (op) { case FUTEX_OP_SET: @@ -146,7 +152,10 @@ futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); /* subsumes preempt_enable() */ + pagefault_enable(); +#ifndef CONFIG_SMP + preempt_enable(); +#endif if (!ret) { switch (cmp) { diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85ff..370f7a7 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -18,7 +18,7 @@ extern struct cputopo_arm cpu_topology[NR_CPUS]; #define topology_physical_package_id(cpu) (cpu_topology[cpu].socket_id) #define topology_core_id(cpu) (cpu_topology[cpu].core_id) #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling) -#define topology_thread_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) +#define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index 6333d9c..0d629b8 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -276,7 +276,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(regs)) diff --git a/arch/arm/mm/highmem.c b/arch/arm/mm/highmem.c index b98895d..ee8dfa7 100644 --- a/arch/arm/mm/highmem.c +++ b/arch/arm/mm/highmem.c @@ -59,6 +59,7 @@ void *kmap_atomic(struct page *page) void *kmap; int type; + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -121,6 +122,7 @@ void __kunmap_atomic(void *kvaddr) kunmap_high(pte_page(pkmap_page_table[PKMAP_NR(vaddr)])); } pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); @@ -130,6 +132,7 @@ void *kmap_atomic_pfn(unsigned long pfn) int idx, type; struct page *page = pfn_to_page(pfn); + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h index 5f750dc..74069b3 100644 --- a/arch/arm64/include/asm/futex.h +++ b/arch/arm64/include/asm/futex.h @@ -58,7 +58,7 @@ futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr) if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) return -EFAULT; - pagefault_disable(); /* implies preempt_disable() */ + pagefault_disable(); switch (op) { case FUTEX_OP_SET: @@ -85,7 +85,7 @@ futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); /* subsumes preempt_enable() */ + pagefault_enable(); if (!ret) { switch (cmp) { diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 7ebcd31..225ec35 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -18,7 +18,7 @@ extern struct cpu_topology cpu_topology[NR_CPUS]; #define topology_physical_package_id(cpu) (cpu_topology[cpu].cluster_id) #define topology_core_id(cpu) (cpu_topology[cpu].core_id) #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling) -#define topology_thread_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) +#define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 96da131..0948d32 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -211,7 +211,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, * If we're in an interrupt or have no user context, we must not take * the fault. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(regs)) diff --git a/arch/avr32/include/asm/uaccess.h b/arch/avr32/include/asm/uaccess.h index a46f7cf..68cf638 100644 --- a/arch/avr32/include/asm/uaccess.h +++ b/arch/avr32/include/asm/uaccess.h @@ -97,7 +97,8 @@ static inline __kernel_size_t __copy_from_user(void *to, * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -116,7 +117,8 @@ static inline __kernel_size_t __copy_from_user(void *to, * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -136,7 +138,8 @@ static inline __kernel_size_t __copy_from_user(void *to, * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -158,7 +161,8 @@ static inline __kernel_size_t __copy_from_user(void *to, * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index d223a8b..c035339 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -14,11 +14,11 @@ #include <linux/pagemap.h> #include <linux/kdebug.h> #include <linux/kprobes.h> +#include <linux/uaccess.h> #include <asm/mmu_context.h> #include <asm/sysreg.h> #include <asm/tlb.h> -#include <asm/uaccess.h> #ifdef CONFIG_KPROBES static inline int notify_page_fault(struct pt_regs *regs, int trap) @@ -81,7 +81,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) * If we're in an interrupt or have no user context, we must * not take the fault... */ - if (in_atomic() || !mm || regs->sr & SYSREG_BIT(GM)) + if (faulthandler_disabled() || !mm || regs->sr & SYSREG_BIT(GM)) goto no_context; local_irq_enable(); diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 83f12f2..3066d40 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -8,7 +8,7 @@ #include <linux/interrupt.h> #include <linux/module.h> #include <linux/wait.h> -#include <asm/uaccess.h> +#include <linux/uaccess.h> #include <arch/system.h> extern int find_fixup_code(struct pt_regs *); @@ -109,11 +109,11 @@ do_page_fault(unsigned long address, struct pt_regs *regs, info.si_code = SEGV_MAPERR; /* - * If we're in an interrupt or "atomic" operation or have no + * If we're in an interrupt, have pagefaults disabled or have no * user context, we must not take the fault. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(regs)) diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index ec4917d..61d9976 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -19,9 +19,9 @@ #include <linux/kernel.h> #include <linux/ptrace.h> #include <linux/hardirq.h> +#include <linux/uaccess.h> #include <asm/pgtable.h> -#include <asm/uaccess.h> #include <asm/gdb-stub.h> /*****************************************************************************/ @@ -78,7 +78,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(__frame)) diff --git a/arch/frv/mm/highmem.c b/arch/frv/mm/highmem.c index bed9a9b..785344b 100644 --- a/arch/frv/mm/highmem.c +++ b/arch/frv/mm/highmem.c @@ -42,6 +42,7 @@ void *kmap_atomic(struct page *page) unsigned long paddr; int type; + preempt_disable(); pagefault_disable(); type = kmap_atomic_idx_push(); paddr = page_to_phys(page); @@ -85,5 +86,6 @@ void __kunmap_atomic(void *kvaddr) } kmap_atomic_idx_pop(); pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); diff --git a/arch/hexagon/include/asm/uaccess.h b/arch/hexagon/include/asm/uaccess.h index e4127e4..f000a38 100644 --- a/arch/hexagon/include/asm/uaccess.h +++ b/arch/hexagon/include/asm/uaccess.h @@ -36,7 +36,8 @@ * @addr: User space pointer to start of block to check * @size: Size of block to check * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Checks if a pointer to a block of memory in user space is valid. * diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index 6437ca2..3ad8f69 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -53,7 +53,7 @@ void build_cpu_to_node_map(void); #define topology_physical_package_id(cpu) (cpu_data(cpu)->socket_id) #define topology_core_id(cpu) (cpu_data(cpu)->core_id) #define topology_core_cpumask(cpu) (&cpu_core_map[cpu]) -#define topology_thread_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) #endif extern void arch_fix_phys_package_id(int num, u32 slot); diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index ba5ba7a..70b40d1 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -11,10 +11,10 @@ #include <linux/kprobes.h> #include <linux/kdebug.h> #include <linux/prefetch.h> +#include <linux/uaccess.h> #include <asm/pgtable.h> #include <asm/processor.h> -#include <asm/uaccess.h> extern int die(char *, struct pt_regs *, long); @@ -96,7 +96,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re /* * If we're in an interrupt or have no user context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; #ifdef CONFIG_VIRTUAL_MEM_MAP diff --git a/arch/m32r/include/asm/uaccess.h b/arch/m32r/include/asm/uaccess.h index 71adff2..cac7014 100644 --- a/arch/m32r/include/asm/uaccess.h +++ b/arch/m32r/include/asm/uaccess.h @@ -91,7 +91,8 @@ static inline void set_fs(mm_segment_t s) * @addr: User space pointer to start of block to check * @size: Size of block to check * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Checks if a pointer to a block of memory in user space is valid. * @@ -155,7 +156,8 @@ extern int fixup_exception(struct pt_regs *regs); * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -175,7 +177,8 @@ extern int fixup_exception(struct pt_regs *regs); * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -194,7 +197,8 @@ extern int fixup_exception(struct pt_regs *regs); * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -274,7 +278,8 @@ do { \ * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -568,7 +573,8 @@ unsigned long __generic_copy_from_user(void *, const void __user *, unsigned lon * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. @@ -588,7 +594,8 @@ unsigned long __generic_copy_from_user(void *, const void __user *, unsigned lon * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. * @@ -606,7 +613,8 @@ unsigned long __generic_copy_from_user(void *, const void __user *, unsigned lon * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. Caller must check * the specified block with access_ok() before calling this function. @@ -626,7 +634,8 @@ unsigned long __generic_copy_from_user(void *, const void __user *, unsigned lon * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. * @@ -677,7 +686,8 @@ unsigned long clear_user(void __user *mem, unsigned long len); * strlen_user: - Get the size of a string in user space. * @str: The string to measure. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Get the size of a NUL-terminated string in user space. * diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index e3d4d48901..8f9875b 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -24,9 +24,9 @@ #include <linux/vt_kern.h> /* For unblank_screen() */ #include <linux/highmem.h> #include <linux/module.h> +#include <linux/uaccess.h> #include <asm/m32r.h> -#include <asm/uaccess.h> #include <asm/hardirq.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -111,10 +111,10 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, mm = tsk->mm; /* - * If we're in an interrupt or have no user context or are running in an - * atomic region then we must not take the fault.. + * If we're in an interrupt or have no user context or have pagefaults + * disabled then we must not take the fault. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto bad_area_nosemaphore; if (error_code & ACE_USERMODE) diff --git a/arch/m68k/include/asm/irqflags.h b/arch/m68k/include/asm/irqflags.h index a823cd7..b594181 100644 --- a/arch/m68k/include/asm/irqflags.h +++ b/arch/m68k/include/asm/irqflags.h @@ -2,9 +2,6 @@ #define _M68K_IRQFLAGS_H #include <linux/types.h> -#ifdef CONFIG_MMU -#include <linux/preempt_mask.h> -#endif #include <linux/preempt.h> #include <asm/thread_info.h> #include <asm/entry.h> diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index b2f04ae..6a94cdd 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -10,10 +10,10 @@ #include <linux/ptrace.h> #include <linux/interrupt.h> #include <linux/module.h> +#include <linux/uaccess.h> #include <asm/setup.h> #include <asm/traps.h> -#include <asm/uaccess.h> #include <asm/pgalloc.h> extern void die_if_kernel(char *, struct pt_regs *, long); @@ -81,7 +81,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(regs)) diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c index 2de5dc6..f57edca 100644 --- a/arch/metag/mm/fault.c +++ b/arch/metag/mm/fault.c @@ -105,7 +105,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, mm = tsk->mm; - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(regs)) diff --git a/arch/metag/mm/highmem.c b/arch/metag/mm/highmem.c index d71f621..807f1b1 100644 --- a/arch/metag/mm/highmem.c +++ b/arch/metag/mm/highmem.c @@ -43,7 +43,7 @@ void *kmap_atomic(struct page *page) unsigned long vaddr; int type; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -82,6 +82,7 @@ void __kunmap_atomic(void *kvaddr) } pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); @@ -95,6 +96,7 @@ void *kmap_atomic_pfn(unsigned long pfn) unsigned long vaddr; int type; + preempt_disable(); pagefault_disable(); type = kmap_atomic_idx_push(); diff --git a/arch/microblaze/include/asm/uaccess.h b/arch/microblaze/include/asm/uaccess.h index 62942fd..331b0d3 100644 --- a/arch/microblaze/include/asm/uaccess.h +++ b/arch/microblaze/include/asm/uaccess.h @@ -178,7 +178,8 @@ extern long __user_bad(void); * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -290,7 +291,8 @@ extern long __user_bad(void); * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index d46a5eb..177dfc0 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -107,14 +107,14 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, if ((error_code & 0x13) == 0x13 || (error_code & 0x11) == 0x11) is_write = 0; - if (unlikely(in_atomic() || !mm)) { + if (unlikely(faulthandler_disabled() || !mm)) { if (kernel_mode(regs)) goto bad_area_nosemaphore; - /* in_atomic() in user mode is really bad, + /* faulthandler_disabled() in user mode is really bad, as is current->mm == NULL. */ - pr_emerg("Page fault in user mode with in_atomic(), mm = %p\n", - mm); + pr_emerg("Page fault in user mode with faulthandler_disabled(), mm = %p\n", + mm); pr_emerg("r15 = %lx MSR = %lx\n", regs->r15, regs->msr); die("Weird page fault", regs, SIGSEGV); diff --git a/arch/microblaze/mm/highmem.c b/arch/microblaze/mm/highmem.c index 5a92576..2fcc5a5 100644 --- a/arch/microblaze/mm/highmem.c +++ b/arch/microblaze/mm/highmem.c @@ -37,7 +37,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot) unsigned long vaddr; int idx, type; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -63,6 +63,7 @@ void __kunmap_atomic(void *kvaddr) if (vaddr < __fix_to_virt(FIX_KMAP_END)) { pagefault_enable(); + preempt_enable(); return; } @@ -84,5 +85,6 @@ void __kunmap_atomic(void *kvaddr) #endif kmap_atomic_idx_pop(); pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); diff --git a/arch/mips/include/asm/topology.h b/arch/mips/include/asm/topology.h index 3e307ec..7afda41 100644 --- a/arch/mips/include/asm/topology.h +++ b/arch/mips/include/asm/topology.h @@ -15,7 +15,7 @@ #define topology_physical_package_id(cpu) (cpu_data[cpu].package) #define topology_core_id(cpu) (cpu_data[cpu].core) #define topology_core_cpumask(cpu) (&cpu_core_map[cpu]) -#define topology_thread_cpumask(cpu) (&cpu_sibling_map[cpu]) +#define topology_sibling_cpumask(cpu) (&cpu_sibling_map[cpu]) #endif #endif /* __ASM_TOPOLOGY_H */ diff --git a/arch/mips/include/asm/uaccess.h b/arch/mips/include/asm/uaccess.h index bf8b324..9722357 100644 --- a/arch/mips/include/asm/uaccess.h +++ b/arch/mips/include/asm/uaccess.h @@ -103,7 +103,8 @@ extern u64 __ua_limit; * @addr: User space pointer to start of block to check * @size: Size of block to check * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Checks if a pointer to a block of memory in user space is valid. * @@ -138,7 +139,8 @@ extern u64 __ua_limit; * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -157,7 +159,8 @@ extern u64 __ua_limit; * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -177,7 +180,8 @@ extern u64 __ua_limit; * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -199,7 +203,8 @@ extern u64 __ua_limit; * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -498,7 +503,8 @@ extern void __put_user_unknown(void); * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -517,7 +523,8 @@ extern void __put_user_unknown(void); * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -537,7 +544,8 @@ extern void __put_user_unknown(void); * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -559,7 +567,8 @@ extern void __put_user_unknown(void); * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -815,7 +824,8 @@ extern size_t __copy_user(void *__to, const void *__from, size_t __n); * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. @@ -888,7 +898,8 @@ extern size_t __copy_user_inatomic(void *__to, const void *__from, size_t __n); * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. * @@ -1075,7 +1086,8 @@ extern size_t __copy_in_user_eva(void *__to, const void *__from, size_t __n); * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. Caller must check * the specified block with access_ok() before calling this function. @@ -1107,7 +1119,8 @@ extern size_t __copy_in_user_eva(void *__to, const void *__from, size_t __n); * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. * @@ -1329,7 +1342,8 @@ strncpy_from_user(char *__to, const char __user *__from, long __len) * strlen_user: - Get the size of a string in user space. * @str: The string to measure. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Get the size of a NUL-terminated string in user space. * @@ -1398,7 +1412,8 @@ static inline long __strnlen_user(const char __user *s, long n) * strnlen_user: - Get the size of a string in user space. * @str: The string to measure. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Get the size of a NUL-terminated string in user space. * diff --git a/arch/mips/kernel/signal-common.h b/arch/mips/kernel/signal-common.h index 06805e0..0b85f82 100644 --- a/arch/mips/kernel/signal-common.h +++ b/arch/mips/kernel/signal-common.h @@ -28,12 +28,7 @@ extern void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs, extern int fpcsr_pending(unsigned int __user *fpcsr); /* Make sure we will not lose FPU ownership */ -#ifdef CONFIG_PREEMPT -#define lock_fpu_owner() preempt_disable() -#define unlock_fpu_owner() preempt_enable() -#else -#define lock_fpu_owner() pagefault_disable() -#define unlock_fpu_owner() pagefault_enable() -#endif +#define lock_fpu_owner() ({ preempt_disable(); pagefault_disable(); }) +#define unlock_fpu_owner() ({ pagefault_enable(); preempt_enable(); }) #endif /* __SIGNAL_COMMON_H */ diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 7ff8637..36c0f26 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -21,10 +21,10 @@ #include <linux/module.h> #include <linux/kprobes.h> #include <linux/perf_event.h> +#include <linux/uaccess.h> #include <asm/branch.h> #include <asm/mmu_context.h> -#include <asm/uaccess.h> #include <asm/ptrace.h> #include <asm/highmem.h> /* For VMALLOC_END */ #include <linux/kdebug.h> @@ -94,7 +94,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto bad_area_nosemaphore; if (user_mode(regs)) diff --git a/arch/mips/mm/highmem.c b/arch/mips/mm/highmem.c index da815d2..11661cb 100644 --- a/arch/mips/mm/highmem.c +++ b/arch/mips/mm/highmem.c @@ -47,7 +47,7 @@ void *kmap_atomic(struct page *page) unsigned long vaddr; int idx, type; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -72,6 +72,7 @@ void __kunmap_atomic(void *kvaddr) if (vaddr < FIXADDR_START) { // FIXME pagefault_enable(); + preempt_enable(); return; } @@ -92,6 +93,7 @@ void __kunmap_atomic(void *kvaddr) #endif kmap_atomic_idx_pop(); pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); @@ -104,6 +106,7 @@ void *kmap_atomic_pfn(unsigned long pfn) unsigned long vaddr; int idx, type; + preempt_disable(); pagefault_disable(); type = kmap_atomic_idx_push(); diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c index faa5c98..198a314 100644 --- a/arch/mips/mm/init.c +++ b/arch/mips/mm/init.c @@ -90,6 +90,7 @@ static void *__kmap_pgprot(struct page *page, unsigned long addr, pgprot_t prot) BUG_ON(Page_dcache_dirty(page)); + preempt_disable(); pagefault_disable(); idx = (addr >> PAGE_SHIFT) & (FIX_N_COLOURS - 1); idx += in_interrupt() ? FIX_N_COLOURS : 0; @@ -152,6 +153,7 @@ void kunmap_coherent(void) write_c0_entryhi(old_ctx); local_irq_restore(flags); pagefault_enable(); + preempt_enable(); } void copy_user_highpage(struct page *to, struct page *from, diff --git a/arch/mn10300/include/asm/highmem.h b/arch/mn10300/include/asm/highmem.h index 2fbbe4d..1ddea5a 100644 --- a/arch/mn10300/include/asm/highmem.h +++ b/arch/mn10300/include/asm/highmem.h @@ -75,6 +75,7 @@ static inline void *kmap_atomic(struct page *page) unsigned long vaddr; int idx, type; + preempt_disable(); pagefault_disable(); if (page < highmem_start_page) return page_address(page); @@ -98,6 +99,7 @@ static inline void __kunmap_atomic(unsigned long vaddr) if (vaddr < FIXADDR_START) { /* FIXME */ pagefault_enable(); + preempt_enable(); return; } @@ -122,6 +124,7 @@ static inline void __kunmap_atomic(unsigned long vaddr) kmap_atomic_idx_pop(); pagefault_enable(); + preempt_enable(); } #endif /* __KERNEL__ */ diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0c2cc5d..4a1d181 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -23,8 +23,8 @@ #include <linux/interrupt.h> #include <linux/init.h> #include <linux/vt_kern.h> /* For unblank_screen() */ +#include <linux/uaccess.h> -#include <asm/uaccess.h> #include <asm/pgalloc.h> #include <asm/hardirq.h> #include <asm/cpu-regs.h> @@ -168,7 +168,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c index 0c9b6af..b51878b 100644 --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -77,7 +77,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto bad_area_nosemaphore; if (user_mode(regs)) diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h index de65f66..ec2df4b 100644 --- a/arch/parisc/include/asm/cacheflush.h +++ b/arch/parisc/include/asm/cacheflush.h @@ -142,6 +142,7 @@ static inline void kunmap(struct page *page) static inline void *kmap_atomic(struct page *page) { + preempt_disable(); pagefault_disable(); return page_address(page); } @@ -150,6 +151,7 @@ static inline void __kunmap_atomic(void *addr) { flush_kernel_dcache_page_addr(addr); pagefault_enable(); + preempt_enable(); } #define kmap_atomic_prot(page, prot) kmap_atomic(page) diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c index 47ee620..6548fd1 100644 --- a/arch/parisc/kernel/traps.c +++ b/arch/parisc/kernel/traps.c @@ -26,9 +26,9 @@ #include <linux/console.h> #include <linux/bug.h> #include <linux/ratelimit.h> +#include <linux/uaccess.h> #include <asm/assembly.h> -#include <asm/uaccess.h> #include <asm/io.h> #include <asm/irq.h> #include <asm/traps.h> @@ -800,7 +800,7 @@ void notrace handle_interruption(int code, struct pt_regs *regs) * unless pagefault_disable() was called before. */ - if (fault_space == 0 && !in_atomic()) + if (fault_space == 0 && !faulthandler_disabled()) { pdc_chassis_send_status(PDC_CHASSIS_DIRECT_PANIC); parisc_terminate("Kernel Fault", regs, code, fault_address); diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index e5120e6..15503ad 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -15,8 +15,8 @@ #include <linux/sched.h> #include <linux/interrupt.h> #include <linux/module.h> +#include <linux/uaccess.h> -#include <asm/uaccess.h> #include <asm/traps.h> /* Various important other fields */ @@ -207,7 +207,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, int fault; unsigned int flags; - if (in_atomic()) + if (pagefault_disabled()) goto no_context; tsk = current; diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 5f1048e..8b3b46b 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -87,7 +87,7 @@ static inline int prrn_is_enabled(void) #include <asm/smp.h> #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) -#define topology_thread_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_core_id(cpu) (cpu_to_core_id(cpu)) #endif diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c index 3cf529c..ac93a3b 100644 --- a/arch/powerpc/lib/vmx-helper.c +++ b/arch/powerpc/lib/vmx-helper.c @@ -27,11 +27,11 @@ int enter_vmx_usercopy(void) if (in_interrupt()) return 0; - /* This acts as preempt_disable() as well and will make - * enable_kernel_altivec(). We need to disable page faults - * as they can call schedule and thus make us lose the VMX - * context. So on page faults, we just fail which will cause - * a fallback to the normal non-vmx copy. + preempt_disable(); + /* + * We need to disable page faults as they can call schedule and + * thus make us lose the VMX context. So on page faults, we just + * fail which will cause a fallback to the normal non-vmx copy. */ pagefault_disable(); @@ -47,6 +47,7 @@ int enter_vmx_usercopy(void) int exit_vmx_usercopy(void) { pagefault_enable(); + preempt_enable(); return 0; } diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index b396868..6d53597 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -33,13 +33,13 @@ #include <linux/ratelimit.h> #include <linux/context_tracking.h> #include <linux/hugetlb.h> +#include <linux/uaccess.h> #include <asm/firmware.h> #include <asm/page.h> #include <asm/pgtable.h> #include <asm/mmu.h> #include <asm/mmu_context.h> -#include <asm/uaccess.h> #include <asm/tlbflush.h> #include <asm/siginfo.h> #include <asm/debug.h> @@ -272,15 +272,16 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, if (!arch_irq_disabled_regs(regs)) local_irq_enable(); - if (in_atomic() || mm == NULL) { + if (faulthandler_disabled() || mm == NULL) { if (!user_mode(regs)) { rc = SIGSEGV; goto bail; } - /* in_atomic() in user mode is really bad, + /* faulthandler_disabled() in user mode is really bad, as is current->mm == NULL. */ printk(KERN_EMERG "Page fault in user mode with " - "in_atomic() = %d mm = %p\n", in_atomic(), mm); + "faulthandler_disabled() = %d mm = %p\n", + faulthandler_disabled(), mm); printk(KERN_EMERG "NIP = %lx MSR = %lx\n", regs->nip, regs->msr); die("Weird page fault", regs, SIGSEGV); diff --git a/arch/powerpc/mm/highmem.c b/arch/powerpc/mm/highmem.c index e7450bd..e292c8a 100644 --- a/arch/powerpc/mm/highmem.c +++ b/arch/powerpc/mm/highmem.c @@ -34,7 +34,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot) unsigned long vaddr; int idx, type; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -59,6 +59,7 @@ void __kunmap_atomic(void *kvaddr) if (vaddr < __fix_to_virt(FIX_KMAP_END)) { pagefault_enable(); + preempt_enable(); return; } @@ -82,5 +83,6 @@ void __kunmap_atomic(void *kvaddr) kmap_atomic_idx_pop(); pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); diff --git a/arch/powerpc/mm/tlb_nohash.c b/arch/powerpc/mm/tlb_nohash.c index cbd3d06..723a099 100644 --- a/arch/powerpc/mm/tlb_nohash.c +++ b/arch/powerpc/mm/tlb_nohash.c @@ -217,7 +217,7 @@ static DEFINE_RAW_SPINLOCK(tlbivax_lock); static int mm_is_core_local(struct mm_struct *mm) { return cpumask_subset(mm_cpumask(mm), - topology_thread_cpumask(smp_processor_id())); + topology_sibling_cpumask(smp_processor_id())); } struct tlb_flush_param { diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index b1453a2..4990f6c 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -22,7 +22,8 @@ DECLARE_PER_CPU(struct cpu_topology_s390, cpu_topology); #define topology_physical_package_id(cpu) (per_cpu(cpu_topology, cpu).socket_id) #define topology_thread_id(cpu) (per_cpu(cpu_topology, cpu).thread_id) -#define topology_thread_cpumask(cpu) (&per_cpu(cpu_topology, cpu).thread_mask) +#define topology_sibling_cpumask(cpu) \ + (&per_cpu(cpu_topology, cpu).thread_mask) #define topology_core_id(cpu) (per_cpu(cpu_topology, cpu).core_id) #define topology_core_cpumask(cpu) (&per_cpu(cpu_topology, cpu).core_mask) #define topology_book_id(cpu) (per_cpu(cpu_topology, cpu).book_id) diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index d64a7a6..9dd4cc4 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -98,7 +98,8 @@ static inline unsigned long extable_fixup(const struct exception_table_entry *x) * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. Caller must check * the specified block with access_ok() before calling this function. @@ -118,7 +119,8 @@ unsigned long __must_check __copy_from_user(void *to, const void __user *from, * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. @@ -264,7 +266,8 @@ int __get_user_bad(void) __attribute__((noreturn)); * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. * @@ -290,7 +293,8 @@ __compiletime_warning("copy_from_user() buffer size is not provably correct") * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. * @@ -348,7 +352,8 @@ static inline unsigned long strnlen_user(const char __user *src, unsigned long n * strlen_user: - Get the size of a string in user space. * @str: The string to measure. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Get the size of a NUL-terminated string in user space. * diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index 76515bc..4c8f5d7 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -399,7 +399,7 @@ static inline int do_exception(struct pt_regs *regs, int access) * user context. */ fault = VM_FAULT_BADCONTEXT; - if (unlikely(!user_space_fault(regs) || in_atomic() || !mm)) + if (unlikely(!user_space_fault(regs) || faulthandler_disabled() || !mm)) goto out; address = trans_exc_code & __FAIL_ADDR_MASK; diff --git a/arch/score/include/asm/uaccess.h b/arch/score/include/asm/uaccess.h index ab66ddd..20a3591 100644 --- a/arch/score/include/asm/uaccess.h +++ b/arch/score/include/asm/uaccess.h @@ -36,7 +36,8 @@ * @addr: User space pointer to start of block to check * @size: Size of block to check * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Checks if a pointer to a block of memory in user space is valid. * @@ -61,7 +62,8 @@ * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -79,7 +81,8 @@ * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -98,7 +101,8 @@ * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -119,7 +123,8 @@ * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6860beb..37a6c2e 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -34,6 +34,7 @@ #include <linux/string.h> #include <linux/types.h> #include <linux/ptrace.h> +#include <linux/uaccess.h> /* * This routine handles page faults. It determines the address, @@ -73,7 +74,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (pagefault_disabled() || !mm) goto bad_area_nosemaphore; if (user_mode(regs)) diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index a58fec9..79d8276 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -17,6 +17,7 @@ #include <linux/kprobes.h> #include <linux/perf_event.h> #include <linux/kdebug.h> +#include <linux/uaccess.h> #include <asm/io_trapped.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -438,9 +439,9 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, /* * If we're in an interrupt, have no user context or are running - * in an atomic region then we must not take the fault: + * with pagefaults disabled then we must not take the fault: */ - if (unlikely(in_atomic() || !mm)) { + if (unlikely(faulthandler_disabled() || !mm)) { bad_area_nosemaphore(regs, error_code, address); return; } diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h index d1761df..01d1704 100644 --- a/arch/sparc/include/asm/topology_64.h +++ b/arch/sparc/include/asm/topology_64.h @@ -41,7 +41,7 @@ static inline int pcibus_to_node(struct pci_bus *pbus) #define topology_physical_package_id(cpu) (cpu_data(cpu).proc_id) #define topology_core_id(cpu) (cpu_data(cpu).core_id) #define topology_core_cpumask(cpu) (&cpu_core_sib_map[cpu]) -#define topology_thread_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) #endif /* CONFIG_SMP */ extern cpumask_t cpu_core_map[NR_CPUS]; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 70d8171..c399e7b 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -21,6 +21,7 @@ #include <linux/perf_event.h> #include <linux/interrupt.h> #include <linux/kdebug.h> +#include <linux/uaccess.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -29,7 +30,6 @@ #include <asm/setup.h> #include <asm/smp.h> #include <asm/traps.h> -#include <asm/uaccess.h> #include "mm_32.h" @@ -196,7 +196,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (pagefault_disabled() || !mm) goto no_context; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 4798232..e9268ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -22,12 +22,12 @@ #include <linux/kdebug.h> #include <linux/percpu.h> #include <linux/context_tracking.h> +#include <linux/uaccess.h> #include <asm/page.h> #include <asm/pgtable.h> #include <asm/openprom.h> #include <asm/oplib.h> -#include <asm/uaccess.h> #include <asm/asi.h> #include <asm/lsu.h> #include <asm/sections.h> @@ -330,7 +330,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto intr_or_no_mm; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); diff --git a/arch/sparc/mm/highmem.c b/arch/sparc/mm/highmem.c index 449f864..a454ec5 100644 --- a/arch/sparc/mm/highmem.c +++ b/arch/sparc/mm/highmem.c @@ -53,7 +53,7 @@ void *kmap_atomic(struct page *page) unsigned long vaddr; long idx, type; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -91,6 +91,7 @@ void __kunmap_atomic(void *kvaddr) if (vaddr < FIXADDR_START) { // FIXME pagefault_enable(); + preempt_enable(); return; } @@ -126,5 +127,6 @@ void __kunmap_atomic(void *kvaddr) kmap_atomic_idx_pop(); pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 559cb744..c5d08b8 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2738,7 +2738,7 @@ void hugetlb_setup(struct pt_regs *regs) struct mm_struct *mm = current->mm; struct tsb_config *tp; - if (in_atomic() || !mm) { + if (faulthandler_disabled() || !mm) { const struct exception_table_entry *entry; entry = search_exception_tables(regs->tpc); diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 9383118..76b0d0e 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -55,7 +55,7 @@ static inline const struct cpumask *cpumask_of_node(int node) #define topology_physical_package_id(cpu) ((void)(cpu), 0) #define topology_core_id(cpu) (cpu) #define topology_core_cpumask(cpu) ((void)(cpu), cpu_online_mask) -#define topology_thread_cpumask(cpu) cpumask_of(cpu) +#define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif #endif /* _ASM_TILE_TOPOLOGY_H */ diff --git a/arch/tile/include/asm/uaccess.h b/arch/tile/include/asm/uaccess.h index f41cb53..a33276b 100644 --- a/arch/tile/include/asm/uaccess.h +++ b/arch/tile/include/asm/uaccess.h @@ -78,7 +78,8 @@ int __range_ok(unsigned long addr, unsigned long size); * @addr: User space pointer to start of block to check * @size: Size of block to check * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Checks if a pointer to a block of memory in user space is valid. * @@ -192,7 +193,8 @@ extern int __get_user_bad(void) * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -274,7 +276,8 @@ extern int __put_user_bad(void) * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -330,7 +333,8 @@ extern int __put_user_bad(void) * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. @@ -366,7 +370,8 @@ copy_to_user(void __user *to, const void *from, unsigned long n) * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. Caller must check * the specified block with access_ok() before calling this function. @@ -437,7 +442,8 @@ static inline unsigned long __must_check copy_from_user(void *to, * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to user space. Caller must check * the specified blocks with access_ok() before calling this function. diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index e83cc99..3f4f58d 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -354,9 +354,9 @@ static int handle_page_fault(struct pt_regs *regs, /* * If we're in an interrupt, have no user context or are running in an - * atomic region then we must not take the fault. + * region with pagefaults disabled then we must not take the fault. */ - if (in_atomic() || !mm) { + if (pagefault_disabled() || !mm) { vma = NULL; /* happy compiler */ goto bad_area_nosemaphore; } diff --git a/arch/tile/mm/highmem.c b/arch/tile/mm/highmem.c index 6aa2f26..fcd5450 100644 --- a/arch/tile/mm/highmem.c +++ b/arch/tile/mm/highmem.c @@ -201,7 +201,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot) int idx, type; pte_t *pte; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); /* Avoid icache flushes by disallowing atomic executable mappings. */ @@ -259,6 +259,7 @@ void __kunmap_atomic(void *kvaddr) } pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index 8e4daf4..47ff9b7 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -7,6 +7,7 @@ #include <linux/sched.h> #include <linux/hardirq.h> #include <linux/module.h> +#include <linux/uaccess.h> #include <asm/current.h> #include <asm/pgtable.h> #include <asm/tlbflush.h> @@ -35,10 +36,10 @@ int handle_page_fault(unsigned long address, unsigned long ip, *code_out = SEGV_MAPERR; /* - * If the fault was during atomic operation, don't take the fault, just + * If the fault was with pagefaults disabled, don't take the fault, just * fail. */ - if (in_atomic()) + if (faulthandler_disabled()) goto out_nosemaphore; if (is_user) diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 0dc922d..afccef552 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -218,7 +218,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (faulthandler_disabled() || !mm) goto no_context; if (user_mode(regs)) diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h index 8f327184..dca7171 100644 --- a/arch/x86/include/asm/preempt.h +++ b/arch/x86/include/asm/preempt.h @@ -99,11 +99,9 @@ static __always_inline bool should_resched(void) extern asmlinkage void ___preempt_schedule(void); # define __preempt_schedule() asm ("call ___preempt_schedule") extern asmlinkage void preempt_schedule(void); -# ifdef CONFIG_CONTEXT_TRACKING - extern asmlinkage void ___preempt_schedule_context(void); -# define __preempt_schedule_context() asm ("call ___preempt_schedule_context") - extern asmlinkage void preempt_schedule_context(void); -# endif + extern asmlinkage void ___preempt_schedule_notrace(void); +# define __preempt_schedule_notrace() asm ("call ___preempt_schedule_notrace") + extern asmlinkage void preempt_schedule_notrace(void); #endif #endif /* __ASM_PREEMPT_H */ diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 17a8dce..222a6a3 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -37,16 +37,6 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id); DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number); -static inline struct cpumask *cpu_sibling_mask(int cpu) -{ - return per_cpu(cpu_sibling_map, cpu); -} - -static inline struct cpumask *cpu_core_mask(int cpu) -{ - return per_cpu(cpu_core_map, cpu); -} - static inline struct cpumask *cpu_llc_shared_mask(int cpu) { return per_cpu(cpu_llc_shared_map, cpu); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 0e8f04f..5a77593f 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -124,7 +124,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef ENABLE_TOPO_DEFINES #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) -#define topology_thread_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) #endif static inline void arch_fix_phys_package_id(int num, u32 slot) diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index ace9dec..a8df874 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -74,7 +74,8 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un * @addr: User space pointer to start of block to check * @size: Size of block to check * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Checks if a pointer to a block of memory in user space is valid. * @@ -145,7 +146,8 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL)) * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -240,7 +242,8 @@ extern void __put_user_8(void); * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger @@ -455,7 +458,8 @@ struct __large_struct { unsigned long buf[100]; }; * @x: Variable to store result. * @ptr: Source address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple variable from user space to kernel * space. It supports simple types like char and int, but not larger @@ -479,7 +483,8 @@ struct __large_struct { unsigned long buf[100]; }; * @x: Value to copy to user space. * @ptr: Destination address, in user space. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * This macro copies a single simple value from kernel space to user * space. It supports simple types like char and int, but not larger diff --git a/arch/x86/include/asm/uaccess_32.h b/arch/x86/include/asm/uaccess_32.h index 3c03a5d..7c8ad34 100644 --- a/arch/x86/include/asm/uaccess_32.h +++ b/arch/x86/include/asm/uaccess_32.h @@ -70,7 +70,8 @@ __copy_to_user_inatomic(void __user *to, const void *from, unsigned long n) * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. @@ -117,7 +118,8 @@ __copy_from_user_inatomic(void *to, const void __user *from, unsigned long n) * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. Caller must check * the specified block with access_ok() before calling this function. diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 19980d9..b9826a9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2576,7 +2576,7 @@ static void intel_pmu_cpu_starting(int cpu) if (!(x86_pmu.flags & PMU_FL_NO_HT_SHARING)) { void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED]; - for_each_cpu(i, topology_thread_cpumask(cpu)) { + for_each_cpu(i, topology_sibling_cpumask(cpu)) { struct intel_shared_regs *pc; pc = per_cpu(cpu_hw_events, i).shared_regs; @@ -2594,7 +2594,7 @@ static void intel_pmu_cpu_starting(int cpu) cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR]; if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) { - for_each_cpu(i, topology_thread_cpumask(cpu)) { + for_each_cpu(i, topology_sibling_cpumask(cpu)) { struct intel_excl_cntrs *c; c = per_cpu(cpu_hw_events, i).excl_cntrs; @@ -3362,7 +3362,7 @@ static __init int fixup_ht_bug(void) if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED)) return 0; - w = cpumask_weight(topology_thread_cpumask(cpu)); + w = cpumask_weight(topology_sibling_cpumask(cpu)); if (w > 1) { pr_info("PMU erratum BJ122, BV98, HSD29 worked around, HT is on\n"); return 0; diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index e7d8c76..18ca99f 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -12,7 +12,8 @@ static void show_cpuinfo_core(struct seq_file *m, struct cpuinfo_x86 *c, { #ifdef CONFIG_SMP seq_printf(m, "physical id\t: %d\n", c->phys_proc_id); - seq_printf(m, "siblings\t: %d\n", cpumask_weight(cpu_core_mask(cpu))); + seq_printf(m, "siblings\t: %d\n", + cpumask_weight(topology_core_cpumask(cpu))); seq_printf(m, "core id\t\t: %d\n", c->cpu_core_id); seq_printf(m, "cpu cores\t: %d\n", c->booted_cores); seq_printf(m, "apicid\t\t: %d\n", c->apicid); diff --git a/arch/x86/kernel/i386_ksyms_32.c b/arch/x86/kernel/i386_ksyms_32.c index 05fd74f..64341aa 100644 --- a/arch/x86/kernel/i386_ksyms_32.c +++ b/arch/x86/kernel/i386_ksyms_32.c @@ -40,7 +40,5 @@ EXPORT_SYMBOL(empty_zero_page); #ifdef CONFIG_PREEMPT EXPORT_SYMBOL(___preempt_schedule); -#ifdef CONFIG_CONTEXT_TRACKING -EXPORT_SYMBOL(___preempt_schedule_context); -#endif +EXPORT_SYMBOL(___preempt_schedule_notrace); #endif diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 6e338e3..c648139 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -445,11 +445,10 @@ static int prefer_mwait_c1_over_halt(const struct cpuinfo_x86 *c) } /* - * MONITOR/MWAIT with no hints, used for default default C1 state. - * This invokes MWAIT with interrutps enabled and no flags, - * which is backwards compatible with the original MWAIT implementation. + * MONITOR/MWAIT with no hints, used for default C1 state. This invokes MWAIT + * with interrupts enabled and no flags, which is backwards compatible with the + * original MWAIT implementation. */ - static void mwait_idle(void) { if (!current_set_polling_and_test()) { diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 50e547e..0e82096 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -314,10 +314,10 @@ topology_sane(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o, const char *name) cpu1, name, cpu2, cpu_to_node(cpu1), cpu_to_node(cpu2)); } -#define link_mask(_m, c1, c2) \ +#define link_mask(mfunc, c1, c2) \ do { \ - cpumask_set_cpu((c1), cpu_##_m##_mask(c2)); \ - cpumask_set_cpu((c2), cpu_##_m##_mask(c1)); \ + cpumask_set_cpu((c1), mfunc(c2)); \ + cpumask_set_cpu((c2), mfunc(c1)); \ } while (0) static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) @@ -398,9 +398,9 @@ void set_cpu_sibling_map(int cpu) cpumask_set_cpu(cpu, cpu_sibling_setup_mask); if (!has_mp) { - cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); + cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); - cpumask_set_cpu(cpu, cpu_core_mask(cpu)); + cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; return; } @@ -409,32 +409,34 @@ void set_cpu_sibling_map(int cpu) o = &cpu_data(i); if ((i == cpu) || (has_smt && match_smt(c, o))) - link_mask(sibling, cpu, i); + link_mask(topology_sibling_cpumask, cpu, i); if ((i == cpu) || (has_mp && match_llc(c, o))) - link_mask(llc_shared, cpu, i); + link_mask(cpu_llc_shared_mask, cpu, i); } /* * This needs a separate iteration over the cpus because we rely on all - * cpu_sibling_mask links to be set-up. + * topology_sibling_cpumask links to be set-up. */ for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); if ((i == cpu) || (has_mp && match_die(c, o))) { - link_mask(core, cpu, i); + link_mask(topology_core_cpumask, cpu, i); /* * Does this new cpu bringup a new core? */ - if (cpumask_weight(cpu_sibling_mask(cpu)) == 1) { + if (cpumask_weight( + topology_sibling_cpumask(cpu)) == 1) { /* * for each core in package, increment * the booted_cores for this new cpu */ - if (cpumask_first(cpu_sibling_mask(i)) == i) + if (cpumask_first( + topology_sibling_cpumask(i)) == i) c->booted_cores++; /* * increment the core count for all @@ -1009,8 +1011,8 @@ static __init void disable_smp(void) physid_set_mask_of_physid(boot_cpu_physical_apicid, &phys_cpu_present_map); else physid_set_mask_of_physid(0, &phys_cpu_present_map); - cpumask_set_cpu(0, cpu_sibling_mask(0)); - cpumask_set_cpu(0, cpu_core_mask(0)); + cpumask_set_cpu(0, topology_sibling_cpumask(0)); + cpumask_set_cpu(0, topology_core_cpumask(0)); } enum { @@ -1293,22 +1295,22 @@ static void remove_siblinginfo(int cpu) int sibling; struct cpuinfo_x86 *c = &cpu_data(cpu); - for_each_cpu(sibling, cpu_core_mask(cpu)) { - cpumask_clear_cpu(cpu, cpu_core_mask(sibling)); + for_each_cpu(sibling, topology_core_cpumask(cpu)) { + cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ - if (cpumask_weight(cpu_sibling_mask(cpu)) == 1) + if (cpumask_weight(topology_sibling_cpumask(cpu)) == 1) cpu_data(sibling).booted_cores--; } - for_each_cpu(sibling, cpu_sibling_mask(cpu)) - cpumask_clear_cpu(cpu, cpu_sibling_mask(sibling)); + for_each_cpu(sibling, topology_sibling_cpumask(cpu)) + cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling)); for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); - cpumask_clear(cpu_sibling_mask(cpu)); - cpumask_clear(cpu_core_mask(cpu)); + cpumask_clear(topology_sibling_cpumask(cpu)); + cpumask_clear(topology_core_cpumask(cpu)); c->phys_proc_id = 0; c->cpu_core_id = 0; cpumask_clear_cpu(cpu, cpu_sibling_setup_mask); diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c index 2648848..dd8d079 100644 --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -113,7 +113,7 @@ static void check_tsc_warp(unsigned int timeout) */ static inline unsigned int loop_timeout(int cpu) { - return (cpumask_weight(cpu_core_mask(cpu)) > 1) ? 2 : 20; + return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20; } /* diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c index 37d8fa4..a0695be 100644 --- a/arch/x86/kernel/x8664_ksyms_64.c +++ b/arch/x86/kernel/x8664_ksyms_64.c @@ -75,7 +75,5 @@ EXPORT_SYMBOL(native_load_gs_index); #ifdef CONFIG_PREEMPT EXPORT_SYMBOL(___preempt_schedule); -#ifdef CONFIG_CONTEXT_TRACKING -EXPORT_SYMBOL(___preempt_schedule_context); -#endif +EXPORT_SYMBOL(___preempt_schedule_notrace); #endif diff --git a/arch/x86/lib/thunk_32.S b/arch/x86/lib/thunk_32.S index 5eb7150..e407941 100644 --- a/arch/x86/lib/thunk_32.S +++ b/arch/x86/lib/thunk_32.S @@ -38,8 +38,6 @@ #ifdef CONFIG_PREEMPT THUNK ___preempt_schedule, preempt_schedule -#ifdef CONFIG_CONTEXT_TRACKING - THUNK ___preempt_schedule_context, preempt_schedule_context -#endif + THUNK ___preempt_schedule_notrace, preempt_schedule_notrace #endif diff --git a/arch/x86/lib/thunk_64.S b/arch/x86/lib/thunk_64.S index f89ba4e9..2198902 100644 --- a/arch/x86/lib/thunk_64.S +++ b/arch/x86/lib/thunk_64.S @@ -49,9 +49,7 @@ #ifdef CONFIG_PREEMPT THUNK ___preempt_schedule, preempt_schedule -#ifdef CONFIG_CONTEXT_TRACKING - THUNK ___preempt_schedule_context, preempt_schedule_context -#endif + THUNK ___preempt_schedule_notrace, preempt_schedule_notrace #endif #if defined(CONFIG_TRACE_IRQFLAGS) \ diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c index e2f5e21..91d93b9 100644 --- a/arch/x86/lib/usercopy_32.c +++ b/arch/x86/lib/usercopy_32.c @@ -647,7 +647,8 @@ EXPORT_SYMBOL(__copy_from_user_ll_nocache_nozero); * @from: Source address, in kernel space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from kernel space to user space. * @@ -668,7 +669,8 @@ EXPORT_SYMBOL(_copy_to_user); * @from: Source address, in user space. * @n: Number of bytes to copy. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Copy data from user space to kernel space. * diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 181c53b..9dc9098 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -13,6 +13,7 @@ #include <linux/hugetlb.h> /* hstate_index_to_shift */ #include <linux/prefetch.h> /* prefetchw */ #include <linux/context_tracking.h> /* exception_enter(), ... */ +#include <linux/uaccess.h> /* faulthandler_disabled() */ #include <asm/traps.h> /* dotraplinkage, ... */ #include <asm/pgalloc.h> /* pgd_*(), ... */ @@ -1126,9 +1127,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, /* * If we're in an interrupt, have no user context or are running - * in an atomic region then we must not take the fault: + * in a region with pagefaults disabled then we must not take the fault */ - if (unlikely(in_atomic() || !mm)) { + if (unlikely(faulthandler_disabled() || !mm)) { bad_area_nosemaphore(regs, error_code, address); return; } diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c index 4500142..eecb207a 100644 --- a/arch/x86/mm/highmem_32.c +++ b/arch/x86/mm/highmem_32.c @@ -35,7 +35,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot) unsigned long vaddr; int idx, type; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) @@ -100,6 +100,7 @@ void __kunmap_atomic(void *kvaddr) #endif pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); diff --git a/arch/x86/mm/iomap_32.c b/arch/x86/mm/iomap_32.c index 9ca35fc..2b7ece0 100644 --- a/arch/x86/mm/iomap_32.c +++ b/arch/x86/mm/iomap_32.c @@ -59,6 +59,7 @@ void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) unsigned long vaddr; int idx, type; + preempt_disable(); pagefault_disable(); type = kmap_atomic_idx_push(); @@ -117,5 +118,6 @@ iounmap_atomic(void __iomem *kvaddr) } pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL_GPL(iounmap_atomic); diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index 9e3571a..83a44a3 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -15,10 +15,10 @@ #include <linux/mm.h> #include <linux/module.h> #include <linux/hardirq.h> +#include <linux/uaccess.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> #include <asm/hardirq.h> -#include <asm/uaccess.h> #include <asm/pgalloc.h> DEFINE_PER_CPU(unsigned long, asid_cache) = ASID_USER_FIRST; @@ -57,7 +57,7 @@ void do_page_fault(struct pt_regs *regs) /* If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) { + if (faulthandler_disabled() || !mm) { bad_page_fault(regs, address, SIGSEGV); return; } diff --git a/arch/xtensa/mm/highmem.c b/arch/xtensa/mm/highmem.c index 8cfb71e..184cead 100644 --- a/arch/xtensa/mm/highmem.c +++ b/arch/xtensa/mm/highmem.c @@ -42,6 +42,7 @@ void *kmap_atomic(struct page *page) enum fixed_addresses idx; unsigned long vaddr; + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -79,6 +80,7 @@ void __kunmap_atomic(void *kvaddr) } pagefault_enable(); + preempt_enable(); } EXPORT_SYMBOL(__kunmap_atomic); diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 5f13f4d..1e28ddb 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -24,7 +24,7 @@ static int get_first_sibling(unsigned int cpu) { unsigned int ret; - ret = cpumask_first(topology_thread_cpumask(cpu)); + ret = cpumask_first(topology_sibling_cpumask(cpu)); if (ret < nr_cpu_ids) return ret; diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c index 6bc9cbc..00b3980 100644 --- a/drivers/acpi/acpi_pad.c +++ b/drivers/acpi/acpi_pad.c @@ -105,7 +105,7 @@ static void round_robin_cpu(unsigned int tsk_index) mutex_lock(&round_robin_lock); cpumask_clear(tmp); for_each_cpu(cpu, pad_busy_cpus) - cpumask_or(tmp, tmp, topology_thread_cpumask(cpu)); + cpumask_or(tmp, tmp, topology_sibling_cpumask(cpu)); cpumask_andnot(tmp, cpu_online_mask, tmp); /* avoid HT sibilings if possible */ if (cpumask_empty(tmp)) diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 6491f45..8b7d7f8 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -61,7 +61,7 @@ static DEVICE_ATTR_RO(physical_package_id); define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id); -define_siblings_show_func(thread_siblings, thread_cpumask); +define_siblings_show_func(thread_siblings, sibling_cpumask); static DEVICE_ATTR_RO(thread_siblings); static DEVICE_ATTR_RO(thread_siblings_list); diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c index b0c18ed..0136dfc 100644 --- a/drivers/cpufreq/acpi-cpufreq.c +++ b/drivers/cpufreq/acpi-cpufreq.c @@ -699,13 +699,14 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy) dmi_check_system(sw_any_bug_dmi_table); if (bios_with_sw_any_bug && !policy_is_shared(policy)) { policy->shared_type = CPUFREQ_SHARED_TYPE_ALL; - cpumask_copy(policy->cpus, cpu_core_mask(cpu)); + cpumask_copy(policy->cpus, topology_core_cpumask(cpu)); } if (check_amd_hwpstate_cpu(cpu) && !acpi_pstate_strict) { cpumask_clear(policy->cpus); cpumask_set_cpu(cpu, policy->cpus); - cpumask_copy(data->freqdomain_cpus, cpu_sibling_mask(cpu)); + cpumask_copy(data->freqdomain_cpus, + topology_sibling_cpumask(cpu)); policy->shared_type = CPUFREQ_SHARED_TYPE_HW; pr_info_once(PFX "overriding BIOS provided _PSD data\n"); } diff --git a/drivers/cpufreq/p4-clockmod.c b/drivers/cpufreq/p4-clockmod.c index 529cfd9..5dd95da 100644 --- a/drivers/cpufreq/p4-clockmod.c +++ b/drivers/cpufreq/p4-clockmod.c @@ -172,7 +172,7 @@ static int cpufreq_p4_cpu_init(struct cpufreq_policy *policy) unsigned int i; #ifdef CONFIG_SMP - cpumask_copy(policy->cpus, cpu_sibling_mask(policy->cpu)); + cpumask_copy(policy->cpus, topology_sibling_cpumask(policy->cpu)); #endif /* Errata workaround */ diff --git a/drivers/cpufreq/powernow-k8.c b/drivers/cpufreq/powernow-k8.c index f9ce7e4..5c035d0 100644 --- a/drivers/cpufreq/powernow-k8.c +++ b/drivers/cpufreq/powernow-k8.c @@ -57,13 +57,6 @@ static DEFINE_PER_CPU(struct powernow_k8_data *, powernow_data); static struct cpufreq_driver cpufreq_amd64_driver; -#ifndef CONFIG_SMP -static inline const struct cpumask *cpu_core_mask(int cpu) -{ - return cpumask_of(0); -} -#endif - /* Return a frequency in MHz, given an input fid */ static u32 find_freq_from_fid(u32 fid) { @@ -620,7 +613,7 @@ static int fill_powernow_table(struct powernow_k8_data *data, pr_debug("cfid 0x%x, cvid 0x%x\n", data->currfid, data->currvid); data->powernow_table = powernow_table; - if (cpumask_first(cpu_core_mask(data->cpu)) == data->cpu) + if (cpumask_first(topology_core_cpumask(data->cpu)) == data->cpu) print_basics(data); for (j = 0; j < data->numps; j++) @@ -784,7 +777,7 @@ static int powernow_k8_cpu_init_acpi(struct powernow_k8_data *data) CPUFREQ_TABLE_END; data->powernow_table = powernow_table; - if (cpumask_first(cpu_core_mask(data->cpu)) == data->cpu) + if (cpumask_first(topology_core_cpumask(data->cpu)) == data->cpu) print_basics(data); /* notify BIOS that we exist */ @@ -1090,7 +1083,7 @@ static int powernowk8_cpu_init(struct cpufreq_policy *pol) if (rc != 0) goto err_out_exit_acpi; - cpumask_copy(pol->cpus, cpu_core_mask(pol->cpu)); + cpumask_copy(pol->cpus, topology_core_cpumask(pol->cpu)); data->available_cores = pol->cpus; /* min/max the cpu is capable of */ diff --git a/drivers/cpufreq/speedstep-ich.c b/drivers/cpufreq/speedstep-ich.c index e56d632..37555c6 100644 --- a/drivers/cpufreq/speedstep-ich.c +++ b/drivers/cpufreq/speedstep-ich.c @@ -292,7 +292,7 @@ static int speedstep_cpu_init(struct cpufreq_policy *policy) /* only run on CPU to be set, or on its sibling */ #ifdef CONFIG_SMP - cpumask_copy(policy->cpus, cpu_sibling_mask(policy->cpu)); + cpumask_copy(policy->cpus, topology_sibling_cpumask(policy->cpu)); #endif policy_cpu = cpumask_any_and(policy->cpus, cpu_online_mask); diff --git a/drivers/crypto/vmx/aes.c b/drivers/crypto/vmx/aes.c index ab300ea..a9064e3 100644 --- a/drivers/crypto/vmx/aes.c +++ b/drivers/crypto/vmx/aes.c @@ -78,12 +78,14 @@ static int p8_aes_setkey(struct crypto_tfm *tfm, const u8 *key, int ret; struct p8_aes_ctx *ctx = crypto_tfm_ctx(tfm); + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); ret = aes_p8_set_encrypt_key(key, keylen * 8, &ctx->enc_key); ret += aes_p8_set_decrypt_key(key, keylen * 8, &ctx->dec_key); pagefault_enable(); - + preempt_enable(); + ret += crypto_cipher_setkey(ctx->fallback, key, keylen); return ret; } @@ -95,10 +97,12 @@ static void p8_aes_encrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src) if (in_interrupt()) { crypto_cipher_encrypt_one(ctx->fallback, dst, src); } else { + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); aes_p8_encrypt(src, dst, &ctx->enc_key); pagefault_enable(); + preempt_enable(); } } @@ -109,10 +113,12 @@ static void p8_aes_decrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src) if (in_interrupt()) { crypto_cipher_decrypt_one(ctx->fallback, dst, src); } else { + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); aes_p8_decrypt(src, dst, &ctx->dec_key); pagefault_enable(); + preempt_enable(); } } diff --git a/drivers/crypto/vmx/aes_cbc.c b/drivers/crypto/vmx/aes_cbc.c index 1a559b7..477284a 100644 --- a/drivers/crypto/vmx/aes_cbc.c +++ b/drivers/crypto/vmx/aes_cbc.c @@ -79,11 +79,13 @@ static int p8_aes_cbc_setkey(struct crypto_tfm *tfm, const u8 *key, int ret; struct p8_aes_cbc_ctx *ctx = crypto_tfm_ctx(tfm); + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); ret = aes_p8_set_encrypt_key(key, keylen * 8, &ctx->enc_key); ret += aes_p8_set_decrypt_key(key, keylen * 8, &ctx->dec_key); pagefault_enable(); + preempt_enable(); ret += crypto_blkcipher_setkey(ctx->fallback, key, keylen); return ret; @@ -106,6 +108,7 @@ static int p8_aes_cbc_encrypt(struct blkcipher_desc *desc, if (in_interrupt()) { ret = crypto_blkcipher_encrypt(&fallback_desc, dst, src, nbytes); } else { + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); @@ -119,6 +122,7 @@ static int p8_aes_cbc_encrypt(struct blkcipher_desc *desc, } pagefault_enable(); + preempt_enable(); } return ret; @@ -141,6 +145,7 @@ static int p8_aes_cbc_decrypt(struct blkcipher_desc *desc, if (in_interrupt()) { ret = crypto_blkcipher_decrypt(&fallback_desc, dst, src, nbytes); } else { + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); @@ -154,6 +159,7 @@ static int p8_aes_cbc_decrypt(struct blkcipher_desc *desc, } pagefault_enable(); + preempt_enable(); } return ret; diff --git a/drivers/crypto/vmx/ghash.c b/drivers/crypto/vmx/ghash.c index d0ffe27..f255ec4 100644 --- a/drivers/crypto/vmx/ghash.c +++ b/drivers/crypto/vmx/ghash.c @@ -114,11 +114,13 @@ static int p8_ghash_setkey(struct crypto_shash *tfm, const u8 *key, if (keylen != GHASH_KEY_LEN) return -EINVAL; + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); enable_kernel_fp(); gcm_init_p8(ctx->htable, (const u64 *) key); pagefault_enable(); + preempt_enable(); return crypto_shash_setkey(ctx->fallback, key, keylen); } @@ -140,23 +142,27 @@ static int p8_ghash_update(struct shash_desc *desc, } memcpy(dctx->buffer + dctx->bytes, src, GHASH_DIGEST_SIZE - dctx->bytes); + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); enable_kernel_fp(); gcm_ghash_p8(dctx->shash, ctx->htable, dctx->buffer, GHASH_DIGEST_SIZE); pagefault_enable(); + preempt_enable(); src += GHASH_DIGEST_SIZE - dctx->bytes; srclen -= GHASH_DIGEST_SIZE - dctx->bytes; dctx->bytes = 0; } len = srclen & ~(GHASH_DIGEST_SIZE - 1); if (len) { + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); enable_kernel_fp(); gcm_ghash_p8(dctx->shash, ctx->htable, src, len); pagefault_enable(); + preempt_enable(); src += len; srclen -= len; } @@ -180,12 +186,14 @@ static int p8_ghash_final(struct shash_desc *desc, u8 *out) if (dctx->bytes) { for (i = dctx->bytes; i < GHASH_DIGEST_SIZE; i++) dctx->buffer[i] = 0; + preempt_disable(); pagefault_disable(); enable_kernel_altivec(); enable_kernel_fp(); gcm_ghash_p8(dctx->shash, ctx->htable, dctx->buffer, GHASH_DIGEST_SIZE); pagefault_enable(); + preempt_enable(); dctx->bytes = 0; } memcpy(out, dctx->shash, GHASH_DIGEST_SIZE); diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c index a3190e79..cc552a4 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -32,6 +32,7 @@ #include "i915_trace.h" #include "intel_drv.h" #include <linux/dma_remapping.h> +#include <linux/uaccess.h> #define __EXEC_OBJECT_HAS_PIN (1<<31) #define __EXEC_OBJECT_HAS_FENCE (1<<30) @@ -465,7 +466,7 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj, } /* We can't wait for rendering with pagefaults disabled */ - if (obj->active && in_atomic()) + if (obj->active && pagefault_disabled()) return -EFAULT; if (use_cpu_reloc(obj)) diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c index ed303ba..3e03379 100644 --- a/drivers/hwmon/coretemp.c +++ b/drivers/hwmon/coretemp.c @@ -63,7 +63,8 @@ MODULE_PARM_DESC(tjmax, "TjMax value in degrees Celsius"); #define TO_ATTR_NO(cpu) (TO_CORE_ID(cpu) + BASE_SYSFS_ATTR_NO) #ifdef CONFIG_SMP -#define for_each_sibling(i, cpu) for_each_cpu(i, cpu_sibling_mask(cpu)) +#define for_each_sibling(i, cpu) \ + for_each_cpu(i, topology_sibling_cpumask(cpu)) #else #define for_each_sibling(i, cpu) for (i = 0; false; ) #endif diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c index 4b00545..65944dd 100644 --- a/drivers/net/ethernet/sfc/efx.c +++ b/drivers/net/ethernet/sfc/efx.c @@ -1304,7 +1304,7 @@ static unsigned int efx_wanted_parallelism(struct efx_nic *efx) if (!cpumask_test_cpu(cpu, thread_mask)) { ++count; cpumask_or(thread_mask, thread_mask, - topology_thread_cpumask(cpu)); + topology_sibling_cpumask(cpu)); } } diff --git a/drivers/staging/lustre/lustre/libcfs/linux/linux-cpu.c b/drivers/staging/lustre/lustre/libcfs/linux/linux-cpu.c index cc3ab35..f926224 100644 --- a/drivers/staging/lustre/lustre/libcfs/linux/linux-cpu.c +++ b/drivers/staging/lustre/lustre/libcfs/linux/linux-cpu.c @@ -87,7 +87,7 @@ static void cfs_cpu_core_siblings(int cpu, cpumask_t *mask) /* return cpumask of HTs in the same core */ static void cfs_cpu_ht_siblings(int cpu, cpumask_t *mask) { - cpumask_copy(mask, topology_thread_cpumask(cpu)); + cpumask_copy(mask, topology_sibling_cpumask(cpu)); } static void cfs_node_to_cpumask(int node, cpumask_t *mask) diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c index 8e61421..344189a 100644 --- a/drivers/staging/lustre/lustre/ptlrpc/service.c +++ b/drivers/staging/lustre/lustre/ptlrpc/service.c @@ -557,7 +557,7 @@ ptlrpc_server_nthreads_check(struct ptlrpc_service *svc, * there are. */ /* weight is # of HTs */ - if (cpumask_weight(topology_thread_cpumask(0)) > 1) { + if (cpumask_weight(topology_sibling_cpumask(0)) > 1) { /* depress thread factor for hyper-thread */ factor = factor - (factor >> 1) + (factor >> 3); } @@ -2768,7 +2768,7 @@ int ptlrpc_hr_init(void) init_waitqueue_head(&ptlrpc_hr.hr_waitq); - weight = cpumask_weight(topology_thread_cpumask(0)); + weight = cpumask_weight(topology_sibling_cpumask(0)); cfs_percpt_for_each(hrp, i, ptlrpc_hr.hr_partitions) { hrp->hrp_cpt = i; diff --git a/include/asm-generic/futex.h b/include/asm-generic/futex.h index b59b5a5..e56272c 100644 --- a/include/asm-generic/futex.h +++ b/include/asm-generic/futex.h @@ -8,8 +8,7 @@ #ifndef CONFIG_SMP /* * The following implementation only for uniprocessor machines. - * For UP, it's relies on the fact that pagefault_disable() also disables - * preemption to ensure mutual exclusion. + * It relies on preempt_disable() ensuring mutual exclusion. * */ @@ -38,6 +37,7 @@ futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) oparg = 1 << oparg; + preempt_disable(); pagefault_disable(); ret = -EFAULT; @@ -72,6 +72,7 @@ futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) out_pagefault_enable: pagefault_enable(); + preempt_enable(); if (ret == 0) { switch (cmp) { @@ -106,6 +107,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, { u32 val; + preempt_disable(); if (unlikely(get_user(val, uaddr) != 0)) return -EFAULT; @@ -113,6 +115,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, return -EFAULT; *uval = val; + preempt_enable(); return 0; } diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index eb6f9e6..d0a7a47 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -79,11 +79,8 @@ static __always_inline bool should_resched(void) #ifdef CONFIG_PREEMPT extern asmlinkage void preempt_schedule(void); #define __preempt_schedule() preempt_schedule() - -#ifdef CONFIG_CONTEXT_TRACKING -extern asmlinkage void preempt_schedule_context(void); -#define __preempt_schedule_context() preempt_schedule_context() -#endif +extern asmlinkage void preempt_schedule_notrace(void); +#define __preempt_schedule_notrace() preempt_schedule_notrace() #endif /* CONFIG_PREEMPT */ #endif /* __ASM_PREEMPT_H */ diff --git a/include/linux/bottom_half.h b/include/linux/bottom_half.h index 86c12c9..8fdcb78 100644 --- a/include/linux/bottom_half.h +++ b/include/linux/bottom_half.h @@ -2,7 +2,6 @@ #define _LINUX_BH_H #include <linux/preempt.h> -#include <linux/preempt_mask.h> #ifdef CONFIG_TRACE_IRQFLAGS extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt); diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index f4af034..dfd59d6 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -1,7 +1,7 @@ #ifndef LINUX_HARDIRQ_H #define LINUX_HARDIRQ_H -#include <linux/preempt_mask.h> +#include <linux/preempt.h> #include <linux/lockdep.h> #include <linux/ftrace_irq.h> #include <linux/vtime.h> diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 9286a46..6aefcd0 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -65,6 +65,7 @@ static inline void kunmap(struct page *page) static inline void *kmap_atomic(struct page *page) { + preempt_disable(); pagefault_disable(); return page_address(page); } @@ -73,6 +74,7 @@ static inline void *kmap_atomic(struct page *page) static inline void __kunmap_atomic(void *addr) { pagefault_enable(); + preempt_enable(); } #define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn)) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 696d223..bb9b075 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -50,9 +50,8 @@ extern struct fs_struct init_fs; .cpu_timers = INIT_CPU_TIMERS(sig.cpu_timers), \ .rlim = INIT_RLIMITS, \ .cputimer = { \ - .cputime = INIT_CPUTIME, \ - .running = 0, \ - .lock = __RAW_SPIN_LOCK_UNLOCKED(sig.cputimer.lock), \ + .cputime_atomic = INIT_CPUTIME_ATOMIC, \ + .running = 0, \ }, \ .cred_guard_mutex = \ __MUTEX_INITIALIZER(sig.cred_guard_mutex), \ diff --git a/include/linux/io-mapping.h b/include/linux/io-mapping.h index 657fab4..c27dde7 100644 --- a/include/linux/io-mapping.h +++ b/include/linux/io-mapping.h @@ -141,6 +141,7 @@ static inline void __iomem * io_mapping_map_atomic_wc(struct io_mapping *mapping, unsigned long offset) { + preempt_disable(); pagefault_disable(); return ((char __force __iomem *) mapping) + offset; } @@ -149,6 +150,7 @@ static inline void io_mapping_unmap_atomic(void __iomem *vaddr) { pagefault_enable(); + preempt_enable(); } /* Non-atomic map/unmap */ diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 3a5b48e..060dd7b 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -244,7 +244,8 @@ static inline u32 reciprocal_scale(u32 val, u32 ep_ro) #if defined(CONFIG_MMU) && \ (defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)) -void might_fault(void); +#define might_fault() __might_fault(__FILE__, __LINE__) +void __might_fault(const char *file, int line); #else static inline void might_fault(void) { } #endif diff --git a/include/linux/lglock.h b/include/linux/lglock.h index 0081f00..c92ebd1 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -52,10 +52,15 @@ struct lglock { static struct lglock name = { .lock = &name ## _lock } void lg_lock_init(struct lglock *lg, char *name); + void lg_local_lock(struct lglock *lg); void lg_local_unlock(struct lglock *lg); void lg_local_lock_cpu(struct lglock *lg, int cpu); void lg_local_unlock_cpu(struct lglock *lg, int cpu); + +void lg_double_lock(struct lglock *lg, int cpu1, int cpu2); +void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2); + void lg_global_lock(struct lglock *lg); void lg_global_unlock(struct lglock *lg); diff --git a/include/linux/preempt.h b/include/linux/preempt.h index de83b4e..0f1534a 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -10,13 +10,117 @@ #include <linux/list.h> /* - * We use the MSB mostly because its available; see <linux/preempt_mask.h> for - * the other bits -- can't include that header due to inclusion hell. + * We put the hardirq and softirq counter into the preemption + * counter. The bitmask has the following meaning: + * + * - bits 0-7 are the preemption count (max preemption depth: 256) + * - bits 8-15 are the softirq count (max # of softirqs: 256) + * + * The hardirq count could in theory be the same as the number of + * interrupts in the system, but we run all interrupt handlers with + * interrupts disabled, so we cannot have nesting interrupts. Though + * there are a few palaeontologic drivers which reenable interrupts in + * the handler, so we need more than one bit here. + * + * PREEMPT_MASK: 0x000000ff + * SOFTIRQ_MASK: 0x0000ff00 + * HARDIRQ_MASK: 0x000f0000 + * NMI_MASK: 0x00100000 + * PREEMPT_ACTIVE: 0x00200000 + * PREEMPT_NEED_RESCHED: 0x80000000 */ +#define PREEMPT_BITS 8 +#define SOFTIRQ_BITS 8 +#define HARDIRQ_BITS 4 +#define NMI_BITS 1 + +#define PREEMPT_SHIFT 0 +#define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) +#define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) +#define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) + +#define __IRQ_MASK(x) ((1UL << (x))-1) + +#define PREEMPT_MASK (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT) +#define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) +#define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) +#define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) + +#define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) +#define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) +#define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) +#define NMI_OFFSET (1UL << NMI_SHIFT) + +#define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET) + +#define PREEMPT_ACTIVE_BITS 1 +#define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS) +#define PREEMPT_ACTIVE (__IRQ_MASK(PREEMPT_ACTIVE_BITS) << PREEMPT_ACTIVE_SHIFT) + +/* We use the MSB mostly because its available */ #define PREEMPT_NEED_RESCHED 0x80000000 +/* preempt_count() and related functions, depends on PREEMPT_NEED_RESCHED */ #include <asm/preempt.h> +#define hardirq_count() (preempt_count() & HARDIRQ_MASK) +#define softirq_count() (preempt_count() & SOFTIRQ_MASK) +#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \ + | NMI_MASK)) + +/* + * Are we doing bottom half or hardware interrupt processing? + * Are we in a softirq context? Interrupt context? + * in_softirq - Are we currently processing softirq or have bh disabled? + * in_serving_softirq - Are we currently processing softirq? + */ +#define in_irq() (hardirq_count()) +#define in_softirq() (softirq_count()) +#define in_interrupt() (irq_count()) +#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET) + +/* + * Are we in NMI context? + */ +#define in_nmi() (preempt_count() & NMI_MASK) + +#if defined(CONFIG_PREEMPT_COUNT) +# define PREEMPT_DISABLE_OFFSET 1 +#else +# define PREEMPT_DISABLE_OFFSET 0 +#endif + +/* + * The preempt_count offset needed for things like: + * + * spin_lock_bh() + * + * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and + * softirqs, such that unlock sequences of: + * + * spin_unlock(); + * local_bh_enable(); + * + * Work as expected. + */ +#define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_DISABLE_OFFSET) + +/* + * Are we running in atomic context? WARNING: this macro cannot + * always detect atomic context; in particular, it cannot know about + * held spinlocks in non-preemptible kernels. Thus it should not be + * used in the general case to determine whether sleeping is possible. + * Do not use in_atomic() in driver code. + */ +#define in_atomic() (preempt_count() != 0) + +/* + * Check whether we were atomic before we did preempt_disable(): + * (used by the scheduler) + */ +#define in_atomic_preempt_off() \ + ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_DISABLE_OFFSET) + #if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER) extern void preempt_count_add(int val); extern void preempt_count_sub(int val); @@ -33,6 +137,18 @@ extern void preempt_count_sub(int val); #define preempt_count_inc() preempt_count_add(1) #define preempt_count_dec() preempt_count_sub(1) +#define preempt_active_enter() \ +do { \ + preempt_count_add(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); \ + barrier(); \ +} while (0) + +#define preempt_active_exit() \ +do { \ + barrier(); \ + preempt_count_sub(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); \ +} while (0) + #ifdef CONFIG_PREEMPT_COUNT #define preempt_disable() \ @@ -49,6 +165,8 @@ do { \ #define preempt_enable_no_resched() sched_preempt_enable_no_resched() +#define preemptible() (preempt_count() == 0 && !irqs_disabled()) + #ifdef CONFIG_PREEMPT #define preempt_enable() \ do { \ @@ -57,52 +175,46 @@ do { \ __preempt_schedule(); \ } while (0) +#define preempt_enable_notrace() \ +do { \ + barrier(); \ + if (unlikely(__preempt_count_dec_and_test())) \ + __preempt_schedule_notrace(); \ +} while (0) + #define preempt_check_resched() \ do { \ if (should_resched()) \ __preempt_schedule(); \ } while (0) -#else +#else /* !CONFIG_PREEMPT */ #define preempt_enable() \ do { \ barrier(); \ preempt_count_dec(); \ } while (0) -#define preempt_check_resched() do { } while (0) -#endif - -#define preempt_disable_notrace() \ -do { \ - __preempt_count_inc(); \ - barrier(); \ -} while (0) -#define preempt_enable_no_resched_notrace() \ +#define preempt_enable_notrace() \ do { \ barrier(); \ __preempt_count_dec(); \ } while (0) -#ifdef CONFIG_PREEMPT - -#ifndef CONFIG_CONTEXT_TRACKING -#define __preempt_schedule_context() __preempt_schedule() -#endif +#define preempt_check_resched() do { } while (0) +#endif /* CONFIG_PREEMPT */ -#define preempt_enable_notrace() \ +#define preempt_disable_notrace() \ do { \ + __preempt_count_inc(); \ barrier(); \ - if (unlikely(__preempt_count_dec_and_test())) \ - __preempt_schedule_context(); \ } while (0) -#else -#define preempt_enable_notrace() \ + +#define preempt_enable_no_resched_notrace() \ do { \ barrier(); \ __preempt_count_dec(); \ } while (0) -#endif #else /* !CONFIG_PREEMPT_COUNT */ @@ -121,6 +233,7 @@ do { \ #define preempt_disable_notrace() barrier() #define preempt_enable_no_resched_notrace() barrier() #define preempt_enable_notrace() barrier() +#define preemptible() 0 #endif /* CONFIG_PREEMPT_COUNT */ diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h deleted file mode 100644 index dbeec4d..0000000 --- a/include/linux/preempt_mask.h +++ /dev/null @@ -1,117 +0,0 @@ -#ifndef LINUX_PREEMPT_MASK_H -#define LINUX_PREEMPT_MASK_H - -#include <linux/preempt.h> - -/* - * We put the hardirq and softirq counter into the preemption - * counter. The bitmask has the following meaning: - * - * - bits 0-7 are the preemption count (max preemption depth: 256) - * - bits 8-15 are the softirq count (max # of softirqs: 256) - * - * The hardirq count could in theory be the same as the number of - * interrupts in the system, but we run all interrupt handlers with - * interrupts disabled, so we cannot have nesting interrupts. Though - * there are a few palaeontologic drivers which reenable interrupts in - * the handler, so we need more than one bit here. - * - * PREEMPT_MASK: 0x000000ff - * SOFTIRQ_MASK: 0x0000ff00 - * HARDIRQ_MASK: 0x000f0000 - * NMI_MASK: 0x00100000 - * PREEMPT_ACTIVE: 0x00200000 - */ -#define PREEMPT_BITS 8 -#define SOFTIRQ_BITS 8 -#define HARDIRQ_BITS 4 -#define NMI_BITS 1 - -#define PREEMPT_SHIFT 0 -#define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) -#define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) -#define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) - -#define __IRQ_MASK(x) ((1UL << (x))-1) - -#define PREEMPT_MASK (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT) -#define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) -#define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) -#define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) - -#define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) -#define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) -#define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) -#define NMI_OFFSET (1UL << NMI_SHIFT) - -#define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET) - -#define PREEMPT_ACTIVE_BITS 1 -#define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS) -#define PREEMPT_ACTIVE (__IRQ_MASK(PREEMPT_ACTIVE_BITS) << PREEMPT_ACTIVE_SHIFT) - -#define hardirq_count() (preempt_count() & HARDIRQ_MASK) -#define softirq_count() (preempt_count() & SOFTIRQ_MASK) -#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \ - | NMI_MASK)) - -/* - * Are we doing bottom half or hardware interrupt processing? - * Are we in a softirq context? Interrupt context? - * in_softirq - Are we currently processing softirq or have bh disabled? - * in_serving_softirq - Are we currently processing softirq? - */ -#define in_irq() (hardirq_count()) -#define in_softirq() (softirq_count()) -#define in_interrupt() (irq_count()) -#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET) - -/* - * Are we in NMI context? - */ -#define in_nmi() (preempt_count() & NMI_MASK) - -#if defined(CONFIG_PREEMPT_COUNT) -# define PREEMPT_CHECK_OFFSET 1 -#else -# define PREEMPT_CHECK_OFFSET 0 -#endif - -/* - * The preempt_count offset needed for things like: - * - * spin_lock_bh() - * - * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and - * softirqs, such that unlock sequences of: - * - * spin_unlock(); - * local_bh_enable(); - * - * Work as expected. - */ -#define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_CHECK_OFFSET) - -/* - * Are we running in atomic context? WARNING: this macro cannot - * always detect atomic context; in particular, it cannot know about - * held spinlocks in non-preemptible kernels. Thus it should not be - * used in the general case to determine whether sleeping is possible. - * Do not use in_atomic() in driver code. - */ -#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) - -/* - * Check whether we were atomic before we did preempt_disable(): - * (used by the scheduler, *after* releasing the kernel lock) - */ -#define in_atomic_preempt_off() \ - ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET) - -#ifdef CONFIG_PREEMPT_COUNT -# define preemptible() (preempt_count() == 0 && !irqs_disabled()) -#else -# define preemptible() 0 -#endif - -#endif /* LINUX_PREEMPT_MASK_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 659f572..d4193d5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -25,7 +25,7 @@ struct sched_param { #include <linux/errno.h> #include <linux/nodemask.h> #include <linux/mm_types.h> -#include <linux/preempt_mask.h> +#include <linux/preempt.h> #include <asm/page.h> #include <asm/ptrace.h> @@ -174,7 +174,12 @@ extern unsigned long nr_iowait_cpu(int cpu); extern void get_iowait_load(unsigned long *nr_waiters, unsigned long *load); extern void calc_global_load(unsigned long ticks); + +#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) extern void update_cpu_load_nohz(void); +#else +static inline void update_cpu_load_nohz(void) { } +#endif extern unsigned long get_parent_ip(unsigned long addr); @@ -214,9 +219,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq); #define TASK_WAKEKILL 128 #define TASK_WAKING 256 #define TASK_PARKED 512 -#define TASK_STATE_MAX 1024 +#define TASK_NOLOAD 1024 +#define TASK_STATE_MAX 2048 -#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP" +#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN" extern char ___assert_task_state[1 - 2*!!( sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)]; @@ -226,6 +232,8 @@ extern char ___assert_task_state[1 - 2*!!( #define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED) #define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED) +#define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD) + /* Convenience macros for the sake of wake_up */ #define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE) #define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED) @@ -241,7 +249,8 @@ extern char ___assert_task_state[1 - 2*!!( ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0) #define task_contributes_to_load(task) \ ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ - (task->flags & PF_FROZEN) == 0) + (task->flags & PF_FROZEN) == 0 && \ + (task->state & TASK_NOLOAD) == 0) #ifdef CONFIG_DEBUG_ATOMIC_SLEEP @@ -568,6 +577,23 @@ struct task_cputime { .sum_exec_runtime = 0, \ } +/* + * This is the atomic variant of task_cputime, which can be used for + * storing and updating task_cputime statistics without locking. + */ +struct task_cputime_atomic { + atomic64_t utime; + atomic64_t stime; + atomic64_t sum_exec_runtime; +}; + +#define INIT_CPUTIME_ATOMIC \ + (struct task_cputime_atomic) { \ + .utime = ATOMIC64_INIT(0), \ + .stime = ATOMIC64_INIT(0), \ + .sum_exec_runtime = ATOMIC64_INIT(0), \ + } + #ifdef CONFIG_PREEMPT_COUNT #define PREEMPT_DISABLED (1 + PREEMPT_ENABLED) #else @@ -585,18 +611,16 @@ struct task_cputime { /** * struct thread_group_cputimer - thread group interval timer counts - * @cputime: thread group interval timers. + * @cputime_atomic: atomic thread group interval timers. * @running: non-zero when there are timers running and * @cputime receives updates. - * @lock: lock for fields in this struct. * * This structure contains the version of task_cputime, above, that is * used for thread group CPU timer calculations. */ struct thread_group_cputimer { - struct task_cputime cputime; + struct task_cputime_atomic cputime_atomic; int running; - raw_spinlock_t lock; }; #include <linux/rwsem.h> @@ -901,6 +925,50 @@ enum cpu_idle_type { #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) /* + * Wake-queues are lists of tasks with a pending wakeup, whose + * callers have already marked the task as woken internally, + * and can thus carry on. A common use case is being able to + * do the wakeups once the corresponding user lock as been + * released. + * + * We hold reference to each task in the list across the wakeup, + * thus guaranteeing that the memory is still valid by the time + * the actual wakeups are performed in wake_up_q(). + * + * One per task suffices, because there's never a need for a task to be + * in two wake queues simultaneously; it is forbidden to abandon a task + * in a wake queue (a call to wake_up_q() _must_ follow), so if a task is + * already in a wake queue, the wakeup will happen soon and the second + * waker can just skip it. + * + * The WAKE_Q macro declares and initializes the list head. + * wake_up_q() does NOT reinitialize the list; it's expected to be + * called near the end of a function, where the fact that the queue is + * not used again will be easy to see by inspection. + * + * Note that this can cause spurious wakeups. schedule() callers + * must ensure the call is done inside a loop, confirming that the + * wakeup condition has in fact occurred. + */ +struct wake_q_node { + struct wake_q_node *next; +}; + +struct wake_q_head { + struct wake_q_node *first; + struct wake_q_node **lastp; +}; + +#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01) + +#define WAKE_Q(name) \ + struct wake_q_head name = { WAKE_Q_TAIL, &name.first } + +extern void wake_q_add(struct wake_q_head *head, + struct task_struct *task); +extern void wake_up_q(struct wake_q_head *head); + +/* * sched-domains (multiprocessor balancing) declarations: */ #ifdef CONFIG_SMP @@ -1335,8 +1403,6 @@ struct task_struct { int rcu_read_lock_nesting; union rcu_special rcu_read_unlock_special; struct list_head rcu_node_entry; -#endif /* #ifdef CONFIG_PREEMPT_RCU */ -#ifdef CONFIG_PREEMPT_RCU struct rcu_node *rcu_blocked_node; #endif /* #ifdef CONFIG_PREEMPT_RCU */ #ifdef CONFIG_TASKS_RCU @@ -1367,7 +1433,7 @@ struct task_struct { int exit_state; int exit_code, exit_signal; int pdeath_signal; /* The signal sent when the parent dies */ - unsigned int jobctl; /* JOBCTL_*, siglock protected */ + unsigned long jobctl; /* JOBCTL_*, siglock protected */ /* Used for emulating ABI behavior of previous Linux versions */ unsigned int personality; @@ -1513,6 +1579,8 @@ struct task_struct { /* Protection of the PI data structures: */ raw_spinlock_t pi_lock; + struct wake_q_node wake_q; + #ifdef CONFIG_RT_MUTEXES /* PI waiters blocked on a rt_mutex held by this task */ struct rb_root pi_waiters; @@ -1726,6 +1794,7 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif + int pagefault_disabled; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ @@ -2079,22 +2148,22 @@ TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab) #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ -#define JOBCTL_STOP_DEQUEUED (1 << JOBCTL_STOP_DEQUEUED_BIT) -#define JOBCTL_STOP_PENDING (1 << JOBCTL_STOP_PENDING_BIT) -#define JOBCTL_STOP_CONSUME (1 << JOBCTL_STOP_CONSUME_BIT) -#define JOBCTL_TRAP_STOP (1 << JOBCTL_TRAP_STOP_BIT) -#define JOBCTL_TRAP_NOTIFY (1 << JOBCTL_TRAP_NOTIFY_BIT) -#define JOBCTL_TRAPPING (1 << JOBCTL_TRAPPING_BIT) -#define JOBCTL_LISTENING (1 << JOBCTL_LISTENING_BIT) +#define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) +#define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) +#define JOBCTL_STOP_CONSUME (1UL << JOBCTL_STOP_CONSUME_BIT) +#define JOBCTL_TRAP_STOP (1UL << JOBCTL_TRAP_STOP_BIT) +#define JOBCTL_TRAP_NOTIFY (1UL << JOBCTL_TRAP_NOTIFY_BIT) +#define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) +#define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) extern bool task_set_jobctl_pending(struct task_struct *task, - unsigned int mask); + unsigned long mask); extern void task_clear_jobctl_trapping(struct task_struct *task); extern void task_clear_jobctl_pending(struct task_struct *task, - unsigned int mask); + unsigned long mask); static inline void rcu_copy_process(struct task_struct *p) { @@ -2964,11 +3033,6 @@ static __always_inline bool need_resched(void) void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times); void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times); -static inline void thread_group_cputime_init(struct signal_struct *sig) -{ - raw_spin_lock_init(&sig->cputimer.lock); -} - /* * Reevaluate whether the task has signals pending delivery. * Wake the task if so. @@ -3082,13 +3146,13 @@ static inline void mm_update_next_owner(struct mm_struct *mm) static inline unsigned long task_rlimit(const struct task_struct *tsk, unsigned int limit) { - return ACCESS_ONCE(tsk->signal->rlim[limit].rlim_cur); + return READ_ONCE(tsk->signal->rlim[limit].rlim_cur); } static inline unsigned long task_rlimit_max(const struct task_struct *tsk, unsigned int limit) { - return ACCESS_ONCE(tsk->signal->rlim[limit].rlim_max); + return READ_ONCE(tsk->signal->rlim[limit].rlim_max); } static inline unsigned long rlimit(unsigned int limit) diff --git a/include/linux/topology.h b/include/linux/topology.h index 909b6e4..73ddad1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -191,8 +191,8 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif -#ifndef topology_thread_cpumask -#define topology_thread_cpumask(cpu) cpumask_of(cpu) +#ifndef topology_sibling_cpumask +#define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) @@ -201,7 +201,7 @@ static inline int cpu_to_mem(int cpu) #ifdef CONFIG_SCHED_SMT static inline const struct cpumask *cpu_smt_mask(int cpu) { - return topology_thread_cpumask(cpu); + return topology_sibling_cpumask(cpu); } #endif diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index ecd3319..ae572c1 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -1,21 +1,30 @@ #ifndef __LINUX_UACCESS_H__ #define __LINUX_UACCESS_H__ -#include <linux/preempt.h> +#include <linux/sched.h> #include <asm/uaccess.h> +static __always_inline void pagefault_disabled_inc(void) +{ + current->pagefault_disabled++; +} + +static __always_inline void pagefault_disabled_dec(void) +{ + current->pagefault_disabled--; + WARN_ON(current->pagefault_disabled < 0); +} + /* - * These routines enable/disable the pagefault handler in that - * it will not take any locks and go straight to the fixup table. + * These routines enable/disable the pagefault handler. If disabled, it will + * not take any locks and go straight to the fixup table. * - * They have great resemblance to the preempt_disable/enable calls - * and in fact they are identical; this is because currently there is - * no other way to make the pagefault handlers do this. So we do - * disable preemption but we don't necessarily care about that. + * User access methods will not sleep when called from a pagefault_disabled() + * environment. */ static inline void pagefault_disable(void) { - preempt_count_inc(); + pagefault_disabled_inc(); /* * make sure to have issued the store before a pagefault * can hit. @@ -25,18 +34,31 @@ static inline void pagefault_disable(void) static inline void pagefault_enable(void) { -#ifndef CONFIG_PREEMPT /* * make sure to issue those last loads/stores before enabling * the pagefault handler again. */ barrier(); - preempt_count_dec(); -#else - preempt_enable(); -#endif + pagefault_disabled_dec(); } +/* + * Is the pagefault handler disabled? If so, user access methods will not sleep. + */ +#define pagefault_disabled() (current->pagefault_disabled != 0) + +/* + * The pagefault handler is in general disabled by pagefault_disable() or + * when in irq context (via in_atomic()). + * + * This function should only be used by the fault handlers. Other users should + * stick to pagefault_disabled(). + * Please NEVER use preempt_disable() to disable the fault handler. With + * !CONFIG_PREEMPT_COUNT, this is like a NOP. So the handler won't be disabled. + * in_atomic() will report different values based on !CONFIG_PREEMPT_COUNT. + */ +#define faulthandler_disabled() (pagefault_disabled() || in_atomic()) + #ifndef ARCH_HAS_NOCACHE_UACCESS static inline unsigned long __copy_from_user_inatomic_nocache(void *to, diff --git a/include/linux/wait.h b/include/linux/wait.h index 2db8334..d69ac4e 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -969,7 +969,7 @@ extern int bit_wait_io_timeout(struct wait_bit_key *); * on that signal. */ static inline int -wait_on_bit(void *word, int bit, unsigned mode) +wait_on_bit(unsigned long *word, int bit, unsigned mode) { might_sleep(); if (!test_bit(bit, word)) @@ -994,7 +994,7 @@ wait_on_bit(void *word, int bit, unsigned mode) * on that signal. */ static inline int -wait_on_bit_io(void *word, int bit, unsigned mode) +wait_on_bit_io(unsigned long *word, int bit, unsigned mode) { might_sleep(); if (!test_bit(bit, word)) @@ -1020,7 +1020,8 @@ wait_on_bit_io(void *word, int bit, unsigned mode) * received a signal and the mode permitted wakeup on that signal. */ static inline int -wait_on_bit_timeout(void *word, int bit, unsigned mode, unsigned long timeout) +wait_on_bit_timeout(unsigned long *word, int bit, unsigned mode, + unsigned long timeout) { might_sleep(); if (!test_bit(bit, word)) @@ -1047,7 +1048,8 @@ wait_on_bit_timeout(void *word, int bit, unsigned mode, unsigned long timeout) * on that signal. */ static inline int -wait_on_bit_action(void *word, int bit, wait_bit_action_f *action, unsigned mode) +wait_on_bit_action(unsigned long *word, int bit, wait_bit_action_f *action, + unsigned mode) { might_sleep(); if (!test_bit(bit, word)) @@ -1075,7 +1077,7 @@ wait_on_bit_action(void *word, int bit, wait_bit_action_f *action, unsigned mode * the @mode allows that signal to wake the process. */ static inline int -wait_on_bit_lock(void *word, int bit, unsigned mode) +wait_on_bit_lock(unsigned long *word, int bit, unsigned mode) { might_sleep(); if (!test_and_set_bit(bit, word)) @@ -1099,7 +1101,7 @@ wait_on_bit_lock(void *word, int bit, unsigned mode) * the @mode allows that signal to wake the process. */ static inline int -wait_on_bit_lock_io(void *word, int bit, unsigned mode) +wait_on_bit_lock_io(unsigned long *word, int bit, unsigned mode) { might_sleep(); if (!test_and_set_bit(bit, word)) @@ -1125,7 +1127,8 @@ wait_on_bit_lock_io(void *word, int bit, unsigned mode) * the @mode allows that signal to wake the process. */ static inline int -wait_on_bit_lock_action(void *word, int bit, wait_bit_action_f *action, unsigned mode) +wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action, + unsigned mode) { might_sleep(); if (!test_and_set_bit(bit, word)) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 30fedaf..d57a575 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -147,7 +147,8 @@ TRACE_EVENT(sched_switch, __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, - { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R", + { 128, "K" }, { 256, "W" }, { 512, "P" }, + { 1024, "N" }) : "R", __entry->prev_state & TASK_STATE_MAX ? "+" : "", __entry->next_comm, __entry->next_pid, __entry->next_prio) ); diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 3aaea7f..a24ba9f 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -47,8 +47,7 @@ #define RECV 1 #define STATE_NONE 0 -#define STATE_PENDING 1 -#define STATE_READY 2 +#define STATE_READY 1 struct posix_msg_tree_node { struct rb_node rb_node; @@ -571,15 +570,12 @@ static int wq_sleep(struct mqueue_inode_info *info, int sr, wq_add(info, sr, ewp); for (;;) { - set_current_state(TASK_INTERRUPTIBLE); + __set_current_state(TASK_INTERRUPTIBLE); spin_unlock(&info->lock); time = schedule_hrtimeout_range_clock(timeout, 0, HRTIMER_MODE_ABS, CLOCK_REALTIME); - while (ewp->state == STATE_PENDING) - cpu_relax(); - if (ewp->state == STATE_READY) { retval = 0; goto out; @@ -907,11 +903,15 @@ out_name: * list of waiting receivers. A sender checks that list before adding the new * message into the message array. If there is a waiting receiver, then it * bypasses the message array and directly hands the message over to the - * receiver. - * The receiver accepts the message and returns without grabbing the queue - * spinlock. Therefore an intermediate STATE_PENDING state and memory barriers - * are necessary. The same algorithm is used for sysv semaphores, see - * ipc/sem.c for more details. + * receiver. The receiver accepts the message and returns without grabbing the + * queue spinlock: + * + * - Set pointer to message. + * - Queue the receiver task for later wakeup (without the info->lock). + * - Update its state to STATE_READY. Now the receiver can continue. + * - Wake up the process after the lock is dropped. Should the process wake up + * before this wakeup (due to a timeout or a signal) it will either see + * STATE_READY and continue or acquire the lock to check the state again. * * The same algorithm is used for senders. */ @@ -919,21 +919,29 @@ out_name: /* pipelined_send() - send a message directly to the task waiting in * sys_mq_timedreceive() (without inserting message into a queue). */ -static inline void pipelined_send(struct mqueue_inode_info *info, +static inline void pipelined_send(struct wake_q_head *wake_q, + struct mqueue_inode_info *info, struct msg_msg *message, struct ext_wait_queue *receiver) { receiver->msg = message; list_del(&receiver->list); - receiver->state = STATE_PENDING; - wake_up_process(receiver->task); - smp_wmb(); + wake_q_add(wake_q, receiver->task); + /* + * Rely on the implicit cmpxchg barrier from wake_q_add such + * that we can ensure that updating receiver->state is the last + * write operation: As once set, the receiver can continue, + * and if we don't have the reference count from the wake_q, + * yet, at that point we can later have a use-after-free + * condition and bogus wakeup. + */ receiver->state = STATE_READY; } /* pipelined_receive() - if there is task waiting in sys_mq_timedsend() * gets its message and put to the queue (we have one free place for sure). */ -static inline void pipelined_receive(struct mqueue_inode_info *info) +static inline void pipelined_receive(struct wake_q_head *wake_q, + struct mqueue_inode_info *info) { struct ext_wait_queue *sender = wq_get_first_waiter(info, SEND); @@ -944,10 +952,9 @@ static inline void pipelined_receive(struct mqueue_inode_info *info) } if (msg_insert(sender->msg, info)) return; + list_del(&sender->list); - sender->state = STATE_PENDING; - wake_up_process(sender->task); - smp_wmb(); + wake_q_add(wake_q, sender->task); sender->state = STATE_READY; } @@ -965,6 +972,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr, struct timespec ts; struct posix_msg_tree_node *new_leaf = NULL; int ret = 0; + WAKE_Q(wake_q); if (u_abs_timeout) { int res = prepare_timeout(u_abs_timeout, &expires, &ts); @@ -1049,7 +1057,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr, } else { receiver = wq_get_first_waiter(info, RECV); if (receiver) { - pipelined_send(info, msg_ptr, receiver); + pipelined_send(&wake_q, info, msg_ptr, receiver); } else { /* adds message to the queue */ ret = msg_insert(msg_ptr, info); @@ -1062,6 +1070,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr, } out_unlock: spin_unlock(&info->lock); + wake_up_q(&wake_q); out_free: if (ret) free_msg(msg_ptr); @@ -1149,14 +1158,17 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr, msg_ptr = wait.msg; } } else { + WAKE_Q(wake_q); + msg_ptr = msg_get(info); inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; /* There is now free space in queue. */ - pipelined_receive(info); + pipelined_receive(&wake_q, info); spin_unlock(&info->lock); + wake_up_q(&wake_q); ret = 0; } if (ret == 0) { diff --git a/kernel/fork.c b/kernel/fork.c index 03c1eaa..0bb88b5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1091,10 +1091,7 @@ static void posix_cpu_timers_init_group(struct signal_struct *sig) { unsigned long cpu_limit; - /* Thread group counters. */ - thread_group_cputime_init(sig); - - cpu_limit = ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); + cpu_limit = READ_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); if (cpu_limit != RLIM_INFINITY) { sig->cputime_expires.prof_exp = secs_to_cputime(cpu_limit); sig->cputimer.running = 1; @@ -1396,6 +1393,9 @@ static struct task_struct *copy_process(unsigned long clone_flags, p->hardirq_context = 0; p->softirq_context = 0; #endif + + p->pagefault_disabled = 0; + #ifdef CONFIG_LOCKDEP p->lockdep_depth = 0; /* no locks held yet */ p->curr_chain_key = 0; diff --git a/kernel/futex.c b/kernel/futex.c index 55ca63ad9..aacc706 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1090,9 +1090,11 @@ static void __unqueue_futex(struct futex_q *q) /* * The hash bucket lock must be held when this is called. - * Afterwards, the futex_q must not be accessed. + * Afterwards, the futex_q must not be accessed. Callers + * must ensure to later call wake_up_q() for the actual + * wakeups to occur. */ -static void wake_futex(struct futex_q *q) +static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) { struct task_struct *p = q->task; @@ -1100,14 +1102,10 @@ static void wake_futex(struct futex_q *q) return; /* - * We set q->lock_ptr = NULL _before_ we wake up the task. If - * a non-futex wake up happens on another CPU then the task - * might exit and p would dereference a non-existing task - * struct. Prevent this by holding a reference on p across the - * wake up. + * Queue the task for later wakeup for after we've released + * the hb->lock. wake_q_add() grabs reference to p. */ - get_task_struct(p); - + wake_q_add(wake_q, p); __unqueue_futex(q); /* * The waiting task can free the futex_q as soon as @@ -1117,9 +1115,6 @@ static void wake_futex(struct futex_q *q) */ smp_wmb(); q->lock_ptr = NULL; - - wake_up_state(p, TASK_NORMAL); - put_task_struct(p); } static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this) @@ -1217,6 +1212,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) struct futex_q *this, *next; union futex_key key = FUTEX_KEY_INIT; int ret; + WAKE_Q(wake_q); if (!bitset) return -EINVAL; @@ -1244,13 +1240,14 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) if (!(this->bitset & bitset)) continue; - wake_futex(this); + mark_wake_futex(&wake_q, this); if (++ret >= nr_wake) break; } } spin_unlock(&hb->lock); + wake_up_q(&wake_q); out_put_key: put_futex_key(&key); out: @@ -1269,6 +1266,7 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2, struct futex_hash_bucket *hb1, *hb2; struct futex_q *this, *next; int ret, op_ret; + WAKE_Q(wake_q); retry: ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ); @@ -1320,7 +1318,7 @@ retry_private: ret = -EINVAL; goto out_unlock; } - wake_futex(this); + mark_wake_futex(&wake_q, this); if (++ret >= nr_wake) break; } @@ -1334,7 +1332,7 @@ retry_private: ret = -EINVAL; goto out_unlock; } - wake_futex(this); + mark_wake_futex(&wake_q, this); if (++op_ret >= nr_wake2) break; } @@ -1344,6 +1342,7 @@ retry_private: out_unlock: double_unlock_hb(hb1, hb2); + wake_up_q(&wake_q); out_put_keys: put_futex_key(&key2); out_put_key1: @@ -1503,6 +1502,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, struct futex_pi_state *pi_state = NULL; struct futex_hash_bucket *hb1, *hb2; struct futex_q *this, *next; + WAKE_Q(wake_q); if (requeue_pi) { /* @@ -1679,7 +1679,7 @@ retry_private: * woken by futex_unlock_pi(). */ if (++task_count <= nr_wake && !requeue_pi) { - wake_futex(this); + mark_wake_futex(&wake_q, this); continue; } @@ -1719,6 +1719,7 @@ retry_private: out_unlock: free_pi_state(pi_state); double_unlock_hb(hb1, hb2); + wake_up_q(&wake_q); hb_waiters_dec(hb2); /* diff --git a/kernel/locking/lglock.c b/kernel/locking/lglock.c index 86ae2ae..951cfcd 100644 --- a/kernel/locking/lglock.c +++ b/kernel/locking/lglock.c @@ -60,6 +60,28 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu) } EXPORT_SYMBOL(lg_local_unlock_cpu); +void lg_double_lock(struct lglock *lg, int cpu1, int cpu2) +{ + BUG_ON(cpu1 == cpu2); + + /* lock in cpu order, just like lg_global_lock */ + if (cpu2 < cpu1) + swap(cpu1, cpu2); + + preempt_disable(); + lock_acquire_shared(&lg->lock_dep_map, 0, 0, NULL, _RET_IP_); + arch_spin_lock(per_cpu_ptr(lg->lock, cpu1)); + arch_spin_lock(per_cpu_ptr(lg->lock, cpu2)); +} + +void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2) +{ + lock_release(&lg->lock_dep_map, 1, _RET_IP_); + arch_spin_unlock(per_cpu_ptr(lg->lock, cpu1)); + arch_spin_unlock(per_cpu_ptr(lg->lock, cpu2)); + preempt_enable(); +} + void lg_global_lock(struct lglock *lg) { int i; diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..6768797 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -11,7 +11,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer endif -obj-y += core.o proc.o clock.o cputime.o +obj-y += core.o loadavg.o clock.o cputime.o obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o obj-y += wait.o completion.o idle.o obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o diff --git a/kernel/sched/auto_group.c b/kernel/sched/auto_group.c index eae160d..750ed60 100644 --- a/kernel/sched/auto_group.c +++ b/kernel/sched/auto_group.c @@ -1,5 +1,3 @@ -#ifdef CONFIG_SCHED_AUTOGROUP - #include "sched.h" #include <linux/proc_fs.h> @@ -141,7 +139,7 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag) p->signal->autogroup = autogroup_kref_get(ag); - if (!ACCESS_ONCE(sysctl_sched_autogroup_enabled)) + if (!READ_ONCE(sysctl_sched_autogroup_enabled)) goto out; for_each_thread(p, t) @@ -249,5 +247,3 @@ int autogroup_path(struct task_group *tg, char *buf, int buflen) return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id); } #endif /* CONFIG_SCHED_DEBUG */ - -#endif /* CONFIG_SCHED_AUTOGROUP */ diff --git a/kernel/sched/auto_group.h b/kernel/sched/auto_group.h index 8bd0471..890c95f 100644 --- a/kernel/sched/auto_group.h +++ b/kernel/sched/auto_group.h @@ -29,7 +29,7 @@ extern bool task_wants_autogroup(struct task_struct *p, struct task_group *tg); static inline struct task_group * autogroup_task_group(struct task_struct *p, struct task_group *tg) { - int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled); + int enabled = READ_ONCE(sysctl_sched_autogroup_enabled); if (enabled && task_wants_autogroup(p, tg)) return p->signal->autogroup->tg; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index db9b10a..f89ca9b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -511,7 +511,7 @@ static bool set_nr_and_not_polling(struct task_struct *p) static bool set_nr_if_polling(struct task_struct *p) { struct thread_info *ti = task_thread_info(p); - typeof(ti->flags) old, val = ACCESS_ONCE(ti->flags); + typeof(ti->flags) old, val = READ_ONCE(ti->flags); for (;;) { if (!(val & _TIF_POLLING_NRFLAG)) @@ -541,6 +541,52 @@ static bool set_nr_if_polling(struct task_struct *p) #endif #endif +void wake_q_add(struct wake_q_head *head, struct task_struct *task) +{ + struct wake_q_node *node = &task->wake_q; + + /* + * Atomically grab the task, if ->wake_q is !nil already it means + * its already queued (either by us or someone else) and will get the + * wakeup due to that. + * + * This cmpxchg() implies a full barrier, which pairs with the write + * barrier implied by the wakeup in wake_up_list(). + */ + if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) + return; + + get_task_struct(task); + + /* + * The head is context local, there can be no concurrency. + */ + *head->lastp = node; + head->lastp = &node->next; +} + +void wake_up_q(struct wake_q_head *head) +{ + struct wake_q_node *node = head->first; + + while (node != WAKE_Q_TAIL) { + struct task_struct *task; + + task = container_of(node, struct task_struct, wake_q); + BUG_ON(!task); + /* task can safely be re-inserted now */ + node = node->next; + task->wake_q.next = NULL; + + /* + * wake_up_process() implies a wmb() to pair with the queueing + * in wake_q_add() so as not to miss wakeups. + */ + wake_up_process(task); + put_task_struct(task); + } +} + /* * resched_curr - mark rq's current task 'to be rescheduled now'. * @@ -2105,12 +2151,15 @@ void wake_up_new_task(struct task_struct *p) #ifdef CONFIG_PREEMPT_NOTIFIERS +static struct static_key preempt_notifier_key = STATIC_KEY_INIT_FALSE; + /** * preempt_notifier_register - tell me when current is being preempted & rescheduled * @notifier: notifier struct to register */ void preempt_notifier_register(struct preempt_notifier *notifier) { + static_key_slow_inc(&preempt_notifier_key); hlist_add_head(¬ifier->link, ¤t->preempt_notifiers); } EXPORT_SYMBOL_GPL(preempt_notifier_register); @@ -2119,15 +2168,16 @@ EXPORT_SYMBOL_GPL(preempt_notifier_register); * preempt_notifier_unregister - no longer interested in preemption notifications * @notifier: notifier struct to unregister * - * This is safe to call from within a preemption notifier. + * This is *not* safe to call from within a preemption notifier. */ void preempt_notifier_unregister(struct preempt_notifier *notifier) { hlist_del(¬ifier->link); + static_key_slow_dec(&preempt_notifier_key); } EXPORT_SYMBOL_GPL(preempt_notifier_unregister); -static void fire_sched_in_preempt_notifiers(struct task_struct *curr) +static void __fire_sched_in_preempt_notifiers(struct task_struct *curr) { struct preempt_notifier *notifier; @@ -2135,9 +2185,15 @@ static void fire_sched_in_preempt_notifiers(struct task_struct *curr) notifier->ops->sched_in(notifier, raw_smp_processor_id()); } +static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr) +{ + if (static_key_false(&preempt_notifier_key)) + __fire_sched_in_preempt_notifiers(curr); +} + static void -fire_sched_out_preempt_notifiers(struct task_struct *curr, - struct task_struct *next) +__fire_sched_out_preempt_notifiers(struct task_struct *curr, + struct task_struct *next) { struct preempt_notifier *notifier; @@ -2145,13 +2201,21 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr, notifier->ops->sched_out(notifier, next); } +static __always_inline void +fire_sched_out_preempt_notifiers(struct task_struct *curr, + struct task_struct *next) +{ + if (static_key_false(&preempt_notifier_key)) + __fire_sched_out_preempt_notifiers(curr, next); +} + #else /* !CONFIG_PREEMPT_NOTIFIERS */ -static void fire_sched_in_preempt_notifiers(struct task_struct *curr) +static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr) { } -static void +static inline void fire_sched_out_preempt_notifiers(struct task_struct *curr, struct task_struct *next) { @@ -2397,9 +2461,9 @@ unsigned long nr_iowait_cpu(int cpu) void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) { - struct rq *this = this_rq(); - *nr_waiters = atomic_read(&this->nr_iowait); - *load = this->cpu_load[0]; + struct rq *rq = this_rq(); + *nr_waiters = atomic_read(&rq->nr_iowait); + *load = rq->load.weight; } #ifdef CONFIG_SMP @@ -2497,6 +2561,7 @@ void scheduler_tick(void) update_rq_clock(rq); curr->sched_class->task_tick(rq, curr, 0); update_cpu_load_active(rq); + calc_global_load_tick(rq); raw_spin_unlock(&rq->lock); perf_event_task_tick(); @@ -2525,7 +2590,7 @@ void scheduler_tick(void) u64 scheduler_tick_max_deferment(void) { struct rq *rq = this_rq(); - unsigned long next, now = ACCESS_ONCE(jiffies); + unsigned long next, now = READ_ONCE(jiffies); next = rq->last_sched_tick + HZ; @@ -2726,9 +2791,7 @@ again: * - return from syscall or exception to user-space * - return from interrupt-handler to user-space * - * WARNING: all callers must re-check need_resched() afterward and reschedule - * accordingly in case an event triggered the need for rescheduling (such as - * an interrupt waking up a task) while preemption was disabled in __schedule(). + * WARNING: must be called with preemption disabled! */ static void __sched __schedule(void) { @@ -2737,7 +2800,6 @@ static void __sched __schedule(void) struct rq *rq; int cpu; - preempt_disable(); cpu = smp_processor_id(); rq = cpu_rq(cpu); rcu_note_context_switch(); @@ -2801,8 +2863,6 @@ static void __sched __schedule(void) raw_spin_unlock_irq(&rq->lock); post_schedule(rq); - - sched_preempt_enable_no_resched(); } static inline void sched_submit_work(struct task_struct *tsk) @@ -2823,7 +2883,9 @@ asmlinkage __visible void __sched schedule(void) sched_submit_work(tsk); do { + preempt_disable(); __schedule(); + sched_preempt_enable_no_resched(); } while (need_resched()); } EXPORT_SYMBOL(schedule); @@ -2862,15 +2924,14 @@ void __sched schedule_preempt_disabled(void) static void __sched notrace preempt_schedule_common(void) { do { - __preempt_count_add(PREEMPT_ACTIVE); + preempt_active_enter(); __schedule(); - __preempt_count_sub(PREEMPT_ACTIVE); + preempt_active_exit(); /* * Check again in case we missed a preemption opportunity * between schedule and now. */ - barrier(); } while (need_resched()); } @@ -2894,9 +2955,8 @@ asmlinkage __visible void __sched notrace preempt_schedule(void) NOKPROBE_SYMBOL(preempt_schedule); EXPORT_SYMBOL(preempt_schedule); -#ifdef CONFIG_CONTEXT_TRACKING /** - * preempt_schedule_context - preempt_schedule called by tracing + * preempt_schedule_notrace - preempt_schedule called by tracing * * The tracing infrastructure uses preempt_enable_notrace to prevent * recursion and tracing preempt enabling caused by the tracing @@ -2909,7 +2969,7 @@ EXPORT_SYMBOL(preempt_schedule); * instead of preempt_schedule() to exit user context if needed before * calling the scheduler. */ -asmlinkage __visible void __sched notrace preempt_schedule_context(void) +asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) { enum ctx_state prev_ctx; @@ -2917,7 +2977,13 @@ asmlinkage __visible void __sched notrace preempt_schedule_context(void) return; do { - __preempt_count_add(PREEMPT_ACTIVE); + /* + * Use raw __prempt_count() ops that don't call function. + * We can't call functions before disabling preemption which + * disarm preemption tracing recursions. + */ + __preempt_count_add(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); + barrier(); /* * Needs preempt disabled in case user_exit() is traced * and the tracer calls preempt_enable_notrace() causing @@ -2927,12 +2993,11 @@ asmlinkage __visible void __sched notrace preempt_schedule_context(void) __schedule(); exception_exit(prev_ctx); - __preempt_count_sub(PREEMPT_ACTIVE); barrier(); + __preempt_count_sub(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); } while (need_resched()); } -EXPORT_SYMBOL_GPL(preempt_schedule_context); -#endif /* CONFIG_CONTEXT_TRACKING */ +EXPORT_SYMBOL_GPL(preempt_schedule_notrace); #endif /* CONFIG_PREEMPT */ @@ -2952,17 +3017,11 @@ asmlinkage __visible void __sched preempt_schedule_irq(void) prev_state = exception_enter(); do { - __preempt_count_add(PREEMPT_ACTIVE); + preempt_active_enter(); local_irq_enable(); __schedule(); local_irq_disable(); - __preempt_count_sub(PREEMPT_ACTIVE); - - /* - * Check again in case we missed a preemption opportunity - * between schedule and now. - */ - barrier(); + preempt_active_exit(); } while (need_resched()); exception_exit(prev_state); @@ -3040,7 +3099,6 @@ void rt_mutex_setprio(struct task_struct *p, int prio) if (!dl_prio(p->normal_prio) || (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) { p->dl.dl_boosted = 1; - p->dl.dl_throttled = 0; enqueue_flag = ENQUEUE_REPLENISH; } else p->dl.dl_boosted = 0; @@ -5314,7 +5372,7 @@ static struct notifier_block migration_notifier = { .priority = CPU_PRI_MIGRATION, }; -static void __cpuinit set_cpu_rq_start_time(void) +static void set_cpu_rq_start_time(void) { int cpu = smp_processor_id(); struct rq *rq = cpu_rq(cpu); @@ -7734,11 +7792,11 @@ static long sched_group_rt_runtime(struct task_group *tg) return rt_runtime_us; } -static int sched_group_set_rt_period(struct task_group *tg, long rt_period_us) +static int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us) { u64 rt_runtime, rt_period; - rt_period = (u64)rt_period_us * NSEC_PER_USEC; + rt_period = rt_period_us * NSEC_PER_USEC; rt_runtime = tg->rt_bandwidth.rt_runtime; return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 8394b1e..f5a64ff 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -567,7 +567,7 @@ static void cputime_advance(cputime_t *counter, cputime_t new) { cputime_t old; - while (new > (old = ACCESS_ONCE(*counter))) + while (new > (old = READ_ONCE(*counter))) cmpxchg_cputime(counter, old, new); } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 5e95145..392e8fb9 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -640,7 +640,7 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se) } static -int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) +int dl_runtime_exceeded(struct sched_dl_entity *dl_se) { return (dl_se->runtime <= 0); } @@ -684,7 +684,7 @@ static void update_curr_dl(struct rq *rq) sched_rt_avg_update(rq, delta_exec); dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec; - if (dl_runtime_exceeded(rq, dl_se)) { + if (dl_runtime_exceeded(dl_se)) { dl_se->dl_throttled = 1; __dequeue_task_dl(rq, curr, 0); if (unlikely(!start_dl_timer(dl_se, curr->dl.dl_boosted))) @@ -995,7 +995,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) rq = cpu_rq(cpu); rcu_read_lock(); - curr = ACCESS_ONCE(rq->curr); /* unlocked access */ + curr = READ_ONCE(rq->curr); /* unlocked access */ /* * If we are dealing with a -deadline task, we must @@ -1012,7 +1012,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) (p->nr_cpus_allowed > 1)) { int target = find_later_rq(p); - if (target != -1) + if (target != -1 && + dl_time_before(p->dl.deadline, + cpu_rq(target)->dl.earliest_dl.curr)) cpu = target; } rcu_read_unlock(); @@ -1230,6 +1232,32 @@ next_node: return NULL; } +/* + * Return the earliest pushable rq's task, which is suitable to be executed + * on the CPU, NULL otherwise: + */ +static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu) +{ + struct rb_node *next_node = rq->dl.pushable_dl_tasks_leftmost; + struct task_struct *p = NULL; + + if (!has_pushable_dl_tasks(rq)) + return NULL; + +next_node: + if (next_node) { + p = rb_entry(next_node, struct task_struct, pushable_dl_tasks); + + if (pick_dl_task(rq, p, cpu)) + return p; + + next_node = rb_next(next_node); + goto next_node; + } + + return NULL; +} + static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl); static int find_later_rq(struct task_struct *task) @@ -1333,6 +1361,17 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq) later_rq = cpu_rq(cpu); + if (!dl_time_before(task->dl.deadline, + later_rq->dl.earliest_dl.curr)) { + /* + * Target rq has tasks of equal or earlier deadline, + * retrying does not release any lock and is unlikely + * to yield a different result. + */ + later_rq = NULL; + break; + } + /* Retry if something changed. */ if (double_lock_balance(rq, later_rq)) { if (unlikely(task_rq(task) != rq || @@ -1514,7 +1553,7 @@ static int pull_dl_task(struct rq *this_rq) if (src_rq->dl.dl_nr_running <= 1) goto skip; - p = pick_next_earliest_dl_task(src_rq, this_cpu); + p = pick_earliest_pushable_dl_task(src_rq, this_cpu); /* * We found a task to be pulled if: @@ -1659,7 +1698,7 @@ static void rq_offline_dl(struct rq *rq) cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu); } -void init_sched_dl_class(void) +void __init init_sched_dl_class(void) { unsigned int i; diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index a245c1f..704683c 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -132,12 +132,14 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p) p->prio); #ifdef CONFIG_SCHEDSTATS SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", - SPLIT_NS(p->se.vruntime), + SPLIT_NS(p->se.statistics.wait_sum), SPLIT_NS(p->se.sum_exec_runtime), SPLIT_NS(p->se.statistics.sum_sleep_runtime)); #else - SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld", - 0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L); + SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", + 0LL, 0L, + SPLIT_NS(p->se.sum_exec_runtime), + 0LL, 0L); #endif #ifdef CONFIG_NUMA_BALANCING SEQ_printf(m, " %d", task_node(p)); @@ -156,7 +158,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu) SEQ_printf(m, "\nrunnable tasks:\n" " task PID tree-key switches prio" - " exec-runtime sum-exec sum-sleep\n" + " wait-time sum-exec sum-sleep\n" "------------------------------------------------------" "----------------------------------------------------\n"); @@ -582,6 +584,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) nr_switches = p->nvcsw + p->nivcsw; #ifdef CONFIG_SCHEDSTATS + PN(se.statistics.sum_sleep_runtime); PN(se.statistics.wait_start); PN(se.statistics.sleep_start); PN(se.statistics.block_start); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c2980e8..433061d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -141,9 +141,9 @@ static inline void update_load_set(struct load_weight *lw, unsigned long w) * * This idea comes from the SD scheduler of Con Kolivas: */ -static int get_update_sysctl_factor(void) +static unsigned int get_update_sysctl_factor(void) { - unsigned int cpus = min_t(int, num_online_cpus(), 8); + unsigned int cpus = min_t(unsigned int, num_online_cpus(), 8); unsigned int factor; switch (sysctl_sched_tunable_scaling) { @@ -576,7 +576,7 @@ int sched_proc_update_handler(struct ctl_table *table, int write, loff_t *ppos) { int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); - int factor = get_update_sysctl_factor(); + unsigned int factor = get_update_sysctl_factor(); if (ret || !write) return ret; @@ -834,7 +834,7 @@ static unsigned int task_nr_scan_windows(struct task_struct *p) static unsigned int task_scan_min(struct task_struct *p) { - unsigned int scan_size = ACCESS_ONCE(sysctl_numa_balancing_scan_size); + unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size); unsigned int scan, floor; unsigned int windows = 1; @@ -1198,11 +1198,9 @@ static void task_numa_assign(struct task_numa_env *env, static bool load_too_imbalanced(long src_load, long dst_load, struct task_numa_env *env) { + long imb, old_imb; + long orig_src_load, orig_dst_load; long src_capacity, dst_capacity; - long orig_src_load; - long load_a, load_b; - long moved_load; - long imb; /* * The load is corrected for the CPU capacity available on each node. @@ -1215,39 +1213,30 @@ static bool load_too_imbalanced(long src_load, long dst_load, dst_capacity = env->dst_stats.compute_capacity; /* We care about the slope of the imbalance, not the direction. */ - load_a = dst_load; - load_b = src_load; - if (load_a < load_b) - swap(load_a, load_b); + if (dst_load < src_load) + swap(dst_load, src_load); /* Is the difference below the threshold? */ - imb = load_a * src_capacity * 100 - - load_b * dst_capacity * env->imbalance_pct; + imb = dst_load * src_capacity * 100 - + src_load * dst_capacity * env->imbalance_pct; if (imb <= 0) return false; /* * The imbalance is above the allowed threshold. - * Allow a move that brings us closer to a balanced situation, - * without moving things past the point of balance. + * Compare it with the old imbalance. */ orig_src_load = env->src_stats.load; + orig_dst_load = env->dst_stats.load; - /* - * In a task swap, there will be one load moving from src to dst, - * and another moving back. This is the net sum of both moves. - * A simple task move will always have a positive value. - * Allow the move if it brings the system closer to a balanced - * situation, without crossing over the balance point. - */ - moved_load = orig_src_load - src_load; + if (orig_dst_load < orig_src_load) + swap(orig_dst_load, orig_src_load); - if (moved_load > 0) - /* Moving src -> dst. Did we overshoot balance? */ - return src_load * dst_capacity < dst_load * src_capacity; - else - /* Moving dst -> src. Did we overshoot balance? */ - return dst_load * src_capacity < src_load * dst_capacity; + old_imb = orig_dst_load * src_capacity * 100 - + orig_src_load * dst_capacity * env->imbalance_pct; + + /* Would this change make things worse? */ + return (imb > old_imb); } /* @@ -1409,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, } } +/* Only move tasks to a NUMA node less busy than the current node. */ +static bool numa_has_capacity(struct task_numa_env *env) +{ + struct numa_stats *src = &env->src_stats; + struct numa_stats *dst = &env->dst_stats; + + if (src->has_free_capacity && !dst->has_free_capacity) + return false; + + /* + * Only consider a task move if the source has a higher load + * than the destination, corrected for CPU capacity on each node. + * + * src->load dst->load + * --------------------- vs --------------------- + * src->compute_capacity dst->compute_capacity + */ + if (src->load * dst->compute_capacity > + dst->load * src->compute_capacity) + return true; + + return false; +} + static int task_numa_migrate(struct task_struct *p) { struct task_numa_env env = { @@ -1463,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p) update_numa_stats(&env.dst_stats, env.dst_nid); /* Try to find a spot on the preferred nid. */ - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); /* * Look at other nodes in these cases: @@ -1494,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p) env.dist = dist; env.dst_nid = nid; update_numa_stats(&env.dst_stats, env.dst_nid); - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); } } @@ -1794,7 +1809,12 @@ static void task_numa_placement(struct task_struct *p) u64 runtime, period; spinlock_t *group_lock = NULL; - seq = ACCESS_ONCE(p->mm->numa_scan_seq); + /* + * The p->mm->numa_scan_seq field gets updated without + * exclusive access. Use READ_ONCE() here to ensure + * that the field is read in a single access: + */ + seq = READ_ONCE(p->mm->numa_scan_seq); if (p->numa_scan_seq == seq) return; p->numa_scan_seq = seq; @@ -1938,7 +1958,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, } rcu_read_lock(); - tsk = ACCESS_ONCE(cpu_rq(cpu)->curr); + tsk = READ_ONCE(cpu_rq(cpu)->curr); if (!cpupid_match_pid(tsk, cpupid)) goto no_join; @@ -2107,7 +2127,15 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) static void reset_ptenuma_scan(struct task_struct *p) { - ACCESS_ONCE(p->mm->numa_scan_seq)++; + /* + * We only did a read acquisition of the mmap sem, so + * p->mm->numa_scan_seq is written to without exclusive access + * and the update is not guaranteed to be atomic. That's not + * much of an issue though, since this is just used for + * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not + * expensive, to avoid any form of compiler optimizations: + */ + WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1); p->mm->numa_scan_offset = 0; } @@ -4323,6 +4351,189 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) } #ifdef CONFIG_SMP + +/* + * per rq 'load' arrray crap; XXX kill this. + */ + +/* + * The exact cpuload at various idx values, calculated at every tick would be + * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load + * + * If a cpu misses updates for n-1 ticks (as it was idle) and update gets called + * on nth tick when cpu may be busy, then we have: + * load = ((2^idx - 1) / 2^idx)^(n-1) * load + * load = (2^idx - 1) / 2^idx) * load + 1 / 2^idx * cur_load + * + * decay_load_missed() below does efficient calculation of + * load = ((2^idx - 1) / 2^idx)^(n-1) * load + * avoiding 0..n-1 loop doing load = ((2^idx - 1) / 2^idx) * load + * + * The calculation is approximated on a 128 point scale. + * degrade_zero_ticks is the number of ticks after which load at any + * particular idx is approximated to be zero. + * degrade_factor is a precomputed table, a row for each load idx. + * Each column corresponds to degradation factor for a power of two ticks, + * based on 128 point scale. + * Example: + * row 2, col 3 (=12) says that the degradation at load idx 2 after + * 8 ticks is 12/128 (which is an approximation of exact factor 3^8/4^8). + * + * With this power of 2 load factors, we can degrade the load n times + * by looking at 1 bits in n and doing as many mult/shift instead of + * n mult/shifts needed by the exact degradation. + */ +#define DEGRADE_SHIFT 7 +static const unsigned char + degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128}; +static const unsigned char + degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = { + {0, 0, 0, 0, 0, 0, 0, 0}, + {64, 32, 8, 0, 0, 0, 0, 0}, + {96, 72, 40, 12, 1, 0, 0}, + {112, 98, 75, 43, 15, 1, 0}, + {120, 112, 98, 76, 45, 16, 2} }; + +/* + * Update cpu_load for any missed ticks, due to tickless idle. The backlog + * would be when CPU is idle and so we just decay the old load without + * adding any new load. + */ +static unsigned long +decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) +{ + int j = 0; + + if (!missed_updates) + return load; + + if (missed_updates >= degrade_zero_ticks[idx]) + return 0; + + if (idx == 1) + return load >> missed_updates; + + while (missed_updates) { + if (missed_updates % 2) + load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT; + + missed_updates >>= 1; + j++; + } + return load; +} + +/* + * Update rq->cpu_load[] statistics. This function is usually called every + * scheduler tick (TICK_NSEC). With tickless idle this will not be called + * every tick. We fix it up based on jiffies. + */ +static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, + unsigned long pending_updates) +{ + int i, scale; + + this_rq->nr_load_updates++; + + /* Update our load: */ + this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ + for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { + unsigned long old_load, new_load; + + /* scale is effectively 1 << i now, and >> i divides by scale */ + + old_load = this_rq->cpu_load[i]; + old_load = decay_load_missed(old_load, pending_updates - 1, i); + new_load = this_load; + /* + * Round up the averaging division if load is increasing. This + * prevents us from getting stuck on 9 if the load is 10, for + * example. + */ + if (new_load > old_load) + new_load += scale - 1; + + this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i; + } + + sched_avg_update(this_rq); +} + +#ifdef CONFIG_NO_HZ_COMMON +/* + * There is no sane way to deal with nohz on smp when using jiffies because the + * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading + * causing off-by-one errors in observed deltas; {0,2} instead of {1,1}. + * + * Therefore we cannot use the delta approach from the regular tick since that + * would seriously skew the load calculation. However we'll make do for those + * updates happening while idle (nohz_idle_balance) or coming out of idle + * (tick_nohz_idle_exit). + * + * This means we might still be one tick off for nohz periods. + */ + +/* + * Called from nohz_idle_balance() to update the load ratings before doing the + * idle balance. + */ +static void update_idle_cpu_load(struct rq *this_rq) +{ + unsigned long curr_jiffies = READ_ONCE(jiffies); + unsigned long load = this_rq->cfs.runnable_load_avg; + unsigned long pending_updates; + + /* + * bail if there's load or we're actually up-to-date. + */ + if (load || curr_jiffies == this_rq->last_load_update_tick) + return; + + pending_updates = curr_jiffies - this_rq->last_load_update_tick; + this_rq->last_load_update_tick = curr_jiffies; + + __update_cpu_load(this_rq, load, pending_updates); +} + +/* + * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed. + */ +void update_cpu_load_nohz(void) +{ + struct rq *this_rq = this_rq(); + unsigned long curr_jiffies = READ_ONCE(jiffies); + unsigned long pending_updates; + + if (curr_jiffies == this_rq->last_load_update_tick) + return; + + raw_spin_lock(&this_rq->lock); + pending_updates = curr_jiffies - this_rq->last_load_update_tick; + if (pending_updates) { + this_rq->last_load_update_tick = curr_jiffies; + /* + * We were idle, this means load 0, the current load might be + * !0 due to remote wakeups and the sort. + */ + __update_cpu_load(this_rq, 0, pending_updates); + } + raw_spin_unlock(&this_rq->lock); +} +#endif /* CONFIG_NO_HZ */ + +/* + * Called from scheduler_tick() + */ +void update_cpu_load_active(struct rq *this_rq) +{ + unsigned long load = this_rq->cfs.runnable_load_avg; + /* + * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). + */ + this_rq->last_load_update_tick = jiffies; + __update_cpu_load(this_rq, load, 1); +} + /* Used instead of source_load when we know the type == 0 */ static unsigned long weighted_cpuload(const int cpu) { @@ -4375,7 +4586,7 @@ static unsigned long capacity_orig_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); + unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg; if (nr_running) @@ -5126,18 +5337,21 @@ again: * entity, update_curr() will update its vruntime, otherwise * forget we've ever seen it. */ - if (curr && curr->on_rq) - update_curr(cfs_rq); - else - curr = NULL; + if (curr) { + if (curr->on_rq) + update_curr(cfs_rq); + else + curr = NULL; - /* - * This call to check_cfs_rq_runtime() will do the throttle and - * dequeue its entity in the parent(s). Therefore the 'simple' - * nr_running test will indeed be correct. - */ - if (unlikely(check_cfs_rq_runtime(cfs_rq))) - goto simple; + /* + * This call to check_cfs_rq_runtime() will do the + * throttle and dequeue its entity in the parent(s). + * Therefore the 'simple' nr_running test will indeed + * be correct. + */ + if (unlikely(check_cfs_rq_runtime(cfs_rq))) + goto simple; + } se = pick_next_entity(cfs_rq, curr); cfs_rq = group_cfs_rq(se); @@ -5467,10 +5681,15 @@ static int task_hot(struct task_struct *p, struct lb_env *env) } #ifdef CONFIG_NUMA_BALANCING -/* Returns true if the destination node has incurred more faults */ +/* + * Returns true if the destination node is the preferred node. + * Needs to match fbq_classify_rq(): if there is a runnable task + * that is not on its preferred node, we should identify it. + */ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env) { struct numa_group *numa_group = rcu_dereference(p->numa_group); + unsigned long src_faults, dst_faults; int src_nid, dst_nid; if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults || @@ -5484,29 +5703,30 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env) if (src_nid == dst_nid) return false; - if (numa_group) { - /* Task is already in the group's interleave set. */ - if (node_isset(src_nid, numa_group->active_nodes)) - return false; - - /* Task is moving into the group's interleave set. */ - if (node_isset(dst_nid, numa_group->active_nodes)) - return true; - - return group_faults(p, dst_nid) > group_faults(p, src_nid); - } - /* Encourage migration to the preferred node. */ if (dst_nid == p->numa_preferred_nid) return true; - return task_faults(p, dst_nid) > task_faults(p, src_nid); + /* Migrating away from the preferred node is bad. */ + if (src_nid == p->numa_preferred_nid) + return false; + + if (numa_group) { + src_faults = group_faults(p, src_nid); + dst_faults = group_faults(p, dst_nid); + } else { + src_faults = task_faults(p, src_nid); + dst_faults = task_faults(p, dst_nid); + } + + return dst_faults > src_faults; } static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env) { struct numa_group *numa_group = rcu_dereference(p->numa_group); + unsigned long src_faults, dst_faults; int src_nid, dst_nid; if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER)) @@ -5521,23 +5741,23 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env) if (src_nid == dst_nid) return false; - if (numa_group) { - /* Task is moving within/into the group's interleave set. */ - if (node_isset(dst_nid, numa_group->active_nodes)) - return false; + /* Migrating away from the preferred node is bad. */ + if (src_nid == p->numa_preferred_nid) + return true; - /* Task is moving out of the group's interleave set. */ - if (node_isset(src_nid, numa_group->active_nodes)) - return true; + /* Encourage migration to the preferred node. */ + if (dst_nid == p->numa_preferred_nid) + return false; - return group_faults(p, dst_nid) < group_faults(p, src_nid); + if (numa_group) { + src_faults = group_faults(p, src_nid); + dst_faults = group_faults(p, dst_nid); + } else { + src_faults = task_faults(p, src_nid); + dst_faults = task_faults(p, dst_nid); } - /* Migrating away from the preferred node is always bad. */ - if (src_nid == p->numa_preferred_nid) - return true; - - return task_faults(p, dst_nid) < task_faults(p, src_nid); + return dst_faults < src_faults; } #else @@ -6037,8 +6257,8 @@ static unsigned long scale_rt_capacity(int cpu) * Since we're reading these variables without serialization make sure * we read them once before doing sanity checks on them. */ - age_stamp = ACCESS_ONCE(rq->age_stamp); - avg = ACCESS_ONCE(rq->rt_avg); + age_stamp = READ_ONCE(rq->age_stamp); + avg = READ_ONCE(rq->rt_avg); delta = __rq_clock_broken(rq) - age_stamp; if (unlikely(delta < 0)) diff --git a/kernel/sched/proc.c b/kernel/sched/loadavg.c index 8ecd552..ef71590 100644 --- a/kernel/sched/proc.c +++ b/kernel/sched/loadavg.c @@ -1,7 +1,9 @@ /* - * kernel/sched/proc.c + * kernel/sched/loadavg.c * - * Kernel load calculations, forked from sched/core.c + * This file contains the magic bits required to compute the global loadavg + * figure. Its a silly number but people think its important. We go through + * great pains to make it work on big machines and tickless kernels. */ #include <linux/export.h> @@ -81,7 +83,7 @@ long calc_load_fold_active(struct rq *this_rq) long nr_active, delta = 0; nr_active = this_rq->nr_running; - nr_active += (long) this_rq->nr_uninterruptible; + nr_active += (long)this_rq->nr_uninterruptible; if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; @@ -186,6 +188,7 @@ void calc_load_enter_idle(void) delta = calc_load_fold_active(this_rq); if (delta) { int idx = calc_load_write_idx(); + atomic_long_add(delta, &calc_load_idle[idx]); } } @@ -241,18 +244,20 @@ fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n) { unsigned long result = 1UL << frac_bits; - if (n) for (;;) { - if (n & 1) { - result *= x; - result += 1UL << (frac_bits - 1); - result >>= frac_bits; + if (n) { + for (;;) { + if (n & 1) { + result *= x; + result += 1UL << (frac_bits - 1); + result >>= frac_bits; + } + n >>= 1; + if (!n) + break; + x *= x; + x += 1UL << (frac_bits - 1); + x >>= frac_bits; } - n >>= 1; - if (!n) - break; - x *= x; - x += 1UL << (frac_bits - 1); - x >>= frac_bits; } return result; @@ -285,7 +290,6 @@ static unsigned long calc_load_n(unsigned long load, unsigned long exp, unsigned long active, unsigned int n) { - return calc_load(load, fixed_power_int(exp, FSHIFT, n), active); } @@ -339,6 +343,8 @@ static inline void calc_global_nohz(void) { } /* * calc_load - update the avenrun load estimates 10 ticks after the * CPUs have updated calc_load_tasks. + * + * Called from the global timer code. */ void calc_global_load(unsigned long ticks) { @@ -370,10 +376,10 @@ void calc_global_load(unsigned long ticks) } /* - * Called from update_cpu_load() to periodically update this CPU's + * Called from scheduler_tick() to periodically update this CPU's * active count. */ -static void calc_load_account_active(struct rq *this_rq) +void calc_global_load_tick(struct rq *this_rq) { long delta; @@ -386,199 +392,3 @@ static void calc_load_account_active(struct rq *this_rq) this_rq->calc_load_update += LOAD_FREQ; } - -/* - * End of global load-average stuff - */ - -/* - * The exact cpuload at various idx values, calculated at every tick would be - * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load - * - * If a cpu misses updates for n-1 ticks (as it was idle) and update gets called - * on nth tick when cpu may be busy, then we have: - * load = ((2^idx - 1) / 2^idx)^(n-1) * load - * load = (2^idx - 1) / 2^idx) * load + 1 / 2^idx * cur_load - * - * decay_load_missed() below does efficient calculation of - * load = ((2^idx - 1) / 2^idx)^(n-1) * load - * avoiding 0..n-1 loop doing load = ((2^idx - 1) / 2^idx) * load - * - * The calculation is approximated on a 128 point scale. - * degrade_zero_ticks is the number of ticks after which load at any - * particular idx is approximated to be zero. - * degrade_factor is a precomputed table, a row for each load idx. - * Each column corresponds to degradation factor for a power of two ticks, - * based on 128 point scale. - * Example: - * row 2, col 3 (=12) says that the degradation at load idx 2 after - * 8 ticks is 12/128 (which is an approximation of exact factor 3^8/4^8). - * - * With this power of 2 load factors, we can degrade the load n times - * by looking at 1 bits in n and doing as many mult/shift instead of - * n mult/shifts needed by the exact degradation. - */ -#define DEGRADE_SHIFT 7 -static const unsigned char - degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128}; -static const unsigned char - degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = { - {0, 0, 0, 0, 0, 0, 0, 0}, - {64, 32, 8, 0, 0, 0, 0, 0}, - {96, 72, 40, 12, 1, 0, 0}, - {112, 98, 75, 43, 15, 1, 0}, - {120, 112, 98, 76, 45, 16, 2} }; - -/* - * Update cpu_load for any missed ticks, due to tickless idle. The backlog - * would be when CPU is idle and so we just decay the old load without - * adding any new load. - */ -static unsigned long -decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) -{ - int j = 0; - - if (!missed_updates) - return load; - - if (missed_updates >= degrade_zero_ticks[idx]) - return 0; - - if (idx == 1) - return load >> missed_updates; - - while (missed_updates) { - if (missed_updates % 2) - load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT; - - missed_updates >>= 1; - j++; - } - return load; -} - -/* - * Update rq->cpu_load[] statistics. This function is usually called every - * scheduler tick (TICK_NSEC). With tickless idle this will not be called - * every tick. We fix it up based on jiffies. - */ -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, - unsigned long pending_updates) -{ - int i, scale; - - this_rq->nr_load_updates++; - - /* Update our load: */ - this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ - for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { - unsigned long old_load, new_load; - - /* scale is effectively 1 << i now, and >> i divides by scale */ - - old_load = this_rq->cpu_load[i]; - old_load = decay_load_missed(old_load, pending_updates - 1, i); - new_load = this_load; - /* - * Round up the averaging division if load is increasing. This - * prevents us from getting stuck on 9 if the load is 10, for - * example. - */ - if (new_load > old_load) - new_load += scale - 1; - - this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i; - } - - sched_avg_update(this_rq); -} - -#ifdef CONFIG_SMP -static inline unsigned long get_rq_runnable_load(struct rq *rq) -{ - return rq->cfs.runnable_load_avg; -} -#else -static inline unsigned long get_rq_runnable_load(struct rq *rq) -{ - return rq->load.weight; -} -#endif - -#ifdef CONFIG_NO_HZ_COMMON -/* - * There is no sane way to deal with nohz on smp when using jiffies because the - * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading - * causing off-by-one errors in observed deltas; {0,2} instead of {1,1}. - * - * Therefore we cannot use the delta approach from the regular tick since that - * would seriously skew the load calculation. However we'll make do for those - * updates happening while idle (nohz_idle_balance) or coming out of idle - * (tick_nohz_idle_exit). - * - * This means we might still be one tick off for nohz periods. - */ - -/* - * Called from nohz_idle_balance() to update the load ratings before doing the - * idle balance. - */ -void update_idle_cpu_load(struct rq *this_rq) -{ - unsigned long curr_jiffies = ACCESS_ONCE(jiffies); - unsigned long load = get_rq_runnable_load(this_rq); - unsigned long pending_updates; - - /* - * bail if there's load or we're actually up-to-date. - */ - if (load || curr_jiffies == this_rq->last_load_update_tick) - return; - - pending_updates = curr_jiffies - this_rq->last_load_update_tick; - this_rq->last_load_update_tick = curr_jiffies; - - __update_cpu_load(this_rq, load, pending_updates); -} - -/* - * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed. - */ -void update_cpu_load_nohz(void) -{ - struct rq *this_rq = this_rq(); - unsigned long curr_jiffies = ACCESS_ONCE(jiffies); - unsigned long pending_updates; - - if (curr_jiffies == this_rq->last_load_update_tick) - return; - - raw_spin_lock(&this_rq->lock); - pending_updates = curr_jiffies - this_rq->last_load_update_tick; - if (pending_updates) { - this_rq->last_load_update_tick = curr_jiffies; - /* - * We were idle, this means load 0, the current load might be - * !0 due to remote wakeups and the sort. - */ - __update_cpu_load(this_rq, 0, pending_updates); - } - raw_spin_unlock(&this_rq->lock); -} -#endif /* CONFIG_NO_HZ */ - -/* - * Called from scheduler_tick() - */ -void update_cpu_load_active(struct rq *this_rq) -{ - unsigned long load = get_rq_runnable_load(this_rq); - /* - * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). - */ - this_rq->last_load_update_tick = jiffies; - __update_cpu_load(this_rq, load, 1); - - calc_load_account_active(this_rq); -} diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 575da76..560d2fa 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1323,7 +1323,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags) rq = cpu_rq(cpu); rcu_read_lock(); - curr = ACCESS_ONCE(rq->curr); /* unlocked access */ + curr = READ_ONCE(rq->curr); /* unlocked access */ /* * If the current task on @p's runqueue is an RT task, then diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e0e1299..d62b288 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -26,8 +26,14 @@ extern __read_mostly int scheduler_running; extern unsigned long calc_load_update; extern atomic_long_t calc_load_tasks; +extern void calc_global_load_tick(struct rq *this_rq); extern long calc_load_fold_active(struct rq *this_rq); + +#ifdef CONFIG_SMP extern void update_cpu_load_active(struct rq *this_rq); +#else +static inline void update_cpu_load_active(struct rq *this_rq) { } +#endif /* * Helpers for converting nanosecond timing to jiffy resolution @@ -707,7 +713,7 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); static inline u64 __rq_clock_broken(struct rq *rq) { - return ACCESS_ONCE(rq->clock); + return READ_ONCE(rq->clock); } static inline u64 rq_clock(struct rq *rq) @@ -1284,7 +1290,6 @@ extern void update_max_interval(void); extern void init_sched_dl_class(void); extern void init_sched_rt_class(void); extern void init_sched_fair_class(void); -extern void init_sched_dl_class(void); extern void resched_curr(struct rq *rq); extern void resched_cpu(int cpu); @@ -1298,8 +1303,6 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se); unsigned long to_ratio(u64 period, u64 runtime); -extern void update_idle_cpu_load(struct rq *this_rq); - extern void init_task_runnable_average(struct task_struct *p); static inline void add_nr_running(struct rq *rq, unsigned count) diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 4ab7043..077ebbd 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -174,7 +174,8 @@ static inline bool cputimer_running(struct task_struct *tsk) { struct thread_group_cputimer *cputimer = &tsk->signal->cputimer; - if (!cputimer->running) + /* Check if cputimer isn't running. This is accessed without locking. */ + if (!READ_ONCE(cputimer->running)) return false; /* @@ -215,9 +216,7 @@ static inline void account_group_user_time(struct task_struct *tsk, if (!cputimer_running(tsk)) return; - raw_spin_lock(&cputimer->lock); - cputimer->cputime.utime += cputime; - raw_spin_unlock(&cputimer->lock); + atomic64_add(cputime, &cputimer->cputime_atomic.utime); } /** @@ -238,9 +237,7 @@ static inline void account_group_system_time(struct task_struct *tsk, if (!cputimer_running(tsk)) return; - raw_spin_lock(&cputimer->lock); - cputimer->cputime.stime += cputime; - raw_spin_unlock(&cputimer->lock); + atomic64_add(cputime, &cputimer->cputime_atomic.stime); } /** @@ -261,7 +258,5 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, if (!cputimer_running(tsk)) return; - raw_spin_lock(&cputimer->lock); - cputimer->cputime.sum_exec_runtime += ns; - raw_spin_unlock(&cputimer->lock); + atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime); } diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 9bc8232..052e026 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -601,7 +601,7 @@ EXPORT_SYMBOL(bit_wait_io); __sched int bit_wait_timeout(struct wait_bit_key *word) { - unsigned long now = ACCESS_ONCE(jiffies); + unsigned long now = READ_ONCE(jiffies); if (signal_pending_state(current->state, current)) return 1; if (time_after_eq(now, word->timeout)) @@ -613,7 +613,7 @@ EXPORT_SYMBOL_GPL(bit_wait_timeout); __sched int bit_wait_io_timeout(struct wait_bit_key *word) { - unsigned long now = ACCESS_ONCE(jiffies); + unsigned long now = READ_ONCE(jiffies); if (signal_pending_state(current->state, current)) return 1; if (time_after_eq(now, word->timeout)) diff --git a/kernel/signal.c b/kernel/signal.c index d51c5dd..f19833b 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -245,7 +245,7 @@ static inline void print_dropped_signal(int sig) * RETURNS: * %true if @mask is set, %false if made noop because @task was dying. */ -bool task_set_jobctl_pending(struct task_struct *task, unsigned int mask) +bool task_set_jobctl_pending(struct task_struct *task, unsigned long mask) { BUG_ON(mask & ~(JOBCTL_PENDING_MASK | JOBCTL_STOP_CONSUME | JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING)); @@ -297,7 +297,7 @@ void task_clear_jobctl_trapping(struct task_struct *task) * CONTEXT: * Must be called with @task->sighand->siglock held. */ -void task_clear_jobctl_pending(struct task_struct *task, unsigned int mask) +void task_clear_jobctl_pending(struct task_struct *task, unsigned long mask) { BUG_ON(mask & ~JOBCTL_PENDING_MASK); @@ -2000,7 +2000,7 @@ static bool do_signal_stop(int signr) struct signal_struct *sig = current->signal; if (!(current->jobctl & JOBCTL_STOP_PENDING)) { - unsigned int gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME; + unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME; struct task_struct *t; /* signr will be recorded in task->jobctl for retries */ diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 695f0c6..fd643d8 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -211,25 +211,6 @@ static int multi_cpu_stop(void *data) return err; } -struct irq_cpu_stop_queue_work_info { - int cpu1; - int cpu2; - struct cpu_stop_work *work1; - struct cpu_stop_work *work2; -}; - -/* - * This function is always run with irqs and preemption disabled. - * This guarantees that both work1 and work2 get queued, before - * our local migrate thread gets the chance to preempt us. - */ -static void irq_cpu_stop_queue_work(void *arg) -{ - struct irq_cpu_stop_queue_work_info *info = arg; - cpu_stop_queue_work(info->cpu1, info->work1); - cpu_stop_queue_work(info->cpu2, info->work2); -} - /** * stop_two_cpus - stops two cpus * @cpu1: the cpu to stop @@ -245,7 +226,6 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void * { struct cpu_stop_done done; struct cpu_stop_work work1, work2; - struct irq_cpu_stop_queue_work_info call_args; struct multi_stop_data msdata; preempt_disable(); @@ -262,13 +242,6 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void * .done = &done }; - call_args = (struct irq_cpu_stop_queue_work_info){ - .cpu1 = cpu1, - .cpu2 = cpu2, - .work1 = &work1, - .work2 = &work2, - }; - cpu_stop_init_done(&done, 2); set_state(&msdata, MULTI_STOP_PREPARE); @@ -285,16 +258,11 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void * return -ENOENT; } - lg_local_lock(&stop_cpus_lock); - /* - * Queuing needs to be done by the lowest numbered CPU, to ensure - * that works are always queued in the same order on every CPU. - * This prevents deadlocks. - */ - smp_call_function_single(min(cpu1, cpu2), - &irq_cpu_stop_queue_work, - &call_args, 1); - lg_local_unlock(&stop_cpus_lock); + lg_double_lock(&stop_cpus_lock, cpu1, cpu2); + cpu_stop_queue_work(cpu1, &work1); + cpu_stop_queue_work(cpu2, &work2); + lg_double_unlock(&stop_cpus_lock, cpu1, cpu2); + preempt_enable(); wait_for_completion(&done.completion); diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index 0075da7..892e3da 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -196,39 +196,62 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p, return 0; } -static void update_gt_cputime(struct task_cputime *a, struct task_cputime *b) +/* + * Set cputime to sum_cputime if sum_cputime > cputime. Use cmpxchg + * to avoid race conditions with concurrent updates to cputime. + */ +static inline void __update_gt_cputime(atomic64_t *cputime, u64 sum_cputime) { - if (b->utime > a->utime) - a->utime = b->utime; + u64 curr_cputime; +retry: + curr_cputime = atomic64_read(cputime); + if (sum_cputime > curr_cputime) { + if (atomic64_cmpxchg(cputime, curr_cputime, sum_cputime) != curr_cputime) + goto retry; + } +} - if (b->stime > a->stime) - a->stime = b->stime; +static void update_gt_cputime(struct task_cputime_atomic *cputime_atomic, struct task_cputime *sum) +{ + __update_gt_cputime(&cputime_atomic->utime, sum->utime); + __update_gt_cputime(&cputime_atomic->stime, sum->stime); + __update_gt_cputime(&cputime_atomic->sum_exec_runtime, sum->sum_exec_runtime); +} - if (b->sum_exec_runtime > a->sum_exec_runtime) - a->sum_exec_runtime = b->sum_exec_runtime; +/* Sample task_cputime_atomic values in "atomic_timers", store results in "times". */ +static inline void sample_cputime_atomic(struct task_cputime *times, + struct task_cputime_atomic *atomic_times) +{ + times->utime = atomic64_read(&atomic_times->utime); + times->stime = atomic64_read(&atomic_times->stime); + times->sum_exec_runtime = atomic64_read(&atomic_times->sum_exec_runtime); } void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times) { struct thread_group_cputimer *cputimer = &tsk->signal->cputimer; struct task_cputime sum; - unsigned long flags; - if (!cputimer->running) { + /* Check if cputimer isn't running. This is accessed without locking. */ + if (!READ_ONCE(cputimer->running)) { /* * The POSIX timer interface allows for absolute time expiry * values through the TIMER_ABSTIME flag, therefore we have - * to synchronize the timer to the clock every time we start - * it. + * to synchronize the timer to the clock every time we start it. */ thread_group_cputime(tsk, &sum); - raw_spin_lock_irqsave(&cputimer->lock, flags); - cputimer->running = 1; - update_gt_cputime(&cputimer->cputime, &sum); - } else - raw_spin_lock_irqsave(&cputimer->lock, flags); - *times = cputimer->cputime; - raw_spin_unlock_irqrestore(&cputimer->lock, flags); + update_gt_cputime(&cputimer->cputime_atomic, &sum); + + /* + * We're setting cputimer->running without a lock. Ensure + * this only gets written to in one operation. We set + * running after update_gt_cputime() as a small optimization, + * but barriers are not required because update_gt_cputime() + * can handle concurrent updates. + */ + WRITE_ONCE(cputimer->running, 1); + } + sample_cputime_atomic(times, &cputimer->cputime_atomic); } /* @@ -582,7 +605,8 @@ bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk) if (!task_cputime_zero(&tsk->cputime_expires)) return false; - if (tsk->signal->cputimer.running) + /* Check if cputimer is running. This is accessed without locking. */ + if (READ_ONCE(tsk->signal->cputimer.running)) return false; return true; @@ -852,10 +876,10 @@ static void check_thread_timers(struct task_struct *tsk, /* * Check for the special case thread timers. */ - soft = ACCESS_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_cur); + soft = READ_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_cur); if (soft != RLIM_INFINITY) { unsigned long hard = - ACCESS_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_max); + READ_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_max); if (hard != RLIM_INFINITY && tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) { @@ -882,14 +906,12 @@ static void check_thread_timers(struct task_struct *tsk, } } -static void stop_process_timers(struct signal_struct *sig) +static inline void stop_process_timers(struct signal_struct *sig) { struct thread_group_cputimer *cputimer = &sig->cputimer; - unsigned long flags; - raw_spin_lock_irqsave(&cputimer->lock, flags); - cputimer->running = 0; - raw_spin_unlock_irqrestore(&cputimer->lock, flags); + /* Turn off cputimer->running. This is done without locking. */ + WRITE_ONCE(cputimer->running, 0); } static u32 onecputick; @@ -958,11 +980,11 @@ static void check_process_timers(struct task_struct *tsk, SIGPROF); check_cpu_itimer(tsk, &sig->it[CPUCLOCK_VIRT], &virt_expires, utime, SIGVTALRM); - soft = ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); + soft = READ_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); if (soft != RLIM_INFINITY) { unsigned long psecs = cputime_to_secs(ptime); unsigned long hard = - ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_max); + READ_ONCE(sig->rlim[RLIMIT_CPU].rlim_max); cputime_t x; if (psecs >= hard) { /* @@ -1111,12 +1133,11 @@ static inline int fastpath_timer_check(struct task_struct *tsk) } sig = tsk->signal; - if (sig->cputimer.running) { + /* Check if cputimer is running. This is accessed without locking. */ + if (READ_ONCE(sig->cputimer.running)) { struct task_cputime group_sample; - raw_spin_lock(&sig->cputimer.lock); - group_sample = sig->cputimer.cputime; - raw_spin_unlock(&sig->cputimer.lock); + sample_cputime_atomic(&group_sample, &sig->cputimer.cputime_atomic); if (task_cputime_expired(&group_sample, &sig->cputime_expires)) return 1; @@ -1157,7 +1178,7 @@ void run_posix_cpu_timers(struct task_struct *tsk) * If there are any active process wide timers (POSIX 1.b, itimers, * RLIMIT_CPU) cputimer must be running. */ - if (tsk->signal->cputimer.running) + if (READ_ONCE(tsk->signal->cputimer.running)) check_process_timers(tsk, &firing); /* diff --git a/lib/cpu_rmap.c b/lib/cpu_rmap.c index 4f134d8..f610b2a 100644 --- a/lib/cpu_rmap.c +++ b/lib/cpu_rmap.c @@ -191,7 +191,7 @@ int cpu_rmap_update(struct cpu_rmap *rmap, u16 index, /* Update distances based on topology */ for_each_cpu(cpu, update_mask) { if (cpu_rmap_copy_neigh(rmap, cpu, - topology_thread_cpumask(cpu), 1)) + topology_sibling_cpumask(cpu), 1)) continue; if (cpu_rmap_copy_neigh(rmap, cpu, topology_core_cpumask(cpu), 2)) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 3d2aa27..061550d 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -33,7 +33,7 @@ #include <linux/string.h> #include <linux/bitops.h> #include <linux/rcupdate.h> -#include <linux/preempt_mask.h> /* in_interrupt() */ +#include <linux/preempt.h> /* in_interrupt() */ /* diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c index fe9a325..3a5f2b3 100644 --- a/lib/strnlen_user.c +++ b/lib/strnlen_user.c @@ -85,7 +85,8 @@ static inline long do_strnlen_user(const char __user *src, unsigned long count, * @str: The string to measure. * @count: Maximum count (including NUL character) * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Get the size of a NUL-terminated string in user space. * @@ -121,7 +122,8 @@ EXPORT_SYMBOL(strnlen_user); * strlen_user: - Get the size of a user string INCLUDING final NUL. * @str: The string to measure. * - * Context: User context only. This function may sleep. + * Context: User context only. This function may sleep if pagefaults are + * enabled. * * Get the size of a NUL-terminated string in user space. * diff --git a/mm/memory.c b/mm/memory.c index 22e037e..17734c3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3737,7 +3737,7 @@ void print_vma_addr(char *prefix, unsigned long ip) } #if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP) -void might_fault(void) +void __might_fault(const char *file, int line) { /* * Some code (nfs/sunrpc) uses socket ops on kernel memory while @@ -3747,21 +3747,15 @@ void might_fault(void) */ if (segment_eq(get_fs(), KERNEL_DS)) return; - - /* - * it would be nicer only to annotate paths which are not under - * pagefault_disable, however that requires a larger audit and - * providing helpers like get_user_atomic. - */ - if (in_atomic()) + if (pagefault_disabled()) return; - - __might_sleep(__FILE__, __LINE__, 0); - + __might_sleep(file, line, 0); +#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) if (current->mm) might_lock_read(¤t->mm->mmap_sem); +#endif } -EXPORT_SYMBOL(might_fault); +EXPORT_SYMBOL(__might_fault); #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) |