summaryrefslogtreecommitdiffstats
path: root/kernel/rcu/tree.h
Commit message (Collapse)AuthorAgeFilesLines
* rcu: Remove debugfs tracingPaul E. McKenney2017-06-081-27/+0
| | | | | | | | | RCU's debugfs tracing used to be the only reasonable low-level debug information available, but ftrace and event tracing has since surpassed the RCU debugfs level of usefulness. This commit therefore removes RCU's debugfs tracing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Remove nohz_full full-system-idle state machinePaul E. McKenney2017-06-081-16/+0
| | | | | | | | | | | | | | The NO_HZ_FULL_SYSIDLE full-system-idle capability was added in 2013 by commit 0edd1b1784cb ("nohz_full: Add full-system-idle state machine"), but has not been used. This commit therefore removes it. If it turns out to be needed later, this commit can always be reverted. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
* rcu: Move rnp->lock wrappers for SRCU usePaul E. McKenney2017-06-081-53/+0
| | | | | | | This commit moves the now-generic rnp->lock wrapper macros from kernel/rcu/tree.h to kernel/rcu/rcu.h, thus allowing SRCU to use them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Convert rnp->lock wrappers to macros for SRCU usePaul E. McKenney2017-06-081-26/+21
| | | | | | | | | | Use of smp_mb__after_unlock_lock() would allow SRCU to omit a full memory barrier during callback execution, so this commit converts raw_spin_lock_rcu_node() from inline functions to type-generic macros to allow them to handle locks in srcu_node structures as well as rcu_node structures. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Use RCU_NOCB_WAKE rather than RCU_NOGP_WAKEPaul E. McKenney2017-06-081-3/+3
| | | | | | | | The RCU_NOGP_WAKE_NOT, RCU_NOGP_WAKE, and RCU_NOGP_WAKE_FORCE flags are used to mediate wakeups for the no-CBs CPU kthreads. The "NOGP" really doesn't make any sense, so this commit does s/NOGP/NOCB/. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Complain if blocking in preemptible RCU read-side critical sectionPaul E. McKenney2017-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Although preemptible RCU allows its read-side critical sections to be preempted, general blocking is forbidden. The reason for this is that excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a voluntarily blocked task doesn't care how high you boost its priority. Because preemptible RCU is a global mechanism, one ill-behaved reader hurts everyone. Hence the prohibition against general blocking in RCU-preempt read-side critical sections. Preemption yes, blocking no. This commit enforces this prohibition. There is a special exception for the -rt patchset (which they kindly volunteered to implement): It is OK to block (as opposed to merely being preempted) within an RCU-preempt read-side critical section, but only if the blocking is subject to priority inheritance. This exception permits CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble. Why doesn't this exception also apply to mainline's rt_mutex? Because of the possibility that someone does general blocking while holding an rt_mutex. Yes, the priority boosting will affect the rt_mutex, but it won't help with the task doing general blocking while holding that rt_mutex. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* srcu: Debloat the <linux/rcu_segcblist.h> headerIngo Molnar2017-05-021-1/+2
| | | | | | | | | | | | | | | | | | | | | Linus noticed that the <linux/rcu_segcblist.h> has huge inline functions which should not be inline at all. As a first step in cleaning this up, move them all to kernel/rcu/ and only keep an absolute minimum of data type defines in the header: before: -rw-r--r-- 1 mingo mingo 22284 May 2 10:25 include/linux/rcu_segcblist.h after: -rw-r--r-- 1 mingo mingo 3180 May 2 10:22 include/linux/rcu_segcblist.h More can be done, such as uninlining the large functions, which inlining is unjustified even if it's an RCU internal matter. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* srcu: Parallelize callback handlingPaul E. McKenney2017-04-211-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2], however, there are workloads that could result in a high volume of concurrent invocations of call_srcu(), which with current SRCU would result in excessive lock contention on the srcu_struct structure's ->queue_lock, which protects SRCU's callback lists. This commit therefore moves SRCU to per-CPU callback lists, thus greatly reducing contention. Because a given SRCU instance no longer has a single centralized callback list, starting grace periods and invoking callbacks are both more complex than in the single-list Classic SRCU implementation. Starting grace periods and handling callbacks are now handled using an srcu_node tree that is in some ways similar to the rcu_node trees used by RCU-bh, RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is controlled by exactly the same Kconfig options and boot parameters that control the shape of the rcu_node tree). In addition, the old per-CPU srcu_array structure is now named srcu_data and contains an rcu_segcblist structure named ->srcu_cblist for its callbacks (and a spinlock to protect this). The srcu_struct gets an srcu_gp_seq that is used to associate callback segments with the corresponding completion-time grace-period number. These completion-time grace-period numbers are propagated up the srcu_node tree so that the grace-period workqueue handler can determine whether additional grace periods are needed on the one hand and where to look for callbacks that are ready to be invoked. The srcu_barrier() function must now wait on all instances of the per-CPU ->srcu_cblist. Because each ->srcu_cblist is protected by ->lock, srcu_barrier() can remotely add the needed callbacks. In theory, it could also remotely start grace periods, but in practice doing so is complex and racy. And interestingly enough, it is never necessary for srcu_barrier() to start a grace period because srcu_barrier() only enqueues a callback when a callback is already present--and it turns out that a grace period has to have already been started for this pre-existing callback. Furthermore, it is only the callback that srcu_barrier() needs to wait on, not any particular grace period. Therefore, a new rcu_segcblist_entrain() function enqueues the srcu_barrier() function's callback into the same segment occupied by the last pre-existing callback in the list. The special case where all the pre-existing callbacks are on a different list (because they are in the process of being invoked) is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL segment, relying on the done-callbacks check that takes place after all callbacks are inovked. Note that the readers use the same algorithm as before. Note that there is a separate srcu_idx that tells the readers what counter to increment. This unfortunately cannot be combined with srcu_gp_seq because they need to be incremented at different times. This commit introduces some ugly #ifdefs in rcutorture. These will go away when I feel good enough about Tree SRCU to ditch Classic SRCU. Some crude performance comparisons, courtesy of a quickly hacked rcuperf asynchronous-grace-period capability: Callback Queuing Overhead ------------------------- # CPUS Classic SRCU Tree SRCU ------ ------------ --------- 2 0.349 us 0.342 us 16 31.66 us 0.4 us 41 --------- 0.417 us The times are the 90th percentiles, a statistic that was chosen to reject the overheads of the occasional srcu_barrier() call needed to avoid OOMing the test machine. The rcuperf test hangs when running Classic SRCU at 41 CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code and that statistics, this is a convincing demonstration of Tree SRCU's performance and scalability advantages. [1] https://lwn.net/Articles/309030/ [2] https://patchwork.kernel.org/patch/5108281/ Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
* srcu: Move rcu_node traversal macros to rcu.hPaul E. McKenney2017-04-181-35/+0
| | | | | | | | | | This commit moves rcu_for_each_node_breadth_first(), rcu_for_each_nonleaf_node_breadth_first(), and rcu_for_each_leaf_node() from kernel/rcu/tree.h to kernel/rcu/rcu.h so that SRCU can access them. This commit is code-movement only. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* srcu: Move combining-tree definitions for SRCU's benefitPaul E. McKenney2017-04-181-70/+1
| | | | | | | | | | | | | | This commit moves the C preprocessor code that defines the default shape of the rcu_node combining tree to a new include/linux/rcu_node_tree.h file as a first step towards enabling SRCU to create its own combining tree, which in turn enables SRCU to implement per-CPU callback handling, thus avoiding contention on the lock currently guarding the single list of callbacks. Note that users of SRCU still need to know the size of the srcu_struct structure, hence include/linux rather than kernel/rcu. This commit is code-movement only. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* srcu: Use rcu_segcblist to track SRCU callbacksPaul E. McKenney2017-04-181-1/+1
| | | | | | | | | This commit switches SRCU from custom-built callback queues to the new rcu_segcblist structure. This change associates grace-period sequence numbers with groups of callbacks, which will be needed for efficient processing of per-CPU callbacks. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* srcu: Abstract multi-tail callback list handlingPaul E. McKenney2017-04-181-34/+7
| | | | | | | | | | | | | | | | | | RCU has only one multi-tail callback list, which is implemented via the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the rcu_data structure, and whose operations are open-code throughout the Tree RCU implementation. This has been more or less OK in the past, but upcoming callback-list optimizations in SRCU could really use a multi-tail callback list there as well. This commit therefore abstracts the multi-tail callback list handling into a new kernel/rcu/rcu_segcblist.h file, and uses this new API. The simple head-and-tail pointer callback list is also abstracted and applied everywhere except for the NOCB callback-offload lists. (Yes, the plan is to apply them there as well, but this commit is already bigger than would be good.) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Default RCU_FANOUT_LEAF to 16 unless explicitly changedPaul E. McKenney2017-04-181-5/+1
| | | | | | | | | | | | | If the RCU_EXPERT Kconfig option is not set (the default), then the RCU_FANOUT_LEAF Kconfig option will not be defined, which will cause the leaf-level rcu_node tree fanout to default to 32 on 32-bit systems and 64 on 64-bit systems. This can result in excessive lock contention. This commit therefore changes the computation of the leaf-level rcu_node tree fanout so that the result will be 16 unless an explicit Kconfig or kernel-boot setting says otherwise. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() actionsPaul E. McKenney2017-04-181-1/+2
| | | | | | | | | | | | | The rcu_all_qs() and rcu_note_context_switch() do a series of checks, taking various actions to supply RCU with quiescent states, depending on the outcomes of the various checks. This is a bit much for scheduling fastpaths, so this commit creates a separate ->rcu_urgent_qs field in the rcu_dynticks structure that acts as a global guard for these checks. Thus, in the common case, rcu_all_qs() and rcu_note_context_switch() check the ->rcu_urgent_qs field, find it false, and simply return. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org>
* rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle()Paul E. McKenney2017-04-181-2/+1
| | | | | | | | | | The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking that one of them still needs a quiescent state before doing an expensive atomic operation on the ->dynticks counter. However, this check reduces overhead only after a rare race condition, and increases complexity. This commit therefore removes the scan and the mechanism enabling the scan. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Pull rcu_qs_ctr into rcu_dynticks structurePaul E. McKenney2017-04-181-1/+2
| | | | | | | | The rcu_qs_ctr variable is yet another isolated per-CPU variable, so this commit pulls it into the pre-existing rcu_dynticks per-CPU structure. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Pull rcu_sched_qs_mask into rcu_dynticks structurePaul E. McKenney2017-04-181-0/+1
| | | | | | | | The rcu_sched_qs_mask variable is yet another isolated per-CPU variable, so this commit pulls it into the pre-existing rcu_dynticks per-CPU structure. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Maintain special bits at bottom of ->dynticks counterPaul E. McKenney2017-04-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, IPIs are used to force other CPUs to invalidate their TLBs in response to a kernel virtual-memory mapping change. This works, but degrades both battery lifetime (for idle CPUs) and real-time response (for nohz_full CPUs), and in addition results in unnecessary IPIs due to the fact that CPUs executing in usermode are unaffected by stale kernel mappings. It would be better to cause a CPU executing in usermode to wait until it is entering kernel mode to do the flush, first to avoid interrupting usemode tasks and second to handle multiple flush requests with a single flush in the case of a long-running user task. This commit therefore reserves a bit at the bottom of the ->dynticks counter, which is checked upon exit from extended quiescent states. If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is invoked, which, if not supplied, is an empty single-pass do-while loop. If this bottom bit is set on -entry- to an extended quiescent state, then a WARN_ON_ONCE() triggers. This bottom bit may be set using a new rcu_eqs_special_set() function, which returns true if the bit was set, or false if the CPU turned out to not be in an extended quiescent state. Please note that this function refuses to set the bit for a non-nohz_full CPU when that CPU is executing in usermode because usermode execution is tracked by RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* sched/headers: Prepare for the removal of <linux/rtmutex.h> from <linux/sched.h>Ingo Molnar2017-03-021-0/+1
| | | | | | | | | | | Fix up missing #includes in other places that rely on sched.h doing that for them. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
*---. Merge branches 'doc.2017.01.15b', 'dyntick.2017.01.23a', ↵Paul E. McKenney2017-01-251-13/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'fixes.2017.01.23a', 'srcu.2017.01.25a' and 'torture.2017.01.15b' into HEAD doc.2017.01.15b: Documentation updates dyntick.2017.01.23a: Dyntick tracking consolidation fixes.2017.01.23a: Miscellaneous fixes srcu.2017.01.25a: SRCU rewrite, fixes, and verification torture.2017.01.15b: Torture-test updates
| | | * srcu: Force full grace-period orderingPaul E. McKenney2017-01-251-12/+0
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a process invokes synchronize_srcu(), is delayed just the right amount of time, and thus does not sleep when waiting for the grace period to complete, there is no ordering between the end of the grace period and the code following the synchronize_srcu(). Similarly, there can be a lack of ordering between the end of the SRCU grace period and callback invocation. This commit adds the necessary ordering. Reported-by: Lance Roy <ldr709@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Further smp_mb() adjustment per email with Lance Roy. ]
| | * rcu: Eliminate unused expedited_normal counterPaul E. McKenney2017-01-231-1/+0
| |/ |/| | | | | | | | | | | | | | | | | Expedited grace periods no longer fall back to normal grace periods in response to lock contention, given that expedited grace periods now use the rcu_node tree so as to avoid contention. This commit therfore removes the expedited_normal counter. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
| * rcu: Abstract extended quiescent state determinationPaul E. McKenney2017-01-231-0/+2
|/ | | | | | | | | | | This commit is the fourth step towards full abstraction of all accesses to the ->dynticks counter, implementing previously open-coded checks and comparisons in new rcu_dynticks_in_eqs() and rcu_dynticks_in_eqs_since() functions. This abstraction will ease changes to the ->dynticks counter operation. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Make expedited grace periods recheck dyntick idle statePaul E. McKenney2016-11-141-0/+1
| | | | | | | | | | | | Expedited grace periods check dyntick-idle state, and avoid sending IPIs to idle CPUs, including those running guest OSes, and, on NOHZ_FULL kernels, nohz_full CPUs. However, the kernel has been observed checking a CPU while it was non-idle, but sending the IPI after it has gone idle. This commit therefore rechecks idle state immediately before sending the IPI, refraining from IPIing CPUs that have since gone idle. Reported-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Drive expedited grace periods from workqueuePaul E. McKenney2016-08-221-0/+1
| | | | | | | | | | | | | | | The current implementation of expedited grace periods has the user task drive the grace period. This works, but has downsides: (1) The user task must awaken tasks piggybacking on this grace period, which can result in latencies rivaling that of the grace period itself, and (2) User tasks can receive signals, which interfere with RCU CPU stall warnings. This commit therefore uses workqueues to drive the grace periods, so that the user task need not do the awakening. A subsequent commit will remove the now-unnecessary code allowing for signals. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Correctly handle sparse possible cpusMark Rutland2016-06-151-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In many cases in the RCU tree code, we iterate over the set of cpus for a leaf node described by rcu_node::grplo and rcu_node::grphi, checking per-cpu data for each cpu in this range. However, if the set of possible cpus is sparse, some cpus described in this range are not possible, and thus no per-cpu region will have been allocated (or initialised) for them by the generic percpu code. Erroneous accesses to a per-cpu area for these !possible cpus may fault or may hit other data depending on the addressed generated when the erroneous per cpu offset is applied. In practice, both cases have been observed on arm64 hardware (the former being silent, but detectable with additional patches). To avoid issues resulting from this, we must iterate over the set of *possible* cpus for a given leaf node. This patch add a new helper, for_each_leaf_node_possible_cpu, to enable this. As iteration is often intertwined with rcu_node local bitmask manipulation, a new leaf_node_cpu_bit helper is added to make this simpler and more consistent. The RCU tree code is made to use both of these where appropriate. Without this patch, running reboot at a shell can result in an oops like: [ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c [ 3369.083881] pgd = ffffffc3ecdda000 [ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000 [ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP [ 3369.101781] Modules linked in: [ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G W 4.6.0+ #3 [ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000 [ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510 [ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510 [ 3369.139479] pc : [<ffffff80081109a8>] lr : [<ffffff8008110924>] pstate: 200001c5 [ 3369.146860] sp : ffffffc3eb9435a0 [ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88 [ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600 [ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88 [ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80 [ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40 [ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000 [ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0 [ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000 [ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000 [ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78 [ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000 [ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003 [ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280 [ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001 [ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140 ... [ 3369.972257] [<ffffff80081109a8>] sync_rcu_exp_select_cpus+0x188/0x510 [ 3369.978685] [<ffffff80081128b4>] synchronize_rcu_expedited+0x64/0xa8 [ 3369.985026] [<ffffff80086b987c>] synchronize_net+0x24/0x30 [ 3369.990499] [<ffffff80086ddb54>] dev_deactivate_many+0x28c/0x298 [ 3369.996493] [<ffffff80086b6bb8>] __dev_close_many+0x60/0xd0 [ 3370.002052] [<ffffff80086b6d48>] __dev_close+0x28/0x40 [ 3370.007178] [<ffffff80086bf62c>] __dev_change_flags+0x8c/0x158 [ 3370.012999] [<ffffff80086bf718>] dev_change_flags+0x20/0x60 [ 3370.018558] [<ffffff80086cf7f0>] do_setlink+0x288/0x918 [ 3370.023771] [<ffffff80086d0798>] rtnl_newlink+0x398/0x6a8 [ 3370.029158] [<ffffff80086cee84>] rtnetlink_rcv_msg+0xe4/0x220 [ 3370.034891] [<ffffff80086e274c>] netlink_rcv_skb+0xc4/0xf8 [ 3370.040364] [<ffffff80086ced8c>] rtnetlink_rcv+0x2c/0x40 [ 3370.045663] [<ffffff80086e1fe8>] netlink_unicast+0x160/0x238 [ 3370.051309] [<ffffff80086e24b8>] netlink_sendmsg+0x2f0/0x358 [ 3370.056956] [<ffffff80086a0070>] sock_sendmsg+0x18/0x30 [ 3370.062168] [<ffffff80086a21cc>] ___sys_sendmsg+0x26c/0x280 [ 3370.067728] [<ffffff80086a30ac>] __sys_sendmsg+0x44/0x88 [ 3370.073027] [<ffffff80086a3100>] SyS_sendmsg+0x10/0x20 [ 3370.078153] [<ffffff8008085e70>] el0_svc_naked+0x24/0x28 Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Dennis Chen <dennis.chen@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*-. Merge branches 'doc.2016.04.19a', 'exp.2016.03.31d', 'fixes.2016.03.31d' and ↵Paul E. McKenney2016-04-211-10/+10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | 'torture.2016.04.21a' into HEAD doc.2016.04.19a: Documentation updates exp.2016.03.31d: Expedited grace-period updates fixes.2016.03.31d: Miscellaneous fixes torture.2016.004.21a Torture-test updates
| | * rcu: Awaken grace-period kthread if too long since FQSPaul E. McKenney2016-03-311-0/+2
| |/ |/| | | | | | | | | | | | | | | Recent kernels can fail to awaken the grace-period kthread for quiescent-state forcing. This commit is a crude hack that does a wakeup if a scheduling-clock interrupt sees that it has been too long since force-quiescent-state (FQS) processing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Overlap wakeups with next expedited grace periodPaul E. McKenney2016-03-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | The current expedited grace-period implementation makes subsequent grace periods wait on wakeups for the prior grace period. This does not fit the dictionary definition of "expedited", so this commit allows these two phases to overlap. Doing this requires four waitqueues rather than two because tasks can now be waiting on the previous, current, and next grace periods. The fourth waitqueue makes the bit masking work out nicely. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Enforce expedited-GP fairness via funnel wait queuePaul E. McKenney2016-03-311-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current mutex-based funnel-locking approach used by expedited grace periods is subject to severe unfairness. The problem arises when a few tasks, making a path from leaves to root, all wake up before other tasks do. A new task can then follow this path all the way to the root, which needlessly delays tasks whose grace period is done, but who do not happen to acquire the lock quickly enough. This commit avoids this problem by maintaining per-rcu_node wait queues, along with a per-rcu_node counter that tracks the latest grace period sought by an earlier task to visit this node. If that grace period would satisfy the current task, instead of proceeding up the tree, it waits on the current rcu_node structure using a pair of wait queues provided for that purpose. This decouples awakening of old tasks from the arrival of new tasks. If the wakeups prove to be a bottleneck, additional kthreads can be brought to bear for that purpose. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Shorten expedited_workdone* to exp_workdone*Paul E. McKenney2016-03-311-3/+3
| | | | | | | | | | | | Just a name change to save a few lines and a bit of typing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Remove expedited GP funnel-lock bypassPaul E. McKenney2016-03-311-1/+0
|/ | | | | | | | | Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking") turns out to be a pessimization at high load because it forces a tree full of tasks to wait for an expedited grace period that they probably do not need. This commit therefore removes this optimization. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* Merge commit 'fixes.2015.02.23a' into core/rcuIngo Molnar2016-03-151-11/+31
|\ | | | | | | | | | | | | Conflicts: kernel/rcu/tree.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * RCU: Privatize rcu_node::lockBoqun Feng2016-02-231-11/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In patch: "rcu: Add transitivity to remaining rcu_node ->lock acquisitions" All locking operations on rcu_node::lock are replaced with the wrappers because of the need of transitivity, which indicates we should never write code using LOCK primitives alone(i.e. without a proper barrier following) on rcu_node::lock outside those wrappers. We could detect this kind of misuses on rcu_node::lock in the future by adding __private modifier on rcu_node::lock. To privatize rcu_node::lock, unlock wrappers are also needed. Replacing spinlock unlocks with these wrappers not only privatizes rcu_node::lock but also makes it easier to figure out critical sections of rcu_node. This patch adds __private modifier to rcu_node::lock and makes every access to it wrapped by ACCESS_PRIVATE(). Besides, unlock wrappers are added and raw_spin_unlock(&rnp->lock) and its friends are replaced with those wrappers. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* | rcu: Use simple wait queues where possible in rcutreePaul Gortmaker2016-02-251-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of commit dae6e64d2bcfd ("rcu: Introduce proper blocking to no-CBs kthreads GP waits") the RCU subsystem started making use of wait queues. Here we convert all additions of RCU wait queues to use simple wait queues, since they don't need the extra overhead of the full wait queue features. Originally this was done for RT kernels[1], since we would get things like... BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt Pid: 8, comm: rcu_preempt Not tainted Call Trace: [<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0 [<ffffffff817d77b4>] rt_spin_lock+0x24/0x50 [<ffffffff8106fcf6>] __wake_up+0x36/0x70 [<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680 [<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50 [<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80 [<ffffffff8105eabb>] kthread+0xdb/0xe0 [<ffffffff8106b912>] ? finish_task_switch+0x52/0x100 [<ffffffff817e0754>] kernel_thread_helper+0x4/0x10 [<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60 [<ffffffff817e0750>] ? gs_change+0xb/0xb ...and hence simple wait queues were deployed on RT out of necessity (as simple wait uses a raw lock), but mainline might as well take advantage of the more streamline support as well. [1] This is a carry forward of work from v3.10-rt; the original conversion was by Thomas on an earlier -rt version, and Sebastian extended it to additional post-3.10 added RCU waiters; here I've added a commit log and unified the RCU changes into one, and uprev'd it to match mainline RCU. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-rt-users@vger.kernel.org Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lockDaniel Wagner2016-02-251-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently, this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will not enable the IRQs. lockdep is happy. By switching over using swait this is not true anymore. swake_up_all() enables the IRQs while processing the waiters. __do_softirq() can now run and will eventually call rcu_process_callbacks() which wants to grap nrp->lock. Let's move the rcu_nocb_gp_cleanup() call outside the lock before we switch over to swait. If we would hold the rnp->lock and use swait, lockdep reports following: ================================= [ INFO: inconsistent lock state ] 4.2.0-rc5-00025-g9a73ba0 #136 Not tainted --------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes: (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0 {IN-SOFTIRQ-W} state was registered at: [<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0 [<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0 [<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0 [<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0 [<ffffffff810b1a9d>] __do_softirq+0x14d/0x670 [<ffffffff810b2214>] irq_exit+0x104/0x110 [<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60 [<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80 [<ffffffff810dba66>] rq_attach_root+0xa6/0x100 [<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650 [<ffffffff810e4b42>] build_sched_domains+0x942/0xb00 [<ffffffff821777c2>] sched_init_smp+0x509/0x5c1 [<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f [<ffffffff8182cdce>] kernel_init+0xe/0xe0 [<ffffffff8184231f>] ret_from_fork+0x3f/0x70 irq event stamp: 76 hardirqs last enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60 hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90 softirqs last enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0 softirqs last disabled at (0): [< (null)>] (null) other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(rcu_node_1); <Interrupt> lock(rcu_node_1); *** DEADLOCK *** 1 lock held by rcu_preempt/8: #0: (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0 stack backtrace: CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136 Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014 0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0 0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b 0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f Call Trace: [<ffffffff818379e0>] dump_stack+0x4f/0x7b [<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0 [<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50 [<ffffffff811087ad>] mark_lock+0x66d/0x6e0 [<ffffffff81107790>] ? check_usage_forwards+0x150/0x150 [<ffffffff81108898>] mark_held_locks+0x78/0xa0 [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60 [<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220 [<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10 [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60 [<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0 [<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0 [<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220 [<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60 [<ffffffff81137c30>] ? rcu_barrier+0x20/0x20 [<ffffffff810d2014>] kthread+0x104/0x120 [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60 [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260 [<ffffffff8184231f>] ret_from_fork+0x3f/0x70 [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260 Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-rt-users@vger.kernel.org Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
*-. Merge branches 'doc.2015.12.05a', 'exp.2015.12.07a', 'fixes.2015.12.07a', ↵Paul E. McKenney2015-12-071-4/+18
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | 'list.2015.12.04b' and 'torture.2015.12.05a' into HEAD doc.2015.12.05a: Documentation updates exp.2015.12.07a: Expedited grace-period updates fixes.2015.12.07a: Miscellaneous fixes list.2015.12.04b: Linked-list updates torture.2015.12.05a: Torture-test updates
| | * rcutorture: Print symbolic name for ->gp_statePaul E. McKenney2015-12-051-0/+12
| |/ |/| | | | | | | | | | | | | | | | | Currently, ->gp_state is printed as an integer, which slows debugging. This commit therefore prints a symbolic name in addition to the integer. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Updated to fix relational operator called out by Dan Carpenter. ] [ paulmck: More "const", as suggested by Josh Triplett. ] Reviewed-by: Josh Triplett <josh@joshtriplett.org>
| * rcu: Reduce expedited GP memory contention via per-CPU variablesPaul E. McKenney2015-12-041-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the piggybacked-work checks carried out by sync_exp_work_done() atomically increment a small set of variables (the ->expedited_workdone0, ->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3 fields in the rcu_state structure), which will form a memory-contention bottleneck given a sufficiently large number of CPUs concurrently invoking either synchronize_rcu_expedited() or synchronize_sched_expedited(). This commit therefore moves these for fields to the per-CPU rcu_data structure, eliminating the memory contention. The show_rcuexp() function also changes to sum up each field in the rcu_data structures. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Clarify role of ->expmaskinitnextPaul E. McKenney2015-12-041-0/+2
|/ | | | | | | | | | Analogy with the ->qsmaskinitnext field might lead one to believe that ->expmaskinitnext tracks online CPUs. This belief is incorrect: Any CPU that has ever been online will have its bit set in the ->expmaskinitnext field. This commit therefore adds a comment to make this clear, at least to people who read comments. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: Create transitive rnp->lock acquisition functionsPeter Zijlstra2015-11-231-0/+39
| | | | | | | | | | | | | Providing RCU's memory-ordering guarantees requires that the rcu_node tree's locking provide transitive memory ordering, which the Linux kernel's spinlocks currently do not provide unless smp_mb__after_unlock_lock() is used. Having a separate smp_mb__after_unlock_lock() after each and every lock acquisition is error-prone, hard to read, and a bit annoying, so this commit provides wrapper functions that pull in the smp_mb__after_unlock_lock() invocations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* Merge branches 'fixes.2015.10.06a' and 'exp.2015.10.07a' into HEADPaul E. McKenney2015-10-071-20/+28
|\ | | | | | | | | exp.2015.10.07a: Reduce OS jitter of RCU-sched expedited grace periods. fixes.2015.10.06a: Miscellaneous fixes.
| * rcu: Add online/offline info to expedited stall warning messagePaul E. McKenney2015-10-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit makes the RCU CPU stall warning message print online/offline indications immediately after the CPU number. A "O" indicates global offline, a "." global online, and a "o" indicates RCU believes that the CPU is offline for the current grace period and "." otherwise, and an "N" indicates that RCU believes that the CPU will be offline for the next grace period, and "." otherwise, all right after the CPU number. So for CPU 10, you would normally see "10-...:" indicating that everything believes that the CPU is online. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Stop silencing lockdep false positive for expedited grace periodsPaul E. McKenney2015-10-071-8/+0
| | | | | | | | | | | | | | | | | | This reverts commit af859beaaba4 (rcu: Silence lockdep false positive for expedited grace periods). Because synchronize_rcu_expedited() no longer invokes synchronize_sched_expedited(), ->exp_funnel_mutex acquisition is no longer nested, so the false positive no longer happens. This commit therefore removes the extra lockdep data structures, as they are no longer needed.
| * rcu: Switch synchronize_sched_expedited() to IPIPaul E. McKenney2015-10-071-3/+0
| | | | | | | | | | | | | | | | | | | | | | This commit switches synchronize_sched_expedited() from stop_one_cpu_nowait() to smp_call_function_single(), thus moving from an IPI and a pair of context switches to an IPI and a single pass through the scheduler. Of course, if the scheduler actually does decide to switch to a different task, there will still be a pair of context switches, but there would likely have been a pair of context switches anyway, just a bit later. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Make ->cpu_no_qs be a union for aggregate ORPaul E. McKenney2015-09-201-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit converts the rcu_data structure's ->cpu_no_qs field to a union. The bytewise side of this union allows individual access to indications as to whether this CPU needs to find a quiescent state for a normal (.norm) and/or expedited (.exp) grace period. The setwise side of the union allows testing whether or not a quiescent state is needed at all, for either type of grace period. For now, only .norm is used. A later commit will introduce the expedited usage. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Invert passed_quiesce and rename to cpu_no_qsPaul E. McKenney2015-09-201-1/+1
| | | | | | | | | | | | | | | | | | This commit inverts the sense of the rcu_data structure's ->passed_quiesce field and renames it to ->cpu_no_qs. This will allow a later commit to use an "aggregate OR" operation to test expedited as well as normal grace periods without added overhead. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Rename qs_pending to core_needs_qsPaul E. McKenney2015-09-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | An upcoming commit needs to invert the sense of the ->passed_quiesce rcu_data structure field, so this commit is taking this opportunity to clarify things a bit by renaming ->qs_pending to ->core_needs_qs. So if !rdp->core_needs_qs, then this CPU need not concern itself with quiescent states, in particular, it need not acquire its leaf rcu_node structure's ->lock to check. Otherwise, it needs to report the next quiescent state. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Move synchronize_sched_expedited() to combining treePaul E. McKenney2015-09-201-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, synchronize_sched_expedited() uses a single global counter to track the number of remaining context switches that the current expedited grace period must wait on. This is problematic on large systems, where the resulting memory contention can be pathological. This commit therefore makes synchronize_sched_expedited() instead use the combining tree in the same manner as synchronize_rcu_expedited(), keeping memory contention down to a dull roar. This commit creates a temporary function sync_sched_exp_select_cpus() that is very similar to sync_rcu_exp_select_cpus(). A later commit will consolidate these two functions, which becomes possible when synchronize_sched_expedited() switches from stop_one_cpu_nowait() to smp_call_function_single(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * rcu: Consolidate tree setup for synchronize_rcu_expedited()Paul E. McKenney2015-09-201-5/+12
| | | | | | | | | | | | | | | | | | | | This commit replaces sync_rcu_preempt_exp_init1(() and sync_rcu_preempt_exp_init2() with sync_exp_reset_tree_hotplug() and sync_exp_reset_tree(), which will also be used by synchronize_sched_expedited(), and sync_rcu_exp_select_nodes(), which contains code specific to synchronize_rcu_expedited(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
OpenPOWER on IntegriCloud