summaryrefslogtreecommitdiffstats
path: root/kernel/rcu/tree_plugin.h
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'locking-core-for-linus' of ↵Linus Torvalds2014-06-031-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__*() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__*() arch,doc: Convert smp_mb__*() arch,xtensa: Convert smp_mb__*() arch,x86: Convert smp_mb__*() arch,tile: Convert smp_mb__*() arch,sparc: Convert smp_mb__*() arch,sh: Convert smp_mb__*() arch,score: Convert smp_mb__*() arch,s390: Convert smp_mb__*() arch,powerpc: Convert smp_mb__*() arch,parisc: Convert smp_mb__*() arch,openrisc: Convert smp_mb__*() arch,mn10300: Convert smp_mb__*() arch,mips: Convert smp_mb__*() arch,metag: Convert smp_mb__*() arch,m68k: Convert smp_mb__*() arch,m32r: Convert smp_mb__*() arch,ia64: Convert smp_mb__*() ...
| * arch: Mass conversion of smp_mb__*()Peter Zijlstra2014-04-181-4/+4
| | | | | | | | | | | | | | | | | | | | | | Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | rcu: Variable name changed in tree_plugin.h and used in tree.cUma Sharma2014-05-141-8/+8
| | | | | | | | | | | | | | | | | | | | | | The variable and struct both having the name "rcu_state" confuses sparse in some situations, so this commit changes the variable to "rcu_state_p" in order to avoid this confusion. This also makes things easier for human readers. Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> [ paulmck: Changed the declaration and several additional uses. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* | rcu: Replace __this_cpu_ptr() uses with raw_cpu_ptr()Christoph Lameter2014-04-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | __this_cpu_ptr is being phased out. One special case is increment_cpu_stall_ticks(). A per cpu variable is incremented so use raw_cpu_inc(). Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | rcu: Make large and small sysidle systems use same state machinePaul E. McKenney2014-04-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, small systems move back into RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT and large systems do not. This works because moving aggressively to RCU_SYSIDLE_NOT affects only performance, not correctness, and on small systems, the performance impact should be negligible. That said, this difference does make RCU a bit more complex, and RCU does not seem to be suffering from any lack of complexity. This commit therefore adjusts small-system operation to match that of large systems, so that the state never moves back to RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | rcu: Bind RCU grace-period kthreads if NO_HZ_FULLPaul E. McKenney2014-04-291-18/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, RCU binds the grace-period kthreads to the timekeeping CPU only if CONFIG_NO_HZ_FULL_SYSIDLE=y. This means that these kthreads must be bound manually when CONFIG_NO_HZ_FULL_SYSIDLE=n and CONFIG_NO_HZ_FULL=y: Otherwise, these kthreads will induce OS jitter on random CPUs. Given that we are trying to reduce the amount of manual tweaking required to make CONFIG_NO_HZ_FULL=y work nicely, this commit makes this binding happen when CONFIG_NO_HZ_FULL=y, even in cases where CONFIG_NO_HZ_FULL_SYSIDLE=n. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | rcu: Merge rcu_sched_force_quiescent_state() with rcu_force_quiescent_state()Andreea-Cristina Bernat2014-04-291-19/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch merges the function rcu_force_quiescent_state() with rcu_sched_force_quiescent_state(), using the rcu_state pointer. Firstly, the rcu_sched_force_quiescent_state() function is deleted from the file kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was calling force_quiescent_state with the argument rcu_preempt_state pointer was deleted as well. The new function that combines the old ones uses the rcu_state pointer and is located after rcu_batches_completed_bh() in kernel/rcu/tree.c. Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | rcu: Consolidate kfree_call_rcu() to use rcu_state pointerAndreea-Cristina Bernat2014-04-291-30/+0
| | | | | | | | | | | | | | | | | | | | | | | | kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU, it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state. This patch uses the rcu_state_pointer to combine the two definitions into one. The resulting function is placed after the closing of the preprocessor conditional CONFIG_TREE_PREEMPT_RCU. Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | rcu: Make callers awaken grace-period kthreadPaul E. McKenney2014-04-291-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rcu_start_gp_advanced() function currently uses irq_work_queue() to defer wakeups of the RCU grace-period kthread. This deferring is necessary to avoid RCU-scheduler deadlocks involving the rcu_node structure's lock, meaning that RCU cannot call any of the scheduler's wake-up functions while holding one of these locks. Unfortunately, the second and subsequent calls to irq_work_queue() are ignored, and the first call will be ignored (aside from queuing the work item) if the scheduler-clock tick is turned off. This is OK for many uses, especially those where irq_work_queue() is called from an interrupt or softirq handler, because in those cases the scheduler-clock-tick state will be re-evaluated, which will turn the scheduler-clock tick back on. On the next tick, any deferred work will then be processed. However, this strategy does not always work for RCU, which can be invoked at process level from idle CPUs. In this case, the tick might never be turned back on, indefinitely defering a grace-period start request. Note that the RCU CPU stall detector cannot see this condition, because there is no RCU grace period in progress. Therefore, we can (and do!) see long tens-of-seconds stalls in grace-period handling. In theory, we could see a full grace-period hang, but rcutorture testing to date has seen only the tens-of-seconds stalls. Event tracing demonstrates that irq_work_queue() is being called repeatedly to no effect during these stalls: The "newreq" event appears repeatedly from a task that is not one of the grace-period kthreads. In theory, irq_work_queue() might be fixed to avoid this sort of issue, but RCU's requirements are unusual and it is quite straightforward to pass wake-up responsibility up through RCU's call chain, so that the wakeup happens when the offending locks are released. This commit therefore makes this change. The rcu_start_gp_advanced(), rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(), __note_gp_changes(), and rcu_start_gp() functions now return a boolean which indicates when a wake-up is needed. A new rcu_gp_kthread_wake() does the wakeup when it is necessary and safe to do so: No self-wakes, no wake-ups if the ->gp_flags field indicates there is no need (as in someone else did the wake-up before we got around to it), and no wake-ups before the grace-period kthread has been created. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | rcu: Update cpu_needs_another_gp() for futures from non-NOCB CPUsPaul E. McKenney2014-04-291-18/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the old days, the only source of requests for future grace periods was NOCB CPUs. This has changed: CPUs routinely post requests for future grace periods in order to promote power efficiency and reduce OS jitter with minimal impact on grace-period latency. This commit therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp() instead of rcu_nocb_needs_gp(). The latter is no longer used, so is now removed. This commit also adds tracing for the irq_work_queue() wakeup case. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | rcu: Fix incorrect notes for codeLiu Ping Fan2014-04-291-1/+1
|/ | | | | | Signed-off-by: Liu Ping Fan <kernelfans@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
*-. Merge branches 'doc.2014.02.24a', 'fixes.2014.02.26a' and 'rt.2014.02.17b' ↵Paul E. McKenney2014-02-261-5/+14
|\ \ | | | | | | | | | | | | | | | | | | | | | into HEAD doc.2014.02.24a: Documentation changes fixes.2014.02.26a: Miscellaneous fixes rt.2014.02.17b: Response-time-related changes
| | * rcu: Optimize RCU_FAST_NO_HZ for RCU_NOCB_CPU_ALLPaul E. McKenney2014-02-171-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If CONFIG_RCU_NOCB_CPU_ALL=y, then no CPU will ever have RCU callbacks because these callbacks will instead be handled by the rcuo kthreads. However, the current version of RCU_FAST_NO_HZ nevertheless checks for RCU callbacks. This commit therefore creates static inline implementations of rcu_prepare_for_idle() and rcu_cleanup_after_idle() that are no-ops when CONFIG_RCU_NOCB_CPU_ALL=y. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
| | * rcu: Optimize rcu_needs_cpu() for RCU_NOCB_CPU_ALLPaul E. McKenney2014-02-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_needs_cpu() will always return false, however, the current version nevertheless checks for RCU callbacks. This commit therefore creates a static inline implementation of rcu_needs_cpu() that unconditionally returns false when CONFIG_RCU_NOCB_CPU_ALL=y. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
| | * rcu: Optimize rcu_is_nocb_cpu() for RCU_NOCB_CPU_ALLPaul E. McKenney2014-02-171-0/+2
| |/ |/| | | | | | | | | | | | | | | | | | | If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_is_nocb_cpu() will always return true, however, the current version nevertheless checks rcu_nocb_mask. This commit therefore creates a static inline implementation of rcu_is_nocb_cpu() that unconditionally returns true when CONFIG_RCU_NOCB_CPU_ALL=y. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
| * rcu: Disambiguate CONFIG_RCU_NOCB_CPUsPaul Bolle2014-02-171-1/+1
| | | | | | | | | | | | | | | | | | This commit fixes a grammar issue in the rcu_nohz_full_cpu() comment header, so that it is clear that the plural is CPUs not Kconfig options. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
| * rcu: Stop tracking FSF's postal addressPaul E. McKenney2014-02-171-2/+2
|/ | | | | | | | | | | | All of the RCU source files have the usual GPL header, which contains a long-obsolete postal address for FSF. To avoid the need to track the FSF office's movements, this commit substitutes the URL where GPL may be found. Reported-by: Greg KH <gregkh@linuxfoundation.org> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds2014-01-201-13/+76
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - add RCU torture scripts/tooling - static analysis improvements - update RCU documentation - miscellaneous fixes * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) rcu: Remove "extern" from function declarations in kernel/rcu/rcu.h rcu: Remove "extern" from function declarations in include/linux/*rcu*.h rcu/torture: Dynamically allocate SRCU output buffer to avoid overflow rcu: Don't activate RCU core on NO_HZ_FULL CPUs rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq rcu: Add an RCU_INITIALIZER for global RCU-protected pointers rcu: Make rcu_assign_pointer's assignment volatile and type-safe bonding: Use RCU_INIT_POINTER() for better overhead and for sparse rcu: Add comment on evaluate-once properties of rcu_assign_pointer(). rcu: Provide better diagnostics for blocking in RCU callback functions rcu: Improve SRCU's grace-period comments rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values rcu: Fix coccinelle warnings rcutorture: Stop tracking FSF's postal address rcutorture: Move checkarg to functions.sh rcutorture: Flag errors and warnings with color coding rcutorture: Record results from repeated runs of the same test scenario rcutorture: Test summary at end of run with less chattiness rcutorture: Update comment in kvm.sh listing typical RCU trace events rcutorture: Add tracing-enabled version of TREE08 ...
| * Merge branch 'rcu/next' of ↵Ingo Molnar2013-12-161-13/+76
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull v3.14 RCU updates from Paul E. McKenney. The main changes: * Update RCU documentation. * Miscellaneous fixes. * Add RCU torture scripts. * Static-analysis improvements. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * rcu: Don't activate RCU core on NO_HZ_FULL CPUsPaul E. McKenney2013-12-121-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see if the RCU core needs anything from this CPU. If so, RCU raises RCU_SOFTIRQ to carry out any needed processing. This approach has worked well historically, but it is undesirable on NO_HZ_FULL CPUs. Such CPUs are expected to spend almost all of their time in userspace, so that scheduling-clock interrupts can be disabled while there is only one runnable task on the CPU in question. Unfortunately, raising any softirq has the potential to wake up ksoftirqd, which would provide the second runnable task on that CPU, preventing disabling of scheduling-clock interrupts. What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone, relying on the grace-period kthreads' quiescent-state forcing to do any needed RCU work on behalf of those CPUs. This commit therefore refrains from raising RCU_SOFTIRQ on any NO_HZ_FULL CPUs during any grace periods that have been in effect for less than one second. The one-second limit handles the case where an inappropriate workload is running on a NO_HZ_FULL CPU that features lots of scheduling-clock interrupts, but no idle or userspace time. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Mike Galbraith <bitbucket@online.de> Toasted-by: Frederic Weisbecker <fweisbec@gmail.com>
| | * rcu: Warn on allegedly impossible rcu_read_unlock_special() from irqLai Jiangshan2013-12-121-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit #10f39bb1b2c1 (rcu: protect __rcu_read_unlock() against scheduler-using irq handlers), it is no longer possible to enter the main body of rcu_read_lock_special() from an NMI, interrupt, or softirq handler. In theory, this implies that the check for "in_irq() || in_serving_softirq()" must always fail, so that in theory this check could be removed entirely. In practice, this commit wraps this condition with a WARN_ON_ONCE(). If this warning never triggers, then the condition will be removed entirely. [ paulmck: And one way of triggering the WARN_ON() is if a scheduling clock interrupt occurs in an RCU read-side critical section, setting RCU_READ_UNLOCK_NEED_QS, which is handled by rcu_read_unlock_special(). Updated this commit to return if only that bit was set. ] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| | * rcu: Break call_rcu() deadlock involving scheduler and perfPaul E. McKenney2013-12-031-10/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave Jones got the following lockdep splat: > ====================================================== > [ INFO: possible circular locking dependency detected ] > 3.12.0-rc3+ #92 Not tainted > ------------------------------------------------------- > trinity-child2/15191 is trying to acquire lock: > (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50 > > but task is already holding lock: > (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #3 (&ctx->lock){-.-...}: > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80 > [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0 > [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0 > [<ffffffff81732052>] __schedule+0x1d2/0xa20 > [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0 > [<ffffffff817352b6>] retint_kernel+0x26/0x30 > [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50 > [<ffffffff813f0504>] pty_write+0x54/0x60 > [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0 > [<ffffffff813e5838>] tty_write+0x158/0x2d0 > [<ffffffff811c4850>] vfs_write+0xc0/0x1f0 > [<ffffffff811c52cc>] SyS_write+0x4c/0xa0 > [<ffffffff8173d4e4>] tracesys+0xdd/0xe2 > > -> #2 (&rq->lock){-.-.-.}: > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80 > [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0 > [<ffffffff81054336>] do_fork+0x126/0x460 > [<ffffffff81054696>] kernel_thread+0x26/0x30 > [<ffffffff8171ff93>] rest_init+0x23/0x140 > [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403 > [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c > [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4 > > -> #1 (&p->pi_lock){-.-.-.}: > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90 > [<ffffffff810979d1>] try_to_wake_up+0x31/0x350 > [<ffffffff81097d62>] default_wake_function+0x12/0x20 > [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40 > [<ffffffff8108ea38>] __wake_up_common+0x58/0x90 > [<ffffffff8108ff59>] __wake_up+0x39/0x50 > [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0 > [<ffffffff81111450>] __call_rcu+0x140/0x820 > [<ffffffff81111b8d>] call_rcu+0x1d/0x20 > [<ffffffff81093697>] cpu_attach_domain+0x287/0x360 > [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0 > [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a > [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202 > [<ffffffff817200be>] kernel_init+0xe/0x190 > [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0 > > -> #0 (&rdp->nocb_wq){......}: > [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0 > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90 > [<ffffffff8108ff43>] __wake_up+0x23/0x50 > [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0 > [<ffffffff81111450>] __call_rcu+0x140/0x820 > [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30 > [<ffffffff81149abf>] put_ctx+0x4f/0x70 > [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230 > [<ffffffff81056b8d>] do_exit+0x30d/0xcc0 > [<ffffffff8105893c>] do_group_exit+0x4c/0xc0 > [<ffffffff810589c4>] SyS_exit_group+0x14/0x20 > [<ffffffff8173d4e4>] tracesys+0xdd/0xe2 > > other info that might help us debug this: > > Chain exists of: > &rdp->nocb_wq --> &rq->lock --> &ctx->lock > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&ctx->lock); > lock(&rq->lock); > lock(&ctx->lock); > lock(&rdp->nocb_wq); > > *** DEADLOCK *** > > 1 lock held by trinity-child2/15191: > #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230 > > stack backtrace: > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92 > ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40 > ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0 > ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0 > Call Trace: > [<ffffffff8172a363>] dump_stack+0x4e/0x82 > [<ffffffff81726741>] print_circular_bug+0x200/0x20f > [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0 > [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60 > [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80 > [<ffffffff810cc243>] lock_acquire+0x93/0x200 > [<ffffffff8108ff43>] ? __wake_up+0x23/0x50 > [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90 > [<ffffffff8108ff43>] ? __wake_up+0x23/0x50 > [<ffffffff8108ff43>] __wake_up+0x23/0x50 > [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0 > [<ffffffff81111450>] __call_rcu+0x140/0x820 > [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50 > [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30 > [<ffffffff81149abf>] put_ctx+0x4f/0x70 > [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230 > [<ffffffff81056b8d>] do_exit+0x30d/0xcc0 > [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10 > [<ffffffff8105893c>] do_group_exit+0x4c/0xc0 > [<ffffffff810589c4>] SyS_exit_group+0x14/0x20 > [<ffffffff8173d4e4>] tracesys+0xdd/0xe2 The underlying problem is that perf is invoking call_rcu() with the scheduler locks held, but in NOCB mode, call_rcu() will with high probability invoke the scheduler -- which just might want to use its locks. The reason that call_rcu() needs to invoke the scheduler is to wake up the corresponding rcuo callback-offload kthread, which does the job of starting up a grace period and invoking the callbacks afterwards. One solution (championed on a related problem by Lai Jiangshan) is to simply defer the wakeup to some point where scheduler locks are no longer held. Since we don't want to unnecessarily incur the cost of such deferral, the task before us is threefold: 1. Determine when it is likely that a relevant scheduler lock is held. 2. Defer the wakeup in such cases. 3. Ensure that all deferred wakeups eventually happen, preferably sooner rather than later. We use irqs_disabled_flags() as a proxy for relevant scheduler locks being held. This works because the relevant locks are always acquired with interrupts disabled. We may defer more often than needed, but that is at least safe. The wakeup deferral is tracked via a new field in the per-CPU and per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup. This flag is checked by the RCU core processing. The __rcu_pending() function now checks this flag, which causes rcu_check_callbacks() to initiate RCU core processing at each scheduling-clock interrupt where this flag is set. Of course this is not sufficient because scheduling-clock interrupts are often turned off (the things we used to be able to count on!). So the flags are also checked on entry to any state that RCU considers to be idle, which includes both NO_HZ_IDLE idle state and NO_HZ_FULL user-mode-execution state. This approach should allow call_rcu() to be invoked regardless of what locks you might be holding, the key word being "should". Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org>
| | * rcu: Fix and comment ordering around wait_event()Paul E. McKenney2013-12-031-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is all too easy to forget that wait_event() does not necessarily imply a full memory barrier. The case where it does not is where the condition transitions to true just as wait_event() starts execution. This is actually a feature: The standard use of wait_event() involves locking, in which case the locks provide the needed ordering (you hold a lock across the wake_up() and acquire that same lock after wait_event() returns). Given that I did forget that wait_event() does not necessarily imply a full memory barrier in one case, this commit fixes that case. This commit also adds comments calling out the placement of existing memory barriers relied on by wait_event() calls. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* | | Merge tag 'v3.13-rc4' into core/lockingIngo Molnar2013-12-171-2/+2
|\ \ \ | |/ / | | | | | | | | | | | | Merge Linux 3.13-rc4, to refresh this rather old tree with the latest fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | NOHZ: Check for nohz active instead of nohz enabledThomas Gleixner2013-11-191-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RCU and the fine grained idle time accounting functions check tick_nohz_enabled. But that variable is merily telling that NOHZ has been enabled in the config and not been disabled on the command line. But it does not tell anything about nohz being active. That's what all this should check for. Matthew reported, that the idle accounting on his old P1 machine showed bogus values, when he enabled NOHZ in the config and did not disable it on the kernel command line. The reason is that his machine uses (refined) jiffies as a clocksource which explains why the "fine" grained accounting went into lala land, because it depends on when the system goes and leaves idle relative to the jiffies increment. Provide a tick_nohz_active indicator and let RCU and the accounting code use this instead of tick_nohz_enable. Reported-and-tested-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: john.stultz@linaro.org Cc: mwhitehe@redhat.com Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311132052240.30673@ionos.tec.linutronix.de
* | rcu: Apply smp_mb__after_unlock_lock() to preserve grace periodsPaul E. McKenney2013-12-161-0/+13
|/ | | | | | | | | | | | | | | | | | | | | RCU must ensure that there is the equivalent of a full memory barrier between any memory access preceding grace period and any memory access following that same grace period, regardless of which CPU(s) happen to execute the two memory accesses. Therefore, downgrading UNLOCK+LOCK to no longer imply a full memory barrier requires some adjustments to RCU. This commit therefore adds smp_mb__after_unlock_lock() invocations as needed after the RCU lock acquisitions that need to be part of a full-memory-barrier UNLOCK+LOCK. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1386799151-2219-7-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* locking: Move the rtmutex code to kernel/locking/Peter Zijlstra2013-11-061-1/+1
| | | | | | | | Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-p9ijt8div0hwldexwfm4nlhj@git.kernel.org [ Fixed build failure in kernel/rcu/tree_plugin.h. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* rcu: Move RCU-related source code to kernel/rcu directoryPaul E. McKenney2013-10-151-0/+2831
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org>
OpenPOWER on IntegriCloud