summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorIngo Molnar <mingo@kernel.org>2013-06-19 13:48:57 +0200
committerIngo Molnar <mingo@kernel.org>2013-06-19 13:48:57 +0200
commitb1fe9987b78755719e8627d58409174ba00c24de (patch)
tree9ee421e21ffeee79c0383a3f720cc126b0554e0b /Documentation
parent17858ca65eef148d335ffd4cfc09228a1c1cbfb5 (diff)
parentbe77f87c001b770f13fe742becb08b847d9542f1 (diff)
downloadop-kernel-dev-b1fe9987b78755719e8627d58409174ba00c24de.zip
op-kernel-dev-b1fe9987b78755719e8627d58409174ba00c24de.tar.gz
Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney: "The major changes for this series are: 1. Simplify RCU's grace-period and callback processing based on the new numbering for callbacks. These were posted to LKML at https://lkml.org/lkml/2013/5/20/330. 2. Documentation updates. These were posted to LKML at https://lkml.org/lkml/2013/5/20/348. 3. Miscellaneous fixes, including converting a few remaining printk() calls to pr_*(). These were posted to LKML at https://lkml.org/lkml/2013/5/20/324. 4. SRCU-related changes and fixes. These were posted to LKML at https://lkml.org/lkml/2013/5/20/425. 5. Removal of TINY_PREEMPT_RCU in favor of TREE_PREEMPT_RCU for single-CPU low-latency systems. These were posted to LKML at https://lkml.org/lkml/2013/5/20/427." Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/RCU/checklist.txt6
-rw-r--r--Documentation/RCU/torture.txt6
-rw-r--r--Documentation/RCU/trace.txt100
-rw-r--r--Documentation/RCU/whatisRCU.txt22
-rw-r--r--Documentation/kernel-per-CPU-kthreads.txt47
-rw-r--r--Documentation/timers/NO_HZ.txt79
6 files changed, 125 insertions, 135 deletions
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 79e789b8..7703ec7 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -354,12 +354,6 @@ over a rather long period of time, but improvements are always welcome!
using RCU rather than SRCU, because RCU is almost always faster
and easier to use than is SRCU.
- If you need to enter your read-side critical section in a
- hardirq or exception handler, and then exit that same read-side
- critical section in the task that was interrupted, then you need
- to srcu_read_lock_raw() and srcu_read_unlock_raw(), which avoid
- the lockdep checking that would otherwise this practice illegal.
-
Also unlike other forms of RCU, explicit initialization
and cleanup is required via init_srcu_struct() and
cleanup_srcu_struct(). These are passed a "struct srcu_struct"
diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt
index 7dce8a1..d8a5023 100644
--- a/Documentation/RCU/torture.txt
+++ b/Documentation/RCU/torture.txt
@@ -182,12 +182,6 @@ torture_type The type of RCU to test, with string values as follows:
"srcu_expedited": srcu_read_lock(), srcu_read_unlock() and
synchronize_srcu_expedited().
- "srcu_raw": srcu_read_lock_raw(), srcu_read_unlock_raw(),
- and call_srcu().
-
- "srcu_raw_sync": srcu_read_lock_raw(), srcu_read_unlock_raw(),
- and synchronize_srcu().
-
"sched": preempt_disable(), preempt_enable(), and
call_rcu_sched().
diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index c776968..f3778f8 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -530,113 +530,21 @@ o "nos" counts the number of times we balked for other
reasons, e.g., the grace period ended first.
-CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU debugfs Files and Formats
+CONFIG_TINY_RCU debugfs Files and Formats
These implementations of RCU provides a single debugfs file under the
top-level directory RCU, namely rcu/rcudata, which displays fields in
-rcu_bh_ctrlblk, rcu_sched_ctrlblk and, for CONFIG_TINY_PREEMPT_RCU,
-rcu_preempt_ctrlblk.
+rcu_bh_ctrlblk and rcu_sched_ctrlblk.
The output of "cat rcu/rcudata" is as follows:
-rcu_preempt: qlen=24 gp=1097669 g197/p197/c197 tasks=...
- ttb=. btg=no ntb=184 neb=0 nnb=183 j=01f7 bt=0274
- normal balk: nt=1097669 gt=0 bt=371 b=0 ny=25073378 nos=0
- exp balk: bt=0 nos=0
rcu_sched: qlen: 0
rcu_bh: qlen: 0
-This is split into rcu_preempt, rcu_sched, and rcu_bh sections, with the
-rcu_preempt section appearing only in CONFIG_TINY_PREEMPT_RCU builds.
-The last three lines of the rcu_preempt section appear only in
-CONFIG_RCU_BOOST kernel builds. The fields are as follows:
+This is split into rcu_sched and rcu_bh sections. The field is as
+follows:
o "qlen" is the number of RCU callbacks currently waiting either
for an RCU grace period or waiting to be invoked. This is the
only field present for rcu_sched and rcu_bh, due to the
short-circuiting of grace period in those two cases.
-
-o "gp" is the number of grace periods that have completed.
-
-o "g197/p197/c197" displays the grace-period state, with the
- "g" number being the number of grace periods that have started
- (mod 256), the "p" number being the number of grace periods
- that the CPU has responded to (also mod 256), and the "c"
- number being the number of grace periods that have completed
- (once again mode 256).
-
- Why have both "gp" and "g"? Because the data flowing into
- "gp" is only present in a CONFIG_RCU_TRACE kernel.
-
-o "tasks" is a set of bits. The first bit is "T" if there are
- currently tasks that have recently blocked within an RCU
- read-side critical section, the second bit is "N" if any of the
- aforementioned tasks are blocking the current RCU grace period,
- and the third bit is "E" if any of the aforementioned tasks are
- blocking the current expedited grace period. Each bit is "."
- if the corresponding condition does not hold.
-
-o "ttb" is a single bit. It is "B" if any of the blocked tasks
- need to be priority boosted and "." otherwise.
-
-o "btg" indicates whether boosting has been carried out during
- the current grace period, with "exp" indicating that boosting
- is in progress for an expedited grace period, "no" indicating
- that boosting has not yet started for a normal grace period,
- "begun" indicating that boosting has bebug for a normal grace
- period, and "done" indicating that boosting has completed for
- a normal grace period.
-
-o "ntb" is the total number of tasks subjected to RCU priority boosting
- periods since boot.
-
-o "neb" is the number of expedited grace periods that have had
- to resort to RCU priority boosting since boot.
-
-o "nnb" is the number of normal grace periods that have had
- to resort to RCU priority boosting since boot.
-
-o "j" is the low-order 16 bits of the jiffies counter in hexadecimal.
-
-o "bt" is the low-order 16 bits of the value that the jiffies counter
- will have at the next time that boosting is scheduled to begin.
-
-o In the line beginning with "normal balk", the fields are as follows:
-
- o "nt" is the number of times that the system balked from
- boosting because there were no blocked tasks to boost.
- Note that the system will balk from boosting even if the
- grace period is overdue when the currently running task
- is looping within an RCU read-side critical section.
- There is no point in boosting in this case, because
- boosting a running task won't make it run any faster.
-
- o "gt" is the number of times that the system balked
- from boosting because, although there were blocked tasks,
- none of them were preventing the current grace period
- from completing.
-
- o "bt" is the number of times that the system balked
- from boosting because boosting was already in progress.
-
- o "b" is the number of times that the system balked from
- boosting because boosting had already completed for
- the grace period in question.
-
- o "ny" is the number of times that the system balked from
- boosting because it was not yet time to start boosting
- the grace period in question.
-
- o "nos" is the number of times that the system balked from
- boosting for inexplicable ("not otherwise specified")
- reasons. This can actually happen due to races involving
- increments of the jiffies counter.
-
-o In the line beginning with "exp balk", the fields are as follows:
-
- o "bt" is the number of times that the system balked from
- boosting because there were no blocked tasks to boost.
-
- o "nos" is the number of times that the system balked from
- boosting for inexplicable ("not otherwise specified")
- reasons.
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 10df0b8..0f0fb7c 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -842,9 +842,7 @@ SRCU: Critical sections Grace period Barrier
srcu_read_lock synchronize_srcu srcu_barrier
srcu_read_unlock call_srcu
- srcu_read_lock_raw synchronize_srcu_expedited
- srcu_read_unlock_raw
- srcu_dereference
+ srcu_dereference synchronize_srcu_expedited
SRCU: Initialization/cleanup
init_srcu_struct
@@ -865,38 +863,32 @@ list can be helpful:
a. Will readers need to block? If so, you need SRCU.
-b. Is it necessary to start a read-side critical section in a
- hardirq handler or exception handler, and then to complete
- this read-side critical section in the task that was
- interrupted? If so, you need SRCU's srcu_read_lock_raw() and
- srcu_read_unlock_raw() primitives.
-
-c. What about the -rt patchset? If readers would need to block
+b. What about the -rt patchset? If readers would need to block
in an non-rt kernel, you need SRCU. If readers would block
in a -rt kernel, but not in a non-rt kernel, SRCU is not
necessary.
-d. Do you need to treat NMI handlers, hardirq handlers,
+c. Do you need to treat NMI handlers, hardirq handlers,
and code segments with preemption disabled (whether
via preempt_disable(), local_irq_save(), local_bh_disable(),
or some other mechanism) as if they were explicit RCU readers?
If so, RCU-sched is the only choice that will work for you.
-e. Do you need RCU grace periods to complete even in the face
+d. Do you need RCU grace periods to complete even in the face
of softirq monopolization of one or more of the CPUs? For
example, is your code subject to network-based denial-of-service
attacks? If so, you need RCU-bh.
-f. Is your workload too update-intensive for normal use of
+e. Is your workload too update-intensive for normal use of
RCU, but inappropriate for other synchronization mechanisms?
If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
-g. Do you need read-side critical sections that are respected
+f. Do you need read-side critical sections that are respected
even though they are in the middle of the idle loop, during
user-mode execution, or on an offlined CPU? If so, SRCU is the
only choice that will work for you.
-h. Otherwise, use RCU.
+g. Otherwise, use RCU.
Of course, this all assumes that you have determined that RCU is in fact
the right tool for your job.
diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index cbf7ae41..5f39ef5 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -157,6 +157,53 @@ RCU_SOFTIRQ: Do at least one of the following:
calls and by forcing both kernel threads and interrupts
to execute elsewhere.
+Name: kworker/%u:%d%s (cpu, id, priority)
+Purpose: Execute workqueue requests
+To reduce its OS jitter, do any of the following:
+1. Run your workload at a real-time priority, which will allow
+ preempting the kworker daemons.
+2. Do any of the following needed to avoid jitter that your
+ application cannot tolerate:
+ a. Build your kernel with CONFIG_SLUB=y rather than
+ CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
+ use of each CPU's workqueues to run its cache_reap()
+ function.
+ b. Avoid using oprofile, thus avoiding OS jitter from
+ wq_sync_buffer().
+ c. Limit your CPU frequency so that a CPU-frequency
+ governor is not required, possibly enlisting the aid of
+ special heatsinks or other cooling technologies. If done
+ correctly, and if you CPU architecture permits, you should
+ be able to build your kernel with CONFIG_CPU_FREQ=n to
+ avoid the CPU-frequency governor periodically running
+ on each CPU, including cs_dbs_timer() and od_dbs_timer().
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ d. It is not possible to entirely get rid of OS jitter
+ from vmstat_update() on CONFIG_SMP=y systems, but you
+ can decrease its frequency by writing a large value to
+ /proc/sys/vm/stat_interval. The default value is HZ,
+ for an interval of one second. Of course, larger values
+ will make your virtual-memory statistics update more
+ slowly. Of course, you can also run your workload at
+ a real-time priority, thus preempting vmstat_update().
+ e. If running on high-end powerpc servers, build with
+ CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
+ daemon from running on each CPU every second or so.
+ (This will require editing Kconfig files and will defeat
+ this platform's RAS functionality.) This avoids jitter
+ due to the rtas_event_scan() function.
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ f. If running on Cell Processor, build your kernel with
+ CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
+ spu_gov_work().
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ g. If running on PowerMAC, build your kernel with
+ CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
+ avoiding OS jitter from rackmeter_do_timer().
+
Name: rcuc/%u
Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
To reduce its OS jitter, do at least one of the following:
diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt
index 5b53220..8869758 100644
--- a/Documentation/timers/NO_HZ.txt
+++ b/Documentation/timers/NO_HZ.txt
@@ -7,21 +7,59 @@ efficiency and reducing OS jitter. Reducing OS jitter is important for
some types of computationally intensive high-performance computing (HPC)
applications and for real-time applications.
-There are two main contexts in which the number of scheduling-clock
-interrupts can be reduced compared to the old-school approach of sending
-a scheduling-clock interrupt to all CPUs every jiffy whether they need
-it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels):
+There are three main ways of managing scheduling-clock interrupts
+(also known as "scheduling-clock ticks" or simply "ticks"):
-1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels).
+1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or
+ CONFIG_NO_HZ=n for older kernels). You normally will -not-
+ want to choose this option.
-2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y).
+2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or
+ CONFIG_NO_HZ=y for older kernels). This is the most common
+ approach, and should be the default.
-These two cases are described in the following two sections, followed
+3. Omit scheduling-clock ticks on CPUs that are either idle or that
+ have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you
+ are running realtime applications or certain types of HPC
+ workloads, you will normally -not- want this option.
+
+These three cases are described in the following three sections, followed
by a third section on RCU-specific considerations and a fourth and final
section listing known issues.
-IDLE CPUs
+NEVER OMIT SCHEDULING-CLOCK TICKS
+
+Very old versions of Linux from the 1990s and the very early 2000s
+are incapable of omitting scheduling-clock ticks. It turns out that
+there are some situations where this old-school approach is still the
+right approach, for example, in heavy workloads with lots of tasks
+that use short bursts of CPU, where there are very frequent idle
+periods, but where these idle periods are also quite short (tens or
+hundreds of microseconds). For these types of workloads, scheduling
+clock interrupts will normally be delivered any way because there
+will frequently be multiple runnable tasks per CPU. In these cases,
+attempting to turn off the scheduling clock interrupt will have no effect
+other than increasing the overhead of switching to and from idle and
+transitioning between user and kernel execution.
+
+This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or
+CONFIG_NO_HZ=n for older kernels).
+
+However, if you are instead running a light workload with long idle
+periods, failing to omit scheduling-clock interrupts will result in
+excessive power consumption. This is especially bad on battery-powered
+devices, where it results in extremely short battery lifetimes. If you
+are running light workloads, you should therefore read the following
+section.
+
+In addition, if you are running either a real-time workload or an HPC
+workload with short iterations, the scheduling-clock interrupts can
+degrade your applications performance. If this describes your workload,
+you should read the following two sections.
+
+
+OMIT SCHEDULING-CLOCK TICKS FOR IDLE CPUs
If a CPU is idle, there is little point in sending it a scheduling-clock
interrupt. After all, the primary purpose of a scheduling-clock interrupt
@@ -59,10 +97,12 @@ By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling
dyntick-idle mode.
-CPUs WITH ONLY ONE RUNNABLE TASK
+OMIT SCHEDULING-CLOCK TICKS FOR CPUs WITH ONLY ONE RUNNABLE TASK
If a CPU has only one runnable task, there is little point in sending it
a scheduling-clock interrupt because there is no other task to switch to.
+Note that omitting scheduling-clock ticks for CPUs with only one runnable
+task implies also omitting them for idle CPUs.
The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
sending scheduling-clock interrupts to CPUs with a single runnable task,
@@ -238,6 +278,11 @@ o Adaptive-ticks does not do anything unless there is only one
single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
tasks, even though these interrupts are unnecessary.
+ And even when there are multiple runnable tasks on a given CPU,
+ there is little point in interrupting that CPU until the current
+ running task's timeslice expires, which is almost always way
+ longer than the time of the next scheduling-clock interrupt.
+
Better handling of these sorts of situations is future work.
o A reboot is required to reconfigure both adaptive idle and RCU
@@ -268,6 +313,16 @@ o Unless all CPUs are idle, at least one CPU must keep the
scheduling-clock interrupt going in order to support accurate
timekeeping.
-o If there are adaptive-ticks CPUs, there will be at least one
- CPU keeping the scheduling-clock interrupt going, even if all
- CPUs are otherwise idle.
+o If there might potentially be some adaptive-ticks CPUs, there
+ will be at least one CPU keeping the scheduling-clock interrupt
+ going, even if all CPUs are otherwise idle.
+
+ Better handling of this situation is ongoing work.
+
+o Some process-handling operations still require the occasional
+ scheduling-clock tick. These operations include calculating CPU
+ load, maintaining sched average, computing CFS entity vruntime,
+ computing avenrun, and carrying out load balancing. They are
+ currently accommodated by scheduling-clock tick every second
+ or so. On-going work will eliminate the need even for these
+ infrequent scheduling-clock ticks.
OpenPOWER on IntegriCloud