diff options
Diffstat (limited to 'Documentation/RCU')
-rw-r--r-- | Documentation/RCU/00-INDEX | 22 | ||||
-rw-r--r-- | Documentation/RCU/NMI-RCU.txt | 115 | ||||
-rw-r--r-- | Documentation/RCU/RTFP.txt | 745 | ||||
-rw-r--r-- | Documentation/RCU/UP.txt | 119 | ||||
-rw-r--r-- | Documentation/RCU/arrayRCU.txt | 141 | ||||
-rw-r--r-- | Documentation/RCU/checklist.txt | 300 | ||||
-rw-r--r-- | Documentation/RCU/listRCU.txt | 315 | ||||
-rw-r--r-- | Documentation/RCU/rcu.txt | 138 | ||||
-rw-r--r-- | Documentation/RCU/rcuref.txt | 66 | ||||
-rw-r--r-- | Documentation/RCU/torture.txt | 180 | ||||
-rw-r--r-- | Documentation/RCU/whatisRCU.txt | 936 |
11 files changed, 3077 insertions, 0 deletions
diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX new file mode 100644 index 0000000..461481d --- /dev/null +++ b/Documentation/RCU/00-INDEX @@ -0,0 +1,22 @@ +00-INDEX + - This file +arrayRCU.txt + - Using RCU to Protect Read-Mostly Arrays +checklist.txt + - Review Checklist for RCU Patches +listRCU.txt + - Using RCU to Protect Read-Mostly Linked Lists +NMI-RCU.txt + - Using RCU to Protect Dynamic NMI Handlers +rcuref.txt + - Reference-count design for elements of lists/arrays protected by RCU +rcu.txt + - RCU Concepts +RTFP.txt + - List of RCU papers (bibliography) going back to 1980. +torture.txt + - RCU Torture Test Operation (CONFIG_RCU_TORTURE_TEST) +UP.txt + - RCU on Uniprocessor Systems +whatisRCU.txt + - What is RCU? diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt new file mode 100644 index 0000000..a6d32e6 --- /dev/null +++ b/Documentation/RCU/NMI-RCU.txt @@ -0,0 +1,115 @@ +Using RCU to Protect Dynamic NMI Handlers + + +Although RCU is usually used to protect read-mostly data structures, +it is possible to use RCU to provide dynamic non-maskable interrupt +handlers, as well as dynamic irq handlers. This document describes +how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer +work in "arch/i386/oprofile/nmi_timer_int.c" and in +"arch/i386/kernel/traps.c". + +The relevant pieces of code are listed below, each followed by a +brief explanation. + + static int dummy_nmi_callback(struct pt_regs *regs, int cpu) + { + return 0; + } + +The dummy_nmi_callback() function is a "dummy" NMI handler that does +nothing, but returns zero, thus saying that it did nothing, allowing +the NMI handler to take the default machine-specific action. + + static nmi_callback_t nmi_callback = dummy_nmi_callback; + +This nmi_callback variable is a global function pointer to the current +NMI handler. + + void do_nmi(struct pt_regs * regs, long error_code) + { + int cpu; + + nmi_enter(); + + cpu = smp_processor_id(); + ++nmi_count(cpu); + + if (!rcu_dereference(nmi_callback)(regs, cpu)) + default_do_nmi(regs); + + nmi_exit(); + } + +The do_nmi() function processes each NMI. It first disables preemption +in the same way that a hardware irq would, then increments the per-CPU +count of NMIs. It then invokes the NMI handler stored in the nmi_callback +function pointer. If this handler returns zero, do_nmi() invokes the +default_do_nmi() function to handle a machine-specific NMI. Finally, +preemption is restored. + +Strictly speaking, rcu_dereference() is not needed, since this code runs +only on i386, which does not need rcu_dereference() anyway. However, +it is a good documentation aid, particularly for anyone attempting to +do something similar on Alpha. + +Quick Quiz: Why might the rcu_dereference() be necessary on Alpha, + given that the code referenced by the pointer is read-only? + + +Back to the discussion of NMI and RCU... + + void set_nmi_callback(nmi_callback_t callback) + { + rcu_assign_pointer(nmi_callback, callback); + } + +The set_nmi_callback() function registers an NMI handler. Note that any +data that is to be used by the callback must be initialized up -before- +the call to set_nmi_callback(). On architectures that do not order +writes, the rcu_assign_pointer() ensures that the NMI handler sees the +initialized values. + + void unset_nmi_callback(void) + { + rcu_assign_pointer(nmi_callback, dummy_nmi_callback); + } + +This function unregisters an NMI handler, restoring the original +dummy_nmi_handler(). However, there may well be an NMI handler +currently executing on some other CPU. We therefore cannot free +up any data structures used by the old NMI handler until execution +of it completes on all other CPUs. + +One way to accomplish this is via synchronize_sched(), perhaps as +follows: + + unset_nmi_callback(); + synchronize_sched(); + kfree(my_nmi_data); + +This works because synchronize_sched() blocks until all CPUs complete +any preemption-disabled segments of code that they were executing. +Since NMI handlers disable preemption, synchronize_sched() is guaranteed +not to return until all ongoing NMI handlers exit. It is therefore safe +to free up the handler's data as soon as synchronize_sched() returns. + +Important note: for this to work, the architecture in question must +invoke irq_enter() and irq_exit() on NMI entry and exit, respectively. + + +Answer to Quick Quiz + + Why might the rcu_dereference() be necessary on Alpha, given + that the code referenced by the pointer is read-only? + + Answer: The caller to set_nmi_callback() might well have + initialized some data that is to be used by the + new NMI handler. In this case, the rcu_dereference() + would be needed, because otherwise a CPU that received + an NMI just after the new handler was set might see + the pointer to the new NMI handler, but the old + pre-initialized version of the handler's data. + + More important, the rcu_dereference() makes it clear + to someone reading the code that the pointer is being + protected by RCU. diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt new file mode 100644 index 0000000..9f711d2 --- /dev/null +++ b/Documentation/RCU/RTFP.txt @@ -0,0 +1,745 @@ +Read the F-ing Papers! + + +This document describes RCU-related publications, and is followed by +the corresponding bibtex entries. A number of the publications may +be found at http://www.rdrop.com/users/paulmck/RCU/. + +The first thing resembling RCU was published in 1980, when Kung and Lehman +[Kung80] recommended use of a garbage collector to defer destruction +of nodes in a parallel binary search tree in order to simplify its +implementation. This works well in environments that have garbage +collectors, but most production garbage collectors incur significant +overhead. + +In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring +destruction until all threads running at that time have terminated, again +for a parallel binary search tree. This approach works well in systems +with short-lived threads, such as the K42 research operating system. +However, Linux has long-lived tasks, so more is needed. + +In 1986, Hennessy, Osisek, and Seigh [Hennessy89] introduced passive +serialization, which is an RCU-like mechanism that relies on the presence +of "quiescent states" in the VM/XA hypervisor that are guaranteed not +to be referencing the data structure. However, this mechanism was not +optimized for modern computer systems, which is not surprising given +that these overheads were not so expensive in the mid-80s. Nonetheless, +passive serialization appears to be the first deferred-destruction +mechanism to be used in production. Furthermore, the relevant patent has +lapsed, so this approach may be used in non-GPL software, if desired. +(In contrast, use of RCU is permitted only in software licensed under +GPL. Sorry!!!) + +In 1990, Pugh [Pugh90] noted that explicitly tracking which threads +were reading a given data structure permitted deferred free to operate +in the presence of non-terminating threads. However, this explicit +tracking imposes significant read-side overhead, which is undesirable +in read-mostly situations. This algorithm does take pains to avoid +write-side contention and parallelize the other write-side overheads by +providing a fine-grained locking design, however, it would be interesting +to see how much of the performance advantage reported in 1990 remains +in 2004. + +At about this same time, Adams [Adams91] described ``chaotic relaxation'', +where the normal barriers between successive iterations of convergent +numerical algorithms are relaxed, so that iteration $n$ might use +data from iteration $n-1$ or even $n-2$. This introduces error, +which typically slows convergence and thus increases the number of +iterations required. However, this increase is sometimes more than made +up for by a reduction in the number of expensive barrier operations, +which are otherwise required to synchronize the threads at the end +of each iteration. Unfortunately, chaotic relaxation requires highly +structured data, such as the matrices used in scientific programs, and +is thus inapplicable to most data structures in operating-system kernels. + +In 1992, Henry (now Alexia) Massalin completed a dissertation advising +parallel programmers to defer processing when feasible to simplify +synchronization. RCU makes extremely heavy use of this advice. + +In 1993, Jacobson [Jacobson93] verbally described what is perhaps the +simplest deferred-free technique: simply waiting a fixed amount of time +before freeing blocks awaiting deferred free. Jacobson did not describe +any write-side changes he might have made in this work using SGI's Irix +kernel. Aju John published a similar technique in 1995 [AjuJohn95]. +This works well if there is a well-defined upper bound on the length of +time that reading threads can hold references, as there might well be in +hard real-time systems. However, if this time is exceeded, perhaps due +to preemption, excessive interrupts, or larger-than-anticipated load, +memory corruption can ensue, with no reasonable means of diagnosis. +Jacobson's technique is therefore inappropriate for use in production +operating-system kernels, except when such kernels can provide hard +real-time response guarantees for all operations. + +Also in 1995, Pu et al. [Pu95a] applied a technique similar to that of Pugh's +read-side-tracking to permit replugging of algorithms within a commercial +Unix operating system. However, this replugging permitted only a single +reader at a time. The following year, this same group of researchers +extended their technique to allow for multiple readers [Cowan96a]. +Their approach requires memory barriers (and thus pipeline stalls), +but reduces memory latency, contention, and locking overheads. + +1995 also saw the first publication of DYNIX/ptx's RCU mechanism +[Slingwine95], which was optimized for modern CPU architectures, +and was successfully applied to a number of situations within the +DYNIX/ptx kernel. The corresponding conference paper appeared in 1998 +[McKenney98]. + +In 1999, the Tornado and K42 groups described their "generations" +mechanism, which quite similar to RCU [Gamsa99]. These operating systems +made pervasive use of RCU in place of "existence locks", which greatly +simplifies locking hierarchies. + +2001 saw the first RCU presentation involving Linux [McKenney01a] +at OLS. The resulting abundance of RCU patches was presented the +following year [McKenney02a], and use of RCU in dcache was first +described that same year [Linder02a]. + +Also in 2002, Michael [Michael02b,Michael02a] presented "hazard-pointer" +techniques that defer the destruction of data structures to simplify +non-blocking synchronization (wait-free synchronization, lock-free +synchronization, and obstruction-free synchronization are all examples of +non-blocking synchronization). In particular, this technique eliminates +locking, reduces contention, reduces memory latency for readers, and +parallelizes pipeline stalls and memory latency for writers. However, +these techniques still impose significant read-side overhead in the +form of memory barriers. Researchers at Sun worked along similar lines +in the same timeframe [HerlihyLM02]. These techniques can be thought +of as inside-out reference counts, where the count is represented by the +number of hazard pointers referencing a given data structure (rather than +the more conventional counter field within the data structure itself). + +By the same token, RCU can be thought of as a "bulk reference count", +where some form of reference counter covers all reference by a given CPU +or thread during a set timeframe. This timeframe is related to, but +not necessarily exactly the same as, an RCU grace period. In classic +RCU, the reference counter is the per-CPU bit in the "bitmask" field, +and each such bit covers all references that might have been made by +the corresponding CPU during the prior grace period. Of course, RCU +can be thought of in other terms as well. + +In 2003, the K42 group described how RCU could be used to create +hot-pluggable implementations of operating-system functions [Appavoo03a]. +Later that year saw a paper describing an RCU implementation of System +V IPC [Arcangeli03], and an introduction to RCU in Linux Journal +[McKenney03a]. + +2004 has seen a Linux-Journal article on use of RCU in dcache +[McKenney04a], a performance comparison of locking to RCU on several +different CPUs [McKenney04b], a dissertation describing use of RCU in a +number of operating-system kernels [PaulEdwardMcKenneyPhD], a paper +describing how to make RCU safe for soft-realtime applications [Sarma04c], +and a paper describing SELinux performance with RCU [JamesMorris04b]. + +2005 brought further adaptation of RCU to realtime use, permitting +preemption of RCU realtime critical sections [PaulMcKenney05a, +PaulMcKenney05b]. + +2006 saw the first best-paper award for an RCU paper [ThomasEHart2006a], +as well as further work on efficient implementations of preemptible +RCU [PaulEMcKenney2006b], but priority-boosting of RCU read-side critical +sections proved elusive. An RCU implementation permitting general +blocking in read-side critical sections appeared [PaulEMcKenney2006c], +Robert Olsson described an RCU-protected trie-hash combination +[RobertOlsson2006a]. + +2007 saw the journal version of the award-winning RCU paper from 2006 +[ThomasEHart2007a], as well as a paper demonstrating use of Promela +and Spin to mechanically verify an optimization to Oleg Nesterov's +QRCU [PaulEMcKenney2007QRCUspin], a design document describing +preemptible RCU [PaulEMcKenney2007PreemptibleRCU], and the three-part +LWN "What is RCU?" series [PaulEMcKenney2007WhatIsRCUFundamentally, +PaulEMcKenney2008WhatIsRCUUsage, and PaulEMcKenney2008WhatIsRCUAPI]. + +Bibtex Entries + +@article{Kung80 +,author="H. T. Kung and Q. Lehman" +,title="Concurrent Maintenance of Binary Search Trees" +,Year="1980" +,Month="September" +,journal="ACM Transactions on Database Systems" +,volume="5" +,number="3" +,pages="354-382" +} + +@techreport{Manber82 +,author="Udi Manber and Richard E. Ladner" +,title="Concurrency Control in a Dynamic Search Structure" +,institution="Department of Computer Science, University of Washington" +,address="Seattle, Washington" +,year="1982" +,number="82-01-01" +,month="January" +,pages="28" +} + +@article{Manber84 +,author="Udi Manber and Richard E. Ladner" +,title="Concurrency Control in a Dynamic Search Structure" +,Year="1984" +,Month="September" +,journal="ACM Transactions on Database Systems" +,volume="9" +,number="3" +,pages="439-455" +} + +@techreport{Hennessy89 +,author="James P. Hennessy and Damian L. Osisek and Joseph W. {Seigh II}" +,title="Passive Serialization in a Multitasking Environment" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1989" +,number="US Patent 4,809,168 (lapsed)" +,month="February" +,pages="11" +} + +@techreport{Pugh90 +,author="William Pugh" +,title="Concurrent Maintenance of Skip Lists" +,institution="Institute of Advanced Computer Science Studies, Department of Computer Science, University of Maryland" +,address="College Park, Maryland" +,year="1990" +,number="CS-TR-2222.1" +,month="June" +} + +@Book{Adams91 +,Author="Gregory R. Adams" +,title="Concurrent Programming, Principles, and Practices" +,Publisher="Benjamin Cummins" +,Year="1991" +} + +@phdthesis{HMassalinPhD +,author="H. Massalin" +,title="Synthesis: An Efficient Implementation of Fundamental Operating +System Services" +,school="Columbia University" +,address="New York, NY" +,year="1992" +,annotation=" + Mondo optimizing compiler. + Wait-free stuff. + Good advice: defer work to avoid synchronization. +" +} + +@unpublished{Jacobson93 +,author="Van Jacobson" +,title="Avoid Read-Side Locking Via Delayed Free" +,year="1993" +,month="September" +,note="Verbal discussion" +} + +@Conference{AjuJohn95 +,Author="Aju John" +,Title="Dynamic vnodes -- Design and Implementation" +,Booktitle="{USENIX Winter 1995}" +,Publisher="USENIX Association" +,Month="January" +,Year="1995" +,pages="11-23" +,Address="New Orleans, LA" +} + +@conference{Pu95a, +Author = "Calton Pu and Tito Autrey and Andrew Black and Charles Consel and +Crispin Cowan and Jon Inouye and Lakshmi Kethana and Jonathan Walpole and +Ke Zhang", +Title = "Optimistic Incremental Specialization: Streamlining a Commercial +Operating System", +Booktitle = "15\textsuperscript{th} ACM Symposium on +Operating Systems Principles (SOSP'95)", +address = "Copper Mountain, CO", +month="December", +year="1995", +pages="314-321", +annotation=" + Uses a replugger, but with a flag to signal when people are + using the resource at hand. Only one reader at a time. +" +} + +@conference{Cowan96a, +Author = "Crispin Cowan and Tito Autrey and Charles Krasic and +Calton Pu and Jonathan Walpole", +Title = "Fast Concurrent Dynamic Linking for an Adaptive Operating System", +Booktitle = "International Conference on Configurable Distributed Systems +(ICCDS'96)", +address = "Annapolis, MD", +month="May", +year="1996", +pages="108", +isbn="0-8186-7395-8", +annotation=" + Uses a replugger, but with a counter to signal when people are + using the resource at hand. Allows multiple readers. +" +} + +@techreport{Slingwine95 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and Method for Achieving Reduced Overhead Mutual +Exclusion and Maintaining Coherency in a Multiprocessor System +Utilizing Execution History and Thread Monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1995" +,number="US Patent 5,442,758 (contributed under GPL)" +,month="August" +} + +@techreport{Slingwine97 +,author="John D. Slingwine and Paul E. McKenney" +,title="Method for maintaining data coherency using thread +activity summaries in a multicomputer system" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1997" +,number="US Patent 5,608,893 (contributed under GPL)" +,month="March" +} + +@techreport{Slingwine98 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and method for achieving reduced overhead +mutual exclusion and maintaining coherency in a multiprocessor +system utilizing execution history and thread monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1998" +,number="US Patent 5,727,209 (contributed under GPL)" +,month="March" +} + +@Conference{McKenney98 +,Author="Paul E. McKenney and John D. Slingwine" +,Title="Read-Copy Update: Using Execution History to Solve Concurrency +Problems" +,Booktitle="{Parallel and Distributed Computing and Systems}" +,Month="October" +,Year="1998" +,pages="509-518" +,Address="Las Vegas, NV" +} + +@Conference{Gamsa99 +,Author="Ben Gamsa and Orran Krieger and Jonathan Appavoo and Michael Stumm" +,Title="Tornado: Maximizing Locality and Concurrency in a Shared Memory +Multiprocessor Operating System" +,Booktitle="{Proceedings of the 3\textsuperscript{rd} Symposium on +Operating System Design and Implementation}" +,Month="February" +,Year="1999" +,pages="87-100" +,Address="New Orleans, LA" +} + +@techreport{Slingwine01 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and method for achieving reduced overhead +mutual exclusion and maintaining coherency in a multiprocessor +system utilizing execution history and thread monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="2001" +,number="US Patent 5,219,690 (contributed under GPL)" +,month="April" +} + +@Conference{McKenney01a +,Author="Paul E. McKenney and Jonathan Appavoo and Andi Kleen and +Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni" +,Title="Read-Copy Update" +,Booktitle="{Ottawa Linux Symposium}" +,Month="July" +,Year="2001" +,note="Available: +\url{http://www.linuxsymposium.org/2001/abstracts/readcopy.php} +\url{http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.2001.05.01c.pdf} +[Viewed June 23, 2004]" +annotation=" +Described RCU, and presented some patches implementing and using it in +the Linux kernel. +" +} + +@Conference{Linder02a +,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni" +,Title="Scalability of the Directory Entry Cache" +,Booktitle="{Ottawa Linux Symposium}" +,Month="June" +,Year="2002" +,pages="289-300" +} + +@Conference{McKenney02a +,Author="Paul E. McKenney and Dipankar Sarma and +Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell" +,Title="Read-Copy Update" +,Booktitle="{Ottawa Linux Symposium}" +,Month="June" +,Year="2002" +,pages="338-367" +,note="Available: +\url{http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz} +[Viewed June 23, 2004]" +} + +@conference{Michael02a +,author="Maged M. Michael" +,title="Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic +Reads and Writes" +,Year="2002" +,Month="August" +,booktitle="{Proceedings of the 21\textsuperscript{st} Annual ACM +Symposium on Principles of Distributed Computing}" +,pages="21-30" +,annotation=" + Each thread keeps an array of pointers to items that it is + currently referencing. Sort of an inside-out garbage collection + mechanism, but one that requires the accessing code to explicitly + state its needs. Also requires read-side memory barriers on + most architectures. +" +} + +@conference{Michael02b +,author="Maged M. Michael" +,title="High Performance Dynamic Lock-Free Hash Tables and List-Based Sets" +,Year="2002" +,Month="August" +,booktitle="{Proceedings of the 14\textsuperscript{th} Annual ACM +Symposium on Parallel +Algorithms and Architecture}" +,pages="73-82" +,annotation=" + Like the title says... +" +} + +@InProceedings{HerlihyLM02 +,author={Maurice Herlihy and Victor Luchangco and Mark Moir} +,title="The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized, +Lock-Free Data Structures" +,booktitle={Proceedings of 16\textsuperscript{th} International +Symposium on Distributed Computing} +,year=2002 +,month="October" +,pages="339-353" +} + +@article{Appavoo03a +,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and +D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and +B. Gamsa and G. R. Ganger and P. McKenney and M. Ostrowski and +B. Rosenburg and M. Stumm and J. Xenidis" +,title="Enabling Autonomic Behavior in Systems Software With Hot Swapping" +,Year="2003" +,Month="January" +,journal="IBM Systems Journal" +,volume="42" +,number="1" +,pages="60-76" +} + +@Conference{Arcangeli03 +,Author="Andrea Arcangeli and Mingming Cao and Paul E. McKenney and +Dipankar Sarma" +,Title="Using Read-Copy Update Techniques for {System V IPC} in the +{Linux} 2.5 Kernel" +,Booktitle="Proceedings of the 2003 USENIX Annual Technical Conference +(FREENIX Track)" +,Publisher="USENIX Association" +,year="2003" +,month="June" +,pages="297-310" +} + +@article{McKenney03a +,author="Paul E. McKenney" +,title="Using {RCU} in the {Linux} 2.5 Kernel" +,Year="2003" +,Month="October" +,journal="Linux Journal" +,volume="1" +,number="114" +,pages="18-26" +} + +@techreport{Friedberg03a +,author="Stuart A. Friedberg" +,title="Lock-Free Wild Card Search Data Structure and Method" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="2003" +,number="US Patent 6,662,184 (contributed under GPL)" +,month="December" +,pages="112" +} + +@article{McKenney04a +,author="Paul E. McKenney and Dipankar Sarma and Maneesh Soni" +,title="Scaling dcache with {RCU}" +,Year="2004" +,Month="January" +,journal="Linux Journal" +,volume="1" +,number="118" +,pages="38-46" +} + +@Conference{McKenney04b +,Author="Paul E. McKenney" +,Title="{RCU} vs. Locking Performance on Different {CPUs}" +,Booktitle="{linux.conf.au}" +,Month="January" +,Year="2004" +,Address="Adelaide, Australia" +,note="Available: +\url{http://www.linux.org.au/conf/2004/abstracts.html#90} +\url{http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf} +[Viewed June 23, 2004]" +} + +@phdthesis{PaulEdwardMcKenneyPhD +,author="Paul E. McKenney" +,title="Exploiting Deferred Destruction: +An Analysis of Read-Copy-Update Techniques +in Operating System Kernels" +,school="OGI School of Science and Engineering at +Oregon Health and Sciences University" +,year="2004" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf} +[Viewed October 15, 2004]" +} + +@Conference{Sarma04c +,Author="Dipankar Sarma and Paul E. McKenney" +,Title="Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications" +,Booktitle="Proceedings of the 2004 USENIX Annual Technical Conference +(FREENIX Track)" +,Publisher="USENIX Association" +,year="2004" +,month="June" +,pages="182-191" +} + +@unpublished{JamesMorris04b +,Author="James Morris" +,Title="Recent Developments in {SELinux} Kernel Performance" +,month="December" +,year="2004" +,note="Available: +\url{http://www.livejournal.com/users/james_morris/2153.html} +[Viewed December 10, 2004]" +} + +@unpublished{PaulMcKenney05a +,Author="Paul E. McKenney" +,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress" +,month="May" +,year="2005" +,note="Available: +\url{http://lkml.org/lkml/2005/5/9/185} +[Viewed May 13, 2005]" +,annotation=" + First publication of working lock-based deferred free patches + for the CONFIG_PREEMPT_RT environment. +" +} + +@conference{PaulMcKenney05b +,Author="Paul E. McKenney and Dipankar Sarma" +,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware" +,Booktitle="linux.conf.au 2005" +,month="April" +,year="2005" +,address="Canberra, Australia" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf} +[Viewed May 13, 2005]" +,annotation=" + Realtime turns into making RCU yet more realtime friendly. +" +} + +@conference{ThomasEHart2006a +,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown" +,Title="Making Lockless Synchronization Fast: Performance Implications +of Memory Reclamation" +,Booktitle="20\textsuperscript{th} {IEEE} International Parallel and +Distributed Processing Symposium" +,month="April" +,year="2006" +,day="25-29" +,address="Rhodes, Greece" +,annotation=" + Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free + reference counting. +" +} + +@Conference{PaulEMcKenney2006b +,Author="Paul E. McKenney and Dipankar Sarma and Ingo Molnar and +Suparna Bhattacharya" +,Title="Extending RCU for Realtime and Embedded Workloads" +,Booktitle="{Ottawa Linux Symposium}" +,Month="July" +,Year="2006" +,pages="v2 123-138" +,note="Available: +\url{http://www.linuxsymposium.org/2006/view_abstract.php?content_key=184} +\url{http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf} +[Viewed January 1, 2007]" +,annotation=" + Described how to improve the -rt implementation of realtime RCU. +" +} + +@unpublished{PaulEMcKenney2006c +,Author="Paul E. McKenney" +,Title="Sleepable {RCU}" +,month="October" +,day="9" +,year="2006" +,note="Available: +\url{http://lwn.net/Articles/202847/} +Revised: +\url{http://www.rdrop.com/users/paulmck/RCU/srcu.2007.01.14a.pdf} +[Viewed August 21, 2006]" +,annotation=" + LWN article introducing SRCU. +" +} + +@unpublished{RobertOlsson2006a +,Author="Robert Olsson and Stefan Nilsson" +,Title="{TRASH}: A dynamic {LC}-trie and hash data structure" +,month="August" +,day="18" +,year="2006" +,note="Available: +\url{http://www.nada.kth.se/~snilsson/public/papers/trash/trash.pdf} +[Viewed February 24, 2007]" +,annotation=" + RCU-protected dynamic trie-hash combination. +" +} + +@unpublished{ThomasEHart2007a +,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown and Jonathan Walpole" +,Title="Performance of memory reclamation for lockless synchronization" +,journal="J. Parallel Distrib. Comput." +,year="2007" +,note="To appear in J. Parallel Distrib. Comput. + \url{doi=10.1016/j.jpdc.2007.04.010}" +,annotation={ + Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free + reference counting. Journal version of ThomasEHart2006a. +} +} + +@unpublished{PaulEMcKenney2007QRCUspin +,Author="Paul E. McKenney" +,Title="Using Promela and Spin to verify parallel algorithms" +,month="August" +,day="1" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/243851/} +[Viewed September 8, 2007]" +,annotation=" + LWN article describing Promela and spin, and also using Oleg + Nesterov's QRCU as an example (with Paul McKenney's fastpath). +" +} + +@unpublished{PaulEMcKenney2007PreemptibleRCU +,Author="Paul E. McKenney" +,Title="The design of preemptible read-copy-update" +,month="October" +,day="8" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/253651/} +[Viewed October 25, 2007]" +,annotation=" + LWN article describing the design of preemptible RCU. +" +} + +######################################################################## +# +# "What is RCU?" LWN series. +# + +@unpublished{PaulEMcKenney2007WhatIsRCUFundamentally +,Author="Paul E. McKenney and Jonathan Walpole" +,Title="What is {RCU}, Fundamentally?" +,month="December" +,day="17" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/262464/} +[Viewed December 27, 2007]" +,annotation=" + Lays out the three basic components of RCU: (1) publish-subscribe, + (2) wait for pre-existing readers to complete, and (2) maintain + multiple versions. +" +} + +@unpublished{PaulEMcKenney2008WhatIsRCUUsage +,Author="Paul E. McKenney" +,Title="What is {RCU}? Part 2: Usage" +,month="January" +,day="4" +,year="2008" +,note="Available: +\url{http://lwn.net/Articles/263130/} +[Viewed January 4, 2008]" +,annotation=" + Lays out six uses of RCU: + 1. RCU is a Reader-Writer Lock Replacement + 2. RCU is a Restricted Reference-Counting Mechanism + 3. RCU is a Bulk Reference-Counting Mechanism + 4. RCU is a Poor Man's Garbage Collector + 5. RCU is a Way of Providing Existence Guarantees + 6. RCU is a Way of Waiting for Things to Finish +" +} + +@unpublished{PaulEMcKenney2008WhatIsRCUAPI +,Author="Paul E. McKenney" +,Title="{RCU} part 3: the {RCU} {API}" +,month="January" +,day="17" +,year="2008" +,note="Available: +\url{http://lwn.net/Articles/264090/} +[Viewed January 10, 2008]" +,annotation=" + Gives an overview of the Linux-kernel RCU API and a brief annotated RCU + bibliography. +" +} + +@article{DinakarGuniguntala2008IBMSysJ +,author="D. Guniguntala and P. E. McKenney and J. Triplett and J. Walpole" +,title="The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with {Linux}" +,Year="2008" +,Month="April" +,journal="IBM Systems Journal" +,volume="47" +,number="2" +,pages="@@-@@" +,annotation=" + RCU, realtime RCU, sleepable RCU, performance. +" +} diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt new file mode 100644 index 0000000..aab4a9e --- /dev/null +++ b/Documentation/RCU/UP.txt @@ -0,0 +1,119 @@ +RCU on Uniprocessor Systems + + +A common misconception is that, on UP systems, the call_rcu() primitive +may immediately invoke its function, and that the synchronize_rcu() +primitive may return immediately. The basis of this misconception +is that since there is only one CPU, it should not be necessary to +wait for anything else to get done, since there are no other CPUs for +anything else to be happening on. Although this approach will -sort- -of- +work a surprising amount of the time, it is a very bad idea in general. +This document presents three examples that demonstrate exactly how bad an +idea this is. + + +Example 1: softirq Suicide + +Suppose that an RCU-based algorithm scans a linked list containing +elements A, B, and C in process context, and can delete elements from +this same list in softirq context. Suppose that the process-context scan +is referencing element B when it is interrupted by softirq processing, +which deletes element B, and then invokes call_rcu() to free element B +after a grace period. + +Now, if call_rcu() were to directly invoke its arguments, then upon return +from softirq, the list scan would find itself referencing a newly freed +element B. This situation can greatly decrease the life expectancy of +your kernel. + +This same problem can occur if call_rcu() is invoked from a hardware +interrupt handler. + + +Example 2: Function-Call Fatality + +Of course, one could avert the suicide described in the preceding example +by having call_rcu() directly invoke its arguments only if it was called +from process context. However, this can fail in a similar manner. + +Suppose that an RCU-based algorithm again scans a linked list containing +elements A, B, and C in process contexts, but that it invokes a function +on each element as it is scanned. Suppose further that this function +deletes element B from the list, then passes it to call_rcu() for deferred +freeing. This may be a bit unconventional, but it is perfectly legal +RCU usage, since call_rcu() must wait for a grace period to elapse. +Therefore, in this case, allowing call_rcu() to immediately invoke +its arguments would cause it to fail to make the fundamental guarantee +underlying RCU, namely that call_rcu() defers invoking its arguments until +all RCU read-side critical sections currently executing have completed. + +Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in + this case? + + +Example 3: Death by Deadlock + +Suppose that call_rcu() is invoked while holding a lock, and that the +callback function must acquire this same lock. In this case, if +call_rcu() were to directly invoke the callback, the result would +be self-deadlock. + +In some cases, it would possible to restructure to code so that +the call_rcu() is delayed until after the lock is released. However, +there are cases where this can be quite ugly: + +1. If a number of items need to be passed to call_rcu() within + the same critical section, then the code would need to create + a list of them, then traverse the list once the lock was + released. + +2. In some cases, the lock will be held across some kernel API, + so that delaying the call_rcu() until the lock is released + requires that the data item be passed up via a common API. + It is far better to guarantee that callbacks are invoked + with no locks held than to have to modify such APIs to allow + arbitrary data items to be passed back up through them. + +If call_rcu() directly invokes the callback, painful locking restrictions +or API changes would be required. + +Quick Quiz #2: What locking restriction must RCU callbacks respect? + + +Summary + +Permitting call_rcu() to immediately invoke its arguments or permitting +synchronize_rcu() to immediately return breaks RCU, even on a UP system. +So do not do it! Even on a UP system, the RCU infrastructure -must- +respect grace periods, and -must- invoke callbacks from a known environment +in which no locks are held. + + +Answer to Quick Quiz #1: + Why is it -not- legal to invoke synchronize_rcu() in this case? + + Because the calling function is scanning an RCU-protected linked + list, and is therefore within an RCU read-side critical section. + Therefore, the called function has been invoked within an RCU + read-side critical section, and is not permitted to block. + +Answer to Quick Quiz #2: + What locking restriction must RCU callbacks respect? + + Any lock that is acquired within an RCU callback must be + acquired elsewhere using an _irq variant of the spinlock + primitive. For example, if "mylock" is acquired by an + RCU callback, then a process-context acquisition of this + lock must use something like spin_lock_irqsave() to + acquire the lock. + + If the process-context code were to simply use spin_lock(), + then, since RCU callbacks can be invoked from softirq context, + the callback might be called from a softirq that interrupted + the process-context critical section. This would result in + self-deadlock. + + This restriction might seem gratuitous, since very few RCU + callbacks acquire locks directly. However, a great many RCU + callbacks do acquire locks -indirectly-, for example, via + the kfree() primitive. diff --git a/Documentation/RCU/arrayRCU.txt b/Documentation/RCU/arrayRCU.txt new file mode 100644 index 0000000..453ebe6 --- /dev/null +++ b/Documentation/RCU/arrayRCU.txt @@ -0,0 +1,141 @@ +Using RCU to Protect Read-Mostly Arrays + + +Although RCU is more commonly used to protect linked lists, it can +also be used to protect arrays. Three situations are as follows: + +1. Hash Tables + +2. Static Arrays + +3. Resizeable Arrays + +Each of these situations are discussed below. + + +Situation 1: Hash Tables + +Hash tables are often implemented as an array, where each array entry +has a linked-list hash chain. Each hash chain can be protected by RCU +as described in the listRCU.txt document. This approach also applies +to other array-of-list situations, such as radix trees. + + +Situation 2: Static Arrays + +Static arrays, where the data (rather than a pointer to the data) is +located in each array element, and where the array is never resized, +have not been used with RCU. Rik van Riel recommends using seqlock in +this situation, which would also have minimal read-side overhead as long +as updates are rare. + +Quick Quiz: Why is it so important that updates be rare when + using seqlock? + + +Situation 3: Resizeable Arrays + +Use of RCU for resizeable arrays is demonstrated by the grow_ary() +function used by the System V IPC code. The array is used to map from +semaphore, message-queue, and shared-memory IDs to the data structure +that represents the corresponding IPC construct. The grow_ary() +function does not acquire any locks; instead its caller must hold the +ids->sem semaphore. + +The grow_ary() function, shown below, does some limit checks, allocates a +new ipc_id_ary, copies the old to the new portion of the new, initializes +the remainder of the new, updates the ids->entries pointer to point to +the new array, and invokes ipc_rcu_putref() to free up the old array. +Note that rcu_assign_pointer() is used to update the ids->entries pointer, +which includes any memory barriers required on whatever architecture +you are running on. + + static int grow_ary(struct ipc_ids* ids, int newsize) + { + struct ipc_id_ary* new; + struct ipc_id_ary* old; + int i; + int size = ids->entries->size; + + if(newsize > IPCMNI) + newsize = IPCMNI; + if(newsize <= size) + return newsize; + + new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize + + sizeof(struct ipc_id_ary)); + if(new == NULL) + return size; + new->size = newsize; + memcpy(new->p, ids->entries->p, + sizeof(struct kern_ipc_perm *)*size + + sizeof(struct ipc_id_ary)); + for(i=size;i<newsize;i++) { + new->p[i] = NULL; + } + old = ids->entries; + + /* + * Use rcu_assign_pointer() to make sure the memcpyed + * contents of the new array are visible before the new + * array becomes visible. + */ + rcu_assign_pointer(ids->entries, new); + + ipc_rcu_putref(old); + return newsize; + } + +The ipc_rcu_putref() function decrements the array's reference count +and then, if the reference count has dropped to zero, uses call_rcu() +to free the array after a grace period has elapsed. + +The array is traversed by the ipc_lock() function. This function +indexes into the array under the protection of rcu_read_lock(), +using rcu_dereference() to pick up the pointer to the array so +that it may later safely be dereferenced -- memory barriers are +required on the Alpha CPU. Since the size of the array is stored +with the array itself, there can be no array-size mismatches, so +a simple check suffices. The pointer to the structure corresponding +to the desired IPC object is placed in "out", with NULL indicating +a non-existent entry. After acquiring "out->lock", the "out->deleted" +flag indicates whether the IPC object is in the process of being +deleted, and, if not, the pointer is returned. + + struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id) + { + struct kern_ipc_perm* out; + int lid = id % SEQ_MULTIPLIER; + struct ipc_id_ary* entries; + + rcu_read_lock(); + entries = rcu_dereference(ids->entries); + if(lid >= entries->size) { + rcu_read_unlock(); + return NULL; + } + out = entries->p[lid]; + if(out == NULL) { + rcu_read_unlock(); + return NULL; + } + spin_lock(&out->lock); + + /* ipc_rmid() may have already freed the ID while ipc_lock + * was spinning: here verify that the structure is still valid + */ + if (out->deleted) { + spin_unlock(&out->lock); + rcu_read_unlock(); + return NULL; + } + return out; + } + + +Answer to Quick Quiz: + + The reason that it is important that updates be rare when + using seqlock is that frequent updates can livelock readers. + One way to avoid this problem is to assign a seqlock for + each array entry rather than to the entire array. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt new file mode 100644 index 0000000..6e25340 --- /dev/null +++ b/Documentation/RCU/checklist.txt @@ -0,0 +1,300 @@ +Review Checklist for RCU Patches + + +This document contains a checklist for producing and reviewing patches +that make use of RCU. Violating any of the rules listed below will +result in the same sorts of problems that leaving out a locking primitive +would cause. This list is based on experiences reviewing such patches +over a rather long period of time, but improvements are always welcome! + +0. Is RCU being applied to a read-mostly situation? If the data + structure is updated more than about 10% of the time, then + you should strongly consider some other approach, unless + detailed performance measurements show that RCU is nonetheless + the right tool for the job. + + Another exception is where performance is not an issue, and RCU + provides a simpler implementation. An example of this situation + is the dynamic NMI code in the Linux 2.6 kernel, at least on + architectures where NMIs are rare. + + Yet another exception is where the low real-time latency of RCU's + read-side primitives is critically important. + +1. Does the update code have proper mutual exclusion? + + RCU does allow -readers- to run (almost) naked, but -writers- must + still use some sort of mutual exclusion, such as: + + a. locking, + b. atomic operations, or + c. restricting updates to a single task. + + If you choose #b, be prepared to describe how you have handled + memory barriers on weakly ordered machines (pretty much all of + them -- even x86 allows reads to be reordered), and be prepared + to explain why this added complexity is worthwhile. If you + choose #c, be prepared to explain how this single task does not + become a major bottleneck on big multiprocessor machines (for + example, if the task is updating information relating to itself + that other tasks can read, there by definition can be no + bottleneck). + +2. Do the RCU read-side critical sections make proper use of + rcu_read_lock() and friends? These primitives are needed + to prevent grace periods from ending prematurely, which + could result in data being unceremoniously freed out from + under your read-side code, which can greatly increase the + actuarial risk of your kernel. + + As a rough rule of thumb, any dereference of an RCU-protected + pointer must be covered by rcu_read_lock() or rcu_read_lock_bh() + or by the appropriate update-side lock. + +3. Does the update code tolerate concurrent accesses? + + The whole point of RCU is to permit readers to run without + any locks or atomic operations. This means that readers will + be running while updates are in progress. There are a number + of ways to handle this concurrency, depending on the situation: + + a. Use the RCU variants of the list and hlist update + primitives to add, remove, and replace elements on an + RCU-protected list. Alternatively, use the RCU-protected + trees that have been added to the Linux kernel. + + This is almost always the best approach. + + b. Proceed as in (a) above, but also maintain per-element + locks (that are acquired by both readers and writers) + that guard per-element state. Of course, fields that + the readers refrain from accessing can be guarded by the + update-side lock. + + This works quite well, also. + + c. Make updates appear atomic to readers. For example, + pointer updates to properly aligned fields will appear + atomic, as will individual atomic primitives. Operations + performed under a lock and sequences of multiple atomic + primitives will -not- appear to be atomic. + + This can work, but is starting to get a bit tricky. + + d. Carefully order the updates and the reads so that + readers see valid data at all phases of the update. + This is often more difficult than it sounds, especially + given modern CPUs' tendency to reorder memory references. + One must usually liberally sprinkle memory barriers + (smp_wmb(), smp_rmb(), smp_mb()) through the code, + making it difficult to understand and to test. + + It is usually better to group the changing data into + a separate structure, so that the change may be made + to appear atomic by updating a pointer to reference + a new structure containing updated values. + +4. Weakly ordered CPUs pose special challenges. Almost all CPUs + are weakly ordered -- even i386 CPUs allow reads to be reordered. + RCU code must take all of the following measures to prevent + memory-corruption problems: + + a. Readers must maintain proper ordering of their memory + accesses. The rcu_dereference() primitive ensures that + the CPU picks up the pointer before it picks up the data + that the pointer points to. This really is necessary + on Alpha CPUs. If you don't believe me, see: + + http://www.openvms.compaq.com/wizard/wiz_2637.html + + The rcu_dereference() primitive is also an excellent + documentation aid, letting the person reading the code + know exactly which pointers are protected by RCU. + + The rcu_dereference() primitive is used by the various + "_rcu()" list-traversal primitives, such as the + list_for_each_entry_rcu(). Note that it is perfectly + legal (if redundant) for update-side code to use + rcu_dereference() and the "_rcu()" list-traversal + primitives. This is particularly useful in code + that is common to readers and updaters. + + b. If the list macros are being used, the list_add_tail_rcu() + and list_add_rcu() primitives must be used in order + to prevent weakly ordered machines from misordering + structure initialization and pointer planting. + Similarly, if the hlist macros are being used, the + hlist_add_head_rcu() primitive is required. + + c. If the list macros are being used, the list_del_rcu() + primitive must be used to keep list_del()'s pointer + poisoning from inflicting toxic effects on concurrent + readers. Similarly, if the hlist macros are being used, + the hlist_del_rcu() primitive is required. + + The list_replace_rcu() primitive may be used to + replace an old structure with a new one in an + RCU-protected list. + + d. Updates must ensure that initialization of a given + structure happens before pointers to that structure are + publicized. Use the rcu_assign_pointer() primitive + when publicizing a pointer to a structure that can + be traversed by an RCU read-side critical section. + +5. If call_rcu(), or a related primitive such as call_rcu_bh() or + call_rcu_sched(), is used, the callback function must be + written to be called from softirq context. In particular, + it cannot block. + +6. Since synchronize_rcu() can block, it cannot be called from + any sort of irq context. Ditto for synchronize_sched() and + synchronize_srcu(). + +7. If the updater uses call_rcu(), then the corresponding readers + must use rcu_read_lock() and rcu_read_unlock(). If the updater + uses call_rcu_bh(), then the corresponding readers must use + rcu_read_lock_bh() and rcu_read_unlock_bh(). If the updater + uses call_rcu_sched(), then the corresponding readers must + disable preemption. Mixing things up will result in confusion + and broken kernels. + + One exception to this rule: rcu_read_lock() and rcu_read_unlock() + may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() + in cases where local bottom halves are already known to be + disabled, for example, in irq or softirq context. Commenting + such cases is a must, of course! And the jury is still out on + whether the increased speed is worth it. + +8. Although synchronize_rcu() is slower than is call_rcu(), it + usually results in simpler code. So, unless update performance + is critically important or the updaters cannot block, + synchronize_rcu() should be used in preference to call_rcu(). + + An especially important property of the synchronize_rcu() + primitive is that it automatically self-limits: if grace periods + are delayed for whatever reason, then the synchronize_rcu() + primitive will correspondingly delay updates. In contrast, + code using call_rcu() should explicitly limit update rate in + cases where grace periods are delayed, as failing to do so can + result in excessive realtime latencies or even OOM conditions. + + Ways of gaining this self-limiting property when using call_rcu() + include: + + a. Keeping a count of the number of data-structure elements + used by the RCU-protected data structure, including those + waiting for a grace period to elapse. Enforce a limit + on this number, stalling updates as needed to allow + previously deferred frees to complete. + + Alternatively, limit only the number awaiting deferred + free rather than the total number of elements. + + b. Limiting update rate. For example, if updates occur only + once per hour, then no explicit rate limiting is required, + unless your system is already badly broken. The dcache + subsystem takes this approach -- updates are guarded + by a global lock, limiting their rate. + + c. Trusted update -- if updates can only be done manually by + superuser or some other trusted user, then it might not + be necessary to automatically limit them. The theory + here is that superuser already has lots of ways to crash + the machine. + + d. Use call_rcu_bh() rather than call_rcu(), in order to take + advantage of call_rcu_bh()'s faster grace periods. + + e. Periodically invoke synchronize_rcu(), permitting a limited + number of updates per grace period. + +9. All RCU list-traversal primitives, which include + rcu_dereference(), list_for_each_entry_rcu(), + list_for_each_continue_rcu(), and list_for_each_safe_rcu(), + must be either within an RCU read-side critical section or + must be protected by appropriate update-side locks. RCU + read-side critical sections are delimited by rcu_read_lock() + and rcu_read_unlock(), or by similar primitives such as + rcu_read_lock_bh() and rcu_read_unlock_bh(). + + The reason that it is permissible to use RCU list-traversal + primitives when the update-side lock is held is that doing so + can be quite helpful in reducing code bloat when common code is + shared between readers and updaters. + +10. Conversely, if you are in an RCU read-side critical section, + and you don't hold the appropriate update-side lock, you -must- + use the "_rcu()" variants of the list macros. Failing to do so + will break Alpha and confuse people reading your code. + +11. Note that synchronize_rcu() -only- guarantees to wait until + all currently executing rcu_read_lock()-protected RCU read-side + critical sections complete. It does -not- necessarily guarantee + that all currently running interrupts, NMIs, preempt_disable() + code, or idle loops will complete. Therefore, if you do not have + rcu_read_lock()-protected read-side critical sections, do -not- + use synchronize_rcu(). + + If you want to wait for some of these other things, you might + instead need to use synchronize_irq() or synchronize_sched(). + +12. Any lock acquired by an RCU callback must be acquired elsewhere + with irq disabled, e.g., via spin_lock_irqsave(). Failing to + disable irq on a given acquisition of that lock will result in + deadlock as soon as the RCU callback happens to interrupt that + acquisition's critical section. + +13. RCU callbacks can be and are executed in parallel. In many cases, + the callback code simply wrappers around kfree(), so that this + is not an issue (or, more accurately, to the extent that it is + an issue, the memory-allocator locking handles it). However, + if the callbacks do manipulate a shared data structure, they + must use whatever locking or other synchronization is required + to safely access and/or modify that data structure. + + RCU callbacks are -usually- executed on the same CPU that executed + the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(), + but are by -no- means guaranteed to be. For example, if a given + CPU goes offline while having an RCU callback pending, then that + RCU callback will execute on some surviving CPU. (If this was + not the case, a self-spawning RCU callback would prevent the + victim CPU from ever going offline.) + +14. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu()) + may only be invoked from process context. Unlike other forms of + RCU, it -is- permissible to block in an SRCU read-side critical + section (demarked by srcu_read_lock() and srcu_read_unlock()), + hence the "SRCU": "sleepable RCU". Please note that if you + don't need to sleep in read-side critical sections, you should + be using RCU rather than SRCU, because RCU is almost always + faster and easier to use than is SRCU. + + Also unlike other forms of RCU, explicit initialization + and cleanup is required via init_srcu_struct() and + cleanup_srcu_struct(). These are passed a "struct srcu_struct" + that defines the scope of a given SRCU domain. Once initialized, + the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock() + and synchronize_srcu(). A given synchronize_srcu() waits only + for SRCU read-side critical sections governed by srcu_read_lock() + and srcu_read_unlock() calls that have been passd the same + srcu_struct. This property is what makes sleeping read-side + critical sections tolerable -- a given subsystem delays only + its own updates, not those of other subsystems using SRCU. + Therefore, SRCU is less prone to OOM the system than RCU would + be if RCU's read-side critical sections were permitted to + sleep. + + The ability to sleep in read-side critical sections does not + come for free. First, corresponding srcu_read_lock() and + srcu_read_unlock() calls must be passed the same srcu_struct. + Second, grace-period-detection overhead is amortized only + over those updates sharing a given srcu_struct, rather than + being globally amortized as they are for other forms of RCU. + Therefore, SRCU should be used in preference to rw_semaphore + only in extremely read-intensive situations, or in situations + requiring SRCU's read-side deadlock immunity or low read-side + realtime latency. + + Note that, rcu_assign_pointer() and rcu_dereference() relate to + SRCU just as they do to other forms of RCU. diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt new file mode 100644 index 0000000..1fd1753 --- /dev/null +++ b/Documentation/RCU/listRCU.txt @@ -0,0 +1,315 @@ +Using RCU to Protect Read-Mostly Linked Lists + + +One of the best applications of RCU is to protect read-mostly linked lists +("struct list_head" in list.h). One big advantage of this approach +is that all of the required memory barriers are included for you in +the list macros. This document describes several applications of RCU, +with the best fits first. + + +Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates + +The best applications are cases where, if reader-writer locking were +used, the read-side lock would be dropped before taking any action +based on the results of the search. The most celebrated example is +the routing table. Because the routing table is tracking the state of +equipment outside of the computer, it will at times contain stale data. +Therefore, once the route has been computed, there is no need to hold +the routing table static during transmission of the packet. After all, +you can hold the routing table static all you want, but that won't keep +the external Internet from changing, and it is the state of the external +Internet that really matters. In addition, routing entries are typically +added or deleted, rather than being modified in place. + +A straightforward example of this use of RCU may be found in the +system-call auditing support. For example, a reader-writer locked +implementation of audit_filter_task() might be as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + read_lock(&auditsc_lock); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + read_unlock(&auditsc_lock); + return state; + } + } + read_unlock(&auditsc_lock); + return AUDIT_BUILD_CONTEXT; + } + +Here the list is searched under the lock, but the lock is dropped before +the corresponding value is returned. By the time that this value is acted +on, the list may well have been modified. This makes sense, since if +you are turning auditing off, it is OK to audit a few extra system calls. + +This means that RCU can be easily applied to the read side, as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +The read_lock() and read_unlock() calls have become rcu_read_lock() +and rcu_read_unlock(), respectively, and the list_for_each_entry() has +become list_for_each_entry_rcu(). The _rcu() list-traversal primitives +insert the read-side memory barriers that are required on DEC Alpha CPUs. + +The changes to the update side are also straightforward. A reader-writer +lock might be used as follows for deletion and insertion: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + write_lock(&auditsc_lock); + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del(&e->list); + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + write_lock(&auditsc_lock); + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add(&entry->list, list); + } else { + list_add_tail(&entry->list, list); + } + write_unlock(&auditsc_lock); + return 0; + } + +Following are the RCU equivalents for these two functions: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not use the _rcu iterator here, since this is the only + * deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del_rcu(&e->list); + call_rcu(&e->rcu, audit_free_rule, e); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add_rcu(&entry->list, list); + } else { + list_add_tail_rcu(&entry->list, list); + } + return 0; + } + +Normally, the write_lock() and write_unlock() would be replaced by +a spin_lock() and a spin_unlock(), but in this case, all callers hold +audit_netlink_sem, so no additional locking is required. The auditsc_lock +can therefore be eliminated, since use of RCU eliminates the need for +writers to exclude readers. Normally, the write_lock() calls would +be converted into spin_lock() calls. + +The list_del(), list_add(), and list_add_tail() primitives have been +replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). +The _rcu() list-manipulation primitives add memory barriers that are +needed on weakly ordered CPUs (most of them!). The list_del_rcu() +primitive omits the pointer poisoning debug-assist code that would +otherwise cause concurrent readers to fail spectacularly. + +So, when readers can tolerate stale data and when entries are either added +or deleted, without in-place modification, it is very easy to use RCU! + + +Example 2: Handling In-Place Updates + +The system-call auditing code does not update auditing rules in place. +However, if it did, reader-writer-locked code to do so might look as +follows (presumably, the field_count is only permitted to decrease, +otherwise, the added fields would need to be filled in): + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + write_lock(&auditsc_lock); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + e->rule.action = newaction; + e->rule.file_count = newfield_count; + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + +The RCU version creates a copy, updates the copy, then replaces the old +entry with the newly updated entry. This sequence of actions, allowing +concurrent reads while doing a copy to perform an update, is what gives +RCU ("read-copy update") its name. The RCU code is as follows: + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + ne = kmalloc(sizeof(*entry), GFP_ATOMIC); + if (ne == NULL) + return -ENOMEM; + audit_copy_rule(&ne->rule, &e->rule); + ne->rule.action = newaction; + ne->rule.file_count = newfield_count; + list_replace_rcu(e, ne); + call_rcu(&e->rcu, audit_free_rule, e); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + +Again, this assumes that the caller holds audit_netlink_sem. Normally, +the reader-writer lock would become a spinlock in this sort of code. + + +Example 3: Eliminating Stale Data + +The auditing examples above tolerate stale data, as do most algorithms +that are tracking external state. Because there is a delay from the +time the external state changes before Linux becomes aware of the change, +additional RCU-induced staleness is normally not a problem. + +However, there are many examples where stale data cannot be tolerated. +One example in the Linux kernel is the System V IPC (see the ipc_lock() +function in ipc/util.c). This code checks a "deleted" flag under a +per-entry spinlock, and, if the "deleted" flag is set, pretends that the +entry does not exist. For this to be helpful, the search function must +return holding the per-entry spinlock, as ipc_lock() does in fact do. + +Quick Quiz: Why does the search function need to return holding the + per-entry lock for this deleted-flag technique to be helpful? + +If the system-call audit module were to ever need to reject stale data, +one way to accomplish this would be to add a "deleted" flag and a "lock" +spinlock to the audit_entry structure, and modify audit_filter_task() +as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + spin_lock(&e->lock); + if (e->deleted) { + spin_unlock(&e->lock); + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +Note that this example assumes that entries are only added and deleted. +Additional mechanism is required to deal correctly with the +update-in-place performed by audit_upd_rule(). For one thing, +audit_upd_rule() would need additional memory barriers to ensure +that the list_add_rcu() was really executed before the list_del_rcu(). + +The audit_del_rule() function would need to set the "deleted" +flag under the spinlock as follows: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not need to use the _rcu iterator here, since this + * is the only deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + spin_lock(&e->lock); + list_del_rcu(&e->list); + e->deleted = 1; + spin_unlock(&e->lock); + call_rcu(&e->rcu, audit_free_rule, e); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + + +Summary + +Read-mostly list-based data structures that can tolerate stale data are +the most amenable to use of RCU. The simplest case is where entries are +either added or deleted from the data structure (or atomically modified +in place), but non-atomic in-place modifications can be handled by making +a copy, updating the copy, then replacing the original with the copy. +If stale data cannot be tolerated, then a "deleted" flag may be used +in conjunction with a per-entry spinlock in order to allow the search +function to reject newly deleted data. + + +Answer to Quick Quiz + Why does the search function need to return holding the per-entry + lock for this deleted-flag technique to be helpful? + + If the search function drops the per-entry lock before returning, + then the caller will be processing stale data in any case. If it + is really OK to be processing stale data, then you don't need a + "deleted" flag. If processing stale data really is a problem, + then you need to hold the per-entry lock across all of the code + that uses the value that was returned. diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt new file mode 100644 index 0000000..95821a2 --- /dev/null +++ b/Documentation/RCU/rcu.txt @@ -0,0 +1,138 @@ +RCU Concepts + + +The basic idea behind RCU (read-copy update) is to split destructive +operations into two parts, one that prevents anyone from seeing the data +item being destroyed, and one that actually carries out the destruction. +A "grace period" must elapse between the two parts, and this grace period +must be long enough that any readers accessing the item being deleted have +since dropped their references. For example, an RCU-protected deletion +from a linked list would first remove the item from the list, wait for +a grace period to elapse, then free the element. See the listRCU.txt +file for more information on using RCU with linked lists. + + +Frequently Asked Questions + +o Why would anyone want to use RCU? + + The advantage of RCU's two-part approach is that RCU readers need + not acquire any locks, perform any atomic instructions, write to + shared memory, or (on CPUs other than Alpha) execute any memory + barriers. The fact that these operations are quite expensive + on modern CPUs is what gives RCU its performance advantages + in read-mostly situations. The fact that RCU readers need not + acquire locks can also greatly simplify deadlock-avoidance code. + +o How can the updater tell when a grace period has completed + if the RCU readers give no indication when they are done? + + Just as with spinlocks, RCU readers are not permitted to + block, switch to user-mode execution, or enter the idle loop. + Therefore, as soon as a CPU is seen passing through any of these + three states, we know that that CPU has exited any previous RCU + read-side critical sections. So, if we remove an item from a + linked list, and then wait until all CPUs have switched context, + executed in user mode, or executed in the idle loop, we can + safely free up that item. + + Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the + same effect, but require that the readers manipulate CPU-local + counters. These counters allow limited types of blocking + within RCU read-side critical sections. SRCU also uses + CPU-local counters, and permits general blocking within + RCU read-side critical sections. These two variants of + RCU detect grace periods by sampling these counters. + +o If I am running on a uniprocessor kernel, which can only do one + thing at a time, why should I wait for a grace period? + + See the UP.txt file in this directory. + +o How can I see where RCU is currently used in the Linux kernel? + + Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", + "rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh", + "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu", + "synchronize_net", "synchronize_srcu", and the other RCU + primitives. Or grab one of the cscope databases from: + + http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html + +o What guidelines should I follow when writing code that uses RCU? + + See the checklist.txt file in this directory. + +o Why the name "RCU"? + + "RCU" stands for "read-copy update". The file listRCU.txt has + more information on where this name came from, search for + "read-copy update" to find it. + +o I hear that RCU is patented? What is with that? + + Yes, it is. There are several known patents related to RCU, + search for the string "Patent" in RTFP.txt to find them. + Of these, one was allowed to lapse by the assignee, and the + others have been contributed to the Linux kernel under GPL. + +o I hear that RCU needs work in order to support realtime kernels? + + This work is largely completed. Realtime-friendly RCU can be + enabled via the CONFIG_PREEMPT_RCU kernel configuration parameter. + However, work is in progress for enabling priority boosting of + preempted RCU read-side critical sections.This is needed if you + have CPU-bound realtime threads. + +o Where can I find more information on RCU? + + See the RTFP.txt file in this directory. + Or point your browser at http://www.rdrop.com/users/paulmck/RCU/. + +o What are all these files in this directory? + + + NMI-RCU.txt + + Describes how to use RCU to implement dynamic + NMI handlers, which can be revectored on the fly, + without rebooting. + + RTFP.txt + + List of RCU-related publications and web sites. + + UP.txt + + Discussion of RCU usage in UP kernels. + + arrayRCU.txt + + Describes how to use RCU to protect arrays, with + resizeable arrays whose elements reference other + data structures being of the most interest. + + checklist.txt + + Lists things to check for when inspecting code that + uses RCU. + + listRCU.txt + + Describes how to use RCU to protect linked lists. + This is the simplest and most common use of RCU + in the Linux kernel. + + rcu.txt + + You are reading it! + + rcuref.txt + + Describes how to combine use of reference counts + with RCU. + + whatisRCU.txt + + Overview of how the RCU implementation works. Along + the way, presents a conceptual view of RCU. diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt new file mode 100644 index 0000000..4202ad0 --- /dev/null +++ b/Documentation/RCU/rcuref.txt @@ -0,0 +1,66 @@ +Reference-count design for elements of lists/arrays protected by RCU. + +Reference counting on elements of lists which are protected by traditional +reader/writer spinlocks or semaphores are straightforward: + +1. 2. +add() search_and_reference() +{ { + alloc_object read_lock(&list_lock); + ... search_for_element + atomic_set(&el->rc, 1); atomic_inc(&el->rc); + write_lock(&list_lock); ... + add_element read_unlock(&list_lock); + ... ... + write_unlock(&list_lock); } +} + +3. 4. +release_referenced() delete() +{ { + ... write_lock(&list_lock); + atomic_dec(&el->rc, relfunc) ... + ... delete_element +} write_unlock(&list_lock); + ... + if (atomic_dec_and_test(&el->rc)) + kfree(el); + ... + } + +If this list/array is made lock free using RCU as in changing the +write_lock() in add() and delete() to spin_lock() and changing read_lock() +in search_and_reference() to rcu_read_lock(), the atomic_inc() in +search_and_reference() could potentially hold reference to an element which +has already been deleted from the list/array. Use atomic_inc_not_zero() +in this scenario as follows: + +1. 2. +add() search_and_reference() +{ { + alloc_object rcu_read_lock(); + ... search_for_element + atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) { + spin_lock(&list_lock); rcu_read_unlock(); + return FAIL; + add_element } + ... ... + spin_unlock(&list_lock); rcu_read_unlock(); +} } +3. 4. +release_referenced() delete() +{ { + ... spin_lock(&list_lock); + if (atomic_dec_and_test(&el->rc)) ... + call_rcu(&el->head, el_free); delete_element + ... spin_unlock(&list_lock); +} ... + if (atomic_dec_and_test(&el->rc)) + call_rcu(&el->head, el_free); + ... + } + +Sometimes, a reference to the element needs to be obtained in the +update (write) stream. In such cases, atomic_inc_not_zero() might be +overkill, since we hold the update-side spinlock. One might instead +use atomic_inc() in such cases. diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt new file mode 100644 index 0000000..a342b6e --- /dev/null +++ b/Documentation/RCU/torture.txt @@ -0,0 +1,180 @@ +RCU Torture Test Operation + + +CONFIG_RCU_TORTURE_TEST + +The CONFIG_RCU_TORTURE_TEST config option is available for all RCU +implementations. It creates an rcutorture kernel module that can +be loaded to run a torture test. The test periodically outputs +status messages via printk(), which can be examined via the dmesg +command (perhaps grepping for "torture"). The test is started +when the module is loaded, and stops when the module is unloaded. + +CONFIG_RCU_TORTURE_TEST_RUNNABLE + +It is also possible to specify CONFIG_RCU_TORTURE_TEST=y, which will +result in the tests being loaded into the base kernel. In this case, +the CONFIG_RCU_TORTURE_TEST_RUNNABLE config option is used to specify +whether the RCU torture tests are to be started immediately during +boot or whether the /proc/sys/kernel/rcutorture_runnable file is used +to enable them. This /proc file can be used to repeatedly pause and +restart the tests, regardless of the initial state specified by the +CONFIG_RCU_TORTURE_TEST_RUNNABLE config option. + +You will normally -not- want to start the RCU torture tests during boot +(and thus the default is CONFIG_RCU_TORTURE_TEST_RUNNABLE=n), but doing +this can sometimes be useful in finding boot-time bugs. + + +MODULE PARAMETERS + +This module has the following parameters: + +irqreaders Says to invoke RCU readers from irq level. This is currently + done via timers. Defaults to "1" for variants of RCU that + permit this. (Or, more accurately, variants of RCU that do + -not- permit this know to ignore this variable.) + +nfakewriters This is the number of RCU fake writer threads to run. Fake + writer threads repeatedly use the synchronous "wait for + current readers" function of the interface selected by + torture_type, with a delay between calls to allow for various + different numbers of writers running in parallel. + nfakewriters defaults to 4, which provides enough parallelism + to trigger special cases caused by multiple writers, such as + the synchronize_srcu() early return optimization. + +nreaders This is the number of RCU reading threads supported. + The default is twice the number of CPUs. Why twice? + To properly exercise RCU implementations with preemptible + read-side critical sections. + +shuffle_interval + The number of seconds to keep the test threads affinitied + to a particular subset of the CPUs, defaults to 3 seconds. + Used in conjunction with test_no_idle_hz. + +stat_interval The number of seconds between output of torture + statistics (via printk()). Regardless of the interval, + statistics are printed when the module is unloaded. + Setting the interval to zero causes the statistics to + be printed -only- when the module is unloaded, and this + is the default. + +stutter The length of time to run the test before pausing for this + same period of time. Defaults to "stutter=5", so as + to run and pause for (roughly) five-second intervals. + Specifying "stutter=0" causes the test to run continuously + without pausing, which is the old default behavior. + +test_no_idle_hz Whether or not to test the ability of RCU to operate in + a kernel that disables the scheduling-clock interrupt to + idle CPUs. Boolean parameter, "1" to test, "0" otherwise. + Defaults to omitting this test. + +torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API, + "rcu_sync" for rcu_read_lock() with synchronous reclamation, + "rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for + rcu_read_lock_bh() with synchronous reclamation, "srcu" for + the "srcu_read_lock()" API, and "sched" for the use of + preempt_disable() together with synchronize_sched(). + +verbose Enable debug printk()s. Default is disabled. + + +OUTPUT + +The statistics output is as follows: + + rcu-torture: --- Start of test: nreaders=16 stat_interval=0 verbose=0 + rcu-torture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915 + rcu-torture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0 + rcu-torture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0 + rcu-torture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0 + rcu-torture: --- End of test + +The command "dmesg | grep torture:" will extract this information on +most systems. On more esoteric configurations, it may be necessary to +use other commands to access the output of the printk()s used by +the RCU torture test. The printk()s use KERN_ALERT, so they should +be evident. ;-) + +The entries are as follows: + +o "rtc": The hexadecimal address of the structure currently visible + to readers. + +o "ver": The number of times since boot that the rcutw writer task + has changed the structure visible to readers. + +o "tfle": If non-zero, indicates that the "torture freelist" + containing structure to be placed into the "rtc" area is empty. + This condition is important, since it can fool you into thinking + that RCU is working when it is not. :-/ + +o "rta": Number of structures allocated from the torture freelist. + +o "rtaf": Number of allocations from the torture freelist that have + failed due to the list being empty. + +o "rtf": Number of frees into the torture freelist. + +o "Reader Pipe": Histogram of "ages" of structures seen by readers. + If any entries past the first two are non-zero, RCU is broken. + And rcutorture prints the error flag string "!!!" to make sure + you notice. The age of a newly allocated structure is zero, + it becomes one when removed from reader visibility, and is + incremented once per grace period subsequently -- and is freed + after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods. + + The output displayed above was taken from a correctly working + RCU. If you want to see what it looks like when broken, break + it yourself. ;-) + +o "Reader Batch": Another histogram of "ages" of structures seen + by readers, but in terms of counter flips (or batches) rather + than in terms of grace periods. The legal number of non-zero + entries is again two. The reason for this separate view is that + it is sometimes easier to get the third entry to show up in the + "Reader Batch" list than in the "Reader Pipe" list. + +o "Free-Block Circulation": Shows the number of torture structures + that have reached a given point in the pipeline. The first element + should closely correspond to the number of structures allocated, + the second to the number that have been removed from reader view, + and all but the last remaining to the corresponding number of + passes through a grace period. The last entry should be zero, + as it is only incremented if a torture structure's counter + somehow gets incremented farther than it should. + +Different implementations of RCU can provide implementation-specific +additional information. For example, SRCU provides the following: + + srcu-torture: rtc: f8cf46a8 ver: 355 tfle: 0 rta: 356 rtaf: 0 rtf: 346 rtmbe: 0 + srcu-torture: Reader Pipe: 559738 939 0 0 0 0 0 0 0 0 0 + srcu-torture: Reader Batch: 560434 243 0 0 0 0 0 0 0 0 + srcu-torture: Free-Block Circulation: 355 354 353 352 351 350 349 348 347 346 0 + srcu-torture: per-CPU(idx=1): 0(0,1) 1(0,1) 2(0,0) 3(0,1) + +The first four lines are similar to those for RCU. The last line shows +the per-CPU counter state. The numbers in parentheses are the values +of the "old" and "current" counters for the corresponding CPU. The +"idx" value maps the "old" and "current" values to the underlying array, +and is useful for debugging. + + +USAGE + +The following script may be used to torture RCU: + + #!/bin/sh + + modprobe rcutorture + sleep 100 + rmmod rcutorture + dmesg | grep torture: + +The output can be manually inspected for the error flag of "!!!". +One could of course create a more elaborate script that automatically +checked for such errors. The "rmmod" command forces a "SUCCESS" or +"FAILURE" indication to be printk()ed. diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt new file mode 100644 index 0000000..9617082 --- /dev/null +++ b/Documentation/RCU/whatisRCU.txt @@ -0,0 +1,936 @@ +Please note that the "What is RCU?" LWN series is an excellent place +to start learning about RCU: + +1. What is RCU, Fundamentally? http://lwn.net/Articles/262464/ +2. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/ +3. RCU part 3: the RCU API http://lwn.net/Articles/264090/ + + +What is RCU? + +RCU is a synchronization mechanism that was added to the Linux kernel +during the 2.5 development effort that is optimized for read-mostly +situations. Although RCU is actually quite simple once you understand it, +getting there can sometimes be a challenge. Part of the problem is that +most of the past descriptions of RCU have been written with the mistaken +assumption that there is "one true way" to describe RCU. Instead, +the experience has been that different people must take different paths +to arrive at an understanding of RCU. This document provides several +different paths, as follows: + +1. RCU OVERVIEW +2. WHAT IS RCU'S CORE API? +3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? +4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? +5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? +6. ANALOGY WITH READER-WRITER LOCKING +7. FULL LIST OF RCU APIs +8. ANSWERS TO QUICK QUIZZES + +People who prefer starting with a conceptual overview should focus on +Section 1, though most readers will profit by reading this section at +some point. People who prefer to start with an API that they can then +experiment with should focus on Section 2. People who prefer to start +with example uses should focus on Sections 3 and 4. People who need to +understand the RCU implementation should focus on Section 5, then dive +into the kernel source code. People who reason best by analogy should +focus on Section 6. Section 7 serves as an index to the docbook API +documentation, and Section 8 is the traditional answer key. + +So, start with the section that makes the most sense to you and your +preferred method of learning. If you need to know everything about +everything, feel free to read the whole thing -- but if you are really +that type of person, you have perused the source code and will therefore +never need this document anyway. ;-) + + +1. RCU OVERVIEW + +The basic idea behind RCU is to split updates into "removal" and +"reclamation" phases. The removal phase removes references to data items +within a data structure (possibly by replacing them with references to +new versions of these data items), and can run concurrently with readers. +The reason that it is safe to run the removal phase concurrently with +readers is the semantics of modern CPUs guarantee that readers will see +either the old or the new version of the data structure rather than a +partially updated reference. The reclamation phase does the work of reclaiming +(e.g., freeing) the data items removed from the data structure during the +removal phase. Because reclaiming data items can disrupt any readers +concurrently referencing those data items, the reclamation phase must +not start until readers no longer hold references to those data items. + +Splitting the update into removal and reclamation phases permits the +updater to perform the removal phase immediately, and to defer the +reclamation phase until all readers active during the removal phase have +completed, either by blocking until they finish or by registering a +callback that is invoked after they finish. Only readers that are active +during the removal phase need be considered, because any reader starting +after the removal phase will be unable to gain a reference to the removed +data items, and therefore cannot be disrupted by the reclamation phase. + +So the typical RCU update sequence goes something like the following: + +a. Remove pointers to a data structure, so that subsequent + readers cannot gain a reference to it. + +b. Wait for all previous readers to complete their RCU read-side + critical sections. + +c. At this point, there cannot be any readers who hold references + to the data structure, so it now may safely be reclaimed + (e.g., kfree()d). + +Step (b) above is the key idea underlying RCU's deferred destruction. +The ability to wait until all readers are done allows RCU readers to +use much lighter-weight synchronization, in some cases, absolutely no +synchronization at all. In contrast, in more conventional lock-based +schemes, readers must use heavy-weight synchronization in order to +prevent an updater from deleting the data structure out from under them. +This is because lock-based updaters typically update data items in place, +and must therefore exclude readers. In contrast, RCU-based updaters +typically take advantage of the fact that writes to single aligned +pointers are atomic on modern CPUs, allowing atomic insertion, removal, +and replacement of data items in a linked structure without disrupting +readers. Concurrent RCU readers can then continue accessing the old +versions, and can dispense with the atomic operations, memory barriers, +and communications cache misses that are so expensive on present-day +SMP computer systems, even in absence of lock contention. + +In the three-step procedure shown above, the updater is performing both +the removal and the reclamation step, but it is often helpful for an +entirely different thread to do the reclamation, as is in fact the case +in the Linux kernel's directory-entry cache (dcache). Even if the same +thread performs both the update step (step (a) above) and the reclamation +step (step (c) above), it is often helpful to think of them separately. +For example, RCU readers and updaters need not communicate at all, +but RCU provides implicit low-overhead communication between readers +and reclaimers, namely, in step (b) above. + +So how the heck can a reclaimer tell when a reader is done, given +that readers are not doing any sort of synchronization operations??? +Read on to learn about how RCU's API makes this easy. + + +2. WHAT IS RCU'S CORE API? + +The core RCU API is quite small: + +a. rcu_read_lock() +b. rcu_read_unlock() +c. synchronize_rcu() / call_rcu() +d. rcu_assign_pointer() +e. rcu_dereference() + +There are many other members of the RCU API, but the rest can be +expressed in terms of these five, though most implementations instead +express synchronize_rcu() in terms of the call_rcu() callback API. + +The five core RCU APIs are described below, the other 18 will be enumerated +later. See the kernel docbook documentation for more info, or look directly +at the function header comments. + +rcu_read_lock() + + void rcu_read_lock(void); + + Used by a reader to inform the reclaimer that the reader is + entering an RCU read-side critical section. It is illegal + to block while in an RCU read-side critical section, though + kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side + critical sections. Any RCU-protected data structure accessed + during an RCU read-side critical section is guaranteed to remain + unreclaimed for the full duration of that critical section. + Reference counts may be used in conjunction with RCU to maintain + longer-term references to data structures. + +rcu_read_unlock() + + void rcu_read_unlock(void); + + Used by a reader to inform the reclaimer that the reader is + exiting an RCU read-side critical section. Note that RCU + read-side critical sections may be nested and/or overlapping. + +synchronize_rcu() + + void synchronize_rcu(void); + + Marks the end of updater code and the beginning of reclaimer + code. It does this by blocking until all pre-existing RCU + read-side critical sections on all CPUs have completed. + Note that synchronize_rcu() will -not- necessarily wait for + any subsequent RCU read-side critical sections to complete. + For example, consider the following sequence of events: + + CPU 0 CPU 1 CPU 2 + ----------------- ------------------------- --------------- + 1. rcu_read_lock() + 2. enters synchronize_rcu() + 3. rcu_read_lock() + 4. rcu_read_unlock() + 5. exits synchronize_rcu() + 6. rcu_read_unlock() + + To reiterate, synchronize_rcu() waits only for ongoing RCU + read-side critical sections to complete, not necessarily for + any that begin after synchronize_rcu() is invoked. + + Of course, synchronize_rcu() does not necessarily return + -immediately- after the last pre-existing RCU read-side critical + section completes. For one thing, there might well be scheduling + delays. For another thing, many RCU implementations process + requests in batches in order to improve efficiencies, which can + further delay synchronize_rcu(). + + Since synchronize_rcu() is the API that must figure out when + readers are done, its implementation is key to RCU. For RCU + to be useful in all but the most read-intensive situations, + synchronize_rcu()'s overhead must also be quite small. + + The call_rcu() API is a callback form of synchronize_rcu(), + and is described in more detail in a later section. Instead of + blocking, it registers a function and argument which are invoked + after all ongoing RCU read-side critical sections have completed. + This callback variant is particularly useful in situations where + it is illegal to block or where update-side performance is + critically important. + + However, the call_rcu() API should not be used lightly, as use + of the synchronize_rcu() API generally results in simpler code. + In addition, the synchronize_rcu() API has the nice property + of automatically limiting update rate should grace periods + be delayed. This property results in system resilience in face + of denial-of-service attacks. Code using call_rcu() should limit + update rate in order to gain this same sort of resilience. See + checklist.txt for some approaches to limiting the update rate. + +rcu_assign_pointer() + + typeof(p) rcu_assign_pointer(p, typeof(p) v); + + Yes, rcu_assign_pointer() -is- implemented as a macro, though it + would be cool to be able to declare a function in this manner. + (Compiler experts will no doubt disagree.) + + The updater uses this function to assign a new value to an + RCU-protected pointer, in order to safely communicate the change + in value from the updater to the reader. This function returns + the new value, and also executes any memory-barrier instructions + required for a given CPU architecture. + + Perhaps just as important, it serves to document (1) which + pointers are protected by RCU and (2) the point at which a + given structure becomes accessible to other CPUs. That said, + rcu_assign_pointer() is most frequently used indirectly, via + the _rcu list-manipulation primitives such as list_add_rcu(). + +rcu_dereference() + + typeof(p) rcu_dereference(p); + + Like rcu_assign_pointer(), rcu_dereference() must be implemented + as a macro. + + The reader uses rcu_dereference() to fetch an RCU-protected + pointer, which returns a value that may then be safely + dereferenced. Note that rcu_deference() does not actually + dereference the pointer, instead, it protects the pointer for + later dereferencing. It also executes any needed memory-barrier + instructions for a given CPU architecture. Currently, only Alpha + needs memory barriers within rcu_dereference() -- on other CPUs, + it compiles to nothing, not even a compiler directive. + + Common coding practice uses rcu_dereference() to copy an + RCU-protected pointer to a local variable, then dereferences + this local variable, for example as follows: + + p = rcu_dereference(head.next); + return p->data; + + However, in this case, one could just as easily combine these + into one statement: + + return rcu_dereference(head.next)->data; + + If you are going to be fetching multiple fields from the + RCU-protected structure, using the local variable is of + course preferred. Repeated rcu_dereference() calls look + ugly and incur unnecessary overhead on Alpha CPUs. + + Note that the value returned by rcu_dereference() is valid + only within the enclosing RCU read-side critical section. + For example, the following is -not- legal: + + rcu_read_lock(); + p = rcu_dereference(head.next); + rcu_read_unlock(); + x = p->address; + rcu_read_lock(); + y = p->data; + rcu_read_unlock(); + + Holding a reference from one RCU read-side critical section + to another is just as illegal as holding a reference from + one lock-based critical section to another! Similarly, + using a reference outside of the critical section in which + it was acquired is just as illegal as doing so with normal + locking. + + As with rcu_assign_pointer(), an important function of + rcu_dereference() is to document which pointers are protected by + RCU, in particular, flagging a pointer that is subject to changing + at any time, including immediately after the rcu_dereference(). + And, again like rcu_assign_pointer(), rcu_dereference() is + typically used indirectly, via the _rcu list-manipulation + primitives, such as list_for_each_entry_rcu(). + +The following diagram shows how each API communicates among the +reader, updater, and reclaimer. + + + rcu_assign_pointer() + +--------+ + +---------------------->| reader |---------+ + | +--------+ | + | | | + | | | Protect: + | | | rcu_read_lock() + | | | rcu_read_unlock() + | rcu_dereference() | | + +---------+ | | + | updater |<---------------------+ | + +---------+ V + | +-----------+ + +----------------------------------->| reclaimer | + +-----------+ + Defer: + synchronize_rcu() & call_rcu() + + +The RCU infrastructure observes the time sequence of rcu_read_lock(), +rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in +order to determine when (1) synchronize_rcu() invocations may return +to their callers and (2) call_rcu() callbacks may be invoked. Efficient +implementations of the RCU infrastructure make heavy use of batching in +order to amortize their overhead over many uses of the corresponding APIs. + +There are no fewer than three RCU mechanisms in the Linux kernel; the +diagram above shows the first one, which is by far the most commonly used. +The rcu_dereference() and rcu_assign_pointer() primitives are used for +all three mechanisms, but different defer and protect primitives are +used as follows: + + Defer Protect + +a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock() + call_rcu() + +b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh() + +c. synchronize_sched() preempt_disable() / preempt_enable() + local_irq_save() / local_irq_restore() + hardirq enter / hardirq exit + NMI enter / NMI exit + +These three mechanisms are used as follows: + +a. RCU applied to normal data structures. + +b. RCU applied to networking data structures that may be subjected + to remote denial-of-service attacks. + +c. RCU applied to scheduler and interrupt/NMI-handler tasks. + +Again, most uses will be of (a). The (b) and (c) cases are important +for specialized uses, but are relatively uncommon. + + +3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? + +This section shows a simple use of the core RCU API to protect a +global pointer to a dynamically allocated structure. More-typical +uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt. + + struct foo { + int a; + char b; + long c; + }; + DEFINE_SPINLOCK(foo_mutex); + + struct foo *gbl_foo; + + /* + * Create a new struct foo that is the same as the one currently + * pointed to by gbl_foo, except that field "a" is replaced + * with "new_a". Points gbl_foo to the new structure, and + * frees up the old structure after a grace period. + * + * Uses rcu_assign_pointer() to ensure that concurrent readers + * see the initialized version of the new structure. + * + * Uses synchronize_rcu() to ensure that any readers that might + * have references to the old structure complete before freeing + * the old structure. + */ + void foo_update_a(int new_a) + { + struct foo *new_fp; + struct foo *old_fp; + + new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); + spin_lock(&foo_mutex); + old_fp = gbl_foo; + *new_fp = *old_fp; + new_fp->a = new_a; + rcu_assign_pointer(gbl_foo, new_fp); + spin_unlock(&foo_mutex); + synchronize_rcu(); + kfree(old_fp); + } + + /* + * Return the value of field "a" of the current gbl_foo + * structure. Use rcu_read_lock() and rcu_read_unlock() + * to ensure that the structure does not get deleted out + * from under us, and use rcu_dereference() to ensure that + * we see the initialized version of the structure (important + * for DEC Alpha and for people reading the code). + */ + int foo_get_a(void) + { + int retval; + + rcu_read_lock(); + retval = rcu_dereference(gbl_foo)->a; + rcu_read_unlock(); + return retval; + } + +So, to sum up: + +o Use rcu_read_lock() and rcu_read_unlock() to guard RCU + read-side critical sections. + +o Within an RCU read-side critical section, use rcu_dereference() + to dereference RCU-protected pointers. + +o Use some solid scheme (such as locks or semaphores) to + keep concurrent updates from interfering with each other. + +o Use rcu_assign_pointer() to update an RCU-protected pointer. + This primitive protects concurrent readers from the updater, + -not- concurrent updates from each other! You therefore still + need to use locking (or something similar) to keep concurrent + rcu_assign_pointer() primitives from interfering with each other. + +o Use synchronize_rcu() -after- removing a data element from an + RCU-protected data structure, but -before- reclaiming/freeing + the data element, in order to wait for the completion of all + RCU read-side critical sections that might be referencing that + data item. + +See checklist.txt for additional rules to follow when using RCU. +And again, more-typical uses of RCU may be found in listRCU.txt, +arrayRCU.txt, and NMI-RCU.txt. + + +4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? + +In the example above, foo_update_a() blocks until a grace period elapses. +This is quite simple, but in some cases one cannot afford to wait so +long -- there might be other high-priority work to be done. + +In such cases, one uses call_rcu() rather than synchronize_rcu(). +The call_rcu() API is as follows: + + void call_rcu(struct rcu_head * head, + void (*func)(struct rcu_head *head)); + +This function invokes func(head) after a grace period has elapsed. +This invocation might happen from either softirq or process context, +so the function is not permitted to block. The foo struct needs to +have an rcu_head structure added, perhaps as follows: + + struct foo { + int a; + char b; + long c; + struct rcu_head rcu; + }; + +The foo_update_a() function might then be written as follows: + + /* + * Create a new struct foo that is the same as the one currently + * pointed to by gbl_foo, except that field "a" is replaced + * with "new_a". Points gbl_foo to the new structure, and + * frees up the old structure after a grace period. + * + * Uses rcu_assign_pointer() to ensure that concurrent readers + * see the initialized version of the new structure. + * + * Uses call_rcu() to ensure that any readers that might have + * references to the old structure complete before freeing the + * old structure. + */ + void foo_update_a(int new_a) + { + struct foo *new_fp; + struct foo *old_fp; + + new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); + spin_lock(&foo_mutex); + old_fp = gbl_foo; + *new_fp = *old_fp; + new_fp->a = new_a; + rcu_assign_pointer(gbl_foo, new_fp); + spin_unlock(&foo_mutex); + call_rcu(&old_fp->rcu, foo_reclaim); + } + +The foo_reclaim() function might appear as follows: + + void foo_reclaim(struct rcu_head *rp) + { + struct foo *fp = container_of(rp, struct foo, rcu); + + kfree(fp); + } + +The container_of() primitive is a macro that, given a pointer into a +struct, the type of the struct, and the pointed-to field within the +struct, returns a pointer to the beginning of the struct. + +The use of call_rcu() permits the caller of foo_update_a() to +immediately regain control, without needing to worry further about the +old version of the newly updated element. It also clearly shows the +RCU distinction between updater, namely foo_update_a(), and reclaimer, +namely foo_reclaim(). + +The summary of advice is the same as for the previous section, except +that we are now using call_rcu() rather than synchronize_rcu(): + +o Use call_rcu() -after- removing a data element from an + RCU-protected data structure in order to register a callback + function that will be invoked after the completion of all RCU + read-side critical sections that might be referencing that + data item. + +Again, see checklist.txt for additional rules governing the use of RCU. + + +5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? + +One of the nice things about RCU is that it has extremely simple "toy" +implementations that are a good first step towards understanding the +production-quality implementations in the Linux kernel. This section +presents two such "toy" implementations of RCU, one that is implemented +in terms of familiar locking primitives, and another that more closely +resembles "classic" RCU. Both are way too simple for real-world use, +lacking both functionality and performance. However, they are useful +in getting a feel for how RCU works. See kernel/rcupdate.c for a +production-quality implementation, and see: + + http://www.rdrop.com/users/paulmck/RCU + +for papers describing the Linux kernel RCU implementation. The OLS'01 +and OLS'02 papers are a good introduction, and the dissertation provides +more details on the current implementation as of early 2004. + + +5A. "TOY" IMPLEMENTATION #1: LOCKING + +This section presents a "toy" RCU implementation that is based on +familiar locking primitives. Its overhead makes it a non-starter for +real-life use, as does its lack of scalability. It is also unsuitable +for realtime use, since it allows scheduling latency to "bleed" from +one read-side critical section to another. + +However, it is probably the easiest implementation to relate to, so is +a good starting point. + +It is extremely simple: + + static DEFINE_RWLOCK(rcu_gp_mutex); + + void rcu_read_lock(void) + { + read_lock(&rcu_gp_mutex); + } + + void rcu_read_unlock(void) + { + read_unlock(&rcu_gp_mutex); + } + + void synchronize_rcu(void) + { + write_lock(&rcu_gp_mutex); + write_unlock(&rcu_gp_mutex); + } + +[You can ignore rcu_assign_pointer() and rcu_dereference() without +missing much. But here they are anyway. And whatever you do, don't +forget about them when submitting patches making use of RCU!] + + #define rcu_assign_pointer(p, v) ({ \ + smp_wmb(); \ + (p) = (v); \ + }) + + #define rcu_dereference(p) ({ \ + typeof(p) _________p1 = p; \ + smp_read_barrier_depends(); \ + (_________p1); \ + }) + + +The rcu_read_lock() and rcu_read_unlock() primitive read-acquire +and release a global reader-writer lock. The synchronize_rcu() +primitive write-acquires this same lock, then immediately releases +it. This means that once synchronize_rcu() exits, all RCU read-side +critical sections that were in progress before synchronize_rcu() was +called are guaranteed to have completed -- there is no way that +synchronize_rcu() would have been able to write-acquire the lock +otherwise. + +It is possible to nest rcu_read_lock(), since reader-writer locks may +be recursively acquired. Note also that rcu_read_lock() is immune +from deadlock (an important property of RCU). The reason for this is +that the only thing that can block rcu_read_lock() is a synchronize_rcu(). +But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex, +so there can be no deadlock cycle. + +Quick Quiz #1: Why is this argument naive? How could a deadlock + occur when using this algorithm in a real-world Linux + kernel? How could this deadlock be avoided? + + +5B. "TOY" EXAMPLE #2: CLASSIC RCU + +This section presents a "toy" RCU implementation that is based on +"classic RCU". It is also short on performance (but only for updates) and +on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT +kernels. The definitions of rcu_dereference() and rcu_assign_pointer() +are the same as those shown in the preceding section, so they are omitted. + + void rcu_read_lock(void) { } + + void rcu_read_unlock(void) { } + + void synchronize_rcu(void) + { + int cpu; + + for_each_possible_cpu(cpu) + run_on(cpu); + } + +Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing. +This is the great strength of classic RCU in a non-preemptive kernel: +read-side overhead is precisely zero, at least on non-Alpha CPUs. +And there is absolutely no way that rcu_read_lock() can possibly +participate in a deadlock cycle! + +The implementation of synchronize_rcu() simply schedules itself on each +CPU in turn. The run_on() primitive can be implemented straightforwardly +in terms of the sched_setaffinity() primitive. Of course, a somewhat less +"toy" implementation would restore the affinity upon completion rather +than just leaving all tasks running on the last CPU, but when I said +"toy", I meant -toy-! + +So how the heck is this supposed to work??? + +Remember that it is illegal to block while in an RCU read-side critical +section. Therefore, if a given CPU executes a context switch, we know +that it must have completed all preceding RCU read-side critical sections. +Once -all- CPUs have executed a context switch, then -all- preceding +RCU read-side critical sections will have completed. + +So, suppose that we remove a data item from its structure and then invoke +synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed +that there are no RCU read-side critical sections holding a reference +to that data item, so we can safely reclaim it. + +Quick Quiz #2: Give an example where Classic RCU's read-side + overhead is -negative-. + +Quick Quiz #3: If it is illegal to block in an RCU read-side + critical section, what the heck do you do in + PREEMPT_RT, where normal spinlocks can block??? + + +6. ANALOGY WITH READER-WRITER LOCKING + +Although RCU can be used in many different ways, a very common use of +RCU is analogous to reader-writer locking. The following unified +diff shows how closely related RCU and reader-writer locking can be. + + @@ -13,15 +14,15 @@ + struct list_head *lp; + struct el *p; + + - read_lock(); + - list_for_each_entry(p, head, lp) { + + rcu_read_lock(); + + list_for_each_entry_rcu(p, head, lp) { + if (p->key == key) { + *result = p->data; + - read_unlock(); + + rcu_read_unlock(); + return 1; + } + } + - read_unlock(); + + rcu_read_unlock(); + return 0; + } + + @@ -29,15 +30,16 @@ + { + struct el *p; + + - write_lock(&listmutex); + + spin_lock(&listmutex); + list_for_each_entry(p, head, lp) { + if (p->key == key) { + - list_del(&p->list); + - write_unlock(&listmutex); + + list_del_rcu(&p->list); + + spin_unlock(&listmutex); + + synchronize_rcu(); + kfree(p); + return 1; + } + } + - write_unlock(&listmutex); + + spin_unlock(&listmutex); + return 0; + } + +Or, for those who prefer a side-by-side listing: + + 1 struct el { 1 struct el { + 2 struct list_head list; 2 struct list_head list; + 3 long key; 3 long key; + 4 spinlock_t mutex; 4 spinlock_t mutex; + 5 int data; 5 int data; + 6 /* Other data fields */ 6 /* Other data fields */ + 7 }; 7 }; + 8 spinlock_t listmutex; 8 spinlock_t listmutex; + 9 struct el head; 9 struct el head; + + 1 int search(long key, int *result) 1 int search(long key, int *result) + 2 { 2 { + 3 struct list_head *lp; 3 struct list_head *lp; + 4 struct el *p; 4 struct el *p; + 5 5 + 6 read_lock(); 6 rcu_read_lock(); + 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) { + 8 if (p->key == key) { 8 if (p->key == key) { + 9 *result = p->data; 9 *result = p->data; +10 read_unlock(); 10 rcu_read_unlock(); +11 return 1; 11 return 1; +12 } 12 } +13 } 13 } +14 read_unlock(); 14 rcu_read_unlock(); +15 return 0; 15 return 0; +16 } 16 } + + 1 int delete(long key) 1 int delete(long key) + 2 { 2 { + 3 struct el *p; 3 struct el *p; + 4 4 + 5 write_lock(&listmutex); 5 spin_lock(&listmutex); + 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { + 7 if (p->key == key) { 7 if (p->key == key) { + 8 list_del(&p->list); 8 list_del_rcu(&p->list); + 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); + 10 synchronize_rcu(); +10 kfree(p); 11 kfree(p); +11 return 1; 12 return 1; +12 } 13 } +13 } 14 } +14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); +15 return 0; 16 return 0; +16 } 17 } + +Either way, the differences are quite small. Read-side locking moves +to rcu_read_lock() and rcu_read_unlock, update-side locking moves from +a reader-writer lock to a simple spinlock, and a synchronize_rcu() +precedes the kfree(). + +However, there is one potential catch: the read-side and update-side +critical sections can now run concurrently. In many cases, this will +not be a problem, but it is necessary to check carefully regardless. +For example, if multiple independent list updates must be seen as +a single atomic update, converting to RCU will require special care. + +Also, the presence of synchronize_rcu() means that the RCU version of +delete() can now block. If this is a problem, there is a callback-based +mechanism that never blocks, namely call_rcu(), that can be used in +place of synchronize_rcu(). + + +7. FULL LIST OF RCU APIs + +The RCU APIs are documented in docbook-format header comments in the +Linux-kernel source code, but it helps to have a full list of the +APIs, since there does not appear to be a way to categorize them +in docbook. Here is the list, by category. + +RCU pointer/list traversal: + + rcu_dereference + list_for_each_entry_rcu + hlist_for_each_entry_rcu + + list_for_each_continue_rcu (to be deprecated in favor of new + list_for_each_entry_continue_rcu) + +RCU pointer/list update: + + rcu_assign_pointer + list_add_rcu + list_add_tail_rcu + list_del_rcu + list_replace_rcu + hlist_del_rcu + hlist_add_after_rcu + hlist_add_before_rcu + hlist_add_head_rcu + hlist_replace_rcu + list_splice_init_rcu() + +RCU: Critical sections Grace period Barrier + + rcu_read_lock synchronize_net rcu_barrier + rcu_read_unlock synchronize_rcu + call_rcu + + +bh: Critical sections Grace period Barrier + + rcu_read_lock_bh call_rcu_bh rcu_barrier_bh + rcu_read_unlock_bh + + +sched: Critical sections Grace period Barrier + + [preempt_disable] synchronize_sched rcu_barrier_sched + [and friends] call_rcu_sched + + +SRCU: Critical sections Grace period Barrier + + srcu_read_lock synchronize_srcu N/A + srcu_read_unlock + + +See the comment headers in the source code (or the docbook generated +from them) for more information. + + +8. ANSWERS TO QUICK QUIZZES + +Quick Quiz #1: Why is this argument naive? How could a deadlock + occur when using this algorithm in a real-world Linux + kernel? [Referring to the lock-based "toy" RCU + algorithm.] + +Answer: Consider the following sequence of events: + + 1. CPU 0 acquires some unrelated lock, call it + "problematic_lock", disabling irq via + spin_lock_irqsave(). + + 2. CPU 1 enters synchronize_rcu(), write-acquiring + rcu_gp_mutex. + + 3. CPU 0 enters rcu_read_lock(), but must wait + because CPU 1 holds rcu_gp_mutex. + + 4. CPU 1 is interrupted, and the irq handler + attempts to acquire problematic_lock. + + The system is now deadlocked. + + One way to avoid this deadlock is to use an approach like + that of CONFIG_PREEMPT_RT, where all normal spinlocks + become blocking locks, and all irq handlers execute in + the context of special tasks. In this case, in step 4 + above, the irq handler would block, allowing CPU 1 to + release rcu_gp_mutex, avoiding the deadlock. + + Even in the absence of deadlock, this RCU implementation + allows latency to "bleed" from readers to other + readers through synchronize_rcu(). To see this, + consider task A in an RCU read-side critical section + (thus read-holding rcu_gp_mutex), task B blocked + attempting to write-acquire rcu_gp_mutex, and + task C blocked in rcu_read_lock() attempting to + read_acquire rcu_gp_mutex. Task A's RCU read-side + latency is holding up task C, albeit indirectly via + task B. + + Realtime RCU implementations therefore use a counter-based + approach where tasks in RCU read-side critical sections + cannot be blocked by tasks executing synchronize_rcu(). + +Quick Quiz #2: Give an example where Classic RCU's read-side + overhead is -negative-. + +Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT + kernel where a routing table is used by process-context + code, but can be updated by irq-context code (for example, + by an "ICMP REDIRECT" packet). The usual way of handling + this would be to have the process-context code disable + interrupts while searching the routing table. Use of + RCU allows such interrupt-disabling to be dispensed with. + Thus, without RCU, you pay the cost of disabling interrupts, + and with RCU you don't. + + One can argue that the overhead of RCU in this + case is negative with respect to the single-CPU + interrupt-disabling approach. Others might argue that + the overhead of RCU is merely zero, and that replacing + the positive overhead of the interrupt-disabling scheme + with the zero-overhead RCU scheme does not constitute + negative overhead. + + In real life, of course, things are more complex. But + even the theoretical possibility of negative overhead for + a synchronization primitive is a bit unexpected. ;-) + +Quick Quiz #3: If it is illegal to block in an RCU read-side + critical section, what the heck do you do in + PREEMPT_RT, where normal spinlocks can block??? + +Answer: Just as PREEMPT_RT permits preemption of spinlock + critical sections, it permits preemption of RCU + read-side critical sections. It also permits + spinlocks blocking while in RCU read-side critical + sections. + + Why the apparent inconsistency? Because it is it + possible to use priority boosting to keep the RCU + grace periods short if need be (for example, if running + short of memory). In contrast, if blocking waiting + for (say) network reception, there is no way to know + what should be boosted. Especially given that the + process we need to boost might well be a human being + who just went out for a pizza or something. And although + a computer-operated cattle prod might arouse serious + interest, it might also provoke serious objections. + Besides, how does the computer know what pizza parlor + the human being went to??? + + +ACKNOWLEDGEMENTS + +My thanks to the people who helped make this human-readable, including +Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern. + + +For more information, see http://www.rdrop.com/users/paulmck/RCU. |