summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorIngo Molnar <mingo@elte.hu>2009-04-20 18:08:07 +0200
committerIngo Molnar <mingo@elte.hu>2009-04-20 18:08:12 +0200
commit62d170290979e0bb805d969cca4ea852bdd45260 (patch)
tree837372297501a2d144358b44e7db3f88c5612aa2 /Documentation
parent8b5b94e4e9813cdd77103827f48d58c806ab45c6 (diff)
parentd91dfbb41bb2e9bdbfbd2cc7078ed7436eab027a (diff)
downloadop-kernel-dev-62d170290979e0bb805d969cca4ea852bdd45260.zip
op-kernel-dev-62d170290979e0bb805d969cca4ea852bdd45260.tar.gz
Merge branch 'linus' into x86/urgent
Merge reason: We need the x86/uv updates from upstream, to queue up dependent fix. Signed-off-by: Ingo Molnar <mingo@elte.hu>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/debugfs-pktcdvd6
-rw-r--r--Documentation/DocBook/Makefile11
-rw-r--r--Documentation/block/biodoc.txt19
-rw-r--r--Documentation/cgroups/cpuacct.txt18
-rw-r--r--Documentation/cgroups/memory.txt55
-rw-r--r--Documentation/cgroups/resource_counter.txt27
-rw-r--r--Documentation/driver-model/platform.txt59
-rw-r--r--Documentation/feature-removal-schedule.txt9
-rw-r--r--Documentation/filesystems/pohmelfs/design_notes.txt5
-rw-r--r--Documentation/filesystems/pohmelfs/info.txt21
-rw-r--r--Documentation/infiniband/ipoib.txt45
-rw-r--r--Documentation/input/rotary-encoder.txt101
-rw-r--r--Documentation/kbuild/makefiles.txt93
-rw-r--r--Documentation/kernel-parameters.txt38
-rw-r--r--Documentation/lguest/.gitignore1
-rw-r--r--Documentation/lguest/lguest.txt11
-rw-r--r--Documentation/networking/bonding.txt2
-rw-r--r--Documentation/powerpc/dts-bindings/fsl/i2c.txt46
-rw-r--r--Documentation/sound/alsa/HD-Audio.txt4
-rw-r--r--Documentation/sparse.txt8
-rw-r--r--Documentation/sysctl/net.txt2
-rw-r--r--Documentation/tomoyo.txt55
-rw-r--r--Documentation/trace/ftrace.txt (renamed from Documentation/ftrace.txt)0
-rw-r--r--Documentation/trace/kmemtrace.txt (renamed from Documentation/vm/kmemtrace.txt)0
-rw-r--r--Documentation/trace/mmiotrace.txt (renamed from Documentation/tracers/mmiotrace.txt)0
-rw-r--r--Documentation/trace/tracepoints.txt (renamed from Documentation/tracepoints.txt)0
-rw-r--r--Documentation/vm/00-INDEX2
-rw-r--r--Documentation/vm/active_mm.txt83
-rw-r--r--Documentation/vm/unevictable-lru.txt1041
29 files changed, 1207 insertions, 555 deletions
diff --git a/Documentation/ABI/testing/debugfs-pktcdvd b/Documentation/ABI/testing/debugfs-pktcdvd
index bf9c16b..cf11736 100644
--- a/Documentation/ABI/testing/debugfs-pktcdvd
+++ b/Documentation/ABI/testing/debugfs-pktcdvd
@@ -1,4 +1,4 @@
-What: /debug/pktcdvd/pktcdvd[0-7]
+What: /sys/kernel/debug/pktcdvd/pktcdvd[0-7]
Date: Oct. 2006
KernelVersion: 2.6.20
Contact: Thomas Maier <balagi@justmail.de>
@@ -10,10 +10,10 @@ debugfs interface
The pktcdvd module (packet writing driver) creates
these files in debugfs:
-/debug/pktcdvd/pktcdvd[0-7]/
+/sys/kernel/debug/pktcdvd/pktcdvd[0-7]/
info (0444) Lots of driver statistics and infos.
Example:
-------
-cat /debug/pktcdvd/pktcdvd0/info
+cat /sys/kernel/debug/pktcdvd/pktcdvd0/info
diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile
index a3a83d3..8918a32 100644
--- a/Documentation/DocBook/Makefile
+++ b/Documentation/DocBook/Makefile
@@ -31,7 +31,7 @@ PS_METHOD = $(prefer-db2x)
###
# The targets that may be used.
-PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs
+PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs cleandocs
BOOKS := $(addprefix $(obj)/,$(DOCBOOKS))
xmldocs: $(BOOKS)
@@ -213,11 +213,12 @@ silent_gen_xml = :
dochelp:
@echo ' Linux kernel internal documentation in different formats:'
@echo ' htmldocs - HTML'
- @echo ' installmandocs - install man pages generated by mandocs'
- @echo ' mandocs - man pages'
@echo ' pdfdocs - PDF'
@echo ' psdocs - Postscript'
@echo ' xmldocs - XML DocBook'
+ @echo ' mandocs - man pages'
+ @echo ' installmandocs - install man pages generated by mandocs'
+ @echo ' cleandocs - clean all generated DocBook files'
###
# Temporary files left by various tools
@@ -235,6 +236,10 @@ clean-files := $(DOCBOOKS) \
clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS)) man
+cleandocs:
+ $(Q)rm -f $(call objectify, $(clean-files))
+ $(Q)rm -rf $(call objectify, $(clean-dirs))
+
# Declare the contents of the .PHONY variable as phony. We keep that
# information in a variable se we can use it in if_changed and friends.
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index ecad6ee..6fab97e 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -1040,23 +1040,21 @@ Front merges are handled by the binary trees in AS and deadline schedulers.
iii. Plugging the queue to batch requests in anticipation of opportunities for
merge/sort optimizations
-This is just the same as in 2.4 so far, though per-device unplugging
-support is anticipated for 2.5. Also with a priority-based i/o scheduler,
-such decisions could be based on request priorities.
-
Plugging is an approach that the current i/o scheduling algorithm resorts to so
that it collects up enough requests in the queue to be able to take
advantage of the sorting/merging logic in the elevator. If the
queue is empty when a request comes in, then it plugs the request queue
-(sort of like plugging the bottom of a vessel to get fluid to build up)
+(sort of like plugging the bath tub of a vessel to get fluid to build up)
till it fills up with a few more requests, before starting to service
the requests. This provides an opportunity to merge/sort the requests before
passing them down to the device. There are various conditions when the queue is
unplugged (to open up the flow again), either through a scheduled task or
could be on demand. For example wait_on_buffer sets the unplugging going
-(by running tq_disk) so the read gets satisfied soon. So in the read case,
-the queue gets explicitly unplugged as part of waiting for completion,
-in fact all queues get unplugged as a side-effect.
+through sync_buffer() running blk_run_address_space(mapping). Or the caller
+can do it explicity through blk_unplug(bdev). So in the read case,
+the queue gets explicitly unplugged as part of waiting for completion on that
+buffer. For page driven IO, the address space ->sync_page() takes care of
+doing the blk_run_address_space().
Aside:
This is kind of controversial territory, as it's not clear if plugging is
@@ -1067,11 +1065,6 @@ Aside:
multi-page bios being queued in one shot, we may not need to wait to merge
a big request from the broken up pieces coming by.
- Per-queue granularity unplugging (still a Todo) may help reduce some of the
- concerns with just a single tq_disk flush approach. Something like
- blk_kick_queue() to unplug a specific queue (right away ?)
- or optionally, all queues, is in the plan.
-
4.4 I/O contexts
I/O contexts provide a dynamically allocated per process data area. They may
be used in I/O schedulers, and in the block layer (could be used for IO statis,
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index bb775fb..8b93094 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -30,3 +30,21 @@ The above steps create a new group g1 and move the current shell
process (bash) into it. CPU time consumed by this bash and its children
can be obtained from g1/cpuacct.usage and the same is accumulated in
/cgroups/cpuacct.usage also.
+
+cpuacct.stat file lists a few statistics which further divide the
+CPU time obtained by the cgroup into user and system times. Currently
+the following statistics are supported:
+
+user: Time spent by tasks of the cgroup in user mode.
+system: Time spent by tasks of the cgroup in kernel mode.
+
+user and system are in USER_HZ unit.
+
+cpuacct controller uses percpu_counter interface to collect user and
+system times. This has two side effects:
+
+- It is theoretically possible to see wrong values for user and system times.
+ This is because percpu_counter_read() on 32bit systems isn't safe
+ against concurrent writes.
+- It is possible to see slightly outdated values for user and system times
+ due to the batch processing nature of percpu_counter.
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index a98a7fe..1a60887 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -6,15 +6,14 @@ used here with the memory controller that is used in hardware.
Salient features
-a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
+a. Enable control of Anonymous, Page Cache (mapped and unmapped) and
+ Swap Cache memory pages.
b. The infrastructure allows easy addition of other types of memory to control
c. Provides *zero overhead* for non memory controller users
d. Provides a double LRU: global memory pressure causes reclaim from the
global LRU; a cgroup on hitting a limit, reclaims from the per
cgroup LRU
-NOTE: Swap Cache (unmapped) is not accounted now.
-
Benefits and Purpose of the memory controller
The memory controller isolates the memory behaviour of a group of tasks
@@ -290,34 +289,44 @@ will be charged as a new owner of it.
moved to the parent. If you want to avoid that, force_empty will be useful.
5.2 stat file
- memory.stat file includes following statistics (now)
- cache - # of pages from page-cache and shmem.
- rss - # of pages from anonymous memory.
- pgpgin - # of event of charging
- pgpgout - # of event of uncharging
- active_anon - # of pages on active lru of anon, shmem.
- inactive_anon - # of pages on active lru of anon, shmem
- active_file - # of pages on active lru of file-cache
- inactive_file - # of pages on inactive lru of file cache
- unevictable - # of pages cannot be reclaimed.(mlocked etc)
-
- Below is depend on CONFIG_DEBUG_VM.
- inactive_ratio - VM internal parameter. (see mm/page_alloc.c)
- recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
- recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
- recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
- recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
-
- Memo:
+
+memory.stat file includes following statistics
+
+cache - # of bytes of page cache memory.
+rss - # of bytes of anonymous and swap cache memory.
+pgpgin - # of pages paged in (equivalent to # of charging events).
+pgpgout - # of pages paged out (equivalent to # of uncharging events).
+active_anon - # of bytes of anonymous and swap cache memory on active
+ lru list.
+inactive_anon - # of bytes of anonymous memory and swap cache memory on
+ inactive lru list.
+active_file - # of bytes of file-backed memory on active lru list.
+inactive_file - # of bytes of file-backed memory on inactive lru list.
+unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).
+
+The following additional stats are dependent on CONFIG_DEBUG_VM.
+
+inactive_ratio - VM internal parameter. (see mm/page_alloc.c)
+recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
+recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
+recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
+recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
+
+Memo:
recent_rotated means recent frequency of lru rotation.
recent_scanned means recent # of scans to lru.
showing for better debug please see the code for meanings.
+Note:
+ Only anonymous and swap cache memory is listed as part of 'rss' stat.
+ This should not be confused with the true 'resident set size' or the
+ amount of physical memory used by the cgroup. Per-cgroup rss
+ accounting is not done yet.
5.3 swappiness
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
- Following cgroup's swapiness can't be changed.
+ Following cgroups' swapiness can't be changed.
- root cgroup (uses /proc/sys/vm/swappiness).
- a cgroup which uses hierarchy and it has child cgroup.
- a cgroup which uses hierarchy and not the root of hierarchy.
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index f196ac1..95b24d7 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -47,13 +47,18 @@ to work with it.
2. Basic accounting routines
- a. void res_counter_init(struct res_counter *rc)
+ a. void res_counter_init(struct res_counter *rc,
+ struct res_counter *rc_parent)
Initializes the resource counter. As usual, should be the first
routine called for a new counter.
- b. int res_counter_charge[_locked]
- (struct res_counter *rc, unsigned long val)
+ The struct res_counter *parent can be used to define a hierarchical
+ child -> parent relationship directly in the res_counter structure,
+ NULL can be used to define no relationship.
+
+ c. int res_counter_charge(struct res_counter *rc, unsigned long val,
+ struct res_counter **limit_fail_at)
When a resource is about to be allocated it has to be accounted
with the appropriate resource counter (controller should determine
@@ -67,15 +72,25 @@ to work with it.
* if the charging is performed first, then it should be uncharged
on error path (if the one is called).
- c. void res_counter_uncharge[_locked]
+ If the charging fails and a hierarchical dependency exists, the
+ limit_fail_at parameter is set to the particular res_counter element
+ where the charging failed.
+
+ d. int res_counter_charge_locked
+ (struct res_counter *rc, unsigned long val)
+
+ The same as res_counter_charge(), but it must not acquire/release the
+ res_counter->lock internally (it must be called with res_counter->lock
+ held).
+
+ e. void res_counter_uncharge[_locked]
(struct res_counter *rc, unsigned long val)
When a resource is released (freed) it should be de-accounted
from the resource counter it was accounted to. This is called
"uncharging".
- The _locked routines imply that the res_counter->lock is taken.
-
+ The _locked routines imply that the res_counter->lock is taken.
2.1 Other accounting routines
diff --git a/Documentation/driver-model/platform.txt b/Documentation/driver-model/platform.txt
index 83009fdc..2e2c2ea 100644
--- a/Documentation/driver-model/platform.txt
+++ b/Documentation/driver-model/platform.txt
@@ -169,3 +169,62 @@ three different ways to find such a match:
be probed later if another device registers. (Which is OK, since
this interface is only for use with non-hotpluggable devices.)
+
+Early Platform Devices and Drivers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The early platform interfaces provide platform data to platform device
+drivers early on during the system boot. The code is built on top of the
+early_param() command line parsing and can be executed very early on.
+
+Example: "earlyprintk" class early serial console in 6 steps
+
+1. Registering early platform device data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The architecture code registers platform device data using the function
+early_platform_add_devices(). In the case of early serial console this
+should be hardware configuration for the serial port. Devices registered
+at this point will later on be matched against early platform drivers.
+
+2. Parsing kernel command line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The architecture code calls parse_early_param() to parse the kernel
+command line. This will execute all matching early_param() callbacks.
+User specified early platform devices will be registered at this point.
+For the early serial console case the user can specify port on the
+kernel command line as "earlyprintk=serial.0" where "earlyprintk" is
+the class string, "serial" is the name of the platfrom driver and
+0 is the platform device id. If the id is -1 then the dot and the
+id can be omitted.
+
+3. Installing early platform drivers belonging to a certain class
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The architecture code may optionally force registration of all early
+platform drivers belonging to a certain class using the function
+early_platform_driver_register_all(). User specified devices from
+step 2 have priority over these. This step is omitted by the serial
+driver example since the early serial driver code should be disabled
+unless the user has specified port on the kernel command line.
+
+4. Early platform driver registration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Compiled-in platform drivers making use of early_platform_init() are
+automatically registered during step 2 or 3. The serial driver example
+should use early_platform_init("earlyprintk", &platform_driver).
+
+5. Probing of early platform drivers belonging to a certain class
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The architecture code calls early_platform_driver_probe() to match
+registered early platform devices associated with a certain class with
+registered early platform drivers. Matched devices will get probed().
+This step can be executed at any point during the early boot. As soon
+as possible may be good for the serial port case.
+
+6. Inside the early platform driver probe()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver code needs to take special care during early boot, especially
+when it comes to memory allocation and interrupt registration. The code
+in the probe() function can use is_early_platform_device() to check if
+it is called at early platform device or at the regular platform device
+time. The early serial driver performs register_console() at this point.
+
+For further information, see <linux/platform_device.h>.
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 7e2af10..de491a3 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -428,3 +428,12 @@ Why: In 2.6.27, the semantics of /sys/bus/pci/slots was redefined to
After a reasonable transition period, we will remove the legacy
fakephp interface.
Who: Alex Chiang <achiang@hp.com>
+
+---------------------------
+
+What: i2c-voodoo3 driver
+When: October 2009
+Why: Superseded by tdfxfb. I2C/DDC support used to live in a separate
+ driver but this caused driver conflicts.
+Who: Jean Delvare <khali@linux-fr.org>
+ Krzysztof Helt <krzysztof.h1@wp.pl>
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt
index 6d6db60..dcf8335 100644
--- a/Documentation/filesystems/pohmelfs/design_notes.txt
+++ b/Documentation/filesystems/pohmelfs/design_notes.txt
@@ -56,9 +56,10 @@ workloads and can fully utilize the bandwidth to the servers when doing bulk
data transfers.
POHMELFS clients operate with a working set of servers and are capable of balancing read-only
-operations (like lookups or directory listings) between them.
+operations (like lookups or directory listings) between them according to IO priorities.
Administrators can add or remove servers from the set at run-time via special commands (described
-in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers.
+in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers, which are connected
+with write permission turned on. IO priority and permissions can be changed in run-time.
POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
One can select any kernel supported cipher, encryption mode, hash type and operation mode
diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt
index 4e3d501..db2e413 100644
--- a/Documentation/filesystems/pohmelfs/info.txt
+++ b/Documentation/filesystems/pohmelfs/info.txt
@@ -1,6 +1,8 @@
POHMELFS usage information.
-Mount options:
+Mount options.
+All but index, number of crypto threads and maximum IO size can changed via remount.
+
idx=%u
Each mountpoint is associated with a special index via this option.
Administrator can add or remove servers from the given index, so all mounts,
@@ -52,16 +54,27 @@ mcache_timeout=%u
Usage examples.
-Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx
+Add server server1.net:1025 into the working set with index $idx
with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
-$cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
+$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
Mount filesystem with given index $idx to /mnt mountpoint.
Client will connect to all servers specified in the working set via previous command:
mount -t pohmel -o idx=$idx q /mnt
-One can add or remove servers from working set after mounting too.
+Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw):
+$cfg A modify -a server1.net -p 1025 -i $idx -I 1
+
+Change IO priority to 123 (node with the highest priority gets read requests).
+$cfg A modify -a server1.net -p 1025 -i $idx -P 123
+One can check currect status of all connections in the mountstats file:
+# cat /proc/$PID/mountstats
+...
+device none mounted on /mnt with fstype pohmel
+idx addr(:port) socket_type protocol active priority permissions
+0 server1.net:1026 1 6 1 250 1
+0 server2.net:1025 1 6 1 123 3
Server installation.
diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt
index 864ff32..6d40f00 100644
--- a/Documentation/infiniband/ipoib.txt
+++ b/Documentation/infiniband/ipoib.txt
@@ -24,6 +24,49 @@ Partitions and P_Keys
The P_Key for any interface is given by the "pkey" file, and the
main interface for a subinterface is in "parent."
+Datagram vs Connected modes
+
+ The IPoIB driver supports two modes of operation: datagram and
+ connected. The mode is set and read through an interface's
+ /sys/class/net/<intf name>/mode file.
+
+ In datagram mode, the IB UD (Unreliable Datagram) transport is used
+ and so the interface MTU has is equal to the IB L2 MTU minus the
+ IPoIB encapsulation header (4 bytes). For example, in a typical IB
+ fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
+
+ In connected mode, the IB RC (Reliable Connected) transport is used.
+ Connected mode is to takes advantage of the connected nature of the
+ IB transport and allows an MTU up to the maximal IP packet size of
+ 64K, which reduces the number of IP packets needed for handling
+ large UDP datagrams, TCP segments, etc and increases the performance
+ for large messages.
+
+ In connected mode, the interface's UD QP is still used for multicast
+ and communication with peers that don't support connected mode. In
+ this case, RX emulation of ICMP PMTU packets is used to cause the
+ networking stack to use the smaller UD MTU for these neighbours.
+
+Stateless offloads
+
+ If the IB HW supports IPoIB stateless offloads, IPoIB advertises
+ TCP/IP checksum and/or Large Send (LSO) offloading capability to the
+ network stack.
+
+ Large Receive (LRO) offloading is also implemented and may be turned
+ on/off using ethtool calls. Currently LRO is supported only for
+ checksum offload capable devices.
+
+ Stateless offloads are supported only in datagram mode.
+
+Interrupt moderation
+
+ If the underlying IB device supports CQ event moderation, one can
+ use ethtool to set interrupt mitigation parameters and thus reduce
+ the overhead incurred by handling interrupts. The main code path of
+ IPoIB doesn't use events for TX completion signaling so only RX
+ moderation is supported.
+
Debugging Information
By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
@@ -55,3 +98,5 @@ References
http://ietf.org/rfc/rfc4391.txt
IP over InfiniBand (IPoIB) Architecture (RFC 4392)
http://ietf.org/rfc/rfc4392.txt
+ IP over InfiniBand: Connected Mode (RFC 4755)
+ http://ietf.org/rfc/rfc4755.txt
diff --git a/Documentation/input/rotary-encoder.txt b/Documentation/input/rotary-encoder.txt
new file mode 100644
index 0000000..435102a
--- /dev/null
+++ b/Documentation/input/rotary-encoder.txt
@@ -0,0 +1,101 @@
+rotary-encoder - a generic driver for GPIO connected devices
+Daniel Mack <daniel@caiaq.de>, Feb 2009
+
+0. Function
+-----------
+
+Rotary encoders are devices which are connected to the CPU or other
+peripherals with two wires. The outputs are phase-shifted by 90 degrees
+and by triggering on falling and rising edges, the turn direction can
+be determined.
+
+The phase diagram of these two outputs look like this:
+
+ _____ _____ _____
+ | | | | | |
+ Channel A ____| |_____| |_____| |____
+
+ : : : : : : : : : : : :
+ __ _____ _____ _____
+ | | | | | | |
+ Channel B |_____| |_____| |_____| |__
+
+ : : : : : : : : : : : :
+ Event a b c d a b c d a b c d
+
+ |<-------->|
+ one step
+
+
+For more information, please see
+ http://en.wikipedia.org/wiki/Rotary_encoder
+
+
+1. Events / state machine
+-------------------------
+
+a) Rising edge on channel A, channel B in low state
+ This state is used to recognize a clockwise turn
+
+b) Rising edge on channel B, channel A in high state
+ When entering this state, the encoder is put into 'armed' state,
+ meaning that there it has seen half the way of a one-step transition.
+
+c) Falling edge on channel A, channel B in high state
+ This state is used to recognize a counter-clockwise turn
+
+d) Falling edge on channel B, channel A in low state
+ Parking position. If the encoder enters this state, a full transition
+ should have happend, unless it flipped back on half the way. The
+ 'armed' state tells us about that.
+
+2. Platform requirements
+------------------------
+
+As there is no hardware dependent call in this driver, the platform it is
+used with must support gpiolib. Another requirement is that IRQs must be
+able to fire on both edges.
+
+
+3. Board integration
+--------------------
+
+To use this driver in your system, register a platform_device with the
+name 'rotary-encoder' and associate the IRQs and some specific platform
+data with it.
+
+struct rotary_encoder_platform_data is declared in
+include/linux/rotary-encoder.h and needs to be filled with the number of
+steps the encoder has and can carry information about externally inverted
+signals (because of used invertig buffer or other reasons).
+
+Because GPIO to IRQ mapping is platform specific, this information must
+be given in seperately to the driver. See the example below.
+
+---------<snip>---------
+
+/* board support file example */
+
+#include <linux/input.h>
+#include <linux/rotary_encoder.h>
+
+#define GPIO_ROTARY_A 1
+#define GPIO_ROTARY_B 2
+
+static struct rotary_encoder_platform_data my_rotary_encoder_info = {
+ .steps = 24,
+ .axis = ABS_X,
+ .gpio_a = GPIO_ROTARY_A,
+ .gpio_b = GPIO_ROTARY_B,
+ .inverted_a = 0,
+ .inverted_b = 0,
+};
+
+static struct platform_device rotary_encoder_device = {
+ .name = "rotary-encoder",
+ .id = 0,
+ .dev = {
+ .platform_data = &my_rotary_encoder_info,
+ }
+};
+
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index 51104f9..d76cfd8 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -40,10 +40,16 @@ This document describes the Linux kernel Makefiles.
--- 6.7 Custom kbuild commands
--- 6.8 Preprocessing linker scripts
- === 7 Kbuild Variables
- === 8 Makefile language
- === 9 Credits
- === 10 TODO
+ === 7 Kbuild syntax for exported headers
+ --- 7.1 header-y
+ --- 7.2 objhdr-y
+ --- 7.3 destination-y
+ --- 7.4 unifdef-y (deprecated)
+
+ === 8 Kbuild Variables
+ === 9 Makefile language
+ === 10 Credits
+ === 11 TODO
=== 1 Overview
@@ -310,6 +316,16 @@ more details, with real examples.
#arch/m68k/fpsp040/Makefile
ldflags-y := -x
+ subdir-ccflags-y, subdir-asflags-y
+ The two flags listed above are similar to ccflags-y and as-falgs-y.
+ The difference is that the subdir- variants has effect for the kbuild
+ file where tey are present and all subdirectories.
+ Options specified using subdir-* are added to the commandline before
+ the options specified using the non-subdir variants.
+
+ Example:
+ subdir-ccflags-y := -Werror
+
CFLAGS_$@, AFLAGS_$@
CFLAGS_$@ and AFLAGS_$@ only apply to commands in current
@@ -1143,8 +1159,69 @@ When kbuild executes, the following steps are followed (roughly):
The kbuild infrastructure for *lds file are used in several
architecture-specific files.
+=== 7 Kbuild syntax for exported headers
+
+The kernel include a set of headers that is exported to userspace.
+Many headers can be exported as-is but other headers requires a
+minimal pre-processing before they are ready for user-space.
+The pre-processing does:
+- drop kernel specific annotations
+- drop include of compiler.h
+- drop all sections that is kernel internat (guarded by ifdef __KERNEL__)
+
+Each relevant directory contain a file name "Kbuild" which specify the
+headers to be exported.
+See subsequent chapter for the syntax of the Kbuild file.
+
+ --- 7.1 header-y
+
+ header-y specify header files to be exported.
+
+ Example:
+ #include/linux/Kbuild
+ header-y += usb/
+ header-y += aio_abi.h
+
+ The convention is to list one file per line and
+ preferably in alphabetic order.
+
+ header-y also specify which subdirectories to visit.
+ A subdirectory is identified by a trailing '/' which
+ can be seen in the example above for the usb subdirectory.
+
+ Subdirectories are visited before their parent directories.
+
+ --- 7.2 objhdr-y
+
+ objhdr-y specifies generated files to be exported.
+ Generated files are special as they need to be looked
+ up in another directory when doing 'make O=...' builds.
+
+ Example:
+ #include/linux/Kbuild
+ objhdr-y += version.h
+
+ --- 7.3 destination-y
+
+ When an architecture have a set of exported headers that needs to be
+ exported to a different directory destination-y is used.
+ destination-y specify the destination directory for all exported
+ headers in the file where it is present.
+
+ Example:
+ #arch/xtensa/platforms/s6105/include/platform/Kbuild
+ destination-y := include/linux
+
+ In the example above all exported headers in the Kbuild file
+ will be located in the directory "include/linux" when exported.
+
+
+ --- 7.4 unifdef-y (deprecated)
+
+ unifdef-y is deprecated. A direct replacement is header-y.
+
-=== 7 Kbuild Variables
+=== 8 Kbuild Variables
The top Makefile exports the following variables:
@@ -1206,7 +1283,7 @@ The top Makefile exports the following variables:
INSTALL_MOD_STRIP will used as the option(s) to the strip command.
-=== 8 Makefile language
+=== 9 Makefile language
The kernel Makefiles are designed to be run with GNU Make. The Makefiles
use only the documented features of GNU Make, but they do use many
@@ -1225,14 +1302,14 @@ time the left-hand side is used.
There are some cases where "=" is appropriate. Usually, though, ":="
is the right choice.
-=== 9 Credits
+=== 10 Credits
Original version made by Michael Elizabeth Chastain, <mailto:mec@shout.net>
Updates by Kai Germaschewski <kai@tp1.ruhr-uni-bochum.de>
Updates by Sam Ravnborg <sam@ravnborg.org>
Language QA by Jan Engelhardt <jengelh@gmx.de>
-=== 10 TODO
+=== 11 TODO
- Describe how kbuild supports shipped files with _shipped.
- Generating offset header files.
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9e4fe72..90b3924 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -231,6 +231,35 @@ and is between 256 and 4096 characters. It is defined in the file
power state again in power transition.
1 : disable the power state check
+ acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode
+ Format: { level | edge | high | low }
+
+ acpi_serialize [HW,ACPI] force serialization of AML methods
+
+ acpi_skip_timer_override [HW,ACPI]
+ Recognize and ignore IRQ0/pin2 Interrupt Override.
+ For broken nForce2 BIOS resulting in XT-PIC timer.
+
+ acpi_sleep= [HW,ACPI] Sleep options
+ Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
+ old_ordering, s4_nonvs }
+ See Documentation/power/video.txt for information on
+ s3_bios and s3_mode.
+ s3_beep is for debugging; it makes the PC's speaker beep
+ as soon as the kernel's real-mode entry point is called.
+ s4_nohwsig prevents ACPI hardware signature from being
+ used during resume from hibernation.
+ old_ordering causes the ACPI 1.0 ordering of the _PTS
+ control method, with respect to putting devices into
+ low power states, to be enforced (the ACPI 2.0 ordering
+ of _PTS is used by default).
+ s4_nonvs prevents the kernel from saving/restoring the
+ ACPI NVS memory during hibernation.
+
+ acpi_use_timer_override [HW,ACPI]
+ Use timer override. For some broken Nvidia NF5 boards
+ that require a timer override, but don't have HPET
+
acpi_enforce_resources= [ACPI]
{ strict | lax | no }
Check for resource conflicts between native drivers
@@ -250,6 +279,9 @@ and is between 256 and 4096 characters. It is defined in the file
ad1848= [HW,OSS]
Format: <io>,<irq>,<dma>,<dma2>,<type>
+ add_efi_memmap [EFI; X86] Include EFI memory map in
+ kernel's map of available physical RAM.
+
advansys= [HW,SCSI]
See header of drivers/scsi/advansys.c.
@@ -1840,6 +1872,12 @@ and is between 256 and 4096 characters. It is defined in the file
autoconfiguration.
Ranges are in pairs (memory base and size).
+ ports= [IP_VS_FTP] IPVS ftp helper module
+ Default is 21.
+ Up to 8 (IP_VS_APP_MAX_PORTS) ports
+ may be specified.
+ Format: <port>,<port>....
+
print-fatal-signals=
[KNL] debug: print fatal signals
print-fatal-signals=1: print segfault info to
diff --git a/Documentation/lguest/.gitignore b/Documentation/lguest/.gitignore
new file mode 100644
index 0000000..115587f
--- /dev/null
+++ b/Documentation/lguest/.gitignore
@@ -0,0 +1 @@
+lguest
diff --git a/Documentation/lguest/lguest.txt b/Documentation/lguest/lguest.txt
index 29510dc..28c7473 100644
--- a/Documentation/lguest/lguest.txt
+++ b/Documentation/lguest/lguest.txt
@@ -3,11 +3,11 @@
/, /` - or, A Young Coder's Illustrated Hypervisor
\\"--\\ http://lguest.ozlabs.org
-Lguest is designed to be a minimal hypervisor for the Linux kernel, for
-Linux developers and users to experiment with virtualization with the
-minimum of complexity. Nonetheless, it should have sufficient
-features to make it useful for specific tasks, and, of course, you are
-encouraged to fork and enhance it (see drivers/lguest/README).
+Lguest is designed to be a minimal 32-bit x86 hypervisor for the Linux kernel,
+for Linux developers and users to experiment with virtualization with the
+minimum of complexity. Nonetheless, it should have sufficient features to
+make it useful for specific tasks, and, of course, you are encouraged to fork
+and enhance it (see drivers/lguest/README).
Features:
@@ -37,6 +37,7 @@ Running Lguest:
"Paravirtualized guest support" = Y
"Lguest guest support" = Y
"High Memory Support" = off/4GB
+ "PAE (Physical Address Extension) Support" = N
"Alignment value to which kernel should be aligned" = 0x100000
(CONFIG_PARAVIRT=y, CONFIG_LGUEST_GUEST=y, CONFIG_HIGHMEM64G=n and
CONFIG_PHYSICAL_ALIGN=0x100000)
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 5ede747..0876275 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -1242,7 +1242,7 @@ monitoring is enabled, and vice-versa.
To add ARP targets:
# echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
# echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target
- NOTE: up to 10 target addresses may be specified.
+ NOTE: up to 16 target addresses may be specified.
To remove an ARP target:
# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
diff --git a/Documentation/powerpc/dts-bindings/fsl/i2c.txt b/Documentation/powerpc/dts-bindings/fsl/i2c.txt
index d0ab33e..b6d2e21 100644
--- a/Documentation/powerpc/dts-bindings/fsl/i2c.txt
+++ b/Documentation/powerpc/dts-bindings/fsl/i2c.txt
@@ -7,8 +7,10 @@ Required properties :
Recommended properties :
- - compatible : Should be "fsl-i2c" for parts compatible with
- Freescale I2C specifications.
+ - compatible : compatibility list with 2 entries, the first should
+ be "fsl,CHIP-i2c" where CHIP is the name of a compatible processor,
+ e.g. mpc8313, mpc8543, mpc8544, mpc5200 or mpc5200b. The second one
+ should be "fsl-i2c".
- interrupts : <a b> where a is the interrupt number and b is a
field that represents an encoding of the sense and level
information for the interrupt. This should be encoded based on
@@ -16,17 +18,31 @@ Recommended properties :
controller you have.
- interrupt-parent : the phandle for the interrupt controller that
services interrupts for this device.
- - dfsrr : boolean; if defined, indicates that this I2C device has
- a digital filter sampling rate register
- - fsl5200-clocking : boolean; if defined, indicated that this device
- uses the FSL 5200 clocking mechanism.
-
-Example :
- i2c@3000 {
- interrupt-parent = <40000>;
- interrupts = <1b 3>;
- reg = <3000 18>;
- device_type = "i2c";
- compatible = "fsl-i2c";
- dfsrr;
+ - fsl,preserve-clocking : boolean; if defined, the clock settings
+ from the bootloader are preserved (not touched).
+ - clock-frequency : desired I2C bus clock frequency in Hz.
+
+Examples :
+
+ i2c@3d00 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ compatible = "fsl,mpc5200b-i2c","fsl,mpc5200-i2c","fsl-i2c";
+ cell-index = <0>;
+ reg = <0x3d00 0x40>;
+ interrupts = <2 15 0>;
+ interrupt-parent = <&mpc5200_pic>;
+ fsl,preserve-clocking;
};
+
+ i2c@3100 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ cell-index = <1>;
+ compatible = "fsl,mpc8544-i2c", "fsl-i2c";
+ reg = <0x3100 0x100>;
+ interrupts = <43 2>;
+ interrupt-parent = <&mpic>;
+ clock-frequency = <400000>;
+ };
+
diff --git a/Documentation/sound/alsa/HD-Audio.txt b/Documentation/sound/alsa/HD-Audio.txt
index c5948f2..88b7433 100644
--- a/Documentation/sound/alsa/HD-Audio.txt
+++ b/Documentation/sound/alsa/HD-Audio.txt
@@ -169,7 +169,7 @@ PCI SSID look-up.
What `model` option values are available depends on the codec chip.
Check your codec chip from the codec proc file (see "Codec Proc-File"
section below). It will show the vendor/product name of your codec
-chip. Then, see Documentation/sound/alsa/HD-Audio-Modelstxt file,
+chip. Then, see Documentation/sound/alsa/HD-Audio-Models.txt file,
the section of HD-audio driver. You can find a list of codecs
and `model` options belonging to each codec. For example, for Realtek
ALC262 codec chip, pass `model=ultra` for devices that are compatible
@@ -177,7 +177,7 @@ with Samsung Q1 Ultra.
Thus, the first thing you can do for any brand-new, unsupported and
non-working HD-audio hardware is to check HD-audio codec and several
-different `model` option values. If you have a luck, some of them
+different `model` option values. If you have any luck, some of them
might suit with your device well.
Some codecs such as ALC880 have a special model option `model=test`.
diff --git a/Documentation/sparse.txt b/Documentation/sparse.txt
index 42f43fa..34c76a5 100644
--- a/Documentation/sparse.txt
+++ b/Documentation/sparse.txt
@@ -42,6 +42,14 @@ sure that bitwise types don't get mixed up (little-endian vs big-endian
vs cpu-endian vs whatever), and there the constant "0" really _is_
special.
+__bitwise__ - to be used for relatively compact stuff (gfp_t, etc.) that
+is mostly warning-free and is supposed to stay that way. Warnings will
+be generated without __CHECK_ENDIAN__.
+
+__bitwise - noisy stuff; in particular, __le*/__be* are that. We really
+don't want to drown in noise unless we'd explicitly asked for it.
+
+
Getting sparse
~~~~~~~~~~~~~~
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index a34d55b..df38ef0 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -95,7 +95,7 @@ of struct cmsghdr structures with appended data.
There is only one file in this directory.
unix_dgram_qlen limits the max number of datagrams queued in Unix domain
-socket's buffer. It will not take effect unless PF_UNIX flag is spicified.
+socket's buffer. It will not take effect unless PF_UNIX flag is specified.
3. /proc/sys/net/ipv4 - IPV4 settings
diff --git a/Documentation/tomoyo.txt b/Documentation/tomoyo.txt
new file mode 100644
index 0000000..b3a232c
--- /dev/null
+++ b/Documentation/tomoyo.txt
@@ -0,0 +1,55 @@
+--- What is TOMOYO? ---
+
+TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel.
+
+LiveCD-based tutorials are available at
+http://tomoyo.sourceforge.jp/en/1.6.x/1st-step/ubuntu8.04-live/
+http://tomoyo.sourceforge.jp/en/1.6.x/1st-step/centos5-live/ .
+Though these tutorials use non-LSM version of TOMOYO, they are useful for you
+to know what TOMOYO is.
+
+--- How to enable TOMOYO? ---
+
+Build the kernel with CONFIG_SECURITY_TOMOYO=y and pass "security=tomoyo" on
+kernel's command line.
+
+Please see http://tomoyo.sourceforge.jp/en/2.2.x/ for details.
+
+--- Where is documentation? ---
+
+User <-> Kernel interface documentation is available at
+http://tomoyo.sourceforge.jp/en/2.2.x/policy-reference.html .
+
+Materials we prepared for seminars and symposiums are available at
+http://sourceforge.jp/projects/tomoyo/docs/?category_id=532&language_id=1 .
+Below lists are chosen from three aspects.
+
+What is TOMOYO?
+ TOMOYO Linux Overview
+ http://sourceforge.jp/projects/tomoyo/docs/lca2009-takeda.pdf
+ TOMOYO Linux: pragmatic and manageable security for Linux
+ http://sourceforge.jp/projects/tomoyo/docs/freedomhectaipei-tomoyo.pdf
+ TOMOYO Linux: A Practical Method to Understand and Protect Your Own Linux Box
+ http://sourceforge.jp/projects/tomoyo/docs/PacSec2007-en-no-demo.pdf
+
+What can TOMOYO do?
+ Deep inside TOMOYO Linux
+ http://sourceforge.jp/projects/tomoyo/docs/lca2009-kumaneko.pdf
+ The role of "pathname based access control" in security.
+ http://sourceforge.jp/projects/tomoyo/docs/lfj2008-bof.pdf
+
+History of TOMOYO?
+ Realities of Mainlining
+ http://sourceforge.jp/projects/tomoyo/docs/lfj2008.pdf
+
+--- What is future plan? ---
+
+We believe that inode based security and name based security are complementary
+and both should be used together. But unfortunately, so far, we cannot enable
+multiple LSM modules at the same time. We feel sorry that you have to give up
+SELinux/SMACK/AppArmor etc. when you want to use TOMOYO.
+
+We hope that LSM becomes stackable in future. Meanwhile, you can use non-LSM
+version of TOMOYO, available at http://tomoyo.sourceforge.jp/en/1.6.x/ .
+LSM version of TOMOYO is a subset of non-LSM version of TOMOYO. We are planning
+to port non-LSM version's functionalities to LSM versions.
diff --git a/Documentation/ftrace.txt b/Documentation/trace/ftrace.txt
index fd9a3e6..fd9a3e6 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
diff --git a/Documentation/vm/kmemtrace.txt b/Documentation/trace/kmemtrace.txt
index a956d9b..a956d9b 100644
--- a/Documentation/vm/kmemtrace.txt
+++ b/Documentation/trace/kmemtrace.txt
diff --git a/Documentation/tracers/mmiotrace.txt b/Documentation/trace/mmiotrace.txt
index 5731c67..5731c67 100644
--- a/Documentation/tracers/mmiotrace.txt
+++ b/Documentation/trace/mmiotrace.txt
diff --git a/Documentation/tracepoints.txt b/Documentation/trace/tracepoints.txt
index c0e1cee..c0e1cee 100644
--- a/Documentation/tracepoints.txt
+++ b/Documentation/trace/tracepoints.txt
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 2131b00..2f77ced 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -1,5 +1,7 @@
00-INDEX
- this file.
+active_mm.txt
+ - An explanation from Linus about tsk->active_mm vs tsk->mm.
balance
- various information on memory balancing.
hugetlbpage.txt
diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt
new file mode 100644
index 0000000..4ee1f64
--- /dev/null
+++ b/Documentation/vm/active_mm.txt
@@ -0,0 +1,83 @@
+List: linux-kernel
+Subject: Re: active_mm
+From: Linus Torvalds <torvalds () transmeta ! com>
+Date: 1999-07-30 21:36:24
+
+Cc'd to linux-kernel, because I don't write explanations all that often,
+and when I do I feel better about more people reading them.
+
+On Fri, 30 Jul 1999, David Mosberger wrote:
+>
+> Is there a brief description someplace on how "mm" vs. "active_mm" in
+> the task_struct are supposed to be used? (My apologies if this was
+> discussed on the mailing lists---I just returned from vacation and
+> wasn't able to follow linux-kernel for a while).
+
+Basically, the new setup is:
+
+ - we have "real address spaces" and "anonymous address spaces". The
+ difference is that an anonymous address space doesn't care about the
+ user-level page tables at all, so when we do a context switch into an
+ anonymous address space we just leave the previous address space
+ active.
+
+ The obvious use for a "anonymous address space" is any thread that
+ doesn't need any user mappings - all kernel threads basically fall into
+ this category, but even "real" threads can temporarily say that for
+ some amount of time they are not going to be interested in user space,
+ and that the scheduler might as well try to avoid wasting time on
+ switching the VM state around. Currently only the old-style bdflush
+ sync does that.
+
+ - "tsk->mm" points to the "real address space". For an anonymous process,
+ tsk->mm will be NULL, for the logical reason that an anonymous process
+ really doesn't _have_ a real address space at all.
+
+ - however, we obviously need to keep track of which address space we
+ "stole" for such an anonymous user. For that, we have "tsk->active_mm",
+ which shows what the currently active address space is.
+
+ The rule is that for a process with a real address space (ie tsk->mm is
+ non-NULL) the active_mm obviously always has to be the same as the real
+ one.
+
+ For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
+ "borrowed" mm while the anonymous process is running. When the
+ anonymous process gets scheduled away, the borrowed address space is
+ returned and cleared.
+
+To support all that, the "struct mm_struct" now has two counters: a
+"mm_users" counter that is how many "real address space users" there are,
+and a "mm_count" counter that is the number of "lazy" users (ie anonymous
+users) plus one if there are any real users.
+
+Usually there is at least one real user, but it could be that the real
+user exited on another CPU while a lazy user was still active, so you do
+actually get cases where you have a address space that is _only_ used by
+lazy users. That is often a short-lived state, because once that thread
+gets scheduled away in favour of a real thread, the "zombie" mm gets
+released because "mm_users" becomes zero.
+
+Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
+more. "init_mm" should be considered just a "lazy context when no other
+context is available", and in fact it is mainly used just at bootup when
+no real VM has yet been created. So code that used to check
+
+ if (current->mm == &init_mm)
+
+should generally just do
+
+ if (!current->mm)
+
+instead (which makes more sense anyway - the test is basically one of "do
+we have a user context", and is generally done by the page fault handler
+and things like that).
+
+Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
+because it slightly changes the interfaces to accomodate the alpha (who
+would have thought it, but the alpha actually ends up having one of the
+ugliest context switch codes - unlike the other architectures where the MM
+and register state is separate, the alpha PALcode joins the two, and you
+need to switch both together).
+
+(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 0706a72..2d70d0d 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -1,588 +1,691 @@
-
-This document describes the Linux memory management "Unevictable LRU"
-infrastructure and the use of this infrastructure to manage several types
-of "unevictable" pages. The document attempts to provide the overall
-rationale behind this mechanism and the rationale for some of the design
-decisions that drove the implementation. The latter design rationale is
-discussed in the context of an implementation description. Admittedly, one
-can obtain the implementation details--the "what does it do?"--by reading the
-code. One hopes that the descriptions below add value by provide the answer
-to "why does it do that?".
-
-Unevictable LRU Infrastructure:
-
-The Unevictable LRU adds an additional LRU list to track unevictable pages
-and to hide these pages from vmscan. This mechanism is based on a patch by
-Larry Woodman of Red Hat to address several scalability problems with page
+ ==============================
+ UNEVICTABLE LRU INFRASTRUCTURE
+ ==============================
+
+========
+CONTENTS
+========
+
+ (*) The Unevictable LRU
+
+ - The unevictable page list.
+ - Memory control group interaction.
+ - Marking address spaces unevictable.
+ - Detecting Unevictable Pages.
+ - vmscan's handling of unevictable pages.
+
+ (*) mlock()'d pages.
+
+ - History.
+ - Basic management.
+ - mlock()/mlockall() system call handling.
+ - Filtering special vmas.
+ - munlock()/munlockall() system call handling.
+ - Migrating mlocked pages.
+ - mmap(MAP_LOCKED) system call handling.
+ - munmap()/exit()/exec() system call handling.
+ - try_to_unmap().
+ - try_to_munlock() reverse map scan.
+ - Page reclaim in shrink_*_list().
+
+
+============
+INTRODUCTION
+============
+
+This document describes the Linux memory manager's "Unevictable LRU"
+infrastructure and the use of this to manage several types of "unevictable"
+pages.
+
+The document attempts to provide the overall rationale behind this mechanism
+and the rationale for some of the design decisions that drove the
+implementation. The latter design rationale is discussed in the context of an
+implementation description. Admittedly, one can obtain the implementation
+details - the "what does it do?" - by reading the code. One hopes that the
+descriptions below add value by provide the answer to "why does it do that?".
+
+
+===================
+THE UNEVICTABLE LRU
+===================
+
+The Unevictable LRU facility adds an additional LRU list to track unevictable
+pages and to hide these pages from vmscan. This mechanism is based on a patch
+by Larry Woodman of Red Hat to address several scalability problems with page
reclaim in Linux. The problems have been observed at customer sites on large
-memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB
-of main memory will have over 32 million 4k pages in a single zone. When a
-large fraction of these pages are not evictable for any reason [see below],
-vmscan will spend a lot of time scanning the LRU lists looking for the small
-fraction of pages that are evictable. This can result in a situation where
-all cpus are spending 100% of their time in vmscan for hours or days on end,
-with the system completely unresponsive.
-
-The Unevictable LRU infrastructure addresses the following classes of
-unevictable pages:
-
-+ page owned by ramfs
-+ page mapped into SHM_LOCKed shared memory regions
-+ page mapped into VM_LOCKED [mlock()ed] vmas
-
-The infrastructure might be able to handle other conditions that make pages
+memory x86_64 systems.
+
+To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
+main memory will have over 32 million 4k pages in a single zone. When a large
+fraction of these pages are not evictable for any reason [see below], vmscan
+will spend a lot of time scanning the LRU lists looking for the small fraction
+of pages that are evictable. This can result in a situation where all CPUs are
+spending 100% of their time in vmscan for hours or days on end, with the system
+completely unresponsive.
+
+The unevictable list addresses the following classes of unevictable pages:
+
+ (*) Those owned by ramfs.
+
+ (*) Those mapped into SHM_LOCK'd shared memory regions.
+
+ (*) Those mapped into VM_LOCKED [mlock()ed] VMAs.
+
+The infrastructure may also be able to handle other conditions that make pages
unevictable, either by definition or by circumstance, in the future.
-The Unevictable LRU List
+THE UNEVICTABLE PAGE LIST
+-------------------------
The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to
-indicate that the page is being managed on the unevictable list. The
-PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active
-flag in that it indicates on which LRU list a page resides when PG_lru is set.
-The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU
-Kconfig option.
+indicate that the page is being managed on the unevictable list.
+
+The PG_unevictable flag is analogous to, and mutually exclusive with, the
+PG_active flag in that it indicates on which LRU list a page resides when
+PG_lru is set. The unevictable list is compile-time configurable based on the
+UNEVICTABLE_LRU Kconfig option.
The Unevictable LRU infrastructure maintains unevictable pages on an additional
LRU list for a few reasons:
-1) We get to "treat unevictable pages just like we treat other pages in the
- system, which means we get to use the same code to manipulate them, the
- same code to isolate them (for migrate, etc.), the same code to keep track
- of the statistics, etc..." [Rik van Riel]
+ (1) We get to "treat unevictable pages just like we treat other pages in the
+ system - which means we get to use the same code to manipulate them, the
+ same code to isolate them (for migrate, etc.), the same code to keep track
+ of the statistics, etc..." [Rik van Riel]
+
+ (2) We want to be able to migrate unevictable pages between nodes for memory
+ defragmentation, workload management and memory hotplug. The linux kernel
+ can only migrate pages that it can successfully isolate from the LRU
+ lists. If we were to maintain pages elsewhere than on an LRU-like list,
+ where they can be found by isolate_lru_page(), we would prevent their
+ migration, unless we reworked migration code to find the unevictable pages
+ itself.
-2) We want to be able to migrate unevictable pages between nodes--for memory
- defragmentation, workload management and memory hotplug. The linux kernel
- can only migrate pages that it can successfully isolate from the lru lists.
- If we were to maintain pages elsewise than on an lru-like list, where they
- can be found by isolate_lru_page(), we would prevent their migration, unless
- we reworked migration code to find the unevictable pages.
+The unevictable list does not differentiate between file-backed and anonymous,
+swap-backed pages. This differentiation is only important while the pages are,
+in fact, evictable.
-The unevictable LRU list does not differentiate between file backed and swap
-backed [anon] pages. This differentiation is only important while the pages
-are, in fact, evictable.
+The unevictable list benefits from the "arrayification" of the per-zone LRU
+lists and statistics originally proposed and posted by Christoph Lameter.
-The unevictable LRU list benefits from the "arrayification" of the per-zone
-LRU lists and statistics originally proposed and posted by Christoph Lameter.
+The unevictable list does not use the LRU pagevec mechanism. Rather,
+unevictable pages are placed directly on the page's zone's unevictable list
+under the zone lru_lock. This allows us to prevent the stranding of pages on
+the unevictable list when one task has the page isolated from the LRU and other
+tasks are changing the "evictability" state of the page.
-The unevictable list does not use the lru pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable
-list under the zone lru_lock. The reason for this is to prevent stranding
-of pages on the unevictable list when one task has the page isolated from the
-lru and other tasks are changing the "evictability" state of the page.
+MEMORY CONTROL GROUP INTERACTION
+--------------------------------
-Unevictable LRU and Memory Controller Interaction
+The unevictable LRU facility interacts with the memory control group [aka
+memory controller; see Documentation/cgroups/memory.txt] by extending the
+lru_list enum.
+
+The memory controller data structure automatically gets a per-zone unevictable
+list as a result of the "arrayification" of the per-zone LRU lists (one per
+lru_list enum element). The memory controller tracks the movement of pages to
+and from the unevictable list.
-The memory controller data structure automatically gets a per zone unevictable
-lru list as a result of the "arrayification" of the per-zone LRU lists. The
-memory controller tracks the movement of pages to and from the unevictable list.
When a memory control group comes under memory pressure, the controller will
not attempt to reclaim pages on the unevictable list. This has a couple of
-effects. Because the pages are "hidden" from reclaim on the unevictable list,
-the reclaim process can be more efficient, dealing only with pages that have
-a chance of being reclaimed. On the other hand, if too many of the pages
-charged to the control group are unevictable, the evictable portion of the
-working set of the tasks in the control group may not fit into the available
-memory. This can cause the control group to thrash or to oom-kill tasks.
-
-
-Unevictable LRU: Detecting Unevictable Pages
-
-The function page_evictable(page, vma) in vmscan.c determines whether a
-page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions,
-page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's
-address space using a wrapper function. Wrapper functions are used to set,
-clear and test the flag to reduce the requirement for #ifdef's throughout the
-source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created.
-This flag remains for the life of the inode.
-
-For shared memory regions, AS_UNEVICTABLE is set when an application
-successfully SHM_LOCKs the region and is removed when the region is
-SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page
-tables for the region as does, for example, mlock(). So, we make no special
-effort to push any pages in the SHM_LOCKed region to the unevictable list.
-Vmscan will do this when/if it encounters the pages during reclaim. On
-SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the
-unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed
-region is destroyed, the pages are also "rescued" from the unevictable list in
-the process of freeing them.
-
-page_evictable() detects mlock()ed pages by testing an additional page flag,
-PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a
-non-NULL vma is supplied, page_evictable() will check whether the vma is
+effects:
+
+ (1) Because the pages are "hidden" from reclaim on the unevictable list, the
+ reclaim process can be more efficient, dealing only with pages that have a
+ chance of being reclaimed.
+
+ (2) On the other hand, if too many of the pages charged to the control group
+ are unevictable, the evictable portion of the working set of the tasks in
+ the control group may not fit into the available memory. This can cause
+ the control group to thrash or to OOM-kill tasks.
+
+
+MARKING ADDRESS SPACES UNEVICTABLE
+----------------------------------
+
+For facilities such as ramfs none of the pages attached to the address space
+may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE
+address space flag is provided, and this can be manipulated by a filesystem
+using a number of wrapper functions:
+
+ (*) void mapping_set_unevictable(struct address_space *mapping);
+
+ Mark the address space as being completely unevictable.
+
+ (*) void mapping_clear_unevictable(struct address_space *mapping);
+
+ Mark the address space as being evictable.
+
+ (*) int mapping_unevictable(struct address_space *mapping);
+
+ Query the address space, and return true if it is completely
+ unevictable.
+
+These are currently used in two places in the kernel:
+
+ (1) By ramfs to mark the address spaces of its inodes when they are created,
+ and this mark remains for the life of the inode.
+
+ (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
+
+ Note that SHM_LOCK is not required to page in the locked pages if they're
+ swapped out; the application must touch the pages manually if it wants to
+ ensure they're in memory.
+
+
+DETECTING UNEVICTABLE PAGES
+---------------------------
+
+The function page_evictable() in vmscan.c determines whether a page is
+evictable or not using the query function outlined above [see section "Marking
+address spaces unevictable"] to check the AS_UNEVICTABLE flag.
+
+For address spaces that are so marked after being populated (as SHM regions
+might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
+the page tables for the region as does, for example, mlock(), nor need it make
+any special effort to push any pages in the SHM_LOCK'd area to the unevictable
+list. Instead, vmscan will do this if and when it encounters the pages during
+a reclamation scan.
+
+On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
+the pages in the region and "rescue" them from the unevictable list if no other
+condition is keeping them unevictable. If an unevictable region is destroyed,
+the pages are also "rescued" from the unevictable list in the process of
+freeing them.
+
+page_evictable() also checks for mlocked pages by testing an additional page
+flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked,
+and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is
VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
update the appropriate statistics if the vma is VM_LOCKED. This method allows
efficient "culling" of pages in the fault path that are being faulted in to
-VM_LOCKED vmas.
+VM_LOCKED VMAs.
-Unevictable Pages and Vmscan [shrink_*_list()]
+VMSCAN'S HANDLING OF UNEVICTABLE PAGES
+--------------------------------------
If unevictable pages are culled in the fault path, or moved to the unevictable
-list at mlock() or mmap() time, vmscan will never encounter the pages until
-they have become evictable again, for example, via munlock() and have been
-"rescued" from the unevictable list. However, there may be situations where we
-decide, for the sake of expediency, to leave a unevictable page on one of the
-regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for
-such pages in all of the shrink_{active|inactive|page}_list() functions and
-will "cull" such pages that it encounters--that is, it diverts those pages to
-the unevictable list for the zone being scanned.
-
-There may be situations where a page is mapped into a VM_LOCKED vma, but the
-page is not marked as PageMlocked. Such pages will make it all the way to
+list at mlock() or mmap() time, vmscan will not encounter the pages until they
+have become evictable again (via munlock() for example) and have been "rescued"
+from the unevictable list. However, there may be situations where we decide,
+for the sake of expediency, to leave a unevictable page on one of the regular
+active/inactive LRU lists for vmscan to deal with. vmscan checks for such
+pages in all of the shrink_{active|inactive|page}_list() functions and will
+"cull" such pages that it encounters: that is, it diverts those pages to the
+unevictable list for the zone being scanned.
+
+There may be situations where a page is mapped into a VM_LOCKED VMA, but the
+page is not marked as PG_mlocked. Such pages will make it all the way to
shrink_page_list() where they will be detected when vmscan walks the reverse
-map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list()
-will cull the page at that point.
+map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK,
+shrink_page_list() will cull the page at that point.
-To "cull" an unevictable page, vmscan simply puts the page back on the lru
-list using putback_lru_page()--the inverse operation to isolate_lru_page()--
-after dropping the page lock. Because the condition which makes the page
-unevictable may change once the page is unlocked, putback_lru_page() will
-recheck the unevictable state of a page that it places on the unevictable lru
-list. If the page has become unevictable, putback_lru_page() removes it from
-the list and retries, including the page_unevictable() test. Because such a
-race is a rare event and movement of pages onto the unevictable list should be
-rare, these extra evictabilty checks should not occur in the majority of calls
-to putback_lru_page().
+To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
+using putback_lru_page() - the inverse operation to isolate_lru_page() - after
+dropping the page lock. Because the condition which makes the page unevictable
+may change once the page is unlocked, putback_lru_page() will recheck the
+unevictable state of a page that it places on the unevictable list. If the
+page has become unevictable, putback_lru_page() removes it from the list and
+retries, including the page_unevictable() test. Because such a race is a rare
+event and movement of pages onto the unevictable list should be rare, these
+extra evictabilty checks should not occur in the majority of calls to
+putback_lru_page().
-Mlocked Page: Prior Work
+=============
+MLOCKED PAGES
+=============
-The "Unevictable Mlocked Pages" infrastructure is based on work originally
+The unevictable page list is also useful for mlock(), in addition to ramfs and
+SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
+NOMMU situations, all mappings are effectively mlocked.
+
+
+HISTORY
+-------
+
+The "Unevictable mlocked Pages" infrastructure is based on work originally
posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
-Nick posted his patch as an alternative to a patch posted by Christoph
-Lameter to achieve the same objective--hiding mlocked pages from vmscan.
-In Nick's patch, he used one of the struct page lru list link fields as a count
-of VM_LOCKED vmas that map the page. This use of the link field for a count
-prevented the management of the pages on an LRU list. Thus, mlocked pages were
-not migratable as isolate_lru_page() could not find them and the lru list link
-field was not available to the migration subsystem. Nick resolved this by
-putting mlocked pages back on the lru list before attempting to isolate them,
-thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated
-with the Unevictable LRU work, the count was replaced by walking the reverse
-map to determine whether any VM_LOCKED vmas mapped the page. More on this
-below.
-
-
-Mlocked Pages: Basic Management
-
-Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of
-unevictable pages. When such a page has been "noticed" by the memory
-management subsystem, the page is marked with the PG_mlocked [PageMlocked()]
-flag. A PageMlocked() page will be placed on the unevictable LRU list when
-it is added to the LRU. Pages can be "noticed" by memory management in
-several places:
-
-1) in the mlock()/mlockall() system call handlers.
-2) in the mmap() system call handler when mmap()ing a region with the
- MAP_LOCKED flag, or mmap()ing a region in a task that has called
- mlockall() with the MCL_FUTURE flag. Both of these conditions result
- in the VM_LOCKED flag being set for the vma.
-3) in the fault path, if mlocked pages are "culled" in the fault path,
- and when a VM_LOCKED stack segment is expanded.
-4) as mentioned above, in vmscan:shrink_page_list() when attempting to
- reclaim a page in a VM_LOCKED vma via try_to_unmap().
-
-Mlocked pages become unlocked and rescued from the unevictable list when:
-
-1) mapped in a range unlocked via the munlock()/munlockall() system calls.
-2) munmapped() out of the last VM_LOCKED vma that maps the page, including
- unmapping at task exit.
-3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file.
-4) before a page is COWed in a VM_LOCKED vma.
-
-
-Mlocked Pages: mlock()/mlockall() System Call Handling
+Nick posted his patch as an alternative to a patch posted by Christoph Lameter
+to achieve the same objective: hiding mlocked pages from vmscan.
+
+In Nick's patch, he used one of the struct page LRU list link fields as a count
+of VM_LOCKED VMAs that map the page. This use of the link field for a count
+prevented the management of the pages on an LRU list, and thus mlocked pages
+were not migratable as isolate_lru_page() could not find them, and the LRU list
+link field was not available to the migration subsystem.
+
+Nick resolved this by putting mlocked pages back on the lru list before
+attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When
+Nick's patch was integrated with the Unevictable LRU work, the count was
+replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
+mapped the page. More on this below.
+
+
+BASIC MANAGEMENT
+----------------
+
+mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
+pages. When such a page has been "noticed" by the memory management subsystem,
+the page is marked with the PG_mlocked flag. This can be manipulated using the
+PageMlocked() functions.
+
+A PG_mlocked page will be placed on the unevictable list when it is added to
+the LRU. Such pages can be "noticed" by memory management in several places:
+
+ (1) in the mlock()/mlockall() system call handlers;
+
+ (2) in the mmap() system call handler when mmapping a region with the
+ MAP_LOCKED flag;
+
+ (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
+ flag
+
+ (4) in the fault path, if mlocked pages are "culled" in the fault path,
+ and when a VM_LOCKED stack segment is expanded; or
+
+ (5) as mentioned above, in vmscan:shrink_page_list() when attempting to
+ reclaim a page in a VM_LOCKED VMA via try_to_unmap()
+
+all of which result in the VM_LOCKED flag being set for the VMA if it doesn't
+already have it set.
+
+mlocked pages become unlocked and rescued from the unevictable list when:
+
+ (1) mapped in a range unlocked via the munlock()/munlockall() system calls;
+
+ (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including
+ unmapping at task exit;
+
+ (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file;
+ or
+
+ (4) before a page is COW'd in a VM_LOCKED VMA.
+
+
+mlock()/mlockall() SYSTEM CALL HANDLING
+---------------------------------------
Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup()
-for each vma in the range specified by the call. In the case of mlockall(),
+for each VMA in the range specified by the call. In the case of mlockall(),
this is the entire active address space of the task. Note that mlock_fixup()
-is used for both mlock()ing and munlock()ing a range of memory. A call to
-mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED
-is treated as a no-op--mlock_fixup() simply returns.
-
-If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas"
-below, mlock_fixup() will attempt to merge the vma with its neighbors or split
-off a subset of the vma if the range does not cover the entire vma. Once the
-vma has been merged or split or neither, mlock_fixup() will call
-__mlock_vma_pages_range() to fault in the pages via get_user_pages() and
-to mark the pages as mlocked via mlock_vma_page().
-
-Note that the vma being mlocked might be mapped with PROT_NONE. In this case,
-get_user_pages() will be unable to fault in the pages. That's OK. If pages
-do end up getting faulted into this VM_LOCKED vma, we'll handle them in the
+is used for both mlocking and munlocking a range of memory. A call to mlock()
+an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
+treated as a no-op, and mlock_fixup() simply returns.
+
+If the VMA passes some filtering as described in "Filtering Special Vmas"
+below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
+off a subset of the VMA if the range does not cover the entire VMA. Once the
+VMA has been merged or split or neither, mlock_fixup() will call
+__mlock_vma_pages_range() to fault in the pages via get_user_pages() and to
+mark the pages as mlocked via mlock_vma_page().
+
+Note that the VMA being mlocked might be mapped with PROT_NONE. In this case,
+get_user_pages() will be unable to fault in the pages. That's okay. If pages
+do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
fault path or in vmscan.
Also note that a page returned by get_user_pages() could be truncated or
-migrated out from under us, while we're trying to mlock it. To detect
-this, __mlock_vma_pages_range() tests the page_mapping after acquiring
-the page lock. If the page is still associated with its mapping, we'll
-go ahead and call mlock_vma_page(). If the mapping is gone, we just
-unlock the page and move on. Worse case, this results in page mapped
-in a VM_LOCKED vma remaining on a normal LRU list without being
-PageMlocked(). Again, vmscan will detect and cull such pages.
-
-mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will
-TestSetPageMlocked() for each page returned by get_user_pages(). We use
-TestSetPageMlocked() because the page might already be mlocked by another
-task/vma and we don't want to do extra work. We especially do not want to
-count an mlocked page more than once in the statistics. If the page was
-already mlocked, mlock_vma_page() is done.
+migrated out from under us, while we're trying to mlock it. To detect this,
+__mlock_vma_pages_range() checks page_mapping() after acquiring the page lock.
+If the page is still associated with its mapping, we'll go ahead and call
+mlock_vma_page(). If the mapping is gone, we just unlock the page and move on.
+In the worst case, this will result in a page mapped in a VM_LOCKED VMA
+remaining on a normal LRU list without being PageMlocked(). Again, vmscan will
+detect and cull such pages.
+
+mlock_vma_page() will call TestSetPageMlocked() for each page returned by
+get_user_pages(). We use TestSetPageMlocked() because the page might already
+be mlocked by another task/VMA and we don't want to do extra work. We
+especially do not want to count an mlocked page more than once in the
+statistics. If the page was already mlocked, mlock_vma_page() need do nothing
+more.
If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
page from the LRU, as it is likely on the appropriate active or inactive list
-at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will
-putback the page--putback_lru_page()--which will notice that the page is now
-mlocked and divert the page to the zone's unevictable LRU list. If
+at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
+back the page - by calling putback_lru_page() - which will notice that the page
+is now mlocked and divert the page to the zone's unevictable list. If
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
-it later if/when it attempts to reclaim the page.
+it later if and when it attempts to reclaim the page.
-Mlocked Pages: Filtering Special Vmas
+FILTERING SPECIAL VMAS
+----------------------
-mlock_fixup() filters several classes of "special" vmas:
+mlock_fixup() filters several classes of "special" VMAs:
-1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind
+1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind
these mappings are inherently pinned, so we don't need to mark them as
- mlocked. In any case, most of the pages have no struct page in which to
- so mark the page. Because of this, get_user_pages() will fail for these
- vmas, so there is no sense in attempting to visit them.
-
-2) vmas mapping hugetlbfs page are already effectively pinned into memory.
- We don't need nor want to mlock() these pages. However, to preserve the
- prior behavior of mlock()--before the unevictable/mlock changes--
- mlock_fixup() will call make_pages_present() in the hugetlbfs vma range
- to allocate the huge pages and populate the ptes.
-
-3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of
- kernel pages, such as the vdso page, relay channel pages, etc. These pages
+ mlocked. In any case, most of the pages have no struct page in which to so
+ mark the page. Because of this, get_user_pages() will fail for these VMAs,
+ so there is no sense in attempting to visit them.
+
+2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We
+ neither need nor want to mlock() these pages. However, to preserve the
+ prior behavior of mlock() - before the unevictable/mlock changes -
+ mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
+ allocate the huge pages and populate the ptes.
+
+3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of
+ kernel pages, such as the VDSO page, relay channel pages, etc. These pages
are inherently unevictable and are not managed on the LRU lists.
- mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls
+ mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
make_pages_present() to populate the ptes.
-Note that for all of these special vmas, mlock_fixup() does not set the
+Note that for all of these special VMAs, mlock_fixup() does not set the
VM_LOCKED flag. Therefore, we won't have to deal with them later during
-munlock() or munmap()--for example, at task exit. Neither does mlock_fixup()
-account these vmas against the task's "locked_vm".
-
-Mlocked Pages: Downgrading the Mmap Semaphore.
-
-mlock_fixup() must be called with the mmap semaphore held for write, because
-it may have to merge or split vmas. However, mlocking a large region of
-memory can take a long time--especially if vmscan must reclaim pages to
-satisfy the regions requirements. Faulting in a large region with the mmap
-semaphore held for write can hold off other faults on the address space, in
-the case of a multi-threaded task. It can also hold off scans of the task's
-address space via /proc. While testing under heavy load, it was observed that
-the ps(1) command could be held off for many minutes while a large segment was
-mlock()ed down.
-
-To address this issue, and to make the system more responsive during mlock()ing
-of large segments, mlock_fixup() downgrades the mmap semaphore to read mode
-during the call to __mlock_vma_pages_range(). This works fine. However, the
-callers of mlock_fixup() expect the semaphore to be returned in write mode.
-So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not
-support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore
-and reacquire it in write mode. In a multi-threaded task, it is possible for
-the task memory map to change while the semaphore is dropped. Therefore,
-mlock_fixup() looks up the vma at the range start address after reacquiring
-the semaphore in write mode and verifies that it still covers the original
-range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of
-mlock_fixup() have been changed to deal with this new error condition.
-
-Note: when munlocking a region, all of the pages should already be resident--
-unless we have racing threads mlocking() and munlocking() regions. So,
-unlocking should not have to wait for page allocations nor faults of any kind.
-Therefore mlock_fixup() does not downgrade the semaphore for munlock().
-
-
-Mlocked Pages: munlock()/munlockall() System Call Handling
-
-The munlock() and munlockall() system calls are handled by the same functions--
-do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock
-vs lock operation indicated by an argument. So, these system calls are also
-handled by mlock_fixup(). Again, if called for an already munlock()ed vma,
-mlock_fixup() simply returns. Because of the vma filtering discussed above,
-VM_LOCKED will not be set in any "special" vmas. So, these vmas will be
+munlock(), munmap() or task exit. Neither does mlock_fixup() account these
+VMAs against the task's "locked_vm".
+
+
+munlock()/munlockall() SYSTEM CALL HANDLING
+-------------------------------------------
+
+The munlock() and munlockall() system calls are handled by the same functions -
+do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
+lock operation indicated by an argument. So, these system calls are also
+handled by mlock_fixup(). Again, if called for an already munlocked VMA,
+mlock_fixup() simply returns. Because of the VMA filtering discussed above,
+VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be
ignored for munlock.
-If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off
-the specified range. The range is then munlocked via the function
-__mlock_vma_pages_range()--the same function used to mlock a vma range--
+If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
+specified range. The range is then munlocked via the function
+__mlock_vma_pages_range() - the same function used to mlock a VMA range -
passing a flag to indicate that munlock() is being performed.
-Because the vma access protections could have been changed to PROT_NONE after
+Because the VMA access protections could have been changed to PROT_NONE after
faulting in and mlocking pages, get_user_pages() was unreliable for visiting
-these pages for munlocking. Because we don't want to leave pages mlocked(),
+these pages for munlocking. Because we don't want to leave pages mlocked,
get_user_pages() was enhanced to accept a flag to ignore the permissions when
-fetching the pages--all of which should be resident as a result of previous
-mlock()ing.
+fetching the pages - all of which should be resident as a result of previous
+mlocking.
For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling
munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
-flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page()
-use the Test*PageMlocked() function to handle the case where the page might
-have already been unlocked by another task. If the page was mlocked,
-munlock_vma_page() updates that zone statistics for the number of mlocked
-pages. Note, however, that at this point we haven't checked whether the page
-is mapped by other VM_LOCKED vmas.
-
-We can't call try_to_munlock(), the function that walks the reverse map to check
-for other VM_LOCKED vmas, without first isolating the page from the LRU.
+flag using TestClearPageMlocked(). As with mlock_vma_page(),
+munlock_vma_page() use the Test*PageMlocked() function to handle the case where
+the page might have already been unlocked by another task. If the page was
+mlocked, munlock_vma_page() updates that zone statistics for the number of
+mlocked pages. Note, however, that at this point we haven't checked whether
+the page is mapped by other VM_LOCKED VMAs.
+
+We can't call try_to_munlock(), the function that walks the reverse map to
+check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
-not be on an lru list. [More on these below.] However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().
-So, we go ahead and clear PG_mlocked up front, as this might be the only chance
-we have. If we can successfully isolate the page, we go ahead and
+not be on an LRU list [more on these below]. However, the call to
+isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So,
+we go ahead and clear PG_mlocked up front, as this might be the only chance we
+have. If we can successfully isolate the page, we go ahead and
try_to_munlock(), which will restore the PG_mlocked flag and update the zone
-page statistics if it finds another vma holding the page mlocked. If we fail
+page statistics if it finds another VMA holding the page mlocked. If we fail
to isolate the page, we'll have left a potentially mlocked page on the LRU.
-This is fine, because we'll catch it later when/if vmscan tries to reclaim the
-page. This should be relatively rare.
-
-Mlocked Pages: Migrating Them...
-
-A page that is being migrated has been isolated from the lru lists and is
-held locked across unmapping of the page, updating the page's mapping
-[address_space] entry and copying the contents and state, until the
-page table entry has been replaced with an entry that refers to the new
-page. Linux supports migration of mlocked pages and other unevictable
-pages. This involves simply moving the PageMlocked and PageUnevictable states
-from the old page to the new page.
-
-Note that page migration can race with mlocking or munlocking of the same
-page. This has been discussed from the mlock/munlock perspective in the
-respective sections above. Both processes [migration, m[un]locking], hold
-the page locked. This provides the first level of synchronization. Page
-migration zeros out the page_mapping of the old page before unlocking it,
-so m[un]lock can skip these pages by testing the page mapping under page
-lock.
-
-When completing page migration, we place the new and old pages back onto the
-lru after dropping the page lock. The "unneeded" page--old page on success,
-new page on failure--will be freed when the reference count held by the
-migration process is released. To ensure that we don't strand pages on the
-unevictable list because of a race between munlock and migration, page
-migration uses the putback_lru_page() function to add migrated pages back to
-the lru.
-
-
-Mlocked Pages: mmap(MAP_LOCKED) System Call Handling
+This is fine, because we'll catch it later if and if vmscan tries to reclaim
+the page. This should be relatively rare.
+
+
+MIGRATING MLOCKED PAGES
+-----------------------
+
+A page that is being migrated has been isolated from the LRU lists and is held
+locked across unmapping of the page, updating the page's address space entry
+and copying the contents and state, until the page table entry has been
+replaced with an entry that refers to the new page. Linux supports migration
+of mlocked pages and other unevictable pages. This involves simply moving the
+PG_mlocked and PG_unevictable states from the old page to the new page.
+
+Note that page migration can race with mlocking or munlocking of the same page.
+This has been discussed from the mlock/munlock perspective in the respective
+sections above. Both processes (migration and m[un]locking) hold the page
+locked. This provides the first level of synchronization. Page migration
+zeros out the page_mapping of the old page before unlocking it, so m[un]lock
+can skip these pages by testing the page mapping under page lock.
+
+To complete page migration, we place the new and old pages back onto the LRU
+after dropping the page lock. The "unneeded" page - old page on success, new
+page on failure - will be freed when the reference count held by the migration
+process is released. To ensure that we don't strand pages on the unevictable
+list because of a race between munlock and migration, page migration uses the
+putback_lru_page() function to add migrated pages back to the LRU.
+
+
+mmap(MAP_LOCKED) SYSTEM CALL HANDLING
+-------------------------------------
In addition the the mlock()/mlockall() system calls, an application can request
-that a region of memory be mlocked using the MAP_LOCKED flag with the mmap()
+that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
call. Furthermore, any mmap() call or brk() call that expands the heap by a
task that has previously called mlockall() with the MCL_FUTURE flag will result
-in the newly mapped memory being mlocked. Before the unevictable/mlock changes,
-the kernel simply called make_pages_present() to allocate pages and populate
-the page table.
+in the newly mapped memory being mlocked. Before the unevictable/mlock
+changes, the kernel simply called make_pages_present() to allocate pages and
+populate the page table.
To mlock a range of memory under the unevictable/mlock infrastructure, the
mmap() handler and task address space expansion functions call
mlock_vma_pages_range() specifying the vma and the address range to mlock.
-mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in
-"Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will
-have already been set by the caller, in filtered vmas. Thus these vma's need
-not be visited for munlock when the region is unmapped.
+mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in
+"Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have
+already been set by the caller, in filtered VMAs. Thus these VMA's need not be
+visited for munlock when the region is unmapped.
-For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
+For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
fault/allocate the pages and mlock them. Again, like mlock_fixup(),
mlock_vma_pages_range() downgrades the mmap semaphore to read mode before
-attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore
+attempting to fault/allocate and mlock the pages and "upgrades" the semaphore
back to write mode before returning.
-The callers of mlock_vma_pages_range() will have already added the memory
-range to be mlocked to the task's "locked_vm". To account for filtered vmas,
+The callers of mlock_vma_pages_range() will have already added the memory range
+to be mlocked to the task's "locked_vm". To account for filtered VMAs,
mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the
-callers then subtract a non-negative return value from the task's locked_vm.
-A negative return value represent an error--for example, from get_user_pages()
-attempting to fault in a vma with PROT_NONE access. In this case, we leave
-the memory range accounted as locked_vm, as the protections could be changed
-later and pages allocated into that region.
+callers then subtract a non-negative return value from the task's locked_vm. A
+negative return value represent an error - for example, from get_user_pages()
+attempting to fault in a VMA with PROT_NONE access. In this case, we leave the
+memory range accounted as locked_vm, as the protections could be changed later
+and pages allocated into that region.
-Mlocked Pages: munmap()/exit()/exec() System Call Handling
+munmap()/exit()/exec() SYSTEM CALL HANDLING
+-------------------------------------------
When unmapping an mlocked region of memory, whether by an explicit call to
munmap() or via an internal unmap from exit() or exec() processing, we must
-munlock the pages if we're removing the last VM_LOCKED vma that maps the pages.
+munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
Before the unevictable/mlock changes, mlocking did not mark the pages in any
way, so unmapping them required no processing.
To munlock a range of memory under the unevictable/mlock infrastructure, the
-munmap() hander and task address space tear down function call
+munmap() handler and task address space call tear down function
munlock_vma_pages_all(). The name reflects the observation that one always
-specifies the entire vma range when munlock()ing during unmap of a region.
-Because of the vma filtering when mlocking() regions, only "normal" vmas that
+specifies the entire VMA range when munlock()ing during unmap of a region.
+Because of the VMA filtering when mlocking() regions, only "normal" VMAs that
actually contain mlocked pages will be passed to munlock_vma_pages_all().
-munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup()
+munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
for the munlock case, calls __munlock_vma_pages_range() to walk the page table
-for the vma's memory range and munlock_vma_page() each resident page mapped by
-the vma. This effectively munlocks the page, only if this is the last
-VM_LOCKED vma that maps the page.
-
+for the VMA's memory range and munlock_vma_page() each resident page mapped by
+the VMA. This effectively munlocks the page, only if this is the last
+VM_LOCKED VMA that maps the page.
-Mlocked Page: try_to_unmap()
-[Note: the code changes represented by this section are really quite small
-compared to the text to describe what happening and why, and to discuss the
-implications.]
+try_to_unmap()
+--------------
-Pages can, of course, be mapped into multiple vmas. Some of these vmas may
+Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may
have VM_LOCKED flag set. It is possible for a page mapped into one or more
-VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one
-of the active or inactive LRU lists. This could happen if, for example, a
-task in the process of munlock()ing the page could not isolate the page from
-the LRU. As a result, vmscan/shrink_page_list() might encounter such a page
-as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To
-handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED
-vmas while it is walking a page's reverse map.
+VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
+of the active or inactive LRU lists. This could happen if, for example, a task
+in the process of munlocking the page could not isolate the page from the LRU.
+As a result, vmscan/shrink_page_list() might encounter such a page as described
+in section "vmscan's handling of unevictable pages". To handle this situation,
+try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
+map.
try_to_unmap() is always called, by either vmscan for reclaim or for page
-migration, with the argument page locked and isolated from the LRU. BUG_ON()
-assertions enforce this requirement. Separate functions handle anonymous and
-mapped file pages, as these types of pages have different reverse map
-mechanisms.
-
- try_to_unmap_anon()
-
-To unmap anonymous pages, each vma in the list anchored in the anon_vma must be
-visited--at least until a VM_LOCKED vma is encountered. If the page is being
-unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked
-pages are migratable. However, for reclaim, if the page is mapped into a
-VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap
-semphore of the mm_struct to which the vma belongs in read mode. If this is
-successful, try_to_unmap() will mlock the page via mlock_vma_page()--we
-wouldn't have gotten to try_to_unmap() if the page were already mlocked--and
-will return SWAP_MLOCK, indicating that the page is unevictable. If the
-mmap semaphore cannot be acquired, we are not sure whether the page is really
-unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN.
-
- try_to_unmap_file() -- linear mappings
-
-Unmapping of a mapped file page works the same, except that the scan visits
-all vmas that maps the page's index/page offset in the page's mapping's
-reverse map priority search tree. It must also visit each vma in the page's
-mapping's non-linear list, if the list is non-empty. As for anonymous pages,
-on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will
-attempt to acquire the associated mm_struct's mmap semaphore to mlock the page,
-returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not.
-
- try_to_unmap_file() -- non-linear mappings
-
-If a page's mapping contains a non-empty non-linear mapping vma list, then
-try_to_un{map|lock}() must also visit each vma in that list to determine
-whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit
-all vmas in the non-linear list to ensure that the pages is not/should not be
-mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate.
-However, there is no easy way to determine whether the page is actually mapped
-in a given vma--either for unmapping or testing whether the VM_LOCKED vma
-actually pins the page.
-
-So, try_to_unmap_file() handles non-linear mappings by scanning a certain
-number of pages--a "cluster"--in each non-linear vma associated with the page's
-mapping, for each file mapped page that vmscan tries to unmap. If this happens
-to unmap the page we're trying to unmap, try_to_unmap() will notice this on
-return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it
-will return SWAP_AGAIN, causing vmscan to recirculate this page. We take
-advantage of the cluster scan in try_to_unmap_cluster() as follows:
-
-For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap
-semaphore of the associated mm_struct for read without blocking. If this
-attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will
-retain the mmap semaphore for the scan; otherwise it drops it here. Then,
-for each page in the cluster, if we're holding the mmap semaphore for a locked
-vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This
-call is a no-op if the page is already locked, but will mlock any pages in
-the non-linear mapping that happen to be unlocked. If one of the pages so
-mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will
-return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan
-to cull the page, rather than recirculating it on the inactive list. Again,
-if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns
-SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but
-couldn't be mlocked.
-
-
-Mlocked pages: try_to_munlock() Reverse Map Scan
-
-TODO/FIXME: a better name might be page_mlocked()--analogous to the
-page_referenced() reverse map walker.
-
-When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall()
-System Call Handling" above--tries to munlock a page, it needs to
-determine whether or not the page is mapped by any VM_LOCKED vma, without
-actually attempting to unmap all ptes from the page. For this purpose, the
-unevictable/mlock infrastructure introduced a variant of try_to_unmap() called
-try_to_munlock().
+migration, with the argument page locked and isolated from the LRU. Separate
+functions handle anonymous and mapped file pages, as these types of pages have
+different reverse map mechanisms.
+
+ (*) try_to_unmap_anon()
+
+ To unmap anonymous pages, each VMA in the list anchored in the anon_vma
+ must be visited - at least until a VM_LOCKED VMA is encountered. If the
+ page is being unmapped for migration, VM_LOCKED VMAs do not stop the
+ process because mlocked pages are migratable. However, for reclaim, if
+ the page is mapped into a VM_LOCKED VMA, the scan stops.
+
+ try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of
+ the mm_struct to which the VMA belongs. If this is successful, it will
+ mlock the page via mlock_vma_page() - we wouldn't have gotten to
+ try_to_unmap_anon() if the page were already mlocked - and will return
+ SWAP_MLOCK, indicating that the page is unevictable.
+
+ If the mmap semaphore cannot be acquired, we are not sure whether the page
+ is really unevictable or not. In this case, try_to_unmap_anon() will
+ return SWAP_AGAIN.
+
+ (*) try_to_unmap_file() - linear mappings
+
+ Unmapping of a mapped file page works the same as for anonymous mappings,
+ except that the scan visits all VMAs that map the page's index/page offset
+ in the page's mapping's reverse map priority search tree. It also visits
+ each VMA in the page's mapping's non-linear list, if the list is
+ non-empty.
+
+ As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file
+ page, try_to_unmap_file() will attempt to acquire the associated
+ mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this
+ is successful, and SWAP_AGAIN, if not.
+
+ (*) try_to_unmap_file() - non-linear mappings
+
+ If a page's mapping contains a non-empty non-linear mapping VMA list, then
+ try_to_un{map|lock}() must also visit each VMA in that list to determine
+ whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit
+ all VMAs in the non-linear list to ensure that the pages is not/should not
+ be mlocked.
+
+ If a VM_LOCKED VMA is found in the list, the scan could terminate.
+ However, there is no easy way to determine whether the page is actually
+ mapped in a given VMA - either for unmapping or testing whether the
+ VM_LOCKED VMA actually pins the page.
+
+ try_to_unmap_file() handles non-linear mappings by scanning a certain
+ number of pages - a "cluster" - in each non-linear VMA associated with the
+ page's mapping, for each file mapped page that vmscan tries to unmap. If
+ this happens to unmap the page we're trying to unmap, try_to_unmap() will
+ notice this on return (page_mapcount(page) will be 0) and return
+ SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to
+ recirculate this page. We take advantage of the cluster scan in
+ try_to_unmap_cluster() as follows:
+
+ For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the
+ mmap semaphore of the associated mm_struct for read without blocking.
+
+ If this attempt is successful and the VMA is VM_LOCKED,
+ try_to_unmap_cluster() will retain the mmap semaphore for the scan;
+ otherwise it drops it here.
+
+ Then, for each page in the cluster, if we're holding the mmap semaphore
+ for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to
+ mlock the page. This call is a no-op if the page is already locked,
+ but will mlock any pages in the non-linear mapping that happen to be
+ unlocked.
+
+ If one of the pages so mlocked is the page passed in to try_to_unmap(),
+ try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default
+ SWAP_AGAIN. This will allow vmscan to cull the page, rather than
+ recirculating it on the inactive list.
+
+ Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it
+ returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED
+ VMA, but couldn't be mlocked.
+
+
+try_to_munlock() REVERSE MAP SCAN
+---------------------------------
+
+ [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
+ page_referenced() reverse map walker.
+
+When munlock_vma_page() [see section "munlock()/munlockall() System Call
+Handling" above] tries to munlock a page, it needs to determine whether or not
+the page is mapped by any VM_LOCKED VMA without actually attempting to unmap
+all PTEs from the page. For this purpose, the unevictable/mlock infrastructure
+introduced a variant of try_to_unmap() called try_to_munlock().
try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
mapped file pages with an additional argument specifing unlock versus unmap
processing. Again, these functions walk the respective reverse maps looking
-for VM_LOCKED vmas. When such a vma is found for anonymous pages and file
+for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file
pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
attempt to acquire the associated mmap semphore, mlock the page via
mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
pre-clearing of the page's PG_mlocked done by munlock_vma_page.
-If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap
-semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list()
-to recycle the page on the inactive list and hope that it has better luck
-with the page next time.
-
-For file pages mapped into non-linear vmas, the try_to_munlock() logic works
-slightly differently. On encountering a VM_LOCKED non-linear vma that might
-map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking
-the page. munlock_vma_page() will just leave the page unlocked and let
-vmscan deal with it--the usual fallback position.
-
-Note that try_to_munlock()'s reverse map walk must visit every vma in a pages'
-reverse map to determine that a page is NOT mapped into any VM_LOCKED vma.
-However, the scan can terminate when it encounters a VM_LOCKED vma and can
-successfully acquire the vma's mmap semphore for read and mlock the page.
-Although try_to_munlock() can be called many [very many!] times when
-munlock()ing a large region or tearing down a large address space that has been
-mlocked via mlockall(), overall this is a fairly rare event.
-
-Mlocked Page: Page Reclaim in shrink_*_list()
-
-shrink_active_list() culls any obviously unevictable pages--i.e.,
-!page_evictable(page, NULL)--diverting these to the unevictable lru
-list. However, shrink_active_list() only sees unevictable pages that
-made it onto the active/inactive lru lists. Note that these pages do not
-have PageUnevictable set--otherwise, they would be on the unevictable list and
-shrink_active_list would never see them.
+If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap
+semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to
+recycle the page on the inactive list and hope that it has better luck with the
+page next time.
+
+For file pages mapped into non-linear VMAs, the try_to_munlock() logic works
+slightly differently. On encountering a VM_LOCKED non-linear VMA that might
+map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the
+page. munlock_vma_page() will just leave the page unlocked and let vmscan deal
+with it - the usual fallback position.
+
+Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
+However, the scan can terminate when it encounters a VM_LOCKED VMA and can
+successfully acquire the VMA's mmap semphore for read and mlock the page.
+Although try_to_munlock() might be called a great many times when munlocking a
+large region or tearing down a large address space that has been mlocked via
+mlockall(), overall this is a fairly rare event.
+
+
+PAGE RECLAIM IN shrink_*_list()
+-------------------------------
+
+shrink_active_list() culls any obviously unevictable pages - i.e.
+!page_evictable(page, NULL) - diverting these to the unevictable list.
+However, shrink_active_list() only sees unevictable pages that made it onto the
+active/inactive lru lists. Note that these pages do not have PageUnevictable
+set - otherwise they would be on the unevictable list and shrink_active_list
+would never see them.
Some examples of these unevictable pages on the LRU lists are:
-1) ramfs pages that have been placed on the lru lists when first allocated.
+ (1) ramfs pages that have been placed on the LRU lists when first allocated.
+
+ (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to
+ allocate or fault in the pages in the shared memory region. This happens
+ when an application accesses the page the first time after SHM_LOCK'ing
+ the segment.
-2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to
- allocate or fault in the pages in the shared memory region. This happens
- when an application accesses the page the first time after SHM_LOCKing
- the segment.
+ (3) mlocked pages that could not be isolated from the LRU and moved to the
+ unevictable list in mlock_vma_page().
-3) Mlocked pages that could not be isolated from the lru and moved to the
- unevictable list in mlock_vma_page().
+ (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
+ acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
+ munlock_vma_page() was forced to let the page back on to the normal LRU
+ list for vmscan to handle.
-3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't
- acquire the vma's mmap semaphore to test the flags and set PageMlocked.
- munlock_vma_page() was forced to let the page back on to the normal
- LRU list for vmscan to handle.
+shrink_inactive_list() also diverts any unevictable pages that it finds on the
+inactive lists to the appropriate zone's unevictable list.
-shrink_inactive_list() also culls any unevictable pages that it finds on
-the inactive lists, again diverting them to the appropriate zone's unevictable
-lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became
-SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or
-pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from
-the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice
-the latter, but will pass on to shrink_page_list().
+shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
+after shrink_active_list() had moved them to the inactive list, or pages mapped
+into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
+recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter,
+but will pass on to shrink_page_list().
shrink_page_list() again culls obviously unevictable pages that it could
encounter for similar reason to shrink_inactive_list(). Pages mapped into
-VM_LOCKED vmas but without PG_mlocked set will make it all the way to
+VM_LOCKED VMAs but without PG_mlocked set will make it all the way to
try_to_unmap(). shrink_page_list() will divert them to the unevictable list
when try_to_unmap() returns SWAP_MLOCK, as discussed above.
OpenPOWER on IntegriCloud