From 8d82ffd15e59febf2c597067a777526958b7f769 Mon Sep 17 00:00:00 2001 From: Wolfgang Grandegger Date: Tue, 7 Apr 2009 10:20:56 +0200 Subject: powerpc: Document new FSL I2C bindings and cleanup This patch documents the new bindings for the MPC I2C bus driver. Furthermore, it removes obsolete FSL device related definitions for I2C. Signed-off-by: Wolfgang Grandegger Signed-off-by: Kumar Gala --- Documentation/powerpc/dts-bindings/fsl/i2c.txt | 46 +++++++++++++++++--------- 1 file changed, 31 insertions(+), 15 deletions(-) (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/fsl/i2c.txt b/Documentation/powerpc/dts-bindings/fsl/i2c.txt index d0ab33e..b6d2e21 100644 --- a/Documentation/powerpc/dts-bindings/fsl/i2c.txt +++ b/Documentation/powerpc/dts-bindings/fsl/i2c.txt @@ -7,8 +7,10 @@ Required properties : Recommended properties : - - compatible : Should be "fsl-i2c" for parts compatible with - Freescale I2C specifications. + - compatible : compatibility list with 2 entries, the first should + be "fsl,CHIP-i2c" where CHIP is the name of a compatible processor, + e.g. mpc8313, mpc8543, mpc8544, mpc5200 or mpc5200b. The second one + should be "fsl-i2c". - interrupts : where a is the interrupt number and b is a field that represents an encoding of the sense and level information for the interrupt. This should be encoded based on @@ -16,17 +18,31 @@ Recommended properties : controller you have. - interrupt-parent : the phandle for the interrupt controller that services interrupts for this device. - - dfsrr : boolean; if defined, indicates that this I2C device has - a digital filter sampling rate register - - fsl5200-clocking : boolean; if defined, indicated that this device - uses the FSL 5200 clocking mechanism. - -Example : - i2c@3000 { - interrupt-parent = <40000>; - interrupts = <1b 3>; - reg = <3000 18>; - device_type = "i2c"; - compatible = "fsl-i2c"; - dfsrr; + - fsl,preserve-clocking : boolean; if defined, the clock settings + from the bootloader are preserved (not touched). + - clock-frequency : desired I2C bus clock frequency in Hz. + +Examples : + + i2c@3d00 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "fsl,mpc5200b-i2c","fsl,mpc5200-i2c","fsl-i2c"; + cell-index = <0>; + reg = <0x3d00 0x40>; + interrupts = <2 15 0>; + interrupt-parent = <&mpc5200_pic>; + fsl,preserve-clocking; }; + + i2c@3100 { + #address-cells = <1>; + #size-cells = <0>; + cell-index = <1>; + compatible = "fsl,mpc8544-i2c", "fsl-i2c"; + reg = <0x3100 0x100>; + interrupts = <43 2>; + interrupt-parent = <&mpic>; + clock-frequency = <400000>; + }; + -- cgit v1.1 From ca8b9950298c84ca528a5943409a727c04ec88f8 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Mon, 13 Apr 2009 14:39:36 -0700 Subject: Documentation/sysctl/net.txt: fix a typo s/spicified/specified Signed-off-by: Li Zefan Cc: "David S. Miller" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/net.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt index a34d55b..df38ef0 100644 --- a/Documentation/sysctl/net.txt +++ b/Documentation/sysctl/net.txt @@ -95,7 +95,7 @@ of struct cmsghdr structures with appended data. There is only one file in this directory. unix_dgram_qlen limits the max number of datagrams queued in Unix domain -socket's buffer. It will not take effect unless PF_UNIX flag is spicified. +socket's buffer. It will not take effect unless PF_UNIX flag is specified. 3. /proc/sys/net/ipv4 - IPV4 settings -- cgit v1.1 From 5341cfab94ec05b8a45726f9fe15e71c0cd9b915 Mon Sep 17 00:00:00 2001 From: Andrea Righi Date: Mon, 13 Apr 2009 14:39:58 -0700 Subject: res_counter: update documentation After the introduction of resource counters hierarchies (28dbc4b6a01fb579a9441c7b81e3d3413dc452df) the prototypes of res_counter_init() and res_counter_charge() have been changed. Keep the documentation consistent with the actual function prototypes. Signed-off-by: Andrea Righi Cc: Paul Menage Cc: Pavel Emelyanov Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/resource_counter.txt | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt index f196ac1..95b24d7 100644 --- a/Documentation/cgroups/resource_counter.txt +++ b/Documentation/cgroups/resource_counter.txt @@ -47,13 +47,18 @@ to work with it. 2. Basic accounting routines - a. void res_counter_init(struct res_counter *rc) + a. void res_counter_init(struct res_counter *rc, + struct res_counter *rc_parent) Initializes the resource counter. As usual, should be the first routine called for a new counter. - b. int res_counter_charge[_locked] - (struct res_counter *rc, unsigned long val) + The struct res_counter *parent can be used to define a hierarchical + child -> parent relationship directly in the res_counter structure, + NULL can be used to define no relationship. + + c. int res_counter_charge(struct res_counter *rc, unsigned long val, + struct res_counter **limit_fail_at) When a resource is about to be allocated it has to be accounted with the appropriate resource counter (controller should determine @@ -67,15 +72,25 @@ to work with it. * if the charging is performed first, then it should be uncharged on error path (if the one is called). - c. void res_counter_uncharge[_locked] + If the charging fails and a hierarchical dependency exists, the + limit_fail_at parameter is set to the particular res_counter element + where the charging failed. + + d. int res_counter_charge_locked + (struct res_counter *rc, unsigned long val) + + The same as res_counter_charge(), but it must not acquire/release the + res_counter->lock internally (it must be called with res_counter->lock + held). + + e. void res_counter_uncharge[_locked] (struct res_counter *rc, unsigned long val) When a resource is released (freed) it should be de-accounted from the resource counter it was accounted to. This is called "uncharging". - The _locked routines imply that the res_counter->lock is taken. - + The _locked routines imply that the res_counter->lock is taken. 2.1 Other accounting routines -- cgit v1.1 From c24b720188e9a1f83caa5b6d49b4cb5b843256f1 Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 13 Apr 2009 14:40:01 -0700 Subject: mm: reformat the Unevictable-LRU documentation Do a bit of reformatting on the Unevictable-LRU documentation. Signed-off-by: David Howells Acked-by: Lee Schermerhorn Cc: Rik van Riel Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.txt | 1041 +++++++++++++++++++--------------- 1 file changed, 572 insertions(+), 469 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 0706a72..2d70d0d 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -1,588 +1,691 @@ - -This document describes the Linux memory management "Unevictable LRU" -infrastructure and the use of this infrastructure to manage several types -of "unevictable" pages. The document attempts to provide the overall -rationale behind this mechanism and the rationale for some of the design -decisions that drove the implementation. The latter design rationale is -discussed in the context of an implementation description. Admittedly, one -can obtain the implementation details--the "what does it do?"--by reading the -code. One hopes that the descriptions below add value by provide the answer -to "why does it do that?". - -Unevictable LRU Infrastructure: - -The Unevictable LRU adds an additional LRU list to track unevictable pages -and to hide these pages from vmscan. This mechanism is based on a patch by -Larry Woodman of Red Hat to address several scalability problems with page + ============================== + UNEVICTABLE LRU INFRASTRUCTURE + ============================== + +======== +CONTENTS +======== + + (*) The Unevictable LRU + + - The unevictable page list. + - Memory control group interaction. + - Marking address spaces unevictable. + - Detecting Unevictable Pages. + - vmscan's handling of unevictable pages. + + (*) mlock()'d pages. + + - History. + - Basic management. + - mlock()/mlockall() system call handling. + - Filtering special vmas. + - munlock()/munlockall() system call handling. + - Migrating mlocked pages. + - mmap(MAP_LOCKED) system call handling. + - munmap()/exit()/exec() system call handling. + - try_to_unmap(). + - try_to_munlock() reverse map scan. + - Page reclaim in shrink_*_list(). + + +============ +INTRODUCTION +============ + +This document describes the Linux memory manager's "Unevictable LRU" +infrastructure and the use of this to manage several types of "unevictable" +pages. + +The document attempts to provide the overall rationale behind this mechanism +and the rationale for some of the design decisions that drove the +implementation. The latter design rationale is discussed in the context of an +implementation description. Admittedly, one can obtain the implementation +details - the "what does it do?" - by reading the code. One hopes that the +descriptions below add value by provide the answer to "why does it do that?". + + +=================== +THE UNEVICTABLE LRU +=================== + +The Unevictable LRU facility adds an additional LRU list to track unevictable +pages and to hide these pages from vmscan. This mechanism is based on a patch +by Larry Woodman of Red Hat to address several scalability problems with page reclaim in Linux. The problems have been observed at customer sites on large -memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB -of main memory will have over 32 million 4k pages in a single zone. When a -large fraction of these pages are not evictable for any reason [see below], -vmscan will spend a lot of time scanning the LRU lists looking for the small -fraction of pages that are evictable. This can result in a situation where -all cpus are spending 100% of their time in vmscan for hours or days on end, -with the system completely unresponsive. - -The Unevictable LRU infrastructure addresses the following classes of -unevictable pages: - -+ page owned by ramfs -+ page mapped into SHM_LOCKed shared memory regions -+ page mapped into VM_LOCKED [mlock()ed] vmas - -The infrastructure might be able to handle other conditions that make pages +memory x86_64 systems. + +To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of +main memory will have over 32 million 4k pages in a single zone. When a large +fraction of these pages are not evictable for any reason [see below], vmscan +will spend a lot of time scanning the LRU lists looking for the small fraction +of pages that are evictable. This can result in a situation where all CPUs are +spending 100% of their time in vmscan for hours or days on end, with the system +completely unresponsive. + +The unevictable list addresses the following classes of unevictable pages: + + (*) Those owned by ramfs. + + (*) Those mapped into SHM_LOCK'd shared memory regions. + + (*) Those mapped into VM_LOCKED [mlock()ed] VMAs. + +The infrastructure may also be able to handle other conditions that make pages unevictable, either by definition or by circumstance, in the future. -The Unevictable LRU List +THE UNEVICTABLE PAGE LIST +------------------------- The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list called the "unevictable" list and an associated page flag, PG_unevictable, to -indicate that the page is being managed on the unevictable list. The -PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active -flag in that it indicates on which LRU list a page resides when PG_lru is set. -The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU -Kconfig option. +indicate that the page is being managed on the unevictable list. + +The PG_unevictable flag is analogous to, and mutually exclusive with, the +PG_active flag in that it indicates on which LRU list a page resides when +PG_lru is set. The unevictable list is compile-time configurable based on the +UNEVICTABLE_LRU Kconfig option. The Unevictable LRU infrastructure maintains unevictable pages on an additional LRU list for a few reasons: -1) We get to "treat unevictable pages just like we treat other pages in the - system, which means we get to use the same code to manipulate them, the - same code to isolate them (for migrate, etc.), the same code to keep track - of the statistics, etc..." [Rik van Riel] + (1) We get to "treat unevictable pages just like we treat other pages in the + system - which means we get to use the same code to manipulate them, the + same code to isolate them (for migrate, etc.), the same code to keep track + of the statistics, etc..." [Rik van Riel] + + (2) We want to be able to migrate unevictable pages between nodes for memory + defragmentation, workload management and memory hotplug. The linux kernel + can only migrate pages that it can successfully isolate from the LRU + lists. If we were to maintain pages elsewhere than on an LRU-like list, + where they can be found by isolate_lru_page(), we would prevent their + migration, unless we reworked migration code to find the unevictable pages + itself. -2) We want to be able to migrate unevictable pages between nodes--for memory - defragmentation, workload management and memory hotplug. The linux kernel - can only migrate pages that it can successfully isolate from the lru lists. - If we were to maintain pages elsewise than on an lru-like list, where they - can be found by isolate_lru_page(), we would prevent their migration, unless - we reworked migration code to find the unevictable pages. +The unevictable list does not differentiate between file-backed and anonymous, +swap-backed pages. This differentiation is only important while the pages are, +in fact, evictable. -The unevictable LRU list does not differentiate between file backed and swap -backed [anon] pages. This differentiation is only important while the pages -are, in fact, evictable. +The unevictable list benefits from the "arrayification" of the per-zone LRU +lists and statistics originally proposed and posted by Christoph Lameter. -The unevictable LRU list benefits from the "arrayification" of the per-zone -LRU lists and statistics originally proposed and posted by Christoph Lameter. +The unevictable list does not use the LRU pagevec mechanism. Rather, +unevictable pages are placed directly on the page's zone's unevictable list +under the zone lru_lock. This allows us to prevent the stranding of pages on +the unevictable list when one task has the page isolated from the LRU and other +tasks are changing the "evictability" state of the page. -The unevictable list does not use the lru pagevec mechanism. Rather, -unevictable pages are placed directly on the page's zone's unevictable -list under the zone lru_lock. The reason for this is to prevent stranding -of pages on the unevictable list when one task has the page isolated from the -lru and other tasks are changing the "evictability" state of the page. +MEMORY CONTROL GROUP INTERACTION +-------------------------------- -Unevictable LRU and Memory Controller Interaction +The unevictable LRU facility interacts with the memory control group [aka +memory controller; see Documentation/cgroups/memory.txt] by extending the +lru_list enum. + +The memory controller data structure automatically gets a per-zone unevictable +list as a result of the "arrayification" of the per-zone LRU lists (one per +lru_list enum element). The memory controller tracks the movement of pages to +and from the unevictable list. -The memory controller data structure automatically gets a per zone unevictable -lru list as a result of the "arrayification" of the per-zone LRU lists. The -memory controller tracks the movement of pages to and from the unevictable list. When a memory control group comes under memory pressure, the controller will not attempt to reclaim pages on the unevictable list. This has a couple of -effects. Because the pages are "hidden" from reclaim on the unevictable list, -the reclaim process can be more efficient, dealing only with pages that have -a chance of being reclaimed. On the other hand, if too many of the pages -charged to the control group are unevictable, the evictable portion of the -working set of the tasks in the control group may not fit into the available -memory. This can cause the control group to thrash or to oom-kill tasks. - - -Unevictable LRU: Detecting Unevictable Pages - -The function page_evictable(page, vma) in vmscan.c determines whether a -page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, -page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's -address space using a wrapper function. Wrapper functions are used to set, -clear and test the flag to reduce the requirement for #ifdef's throughout the -source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. -This flag remains for the life of the inode. - -For shared memory regions, AS_UNEVICTABLE is set when an application -successfully SHM_LOCKs the region and is removed when the region is -SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page -tables for the region as does, for example, mlock(). So, we make no special -effort to push any pages in the SHM_LOCKed region to the unevictable list. -Vmscan will do this when/if it encounters the pages during reclaim. On -SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the -unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed -region is destroyed, the pages are also "rescued" from the unevictable list in -the process of freeing them. - -page_evictable() detects mlock()ed pages by testing an additional page flag, -PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a -non-NULL vma is supplied, page_evictable() will check whether the vma is +effects: + + (1) Because the pages are "hidden" from reclaim on the unevictable list, the + reclaim process can be more efficient, dealing only with pages that have a + chance of being reclaimed. + + (2) On the other hand, if too many of the pages charged to the control group + are unevictable, the evictable portion of the working set of the tasks in + the control group may not fit into the available memory. This can cause + the control group to thrash or to OOM-kill tasks. + + +MARKING ADDRESS SPACES UNEVICTABLE +---------------------------------- + +For facilities such as ramfs none of the pages attached to the address space +may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE +address space flag is provided, and this can be manipulated by a filesystem +using a number of wrapper functions: + + (*) void mapping_set_unevictable(struct address_space *mapping); + + Mark the address space as being completely unevictable. + + (*) void mapping_clear_unevictable(struct address_space *mapping); + + Mark the address space as being evictable. + + (*) int mapping_unevictable(struct address_space *mapping); + + Query the address space, and return true if it is completely + unevictable. + +These are currently used in two places in the kernel: + + (1) By ramfs to mark the address spaces of its inodes when they are created, + and this mark remains for the life of the inode. + + (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. + + Note that SHM_LOCK is not required to page in the locked pages if they're + swapped out; the application must touch the pages manually if it wants to + ensure they're in memory. + + +DETECTING UNEVICTABLE PAGES +--------------------------- + +The function page_evictable() in vmscan.c determines whether a page is +evictable or not using the query function outlined above [see section "Marking +address spaces unevictable"] to check the AS_UNEVICTABLE flag. + +For address spaces that are so marked after being populated (as SHM regions +might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate +the page tables for the region as does, for example, mlock(), nor need it make +any special effort to push any pages in the SHM_LOCK'd area to the unevictable +list. Instead, vmscan will do this if and when it encounters the pages during +a reclamation scan. + +On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan +the pages in the region and "rescue" them from the unevictable list if no other +condition is keeping them unevictable. If an unevictable region is destroyed, +the pages are also "rescued" from the unevictable list in the process of +freeing them. + +page_evictable() also checks for mlocked pages by testing an additional page +flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, +and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and update the appropriate statistics if the vma is VM_LOCKED. This method allows efficient "culling" of pages in the fault path that are being faulted in to -VM_LOCKED vmas. +VM_LOCKED VMAs. -Unevictable Pages and Vmscan [shrink_*_list()] +VMSCAN'S HANDLING OF UNEVICTABLE PAGES +-------------------------------------- If unevictable pages are culled in the fault path, or moved to the unevictable -list at mlock() or mmap() time, vmscan will never encounter the pages until -they have become evictable again, for example, via munlock() and have been -"rescued" from the unevictable list. However, there may be situations where we -decide, for the sake of expediency, to leave a unevictable page on one of the -regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for -such pages in all of the shrink_{active|inactive|page}_list() functions and -will "cull" such pages that it encounters--that is, it diverts those pages to -the unevictable list for the zone being scanned. - -There may be situations where a page is mapped into a VM_LOCKED vma, but the -page is not marked as PageMlocked. Such pages will make it all the way to +list at mlock() or mmap() time, vmscan will not encounter the pages until they +have become evictable again (via munlock() for example) and have been "rescued" +from the unevictable list. However, there may be situations where we decide, +for the sake of expediency, to leave a unevictable page on one of the regular +active/inactive LRU lists for vmscan to deal with. vmscan checks for such +pages in all of the shrink_{active|inactive|page}_list() functions and will +"cull" such pages that it encounters: that is, it diverts those pages to the +unevictable list for the zone being scanned. + +There may be situations where a page is mapped into a VM_LOCKED VMA, but the +page is not marked as PG_mlocked. Such pages will make it all the way to shrink_page_list() where they will be detected when vmscan walks the reverse -map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() -will cull the page at that point. +map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, +shrink_page_list() will cull the page at that point. -To "cull" an unevictable page, vmscan simply puts the page back on the lru -list using putback_lru_page()--the inverse operation to isolate_lru_page()-- -after dropping the page lock. Because the condition which makes the page -unevictable may change once the page is unlocked, putback_lru_page() will -recheck the unevictable state of a page that it places on the unevictable lru -list. If the page has become unevictable, putback_lru_page() removes it from -the list and retries, including the page_unevictable() test. Because such a -race is a rare event and movement of pages onto the unevictable list should be -rare, these extra evictabilty checks should not occur in the majority of calls -to putback_lru_page(). +To "cull" an unevictable page, vmscan simply puts the page back on the LRU list +using putback_lru_page() - the inverse operation to isolate_lru_page() - after +dropping the page lock. Because the condition which makes the page unevictable +may change once the page is unlocked, putback_lru_page() will recheck the +unevictable state of a page that it places on the unevictable list. If the +page has become unevictable, putback_lru_page() removes it from the list and +retries, including the page_unevictable() test. Because such a race is a rare +event and movement of pages onto the unevictable list should be rare, these +extra evictabilty checks should not occur in the majority of calls to +putback_lru_page(). -Mlocked Page: Prior Work +============= +MLOCKED PAGES +============= -The "Unevictable Mlocked Pages" infrastructure is based on work originally +The unevictable page list is also useful for mlock(), in addition to ramfs and +SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in +NOMMU situations, all mappings are effectively mlocked. + + +HISTORY +------- + +The "Unevictable mlocked Pages" infrastructure is based on work originally posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". -Nick posted his patch as an alternative to a patch posted by Christoph -Lameter to achieve the same objective--hiding mlocked pages from vmscan. -In Nick's patch, he used one of the struct page lru list link fields as a count -of VM_LOCKED vmas that map the page. This use of the link field for a count -prevented the management of the pages on an LRU list. Thus, mlocked pages were -not migratable as isolate_lru_page() could not find them and the lru list link -field was not available to the migration subsystem. Nick resolved this by -putting mlocked pages back on the lru list before attempting to isolate them, -thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated -with the Unevictable LRU work, the count was replaced by walking the reverse -map to determine whether any VM_LOCKED vmas mapped the page. More on this -below. - - -Mlocked Pages: Basic Management - -Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of -unevictable pages. When such a page has been "noticed" by the memory -management subsystem, the page is marked with the PG_mlocked [PageMlocked()] -flag. A PageMlocked() page will be placed on the unevictable LRU list when -it is added to the LRU. Pages can be "noticed" by memory management in -several places: - -1) in the mlock()/mlockall() system call handlers. -2) in the mmap() system call handler when mmap()ing a region with the - MAP_LOCKED flag, or mmap()ing a region in a task that has called - mlockall() with the MCL_FUTURE flag. Both of these conditions result - in the VM_LOCKED flag being set for the vma. -3) in the fault path, if mlocked pages are "culled" in the fault path, - and when a VM_LOCKED stack segment is expanded. -4) as mentioned above, in vmscan:shrink_page_list() when attempting to - reclaim a page in a VM_LOCKED vma via try_to_unmap(). - -Mlocked pages become unlocked and rescued from the unevictable list when: - -1) mapped in a range unlocked via the munlock()/munlockall() system calls. -2) munmapped() out of the last VM_LOCKED vma that maps the page, including - unmapping at task exit. -3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. -4) before a page is COWed in a VM_LOCKED vma. - - -Mlocked Pages: mlock()/mlockall() System Call Handling +Nick posted his patch as an alternative to a patch posted by Christoph Lameter +to achieve the same objective: hiding mlocked pages from vmscan. + +In Nick's patch, he used one of the struct page LRU list link fields as a count +of VM_LOCKED VMAs that map the page. This use of the link field for a count +prevented the management of the pages on an LRU list, and thus mlocked pages +were not migratable as isolate_lru_page() could not find them, and the LRU list +link field was not available to the migration subsystem. + +Nick resolved this by putting mlocked pages back on the lru list before +attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When +Nick's patch was integrated with the Unevictable LRU work, the count was +replaced by walking the reverse map to determine whether any VM_LOCKED VMAs +mapped the page. More on this below. + + +BASIC MANAGEMENT +---------------- + +mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable +pages. When such a page has been "noticed" by the memory management subsystem, +the page is marked with the PG_mlocked flag. This can be manipulated using the +PageMlocked() functions. + +A PG_mlocked page will be placed on the unevictable list when it is added to +the LRU. Such pages can be "noticed" by memory management in several places: + + (1) in the mlock()/mlockall() system call handlers; + + (2) in the mmap() system call handler when mmapping a region with the + MAP_LOCKED flag; + + (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE + flag + + (4) in the fault path, if mlocked pages are "culled" in the fault path, + and when a VM_LOCKED stack segment is expanded; or + + (5) as mentioned above, in vmscan:shrink_page_list() when attempting to + reclaim a page in a VM_LOCKED VMA via try_to_unmap() + +all of which result in the VM_LOCKED flag being set for the VMA if it doesn't +already have it set. + +mlocked pages become unlocked and rescued from the unevictable list when: + + (1) mapped in a range unlocked via the munlock()/munlockall() system calls; + + (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including + unmapping at task exit; + + (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; + or + + (4) before a page is COW'd in a VM_LOCKED VMA. + + +mlock()/mlockall() SYSTEM CALL HANDLING +--------------------------------------- Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() -for each vma in the range specified by the call. In the case of mlockall(), +for each VMA in the range specified by the call. In the case of mlockall(), this is the entire active address space of the task. Note that mlock_fixup() -is used for both mlock()ing and munlock()ing a range of memory. A call to -mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED -is treated as a no-op--mlock_fixup() simply returns. - -If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" -below, mlock_fixup() will attempt to merge the vma with its neighbors or split -off a subset of the vma if the range does not cover the entire vma. Once the -vma has been merged or split or neither, mlock_fixup() will call -__mlock_vma_pages_range() to fault in the pages via get_user_pages() and -to mark the pages as mlocked via mlock_vma_page(). - -Note that the vma being mlocked might be mapped with PROT_NONE. In this case, -get_user_pages() will be unable to fault in the pages. That's OK. If pages -do end up getting faulted into this VM_LOCKED vma, we'll handle them in the +is used for both mlocking and munlocking a range of memory. A call to mlock() +an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is +treated as a no-op, and mlock_fixup() simply returns. + +If the VMA passes some filtering as described in "Filtering Special Vmas" +below, mlock_fixup() will attempt to merge the VMA with its neighbors or split +off a subset of the VMA if the range does not cover the entire VMA. Once the +VMA has been merged or split or neither, mlock_fixup() will call +__mlock_vma_pages_range() to fault in the pages via get_user_pages() and to +mark the pages as mlocked via mlock_vma_page(). + +Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, +get_user_pages() will be unable to fault in the pages. That's okay. If pages +do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the fault path or in vmscan. Also note that a page returned by get_user_pages() could be truncated or -migrated out from under us, while we're trying to mlock it. To detect -this, __mlock_vma_pages_range() tests the page_mapping after acquiring -the page lock. If the page is still associated with its mapping, we'll -go ahead and call mlock_vma_page(). If the mapping is gone, we just -unlock the page and move on. Worse case, this results in page mapped -in a VM_LOCKED vma remaining on a normal LRU list without being -PageMlocked(). Again, vmscan will detect and cull such pages. - -mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will -TestSetPageMlocked() for each page returned by get_user_pages(). We use -TestSetPageMlocked() because the page might already be mlocked by another -task/vma and we don't want to do extra work. We especially do not want to -count an mlocked page more than once in the statistics. If the page was -already mlocked, mlock_vma_page() is done. +migrated out from under us, while we're trying to mlock it. To detect this, +__mlock_vma_pages_range() checks page_mapping() after acquiring the page lock. +If the page is still associated with its mapping, we'll go ahead and call +mlock_vma_page(). If the mapping is gone, we just unlock the page and move on. +In the worst case, this will result in a page mapped in a VM_LOCKED VMA +remaining on a normal LRU list without being PageMlocked(). Again, vmscan will +detect and cull such pages. + +mlock_vma_page() will call TestSetPageMlocked() for each page returned by +get_user_pages(). We use TestSetPageMlocked() because the page might already +be mlocked by another task/VMA and we don't want to do extra work. We +especially do not want to count an mlocked page more than once in the +statistics. If the page was already mlocked, mlock_vma_page() need do nothing +more. If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the page from the LRU, as it is likely on the appropriate active or inactive list -at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will -putback the page--putback_lru_page()--which will notice that the page is now -mlocked and divert the page to the zone's unevictable LRU list. If +at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put +back the page - by calling putback_lru_page() - which will notice that the page +is now mlocked and divert the page to the zone's unevictable list. If mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle -it later if/when it attempts to reclaim the page. +it later if and when it attempts to reclaim the page. -Mlocked Pages: Filtering Special Vmas +FILTERING SPECIAL VMAS +---------------------- -mlock_fixup() filters several classes of "special" vmas: +mlock_fixup() filters several classes of "special" VMAs: -1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind +1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind these mappings are inherently pinned, so we don't need to mark them as - mlocked. In any case, most of the pages have no struct page in which to - so mark the page. Because of this, get_user_pages() will fail for these - vmas, so there is no sense in attempting to visit them. - -2) vmas mapping hugetlbfs page are already effectively pinned into memory. - We don't need nor want to mlock() these pages. However, to preserve the - prior behavior of mlock()--before the unevictable/mlock changes-- - mlock_fixup() will call make_pages_present() in the hugetlbfs vma range - to allocate the huge pages and populate the ptes. - -3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of - kernel pages, such as the vdso page, relay channel pages, etc. These pages + mlocked. In any case, most of the pages have no struct page in which to so + mark the page. Because of this, get_user_pages() will fail for these VMAs, + so there is no sense in attempting to visit them. + +2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We + neither need nor want to mlock() these pages. However, to preserve the + prior behavior of mlock() - before the unevictable/mlock changes - + mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to + allocate the huge pages and populate the ptes. + +3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of + kernel pages, such as the VDSO page, relay channel pages, etc. These pages are inherently unevictable and are not managed on the LRU lists. - mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls + mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls make_pages_present() to populate the ptes. -Note that for all of these special vmas, mlock_fixup() does not set the +Note that for all of these special VMAs, mlock_fixup() does not set the VM_LOCKED flag. Therefore, we won't have to deal with them later during -munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() -account these vmas against the task's "locked_vm". - -Mlocked Pages: Downgrading the Mmap Semaphore. - -mlock_fixup() must be called with the mmap semaphore held for write, because -it may have to merge or split vmas. However, mlocking a large region of -memory can take a long time--especially if vmscan must reclaim pages to -satisfy the regions requirements. Faulting in a large region with the mmap -semaphore held for write can hold off other faults on the address space, in -the case of a multi-threaded task. It can also hold off scans of the task's -address space via /proc. While testing under heavy load, it was observed that -the ps(1) command could be held off for many minutes while a large segment was -mlock()ed down. - -To address this issue, and to make the system more responsive during mlock()ing -of large segments, mlock_fixup() downgrades the mmap semaphore to read mode -during the call to __mlock_vma_pages_range(). This works fine. However, the -callers of mlock_fixup() expect the semaphore to be returned in write mode. -So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not -support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore -and reacquire it in write mode. In a multi-threaded task, it is possible for -the task memory map to change while the semaphore is dropped. Therefore, -mlock_fixup() looks up the vma at the range start address after reacquiring -the semaphore in write mode and verifies that it still covers the original -range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of -mlock_fixup() have been changed to deal with this new error condition. - -Note: when munlocking a region, all of the pages should already be resident-- -unless we have racing threads mlocking() and munlocking() regions. So, -unlocking should not have to wait for page allocations nor faults of any kind. -Therefore mlock_fixup() does not downgrade the semaphore for munlock(). - - -Mlocked Pages: munlock()/munlockall() System Call Handling - -The munlock() and munlockall() system calls are handled by the same functions-- -do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock -vs lock operation indicated by an argument. So, these system calls are also -handled by mlock_fixup(). Again, if called for an already munlock()ed vma, -mlock_fixup() simply returns. Because of the vma filtering discussed above, -VM_LOCKED will not be set in any "special" vmas. So, these vmas will be +munlock(), munmap() or task exit. Neither does mlock_fixup() account these +VMAs against the task's "locked_vm". + + +munlock()/munlockall() SYSTEM CALL HANDLING +------------------------------------------- + +The munlock() and munlockall() system calls are handled by the same functions - +do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs +lock operation indicated by an argument. So, these system calls are also +handled by mlock_fixup(). Again, if called for an already munlocked VMA, +mlock_fixup() simply returns. Because of the VMA filtering discussed above, +VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be ignored for munlock. -If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off -the specified range. The range is then munlocked via the function -__mlock_vma_pages_range()--the same function used to mlock a vma range-- +If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the +specified range. The range is then munlocked via the function +__mlock_vma_pages_range() - the same function used to mlock a VMA range - passing a flag to indicate that munlock() is being performed. -Because the vma access protections could have been changed to PROT_NONE after +Because the VMA access protections could have been changed to PROT_NONE after faulting in and mlocking pages, get_user_pages() was unreliable for visiting -these pages for munlocking. Because we don't want to leave pages mlocked(), +these pages for munlocking. Because we don't want to leave pages mlocked, get_user_pages() was enhanced to accept a flag to ignore the permissions when -fetching the pages--all of which should be resident as a result of previous -mlock()ing. +fetching the pages - all of which should be resident as a result of previous +mlocking. For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked -flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() -use the Test*PageMlocked() function to handle the case where the page might -have already been unlocked by another task. If the page was mlocked, -munlock_vma_page() updates that zone statistics for the number of mlocked -pages. Note, however, that at this point we haven't checked whether the page -is mapped by other VM_LOCKED vmas. - -We can't call try_to_munlock(), the function that walks the reverse map to check -for other VM_LOCKED vmas, without first isolating the page from the LRU. +flag using TestClearPageMlocked(). As with mlock_vma_page(), +munlock_vma_page() use the Test*PageMlocked() function to handle the case where +the page might have already been unlocked by another task. If the page was +mlocked, munlock_vma_page() updates that zone statistics for the number of +mlocked pages. Note, however, that at this point we haven't checked whether +the page is mapped by other VM_LOCKED VMAs. + +We can't call try_to_munlock(), the function that walks the reverse map to +check for other VM_LOCKED VMAs, without first isolating the page from the LRU. try_to_munlock() is a variant of try_to_unmap() and thus requires that the page -not be on an lru list. [More on these below.] However, the call to -isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). -So, we go ahead and clear PG_mlocked up front, as this might be the only chance -we have. If we can successfully isolate the page, we go ahead and +not be on an LRU list [more on these below]. However, the call to +isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So, +we go ahead and clear PG_mlocked up front, as this might be the only chance we +have. If we can successfully isolate the page, we go ahead and try_to_munlock(), which will restore the PG_mlocked flag and update the zone -page statistics if it finds another vma holding the page mlocked. If we fail +page statistics if it finds another VMA holding the page mlocked. If we fail to isolate the page, we'll have left a potentially mlocked page on the LRU. -This is fine, because we'll catch it later when/if vmscan tries to reclaim the -page. This should be relatively rare. - -Mlocked Pages: Migrating Them... - -A page that is being migrated has been isolated from the lru lists and is -held locked across unmapping of the page, updating the page's mapping -[address_space] entry and copying the contents and state, until the -page table entry has been replaced with an entry that refers to the new -page. Linux supports migration of mlocked pages and other unevictable -pages. This involves simply moving the PageMlocked and PageUnevictable states -from the old page to the new page. - -Note that page migration can race with mlocking or munlocking of the same -page. This has been discussed from the mlock/munlock perspective in the -respective sections above. Both processes [migration, m[un]locking], hold -the page locked. This provides the first level of synchronization. Page -migration zeros out the page_mapping of the old page before unlocking it, -so m[un]lock can skip these pages by testing the page mapping under page -lock. - -When completing page migration, we place the new and old pages back onto the -lru after dropping the page lock. The "unneeded" page--old page on success, -new page on failure--will be freed when the reference count held by the -migration process is released. To ensure that we don't strand pages on the -unevictable list because of a race between munlock and migration, page -migration uses the putback_lru_page() function to add migrated pages back to -the lru. - - -Mlocked Pages: mmap(MAP_LOCKED) System Call Handling +This is fine, because we'll catch it later if and if vmscan tries to reclaim +the page. This should be relatively rare. + + +MIGRATING MLOCKED PAGES +----------------------- + +A page that is being migrated has been isolated from the LRU lists and is held +locked across unmapping of the page, updating the page's address space entry +and copying the contents and state, until the page table entry has been +replaced with an entry that refers to the new page. Linux supports migration +of mlocked pages and other unevictable pages. This involves simply moving the +PG_mlocked and PG_unevictable states from the old page to the new page. + +Note that page migration can race with mlocking or munlocking of the same page. +This has been discussed from the mlock/munlock perspective in the respective +sections above. Both processes (migration and m[un]locking) hold the page +locked. This provides the first level of synchronization. Page migration +zeros out the page_mapping of the old page before unlocking it, so m[un]lock +can skip these pages by testing the page mapping under page lock. + +To complete page migration, we place the new and old pages back onto the LRU +after dropping the page lock. The "unneeded" page - old page on success, new +page on failure - will be freed when the reference count held by the migration +process is released. To ensure that we don't strand pages on the unevictable +list because of a race between munlock and migration, page migration uses the +putback_lru_page() function to add migrated pages back to the LRU. + + +mmap(MAP_LOCKED) SYSTEM CALL HANDLING +------------------------------------- In addition the the mlock()/mlockall() system calls, an application can request -that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() +that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() call. Furthermore, any mmap() call or brk() call that expands the heap by a task that has previously called mlockall() with the MCL_FUTURE flag will result -in the newly mapped memory being mlocked. Before the unevictable/mlock changes, -the kernel simply called make_pages_present() to allocate pages and populate -the page table. +in the newly mapped memory being mlocked. Before the unevictable/mlock +changes, the kernel simply called make_pages_present() to allocate pages and +populate the page table. To mlock a range of memory under the unevictable/mlock infrastructure, the mmap() handler and task address space expansion functions call mlock_vma_pages_range() specifying the vma and the address range to mlock. -mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in -"Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will -have already been set by the caller, in filtered vmas. Thus these vma's need -not be visited for munlock when the region is unmapped. +mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in +"Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have +already been set by the caller, in filtered VMAs. Thus these VMA's need not be +visited for munlock when the region is unmapped. -For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to +For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to fault/allocate the pages and mlock them. Again, like mlock_fixup(), mlock_vma_pages_range() downgrades the mmap semaphore to read mode before -attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore +attempting to fault/allocate and mlock the pages and "upgrades" the semaphore back to write mode before returning. -The callers of mlock_vma_pages_range() will have already added the memory -range to be mlocked to the task's "locked_vm". To account for filtered vmas, +The callers of mlock_vma_pages_range() will have already added the memory range +to be mlocked to the task's "locked_vm". To account for filtered VMAs, mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the -callers then subtract a non-negative return value from the task's locked_vm. -A negative return value represent an error--for example, from get_user_pages() -attempting to fault in a vma with PROT_NONE access. In this case, we leave -the memory range accounted as locked_vm, as the protections could be changed -later and pages allocated into that region. +callers then subtract a non-negative return value from the task's locked_vm. A +negative return value represent an error - for example, from get_user_pages() +attempting to fault in a VMA with PROT_NONE access. In this case, we leave the +memory range accounted as locked_vm, as the protections could be changed later +and pages allocated into that region. -Mlocked Pages: munmap()/exit()/exec() System Call Handling +munmap()/exit()/exec() SYSTEM CALL HANDLING +------------------------------------------- When unmapping an mlocked region of memory, whether by an explicit call to munmap() or via an internal unmap from exit() or exec() processing, we must -munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. +munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. Before the unevictable/mlock changes, mlocking did not mark the pages in any way, so unmapping them required no processing. To munlock a range of memory under the unevictable/mlock infrastructure, the -munmap() hander and task address space tear down function call +munmap() handler and task address space call tear down function munlock_vma_pages_all(). The name reflects the observation that one always -specifies the entire vma range when munlock()ing during unmap of a region. -Because of the vma filtering when mlocking() regions, only "normal" vmas that +specifies the entire VMA range when munlock()ing during unmap of a region. +Because of the VMA filtering when mlocking() regions, only "normal" VMAs that actually contain mlocked pages will be passed to munlock_vma_pages_all(). -munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() +munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup() for the munlock case, calls __munlock_vma_pages_range() to walk the page table -for the vma's memory range and munlock_vma_page() each resident page mapped by -the vma. This effectively munlocks the page, only if this is the last -VM_LOCKED vma that maps the page. - +for the VMA's memory range and munlock_vma_page() each resident page mapped by +the VMA. This effectively munlocks the page, only if this is the last +VM_LOCKED VMA that maps the page. -Mlocked Page: try_to_unmap() -[Note: the code changes represented by this section are really quite small -compared to the text to describe what happening and why, and to discuss the -implications.] +try_to_unmap() +-------------- -Pages can, of course, be mapped into multiple vmas. Some of these vmas may +Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may have VM_LOCKED flag set. It is possible for a page mapped into one or more -VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one -of the active or inactive LRU lists. This could happen if, for example, a -task in the process of munlock()ing the page could not isolate the page from -the LRU. As a result, vmscan/shrink_page_list() might encounter such a page -as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To -handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED -vmas while it is walking a page's reverse map. +VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one +of the active or inactive LRU lists. This could happen if, for example, a task +in the process of munlocking the page could not isolate the page from the LRU. +As a result, vmscan/shrink_page_list() might encounter such a page as described +in section "vmscan's handling of unevictable pages". To handle this situation, +try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse +map. try_to_unmap() is always called, by either vmscan for reclaim or for page -migration, with the argument page locked and isolated from the LRU. BUG_ON() -assertions enforce this requirement. Separate functions handle anonymous and -mapped file pages, as these types of pages have different reverse map -mechanisms. - - try_to_unmap_anon() - -To unmap anonymous pages, each vma in the list anchored in the anon_vma must be -visited--at least until a VM_LOCKED vma is encountered. If the page is being -unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked -pages are migratable. However, for reclaim, if the page is mapped into a -VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap -semphore of the mm_struct to which the vma belongs in read mode. If this is -successful, try_to_unmap() will mlock the page via mlock_vma_page()--we -wouldn't have gotten to try_to_unmap() if the page were already mlocked--and -will return SWAP_MLOCK, indicating that the page is unevictable. If the -mmap semaphore cannot be acquired, we are not sure whether the page is really -unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. - - try_to_unmap_file() -- linear mappings - -Unmapping of a mapped file page works the same, except that the scan visits -all vmas that maps the page's index/page offset in the page's mapping's -reverse map priority search tree. It must also visit each vma in the page's -mapping's non-linear list, if the list is non-empty. As for anonymous pages, -on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will -attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, -returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. - - try_to_unmap_file() -- non-linear mappings - -If a page's mapping contains a non-empty non-linear mapping vma list, then -try_to_un{map|lock}() must also visit each vma in that list to determine -whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit -all vmas in the non-linear list to ensure that the pages is not/should not be -mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. -However, there is no easy way to determine whether the page is actually mapped -in a given vma--either for unmapping or testing whether the VM_LOCKED vma -actually pins the page. - -So, try_to_unmap_file() handles non-linear mappings by scanning a certain -number of pages--a "cluster"--in each non-linear vma associated with the page's -mapping, for each file mapped page that vmscan tries to unmap. If this happens -to unmap the page we're trying to unmap, try_to_unmap() will notice this on -return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it -will return SWAP_AGAIN, causing vmscan to recirculate this page. We take -advantage of the cluster scan in try_to_unmap_cluster() as follows: - -For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap -semaphore of the associated mm_struct for read without blocking. If this -attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will -retain the mmap semaphore for the scan; otherwise it drops it here. Then, -for each page in the cluster, if we're holding the mmap semaphore for a locked -vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This -call is a no-op if the page is already locked, but will mlock any pages in -the non-linear mapping that happen to be unlocked. If one of the pages so -mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will -return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan -to cull the page, rather than recirculating it on the inactive list. Again, -if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns -SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but -couldn't be mlocked. - - -Mlocked pages: try_to_munlock() Reverse Map Scan - -TODO/FIXME: a better name might be page_mlocked()--analogous to the -page_referenced() reverse map walker. - -When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() -System Call Handling" above--tries to munlock a page, it needs to -determine whether or not the page is mapped by any VM_LOCKED vma, without -actually attempting to unmap all ptes from the page. For this purpose, the -unevictable/mlock infrastructure introduced a variant of try_to_unmap() called -try_to_munlock(). +migration, with the argument page locked and isolated from the LRU. Separate +functions handle anonymous and mapped file pages, as these types of pages have +different reverse map mechanisms. + + (*) try_to_unmap_anon() + + To unmap anonymous pages, each VMA in the list anchored in the anon_vma + must be visited - at least until a VM_LOCKED VMA is encountered. If the + page is being unmapped for migration, VM_LOCKED VMAs do not stop the + process because mlocked pages are migratable. However, for reclaim, if + the page is mapped into a VM_LOCKED VMA, the scan stops. + + try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of + the mm_struct to which the VMA belongs. If this is successful, it will + mlock the page via mlock_vma_page() - we wouldn't have gotten to + try_to_unmap_anon() if the page were already mlocked - and will return + SWAP_MLOCK, indicating that the page is unevictable. + + If the mmap semaphore cannot be acquired, we are not sure whether the page + is really unevictable or not. In this case, try_to_unmap_anon() will + return SWAP_AGAIN. + + (*) try_to_unmap_file() - linear mappings + + Unmapping of a mapped file page works the same as for anonymous mappings, + except that the scan visits all VMAs that map the page's index/page offset + in the page's mapping's reverse map priority search tree. It also visits + each VMA in the page's mapping's non-linear list, if the list is + non-empty. + + As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file + page, try_to_unmap_file() will attempt to acquire the associated + mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this + is successful, and SWAP_AGAIN, if not. + + (*) try_to_unmap_file() - non-linear mappings + + If a page's mapping contains a non-empty non-linear mapping VMA list, then + try_to_un{map|lock}() must also visit each VMA in that list to determine + whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit + all VMAs in the non-linear list to ensure that the pages is not/should not + be mlocked. + + If a VM_LOCKED VMA is found in the list, the scan could terminate. + However, there is no easy way to determine whether the page is actually + mapped in a given VMA - either for unmapping or testing whether the + VM_LOCKED VMA actually pins the page. + + try_to_unmap_file() handles non-linear mappings by scanning a certain + number of pages - a "cluster" - in each non-linear VMA associated with the + page's mapping, for each file mapped page that vmscan tries to unmap. If + this happens to unmap the page we're trying to unmap, try_to_unmap() will + notice this on return (page_mapcount(page) will be 0) and return + SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to + recirculate this page. We take advantage of the cluster scan in + try_to_unmap_cluster() as follows: + + For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the + mmap semaphore of the associated mm_struct for read without blocking. + + If this attempt is successful and the VMA is VM_LOCKED, + try_to_unmap_cluster() will retain the mmap semaphore for the scan; + otherwise it drops it here. + + Then, for each page in the cluster, if we're holding the mmap semaphore + for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to + mlock the page. This call is a no-op if the page is already locked, + but will mlock any pages in the non-linear mapping that happen to be + unlocked. + + If one of the pages so mlocked is the page passed in to try_to_unmap(), + try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default + SWAP_AGAIN. This will allow vmscan to cull the page, rather than + recirculating it on the inactive list. + + Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it + returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED + VMA, but couldn't be mlocked. + + +try_to_munlock() REVERSE MAP SCAN +--------------------------------- + + [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the + page_referenced() reverse map walker. + +When munlock_vma_page() [see section "munlock()/munlockall() System Call +Handling" above] tries to munlock a page, it needs to determine whether or not +the page is mapped by any VM_LOCKED VMA without actually attempting to unmap +all PTEs from the page. For this purpose, the unevictable/mlock infrastructure +introduced a variant of try_to_unmap() called try_to_munlock(). try_to_munlock() calls the same functions as try_to_unmap() for anonymous and mapped file pages with an additional argument specifing unlock versus unmap processing. Again, these functions walk the respective reverse maps looking -for VM_LOCKED vmas. When such a vma is found for anonymous pages and file +for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file pages mapped in linear VMAs, as in the try_to_unmap() case, the functions attempt to acquire the associated mmap semphore, mlock the page via mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. -If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap -semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() -to recycle the page on the inactive list and hope that it has better luck -with the page next time. - -For file pages mapped into non-linear vmas, the try_to_munlock() logic works -slightly differently. On encountering a VM_LOCKED non-linear vma that might -map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking -the page. munlock_vma_page() will just leave the page unlocked and let -vmscan deal with it--the usual fallback position. - -Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' -reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. -However, the scan can terminate when it encounters a VM_LOCKED vma and can -successfully acquire the vma's mmap semphore for read and mlock the page. -Although try_to_munlock() can be called many [very many!] times when -munlock()ing a large region or tearing down a large address space that has been -mlocked via mlockall(), overall this is a fairly rare event. - -Mlocked Page: Page Reclaim in shrink_*_list() - -shrink_active_list() culls any obviously unevictable pages--i.e., -!page_evictable(page, NULL)--diverting these to the unevictable lru -list. However, shrink_active_list() only sees unevictable pages that -made it onto the active/inactive lru lists. Note that these pages do not -have PageUnevictable set--otherwise, they would be on the unevictable list and -shrink_active_list would never see them. +If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap +semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to +recycle the page on the inactive list and hope that it has better luck with the +page next time. + +For file pages mapped into non-linear VMAs, the try_to_munlock() logic works +slightly differently. On encountering a VM_LOCKED non-linear VMA that might +map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the +page. munlock_vma_page() will just leave the page unlocked and let vmscan deal +with it - the usual fallback position. + +Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's +reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. +However, the scan can terminate when it encounters a VM_LOCKED VMA and can +successfully acquire the VMA's mmap semphore for read and mlock the page. +Although try_to_munlock() might be called a great many times when munlocking a +large region or tearing down a large address space that has been mlocked via +mlockall(), overall this is a fairly rare event. + + +PAGE RECLAIM IN shrink_*_list() +------------------------------- + +shrink_active_list() culls any obviously unevictable pages - i.e. +!page_evictable(page, NULL) - diverting these to the unevictable list. +However, shrink_active_list() only sees unevictable pages that made it onto the +active/inactive lru lists. Note that these pages do not have PageUnevictable +set - otherwise they would be on the unevictable list and shrink_active_list +would never see them. Some examples of these unevictable pages on the LRU lists are: -1) ramfs pages that have been placed on the lru lists when first allocated. + (1) ramfs pages that have been placed on the LRU lists when first allocated. + + (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to + allocate or fault in the pages in the shared memory region. This happens + when an application accesses the page the first time after SHM_LOCK'ing + the segment. -2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to - allocate or fault in the pages in the shared memory region. This happens - when an application accesses the page the first time after SHM_LOCKing - the segment. + (3) mlocked pages that could not be isolated from the LRU and moved to the + unevictable list in mlock_vma_page(). -3) Mlocked pages that could not be isolated from the lru and moved to the - unevictable list in mlock_vma_page(). + (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't + acquire the VMA's mmap semaphore to test the flags and set PageMlocked. + munlock_vma_page() was forced to let the page back on to the normal LRU + list for vmscan to handle. -3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't - acquire the vma's mmap semaphore to test the flags and set PageMlocked. - munlock_vma_page() was forced to let the page back on to the normal - LRU list for vmscan to handle. +shrink_inactive_list() also diverts any unevictable pages that it finds on the +inactive lists to the appropriate zone's unevictable list. -shrink_inactive_list() also culls any unevictable pages that it finds on -the inactive lists, again diverting them to the appropriate zone's unevictable -lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became -SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or -pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from -the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice -the latter, but will pass on to shrink_page_list(). +shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd +after shrink_active_list() had moved them to the inactive list, or pages mapped +into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to +recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, +but will pass on to shrink_page_list(). shrink_page_list() again culls obviously unevictable pages that it could encounter for similar reason to shrink_inactive_list(). Pages mapped into -VM_LOCKED vmas but without PG_mlocked set will make it all the way to +VM_LOCKED VMAs but without PG_mlocked set will make it all the way to try_to_unmap(). shrink_page_list() will divert them to the unevictable list when try_to_unmap() returns SWAP_MLOCK, as discussed above. -- cgit v1.1 From a55ce6dc705c9ed0bb0d4f629dbcaf3b3ced5172 Mon Sep 17 00:00:00 2001 From: Michael Ellerman Date: Mon, 13 Apr 2009 14:40:09 -0700 Subject: mm: add documentation describing what tsk->active_mm means vs tsk->mm I'm sure everyone knows this, but I didn't, so I googled it, and found a nice explanation from Linus. Might be worth sticking in Documentation. Signed-off-by: Michael Ellerman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/00-INDEX | 2 + Documentation/vm/active_mm.txt | 83 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 85 insertions(+) create mode 100644 Documentation/vm/active_mm.txt (limited to 'Documentation') diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 2131b00..2f77ced 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -1,5 +1,7 @@ 00-INDEX - this file. +active_mm.txt + - An explanation from Linus about tsk->active_mm vs tsk->mm. balance - various information on memory balancing. hugetlbpage.txt diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt new file mode 100644 index 0000000..4ee1f64 --- /dev/null +++ b/Documentation/vm/active_mm.txt @@ -0,0 +1,83 @@ +List: linux-kernel +Subject: Re: active_mm +From: Linus Torvalds +Date: 1999-07-30 21:36:24 + +Cc'd to linux-kernel, because I don't write explanations all that often, +and when I do I feel better about more people reading them. + +On Fri, 30 Jul 1999, David Mosberger wrote: +> +> Is there a brief description someplace on how "mm" vs. "active_mm" in +> the task_struct are supposed to be used? (My apologies if this was +> discussed on the mailing lists---I just returned from vacation and +> wasn't able to follow linux-kernel for a while). + +Basically, the new setup is: + + - we have "real address spaces" and "anonymous address spaces". The + difference is that an anonymous address space doesn't care about the + user-level page tables at all, so when we do a context switch into an + anonymous address space we just leave the previous address space + active. + + The obvious use for a "anonymous address space" is any thread that + doesn't need any user mappings - all kernel threads basically fall into + this category, but even "real" threads can temporarily say that for + some amount of time they are not going to be interested in user space, + and that the scheduler might as well try to avoid wasting time on + switching the VM state around. Currently only the old-style bdflush + sync does that. + + - "tsk->mm" points to the "real address space". For an anonymous process, + tsk->mm will be NULL, for the logical reason that an anonymous process + really doesn't _have_ a real address space at all. + + - however, we obviously need to keep track of which address space we + "stole" for such an anonymous user. For that, we have "tsk->active_mm", + which shows what the currently active address space is. + + The rule is that for a process with a real address space (ie tsk->mm is + non-NULL) the active_mm obviously always has to be the same as the real + one. + + For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the + "borrowed" mm while the anonymous process is running. When the + anonymous process gets scheduled away, the borrowed address space is + returned and cleared. + +To support all that, the "struct mm_struct" now has two counters: a +"mm_users" counter that is how many "real address space users" there are, +and a "mm_count" counter that is the number of "lazy" users (ie anonymous +users) plus one if there are any real users. + +Usually there is at least one real user, but it could be that the real +user exited on another CPU while a lazy user was still active, so you do +actually get cases where you have a address space that is _only_ used by +lazy users. That is often a short-lived state, because once that thread +gets scheduled away in favour of a real thread, the "zombie" mm gets +released because "mm_users" becomes zero. + +Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any +more. "init_mm" should be considered just a "lazy context when no other +context is available", and in fact it is mainly used just at bootup when +no real VM has yet been created. So code that used to check + + if (current->mm == &init_mm) + +should generally just do + + if (!current->mm) + +instead (which makes more sense anyway - the test is basically one of "do +we have a user context", and is generally done by the page fault handler +and things like that). + +Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, +because it slightly changes the interfaces to accomodate the alpha (who +would have thought it, but the alpha actually ends up having one of the +ugliest context switch codes - unlike the other architectures where the MM +and register state is separate, the alpha PALcode joins the two, and you +need to switch both together). + +(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2) -- cgit v1.1 From c863d835b7cd9a3c08a941d4ae59b8faefa31422 Mon Sep 17 00:00:00 2001 From: Bharata B Rao Date: Mon, 13 Apr 2009 14:40:15 -0700 Subject: memcg: fix documentation The description about various statistics from memory.stat is not accurate and confusing at times. Correct this along with a few other minor cleanups. Signed-off-by: Bharata B Rao Acked-by: Balbir Singh Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 55 +++++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 23 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index a98a7fe..1a60887 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -6,15 +6,14 @@ used here with the memory controller that is used in hardware. Salient features -a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages +a. Enable control of Anonymous, Page Cache (mapped and unmapped) and + Swap Cache memory pages. b. The infrastructure allows easy addition of other types of memory to control c. Provides *zero overhead* for non memory controller users d. Provides a double LRU: global memory pressure causes reclaim from the global LRU; a cgroup on hitting a limit, reclaims from the per cgroup LRU -NOTE: Swap Cache (unmapped) is not accounted now. - Benefits and Purpose of the memory controller The memory controller isolates the memory behaviour of a group of tasks @@ -290,34 +289,44 @@ will be charged as a new owner of it. moved to the parent. If you want to avoid that, force_empty will be useful. 5.2 stat file - memory.stat file includes following statistics (now) - cache - # of pages from page-cache and shmem. - rss - # of pages from anonymous memory. - pgpgin - # of event of charging - pgpgout - # of event of uncharging - active_anon - # of pages on active lru of anon, shmem. - inactive_anon - # of pages on active lru of anon, shmem - active_file - # of pages on active lru of file-cache - inactive_file - # of pages on inactive lru of file cache - unevictable - # of pages cannot be reclaimed.(mlocked etc) - - Below is depend on CONFIG_DEBUG_VM. - inactive_ratio - VM internal parameter. (see mm/page_alloc.c) - recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) - recent_rotated_file - VM internal parameter. (see mm/vmscan.c) - recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) - recent_scanned_file - VM internal parameter. (see mm/vmscan.c) - - Memo: + +memory.stat file includes following statistics + +cache - # of bytes of page cache memory. +rss - # of bytes of anonymous and swap cache memory. +pgpgin - # of pages paged in (equivalent to # of charging events). +pgpgout - # of pages paged out (equivalent to # of uncharging events). +active_anon - # of bytes of anonymous and swap cache memory on active + lru list. +inactive_anon - # of bytes of anonymous memory and swap cache memory on + inactive lru list. +active_file - # of bytes of file-backed memory on active lru list. +inactive_file - # of bytes of file-backed memory on inactive lru list. +unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). + +The following additional stats are dependent on CONFIG_DEBUG_VM. + +inactive_ratio - VM internal parameter. (see mm/page_alloc.c) +recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) +recent_rotated_file - VM internal parameter. (see mm/vmscan.c) +recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) +recent_scanned_file - VM internal parameter. (see mm/vmscan.c) + +Memo: recent_rotated means recent frequency of lru rotation. recent_scanned means recent # of scans to lru. showing for better debug please see the code for meanings. +Note: + Only anonymous and swap cache memory is listed as part of 'rss' stat. + This should not be confused with the true 'resident set size' or the + amount of physical memory used by the cgroup. Per-cgroup rss + accounting is not done yet. 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. - Following cgroup's swapiness can't be changed. + Following cgroups' swapiness can't be changed. - root cgroup (uses /proc/sys/vm/swappiness). - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. -- cgit v1.1 From bbdba2737443ae7b530a453d8152f2068ca4cf56 Mon Sep 17 00:00:00 2001 From: Shen Feng Date: Mon, 13 Apr 2009 14:40:16 -0700 Subject: doc: use correct debugfs mountpoint Use the default mountpoint of debugfs in the pktcdvd ABI. Signed-off-by: Shen Feng Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/debugfs-pktcdvd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/debugfs-pktcdvd b/Documentation/ABI/testing/debugfs-pktcdvd index bf9c16b..cf11736 100644 --- a/Documentation/ABI/testing/debugfs-pktcdvd +++ b/Documentation/ABI/testing/debugfs-pktcdvd @@ -1,4 +1,4 @@ -What: /debug/pktcdvd/pktcdvd[0-7] +What: /sys/kernel/debug/pktcdvd/pktcdvd[0-7] Date: Oct. 2006 KernelVersion: 2.6.20 Contact: Thomas Maier @@ -10,10 +10,10 @@ debugfs interface The pktcdvd module (packet writing driver) creates these files in debugfs: -/debug/pktcdvd/pktcdvd[0-7]/ +/sys/kernel/debug/pktcdvd/pktcdvd[0-7]/ info (0444) Lots of driver statistics and infos. Example: ------- -cat /debug/pktcdvd/pktcdvd0/info +cat /sys/kernel/debug/pktcdvd/pktcdvd0/info -- cgit v1.1 From 17a7b7b39056a82c5012539311850f202e6c3cd4 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Mon, 13 Apr 2009 11:04:19 +0900 Subject: tomoyo: add Documentation/tomoyo.txt Signed-off-by: Kentaro Takeda Signed-off-by: Tetsuo Handa Signed-off-by: Toshiharu Harada Signed-off-by: James Morris --- Documentation/tomoyo.txt | 55 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) create mode 100644 Documentation/tomoyo.txt (limited to 'Documentation') diff --git a/Documentation/tomoyo.txt b/Documentation/tomoyo.txt new file mode 100644 index 0000000..b3a232c --- /dev/null +++ b/Documentation/tomoyo.txt @@ -0,0 +1,55 @@ +--- What is TOMOYO? --- + +TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel. + +LiveCD-based tutorials are available at +http://tomoyo.sourceforge.jp/en/1.6.x/1st-step/ubuntu8.04-live/ +http://tomoyo.sourceforge.jp/en/1.6.x/1st-step/centos5-live/ . +Though these tutorials use non-LSM version of TOMOYO, they are useful for you +to know what TOMOYO is. + +--- How to enable TOMOYO? --- + +Build the kernel with CONFIG_SECURITY_TOMOYO=y and pass "security=tomoyo" on +kernel's command line. + +Please see http://tomoyo.sourceforge.jp/en/2.2.x/ for details. + +--- Where is documentation? --- + +User <-> Kernel interface documentation is available at +http://tomoyo.sourceforge.jp/en/2.2.x/policy-reference.html . + +Materials we prepared for seminars and symposiums are available at +http://sourceforge.jp/projects/tomoyo/docs/?category_id=532&language_id=1 . +Below lists are chosen from three aspects. + +What is TOMOYO? + TOMOYO Linux Overview + http://sourceforge.jp/projects/tomoyo/docs/lca2009-takeda.pdf + TOMOYO Linux: pragmatic and manageable security for Linux + http://sourceforge.jp/projects/tomoyo/docs/freedomhectaipei-tomoyo.pdf + TOMOYO Linux: A Practical Method to Understand and Protect Your Own Linux Box + http://sourceforge.jp/projects/tomoyo/docs/PacSec2007-en-no-demo.pdf + +What can TOMOYO do? + Deep inside TOMOYO Linux + http://sourceforge.jp/projects/tomoyo/docs/lca2009-kumaneko.pdf + The role of "pathname based access control" in security. + http://sourceforge.jp/projects/tomoyo/docs/lfj2008-bof.pdf + +History of TOMOYO? + Realities of Mainlining + http://sourceforge.jp/projects/tomoyo/docs/lfj2008.pdf + +--- What is future plan? --- + +We believe that inode based security and name based security are complementary +and both should be used together. But unfortunately, so far, we cannot enable +multiple LSM modules at the same time. We feel sorry that you have to give up +SELinux/SMACK/AppArmor etc. when you want to use TOMOYO. + +We hope that LSM becomes stackable in future. Meanwhile, you can use non-LSM +version of TOMOYO, available at http://tomoyo.sourceforge.jp/en/1.6.x/ . +LSM version of TOMOYO is a subset of non-LSM version of TOMOYO. We are planning +to port non-LSM version's functionalities to LSM versions. -- cgit v1.1 From 6cececfcece2b072d29886ed7140495f3af17153 Mon Sep 17 00:00:00 2001 From: Jaswinder Singh Rajput Date: Tue, 14 Apr 2009 14:03:43 +0530 Subject: x86, documentation: kernel-parameters replace X86-32,X86-64 with X86 X86 is same as X86-32+X86-64 so replace X86-32,X86-64 with X86. Signed-off-by: Jaswinder Singh Rajput LKML-Reference: <1239698023.3033.37.camel@ht.satnam> Signed-off-by: Ingo Molnar --- Documentation/kernel-parameters.txt | 40 ++++++++++++++++++------------------- 1 file changed, 20 insertions(+), 20 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6172e43..a19f021 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -134,7 +134,7 @@ and is between 256 and 4096 characters. It is defined in the file ./include/asm/setup.h as COMMAND_LINE_SIZE. - acpi= [HW,ACPI,X86-64,i386] + acpi= [HW,ACPI,X86] Advanced Configuration and Power Interface Format: { force | off | ht | strict | noirq | rsdt } force -- enable ACPI if default was off @@ -218,7 +218,7 @@ and is between 256 and 4096 characters. It is defined in the file acpi_osi="!string2" # remove built-in string2 acpi_osi= # disable all strings - acpi_pm_good [X86-32,X86-64] + acpi_pm_good [X86] Override the pmtimer bug detection: force the kernel to assume that this machine's pmtimer latches its value and always returns good values. @@ -459,7 +459,7 @@ and is between 256 and 4096 characters. It is defined in the file Also note the kernel might malfunction if you disable some critical bits. - code_bytes [IA32/X86_64] How many bytes of object code to print + code_bytes [X86] How many bytes of object code to print in an oops report. Range: 0 - 8192 Default: 64 @@ -592,7 +592,7 @@ and is between 256 and 4096 characters. It is defined in the file MTRR settings. This parameter disables that behavior, possibly causing your machine to run very slowly. - disable_timer_pin_1 [i386,x86-64] + disable_timer_pin_1 [X86] Disable PIN 1 of APIC timer Can be useful to work around chipset bugs. @@ -624,7 +624,7 @@ and is between 256 and 4096 characters. It is defined in the file UART at the specified I/O port or MMIO address. The options are the same as for ttyS, above. - earlyprintk= [X86-32,X86-64,SH,BLACKFIN] + earlyprintk= [X86,SH,BLACKFIN] earlyprintk=vga earlyprintk=serial[,ttySn[,baudrate]] earlyprintk=dbgp @@ -659,7 +659,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/block/as-iosched.txt and Documentation/block/deadline-iosched.txt for details. - elfcorehdr= [IA64,PPC,SH,X86-32,X86_64] + elfcorehdr= [IA64,PPC,SH,X86] Specifies physical address of start of kernel core image elf header. Generally kexec loader will pass this option to capture kernel. @@ -938,7 +938,7 @@ and is between 256 and 4096 characters. It is defined in the file See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. - io_delay= [X86-32,X86-64] I/O delay method + io_delay= [X86] I/O delay method 0x80 Standard port 0x80 based delay 0xed @@ -1000,7 +1000,7 @@ and is between 256 and 4096 characters. It is defined in the file keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter + kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1034,7 +1034,7 @@ and is between 256 and 4096 characters. It is defined in the file Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. - kstack=N [X86-32,X86-64] Print N words from the kernel stack + kstack=N [X86] Print N words from the kernel stack in oops dumps. l2cr= [PPC] @@ -1044,7 +1044,7 @@ and is between 256 and 4096 characters. It is defined in the file lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. - lapic_timer_c2_ok [X86-32,x86-64,APIC] trust the local apic timer + lapic_timer_c2_ok [X86,APIC] trust the local apic timer in C2 power state. libata.dma= [LIBATA] DMA control @@ -1229,7 +1229,7 @@ and is between 256 and 4096 characters. It is defined in the file [KNL,SH] Allow user to override the default size for per-device physically contiguous DMA buffers. - memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact + memmap=exactmap [KNL,X86] Enable setting of an exact E820 memory map, as specified by the user. Such memmap=exactmap lines can be constructed based on BIOS output or other requirements. See the memmap=nn@ss @@ -1320,7 +1320,7 @@ and is between 256 and 4096 characters. It is defined in the file mousedev.yres= [MOUSE] Vertical screen resolution, used for devices reporting absolute coordinates, such as tablets - movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter + movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter is similar to kernelcore except it specifies the amount of memory used for migratable allocations. If both kernelcore and movablecore is specified, @@ -1422,7 +1422,7 @@ and is between 256 and 4096 characters. It is defined in the file when a NMI is triggered. Format: [state][,regs][,debounce][,die] - nmi_watchdog= [KNL,BUGS=X86-32,X86-64] Debugging features for SMP kernels + nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels Format: [panic,][num] Valid num: 0,1,2 0 - turn nmi_watchdog off @@ -1475,11 +1475,11 @@ and is between 256 and 4096 characters. It is defined in the file nodsp [SH] Disable hardware DSP at boot time. - noefi [X86-32,X86-64] Disable EFI runtime services support. + noefi [X86] Disable EFI runtime services support. noexec [IA-64] - noexec [X86-32,X86-64] + noexec [X86] On X86-32 available only on PAE configured kernels. noexec=on: enable non-executable mappings (default) noexec=off: disable non-executable mappings @@ -1525,7 +1525,7 @@ and is between 256 and 4096 characters. It is defined in the file noirqdebug [X86-32] Disables the code which attempts to detect and disable unhandled interrupt sources. - no_timer_check [X86-32,X86_64,APIC] Disables the code which tests for + no_timer_check [X86,APIC] Disables the code which tests for broken timer IRQ sources. noisapnp [ISAPNP] Disables ISA PnP code. @@ -1689,7 +1689,7 @@ and is between 256 and 4096 characters. It is defined in the file disable the use of PCIE advanced error reporting. nodomains [PCI] Disable support for multiple PCI root domains (aka PCI segments, in ACPI-speak). - nommconf [X86-32,X86_64] Disable use of MMCONFIG for PCI + nommconf [X86] Disable use of MMCONFIG for PCI Configuration nomsi [MSI] If the PCI_MSI kernel config parameter is enabled, this kernel boot option can be used to @@ -2380,7 +2380,7 @@ and is between 256 and 4096 characters. It is defined in the file reported either. unknown_nmi_panic - [X86-32,X86-64] + [X86] Set unknown_nmi_panic=1 early on boot. usbcore.autosuspend= @@ -2447,12 +2447,12 @@ and is between 256 and 4096 characters. It is defined in the file medium is write-protected). Example: quirks=0419:aaf5:rl,0421:0433:rc - vdso= [X86-32,SH,x86-64] + vdso= [X86,SH] vdso=2: enable compat VDSO (default with COMPAT_VDSO) vdso=1: enable VDSO (default) vdso=0: disable VDSO mapping - vdso32= [X86-32,X86-64] + vdso32= [X86] vdso32=2: enable compat VDSO (default with COMPAT_VDSO) vdso32=1: enable 32-bit VDSO (default) vdso32=0: disable 32-bit VDSO mapping -- cgit v1.1 From 329007ce25d56fc7113df7b4828d607806d8bc21 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 8 Apr 2009 11:38:50 +0200 Subject: block: update biodoc.txt on plugging We do per-device plugging, get rid of any references to tq_disk as that has been dead since 2.6.5 or so. Signed-off-by: Jens Axboe --- Documentation/block/biodoc.txt | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) (limited to 'Documentation') diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index ecad6ee..6fab97e 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -1040,23 +1040,21 @@ Front merges are handled by the binary trees in AS and deadline schedulers. iii. Plugging the queue to batch requests in anticipation of opportunities for merge/sort optimizations -This is just the same as in 2.4 so far, though per-device unplugging -support is anticipated for 2.5. Also with a priority-based i/o scheduler, -such decisions could be based on request priorities. - Plugging is an approach that the current i/o scheduling algorithm resorts to so that it collects up enough requests in the queue to be able to take advantage of the sorting/merging logic in the elevator. If the queue is empty when a request comes in, then it plugs the request queue -(sort of like plugging the bottom of a vessel to get fluid to build up) +(sort of like plugging the bath tub of a vessel to get fluid to build up) till it fills up with a few more requests, before starting to service the requests. This provides an opportunity to merge/sort the requests before passing them down to the device. There are various conditions when the queue is unplugged (to open up the flow again), either through a scheduled task or could be on demand. For example wait_on_buffer sets the unplugging going -(by running tq_disk) so the read gets satisfied soon. So in the read case, -the queue gets explicitly unplugged as part of waiting for completion, -in fact all queues get unplugged as a side-effect. +through sync_buffer() running blk_run_address_space(mapping). Or the caller +can do it explicity through blk_unplug(bdev). So in the read case, +the queue gets explicitly unplugged as part of waiting for completion on that +buffer. For page driven IO, the address space ->sync_page() takes care of +doing the blk_run_address_space(). Aside: This is kind of controversial territory, as it's not clear if plugging is @@ -1067,11 +1065,6 @@ Aside: multi-page bios being queued in one shot, we may not need to wait to merge a big request from the broken up pieces coming by. - Per-queue granularity unplugging (still a Todo) may help reduce some of the - concerns with just a single tq_disk flush approach. Something like - blk_kick_queue() to unplug a specific queue (right away ?) - or optionally, all queues, is in the plan. - 4.4 I/O contexts I/O contexts provide a dynamically allocated per process data area. They may be used in I/O schedulers, and in the block layer (could be used for IO statis, -- cgit v1.1 From 83b2086ce2a1458168dc8b9d624060b2d7a82d4c Mon Sep 17 00:00:00 2001 From: Justin Mattock Date: Tue, 14 Apr 2009 14:31:21 -0700 Subject: ALSA: add missing definitions(letters) to HD-Audio.txt impact: Add missing definitions(letters). Signed-off-by: Justin P. Mattock Signed-off-by: Takashi Iwai --- Documentation/sound/alsa/HD-Audio.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sound/alsa/HD-Audio.txt b/Documentation/sound/alsa/HD-Audio.txt index c5948f2..88b7433 100644 --- a/Documentation/sound/alsa/HD-Audio.txt +++ b/Documentation/sound/alsa/HD-Audio.txt @@ -169,7 +169,7 @@ PCI SSID look-up. What `model` option values are available depends on the codec chip. Check your codec chip from the codec proc file (see "Codec Proc-File" section below). It will show the vendor/product name of your codec -chip. Then, see Documentation/sound/alsa/HD-Audio-Modelstxt file, +chip. Then, see Documentation/sound/alsa/HD-Audio-Models.txt file, the section of HD-audio driver. You can find a list of codecs and `model` options belonging to each codec. For example, for Realtek ALC262 codec chip, pass `model=ultra` for devices that are compatible @@ -177,7 +177,7 @@ with Samsung Q1 Ultra. Thus, the first thing you can do for any brand-new, unsupported and non-working HD-audio hardware is to check HD-audio codec and several -different `model` option values. If you have a luck, some of them +different `model` option values. If you have any luck, some of them might suit with your device well. Some codecs such as ALC880 have a special model option `model=test`. -- cgit v1.1 From efcc2da3fd148c9acb7d7cf1d9800e0649f950fc Mon Sep 17 00:00:00 2001 From: Stefan Roese Date: Thu, 16 Apr 2009 15:11:54 -0600 Subject: powerpc/of-device-tree: Factor MTD physmap bindings out of booting-without-of It's easier to find bindings descriptions in separate files. So factor out the MTD physmap bindings into a separate file to not clutter booting-without-of.txt more. Signed-off-by: Stefan Roese Signed-off-by: Grant Likely --- Documentation/powerpc/booting-without-of.txt | 89 +++------------------- Documentation/powerpc/dts-bindings/mtd-physmap.txt | 63 +++++++++++++++ 2 files changed, 75 insertions(+), 77 deletions(-) create mode 100644 Documentation/powerpc/dts-bindings/mtd-physmap.txt (limited to 'Documentation') diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt index 0ab0230..d16b7a1 100644 --- a/Documentation/powerpc/booting-without-of.txt +++ b/Documentation/powerpc/booting-without-of.txt @@ -43,12 +43,11 @@ Table of Contents 2) Representing devices without a current OF specification a) PHY nodes b) Interrupt controllers - c) CFI or JEDEC memory-mapped NOR flash - d) 4xx/Axon EMAC ethernet nodes - e) Xilinx IP cores - f) USB EHCI controllers - g) MDIO on GPIOs - h) SPI busses + c) 4xx/Axon EMAC ethernet nodes + d) Xilinx IP cores + e) USB EHCI controllers + f) MDIO on GPIOs + g) SPI busses VII - Marvell Discovery mv64[345]6x System Controller chips 1) The /system-controller node @@ -999,7 +998,7 @@ compatibility. translation of SOC addresses for memory mapped SOC registers. - bus-frequency: Contains the bus frequency for the SOC node. Typically, the value of this field is filled in by the boot - loader. + loader. Recommended properties: @@ -1287,71 +1286,7 @@ platforms are moved over to use the flattened-device-tree model. device_type = "open-pic"; }; - c) CFI or JEDEC memory-mapped NOR flash - - Flash chips (Memory Technology Devices) are often used for solid state - file systems on embedded devices. - - - compatible : should contain the specific model of flash chip(s) - used, if known, followed by either "cfi-flash" or "jedec-flash" - - reg : Address range of the flash chip - - bank-width : Width (in bytes) of the flash bank. Equal to the - device width times the number of interleaved chips. - - device-width : (optional) Width of a single flash chip. If - omitted, assumed to be equal to 'bank-width'. - - #address-cells, #size-cells : Must be present if the flash has - sub-nodes representing partitions (see below). In this case - both #address-cells and #size-cells must be equal to 1. - - For JEDEC compatible devices, the following additional properties - are defined: - - - vendor-id : Contains the flash chip's vendor id (1 byte). - - device-id : Contains the flash chip's device id (1 byte). - - In addition to the information on the flash bank itself, the - device tree may optionally contain additional information - describing partitions of the flash address space. This can be - used on platforms which have strong conventions about which - portions of the flash are used for what purposes, but which don't - use an on-flash partition table such as RedBoot. - - Each partition is represented as a sub-node of the flash device. - Each node's name represents the name of the corresponding - partition of the flash device. - - Flash partitions - - reg : The partition's offset and size within the flash bank. - - label : (optional) The label / name for this flash partition. - If omitted, the label is taken from the node name (excluding - the unit address). - - read-only : (optional) This parameter, if present, is a hint to - Linux that this flash partition should only be mounted - read-only. This is usually used for flash partitions - containing early-boot firmware images or data which should not - be clobbered. - - Example: - - flash@ff000000 { - compatible = "amd,am29lv128ml", "cfi-flash"; - reg = ; - bank-width = <4>; - device-width = <1>; - #address-cells = <1>; - #size-cells = <1>; - fs@0 { - label = "fs"; - reg = <0 f80000>; - }; - firmware@f80000 { - label ="firmware"; - reg = ; - read-only; - }; - }; - - d) 4xx/Axon EMAC ethernet nodes + c) 4xx/Axon EMAC ethernet nodes The EMAC ethernet controller in IBM and AMCC 4xx chips, and also the Axon bridge. To operate this needs to interact with a ths @@ -1499,7 +1434,7 @@ platforms are moved over to use the flattened-device-tree model. available. For Axon: 0x0000012a - e) Xilinx IP cores + d) Xilinx IP cores The Xilinx EDK toolchain ships with a set of IP cores (devices) for use in Xilinx Spartan and Virtex FPGAs. The devices cover the whole range @@ -1761,7 +1696,7 @@ platforms are moved over to use the flattened-device-tree model. listed above, nodes for these devices should include a phy-handle property, and may include other common network device properties like local-mac-address. - + iv) Xilinx Uartlite Xilinx uartlite devices are simple fixed speed serial ports. @@ -1793,7 +1728,7 @@ platforms are moved over to use the flattened-device-tree model. - reg-offset : A value of 3 is required - reg-shift : A value of 2 is required - f) USB EHCI controllers + e) USB EHCI controllers Required properties: - compatible : should be "usb-ehci". @@ -1819,7 +1754,7 @@ platforms are moved over to use the flattened-device-tree model. big-endian; }; - g) MDIO on GPIOs + f) MDIO on GPIOs Currently defined compatibles: - virtual,gpio-mdio @@ -1839,7 +1774,7 @@ platforms are moved over to use the flattened-device-tree model. &qe_pio_c 6>; }; - h) SPI (Serial Peripheral Interface) busses + g) SPI (Serial Peripheral Interface) busses SPI busses can be described with a node for the SPI master device and a set of child nodes for each SPI slave on the bus. For this diff --git a/Documentation/powerpc/dts-bindings/mtd-physmap.txt b/Documentation/powerpc/dts-bindings/mtd-physmap.txt new file mode 100644 index 0000000..cd474f9 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/mtd-physmap.txt @@ -0,0 +1,63 @@ +CFI or JEDEC memory-mapped NOR flash + +Flash chips (Memory Technology Devices) are often used for solid state +file systems on embedded devices. + + - compatible : should contain the specific model of flash chip(s) + used, if known, followed by either "cfi-flash" or "jedec-flash" + - reg : Address range of the flash chip + - bank-width : Width (in bytes) of the flash bank. Equal to the + device width times the number of interleaved chips. + - device-width : (optional) Width of a single flash chip. If + omitted, assumed to be equal to 'bank-width'. + - #address-cells, #size-cells : Must be present if the flash has + sub-nodes representing partitions (see below). In this case + both #address-cells and #size-cells must be equal to 1. + +For JEDEC compatible devices, the following additional properties +are defined: + + - vendor-id : Contains the flash chip's vendor id (1 byte). + - device-id : Contains the flash chip's device id (1 byte). + +In addition to the information on the flash bank itself, the +device tree may optionally contain additional information +describing partitions of the flash address space. This can be +used on platforms which have strong conventions about which +portions of the flash are used for what purposes, but which don't +use an on-flash partition table such as RedBoot. + +Each partition is represented as a sub-node of the flash device. +Each node's name represents the name of the corresponding +partition of the flash device. + +Flash partitions + - reg : The partition's offset and size within the flash bank. + - label : (optional) The label / name for this flash partition. + If omitted, the label is taken from the node name (excluding + the unit address). + - read-only : (optional) This parameter, if present, is a hint to + Linux that this flash partition should only be mounted + read-only. This is usually used for flash partitions + containing early-boot firmware images or data which should not + be clobbered. + +Example: + + flash@ff000000 { + compatible = "amd,am29lv128ml", "cfi-flash"; + reg = ; + bank-width = <4>; + device-width = <1>; + #address-cells = <1>; + #size-cells = <1>; + fs@0 { + label = "fs"; + reg = <0 f80000>; + }; + firmware@f80000 { + label ="firmware"; + reg = ; + read-only; + }; + }; -- cgit v1.1 From c5a88dd90cf243a17c4a8c10e1ed973192ea5825 Mon Sep 17 00:00:00 2001 From: Stefan Roese Date: Thu, 16 Apr 2009 15:11:54 -0600 Subject: powerpc/device-tree: Document MTD nodes with multiple "reg" tuples Add binding for mtd nodes with multiple reg tuples. Multiple reg tuples are used when the flash region covers multiple devices of the same type, but not necessarily the same size. Signed-off-by: Stefan Roese Signed-off-by: Grant Likely --- Documentation/powerpc/dts-bindings/mtd-physmap.txt | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/mtd-physmap.txt b/Documentation/powerpc/dts-bindings/mtd-physmap.txt index cd474f9..667c9bd 100644 --- a/Documentation/powerpc/dts-bindings/mtd-physmap.txt +++ b/Documentation/powerpc/dts-bindings/mtd-physmap.txt @@ -5,7 +5,9 @@ file systems on embedded devices. - compatible : should contain the specific model of flash chip(s) used, if known, followed by either "cfi-flash" or "jedec-flash" - - reg : Address range of the flash chip + - reg : Address range(s) of the flash chip(s) + It's possible to (optionally) define multiple "reg" tuples so that + non-identical NOR chips can be described in one flash node. - bank-width : Width (in bytes) of the flash bank. Equal to the device width times the number of interleaved chips. - device-width : (optional) Width of a single flash chip. If @@ -61,3 +63,18 @@ Example: read-only; }; }; + +Here an example with multiple "reg" tuples: + + flash@f0000000,0 { + #address-cells = <1>; + #size-cells = <1>; + compatible = "intel,PC48F4400P0VB", "cfi-flash"; + reg = <0 0x00000000 0x02000000 + 0 0x02000000 0x02000000>; + bank-width = <2>; + partition@0 { + label = "test-part1"; + reg = <0 0x04000000>; + }; + }; -- cgit v1.1 From 13977091a988fb0d21821c2221ddc920eba36b79 Mon Sep 17 00:00:00 2001 From: Magnus Damm Date: Mon, 30 Mar 2009 14:37:25 -0700 Subject: Driver Core: early platform driver V3 of the early platform driver implementation. Platform drivers are great for embedded platforms because we can separate driver configuration from the actual driver. So base addresses, interrupts and other configuration can be kept with the processor or board code, and the platform driver can be reused by many different platforms. For early devices we have nothing today. For instance, to configure early timers and early serial ports we cannot use platform devices. This because the setup order during boot. Timers are needed before the platform driver core code is available. The same goes for early printk support. Early in this case means before initcalls. These early drivers today have their configuration either hard coded or they receive it using some special configuration method. This is working quite well, but if we want to support both regular kernel modules and early devices then we need to have two ways of configuring the same driver. A single way would be better. The early platform driver patch is basically a set of functions that allow drivers to register themselves and architecture code to locate them and probe. Registration happens through early_param(). The time for the probe is decided by the architecture code. See Documentation/driver-model/platform.txt for more details. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Magnus Damm Cc: Paul Mundt Cc: Kay Sievers Cc: David Brownell Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-model/platform.txt | 59 +++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) (limited to 'Documentation') diff --git a/Documentation/driver-model/platform.txt b/Documentation/driver-model/platform.txt index 83009fdc..2e2c2ea 100644 --- a/Documentation/driver-model/platform.txt +++ b/Documentation/driver-model/platform.txt @@ -169,3 +169,62 @@ three different ways to find such a match: be probed later if another device registers. (Which is OK, since this interface is only for use with non-hotpluggable devices.) + +Early Platform Devices and Drivers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The early platform interfaces provide platform data to platform device +drivers early on during the system boot. The code is built on top of the +early_param() command line parsing and can be executed very early on. + +Example: "earlyprintk" class early serial console in 6 steps + +1. Registering early platform device data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code registers platform device data using the function +early_platform_add_devices(). In the case of early serial console this +should be hardware configuration for the serial port. Devices registered +at this point will later on be matched against early platform drivers. + +2. Parsing kernel command line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code calls parse_early_param() to parse the kernel +command line. This will execute all matching early_param() callbacks. +User specified early platform devices will be registered at this point. +For the early serial console case the user can specify port on the +kernel command line as "earlyprintk=serial.0" where "earlyprintk" is +the class string, "serial" is the name of the platfrom driver and +0 is the platform device id. If the id is -1 then the dot and the +id can be omitted. + +3. Installing early platform drivers belonging to a certain class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code may optionally force registration of all early +platform drivers belonging to a certain class using the function +early_platform_driver_register_all(). User specified devices from +step 2 have priority over these. This step is omitted by the serial +driver example since the early serial driver code should be disabled +unless the user has specified port on the kernel command line. + +4. Early platform driver registration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Compiled-in platform drivers making use of early_platform_init() are +automatically registered during step 2 or 3. The serial driver example +should use early_platform_init("earlyprintk", &platform_driver). + +5. Probing of early platform drivers belonging to a certain class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code calls early_platform_driver_probe() to match +registered early platform devices associated with a certain class with +registered early platform drivers. Matched devices will get probed(). +This step can be executed at any point during the early boot. As soon +as possible may be good for the serial port case. + +6. Inside the early platform driver probe() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The driver code needs to take special care during early boot, especially +when it comes to memory allocation and interrupt registration. The code +in the probe() function can use is_early_platform_device() to check if +it is called at early platform device or at the regular platform device +time. The early serial driver performs register_console() at this point. + +For further information, see . -- cgit v1.1 From e0ca87391694dfacd01465d5c01c579c3b8b63e0 Mon Sep 17 00:00:00 2001 From: Evgeniy Polyakov Date: Fri, 27 Mar 2009 15:04:29 +0300 Subject: Staging: Pohmelfs: Added IO permissions and priorities. Signed-off-by: Evgeniy Polyakov Signed-off-by: Greg Kroah-Hartman --- Documentation/filesystems/pohmelfs/design_notes.txt | 5 +++-- Documentation/filesystems/pohmelfs/info.txt | 21 +++++++++++++++++---- 2 files changed, 20 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt index 6d6db60..dcf8335 100644 --- a/Documentation/filesystems/pohmelfs/design_notes.txt +++ b/Documentation/filesystems/pohmelfs/design_notes.txt @@ -56,9 +56,10 @@ workloads and can fully utilize the bandwidth to the servers when doing bulk data transfers. POHMELFS clients operate with a working set of servers and are capable of balancing read-only -operations (like lookups or directory listings) between them. +operations (like lookups or directory listings) between them according to IO priorities. Administrators can add or remove servers from the set at run-time via special commands (described -in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers. +in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers, which are connected +with write permission turned on. IO priority and permissions can be changed in run-time. POHMELFS is capable of full data channel encryption and/or strong crypto hashing. One can select any kernel supported cipher, encryption mode, hash type and operation mode diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt index 4e3d501..db2e413 100644 --- a/Documentation/filesystems/pohmelfs/info.txt +++ b/Documentation/filesystems/pohmelfs/info.txt @@ -1,6 +1,8 @@ POHMELFS usage information. -Mount options: +Mount options. +All but index, number of crypto threads and maximum IO size can changed via remount. + idx=%u Each mountpoint is associated with a special index via this option. Administrator can add or remove servers from the given index, so all mounts, @@ -52,16 +54,27 @@ mcache_timeout=%u Usage examples. -Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx +Add server server1.net:1025 into the working set with index $idx with appropriate hash algorithm and key file and cipher algorithm, mode and key file: -$cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key +$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key Mount filesystem with given index $idx to /mnt mountpoint. Client will connect to all servers specified in the working set via previous command: mount -t pohmel -o idx=$idx q /mnt -One can add or remove servers from working set after mounting too. +Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw): +$cfg A modify -a server1.net -p 1025 -i $idx -I 1 + +Change IO priority to 123 (node with the highest priority gets read requests). +$cfg A modify -a server1.net -p 1025 -i $idx -P 123 +One can check currect status of all connections in the mountstats file: +# cat /proc/$PID/mountstats +... +device none mounted on /mnt with fstype pohmel +idx addr(:port) socket_type protocol active priority permissions +0 server1.net:1026 1 6 1 250 1 +0 server2.net:1025 1 6 1 123 3 Server installation. -- cgit v1.1 From b57f7e7b836d271902b8b7b1ec8cf9312dc5d228 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Tue, 14 Apr 2009 02:44:14 +0000 Subject: thinkpad-acpi: bump up version to 0.23 Plenty of high-profile changes, so it deserves a new version number. Features added since 0.22: * Restrict unsafe LEDs * New race-less brightness control strategy for IBM ThinkPads * Disclose TGID of driver access from userspace (debug) * Warn when deprecated functions are used Other changes: * Better debug messages in some subdrivers * Removed "hotkey disable" support, since it breaks the driver * Dropped "ibm-acpi" alias Signed-off-by: Henrique de Moraes Holschuh Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 3d76507..e7e9a690 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1,7 +1,7 @@ ThinkPad ACPI Extras Driver - Version 0.22 - November 23rd, 2008 + Version 0.23 + April 10th, 2009 Borislav Deianov Henrique de Moraes Holschuh -- cgit v1.1 From 8b5b94e4e9813cdd77103827f48d58c806ab45c6 Mon Sep 17 00:00:00 2001 From: Weidong Han Date: Fri, 17 Apr 2009 16:42:12 +0800 Subject: docs, x86: add nox2apic back to kernel-parameters.txt "nox2apic" was removed from kernel-parameters.txt by mistake, when entries were sorted in alpha order (commit 0cb55ad2). But this early parameter is still there, add it back to kernel-parameters.txt. [ Impact: add boot parameter description ] Signed-off-by: Suresh Siddha Signed-off-by: Weidong Han Cc: Randy Dunlap Cc: iommu@lists.linux-foundation.org Cc: dwmw2@infradead.org Cc: allen.m.kay@intel.com Cc: fenghua.yu@intel.com LKML-Reference: <1239957736-6161-2-git-send-email-weidong.han@intel.com> Signed-off-by: Ingo Molnar --- Documentation/kernel-parameters.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index a19f021..9e4fe72 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1588,6 +1588,8 @@ and is between 256 and 4096 characters. It is defined in the file nowb [ARM] + nox2apic [X86-64,APIC] Do not enable x2APIC mode. + nptcg= [IA64] Override max number of concurrent global TLB purges which is reported from either PAL_VM_SUMMARY or SAL PALO. -- cgit v1.1 From 4af94f39004a0d1a074e7573457e238a29557848 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 17 Apr 2009 18:30:28 -0700 Subject: doc: fix kernel-parameters.txt mistaken deletions Re-add missing kernel-parameters documentation that was accidentally deleted in commit 0cb55ad2. Thanks to Ingo and Weidong Han for the heads-up on this. Signed-off-by: Randy Dunlap cc: Ingo Molnar cc: Len Brown Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 38 +++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index a19f021..600cdd7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -231,6 +231,35 @@ and is between 256 and 4096 characters. It is defined in the file power state again in power transition. 1 : disable the power state check + acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode + Format: { level | edge | high | low } + + acpi_serialize [HW,ACPI] force serialization of AML methods + + acpi_skip_timer_override [HW,ACPI] + Recognize and ignore IRQ0/pin2 Interrupt Override. + For broken nForce2 BIOS resulting in XT-PIC timer. + + acpi_sleep= [HW,ACPI] Sleep options + Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, + old_ordering, s4_nonvs } + See Documentation/power/video.txt for information on + s3_bios and s3_mode. + s3_beep is for debugging; it makes the PC's speaker beep + as soon as the kernel's real-mode entry point is called. + s4_nohwsig prevents ACPI hardware signature from being + used during resume from hibernation. + old_ordering causes the ACPI 1.0 ordering of the _PTS + control method, with respect to putting devices into + low power states, to be enforced (the ACPI 2.0 ordering + of _PTS is used by default). + s4_nonvs prevents the kernel from saving/restoring the + ACPI NVS memory during hibernation. + + acpi_use_timer_override [HW,ACPI] + Use timer override. For some broken Nvidia NF5 boards + that require a timer override, but don't have HPET + acpi_enforce_resources= [ACPI] { strict | lax | no } Check for resource conflicts between native drivers @@ -250,6 +279,9 @@ and is between 256 and 4096 characters. It is defined in the file ad1848= [HW,OSS] Format: ,,,, + add_efi_memmap [EFI; X86] Include EFI memory map in + kernel's map of available physical RAM. + advansys= [HW,SCSI] See header of drivers/scsi/advansys.c. @@ -1838,6 +1870,12 @@ and is between 256 and 4096 characters. It is defined in the file autoconfiguration. Ranges are in pairs (memory base and size). + ports= [IP_VS_FTP] IPVS ftp helper module + Default is 21. + Up to 8 (IP_VS_APP_MAX_PORTS) ports + may be specified. + Format: ,.... + print-fatal-signals= [KNL] debug: print fatal signals print-fatal-signals=1: print segfault info to -- cgit v1.1 From 720097d895956c0bf15b8440f7845159a04b74da Mon Sep 17 00:00:00 2001 From: Sam Ravnborg Date: Sun, 19 Apr 2009 11:04:26 +0200 Subject: kbuild: introduce subdir-ccflags-y Following patch introduce support for setting options to gcc that has effect for current directory and all subdirectories. The typical use case are an architecture or a subsystem that decide to cover all files with -Werror. Today alpha, mips and sparc uses -Werror in almost all their Makefile- with subdir-ccflag-y it is now simpler to do so as only the top-level directories needs to be covered. Likewise if we decide to cover a full subsystem such as net/ with -Werror this is done by adding a single line to net/Makefile. Signed-off-by: Sam Ravnborg Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner --- Documentation/kbuild/makefiles.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index d4b0567..d76cfd8 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -316,6 +316,16 @@ more details, with real examples. #arch/m68k/fpsp040/Makefile ldflags-y := -x + subdir-ccflags-y, subdir-asflags-y + The two flags listed above are similar to ccflags-y and as-falgs-y. + The difference is that the subdir- variants has effect for the kbuild + file where tey are present and all subdirectories. + Options specified using subdir-* are added to the commandline before + the options specified using the non-subdir variants. + + Example: + subdir-ccflags-y := -Werror + CFLAGS_$@, AFLAGS_$@ CFLAGS_$@ and AFLAGS_$@ only apply to commands in current -- cgit v1.1 From 9536c26b31ae34ba6371a1b8f406028e9756f913 Mon Sep 17 00:00:00 2001 From: Matt Kraai Date: Thu, 16 Apr 2009 23:46:20 -0700 Subject: lguest: tell git to ignore Documentation/lguest/lguest This is the example lguest launcher binary. Signed-off-by: Matt Kraai Signed-off-by: Rusty Russell --- Documentation/lguest/.gitignore | 1 + 1 file changed, 1 insertion(+) create mode 100644 Documentation/lguest/.gitignore (limited to 'Documentation') diff --git a/Documentation/lguest/.gitignore b/Documentation/lguest/.gitignore new file mode 100644 index 0000000..115587f --- /dev/null +++ b/Documentation/lguest/.gitignore @@ -0,0 +1 @@ +lguest -- cgit v1.1 From 38cfe968040250b89c3554a17219a9fda45b9665 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Sun, 19 Apr 2009 23:14:02 -0600 Subject: lguest: document 32-bit and PAE requirements Robert noted that we don't actually document that lguest is 32-bit only, nor that PAE must be off (CONFIG_PAE is now prompted for if HIGHMEM is set to "off). Signed-off-by: Rusty Russell Cc: lguest@ozlabs.org Cc: "Robert P. J. Day" --- Documentation/lguest/lguest.txt | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/lguest/lguest.txt b/Documentation/lguest/lguest.txt index 29510dc..28c7473 100644 --- a/Documentation/lguest/lguest.txt +++ b/Documentation/lguest/lguest.txt @@ -3,11 +3,11 @@ /, /` - or, A Young Coder's Illustrated Hypervisor \\"--\\ http://lguest.ozlabs.org -Lguest is designed to be a minimal hypervisor for the Linux kernel, for -Linux developers and users to experiment with virtualization with the -minimum of complexity. Nonetheless, it should have sufficient -features to make it useful for specific tasks, and, of course, you are -encouraged to fork and enhance it (see drivers/lguest/README). +Lguest is designed to be a minimal 32-bit x86 hypervisor for the Linux kernel, +for Linux developers and users to experiment with virtualization with the +minimum of complexity. Nonetheless, it should have sufficient features to +make it useful for specific tasks, and, of course, you are encouraged to fork +and enhance it (see drivers/lguest/README). Features: @@ -37,6 +37,7 @@ Running Lguest: "Paravirtualized guest support" = Y "Lguest guest support" = Y "High Memory Support" = off/4GB + "PAE (Physical Address Extension) Support" = N "Alignment value to which kernel should be aligned" = 0x100000 (CONFIG_PARAVIRT=y, CONFIG_LGUEST_GUEST=y, CONFIG_HIGHMEM64G=n and CONFIG_PHYSICAL_ALIGN=0x100000) -- cgit v1.1 From 66672fefaa91802fec51c3fe0cc55bc9baea5a2d Mon Sep 17 00:00:00 2001 From: Adrian McMenamin Date: Mon, 20 Apr 2009 18:38:28 -0700 Subject: Documentation/filesystems: remove out of date reference to BKL being held Documentation/filesystems/vfs.txt incorrectly states that the kernel is locked during the call to statfs (Documentation/filesystems/Locking correctly says it is not). This patch removes the offending sentence. remove reference to BKL being held in statfs Signed-off-by: Adrian McMenamin Signed-off-by: Randy Dunlap Cc: Alexander Viro Signed-off-by: Al Viro --- Documentation/filesystems/vfs.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index deeeed0..f49eecf 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -277,8 +277,7 @@ or bottom half). unfreeze_fs: called when VFS is unlocking a filesystem and making it writable again. - statfs: called when the VFS needs to get filesystem statistics. This - is called with the kernel lock held + statfs: called when the VFS needs to get filesystem statistics. remount_fs: called when the filesystem is remounted. This is called with the kernel lock held -- cgit v1.1 From 88bea188b85f9cefefbbd56b8a48d0f798409177 Mon Sep 17 00:00:00 2001 From: Len Brown Date: Tue, 21 Apr 2009 00:35:47 -0400 Subject: ACPI: add /sys/firmware/acpi/interrupts/sci_not counter This counter may prove useful in debugging some spurious interrupt issues seen in the field. Signed-off-by: Len Brown --- Documentation/ABI/testing/sysfs-firmware-acpi | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-firmware-acpi b/Documentation/ABI/testing/sysfs-firmware-acpi index e8ffc70..4f9ba3c 100644 --- a/Documentation/ABI/testing/sysfs-firmware-acpi +++ b/Documentation/ABI/testing/sysfs-firmware-acpi @@ -69,9 +69,13 @@ Description: gpe1F: 0 invalid gpe_all: 1192 sci: 1194 + sci_not: 0 - sci - The total number of times the ACPI SCI - has claimed an interrupt. + sci - The number of times the ACPI SCI + has been called and claimed an interrupt. + + sci_not - The number of times the ACPI SCI + has been called and NOT claimed an interrupt. gpe_all - count of SCI caused by GPEs. -- cgit v1.1 From 6e538aaf50ae782a890cbc02c27950448d8193e1 Mon Sep 17 00:00:00 2001 From: David Brownell Date: Tue, 21 Apr 2009 12:24:49 -0700 Subject: spi: documentation: emphasise spi_master.setup() semantics This is a doc-only patch which I hope will reduce the number of spi_master controller driver patches starting out with a common implementation bug. (As in: almost every spi_master driver I see starts out with its version of this bug. Sigh.) It just re-emphasizes that the setup() method may be called for one device while a transfer is active on another ... which means that most driver implementations shouldn't touch any registers. Signed-off-by: David Brownell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/spi/spi-summary | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'Documentation') diff --git a/Documentation/spi/spi-summary b/Documentation/spi/spi-summary index 0f5122e..4a02d25 100644 --- a/Documentation/spi/spi-summary +++ b/Documentation/spi/spi-summary @@ -511,10 +511,16 @@ SPI MASTER METHODS This sets up the device clock rate, SPI mode, and word sizes. Drivers may change the defaults provided by board_info, and then call spi_setup(spi) to invoke this routine. It may sleep. + Unless each SPI slave has its own configuration registers, don't change them right away ... otherwise drivers could corrupt I/O that's in progress for other SPI devices. + ** BUG ALERT: for some reason the first version of + ** many spi_master drivers seems to get this wrong. + ** When you code setup(), ASSUME that the controller + ** is actively processing transfers for another device. + master->transfer(struct spi_device *spi, struct spi_message *message) This must not sleep. Its responsibility is arrange that the transfer happens and its complete() callback is issued. The two -- cgit v1.1 From cffb2fafb726c898fec1c5ae33717741f94fda83 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 10 Apr 2009 15:17:50 -0700 Subject: docbooks: add/fix PCI kernel-doc Add drivers/pci/*.c source files to DocBook/kernel-api.tmpl and update those pci/*.c source files that need kernel-doc fixes. Signed-off-by: Randy Dunlap Signed-off-by: Jesse Barnes --- Documentation/DocBook/kernel-api.tmpl | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index d6ac5d6..44b3def 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -190,16 +190,20 @@ X!Ekernel/module.c !Edrivers/pci/pci.c !Edrivers/pci/pci-driver.c !Edrivers/pci/remove.c -!Edrivers/pci/pci-acpi.c !Edrivers/pci/search.c !Edrivers/pci/msi.c !Edrivers/pci/bus.c +!Edrivers/pci/access.c +!Edrivers/pci/irq.c +!Edrivers/pci/htirq.c !Edrivers/pci/probe.c +!Edrivers/pci/slot.c !Edrivers/pci/rom.c !Edrivers/pci/iov.c +!Idrivers/pci/pci-sysfs.c PCI Hotplug Support Library !Edrivers/pci/hotplug/pci_hotplug_core.c -- cgit v1.1 From 91ac033d8377552d3654501a105ab55bf546940e Mon Sep 17 00:00:00 2001 From: Marc Dionne Date: Thu, 23 Apr 2009 11:21:55 +0100 Subject: CacheFiles: Fix the documentation to use the correct credential pointer names Adjust the CacheFiles documentation to use the correct names of the credential pointers in task_struct. The documentation was using names from the old versions of the credentials patches. Signed-off-by: Marc Dionne Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- Documentation/filesystems/caching/cachefiles.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt index c78a49b..748a1ae 100644 --- a/Documentation/filesystems/caching/cachefiles.txt +++ b/Documentation/filesystems/caching/cachefiles.txt @@ -407,7 +407,7 @@ A NOTE ON SECURITY ================== CacheFiles makes use of the split security in the task_struct. It allocates -its own task_security structure, and redirects current->act_as to point to it +its own task_security structure, and redirects current->cred to point to it when it acts on behalf of another process, in that process's context. The reason it does this is that it calls vfs_mkdir() and suchlike rather than @@ -429,9 +429,9 @@ This means it may lose signals or ptrace events for example, and affects what the process looks like in /proc. So CacheFiles makes use of a logical split in the security between the -objective security (task->sec) and the subjective security (task->act_as). The -objective security holds the intrinsic security properties of a process and is -never overridden. This is what appears in /proc, and is what is used when a +objective security (task->real_cred) and the subjective security (task->cred). +The objective security holds the intrinsic security properties of a process and +is never overridden. This is what appears in /proc, and is what is used when a process is the target of an operation by some other process (SIGKILL for example). -- cgit v1.1 From 992d7ced75322307035a0e94074eb7188612a680 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 24 Apr 2009 23:10:06 +0800 Subject: locking: Documentation: lockdep-design.txt, fix note of state bits From source code of get_usage_char(), the previous note is not correct, so fix it. static char get_usage_char(struct lock_class *class, enum lock_usage_bit bit) { char c = '.'; if (class->usage_mask & lock_flag(bit + 2))/*LOCK_ENABLED_##STATE*/ c = '+'; if (class->usage_mask & lock_flag(bit)) {/*LOCK_USED_IN_##STATE*/ c = '-'; if (class->usage_mask & lock_flag(bit + 2)) c = '?'; } return c; } note: 1) The 'bit' parameter always is passed as LOCK_USED_IN_##STATE or LOCK_USED_IN_##STATE_READ , from get_usage_chars(). Signed-off-by: Ming Lei LKML-Reference: <1240585806-5744-1-git-send-email-tom.leiming@gmail.com> Signed-off-by: Ingo Molnar --- Documentation/lockdep-design.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/lockdep-design.txt b/Documentation/lockdep-design.txt index 938ea22..e20d913 100644 --- a/Documentation/lockdep-design.txt +++ b/Documentation/lockdep-design.txt @@ -54,9 +54,9 @@ locking error messages, inside curlies. A contrived example: The bit position indicates STATE, STATE-read, for each of the states listed above, and the character displayed in each indicates: - '.' acquired while irqs disabled - '+' acquired in irq context - '-' acquired with irqs enabled + '.' acquired while irqs disabled and not in irq context + '-' acquired in irq context + '+' acquired with irqs enabled '?' acquired in irq context with irqs enabled. Unused mutexes cannot be part of the cause of an error. -- cgit v1.1 From 3d4f16348b77efbf81b7fa186a18a0eb815b6b84 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Mon, 27 Apr 2009 12:00:27 -0700 Subject: Revert "linux.conf.au 2009: Tuz" This reverts commit 8032b526d1a3bd91ad633dd3a3b5fdbc47ad54f1. Hey, it was only meant to be a single release. Now they can all die as far as I'm concerned. [ Just kidding. They're cute and cuddly. Except when they have horrible nasty facial diseases. Oh, and I guess they're not actually that cuddly even when disease-free. ] Signed-off-by: Linus Torvalds --- Documentation/logo.gif | Bin 0 -> 16335 bytes Documentation/logo.svg | 2911 ------------------------------------------------ Documentation/logo.txt | 15 +- 3 files changed, 12 insertions(+), 2914 deletions(-) create mode 100644 Documentation/logo.gif delete mode 100644 Documentation/logo.svg (limited to 'Documentation') diff --git a/Documentation/logo.gif b/Documentation/logo.gif new file mode 100644 index 0000000..2eae75f Binary files /dev/null and b/Documentation/logo.gif differ diff --git a/Documentation/logo.svg b/Documentation/logo.svg deleted file mode 100644 index cb9e485..0000000 --- a/Documentation/logo.svg +++ /dev/null @@ -1,2911 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/Documentation/logo.txt b/Documentation/logo.txt index a2e6244..296f0f7 100644 --- a/Documentation/logo.txt +++ b/Documentation/logo.txt @@ -1,4 +1,13 @@ -Tux is taking a three month sabbatical to work as a barber, so Tuz is -standing in. He's taken pains to ensure you'll hardly notice. +This is the full-colour version of the currently unofficial Linux logo +("currently unofficial" just means that there has been no paperwork and +that I have not really announced it yet). It was created by Larry Ewing, +and is freely usable as long as you acknowledge Larry as the original +artist. + +Note that there are black-and-white versions of this available that +scale down to smaller sizes and are better for letterheads or whatever +you want to use it for: for the full range of logos take a look at +Larry's web-page: + + http://www.isc.tamu.edu/~lewing/linux/ -Image by Andrew McGown and Josh Bush. Image is licensed CC BY-SA. -- cgit v1.1 From 4c57e379f4e318847dd06f60c7e1ff4d8bde1bdb Mon Sep 17 00:00:00 2001 From: Henrik Rydberg Date: Tue, 28 Apr 2009 07:07:07 -0700 Subject: Input: bcm5974 - add documentation for the driver This patch adds documentation for the bcm5974 to Documentation/input/. Signed-off-by: Henrik Rydberg Signed-off-by: Dmitry Torokhov --- Documentation/input/bcm5974.txt | 65 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 Documentation/input/bcm5974.txt (limited to 'Documentation') diff --git a/Documentation/input/bcm5974.txt b/Documentation/input/bcm5974.txt new file mode 100644 index 0000000..5e22dcf --- /dev/null +++ b/Documentation/input/bcm5974.txt @@ -0,0 +1,65 @@ +BCM5974 Driver (bcm5974) +------------------------ + Copyright (C) 2008-2009 Henrik Rydberg + +The USB initialization and package decoding was made by Scott Shawcroft as +part of the touchd user-space driver project: + Copyright (C) 2008 Scott Shawcroft (scott.shawcroft@gmail.com) + +The BCM5974 driver is based on the appletouch driver: + Copyright (C) 2001-2004 Greg Kroah-Hartman (greg@kroah.com) + Copyright (C) 2005 Johannes Berg (johannes@sipsolutions.net) + Copyright (C) 2005 Stelian Pop (stelian@popies.net) + Copyright (C) 2005 Frank Arnold (frank@scirocco-5v-turbo.de) + Copyright (C) 2005 Peter Osterlund (petero2@telia.com) + Copyright (C) 2005 Michael Hanselmann (linux-kernel@hansmi.ch) + Copyright (C) 2006 Nicolas Boichat (nicolas@boichat.ch) + +This driver adds support for the multi-touch trackpad on the new Apple +Macbook Air and Macbook Pro laptops. It replaces the appletouch driver on +those computers, and integrates well with the synaptics driver of the Xorg +system. + +Known to work on Macbook Air, Macbook Pro Penryn and the new unibody +Macbook 5 and Macbook Pro 5. + +Usage +----- + +The driver loads automatically for the supported usb device ids, and +becomes available both as an event device (/dev/input/event*) and as a +mouse via the mousedev driver (/dev/input/mice). + +USB Race +-------- + +The Apple multi-touch trackpads report both mouse and keyboard events via +different interfaces of the same usb device. This creates a race condition +with the HID driver, which, if not told otherwise, will find the standard +HID mouse and keyboard, and claim the whole device. To remedy, the usb +product id must be listed in the mouse_ignore list of the hid driver. + +Debug output +------------ + +To ease the development for new hardware version, verbose packet output can +be switched on with the debug kernel module parameter. The range [1-9] +yields different levels of verbosity. Example (as root): + +echo -n 9 > /sys/module/bcm5974/parameters/debug + +tail -f /var/log/debug + +echo -n 0 > /sys/module/bcm5974/parameters/debug + +Trivia +------ + +The driver was developed at the ubuntu forums in June 2008 [1], and now has +a more permanent home at bitmath.org [2]. + +Links +----- + +[1] http://ubuntuforums.org/showthread.php?t=840040 +[2] http://http://bitmath.org/code/ -- cgit v1.1 From eacaad01b4e67336b5b3f4db6dc15ef92c64b47d Mon Sep 17 00:00:00 2001 From: Henrik Rydberg Date: Tue, 28 Apr 2009 07:49:21 -0700 Subject: Input: document the multi-touch (MT) protocol This patchs adds documentation for the multi-touch protocol to Documentation/input/. [randy.dunlap@oracle.com: grammar fixes] Signed-off-by: Henrik Rydberg Signed-off-by: Dmitry Torokhov --- Documentation/input/multi-touch-protocol.txt | 140 +++++++++++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 Documentation/input/multi-touch-protocol.txt (limited to 'Documentation') diff --git a/Documentation/input/multi-touch-protocol.txt b/Documentation/input/multi-touch-protocol.txt new file mode 100644 index 0000000..9f09557 --- /dev/null +++ b/Documentation/input/multi-touch-protocol.txt @@ -0,0 +1,140 @@ +Multi-touch (MT) Protocol +------------------------- + Copyright (C) 2009 Henrik Rydberg + + +Introduction +------------ + +In order to utilize the full power of the new multi-touch devices, a way to +report detailed finger data to user space is needed. This document +describes the multi-touch (MT) protocol which allows kernel drivers to +report details for an arbitrary number of fingers. + + +Usage +----- + +Anonymous finger details are sent sequentially as separate packets of ABS +events. Only the ABS_MT events are recognized as part of a finger +packet. The end of a packet is marked by calling the input_mt_sync() +function, which generates a SYN_MT_REPORT event. The end of multi-touch +transfer is marked by calling the usual input_sync() function. + +A set of ABS_MT events with the desired properties is defined. The events +are divided into categories, to allow for partial implementation. The +minimum set consists of ABS_MT_TOUCH_MAJOR, ABS_MT_POSITION_X and +ABS_MT_POSITION_Y, which allows for multiple fingers to be tracked. If the +device supports it, the ABS_MT_WIDTH_MAJOR may be used to provide the size +of the approaching finger. Anisotropy and direction may be specified with +ABS_MT_TOUCH_MINOR, ABS_MT_WIDTH_MINOR and ABS_MT_ORIENTATION. Devices with +more granular information may specify general shapes as blobs, i.e., as a +sequence of rectangular shapes grouped together by an +ABS_MT_BLOB_ID. Finally, the ABS_MT_TOOL_TYPE may be used to specify +whether the touching tool is a finger or a pen or something else. + + +Event Semantics +--------------- + +The word "contact" is used to describe a tool which is in direct contact +with the surface. A finger, a pen or a rubber all classify as contacts. + +ABS_MT_TOUCH_MAJOR + +The length of the major axis of the contact. The length should be given in +surface units. If the surface has an X times Y resolution, the largest +possible value of ABS_MT_TOUCH_MAJOR is sqrt(X^2 + Y^2), the diagonal. + +ABS_MT_TOUCH_MINOR + +The length, in surface units, of the minor axis of the contact. If the +contact is circular, this event can be omitted. + +ABS_MT_WIDTH_MAJOR + +The length, in surface units, of the major axis of the approaching +tool. This should be understood as the size of the tool itself. The +orientation of the contact and the approaching tool are assumed to be the +same. + +ABS_MT_WIDTH_MINOR + +The length, in surface units, of the minor axis of the approaching +tool. Omit if circular. + +The above four values can be used to derive additional information about +the contact. The ratio ABS_MT_TOUCH_MAJOR / ABS_MT_WIDTH_MAJOR approximates +the notion of pressure. The fingers of the hand and the palm all have +different characteristic widths [1]. + +ABS_MT_ORIENTATION + +The orientation of the ellipse. The value should describe half a revolution +clockwise around the touch center. The scale of the value is arbitrary, but +zero should be returned for an ellipse aligned along the Y axis of the +surface. As an example, an index finger placed straight onto the axis could +return zero orientation, something negative when twisted to the left, and +something positive when twisted to the right. This value can be omitted if +the touching object is circular, or if the information is not available in +the kernel driver. + +ABS_MT_POSITION_X + +The surface X coordinate of the center of the touching ellipse. + +ABS_MT_POSITION_Y + +The surface Y coordinate of the center of the touching ellipse. + +ABS_MT_TOOL_TYPE + +The type of approaching tool. A lot of kernel drivers cannot distinguish +between different tool types, such as a finger or a pen. In such cases, the +event should be omitted. The protocol currently supports MT_TOOL_FINGER and +MT_TOOL_PEN [2]. + +ABS_MT_BLOB_ID + +The BLOB_ID groups several packets together into one arbitrarily shaped +contact. This is a low-level anonymous grouping, and should not be confused +with the high-level contactID, explained below. Most kernel drivers will +not have this capability, and can safely omit the event. + + +Finger Tracking +--------------- + +The kernel driver should generate an arbitrary enumeration of the set of +anonymous contacts currently on the surface. The order in which the packets +appear in the event stream is not important. + +The process of finger tracking, i.e., to assign a unique contactID to each +initiated contact on the surface, is left to user space; preferably the +multi-touch X driver [3]. In that driver, the contactID stays the same and +unique until the contact vanishes (when the finger leaves the surface). The +problem of assigning a set of anonymous fingers to a set of identified +fingers is a euclidian bipartite matching problem at each event update, and +relies on a sufficiently rapid update rate. + +Notes +----- + +In order to stay compatible with existing applications, the data +reported in a finger packet must not be recognized as single-touch +events. In addition, all finger data must bypass input filtering, +since subsequent events of the same type refer to different fingers. + +The first kernel driver to utilize the MT protocol is the bcm5974 driver, +where examples can be found. + +[1] With the extension ABS_MT_APPROACH_X and ABS_MT_APPROACH_Y, the +difference between the contact position and the approaching tool position +could be used to derive tilt. +[2] The list can of course be extended. +[3] The multi-touch X driver is currently in the prototyping stage. At the +time of writing (April 2009), the MT protocol is not yet merged, and the +prototype implements finger matching, basic mouse support and two-finger +scrolling. The project aims at improving the quality of current multi-touch +functionality available in the synaptics X driver, and in addition +implement more advanced gestures. -- cgit v1.1 From 64e3da109404eed27f825ee3eb1fe465ded47979 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 17 Apr 2009 18:28:55 -0700 Subject: docs: also clean index.html Missed index.html in "make cleandocs", so add it. Signed-off-by: Randy Dunlap Signed-off-by: Sam Ravnborg --- Documentation/DocBook/Makefile | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index 8918a32..b1eb661 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -143,7 +143,8 @@ quiet_cmd_db2pdf = PDF $@ $(call cmd,db2pdf) -main_idx = Documentation/DocBook/index.html +index = index.html +main_idx = Documentation/DocBook/$(index) build_main_index = rm -rf $(main_idx) && \ echo '

Linux Kernel HTML Documentation

' >> $(main_idx) && \ echo '

Kernel Version: $(KERNELVERSION)

' >> $(main_idx) && \ @@ -232,7 +233,7 @@ clean-files := $(DOCBOOKS) \ $(patsubst %.xml, %.pdf, $(DOCBOOKS)) \ $(patsubst %.xml, %.html, $(DOCBOOKS)) \ $(patsubst %.xml, %.9, $(DOCBOOKS)) \ - $(C-procfs-example) + $(C-procfs-example) $(index) clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS)) man -- cgit v1.1 From b827e496c893de0c0f142abfaeb8730a2fd6b37f Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Thu, 30 Apr 2009 15:08:16 -0700 Subject: mm: close page_mkwrite races Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil Cc: Trond Myklebust Signed-off-by: Nick Piggin Cc: Valdis Kletnieks Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/Locking | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 76efe5b..3120f8d 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -512,16 +512,24 @@ locking rules: BKL mmap_sem PageLocked(page) open: no yes close: no yes -fault: no yes -page_mkwrite: no yes no +fault: no yes can return with page locked +page_mkwrite: no yes can return with page locked access: no yes - ->page_mkwrite() is called when a previously read-only page is -about to become writeable. The file system is responsible for -protecting against truncate races. Once appropriate action has been -taking to lock out truncate, the page range should be verified to be -within i_size. The page mapping should also be checked that it is not -NULL. + ->fault() is called when a previously not present pte is about +to be faulted in. The filesystem must find and return the page associated +with the passed in "pgoff" in the vm_fault structure. If it is possible that +the page may be truncated and/or invalidated, then the filesystem must lock +the page, then ensure it is not already truncated (the page lock will block +subsequent truncate), and then return with VM_FAULT_LOCKED, and the page +locked. The VM will unlock the page. + + ->page_mkwrite() is called when a previously read-only pte is +about to become writeable. The filesystem again must ensure that there are +no truncate/invalidate races, and then return with the page locked. If +the page has been truncated, the filesystem should not look up a new page +like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which +will cause the VM to retry the fault. ->access() is called when get_user_pages() fails in acces_process_vm(), typically used to debug a process through -- cgit v1.1 From 52dc5aec9fe2eb591f1490278ae767448860118b Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 30 Apr 2009 15:08:53 -0700 Subject: kernel-doc: restrict syntax for private: and public: scripts/kernel-doc can (incorrectly) delete struct members that are surrounded by /* ... */ /* ... */ if there is a /* private: */ comment in there somewhere also. Fix that by making the "/* private:" only allow whitespace between /* and "private:", not anything/everything in the world. This fixes some erroneous kernel-doc warnings that popped up while processing include/linux/usb/composite.h. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-doc-nano-HOWTO.txt | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-doc-nano-HOWTO.txt b/Documentation/kernel-doc-nano-HOWTO.txt index 026ec7d..4d04572 100644 --- a/Documentation/kernel-doc-nano-HOWTO.txt +++ b/Documentation/kernel-doc-nano-HOWTO.txt @@ -269,7 +269,10 @@ Use the argument mechanism to document members or constants. Inside a struct description, you can use the "private:" and "public:" comment tags. Structure fields that are inside a "private:" area -are not listed in the generated output documentation. +are not listed in the generated output documentation. The "private:" +and "public:" tags must begin immediately following a "/*" comment +marker. They may optionally include comments between the ":" and the +ending "*/" marker. Example: @@ -283,7 +286,7 @@ Example: struct my_struct { int a; int b; -/* private: */ +/* private: internal use only */ int c; }; -- cgit v1.1 From 9e4a5bda89034502fb144331e71a0efdfd5fae97 Mon Sep 17 00:00:00 2001 From: Andrea Righi Date: Thu, 30 Apr 2009 15:08:57 -0700 Subject: mm: prevent divide error for small values of vm_dirty_bytes Avoid setting less than two pages for vm_dirty_bytes: this is necessary to avoid potential division by 0 (like the following) in get_dirty_limits(). [ 49.951610] divide error: 0000 [#1] PREEMPT SMP [ 49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent [ 49.952195] CPU 1 [ 49.952195] Modules linked in: pcspkr [ 49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1 [ 49.952195] RIP: 0010:[] [] get_dirty_limits+0xe9/0x2c0 [ 49.952195] RSP: 0018:ffff88001de03a98 EFLAGS: 00010202 [ 49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3 [ 49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000 [ 49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000 [ 49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001 [ 49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78 [ 49.952195] FS: 00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000 [ 49.952195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0 [ 49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250) [ 49.952195] Stack: [ 49.952195] ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000 [ 49.952195] 00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218 [ 49.952195] 0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7 [ 49.952195] Call Trace: [ 49.952195] [] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0 [ 49.952195] [] ? ext3_writeback_write_end+0x9e/0x120 [ 49.952195] [] generic_file_buffered_write+0x12f/0x330 [ 49.952195] [] __generic_file_aio_write_nolock+0x26d/0x460 [ 49.952195] [] ? generic_file_aio_write+0x52/0xd0 [ 49.952195] [] generic_file_aio_write+0x69/0xd0 [ 49.952195] [] ext3_file_write+0x26/0xc0 [ 49.952195] [] do_sync_write+0xf1/0x140 [ 49.952195] [] ? get_lock_stats+0x2a/0x60 [ 49.952195] [] ? autoremove_wake_function+0x0/0x40 [ 49.952195] [] vfs_write+0xcb/0x190 [ 49.952195] [] sys_write+0x50/0x90 [ 49.952195] [] system_call_fastpath+0x16/0x1b [ 49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 <48> f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02 [ 49.952195] RIP [] get_dirty_limits+0xe9/0x2c0 [ 49.952195] RSP [ 50.096523] ---[ end trace 008d7aa02f244d7b ]--- Signed-off-by: Andrea Righi Cc: Peter Zijlstra Cc: David Rientjes Cc: Dave Chinner Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 97c4b32..b716d33 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -90,6 +90,10 @@ will itself start writeback. If dirty_bytes is written, dirty_ratio becomes a function of its value (dirty_bytes / the amount of dirtyable system memory). +Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any +value lower than this limit will be ignored and the old configuration will be +retained. + ============================================================== dirty_expire_centisecs -- cgit v1.1 From 429aa0fca0df702fc9c81d799175a7d920398827 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 6 May 2009 16:02:51 -0700 Subject: doc: hashdist defaults on for 64bit kernel boot parameter `hashdist' now defaults on for all 64bit NUMA. Signed-off-by: Hugh Dickins Acked-by: Mel Gorman Acked-by: David S. Miller Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 90b3924..52f5ce4 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -775,7 +775,7 @@ and is between 256 and 4096 characters. It is defined in the file hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on - for IA-64, off otherwise. + for 64bit NUMA, off otherwise. Format: 0 | 1 (for off | on) hcl= [IA-64] SGI's Hardware Graph compatibility layer -- cgit v1.1 From ca1eda2d75b855f434b1d5458534332ffad92d65 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Wed, 6 May 2009 16:02:58 -0700 Subject: doc: small kernel-parameters updates Change last "i386" to X86-32 as is used throughout the rest of the file. Change combination of X86-32,X86-64 to just X86, as is done throughout the rest of the file. Add a note that hyphens and underscores are equivalent in parameter names, with examples. Signed-off-by: Randy Dunlap Cc: Jan Engelhardt Cc: Christopher Sylvain Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 52f5ce4..e87bdbf 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -17,6 +17,12 @@ are specified on the kernel command line with the module name plus usbcore.blinkenlights=1 +Hyphens (dashes) and underscores are equivalent in parameter names, so + log_buf_len=1M print-fatal-signals=1 +can also be entered as + log-buf-len=1M print_fatal_signals=1 + + This document may not be entirely up to date and comprehensive. The command "modinfo -p ${modulename}" shows a current list of all parameters of a loadable module. Loadable modules, after being loaded into the running kernel, also @@ -345,7 +351,7 @@ and is between 256 and 4096 characters. It is defined in the file not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. - apic= [APIC,i386] Advanced Programmable Interrupt Controller + apic= [APIC,X86-32] Advanced Programmable Interrupt Controller Change the output verbosity whilst booting Format: { quiet (default) | verbose | debug } Change the amount of debugging information output @@ -702,7 +708,7 @@ and is between 256 and 4096 characters. It is defined in the file to discrete, to make X server driver able to add WB entry later. This parameter enables that. - enable_timer_pin_1 [i386,x86-64] + enable_timer_pin_1 [X86] Enable PIN 1 of APIC timer Can be useful to work around chipset bugs (in particular on some ATI chipsets). -- cgit v1.1 From 441ee4cb874622bc9a11c7b022a38919b04c105f Mon Sep 17 00:00:00 2001 From: Henrik Austad Date: Mon, 20 Apr 2009 18:42:38 -0700 Subject: Doc/sysfs-rules: Swap the order of the words so the sentence makes more sense Signed-off-by: Henrik Austad Signed-off-by: Randy Dunlap Cc: Greg Kroah-Hartman Signed-off-by: Greg Kroah-Hartman --- Documentation/sysfs-rules.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/sysfs-rules.txt b/Documentation/sysfs-rules.txt index 6049a2a..5d8bc2c 100644 --- a/Documentation/sysfs-rules.txt +++ b/Documentation/sysfs-rules.txt @@ -113,7 +113,7 @@ versions of the sysfs interface. "devices" directory at /sys/subsystem//devices. If /sys/subsystem exists, /sys/bus, /sys/class and /sys/block can be - ignored. If it does not exist, you have always to scan all three + ignored. If it does not exist, you always have to scan all three places, as the kernel is free to move a subsystem from one place to the other, as long as the devices are still reachable by the same subsystem name. -- cgit v1.1 From cd17cbfda004fe5f406c01b318c6378d9895896f Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 15 May 2009 11:32:24 +0200 Subject: Revert "mm: add /proc controls for pdflush threads" This reverts commit fafd688e4c0c34da0f3de909881117d374e4c7af. Work is progressing to switch away from pdflush as the process backing for flushing out dirty data. So it seems pointless to add more knobs to control pdflush threads. The original author of the patch did not have any specific use cases for adding the knobs, so we can easily revert this before 2.6.30 to avoid having to maintain this API forever. Signed-off-by: Jens Axboe --- Documentation/sysctl/vm.txt | 28 ---------------------------- 1 file changed, 28 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index b716d33..c302ddf 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -39,8 +39,6 @@ Currently, these files are in /proc/sys/vm: - nr_hugepages - nr_overcommit_hugepages - nr_pdflush_threads -- nr_pdflush_threads_min -- nr_pdflush_threads_max - nr_trim_pages (only if CONFIG_MMU=n) - numa_zonelist_order - oom_dump_tasks @@ -469,32 +467,6 @@ The default value is 0. ============================================================== -nr_pdflush_threads_min - -This value controls the minimum number of pdflush threads. - -At boot time, the kernel will create and maintain 'nr_pdflush_threads_min' -threads for the kernel's lifetime. - -The default value is 2. The minimum value you can specify is 1, and -the maximum value is the current setting of 'nr_pdflush_threads_max'. - -See 'nr_pdflush_threads_max' below for more information. - -============================================================== - -nr_pdflush_threads_max - -This value controls the maximum number of pdflush threads that can be -created. The pdflush algorithm will create a new pdflush thread (up to -this maximum) if no pdflush threads have been available for >= 1 second. - -The default value is 8. The minimum value you can specify is the -current value of 'nr_pdflush_threads_min' and the -maximum is 1000. - -============================================================== - overcommit_memory: This value contains a flag that enables memory overcommitment. -- cgit v1.1 From d34a792da969a00b0f653c512414411760f55a20 Mon Sep 17 00:00:00 2001 From: Frank Rowand Date: Fri, 15 May 2009 07:56:25 -0500 Subject: kgdb: gdb documentation fix gdb command "set remote debug 1" is not valid, change to correct command. Signed-off-by: Frank Rowand Signed-off-by: Jason Wessel --- Documentation/DocBook/kgdb.tmpl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/kgdb.tmpl b/Documentation/DocBook/kgdb.tmpl index 372dec2..5cff41a 100644 --- a/Documentation/DocBook/kgdb.tmpl +++ b/Documentation/DocBook/kgdb.tmpl @@ -281,7 +281,7 @@ seriously wrong while debugging, it will most often be the case that you want to enable gdb to be verbose about its target communications. You do this prior to issuing the target - remote command by typing in: set remote debug 1 + remote command by typing in: set debug remote 1 -- cgit v1.1 From 705efc3b03cbee449e4d83b230423894152f7982 Mon Sep 17 00:00:00 2001 From: Wang Tinggong Date: Thu, 14 May 2009 22:49:36 +0000 Subject: Doc: fixed descriptions on /proc/sys/net/core/* and /proc/sys/net/unix/* Signed-off-by: Wang Tinggong Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ec5de02..b121c5d 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1266,13 +1266,22 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max sctp_wmem - vector of 3 INTEGERs: min, default, max See tcp_wmem for a description. -UNDOCUMENTED: /proc/sys/net/core/* - dev_weight FIXME +dev_weight - INTEGER + The maximum number of packets that kernel can handle on a NAPI + interrupt, it's a Per-CPU variable. + + Default: 64 /proc/sys/net/unix/* - max_dgram_qlen FIXME +max_dgram_qlen - INTEGER + The maximum length of dgram socket receive queue + + Default: 10 + + +UNDOCUMENTED: /proc/sys/net/irda/* fast_poll_increase FIXME -- cgit v1.1