From 4bb9998d388b48dc0a4128bd1f4d85f09ec3b705 Mon Sep 17 00:00:00 2001 From: Masanari Iida Date: Mon, 16 Nov 2015 20:46:28 +0900 Subject: Doc: f2fs: Fix typos in Documentation/filesystems/f2fs.txt This patch fix some typos in Documentation/filesystems/f2fs.txt Signed-off-by: Masanari Iida Acked-by: Randy Dunlap Signed-off-by: Jaegeuk Kim --- Documentation/filesystems/f2fs.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index b102b43..ad10494 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -102,7 +102,7 @@ background_gc=%s Turn on/off cleaning operations, namely garbage collection, triggered in background when I/O subsystem is idle. If background_gc=on, it will turn on the garbage collection and if background_gc=off, garbage collection - will be truned off. If background_gc=sync, it will turn + will be turned off. If background_gc=sync, it will turn on synchronous garbage collection running in background. Default value for this option is on. So garbage collection is on by default. @@ -145,7 +145,7 @@ extent_cache Enable an extent cache based on rb-tree, it can cache as many as extent which map between contiguous logical address and physical address per inode, resulting in increasing the cache hit ratio. Set by default. -noextent_cache Diable an extent cache based on rb-tree explicitly, see +noextent_cache Disable an extent cache based on rb-tree explicitly, see the above extent_cache mount option. noinline_data Disable the inline data feature, inline data feature is enabled by default. @@ -192,7 +192,7 @@ Files in /sys/fs/f2fs/ policy for garbage collection. Setting gc_idle = 0 (default) will disable this option. Setting gc_idle = 1 will select the Cost Benefit approach - & setting gc_idle = 2 will select the greedy aproach. + & setting gc_idle = 2 will select the greedy approach. reclaim_segments This parameter controls the number of prefree segments to be reclaimed. If the number of prefree @@ -298,7 +298,7 @@ The dump.f2fs shows the information of specific inode and dumps SSA and SIT to file. Each file is dump_ssa and dump_sit. The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. -It shows on-disk inode information reconized by a given inode number, and is +It shows on-disk inode information recognized by a given inode number, and is able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and ./dump_sit respectively. -- cgit v1.1 From 21fc61c73c3903c4c312d0802da01ec2b323d174 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Tue, 17 Nov 2015 01:07:57 -0500 Subject: don't put symlink bodies in pagecache into highmem kmap() in page_follow_link_light() needed to go - allowing to hold an arbitrary number of kmaps for long is a great way to deadlocking the system. new helper (inode_nohighmem(inode)) needs to be used for pagecache symlinks inodes; done for all in-tree cases. page_follow_link_light() instrumented to yell about anything missed. Signed-off-by: Al Viro --- Documentation/filesystems/porting | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index f24d1b8..3eb7c35 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -504,3 +504,8 @@ in your dentry operations instead. [mandatory] __fd_install() & fd_install() can now sleep. Callers should not hold a spinlock or other resources that do not allow a schedule. +-- +[mandatory] + any symlink that might use page_follow_link_light/page_put_link() must + have inode_nohighmem(inode) called before anything might start playing with + its pagecache. -- cgit v1.1 From 6b2553918d8b4e6de9853fd6315bec7271a2e592 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Tue, 17 Nov 2015 10:20:54 -0500 Subject: replace ->follow_link() with new method that could stay in RCU mode new method: ->get_link(); replacement of ->follow_link(). The differences are: * inode and dentry are passed separately * might be called both in RCU and non-RCU mode; the former is indicated by passing it a NULL dentry. * when called that way it isn't allowed to block and should return ERR_PTR(-ECHILD) if it needs to be called in non-RCU mode. It's a flagday change - the old method is gone, all in-tree instances converted. Conversion isn't hard; said that, so far very few instances do not immediately bail out when called in RCU mode. That'll change in the next commits. Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 4 ++-- Documentation/filesystems/porting | 6 ++++++ 2 files changed, 8 insertions(+), 2 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 06d4434..4fba54b 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -50,7 +50,7 @@ prototypes: int (*rename2) (struct inode *, struct dentry *, struct inode *, struct dentry *, unsigned int); int (*readlink) (struct dentry *, char __user *,int); - const char *(*follow_link) (struct dentry *, void **); + const char *(*get_link) (struct dentry *, struct inode *, void **); void (*put_link) (struct inode *, void *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int, unsigned int); @@ -83,7 +83,7 @@ rmdir: yes (both) (see below) rename: yes (all) (see below) rename2: yes (all) (see below) readlink: no -follow_link: no +get_link: no put_link: no setattr: yes permission: no (may not block if called in rcu-walk mode) diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 3eb7c35..cf92a8c 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -509,3 +509,9 @@ in your dentry operations instead. any symlink that might use page_follow_link_light/page_put_link() must have inode_nohighmem(inode) called before anything might start playing with its pagecache. +-- +[mandatory] + ->follow_link() is replaced with ->get_link(); same API, except that + * ->get_link() gets inode as a separate argument + * ->get_link() may be called in RCU mode - in that case NULL + dentry is passed -- cgit v1.1 From 7acccdbc4d0bf78a613008e92a3e6c32e37ef26f Mon Sep 17 00:00:00 2001 From: Masanari Iida Date: Thu, 10 Dec 2015 00:59:29 +0900 Subject: Doc: treewide: Fix grammar "a" to "an" This patch fix some grammar mistake. Signed-off-by: Masanari Iida Signed-off-by: Jonathan Corbet --- Documentation/filesystems/sharedsubtree.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt index 32a173d..e3f4c77 100644 --- a/Documentation/filesystems/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.txt @@ -664,7 +664,7 @@ replicas continue to be exactly same. if one rbind mounts a tree within the same subtree 'n' times the number of mounts created is an exponential function of 'n'. Having unbindable mount can help prune the unneeded bind - mounts. Here is a example. + mounts. Here is an example. step 1: let's say the root tree has just two directories with -- cgit v1.1 From 343f40f0a70eb7cee9cc8d6fcfbb3917252a5245 Mon Sep 17 00:00:00 2001 From: Chao Yu Date: Wed, 16 Dec 2015 13:12:16 +0800 Subject: f2fs: introduce new option for controlling data flush Add a new option 'data_flush' to enable data flush functionality. Signed-off-by: Chao Yu Signed-off-by: Jaegeuk Kim --- Documentation/filesystems/f2fs.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index ad10494..e1c9f08 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -149,6 +149,8 @@ noextent_cache Disable an extent cache based on rb-tree explicitly, see the above extent_cache mount option. noinline_data Disable the inline data feature, inline data feature is enabled by default. +data_flush Enable data flushing before checkpoint in order to + persist data of regular and symlink. ================================================================================ DEBUGFS ENTRIES -- cgit v1.1 From fceef393a538134f03b778c5d2519e670269342f Mon Sep 17 00:00:00 2001 From: Al Viro Date: Tue, 29 Dec 2015 15:58:39 -0500 Subject: switch ->get_link() to delayed_call, kill ->put_link() Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 2 -- Documentation/filesystems/porting | 6 ++++++ Documentation/filesystems/vfs.txt | 21 ++++++++++----------- 3 files changed, 16 insertions(+), 13 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 4fba54b..619af9b 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -51,7 +51,6 @@ prototypes: struct inode *, struct dentry *, unsigned int); int (*readlink) (struct dentry *, char __user *,int); const char *(*get_link) (struct dentry *, struct inode *, void **); - void (*put_link) (struct inode *, void *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int, unsigned int); int (*get_acl)(struct inode *, int); @@ -84,7 +83,6 @@ rename: yes (all) (see below) rename2: yes (all) (see below) readlink: no get_link: no -put_link: no setattr: yes permission: no (may not block if called in rcu-walk mode) get_acl: no diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index cf92a8c..0f88e60 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -515,3 +515,9 @@ in your dentry operations instead. * ->get_link() gets inode as a separate argument * ->get_link() may be called in RCU mode - in that case NULL dentry is passed +-- +[mandatory] + ->get_link() gets struct delayed_call *done now, and should do + set_delayed_call() where it used to set *cookie. + ->put_link() is gone - just give the destructor to set_delayed_call() + in ->get_link(). diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8c6f07a..b02a7d5 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -350,8 +350,8 @@ struct inode_operations { int (*rename2) (struct inode *, struct dentry *, struct inode *, struct dentry *, unsigned int); int (*readlink) (struct dentry *, char __user *,int); - const char *(*follow_link) (struct dentry *, void **); - void (*put_link) (struct inode *, void *); + const char *(*get_link) (struct dentry *, struct inode *, + struct delayed_call *); int (*permission) (struct inode *, int); int (*get_acl)(struct inode *, int); int (*setattr) (struct dentry *, struct iattr *); @@ -434,20 +434,19 @@ otherwise noted. readlink: called by the readlink(2) system call. Only required if you want to support reading symbolic links - follow_link: called by the VFS to follow a symbolic link to the + get_link: called by the VFS to follow a symbolic link to the inode it points to. Only required if you want to support symbolic links. This method returns the symlink body to traverse (and possibly resets the current position with nd_jump_link()). If the body won't go away until the inode is gone, nothing else is needed; if it needs to be otherwise - pinned, the data needed to release whatever we'd grabbed - is to be stored in void * variable passed by address to - follow_link() instance. - - put_link: called by the VFS to release resources allocated by - follow_link(). The cookie stored by follow_link() is passed - to this method as the last parameter; only called when - cookie isn't NULL. + pinned, arrange for its release by having get_link(..., ..., done) + do set_delayed_call(done, destructor, argument). + In that case destructor(argument) will be called once VFS is + done with the body you've returned. + May be called in RCU mode; that is indicated by NULL dentry + argument. If request can't be handled without leaving RCU mode, + have it return ERR_PTR(-ECHILD). permission: called by the VFS to check for access rights on a POSIX-like filesystem. -- cgit v1.1 From 03607ace807b414eab46323c794b6fb8fcc2d48c Mon Sep 17 00:00:00 2001 From: Pantelis Antoniou Date: Thu, 22 Oct 2015 23:30:04 +0300 Subject: configfs: implement binary attributes ConfigFS lacked binary attributes up until now. This patch introduces support for binary attributes in a somewhat similar manner of sysfs binary attributes albeit with changes that fit the configfs usage model. Problems that configfs binary attributes fix are everything that requires a binary blob as part of the configuration of a resource, such as bitstream loading for FPGAs, DTBs for dynamically created devices etc. Look at Documentation/filesystems/configfs/configfs.txt for internals and howto use them. This patch is against linux-next as of today that contains Christoph's configfs rework. Signed-off-by: Pantelis Antoniou [hch: folded a fix from Geert Uytterhoeven ] [hch: a few tiny updates based on review feedback] Signed-off-by: Christoph Hellwig --- Documentation/filesystems/configfs/configfs.txt | 57 +++++++++++++++++++++---- 1 file changed, 48 insertions(+), 9 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt index af68efd..e5fe521 100644 --- a/Documentation/filesystems/configfs/configfs.txt +++ b/Documentation/filesystems/configfs/configfs.txt @@ -51,15 +51,27 @@ configfs tree is always there, whether mounted on /config or not. An item is created via mkdir(2). The item's attributes will also appear at this time. readdir(3) can determine what the attributes are, read(2) can query their default values, and write(2) can store new -values. Like sysfs, attributes should be ASCII text files, preferably -with only one value per file. The same efficiency caveats from sysfs -apply. Don't mix more than one attribute in one attribute file. - -Like sysfs, configfs expects write(2) to store the entire buffer at -once. When writing to configfs attributes, userspace processes should -first read the entire file, modify the portions they wish to change, and -then write the entire buffer back. Attribute files have a maximum size -of one page (PAGE_SIZE, 4096 on i386). +values. Don't mix more than one attribute in one attribute file. + +There are two types of configfs attributes: + +* Normal attributes, which similar to sysfs attributes, are small ASCII text +files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably +only one value per file should be used, and the same caveats from sysfs apply. +Configfs expects write(2) to store the entire buffer at once. When writing to +normal configfs attributes, userspace processes should first read the entire +file, modify the portions they wish to change, and then write the entire +buffer back. + +* Binary attributes, which are somewhat similar to sysfs binary attributes, +but with a few slight changes to semantics. The PAGE_SIZE limitation does not +apply, but the whole binary item must fit in single kernel vmalloc'ed buffer. +The write(2) calls from user space are buffered, and the attributes' +write_bin_attribute method will be invoked on the final close, therefore it is +imperative for user-space to check the return code of close(2) in order to +verify that the operation finished successfully. +To avoid a malicious user OOMing the kernel, there's a per-binary attribute +maximum buffer value. When an item needs to be destroyed, remove it with rmdir(2). An item cannot be destroyed if any other item has a link to it (via @@ -171,6 +183,7 @@ among other things. For that, it needs a type. struct configfs_item_operations *ct_item_ops; struct configfs_group_operations *ct_group_ops; struct configfs_attribute **ct_attrs; + struct configfs_bin_attribute **ct_bin_attrs; }; The most basic function of a config_item_type is to define what @@ -201,6 +214,32 @@ be called whenever userspace asks for a read(2) on the attribute. If an attribute is writable and provides a ->store method, that method will be be called whenever userspace asks for a write(2) on the attribute. +[struct configfs_bin_attribute] + + struct configfs_attribute { + struct configfs_attribute cb_attr; + void *cb_private; + size_t cb_max_size; + }; + +The binary attribute is used when the one needs to use binary blob to +appear as the contents of a file in the item's configfs directory. +To do so add the binary attribute to the NULL-terminated array +config_item_type->ct_bin_attrs, and the item appears in configfs, the +attribute file will appear with the configfs_bin_attribute->cb_attr.ca_name +filename. configfs_bin_attribute->cb_attr.ca_mode specifies the file +permissions. +The cb_private member is provided for use by the driver, while the +cb_max_size member specifies the maximum amount of vmalloc buffer +to be used. + +If binary attribute is readable and the config_item provides a +ct_item_ops->read_bin_attribute() method, that method will be called +whenever userspace asks for a read(2) on the attribute. The converse +will happen for write(2). The reads/writes are bufferred so only a +single read/write will occur; the attributes' need not concern itself +with it. + [struct config_group] A config_item cannot live in a vacuum. The only way one can be created -- cgit v1.1 From ceec86ec06177f9eca684a489ddf630432023284 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 13 Jan 2016 16:47:56 +0900 Subject: Documentation: update libhugetlbfs site url The site for libhugetlbfs has moved from sourceforge to github. This commit updates the old url. Signed-off-by: SeongJae Park Acked-by: Mike Kravetz Signed-off-by: Jonathan Corbet --- Documentation/filesystems/proc.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 402ab99..5d4f28a 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -807,7 +807,7 @@ by migrate-type and finishes with details on how many page blocks of each type exist. If min_free_kbytes has been tuned correctly (recommendations made by hugeadm -from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can +from libhugetlbfs https://github.com/libhugetlbfs/libhugetlbfs/), one can make an estimate of the likely number of huge pages that can be allocated at a given point in time. All the "Movable" blocks should be allocatable unless memory has been mlock()'d. Some of the Reclaimable blocks should -- cgit v1.1 From e8ecde25f5e08f89b61d86c32bbb56b405e90c32 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 14 Jan 2016 17:52:59 -0500 Subject: Make sure that highmem pages are not added to symlink page cache inode_nohighmem() is sufficient to make sure that page_get_link() won't try to allocate a highmem page. Moreover, it is sufficient to make sure that page_symlink/__page_symlink won't do the same thing. However, any filesystem that manually preseeds the symlink's page cache upon symlink(2) needs to make sure that the page it inserts there won't be a highmem one. Fortunately, only nfs and shmem have run afoul of that... Signed-off-by: Al Viro --- Documentation/filesystems/porting | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 0f88e60..f1b87d8 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -508,7 +508,11 @@ in your dentry operations instead. [mandatory] any symlink that might use page_follow_link_light/page_put_link() must have inode_nohighmem(inode) called before anything might start playing with - its pagecache. + its pagecache. No highmem pages should end up in the pagecache of such + symlinks. That includes any preseeding that might be done during symlink + creation. __page_symlink() will honour the mapping gfp flags, so once + you've done inode_nohighmem() it's safe to use, but if you allocate and + insert the page manually, make sure to use the right gfp flags. -- [mandatory] ->follow_link() is replaced with ->get_link(); same API, except that -- cgit v1.1 From bf9683d6990589390b5178dafe8fd06808869293 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 14 Jan 2016 15:19:14 -0800 Subject: mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem This series is based on Jerome Marchand's [1] so let me quote the first paragraph from there: There are several shortcomings with the accounting of shared memory (sysV shm, shared anonymous mapping, mapping to a tmpfs file). The values in /proc//status and statm don't allow to distinguish between shmem memory and a shared mapping to a regular file, even though their implications on memory usage are quite different: at reclaim, file mapping can be dropped or written back on disk while shmem needs a place in swap. As for shmem pages that are swapped-out or in swap cache, they aren't accounted at all. The original motivation for myself is that a customer found (IMHO rightfully) confusing that e.g. top output for process swap usage is unreliable with respect to swapped out shmem pages, which are not accounted for. The fundamental difference between private anonymous and shmem pages is that the latter has PTE's converted to pte_none, and not swapents. As such, they are not accounted to the number of swapents visible e.g. in /proc/pid/status VmSwap row. It might be theoretically possible to use swapents when swapping out shmem (without extra cost, as one has to change all mappers anyway), and on swap in only convert the swapent for the faulting process, leaving swapents in other processes until they also fault (so again no extra cost). But I don't know how many assumptions this would break, and it would be too disruptive change for a relatively small benefit. Instead, my approach is to document the limitation of VmSwap, and provide means to determine the swap usage for shmem areas for those who are interested and willing to pay the price, using /proc/pid/smaps. Because outside of ipcs, I don't think it's possible to currently to determine the usage at all. The previous patchset [1] did introduce new shmem-specific fields into smaps output, and functions to determine the values. I take a simpler approach, noting that smaps output already has a "Swap: X kB" line, where currently X == 0 always for shmem areas. I think we can just consider this a bug and provide the proper value by consulting the radix tree, as e.g. mincore_page() does. In the patch changelog I explain why this is also not perfect (and cannot be without swapents), but still arguably much better than showing a 0. The last two patches are adapted from Jerome's patchset and provide a VmRSS breakdown to RssAnon, RssFile and RssShm in /proc/pid/status. Hugh noted that this is a welcome addition, and I agree that it might help e.g. debugging process memory usage at albeit non-zero, but still rather low cost of extra per-mm counter and some page flag checks. [1] http://lwn.net/Articles/611966/ This patch (of 6): The documentation for /proc/pid/status does not mention that the value of VmSwap counts only swapped out anonymous private pages, and not swapped out pages of the underlying shmem objects (for shmem mappings). This is not obvious, so document this limitation. Signed-off-by: Vlastimil Babka Acked-by: Konstantin Khlebnikov Acked-by: Michal Hocko Acked-by: Jerome Marchand Acked-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 402ab99..9f13b6e 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -238,7 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1) VmLib size of shared library code VmPTE size of page table entries VmPMD size of second level page tables - VmSwap size of swap usage (the number of referred swapents) + VmSwap amount of swap used by anonymous private data + (shmem swap usage is not included) HugetlbPages size of hugetlb memory portions Threads number of threads SigQ number of signals queued/max. number for queue -- cgit v1.1 From c261e7d94f0dd33a34b6cf98686e8b9699b62340 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 14 Jan 2016 15:19:17 -0800 Subject: mm, proc: account for shmem swap in /proc/pid/smaps Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed mappings, even if the mapped portion does contain pages that were swapped out. This is because unlike private anonymous mappings, shmem does not change pte to swap entry, but pte_none when swapping the page out. In the smaps page walk, such page thus looks like it was never faulted in. This patch changes smaps_pte_entry() to determine the swap status for such pte_none entries for shmem mappings, similarly to how mincore_page() does it. Swapped out shmem pages are thus accounted for. For private mappings of tmpfs files that COWed some of the pages, swaped out status of the original shmem pages is naturally ignored. If some of the private copies was also swapped out, they are accounted via their page table swap entries, so the resulting reported swap usage is then a sum of both swapped out private copies, and swapped out shmem pages that were not COWed. No double accounting can thus happen. The accounting is arguably still not as precise as for private anonymous mappings, since now we will count also pages that the process in question never accessed, but another process populated them and then let them become swapped out. I believe it is still less confusing and subtle than not showing any swap usage by shmem mappings at all. Swapped out counter might of interest of users who would like to prevent from future swapins during performance critical operation and pre-fault them at their convenience. Especially for larger swapped out regions the cost of swapin is much higher than a fresh page allocation. So a differentiation between pte_none vs. swapped out is important for those usecases. One downside of this patch is that it makes /proc/pid/smaps more expensive for shmem mappings, as we consult the radix tree for each pte_none entry, so the overal complexity is O(n*log(n)). I have measured this on a process that creates a 2GB mapping and dirties single pages with a stride of 2MB, and time how long does it take to cat /proc/pid/smaps of this process 100 times. Private anonymous mapping: real 0m0.949s user 0m0.116s sys 0m0.348s Mapping of a /dev/shm/file: real 0m3.831s user 0m0.180s sys 0m3.212s The difference is rather substantial, so the next patch will reduce the cost for shared or read-only mappings. In a less controlled experiment, I've gathered pids of processes on my desktop that have either '/dev/shm/*' or 'SYSV*' in smaps. This included the Chrome browser and some KDE processes. Again, I've run cat /proc/pid/smaps on each 100 times. Before this patch: real 0m9.050s user 0m0.518s sys 0m8.066s After this patch: real 0m9.221s user 0m0.541s sys 0m8.187s This suggests low impact on average systems. Note that this patch doesn't attempt to adjust the SwapPss field for shmem mappings, which would need extra work to determine who else could have the pages mapped. Thus the value stays zero except for COWed swapped out pages in a shmem mapping, which are accounted as usual. Signed-off-by: Vlastimil Babka Acked-by: Konstantin Khlebnikov Acked-by: Jerome Marchand Acked-by: Michal Hocko Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 9f13b6e..fdeb5b3 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -460,7 +460,10 @@ and a page is modified, the file page is replaced by a private anonymous copy. hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. "Swap" shows how much would-be-anonymous memory is also used, but out on swap. -"SwapPss" shows proportional swap share of this mapping. +For shmem mappings, "Swap" includes also the size of the mapped (and not +replaced by copy-on-write) part of the underlying shmem object out on swap. +"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this +does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. "VmFlags" field deserves a separate description. This member represents the kernel -- cgit v1.1 From 8cee852ec53fb530f10ccabf1596734209ae336b Mon Sep 17 00:00:00 2001 From: Jerome Marchand Date: Thu, 14 Jan 2016 15:19:29 -0800 Subject: mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status There are several shortcomings with the accounting of shared memory (SysV shm, shared anonymous mapping, mapping of a tmpfs file). The values in /proc//status and <...>/statm don't allow to distinguish between shmem memory and a shared mapping to a regular file, even though theirs implication on memory usage are quite different: during reclaim, file mapping can be dropped or written back on disk, while shmem needs a place in swap. Also, to distinguish the memory occupied by anonymous and file mappings, one has to read the /proc/pid/statm file, which has a field for the file mappings (again, including shmem) and total memory occupied by these mappings (i.e. equivalent to VmRSS in the <...>/status file. Getting the value for anonymous mappings only is thus not exactly user-friendly (the statm file is intended to be rather efficiently machine-readable). To address both of these shortcomings, this patch adds a breakdown of VmRSS in /proc//status via new fields RssAnon, RssFile and RssShmem, making use of the previous preparatory patch. These fields tell the user the memory occupied by private anonymous pages, mapped regular files and shmem, respectively. Other existing fields in /status and /statm files are left without change. The /statm file can be extended in the future, if there's a need for that. Example (part of) /proc/pid/status output including the new Rss* fields: VmPeak: 2001008 kB VmSize: 2001004 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 5108 kB VmRSS: 5108 kB RssAnon: 92 kB RssFile: 1324 kB RssShmem: 3692 kB VmData: 192 kB VmStk: 136 kB VmExe: 4 kB VmLib: 1784 kB VmPTE: 3928 kB VmPMD: 20 kB VmSwap: 0 kB HugetlbPages: 0 kB [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand Signed-off-by: Vlastimil Babka Acked-by: Konstantin Khlebnikov Acked-by: Michal Hocko Acked-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fdeb5b3..ffcd495 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -169,6 +169,9 @@ read the file /proc/PID/status: VmLck: 0 kB VmHWM: 476 kB VmRSS: 476 kB + RssAnon: 352 kB + RssFile: 120 kB + RssShmem: 4 kB VmData: 156 kB VmStk: 88 kB VmExe: 68 kB @@ -231,7 +234,12 @@ Table 1-2: Contents of the status files (as of 4.1) VmSize total program size VmLck locked memory size VmHWM peak resident set size ("high water mark") - VmRSS size of memory portions + VmRSS size of memory portions. It contains the three + following parts (VmRSS = RssAnon + RssFile + RssShmem) + RssAnon size of resident anonymous memory + RssFile size of resident file mappings + RssShmem size of resident shmem memory (includes SysV shm, + mapping of tmpfs and shared anonymous mappings) VmData size of data, stack, and text segments VmStk size of data, stack, and text segments VmExe size of text segment @@ -266,7 +274,8 @@ Table 1-3: Contents of the statm files (as of 2.6.8-rc3) Field Content size total program size (pages) (same as VmSize in status) resident size of memory portions (pages) (same as VmRSS in status) - shared number of pages that are shared (i.e. backed by a file) + shared number of pages that are shared (i.e. backed by a file, same + as RssFile+RssShmem in status) trs number of pages that are 'code' (not including libs; broken, includes data segment) lrs number of pages of library (always 0 on 2.6) -- cgit v1.1 From 0bc126d460453736c0e03d9da7ae0e9d4fcf86b3 Mon Sep 17 00:00:00 2001 From: Rodrigo Freire Date: Thu, 14 Jan 2016 15:21:58 -0800 Subject: Documentation/filesystems: describe the shared memory usage/accounting The Shared Memory accounting support is present in Kernel since commit 4b02108ac1b3 ("mm: oom analysis: add shmem vmstat") and in userland free(1) since 2014. This patch updates the Documentation to reflect this change. Signed-off-by: Rodrigo Freire Acked-by: Vlastimil Babka Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 2 ++ Documentation/filesystems/tmpfs.txt | 8 ++++---- 2 files changed, 6 insertions(+), 4 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index ffcd495..e95aa1c 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -855,6 +855,7 @@ Dirty: 968 kB Writeback: 0 kB AnonPages: 861800 kB Mapped: 280372 kB +Shmem: 644 kB Slab: 284364 kB SReclaimable: 159856 kB SUnreclaim: 124508 kB @@ -911,6 +912,7 @@ MemAvailable: An estimate of how much memory is available for starting new AnonPages: Non-file backed pages mapped into userspace page tables AnonHugePages: Non-file backed huge pages mapped into userspace page tables Mapped: files which have been mmaped, such as libraries + Shmem: Total memory used by shared memory (shmem) and tmpfs Slab: in-kernel data structures cache SReclaimable: Part of Slab, that might be reclaimed, such as caches SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 98ef551..d392e15 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -17,10 +17,10 @@ RAM, where you have to create an ordinary filesystem on top. Ramdisks cannot swap and you do not have the possibility to resize them. Since tmpfs lives completely in the page cache and on swap, all tmpfs -pages currently in memory will show up as cached. It will not show up -as shared or something like that. Further on you can check the actual -RAM+swap use of a tmpfs instance with df(1) and du(1). - +pages will be shown as "Shmem" in /proc/meminfo and "Shared" in +free(1). Notice that these counters also include shared memory +(shmem, see ipcs(1)). The most reliable way to get the count is +using df(1) and du(1). tmpfs has the following uses: -- cgit v1.1 From 28016128d37a46d89ac5d9a450709284148989d6 Mon Sep 17 00:00:00 2001 From: Namjae Jeon Date: Wed, 20 Jan 2016 14:59:49 -0800 Subject: Documentation/filesystems/vfat.txt: update the limitation for fat fallocate Update the limitation for fat fallocate. Signed-off-by: Namjae Jeon Signed-off-by: Amit Sahrawat Cc: OGAWA Hirofumi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/vfat.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index ce1126a..223c321 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -180,6 +180,16 @@ dos1xfloppy -- If set, use a fallback default BIOS Parameter Block : 0,1,yes,no,true,false +LIMITATION +--------------------------------------------------------------------- +* The fallocated region of file is discarded at umount/evict time + when using fallocate with FALLOC_FL_KEEP_SIZE. + So, User should assume that fallocated region can be discarded at + last close if there is memory pressure resulting in eviction of + the inode from the memory. As a result, for any dependency on + the fallocated region, user should make sure to recheck fallocate + after reopening the file. + TODO ---------------------------------------------------------------------- * Need to get rid of the raw scanning stuff. Instead, always use -- cgit v1.1 From 65376df582174ffcec9e6471bf5b0dd79ba05e4a Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 2 Feb 2016 16:57:29 -0800 Subject: proc: revert /proc//maps [stack:TID] annotation Commit b76437579d13 ("procfs: mark thread stack correctly in proc//maps") added [stack:TID] annotation to /proc//maps. Finding the task of a stack VMA requires walking the entire thread list, turning this into quadratic behavior: a thousand threads means a thousand stacks, so the rendering of /proc//maps needs to look at a million combinations. The cost is not in proportion to the usefulness as described in the patch. Drop the [stack:TID] annotation to make /proc//maps (and /proc//numa_maps) usable again for higher thread counts. The [stack] annotation inside /proc//task//maps is retained, as identifying the stack VMA there is an O(1) operation. Siddesh said: "The end users needed a way to identify thread stacks programmatically and there wasn't a way to do that. I'm afraid I no longer remember (or have access to the resources that would aid my memory since I changed employers) the details of their requirement. However, I did do this on my own time because I thought it was an interesting project for me and nobody really gave any feedback then as to its utility, so as far as I am concerned you could roll back the main thread maps information since the information is available in the thread-specific files" Signed-off-by: Johannes Weiner Cc: "Kirill A. Shutemov" Cc: Siddhesh Poyarekar Cc: Shaohua Li Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fde9fd0..eaebf27 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -356,7 +356,7 @@ address perms offset dev inode pathname a7cb1000-a7cb2000 ---p 00000000 00:00 0 a7cb2000-a7eb2000 rw-p 00000000 00:00 0 a7eb2000-a7eb3000 ---p 00000000 00:00 0 -a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack:1001] +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 @@ -388,7 +388,6 @@ is not associated with a file: [heap] = the heap of the program [stack] = the stack of the main process - [stack:1001] = the stack of the thread with tid 1001 [vdso] = the "virtual dynamic shared object", the kernel system call handler @@ -396,10 +395,8 @@ is not associated with a file: The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint of the individual tasks of a process. In this file you will see a mapping marked -as [stack] if that task sees it as a stack. This is a key difference from the -content of /proc/PID/maps, where you will see all mappings that are being used -as stack by all of those tasks. Hence, for the example above, the task-level -map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this: +as [stack] if that task sees it as a stack. Hence, for the example above, the +task-level map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this: 08048000-08049000 r-xp 00000000 03:00 8312 /opt/test 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test -- cgit v1.1 From 30bdbb78009e67767983085e302bec6d97afc679 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Tue, 2 Feb 2016 16:57:46 -0800 Subject: mm: polish virtual memory accounting * add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture * always account VMAs with flag VM_STACK as stack (as it was before) * cleanup classifying helpers * update comments and documentation Signed-off-by: Konstantin Khlebnikov Tested-by: Sudip Mukherjee Cc: Cyrill Gorcunov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index eaebf27..843b045b 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -240,8 +240,8 @@ Table 1-2: Contents of the status files (as of 4.1) RssFile size of resident file mappings RssShmem size of resident shmem memory (includes SysV shm, mapping of tmpfs and shared anonymous mappings) - VmData size of data, stack, and text segments - VmStk size of data, stack, and text segments + VmData size of private data segments + VmStk size of stack segments VmExe size of text segment VmLib size of shared library code VmPTE size of page table entries -- cgit v1.1 From ed8b0de5a33d2a2557dce7f9429dca8cb5bc5879 Mon Sep 17 00:00:00 2001 From: Peter Jones Date: Mon, 8 Feb 2016 14:48:15 -0500 Subject: efi: Make efivarfs entries immutable by default "rm -rf" is bricking some peoples' laptops because of variables being used to store non-reinitializable firmware driver data that's required to POST the hardware. These are 100% bugs, and they need to be fixed, but in the mean time it shouldn't be easy to *accidentally* brick machines. We have to have delete working, and picking which variables do and don't work for deletion is quite intractable, so instead make everything immutable by default (except for a whitelist), and make tools that aren't quite so broad-spectrum unset the immutable flag. Signed-off-by: Peter Jones Tested-by: Lee, Chun-Yi Acked-by: Matthew Garrett Signed-off-by: Matt Fleming --- Documentation/filesystems/efivarfs.txt | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.txt index c477af0..686a64b 100644 --- a/Documentation/filesystems/efivarfs.txt +++ b/Documentation/filesystems/efivarfs.txt @@ -14,3 +14,10 @@ filesystem. efivarfs is typically mounted like this, mount -t efivarfs none /sys/firmware/efi/efivars + +Due to the presence of numerous firmware bugs where removing non-standard +UEFI variables causes the system firmware to fail to POST, efivarfs +files that are not well-known standardized variables are created +as immutable files. This doesn't prevent removal - "chattr -i" will work - +but it does prevent this kind of failure from being accomplished +accidentally. -- cgit v1.1