diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2015-04-15 16:39:15 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2015-04-15 16:39:15 -0700 |
commit | eea3a00264cf243a28e4331566ce67b86059339d (patch) | |
tree | 487f16389e0dfa32e9caa7604d1274a7dcda8f04 | |
parent | e7c82412433a8039616c7314533a0a1c025d99bf (diff) | |
parent | e693d73c20ffdb06840c9378f367bad849ac0d5d (diff) | |
download | op-kernel-dev-eea3a00264cf243a28e4331566ce67b86059339d.zip op-kernel-dev-eea3a00264cf243a28e4331566ce67b86059339d.tar.gz |
Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:
- the rest of MM
- various misc bits
- add ability to run /sbin/reboot at reboot time
- printk/vsprintf changes
- fiddle with seq_printf() return value
* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
...
136 files changed, 3279 insertions, 1814 deletions
@@ -100,6 +100,7 @@ Rajesh Shah <rajesh.shah@intel.com> Ralf Baechle <ralf@linux-mips.org> Ralf Wildenhues <Ralf.Wildenhues@gmx.de> RĂ©mi Denis-Courmont <rdenis@simphalempin.com> +Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com> Rudolf Marek <R.Marek@sh.cvut.cz> Rui Saraiva <rmps@joel.ist.utl.pt> Sachin P Sant <ssant@in.ibm.com> @@ -508,6 +508,10 @@ E: paul@paulbristow.net W: http://paulbristow.net/linux/idefloppy.html D: Maintainer of IDE/ATAPI floppy driver +N: Stefano Brivio +E: stefano.brivio@polimi.it +D: Broadcom B43 driver + N: Dominik Brodowski E: linux@brodo.de W: http://www.brodo.de/ @@ -3008,6 +3012,19 @@ W: http://www.qsl.net/dl1bke/ D: Generic Z8530 driver, AX.25 DAMA slave implementation D: Several AX.25 hacks +N: Ricardo Ribalda Delgado +E: ricardo.ribalda@gmail.com +W: http://ribalda.com +D: PLX USB338x driver +D: PCA9634 driver +D: Option GTM671WFS +D: Fintek F81216A +D: Various kernel hacks +S: Qtechnology A/S +S: Valby Langgade 142 +S: 2500 Valby +S: Denmark + N: Francois-Rene Rideau E: fare@tunes.org W: http://www.tunes.org/~fare diff --git a/Documentation/ABI/obsolete/sysfs-block-zram b/Documentation/ABI/obsolete/sysfs-block-zram new file mode 100644 index 0000000..720ea92 --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-block-zram @@ -0,0 +1,119 @@ +What: /sys/block/zram<id>/num_reads +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The num_reads file is read-only and specifies the number of + reads (failed or successful) done on this device. + Now accessible via zram<id>/stat node. + +What: /sys/block/zram<id>/num_writes +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The num_writes file is read-only and specifies the number of + writes (failed or successful) done on this device. + Now accessible via zram<id>/stat node. + +What: /sys/block/zram<id>/invalid_io +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The invalid_io file is read-only and specifies the number of + non-page-size-aligned I/O requests issued to this device. + Now accessible via zram<id>/io_stat node. + +What: /sys/block/zram<id>/failed_reads +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The failed_reads file is read-only and specifies the number of + failed reads happened on this device. + Now accessible via zram<id>/io_stat node. + +What: /sys/block/zram<id>/failed_writes +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The failed_writes file is read-only and specifies the number of + failed writes happened on this device. + Now accessible via zram<id>/io_stat node. + +What: /sys/block/zram<id>/notify_free +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The notify_free file is read-only. Depending on device usage + scenario it may account a) the number of pages freed because + of swap slot free notifications or b) the number of pages freed + because of REQ_DISCARD requests sent by bio. The former ones + are sent to a swap block device when a swap slot is freed, which + implies that this disk is being used as a swap disk. The latter + ones are sent by filesystem mounted with discard option, + whenever some data blocks are getting discarded. + Now accessible via zram<id>/io_stat node. + +What: /sys/block/zram<id>/zero_pages +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The zero_pages file is read-only and specifies number of zero + filled pages written to this disk. No memory is allocated for + such pages. + Now accessible via zram<id>/mm_stat node. + +What: /sys/block/zram<id>/orig_data_size +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The orig_data_size file is read-only and specifies uncompressed + size of data stored in this disk. This excludes zero-filled + pages (zero_pages) since no memory is allocated for them. + Unit: bytes + Now accessible via zram<id>/mm_stat node. + +What: /sys/block/zram<id>/compr_data_size +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The compr_data_size file is read-only and specifies compressed + size of data stored in this disk. So, compression ratio can be + calculated using orig_data_size and this statistic. + Unit: bytes + Now accessible via zram<id>/mm_stat node. + +What: /sys/block/zram<id>/mem_used_total +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The mem_used_total file is read-only and specifies the amount + of memory, including allocator fragmentation and metadata + overhead, allocated for this disk. So, allocator space + efficiency can be calculated using compr_data_size and this + statistic. + Unit: bytes + Now accessible via zram<id>/mm_stat node. + +What: /sys/block/zram<id>/mem_used_max +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The mem_used_max file is read/write and specifies the amount + of maximum memory zram have consumed to store compressed data. + For resetting the value, you should write "0". Otherwise, + you could see -EINVAL. + Unit: bytes + Downgraded to write-only node: so it's possible to set new + value only; its current value is stored in zram<id>/mm_stat + node. + +What: /sys/block/zram<id>/mem_limit +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The mem_limit file is read/write and specifies the maximum + amount of memory ZRAM can use to store the compressed data. + The limit could be changed in run time and "0" means disable + the limit. No limit is the initial state. Unit: bytes + Downgraded to write-only node: so it's possible to set new + value only; its current value is stored in zram<id>/mm_stat + node. diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index a6148ea..2e69e83 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -141,3 +141,28 @@ Description: amount of memory ZRAM can use to store the compressed data. The limit could be changed in run time and "0" means disable the limit. No limit is the initial state. Unit: bytes + +What: /sys/block/zram<id>/compact +Date: August 2015 +Contact: Minchan Kim <minchan@kernel.org> +Description: + The compact file is write-only and trigger compaction for + allocator zrm uses. The allocator moves some objects so that + it could free fragment space. + +What: /sys/block/zram<id>/io_stat +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The io_stat file is read-only and accumulates device's I/O + statistics not accounted by block layer. For example, + failed_reads, failed_writes, etc. File format is similar to + block layer statistics file format. + +What: /sys/block/zram<id>/mm_stat +Date: August 2015 +Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> +Description: + The mm_stat file is read-only and represents device's mm + statistics (orig_data_size, compr_data_size, etc.) in a format + similar to block layer statistics file format. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 7fcf9c6..48a183e 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -98,20 +98,79 @@ size of the disk when not in use so a huge zram is wasteful. mount /dev/zram1 /tmp 7) Stats: - Per-device statistics are exported as various nodes under - /sys/block/zram<id>/ - disksize - num_reads - num_writes - failed_reads - failed_writes - invalid_io - notify_free - zero_pages - orig_data_size - compr_data_size - mem_used_total - mem_used_max +Per-device statistics are exported as various nodes under /sys/block/zram<id>/ + +A brief description of exported device attritbutes. For more details please +read Documentation/ABI/testing/sysfs-block-zram. + +Name access description +---- ------ ----------- +disksize RW show and set the device's disk size +initstate RO shows the initialization state of the device +reset WO trigger device reset +num_reads RO the number of reads +failed_reads RO the number of failed reads +num_write RO the number of writes +failed_writes RO the number of failed writes +invalid_io RO the number of non-page-size-aligned I/O requests +max_comp_streams RW the number of possible concurrent compress operations +comp_algorithm RW show and change the compression algorithm +notify_free RO the number of notifications to free pages (either + slot free notifications or REQ_DISCARD requests) +zero_pages RO the number of zero filled pages written to this disk +orig_data_size RO uncompressed size of data stored in this disk +compr_data_size RO compressed size of data stored in this disk +mem_used_total RO the amount of memory allocated for this disk +mem_used_max RW the maximum amount memory zram have consumed to + store compressed data +mem_limit RW the maximum amount of memory ZRAM can use to store + the compressed data +num_migrated RO the number of objects migrated migrated by compaction + + +WARNING +======= +per-stat sysfs attributes are considered to be deprecated. +The basic strategy is: +-- the existing RW nodes will be downgraded to WO nodes (in linux 4.11) +-- deprecated RO sysfs nodes will eventually be removed (in linux 4.11) + +The list of deprecated attributes can be found here: +Documentation/ABI/obsolete/sysfs-block-zram + +Basically, every attribute that has its own read accessible sysfs node +(e.g. num_reads) *AND* is accessible via one of the stat files (zram<id>/stat +or zram<id>/io_stat or zram<id>/mm_stat) is considered to be deprecated. + +User space is advised to use the following files to read the device statistics. + +File /sys/block/zram<id>/stat + +Represents block layer statistics. Read Documentation/block/stat.txt for +details. + +File /sys/block/zram<id>/io_stat + +The stat file represents device's I/O statistics not accounted by block +layer and, thus, not available in zram<id>/stat file. It consists of a +single line of text and contains the following stats separated by +whitespace: + failed_reads + failed_writes + invalid_io + notify_free + +File /sys/block/zram<id>/mm_stat + +The stat file represents device's mm statistics. It consists of a single +line of text and contains the following stats separated by whitespace: + orig_data_size + compr_data_size + mem_used_total + mem_limit + mem_used_max + zero_pages + num_migrated 8) Deactivate: swapoff /dev/zram0 diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index c3cd627..7c3f187d 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -523,6 +523,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -532,6 +533,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -558,6 +560,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite returns +an error. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt index 5a615c1..cb6a596 100644 --- a/Documentation/printk-formats.txt +++ b/Documentation/printk-formats.txt @@ -8,6 +8,21 @@ If variable is of Type, use printk format specifier: unsigned long long %llu or %llx size_t %zu or %zx ssize_t %zd or %zx + s32 %d or %x + u32 %u or %x + s64 %lld or %llx + u64 %llu or %llx + +If <type> is dependent on a config option for its size (e.g., sector_t, +blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a +format specifier of its largest possible type and explicitly cast to it. +Example: + + printk("test: sector number/total blocks: %llu/%llu\n", + (unsigned long long)sector, (unsigned long long)blockcount); + +Reminder: sizeof() result is of type size_t. + Raw pointer value SHOULD be printed with %p. The kernel supports the following extended format specifiers for pointer types: @@ -54,6 +69,7 @@ Struct Resources: For printing struct resources. The 'R' and 'r' specifiers result in a printed resource with ('R') or without ('r') a decoded flags member. + Passed by reference. Physical addresses types phys_addr_t: @@ -132,6 +148,8 @@ MAC/FDDI addresses: specifier to use reversed byte order suitable for visual interpretation of Bluetooth addresses which are in the little endian order. + Passed by reference. + IPv4 addresses: %pI4 1.2.3.4 @@ -146,6 +164,8 @@ IPv4 addresses: host, network, big or little endian order addresses respectively. Where no specifier is provided the default network/big endian order is used. + Passed by reference. + IPv6 addresses: %pI6 0001:0002:0003:0004:0005:0006:0007:0008 @@ -160,6 +180,8 @@ IPv6 addresses: print a compressed IPv6 address as described by http://tools.ietf.org/html/rfc5952 + Passed by reference. + IPv4/IPv6 addresses (generic, with port, flowinfo, scope): %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 @@ -186,6 +208,8 @@ IPv4/IPv6 addresses (generic, with port, flowinfo, scope): specifiers can be used as well and are ignored in case of an IPv6 address. + Passed by reference. + Further examples: %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 @@ -207,6 +231,8 @@ UUID/GUID addresses: Where no additional specifiers are used the default little endian order with lower case hex characters will be printed. + Passed by reference. + dentry names: %pd{,2,3,4} %pD{,2,3,4} @@ -216,6 +242,8 @@ dentry names: equivalent of %s dentry->d_name.name we used to use, %pd<n> prints n last components. %pD does the same thing for struct file. + Passed by reference. + struct va_format: %pV @@ -231,23 +259,20 @@ struct va_format: Do not use this feature without some mechanism to verify the correctness of the format string and va_list arguments. -u64 SHOULD be printed with %llu/%llx: - - printk("%llu", u64_var); + Passed by reference. -s64 SHOULD be printed with %lld/%llx: +struct clk: - printk("%lld", s64_var); + %pC pll1 + %pCn pll1 + %pCr 1560000000 -If <type> is dependent on a config option for its size (e.g., sector_t, -blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a -format specifier of its largest possible type and explicitly cast to it. -Example: + For printing struct clk structures. '%pC' and '%pCn' print the name + (Common Clock Framework) or address (legacy clock framework) of the + structure; '%pCr' prints the current clock rate. - printk("test: sector number/total blocks: %llu/%llu\n", - (unsigned long long)sector, (unsigned long long)blockcount); + Passed by reference. -Reminder: sizeof() result is of type size_t. Thank you for your cooperation and attention. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 902b457..9832ec5 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -21,6 +21,7 @@ Currently, these files are in /proc/sys/vm: - admin_reserve_kbytes - block_dump - compact_memory +- compact_unevictable_allowed - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -106,6 +107,16 @@ huge pages although processes will also directly compact memory as required. ============================================================== +compact_unevictable_allowed + +Available only when CONFIG_COMPACTION is set. When set to 1, compaction is +allowed to examine the unevictable lru (mlocked pages) for pages to compact. +This should be used on systems where stalls for minor page faults are an +acceptable trade for large contiguous free memory. Set to 0 to prevent +compaction from moving pages that are unevictable. Default value is 1. + +============================================================== + dirty_background_bytes Contains the amount of dirty memory at which the background kernel diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index f2d3a10..030977f 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -267,21 +267,34 @@ call, then it is required that system administrator mount a file system of type hugetlbfs: mount -t hugetlbfs \ - -o uid=<value>,gid=<value>,mode=<value>,size=<value>,nr_inodes=<value> \ - none /mnt/huge + -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ + min_size=<value>,nr_inodes=<value> none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory /mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid options sets the owner and group of the root of the file system. By default the uid and gid of the current process are taken. The mode option sets the mode of root of file system to value & 01777. This value is given in octal. -By default the value 0755 is picked. The size option sets the maximum value of -memory (huge pages) allowed for that filesystem (/mnt/huge). The size is -rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of -inodes that /mnt/huge can use. If the size or nr_inodes option is not -provided on command line then no limits are set. For size and nr_inodes -options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For -example, size=2K has the same meaning as size=2048. +By default the value 0755 is picked. If the paltform supports multiple huge +page sizes, the pagesize option can be used to specify the huge page size and +associated pool. pagesize is specified in bytes. If pagesize is not specified +the paltform's default huge page size and associated pool will be used. The +size option sets the maximum value of memory (huge pages) allowed for that +filesystem (/mnt/huge). The size option can be specified in bytes, or as a +percentage of the specified huge page pool (nr_hugepages). The size is +rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum +value of memory (huge pages) allowed for the filesystem. min_size can be +specified in the same way as size, either bytes or a percentage of the +huge page pool. At mount time, the number of huge pages specified by +min_size are reserved for use by the filesystem. If there are not enough +free huge pages available, the mount will fail. As huge pages are allocated +to the filesystem and freed, the reserve count is adjusted so that the sum +of allocated and reserved huge pages is always at least min_size. The option +nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the +size, min_size or nr_inodes option is not provided on command line then +no limits are set. For pagesize, size, min_size and nr_inodes options, you +can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K +has the same meaning as size=2048. While read system calls are supported on files that reside on hugetlb file systems, write system calls are not. @@ -289,15 +302,23 @@ file systems, write system calls are not. Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs. -Also, it is important to note that no such mount command is required if the +Also, it is important to note that no such mount command is required if applications are going to use only shmat/shmget system calls or mmap with -MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment -should be a member of a supplementary group and system admin needs to -configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for -same or different applications to use any combination of mmaps and shm* -calls, though the mount of filesystem will be required for using mmap calls -without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see -map_hugetlb.c. +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb +below. + +Users who wish to use hugetlb memory via shared memory segment should be a +member of a supplementary group and system admin needs to configure that gid +into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different +applications to use any combination of mmaps and shm* calls, though the mount of +filesystem will be required for using mmap calls without MAP_HUGETLB. + +Syscalls that operate on memory backed by hugetlb pages only have their lengths +aligned to the native page size of the processor; they will normally fail with +errno set to EINVAL or exclude hugetlb pages that extend beyond the length if +not hugepage aligned. For example, munmap(2) will fail if memory is backed by +a hugetlb page and the length is smaller than the hugepage size. + Examples ======== diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 86cb462..3be0bfc 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -22,6 +22,7 @@ CONTENTS - Filtering special vmas. - munlock()/munlockall() system call handling. - Migrating mlocked pages. + - Compacting mlocked pages. - mmap(MAP_LOCKED) system call handling. - munmap()/exit()/exec() system call handling. - try_to_unmap(). @@ -450,6 +451,17 @@ list because of a race between munlock and migration, page migration uses the putback_lru_page() function to add migrated pages back to the LRU. +COMPACTING MLOCKED PAGES +------------------------ + +The unevictable LRU can be scanned for compactable regions and the default +behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls +this behavior (see Documentation/sysctl/vm.txt). Once scanning of the +unevictable LRU is enabled, the work of compaction is mostly handled by +the page migration code and the same work flow as described in MIGRATING +MLOCKED PAGES will apply. + + mmap(MAP_LOCKED) SYSTEM CALL HANDLING ------------------------------------- diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 0000000..64ed63c --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,70 @@ +zsmalloc +-------- + +This allocator is designed for use with zram. Thus, the allocator is +supposed to work well under low memory conditions. In particular, it +never attempts higher order page allocation which is very likely to +fail under memory pressure. On the other hand, if we just use single +(0-order) pages, it would suffer from very high fragmentation -- +any object of size PAGE_SIZE/2 or larger would occupy an entire page. +This was one of the major issues with its predecessor (xvmalloc). + +To overcome these issues, zsmalloc allocates a bunch of 0-order pages +and links them together using various 'struct page' fields. These linked +pages act as a single higher-order page i.e. an object can span 0-order +page boundaries. The code refers to these linked pages as a single entity +called zspage. + +For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE +since this satisfies the requirements of all its current users (in the +worst case, page is incompressible and is thus stored "as-is" i.e. in +uncompressed form). For allocation requests larger than this size, failure +is returned (see zs_malloc). + +Additionally, zs_malloc() does not return a dereferenceable pointer. +Instead, it returns an opaque handle (unsigned long) which encodes actual +location of the allocated object. The reason for this indirection is that +zsmalloc does not keep zspages permanently mapped since that would cause +issues on 32-bit systems where the VA region for kernel space mappings +is very small. So, before using the allocating memory, the object has to +be mapped using zs_map_object() to get a usable pointer and subsequently +unmapped using zs_unmap_object(). + +stat +---- + +With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via +/sys/kernel/debug/zsmalloc/<user name>. Here is a sample of stat output: + +# cat /sys/kernel/debug/zsmalloc/zram0/classes + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage + .. + .. + 9 176 0 1 186 129 8 4 + 10 192 1 0 2880 2872 135 3 + 11 208 0 1 819 795 42 2 + 12 224 0 1 219 159 12 4 + .. + .. + + +class: index +size: object size zspage stores +almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below) +almost_full: the number of ZS_ALMOST_FULL zspages(see below) +obj_allocated: the number of objects allocated +obj_used: the number of objects allocated to the user +pages_used: the number of pages allocated for the class +pages_per_zspage: the number of 0-order pages to make a zspage + +We assign a zspage to ZS_ALMOST_EMPTY fullness group when: + n <= N / f, where +n = number of allocated objects +N = total number of objects zspage can store +f = fullness_threshold_frac(ie, 4 at the moment) + +Similarly, we assign zspage to: + ZS_ALMOST_FULL when n > N / f + ZS_EMPTY when n == 0 + ZS_FULL when n == N diff --git a/MAINTAINERS b/MAINTAINERS index d158405c..56a432d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -625,16 +625,16 @@ F: drivers/iommu/amd_iommu*.[ch] F: include/linux/amd-iommu.h AMD KFD -M: Oded Gabbay <oded.gabbay@amd.com> -L: dri-devel@lists.freedesktop.org -T: git git://people.freedesktop.org/~gabbayo/linux.git -S: Supported -F: drivers/gpu/drm/amd/amdkfd/ +M: Oded Gabbay <oded.gabbay@amd.com> +L: dri-devel@lists.freedesktop.org +T: git git://people.freedesktop.org/~gabbayo/linux.git +S: Supported +F: drivers/gpu/drm/amd/amdkfd/ F: drivers/gpu/drm/amd/include/cik_structs.h F: drivers/gpu/drm/amd/include/kgd_kfd_interface.h -F: drivers/gpu/drm/radeon/radeon_kfd.c -F: drivers/gpu/drm/radeon/radeon_kfd.h -F: include/uapi/linux/kfd_ioctl.h +F: drivers/gpu/drm/radeon/radeon_kfd.c +F: drivers/gpu/drm/radeon/radeon_kfd.h +F: include/uapi/linux/kfd_ioctl.h AMD MICROCODE UPDATE SUPPORT M: Borislav Petkov <bp@alien8.de> @@ -1915,16 +1915,14 @@ S: Maintained F: drivers/media/radio/radio-aztech* B43 WIRELESS DRIVER -M: Stefano Brivio <stefano.brivio@polimi.it> L: linux-wireless@vger.kernel.org L: b43-dev@lists.infradead.org W: http://wireless.kernel.org/en/users/Drivers/b43 -S: Maintained +S: Odd Fixes F: drivers/net/wireless/b43/ B43LEGACY WIRELESS DRIVER M: Larry Finger <Larry.Finger@lwfinger.net> -M: Stefano Brivio <stefano.brivio@polimi.it> L: linux-wireless@vger.kernel.org L: b43-dev@lists.infradead.org W: http://wireless.kernel.org/en/users/Drivers/b43 @@ -1967,10 +1965,10 @@ F: Documentation/filesystems/befs.txt F: fs/befs/ BECKHOFF CX5020 ETHERCAT MASTER DRIVER -M: Dariusz Marcinkiewicz <reksio@newterm.pl> -L: netdev@vger.kernel.org -S: Maintained -F: drivers/net/ethernet/ec_bhf.c +M: Dariusz Marcinkiewicz <reksio@newterm.pl> +L: netdev@vger.kernel.org +S: Maintained +F: drivers/net/ethernet/ec_bhf.c BFS FILE SYSTEM M: "Tigran A. Aivazian" <tigran@aivazian.fsnet.co.uk> @@ -2896,11 +2894,11 @@ S: Supported F: drivers/net/ethernet/chelsio/cxgb3/ CXGB3 ISCSI DRIVER (CXGB3I) -M: Karen Xie <kxie@chelsio.com> -L: linux-scsi@vger.kernel.org -W: http://www.chelsio.com -S: Supported -F: drivers/scsi/cxgbi/cxgb3i +M: Karen Xie <kxie@chelsio.com> +L: linux-scsi@vger.kernel.org +W: http://www.chelsio.com +S: Supported +F: drivers/scsi/cxgbi/cxgb3i CXGB3 IWARP RNIC DRIVER (IW_CXGB3) M: Steve Wise <swise@chelsio.com> @@ -2917,11 +2915,11 @@ S: Supported F: drivers/net/ethernet/chelsio/cxgb4/ CXGB4 ISCSI DRIVER (CXGB4I) -M: Karen Xie <kxie@chelsio.com> -L: linux-scsi@vger.kernel.org -W: http://www.chelsio.com -S: Supported -F: drivers/scsi/cxgbi/cxgb4i +M: Karen Xie <kxie@chelsio.com> +L: linux-scsi@vger.kernel.org +W: http://www.chelsio.com +S: Supported +F: drivers/scsi/cxgbi/cxgb4i CXGB4 IWARP RNIC DRIVER (IW_CXGB4) M: Steve Wise <swise@chelsio.com> @@ -5223,7 +5221,7 @@ F: arch/x86/kernel/tboot.c INTEL WIRELESS WIMAX CONNECTION 2400 M: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com> M: linux-wimax@intel.com -L: wimax@linuxwimax.org (subscribers-only) +L: wimax@linuxwimax.org (subscribers-only) S: Supported W: http://linuxwimax.org F: Documentation/wimax/README.i2400m @@ -5926,7 +5924,7 @@ F: arch/powerpc/platforms/512x/ F: arch/powerpc/platforms/52xx/ LINUX FOR POWERPC EMBEDDED PPC4XX -M: Alistair Popple <alistair@popple.id.au> +M: Alistair Popple <alistair@popple.id.au> M: Matt Porter <mporter@kernel.crashing.org> W: http://www.penguinppc.org/ L: linuxppc-dev@lists.ozlabs.org @@ -6399,7 +6397,7 @@ S: Supported F: drivers/watchdog/mena21_wdt.c MEN CHAMELEON BUS (mcb) -M: Johannes Thumshirn <johannes.thumshirn@men.de> +M: Johannes Thumshirn <johannes.thumshirn@men.de> S: Supported F: drivers/mcb/ F: include/linux/mcb.h @@ -7955,10 +7953,10 @@ L: rtc-linux@googlegroups.com S: Maintained QAT DRIVER -M: Tadeusz Struk <tadeusz.struk@intel.com> -L: qat-linux@intel.com -S: Supported -F: drivers/crypto/qat/ +M: Tadeusz Struk <tadeusz.struk@intel.com> +L: qat-linux@intel.com +S: Supported +F: drivers/crypto/qat/ QIB DRIVER M: Mike Marciniszyn <infinipath@intel.com> @@ -10129,11 +10127,11 @@ F: include/linux/cdrom.h F: include/uapi/linux/cdrom.h UNISYS S-PAR DRIVERS -M: Benjamin Romer <benjamin.romer@unisys.com> -M: David Kershner <david.kershner@unisys.com> -L: sparmaintainer@unisys.com (Unisys internal) -S: Supported -F: drivers/staging/unisys/ +M: Benjamin Romer <benjamin.romer@unisys.com> +M: David Kershner <david.kershner@unisys.com> +L: sparmaintainer@unisys.com (Unisys internal) +S: Supported +F: drivers/staging/unisys/ UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER M: Vinayak Holikatti <vinholikatti@gmail.com> @@ -10690,7 +10688,7 @@ F: drivers/media/rc/winbond-cir.c WIMAX STACK M: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com> M: linux-wimax@intel.com -L: wimax@linuxwimax.org (subscribers-only) +L: wimax@linuxwimax.org (subscribers-only) S: Supported W: http://linuxwimax.org F: Documentation/wimax/README.wimax @@ -10981,6 +10979,7 @@ L: linux-mm@kvack.org S: Maintained F: mm/zsmalloc.c F: include/linux/zsmalloc.h +F: Documentation/vm/zsmalloc.txt ZSWAP COMPRESSED SWAP CACHING M: Seth Jennings <sjennings@variantweb.net> diff --git a/arch/arm/plat-pxa/dma.c b/arch/arm/plat-pxa/dma.c index 054fc5a..d92f07f 100644 --- a/arch/arm/plat-pxa/dma.c +++ b/arch/arm/plat-pxa/dma.c @@ -51,19 +51,19 @@ static struct dentry *dbgfs_root, *dbgfs_state, **dbgfs_chan; static int dbg_show_requester_chan(struct seq_file *s, void *p) { - int pos = 0; int chan = (int)s->private; int i; u32 drcmr; - pos += seq_printf(s, "DMA channel %d requesters list :\n", chan); + seq_printf(s, "DMA channel %d requesters list :\n", chan); for (i = 0; i < DMA_MAX_REQUESTERS; i++) { drcmr = DRCMR(i); if ((drcmr & DRCMR_CHLNUM) == chan) - pos += seq_printf(s, "\tRequester %d (MAPVLD=%d)\n", i, - !!(drcmr & DRCMR_MAPVLD)); + seq_printf(s, "\tRequester %d (MAPVLD=%d)\n", + i, !!(drcmr & DRCMR_MAPVLD)); } - return pos; + + return 0; } static inline int dbg_burst_from_dcmd(u32 dcmd) @@ -83,7 +83,6 @@ static int is_phys_valid(unsigned long addr) static int dbg_show_descriptors(struct seq_file *s, void *p) { - int pos = 0; int chan = (int)s->private; int i, max_show = 20, burst, width; u32 dcmd; @@ -94,44 +93,43 @@ static int dbg_show_descriptors(struct seq_file *s, void *p) spin_lock_irqsave(&dma_channels[chan].lock, flags); phys_desc = DDADR(chan); - pos += seq_printf(s, "DMA channel %d descriptors :\n", chan); - pos += seq_printf(s, "[%03d] First descriptor unknown\n", 0); + seq_printf(s, "DMA channel %d descriptors :\n", chan); + seq_printf(s, "[%03d] First descriptor unknown\n", 0); for (i = 1; i < max_show && is_phys_valid(phys_desc); i++) { desc = phys_to_virt(phys_desc); dcmd = desc->dcmd; burst = dbg_burst_from_dcmd(dcmd); width = (1 << ((dcmd >> 14) & 0x3)) >> 1; - pos += seq_printf(s, "[%03d] Desc at %08lx(virt %p)\n", - i, phys_desc, desc); - pos += seq_printf(s, "\tDDADR = %08x\n", desc->ddadr); - pos += seq_printf(s, "\tDSADR = %08x\n", desc->dsadr); - pos += seq_printf(s, "\tDTADR = %08x\n", desc->dtadr); - pos += seq_printf(s, "\tDCMD = %08x (%s%s%s%s%s%s%sburst=%d" - " width=%d len=%d)\n", - dcmd, - DCMD_STR(INCSRCADDR), DCMD_STR(INCTRGADDR), - DCMD_STR(FLOWSRC), DCMD_STR(FLOWTRG), - DCMD_STR(STARTIRQEN), DCMD_STR(ENDIRQEN), - DCMD_STR(ENDIAN), burst, width, - dcmd & DCMD_LENGTH); + seq_printf(s, "[%03d] Desc at %08lx(virt %p)\n", + i, phys_desc, desc); + seq_printf(s, "\tDDADR = %08x\n", desc->ddadr); + seq_printf(s, "\tDSADR = %08x\n", desc->dsadr); + seq_printf(s, "\tDTADR = %08x\n", desc->dtadr); + seq_printf(s, "\tDCMD = %08x (%s%s%s%s%s%s%sburst=%d width=%d len=%d)\n", + dcmd, + DCMD_STR(INCSRCADDR), DCMD_STR(INCTRGADDR), + DCMD_STR(FLOWSRC), DCMD_STR(FLOWTRG), + DCMD_STR(STARTIRQEN), DCMD_STR(ENDIRQEN), + DCMD_STR(ENDIAN), burst, width, + dcmd & DCMD_LENGTH); phys_desc = desc->ddadr; } if (i == max_show) - pos += seq_printf(s, "[%03d] Desc at %08lx ... max display reached\n", - i, phys_desc); + seq_printf(s, "[%03d] Desc at %08lx ... max display reached\n", + i, phys_desc); else - pos += seq_printf(s, "[%03d] Desc at %08lx is %s\n", - i, phys_desc, phys_desc == DDADR_STOP ? - "DDADR_STOP" : "invalid"); + seq_printf(s, "[%03d] Desc at %08lx is %s\n", + i, phys_desc, phys_desc == DDADR_STOP ? + "DDADR_STOP" : "invalid"); spin_unlock_irqrestore(&dma_channels[chan].lock, flags); - return pos; + + return 0; } static int dbg_show_chan_state(struct seq_file *s, void *p) { - int pos = 0; int chan = (int)s->private; u32 dcsr, dcmd; int burst, width; @@ -142,42 +140,39 @@ static int dbg_show_chan_state(struct seq_file *s, void *p) burst = dbg_burst_from_dcmd(dcmd); width = (1 << ((dcmd >> 14) & 0x3)) >> 1; - pos += seq_printf(s, "DMA channel %d\n", chan); - pos += seq_printf(s, "\tPriority : %s\n", - str_prio[dma_channels[chan].prio]); - pos += seq_printf(s, "\tUnaligned transfer bit: %s\n", - DALGN & (1 << chan) ? "yes" : "no"); - pos += seq_printf(s, "\tDCSR = %08x (%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s)\n", - dcsr, DCSR_STR(RUN), DCSR_STR(NODESC), - DCSR_STR(STOPIRQEN), DCSR_STR(EORIRQEN), - DCSR_STR(EORJMPEN), DCSR_STR(EORSTOPEN), - DCSR_STR(SETCMPST), DCSR_STR(CLRCMPST), - DCSR_STR(CMPST), DCSR_STR(EORINTR), DCSR_STR(REQPEND), - DCSR_STR(STOPSTATE), DCSR_STR(ENDINTR), - DCSR_STR(STARTINTR), DCSR_STR(BUSERR)); - - pos += seq_printf(s, "\tDCMD = %08x (%s%s%s%s%s%s%sburst=%d width=%d" - " len=%d)\n", - dcmd, - DCMD_STR(INCSRCADDR), DCMD_STR(INCTRGADDR), - DCMD_STR(FLOWSRC), DCMD_STR(FLOWTRG), - DCMD_STR(STARTIRQEN), DCMD_STR(ENDIRQEN), - DCMD_STR(ENDIAN), burst, width, dcmd & DCMD_LENGTH); - pos += seq_printf(s, "\tDSADR = %08x\n", DSADR(chan)); - pos += seq_printf(s, "\tDTADR = %08x\n", DTADR(chan)); - pos += seq_printf(s, "\tDDADR = %08x\n", DDADR(chan)); - return pos; + seq_printf(s, "DMA channel %d\n", chan); + seq_printf(s, "\tPriority : %s\n", str_prio[dma_channels[chan].prio]); + seq_printf(s, "\tUnaligned transfer bit: %s\n", + DALGN & (1 << chan) ? "yes" : "no"); + seq_printf(s, "\tDCSR = %08x (%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s)\n", + dcsr, DCSR_STR(RUN), DCSR_STR(NODESC), + DCSR_STR(STOPIRQEN), DCSR_STR(EORIRQEN), + DCSR_STR(EORJMPEN), DCSR_STR(EORSTOPEN), + DCSR_STR(SETCMPST), DCSR_STR(CLRCMPST), + DCSR_STR(CMPST), DCSR_STR(EORINTR), DCSR_STR(REQPEND), + DCSR_STR(STOPSTATE), DCSR_STR(ENDINTR), + DCSR_STR(STARTINTR), DCSR_STR(BUSERR)); + + seq_printf(s, "\tDCMD = %08x (%s%s%s%s%s%s%sburst=%d width=%d len=%d)\n", + dcmd, + DCMD_STR(INCSRCADDR), DCMD_STR(INCTRGADDR), + DCMD_STR(FLOWSRC), DCMD_STR(FLOWTRG), + DCMD_STR(STARTIRQEN), DCMD_STR(ENDIRQEN), + DCMD_STR(ENDIAN), burst, width, dcmd & DCMD_LENGTH); + seq_printf(s, "\tDSADR = %08x\n", DSADR(chan)); + seq_printf(s, "\tDTADR = %08x\n", DTADR(chan)); + seq_printf(s, "\tDDADR = %08x\n", DDADR(chan)); + + return 0; } static int dbg_show_state(struct seq_file *s, void *p) { - int pos = 0; - /* basic device status */ - pos += seq_printf(s, "DMA engine status\n"); - pos += seq_printf(s, "\tChannel number: %d\n", num_dma_channels); + seq_puts(s, "DMA engine status\n"); + seq_printf(s, "\tChannel number: %d\n", num_dma_channels); - return pos; + return 0; } #define DBGFS_FUNC_DECL(name) \ diff --git a/arch/cris/arch-v10/kernel/fasttimer.c b/arch/cris/arch-v10/kernel/fasttimer.c index 48a59af..e929873 100644 --- a/arch/cris/arch-v10/kernel/fasttimer.c +++ b/arch/cris/arch-v10/kernel/fasttimer.c @@ -527,7 +527,8 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) i = debug_log_cnt; while (i != end_i || debug_log_cnt_wrapped) { - if (seq_printf(m, debug_log_string[i], debug_log_value[i]) < 0) + seq_printf(m, debug_log_string[i], debug_log_value[i]); + if (seq_has_overflowed(m)) return 0; i = (i+1) % DEBUG_LOG_MAX; } @@ -542,24 +543,22 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) int cur = (fast_timers_started - i - 1) % NUM_TIMER_STATS; #if 1 //ndef FAST_TIMER_LOG - seq_printf(m, "div: %i freq: %i delay: %i" - "\n", + seq_printf(m, "div: %i freq: %i delay: %i\n", timer_div_settings[cur], timer_freq_settings[cur], timer_delay_settings[cur]); #endif #ifdef FAST_TIMER_LOG t = &timer_started_log[cur]; - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; #endif } @@ -571,16 +570,15 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) seq_printf(m, "Timers added: %i\n", fast_timers_added); for (i = 0; i < num_to_show; i++) { t = &timer_added_log[(fast_timers_added - i - 1) % NUM_TIMER_STATS]; - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; } seq_putc(m, '\n'); @@ -590,16 +588,15 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) seq_printf(m, "Timers expired: %i\n", fast_timers_expired); for (i = 0; i < num_to_show; i++) { t = &timer_expired_log[(fast_timers_expired - i - 1) % NUM_TIMER_STATS]; - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; } seq_putc(m, '\n'); @@ -611,19 +608,15 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) while (t) { nextt = t->next; local_irq_restore(flags); - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" -/* " func: 0x%08lX" */ - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data -/* , t->function */ - ) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; local_irq_save(flags); if (t->next != nextt) diff --git a/arch/cris/arch-v10/kernel/setup.c b/arch/cris/arch-v10/kernel/setup.c index 4f96d71..7ab31f1 100644 --- a/arch/cris/arch-v10/kernel/setup.c +++ b/arch/cris/arch-v10/kernel/setup.c @@ -63,35 +63,37 @@ int show_cpuinfo(struct seq_file *m, void *v) else info = &cpu_info[revision]; - return seq_printf(m, - "processor\t: 0\n" - "cpu\t\t: CRIS\n" - "cpu revision\t: %lu\n" - "cpu model\t: %s\n" - "cache size\t: %d kB\n" - "fpu\t\t: %s\n" - "mmu\t\t: %s\n" - "mmu DMA bug\t: %s\n" - "ethernet\t: %s Mbps\n" - "token ring\t: %s\n" - "scsi\t\t: %s\n" - "ata\t\t: %s\n" - "usb\t\t: %s\n" - "bogomips\t: %lu.%02lu\n", + seq_printf(m, + "processor\t: 0\n" + "cpu\t\t: CRIS\n" + "cpu revision\t: %lu\n" + "cpu model\t: %s\n" + "cache size\t: %d kB\n" + "fpu\t\t: %s\n" + "mmu\t\t: %s\n" + "mmu DMA bug\t: %s\n" + "ethernet\t: %s Mbps\n" + "token ring\t: %s\n" + "scsi\t\t: %s\n" + "ata\t\t: %s\n" + "usb\t\t: %s\n" + "bogomips\t: %lu.%02lu\n", - revision, - info->model, - info->cache, - info->flags & HAS_FPU ? "yes" : "no", - info->flags & HAS_MMU ? "yes" : "no", - info->flags & HAS_MMU_BUG ? "yes" : "no", - info->flags & HAS_ETHERNET100 ? "10/100" : "10", - info->flags & HAS_TOKENRING ? "4/16 Mbps" : "no", - info->flags & HAS_SCSI ? "yes" : "no", - info->flags & HAS_ATA ? "yes" : "no", - info->flags & HAS_USB ? "yes" : "no", - (loops_per_jiffy * HZ + 500) / 500000, - ((loops_per_jiffy * HZ + 500) / 5000) % 100); + revision, + info->model, + info->cache, + info->flags & HAS_FPU ? "yes" : "no", + info->flags & HAS_MMU ? "yes" : "no", + info->flags & HAS_MMU_BUG ? "yes" : "no", + info->flags & HAS_ETHERNET100 ? "10/100" : "10", + info->flags & HAS_TOKENRING ? "4/16 Mbps" : "no", + info->flags & HAS_SCSI ? "yes" : "no", + info->flags & HAS_ATA ? "yes" : "no", + info->flags & HAS_USB ? "yes" : "no", + (loops_per_jiffy * HZ + 500) / 500000, + ((loops_per_jiffy * HZ + 500) / 5000) % 100); + + return 0; } #endif /* CONFIG_PROC_FS */ diff --git a/arch/cris/arch-v32/kernel/fasttimer.c b/arch/cris/arch-v32/kernel/fasttimer.c index b130c2c..5c84dbb 100644 --- a/arch/cris/arch-v32/kernel/fasttimer.c +++ b/arch/cris/arch-v32/kernel/fasttimer.c @@ -501,7 +501,8 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) i = debug_log_cnt; while ((i != end_i || debug_log_cnt_wrapped)) { - if (seq_printf(m, debug_log_string[i], debug_log_value[i]) < 0) + seq_printf(m, debug_log_string[i], debug_log_value[i]); + if (seq_has_overflowed(m)) return 0; i = (i+1) % DEBUG_LOG_MAX; } @@ -516,23 +517,21 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) int cur = (fast_timers_started - i - 1) % NUM_TIMER_STATS; #if 1 //ndef FAST_TIMER_LOG - seq_printf(m, "div: %i delay: %i" - "\n", + seq_printf(m, "div: %i delay: %i\n", timer_div_settings[cur], timer_delay_settings[cur]); #endif #ifdef FAST_TIMER_LOG t = &timer_started_log[cur]; - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; #endif } @@ -544,16 +543,15 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) seq_printf(m, "Timers added: %i\n", fast_timers_added); for (i = 0; i < num_to_show; i++) { t = &timer_added_log[(fast_timers_added - i - 1) % NUM_TIMER_STATS]; - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; } seq_putc(m, '\n'); @@ -563,16 +561,15 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) seq_printf(m, "Timers expired: %i\n", fast_timers_expired); for (i = 0; i < num_to_show; i++){ t = &timer_expired_log[(fast_timers_expired - i - 1) % NUM_TIMER_STATS]; - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; } seq_putc(m, '\n'); @@ -584,19 +581,15 @@ static int proc_fasttimer_show(struct seq_file *m, void *v) while (t != NULL){ nextt = t->next; local_irq_restore(flags); - if (seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu " - "d: %6li us data: 0x%08lX" -/* " func: 0x%08lX" */ - "\n", - t->name, - (unsigned long)t->tv_set.tv_jiff, - (unsigned long)t->tv_set.tv_usec, - (unsigned long)t->tv_expires.tv_jiff, - (unsigned long)t->tv_expires.tv_usec, - t->delay_us, - t->data -/* , t->function */ - ) < 0) + seq_printf(m, "%-14s s: %6lu.%06lu e: %6lu.%06lu d: %6li us data: 0x%08lX\n", + t->name, + (unsigned long)t->tv_set.tv_jiff, + (unsigned long)t->tv_set.tv_usec, + (unsigned long)t->tv_expires.tv_jiff, + (unsigned long)t->tv_expires.tv_usec, + t->delay_us, + t->data); + if (seq_has_overflowed(m)) return 0; local_irq_save(flags); if (t->next != nextt) diff --git a/arch/cris/arch-v32/kernel/setup.c b/arch/cris/arch-v32/kernel/setup.c index 61e10ae..81715c6 100644 --- a/arch/cris/arch-v32/kernel/setup.c +++ b/arch/cris/arch-v32/kernel/setup.c @@ -77,36 +77,38 @@ int show_cpuinfo(struct seq_file *m, void *v) } } - return seq_printf(m, - "processor\t: %d\n" - "cpu\t\t: CRIS\n" - "cpu revision\t: %lu\n" - "cpu model\t: %s\n" - "cache size\t: %d KB\n" - "fpu\t\t: %s\n" - "mmu\t\t: %s\n" - "mmu DMA bug\t: %s\n" - "ethernet\t: %s Mbps\n" - "token ring\t: %s\n" - "scsi\t\t: %s\n" - "ata\t\t: %s\n" - "usb\t\t: %s\n" - "bogomips\t: %lu.%02lu\n\n", - - cpu, - revision, - info->cpu_model, - info->cache_size, - info->flags & HAS_FPU ? "yes" : "no", - info->flags & HAS_MMU ? "yes" : "no", - info->flags & HAS_MMU_BUG ? "yes" : "no", - info->flags & HAS_ETHERNET100 ? "10/100" : "10", - info->flags & HAS_TOKENRING ? "4/16 Mbps" : "no", - info->flags & HAS_SCSI ? "yes" : "no", - info->flags & HAS_ATA ? "yes" : "no", - info->flags & HAS_USB ? "yes" : "no", - (loops_per_jiffy * HZ + 500) / 500000, - ((loops_per_jiffy * HZ + 500) / 5000) % 100); + seq_printf(m, + "processor\t: %d\n" + "cpu\t\t: CRIS\n" + "cpu revision\t: %lu\n" + "cpu model\t: %s\n" + "cache size\t: %d KB\n" + "fpu\t\t: %s\n" + "mmu\t\t: %s\n" + "mmu DMA bug\t: %s\n" + "ethernet\t: %s Mbps\n" + "token ring\t: %s\n" + "scsi\t\t: %s\n" + "ata\t\t: %s\n" + "usb\t\t: %s\n" + "bogomips\t: %lu.%02lu\n\n", + + cpu, + revision, + info->cpu_model, + info->cache_size, + info->flags & HAS_FPU ? "yes" : "no", + info->flags & HAS_MMU ? "yes" : "no", + info->flags & HAS_MMU_BUG ? "yes" : "no", + info->flags & HAS_ETHERNET100 ? "10/100" : "10", + info->flags & HAS_TOKENRING ? "4/16 Mbps" : "no", + info->flags & HAS_SCSI ? "yes" : "no", + info->flags & HAS_ATA ? "yes" : "no", + info->flags & HAS_USB ? "yes" : "no", + (loops_per_jiffy * HZ + 500) / 500000, + ((loops_per_jiffy * HZ + 500) / 5000) % 100); + + return 0; } #endif /* CONFIG_PROC_FS */ diff --git a/arch/microblaze/kernel/cpu/mb.c b/arch/microblaze/kernel/cpu/mb.c index 7b5dca7..9581d194 100644 --- a/arch/microblaze/kernel/cpu/mb.c +++ b/arch/microblaze/kernel/cpu/mb.c @@ -27,7 +27,6 @@ static int show_cpuinfo(struct seq_file *m, void *v) { - int count = 0; char *fpga_family = "Unknown"; char *cpu_ver = "Unknown"; int i; @@ -48,91 +47,89 @@ static int show_cpuinfo(struct seq_file *m, void *v) } } - count = seq_printf(m, - "CPU-Family: MicroBlaze\n" - "FPGA-Arch: %s\n" - "CPU-Ver: %s, %s endian\n" - "CPU-MHz: %d.%02d\n" - "BogoMips: %lu.%02lu\n", - fpga_family, - cpu_ver, - cpuinfo.endian ? "little" : "big", - cpuinfo.cpu_clock_freq / - 1000000, - cpuinfo.cpu_clock_freq % - 1000000, - loops_per_jiffy / (500000 / HZ), - (loops_per_jiffy / (5000 / HZ)) % 100); - - count += seq_printf(m, - "HW:\n Shift:\t\t%s\n" - " MSR:\t\t%s\n" - " PCMP:\t\t%s\n" - " DIV:\t\t%s\n", - (cpuinfo.use_instr & PVR0_USE_BARREL_MASK) ? "yes" : "no", - (cpuinfo.use_instr & PVR2_USE_MSR_INSTR) ? "yes" : "no", - (cpuinfo.use_instr & PVR2_USE_PCMP_INSTR) ? "yes" : "no", - (cpuinfo.use_instr & PVR0_USE_DIV_MASK) ? "yes" : "no"); - - count += seq_printf(m, - " MMU:\t\t%x\n", - cpuinfo.mmu); - - count += seq_printf(m, - " MUL:\t\t%s\n" - " FPU:\t\t%s\n", - (cpuinfo.use_mult & PVR2_USE_MUL64_MASK) ? "v2" : - (cpuinfo.use_mult & PVR0_USE_HW_MUL_MASK) ? "v1" : "no", - (cpuinfo.use_fpu & PVR2_USE_FPU2_MASK) ? "v2" : - (cpuinfo.use_fpu & PVR0_USE_FPU_MASK) ? "v1" : "no"); - - count += seq_printf(m, - " Exc:\t\t%s%s%s%s%s%s%s%s\n", - (cpuinfo.use_exc & PVR2_OPCODE_0x0_ILL_MASK) ? "op0x0 " : "", - (cpuinfo.use_exc & PVR2_UNALIGNED_EXC_MASK) ? "unal " : "", - (cpuinfo.use_exc & PVR2_ILL_OPCODE_EXC_MASK) ? "ill " : "", - (cpuinfo.use_exc & PVR2_IOPB_BUS_EXC_MASK) ? "iopb " : "", - (cpuinfo.use_exc & PVR2_DOPB_BUS_EXC_MASK) ? "dopb " : "", - (cpuinfo.use_exc & PVR2_DIV_ZERO_EXC_MASK) ? "zero " : "", - (cpuinfo.use_exc & PVR2_FPU_EXC_MASK) ? "fpu " : "", - (cpuinfo.use_exc & PVR2_USE_FSL_EXC) ? "fsl " : ""); - - count += seq_printf(m, - "Stream-insns:\t%sprivileged\n", - cpuinfo.mmu_privins ? "un" : ""); + seq_printf(m, + "CPU-Family: MicroBlaze\n" + "FPGA-Arch: %s\n" + "CPU-Ver: %s, %s endian\n" + "CPU-MHz: %d.%02d\n" + "BogoMips: %lu.%02lu\n", + fpga_family, + cpu_ver, + cpuinfo.endian ? "little" : "big", + cpuinfo.cpu_clock_freq / 1000000, + cpuinfo.cpu_clock_freq % 1000000, + loops_per_jiffy / (500000 / HZ), + (loops_per_jiffy / (5000 / HZ)) % 100); + + seq_printf(m, + "HW:\n Shift:\t\t%s\n" + " MSR:\t\t%s\n" + " PCMP:\t\t%s\n" + " DIV:\t\t%s\n", + (cpuinfo.use_instr & PVR0_USE_BARREL_MASK) ? "yes" : "no", + (cpuinfo.use_instr & PVR2_USE_MSR_INSTR) ? "yes" : "no", + (cpuinfo.use_instr & PVR2_USE_PCMP_INSTR) ? "yes" : "no", + (cpuinfo.use_instr & PVR0_USE_DIV_MASK) ? "yes" : "no"); + + seq_printf(m, " MMU:\t\t%x\n", cpuinfo.mmu); + + seq_printf(m, + " MUL:\t\t%s\n" + " FPU:\t\t%s\n", + (cpuinfo.use_mult & PVR2_USE_MUL64_MASK) ? "v2" : + (cpuinfo.use_mult & PVR0_USE_HW_MUL_MASK) ? "v1" : "no", + (cpuinfo.use_fpu & PVR2_USE_FPU2_MASK) ? "v2" : + (cpuinfo.use_fpu & PVR0_USE_FPU_MASK) ? "v1" : "no"); + + seq_printf(m, + " Exc:\t\t%s%s%s%s%s%s%s%s\n", + (cpuinfo.use_exc & PVR2_OPCODE_0x0_ILL_MASK) ? "op0x0 " : "", + (cpuinfo.use_exc & PVR2_UNALIGNED_EXC_MASK) ? "unal " : "", + (cpuinfo.use_exc & PVR2_ILL_OPCODE_EXC_MASK) ? "ill " : "", + (cpuinfo.use_exc & PVR2_IOPB_BUS_EXC_MASK) ? "iopb " : "", + (cpuinfo.use_exc & PVR2_DOPB_BUS_EXC_MASK) ? "dopb " : "", + (cpuinfo.use_exc & PVR2_DIV_ZERO_EXC_MASK) ? "zero " : "", + (cpuinfo.use_exc & PVR2_FPU_EXC_MASK) ? "fpu " : "", + (cpuinfo.use_exc & PVR2_USE_FSL_EXC) ? "fsl " : ""); + + seq_printf(m, + "Stream-insns:\t%sprivileged\n", + cpuinfo.mmu_privins ? "un" : ""); if (cpuinfo.use_icache) - count += seq_printf(m, - "Icache:\t\t%ukB\tline length:\t%dB\n", - cpuinfo.icache_size >> 10, - cpuinfo.icache_line_length); + seq_printf(m, + "Icache:\t\t%ukB\tline length:\t%dB\n", + cpuinfo.icache_size >> 10, + cpuinfo.icache_line_length); else - count += seq_printf(m, "Icache:\t\tno\n"); + seq_puts(m, "Icache:\t\tno\n"); if (cpuinfo.use_dcache) { - count += seq_printf(m, - "Dcache:\t\t%ukB\tline length:\t%dB\n", - cpuinfo.dcache_size >> 10, - cpuinfo.dcache_line_length); - seq_printf(m, "Dcache-Policy:\t"); + seq_printf(m, + "Dcache:\t\t%ukB\tline length:\t%dB\n", + cpuinfo.dcache_size >> 10, + cpuinfo.dcache_line_length); + seq_puts(m, "Dcache-Policy:\t"); if (cpuinfo.dcache_wb) - count += seq_printf(m, "write-back\n"); + seq_puts(m, "write-back\n"); else - count += seq_printf(m, "write-through\n"); - } else - count += seq_printf(m, "Dcache:\t\tno\n"); + seq_puts(m, "write-through\n"); + } else { + seq_puts(m, "Dcache:\t\tno\n"); + } + + seq_printf(m, + "HW-Debug:\t%s\n", + cpuinfo.hw_debug ? "yes" : "no"); - count += seq_printf(m, - "HW-Debug:\t%s\n", - cpuinfo.hw_debug ? "yes" : "no"); + seq_printf(m, + "PVR-USR1:\t%02x\n" + "PVR-USR2:\t%08x\n", + cpuinfo.pvr_user1, + cpuinfo.pvr_user2); - count += seq_printf(m, - "PVR-USR1:\t%02x\n" - "PVR-USR2:\t%08x\n", - cpuinfo.pvr_user1, - cpuinfo.pvr_user2); + seq_printf(m, "Page size:\t%lu\n", PAGE_SIZE); - count += seq_printf(m, "Page size:\t%lu\n", PAGE_SIZE); return 0; } diff --git a/arch/nios2/kernel/cpuinfo.c b/arch/nios2/kernel/cpuinfo.c index a223691d..1d96de0 100644 --- a/arch/nios2/kernel/cpuinfo.c +++ b/arch/nios2/kernel/cpuinfo.c @@ -126,47 +126,46 @@ void __init setup_cpuinfo(void) */ static int show_cpuinfo(struct seq_file *m, void *v) { - int count = 0; const u32 clockfreq = cpuinfo.cpu_clock_freq; - count = seq_printf(m, - "CPU:\t\tNios II/%s\n" - "MMU:\t\t%s\n" - "FPU:\t\tnone\n" - "Clocking:\t%u.%02u MHz\n" - "BogoMips:\t%lu.%02lu\n" - "Calibration:\t%lu loops\n", - cpuinfo.cpu_impl, - cpuinfo.mmu ? "present" : "none", - clockfreq / 1000000, (clockfreq / 100000) % 10, - (loops_per_jiffy * HZ) / 500000, - ((loops_per_jiffy * HZ) / 5000) % 100, - (loops_per_jiffy * HZ)); - - count += seq_printf(m, - "HW:\n" - " MUL:\t\t%s\n" - " MULX:\t\t%s\n" - " DIV:\t\t%s\n", - cpuinfo.has_mul ? "yes" : "no", - cpuinfo.has_mulx ? "yes" : "no", - cpuinfo.has_div ? "yes" : "no"); - - count += seq_printf(m, - "Icache:\t\t%ukB, line length: %u\n", - cpuinfo.icache_size >> 10, - cpuinfo.icache_line_size); - - count += seq_printf(m, - "Dcache:\t\t%ukB, line length: %u\n", - cpuinfo.dcache_size >> 10, - cpuinfo.dcache_line_size); - - count += seq_printf(m, - "TLB:\t\t%u ways, %u entries, %u PID bits\n", - cpuinfo.tlb_num_ways, - cpuinfo.tlb_num_entries, - cpuinfo.tlb_pid_num_bits); + seq_printf(m, + "CPU:\t\tNios II/%s\n" + "MMU:\t\t%s\n" + "FPU:\t\tnone\n" + "Clocking:\t%u.%02u MHz\n" + "BogoMips:\t%lu.%02lu\n" + "Calibration:\t%lu loops\n", + cpuinfo.cpu_impl, + cpuinfo.mmu ? "present" : "none", + clockfreq / 1000000, (clockfreq / 100000) % 10, + (loops_per_jiffy * HZ) / 500000, + ((loops_per_jiffy * HZ) / 5000) % 100, + (loops_per_jiffy * HZ)); + + seq_printf(m, + "HW:\n" + " MUL:\t\t%s\n" + " MULX:\t\t%s\n" + " DIV:\t\t%s\n", + cpuinfo.has_mul ? "yes" : "no", + cpuinfo.has_mulx ? "yes" : "no", + cpuinfo.has_div ? "yes" : "no"); + + seq_printf(m, + "Icache:\t\t%ukB, line length: %u\n", + cpuinfo.icache_size >> 10, + cpuinfo.icache_line_size); + + seq_printf(m, + "Dcache:\t\t%ukB, line length: %u\n", + cpuinfo.dcache_size >> 10, + cpuinfo.dcache_line_size); + + seq_printf(m, + "TLB:\t\t%u ways, %u entries, %u PID bits\n", + cpuinfo.tlb_num_ways, + cpuinfo.tlb_num_entries, + cpuinfo.tlb_pid_num_bits); return 0; } diff --git a/arch/openrisc/kernel/setup.c b/arch/openrisc/kernel/setup.c index 4fc7ccc..b4ed8b3 100644 --- a/arch/openrisc/kernel/setup.c +++ b/arch/openrisc/kernel/setup.c @@ -329,30 +329,32 @@ static int show_cpuinfo(struct seq_file *m, void *v) version = (vr & SPR_VR_VER) >> 24; revision = vr & SPR_VR_REV; - return seq_printf(m, - "cpu\t\t: OpenRISC-%x\n" - "revision\t: %d\n" - "frequency\t: %ld\n" - "dcache size\t: %d bytes\n" - "dcache block size\t: %d bytes\n" - "icache size\t: %d bytes\n" - "icache block size\t: %d bytes\n" - "immu\t\t: %d entries, %lu ways\n" - "dmmu\t\t: %d entries, %lu ways\n" - "bogomips\t: %lu.%02lu\n", - version, - revision, - loops_per_jiffy * HZ, - cpuinfo.dcache_size, - cpuinfo.dcache_block_size, - cpuinfo.icache_size, - cpuinfo.icache_block_size, - 1 << ((mfspr(SPR_DMMUCFGR) & SPR_DMMUCFGR_NTS) >> 2), - 1 + (mfspr(SPR_DMMUCFGR) & SPR_DMMUCFGR_NTW), - 1 << ((mfspr(SPR_IMMUCFGR) & SPR_IMMUCFGR_NTS) >> 2), - 1 + (mfspr(SPR_IMMUCFGR) & SPR_IMMUCFGR_NTW), - (loops_per_jiffy * HZ) / 500000, - ((loops_per_jiffy * HZ) / 5000) % 100); + seq_printf(m, + "cpu\t\t: OpenRISC-%x\n" + "revision\t: %d\n" + "frequency\t: %ld\n" + "dcache size\t: %d bytes\n" + "dcache block size\t: %d bytes\n" + "icache size\t: %d bytes\n" + "icache block size\t: %d bytes\n" + "immu\t\t: %d entries, %lu ways\n" + "dmmu\t\t: %d entries, %lu ways\n" + "bogomips\t: %lu.%02lu\n", + version, + revision, + loops_per_jiffy * HZ, + cpuinfo.dcache_size, + cpuinfo.dcache_block_size, + cpuinfo.icache_size, + cpuinfo.icache_block_size, + 1 << ((mfspr(SPR_DMMUCFGR) & SPR_DMMUCFGR_NTS) >> 2), + 1 + (mfspr(SPR_DMMUCFGR) & SPR_DMMUCFGR_NTW), + 1 << ((mfspr(SPR_IMMUCFGR) & SPR_IMMUCFGR_NTS) >> 2), + 1 + (mfspr(SPR_IMMUCFGR) & SPR_IMMUCFGR_NTW), + (loops_per_jiffy * HZ) / 500000, + ((loops_per_jiffy * HZ) / 5000) % 100); + + return 0; } static void *c_start(struct seq_file *m, loff_t * pos) diff --git a/arch/powerpc/platforms/powernv/opal-power.c b/arch/powerpc/platforms/powernv/opal-power.c index 48bf5b0..ac46c2c 100644 --- a/arch/powerpc/platforms/powernv/opal-power.c +++ b/arch/powerpc/platforms/powernv/opal-power.c @@ -29,8 +29,9 @@ static int opal_power_control_event(struct notifier_block *nb, switch (type) { case SOFT_REBOOT: - /* Fall through. The service processor is responsible for - * bringing the machine back up */ + pr_info("OPAL: reboot requested\n"); + orderly_reboot(); + break; case SOFT_OFF: pr_info("OPAL: poweroff requested\n"); orderly_poweroff(true); diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index a5ced5c..de2726a 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -328,6 +328,7 @@ config COMPAT select COMPAT_BINFMT_ELF if BINFMT_ELF select ARCH_WANT_OLD_COMPAT_IPC select COMPAT_OLD_SIGACTION + depends on MULTIUSER help Select this option if you want to enable your system kernel to handle system-calls from ELF binaries for 31 bit ESA. This option diff --git a/arch/s390/pci/pci_debug.c b/arch/s390/pci/pci_debug.c index 3229a2e..c22d440 100644 --- a/arch/s390/pci/pci_debug.c +++ b/arch/s390/pci/pci_debug.c @@ -45,8 +45,10 @@ static int pci_perf_show(struct seq_file *m, void *v) if (!zdev) return 0; - if (!zdev->fmb) - return seq_printf(m, "FMB statistics disabled\n"); + if (!zdev->fmb) { + seq_puts(m, "FMB statistics disabled\n"); + return 0; + } /* header */ seq_printf(m, "FMB @ %p\n", zdev->fmb); diff --git a/arch/x86/kernel/cpu/mtrr/if.c b/arch/x86/kernel/cpu/mtrr/if.c index a041e09..d76f13d 100644 --- a/arch/x86/kernel/cpu/mtrr/if.c +++ b/arch/x86/kernel/cpu/mtrr/if.c @@ -404,11 +404,10 @@ static const struct file_operations mtrr_fops = { static int mtrr_seq_show(struct seq_file *seq, void *offset) { char factor; - int i, max, len; + int i, max; mtrr_type type; unsigned long base, size; - len = 0; max = num_var_ranges; for (i = 0; i < max; i++) { mtrr_if->get(i, &base, &size, &type); @@ -425,11 +424,10 @@ static int mtrr_seq_show(struct seq_file *seq, void *offset) size >>= 20 - PAGE_SHIFT; } /* Base can be > 32bit */ - len += seq_printf(seq, "reg%02i: base=0x%06lx000 " - "(%5luMB), size=%5lu%cB, count=%d: %s\n", - i, base, base >> (20 - PAGE_SHIFT), size, - factor, mtrr_usage_table[i], - mtrr_attrib_to_str(type)); + seq_printf(seq, "reg%02i: base=0x%06lx000 (%5luMB), size=%5lu%cB, count=%d: %s\n", + i, base, base >> (20 - PAGE_SHIFT), + size, factor, + mtrr_usage_table[i], mtrr_attrib_to_str(type)); } return 0; } diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c index aab7158..7726200 100644 --- a/drivers/base/power/wakeup.c +++ b/drivers/base/power/wakeup.c @@ -843,7 +843,6 @@ static int print_wakeup_source_stats(struct seq_file *m, unsigned long active_count; ktime_t active_time; ktime_t prevent_sleep_time; - int ret; spin_lock_irqsave(&ws->lock, flags); @@ -866,17 +865,16 @@ static int print_wakeup_source_stats(struct seq_file *m, active_time = ktime_set(0, 0); } - ret = seq_printf(m, "%-12s\t%lu\t\t%lu\t\t%lu\t\t%lu\t\t" - "%lld\t\t%lld\t\t%lld\t\t%lld\t\t%lld\n", - ws->name, active_count, ws->event_count, - ws->wakeup_count, ws->expire_count, - ktime_to_ms(active_time), ktime_to_ms(total_time), - ktime_to_ms(max_time), ktime_to_ms(ws->last_time), - ktime_to_ms(prevent_sleep_time)); + seq_printf(m, "%-12s\t%lu\t\t%lu\t\t%lu\t\t%lu\t\t%lld\t\t%lld\t\t%lld\t\t%lld\t\t%lld\n", + ws->name, active_count, ws->event_count, + ws->wakeup_count, ws->expire_count, + ktime_to_ms(active_time), ktime_to_ms(total_time), + ktime_to_ms(max_time), ktime_to_ms(ws->last_time), + ktime_to_ms(prevent_sleep_time)); spin_unlock_irqrestore(&ws->lock, flags); - return ret; + return 0; } /** diff --git a/drivers/block/paride/pg.c b/drivers/block/paride/pg.c index 2ce3dfd..876d0c3 100644 --- a/drivers/block/paride/pg.c +++ b/drivers/block/paride/pg.c @@ -137,7 +137,7 @@ */ -static bool verbose = 0; +static int verbose; static int major = PG_MAJOR; static char *name = PG_NAME; static int disable = 0; @@ -168,7 +168,7 @@ enum {D_PRT, D_PRO, D_UNI, D_MOD, D_SLV, D_DLY}; #include <asm/uaccess.h> -module_param(verbose, bool, 0644); +module_param(verbose, int, 0644); module_param(major, int, 0); module_param(name, charp, 0); module_param_array(drive0, int, NULL, 0); diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 871bd35..c94386a 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -43,11 +43,22 @@ static const char *default_compressor = "lzo"; /* Module params (documentation at end) */ static unsigned int num_devices = 1; +static inline void deprecated_attr_warn(const char *name) +{ + pr_warn_once("%d (%s) Attribute %s (and others) will be removed. %s\n", + task_pid_nr(current), + current->comm, + name, + "See zram documentation."); +} + #define ZRAM_ATTR_RO(name) \ static ssize_t name##_show(struct device *d, \ struct device_attribute *attr, char *b) \ { \ struct zram *zram = dev_to_zram(d); \ + \ + deprecated_attr_warn(__stringify(name)); \ return scnprintf(b, PAGE_SIZE, "%llu\n", \ (u64)atomic64_read(&zram->stats.name)); \ } \ @@ -89,6 +100,7 @@ static ssize_t orig_data_size_show(struct device *dev, { struct zram *zram = dev_to_zram(dev); + deprecated_attr_warn("orig_data_size"); return scnprintf(buf, PAGE_SIZE, "%llu\n", (u64)(atomic64_read(&zram->stats.pages_stored)) << PAGE_SHIFT); } @@ -99,6 +111,7 @@ static ssize_t mem_used_total_show(struct device *dev, u64 val = 0; struct zram *zram = dev_to_zram(dev); + deprecated_attr_warn("mem_used_total"); down_read(&zram->init_lock); if (init_done(zram)) { struct zram_meta *meta = zram->meta; @@ -128,6 +141,7 @@ static ssize_t mem_limit_show(struct device *dev, u64 val; struct zram *zram = dev_to_zram(dev); + deprecated_attr_warn("mem_limit"); down_read(&zram->init_lock); val = zram->limit_pages; up_read(&zram->init_lock); @@ -159,6 +173,7 @@ static ssize_t mem_used_max_show(struct device *dev, u64 val = 0; struct zram *zram = dev_to_zram(dev); + deprecated_attr_warn("mem_used_max"); down_read(&zram->init_lock); if (init_done(zram)) val = atomic_long_read(&zram->stats.max_used_pages); @@ -670,8 +685,12 @@ out: static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index, int offset, int rw) { + unsigned long start_time = jiffies; int ret; + generic_start_io_acct(rw, bvec->bv_len >> SECTOR_SHIFT, + &zram->disk->part0); + if (rw == READ) { atomic64_inc(&zram->stats.num_reads); ret = zram_bvec_read(zram, bvec, index, offset); @@ -680,6 +699,8 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index, ret = zram_bvec_write(zram, bvec, index, offset); } + generic_end_io_acct(rw, &zram->disk->part0, start_time); + if (unlikely(ret)) { if (rw == READ) atomic64_inc(&zram->stats.failed_reads); @@ -1027,6 +1048,55 @@ static DEVICE_ATTR_RW(mem_used_max); static DEVICE_ATTR_RW(max_comp_streams); static DEVICE_ATTR_RW(comp_algorithm); +static ssize_t io_stat_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct zram *zram = dev_to_zram(dev); + ssize_t ret; + + down_read(&zram->init_lock); + ret = scnprintf(buf, PAGE_SIZE, + "%8llu %8llu %8llu %8llu\n", + (u64)atomic64_read(&zram->stats.failed_reads), + (u64)atomic64_read(&zram->stats.failed_writes), + (u64)atomic64_read(&zram->stats.invalid_io), + (u64)atomic64_read(&zram->stats.notify_free)); + up_read(&zram->init_lock); + + return ret; +} + +static ssize_t mm_stat_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct zram *zram = dev_to_zram(dev); + u64 orig_size, mem_used = 0; + long max_used; + ssize_t ret; + + down_read(&zram->init_lock); + if (init_done(zram)) + mem_used = zs_get_total_pages(zram->meta->mem_pool); + + orig_size = atomic64_read(&zram->stats.pages_stored); + max_used = atomic_long_read(&zram->stats.max_used_pages); + + ret = scnprintf(buf, PAGE_SIZE, + "%8llu %8llu %8llu %8lu %8ld %8llu %8llu\n", + orig_size << PAGE_SHIFT, + (u64)atomic64_read(&zram->stats.compr_data_size), + mem_used << PAGE_SHIFT, + zram->limit_pages << PAGE_SHIFT, + max_used << PAGE_SHIFT, + (u64)atomic64_read(&zram->stats.zero_pages), + (u64)atomic64_read(&zram->stats.num_migrated)); + up_read(&zram->init_lock); + + return ret; +} + +static DEVICE_ATTR_RO(io_stat); +static DEVICE_ATTR_RO(mm_stat); ZRAM_ATTR_RO(num_reads); ZRAM_ATTR_RO(num_writes); ZRAM_ATTR_RO(failed_reads); @@ -1054,6 +1124,8 @@ static struct attribute *zram_disk_attrs[] = { &dev_attr_mem_used_max.attr, &dev_attr_max_comp_streams.attr, &dev_attr_comp_algorithm.attr, + &dev_attr_io_stat.attr, + &dev_attr_mm_stat.attr, NULL, }; @@ -1082,6 +1154,7 @@ static int create_device(struct zram *zram, int device_id) if (!zram->disk) { pr_warn("Error allocating disk structure for device %d\n", device_id); + ret = -ENOMEM; goto out_free_queue; } diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 17056e5..570c598 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -84,6 +84,7 @@ struct zram_stats { atomic64_t compr_data_size; /* compressed size of pages stored */ atomic64_t num_reads; /* failed + successful */ atomic64_t num_writes; /* --do-- */ + atomic64_t num_migrated; /* no. of migrated object */ atomic64_t failed_reads; /* can happen when memory is too low */ atomic64_t failed_writes; /* can happen when memory is too low */ atomic64_t invalid_io; /* non-page-aligned I/O requests */ diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c index 8b490d7..6bc1680 100644 --- a/drivers/parisc/ccio-dma.c +++ b/drivers/parisc/ccio-dma.c @@ -1021,7 +1021,6 @@ static struct hppa_dma_ops ccio_ops = { #ifdef CONFIG_PROC_FS static int ccio_proc_info(struct seq_file *m, void *p) { - int len = 0; struct ioc *ioc = ioc_list; while (ioc != NULL) { @@ -1031,22 +1030,22 @@ static int ccio_proc_info(struct seq_file *m, void *p) int j; #endif - len += seq_printf(m, "%s\n", ioc->name); + seq_printf(m, "%s\n", ioc->name); - len += seq_printf(m, "Cujo 2.0 bug : %s\n", - (ioc->cujo20_bug ? "yes" : "no")); + seq_printf(m, "Cujo 2.0 bug : %s\n", + (ioc->cujo20_bug ? "yes" : "no")); - len += seq_printf(m, "IO PDIR size : %d bytes (%d entries)\n", - total_pages * 8, total_pages); + seq_printf(m, "IO PDIR size : %d bytes (%d entries)\n", + total_pages * 8, total_pages); #ifdef CCIO_COLLECT_STATS - len += seq_printf(m, "IO PDIR entries : %ld free %ld used (%d%%)\n", - total_pages - ioc->used_pages, ioc->used_pages, - (int)(ioc->used_pages * 100 / total_pages)); + seq_printf(m, "IO PDIR entries : %ld free %ld used (%d%%)\n", + total_pages - ioc->used_pages, ioc->used_pages, + (int)(ioc->used_pages * 100 / total_pages)); #endif - len += seq_printf(m, "Resource bitmap : %d bytes (%d pages)\n", - ioc->res_size, total_pages); + seq_printf(m, "Resource bitmap : %d bytes (%d pages)\n", + ioc->res_size, total_pages); #ifdef CCIO_COLLECT_STATS min = max = ioc->avg_search[0]; @@ -1058,26 +1057,26 @@ static int ccio_proc_info(struct seq_file *m, void *p) min = ioc->avg_search[j]; } avg /= CCIO_SEARCH_SAMPLE; - len += seq_printf(m, " Bitmap search : %ld/%ld/%ld (min/avg/max CPU Cycles)\n", - min, avg, max); + seq_printf(m, " Bitmap search : %ld/%ld/%ld (min/avg/max CPU Cycles)\n", + min, avg, max); - len += seq_printf(m, "pci_map_single(): %8ld calls %8ld pages (avg %d/1000)\n", - ioc->msingle_calls, ioc->msingle_pages, - (int)((ioc->msingle_pages * 1000)/ioc->msingle_calls)); + seq_printf(m, "pci_map_single(): %8ld calls %8ld pages (avg %d/1000)\n", + ioc->msingle_calls, ioc->msingle_pages, + (int)((ioc->msingle_pages * 1000)/ioc->msingle_calls)); /* KLUGE - unmap_sg calls unmap_single for each mapped page */ min = ioc->usingle_calls - ioc->usg_calls; max = ioc->usingle_pages - ioc->usg_pages; - len += seq_printf(m, "pci_unmap_single: %8ld calls %8ld pages (avg %d/1000)\n", - min, max, (int)((max * 1000)/min)); + seq_printf(m, "pci_unmap_single: %8ld calls %8ld pages (avg %d/1000)\n", + min, max, (int)((max * 1000)/min)); - len += seq_printf(m, "pci_map_sg() : %8ld calls %8ld pages (avg %d/1000)\n", - ioc->msg_calls, ioc->msg_pages, - (int)((ioc->msg_pages * 1000)/ioc->msg_calls)); + seq_printf(m, "pci_map_sg() : %8ld calls %8ld pages (avg %d/1000)\n", + ioc->msg_calls, ioc->msg_pages, + (int)((ioc->msg_pages * 1000)/ioc->msg_calls)); - len += seq_printf(m, "pci_unmap_sg() : %8ld calls %8ld pages (avg %d/1000)\n\n\n", - ioc->usg_calls, ioc->usg_pages, - (int)((ioc->usg_pages * 1000)/ioc->usg_calls)); + seq_printf(m, "pci_unmap_sg() : %8ld calls %8ld pages (avg %d/1000)\n\n\n", + ioc->usg_calls, ioc->usg_pages, + (int)((ioc->usg_pages * 1000)/ioc->usg_calls)); #endif /* CCIO_COLLECT_STATS */ ioc = ioc->next; @@ -1101,7 +1100,6 @@ static const struct file_operations ccio_proc_info_fops = { static int ccio_proc_bitmap_info(struct seq_file *m, void *p) { - int len = 0; struct ioc *ioc = ioc_list; while (ioc != NULL) { @@ -1110,11 +1108,11 @@ static int ccio_proc_bitmap_info(struct seq_file *m, void *p) for (j = 0; j < (ioc->res_size / sizeof(u32)); j++) { if ((j & 7) == 0) - len += seq_puts(m, "\n "); - len += seq_printf(m, "%08x", *res_ptr); + seq_puts(m, "\n "); + seq_printf(m, "%08x", *res_ptr); res_ptr++; } - len += seq_puts(m, "\n\n"); + seq_puts(m, "\n\n"); ioc = ioc->next; break; /* XXX - remove me */ } diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c index 1ff1b67..f074712 100644 --- a/drivers/parisc/sba_iommu.c +++ b/drivers/parisc/sba_iommu.c @@ -1774,37 +1774,35 @@ static int sba_proc_info(struct seq_file *m, void *p) #ifdef SBA_COLLECT_STATS unsigned long avg = 0, min, max; #endif - int i, len = 0; - - len += seq_printf(m, "%s rev %d.%d\n", - sba_dev->name, - (sba_dev->hw_rev & 0x7) + 1, - (sba_dev->hw_rev & 0x18) >> 3 - ); - len += seq_printf(m, "IO PDIR size : %d bytes (%d entries)\n", - (int) ((ioc->res_size << 3) * sizeof(u64)), /* 8 bits/byte */ - total_pages); - - len += seq_printf(m, "Resource bitmap : %d bytes (%d pages)\n", - ioc->res_size, ioc->res_size << 3); /* 8 bits per byte */ - - len += seq_printf(m, "LMMIO_BASE/MASK/ROUTE %08x %08x %08x\n", - READ_REG32(sba_dev->sba_hpa + LMMIO_DIST_BASE), - READ_REG32(sba_dev->sba_hpa + LMMIO_DIST_MASK), - READ_REG32(sba_dev->sba_hpa + LMMIO_DIST_ROUTE) - ); + int i; + + seq_printf(m, "%s rev %d.%d\n", + sba_dev->name, + (sba_dev->hw_rev & 0x7) + 1, + (sba_dev->hw_rev & 0x18) >> 3); + seq_printf(m, "IO PDIR size : %d bytes (%d entries)\n", + (int)((ioc->res_size << 3) * sizeof(u64)), /* 8 bits/byte */ + total_pages); + + seq_printf(m, "Resource bitmap : %d bytes (%d pages)\n", + ioc->res_size, ioc->res_size << 3); /* 8 bits per byte */ + + seq_printf(m, "LMMIO_BASE/MASK/ROUTE %08x %08x %08x\n", + READ_REG32(sba_dev->sba_hpa + LMMIO_DIST_BASE), + READ_REG32(sba_dev->sba_hpa + LMMIO_DIST_MASK), + READ_REG32(sba_dev->sba_hpa + LMMIO_DIST_ROUTE)); for (i=0; i<4; i++) - len += seq_printf(m, "DIR%d_BASE/MASK/ROUTE %08x %08x %08x\n", i, - READ_REG32(sba_dev->sba_hpa + LMMIO_DIRECT0_BASE + i*0x18), - READ_REG32(sba_dev->sba_hpa + LMMIO_DIRECT0_MASK + i*0x18), - READ_REG32(sba_dev->sba_hpa + LMMIO_DIRECT0_ROUTE + i*0x18) - ); + seq_printf(m, "DIR%d_BASE/MASK/ROUTE %08x %08x %08x\n", + i, + READ_REG32(sba_dev->sba_hpa + LMMIO_DIRECT0_BASE + i*0x18), + READ_REG32(sba_dev->sba_hpa + LMMIO_DIRECT0_MASK + i*0x18), + READ_REG32(sba_dev->sba_hpa + LMMIO_DIRECT0_ROUTE + i*0x18)); #ifdef SBA_COLLECT_STATS - len += seq_printf(m, "IO PDIR entries : %ld free %ld used (%d%%)\n", - total_pages - ioc->used_pages, ioc->used_pages, - (int) (ioc->used_pages * 100 / total_pages)); + seq_printf(m, "IO PDIR entries : %ld free %ld used (%d%%)\n", + total_pages - ioc->used_pages, ioc->used_pages, + (int)(ioc->used_pages * 100 / total_pages)); min = max = ioc->avg_search[0]; for (i = 0; i < SBA_SEARCH_SAMPLE; i++) { @@ -1813,26 +1811,26 @@ static int sba_proc_info(struct seq_file *m, void *p) if (ioc->avg_search[i] < min) min = ioc->avg_search[i]; } avg /= SBA_SEARCH_SAMPLE; - len += seq_printf(m, " Bitmap search : %ld/%ld/%ld (min/avg/max CPU Cycles)\n", - min, avg, max); + seq_printf(m, " Bitmap search : %ld/%ld/%ld (min/avg/max CPU Cycles)\n", + min, avg, max); - len += seq_printf(m, "pci_map_single(): %12ld calls %12ld pages (avg %d/1000)\n", - ioc->msingle_calls, ioc->msingle_pages, - (int) ((ioc->msingle_pages * 1000)/ioc->msingle_calls)); + seq_printf(m, "pci_map_single(): %12ld calls %12ld pages (avg %d/1000)\n", + ioc->msingle_calls, ioc->msingle_pages, + (int)((ioc->msingle_pages * 1000)/ioc->msingle_calls)); /* KLUGE - unmap_sg calls unmap_single for each mapped page */ min = ioc->usingle_calls; max = ioc->usingle_pages - ioc->usg_pages; - len += seq_printf(m, "pci_unmap_single: %12ld calls %12ld pages (avg %d/1000)\n", - min, max, (int) ((max * 1000)/min)); + seq_printf(m, "pci_unmap_single: %12ld calls %12ld pages (avg %d/1000)\n", + min, max, (int)((max * 1000)/min)); - len += seq_printf(m, "pci_map_sg() : %12ld calls %12ld pages (avg %d/1000)\n", - ioc->msg_calls, ioc->msg_pages, - (int) ((ioc->msg_pages * 1000)/ioc->msg_calls)); + seq_printf(m, "pci_map_sg() : %12ld calls %12ld pages (avg %d/1000)\n", + ioc->msg_calls, ioc->msg_pages, + (int)((ioc->msg_pages * 1000)/ioc->msg_calls)); - len += seq_printf(m, "pci_unmap_sg() : %12ld calls %12ld pages (avg %d/1000)\n", - ioc->usg_calls, ioc->usg_pages, - (int) ((ioc->usg_pages * 1000)/ioc->usg_calls)); + seq_printf(m, "pci_unmap_sg() : %12ld calls %12ld pages (avg %d/1000)\n", + ioc->usg_calls, ioc->usg_pages, + (int)((ioc->usg_pages * 1000)/ioc->usg_calls)); #endif return 0; @@ -1858,14 +1856,14 @@ sba_proc_bitmap_info(struct seq_file *m, void *p) struct sba_device *sba_dev = sba_list; struct ioc *ioc = &sba_dev->ioc[0]; /* FIXME: Multi-IOC support! */ unsigned int *res_ptr = (unsigned int *)ioc->res_map; - int i, len = 0; + int i; for (i = 0; i < (ioc->res_size/sizeof(unsigned int)); ++i, ++res_ptr) { if ((i & 7) == 0) - len += seq_printf(m, "\n "); - len += seq_printf(m, " %08x", *res_ptr); + seq_puts(m, "\n "); + seq_printf(m, " %08x", *res_ptr); } - len += seq_printf(m, "\n"); + seq_putc(m, '\n'); return 0; } diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c index 5b2e761..87647f4 100644 --- a/drivers/rtc/rtc-cmos.c +++ b/drivers/rtc/rtc-cmos.c @@ -459,23 +459,25 @@ static int cmos_procfs(struct device *dev, struct seq_file *seq) /* NOTE: at least ICH6 reports battery status using a different * (non-RTC) bit; and SQWE is ignored on many current systems. */ - return seq_printf(seq, - "periodic_IRQ\t: %s\n" - "update_IRQ\t: %s\n" - "HPET_emulated\t: %s\n" - // "square_wave\t: %s\n" - "BCD\t\t: %s\n" - "DST_enable\t: %s\n" - "periodic_freq\t: %d\n" - "batt_status\t: %s\n", - (rtc_control & RTC_PIE) ? "yes" : "no", - (rtc_control & RTC_UIE) ? "yes" : "no", - is_hpet_enabled() ? "yes" : "no", - // (rtc_control & RTC_SQWE) ? "yes" : "no", - (rtc_control & RTC_DM_BINARY) ? "no" : "yes", - (rtc_control & RTC_DST_EN) ? "yes" : "no", - cmos->rtc->irq_freq, - (valid & RTC_VRT) ? "okay" : "dead"); + seq_printf(seq, + "periodic_IRQ\t: %s\n" + "update_IRQ\t: %s\n" + "HPET_emulated\t: %s\n" + // "square_wave\t: %s\n" + "BCD\t\t: %s\n" + "DST_enable\t: %s\n" + "periodic_freq\t: %d\n" + "batt_status\t: %s\n", + (rtc_control & RTC_PIE) ? "yes" : "no", + (rtc_control & RTC_UIE) ? "yes" : "no", + is_hpet_enabled() ? "yes" : "no", + // (rtc_control & RTC_SQWE) ? "yes" : "no", + (rtc_control & RTC_DM_BINARY) ? "no" : "yes", + (rtc_control & RTC_DST_EN) ? "yes" : "no", + cmos->rtc->irq_freq, + (valid & RTC_VRT) ? "okay" : "dead"); + + return 0; } #else diff --git a/drivers/rtc/rtc-ds1305.c b/drivers/rtc/rtc-ds1305.c index 129add77..12b0715 100644 --- a/drivers/rtc/rtc-ds1305.c +++ b/drivers/rtc/rtc-ds1305.c @@ -434,9 +434,9 @@ static int ds1305_proc(struct device *dev, struct seq_file *seq) } done: - return seq_printf(seq, - "trickle_charge\t: %s%s\n", - diodes, resistors); + seq_printf(seq, "trickle_charge\t: %s%s\n", diodes, resistors); + + return 0; } #else diff --git a/drivers/rtc/rtc-mrst.c b/drivers/rtc/rtc-mrst.c index 3a6fd3a..548ea6f 100644 --- a/drivers/rtc/rtc-mrst.c +++ b/drivers/rtc/rtc-mrst.c @@ -277,13 +277,15 @@ static int mrst_procfs(struct device *dev, struct seq_file *seq) valid = vrtc_cmos_read(RTC_VALID); spin_unlock_irq(&rtc_lock); - return seq_printf(seq, - "periodic_IRQ\t: %s\n" - "alarm\t\t: %s\n" - "BCD\t\t: no\n" - "periodic_freq\t: daily (not adjustable)\n", - (rtc_control & RTC_PIE) ? "on" : "off", - (rtc_control & RTC_AIE) ? "on" : "off"); + seq_printf(seq, + "periodic_IRQ\t: %s\n" + "alarm\t\t: %s\n" + "BCD\t\t: no\n" + "periodic_freq\t: daily (not adjustable)\n", + (rtc_control & RTC_PIE) ? "on" : "off", + (rtc_control & RTC_AIE) ? "on" : "off"); + + return 0; } #else diff --git a/drivers/rtc/rtc-tegra.c b/drivers/rtc/rtc-tegra.c index d948277..60232bd 100644 --- a/drivers/rtc/rtc-tegra.c +++ b/drivers/rtc/rtc-tegra.c @@ -261,7 +261,9 @@ static int tegra_rtc_proc(struct device *dev, struct seq_file *seq) if (!dev || !dev->driver) return 0; - return seq_printf(seq, "name\t\t: %s\n", dev_name(dev)); + seq_printf(seq, "name\t\t: %s\n", dev_name(dev)); + + return 0; } static irqreturn_t tegra_rtc_irq_handler(int irq, void *data) diff --git a/drivers/s390/cio/blacklist.c b/drivers/s390/cio/blacklist.c index b3f791b..20314aa 100644 --- a/drivers/s390/cio/blacklist.c +++ b/drivers/s390/cio/blacklist.c @@ -330,18 +330,20 @@ cio_ignore_proc_seq_show(struct seq_file *s, void *it) if (!iter->in_range) { /* First device in range. */ if ((iter->devno == __MAX_SUBCHANNEL) || - !is_blacklisted(iter->ssid, iter->devno + 1)) + !is_blacklisted(iter->ssid, iter->devno + 1)) { /* Singular device. */ - return seq_printf(s, "0.%x.%04x\n", - iter->ssid, iter->devno); + seq_printf(s, "0.%x.%04x\n", iter->ssid, iter->devno); + return 0; + } iter->in_range = 1; - return seq_printf(s, "0.%x.%04x-", iter->ssid, iter->devno); + seq_printf(s, "0.%x.%04x-", iter->ssid, iter->devno); + return 0; } if ((iter->devno == __MAX_SUBCHANNEL) || !is_blacklisted(iter->ssid, iter->devno + 1)) { /* Last device in range. */ iter->in_range = 0; - return seq_printf(s, "0.%x.%04x\n", iter->ssid, iter->devno); + seq_printf(s, "0.%x.%04x\n", iter->ssid, iter->devno); } return 0; } diff --git a/drivers/sbus/char/bbc_envctrl.c b/drivers/sbus/char/bbc_envctrl.c index 0787b97..228c782 100644 --- a/drivers/sbus/char/bbc_envctrl.c +++ b/drivers/sbus/char/bbc_envctrl.c @@ -160,8 +160,7 @@ static void do_envctrl_shutdown(struct bbc_cpu_temperature *tp) printk(KERN_CRIT "kenvctrld: Shutting down the system now.\n"); shutting_down = 1; - if (orderly_poweroff(true) < 0) - printk(KERN_CRIT "envctrl: shutdown execution failed\n"); + orderly_poweroff(true); } #define WARN_INTERVAL (30 * HZ) diff --git a/drivers/sbus/char/envctrl.c b/drivers/sbus/char/envctrl.c index e244cf3..5609b60 100644 --- a/drivers/sbus/char/envctrl.c +++ b/drivers/sbus/char/envctrl.c @@ -970,18 +970,13 @@ static struct i2c_child_t *envctrl_get_i2c_child(unsigned char mon_type) static void envctrl_do_shutdown(void) { static int inprog = 0; - int ret; if (inprog != 0) return; inprog = 1; printk(KERN_CRIT "kenvctrld: WARNING: Shutting down the system now.\n"); - ret = orderly_poweroff(true); - if (ret < 0) { - printk(KERN_CRIT "kenvctrld: WARNING: system shutdown failed!\n"); - inprog = 0; /* unlikely to succeed, but we could try again */ - } + orderly_poweroff(true); } static struct task_struct *kenvctrld_task; diff --git a/drivers/staging/lustre/lustre/Kconfig b/drivers/staging/lustre/lustre/Kconfig index 6725467..62c7bba 100644 --- a/drivers/staging/lustre/lustre/Kconfig +++ b/drivers/staging/lustre/lustre/Kconfig @@ -10,6 +10,7 @@ config LUSTRE_FS select CRYPTO_SHA1 select CRYPTO_SHA256 select CRYPTO_SHA512 + depends on MULTIUSER help This option enables Lustre file system client support. Choose Y here if you want to access a Lustre file system cluster. To compile @@ -464,6 +464,23 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** + * dax_pfn_mkwrite - handle first write to DAX page + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * + */ +int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + sb_end_pagefault(sb); + return VM_FAULT_NOPAGE; +} +EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); + +/** * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 678f9ab..8d15feb 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -793,7 +793,6 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync); extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; -extern const struct file_operations ext2_dax_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index ef04fdb..3a0a6c6 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -39,6 +39,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext2_dax_vm_ops = { .fault = ext2_dax_fault, .page_mkwrite = ext2_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) @@ -106,22 +107,6 @@ const struct file_operations ext2_file_operations = { .splice_write = iter_file_splice_write, }; -#ifdef CONFIG_FS_DAX -const struct file_operations ext2_dax_file_operations = { - .llseek = generic_file_llseek, - .read_iter = generic_file_read_iter, - .write_iter = generic_file_write_iter, - .unlocked_ioctl = ext2_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ext2_compat_ioctl, -#endif - .mmap = ext2_file_mmap, - .open = dquot_file_open, - .release = ext2_release_file, - .fsync = ext2_fsync, -}; -#endif - const struct inode_operations ext2_file_inode_operations = { #ifdef CONFIG_EXT2_FS_XATTR .setxattr = generic_setxattr, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index df9d6af..b29eb67 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1388,10 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 148f6e3..ce42293 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -104,10 +104,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { @@ -125,10 +122,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) { - inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_dax_file_operations; - } else if (test_opt(inode->i_sb, NOBH)) { + if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; } else { diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f63c3d5..8a3981e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2593,7 +2593,6 @@ extern const struct file_operations ext4_dir_operations; /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; -extern const struct file_operations ext4_dax_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); /* inline.c */ diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 9ad0303..7a6defc 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -206,6 +206,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .page_mkwrite = ext4_dax_mkwrite, + .pfn_mkwrite = dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops @@ -622,24 +623,6 @@ const struct file_operations ext4_file_operations = { .fallocate = ext4_fallocate, }; -#ifdef CONFIG_FS_DAX -const struct file_operations ext4_dax_file_operations = { - .llseek = ext4_llseek, - .read_iter = generic_file_read_iter, - .write_iter = ext4_file_write_iter, - .unlocked_ioctl = ext4_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ext4_compat_ioctl, -#endif - .mmap = ext4_file_mmap, - .open = ext4_file_open, - .release = ext4_release_file, - .fsync = ext4_sync_file, - /* Splice not yet supported with DAX */ - .fallocate = ext4_fallocate, -}; -#endif - const struct inode_operations ext4_file_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_getattr, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a3f4513..035b7a0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4090,10 +4090,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &ext4_dir_inode_operations; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 28fe71a..2291923 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2235,10 +2235,7 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); err = ext4_add_nondir(handle, dentry, inode); if (!err && IS_DIRSYNC(dir)) @@ -2302,10 +2299,7 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - if (test_opt(inode->i_sb, DAX)) - inode->i_fop = &ext4_dax_file_operations; - else - inode->i_fop = &ext4_file_operations; + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); d_tmpfile(dentry, inode); err = ext4_orphan_add(handle, inode); diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 45e3490..2640d88 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -48,9 +48,10 @@ struct hugetlbfs_config { kuid_t uid; kgid_t gid; umode_t mode; - long nr_blocks; + long max_hpages; long nr_inodes; struct hstate *hstate; + long min_hpages; }; struct hugetlbfs_inode_info { @@ -68,7 +69,7 @@ int sysctl_hugetlb_shm_group; enum { Opt_size, Opt_nr_inodes, Opt_mode, Opt_uid, Opt_gid, - Opt_pagesize, + Opt_pagesize, Opt_min_size, Opt_err, }; @@ -79,6 +80,7 @@ static const match_table_t tokens = { {Opt_uid, "uid=%u"}, {Opt_gid, "gid=%u"}, {Opt_pagesize, "pagesize=%s"}, + {Opt_min_size, "min_size=%s"}, {Opt_err, NULL}, }; @@ -729,14 +731,38 @@ static const struct super_operations hugetlbfs_ops = { .show_options = generic_show_options, }; +enum { NO_SIZE, SIZE_STD, SIZE_PERCENT }; + +/* + * Convert size option passed from command line to number of huge pages + * in the pool specified by hstate. Size option could be in bytes + * (val_type == SIZE_STD) or percentage of the pool (val_type == SIZE_PERCENT). + */ +static long long +hugetlbfs_size_to_hpages(struct hstate *h, unsigned long long size_opt, + int val_type) +{ + if (val_type == NO_SIZE) + return -1; + + if (val_type == SIZE_PERCENT) { + size_opt <<= huge_page_shift(h); + size_opt *= h->max_huge_pages; + do_div(size_opt, 100); + } + + size_opt >>= huge_page_shift(h); + return size_opt; +} + static int hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig) { char *p, *rest; substring_t args[MAX_OPT_ARGS]; int option; - unsigned long long size = 0; - enum { NO_SIZE, SIZE_STD, SIZE_PERCENT } setsize = NO_SIZE; + unsigned long long max_size_opt = 0, min_size_opt = 0; + int max_val_type = NO_SIZE, min_val_type = NO_SIZE; if (!options) return 0; @@ -774,10 +800,10 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig) /* memparse() will accept a K/M/G without a digit */ if (!isdigit(*args[0].from)) goto bad_val; - size = memparse(args[0].from, &rest); - setsize = SIZE_STD; + max_size_opt = memparse(args[0].from, &rest); + max_val_type = SIZE_STD; if (*rest == '%') - setsize = SIZE_PERCENT; + max_val_type = SIZE_PERCENT; break; } @@ -800,6 +826,17 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig) break; } + case Opt_min_size: { + /* memparse() will accept a K/M/G without a digit */ + if (!isdigit(*args[0].from)) + goto bad_val; + min_size_opt = memparse(args[0].from, &rest); + min_val_type = SIZE_STD; + if (*rest == '%') + min_val_type = SIZE_PERCENT; + break; + } + default: pr_err("Bad mount option: \"%s\"\n", p); return -EINVAL; @@ -807,15 +844,22 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig) } } - /* Do size after hstate is set up */ - if (setsize > NO_SIZE) { - struct hstate *h = pconfig->hstate; - if (setsize == SIZE_PERCENT) { - size <<= huge_page_shift(h); - size *= h->max_huge_pages; - do_div(size, 100); - } - pconfig->nr_blocks = (size >> huge_page_shift(h)); + /* + * Use huge page pool size (in hstate) to convert the size + * options to number of huge pages. If NO_SIZE, -1 is returned. + */ + pconfig->max_hpages = hugetlbfs_size_to_hpages(pconfig->hstate, + max_size_opt, max_val_type); + pconfig->min_hpages = hugetlbfs_size_to_hpages(pconfig->hstate, + min_size_opt, min_val_type); + + /* + * If max_size was specified, then min_size must be smaller + */ + if (max_val_type > NO_SIZE && + pconfig->min_hpages > pconfig->max_hpages) { + pr_err("minimum size can not be greater than maximum size\n"); + return -EINVAL; } return 0; @@ -834,12 +878,13 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) save_mount_options(sb, data); - config.nr_blocks = -1; /* No limit on size by default */ + config.max_hpages = -1; /* No limit on size by default */ config.nr_inodes = -1; /* No limit on number of inodes by default */ config.uid = current_fsuid(); config.gid = current_fsgid(); config.mode = 0755; config.hstate = &default_hstate; + config.min_hpages = -1; /* No default minimum size */ ret = hugetlbfs_parse_options(data, &config); if (ret) return ret; @@ -853,8 +898,15 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) sbinfo->max_inodes = config.nr_inodes; sbinfo->free_inodes = config.nr_inodes; sbinfo->spool = NULL; - if (config.nr_blocks != -1) { - sbinfo->spool = hugepage_new_subpool(config.nr_blocks); + /* + * Allocate and initialize subpool if maximum or minimum size is + * specified. Any needed reservations (for minimim size) are taken + * taken when the subpool is created. + */ + if (config.max_hpages != -1 || config.min_hpages != -1) { + sbinfo->spool = hugepage_new_subpool(config.hstate, + config.max_hpages, + config.min_hpages); if (!sbinfo->spool) goto out_free; } diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index 49ba7ff..16a0922 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -183,30 +183,23 @@ static inline void remove_metapage(struct page *page, struct metapage *mp) #endif -static void init_once(void *foo) -{ - struct metapage *mp = (struct metapage *)foo; - - mp->lid = 0; - mp->lsn = 0; - mp->flag = 0; - mp->data = NULL; - mp->clsn = 0; - mp->log = NULL; - set_bit(META_free, &mp->flag); - init_waitqueue_head(&mp->wait); -} - static inline struct metapage *alloc_metapage(gfp_t gfp_mask) { - return mempool_alloc(metapage_mempool, gfp_mask); + struct metapage *mp = mempool_alloc(metapage_mempool, gfp_mask); + + if (mp) { + mp->lid = 0; + mp->lsn = 0; + mp->data = NULL; + mp->clsn = 0; + mp->log = NULL; + init_waitqueue_head(&mp->wait); + } + return mp; } static inline void free_metapage(struct metapage *mp) { - mp->flag = 0; - set_bit(META_free, &mp->flag); - mempool_free(mp, metapage_mempool); } @@ -216,7 +209,7 @@ int __init metapage_init(void) * Allocate the metapage structures */ metapage_cache = kmem_cache_create("jfs_mp", sizeof(struct metapage), - 0, 0, init_once); + 0, 0, NULL); if (metapage_cache == NULL) return -ENOMEM; diff --git a/fs/jfs/jfs_metapage.h b/fs/jfs/jfs_metapage.h index a78beda..337e9e5 100644 --- a/fs/jfs/jfs_metapage.h +++ b/fs/jfs/jfs_metapage.h @@ -48,7 +48,6 @@ struct metapage { /* metapage flag */ #define META_locked 0 -#define META_free 1 #define META_dirty 2 #define META_sync 3 #define META_discard 4 diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig index c7abc10..f31fd0d 100644 --- a/fs/nfs/Kconfig +++ b/fs/nfs/Kconfig @@ -1,6 +1,6 @@ config NFS_FS tristate "NFS client support" - depends on INET && FILE_LOCKING + depends on INET && FILE_LOCKING && MULTIUSER select LOCKD select SUNRPC select NFS_ACL_SUPPORT if NFS_V3_ACL diff --git a/fs/nfsd/Kconfig b/fs/nfsd/Kconfig index 683bf71..fc2d108 100644 --- a/fs/nfsd/Kconfig +++ b/fs/nfsd/Kconfig @@ -6,6 +6,7 @@ config NFSD select SUNRPC select EXPORTFS select NFS_ACL_SUPPORT if NFSD_V2_ACL + depends on MULTIUSER help Choose Y here if you want to allow other computers to access files residing on this system using Sun's Network File System diff --git a/fs/proc/array.c b/fs/proc/array.c index 1295a00..fd02a9e 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -99,8 +99,8 @@ static inline void task_name(struct seq_file *m, struct task_struct *p) buf = m->buf + m->count; /* Ignore error for now */ - string_escape_str(tcomm, &buf, m->size - m->count, - ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\"); + buf += string_escape_str(tcomm, buf, m->size - m->count, + ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\"); m->count = buf - m->buf; seq_putc(m, '\n'); @@ -188,6 +188,24 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns, from_kgid_munged(user_ns, GROUP_AT(group_info, g))); put_cred(cred); +#ifdef CONFIG_PID_NS + seq_puts(m, "\nNStgid:"); + for (g = ns->level; g <= pid->level; g++) + seq_printf(m, "\t%d", + task_tgid_nr_ns(p, pid->numbers[g].ns)); + seq_puts(m, "\nNSpid:"); + for (g = ns->level; g <= pid->level; g++) + seq_printf(m, "\t%d", + task_pid_nr_ns(p, pid->numbers[g].ns)); + seq_puts(m, "\nNSpgid:"); + for (g = ns->level; g <= pid->level; g++) + seq_printf(m, "\t%d", + task_pgrp_nr_ns(p, pid->numbers[g].ns)); + seq_puts(m, "\nNSsid:"); + for (g = ns->level; g <= pid->level; g++) + seq_printf(m, "\t%d", + task_session_nr_ns(p, pid->numbers[g].ns)); +#endif seq_putc(m, '\n'); } @@ -614,7 +632,9 @@ static int children_seq_show(struct seq_file *seq, void *v) pid_t pid; pid = pid_nr_ns(v, inode->i_sb->s_fs_info); - return seq_printf(seq, "%d ", pid); + seq_printf(seq, "%d ", pid); + + return 0; } static void *children_seq_start(struct seq_file *seq, loff_t *pos) diff --git a/fs/proc/base.c b/fs/proc/base.c index 3f3d7ae..7a3b82f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -238,13 +238,15 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns, wchan = get_wchan(task); - if (lookup_symbol_name(wchan, symname) < 0) + if (lookup_symbol_name(wchan, symname) < 0) { if (!ptrace_may_access(task, PTRACE_MODE_READ)) return 0; - else - return seq_printf(m, "%lu", wchan); - else - return seq_printf(m, "%s", symname); + seq_printf(m, "%lu", wchan); + } else { + seq_printf(m, "%s", symname); + } + + return 0; } #endif /* CONFIG_KALLSYMS */ @@ -309,10 +311,12 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns, static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - return seq_printf(m, "%llu %llu %lu\n", - (unsigned long long)task->se.sum_exec_runtime, - (unsigned long long)task->sched_info.run_delay, - task->sched_info.pcount); + seq_printf(m, "%llu %llu %lu\n", + (unsigned long long)task->se.sum_exec_runtime, + (unsigned long long)task->sched_info.run_delay, + task->sched_info.pcount); + + return 0; } #endif @@ -387,7 +391,9 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, points = oom_badness(task, NULL, NULL, totalpages) * 1000 / totalpages; read_unlock(&tasklist_lock); - return seq_printf(m, "%lu\n", points); + seq_printf(m, "%lu\n", points); + + return 0; } struct limit_names { @@ -432,15 +438,15 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns, * print the file header */ seq_printf(m, "%-25s %-20s %-20s %-10s\n", - "Limit", "Soft Limit", "Hard Limit", "Units"); + "Limit", "Soft Limit", "Hard Limit", "Units"); for (i = 0; i < RLIM_NLIMITS; i++) { if (rlim[i].rlim_cur == RLIM_INFINITY) seq_printf(m, "%-25s %-20s ", - lnames[i].name, "unlimited"); + lnames[i].name, "unlimited"); else seq_printf(m, "%-25s %-20lu ", - lnames[i].name, rlim[i].rlim_cur); + lnames[i].name, rlim[i].rlim_cur); if (rlim[i].rlim_max == RLIM_INFINITY) seq_printf(m, "%-20s ", "unlimited"); @@ -462,7 +468,9 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns, { long nr; unsigned long args[6], sp, pc; - int res = lock_trace(task); + int res; + + res = lock_trace(task); if (res) return res; @@ -477,7 +485,8 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns, args[0], args[1], args[2], args[3], args[4], args[5], sp, pc); unlock_trace(task); - return res; + + return 0; } #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */ @@ -2002,12 +2011,13 @@ static int show_timer(struct seq_file *m, void *v) notify = timer->it_sigev_notify; seq_printf(m, "ID: %d\n", timer->it_id); - seq_printf(m, "signal: %d/%p\n", timer->sigq->info.si_signo, - timer->sigq->info.si_value.sival_ptr); + seq_printf(m, "signal: %d/%p\n", + timer->sigq->info.si_signo, + timer->sigq->info.si_value.sival_ptr); seq_printf(m, "notify: %s/%s.%d\n", - nstr[notify & ~SIGEV_THREAD_ID], - (notify & SIGEV_THREAD_ID) ? "tid" : "pid", - pid_nr_ns(timer->it_pid, tp->ns)); + nstr[notify & ~SIGEV_THREAD_ID], + (notify & SIGEV_THREAD_ID) ? "tid" : "pid", + pid_nr_ns(timer->it_pid, tp->ns)); seq_printf(m, "ClockID: %d\n", timer->it_clock); return 0; @@ -2352,21 +2362,23 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh unlock_task_sighand(task, &flags); } - result = seq_printf(m, - "rchar: %llu\n" - "wchar: %llu\n" - "syscr: %llu\n" - "syscw: %llu\n" - "read_bytes: %llu\n" - "write_bytes: %llu\n" - "cancelled_write_bytes: %llu\n", - (unsigned long long)acct.rchar, - (unsigned long long)acct.wchar, - (unsigned long long)acct.syscr, - (unsigned long long)acct.syscw, - (unsigned long long)acct.read_bytes, - (unsigned long long)acct.write_bytes, - (unsigned long long)acct.cancelled_write_bytes); + seq_printf(m, + "rchar: %llu\n" + "wchar: %llu\n" + "syscr: %llu\n" + "syscw: %llu\n" + "read_bytes: %llu\n" + "write_bytes: %llu\n" + "cancelled_write_bytes: %llu\n", + (unsigned long long)acct.rchar, + (unsigned long long)acct.wchar, + (unsigned long long)acct.syscr, + (unsigned long long)acct.syscw, + (unsigned long long)acct.read_bytes, + (unsigned long long)acct.write_bytes, + (unsigned long long)acct.cancelled_write_bytes); + result = 0; + out_unlock: mutex_unlock(&task->signal->cred_guard_mutex); return result; diff --git a/fs/splice.c b/fs/splice.c index 41cbb16..476024b 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -523,6 +523,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, loff_t isize, left; int ret; + if (IS_DAX(in->f_mapping->host)) + return default_file_splice_read(in, ppos, pipe, len, flags); + isize = i_size_read(in->f_mapping->host); if (unlikely(*ppos >= isize)) return 0; diff --git a/include/linux/a.out.h b/include/linux/a.out.h index 220f143..ee88416 100644 --- a/include/linux/a.out.h +++ b/include/linux/a.out.h @@ -4,44 +4,6 @@ #include <uapi/linux/a.out.h> #ifndef __ASSEMBLY__ -#if defined (M_OLDSUN2) -#else -#endif -#if defined (M_68010) -#else -#endif -#if defined (M_68020) -#else -#endif -#if defined (M_SPARC) -#else -#endif -#if !defined (N_MAGIC) -#endif -#if !defined (N_BADMAG) -#endif -#if !defined (N_TXTOFF) -#endif -#if !defined (N_DATOFF) -#endif -#if !defined (N_TRELOFF) -#endif -#if !defined (N_DRELOFF) -#endif -#if !defined (N_SYMOFF) -#endif -#if !defined (N_STROFF) -#endif -#if !defined (N_TXTADDR) -#endif -#if defined(vax) || defined(hp300) || defined(pyr) -#endif -#ifdef sony -#endif /* Sony. */ -#ifdef is68k -#endif -#if defined(m68k) && defined(PORTAR) -#endif #ifdef linux #include <asm/page.h> #if defined(__i386__) || defined(__mc68000__) @@ -51,34 +13,5 @@ #endif #endif #endif -#ifndef N_DATADDR -#endif -#if !defined (N_BSSADDR) -#endif -#if !defined (N_NLIST_DECLARED) -#endif /* no N_NLIST_DECLARED. */ -#if !defined (N_UNDF) -#endif -#if !defined (N_ABS) -#endif -#if !defined (N_TEXT) -#endif -#if !defined (N_DATA) -#endif -#if !defined (N_BSS) -#endif -#if !defined (N_FN) -#endif -#if !defined (N_EXT) -#endif -#if !defined (N_TYPE) -#endif -#if !defined (N_STAB) -#endif -#if !defined (N_RELOCATION_INFO_DECLARED) -#ifdef NS32K -#else -#endif -#endif /* no N_RELOCATION_INFO_DECLARED. */ #endif /*__ASSEMBLY__ */ #endif /* __A_OUT_GNU_H__ */ diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index dbfbf49..be4fa5d 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -172,12 +172,8 @@ extern unsigned int bitmap_ord_to_pos(const unsigned long *bitmap, unsigned int extern int bitmap_print_to_pagebuf(bool list, char *buf, const unsigned long *maskp, int nmaskbits); -#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) % BITS_PER_LONG)) -#define BITMAP_LAST_WORD_MASK(nbits) \ -( \ - ((nbits) % BITS_PER_LONG) ? \ - (1UL<<((nbits) % BITS_PER_LONG))-1 : ~0UL \ -) +#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1))) +#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1))) #define small_const_nbits(nbits) \ (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG) diff --git a/include/linux/capability.h b/include/linux/capability.h index aa93e5e..af9f0b9 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -205,6 +205,7 @@ static inline kernel_cap_t cap_raise_nfsd_set(const kernel_cap_t a, cap_intersect(permitted, __cap_nfsd_set)); } +#ifdef CONFIG_MULTIUSER extern bool has_capability(struct task_struct *t, int cap); extern bool has_ns_capability(struct task_struct *t, struct user_namespace *ns, int cap); @@ -213,6 +214,34 @@ extern bool has_ns_capability_noaudit(struct task_struct *t, struct user_namespace *ns, int cap); extern bool capable(int cap); extern bool ns_capable(struct user_namespace *ns, int cap); +#else +static inline bool has_capability(struct task_struct *t, int cap) +{ + return true; +} +static inline bool has_ns_capability(struct task_struct *t, + struct user_namespace *ns, int cap) +{ + return true; +} +static inline bool has_capability_noaudit(struct task_struct *t, int cap) +{ + return true; +} +static inline bool has_ns_capability_noaudit(struct task_struct *t, + struct user_namespace *ns, int cap) +{ + return true; +} +static inline bool capable(int cap) +{ + return true; +} +static inline bool ns_capable(struct user_namespace *ns, int cap) +{ + return true; +} +#endif /* CONFIG_MULTIUSER */ extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap); extern bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap); diff --git a/include/linux/compaction.h b/include/linux/compaction.h index a014559..aa8f61c 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -34,6 +34,7 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write, extern int sysctl_extfrag_threshold; extern int sysctl_extfrag_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); +extern int sysctl_compact_unevictable_allowed; extern int fragmentation_index(struct zone *zone, unsigned int order); extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order, diff --git a/include/linux/cred.h b/include/linux/cred.h index 2fb2ca2..8b6c083 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -62,9 +62,27 @@ do { \ groups_free(group_info); \ } while (0) -extern struct group_info *groups_alloc(int); extern struct group_info init_groups; +#ifdef CONFIG_MULTIUSER +extern struct group_info *groups_alloc(int); extern void groups_free(struct group_info *); + +extern int in_group_p(kgid_t); +extern int in_egroup_p(kgid_t); +#else +static inline void groups_free(struct group_info *group_info) +{ +} + +static inline int in_group_p(kgid_t grp) +{ + return 1; +} +static inline int in_egroup_p(kgid_t grp) +{ + return 1; +} +#endif extern int set_current_groups(struct group_info *); extern void set_groups(struct cred *, struct group_info *); extern int groups_search(const struct group_info *, kgid_t); @@ -74,9 +92,6 @@ extern bool may_setgroups(void); #define GROUP_AT(gi, i) \ ((gi)->blocks[(i) / NGROUPS_PER_BLOCK][(i) % NGROUPS_PER_BLOCK]) -extern int in_group_p(kgid_t); -extern int in_egroup_p(kgid_t); - /* * The security context of a task * diff --git a/include/linux/fs.h b/include/linux/fs.h index 90a1207..f4fc607 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2615,6 +2615,7 @@ int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); +int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) #ifdef CONFIG_BLOCK @@ -2679,7 +2680,6 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); -extern int vfs_readdir(struct file *, filldir_t, void *); extern int iterate_dir(struct file *, struct dir_context *); extern int vfs_stat(const char __user *, struct kstat *); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 7b57850..2050261 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -22,7 +22,13 @@ struct mmu_gather; struct hugepage_subpool { spinlock_t lock; long count; - long max_hpages, used_hpages; + long max_hpages; /* Maximum huge pages or -1 if no maximum. */ + long used_hpages; /* Used count against maximum, includes */ + /* both alloced and reserved pages. */ + struct hstate *hstate; + long min_hpages; /* Minimum huge pages or -1 if no minimum. */ + long rsv_hpages; /* Pages reserved against global pool to */ + /* sasitfy minimum size. */ }; struct resv_map { @@ -38,11 +44,10 @@ extern int hugetlb_max_hstate __read_mostly; #define for_each_hstate(h) \ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) -struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); +struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages, + long min_hpages); void hugepage_put_subpool(struct hugepage_subpool *spool); -int PageHuge(struct page *page); - void reset_vma_resv_huge_pages(struct vm_area_struct *vma); int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); @@ -79,7 +84,6 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed); int dequeue_hwpoisoned_huge_page(struct page *page); bool isolate_huge_page(struct page *page, struct list_head *list); void putback_active_hugepage(struct page *page); -bool is_hugepage_active(struct page *page); void free_huge_page(struct page *page); #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE @@ -109,11 +113,6 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, #else /* !CONFIG_HUGETLB_PAGE */ -static inline int PageHuge(struct page *page) -{ - return 0; -} - static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { } @@ -152,7 +151,6 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list) return false; } #define putback_active_hugepage(p) do {} while (0) -#define is_hugepage_active(x) false static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot) diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 2c525022..388e3ae 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -196,10 +196,8 @@ extern struct resource * __request_region(struct resource *, /* Compatibility cruft */ #define release_region(start,n) __release_region(&ioport_resource, (start), (n)) -#define check_mem_region(start,n) __check_region(&iomem_resource, (start), (n)) #define release_mem_region(start,n) __release_region(&iomem_resource, (start), (n)) -extern int __check_region(struct resource *, resource_size_t, resource_size_t); extern void __release_region(struct resource *, resource_size_t, resource_size_t); #ifdef CONFIG_MEMORY_HOTREMOVE @@ -207,12 +205,6 @@ extern int release_mem_region_adjustable(struct resource *, resource_size_t, resource_size_t); #endif -static inline int __deprecated check_region(resource_size_t s, - resource_size_t n) -{ - return __check_region(&ioport_resource, s, n); -} - /* Wrappers for managed devices */ struct device; diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 5bb0744..5486d77 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -44,6 +44,7 @@ void kasan_poison_object_data(struct kmem_cache *cache, void *object); void kasan_kmalloc_large(const void *ptr, size_t size); void kasan_kfree_large(const void *ptr); +void kasan_kfree(void *ptr); void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size); void kasan_krealloc(const void *object, size_t new_size); @@ -71,6 +72,7 @@ static inline void kasan_poison_object_data(struct kmem_cache *cache, static inline void kasan_kmalloc_large(void *ptr, size_t size) {} static inline void kasan_kfree_large(const void *ptr) {} +static inline void kasan_kfree(void *ptr) {} static inline void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size) {} static inline void kasan_krealloc(const void *object, size_t new_size) {} diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 3be6bb1..7ae216a 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -35,18 +35,6 @@ static inline void ksm_exit(struct mm_struct *mm) __ksm_exit(mm); } -/* - * A KSM page is one of those write-protected "shared pages" or "merged pages" - * which KSM maps into multiple mms, wherever identical anonymous page content - * is found in VM_MERGEABLE vmas. It's a PageAnon page, pointing not to any - * anon_vma, but to that page's node of the stable tree. - */ -static inline int PageKsm(struct page *page) -{ - return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == - (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); -} - static inline struct stable_node *page_stable_node(struct page *page) { return PageKsm(page) ? page_rmapping(page) : NULL; @@ -87,11 +75,6 @@ static inline void ksm_exit(struct mm_struct *mm) { } -static inline int PageKsm(struct page *page) -{ - return 0; -} - #ifdef CONFIG_MMU static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags) diff --git a/include/linux/mempool.h b/include/linux/mempool.h index b19b302..69b6951 100644 --- a/include/linux/mempool.h +++ b/include/linux/mempool.h @@ -36,7 +36,8 @@ extern void mempool_free(void *element, mempool_t *pool); /* * A mempool_alloc_t and mempool_free_t that get the memory from - * a slab that is passed in through pool_data. + * a slab cache that is passed in through pool_data. + * Note: the slab cache may not have a ctor function. */ void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data); void mempool_free_slab(void *element, void *pool_data); diff --git a/include/linux/mm.h b/include/linux/mm.h index 6571dd78..8b08607 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,9 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); + /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ @@ -494,18 +497,9 @@ static inline int page_count(struct page *page) return atomic_read(&compound_head(page)->_count); } -#ifdef CONFIG_HUGETLB_PAGE -extern int PageHeadHuge(struct page *page_head); -#else /* CONFIG_HUGETLB_PAGE */ -static inline int PageHeadHuge(struct page *page_head) -{ - return 0; -} -#endif /* CONFIG_HUGETLB_PAGE */ - static inline bool __compound_tail_refcounted(struct page *page) { - return !PageSlab(page) && !PageHeadHuge(page); + return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page); } /* @@ -571,53 +565,6 @@ static inline void init_page_count(struct page *page) atomic_set(&page->_count, 1); } -/* - * PageBuddy() indicate that the page is free and in the buddy system - * (see mm/page_alloc.c). - * - * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to - * -2 so that an underflow of the page_mapcount() won't be mistaken - * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very - * efficiently by most CPU architectures. - */ -#define PAGE_BUDDY_MAPCOUNT_VALUE (-128) - -static inline int PageBuddy(struct page *page) -{ - return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE; -} - -static inline void __SetPageBuddy(struct page *page) -{ - VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); - atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE); -} - -static inline void __ClearPageBuddy(struct page *page) -{ - VM_BUG_ON_PAGE(!PageBuddy(page), page); - atomic_set(&page->_mapcount, -1); -} - -#define PAGE_BALLOON_MAPCOUNT_VALUE (-256) - -static inline int PageBalloon(struct page *page) -{ - return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE; -} - -static inline void __SetPageBalloon(struct page *page) -{ - VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); - atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE); -} - -static inline void __ClearPageBalloon(struct page *page) -{ - VM_BUG_ON_PAGE(!PageBalloon(page), page); - atomic_set(&page->_mapcount, -1); -} - void put_page(struct page *page); void put_pages_list(struct list_head *pages); @@ -1006,34 +953,10 @@ void page_address_init(void); #define page_address_init() do { } while(0) #endif -/* - * On an anonymous page mapped into a user virtual memory area, - * page->mapping points to its anon_vma, not to a struct address_space; - * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. - * - * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled, - * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit; - * and then page->mapping points, not to an anon_vma, but to a private - * structure which KSM associates with that merged page. See ksm.h. - * - * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used. - * - * Please note that, confusingly, "page_mapping" refers to the inode - * address_space which maps the page from disk; whereas "page_mapped" - * refers to user virtual address space into which the page is mapped. - */ -#define PAGE_MAPPING_ANON 1 -#define PAGE_MAPPING_KSM 2 -#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM) - +extern void *page_rmapping(struct page *page); +extern struct anon_vma *page_anon_vma(struct page *page); extern struct address_space *page_mapping(struct page *page); -/* Neutral page->mapping pointer to address_space or anon_vma or other */ -static inline void *page_rmapping(struct page *page) -{ - return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); -} - extern struct address_space *__page_file_mapping(struct page *); static inline @@ -1045,11 +968,6 @@ struct address_space *page_file_mapping(struct page *page) return page->mapping; } -static inline int PageAnon(struct page *page) -{ - return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0; -} - /* * Return the pagecache index of the passed page. Regular pagecache pages * use ->index whereas swapcache pages use ->private @@ -1975,10 +1893,10 @@ extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info); static inline unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { - if (!(info->flags & VM_UNMAPPED_AREA_TOPDOWN)) - return unmapped_area(info); - else + if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) return unmapped_area_topdown(info); + else + return unmapped_area(info); } /* truncate.c */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2782df4..54d74f6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -842,16 +842,16 @@ static inline int populated_zone(struct zone *zone) extern int movable_zone; +#ifdef CONFIG_HIGHMEM static inline int zone_movable_is_highmem(void) { -#if defined(CONFIG_HIGHMEM) && defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP return movable_zone == ZONE_HIGHMEM; -#elif defined(CONFIG_HIGHMEM) - return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM; #else - return 0; + return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM; #endif } +#endif static inline int is_highmem_idx(enum zone_type idx) { diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index c851ff9..f34e040 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -289,6 +289,47 @@ PAGEFLAG_FALSE(HWPoison) #define __PG_HWPOISON 0 #endif +/* + * On an anonymous page mapped into a user virtual memory area, + * page->mapping points to its anon_vma, not to a struct address_space; + * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. + * + * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled, + * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit; + * and then page->mapping points, not to an anon_vma, but to a private + * structure which KSM associates with that merged page. See ksm.h. + * + * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used. + * + * Please note that, confusingly, "page_mapping" refers to the inode + * address_space which maps the page from disk; whereas "page_mapped" + * refers to user virtual address space into which the page is mapped. + */ +#define PAGE_MAPPING_ANON 1 +#define PAGE_MAPPING_KSM 2 +#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM) + +static inline int PageAnon(struct page *page) +{ + return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0; +} + +#ifdef CONFIG_KSM +/* + * A KSM page is one of those write-protected "shared pages" or "merged pages" + * which KSM maps into multiple mms, wherever identical anonymous page content + * is found in VM_MERGEABLE vmas. It's a PageAnon page, pointing not to any + * anon_vma, but to that page's node of the stable tree. + */ +static inline int PageKsm(struct page *page) +{ + return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); +} +#else +TESTPAGEFLAG_FALSE(Ksm) +#endif + u64 stable_page_flags(struct page *page); static inline int PageUptodate(struct page *page) @@ -426,6 +467,21 @@ static inline void ClearPageCompound(struct page *page) #endif /* !PAGEFLAGS_EXTENDED */ +#ifdef CONFIG_HUGETLB_PAGE +int PageHuge(struct page *page); +int PageHeadHuge(struct page *page); +bool page_huge_active(struct page *page); +#else +TESTPAGEFLAG_FALSE(Huge) +TESTPAGEFLAG_FALSE(HeadHuge) + +static inline bool page_huge_active(struct page *page) +{ + return 0; +} +#endif + + #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* * PageHuge() only returns true for hugetlbfs pages, but not for @@ -480,6 +536,53 @@ static inline int PageTransTail(struct page *page) #endif /* + * PageBuddy() indicate that the page is free and in the buddy system + * (see mm/page_alloc.c). + * + * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to + * -2 so that an underflow of the page_mapcount() won't be mistaken + * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very + * efficiently by most CPU architectures. + */ +#define PAGE_BUDDY_MAPCOUNT_VALUE (-128) + +static inline int PageBuddy(struct page *page) +{ + return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE; +} + +static inline void __SetPageBuddy(struct page *page) +{ + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); + atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE); +} + +static inline void __ClearPageBuddy(struct page *page) +{ + VM_BUG_ON_PAGE(!PageBuddy(page), page); + atomic_set(&page->_mapcount, -1); +} + +#define PAGE_BALLOON_MAPCOUNT_VALUE (-256) + +static inline int PageBalloon(struct page *page) +{ + return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE; +} + +static inline void __SetPageBalloon(struct page *page) +{ + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); + atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE); +} + +static inline void __ClearPageBalloon(struct page *page) +{ + VM_BUG_ON_PAGE(!PageBalloon(page), page); + atomic_set(&page->_mapcount, -1); +} + +/* * If network-based swap is enabled, sl*b must keep track of whether pages * were allocated from pfmemalloc reserves. */ diff --git a/include/linux/printk.h b/include/linux/printk.h index baa3f97..9b30871 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -255,6 +255,11 @@ extern asmlinkage void dump_stack(void) __cold; printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__) #define pr_info(fmt, ...) \ printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) +/* + * Like KERN_CONT, pr_cont() should only be used when continuing + * a line with no newline ('\n') enclosed. Otherwise it defaults + * back to KERN_DEFAULT. + */ #define pr_cont(fmt, ...) \ printk(KERN_CONT fmt, ##__VA_ARGS__) diff --git a/include/linux/reboot.h b/include/linux/reboot.h index 67fc8fc..a7ff409 100644 --- a/include/linux/reboot.h +++ b/include/linux/reboot.h @@ -70,7 +70,8 @@ void ctrl_alt_del(void); #define POWEROFF_CMD_PATH_LEN 256 extern char poweroff_cmd[POWEROFF_CMD_PATH_LEN]; -extern int orderly_poweroff(bool force); +extern void orderly_poweroff(bool force); +extern void orderly_reboot(void); /* * Emergency restart, callable from an interrupt handler. diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c4c559a..c89c53a 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -105,14 +105,6 @@ static inline void put_anon_vma(struct anon_vma *anon_vma) __put_anon_vma(anon_vma); } -static inline struct anon_vma *page_anon_vma(struct page *page) -{ - if (((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != - PAGE_MAPPING_ANON) - return NULL; - return page_rmapping(page); -} - static inline void vma_lock_anon_vma(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index 6575718..0991913 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -47,22 +47,22 @@ static inline int string_unescape_any_inplace(char *buf) #define ESCAPE_ANY_NP (ESCAPE_ANY | ESCAPE_NP) #define ESCAPE_HEX 0x20 -int string_escape_mem(const char *src, size_t isz, char **dst, size_t osz, +int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *esc); static inline int string_escape_mem_any_np(const char *src, size_t isz, - char **dst, size_t osz, const char *esc) + char *dst, size_t osz, const char *esc) { return string_escape_mem(src, isz, dst, osz, ESCAPE_ANY_NP, esc); } -static inline int string_escape_str(const char *src, char **dst, size_t sz, +static inline int string_escape_str(const char *src, char *dst, size_t sz, unsigned int flags, const char *esc) { return string_escape_mem(src, strlen(src), dst, sz, flags, esc); } -static inline int string_escape_str_any_np(const char *src, char **dst, +static inline int string_escape_str_any_np(const char *src, char *dst, size_t sz, const char *esc) { return string_escape_str(src, dst, sz, ESCAPE_ANY_NP, esc); diff --git a/include/linux/swap.h b/include/linux/swap.h index 7067eca..cee108c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,7 +307,7 @@ extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); -extern void deactivate_page(struct page *page); +extern void deactivate_file_page(struct page *page); extern void swap_setup(void); extern void add_page_to_unevictable_list(struct page *page); diff --git a/include/linux/types.h b/include/linux/types.h index 6747247..59698be 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -146,12 +146,6 @@ typedef u64 dma_addr_t; typedef u32 dma_addr_t; #endif /* dma_addr_t */ -#ifdef __CHECKER__ -#else -#endif -#ifdef __CHECK_ENDIAN__ -#else -#endif typedef unsigned __bitwise__ gfp_t; typedef unsigned __bitwise__ fmode_t; typedef unsigned __bitwise__ oom_flags_t; diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h index 2d1f9b6..0ee05da 100644 --- a/include/linux/uidgid.h +++ b/include/linux/uidgid.h @@ -29,6 +29,7 @@ typedef struct { #define KUIDT_INIT(value) (kuid_t){ value } #define KGIDT_INIT(value) (kgid_t){ value } +#ifdef CONFIG_MULTIUSER static inline uid_t __kuid_val(kuid_t uid) { return uid.val; @@ -38,6 +39,17 @@ static inline gid_t __kgid_val(kgid_t gid) { return gid.val; } +#else +static inline uid_t __kuid_val(kuid_t uid) +{ + return 0; +} + +static inline gid_t __kgid_val(kgid_t gid) +{ + return 0; +} +#endif #define GLOBAL_ROOT_UID KUIDT_INIT(0) #define GLOBAL_ROOT_GID KGIDT_INIT(0) diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index 3283c6a..1338190 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -47,5 +47,6 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, void zs_unmap_object(struct zs_pool *pool, unsigned long handle); unsigned long zs_get_total_pages(struct zs_pool *pool); +unsigned long zs_compact(struct zs_pool *pool); #endif diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h new file mode 100644 index 0000000..d7cd961 --- /dev/null +++ b/include/trace/events/cma.h @@ -0,0 +1,66 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM cma + +#if !defined(_TRACE_CMA_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_CMA_H + +#include <linux/types.h> +#include <linux/tracepoint.h> + +TRACE_EVENT(cma_alloc, + + TP_PROTO(unsigned long pfn, const struct page *page, + unsigned int count, unsigned int align), + + TP_ARGS(pfn, page, count, align), + + TP_STRUCT__entry( + __field(unsigned long, pfn) + __field(const struct page *, page) + __field(unsigned int, count) + __field(unsigned int, align) + ), + + TP_fast_assign( + __entry->pfn = pfn; + __entry->page = page; + __entry->count = count; + __entry->align = align; + ), + + TP_printk("pfn=%lx page=%p count=%u align=%u", + __entry->pfn, + __entry->page, + __entry->count, + __entry->align) +); + +TRACE_EVENT(cma_release, + + TP_PROTO(unsigned long pfn, const struct page *page, + unsigned int count), + + TP_ARGS(pfn, page, count), + + TP_STRUCT__entry( + __field(unsigned long, pfn) + __field(const struct page *, page) + __field(unsigned int, count) + ), + + TP_fast_assign( + __entry->pfn = pfn; + __entry->page = page; + __entry->count = count; + ), + + TP_printk("pfn=%lx page=%p count=%u", + __entry->pfn, + __entry->page, + __entry->count) +); + +#endif /* _TRACE_CMA_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/init/Kconfig b/init/Kconfig index a905b73..3b9df1a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -394,6 +394,7 @@ endchoice config BSD_PROCESS_ACCT bool "BSD Process Accounting" + depends on MULTIUSER help If you say Y here, a user level program will be able to instruct the kernel (via a special system call) to write process accounting @@ -420,6 +421,7 @@ config BSD_PROCESS_ACCT_V3 config TASKSTATS bool "Export task/process statistics through netlink" depends on NET + depends on MULTIUSER default n help Export selected statistics for tasks/processes through the @@ -1160,6 +1162,7 @@ config CHECKPOINT_RESTORE menuconfig NAMESPACES bool "Namespaces support" if EXPERT + depends on MULTIUSER default !EXPERT help Provides the way to make tasks work with different objects using @@ -1356,11 +1359,25 @@ menuconfig EXPERT config UID16 bool "Enable 16-bit UID system calls" if EXPERT - depends on HAVE_UID16 + depends on HAVE_UID16 && MULTIUSER default y help This enables the legacy 16-bit UID syscall wrappers. +config MULTIUSER + bool "Multiple users, groups and capabilities support" if EXPERT + default y + help + This option enables support for non-root users, groups and + capabilities. + + If you say N here, all processes will run with UID 0, GID 0, and all + possible capabilities. Saying N here also compiles out support for + system calls related to UIDs, GIDs, and capabilities, such as setuid, + setgid, and capset. + + If unsure, say Y here. + config SGETMASK_SYSCALL bool "sgetmask/ssetmask syscalls support" if EXPERT def_bool PARISC || MN10300 || BLACKFIN || M68K || PPC || MIPS || X86 || SPARC || CRIS || MICROBLAZE || SUPERH @@ -1015,22 +1015,24 @@ static int sysvipc_msg_proc_show(struct seq_file *s, void *it) struct user_namespace *user_ns = seq_user_ns(s); struct msg_queue *msq = it; - return seq_printf(s, - "%10d %10d %4o %10lu %10lu %5u %5u %5u %5u %5u %5u %10lu %10lu %10lu\n", - msq->q_perm.key, - msq->q_perm.id, - msq->q_perm.mode, - msq->q_cbytes, - msq->q_qnum, - msq->q_lspid, - msq->q_lrpid, - from_kuid_munged(user_ns, msq->q_perm.uid), - from_kgid_munged(user_ns, msq->q_perm.gid), - from_kuid_munged(user_ns, msq->q_perm.cuid), - from_kgid_munged(user_ns, msq->q_perm.cgid), - msq->q_stime, - msq->q_rtime, - msq->q_ctime); + seq_printf(s, + "%10d %10d %4o %10lu %10lu %5u %5u %5u %5u %5u %5u %10lu %10lu %10lu\n", + msq->q_perm.key, + msq->q_perm.id, + msq->q_perm.mode, + msq->q_cbytes, + msq->q_qnum, + msq->q_lspid, + msq->q_lrpid, + from_kuid_munged(user_ns, msq->q_perm.uid), + from_kgid_munged(user_ns, msq->q_perm.gid), + from_kuid_munged(user_ns, msq->q_perm.cuid), + from_kgid_munged(user_ns, msq->q_perm.cgid), + msq->q_stime, + msq->q_rtime, + msq->q_ctime); + + return 0; } #endif @@ -2170,17 +2170,19 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) sem_otime = get_semotime(sma); - return seq_printf(s, - "%10d %10d %4o %10u %5u %5u %5u %5u %10lu %10lu\n", - sma->sem_perm.key, - sma->sem_perm.id, - sma->sem_perm.mode, - sma->sem_nsems, - from_kuid_munged(user_ns, sma->sem_perm.uid), - from_kgid_munged(user_ns, sma->sem_perm.gid), - from_kuid_munged(user_ns, sma->sem_perm.cuid), - from_kgid_munged(user_ns, sma->sem_perm.cgid), - sem_otime, - sma->sem_ctime); + seq_printf(s, + "%10d %10d %4o %10u %5u %5u %5u %5u %10lu %10lu\n", + sma->sem_perm.key, + sma->sem_perm.id, + sma->sem_perm.mode, + sma->sem_nsems, + from_kuid_munged(user_ns, sma->sem_perm.uid), + from_kgid_munged(user_ns, sma->sem_perm.gid), + from_kuid_munged(user_ns, sma->sem_perm.cuid), + from_kgid_munged(user_ns, sma->sem_perm.cgid), + sem_otime, + sma->sem_ctime); + + return 0; } #endif @@ -1342,25 +1342,27 @@ static int sysvipc_shm_proc_show(struct seq_file *s, void *it) #define SIZE_SPEC "%21lu" #endif - return seq_printf(s, - "%10d %10d %4o " SIZE_SPEC " %5u %5u " - "%5lu %5u %5u %5u %5u %10lu %10lu %10lu " - SIZE_SPEC " " SIZE_SPEC "\n", - shp->shm_perm.key, - shp->shm_perm.id, - shp->shm_perm.mode, - shp->shm_segsz, - shp->shm_cprid, - shp->shm_lprid, - shp->shm_nattch, - from_kuid_munged(user_ns, shp->shm_perm.uid), - from_kgid_munged(user_ns, shp->shm_perm.gid), - from_kuid_munged(user_ns, shp->shm_perm.cuid), - from_kgid_munged(user_ns, shp->shm_perm.cgid), - shp->shm_atim, - shp->shm_dtim, - shp->shm_ctim, - rss * PAGE_SIZE, - swp * PAGE_SIZE); + seq_printf(s, + "%10d %10d %4o " SIZE_SPEC " %5u %5u " + "%5lu %5u %5u %5u %5u %10lu %10lu %10lu " + SIZE_SPEC " " SIZE_SPEC "\n", + shp->shm_perm.key, + shp->shm_perm.id, + shp->shm_perm.mode, + shp->shm_segsz, + shp->shm_cprid, + shp->shm_lprid, + shp->shm_nattch, + from_kuid_munged(user_ns, shp->shm_perm.uid), + from_kgid_munged(user_ns, shp->shm_perm.gid), + from_kuid_munged(user_ns, shp->shm_perm.cuid), + from_kgid_munged(user_ns, shp->shm_perm.cgid), + shp->shm_atim, + shp->shm_dtim, + shp->shm_ctim, + rss * PAGE_SIZE, + swp * PAGE_SIZE); + + return 0; } #endif @@ -837,8 +837,10 @@ static int sysvipc_proc_show(struct seq_file *s, void *it) struct ipc_proc_iter *iter = s->private; struct ipc_proc_iface *iface = iter->iface; - if (it == SEQ_START_TOKEN) - return seq_puts(s, iface->header); + if (it == SEQ_START_TOKEN) { + seq_puts(s, iface->header); + return 0; + } return iface->show(s, it); } diff --git a/kernel/Makefile b/kernel/Makefile index 1408b33..0f8f8b0 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -9,7 +9,9 @@ obj-y = fork.o exec_domain.o panic.o \ extable.o params.o \ kthread.o sys_ni.o nsproxy.o \ notifier.o ksysfs.o cred.o reboot.o \ - async.o range.o groups.o smpboot.o + async.o range.o smpboot.o + +obj-$(CONFIG_MULTIUSER) += groups.o ifdef CONFIG_FUNCTION_TRACER # Do not trace debug files and internal ftrace files diff --git a/kernel/capability.c b/kernel/capability.c index 989f5bf..45432b5 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -35,6 +35,7 @@ static int __init file_caps_disable(char *str) } __setup("no_file_caps", file_caps_disable); +#ifdef CONFIG_MULTIUSER /* * More recent versions of libcap are available from: * @@ -386,6 +387,24 @@ bool ns_capable(struct user_namespace *ns, int cap) } EXPORT_SYMBOL(ns_capable); + +/** + * capable - Determine if the current task has a superior capability in effect + * @cap: The capability to be tested for + * + * Return true if the current task has the given superior capability currently + * available for use, false if not. + * + * This sets PF_SUPERPRIV on the task if the capability is available on the + * assumption that it's about to be used. + */ +bool capable(int cap) +{ + return ns_capable(&init_user_ns, cap); +} +EXPORT_SYMBOL(capable); +#endif /* CONFIG_MULTIUSER */ + /** * file_ns_capable - Determine if the file's opener had a capability in effect * @file: The file we want to check @@ -412,22 +431,6 @@ bool file_ns_capable(const struct file *file, struct user_namespace *ns, EXPORT_SYMBOL(file_ns_capable); /** - * capable - Determine if the current task has a superior capability in effect - * @cap: The capability to be tested for - * - * Return true if the current task has the given superior capability currently - * available for use, false if not. - * - * This sets PF_SUPERPRIV on the task if the capability is available on the - * assumption that it's about to be used. - */ -bool capable(int cap) -{ - return ns_capable(&init_user_ns, cap); -} -EXPORT_SYMBOL(capable); - -/** * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped * @inode: The inode in question * @cap: The capability in question diff --git a/kernel/cgroup.c b/kernel/cgroup.c index a220fdb..469dd54 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4196,7 +4196,9 @@ static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos) static int cgroup_pidlist_show(struct seq_file *s, void *v) { - return seq_printf(s, "%d\n", *(int *)v); + seq_printf(s, "%d\n", *(int *)v); + + return 0; } static u64 cgroup_read_notify_on_release(struct cgroup_subsys_state *css, @@ -5451,7 +5453,7 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry, struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss) { WARN_ON_ONCE(!rcu_read_lock_held()); - return idr_find(&ss->css_idr, id); + return id > 0 ? idr_find(&ss->css_idr, id) : NULL; } #ifdef CONFIG_CGROUP_DEBUG diff --git a/kernel/cred.c b/kernel/cred.c index e0573a4..ec1c076 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -29,6 +29,9 @@ static struct kmem_cache *cred_jar; +/* init to 2 - one for init_task, one to ensure it is never freed */ +struct group_info init_groups = { .usage = ATOMIC_INIT(2) }; + /* * The initial credentials for the initial task */ diff --git a/kernel/groups.c b/kernel/groups.c index 664411f..74d431d 100644 --- a/kernel/groups.c +++ b/kernel/groups.c @@ -9,9 +9,6 @@ #include <linux/user_namespace.h> #include <asm/uaccess.h> -/* init to 2 - one for init_task, one to ensure it is never freed */ -struct group_info init_groups = { .usage = ATOMIC_INIT(2) }; - struct group_info *groups_alloc(int gidsetsize) { struct group_info *group_info; diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 06db124..e0f90c2 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -169,7 +169,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) return; rcu_read_lock(); - do_each_thread(g, t) { + for_each_process_thread(g, t) { if (!max_count--) goto unlock; if (!--batch_count) { @@ -180,7 +180,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ if (t->state == TASK_UNINTERRUPTIBLE) check_hung_task(t, timeout); - } while_each_thread(g, t); + } unlock: rcu_read_unlock(); } diff --git a/kernel/reboot.c b/kernel/reboot.c index 5925f5a..d20c85d 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -387,8 +387,9 @@ void ctrl_alt_del(void) } char poweroff_cmd[POWEROFF_CMD_PATH_LEN] = "/sbin/poweroff"; +static const char reboot_cmd[] = "/sbin/reboot"; -static int __orderly_poweroff(bool force) +static int run_cmd(const char *cmd) { char **argv; static char *envp[] = { @@ -397,8 +398,7 @@ static int __orderly_poweroff(bool force) NULL }; int ret; - - argv = argv_split(GFP_KERNEL, poweroff_cmd, NULL); + argv = argv_split(GFP_KERNEL, cmd, NULL); if (argv) { ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC); argv_free(argv); @@ -406,8 +406,33 @@ static int __orderly_poweroff(bool force) ret = -ENOMEM; } + return ret; +} + +static int __orderly_reboot(void) +{ + int ret; + + ret = run_cmd(reboot_cmd); + + if (ret) { + pr_warn("Failed to start orderly reboot: forcing the issue\n"); + emergency_sync(); + kernel_restart(NULL); + } + + return ret; +} + +static int __orderly_poweroff(bool force) +{ + int ret; + + ret = run_cmd(poweroff_cmd); + if (ret && force) { pr_warn("Failed to start orderly shutdown: forcing the issue\n"); + /* * I guess this should try to kick off some daemon to sync and * poweroff asap. Or not even bother syncing if we're doing an @@ -436,15 +461,33 @@ static DECLARE_WORK(poweroff_work, poweroff_work_func); * This may be called from any context to trigger a system shutdown. * If the orderly shutdown fails, it will force an immediate shutdown. */ -int orderly_poweroff(bool force) +void orderly_poweroff(bool force) { if (force) /* do not override the pending "true" */ poweroff_force = true; schedule_work(&poweroff_work); - return 0; } EXPORT_SYMBOL_GPL(orderly_poweroff); +static void reboot_work_func(struct work_struct *work) +{ + __orderly_reboot(); +} + +static DECLARE_WORK(reboot_work, reboot_work_func); + +/** + * orderly_reboot - Trigger an orderly system reboot + * + * This may be called from any context to trigger a system reboot. + * If the orderly reboot fails, it will force an immediate reboot. + */ +void orderly_reboot(void) +{ + schedule_work(&reboot_work); +} +EXPORT_SYMBOL_GPL(orderly_reboot); + static int __init reboot_setup(char *str) { for (;;) { diff --git a/kernel/resource.c b/kernel/resource.c index 19f2357..90552aa 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1034,8 +1034,6 @@ resource_size_t resource_alignment(struct resource *res) * * request_region creates a new busy region. * - * check_region returns non-zero if the area is already busy. - * * release_region releases a matching busy region. */ @@ -1098,36 +1096,6 @@ struct resource * __request_region(struct resource *parent, EXPORT_SYMBOL(__request_region); /** - * __check_region - check if a resource region is busy or free - * @parent: parent resource descriptor - * @start: resource start address - * @n: resource region size - * - * Returns 0 if the region is free at the moment it is checked, - * returns %-EBUSY if the region is busy. - * - * NOTE: - * This function is deprecated because its use is racy. - * Even if it returns 0, a subsequent call to request_region() - * may fail because another driver etc. just allocated the region. - * Do NOT use it. It will be removed from the kernel. - */ -int __check_region(struct resource *parent, resource_size_t start, - resource_size_t n) -{ - struct resource * res; - - res = __request_region(parent, start, n, "check-region", 0); - if (!res) - return -EBUSY; - - release_resource(res); - free_resource(res); - return 0; -} -EXPORT_SYMBOL(__check_region); - -/** * __release_region - release a previously reserved resource region * @parent: parent resource descriptor * @start: resource start address diff --git a/kernel/sys.c b/kernel/sys.c index a03d9cd..3be3449 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -325,6 +325,7 @@ out_unlock: * SMP: There are not races, the GIDs are checked only by filesystem * operations (as far as semantic preservation is concerned). */ +#ifdef CONFIG_MULTIUSER SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid) { struct user_namespace *ns = current_user_ns(); @@ -815,6 +816,7 @@ change_okay: commit_creds(new); return old_fsgid; } +#endif /* CONFIG_MULTIUSER */ /** * sys_getpid - return the thread group id of the current process diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 5adcb0a..7995ef5 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -159,6 +159,20 @@ cond_syscall(sys_uselib); cond_syscall(sys_fadvise64); cond_syscall(sys_fadvise64_64); cond_syscall(sys_madvise); +cond_syscall(sys_setuid); +cond_syscall(sys_setregid); +cond_syscall(sys_setgid); +cond_syscall(sys_setreuid); +cond_syscall(sys_setresuid); +cond_syscall(sys_getresuid); +cond_syscall(sys_setresgid); +cond_syscall(sys_getresgid); +cond_syscall(sys_setgroups); +cond_syscall(sys_getgroups); +cond_syscall(sys_setfsuid); +cond_syscall(sys_setfsgid); +cond_syscall(sys_capget); +cond_syscall(sys_capset); /* arch-specific weak syscall entries */ cond_syscall(sys_pciconfig_read); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8c0eabd..42b7fc2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1335,6 +1335,15 @@ static struct ctl_table vm_table[] = { .extra1 = &min_extfrag_threshold, .extra2 = &max_extfrag_threshold, }, + { + .procname = "compact_unevictable_allowed", + .data = &sysctl_compact_unevictable_allowed, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = &zero, + .extra2 = &one, + }, #endif /* CONFIG_COMPACTION */ { diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index c3e4fcf..3f34496 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -327,11 +327,11 @@ static void t_stop(struct seq_file *m, void *p) local_irq_enable(); } -static int trace_lookup_stack(struct seq_file *m, long i) +static void trace_lookup_stack(struct seq_file *m, long i) { unsigned long addr = stack_dump_trace[i]; - return seq_printf(m, "%pS\n", (void *)addr); + seq_printf(m, "%pS\n", (void *)addr); } static void print_disabled(struct seq_file *m) diff --git a/lib/lru_cache.c b/lib/lru_cache.c index 852c81e..028f5d9 100644 --- a/lib/lru_cache.c +++ b/lib/lru_cache.c @@ -247,10 +247,11 @@ size_t lc_seq_printf_stats(struct seq_file *seq, struct lru_cache *lc) * progress) and "changed", when this in fact lead to an successful * update of the cache. */ - return seq_printf(seq, "\t%s: used:%u/%u " - "hits:%lu misses:%lu starving:%lu locked:%lu changed:%lu\n", - lc->name, lc->used, lc->nr_elements, - lc->hits, lc->misses, lc->starving, lc->locked, lc->changed); + seq_printf(seq, "\t%s: used:%u/%u hits:%lu misses:%lu starving:%lu locked:%lu changed:%lu\n", + lc->name, lc->used, lc->nr_elements, + lc->hits, lc->misses, lc->starving, lc->locked, lc->changed); + + return 0; } static struct hlist_head *lc_hash_slot(struct lru_cache *lc, unsigned int enr) diff --git a/lib/string_helpers.c b/lib/string_helpers.c index 8f8c441..1826c74 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -239,29 +239,21 @@ int string_unescape(char *src, char *dst, size_t size, unsigned int flags) } EXPORT_SYMBOL(string_unescape); -static int escape_passthrough(unsigned char c, char **dst, size_t *osz) +static bool escape_passthrough(unsigned char c, char **dst, char *end) { char *out = *dst; - if (*osz < 1) - return -ENOMEM; - - *out++ = c; - - *dst = out; - *osz -= 1; - - return 1; + if (out < end) + *out = c; + *dst = out + 1; + return true; } -static int escape_space(unsigned char c, char **dst, size_t *osz) +static bool escape_space(unsigned char c, char **dst, char *end) { char *out = *dst; unsigned char to; - if (*osz < 2) - return -ENOMEM; - switch (c) { case '\n': to = 'n'; @@ -279,26 +271,25 @@ static int escape_space(unsigned char c, char **dst, size_t *osz) to = 'f'; break; default: - return 0; + return false; } - *out++ = '\\'; - *out++ = to; + if (out < end) + *out = '\\'; + ++out; + if (out < end) + *out = to; + ++out; *dst = out; - *osz -= 2; - - return 1; + return true; } -static int escape_special(unsigned char c, char **dst, size_t *osz) +static bool escape_special(unsigned char c, char **dst, char *end) { char *out = *dst; unsigned char to; - if (*osz < 2) - return -ENOMEM; - switch (c) { case '\\': to = '\\'; @@ -310,71 +301,78 @@ static int escape_special(unsigned char c, char **dst, size_t *osz) to = 'e'; break; default: - return 0; + return false; } - *out++ = '\\'; - *out++ = to; + if (out < end) + *out = '\\'; + ++out; + if (out < end) + *out = to; + ++out; *dst = out; - *osz -= 2; - - return 1; + return true; } -static int escape_null(unsigned char c, char **dst, size_t *osz) +static bool escape_null(unsigned char c, char **dst, char *end) { char *out = *dst; - if (*osz < 2) - return -ENOMEM; - if (c) - return 0; + return false; - *out++ = '\\'; - *out++ = '0'; + if (out < end) + *out = '\\'; + ++out; + if (out < end) + *out = '0'; + ++out; *dst = out; - *osz -= 2; - - return 1; + return true; } -static int escape_octal(unsigned char c, char **dst, size_t *osz) +static bool escape_octal(unsigned char c, char **dst, char *end) { char *out = *dst; - if (*osz < 4) - return -ENOMEM; - - *out++ = '\\'; - *out++ = ((c >> 6) & 0x07) + '0'; - *out++ = ((c >> 3) & 0x07) + '0'; - *out++ = ((c >> 0) & 0x07) + '0'; + if (out < end) + *out = '\\'; + ++out; + if (out < end) + *out = ((c >> 6) & 0x07) + '0'; + ++out; + if (out < end) + *out = ((c >> 3) & 0x07) + '0'; + ++out; + if (out < end) + *out = ((c >> 0) & 0x07) + '0'; + ++out; *dst = out; - *osz -= 4; - - return 1; + return true; } -static int escape_hex(unsigned char c, char **dst, size_t *osz) +static bool escape_hex(unsigned char c, char **dst, char *end) { char *out = *dst; - if (*osz < 4) - return -ENOMEM; - - *out++ = '\\'; - *out++ = 'x'; - *out++ = hex_asc_hi(c); - *out++ = hex_asc_lo(c); + if (out < end) + *out = '\\'; + ++out; + if (out < end) + *out = 'x'; + ++out; + if (out < end) + *out = hex_asc_hi(c); + ++out; + if (out < end) + *out = hex_asc_lo(c); + ++out; *dst = out; - *osz -= 4; - - return 1; + return true; } /** @@ -426,19 +424,17 @@ static int escape_hex(unsigned char c, char **dst, size_t *osz) * it if needs. * * Return: - * The amount of the characters processed to the destination buffer, or - * %-ENOMEM if the size of buffer is not enough to put an escaped character is - * returned. - * - * Even in the case of error @dst pointer will be updated to point to the byte - * after the last processed character. + * The total size of the escaped output that would be generated for + * the given input and flags. To check whether the output was + * truncated, compare the return value to osz. There is room left in + * dst for a '\0' terminator if and only if ret < osz. */ -int string_escape_mem(const char *src, size_t isz, char **dst, size_t osz, +int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *esc) { - char *out = *dst, *p = out; + char *p = dst; + char *end = p + osz; bool is_dict = esc && *esc; - int ret = 0; while (isz--) { unsigned char c = *src++; @@ -458,55 +454,26 @@ int string_escape_mem(const char *src, size_t isz, char **dst, size_t osz, (is_dict && !strchr(esc, c))) { /* do nothing */ } else { - if (flags & ESCAPE_SPACE) { - ret = escape_space(c, &p, &osz); - if (ret < 0) - break; - if (ret > 0) - continue; - } - - if (flags & ESCAPE_SPECIAL) { - ret = escape_special(c, &p, &osz); - if (ret < 0) - break; - if (ret > 0) - continue; - } - - if (flags & ESCAPE_NULL) { - ret = escape_null(c, &p, &osz); - if (ret < 0) - break; - if (ret > 0) - continue; - } + if (flags & ESCAPE_SPACE && escape_space(c, &p, end)) + continue; + + if (flags & ESCAPE_SPECIAL && escape_special(c, &p, end)) + continue; + + if (flags & ESCAPE_NULL && escape_null(c, &p, end)) + continue; /* ESCAPE_OCTAL and ESCAPE_HEX always go last */ - if (flags & ESCAPE_OCTAL) { - ret = escape_octal(c, &p, &osz); - if (ret < 0) - break; + if (flags & ESCAPE_OCTAL && escape_octal(c, &p, end)) continue; - } - if (flags & ESCAPE_HEX) { - ret = escape_hex(c, &p, &osz); - if (ret < 0) - break; + + if (flags & ESCAPE_HEX && escape_hex(c, &p, end)) continue; - } } - ret = escape_passthrough(c, &p, &osz); - if (ret < 0) - break; + escape_passthrough(c, &p, end); } - *dst = p; - - if (ret < 0) - return ret; - - return p - out; + return p - dst; } EXPORT_SYMBOL(string_escape_mem); diff --git a/lib/test-hexdump.c b/lib/test-hexdump.c index daf29a39..9846ff7 100644 --- a/lib/test-hexdump.c +++ b/lib/test-hexdump.c @@ -18,26 +18,26 @@ static const unsigned char data_b[] = { static const unsigned char data_a[] = ".2.{....p..$}.4...1.....L...C..."; -static const char *test_data_1_le[] __initconst = { +static const char * const test_data_1_le[] __initconst = { "be", "32", "db", "7b", "0a", "18", "93", "b2", "70", "ba", "c4", "24", "7d", "83", "34", "9b", "a6", "9c", "31", "ad", "9c", "0f", "ac", "e9", "4c", "d1", "19", "99", "43", "b1", "af", "0c", }; -static const char *test_data_2_le[] __initconst = { +static const char *test_data_2_le[] __initdata = { "32be", "7bdb", "180a", "b293", "ba70", "24c4", "837d", "9b34", "9ca6", "ad31", "0f9c", "e9ac", "d14c", "9919", "b143", "0caf", }; -static const char *test_data_4_le[] __initconst = { +static const char *test_data_4_le[] __initdata = { "7bdb32be", "b293180a", "24c4ba70", "9b34837d", "ad319ca6", "e9ac0f9c", "9919d14c", "0cafb143", }; -static const char *test_data_8_le[] __initconst = { +static const char *test_data_8_le[] __initdata = { "b293180a7bdb32be", "9b34837d24c4ba70", "e9ac0f9cad319ca6", "0cafb1439919d14c", }; diff --git a/lib/test-string_helpers.c b/lib/test-string_helpers.c index ab0d30e..8e376ef 100644 --- a/lib/test-string_helpers.c +++ b/lib/test-string_helpers.c @@ -260,16 +260,28 @@ static __init const char *test_string_find_match(const struct test_string_2 *s2, return NULL; } +static __init void +test_string_escape_overflow(const char *in, int p, unsigned int flags, const char *esc, + int q_test, const char *name) +{ + int q_real; + + q_real = string_escape_mem(in, p, NULL, 0, flags, esc); + if (q_real != q_test) + pr_warn("Test '%s' failed: flags = %u, osz = 0, expected %d, got %d\n", + name, flags, q_test, q_real); +} + static __init void test_string_escape(const char *name, const struct test_string_2 *s2, unsigned int flags, const char *esc) { - int q_real = 512; - char *out_test = kmalloc(q_real, GFP_KERNEL); - char *out_real = kmalloc(q_real, GFP_KERNEL); + size_t out_size = 512; + char *out_test = kmalloc(out_size, GFP_KERNEL); + char *out_real = kmalloc(out_size, GFP_KERNEL); char *in = kmalloc(256, GFP_KERNEL); - char *buf = out_real; int p = 0, q_test = 0; + int q_real; if (!out_test || !out_real || !in) goto out; @@ -301,29 +313,19 @@ static __init void test_string_escape(const char *name, q_test += len; } - q_real = string_escape_mem(in, p, &buf, q_real, flags, esc); + q_real = string_escape_mem(in, p, out_real, out_size, flags, esc); test_string_check_buf(name, flags, in, p, out_real, q_real, out_test, q_test); + + test_string_escape_overflow(in, p, flags, esc, q_test, name); + out: kfree(in); kfree(out_real); kfree(out_test); } -static __init void test_string_escape_nomem(void) -{ - char *in = "\eb \\C\007\"\x90\r]"; - char out[64], *buf = out; - int rc = -ENOMEM, ret; - - ret = string_escape_str_any_np(in, &buf, strlen(in), NULL); - if (ret == rc) - return; - - pr_err("Test 'escape nomem' failed: got %d instead of %d\n", ret, rc); -} - static int __init test_string_helpers_init(void) { unsigned int i; @@ -342,8 +344,6 @@ static int __init test_string_helpers_init(void) for (i = 0; i < (ESCAPE_ANY_NP | ESCAPE_HEX) + 1; i++) test_string_escape("escape 1", escape1, i, TEST_STRING_2_DICT_1); - test_string_escape_nomem(); - return -EINVAL; } module_init(test_string_helpers_init); diff --git a/lib/vsprintf.c b/lib/vsprintf.c index b235c96..3a1e084 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -17,6 +17,7 @@ */ #include <stdarg.h> +#include <linux/clk-provider.h> #include <linux/module.h> /* for KSYM_SYMBOL_LEN */ #include <linux/types.h> #include <linux/string.h> @@ -340,11 +341,11 @@ int num_to_str(char *buf, int size, unsigned long long num) return len; } -#define ZEROPAD 1 /* pad with zero */ -#define SIGN 2 /* unsigned/signed long */ +#define SIGN 1 /* unsigned/signed, must be 1 */ +#define LEFT 2 /* left justified */ #define PLUS 4 /* show plus */ #define SPACE 8 /* space if plus */ -#define LEFT 16 /* left justified */ +#define ZEROPAD 16 /* pad with zero, must be 16 == '0' - ' ' */ #define SMALL 32 /* use lowercase in hex (must be 32 == 0x20) */ #define SPECIAL 64 /* prefix hex with "0x", octal with "0" */ @@ -383,10 +384,7 @@ static noinline_for_stack char *number(char *buf, char *end, unsigned long long num, struct printf_spec spec) { - /* we are called with base 8, 10 or 16, only, thus don't need "G..." */ - static const char digits[16] = "0123456789ABCDEF"; /* "GHIJKLMNOPQRSTUVWXYZ"; */ - - char tmp[66]; + char tmp[3 * sizeof(num)]; char sign; char locase; int need_pfx = ((spec.flags & SPECIAL) && spec.base != 10); @@ -422,12 +420,7 @@ char *number(char *buf, char *end, unsigned long long num, /* generate full string in tmp[], in reverse order */ i = 0; if (num < spec.base) - tmp[i++] = digits[num] | locase; - /* Generic code, for any base: - else do { - tmp[i++] = (digits[do_div(num,base)] | locase); - } while (num != 0); - */ + tmp[i++] = hex_asc_upper[num] | locase; else if (spec.base != 10) { /* 8 or 16 */ int mask = spec.base - 1; int shift = 3; @@ -435,7 +428,7 @@ char *number(char *buf, char *end, unsigned long long num, if (spec.base == 16) shift = 4; do { - tmp[i++] = (digits[((unsigned char)num) & mask] | locase); + tmp[i++] = (hex_asc_upper[((unsigned char)num) & mask] | locase); num >>= shift; } while (num); } else { /* base 10 */ @@ -447,7 +440,7 @@ char *number(char *buf, char *end, unsigned long long num, spec.precision = i; /* leading space padding */ spec.field_width -= spec.precision; - if (!(spec.flags & (ZEROPAD+LEFT))) { + if (!(spec.flags & (ZEROPAD | LEFT))) { while (--spec.field_width >= 0) { if (buf < end) *buf = ' '; @@ -475,7 +468,8 @@ char *number(char *buf, char *end, unsigned long long num, } /* zero or space padding */ if (!(spec.flags & LEFT)) { - char c = (spec.flags & ZEROPAD) ? '0' : ' '; + char c = ' ' + (spec.flags & ZEROPAD); + BUILD_BUG_ON(' ' + ZEROPAD != '0'); while (--spec.field_width >= 0) { if (buf < end) *buf = c; @@ -783,11 +777,19 @@ char *hex_string(char *buf, char *end, u8 *addr, struct printf_spec spec, if (spec.field_width > 0) len = min_t(int, spec.field_width, 64); - for (i = 0; i < len && buf < end - 1; i++) { - buf = hex_byte_pack(buf, addr[i]); + for (i = 0; i < len; ++i) { + if (buf < end) + *buf = hex_asc_hi(addr[i]); + ++buf; + if (buf < end) + *buf = hex_asc_lo(addr[i]); + ++buf; - if (buf < end && separator && i != len - 1) - *buf++ = separator; + if (separator && i != len - 1) { + if (buf < end) + *buf = separator; + ++buf; + } } return buf; @@ -1233,8 +1235,12 @@ char *escaped_string(char *buf, char *end, u8 *addr, struct printf_spec spec, len = spec.field_width < 0 ? 1 : spec.field_width; - /* Ignore the error. We print as many characters as we can */ - string_escape_mem(addr, len, &buf, end - buf, flags, NULL); + /* + * string_escape_mem() writes as many characters as it can to + * the given buffer, and returns the total size of the output + * had the buffer been big enough. + */ + buf += string_escape_mem(addr, len, buf, buf < end ? end - buf : 0, flags, NULL); return buf; } @@ -1322,6 +1328,30 @@ char *address_val(char *buf, char *end, const void *addr, return number(buf, end, num, spec); } +static noinline_for_stack +char *clock(char *buf, char *end, struct clk *clk, struct printf_spec spec, + const char *fmt) +{ + if (!IS_ENABLED(CONFIG_HAVE_CLK) || !clk) + return string(buf, end, NULL, spec); + + switch (fmt[1]) { + case 'r': + return number(buf, end, clk_get_rate(clk), spec); + + case 'n': + default: +#ifdef CONFIG_COMMON_CLK + return string(buf, end, __clk_get_name(clk), spec); +#else + spec.base = 16; + spec.field_width = sizeof(unsigned long) * 2 + 2; + spec.flags |= SPECIAL | SMALL | ZEROPAD; + return number(buf, end, (unsigned long)clk, spec); +#endif + } +} + int kptr_restrict __read_mostly; /* @@ -1404,6 +1434,11 @@ int kptr_restrict __read_mostly; * (default assumed to be phys_addr_t, passed by reference) * - 'd[234]' For a dentry name (optionally 2-4 last components) * - 'D[234]' Same as 'd' but for a struct file + * - 'C' For a clock, it prints the name (Common Clock Framework) or address + * (legacy clock framework) of the clock + * - 'Cn' For a clock, it prints the name (Common Clock Framework) or address + * (legacy clock framework) of the clock + * - 'Cr' For a clock, it prints the current rate of the clock * * Note: The difference between 'S' and 'F' is that on ia64 and ppc64 * function pointers are really function descriptors, which contain a @@ -1548,6 +1583,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr, return address_val(buf, end, ptr, spec, fmt); case 'd': return dentry_name(buf, end, ptr, spec, fmt); + case 'C': + return clock(buf, end, ptr, spec, fmt); case 'D': return dentry_name(buf, end, ((const struct file *)ptr)->f_path.dentry, @@ -1738,29 +1775,21 @@ qualifier: if (spec->qualifier == 'L') spec->type = FORMAT_TYPE_LONG_LONG; else if (spec->qualifier == 'l') { - if (spec->flags & SIGN) - spec->type = FORMAT_TYPE_LONG; - else - spec->type = FORMAT_TYPE_ULONG; + BUILD_BUG_ON(FORMAT_TYPE_ULONG + SIGN != FORMAT_TYPE_LONG); + spec->type = FORMAT_TYPE_ULONG + (spec->flags & SIGN); } else if (_tolower(spec->qualifier) == 'z') { spec->type = FORMAT_TYPE_SIZE_T; } else if (spec->qualifier == 't') { spec->type = FORMAT_TYPE_PTRDIFF; } else if (spec->qualifier == 'H') { - if (spec->flags & SIGN) - spec->type = FORMAT_TYPE_BYTE; - else - spec->type = FORMAT_TYPE_UBYTE; + BUILD_BUG_ON(FORMAT_TYPE_UBYTE + SIGN != FORMAT_TYPE_BYTE); + spec->type = FORMAT_TYPE_UBYTE + (spec->flags & SIGN); } else if (spec->qualifier == 'h') { - if (spec->flags & SIGN) - spec->type = FORMAT_TYPE_SHORT; - else - spec->type = FORMAT_TYPE_USHORT; + BUILD_BUG_ON(FORMAT_TYPE_USHORT + SIGN != FORMAT_TYPE_SHORT); + spec->type = FORMAT_TYPE_USHORT + (spec->flags & SIGN); } else { - if (spec->flags & SIGN) - spec->type = FORMAT_TYPE_INT; - else - spec->type = FORMAT_TYPE_UINT; + BUILD_BUG_ON(FORMAT_TYPE_UINT + SIGN != FORMAT_TYPE_INT); + spec->type = FORMAT_TYPE_UINT + (spec->flags & SIGN); } return ++fmt - start; @@ -1800,6 +1829,11 @@ qualifier: * %*pE[achnops] print an escaped buffer * %*ph[CDN] a variable-length hex string with a separator (supports up to 64 * bytes of the input) + * %pC output the name (Common Clock Framework) or address (legacy clock + * framework) of a clock + * %pCn output the name (Common Clock Framework) or address (legacy clock + * framework) of a clock + * %pCr output the current rate of a clock * %n is ignored * * ** Please update Documentation/printk-formats.txt when making changes ** @@ -23,6 +23,7 @@ # define DEBUG #endif #endif +#define CREATE_TRACE_POINTS #include <linux/memblock.h> #include <linux/err.h> @@ -34,6 +35,7 @@ #include <linux/cma.h> #include <linux/highmem.h> #include <linux/io.h> +#include <trace/events/cma.h> #include "cma.h" @@ -414,6 +416,8 @@ struct page *cma_alloc(struct cma *cma, unsigned int count, unsigned int align) start = bitmap_no + mask + 1; } + trace_cma_alloc(page ? pfn : -1UL, page, count, align); + pr_debug("%s(): returned %p\n", __func__, page); return page; } @@ -446,6 +450,7 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count) free_contig_range(pfn, count); cma_clear_bitmap(cma, pfn, count); + trace_cma_release(pfn, pages, count); return true; } diff --git a/mm/cma_debug.c b/mm/cma_debug.c index 0b37753..7621ee3 100644 --- a/mm/cma_debug.c +++ b/mm/cma_debug.c @@ -30,9 +30,44 @@ static int cma_debugfs_get(void *data, u64 *val) return 0; } - DEFINE_SIMPLE_ATTRIBUTE(cma_debugfs_fops, cma_debugfs_get, NULL, "%llu\n"); +static int cma_used_get(void *data, u64 *val) +{ + struct cma *cma = data; + unsigned long used; + + mutex_lock(&cma->lock); + /* pages counter is smaller than sizeof(int) */ + used = bitmap_weight(cma->bitmap, (int)cma->count); + mutex_unlock(&cma->lock); + *val = (u64)used << cma->order_per_bit; + + return 0; +} +DEFINE_SIMPLE_ATTRIBUTE(cma_used_fops, cma_used_get, NULL, "%llu\n"); + +static int cma_maxchunk_get(void *data, u64 *val) +{ + struct cma *cma = data; + unsigned long maxchunk = 0; + unsigned long start, end = 0; + + mutex_lock(&cma->lock); + for (;;) { + start = find_next_zero_bit(cma->bitmap, cma->count, end); + if (start >= cma->count) + break; + end = find_next_bit(cma->bitmap, cma->count, start); + maxchunk = max(end - start, maxchunk); + } + mutex_unlock(&cma->lock); + *val = (u64)maxchunk << cma->order_per_bit; + + return 0; +} +DEFINE_SIMPLE_ATTRIBUTE(cma_maxchunk_fops, cma_maxchunk_get, NULL, "%llu\n"); + static void cma_add_to_cma_mem_list(struct cma *cma, struct cma_mem *mem) { spin_lock(&cma->mem_head_lock); @@ -91,7 +126,6 @@ static int cma_free_write(void *data, u64 val) return cma_free_mem(cma, pages); } - DEFINE_SIMPLE_ATTRIBUTE(cma_free_fops, NULL, cma_free_write, "%llu\n"); static int cma_alloc_mem(struct cma *cma, int count) @@ -124,7 +158,6 @@ static int cma_alloc_write(void *data, u64 val) return cma_alloc_mem(cma, pages); } - DEFINE_SIMPLE_ATTRIBUTE(cma_alloc_fops, NULL, cma_alloc_write, "%llu\n"); static void cma_debugfs_add_one(struct cma *cma, int idx) @@ -149,6 +182,8 @@ static void cma_debugfs_add_one(struct cma *cma, int idx) &cma->count, &cma_debugfs_fops); debugfs_create_file("order_per_bit", S_IRUGO, tmp, &cma->order_per_bit, &cma_debugfs_fops); + debugfs_create_file("used", S_IRUGO, tmp, cma, &cma_used_fops); + debugfs_create_file("maxchunk", S_IRUGO, tmp, cma, &cma_maxchunk_fops); u32s = DIV_ROUND_UP(cma_bitmap_maxno(cma), BITS_PER_BYTE * sizeof(u32)); debugfs_create_u32_array("bitmap", S_IRUGO, tmp, (u32*)cma->bitmap, u32s); diff --git a/mm/compaction.c b/mm/compaction.c index a18201a..018f08d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -391,28 +391,6 @@ static inline bool compact_should_abort(struct compact_control *cc) return false; } -/* Returns true if the page is within a block suitable for migration to */ -static bool suitable_migration_target(struct page *page) -{ - /* If the page is a large free page, then disallow migration */ - if (PageBuddy(page)) { - /* - * We are checking page_order without zone->lock taken. But - * the only small danger is that we skip a potentially suitable - * pageblock, so it's not worth to check order for valid range. - */ - if (page_order_unsafe(page) >= pageblock_order) - return false; - } - - /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ - if (migrate_async_suitable(get_pageblock_migratetype(page))) - return true; - - /* Otherwise skip the block */ - return false; -} - /* * Isolate free pages onto a private freelist. If @strict is true, will abort * returning 0 on any invalid PFNs or non-free pages inside of the pageblock @@ -896,6 +874,29 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, #endif /* CONFIG_COMPACTION || CONFIG_CMA */ #ifdef CONFIG_COMPACTION + +/* Returns true if the page is within a block suitable for migration to */ +static bool suitable_migration_target(struct page *page) +{ + /* If the page is a large free page, then disallow migration */ + if (PageBuddy(page)) { + /* + * We are checking page_order without zone->lock taken. But + * the only small danger is that we skip a potentially suitable + * pageblock, so it's not worth to check order for valid range. + */ + if (page_order_unsafe(page) >= pageblock_order) + return false; + } + + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ + if (migrate_async_suitable(get_pageblock_migratetype(page))) + return true; + + /* Otherwise skip the block */ + return false; +} + /* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. @@ -1047,6 +1048,12 @@ typedef enum { } isolate_migrate_t; /* + * Allow userspace to control policy on scanning the unevictable LRU for + * compactable pages. + */ +int sysctl_compact_unevictable_allowed __read_mostly = 1; + +/* * Isolate all pages that can be migrated from the first suitable block, * starting at the block pointed to by the migrate scanner pfn within * compact_control. @@ -1057,6 +1064,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, unsigned long low_pfn, end_pfn; struct page *page; const isolate_mode_t isolate_mode = + (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) | (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0); /* @@ -1598,6 +1606,14 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc) INIT_LIST_HEAD(&cc->freepages); INIT_LIST_HEAD(&cc->migratepages); + /* + * When called via /proc/sys/vm/compact_memory + * this makes sure we compact the whole zone regardless of + * cached scanner positions. + */ + if (cc->order == -1) + __reset_isolation_suitable(zone); + if (cc->order == -1 || !compaction_deferred(zone, cc->order)) compact_zone(zone, cc); @@ -1019,7 +1019,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, * * for an example see gup_get_pte in arch/x86/mm/gup.c */ - pte_t pte = ACCESS_ONCE(*ptep); + pte_t pte = READ_ONCE(*ptep); struct page *page; /* @@ -1309,7 +1309,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, local_irq_save(flags); pgdp = pgd_offset(mm, addr); do { - pgd_t pgd = ACCESS_ONCE(*pgdp); + pgd_t pgd = READ_ONCE(*pgdp); next = pgd_addr_end(addr, end); if (pgd_none(pgd)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3afb5cb..078832c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -67,6 +67,7 @@ static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1; static int khugepaged(void *none); static int khugepaged_slab_init(void); +static void khugepaged_slab_exit(void); #define MM_SLOTS_HASH_BITS 10 static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); @@ -109,9 +110,6 @@ static int set_recommended_min_free_kbytes(void) int nr_zones = 0; unsigned long recommended_min; - if (!khugepaged_enabled()) - return 0; - for_each_populated_zone(zone) nr_zones++; @@ -143,9 +141,8 @@ static int set_recommended_min_free_kbytes(void) setup_per_zone_wmarks(); return 0; } -late_initcall(set_recommended_min_free_kbytes); -static int start_khugepaged(void) +static int start_stop_khugepaged(void) { int err = 0; if (khugepaged_enabled()) { @@ -156,6 +153,7 @@ static int start_khugepaged(void) pr_err("khugepaged: kthread_run(khugepaged) failed\n"); err = PTR_ERR(khugepaged_thread); khugepaged_thread = NULL; + goto fail; } if (!list_empty(&khugepaged_scan.mm_head)) @@ -166,7 +164,7 @@ static int start_khugepaged(void) kthread_stop(khugepaged_thread); khugepaged_thread = NULL; } - +fail: return err; } @@ -183,7 +181,7 @@ static struct page *get_huge_zero_page(void) struct page *zero_page; retry: if (likely(atomic_inc_not_zero(&huge_zero_refcount))) - return ACCESS_ONCE(huge_zero_page); + return READ_ONCE(huge_zero_page); zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE, HPAGE_PMD_ORDER); @@ -202,7 +200,7 @@ retry: /* We take additional reference here. It will be put back by shrinker */ atomic_set(&huge_zero_refcount, 2); preempt_enable(); - return ACCESS_ONCE(huge_zero_page); + return READ_ONCE(huge_zero_page); } static void put_huge_zero_page(void) @@ -300,7 +298,7 @@ static ssize_t enabled_store(struct kobject *kobj, int err; mutex_lock(&khugepaged_mutex); - err = start_khugepaged(); + err = start_stop_khugepaged(); mutex_unlock(&khugepaged_mutex); if (err) @@ -634,27 +632,38 @@ static int __init hugepage_init(void) err = hugepage_init_sysfs(&hugepage_kobj); if (err) - return err; + goto err_sysfs; err = khugepaged_slab_init(); if (err) - goto out; + goto err_slab; - register_shrinker(&huge_zero_page_shrinker); + err = register_shrinker(&huge_zero_page_shrinker); + if (err) + goto err_hzp_shrinker; /* * By default disable transparent hugepages on smaller systems, * where the extra memory used could hurt more than TLB overhead * is likely to save. The admin can still enable it through /sys. */ - if (totalram_pages < (512 << (20 - PAGE_SHIFT))) + if (totalram_pages < (512 << (20 - PAGE_SHIFT))) { transparent_hugepage_flags = 0; + return 0; + } - start_khugepaged(); + err = start_stop_khugepaged(); + if (err) + goto err_khugepaged; return 0; -out: +err_khugepaged: + unregister_shrinker(&huge_zero_page_shrinker); +err_hzp_shrinker: + khugepaged_slab_exit(); +err_slab: hugepage_exit_sysfs(hugepage_kobj); +err_sysfs: return err; } subsys_initcall(hugepage_init); @@ -708,7 +717,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot) static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd, - struct page *page) + struct page *page, gfp_t gfp) { struct mem_cgroup *memcg; pgtable_t pgtable; @@ -716,7 +725,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, VM_BUG_ON_PAGE(!PageCompound(page), page); - if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg)) + if (mem_cgroup_try_charge(page, mm, gfp, &memcg)) return VM_FAULT_OOM; pgtable = pte_alloc_one(mm, haddr); @@ -822,7 +831,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); return VM_FAULT_FALLBACK; } - if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) { + if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp))) { put_page(page); count_vm_event(THP_FAULT_FALLBACK); return VM_FAULT_FALLBACK; @@ -1080,6 +1089,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ + gfp_t huge_gfp; /* for allocation and charge */ ptl = pmd_lockptr(mm, pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); @@ -1106,10 +1116,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, alloc: if (transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) { - gfp_t gfp; - - gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0); - new_page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER); + huge_gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0); + new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER); } else new_page = NULL; @@ -1130,8 +1138,7 @@ alloc: goto out; } - if (unlikely(mem_cgroup_try_charge(new_page, mm, - GFP_TRANSHUGE, &memcg))) { + if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg))) { put_page(new_page); if (page) { split_huge_page(page); @@ -1976,6 +1983,11 @@ static int __init khugepaged_slab_init(void) return 0; } +static void __init khugepaged_slab_exit(void) +{ + kmem_cache_destroy(mm_slot_cache); +} + static inline struct mm_slot *alloc_mm_slot(void) { if (!mm_slot_cache) /* initialization failed */ @@ -2323,19 +2335,13 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait) return true; } -static struct page -*khugepaged_alloc_page(struct page **hpage, struct mm_struct *mm, +static struct page * +khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int node) { - gfp_t flags; - VM_BUG_ON_PAGE(*hpage, *hpage); - /* Only allocate from the target node */ - flags = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) | - __GFP_THISNODE; - /* * Before allocating the hugepage, release the mmap_sem read lock. * The allocation can take potentially a long time if it involves @@ -2344,7 +2350,7 @@ static struct page */ up_read(&mm->mmap_sem); - *hpage = alloc_pages_exact_node(node, flags, HPAGE_PMD_ORDER); + *hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER); if (unlikely(!*hpage)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); @@ -2397,13 +2403,14 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait) return true; } -static struct page -*khugepaged_alloc_page(struct page **hpage, struct mm_struct *mm, +static struct page * +khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int node) { up_read(&mm->mmap_sem); VM_BUG_ON(!*hpage); + return *hpage; } #endif @@ -2438,16 +2445,21 @@ static void collapse_huge_page(struct mm_struct *mm, struct mem_cgroup *memcg; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ + gfp_t gfp; VM_BUG_ON(address & ~HPAGE_PMD_MASK); + /* Only allocate from the target node */ + gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) | + __GFP_THISNODE; + /* release the mmap_sem read lock. */ - new_page = khugepaged_alloc_page(hpage, mm, vma, address, node); + new_page = khugepaged_alloc_page(hpage, gfp, mm, vma, address, node); if (!new_page) return; if (unlikely(mem_cgroup_try_charge(new_page, mm, - GFP_TRANSHUGE, &memcg))) + gfp, &memcg))) return; /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8874c8a..271e443 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -61,6 +61,9 @@ DEFINE_SPINLOCK(hugetlb_lock); static int num_fault_mutexes; static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp; +/* Forward declaration */ +static int hugetlb_acct_memory(struct hstate *h, long delta); + static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) { bool free = (spool->count == 0) && (spool->used_hpages == 0); @@ -68,23 +71,36 @@ static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) spin_unlock(&spool->lock); /* If no pages are used, and no other handles to the subpool - * remain, free the subpool the subpool remain */ - if (free) + * remain, give up any reservations mased on minimum size and + * free the subpool */ + if (free) { + if (spool->min_hpages != -1) + hugetlb_acct_memory(spool->hstate, + -spool->min_hpages); kfree(spool); + } } -struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) +struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages, + long min_hpages) { struct hugepage_subpool *spool; - spool = kmalloc(sizeof(*spool), GFP_KERNEL); + spool = kzalloc(sizeof(*spool), GFP_KERNEL); if (!spool) return NULL; spin_lock_init(&spool->lock); spool->count = 1; - spool->max_hpages = nr_blocks; - spool->used_hpages = 0; + spool->max_hpages = max_hpages; + spool->hstate = h; + spool->min_hpages = min_hpages; + + if (min_hpages != -1 && hugetlb_acct_memory(h, min_hpages)) { + kfree(spool); + return NULL; + } + spool->rsv_hpages = min_hpages; return spool; } @@ -97,36 +113,89 @@ void hugepage_put_subpool(struct hugepage_subpool *spool) unlock_or_release_subpool(spool); } -static int hugepage_subpool_get_pages(struct hugepage_subpool *spool, +/* + * Subpool accounting for allocating and reserving pages. + * Return -ENOMEM if there are not enough resources to satisfy the + * the request. Otherwise, return the number of pages by which the + * global pools must be adjusted (upward). The returned value may + * only be different than the passed value (delta) in the case where + * a subpool minimum size must be manitained. + */ +static long hugepage_subpool_get_pages(struct hugepage_subpool *spool, long delta) { - int ret = 0; + long ret = delta; if (!spool) - return 0; + return ret; spin_lock(&spool->lock); - if ((spool->used_hpages + delta) <= spool->max_hpages) { - spool->used_hpages += delta; - } else { - ret = -ENOMEM; + + if (spool->max_hpages != -1) { /* maximum size accounting */ + if ((spool->used_hpages + delta) <= spool->max_hpages) + spool->used_hpages += delta; + else { + ret = -ENOMEM; + goto unlock_ret; + } + } + + if (spool->min_hpages != -1) { /* minimum size accounting */ + if (delta > spool->rsv_hpages) { + /* + * Asking for more reserves than those already taken on + * behalf of subpool. Return difference. + */ + ret = delta - spool->rsv_hpages; + spool->rsv_hpages = 0; + } else { + ret = 0; /* reserves already accounted for */ + spool->rsv_hpages -= delta; + } } - spin_unlock(&spool->lock); +unlock_ret: + spin_unlock(&spool->lock); return ret; } -static void hugepage_subpool_put_pages(struct hugepage_subpool *spool, +/* + * Subpool accounting for freeing and unreserving pages. + * Return the number of global page reservations that must be dropped. + * The return value may only be different than the passed value (delta) + * in the case where a subpool minimum size must be maintained. + */ +static long hugepage_subpool_put_pages(struct hugepage_subpool *spool, long delta) { + long ret = delta; + if (!spool) - return; + return delta; spin_lock(&spool->lock); - spool->used_hpages -= delta; - /* If hugetlbfs_put_super couldn't free spool due to - * an outstanding quota reference, free it now. */ + + if (spool->max_hpages != -1) /* maximum size accounting */ + spool->used_hpages -= delta; + + if (spool->min_hpages != -1) { /* minimum size accounting */ + if (spool->rsv_hpages + delta <= spool->min_hpages) + ret = 0; + else + ret = spool->rsv_hpages + delta - spool->min_hpages; + + spool->rsv_hpages += delta; + if (spool->rsv_hpages > spool->min_hpages) + spool->rsv_hpages = spool->min_hpages; + } + + /* + * If hugetlbfs_put_super couldn't free spool due to an outstanding + * quota reference, free it now. + */ unlock_or_release_subpool(spool); + + return ret; } static inline struct hugepage_subpool *subpool_inode(struct inode *inode) @@ -855,6 +924,31 @@ struct hstate *size_to_hstate(unsigned long size) return NULL; } +/* + * Test to determine whether the hugepage is "active/in-use" (i.e. being linked + * to hstate->hugepage_activelist.) + * + * This function can be called for tail pages, but never returns true for them. + */ +bool page_huge_active(struct page *page) +{ + VM_BUG_ON_PAGE(!PageHuge(page), page); + return PageHead(page) && PagePrivate(&page[1]); +} + +/* never called for tail page */ +static void set_page_huge_active(struct page *page) +{ + VM_BUG_ON_PAGE(!PageHeadHuge(page), page); + SetPagePrivate(&page[1]); +} + +static void clear_page_huge_active(struct page *page) +{ + VM_BUG_ON_PAGE(!PageHeadHuge(page), page); + ClearPagePrivate(&page[1]); +} + void free_huge_page(struct page *page) { /* @@ -874,7 +968,16 @@ void free_huge_page(struct page *page) restore_reserve = PagePrivate(page); ClearPagePrivate(page); + /* + * A return code of zero implies that the subpool will be under its + * minimum size if the reservation is not restored after page is free. + * Therefore, force restore_reserve operation. + */ + if (hugepage_subpool_put_pages(spool, 1) == 0) + restore_reserve = true; + spin_lock(&hugetlb_lock); + clear_page_huge_active(page); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page); if (restore_reserve) @@ -891,7 +994,6 @@ void free_huge_page(struct page *page) enqueue_huge_page(h, page); } spin_unlock(&hugetlb_lock); - hugepage_subpool_put_pages(spool, 1); } static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) @@ -1386,7 +1488,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, if (chg < 0) return ERR_PTR(-ENOMEM); if (chg || avoid_reserve) - if (hugepage_subpool_get_pages(spool, 1)) + if (hugepage_subpool_get_pages(spool, 1) < 0) return ERR_PTR(-ENOSPC); ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); @@ -2454,6 +2556,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) struct resv_map *resv = vma_resv_map(vma); struct hugepage_subpool *spool = subpool_vma(vma); unsigned long reserve, start, end; + long gbl_reserve; if (!resv || !is_vma_resv_set(vma, HPAGE_RESV_OWNER)) return; @@ -2466,8 +2569,12 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) kref_put(&resv->refs, resv_map_release); if (reserve) { - hugetlb_acct_memory(h, -reserve); - hugepage_subpool_put_pages(spool, reserve); + /* + * Decrement reserve counts. The global reserve count may be + * adjusted if the subpool has a minimum size. + */ + gbl_reserve = hugepage_subpool_put_pages(spool, reserve); + hugetlb_acct_memory(h, -gbl_reserve); } } @@ -2891,6 +2998,7 @@ retry_avoidcopy: copy_user_huge_page(new_page, old_page, address, vma, pages_per_huge_page(h)); __SetPageUptodate(new_page); + set_page_huge_active(new_page); mmun_start = address & huge_page_mask(h); mmun_end = mmun_start + huge_page_size(h); @@ -3003,6 +3111,7 @@ retry: } clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); + set_page_huge_active(page); if (vma->vm_flags & VM_MAYSHARE) { int err; @@ -3447,6 +3556,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; + long gbl_reserve; /* * Only apply hugepage reservation if asked. At fault time, an @@ -3483,8 +3593,13 @@ int hugetlb_reserve_pages(struct inode *inode, goto out_err; } - /* There must be enough pages in the subpool for the mapping */ - if (hugepage_subpool_get_pages(spool, chg)) { + /* + * There must be enough pages in the subpool for the mapping. If + * the subpool has a minimum size, there may be some global + * reservations already in place (gbl_reserve). + */ + gbl_reserve = hugepage_subpool_get_pages(spool, chg); + if (gbl_reserve < 0) { ret = -ENOSPC; goto out_err; } @@ -3493,9 +3608,10 @@ int hugetlb_reserve_pages(struct inode *inode, * Check enough hugepages are available for the reservation. * Hand the pages back to the subpool if there are not */ - ret = hugetlb_acct_memory(h, chg); + ret = hugetlb_acct_memory(h, gbl_reserve); if (ret < 0) { - hugepage_subpool_put_pages(spool, chg); + /* put back original number of pages, chg */ + (void)hugepage_subpool_put_pages(spool, chg); goto out_err; } @@ -3525,6 +3641,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) struct resv_map *resv_map = inode_resv_map(inode); long chg = 0; struct hugepage_subpool *spool = subpool_inode(inode); + long gbl_reserve; if (resv_map) chg = region_truncate(resv_map, offset); @@ -3532,8 +3649,12 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) inode->i_blocks -= (blocks_per_huge_page(h) * freed); spin_unlock(&inode->i_lock); - hugepage_subpool_put_pages(spool, (chg - freed)); - hugetlb_acct_memory(h, -(chg - freed)); + /* + * If the subpool has a minimum size, the number of global + * reservations to be released may be adjusted. + */ + gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed)); + hugetlb_acct_memory(h, -gbl_reserve); } #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE @@ -3775,20 +3896,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, #ifdef CONFIG_MEMORY_FAILURE -/* Should be called in hugetlb_lock */ -static int is_hugepage_on_freelist(struct page *hpage) -{ - struct page *page; - struct page *tmp; - struct hstate *h = page_hstate(hpage); - int nid = page_to_nid(hpage); - - list_for_each_entry_safe(page, tmp, &h->hugepage_freelists[nid], lru) - if (page == hpage) - return 1; - return 0; -} - /* * This function is called from memory failure code. * Assume the caller holds page lock of the head page. @@ -3800,7 +3907,11 @@ int dequeue_hwpoisoned_huge_page(struct page *hpage) int ret = -EBUSY; spin_lock(&hugetlb_lock); - if (is_hugepage_on_freelist(hpage)) { + /* + * Just checking !page_huge_active is not enough, because that could be + * an isolated/hwpoisoned hugepage (which have >0 refcount). + */ + if (!page_huge_active(hpage) && !page_count(hpage)) { /* * Hwpoisoned hugepage isn't linked to activelist or freelist, * but dangling hpage->lru can trigger list-debug warnings @@ -3820,42 +3931,27 @@ int dequeue_hwpoisoned_huge_page(struct page *hpage) bool isolate_huge_page(struct page *page, struct list_head *list) { + bool ret = true; + VM_BUG_ON_PAGE(!PageHead(page), page); - if (!get_page_unless_zero(page)) - return false; spin_lock(&hugetlb_lock); + if (!page_huge_active(page) || !get_page_unless_zero(page)) { + ret = false; + goto unlock; + } + clear_page_huge_active(page); list_move_tail(&page->lru, list); +unlock: spin_unlock(&hugetlb_lock); - return true; + return ret; } void putback_active_hugepage(struct page *page) { VM_BUG_ON_PAGE(!PageHead(page), page); spin_lock(&hugetlb_lock); + set_page_huge_active(page); list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist); spin_unlock(&hugetlb_lock); put_page(page); } - -bool is_hugepage_active(struct page *page) -{ - VM_BUG_ON_PAGE(!PageHuge(page), page); - /* - * This function can be called for a tail page because the caller, - * scan_movable_pages, scans through a given pfn-range which typically - * covers one memory block. In systems using gigantic hugepage (1GB - * for x86_64,) a hugepage is larger than a memory block, and we don't - * support migrating such large hugepages for now, so return false - * when called for tail pages. - */ - if (PageTail(page)) - return false; - /* - * Refcount of a hwpoisoned hugepages is 1, but they are not active, - * so we should return false for them. - */ - if (unlikely(PageHWPoison(page))) - return false; - return page_count(page) > 0; -} diff --git a/mm/internal.h b/mm/internal.h index edaab69..a25e359 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -224,13 +224,13 @@ static inline unsigned long page_order(struct page *page) * PageBuddy() should be checked first by the caller to minimize race window, * and invalid values must be handled gracefully. * - * ACCESS_ONCE is used so that if the caller assigns the result into a local + * READ_ONCE is used so that if the caller assigns the result into a local * variable and e.g. tests it for valid range before using, the compiler cannot * decide to remove the variable and inline the page_private(page) multiple * times, potentially observing different values in the tests and the actual * use of the result. */ -#define page_order_unsafe(page) ACCESS_ONCE(page_private(page)) +#define page_order_unsafe(page) READ_ONCE(page_private(page)) static inline bool is_cow_mapping(vm_flags_t flags) { diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 936d816..6c513a6 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -389,6 +389,19 @@ void kasan_krealloc(const void *object, size_t size) kasan_kmalloc(page->slab_cache, object, size); } +void kasan_kfree(void *ptr) +{ + struct page *page; + + page = virt_to_head_page(ptr); + + if (unlikely(!PageSlab(page))) + kasan_poison_shadow(ptr, PAGE_SIZE << compound_order(page), + KASAN_FREE_PAGE); + else + kasan_slab_free(page->slab_cache, ptr); +} + void kasan_kfree_large(const void *ptr) { struct page *page = virt_to_page(ptr); @@ -542,7 +542,7 @@ static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it) expected_mapping = (void *)stable_node + (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); again: - kpfn = ACCESS_ONCE(stable_node->kpfn); + kpfn = READ_ONCE(stable_node->kpfn); page = pfn_to_page(kpfn); /* @@ -551,7 +551,7 @@ again: * but on Alpha we need to be more careful. */ smp_read_barrier_depends(); - if (ACCESS_ONCE(page->mapping) != expected_mapping) + if (READ_ONCE(page->mapping) != expected_mapping) goto stale; /* @@ -577,14 +577,14 @@ again: cpu_relax(); } - if (ACCESS_ONCE(page->mapping) != expected_mapping) { + if (READ_ONCE(page->mapping) != expected_mapping) { put_page(page); goto stale; } if (lock_it) { lock_page(page); - if (ACCESS_ONCE(page->mapping) != expected_mapping) { + if (READ_ONCE(page->mapping) != expected_mapping) { unlock_page(page); put_page(page); goto stale; @@ -600,7 +600,7 @@ stale: * before checking whether node->kpfn has been changed. */ smp_rmb(); - if (ACCESS_ONCE(stable_node->kpfn) != kpfn) + if (READ_ONCE(stable_node->kpfn) != kpfn) goto again; remove_node_from_stable_tree(stable_node); return NULL; diff --git a/mm/memblock.c b/mm/memblock.c index 3f37a0b..9318b56 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -580,10 +580,24 @@ int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size, return memblock_add_range(&memblock.memory, base, size, nid, 0); } +static int __init_memblock memblock_add_region(phys_addr_t base, + phys_addr_t size, + int nid, + unsigned long flags) +{ + struct memblock_type *_rgn = &memblock.memory; + + memblock_dbg("memblock_add: [%#016llx-%#016llx] flags %#02lx %pF\n", + (unsigned long long)base, + (unsigned long long)base + size - 1, + flags, (void *)_RET_IP_); + + return memblock_add_range(_rgn, base, size, nid, flags); +} + int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size) { - return memblock_add_range(&memblock.memory, base, size, - MAX_NUMNODES, 0); + return memblock_add_region(base, size, MAX_NUMNODES, 0); } /** diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c3f09b2..14c2f20 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -259,11 +259,6 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg); * page cache and RSS per cgroup. We would eventually like to provide * statistics based on the statistics developed by Rik Van Riel for clock-pro, * to help the administrator determine what knobs to tune. - * - * TODO: Add a water mark for the memory controller. Reclaim will begin when - * we hit the water mark. May be even add a low water mark, such that - * no reclaim occurs from a cgroup at it's low water mark, this is - * a feature that will be implemented much later in the future. */ struct mem_cgroup { struct cgroup_subsys_state css; @@ -460,6 +455,12 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) return memcg->css.id; } +/* + * A helper function to get mem_cgroup from ID. must be called under + * rcu_read_lock(). The caller is responsible for calling + * css_tryget_online() if the mem_cgroup is used for charging. (dropping + * refcnt from swap can be called against removed memcg.) + */ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) { struct cgroup_subsys_state *css; @@ -673,7 +674,7 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz, static unsigned long soft_limit_excess(struct mem_cgroup *memcg) { unsigned long nr_pages = page_counter_read(&memcg->memory); - unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit); + unsigned long soft_limit = READ_ONCE(memcg->soft_limit); unsigned long excess = 0; if (nr_pages > soft_limit) @@ -1041,7 +1042,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, goto out_unlock; do { - pos = ACCESS_ONCE(iter->position); + pos = READ_ONCE(iter->position); /* * A racing update may change the position and * put the last reference, hence css_tryget(), @@ -1358,13 +1359,13 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) unsigned long limit; count = page_counter_read(&memcg->memory); - limit = ACCESS_ONCE(memcg->memory.limit); + limit = READ_ONCE(memcg->memory.limit); if (count < limit) margin = limit - count; if (do_swap_account) { count = page_counter_read(&memcg->memsw); - limit = ACCESS_ONCE(memcg->memsw.limit); + limit = READ_ONCE(memcg->memsw.limit); if (count <= limit) margin = min(margin, limit - count); } @@ -2349,20 +2350,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) } /* - * A helper function to get mem_cgroup from ID. must be called under - * rcu_read_lock(). The caller is responsible for calling - * css_tryget_online() if the mem_cgroup is used for charging. (dropping - * refcnt from swap can be called against removed memcg.) - */ -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id) -{ - /* ID 0 is unused ID */ - if (!id) - return NULL; - return mem_cgroup_from_id(id); -} - -/* * try_get_mem_cgroup_from_page - look up page's memcg association * @page: the page * @@ -2388,7 +2375,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) ent.val = page_private(page); id = lookup_swap_cgroup_id(ent); rcu_read_lock(); - memcg = mem_cgroup_lookup(id); + memcg = mem_cgroup_from_id(id); if (memcg && !css_tryget_online(&memcg->css)) memcg = NULL; rcu_read_unlock(); @@ -2650,7 +2637,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) return cachep; memcg = get_mem_cgroup_from_mm(current->mm); - kmemcg_id = ACCESS_ONCE(memcg->kmemcg_id); + kmemcg_id = READ_ONCE(memcg->kmemcg_id); if (kmemcg_id < 0) goto out; @@ -5020,7 +5007,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css, * tunable will only affect upcoming migrations, not the current one. * So we need to save it, and keep it going. */ - move_flags = ACCESS_ONCE(memcg->move_charge_at_immigrate); + move_flags = READ_ONCE(memcg->move_charge_at_immigrate); if (move_flags) { struct mm_struct *mm; struct mem_cgroup *from = mem_cgroup_from_task(p); @@ -5254,7 +5241,7 @@ static u64 memory_current_read(struct cgroup_subsys_state *css, static int memory_low_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); - unsigned long low = ACCESS_ONCE(memcg->low); + unsigned long low = READ_ONCE(memcg->low); if (low == PAGE_COUNTER_MAX) seq_puts(m, "max\n"); @@ -5284,7 +5271,7 @@ static ssize_t memory_low_write(struct kernfs_open_file *of, static int memory_high_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); - unsigned long high = ACCESS_ONCE(memcg->high); + unsigned long high = READ_ONCE(memcg->high); if (high == PAGE_COUNTER_MAX) seq_puts(m, "max\n"); @@ -5314,7 +5301,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, static int memory_max_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); - unsigned long max = ACCESS_ONCE(memcg->memory.limit); + unsigned long max = READ_ONCE(memcg->memory.limit); if (max == PAGE_COUNTER_MAX) seq_puts(m, "max\n"); @@ -5869,7 +5856,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry) id = swap_cgroup_record(entry, 0); rcu_read_lock(); - memcg = mem_cgroup_lookup(id); + memcg = mem_cgroup_from_id(id); if (memcg) { if (!mem_cgroup_is_root(memcg)) page_counter_uncharge(&memcg->memsw, 1); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index d487f8d..d9359b7 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -521,6 +521,52 @@ static const char *action_name[] = { [RECOVERED] = "Recovered", }; +enum action_page_type { + MSG_KERNEL, + MSG_KERNEL_HIGH_ORDER, + MSG_SLAB, + MSG_DIFFERENT_COMPOUND, + MSG_POISONED_HUGE, + MSG_HUGE, + MSG_FREE_HUGE, + MSG_UNMAP_FAILED, + MSG_DIRTY_SWAPCACHE, + MSG_CLEAN_SWAPCACHE, + MSG_DIRTY_MLOCKED_LRU, + MSG_CLEAN_MLOCKED_LRU, + MSG_DIRTY_UNEVICTABLE_LRU, + MSG_CLEAN_UNEVICTABLE_LRU, + MSG_DIRTY_LRU, + MSG_CLEAN_LRU, + MSG_TRUNCATED_LRU, + MSG_BUDDY, + MSG_BUDDY_2ND, + MSG_UNKNOWN, +}; + +static const char * const action_page_types[] = { + [MSG_KERNEL] = "reserved kernel page", + [MSG_KERNEL_HIGH_ORDER] = "high-order kernel page", + [MSG_SLAB] = "kernel slab page", + [MSG_DIFFERENT_COMPOUND] = "different compound page after locking", + [MSG_POISONED_HUGE] = "huge page already hardware poisoned", + [MSG_HUGE] = "huge page", + [MSG_FREE_HUGE] = "free huge page", + [MSG_UNMAP_FAILED] = "unmapping failed page", + [MSG_DIRTY_SWAPCACHE] = "dirty swapcache page", + [MSG_CLEAN_SWAPCACHE] = "clean swapcache page", + [MSG_DIRTY_MLOCKED_LRU] = "dirty mlocked LRU page", + [MSG_CLEAN_MLOCKED_LRU] = "clean mlocked LRU page", + [MSG_DIRTY_UNEVICTABLE_LRU] = "dirty unevictable LRU page", + [MSG_CLEAN_UNEVICTABLE_LRU] = "clean unevictable LRU page", + [MSG_DIRTY_LRU] = "dirty LRU page", + [MSG_CLEAN_LRU] = "clean LRU page", + [MSG_TRUNCATED_LRU] = "already truncated LRU page", + [MSG_BUDDY] = "free buddy page", + [MSG_BUDDY_2ND] = "free buddy page (2nd try)", + [MSG_UNKNOWN] = "unknown page", +}; + /* * XXX: It is possible that a page is isolated from LRU cache, * and then kept in swap cache or failed to remove from page cache. @@ -777,10 +823,10 @@ static int me_huge_page(struct page *p, unsigned long pfn) static struct page_state { unsigned long mask; unsigned long res; - char *msg; + enum action_page_type type; int (*action)(struct page *p, unsigned long pfn); } error_states[] = { - { reserved, reserved, "reserved kernel", me_kernel }, + { reserved, reserved, MSG_KERNEL, me_kernel }, /* * free pages are specially detected outside this table: * PG_buddy pages only make a small fraction of all free pages. @@ -791,31 +837,31 @@ static struct page_state { * currently unused objects without touching them. But just * treat it as standard kernel for now. */ - { slab, slab, "kernel slab", me_kernel }, + { slab, slab, MSG_SLAB, me_kernel }, #ifdef CONFIG_PAGEFLAGS_EXTENDED - { head, head, "huge", me_huge_page }, - { tail, tail, "huge", me_huge_page }, + { head, head, MSG_HUGE, me_huge_page }, + { tail, tail, MSG_HUGE, me_huge_page }, #else - { compound, compound, "huge", me_huge_page }, + { compound, compound, MSG_HUGE, me_huge_page }, #endif - { sc|dirty, sc|dirty, "dirty swapcache", me_swapcache_dirty }, - { sc|dirty, sc, "clean swapcache", me_swapcache_clean }, + { sc|dirty, sc|dirty, MSG_DIRTY_SWAPCACHE, me_swapcache_dirty }, + { sc|dirty, sc, MSG_CLEAN_SWAPCACHE, me_swapcache_clean }, - { mlock|dirty, mlock|dirty, "dirty mlocked LRU", me_pagecache_dirty }, - { mlock|dirty, mlock, "clean mlocked LRU", me_pagecache_clean }, + { mlock|dirty, mlock|dirty, MSG_DIRTY_MLOCKED_LRU, me_pagecache_dirty }, + { mlock|dirty, mlock, MSG_CLEAN_MLOCKED_LRU, me_pagecache_clean }, - { unevict|dirty, unevict|dirty, "dirty unevictable LRU", me_pagecache_dirty }, - { unevict|dirty, unevict, "clean unevictable LRU", me_pagecache_clean }, + { unevict|dirty, unevict|dirty, MSG_DIRTY_UNEVICTABLE_LRU, me_pagecache_dirty }, + { unevict|dirty, unevict, MSG_CLEAN_UNEVICTABLE_LRU, me_pagecache_clean }, - { lru|dirty, lru|dirty, "dirty LRU", me_pagecache_dirty }, - { lru|dirty, lru, "clean LRU", me_pagecache_clean }, + { lru|dirty, lru|dirty, MSG_DIRTY_LRU, me_pagecache_dirty }, + { lru|dirty, lru, MSG_CLEAN_LRU, me_pagecache_clean }, /* * Catchall entry: must be at end. */ - { 0, 0, "unknown page state", me_unknown }, + { 0, 0, MSG_UNKNOWN, me_unknown }, }; #undef dirty @@ -835,10 +881,10 @@ static struct page_state { * "Dirty/Clean" indication is not 100% accurate due to the possibility of * setting PG_dirty outside page lock. See also comment above set_page_dirty(). */ -static void action_result(unsigned long pfn, char *msg, int result) +static void action_result(unsigned long pfn, enum action_page_type type, int result) { - pr_err("MCE %#lx: %s page recovery: %s\n", - pfn, msg, action_name[result]); + pr_err("MCE %#lx: recovery action for %s: %s\n", + pfn, action_page_types[type], action_name[result]); } static int page_action(struct page_state *ps, struct page *p, @@ -854,11 +900,11 @@ static int page_action(struct page_state *ps, struct page *p, count--; if (count != 0) { printk(KERN_ERR - "MCE %#lx: %s page still referenced by %d users\n", - pfn, ps->msg, count); + "MCE %#lx: %s still referenced by %d users\n", + pfn, action_page_types[ps->type], count); result = FAILED; } - action_result(pfn, ps->msg, result); + action_result(pfn, ps->type, result); /* Could do more checks here if page looks ok */ /* @@ -1106,7 +1152,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags) if (!(flags & MF_COUNT_INCREASED) && !get_page_unless_zero(hpage)) { if (is_free_buddy_page(p)) { - action_result(pfn, "free buddy", DELAYED); + action_result(pfn, MSG_BUDDY, DELAYED); return 0; } else if (PageHuge(hpage)) { /* @@ -1123,12 +1169,12 @@ int memory_failure(unsigned long pfn, int trapno, int flags) } set_page_hwpoison_huge_page(hpage); res = dequeue_hwpoisoned_huge_page(hpage); - action_result(pfn, "free huge", + action_result(pfn, MSG_FREE_HUGE, res ? IGNORED : DELAYED); unlock_page(hpage); return res; } else { - action_result(pfn, "high order kernel", IGNORED); + action_result(pfn, MSG_KERNEL_HIGH_ORDER, IGNORED); return -EBUSY; } } @@ -1150,9 +1196,10 @@ int memory_failure(unsigned long pfn, int trapno, int flags) */ if (is_free_buddy_page(p)) { if (flags & MF_COUNT_INCREASED) - action_result(pfn, "free buddy", DELAYED); + action_result(pfn, MSG_BUDDY, DELAYED); else - action_result(pfn, "free buddy, 2nd try", DELAYED); + action_result(pfn, MSG_BUDDY_2ND, + DELAYED); return 0; } } @@ -1165,7 +1212,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags) * If this happens just bail out. */ if (compound_head(p) != hpage) { - action_result(pfn, "different compound page after locking", IGNORED); + action_result(pfn, MSG_DIFFERENT_COMPOUND, IGNORED); res = -EBUSY; goto out; } @@ -1205,8 +1252,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags) * on the head page to show that the hugepage is hwpoisoned */ if (PageHuge(p) && PageTail(p) && TestSetPageHWPoison(hpage)) { - action_result(pfn, "hugepage already hardware poisoned", - IGNORED); + action_result(pfn, MSG_POISONED_HUGE, IGNORED); unlock_page(hpage); put_page(hpage); return 0; @@ -1235,7 +1281,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags) */ if (hwpoison_user_mappings(p, pfn, trapno, flags, &hpage) != SWAP_SUCCESS) { - action_result(pfn, "unmapping failed", IGNORED); + action_result(pfn, MSG_UNMAP_FAILED, IGNORED); res = -EBUSY; goto out; } @@ -1244,7 +1290,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags) * Torn down by someone else? */ if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) { - action_result(pfn, "already truncated LRU", IGNORED); + action_result(pfn, MSG_TRUNCATED_LRU, IGNORED); res = -EBUSY; goto out; } @@ -1540,8 +1586,18 @@ static int soft_offline_huge_page(struct page *page, int flags) } unlock_page(hpage); - /* Keep page count to indicate a given hugepage is isolated. */ - list_move(&hpage->lru, &pagelist); + ret = isolate_huge_page(hpage, &pagelist); + if (ret) { + /* + * get_any_page() and isolate_huge_page() takes a refcount each, + * so need to drop one here. + */ + put_page(hpage); + } else { + pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn); + return -EBUSY; + } + ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { diff --git a/mm/memory.c b/mm/memory.c index ac20b2a..22e037e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -690,12 +690,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, /* * Choose text because data symbols depend on CONFIG_KALLSYMS_ALL=y */ - if (vma->vm_ops) - printk(KERN_ALERT "vma->vm_ops->fault: %pSR\n", - vma->vm_ops->fault); - if (vma->vm_file) - printk(KERN_ALERT "vma->vm_file->f_op->mmap: %pSR\n", - vma->vm_file->f_op->mmap); + pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n", + vma->vm_file, + vma->vm_ops ? vma->vm_ops->fault : NULL, + vma->vm_file ? vma->vm_file->f_op->mmap : NULL, + mapping ? mapping->a_ops->readpage : NULL); dump_stack(); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); } @@ -2181,6 +2180,42 @@ oom: return VM_FAULT_OOM; } +/* + * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED + * mapping + */ +static int wp_pfn_shared(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, + pmd_t *pmd) +{ + if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { + struct vm_fault vmf = { + .page = NULL, + .pgoff = linear_page_index(vma, address), + .virtual_address = (void __user *)(address & PAGE_MASK), + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE, + }; + int ret; + + pte_unmap_unlock(page_table, ptl); + ret = vma->vm_ops->pfn_mkwrite(vma, &vmf); + if (ret & VM_FAULT_ERROR) + return ret; + page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + /* + * We might have raced with another page fault while we + * released the pte_offset_map_lock. + */ + if (!pte_same(*page_table, orig_pte)) { + pte_unmap_unlock(page_table, ptl); + return 0; + } + } + return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, + NULL, 0, 0); +} + static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, @@ -2259,13 +2294,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * VM_PFNMAP VMA. * * We should not cow pages in a shared writeable mapping. - * Just mark the pages writable as we can't do any dirty - * accounting on raw pfn maps. + * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) - return wp_page_reuse(mm, vma, address, page_table, ptl, - orig_pte, old_page, 0, 0); + return wp_pfn_shared(mm, vma, address, page_table, ptl, + orig_pte, pmd); pte_unmap_unlock(page_table, ptl); return wp_page_copy(mm, vma, address, page_table, pmd, @@ -2845,7 +2879,7 @@ static void do_fault_around(struct vm_area_struct *vma, unsigned long address, struct vm_fault vmf; int off; - nr_pages = ACCESS_ONCE(fault_around_bytes) >> PAGE_SHIFT; + nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT; mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK; start_addr = max(address & mask, vma->vm_start); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index e2e8014..457bde5 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1373,7 +1373,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) if (PageLRU(page)) return pfn; if (PageHuge(page)) { - if (is_hugepage_active(page)) + if (page_huge_active(page)) return pfn; else pfn = round_up(pfn + 1, diff --git a/mm/mempool.c b/mm/mempool.c index 949970d..2cc08de 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -6,26 +6,138 @@ * extreme VM load. * * started by Ingo Molnar, Copyright (C) 2001 + * debugging by David Rientjes, Copyright (C) 2015 */ #include <linux/mm.h> #include <linux/slab.h> +#include <linux/highmem.h> +#include <linux/kasan.h> #include <linux/kmemleak.h> #include <linux/export.h> #include <linux/mempool.h> #include <linux/blkdev.h> #include <linux/writeback.h> +#include "slab.h" + +#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB_DEBUG_ON) +static void poison_error(mempool_t *pool, void *element, size_t size, + size_t byte) +{ + const int nr = pool->curr_nr; + const int start = max_t(int, byte - (BITS_PER_LONG / 8), 0); + const int end = min_t(int, byte + (BITS_PER_LONG / 8), size); + int i; + + pr_err("BUG: mempool element poison mismatch\n"); + pr_err("Mempool %p size %zu\n", pool, size); + pr_err(" nr=%d @ %p: %s0x", nr, element, start > 0 ? "... " : ""); + for (i = start; i < end; i++) + pr_cont("%x ", *(u8 *)(element + i)); + pr_cont("%s\n", end < size ? "..." : ""); + dump_stack(); +} + +static void __check_element(mempool_t *pool, void *element, size_t size) +{ + u8 *obj = element; + size_t i; + + for (i = 0; i < size; i++) { + u8 exp = (i < size - 1) ? POISON_FREE : POISON_END; + + if (obj[i] != exp) { + poison_error(pool, element, size, i); + return; + } + } + memset(obj, POISON_INUSE, size); +} + +static void check_element(mempool_t *pool, void *element) +{ + /* Mempools backed by slab allocator */ + if (pool->free == mempool_free_slab || pool->free == mempool_kfree) + __check_element(pool, element, ksize(element)); + + /* Mempools backed by page allocator */ + if (pool->free == mempool_free_pages) { + int order = (int)(long)pool->pool_data; + void *addr = kmap_atomic((struct page *)element); + + __check_element(pool, addr, 1UL << (PAGE_SHIFT + order)); + kunmap_atomic(addr); + } +} + +static void __poison_element(void *element, size_t size) +{ + u8 *obj = element; + + memset(obj, POISON_FREE, size - 1); + obj[size - 1] = POISON_END; +} + +static void poison_element(mempool_t *pool, void *element) +{ + /* Mempools backed by slab allocator */ + if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) + __poison_element(element, ksize(element)); + + /* Mempools backed by page allocator */ + if (pool->alloc == mempool_alloc_pages) { + int order = (int)(long)pool->pool_data; + void *addr = kmap_atomic((struct page *)element); + + __poison_element(addr, 1UL << (PAGE_SHIFT + order)); + kunmap_atomic(addr); + } +} +#else /* CONFIG_DEBUG_SLAB || CONFIG_SLUB_DEBUG_ON */ +static inline void check_element(mempool_t *pool, void *element) +{ +} +static inline void poison_element(mempool_t *pool, void *element) +{ +} +#endif /* CONFIG_DEBUG_SLAB || CONFIG_SLUB_DEBUG_ON */ + +static void kasan_poison_element(mempool_t *pool, void *element) +{ + if (pool->alloc == mempool_alloc_slab) + kasan_slab_free(pool->pool_data, element); + if (pool->alloc == mempool_kmalloc) + kasan_kfree(element); + if (pool->alloc == mempool_alloc_pages) + kasan_free_pages(element, (unsigned long)pool->pool_data); +} + +static void kasan_unpoison_element(mempool_t *pool, void *element) +{ + if (pool->alloc == mempool_alloc_slab) + kasan_slab_alloc(pool->pool_data, element); + if (pool->alloc == mempool_kmalloc) + kasan_krealloc(element, (size_t)pool->pool_data); + if (pool->alloc == mempool_alloc_pages) + kasan_alloc_pages(element, (unsigned long)pool->pool_data); +} static void add_element(mempool_t *pool, void *element) { BUG_ON(pool->curr_nr >= pool->min_nr); + poison_element(pool, element); + kasan_poison_element(pool, element); pool->elements[pool->curr_nr++] = element; } static void *remove_element(mempool_t *pool) { - BUG_ON(pool->curr_nr <= 0); - return pool->elements[--pool->curr_nr]; + void *element = pool->elements[--pool->curr_nr]; + + BUG_ON(pool->curr_nr < 0); + check_element(pool, element); + kasan_unpoison_element(pool, element); + return element; } /** @@ -334,6 +446,7 @@ EXPORT_SYMBOL(mempool_free); void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data) { struct kmem_cache *mem = pool_data; + VM_BUG_ON(mem->ctor); return kmem_cache_alloc(mem, gfp_mask); } EXPORT_SYMBOL(mempool_alloc_slab); diff --git a/mm/migrate.c b/mm/migrate.c index a65ff72..f53838f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -537,7 +537,8 @@ void migrate_page_copy(struct page *newpage, struct page *page) * Please do not reorder this without considering how mm/ksm.c's * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). */ - ClearPageSwapCache(page); + if (PageSwapCache(page)) + ClearPageSwapCache(page); ClearPagePrivate(page); set_page_private(page, 0); @@ -1133,7 +1133,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * * by another page fault trying to merge _that_. But that's ok: if it * is being set up, that automatically means that it will be a singleton * acceptable for merging, so we can do all of this optimistically. But - * we do that ACCESS_ONCE() to make sure that we never re-load the pointer. + * we do that READ_ONCE() to make sure that we never re-load the pointer. * * IOW: that the "list_is_singular()" test on the anon_vma_chain only * matters for the 'stable anon_vma' case (ie the thing we want to avoid @@ -1147,7 +1147,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b) { if (anon_vma_compatible(a, b)) { - struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma); + struct anon_vma *anon_vma = READ_ONCE(old->anon_vma); if (anon_vma && list_is_singular(&old->anon_vma_chain)) return anon_vma; @@ -1551,11 +1551,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Clear old maps */ error = -ENOMEM; -munmap_back: - if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) { + while (find_vma_links(mm, addr, addr + len, &prev, &rb_link, + &rb_parent)) { if (do_munmap(mm, addr, len)) return -ENOMEM; - goto munmap_back; } /* @@ -1571,7 +1570,8 @@ munmap_back: /* * Can we just expand an old mapping? */ - vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, + NULL); if (vma) goto out; @@ -2100,7 +2100,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns actual_size = size; if (size && (vma->vm_flags & (VM_GROWSUP | VM_GROWSDOWN))) actual_size -= PAGE_SIZE; - if (actual_size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur)) + if (actual_size > READ_ONCE(rlim[RLIMIT_STACK].rlim_cur)) return -ENOMEM; /* mlock limit tests */ @@ -2108,7 +2108,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns unsigned long locked; unsigned long limit; locked = mm->locked_vm + grow; - limit = ACCESS_ONCE(rlim[RLIMIT_MEMLOCK].rlim_cur); + limit = READ_ONCE(rlim[RLIMIT_MEMLOCK].rlim_cur); limit >>= PAGE_SHIFT; if (locked > limit && !capable(CAP_IPC_LOCK)) return -ENOMEM; @@ -2739,11 +2739,10 @@ static unsigned long do_brk(unsigned long addr, unsigned long len) /* * Clear old maps. this also does some error checking for us */ - munmap_back: - if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) { + while (find_vma_links(mm, addr, addr + len, &prev, &rb_link, + &rb_parent)) { if (do_munmap(mm, addr, len)) return -ENOMEM; - goto munmap_back; } /* Check against address space limits *after* clearing old maps... */ diff --git a/mm/mremap.c b/mm/mremap.c index 2dc44b1..034e2d3 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -345,25 +345,25 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, struct vm_area_struct *vma = find_vma(mm, addr); if (!vma || vma->vm_start > addr) - goto Efault; + return ERR_PTR(-EFAULT); if (is_vm_hugetlb_page(vma)) - goto Einval; + return ERR_PTR(-EINVAL); /* We can't remap across vm area boundaries */ if (old_len > vma->vm_end - addr) - goto Efault; + return ERR_PTR(-EFAULT); /* Need to be careful about a growing mapping */ if (new_len > old_len) { unsigned long pgoff; if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) - goto Efault; + return ERR_PTR(-EFAULT); pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; pgoff += vma->vm_pgoff; if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) - goto Einval; + return ERR_PTR(-EINVAL); } if (vma->vm_flags & VM_LOCKED) { @@ -372,29 +372,20 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, lock_limit = rlimit(RLIMIT_MEMLOCK); locked += new_len - old_len; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) - goto Eagain; + return ERR_PTR(-EAGAIN); } if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT)) - goto Enomem; + return ERR_PTR(-ENOMEM); if (vma->vm_flags & VM_ACCOUNT) { unsigned long charged = (new_len - old_len) >> PAGE_SHIFT; if (security_vm_enough_memory_mm(mm, charged)) - goto Efault; + return ERR_PTR(-ENOMEM); *p = charged; } return vma; - -Efault: /* very odd choice for most of the cases, but... */ - return ERR_PTR(-EFAULT); -Einval: - return ERR_PTR(-EINVAL); -Enomem: - return ERR_PTR(-ENOMEM); -Eagain: - return ERR_PTR(-EAGAIN); } static unsigned long mremap_to(unsigned long addr, unsigned long old_len, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 52628c8..2b665da 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -408,7 +408,7 @@ bool oom_killer_disabled __read_mostly; static DECLARE_RWSEM(oom_sem); /** - * mark_tsk_oom_victim - marks the given taks as OOM victim. + * mark_tsk_oom_victim - marks the given task as OOM victim. * @tsk: task to mark * * Has to be called with oom_sem taken for read and never after diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0372411..5daf556 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2228,7 +2228,8 @@ int set_page_dirty(struct page *page) * it will confuse readahead and make it restart the size rampup * process. But it's a trivial problem. */ - ClearPageReclaim(page); + if (PageReclaim(page)) + ClearPageReclaim(page); #ifdef CONFIG_BLOCK if (!spd) spd = __set_page_dirty_buffers; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1b84950..ebffa0e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1371,7 +1371,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) int to_drain, batch; local_irq_save(flags); - batch = ACCESS_ONCE(pcp->batch); + batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); if (to_drain > 0) { free_pcppages_bulk(zone, to_drain, pcp); @@ -1570,7 +1570,7 @@ void free_hot_cold_page(struct page *page, bool cold) list_add_tail(&page->lru, &pcp->lists[migratetype]); pcp->count++; if (pcp->count >= pcp->high) { - unsigned long batch = ACCESS_ONCE(pcp->batch); + unsigned long batch = READ_ONCE(pcp->batch); free_pcppages_bulk(zone, batch, pcp); pcp->count -= batch; } @@ -6207,7 +6207,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags, mask <<= (BITS_PER_LONG - bitidx - 1); flags <<= (BITS_PER_LONG - bitidx - 1); - word = ACCESS_ONCE(bitmap[word_bitidx]); + word = READ_ONCE(bitmap[word_bitidx]); for (;;) { old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags); if (word == old_word) @@ -456,7 +456,7 @@ struct anon_vma *page_get_anon_vma(struct page *page) unsigned long anon_mapping; rcu_read_lock(); - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); + anon_mapping = (unsigned long)READ_ONCE(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; if (!page_mapped(page)) @@ -500,14 +500,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page) unsigned long anon_mapping; rcu_read_lock(); - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); + anon_mapping = (unsigned long)READ_ONCE(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; if (!page_mapped(page)) goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); - root_anon_vma = ACCESS_ONCE(anon_vma->root); + root_anon_vma = READ_ONCE(anon_vma->root); if (down_read_trylock(&root_anon_vma->rwsem)) { /* * If the page is still mapped, then this anon_vma is still @@ -4277,7 +4277,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s, int node; struct page *page; - page = ACCESS_ONCE(c->page); + page = READ_ONCE(c->page); if (!page) continue; @@ -4292,7 +4292,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s, total += x; nodes[node] += x; - page = ACCESS_ONCE(c->partial); + page = READ_ONCE(c->partial); if (page) { node = page_to_nid(page); if (flags & SO_TOTAL) @@ -31,6 +31,7 @@ #include <linux/memcontrol.h> #include <linux/gfp.h> #include <linux/uio.h> +#include <linux/hugetlb.h> #include "internal.h" @@ -42,7 +43,7 @@ int page_cluster; static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs); /* * This path almost never happens for VM activity - pages are normally @@ -75,7 +76,14 @@ static void __put_compound_page(struct page *page) { compound_page_dtor *dtor; - __page_cache_release(page); + /* + * __page_cache_release() is supposed to be called for thp, not for + * hugetlb. This is because hugetlb page does never have PageLRU set + * (it's never listed to any LRU lists) and no memcg routines should + * be called for hugetlb (it has a separate hugetlb_cgroup.) + */ + if (!PageHuge(page)) + __page_cache_release(page); dtor = get_compound_page_dtor(page); (*dtor)(page); } @@ -743,7 +751,7 @@ void lru_cache_add_active_or_unevictable(struct page *page, * be write it out by flusher threads as this is much more effective * than the single-page writeout from reclaim. */ -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, +static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, void *arg) { int lru, file; @@ -811,36 +819,36 @@ void lru_add_drain_cpu(int cpu) local_irq_restore(flags); } - pvec = &per_cpu(lru_deactivate_pvecs, cpu); + pvec = &per_cpu(lru_deactivate_file_pvecs, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); activate_page_drain(cpu); } /** - * deactivate_page - forcefully deactivate a page + * deactivate_file_page - forcefully deactivate a file page * @page: page to deactivate * * This function hints the VM that @page is a good reclaim candidate, * for example if its invalidation fails due to the page being dirty * or under writeback. */ -void deactivate_page(struct page *page) +void deactivate_file_page(struct page *page) { /* - * In a workload with many unevictable page such as mprotect, unevictable - * page deactivation for accelerating reclaim is pointless. + * In a workload with many unevictable page such as mprotect, + * unevictable page deactivation for accelerating reclaim is pointless. */ if (PageUnevictable(page)) return; if (likely(get_page_unless_zero(page))) { - struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); + struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs); if (!pagevec_add(pvec, page)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); - put_cpu_var(lru_deactivate_pvecs); + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); + put_cpu_var(lru_deactivate_file_pvecs); } } @@ -872,7 +880,7 @@ void lru_add_drain_all(void) if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || need_activate_page_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); schedule_work_on(cpu, work); diff --git a/mm/swap_state.c b/mm/swap_state.c index 405923f..8bc8e66 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -390,7 +390,7 @@ static unsigned long swapin_nr_pages(unsigned long offset) unsigned int pages, max_pages, last_ra; static atomic_t last_readahead_pages; - max_pages = 1 << ACCESS_ONCE(page_cluster); + max_pages = 1 << READ_ONCE(page_cluster); if (max_pages <= 1) return 1; diff --git a/mm/swapfile.c b/mm/swapfile.c index 63f55cc..a7e7210 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1312,7 +1312,7 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si, else continue; } - count = ACCESS_ONCE(si->swap_map[i]); + count = READ_ONCE(si->swap_map[i]); if (count && swap_count(count) != SWAP_MAP_BAD) break; } diff --git a/mm/truncate.c b/mm/truncate.c index 7a9d8a3..66af903 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -490,7 +490,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, * of interest and try to speed up its reclaim. */ if (!ret) - deactivate_page(page); + deactivate_file_page(page); count += ret; } pagevec_remove_exceptionals(&pvec); @@ -325,9 +325,37 @@ void kvfree(const void *addr) } EXPORT_SYMBOL(kvfree); +static inline void *__page_rmapping(struct page *page) +{ + unsigned long mapping; + + mapping = (unsigned long)page->mapping; + mapping &= ~PAGE_MAPPING_FLAGS; + + return (void *)mapping; +} + +/* Neutral page->mapping pointer to address_space or anon_vma or other */ +void *page_rmapping(struct page *page) +{ + page = compound_head(page); + return __page_rmapping(page); +} + +struct anon_vma *page_anon_vma(struct page *page) +{ + unsigned long mapping; + + page = compound_head(page); + mapping = (unsigned long)page->mapping; + if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) + return NULL; + return __page_rmapping(page); +} + struct address_space *page_mapping(struct page *page) { - struct address_space *mapping = page->mapping; + unsigned long mapping; /* This happens if someone calls flush_dcache_page on slab page */ if (unlikely(PageSlab(page))) @@ -337,10 +365,13 @@ struct address_space *page_mapping(struct page *page) swp_entry_t entry; entry.val = page_private(page); - mapping = swap_address_space(entry); - } else if ((unsigned long)mapping & PAGE_MAPPING_ANON) - mapping = NULL; - return mapping; + return swap_address_space(entry); + } + + mapping = (unsigned long)page->mapping; + if (mapping & PAGE_MAPPING_FLAGS) + return NULL; + return page->mapping; } int overcommit_ratio_handler(struct ctl_table *table, int write, diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a5bbdd3..2faaa29 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -765,7 +765,7 @@ struct vmap_block { spinlock_t lock; struct vmap_area *va; unsigned long free, dirty; - DECLARE_BITMAP(dirty_map, VMAP_BBMAP_BITS); + unsigned long dirty_min, dirty_max; /*< dirty range */ struct list_head free_list; struct rcu_head rcu_head; struct list_head purge; @@ -796,13 +796,31 @@ static unsigned long addr_to_vb_idx(unsigned long addr) return addr; } -static struct vmap_block *new_vmap_block(gfp_t gfp_mask) +static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off) +{ + unsigned long addr; + + addr = va_start + (pages_off << PAGE_SHIFT); + BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start)); + return (void *)addr; +} + +/** + * new_vmap_block - allocates new vmap_block and occupies 2^order pages in this + * block. Of course pages number can't exceed VMAP_BBMAP_BITS + * @order: how many 2^order pages should be occupied in newly allocated block + * @gfp_mask: flags for the page level allocator + * + * Returns: virtual address in a newly allocated block or ERR_PTR(-errno) + */ +static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) { struct vmap_block_queue *vbq; struct vmap_block *vb; struct vmap_area *va; unsigned long vb_idx; int node, err; + void *vaddr; node = numa_node_id(); @@ -826,11 +844,15 @@ static struct vmap_block *new_vmap_block(gfp_t gfp_mask) return ERR_PTR(err); } + vaddr = vmap_block_vaddr(va->va_start, 0); spin_lock_init(&vb->lock); vb->va = va; - vb->free = VMAP_BBMAP_BITS; + /* At least something should be left free */ + BUG_ON(VMAP_BBMAP_BITS <= (1UL << order)); + vb->free = VMAP_BBMAP_BITS - (1UL << order); vb->dirty = 0; - bitmap_zero(vb->dirty_map, VMAP_BBMAP_BITS); + vb->dirty_min = VMAP_BBMAP_BITS; + vb->dirty_max = 0; INIT_LIST_HEAD(&vb->free_list); vb_idx = addr_to_vb_idx(va->va_start); @@ -842,11 +864,11 @@ static struct vmap_block *new_vmap_block(gfp_t gfp_mask) vbq = &get_cpu_var(vmap_block_queue); spin_lock(&vbq->lock); - list_add_rcu(&vb->free_list, &vbq->free); + list_add_tail_rcu(&vb->free_list, &vbq->free); spin_unlock(&vbq->lock); put_cpu_var(vmap_block_queue); - return vb; + return vaddr; } static void free_vmap_block(struct vmap_block *vb) @@ -881,7 +903,8 @@ static void purge_fragmented_blocks(int cpu) if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) { vb->free = 0; /* prevent further allocs after releasing lock */ vb->dirty = VMAP_BBMAP_BITS; /* prevent purging it again */ - bitmap_fill(vb->dirty_map, VMAP_BBMAP_BITS); + vb->dirty_min = 0; + vb->dirty_max = VMAP_BBMAP_BITS; spin_lock(&vbq->lock); list_del_rcu(&vb->free_list); spin_unlock(&vbq->lock); @@ -910,7 +933,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask) { struct vmap_block_queue *vbq; struct vmap_block *vb; - unsigned long addr = 0; + void *vaddr = NULL; unsigned int order; BUG_ON(size & ~PAGE_MASK); @@ -925,43 +948,38 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask) } order = get_order(size); -again: rcu_read_lock(); vbq = &get_cpu_var(vmap_block_queue); list_for_each_entry_rcu(vb, &vbq->free, free_list) { - int i; + unsigned long pages_off; spin_lock(&vb->lock); - if (vb->free < 1UL << order) - goto next; + if (vb->free < (1UL << order)) { + spin_unlock(&vb->lock); + continue; + } - i = VMAP_BBMAP_BITS - vb->free; - addr = vb->va->va_start + (i << PAGE_SHIFT); - BUG_ON(addr_to_vb_idx(addr) != - addr_to_vb_idx(vb->va->va_start)); + pages_off = VMAP_BBMAP_BITS - vb->free; + vaddr = vmap_block_vaddr(vb->va->va_start, pages_off); vb->free -= 1UL << order; if (vb->free == 0) { spin_lock(&vbq->lock); list_del_rcu(&vb->free_list); spin_unlock(&vbq->lock); } + spin_unlock(&vb->lock); break; -next: - spin_unlock(&vb->lock); } put_cpu_var(vmap_block_queue); rcu_read_unlock(); - if (!addr) { - vb = new_vmap_block(gfp_mask); - if (IS_ERR(vb)) - return vb; - goto again; - } + /* Allocate new block if nothing was found */ + if (!vaddr) + vaddr = new_vmap_block(order, gfp_mask); - return (void *)addr; + return vaddr; } static void vb_free(const void *addr, unsigned long size) @@ -979,6 +997,7 @@ static void vb_free(const void *addr, unsigned long size) order = get_order(size); offset = (unsigned long)addr & (VMAP_BLOCK_SIZE - 1); + offset >>= PAGE_SHIFT; vb_idx = addr_to_vb_idx((unsigned long)addr); rcu_read_lock(); @@ -989,7 +1008,10 @@ static void vb_free(const void *addr, unsigned long size) vunmap_page_range((unsigned long)addr, (unsigned long)addr + size); spin_lock(&vb->lock); - BUG_ON(bitmap_allocate_region(vb->dirty_map, offset >> PAGE_SHIFT, order)); + + /* Expand dirty range */ + vb->dirty_min = min(vb->dirty_min, offset); + vb->dirty_max = max(vb->dirty_max, offset + (1UL << order)); vb->dirty += 1UL << order; if (vb->dirty == VMAP_BBMAP_BITS) { @@ -1028,25 +1050,18 @@ void vm_unmap_aliases(void) rcu_read_lock(); list_for_each_entry_rcu(vb, &vbq->free, free_list) { - int i, j; - spin_lock(&vb->lock); - i = find_first_bit(vb->dirty_map, VMAP_BBMAP_BITS); - if (i < VMAP_BBMAP_BITS) { + if (vb->dirty) { + unsigned long va_start = vb->va->va_start; unsigned long s, e; - j = find_last_bit(vb->dirty_map, - VMAP_BBMAP_BITS); - j = j + 1; /* need exclusive index */ + s = va_start + (vb->dirty_min << PAGE_SHIFT); + e = va_start + (vb->dirty_max << PAGE_SHIFT); - s = vb->va->va_start + (i << PAGE_SHIFT); - e = vb->va->va_start + (j << PAGE_SHIFT); - flush = 1; + start = min(s, start); + end = max(e, end); - if (s < start) - start = s; - if (e > end) - end = e; + flush = 1; } spin_unlock(&vb->lock); } diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 0dec1fa..08bd7a3 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -12,35 +12,6 @@ */ /* - * This allocator is designed for use with zram. Thus, the allocator is - * supposed to work well under low memory conditions. In particular, it - * never attempts higher order page allocation which is very likely to - * fail under memory pressure. On the other hand, if we just use single - * (0-order) pages, it would suffer from very high fragmentation -- - * any object of size PAGE_SIZE/2 or larger would occupy an entire page. - * This was one of the major issues with its predecessor (xvmalloc). - * - * To overcome these issues, zsmalloc allocates a bunch of 0-order pages - * and links them together using various 'struct page' fields. These linked - * pages act as a single higher-order page i.e. an object can span 0-order - * page boundaries. The code refers to these linked pages as a single entity - * called zspage. - * - * For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE - * since this satisfies the requirements of all its current users (in the - * worst case, page is incompressible and is thus stored "as-is" i.e. in - * uncompressed form). For allocation requests larger than this size, failure - * is returned (see zs_malloc). - * - * Additionally, zs_malloc() does not return a dereferenceable pointer. - * Instead, it returns an opaque handle (unsigned long) which encodes actual - * location of the allocated object. The reason for this indirection is that - * zsmalloc does not keep zspages permanently mapped since that would cause - * issues on 32-bit systems where the VA region for kernel space mappings - * is very small. So, before using the allocating memory, the object has to - * be mapped using zs_map_object() to get a usable pointer and subsequently - * unmapped using zs_unmap_object(). - * * Following is how we use various fields and flags of underlying * struct page(s) to form a zspage. * @@ -57,6 +28,8 @@ * * page->private (union with page->first_page): refers to the * component page after the first page + * If the page is first_page for huge object, it stores handle. + * Look at size_class->huge. * page->freelist: points to the first free object in zspage. * Free objects are linked together using in-place * metadata. @@ -78,6 +51,7 @@ #include <linux/module.h> #include <linux/kernel.h> +#include <linux/sched.h> #include <linux/bitops.h> #include <linux/errno.h> #include <linux/highmem.h> @@ -110,6 +84,8 @@ #define ZS_MAX_ZSPAGE_ORDER 2 #define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER) +#define ZS_HANDLE_SIZE (sizeof(unsigned long)) + /* * Object location (<PFN>, <obj_idx>) is encoded as * as single (unsigned long) handle value. @@ -133,13 +109,33 @@ #endif #endif #define _PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) -#define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS) + +/* + * Memory for allocating for handle keeps object position by + * encoding <page, obj_idx> and the encoded value has a room + * in least bit(ie, look at obj_to_location). + * We use the bit to synchronize between object access by + * user and migration. + */ +#define HANDLE_PIN_BIT 0 + +/* + * Head in allocated object should have OBJ_ALLOCATED_TAG + * to identify the object was allocated or not. + * It's okay to add the status bit in the least bit because + * header keeps handle which is 4byte-aligned address so we + * have room for two bit at least. + */ +#define OBJ_ALLOCATED_TAG 1 +#define OBJ_TAG_BITS 1 +#define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS) #define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1) #define MAX(a, b) ((a) >= (b) ? (a) : (b)) /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */ #define ZS_MIN_ALLOC_SIZE \ MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) +/* each chunk includes extra space to keep handle */ #define ZS_MAX_ALLOC_SIZE PAGE_SIZE /* @@ -172,6 +168,8 @@ enum fullness_group { enum zs_stat_type { OBJ_ALLOCATED, OBJ_USED, + CLASS_ALMOST_FULL, + CLASS_ALMOST_EMPTY, NR_ZS_STAT_TYPE, }; @@ -216,6 +214,8 @@ struct size_class { /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ int pages_per_zspage; + /* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */ + bool huge; #ifdef CONFIG_ZSMALLOC_STAT struct zs_size_stat stats; @@ -233,14 +233,24 @@ struct size_class { * This must be power of 2 and less than or equal to ZS_ALIGN */ struct link_free { - /* Handle of next free chunk (encodes <PFN, obj_idx>) */ - void *next; + union { + /* + * Position of next free chunk (encodes <PFN, obj_idx>) + * It's valid for non-allocated object + */ + void *next; + /* + * Handle of allocated object. + */ + unsigned long handle; + }; }; struct zs_pool { char *name; struct size_class **size_class; + struct kmem_cache *handle_cachep; gfp_t flags; /* allocation flags used when growing pool */ atomic_long_t pages_allocated; @@ -267,8 +277,37 @@ struct mapping_area { #endif char *vm_addr; /* address of kmap_atomic()'ed pages */ enum zs_mapmode vm_mm; /* mapping mode */ + bool huge; }; +static int create_handle_cache(struct zs_pool *pool) +{ + pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE, + 0, 0, NULL); + return pool->handle_cachep ? 0 : 1; +} + +static void destroy_handle_cache(struct zs_pool *pool) +{ + kmem_cache_destroy(pool->handle_cachep); +} + +static unsigned long alloc_handle(struct zs_pool *pool) +{ + return (unsigned long)kmem_cache_alloc(pool->handle_cachep, + pool->flags & ~__GFP_HIGHMEM); +} + +static void free_handle(struct zs_pool *pool, unsigned long handle) +{ + kmem_cache_free(pool->handle_cachep, (void *)handle); +} + +static void record_obj(unsigned long handle, unsigned long obj) +{ + *(unsigned long *)handle = obj; +} + /* zpool driver */ #ifdef CONFIG_ZPOOL @@ -346,6 +385,11 @@ static struct zpool_driver zs_zpool_driver = { MODULE_ALIAS("zpool-zsmalloc"); #endif /* CONFIG_ZPOOL */ +static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage) +{ + return pages_per_zspage * PAGE_SIZE / size; +} + /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */ static DEFINE_PER_CPU(struct mapping_area, zs_map_area); @@ -396,9 +440,182 @@ static int get_size_class_index(int size) idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE, ZS_SIZE_CLASS_DELTA); - return idx; + return min(zs_size_classes - 1, idx); +} + +#ifdef CONFIG_ZSMALLOC_STAT + +static inline void zs_stat_inc(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ + class->stats.objs[type] += cnt; +} + +static inline void zs_stat_dec(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ + class->stats.objs[type] -= cnt; +} + +static inline unsigned long zs_stat_get(struct size_class *class, + enum zs_stat_type type) +{ + return class->stats.objs[type]; +} + +static int __init zs_stat_init(void) +{ + if (!debugfs_initialized()) + return -ENODEV; + + zs_stat_root = debugfs_create_dir("zsmalloc", NULL); + if (!zs_stat_root) + return -ENOMEM; + + return 0; +} + +static void __exit zs_stat_exit(void) +{ + debugfs_remove_recursive(zs_stat_root); +} + +static int zs_stats_size_show(struct seq_file *s, void *v) +{ + int i; + struct zs_pool *pool = s->private; + struct size_class *class; + int objs_per_zspage; + unsigned long class_almost_full, class_almost_empty; + unsigned long obj_allocated, obj_used, pages_used; + unsigned long total_class_almost_full = 0, total_class_almost_empty = 0; + unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0; + + seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s\n", + "class", "size", "almost_full", "almost_empty", + "obj_allocated", "obj_used", "pages_used", + "pages_per_zspage"); + + for (i = 0; i < zs_size_classes; i++) { + class = pool->size_class[i]; + + if (class->index != i) + continue; + + spin_lock(&class->lock); + class_almost_full = zs_stat_get(class, CLASS_ALMOST_FULL); + class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY); + obj_allocated = zs_stat_get(class, OBJ_ALLOCATED); + obj_used = zs_stat_get(class, OBJ_USED); + spin_unlock(&class->lock); + + objs_per_zspage = get_maxobj_per_zspage(class->size, + class->pages_per_zspage); + pages_used = obj_allocated / objs_per_zspage * + class->pages_per_zspage; + + seq_printf(s, " %5u %5u %11lu %12lu %13lu %10lu %10lu %16d\n", + i, class->size, class_almost_full, class_almost_empty, + obj_allocated, obj_used, pages_used, + class->pages_per_zspage); + + total_class_almost_full += class_almost_full; + total_class_almost_empty += class_almost_empty; + total_objs += obj_allocated; + total_used_objs += obj_used; + total_pages += pages_used; + } + + seq_puts(s, "\n"); + seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu\n", + "Total", "", total_class_almost_full, + total_class_almost_empty, total_objs, + total_used_objs, total_pages); + + return 0; +} + +static int zs_stats_size_open(struct inode *inode, struct file *file) +{ + return single_open(file, zs_stats_size_show, inode->i_private); +} + +static const struct file_operations zs_stat_size_ops = { + .open = zs_stats_size_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int zs_pool_stat_create(char *name, struct zs_pool *pool) +{ + struct dentry *entry; + + if (!zs_stat_root) + return -ENODEV; + + entry = debugfs_create_dir(name, zs_stat_root); + if (!entry) { + pr_warn("debugfs dir <%s> creation failed\n", name); + return -ENOMEM; + } + pool->stat_dentry = entry; + + entry = debugfs_create_file("classes", S_IFREG | S_IRUGO, + pool->stat_dentry, pool, &zs_stat_size_ops); + if (!entry) { + pr_warn("%s: debugfs file entry <%s> creation failed\n", + name, "classes"); + return -ENOMEM; + } + + return 0; +} + +static void zs_pool_stat_destroy(struct zs_pool *pool) +{ + debugfs_remove_recursive(pool->stat_dentry); +} + +#else /* CONFIG_ZSMALLOC_STAT */ + +static inline void zs_stat_inc(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ +} + +static inline void zs_stat_dec(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ +} + +static inline unsigned long zs_stat_get(struct size_class *class, + enum zs_stat_type type) +{ + return 0; +} + +static int __init zs_stat_init(void) +{ + return 0; +} + +static void __exit zs_stat_exit(void) +{ +} + +static inline int zs_pool_stat_create(char *name, struct zs_pool *pool) +{ + return 0; +} + +static inline void zs_pool_stat_destroy(struct zs_pool *pool) +{ } +#endif + + /* * For each size class, zspages are divided into different groups * depending on how "full" they are. This was done so that we could @@ -419,7 +636,7 @@ static enum fullness_group get_fullness_group(struct page *page) fg = ZS_EMPTY; else if (inuse == max_objects) fg = ZS_FULL; - else if (inuse <= max_objects / fullness_threshold_frac) + else if (inuse <= 3 * max_objects / fullness_threshold_frac) fg = ZS_ALMOST_EMPTY; else fg = ZS_ALMOST_FULL; @@ -448,6 +665,8 @@ static void insert_zspage(struct page *page, struct size_class *class, list_add_tail(&page->lru, &(*head)->lru); *head = page; + zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ? + CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1); } /* @@ -473,6 +692,8 @@ static void remove_zspage(struct page *page, struct size_class *class, struct page, lru); list_del_init(&page->lru); + zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ? + CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1); } /* @@ -484,11 +705,10 @@ static void remove_zspage(struct page *page, struct size_class *class, * page from the freelist of the old fullness group to that of the new * fullness group. */ -static enum fullness_group fix_fullness_group(struct zs_pool *pool, +static enum fullness_group fix_fullness_group(struct size_class *class, struct page *page) { int class_idx; - struct size_class *class; enum fullness_group currfg, newfg; BUG_ON(!is_first_page(page)); @@ -498,7 +718,6 @@ static enum fullness_group fix_fullness_group(struct zs_pool *pool, if (newfg == currfg) goto out; - class = pool->size_class[class_idx]; remove_zspage(page, class, currfg); insert_zspage(page, class, newfg); set_zspage_mapping(page, class_idx, newfg); @@ -512,7 +731,8 @@ out: * to form a zspage for each size class. This is important * to reduce wastage due to unusable space left at end of * each zspage which is given as: - * wastage = Zp - Zp % size_class + * wastage = Zp % class_size + * usage = Zp - wastage * where Zp = zspage size = k * PAGE_SIZE where k = 1, 2, ... * * For example, for size class of 3/8 * PAGE_SIZE, we should @@ -571,35 +791,50 @@ static struct page *get_next_page(struct page *page) /* * Encode <page, obj_idx> as a single handle value. - * On hardware platforms with physical memory starting at 0x0 the pfn - * could be 0 so we ensure that the handle will never be 0 by adjusting the - * encoded obj_idx value before encoding. + * We use the least bit of handle for tagging. */ -static void *obj_location_to_handle(struct page *page, unsigned long obj_idx) +static void *location_to_obj(struct page *page, unsigned long obj_idx) { - unsigned long handle; + unsigned long obj; if (!page) { BUG_ON(obj_idx); return NULL; } - handle = page_to_pfn(page) << OBJ_INDEX_BITS; - handle |= ((obj_idx + 1) & OBJ_INDEX_MASK); + obj = page_to_pfn(page) << OBJ_INDEX_BITS; + obj |= ((obj_idx) & OBJ_INDEX_MASK); + obj <<= OBJ_TAG_BITS; - return (void *)handle; + return (void *)obj; } /* * Decode <page, obj_idx> pair from the given object handle. We adjust the * decoded obj_idx back to its original value since it was adjusted in - * obj_location_to_handle(). + * location_to_obj(). */ -static void obj_handle_to_location(unsigned long handle, struct page **page, +static void obj_to_location(unsigned long obj, struct page **page, unsigned long *obj_idx) { - *page = pfn_to_page(handle >> OBJ_INDEX_BITS); - *obj_idx = (handle & OBJ_INDEX_MASK) - 1; + obj >>= OBJ_TAG_BITS; + *page = pfn_to_page(obj >> OBJ_INDEX_BITS); + *obj_idx = (obj & OBJ_INDEX_MASK); +} + +static unsigned long handle_to_obj(unsigned long handle) +{ + return *(unsigned long *)handle; +} + +static unsigned long obj_to_head(struct size_class *class, struct page *page, + void *obj) +{ + if (class->huge) { + VM_BUG_ON(!is_first_page(page)); + return *(unsigned long *)page_private(page); + } else + return *(unsigned long *)obj; } static unsigned long obj_idx_to_offset(struct page *page, @@ -613,6 +848,25 @@ static unsigned long obj_idx_to_offset(struct page *page, return off + obj_idx * class_size; } +static inline int trypin_tag(unsigned long handle) +{ + unsigned long *ptr = (unsigned long *)handle; + + return !test_and_set_bit_lock(HANDLE_PIN_BIT, ptr); +} + +static void pin_tag(unsigned long handle) +{ + while (!trypin_tag(handle)); +} + +static void unpin_tag(unsigned long handle) +{ + unsigned long *ptr = (unsigned long *)handle; + + clear_bit_unlock(HANDLE_PIN_BIT, ptr); +} + static void reset_page(struct page *page) { clear_bit(PG_private, &page->flags); @@ -674,7 +928,7 @@ static void init_zspage(struct page *first_page, struct size_class *class) link = (struct link_free *)vaddr + off / sizeof(*link); while ((off += class->size) < PAGE_SIZE) { - link->next = obj_location_to_handle(page, i++); + link->next = location_to_obj(page, i++); link += class->size / sizeof(*link); } @@ -684,7 +938,7 @@ static void init_zspage(struct page *first_page, struct size_class *class) * page (if present) */ next_page = get_next_page(page); - link->next = obj_location_to_handle(next_page, 0); + link->next = location_to_obj(next_page, 0); kunmap_atomic(vaddr); page = next_page; off %= PAGE_SIZE; @@ -738,7 +992,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags) init_zspage(first_page, class); - first_page->freelist = obj_location_to_handle(first_page, 0); + first_page->freelist = location_to_obj(first_page, 0); /* Maximum number of objects we can store in this zspage */ first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size; @@ -860,12 +1114,19 @@ static void __zs_unmap_object(struct mapping_area *area, { int sizes[2]; void *addr; - char *buf = area->vm_buf; + char *buf; /* no write fastpath */ if (area->vm_mm == ZS_MM_RO) goto out; + buf = area->vm_buf; + if (!area->huge) { + buf = buf + ZS_HANDLE_SIZE; + size -= ZS_HANDLE_SIZE; + off += ZS_HANDLE_SIZE; + } + sizes[0] = PAGE_SIZE - off; sizes[1] = size - sizes[0]; @@ -952,11 +1213,6 @@ static void init_zs_size_classes(void) zs_size_classes = nr; } -static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage) -{ - return pages_per_zspage * PAGE_SIZE / size; -} - static bool can_merge(struct size_class *prev, int size, int pages_per_zspage) { if (prev->pages_per_zspage != pages_per_zspage) @@ -969,166 +1225,13 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage) return true; } -#ifdef CONFIG_ZSMALLOC_STAT - -static inline void zs_stat_inc(struct size_class *class, - enum zs_stat_type type, unsigned long cnt) -{ - class->stats.objs[type] += cnt; -} - -static inline void zs_stat_dec(struct size_class *class, - enum zs_stat_type type, unsigned long cnt) -{ - class->stats.objs[type] -= cnt; -} - -static inline unsigned long zs_stat_get(struct size_class *class, - enum zs_stat_type type) -{ - return class->stats.objs[type]; -} - -static int __init zs_stat_init(void) -{ - if (!debugfs_initialized()) - return -ENODEV; - - zs_stat_root = debugfs_create_dir("zsmalloc", NULL); - if (!zs_stat_root) - return -ENOMEM; - - return 0; -} - -static void __exit zs_stat_exit(void) -{ - debugfs_remove_recursive(zs_stat_root); -} - -static int zs_stats_size_show(struct seq_file *s, void *v) +static bool zspage_full(struct page *page) { - int i; - struct zs_pool *pool = s->private; - struct size_class *class; - int objs_per_zspage; - unsigned long obj_allocated, obj_used, pages_used; - unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0; - - seq_printf(s, " %5s %5s %13s %10s %10s\n", "class", "size", - "obj_allocated", "obj_used", "pages_used"); - - for (i = 0; i < zs_size_classes; i++) { - class = pool->size_class[i]; - - if (class->index != i) - continue; - - spin_lock(&class->lock); - obj_allocated = zs_stat_get(class, OBJ_ALLOCATED); - obj_used = zs_stat_get(class, OBJ_USED); - spin_unlock(&class->lock); - - objs_per_zspage = get_maxobj_per_zspage(class->size, - class->pages_per_zspage); - pages_used = obj_allocated / objs_per_zspage * - class->pages_per_zspage; - - seq_printf(s, " %5u %5u %10lu %10lu %10lu\n", i, - class->size, obj_allocated, obj_used, pages_used); - - total_objs += obj_allocated; - total_used_objs += obj_used; - total_pages += pages_used; - } - - seq_puts(s, "\n"); - seq_printf(s, " %5s %5s %10lu %10lu %10lu\n", "Total", "", - total_objs, total_used_objs, total_pages); - - return 0; -} - -static int zs_stats_size_open(struct inode *inode, struct file *file) -{ - return single_open(file, zs_stats_size_show, inode->i_private); -} - -static const struct file_operations zs_stat_size_ops = { - .open = zs_stats_size_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - -static int zs_pool_stat_create(char *name, struct zs_pool *pool) -{ - struct dentry *entry; - - if (!zs_stat_root) - return -ENODEV; - - entry = debugfs_create_dir(name, zs_stat_root); - if (!entry) { - pr_warn("debugfs dir <%s> creation failed\n", name); - return -ENOMEM; - } - pool->stat_dentry = entry; - - entry = debugfs_create_file("obj_in_classes", S_IFREG | S_IRUGO, - pool->stat_dentry, pool, &zs_stat_size_ops); - if (!entry) { - pr_warn("%s: debugfs file entry <%s> creation failed\n", - name, "obj_in_classes"); - return -ENOMEM; - } - - return 0; -} - -static void zs_pool_stat_destroy(struct zs_pool *pool) -{ - debugfs_remove_recursive(pool->stat_dentry); -} - -#else /* CONFIG_ZSMALLOC_STAT */ - -static inline void zs_stat_inc(struct size_class *class, - enum zs_stat_type type, unsigned long cnt) -{ -} - -static inline void zs_stat_dec(struct size_class *class, - enum zs_stat_type type, unsigned long cnt) -{ -} - -static inline unsigned long zs_stat_get(struct size_class *class, - enum zs_stat_type type) -{ - return 0; -} - -static int __init zs_stat_init(void) -{ - return 0; -} - -static void __exit zs_stat_exit(void) -{ -} - -static inline int zs_pool_stat_create(char *name, struct zs_pool *pool) -{ - return 0; -} + BUG_ON(!is_first_page(page)); -static inline void zs_pool_stat_destroy(struct zs_pool *pool) -{ + return page->inuse == page->objects; } -#endif - unsigned long zs_get_total_pages(struct zs_pool *pool) { return atomic_long_read(&pool->pages_allocated); @@ -1153,13 +1256,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, enum zs_mapmode mm) { struct page *page; - unsigned long obj_idx, off; + unsigned long obj, obj_idx, off; unsigned int class_idx; enum fullness_group fg; struct size_class *class; struct mapping_area *area; struct page *pages[2]; + void *ret; BUG_ON(!handle); @@ -1170,7 +1274,11 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, */ BUG_ON(in_interrupt()); - obj_handle_to_location(handle, &page, &obj_idx); + /* From now on, migration cannot move the object */ + pin_tag(handle); + + obj = handle_to_obj(handle); + obj_to_location(obj, &page, &obj_idx); get_zspage_mapping(get_first_page(page), &class_idx, &fg); class = pool->size_class[class_idx]; off = obj_idx_to_offset(page, obj_idx, class->size); @@ -1180,7 +1288,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, if (off + class->size <= PAGE_SIZE) { /* this object is contained entirely within a page */ area->vm_addr = kmap_atomic(page); - return area->vm_addr + off; + ret = area->vm_addr + off; + goto out; } /* this object spans two pages */ @@ -1188,14 +1297,19 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, pages[1] = get_next_page(page); BUG_ON(!pages[1]); - return __zs_map_object(area, pages, off, class->size); + ret = __zs_map_object(area, pages, off, class->size); +out: + if (!class->huge) + ret += ZS_HANDLE_SIZE; + + return ret; } EXPORT_SYMBOL_GPL(zs_map_object); void zs_unmap_object(struct zs_pool *pool, unsigned long handle) { struct page *page; - unsigned long obj_idx, off; + unsigned long obj, obj_idx, off; unsigned int class_idx; enum fullness_group fg; @@ -1204,7 +1318,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) BUG_ON(!handle); - obj_handle_to_location(handle, &page, &obj_idx); + obj = handle_to_obj(handle); + obj_to_location(obj, &page, &obj_idx); get_zspage_mapping(get_first_page(page), &class_idx, &fg); class = pool->size_class[class_idx]; off = obj_idx_to_offset(page, obj_idx, class->size); @@ -1222,9 +1337,42 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) __zs_unmap_object(area, pages, off, class->size); } put_cpu_var(zs_map_area); + unpin_tag(handle); } EXPORT_SYMBOL_GPL(zs_unmap_object); +static unsigned long obj_malloc(struct page *first_page, + struct size_class *class, unsigned long handle) +{ + unsigned long obj; + struct link_free *link; + + struct page *m_page; + unsigned long m_objidx, m_offset; + void *vaddr; + + handle |= OBJ_ALLOCATED_TAG; + obj = (unsigned long)first_page->freelist; + obj_to_location(obj, &m_page, &m_objidx); + m_offset = obj_idx_to_offset(m_page, m_objidx, class->size); + + vaddr = kmap_atomic(m_page); + link = (struct link_free *)vaddr + m_offset / sizeof(*link); + first_page->freelist = link->next; + if (!class->huge) + /* record handle in the header of allocated chunk */ + link->handle = handle; + else + /* record handle in first_page->private */ + set_page_private(first_page, handle); + kunmap_atomic(vaddr); + first_page->inuse++; + zs_stat_inc(class, OBJ_USED, 1); + + return obj; +} + + /** * zs_malloc - Allocate block of given size from pool. * @pool: pool to allocate from @@ -1236,17 +1384,19 @@ EXPORT_SYMBOL_GPL(zs_unmap_object); */ unsigned long zs_malloc(struct zs_pool *pool, size_t size) { - unsigned long obj; - struct link_free *link; + unsigned long handle, obj; struct size_class *class; - void *vaddr; - - struct page *first_page, *m_page; - unsigned long m_objidx, m_offset; + struct page *first_page; if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE)) return 0; + handle = alloc_handle(pool); + if (!handle) + return 0; + + /* extra space in chunk to keep the handle */ + size += ZS_HANDLE_SIZE; class = pool->size_class[get_size_class_index(size)]; spin_lock(&class->lock); @@ -1255,8 +1405,10 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) if (!first_page) { spin_unlock(&class->lock); first_page = alloc_zspage(class, pool->flags); - if (unlikely(!first_page)) + if (unlikely(!first_page)) { + free_handle(pool, handle); return 0; + } set_zspage_mapping(first_page, class->index, ZS_EMPTY); atomic_long_add(class->pages_per_zspage, @@ -1267,73 +1419,360 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) class->size, class->pages_per_zspage)); } - obj = (unsigned long)first_page->freelist; - obj_handle_to_location(obj, &m_page, &m_objidx); - m_offset = obj_idx_to_offset(m_page, m_objidx, class->size); - - vaddr = kmap_atomic(m_page); - link = (struct link_free *)vaddr + m_offset / sizeof(*link); - first_page->freelist = link->next; - memset(link, POISON_INUSE, sizeof(*link)); - kunmap_atomic(vaddr); - - first_page->inuse++; - zs_stat_inc(class, OBJ_USED, 1); + obj = obj_malloc(first_page, class, handle); /* Now move the zspage to another fullness group, if required */ - fix_fullness_group(pool, first_page); + fix_fullness_group(class, first_page); + record_obj(handle, obj); spin_unlock(&class->lock); - return obj; + return handle; } EXPORT_SYMBOL_GPL(zs_malloc); -void zs_free(struct zs_pool *pool, unsigned long obj) +static void obj_free(struct zs_pool *pool, struct size_class *class, + unsigned long obj) { struct link_free *link; struct page *first_page, *f_page; unsigned long f_objidx, f_offset; void *vaddr; - int class_idx; - struct size_class *class; enum fullness_group fullness; - if (unlikely(!obj)) - return; + BUG_ON(!obj); - obj_handle_to_location(obj, &f_page, &f_objidx); + obj &= ~OBJ_ALLOCATED_TAG; + obj_to_location(obj, &f_page, &f_objidx); first_page = get_first_page(f_page); get_zspage_mapping(first_page, &class_idx, &fullness); - class = pool->size_class[class_idx]; f_offset = obj_idx_to_offset(f_page, f_objidx, class->size); - spin_lock(&class->lock); + vaddr = kmap_atomic(f_page); /* Insert this object in containing zspage's freelist */ - vaddr = kmap_atomic(f_page); link = (struct link_free *)(vaddr + f_offset); link->next = first_page->freelist; + if (class->huge) + set_page_private(first_page, 0); kunmap_atomic(vaddr); first_page->freelist = (void *)obj; - first_page->inuse--; - fullness = fix_fullness_group(pool, first_page); - zs_stat_dec(class, OBJ_USED, 1); - if (fullness == ZS_EMPTY) +} + +void zs_free(struct zs_pool *pool, unsigned long handle) +{ + struct page *first_page, *f_page; + unsigned long obj, f_objidx; + int class_idx; + struct size_class *class; + enum fullness_group fullness; + + if (unlikely(!handle)) + return; + + pin_tag(handle); + obj = handle_to_obj(handle); + obj_to_location(obj, &f_page, &f_objidx); + first_page = get_first_page(f_page); + + get_zspage_mapping(first_page, &class_idx, &fullness); + class = pool->size_class[class_idx]; + + spin_lock(&class->lock); + obj_free(pool, class, obj); + fullness = fix_fullness_group(class, first_page); + if (fullness == ZS_EMPTY) { zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage( class->size, class->pages_per_zspage)); - + atomic_long_sub(class->pages_per_zspage, + &pool->pages_allocated); + free_zspage(first_page); + } spin_unlock(&class->lock); + unpin_tag(handle); + + free_handle(pool, handle); +} +EXPORT_SYMBOL_GPL(zs_free); + +static void zs_object_copy(unsigned long src, unsigned long dst, + struct size_class *class) +{ + struct page *s_page, *d_page; + unsigned long s_objidx, d_objidx; + unsigned long s_off, d_off; + void *s_addr, *d_addr; + int s_size, d_size, size; + int written = 0; + + s_size = d_size = class->size; + + obj_to_location(src, &s_page, &s_objidx); + obj_to_location(dst, &d_page, &d_objidx); + + s_off = obj_idx_to_offset(s_page, s_objidx, class->size); + d_off = obj_idx_to_offset(d_page, d_objidx, class->size); + + if (s_off + class->size > PAGE_SIZE) + s_size = PAGE_SIZE - s_off; + + if (d_off + class->size > PAGE_SIZE) + d_size = PAGE_SIZE - d_off; + + s_addr = kmap_atomic(s_page); + d_addr = kmap_atomic(d_page); + + while (1) { + size = min(s_size, d_size); + memcpy(d_addr + d_off, s_addr + s_off, size); + written += size; + + if (written == class->size) + break; + + s_off += size; + s_size -= size; + d_off += size; + d_size -= size; + + if (s_off >= PAGE_SIZE) { + kunmap_atomic(d_addr); + kunmap_atomic(s_addr); + s_page = get_next_page(s_page); + BUG_ON(!s_page); + s_addr = kmap_atomic(s_page); + d_addr = kmap_atomic(d_page); + s_size = class->size - written; + s_off = 0; + } + + if (d_off >= PAGE_SIZE) { + kunmap_atomic(d_addr); + d_page = get_next_page(d_page); + BUG_ON(!d_page); + d_addr = kmap_atomic(d_page); + d_size = class->size - written; + d_off = 0; + } + } + + kunmap_atomic(d_addr); + kunmap_atomic(s_addr); +} + +/* + * Find alloced object in zspage from index object and + * return handle. + */ +static unsigned long find_alloced_obj(struct page *page, int index, + struct size_class *class) +{ + unsigned long head; + int offset = 0; + unsigned long handle = 0; + void *addr = kmap_atomic(page); + + if (!is_first_page(page)) + offset = page->index; + offset += class->size * index; + + while (offset < PAGE_SIZE) { + head = obj_to_head(class, page, addr + offset); + if (head & OBJ_ALLOCATED_TAG) { + handle = head & ~OBJ_ALLOCATED_TAG; + if (trypin_tag(handle)) + break; + handle = 0; + } + + offset += class->size; + index++; + } + + kunmap_atomic(addr); + return handle; +} + +struct zs_compact_control { + /* Source page for migration which could be a subpage of zspage. */ + struct page *s_page; + /* Destination page for migration which should be a first page + * of zspage. */ + struct page *d_page; + /* Starting object index within @s_page which used for live object + * in the subpage. */ + int index; + /* how many of objects are migrated */ + int nr_migrated; +}; + +static int migrate_zspage(struct zs_pool *pool, struct size_class *class, + struct zs_compact_control *cc) +{ + unsigned long used_obj, free_obj; + unsigned long handle; + struct page *s_page = cc->s_page; + struct page *d_page = cc->d_page; + unsigned long index = cc->index; + int nr_migrated = 0; + int ret = 0; + + while (1) { + handle = find_alloced_obj(s_page, index, class); + if (!handle) { + s_page = get_next_page(s_page); + if (!s_page) + break; + index = 0; + continue; + } + + /* Stop if there is no more space */ + if (zspage_full(d_page)) { + unpin_tag(handle); + ret = -ENOMEM; + break; + } + + used_obj = handle_to_obj(handle); + free_obj = obj_malloc(d_page, class, handle); + zs_object_copy(used_obj, free_obj, class); + index++; + record_obj(handle, free_obj); + unpin_tag(handle); + obj_free(pool, class, used_obj); + nr_migrated++; + } + + /* Remember last position in this iteration */ + cc->s_page = s_page; + cc->index = index; + cc->nr_migrated = nr_migrated; + + return ret; +} + +static struct page *alloc_target_page(struct size_class *class) +{ + int i; + struct page *page; + + for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) { + page = class->fullness_list[i]; + if (page) { + remove_zspage(page, class, i); + break; + } + } + + return page; +} + +static void putback_zspage(struct zs_pool *pool, struct size_class *class, + struct page *first_page) +{ + enum fullness_group fullness; + + BUG_ON(!is_first_page(first_page)); + + fullness = get_fullness_group(first_page); + insert_zspage(first_page, class, fullness); + set_zspage_mapping(first_page, class->index, fullness); if (fullness == ZS_EMPTY) { + zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage( + class->size, class->pages_per_zspage)); atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated); + free_zspage(first_page); } } -EXPORT_SYMBOL_GPL(zs_free); + +static struct page *isolate_source_page(struct size_class *class) +{ + struct page *page; + + page = class->fullness_list[ZS_ALMOST_EMPTY]; + if (page) + remove_zspage(page, class, ZS_ALMOST_EMPTY); + + return page; +} + +static unsigned long __zs_compact(struct zs_pool *pool, + struct size_class *class) +{ + int nr_to_migrate; + struct zs_compact_control cc; + struct page *src_page; + struct page *dst_page = NULL; + unsigned long nr_total_migrated = 0; + + spin_lock(&class->lock); + while ((src_page = isolate_source_page(class))) { + + BUG_ON(!is_first_page(src_page)); + + /* The goal is to migrate all live objects in source page */ + nr_to_migrate = src_page->inuse; + cc.index = 0; + cc.s_page = src_page; + + while ((dst_page = alloc_target_page(class))) { + cc.d_page = dst_page; + /* + * If there is no more space in dst_page, try to + * allocate another zspage. + */ + if (!migrate_zspage(pool, class, &cc)) + break; + + putback_zspage(pool, class, dst_page); + nr_total_migrated += cc.nr_migrated; + nr_to_migrate -= cc.nr_migrated; + } + + /* Stop if we couldn't find slot */ + if (dst_page == NULL) + break; + + putback_zspage(pool, class, dst_page); + putback_zspage(pool, class, src_page); + spin_unlock(&class->lock); + nr_total_migrated += cc.nr_migrated; + cond_resched(); + spin_lock(&class->lock); + } + + if (src_page) + putback_zspage(pool, class, src_page); + + spin_unlock(&class->lock); + + return nr_total_migrated; +} + +unsigned long zs_compact(struct zs_pool *pool) +{ + int i; + unsigned long nr_migrated = 0; + struct size_class *class; + + for (i = zs_size_classes - 1; i >= 0; i--) { + class = pool->size_class[i]; + if (!class) + continue; + if (class->index != i) + continue; + nr_migrated += __zs_compact(pool, class); + } + + return nr_migrated; +} +EXPORT_SYMBOL_GPL(zs_compact); /** * zs_create_pool - Creates an allocation pool to work from. @@ -1355,20 +1794,20 @@ struct zs_pool *zs_create_pool(char *name, gfp_t flags) if (!pool) return NULL; - pool->name = kstrdup(name, GFP_KERNEL); - if (!pool->name) { - kfree(pool); - return NULL; - } - pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *), GFP_KERNEL); if (!pool->size_class) { - kfree(pool->name); kfree(pool); return NULL; } + pool->name = kstrdup(name, GFP_KERNEL); + if (!pool->name) + goto err; + + if (create_handle_cache(pool)) + goto err; + /* * Iterate reversly, because, size of size_class that we want to use * for merging should be larger or equal to current size. @@ -1406,6 +1845,9 @@ struct zs_pool *zs_create_pool(char *name, gfp_t flags) class->size = size; class->index = i; class->pages_per_zspage = pages_per_zspage; + if (pages_per_zspage == 1 && + get_maxobj_per_zspage(size, pages_per_zspage) == 1) + class->huge = true; spin_lock_init(&class->lock); pool->size_class[i] = class; @@ -1450,6 +1892,7 @@ void zs_destroy_pool(struct zs_pool *pool) kfree(class); } + destroy_handle_cache(pool); kfree(pool->size_class); kfree(pool->name); kfree(pool); diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig index fb78117..9068e72 100644 --- a/net/sunrpc/Kconfig +++ b/net/sunrpc/Kconfig @@ -1,9 +1,11 @@ config SUNRPC tristate + depends on MULTIUSER config SUNRPC_GSS tristate select OID_REGISTRY + depends on MULTIUSER config SUNRPC_BACKCHANNEL bool diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index 5199bb1..2928aff 100644 --- a/net/sunrpc/cache.c +++ b/net/sunrpc/cache.c @@ -1072,10 +1072,12 @@ void qword_add(char **bpp, int *lp, char *str) if (len < 0) return; - ret = string_escape_str(str, &bp, len, ESCAPE_OCTAL, "\\ \n\t"); - if (ret < 0 || ret == len) + ret = string_escape_str(str, bp, len, ESCAPE_OCTAL, "\\ \n\t"); + if (ret >= len) { + bp += len; len = -1; - else { + } else { + bp += ret; len -= ret; *bp++ = ' '; len--; diff --git a/security/Kconfig b/security/Kconfig index beb86b5..bf4ec4647 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -21,6 +21,7 @@ config SECURITY_DMESG_RESTRICT config SECURITY bool "Enable different security models" depends on SYSFS + depends on MULTIUSER help This allows you to choose different security modules to be configured into your kernel. diff --git a/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c b/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c index 3d8e5b0..4900367 100644 --- a/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c +++ b/tools/testing/selftests/powerpc/mm/hugetlb_vs_thp_test.c @@ -21,9 +21,13 @@ static int test_body(void) * Typically the mmap will fail because no huge pages are * allocated on the system. But if there are huge pages * allocated the mmap will succeed. That's fine too, we just - * munmap here before continuing. + * munmap here before continuing. munmap() length of + * MAP_HUGETLB memory must be hugepage aligned. */ - munmap(addr, SIZE); + if (munmap(addr, SIZE)) { + perror("munmap"); + return 1; + } } p = mmap(addr, SIZE, PROT_READ | PROT_WRITE, diff --git a/tools/testing/selftests/vm/hugetlbfstest.c b/tools/testing/selftests/vm/hugetlbfstest.c index ea40ff8..02e1072 100644 --- a/tools/testing/selftests/vm/hugetlbfstest.c +++ b/tools/testing/selftests/vm/hugetlbfstest.c @@ -34,6 +34,7 @@ static void do_mmap(int fd, int extra_flags, int unmap) int *p; int flags = MAP_PRIVATE | MAP_POPULATE | extra_flags; u64 before, after; + int ret; before = read_rss(); p = mmap(NULL, length, PROT_READ | PROT_WRITE, flags, fd, 0); @@ -44,7 +45,8 @@ static void do_mmap(int fd, int extra_flags, int unmap) !"rss didn't grow as expected"); if (!unmap) return; - munmap(p, length); + ret = munmap(p, length); + assert(!ret || !"munmap returned an unexpected error"); after = read_rss(); assert(llabs(after - before) < 0x40000 || !"rss didn't shrink as expected"); diff --git a/tools/testing/selftests/vm/map_hugetlb.c b/tools/testing/selftests/vm/map_hugetlb.c index ac56639..addcd6f 100644 --- a/tools/testing/selftests/vm/map_hugetlb.c +++ b/tools/testing/selftests/vm/map_hugetlb.c @@ -73,7 +73,11 @@ int main(void) write_bytes(addr); ret = read_bytes(addr); - munmap(addr, LENGTH); + /* munmap() length of MAP_HUGETLB memory must be hugepage aligned */ + if (munmap(addr, LENGTH)) { + perror("munmap"); + exit(1); + } return ret; } |