diff options
author | Timothy Pearson <tpearson@raptorengineering.com> | 2017-08-23 14:45:25 -0500 |
---|---|---|
committer | Timothy Pearson <tpearson@raptorengineering.com> | 2017-08-23 14:45:25 -0500 |
commit | fcbb27b0ec6dcbc5a5108cb8fb19eae64593d204 (patch) | |
tree | 22962a4387943edc841c72a4e636a068c66d58fd /Documentation/sysctl | |
download | ast2050-linux-kernel-fcbb27b0ec6dcbc5a5108cb8fb19eae64593d204.zip ast2050-linux-kernel-fcbb27b0ec6dcbc5a5108cb8fb19eae64593d204.tar.gz |
Initial import of modified Linux 2.6.28 tree
Original upstream URL:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git | branch linux-2.6.28.y
Diffstat (limited to 'Documentation/sysctl')
-rw-r--r-- | Documentation/sysctl/00-INDEX | 16 | ||||
-rw-r--r-- | Documentation/sysctl/README | 75 | ||||
-rw-r--r-- | Documentation/sysctl/abi.txt | 54 | ||||
-rw-r--r-- | Documentation/sysctl/ctl_unnumbered.txt | 22 | ||||
-rw-r--r-- | Documentation/sysctl/fs.txt | 180 | ||||
-rw-r--r-- | Documentation/sysctl/kernel.txt | 383 | ||||
-rw-r--r-- | Documentation/sysctl/sunrpc.txt | 20 | ||||
-rw-r--r-- | Documentation/sysctl/vm.txt | 349 |
8 files changed, 1099 insertions, 0 deletions
diff --git a/Documentation/sysctl/00-INDEX b/Documentation/sysctl/00-INDEX new file mode 100644 index 0000000..a20a906 --- /dev/null +++ b/Documentation/sysctl/00-INDEX @@ -0,0 +1,16 @@ +00-INDEX + - this file. +README + - general information about /proc/sys/ sysctl files. +abi.txt + - documentation for /proc/sys/abi/*. +ctl_unnumbered.txt + - explanation of why one should not add new binary sysctl numbers. +fs.txt + - documentation for /proc/sys/fs/*. +kernel.txt + - documentation for /proc/sys/kernel/*. +sunrpc.txt + - documentation for /proc/sys/sunrpc/*. +vm.txt + - documentation for /proc/sys/vm/*. diff --git a/Documentation/sysctl/README b/Documentation/sysctl/README new file mode 100644 index 0000000..8c3306e --- /dev/null +++ b/Documentation/sysctl/README @@ -0,0 +1,75 @@ +Documentation for /proc/sys/ kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + +'Why', I hear you ask, 'would anyone even _want_ documentation +for them sysctl files? If anybody really needs it, it's all in +the source...' + +Well, this documentation is written because some people either +don't know they need to tweak something, or because they don't +have the time or knowledge to read the source code. + +Furthermore, the programmers who built sysctl have built it to +be actually used, not just for the fun of programming it :-) + +============================================================== + +Legal blurb: + +As usual, there are two main things to consider: +1. you get what you pay for +2. it's free + +The consequences are that I won't guarantee the correctness of +this document, and if you come to me complaining about how you +screwed up your system because of wrong documentation, I won't +feel sorry for you. I might even laugh at you... + +But of course, if you _do_ manage to screw up your system using +only the sysctl options used in this file, I'd like to hear of +it. Not only to have a great laugh, but also to make sure that +you're the last RTFMing person to screw up. + +In short, e-mail your suggestions, corrections and / or horror +stories to: <riel@nl.linux.org> + +Rik van Riel. + +============================================================== + +Introduction: + +Sysctl is a means of configuring certain aspects of the kernel +at run-time, and the /proc/sys/ directory is there so that you +don't even need special tools to do it! +In fact, there are only four things needed to use these config +facilities: +- a running Linux system +- root access +- common sense (this is especially hard to come by these days) +- knowledge of what all those values mean + +As a quick 'ls /proc/sys' will show, the directory consists of +several (arch-dependent?) subdirs. Each subdir is mainly about +one part of the kernel, so you can do configuration on a piece +by piece basis, or just some 'thematic frobbing'. + +The subdirs are about: +abi/ execution domains & personalities +debug/ <empty> +dev/ device specific information (eg dev/cdrom/info) +fs/ specific filesystems + filehandle, inode, dentry and quota tuning + binfmt_misc <Documentation/binfmt_misc.txt> +kernel/ global kernel info / tuning + miscellaneous stuff +net/ networking stuff, for documentation look in: + <Documentation/networking/> +proc/ <empty> +sunrpc/ SUN Remote Procedure Call (NFS) +vm/ memory management tuning + buffer and cache management + +These are the subdirs I have on my system. There might be more +or other subdirs in another setup. If you see another dir, I'd +really like to hear about it :-) diff --git a/Documentation/sysctl/abi.txt b/Documentation/sysctl/abi.txt new file mode 100644 index 0000000..63f4ebc --- /dev/null +++ b/Documentation/sysctl/abi.txt @@ -0,0 +1,54 @@ +Documentation for /proc/sys/abi/* kernel version 2.6.0.test2 + (c) 2003, Fabian Frederick <ffrederick@users.sourceforge.net> + +For general info : README. + +============================================================== + +This path is binary emulation relevant aka personality types aka abi. +When a process is executed, it's linked to an exec_domain whose +personality is defined using values available from /proc/sys/abi. +You can find further details about abi in include/linux/personality.h. + +Here are the files featuring in 2.6 kernel : + +- defhandler_coff +- defhandler_elf +- defhandler_lcall7 +- defhandler_libcso +- fake_utsname +- trace + +=========================================================== +defhandler_coff: +defined value : +PER_SCOSVR3 +0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE + +=========================================================== +defhandler_elf: +defined value : +PER_LINUX +0 + +=========================================================== +defhandler_lcall7: +defined value : +PER_SVR4 +0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +=========================================================== +defhandler_libsco: +defined value: +PER_SVR4 +0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +=========================================================== +fake_utsname: +Unused + +=========================================================== +trace: +Unused + +=========================================================== diff --git a/Documentation/sysctl/ctl_unnumbered.txt b/Documentation/sysctl/ctl_unnumbered.txt new file mode 100644 index 0000000..23003a8 --- /dev/null +++ b/Documentation/sysctl/ctl_unnumbered.txt @@ -0,0 +1,22 @@ + +Except for a few extremely rare exceptions user space applications do not use +the binary sysctl interface. Instead everyone uses /proc/sys/... with +readable ascii names. + +Recently the kernel has started supporting setting the binary sysctl value to +CTL_UNNUMBERED so we no longer need to assign a binary sysctl path to allow +sysctls to show up in /proc/sys. + +Assigning binary sysctl numbers is an endless source of conflicts in sysctl.h, +breaking of the user space ABI (because of those conflicts), and maintenance +problems. A complete pass through all of the sysctl users revealed multiple +instances where the sysctl binary interface was broken and had gone undetected +for years. + +So please do not add new binary sysctl numbers. They are unneeded and +problematic. + +If you really need a new binary sysctl number please first merge your sysctl +into the kernel and then as a separate patch allocate a binary sysctl number. + +(ebiederm@xmission.com, June 2007) diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt new file mode 100644 index 0000000..f992543 --- /dev/null +++ b/Documentation/sysctl/fs.txt @@ -0,0 +1,180 @@ +Documentation for /proc/sys/fs/* kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains documentation for the sysctl files in +/proc/sys/fs/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +Currently, these files are in /proc/sys/fs: +- dentry-state +- dquot-max +- dquot-nr +- file-max +- file-nr +- inode-max +- inode-nr +- inode-state +- nr_open +- overflowuid +- overflowgid +- suid_dumpable +- super-max +- super-nr + +Documentation for the files in /proc/sys/fs/binfmt_misc is +in Documentation/binfmt_misc.txt. + +============================================================== + +dentry-state: + +From linux/fs/dentry.c: +-------------------------------------------------------------- +struct { + int nr_dentry; + int nr_unused; + int age_limit; /* age in seconds */ + int want_pages; /* pages requested by system */ + int dummy[2]; +} dentry_stat = {0, 0, 45, 0,}; +-------------------------------------------------------------- + +Dentries are dynamically allocated and deallocated, and +nr_dentry seems to be 0 all the time. Hence it's safe to +assume that only nr_unused, age_limit and want_pages are +used. Nr_unused seems to be exactly what its name says. +Age_limit is the age in seconds after which dcache entries +can be reclaimed when memory is short and want_pages is +nonzero when shrink_dcache_pages() has been called and the +dcache isn't pruned yet. + +============================================================== + +dquot-max & dquot-nr: + +The file dquot-max shows the maximum number of cached disk +quota entries. + +The file dquot-nr shows the number of allocated disk quota +entries and the number of free disk quota entries. + +If the number of free cached disk quotas is very low and +you have some awesome number of simultaneous system users, +you might want to raise the limit. + +============================================================== + +file-max & file-nr: + +The kernel allocates file handles dynamically, but as yet it +doesn't free them again. + +The value in file-max denotes the maximum number of file- +handles that the Linux kernel will allocate. When you get lots +of error messages about running out of file handles, you might +want to increase this limit. + +The three values in file-nr denote the number of allocated +file handles, the number of unused file handles and the maximum +number of file handles. When the allocated file handles come +close to the maximum, but the number of unused file handles is +significantly greater than 0, you've encountered a peak in your +usage of file handles and you don't need to increase the maximum. + +============================================================== + +nr_open: + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + +============================================================== + +inode-max, inode-nr & inode-state: + +As with file handles, the kernel allocates the inode structures +dynamically, but can't free them yet. + +The value in inode-max denotes the maximum number of inode +handlers. This value should be 3-4 times larger than the value +in file-max, since stdin, stdout and network sockets also +need an inode struct to handle them. When you regularly run +out of inodes, you need to increase this value. + +The file inode-nr contains the first two items from +inode-state, so we'll skip to that file... + +Inode-state contains three actual numbers and four dummies. +The actual numbers are, in order of appearance, nr_inodes, +nr_free_inodes and preshrink. + +Nr_inodes stands for the number of inodes the system has +allocated, this can be slightly more than inode-max because +Linux allocates them one pageful at a time. + +Nr_free_inodes represents the number of free inodes (?) and +preshrink is nonzero when the nr_inodes > inode-max and the +system needs to prune the inode list instead of allocating +more. + +============================================================== + +overflowgid & overflowuid: + +Some filesystems only support 16-bit UIDs and GIDs, although in Linux +UIDs and GIDs are 32 bits. When one of these filesystems is mounted +with writes enabled, any UID or GID that would exceed 65535 is translated +to a fixed value before being written to disk. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + +============================================================== + +suid_dumpable: + +This value can be used to query and set the core dump mode for setuid +or otherwise protected/tainted binaries. The modes are + +0 - (default) - traditional behaviour. Any process which has changed + privilege levels or is execute only will not be dumped +1 - (debug) - all processes dump core when possible. The core dump is + owned by the current user and no security is applied. This is + intended for system debugging situations only. Ptrace is unchecked. +2 - (suidsafe) - any binary which normally would not be dumped is dumped + readable by root only. This allows the end user to remove + such a dump but not access it directly. For security reasons + core dumps in this mode will not overwrite one another or + other files. This mode is appropriate when administrators are + attempting to debug problems in a normal environment. + +============================================================== + +super-max & super-nr: + +These numbers control the maximum number of superblocks, and +thus the maximum number of mounted filesystems the kernel +can have. You only need to increase super-max if you need to +mount more filesystems than the current value in super-max +allows you to. + +============================================================== + +aio-nr & aio-max-nr: + +aio-nr shows the current system-wide number of asynchronous io +requests. aio-max-nr allows you to change the maximum value +aio-nr can grow to. + +============================================================== diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt new file mode 100644 index 0000000..a4ccdd1 --- /dev/null +++ b/Documentation/sysctl/kernel.txt @@ -0,0 +1,383 @@ +Documentation for /proc/sys/kernel/* kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains documentation for the sysctl files in +/proc/sys/kernel/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +Currently, these files might (depending on your configuration) +show up in /proc/sys/kernel: +- acpi_video_flags +- acct +- core_pattern +- core_uses_pid +- ctrl-alt-del +- dentry-state +- domainname +- hostname +- hotplug +- java-appletviewer [ binfmt_java, obsolete ] +- java-interpreter [ binfmt_java, obsolete ] +- kstack_depth_to_print [ X86 only ] +- l2cr [ PPC only ] +- modprobe ==> Documentation/debugging-modules.txt +- msgmax +- msgmnb +- msgmni +- osrelease +- ostype +- overflowgid +- overflowuid +- panic +- pid_max +- powersave-nap [ PPC only ] +- printk +- randomize_va_space +- real-root-dev ==> Documentation/initrd.txt +- reboot-cmd [ SPARC only ] +- rtsig-max +- rtsig-nr +- sem +- sg-big-buff [ generic SCSI device (sg) ] +- shmall +- shmmax [ sysv ipc ] +- shmmni +- stop-a [ SPARC only ] +- sysrq ==> Documentation/sysrq.txt +- tainted +- threads-max +- version + +============================================================== + +acpi_video_flags: + +flags + +See Doc*/kernel/power/video.txt, it allows mode of video boot to be +set during run time. + +============================================================== + +acct: + +highwater lowwater frequency + +If BSD-style process accounting is enabled these values control +its behaviour. If free space on filesystem where the log lives +goes below <lowwater>% accounting suspends. If free space gets +above <highwater>% accounting resumes. <Frequency> determines +how often do we check the amount of free space (value is in +seconds). Default: +4 2 30 +That is, suspend accounting if there left <= 2% free; resume it +if we got >=4%; consider information about amount of free space +valid for 30 seconds. + +============================================================== + +core_pattern: + +core_pattern is used to specify a core dumpfile pattern name. +. max length 128 characters; default value is "core" +. core_pattern is used as a pattern template for the output filename; + certain string patterns (beginning with '%') are substituted with + their actual values. +. backward compatibility with core_uses_pid: + If core_pattern does not include "%p" (default does not) + and core_uses_pid is set, then .PID will be appended to + the filename. +. corename format specifiers: + %<NUL> '%' is dropped + %% output one '%' + %p pid + %u uid + %g gid + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename + %<OTHER> both are dropped +. If the first character of the pattern is a '|', the kernel will treat + the rest of the pattern as a command to run. The core dump will be + written to the standard input of that program instead of to a file. + +============================================================== + +core_uses_pid: + +The default coredump filename is "core". By setting +core_uses_pid to 1, the coredump filename becomes core.PID. +If core_pattern does not include "%p" (default does not) +and core_uses_pid is set, then .PID will be appended to +the filename. + +============================================================== + +ctrl-alt-del: + +When the value in this file is 0, ctrl-alt-del is trapped and +sent to the init(1) program to handle a graceful restart. +When, however, the value is > 0, Linux's reaction to a Vulcan +Nerve Pinch (tm) will be an immediate reboot, without even +syncing its dirty buffers. + +Note: when a program (like dosemu) has the keyboard in 'raw' +mode, the ctrl-alt-del is intercepted by the program before it +ever reaches the kernel tty layer, and it's up to the program +to decide what to do with it. + +============================================================== + +domainname & hostname: + +These files can be used to set the NIS/YP domainname and the +hostname of your box in exactly the same way as the commands +domainname and hostname, i.e.: +# echo "darkstar" > /proc/sys/kernel/hostname +# echo "mydomain" > /proc/sys/kernel/domainname +has the same effect as +# hostname "darkstar" +# domainname "mydomain" + +Note, however, that the classic darkstar.frop.org has the +hostname "darkstar" and DNS (Internet Domain Name Server) +domainname "frop.org", not to be confused with the NIS (Network +Information Service) or YP (Yellow Pages) domainname. These two +domain names are in general different. For a detailed discussion +see the hostname(1) man page. + +============================================================== + +hotplug: + +Path for the hotplug policy agent. +Default value is "/sbin/hotplug". + +============================================================== + +l2cr: (PPC only) + +This flag controls the L2 cache of G3 processor boards. If +0, the cache is disabled. Enabled if nonzero. + +============================================================== + +kstack_depth_to_print: (X86 only) + +Controls the number of words to print when dumping the raw +kernel stack. + +============================================================== + +osrelease, ostype & version: + +# cat osrelease +2.1.88 +# cat ostype +Linux +# cat version +#5 Wed Feb 25 21:49:24 MET 1998 + +The files osrelease and ostype should be clear enough. Version +needs a little more clarification however. The '#5' means that +this is the fifth kernel built from this source base and the +date behind it indicates the time the kernel was built. +The only way to tune these values is to rebuild the kernel :-) + +============================================================== + +overflowgid & overflowuid: + +if your architecture did not always support 32-bit UIDs (i.e. arm, i386, +m68k, sh, and sparc32), a fixed UID and GID will be returned to +applications that use the old 16-bit UID/GID system calls, if the actual +UID or GID would exceed 65535. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + +============================================================== + +panic: + +The value in this file represents the number of seconds the +kernel waits before rebooting on a panic. When you use the +software watchdog, the recommended setting is 60. + +============================================================== + +panic_on_oops: + +Controls the kernel's behaviour when an oops or BUG is encountered. + +0: try to continue operation + +1: panic immediately. If the `panic' sysctl is also non-zero then the + machine will be rebooted. + +============================================================== + +pid_max: + +PID allocation wrap value. When the kernel's next PID value +reaches this value, it wraps back to a minimum PID value. +PIDs of value pid_max or larger are not allocated. + +============================================================== + +powersave-nap: (PPC only) + +If set, Linux-PPC will use the 'nap' mode of powersaving, +otherwise the 'doze' mode will be used. + +============================================================== + +printk: + +The four values in printk denote: console_loglevel, +default_message_loglevel, minimum_console_loglevel and +default_console_loglevel respectively. + +These values influence printk() behavior when printing or +logging error messages. See 'man 2 syslog' for more info on +the different loglevels. + +- console_loglevel: messages with a higher priority than + this will be printed to the console +- default_message_level: messages without an explicit priority + will be printed with this priority +- minimum_console_loglevel: minimum (highest) value to which + console_loglevel can be set +- default_console_loglevel: default value for console_loglevel + +============================================================== + +printk_ratelimit: + +Some warning messages are rate limited. printk_ratelimit specifies +the minimum length of time between these messages (in jiffies), by +default we allow one every 5 seconds. + +A value of 0 will disable rate limiting. + +============================================================== + +printk_ratelimit_burst: + +While long term we enforce one message per printk_ratelimit +seconds, we do allow a burst of messages to pass through. +printk_ratelimit_burst specifies the number of messages we can +send before ratelimiting kicks in. + +============================================================== + +randomize-va-space: + +This option can be used to select the type of process address +space randomization that is used in the system, for architectures +that support this feature. + +0 - Turn the process address space randomization off by default. + +1 - Make the addresses of mmap base, stack and VDSO page randomized. + This, among other things, implies that shared libraries will be + loaded to random addresses. Also for PIE-linked binaries, the location + of code start is randomized. + + With heap randomization, the situation is a little bit more + complicated. + There a few legacy applications out there (such as some ancient + versions of libc.so.5 from 1996) that assume that brk area starts + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known + non-legacy applications that would be broken this way, so for most + systems it is safe to choose full randomization. However there is + a CONFIG_COMPAT_BRK option for systems with ancient and/or broken + binaries, that makes heap non-randomized, but keeps all other + parts of process address space randomized if randomize_va_space + sysctl is turned on. + +============================================================== + +reboot-cmd: (Sparc only) + +??? This seems to be a way to give an argument to the Sparc +ROM/Flash boot loader. Maybe to tell it what to do after +rebooting. ??? + +============================================================== + +rtsig-max & rtsig-nr: + +The file rtsig-max can be used to tune the maximum number +of POSIX realtime (queued) signals that can be outstanding +in the system. + +rtsig-nr shows the number of RT signals currently queued. + +============================================================== + +sg-big-buff: + +This file shows the size of the generic SCSI (sg) buffer. +You can't tune it just yet, but you could change it on +compile time by editing include/scsi/sg.h and changing +the value of SG_BIG_BUFF. + +There shouldn't be any reason to change this value. If +you can come up with one, you probably know what you +are doing anyway :) + +============================================================== + +shmmax: + +This value can be used to query and set the run time limit +on the maximum shared memory segment size that can be created. +Shared memory segments up to 1Gb are now supported in the +kernel. This value defaults to SHMMAX. + +============================================================== + +softlockup_thresh: + +This value can be used to lower the softlockup tolerance threshold. The +default threshold is 60 seconds. If a cpu is locked up for 60 seconds, +the kernel complains. Valid values are 1-60 seconds. Setting this +tunable to zero will disable the softlockup detection altogether. + +============================================================== + +tainted: + +Non-zero if the kernel has been tainted. Numeric values, which +can be ORed together: + + 1 - A module with a non-GPL license has been loaded, this + includes modules with no license. + Set by modutils >= 2.4.9 and module-init-tools. + 2 - A module was force loaded by insmod -f. + Set by modutils >= 2.4.9 and module-init-tools. + 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP. + 8 - A module was forcibly unloaded from the system by rmmod -f. + 16 - A hardware machine check error occurred on the system. + 32 - A bad page was discovered on the system. + 64 - The user has asked that the system be marked "tainted". This + could be because they are running software that directly modifies + the hardware, or for other reasons. + 128 - The system has died. + 256 - The ACPI DSDT has been overridden with one supplied by the user + instead of using the one provided by the hardware. + 512 - A kernel warning has occurred. +1024 - A module from drivers/staging was loaded. + diff --git a/Documentation/sysctl/sunrpc.txt b/Documentation/sysctl/sunrpc.txt new file mode 100644 index 0000000..ae1ecac --- /dev/null +++ b/Documentation/sysctl/sunrpc.txt @@ -0,0 +1,20 @@ +Documentation for /proc/sys/sunrpc/* kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains the documentation for the sysctl files in +/proc/sys/sunrpc and is valid for Linux kernel version 2.2. + +The files in this directory can be used to (re)set the debug +flags of the SUN Remote Procedure Call (RPC) subsystem in +the Linux kernel. This stuff is used for NFS, KNFSD and +maybe a few other things as well. + +The files in there are used to control the debugging flags: +rpc_debug, nfs_debug, nfsd_debug and nlm_debug. + +These flags are for kernel hackers only. You should read the +source code in net/sunrpc/ for more information. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt new file mode 100644 index 0000000..d79eeda --- /dev/null +++ b/Documentation/sysctl/vm.txt @@ -0,0 +1,349 @@ +Documentation for /proc/sys/vm/* kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains the documentation for the sysctl files in +/proc/sys/vm and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune the operation +of the virtual memory (VM) subsystem of the Linux kernel and +the writeout of dirty data to disk. + +Default values and initialization routines for most of these +files can be found in mm/swap.c. + +Currently, these files are in /proc/sys/vm: +- overcommit_memory +- page-cluster +- dirty_ratio +- dirty_background_ratio +- dirty_expire_centisecs +- dirty_writeback_centisecs +- highmem_is_dirtyable (only if CONFIG_HIGHMEM set) +- max_map_count +- min_free_kbytes +- laptop_mode +- block_dump +- drop-caches +- zone_reclaim_mode +- min_unmapped_ratio +- min_slab_ratio +- panic_on_oom +- oom_dump_tasks +- oom_kill_allocating_task +- mmap_min_address +- numa_zonelist_order +- nr_hugepages +- nr_overcommit_hugepages + +============================================================== + +dirty_ratio, dirty_background_ratio, dirty_expire_centisecs, +dirty_writeback_centisecs, highmem_is_dirtyable, +vfs_cache_pressure, laptop_mode, block_dump, swap_token_timeout, +drop-caches, hugepages_treat_as_movable: + +See Documentation/filesystems/proc.txt + +============================================================== + +overcommit_memory: + +This value contains a flag that enables memory overcommitment. + +When this flag is 0, the kernel attempts to estimate the amount +of free memory left when userspace requests more memory. + +When this flag is 1, the kernel pretends there is always enough +memory until it actually runs out. + +When this flag is 2, the kernel uses a "never overcommit" +policy that attempts to prevent any overcommit of memory. + +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. + +The default value is 0. + +See Documentation/vm/overcommit-accounting and +security/commoncap.c::cap_vm_enough_memory() for more information. + +============================================================== + +overcommit_ratio: + +When overcommit_memory is set to 2, the committed address +space is not permitted to exceed swap plus this percentage +of physical RAM. See above. + +============================================================== + +page-cluster: + +The Linux VM subsystem avoids excessive disk seeks by reading +multiple pages on a page fault. The number of pages it reads +is dependent on the amount of memory in your machine. + +The number of pages the kernel reads in at once is equal to +2 ^ page-cluster. Values above 2 ^ 5 don't make much sense +for swap because we only cluster swap data in 32-page groups. + +============================================================== + +max_map_count: + +This file contains the maximum number of memory map areas a process +may have. Memory map areas are used as a side-effect of calling +malloc, directly by mmap and mprotect, and also when loading shared +libraries. + +While most applications need less than a thousand maps, certain +programs, particularly malloc debuggers, may consume lots of them, +e.g., up to one or two maps per allocation. + +The default value is 65536. + +============================================================== + +min_free_kbytes: + +This is used to force the Linux VM to keep a minimum number +of kilobytes free. The VM uses this number to compute a pages_min +value for each lowmem zone in the system. Each lowmem zone gets +a number of reserved free pages based proportionally on its size. + +Some minimal amount of memory is needed to satisfy PF_MEMALLOC +allocations; if you set this to lower than 1024KB, your system will +become subtly broken, and prone to deadlock under high loads. + +Setting this too high will OOM your machine instantly. + +============================================================== + +percpu_pagelist_fraction + +This is the fraction of pages at most (high mark pcp->high) in each zone that +are allocated for each per cpu page list. The min value for this is 8. It +means that we don't allow more than 1/8th of pages in each zone to be +allocated in any single per_cpu_pagelist. This entry only changes the value +of hot per cpu pagelists. User can specify a number like 100 to allocate +1/100th of each zone to each per cpu page list. + +The batch value of each per cpu pagelist is also updated as a result. It is +set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) + +The initial value is zero. Kernel does not use this value at boot time to set +the high water marks for each per cpu page list. + +=============================================================== + +zone_reclaim_mode: + +Zone_reclaim_mode allows someone to set more or less aggressive approaches to +reclaim memory when a zone runs out of memory. If it is set to zero then no +zone reclaim occurs. Allocations will be satisfied from other zones / nodes +in the system. + +This is value ORed together of + +1 = Zone reclaim on +2 = Zone reclaim writes dirty pages out +4 = Zone reclaim swaps pages + +zone_reclaim_mode is set during bootup to 1 if it is determined that pages +from remote zones will cause a measurable performance reduction. The +page allocator will then reclaim easily reusable pages (those page +cache pages that are currently not used) before allocating off node pages. + +It may be beneficial to switch off zone reclaim if the system is +used for a file server and all of memory should be used for caching files +from disk. In that case the caching effect is more important than +data locality. + +Allowing zone reclaim to write out pages stops processes that are +writing large amounts of data from dirtying pages on other nodes. Zone +reclaim will write out dirty pages if a zone fills up and so effectively +throttle the process. This may decrease the performance of a single process +since it cannot use all of system memory to buffer the outgoing writes +anymore but it preserve the memory on other nodes so that the performance +of other processes running on other nodes will not be affected. + +Allowing regular swap effectively restricts allocations to the local +node unless explicitly overridden by memory policies or cpuset +configurations. + +============================================================= + +min_unmapped_ratio: + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. Zone reclaim will only +occur if more than this percentage of pages are file backed and unmapped. +This is to insure that a minimal amount of local pages is still available for +file I/O even if the node is overallocated. + +The default is 1 percent. + +============================================================= + +min_slab_ratio: + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. On Zone reclaim +(fallback from the local zone occurs) slabs will be reclaimed if more +than this percentage of pages in a zone are reclaimable slab pages. +This insures that the slab growth stays under control even in NUMA +systems that rarely perform global reclaim. + +The default is 5 percent. + +Note that slab reclaim is triggered in a per zone / node fashion. +The process of reclaiming slab memory is currently not node specific +and may not be fast. + +============================================================= + +panic_on_oom + +This enables or disables panic on out-of-memory feature. + +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. + +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. + +The default value is 0. +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. + +============================================================= + +oom_dump_tasks + +Enables a system-wide task dump (excluding kernel threads) to be +produced when the kernel performs an OOM-killing and includes such +information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and +name. This is helpful to determine why the OOM killer was invoked +and to identify the rogue task that caused it. + +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 0. + +============================================================= + +oom_kill_allocating_task + +This enables or disables killing the OOM-triggering task in +out-of-memory situations. + +If this is set to zero, the OOM killer will scan through the entire +tasklist and select a task based on heuristics to kill. This normally +selects a rogue memory-hogging task that frees up a large amount of +memory when killed. + +If this is set to non-zero, the OOM killer simply kills the task that +triggered the out-of-memory condition. This avoids the expensive +tasklist scan. + +If panic_on_oom is selected, it takes precedence over whatever value +is used in oom_kill_allocating_task. + +The default value is 0. + +============================================================== + +mmap_min_addr + +This file indicates the amount of address space which a user process will +be restricted from mmaping. Since kernel null dereference bugs could +accidentally operate based on the information in the first couple of pages +of memory userspace processes should not be allowed to write to them. By +default this value is set to 0 and no protections will be enforced by the +security module. Setting this value to something like 64k will allow the +vast majority of applications to work correctly and provide defense in depth +against future potential kernel bugs. + +============================================================== + +numa_zonelist_order + +This sysctl is only for NUMA. +'where the memory is allocated from' is controlled by zonelists. +(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. + you may be able to read ZONE_DMA as ZONE_DMA32...) + +In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. +ZONE_NORMAL -> ZONE_DMA +This means that a memory allocation request for GFP_KERNEL will +get memory from ZONE_DMA only when ZONE_NORMAL is not available. + +In NUMA case, you can think of following 2 types of order. +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL + +(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL +(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. + +Type(A) offers the best locality for processes on Node(0), but ZONE_DMA +will be used before ZONE_NORMAL exhaustion. This increases possibility of +out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. + +Type(B) cannot offer the best locality but is more robust against OOM of +the DMA zone. + +Type(A) is called as "Node" order. Type (B) is "Zone" order. + +"Node order" orders the zonelists by node, then by zone within each node. +Specify "[Nn]ode" for zone order + +"Zone Order" orders the zonelists by zone type, then by node within each +zone. Specify "[Zz]one"for zode order. + +Specify "[Dd]efault" to request automatic configuration. Autoconfiguration +will select "node" order in following case. +(1) if the DMA zone does not exist or +(2) if the DMA zone comprises greater than 50% of the available memory or +(3) if any node's DMA zone comprises greater than 60% of its local memory and + the amount of local memory is big enough. + +Otherwise, "zone" order will be selected. Default order is recommended unless +this is causing problems for your system/application. + +============================================================== + +nr_hugepages + +Change the minimum size of the hugepage pool. + +See Documentation/vm/hugetlbpage.txt + +============================================================== + +nr_overcommit_hugepages + +Change the maximum size of the hugepage pool. The maximum is +nr_hugepages + nr_overcommit_hugepages. + +See Documentation/vm/hugetlbpage.txt |