summaryrefslogtreecommitdiffstats
path: root/sys/vm/vm_pageout.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r320319alc2018-02-211-2/+1
| | | | | | | | | | | | | Increase the pageout cluster size to 32 pages. Decouple the pageout cluster size from the size of the hash table entry used by the swap pager for mapping (object, pindex) to a block on the swap device(s), and keep the size of a hash table entry at its current size. Eliminate a pointless macro. (cherry picked from commit 90fed17dafd94f9a34e74086f35e7e8a540e00a7)
* MFS r320605, r320610: MFC r303052, r309017 (by alc):markj2017-07-051-1/+1
| | | | | | | Omit v_cache_count when computing the number of free pages, since its value is always 0. Approved by: re (gjb, kib)
* MFC r308474, r308691, r309203, r309365, r309703, r309898, r310720,markj2017-05-231-142/+551
| | | | | r308489, r308706: Add PQ_LAUNDRY and remove PG_CACHED pages.
* MFC r314272: call vm_lowmem hook in uma_reclaim_workeravg2017-03-041-1/+1
|
* MFC r313730: try to fix RACCT_RSS accountingavg2017-02-271-2/+5
|
* MFC r306712alc2016-10-301-6/+7
| | | | | | Make the page daemon's notion of what kind of pass is being performed by vm_pageout_scan() local to vm_pageout_worker(). There is no reason to store the pass in the NUMA domain structure.
* MFC r306706alc2016-10-301-16/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | Change vm_pageout_scan() to return a value indicating whether the free page target was met. Previously, vm_pageout_worker() itself checked the length of the free page queues to determine whether vm_pageout_scan(pass >= 1)'s inactive queue scan freed enough pages to meet the free page target. Specifically, vm_pageout_worker() used vm_paging_needed(). The trouble with vm_paging_needed() is that it compares the length of the free page queues to the wakeup threshold for the page daemon, which is much lower than the free page target. Consequently, vm_pageout_worker() could conclude that the inactive queue scan succeeded in meeting its free page target when in fact it did not; and rather than immediately triggering an all-out laundering pass over the inactive queue, vm_pageout_worker() would go back to sleep waiting for the free page count to fall below the page daemon wakeup threshold again, at which point it will perform another limited (pass == 1) scan over the inactive queue. Changing vm_pageout_worker() to use vm_page_count_target() instead of vm_paging_needed() won't work because any page allocations that happen concurrently with the inactive queue scan will result in the free page count being below the target at the end of a successful scan. Instead, having vm_pageout_scan() return a value indicating success or failure is the most straightforward fix.
* MFC r303747,303982alc2016-08-271-60/+35
| | | | | | | | | | Correct errors and clean up the comments on the active queue scan. Eliminate some unnecessary blank lines. Clean up the comments and code style in and around vm_pageout_cluster(). In particular, fix factual, grammatical, and spelling errors in various comments, and remove comments that are out of place in this function.
* MFC r303244, r303399markj2016-08-141-10/+10
| | | | De-pluralize "queues" in the pagedaemon code.
* MFC r303773alc2016-08-101-1/+1
| | | | | | Correct a spelling error. Approved by: re (kib)
* MFC r303492alc2016-08-051-1/+0
| | | | | | | Remove a probe declaration that has been unused since r292469, when vm_pageout_grow_cache() was replaced. Approved by: re (gjb)
* MFC r303356 and r303465alc2016-08-011-15/+12
| | | | | | | Remove any mention of cache (PG_CACHE) pages from the comments in vm_pageout_scan(). That function has not cached pages since r284376. Approved by: re (kib)
* Fix a LOR between vnode locks and allproc_lock.kib2016-06-221-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an order between covered vnode lock and allproc_lock, which is established by calling mountcheckdirs() while owning the covered vnode lock. mountcheckdirs() iterates over the processes, protected by allproc_lock. This order is needed and seems to be not avoidable. On the other hand, various VM daemons also need to iterate over all processes, and they lock and unlock user maps. Since unlock of the user map may trigger processing of the deferred map entries, it causes vnode locking to occur. Or, when vmspace is freed, dropping references on the vnode-backed object also lock vnodes. We get reverted order comparing with the mount/unmount order. For VM daemons, there is no need to own allproc_lock while we operate on vmspaces. If the process is held, it serves as the marker for allproc list, which allows to continue the iteration. Add _PHOLD_LITE() macro, similar to _PHOLD(), but not causing swap-in of the kernel stacks. It is used instead of _PHOLD() in vm code, since e.g. calling faultin() in OOM conditions only exaggerates the problem. Modernize comment describing PHOLD. Reported by: lists@yamagi.org Tested by: pho (previous version) Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 week Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6679
* The flag "vm_pages_needed" has long served two distinct purposes: (1) toalc2016-05-271-34/+52
| | | | | | | | | | | | | | | | indicate that threads are waiting for free pages to become available and (2) to indicate whether a wakeup call has been sent to the page daemon. The trouble is that a single flag cannot really serve both purposes, because we have two distinct targets for when to wakeup threads waiting for free pages versus when the page daemon has completed its work. In particular, the flag will be cleared by vm_page_free() before the page daemon has met its target, and this can lead to the OOM killer being invoked prematurely. To address this problem, a new flag "vm_pageout_wanted" is introduced. Discussed with: jeff Reviewed by: kib, markj Tested by: markj Sponsored by: EMC / Isilon Storage Division
* sys/vm: minor spelling fixes in comments.pfg2016-05-021-2/+2
| | | | No functional change.
* Add more fine-grained kernel options for NUMA support.jhb2016-04-091-2/+2
| | | | | | | | | | | | | VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in the virtual memory system. DEVICE_NUMA is used to enable affinity reporting for devices such as bus_get_domain(). MAXMEMDOM must still be set to a value greater than for any NUMA support to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is enabled and the system supports NUMA. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D5782
* Introduce a new mechanism for relocating virtual pages to a new physicalalc2015-12-191-166/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | address and use this mechanism when: 1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical memory allocator's free page lists. This replaces the long-standing approach of scanning the inactive and inactive queues, converting clean pages into PG_CACHED pages and laundering dirty pages. In contrast, the new mechanism does not use PG_CACHED pages nor does it trigger a large number of I/O operations. 2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find free pages in the physical memory allocator's free page lists that are covered by the direct map. Tested by: adrian 3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable free pages in the physical memory allocator's free page lists. In the coming months, I expect that this new mechanism will be applied in other places. For example, balloon drivers should use relocation to minimize fragmentation of the guest physical address space. Make vm_phys_alloc_contig() a little smarter (and more efficient in some cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free page lists that can't possibly contain suitable pages. Reviewed by: kib, markj Glanced at: jhb Discussed with: jeff Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4444
* In vm_pageout_grow_cache(), do not re-try the inactive queue whenkib2015-11-271-5/+6
| | | | | | | | | | | | | active queue scan initiated write. Re-trying from the inactive queue when doing active scan makes the loop never end if number of domains is greater than 1 and inactive or active scan cannot reach the target. Reported and tested by: Andrew Gallatin <gallatin@netflix.com> Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Remove unneeded includes of opt_kdtrace.h.markj2015-11-221-1/+1
| | | | | As of r258541, KDTRACE_HOOKS is defined in opt_global.h, so opt_kdtrace.h is not needed when defining SDT(9) probes.
* Rework the test which raises OOM condition. Right now, the codekib2015-11-161-14/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | checks for the swap space consumption plus checks that the amount of the free pages exceeds some limit, in case pagedeamon did not coped with the page shortage in one of the late passes. This is wrong because it does not account for the presence of the reclamaible pages in the queues which are not selectable for reclaim immediately. E.g., on the swap-less systems, large active queue easily triggered OOM. Instead, only raise OOM when pagedaemon is unable to produce a free page in several back-to-back passes. Track the failed passes per pagedaemon thread. The number of passes to trigger OOM was selected empirically and tested both on small (32M-64M i386 VM) and large (32G amd64) configurations. If the specifics of the load require tuning, sysctl vm.pageout_oom_seq sets the number of back-to-back passes which must fail before OOM is raised. Each pass takes 1/2 of seconds. Less the value, more sensible the pagedaemon is to the page shortage. In future, some heuristic to calculate the value of the tunable might be designed based on the system configuration and load. But before it can be done, the i/o system must be fixed to reliably time-out pagedaemon writes, even if waiting for the memory to proceed. Then, code can account for the in-flight page-outs and postpone OOM until all of them finished, which should reduce the need in tuning. Right now, ignoring the in-flight writes and the counter allows to break deadlocks due to write path doing sleepable memory allocations. Reported by: Dmitry Sivachenko, bde, many others Tested by: pho, bde, tuexen (arm) Reviewed by: alc Discussed with: bde, imp Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
* Do not use vmspace_resident_count() for the OOM process selection.kib2015-11-161-3/+63
| | | | | | | | | | | | | | | | | | Residency count track the number of pte entries installed into the current pmap, which does not reflect the consumption of the physical memory by the address map. Due to several mechanisms like pv entries reclamation, copy on write etc. the resident pte entries count may be much less than the amount of physical memory kept by the process. Provide the OOM-specific vm_pageout_oom_pagecount() function which estimates the amount of reclamaible memory which could be stolen if the process is killed. Reported and tested by: pho Reviewed by: alc Comments text by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
* VM daemon works in parallel with the pagedaemon threads, and, amongkib2015-11-161-1/+2
| | | | | | | | | | | | | | | | | other actions, swaps out kernel stacks of the processes. On the other hand, currentl OOM logic which selects a process to kill in the critical condition, skips process with swapped-out thread. Under some loads, this results in the big(gest) process being ignored by OOM. Do not skip a process which has inhibited thread due to the swap-out, in the OOM selection loop. Note that killing such process requires the thread stack page-in, but sometimes this is the only way to recover. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
* Ensure that deactivated pages that are not expected to be reused aremarkj2015-11-081-0/+3
| | | | | | | | | reclaimed in FIFO order by the pagedaemon. Previously we would enqueue such pages at the head of the inactive queue, yielding a LIFO reclaim order. Reviewed by: alc MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division
* Only marker is guaranteed to be present on the queue after the relockkib2015-10-181-5/+17
| | | | | | | | | | | | | | | in vm_pageout_fallback_object_lock() and vm_pageout_page_lock(). The check for the m->queue == queue assumes that the page does belong to a queue. Modify the 'unchanged' calculation bu dereferencing the marker tailq pointers, which is known to belong to the queue. Since for a page m linked to the queue, m->queue must be equal to the queue index, assert this instead of checking. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 2 weeks
* Revert r289302, invalid pages can be queued, e.g. by vfs_vmio_unwire().kib2015-10-151-5/+5
| | | | | | Found by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
* Invalid pages should not appear on the inactive queue. Change thekib2015-10-141-5/+5
| | | | | | | | check into an assertion. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
* Reduce the scope of a variable to the only file where it is used.alc2015-10-031-1/+0
|
* Correct a non-fatal error in vm_pageout_worker(). vm_pageout_worker()alc2015-09-201-6/+11
| | | | | | | | | | | | | | | | should not assume that vm_pages_needed will remain set while it sleeps. Other threads can clear vm_pages_needed by performing a sufficient number of vm_page_free() calls, e.g., process termination. The effect of this error was that vm_pageout_worker() would free and/or launder pages when, in fact, there was no shortage of free pages. Rewrite a nearby comment to describe all of the possible cases and not just the most common case. The problem being that the comment made the most common case seem like the only case. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
* To simplify upcoming changes to the inactive queue scan, change the codealc2015-09-081-18/+8
| | | | | | | | so that there is only one place where pages are freed and only one place where pages are moved to the tail of the queue. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
* Eliminate pointless requeueing of pages from terminated objects. Thesealc2015-09-051-23/+25
| | | | | | | | | | | | pages will have left the inactive queue before the page daemon performs its next scan. Also, ignore references to pages from terminated objects. This allows the clean pages to be freed a little sooner. Move some comments to their proper place, i.e., next to the code that they describe, and update other nearby comments. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
* Handle held pages earlier in the inactive queue scan.alc2015-09-011-31/+33
| | | | | Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
* In vm_pageout_scan(), simplify the logic for determining if a page can bealc2015-08-271-15/+15
| | | | | | | | paged out and apply some nearby style fixes. In collaboration with: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation, EMC / Isilon Storage Division
* Testing whether a page is dirty does not require the page lock. Moreover,alc2015-08-251-6/+10
| | | | | | | | it may involve a pmap operation that iterates over the page's PV list, so unnecessarily holding the page lock is undesirable. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
* Prevent ticks rollover from preventing vm_lowmem eventrstone2015-08-201-3/+4
| | | | | | | | | | | | | | | | Currently vm_pageout_scan() uses a ticks-based scheme to rate-limit the number of times that the vm_lowmem event will happen. However if no events happen for long enough for ticks to roll over, this leaves us in a long window in which vm_lowmem events will not happen. Replace the use of ticks with time_t to prevent rollover from ever being an issue. Reviewed by: ian MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3439
* As another piece of PG_CACHE page elimination, remove an LRU-defeating callalc2015-08-161-5/+0
| | | | | | | | | | to vm_page_try_to_cache() from vm_pageout_flush(). Other changes, most recently r286814, have made this call unnecessary. Reviewed by: kib Discussed with: jeff Tested by: pho Sponsored by: EMC / Isilon Storage Division
* The intention of r254304 was to scan the active queue continuously.alc2015-07-081-15/+19
| | | | | | | | | | | | | | | | | | However, I've observed the active queue scan stopping when there are frequent free page shortages and the inactive queue is steadily refilled by other mechanisms, such as the sequential access heuristic in vm_fault() or madvise(2). To remedy this problem, record the time of the last active queue scan, and always scan a number of pages proportional to the time since the last scan, regardless of whether that last scan was a timeout-triggered ("pass == 0") or free-page-shortage-triggered ("pass > 0") scan. Also, on a timeout-triggered scan, allow a full scan of the active queue when the system is short of inactive pages. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division
* Avoid pmap_is_modified() on pages that can't be mapped.alc2015-06-211-3/+5
| | | | | MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
* Invalid pages do not need neither update of the activation count norkib2015-06-141-15/+15
| | | | | | | | | | | | they coould be dirty. Move the handling if the invalid pages in the inactive scan earlier. Remove some code duplication in the scan by introducing the 'drop_page' label, which centralizes the object and the page unlock. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* As the next step in eliminating PG_CACHE pages, free rather than cachealc2015-06-141-3/+3
| | | | | | | | | | | | | pages in vm_pageout_scan(). The reactivation rate of cache pages created by vm_pageout_scan() is extremely low; typically no more than 0.5% to 2.25% of the pages are ever reactivated. At the same time, caching pages is more expensive than freeing them. For example, in a test with PostgreSQL, this change reduced the amount of time spent in the inactive queue scan by 1/6. Differential Revision: https://reviews.freebsd.org/D2805 Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
* Implement lockless resource limits.mjg2015-06-101-1/+1
| | | | | | | | | | Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.
* The vmem callback to reclaim kmem arena address space on low orkib2015-05-091-1/+6
| | | | | | | | | | | | | | | | | | | fragmented conditions currently just wakes up the pagedaemon. The kmem arena is significantly smaller then the total available physical memory, which means that there are loads where kmem arena space could be exhausted, while there is a lot of pages available still. The woken up pagedaemon sees vm_pages_needed != 0, verifies the condition vm_paging_needed() which is false, clears the pass and returns back to sleep, not calling neither uma_reclaim() nor lowmem handler. To handle low kmem arena conditions, create additional pagedaemon thread which calls uma_reclaim() directly. The thread sleeps on the dedicated channel and kmem_reclaim() wakes the thread in addition to the pagedaemon. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Add kern.racct.enable tunable and RACCT_DISABLED config option.trasz2015-04-291-25/+34
| | | | | | | | | | | The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). Differential Revision: https://reviews.freebsd.org/D2369 Reviewed by: kib@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation
* - Simplify vm_pageout_scan() by introducing a new vm_pageout_clean()jeff2015-04-071-120/+123
| | | | | | | | | | | function that does the locking and validation associated with cleaning a page. This moves 150 lines of code into its own function. - Rename vm_pageout_clean() to vm_pageout_cluster() to define what it really does; clustering nearby pages for pageout optimization. Reviewd by: alc, kib, kmacy Tested by: pho (earlier version) Sponsored by: EMC / Isilon
* - Eliminate pagequeue locking in the dirty code in vm_pageout_scan().jeff2015-03-281-13/+10
| | | | | | | | - Use a more precise series of tests to see if the page changed while we were locking the vnode. Reviewed by: alc Sponsored by: EMC / Isilon
* Add vm.panic_on_oom sysctl, which enables those who would rather panic thanwill2015-01-241-0/+8
| | | | | | | | | | | | | | | | | | | | | | kill a process, when the system runs out of memory. Defaults to off. Usually, this is most useful when the OOM condition is due to mismanagement of memory, on a system where the applications in question don't respond well to being killed. In theory, if the system is properly managed, it shouldn't be possible to hit this condition. If it does, the panic can be more desirable for some users (since it can be a good means of finding the root cause) rather than killing the largest process and continuing on its merry way. As kib@ mentions in the differential, there is also protect(1), which uses procctl(PROC_SPROTECT) to ensure that some processes are immune. However, a panic approach is still useful in some environments. This is primarily intended as a development/debugging tool. Differential Revision: D1627 Reviewed by: kib MFC after: 1 week
* Avoid calling vmspace_free() while owning the process lock. Freeingkib2015-01-241-10/+16
| | | | | | | | | | | | | | of an vm space may require obtaining sleepable locks. Hold the process to keep the pointer valid, and change trylock to lock, since there is no longer two process locks owned simultaneously in vm_pageout_oom(). Note that after the process lock is dropped, process might exec, and no longer qualify as the owner of biggest vm space. In collaboration with: rstone Sponsored by: The FreeBSD Foundation MFC after: 1 week
* Refactor ZFS ARC reclaim checks and limitssmh2014-10-031-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | Remove previously added kmem methods in favour of defines which allow diff minimisation between upstream code base. Rebalance ARC free target to be vm_pageout_wakeup_thresh by default which eliminates issue where ARC gets minimised instead of balancing with VM pageout. The restores the target point prior to r270759. Bring in missing upstream only changes which move unused code to further eliminate code differences. Add additional DTRACE probe to aid monitoring of ARC behaviour. Enable upstream i386 code paths on platforms which don't define UMA_MD_SMALL_ALLOC. Fix mixture of byte an page values in arc_memory_throttle i386 code path value assignment of available_memory. PR: 187594 Review: D702 Reviewed by: avg MFC after: 1 week X-MFC-With: r270759 & r270861 Sponsored by: Multiplay
* Fix ticks wrap issue of lowmem test in vm_pageout_scansmh2014-09-241-1/+1
| | | | | | Reviewed by: jhb (D818) MFC after: 3 days Sponsored by: Multiplay
* Refactor ZFS ARC reclaim logic to be more VM cooperativesmh2014-08-281-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay
* Back in the days when the kernel was single threaded, testingalc2014-08-261-14/+17
| | | | | | | | | | | | | | | | | | | | | "vm_paging_target() > 0" was a reasonable way of determining if the inactive queue scan met its target. However, now that other threads can be allocating pages while the inactive queue scan is running, it's an unreliable method. The effect of it being unreliable is that we can start swapping out processes when we didn't intend to. This issue has existed since the kernel was multithreaded, but the changes to the inactive queue target in 10.0-RELEASE have made its effects visible. This change introduces a more direct method for determining if the inactive queue scan met its target that is not affected by the actions of other threads. Reported by: Steve Polyack Tested by: pho, Steve Polyack (an earlier version) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
OpenPOWER on IntegriCloud