summaryrefslogtreecommitdiffstats
path: root/sys/vm
Commit message (Collapse)AuthorAgeFilesLines
* - Make 'vm_refcnt' volatile so that compilers won't be tempted to treatjhb2010-10-212-6/+2
| | | | | | | | | | its value as a loop invariant. Currently this is a no-op because 'atomic_cmpset_int()' clobbers all memory on current architectures. - Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop a reference in vmspace_free(). Reviewed by: alc MFC after: 1 month
* PG_BUSY -> VPO_BUSY, PG_WANTED -> VPO_WANTED in manual pages and commentsavg2010-10-201-2/+2
| | | | | Reviewed by: alc MFC after: 4 days
* uma_zfree(zone, NULL) should do nothing, to match free(9).mdf2010-10-191-0/+4
| | | | | Noticed by: Ron Steinke <rsteinke at isilon dot com> MFC after: 3 days
* Change uma_zone_set_max to return the effective value of "nitems" afterlstewart2010-10-162-4/+7
| | | | | | | | rounding. The same value can also be obtained with uma_zone_get_max, but this change avoids a caller having to make two back-to-back calls. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb
* - Simplify implementation of uma_zone_get_max.lstewart2010-10-162-4/+35
| | | | | | | | | - Add uma_zone_get_cur which returns the current approximate occupancy of a zone. This is useful for providing stats via sysctl amongst other things. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb MFC after: 2 weeks
* If vm_map_find() is asked to allocate a superpage-aligned region of virtualalc2010-10-041-14/+8
| | | | | | | | | | | | | | | | | | | | addresses that is greater than a superpage in size but not a multiple of the superpage size, then vm_map_find() is not always expanding the kernel pmap to support the last few small pages being allocated. These failures are not commonplace, so this was first noticed by someone porting FreeBSD to a new architecture. Previously, we grew the kernel page table in vm_map_findspace() when we found the first available virtual address. This works most of the time because we always grow the kernel pmap or page table by an amount that is a multiple of the superpage size. Now, instead, we defer the call to pmap_growkernel() until we are committed to a range of virtual addresses in vm_map_insert(). In general, there is another reason to prefer calling pmap_growkernel() in vm_map_insert(). It makes it possible for someone to do the equivalent of an mmap(MAP_FIXED) on the kernel map. Reported by: Svatopluk Kraus Reviewed by: kib@ MFC after: 3 weeks
* Replace an XXX comment with the appropriate code.mdf2010-09-201-5/+1
| | | | Submitted by: alc
* Allow a POSIX shared memory object that is opened for read but not foralc2010-09-191-1/+2
| | | | | | | | | | | | write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e., copy-on-write. (This is a regression in the new implementation of POSIX shared memory objects that is used by HEAD and RELENG_8. This bug does not exist in RELENG_7's user-level, file-based implementation.) PR: 150260 MFC after: 3 weeks
* Make refinements to r212824. In particular, don't makealc2010-09-192-20/+27
| | | | | | | | | | | | | | | vm_map_unlock_nodefer() part of the synchronization interface for maps. Add comments to vm_map_unlock_and_wait() and vm_map_wakeup() describing how they should be used. In particular, describe the deferred deallocations issue with vm_map_unlock_and_wait(). Redo the implementation of vm_map_unlock_and_wait() so that it passes along the caller's file and line information, just like the other map locking primitives. Reviewed by: kib X-MFC after: r212824
* Adopt the deferring of object deallocation for the deleted map entrieskib2010-09-182-21/+48
| | | | | | | | | | | | | | | | | | | on map unlock to the lock downgrade and later read unlock operation. System map entries cannot be backed by OBJT_VNODE objects, no need to defer deallocation for them. Map entries from user maps do not require the owner map for deallocation, and can be accumulated in the thread-local list for freeing when a user map is unlocked. Move the collection of entries for deferred reclamation into vm_map_delete(). Create helper vm_map_process_deferred(), that is called from locations where processing is feasible. Do not process deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is held. Reviewed by: alc, rstone (previous versions) Tested by: pho MFC after: 2 weeks
* Re-add r212370 now that the LOR in powerpc64 has been resolved:mdf2010-09-163-73/+17
| | | | | | | | | | | | Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough SBUF_FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk (original patch)
* Revert r212370, as it causes a LOR on powerpc. powerpc does a fewmdf2010-09-133-17/+73
| | | | | | | unexpected things in copyout(9) and so wiring the user buffer is not sufficient to perform a copyout(9) while holding a random mutex. Requested by: nwhitehorn
* Add a drain function for struct sysctl_req, and use it for a variety ofmdf2010-09-093-73/+17
| | | | | | | | | | handlers, some of which had to do awkward things to get a large enough FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk
* On architectures with non-tree-based page tables like PowerPC, every pagenwhitehorn2010-09-091-2/+5
| | | | | | | | | | in a range must be checked when calling pmap_remove(). Calling pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range of the map could result in attempting to demap an extraordinary number of pages (> 10^15), so iterate through each map entry and unmap each of them individually. MFC after: 6 weeks
* Fix a typo in r212281. uintptr -> uintptr_trstone2010-09-071-1/+1
| | | | | | | Pointy hat to: rstone Approved by: emaste (mentor) MFC after: 2 weeks
* In munmap() downgrade the vm_map_lock to a read lock before taking a readrstone2010-09-071-3/+11
| | | | | | | | | | | | | | | | lock on the pmc-sx lock. This prevents a deadlock with pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries to get a read lock on a vm_map. Downgrading the vm_map_lock in munmap allows pmc_log_process_mappings to continue, preventing the deadlock. Without this change I could cause a deadlock on a multicore 8.1-RELEASE system by having one thread constantly mmap'ing and then munmap'ing a PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat in system-wide sampling mode. Reviewed by: fabient Approved by: emaste (mentor) MFC after: 2 weeks
* vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startupavg2010-09-031-0/+1
| | | | | | | | | | | | vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump. If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel config, then not all msgbuf pages will be dumped. And most importantly, struct msgbuf itself will not be included. Thus the dump would look corrupted/incomplete to tools like kgdb, dmesg, etc that try to access struct msgbuf as one of the first things they do when working on a crash dump. MFC after: 5 days
* Have memguard(9) crash with an easier-to-debug message on double-free.mdf2010-08-311-1/+5
| | | | | Reviewed by: zml MFC after: 3 weeks
* The realloc case for memguard(9) will copy too many bytes whenmdf2010-08-312-0/+27
| | | | | | | | | reallocating to a smaller-sized allocation. Fix this issue. Noticed by: alc Reviewed by: alc Approved by: zml (mentor) MFC after: 3 weeks
* Add the MAP_PREFAULT_READ option to mmap(2).alc2010-08-281-2/+3
| | | | Reviewed by: jhb, kib
* Add uma_zone_get_max() to obtain the effective limit after a callandre2010-08-162-0/+30
| | | | | | | | | | | | | | to uma_zone_set_max(). The UMA zone limit is not exactly set to the value supplied but rounded up to completely fill the backing store increment (a page normally). This can lead to surprising situations where the number of elements allocated from UMA is higher than the supplied limit value. The new get function reads back the effective value so that the supplied limit value can be adjusted to the real limit. Reviewed by: jeffr MFC after: 1 week
* Fix compile. It seemed better to have memguard.c include opt_vm.h inmdf2010-08-122-0/+10
| | | | | | | | case future compile-time knobs were added that it wants to use. Also add include guards and forward declarations to vm/memguard.h. Approved by: zml (mentor) MFC after: 1 month
* Rework memguard(9) to reserve significantly more KVA to detectmdf2010-08-114-270/+333
| | | | | | | | | | | | | | | | | use-after-free over a longer time. Also release the backing pages of a guarded allocation at free(9) time to reduce the overhead of using memguard(9). Allow setting and varying the malloc type at run-time. Add knobs to allow: - randomly guarding memory - adding un-backed KVA guard pages to detect underflow and overflow - a lower limit on the size of allocations that are guarded Reviewed by: alc Reviewed by: brueffer, Ulrich Spörlein <uqs spoerlein net> (man page) Silence from: -arch Approved by: zml (mentor) MFC after: 1 month
* Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that createdkib2010-08-062-13/+14
| | | | | | | | | cdev will never be destroyed. Propagate the flag to devfs vnodes as VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a thread reference on such nodes. In collaboration with: pho MFC after: 1 month
* Very rough first cut at NUMA support for the physical page allocator. Forjhb2010-07-272-7/+149
| | | | | | | | | | | | | | | | | | | | | | | | | | | | now it uses a very dumb first-touch allocation policy. This will change in the future. - Each architecture indicates the maximum number of supported memory domains via a new VM_NDOMAIN parameter in <machine/vmparam.h>. - Each cpu now has a PCPU_GET(domain) member to indicate the memory domain a CPU belongs to. Domain values are dense and numbered from 0. - When a platform supports multiple domains, the default freelist (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain. The MD code is required to populate an array of mem_affinity structures. Each entry in the array defines a range of memory (start and end) and a domain for the range. Multiple entries may be present for a single domain. The list is terminated by an entry where all fields are zero. This array of structures is used to split up phys_avail[] regions that fall in VM_FREELIST_DEFAULT into per-domain freelists. - Each memory domain has a separate lookup-array of freelists that is used when fulfulling a physical memory allocation. Right now the per-domain freelists are listed in a round-robin order for each domain. In the future a table such as the ACPI SLIT table may be used to order the per-domain lookup lists based on the penalty for each memory domain relative to a specific domain. The lookup lists may be examined via a new vm.phys.lookup_lists sysctl. - The first-touch policy is implemented by using PCPU_GET(domain) to pick a lookup list when allocating memory. Reviewed by: alc
* Fix commented out resource limit check in mlockall(2). It's still racy,trasz2010-07-271-2/+1
| | | | but at least less misleading.
* Introduce exec_alloc_args(). The objective being to encapsulate thealc2010-07-271-3/+1
| | | | | | | | | | details of the string buffer allocation in one place. Eliminate the portion of the string buffer that was dedicated to storing the interpreter name. The pointer to the interpreter name can simply be made to point to the appropriate argument string. Reviewed by: kib
* Change the order in which the file name, arguments, environment, andalc2010-07-251-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | shell command are stored in exec*()'s demand-paged string buffer. For a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces the number of global TLB shootdowns by 31%. It also eliminates about 330k page faults on the kernel address space. Change exec_shell_imgact() to use "args->begin_argv" consistently as the start of the argument and environment strings. Previously, it would sometimes use "args->buf", which is the start of the overall buffer, but no longer the start of the argument and environment strings. While I'm here, eliminate unnecessary passing of "&length" to copystr(), where we don't actually care about the length of the copied string. Clean up the initialization of the exec map. In particular, use the correct size for an entry, and express that size in the same way that is used when an entry is allocated. The old size was one page too large. (This discrepancy originated in 2004 when I rewrote exec_map_first_page() to use sf_buf_alloc() instead of the exec map for mapping the first page of the executable.) Reviewed by: kib
* Redo the page table page allocation on MIPS, as suggested byjchandra2010-07-214-74/+154
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alc@. The UMA zone based allocation is replaced by a scheme that creates a new free page list for the KSEG0 region, and a new function in sys/vm that allocates pages from a specific free page list. This also fixes a race condition introduced by the UMA based page table page allocation code. Dropping the page queue and pmap locks before the call to uma_zfree, and re-acquiring them afterwards will introduce a race condtion(noted by alc@). The changes are : - Revert the earlier changes in MIPS pmap.c that added UMA zone for page table pages. - Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that is not directly mapped (in 32bit kernel). Normal page allocations will first try the HIGHMEM freelist and then the default(direct mapped) freelist. - Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int order, int req)' to vm/vm_page.c to allocate a page from a specified freelist. The MIPS page table pages will be allocated using this function from the freelist containing direct mapped pages. - Move the page initialization code from vm_phys_alloc_contig() to a new function vm_page_alloc_init(), and use this function to initialize pages in vm_page_alloc_freelist() too. - Split the function vm_phys_alloc_pages(int pool, int order) to create vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages(). Reviewed by: alc
* Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,alc2010-07-092-7/+3
| | | | | | | | | | | | the maintenance of vm_pageout_deficit can be localized to just two places: vm_page_alloc() and vm_pageout_scan(). This change also corrects an off-by-one error in the maintenance of vm_pageout_deficit. Historically, the buffer cache functions, allocbuf() and vm_hold_load_pages(), have not taken into account that vm_page_alloc() already increments vm_pageout_deficit by one. Reviewed by: kib
* Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that thekib2010-07-082-14/+13
| | | | | | | | | flag is always provided, and unconditionally retry after sleep for the busy page or failed allocation. The intent is to remove VM_ALLOC_RETRY eventually. Proposed and reviewed by: alc
* Add the ability for the allocflag argument of the vm_page_grab() tokib2010-07-052-2/+13
| | | | | | | | | specify the increment of vm_pageout_deficit when sleeping due to page shortage. Then, in allocbuf(), the code to allocate pages when extending vmio buffer can be replaced by a call to vm_page_grab(). Suggested and reviewed by: alc MFC after: 2 weeks
* Several cleanups for the r209686:kib2010-07-041-13/+6
| | | | | | | | | - remove unused defines; - remove unused curgeneration argument for vm_object_page_collect_flush(); - always assert that vm_object_page_clean() is called for OBJT_VNODE; - move vm_page_find_least() into for() statement initial clause. Submitted by: alc
* Reimplement vm_object_page_clean(), using the fact that vm object memqkib2010-07-043-191/+73
| | | | | | | | | | | | | is ordered by page index. This greatly simplifies the implementation, since we no longer need to mark the pages with VPO_CLEANCHK to denote the progress. It is enough to remember the current position by index before dropping the object lock. Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused. Garbage-collect vm.msync_flush_flags sysctl. Suggested and reviewed by: alc Tested by: pho
* Introduce a helper function vm_page_find_least(). Use it in several places,kib2010-07-044-21/+29
| | | | | | | | which inline the function. Reviewed by: alc Tested by: pho MFC after: 1 week
* Improve the comment and man page for vm_page_alloc(). Specifically,alc2010-07-031-2/+7
| | | | | | | | document one of the optional flags; clarify which of the flags are optional (and which are not), and remove mention of a restriction on the reclamation of cached pages that no longer holds since version 7. MFC after: 1 week
* Push down the acquisition of the page queues lock intoalc2010-07-021-3/+2
| | | | | vm_pageout_page_stats(). In particular, avoid acquiring the page queues lock unless iterating over the active queue.
* Use vm_page_prev() instead of vm_page_lookup() in the implementation ofalc2010-07-021-10/+12
| | | | | vm_fault()'s automatic delete-behind heuristic. vm_page_prev() is typically faster.
* With the demise of page coloring, the page queue macros no longer serve anyalc2010-07-024-29/+16
| | | | | | useful purpose. Eliminate them. Reviewed by: kib
* Simplify entry to vm_pageout_clean(). Expect the page to be locked.alc2010-06-301-8/+4
| | | | | | | | | Previously, the caller unlocked the page, and vm_pageout_clean() immediately reacquired the page lock. Also, assert rather than test that the page is neither busy nor held. Since vm_pageout_clean() is called with the object and page locked, the page can't have changed state since the caller verified that the page is neither busy nor held.
* Introduce vm_page_next() and vm_page_prev(), and use them inalc2010-06-213-13/+46
| | | | | | | | | vm_pageout_clean(). When iterating over a range of pages, these functions can be cheaper than vm_page_lookup() because their implementation takes advantage of the vm_object's memq being ordered. Reviewed by: kib@ MFC after: 3 weeks
* Add a new column to the output of vmstat -z to indicate the numbersbruno2010-06-153-10/+19
| | | | | | | | | | | | | of times the system was forced to sleep when requesting a new allocation. Expand the debugger hook, db_show_uma, to display these results as well. This has proven to be very useful in out of memory situations when it is not known why systems have become sluggish or fail in odd ways. Reviewed by: rwatson alc Approved by: scottl (mentor) peter Obtained from: Yahoo Inc.
* Eliminate checks for a page having a NULL object in vm_pageout_scan()alc2010-06-142-42/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and vm_pageout_page_stats(). These checks were recently introduced by the first page locking commit, r207410, but they are not needed. At the same time, eliminate some redundant accesses to the page's object field. (These accesses should have neen eliminated by r207410.) Make the assertion in vm_page_flag_set() stricter. Specifically, only managed pages should have PG_WRITEABLE set. Add a comment documenting an assertion to vm_page_flag_clear(). It has long been the case that fictitious pages have their wire count permanently set to one. Add comments to vm_page_wire() and vm_page_unwire() documenting this. Add assertions to these functions as well. Update the comment describing vm_page_unwire(). Much of the old comment had little to do with vm_page_unwire(), but a lot to do with _vm_page_deactivate(). Move relevant parts of the old comment to _vm_page_deactivate(). Only pages that belong to an object can be paged out. Therefore, it is pointless for vm_page_unwire() to acquire the page queues lock and enqueue such pages in one of the paging queues. Generally speaking, such pages are immediately freed after the call to vm_page_unwire(). Previously, it was the call to vm_page_free() that reacquired the page queues lock and removed these pages from the paging queues. Now, we will never acquire the page queues lock for this case. (It is also worth noting that since both vm_page_unwire() and vm_page_free() occurred with the page locked, the page daemon never saw the page with its object field set to NULL.) Change the panic with vm_page_unwire() to provide a more precise message. Reviewed by: kib@
* Update several places that iterate over CPUs to use CPU_FOREACH().jhb2010-06-111-9/+3
|
* Reduce the scope of the page queues lock and the number ofalc2010-06-103-45/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PG_REFERENCED changes in vm_pageout_object_deactivate_pages(). Simplify this function's inner loop using TAILQ_FOREACH(), and shorten some of its overly long lines. Update a stale comment. Assert that PG_REFERENCED may be cleared only if the object containing the page is locked. Add a comment documenting this. Assert that a caller to vm_page_requeue() holds the page queues lock, and assert that the page is on a page queue. Push down the page queues lock into pmap_ts_referenced() and pmap_page_exists_quick(). (As of now, there are no longer any pmap functions that expect to be called with the page queues lock held.) Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever be passed an unmanaged page. Assert this rather than returning "0" and "FALSE" respectively. ARM: Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH(). Push down the page queues lock inside of pmap_clearbit(), simplifying pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write(). Additionally, this allows for avoiding the acquisition of the page queues lock in some cases. PowerPC/AIM: moea*_page_exits_quick() and moea*_page_wired_mappings() will never be called before pmap initialization is complete. Therefore, the check for moea_initialized can be eliminated. Push down the page queues lock inside of moea*_clear_bit(), simplifying moea*_clear_modify() and moea*_clear_reference(). The last parameter to moea*_clear_bit() is never used. Eliminate it. PowerPC/BookE: Simplify mmu_booke_page_exists_quick()'s control flow. Reviewed by: kib@
* Make vm_contig_grow_cache() extern, and use it when vm_phys_alloc_contig()jchandra2010-06-042-9/+13
| | | | | | | | | | | | | | | | | | | | fails to allocate MIPS page table pages. The current usage of VM_WAIT in case of vm_phys_alloc_contig() failure is not correct, because: "There is no guarantee that any of the available free (or cached) pages after the VM_WAIT will fall within the range of suitable physical addresses. Every time this function sleeps and a single page is freed (or cached) by someone else, this function will be reawakened. With a little bad luck, you could spin indefinitely." We also add low and high parameters to vm_contig_grow_cache() and vm_contig_launder() so that we restrict vm_contig_launder() to the range of pages we are interested in. Reported by: alc Reviewed by: alc Approved by: rrs (mentor)
* Do not leak vm page lock in vm_contig_launder(), vm_pageout_page_lock()kib2010-06-031-1/+3
| | | | | | | always returns with the page locked. Submitted by: alc Pointy hat to: kib
* Add assertion and comment in vm_page_flag_set() describing the expectationskib2010-06-031-0/+8
| | | | | | when the PG_WRITEABLE flag is set. Reviewed by: alc
* Maintain the pretense that we support 32KB pages for the sake of the ia64alc2010-06-031-1/+1
| | | | LINT build.
* Minimize the use of the page queues lock for synchronizing access to thealc2010-06-022-12/+47
| | | | | page's dirty field. With the exception of one case, access to this field is now synchronized by the object lock.
OpenPOWER on IntegriCloud