summaryrefslogtreecommitdiffstats
path: root/sys/vm
Commit message (Collapse)AuthorAgeFilesLines
* Eliminate the acquisition and release of Giant around the call toalc2004-08-021-2/+0
| | | | | pmap_mincore() in mincore(2). Either pmap locking exists (alpha, amd64, i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).
* * Add a "how" argument to uma_zone constructors and initialization functionsgreen2004-08-026-86/+167
| | | | | | | | | | | | | | | | | so that they know whether the allocation is supposed to be able to sleep or not. * Allow uma_zone constructors and initialation functions to return either success or error. Almost all of the ones in the tree currently return success unconditionally, but mbuf is a notable exception: the packet zone constructor wants to be able to fail if it cannot suballocate an mbuf cluster, and the mbuf allocators want to be able to fail in general in a MAC kernel if the MAC mbuf initializer fails. This fixes the panics people are seeing when they run out of memory for mbuf clusters. * Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing the default. Both bmilekic and jeff have reviewed the changes made to make failable zone allocations work.
* - Push down the acquisition and release of Giant into pmap_protect() onalc2004-07-301-2/+0
| | | | | | | | those architectures without pmap locking. - Eliminate the acquisition and release of Giant from vm_map_protect(). (Translation: mprotect(2) runs to completion without touching Giant on alpha, amd64, i386 and ia64.)
* Giant is no longer required by vm_waitproc() and vmspace_exitfree().alc2004-07-301-1/+0
| | | | Eliminate it acquisition and release around vm_waitproc() in kern_wait().
* Fix a memory leak in the device pager which is exposed by the NVIDIAdfr2004-07-301-13/+41
| | | | | | OpenGL driver. Submitted by: nvidia (possibly also tegge)
* Fix handling of msync(2) for character special files.dfr2004-07-301-1/+3
| | | | Submitted by: nvidia
* Get rid of another lockmgr(9) consumer by using sx locks for the usermux2004-07-302-27/+19
| | | | | | | | | maps. We always acquire the sx lock exclusively here, but we can't use a mutex because we want to be able to sleep while holding the lock. This is completely equivalent to what we were doing with the lockmgr(9) locks before. Approved by: alc
* Advance the state of pmap locking on alpha, amd64, and i386.alc2004-07-291-6/+3
| | | | | | | | | | | | | | | | - Enable recursion on the page queues lock. This allows calls to vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page queues lock held. Such calls are made to allocate page table pages and pv entries. - The previous change enables a partial reversion of vm/vm_page.c revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault() now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT. - Add partial locking to pmap_copy(). (As a side-effect, pmap_copy() should now be faster on i386 SMP because it no longer generates IPIs for TLB shootdown on the other processors.) - Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now, all changes to a user-level pmap on alpha, amd64, and i386 are performed with appropriate locking.)
* Rework the way slab header storage space is calculated in UMA.bmilekic2004-07-292-52/+176
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - zone_large_init() stays pretty much the same. - zone_small_init() will try to stash the slab header in the slab page being allocated if the amount of calculated wasted space is less than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular case). If the amount of wasted space is >= UMA_MAX_WASTE, then UMA_ZONE_OFFPAGE will be set and the slab header will be allocated separately for better use of space. - uma_startup() calculates the maximum ipers required in offpage slabs (so that the offpage slab header zone(s) can be sized accordingly). The algorithm used to calculate this replaces the old calculation (which only happened to work coincidentally). We now iterate over possible object sizes, starting from the smallest one, until we determine that wastedspace calculated in zone_small_init() might end up being greater than UMA_MAX_WASTE, at which point we use the found object size to compute the maximum possible ipers. The reason this works is because: - wastedspace versus objectsize is a see-saw function with local minima all equal to zero and local maxima growing directly proportioned to objectsize. This implies that for objects up to or equal a certain objectsize, the see-saw remains entirely below UMA_MAX_WASTE, so for those objectsizes it is impossible to ever go OFFPAGE for slab headers. - ipers (items-per-slab) versus objectsize is an inversely proportional function which falls off very quickly (very large for small objectsizes). - To determine the maximum ipers we'll ever need from OFFPAGE slab headers we first find the largest objectsize for which we are guaranteed to not go offpage for and use it to compute ipers (as though we were offpage). Since the only objectsizes allowed to go offpage are bigger than the found objectsize, and since ipers vs objectsize is inversely proportional (and monotonically decreasing), then we are guaranteed that the ipers computed is always >= what we will ever need in offpage slab headers. - Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly padded) size of each freelist index so that offset calculations are fixed. This might fix weird data corruption problems and certainly allows ARM to now boot to at least single-user (via simulator). Tested on i386 UP by me. Tested on sparc64 SMP by fenner. Tested on ARM simulator to single-user by cognet.
* Correct a very old error in both vm_object_madvise() (originating inalc2004-07-281-2/+2
| | | | | | | | | | | | vm/vm_object.c revision 1.88) and vm_object_sync() (originating in vm/vm_map.c revision 1.36): When descending a chain of backing objects, both use the wrong object's backing offset. Consequently, both may operate on the wrong pages. Quoting Matt, "This could be responsible for all of the sporatic madvise oddness that has been reported over the years." Reviewed by: Matt Dillon
* - Use atomic ops for updating the vmspace's refcnt and exitingcnt.alc2004-07-272-8/+13
| | | | | | | | - Push down Giant into shmexit(). (Giant is acquired only if the vmspace contains shm segments.) - Eliminate the acquisition of Giant from proc_rwmem(). - Reduce the scope of Giant in exit1(), uncovering the destruction of the address space.
* For years, kmem_alloc_pageable() has been misused. Now that the last ofalc2004-07-252-25/+0
| | | | | these misuses has been corrected, remove it before new ones appear, such as arm/arm/pmap.c revision 1.8.
* Remove spl calls.alc2004-07-251-11/+0
|
* Make the code and comments for vm_object_coalesce() consistent.alc2004-07-253-9/+6
|
* Simplify vmspace initialization. The bcopy() of fields from the oldalc2004-07-242-14/+8
| | | | | | | vmspace to the new vmspace in vmspace_exec() is mostly wasted effort. With one exception, vm_swrss, the copied fields are immediately overwritten. Instead, initialize these fields to zero in vmspace_alloc(), eliminating a bcopy() from vmspace_exec() and a bzero() from vmspace_fork().
* - Change uma_zone_set_obj() to call kmem_alloc_nofault() instead ofalc2004-07-224-13/+10
| | | | | | | | | | | | | | | | | | | | | | | kmem_alloc_pageable(). The difference between these is that an errant memory access to the zone will be detected sooner with kmem_alloc_nofault(). The following changes serve to eliminate the following lock-order reversal reported by witness: 1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311 2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797 3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931 There is no potential deadlock in this case. However, witness is unable to recognize this because vm objects used by UMA have the same type as ordinary vm objects. To remedy this, we make the following changes: - Add a mutex type argument to VM_OBJECT_LOCK_INIT(). - Use the mutex type argument to assign distinct types to special vm objects such as the kernel object, kmem object, and UMA objects. - Define a static swap zone object for use by UMA. (Only static objects are assigned a special mutex type.)
* Fix a race in vm_page_sleep_if_busy(). Due to vm_object lockinggreen2004-07-211-4/+12
| | | | | | | | | | | | | | | being incomplete, it currently has to know how to drop and pick back up the vm_object's mutex if it has to sleep and drop the page queue mutex. The problem with this is that if the page is busy, while we are sleeping, the page can be freed and object disappear. When trying to lock m->object, we'd get a stale or NULL pointer and crash. The object is now cached, but this makes the assumption that the object is referenced in some manner and will not itself disappear while it is unlocked. Since this only happens if the object is locked, I had to remove an assumption earlier in contigmalloc() that reversed the order of locking the object and doing vm_page_sleep_if_busy(), not the normal order.
* Semi-gratuitous change. Move two refcount operations to their own linespeter2004-07-211-2/+4
| | | | | rather than be buried inside an if (expression). And now that the if expression is the same in both exit paths, use the same ordering.
* Move the initialization and teardown of pmaps to the vmspace zone'speter2004-07-211-3/+2
| | | | | | | | | | | | | | | | | init and fini handlers. Our vm system removes all userland mappings at exit prior to calling pmap_release. It just so happens that we might as well reuse the pmap for the next process since the userland slate has already been wiped clean. However. There is a functional benefit to this as well. For platforms that share userland and kernel context in the same pmap, it means that the kernel portion of a pmap remains valid after the vmspace has been freed (process exit) and while it is in uma's cache. This is significant for i386 SMP systems with kernel context borrowing because it avoids a LOT of IPIs from the pmap_lazyfix() cleanup in the usual case. Tested on: amd64, i386, sparc64, alpha Glanced at by: alc
* Remove extraneous locks on the VM free page queue mutex; it is notgreen2004-07-191-2/+0
| | | | | | | meant to be recursed upon, and could cauuse a deadlock inside the new contigmalloc (vm.old_contigmalloc=0) code. Submitted by: alc
* - Eliminate the pte object from the pmap. Instead, page table pages arealc2004-07-191-2/+0
| | | | | | | | | | | | | allocated as "no object" pages. Similar changes were made to the amd64 and i386 pmap last year. The primary reason being that maintaining a pte object leads to lock order violations. A secondary reason being that the pte object is redundant, i.e., the page table itself can be used to lookup page table pages. (Historical note: The pte object predates our ability to allocate "no object" pages. Thus, the pte object was a necessary evil.) - Unconditionally check the vm object lock's status in vm_page_remove(). Previously, this assertion could not be made on Alpha due to its use of a pte object.
* Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional ingreen2004-07-191-0/+6
| | | | | GENERIC/for WITNESS users, make sure the sysctl to disable the behavior is read-only and always enabled.
* Reimplement contigmalloc(9) with an algorithm which stands a greatly-green2004-07-192-36/+273
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | improved chance of working despite pressure from running programs. Instead of trying to throw a bunch of pages out to swap and hope for the best, only a range that can potentially fulfill contigmalloc(9)'s request will have its contents paged out (potentially, not forcibly) at a time. The new contigmalloc operation still operates in three passes, but it could potentially be tuned to more or less. The first pass only looks at pages in the cache and free pages, so they would be thrown out without having to block. If this is not enough, the subsequent passes page out any unwired memory. To combat memory pressure refragmenting the section of memory being laundered, each page is removed from the systems' free memory queue once it has been freed so that blocking later doesn't cause the memory laundered so far to get reallocated. The page-out operations are now blocking, as it would make little sense to try to push out a page, then get its status immediately afterward to remove it from the available free pages queue, if it's unlikely to have been freed. Another change is that if KVA allocation fails, the allocated memory segment will be freed and not leaked. There is a sysctl/tunable, defaulting to on, which causes the old contigmalloc() algorithm to be used. Nonetheless, I have been using vm.old_contigmalloc=0 for over a month. It is safe to switch at run-time to see the difference it makes. A new interface has been used which does not require mapping the allocated pages into KVA: vm_page.h functions vm_page_alloc_contig() and vm_page_release_contig(). These are what vm.old_contigmalloc=0 uses internally, so the sysctl/tunable does not affect their operation. When using the contigmalloc(9) and contigfree(9) interfaces, memory is now tracked with malloc(9) stats. Several functions have been exported from kern_malloc.c to allow other subsystems to use these statistics, as well. This invalidates the BUGS section of the contigmalloc(9) manpage.
* Remove the GIANT_REQUIRED preceding pmap_remove() inalc2004-07-181-1/+0
| | | | vm_pageout_map_deactivate_pages().
* Push down the acquisition and release of the page queues lock intoalc2004-07-152-10/+0
| | | | | | pmap_protect() and pmap_remove(). In general, they require the lock in order to modify a page's pv list or flags. In some cases, however, pmap_protect() can avoid acquiring the lock.
* Remove an unused and unimplemented sysctl. (For the record, it was markedalc2004-07-121-10/+1
| | | | as unimplemented in revision 1.129 nearly six years ago.)
* Increase the scope of the page queues lock in vm_page_alloc() to coveralc2004-07-101-1/+1
| | | | a diagnostic check that accesses the cache queue count.
* Micro-optimize vmspace for 64-bit architectures: Colocate vm_refcnt andalc2004-07-061-1/+1
| | | | vm_exitingcnt so that alignment does not result in wasted space.
* Properly brucify a string by outdenting it.bms2004-07-061-2/+2
|
* Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1bmilekic2004-07-041-11/+9
| | | | | | | | | | | | | | | and WITNESS is not built, then force all M_WAITOK allocations to M_NOWAIT behavior (transparently). This is to be used temporarily if wierd deadlocks are reported because we still have code paths that perform M_WAITOK allocations with lock(s) held, which can lead to deadlock. If WITNESS is compiled, then the sysctl is ignored and we ask witness to tell us wether we have locks held, converting to M_NOWAIT behavior only if it tells us that we do. Note this removes the previous mbuf.h inclusion as well (only needed by last revision), and cleans up unneeded [artificial] comparisons to just the mbuf zones. The problem described above has nothing to do with previous mbuf wait behavior; it is a general problem.
* Reextend the M_WAITOK-disabling-hack to all three of the mbuf-relatedgreen2004-07-041-2/+4
| | | | | | | | | | | | | | | | | | | zones, and do it by direct comparison of uma_zone_t instead of strcmp. The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but this is mostly no longer the case. M_WAITOK has taken over the spot M_TRYWAIT used to have, and for mbuf things, still may return NULL if the code path is incorrectly holding a mutex going into mbuf allocation functions. The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock the system to try to malloc or uma_zalloc something with a mutex held and M_WAITOK specified, it is absolutely required to not return NULL and will result in instability and/or security breaches otherwise. There is still room to add the WITNESS_WARN() to all cases so that we are notified of the possibility of deadlocks, but it cannot change the value of the "badness" variable and allow allocation to actually fail except for the specialized cases which used to be M_TRYWAIT.
* Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subjectgreen2004-07-031-4/+8
| | | | | | | | | | | | | to failing -- that is, allocations via malloc(M_WAITOK) that are required to never fail -- if WITNESS is not defined. While everyone should be running WITNESS, in any case, zone "Mbuf" allocations are really the only ones that should be screwed with by this hack. This hack is crashing people, and would continue to do so with or without WITNESS. Things shouldn't be allocating with M_WAITOK with locks held, but it's not okay just to always remove M_WAITOK when !WITNESS. Reported by: Bernd Walter <ticso@cicely5.cicely.de>
* Implement preemption of kernel threads natively in the scheduler ratherjhb2004-07-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)
* - Change mi_switch() and sched_switch() to accept an optional thread tojhb2004-07-021-1/+1
| | | | | | | | | | | | | switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
* - Don't use a variable to point to the user area that we only use once.jhb2004-07-021-15/+10
| | | | | | Just use p2->p_uarea directly instead. - Remove an old and mostly bogus assertion regarding p2->p_sigacts. - Use RANGEOF macro ala fork1() to clean up bzero/bcopy of p_stats.
* Initialize result->backing_object_offset before linking result onto the list oftegge2004-06-281-5/+5
| | | | | | | vm objects shadowing source in vm_object_shadow(). This closes a race where vm_object_collapse() could be called with a partially uninitialized object argument causing symptoms that looked like hardware problems, e.g. signal 6, 10, 11 or a /bin/sh busy-waiting for a nonexistant child process.
* Use MIN() macro rather than ulmin() inline, and fix stray tabgallatin2004-06-281-3/+3
| | | | | | that snuck in with my last commit. Submitted by: green
* Fix alpha - the use of min() on longs was loosing the high bits andgallatin2004-06-281-3/+3
| | | | returning wrong answers, leading to strange values vm2->vm_{s,t,d}size.
* Update a stale comment. The heuristic to swap processes out based ondas2004-06-271-2/+3
| | | | | the number of pages already paged out was broken in rev 1.10 and removed in rev 1.11.
* Remove an unused field from the vmspace structure.alc2004-06-261-2/+1
|
* Correct the tracking of various bits of the process's vmspace and vm_mapgreen2004-06-241-3/+44
| | | | | | | | | | when not propogated on fork (due to minherit(2)). Consistency checks otherwise fail when the vm_map is freed and it appears to have not been emptied completely, causing an INVARIANTS panic in vm_map_zdtor(). PR: kern/68017 Submitted by: Mark W. Krentel <krentel@dreamscape.com> Reviewed by: alc
* Call vm_pageout_page_stats() with the page queues lock held.alc2004-06-241-3/+2
|
* Remove spl calls.alc2004-06-241-14/+1
|
* Make uma_mtx MTX_RECURSE. Here's why:bmilekic2004-06-231-1/+11
| | | | | | | | | | | | | | | The general UMA lock is a recursion-allowed lock because there is a code path where, while we're still configured to use startup_alloc() for backend page allocations, we may end up in uma_reclaim() which calls zone_foreach(zone_drain), which grabs uma_mtx, only to later call into startup_alloc() because while freeing we needed to allocate a bucket. Since startup_alloc() also takes uma_mtx, we need to be able to recurse on it. This exact explanation also added as comment above mtx_init(). Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>
* In swap_pager_getpages(), bp->b_dev can be NULL, particularly for thebms2004-06-231-6/+4
| | | | | | | | | | case of NFS mounted swap, so do not try to dereference it. While we're here, brucify the printf() call which happens when we time out on acquisition of vm_page_queue_mtx. PR: kern/67898 Submitted by: bde (style)
* Remove spl() calls. Update comments to reflect the removal of spl() calls.alc2004-06-191-53/+8
| | | | Remove '\n' from panic() format strings. Remove some blank lines.
* Second half of the dev_t cleanup.phk2004-06-172-7/+7
| | | | | | | | | | | The big lines are: NODEV -> NULL NOUDEV -> NODEV udev_t -> dev_t udev2dev() -> findcdev() Various minor adjustments including handling of userland access to kernel space struct cdev etc.
* Do not preset PG_BUSY on VM_ALLOC_NOOBJ pages. Such pages are notalc2004-06-171-0/+2
| | | | accessible through an object. Thus, PG_BUSY serves no purpose.
* Do the dreaded s/dev_t/struct cdev */phk2004-06-162-3/+3
| | | | Bump __FreeBSD_version accordingly.
* Nice, is a property of a process as a whole..julian2004-06-162-5/+2
| | | | | I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
OpenPOWER on IntegriCloud