summaryrefslogtreecommitdiffstats
path: root/sys/vm
Commit message (Collapse)AuthorAgeFilesLines
* - Eliminate the pte object from the pmap. Instead, page table pages arealc2004-07-191-2/+0
| | | | | | | | | | | | | allocated as "no object" pages. Similar changes were made to the amd64 and i386 pmap last year. The primary reason being that maintaining a pte object leads to lock order violations. A secondary reason being that the pte object is redundant, i.e., the page table itself can be used to lookup page table pages. (Historical note: The pte object predates our ability to allocate "no object" pages. Thus, the pte object was a necessary evil.) - Unconditionally check the vm object lock's status in vm_page_remove(). Previously, this assertion could not be made on Alpha due to its use of a pte object.
* Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional ingreen2004-07-191-0/+6
| | | | | GENERIC/for WITNESS users, make sure the sysctl to disable the behavior is read-only and always enabled.
* Reimplement contigmalloc(9) with an algorithm which stands a greatly-green2004-07-192-36/+273
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | improved chance of working despite pressure from running programs. Instead of trying to throw a bunch of pages out to swap and hope for the best, only a range that can potentially fulfill contigmalloc(9)'s request will have its contents paged out (potentially, not forcibly) at a time. The new contigmalloc operation still operates in three passes, but it could potentially be tuned to more or less. The first pass only looks at pages in the cache and free pages, so they would be thrown out without having to block. If this is not enough, the subsequent passes page out any unwired memory. To combat memory pressure refragmenting the section of memory being laundered, each page is removed from the systems' free memory queue once it has been freed so that blocking later doesn't cause the memory laundered so far to get reallocated. The page-out operations are now blocking, as it would make little sense to try to push out a page, then get its status immediately afterward to remove it from the available free pages queue, if it's unlikely to have been freed. Another change is that if KVA allocation fails, the allocated memory segment will be freed and not leaked. There is a sysctl/tunable, defaulting to on, which causes the old contigmalloc() algorithm to be used. Nonetheless, I have been using vm.old_contigmalloc=0 for over a month. It is safe to switch at run-time to see the difference it makes. A new interface has been used which does not require mapping the allocated pages into KVA: vm_page.h functions vm_page_alloc_contig() and vm_page_release_contig(). These are what vm.old_contigmalloc=0 uses internally, so the sysctl/tunable does not affect their operation. When using the contigmalloc(9) and contigfree(9) interfaces, memory is now tracked with malloc(9) stats. Several functions have been exported from kern_malloc.c to allow other subsystems to use these statistics, as well. This invalidates the BUGS section of the contigmalloc(9) manpage.
* Remove the GIANT_REQUIRED preceding pmap_remove() inalc2004-07-181-1/+0
| | | | vm_pageout_map_deactivate_pages().
* Push down the acquisition and release of the page queues lock intoalc2004-07-152-10/+0
| | | | | | pmap_protect() and pmap_remove(). In general, they require the lock in order to modify a page's pv list or flags. In some cases, however, pmap_protect() can avoid acquiring the lock.
* Remove an unused and unimplemented sysctl. (For the record, it was markedalc2004-07-121-10/+1
| | | | as unimplemented in revision 1.129 nearly six years ago.)
* Increase the scope of the page queues lock in vm_page_alloc() to coveralc2004-07-101-1/+1
| | | | a diagnostic check that accesses the cache queue count.
* Micro-optimize vmspace for 64-bit architectures: Colocate vm_refcnt andalc2004-07-061-1/+1
| | | | vm_exitingcnt so that alignment does not result in wasted space.
* Properly brucify a string by outdenting it.bms2004-07-061-2/+2
|
* Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1bmilekic2004-07-041-11/+9
| | | | | | | | | | | | | | | and WITNESS is not built, then force all M_WAITOK allocations to M_NOWAIT behavior (transparently). This is to be used temporarily if wierd deadlocks are reported because we still have code paths that perform M_WAITOK allocations with lock(s) held, which can lead to deadlock. If WITNESS is compiled, then the sysctl is ignored and we ask witness to tell us wether we have locks held, converting to M_NOWAIT behavior only if it tells us that we do. Note this removes the previous mbuf.h inclusion as well (only needed by last revision), and cleans up unneeded [artificial] comparisons to just the mbuf zones. The problem described above has nothing to do with previous mbuf wait behavior; it is a general problem.
* Reextend the M_WAITOK-disabling-hack to all three of the mbuf-relatedgreen2004-07-041-2/+4
| | | | | | | | | | | | | | | | | | | zones, and do it by direct comparison of uma_zone_t instead of strcmp. The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but this is mostly no longer the case. M_WAITOK has taken over the spot M_TRYWAIT used to have, and for mbuf things, still may return NULL if the code path is incorrectly holding a mutex going into mbuf allocation functions. The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock the system to try to malloc or uma_zalloc something with a mutex held and M_WAITOK specified, it is absolutely required to not return NULL and will result in instability and/or security breaches otherwise. There is still room to add the WITNESS_WARN() to all cases so that we are notified of the possibility of deadlocks, but it cannot change the value of the "badness" variable and allow allocation to actually fail except for the specialized cases which used to be M_TRYWAIT.
* Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subjectgreen2004-07-031-4/+8
| | | | | | | | | | | | | to failing -- that is, allocations via malloc(M_WAITOK) that are required to never fail -- if WITNESS is not defined. While everyone should be running WITNESS, in any case, zone "Mbuf" allocations are really the only ones that should be screwed with by this hack. This hack is crashing people, and would continue to do so with or without WITNESS. Things shouldn't be allocating with M_WAITOK with locks held, but it's not okay just to always remove M_WAITOK when !WITNESS. Reported by: Bernd Walter <ticso@cicely5.cicely.de>
* Implement preemption of kernel threads natively in the scheduler ratherjhb2004-07-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)
* - Change mi_switch() and sched_switch() to accept an optional thread tojhb2004-07-021-1/+1
| | | | | | | | | | | | | switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
* - Don't use a variable to point to the user area that we only use once.jhb2004-07-021-15/+10
| | | | | | Just use p2->p_uarea directly instead. - Remove an old and mostly bogus assertion regarding p2->p_sigacts. - Use RANGEOF macro ala fork1() to clean up bzero/bcopy of p_stats.
* Initialize result->backing_object_offset before linking result onto the list oftegge2004-06-281-5/+5
| | | | | | | vm objects shadowing source in vm_object_shadow(). This closes a race where vm_object_collapse() could be called with a partially uninitialized object argument causing symptoms that looked like hardware problems, e.g. signal 6, 10, 11 or a /bin/sh busy-waiting for a nonexistant child process.
* Use MIN() macro rather than ulmin() inline, and fix stray tabgallatin2004-06-281-3/+3
| | | | | | that snuck in with my last commit. Submitted by: green
* Fix alpha - the use of min() on longs was loosing the high bits andgallatin2004-06-281-3/+3
| | | | returning wrong answers, leading to strange values vm2->vm_{s,t,d}size.
* Update a stale comment. The heuristic to swap processes out based ondas2004-06-271-2/+3
| | | | | the number of pages already paged out was broken in rev 1.10 and removed in rev 1.11.
* Remove an unused field from the vmspace structure.alc2004-06-261-2/+1
|
* Correct the tracking of various bits of the process's vmspace and vm_mapgreen2004-06-241-3/+44
| | | | | | | | | | when not propogated on fork (due to minherit(2)). Consistency checks otherwise fail when the vm_map is freed and it appears to have not been emptied completely, causing an INVARIANTS panic in vm_map_zdtor(). PR: kern/68017 Submitted by: Mark W. Krentel <krentel@dreamscape.com> Reviewed by: alc
* Call vm_pageout_page_stats() with the page queues lock held.alc2004-06-241-3/+2
|
* Remove spl calls.alc2004-06-241-14/+1
|
* Make uma_mtx MTX_RECURSE. Here's why:bmilekic2004-06-231-1/+11
| | | | | | | | | | | | | | | The general UMA lock is a recursion-allowed lock because there is a code path where, while we're still configured to use startup_alloc() for backend page allocations, we may end up in uma_reclaim() which calls zone_foreach(zone_drain), which grabs uma_mtx, only to later call into startup_alloc() because while freeing we needed to allocate a bucket. Since startup_alloc() also takes uma_mtx, we need to be able to recurse on it. This exact explanation also added as comment above mtx_init(). Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>
* In swap_pager_getpages(), bp->b_dev can be NULL, particularly for thebms2004-06-231-6/+4
| | | | | | | | | | case of NFS mounted swap, so do not try to dereference it. While we're here, brucify the printf() call which happens when we time out on acquisition of vm_page_queue_mtx. PR: kern/67898 Submitted by: bde (style)
* Remove spl() calls. Update comments to reflect the removal of spl() calls.alc2004-06-191-53/+8
| | | | Remove '\n' from panic() format strings. Remove some blank lines.
* Second half of the dev_t cleanup.phk2004-06-172-7/+7
| | | | | | | | | | | The big lines are: NODEV -> NULL NOUDEV -> NODEV udev_t -> dev_t udev2dev() -> findcdev() Various minor adjustments including handling of userland access to kernel space struct cdev etc.
* Do not preset PG_BUSY on VM_ALLOC_NOOBJ pages. Such pages are notalc2004-06-171-0/+2
| | | | accessible through an object. Thus, PG_BUSY serves no purpose.
* Do the dreaded s/dev_t/struct cdev */phk2004-06-162-3/+3
| | | | Bump __FreeBSD_version accordingly.
* Nice, is a property of a process as a whole..julian2004-06-162-5/+2
| | | | | I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
* Make contigmalloc() more reliable:green2004-06-151-6/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | 1. Remove a race whereby contigmalloc() would deadlock against the running processes in the system if they kept reinstantiating the memory on the active and inactive page queues that it was trying to flush out. The process doing the contigmalloc() would sit in "swwrt" forever and the swap pager would be going at full force, but never get anywhere. Instead of doing it until the queues are empty, launder for as many iterations as there are pages in the queue. 2. Do all laundering to swap synchronously; previously, the vnode laundering was synchronous and the swap laundering not. 3. Increase the number of launder-or-allocate passes to three, from two, while failing without bothering to do all the laundering on the third pass if allocation was not possible. This effectively gives exactly two chances to launder enough contiguous memory, helpful with high memory churn where a lot of memory from one pass to the next (and during a single laundering loop) becomes dirtied again. I can now reliably hot-plug hardware requiring a 256KB contigmalloc() without having the kldload/cbb ithread sit around failing to make progress, while running a busy X session. Previously, it took killing X to get contigmalloc() to get further (that is, quiescing the system), and even then contigmalloc() returned failure.
* Deorbit COMPAT_SUNOS.phk2004-06-111-2/+2
| | | | | We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither a sparc32 port nor a SunOS4.x compatibility desire these days.
* Backout previous change, I think Julian has a better solution whichbmilekic2004-06-091-1/+1
| | | | does not require type-stable refcnts here.
* Make the slabrefzone, the zone from which we allocated slabs withbmilekic2004-06-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | internal reference counters, UMA_ZONE_NOFREE. This way, those slabs (with their ref counts) will be effectively type-stable, then using a trick like this on the refcount is no longer dangerous: MEXT_REM_REF(m); if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) { if (m->m_ext.ext_type == EXT_PACKET) { uma_zfree(zone_pack, m); return; } else if (m->m_ext.ext_type == EXT_CLUSTER) { uma_zfree(zone_clust, m->m_ext.ext_buf); m->m_ext.ext_buf = NULL; } else { (*(m->m_ext.ext_free))(m->m_ext.ext_buf, m->m_ext.ext_args); if (m->m_ext.ext_type != EXT_EXTREF) free(m->m_ext.ref_cnt, M_MBUF); } } uma_zfree(zone_mbuf, m); Previously, a second thread hitting the above cmpset might actually read the refcnt AFTER it has already been freed. A very rare occurance. Now we'll know that it won't be freed, though. Spotted by: julian, pjd
* Remove references to L1 in the comments, according to Alan they arenetchild2004-06-071-2/+2
| | | | | | historical leftovers. Approved by: alc
* Update stale comments regarding page coloring.alc2004-06-051-10/+10
|
* Move the definitions of SWAPBLK_NONE and SWAPBLK_MASK from vm_page.h toalc2004-06-041-8/+0
| | | | | blist.h, enabling the removal of numerous #includes from subr_blist.c. (subr_blist.c and swap_pager.c are the only users of these definitions.)
* Fix a comment above uma_zsecond_create(), describing its arguments.bmilekic2004-06-011-3/+3
| | | | | | | It doesn't take 'align' and 'flags' but 'master' instead, which is a reference to the Master Zone, containing the backing Keg. Pointed out by: Tim Robbins (tjr)
* Bring in mbuma to replace mballoc.bmilekic2004-05-315-355/+832
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
* Remove a stale comment: PG_DIRTY and PG_FILLED were removed inalc2004-05-301-2/+0
| | | | revisions 1.17 and 1.12 respectively.
* Correct typo, vm_page_list_find() is called vm_pageq_find() for quite ahmp2004-05-301-2/+2
| | | | | | | | long time, i.e., since the cleanup of the VM Page-queues code done two years ago. Reviewed by: Alan Cox <alc at freebsd.org>, Matthew Dillon <dillon at backplane.com>
* MFS: vm_map.c rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATEdes2004-05-251-1/+5
| | | | semantics but provide a sysctl knob for reverting to old ones.
* Back out previous commit; it went to the wrong file.des2004-05-251-8/+1
|
* MFS: rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATE semantics butdes2004-05-251-1/+8
| | | | provide a sysctl knob for reverting to old ones.
* Correct two error cases in vm_map_unwire():alc2004-05-251-4/+5
| | | | | | | | | | | | | | | | | 1. Contrary to the Single Unix Specification our implementation of munlock(2) when performed on an unwired virtual address range has returned an error. Correct this. Note, however, that the behavior of "system" unwiring is unchanged, only "user" unwiring is changed. If "system" unwiring is performed on an unwired virtual address range, an error is still returned. 2. Performing an errant "system" unwiring on a virtual address range that was "user" (i.e., mlock(2)) but not "system" wired would incorrectly undo the "user" wiring instead of returning an error. Correct this. Discussed with: green@ Reviewed by: tegge@
* To date, unwiring a fictitious page has produced a panic. The reasonalc2004-05-224-18/+29
| | | | | | | | | | | | | | | being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic reported an unexpected wired count. Rather than attempting to fix PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of fictitious pages. Specifically, fictitious pages will never be completely unwired. Therefore, we can keep a fictitious page's wired count forever set to one and thereby avoid the use of PHYS_TO_VM_PAGE() when we know that we're working with a fictitious page, just not which one. In collaboration with: green@, tegge@ PR: kern/29915
* Restructure vm_page_select_cache() so that adding assertions is easy.alc2004-05-121-10/+15
| | | | | | | | Some of the conditions that caused vm_page_select_cache() to deactivate a page were wrong. For example, deactivating an unmanaged or wired page is a nop. Thus, if vm_page_select_cache() had ever encountered an unmanaged or wired page, it would have looped forever. Now, we assert that the page is neither unmanaged nor wired.
* Cache queue pages are not mapped. Thus, the pmap_remove_all() byalc2004-05-121-1/+0
| | | | vm_pageout_scan()'s loop for freeing cache queue pages is unnecessary.
* To handle orphaned character device vnodes properly in mmap(), check thattjr2004-05-111-1/+1
| | | | | v_mount is non-null before dereferencing it. If it's null, behave as if MNT_NOEXEC was not set on the mount that originally containined it.
* Cache queue pages are not mapped. Thus, the pmap_remove_all() byalc2004-05-091-1/+0
| | | | vm_page_alloc() is unnecessary.
OpenPOWER on IntegriCloud