summaryrefslogtreecommitdiffstats
path: root/sys/kern/kern_malloc.c
Commit message (Collapse)AuthorAgeFilesLines
* Reimplement contigmalloc(9) with an algorithm which stands a greatly-green2004-07-191-27/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | improved chance of working despite pressure from running programs. Instead of trying to throw a bunch of pages out to swap and hope for the best, only a range that can potentially fulfill contigmalloc(9)'s request will have its contents paged out (potentially, not forcibly) at a time. The new contigmalloc operation still operates in three passes, but it could potentially be tuned to more or less. The first pass only looks at pages in the cache and free pages, so they would be thrown out without having to block. If this is not enough, the subsequent passes page out any unwired memory. To combat memory pressure refragmenting the section of memory being laundered, each page is removed from the systems' free memory queue once it has been freed so that blocking later doesn't cause the memory laundered so far to get reallocated. The page-out operations are now blocking, as it would make little sense to try to push out a page, then get its status immediately afterward to remove it from the available free pages queue, if it's unlikely to have been freed. Another change is that if KVA allocation fails, the allocated memory segment will be freed and not leaked. There is a sysctl/tunable, defaulting to on, which causes the old contigmalloc() algorithm to be used. Nonetheless, I have been using vm.old_contigmalloc=0 for over a month. It is safe to switch at run-time to see the difference it makes. A new interface has been used which does not require mapping the allocated pages into KVA: vm_page.h functions vm_page_alloc_contig() and vm_page_release_contig(). These are what vm.old_contigmalloc=0 uses internally, so the sysctl/tunable does not affect their operation. When using the contigmalloc(9) and contigfree(9) interfaces, memory is now tracked with malloc(9) stats. Several functions have been exported from kern_malloc.c to allow other subsystems to use these statistics, as well. This invalidates the BUGS section of the contigmalloc(9) manpage.
* Update for the KDB framework:marcel2004-07-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | o Make debugging code conditional upon KDB instead of DDB. o Call kdb_enter() instead of Debugger(). o Call kdb_backtrace() instead of db_print_backtrace() or backtrace(). kern_mutex.c: o Replace checks for db_active with checks for kdb_active and make them unconditional. kern_shutdown.c: o s/DDB_UNATTENDED/KDB_UNATTENDED/g o s/DDB_TRACE/KDB_TRACE/g o Save the TID of the thread doing the kernel dump so the debugger knows which thread to select as the current when debugging the kernel core file. o Clear kdb_active instead of db_active and do so unconditionally. o Remove backtrace() implementation. kern_synch.c: o Call kdb_reenter() instead of db_error().
* Bring in mbuma to replace mballoc.bmilekic2004-05-311-17/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
* Remove advertising clause from University of California Regent's license,imp2004-04-051-4/+0
| | | | | | per letter dated July 22, 1999. Approved by: core
* Rename the kern.vm.kmem.size tunable to the more logical vm.kmem_size. Todes2004-01-271-1/+7
| | | | | | | | | | | assure backward compatibility (conditional on !BURN_BRIDGES), look it up by its old name first, and log a warning (but accept the setting) if it was found. If both the old and new name are defined, the new name takes precedence. Also export vm.kmem_size as a read-only sysctl variable; I find it hard to tune a parameter when I don't know its default value, especially when that default value is computed at boot time.
* - Only use UMA to cache malloc requests up to PAGE_SIZE. Values larger thanjeff2003-09-191-1/+12
| | | | | this are requested very infrequently and waste memory when we cache spares.
* Revert stuff which accidentally ended up in the previous commit.phk2003-07-221-6/+3
|
* Don't attempt to inline large functions mb_alloc() and mb_free(),phk2003-07-221-3/+6
| | | | | | it more than doubles the text size of this file. GCC has wisely ignored us on this previously
* Add init_param3() to subr_param. This function is calledsilby2003-07-111-0/+5
| | | | | | | | immediately after the kernel map has been sized, and is the optimal place for the autosizing of memory allocations which occur within the kernel map to occur. Suggested by: bde
* Don't overflow when calculating vm_kmem_size. This fixes kmem_mapps2003-06-111-4/+4
| | | | | | | | too small panics on PAE machines which have odd > 4GB sizes (4.5 gig would render a 20MB of KVA for kmem_map instead of 200MB). Submitted by: John Cagle <john.cagle@hp.com>, jeff Reviewed by: jeff, peter, scottl, lots of USENIX folks
* Use __FBSDID().obrien2003-06-111-1/+3
|
* Don't pass NULL pointer to memset if we are compiled with DIAGNOSTICphk2003-05-121-4/+3
| | | | Approved by: re/rwatson
* Add two KASSERTS which trigger if free(9) would drag the "memuse" statisticphk2003-05-051-0/+6
| | | | | for a malloc bucket under zero. This typically happens if you malloc(9) from one bucket and free to another.
* Update the "last malloc failure timestamp" also for simulatedphk2003-04-251-0/+1
| | | | malloc errors.
* Permit debug.malloc.failure_rate to be specified using a tunable sorwatson2003-03-261-0/+1
| | | | | | | that the feature can be enabled during the boot process. Note the continued limitation that FreeBSD fails so rapidly with this setting enabled that it's hard to narrow down particular failures for correction; we really need per-malloc type failure rates.
* Add a new kernel option, MALLOC_MAKE_FAILURES, which compilesrwatson2003-03-261-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | in a debugging feature causing M_NOWAIT allocations to fail at a specified rate. This can be useful for detecting poor handling of M_NOWAIT: the most frequent problems I've bumped into are unconditional deference of the pointer even though it's NULL, and hangs as a result of a lost event where memory for the event couldn't be allocated. Two sysctls are added: debug.malloc.failure_rate How often to generate a failure: if set to 0 (default), this feature is disabled. Otherwise, the frequency of failures -- I've been using 10 (one in ten mallocs fails), but other popular settings might be much lower or much higher. debug.malloc.failure_count Number of times a coerced malloc failure has occurred as a result of this feature. Useful for tracking what might have happened and whether failures are being generated. Useful possible additions: tying failure rate to malloc type, printfs indicating the thread that experienced the coerced failure. Reviewed by: jeffr, jhb
* PHCC[1]:phk2003-03-101-2/+2
| | | | | | | | | | I had commented the #ifdef INVARIANTS checks out to make sure I ran this code in all kernels and forgot to comment the #ifdefs back in before I committed. Spotted by: bmilekic [1] PHCC = Pointy Hat Correction Commit
* Make malloc and mbuf allocation mode flags nonoverlapping.phk2003-03-101-1/+18
| | | | | | Under INVARIANTS whine if we get incompatible flags. Submitted by: imp
* o Allow "buckets" in mb_alloc to be differently sized (according tobmilekic2003-02-201-2/+1
| | | | | | | | | | | | | | | | compile-time constants). That is, a "bucket" now is not necessarily a page-worth of mbufs or clusters, but it is MBUF_BUCK_SZ, CLUS_BUCK_SZ worth of mbufs, clusters. o Rename {mbuf,clust}_limit to {mbuf,clust}_hiwm and introduce {mbuf,clust}_lowm, which currently has no effect but will be used to set the low watermarks. o Fix netstat so that it can deal with the differently-sized buckets and teach it about the low watermarks too. o Make sure the per-cpu stats for an absent CPU has mb_active set to 0, explicitly. o Get rid of the allocate refcounts from mbuf map mess. Instead, just malloc() the refcounts in one shot from mbuf_init() o Clean up / update comments in subr_mbuf.c
* Back out M_* changes, per decision of the TRB.imp2003-02-191-4/+4
| | | | Approved by: trb
* Under #ifdef DIAGNOSTIC, fill malloc(9) allocations which do not havephk2003-02-011-0/+8
| | | | M_ZERO specified with 0x70. (malloc_flags=J for the kernel :-)
* Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.alfred2003-01-211-4/+4
| | | | Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
* Introduce malloc_last_fail() which returns the number of seconds sincephk2002-11-011-0/+16
| | | | | | | | | malloc(9) failed last time. This is intended to help code adjust memory usage to the current circumstances. A typical use could be: if (malloc_last_fail() < 60) reduce_cache_by_one();
* - Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH.jeff2002-09-181-33/+9
| | | | | | | - Remove all instances of the mallochash. - Stash the slab pointer in the vm page's object pointer when allocating from the kmem_obj. - Use the overloaded object pointer to find slabs for malloced memory.
* - Replace the bandaid introduced in revision 1.110 withrobert2002-05-311-1/+3
| | | | | | a better solution. - Add braces for a ``for'' statement containing a single multi-line statement.
* Add a bandaid so that sysctl kern.malloc works on sparc64.jake2002-05-201-1/+1
|
* Fix the td_intr_nesting_level check to work ok if a flag like M_ZERO isjhb2002-05-201-3/+1
| | | | passed in with M_WAITOK to malloc().
* Hide a pointer to the malloc_type bucket at the end of the freed memory. Ifjeff2002-05-021-1/+19
| | | | | this memory is modified after it has been freed we can now report it's previous owner.
* malloc/free(9) no longer require Giant. Use the malloc_mtx to protect thejeff2002-05-021-1/+21
| | | | | | | mallochash. Mallochash is going to go away as soon as I introduce the kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen until all users of the malloc api that expect memory to be aligned on the size of the allocation are fixed.
* Remove the temporary alignment check in free().jeff2002-05-021-6/+0
| | | | | | | | | | | Implement the following checks on freed memory in the bucket path: - Slab membership - Alignment - Duplicate free This previously was only done if we skipped the buckets. This code will slow down INVARIANTS a bit, but it is smp safe. The checks were moved out of the normal path and into hooks supplied in uma_dbg.
* Convert longs to u_longs in stats. This will hold off wrap arounds for ajeff2002-04-301-1/+1
| | | | while longer.
* Add a new UMA debugging facility. This will overwrite freed memory withjeff2002-04-301-2/+8
| | | | | | | | | | | 0xdeadc0de and then check for it just before memory is handed off as part of a new request. This will catch any post free/pre alloc modification of memory, as well as introduce errors for anything that tries to dereference it as a pointer. This code takes the form of special init, fini, ctor and dtor routines that are specificly used by malloc. It is in a seperate file because additional debugging aids will want to live here as well.
* Move the implementation of M_ZERO into UMA so that it can be passed tojeff2002-04-301-3/+0
| | | | | | uma_zalloc and friends. Remove this functionality from the malloc wrapper. Document this change in uma.h and adjust variable names in uma_core.
* Re-add the 16384 bucket also.rwatson2002-04-291-0/+1
| | | | Submitted by: green
* Revert a portion of kern_malloc.c:1.99, which (in addition to addingrwatson2002-04-291-1/+3
| | | | | | | | | | | | | malloc profiling) also modified the set of pre-defined buckets for the memory allocator. For reasons unknown to me, this resulted in extensive memory corruption in the kernel, in particular on SMP boxes, so I'm committing this work-around until Jeff gets a chance to debug it properly. David Wolfskill pointed me at this commit as the one that might be a problem; I've been running this code on two dual-processor burn-in boxes for about 12 hours now, and the rate of panics due to memory corruption has dropped to zero (from one every five minutes). Hopefully not treading on the toes of: jeff
* Add a basic sanity check on pointers passed to free(9).phk2002-04-231-0/+10
| | | | Should be improved by: jeff
* Finish adding support code for sysctl kern.mprof. This dumps some mallocjeff2002-04-151-11/+68
| | | | | | | | | | | | | | | | | information related to bucket size effeciency. Three things are printed on each row: Size is the size the user actually asked for rounded to 16 bytes. Requests is the number of times this size was asked for. Real Size is the size we actually handed out. At the end the total memory used and total waste is displayed. Currently my system displays about 33% wasted memory. The intent of this code is to gather statistics for tuning the malloc bucket sizes. It is not intended to be run with INVARIANTS and it is not entirely mp safe. It can be enabled via 'options MALLOC_PROFILE' which was commited earlier.
* Remove malloc_type's ks_limit.jeff2002-04-151-84/+135
| | | | | | | | | | | | Updated the kmemzones logic such that the ks_size bitmap can be used as an index into it to report the size of the zone used. Create the kern.malloc sysctl which replaces the kvm mechanism to report similar data. This will provide an easy place for statistics aggregation if malloc_type statistics become per cpu data. Add some code ifdef'd under MALLOC_PROFILING to facilitate a tool for sizing the malloc buckets.
* Change callers of mtx_init() to pass in an appropriate lock type name. Injhb2002-04-041-1/+1
| | | | | | | most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
* Remove __P.alfred2002-03-191-1/+1
|
* This is the first part of the new kernel memory allocator. This replacesjeff2002-03-191-299/+122
| | | | | | malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
* Add realloc() and reallocf(), and make free(NULL, ...) acceptable.archie2002-03-131-0/+74
| | | | Reviewed by: alfred
* KSE Milestone 2julian2001-09-121-1/+1
| | | | | | | | | | | | | | Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
* - Remove asleep(), await(), and M_ASLEEP.jhb2001-08-101-9/+0
| | | | | | | | | - Callers of asleep() and await() have been converted to calling tsleep(). The only caller outside of M_ASLEEP was the ata driver, which called both asleep() and await() with spl-raised, so there was no need for the asleep() and await() pair. M_ASLEEP was unused. Reviewed by: jasone, peter
* Rename mb_init() mbuf subsystem initialization routine to mbuf_init(), inbmilekic2001-08-031-1/+1
| | | | | | | | order to avoid namespace collision with subr_mchain.c's mb_init(). This wasn't "fatal" as the mbuf initialization routine mb_init() was local to subr_mbuf.c which in turn didn't pull in subr_mchain.c's mb_init() declaration, but it should deffinately be changed now before it creates headache.
* Remove some code that appears to have endian problems with INVARIANTS.jake2001-08-031-5/+0
| | | | | This is #if BIG_ENDIAN, but is only necessary if malloc types are shorts, not struct malloc_type * like they are now.
* Introduce numerous SMP friendly changes to the mbuf allocator. Namely,bmilekic2001-06-221-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | introduce a modified allocation mechanism for mbufs and mbuf clusters; one which can scale under SMP and which offers the possibility of resource reclamation to be implemented in the future. Notable advantages: o Reduce contention for SMP by offering per-CPU pools and locks. o Better use of data cache due to per-CPU pools. o Much less code cache pollution due to excessively large allocation macros. o Framework for `grouping' objects from same page together so as to be able to possibly free wired-down pages back to the system if they are no longer needed by the network stacks. Additional things changed with this addition: - Moved some mbuf specific declarations and initializations from sys/conf/param.c into mbuf-specific code where they belong. - m_getclr() has been renamed to m_get_clrd() because the old name is really confusing. m_getclr() HAS been preserved though and is defined to the new name. No tree sweep has been done "to change the interface," as the old name will continue to be supported and is not depracated. The change was merely done because m_getclr() sounds too much like "m_get a cluster." - TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and systat(1) (see TODO below). - Fixed systat(1) to display number of "free mbufs" based on new per-CPU stat structures. - Fixed netstat(1) to display new per-CPU stats based on sysctl-exported per-CPU stat structures. All infos are fetched via sysctl. TODO (in order of priority): - Re-enable mbtypes statistics in both netstat(1) and systat(1) after introducing an SMP friendly way to collect the mbtypes stats under the already introduced per-CPU locks (i.e. hopefully don't use atomic() - it seems too costly for a mere stat update, especially when other locks are already present). - Optionally have systat(1) display not only "total free mbufs" but also "total free mbufs per CPU pool." - Fix minor length-fetching issues in netstat(1) related to recently re-enabled option to read mbuf stats from a core file. - Move reference counters at least for mbuf clusters into an unused portion of the cluster itself, to save space and need to allocate a counter. - Look into introducing resource freeing possibly from a kproc. Reviewed by (in parts): jlemon, jake, silby, terry Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha) Preliminary performance measurements: jlemon (and me, obviously) URL: http://people.freebsd.org/~bmilekic/mb_alloc/
* "Fix" the previous initial attempt at fixing TUNABLE_INT(). This timepeter2001-06-081-1/+1
| | | | | | | around, use a common function for looking up and extracting the tunables from the kernel environment. This saves duplicating the same function over and over again. This way typically has an overhead of 8 bytes + the path string, versus about 26 bytes + the path string.
* Back out part of my previous commit. This was a last minute changepeter2001-06-071-1/+1
| | | | | and I botched testing. This is a perfect example of how NOT to do this sort of thing. :-(
* Make the TUNABLE_*() macros look and behave more consistantly like thepeter2001-06-061-7/+6
| | | | | SYSCTL_*() macros. TUNABLE_INT_DECL() was an odd name because it didn't actually declare the int, which is what the name suggests it would do.
OpenPOWER on IntegriCloud