| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
support for it. Note that while sparc64 also supports the static TLS
model and thus tls_model("initial-exec"), using the default model
turned out to yield slightly better buildstone performance.
|
|
|
|
|
|
|
|
|
|
|
| |
number of host CPUs and osreldate.
This eliminates the last sysctl(2) calls from the dynamically linked image
startup.
No objections from: kan
Tested by: marius (sparc64)
MFC after: 1 month
|
|
|
|
| |
Obtained from: projects/ppc64
|
|
|
|
| |
insert/remove speed by ~30%.
|
|
|
|
| |
like ia64, leave it empty (default model).
|
|
|
|
|
|
| |
the static TLS model, which is fundamentally different from the dynamic
TLS model. The consequence was data corruption. Limit the attribute to
i386 and amd64.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Fix a race in chunk_dealloc_dss().
* Check for allocation failure before zeroing memory in base_calloc().
Merge enhancements from a divergent version of jemalloc:
* Convert thread-specific caching from magazines to an algorithm that is
more tunable, and implement incremental GC.
* Add support for medium size classes, [4KiB..32KiB], 2KiB apart by
default.
* Add dirty page tracking for pages within active small/medium object
runs. This allows malloc to track precisely which pages are in active
use, which makes dirty page purging more effective.
* Base maximum dirty page count on proportion of active memory.
* Use optional zeroing in arena_chunk_alloc() to avoid needless zeroing
of chunks. This is useful in the context of DSS allocation, since a
long-lived application may commonly recycle chunks.
* Increase the default chunk size from 1MiB to 4MiB.
Remove feature:
* Remove the dynamic rebalancing code, since thread caching reduces its
utility.
|
|
|
|
|
|
|
| |
deallocate.
Submitted by: Ryan Stone (rysto32 at gmail dot com)
Approved by: jasone
|
|
|
|
| |
initialization of ssize_invs.
|
|
|
|
|
|
| |
as intended.
PR: standards/138307
|
|
|
|
|
|
| |
in order to distinguish it from free(NULL), which is logged as (0, 0, 0).
Reviewed by: jhb
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
a large page size that is greater than malloc(3)'s default chunk size but
less than or equal to 4 MB, then increase the chunk size to match the large
page size.
Most often, using a chunk size that is less than the large page size is not
a problem. However, consider a long-running application that allocates and
frees significant amounts of memory. In particular, it frees enough memory
at times that some of that memory is munmap()ed. Up until the first
munmap(), a 1MB chunk size is just fine; it's not a problem for the virtual
memory system. Two adjacent 1MB chunks that are aligned on a 2MB boundary
will be promoted automatically to a superpage even though they were
allocated at different times. The trouble begins with the munmap(),
releasing a 1MB chunk will trigger the demotion of the containing superpage,
leaving behind a half-used 2MB reservation. Now comes the real problem.
Unfortunately, when the application needs to allocate more memory, and it
recycles the previously munmap()ed address range, the implementation of
mmap() won't be able to reuse the reservation. Basically, the coalescing
rules in the virtual memory system don't allow this new range to combine
with its neighbor. The effect being that superpage promotion will not
reoccur for this range of addresses until both 1MB chunks are freed at some
point in the future.
Reviewed by: jasone
MFC after: 3 weeks
|
|
|
|
|
|
|
| |
according to the 'V' option.
PR: standards/138307
MFC after: 1 week
|
|
|
|
| |
Reported by: kib
|
|
|
|
|
|
| |
spinning is avoided due to running on a single-CPU system.
Reported by: stefanf
|
|
|
|
| |
Reported by: davidxu
|
|
|
|
|
|
|
| |
potential extreme contention in the kernel for multi-threaded applications
on SMP systems.
Reported by: kris
|
|
|
|
|
|
| |
page size and using sysconf(3).
Suggested by: marcel
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This caching allows for completely lock-free allocation/deallocation in the
steady state, at the expense of likely increased memory use and
fragmentation.
Reduce the default number of arenas to 2*ncpus, since thread-specific
caching typically reduces arena contention.
Modify size class spacing to include ranges of 2^n-spaced, quantum-spaced,
cacheline-spaced, and subpage-spaced size classes. The advantages are:
fewer size classes, reduced false cacheline sharing, and reduced internal
fragmentation for allocations that are slightly over 512, 1024, etc.
Increase RUN_MAX_SMALL, in order to limit fragmentation for the
subpage-spaced size classes.
Add a size-->bin lookup table for small sizes to simplify translating sizes
to size classes. Include a hard-coded constant table that is used unless
custom size class spacing is specified at run time.
Add the ability to disable tiny size classes at compile time via
MALLOC_TINY.
|
|
|
|
|
|
| |
preemption while busy-waiting.
Submitted by: Mike Schuster <schuster@adobe.com>
|
|
|
|
|
|
| |
detect whether the integer division table is large enough to handle the
divisor. Before this change, the last two table elements were never used,
thus causing the slow path to be used for those divisors.
|
|
|
|
|
| |
Found by: LLVM/Clang Static Checker
Approved by: jasone
|
|
|
|
|
|
|
| |
the chunk map instead of red-black trees where possible. Remove the
red-black trees and node objects that are obsoleted by this change. The
net result is a ~1-2% memory savings, and a substantial allocation speed
improvement.
|
|
|
|
|
|
| |
Fix bit vector initialization for run headers.
Submitted by: [1] Mike Schuster <schuster@adobe.com>
|
|
|
|
|
|
|
| |
This substantially improves worst case allocation performance, since
O(lg n) tree search can be used instead of O(n) tree iteration.
Use rb_wrap() instead of directly calling rb_*() macros.
|
|
|
|
| |
Approved by: imp
|
|
|
|
| |
signed increment argument, but the size is an unsigned integer.
|
|
|
|
|
|
|
|
|
| |
color bit in the least significant bit of the right child pointer, in
order to reduce red-black tree linkage overhead by ~2X as compared to
sys/tree.h.
Use the new red-black tree implementation in malloc, which drops
memory usage by ~0.5 or ~1%, for 32- and 64-bit systems, respectively.
|
|
|
|
| |
deallocation.
|
|
|
|
|
|
|
|
| |
reallocation, when junk filling is enabled. Junk filling must occur
prior to shrinking, since any deallocated trailing pages are immediately
available for use by other threads.
Reported by: Mats Palmgren <mats.palmgren@bredband.net>
|
|
|
|
|
|
|
|
| |
allocation patterns, number of CPUs, and MALLOC_OPTIONS settings indicate
that lazy deallocation has the potential to worsen throughput dramatically.
Performance degradation occurs when multiple threads try to clear the lazy
free cache simultaneously. Various experiments to avoid this bottleneck
failed to completely solve this problem, while adding yet more complexity.
|
|
|
|
|
|
|
|
| |
arena_dalloc_lazy_hard() was split out of arena_dalloc_lazy() in revision
1.162.
Reduce thundering herd problems in lazy deallocation by randomly varying
how many probes a thread does before taking the slow path.
|
|
|
|
|
|
|
|
|
|
|
| |
assumptions about whether bits are set at various times. This makes
adding other flags safe.
Reorganize functions in order to inline i{m,c,p,s,re}alloc(). This
allows the entire fast-path call chains for malloc() and free() to be
inlined. [1]
Suggested by: [1] Stuart Parmenter <stuart@mozilla.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
threshold, according to the 'F' MALLOC_OPTIONS flag. This obsoletes the
'H' flag.
Try to realloc() large objects in place. This substantially speeds up
incremental large reallocations in the common case.
Fix a bug in arena_ralloc() that caused relocation of sub-page objects
even if the old and new sizes were in the same size class.
Maintain trees of runs and simplify the per-chunk page map. This allows
logarithmic-time searching for sufficiently large runs in
arena_run_alloc(), whereas the previous algorithm required linear time
in the worst case.
Break various large functions into smaller sub-functions, and inline
only the functions that are in the fast path for small object
allocation/deallocation.
Remove an unnecessary check in base_pages_alloc_mmap().
Avoid integer division in choose_arena() for the NO_TLS case on
single-CPU systems.
|
|
|
|
|
|
|
|
|
| |
default. This has the disadvantage of rendering the datasize resource
limit irrelevant, but without this change, legitimate uses of more
memory than will fit in the data segment are thwarted by default.
Fix chunk_alloc_mmap() to work correctly if initial mapping is not
chunk-aligned and mapping extension fails.
|
|
|
|
|
|
|
|
| |
Clean up DSS-related locking and protect all pertinent variables with
dss_mtx (remove dss_chunks_mtx). This fixes race conditions that could
cause chunk leaks.
Reported by: [1] kris
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a long-standing bug, but until recent changes it was difficult
to trigger, and even then its impact was non-catastrophic, with the
exception of revision 1.157.
Optimize chunk_alloc_mmap() to avoid the need for unmapping pages in the
common case. Thanks go to Kris Kennaway for a patch that inspired this
change.
Do not maintain a record of previously mmap'ed chunk address ranges.
The original intent was to avoid the extra system call overhead in
chunk_alloc_mmap(), which is no longer a concern. This also allows some
simplifications for the tree of unused DSS chunks.
Introduce huge_mtx and dss_chunks_mtx to replace chunks_mtx. There was
no compelling reason to use the same mutex for these disjoint purposes.
Avoid memset() for huge allocations when possible.
Maintain two trees instead of one for tracking unused DSS address
ranges. This allows scalable allocation of multi-chunk huge objects in
the DSS. Previously, multi-chunk huge allocation requests failed if the
DSS could not be extended.
|
| |
|
|
|
|
|
|
|
|
|
| |
order to support re-use of multi-chunk unused regions within the DSS for
huge allocations. This generalization is important to correct function
when mmap-based allocation is disabled.
Avoid zeroing re-used memory in the DSS unless it really needs to be
zeroed.
|
|
|
|
| |
Reported by: kris
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
memory is acquired from the system via sbrk(2) and/or mmap(2). By default,
use sbrk(2) only, in order to support traditional use of resource limits.
Additionally, when both options are enabled, prefer the data segment to
anonymous mappings, in order to coexist better with large file mappings
in applications on 32-bit platforms. This change has the potential to
increase memory fragmentation due to the linear nature of the data
segment, but from a performance perspective this is mitigated by the use
of madvise(2). [1]
Add the ability to interpret integer prefixes in MALLOC_OPTIONS
processing. For example, MALLOC_OPTIONS=lllllllll can now be specified as
MALLOC_OPTIONS=9l.
Reported by: [1] rwatson
Design review: [1] alc, peter, rwatson
|
|
|
|
|
|
|
|
|
|
| |
calculating run sizes. Use of the floating point unit was a potential
pessimization to context switching for applications that do not otherwise
use floating point math. [1]
Reformat cpp macro-related comments to improve consistency.
Submitted by: das
|
|
|
|
|
|
|
|
|
| |
deallocation and dynamic load balancing via the MALLOC_LAZY_FREE and
MALLOC_BALANCE knobs. This is a non-functional change, since these
features are still enabled when possible.
Clean up a few things that more pedantic compiler settings would cause
complaints over.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
contention. The intent is to dynamically adjust to load imbalances, which
can cause severe contention.
Use pthread mutexes where possible instead of libc "spinlocks" (they aren't
actually spin locks). Conceptually, this change is meant only to support
the dynamic load balancing code by enabling the use of spin locks, but it
has the added apparent benefit of substantially improving performance due to
reduced context switches when there is moderate arena lock contention.
Proper tuning parameter configuration for this change is a finicky business,
and it is very much machine-dependent. One seemingly promising solution
would be to run a tuning program during operating system installation that
computes appropriate settings for load balancing. (The pthreads adaptive
spin locks should probably be similarly tuned.)
|
|
|
|
|
|
|
|
|
|
|
| |
vector of slots for lazily freed objects. For each deallocation, before
doing the hard work of locking the arena and deallocating, try several times
to randomly insert the object into the vector using atomic operations.
This approach is particularly effective at reducing contention for
multi-threaded applications that use the producer-consumer model, wherein
one producer thread allocates objects, then multiple consumer threads
deallocate those objects.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
allocations. [1]
Fix calculation of the number of arenas when 'n' is specified via
MALLOC_OPTIONS.
Clean up various style inconsistencies.
Obtained from: [1] NetBSD
|
|
|
|
|
|
| |
and zero filling was broken in a way that could cause memory corruption.
Update comments.
|