A little paper about phkmalloc.

author: phk <phk@FreeBSD.org> 1996-04-13 08:30:21 +0000
committer: phk <phk@FreeBSD.org> 1996-04-13 08:30:21 +0000
commit: 26030597389c8d6a6c8c1f0c7601fd9bdf0584f5 (patch)
tree: 1fd09c21d5269ca254cb617520ccaf0b1cff2e86 /share/doc/papers/malloc/implementation.ms
parent: 71735d6644bce690478b4cd10efed8a56497202d (diff)
download: FreeBSD-src-26030597389c8d6a6c8c1f0c7601fd9bdf0584f5.zip
FreeBSD-src-26030597389c8d6a6c8c1f0c7601fd9bdf0584f5.tar.gz
1 files changed, 222 insertions, 0 deletions
diff --git a/share/doc/papers/malloc/implementation.ms b/share/doc/papers/malloc/implementation.ms
new file mode 100644
index 0000000..3aba2c4
--- /dev/null
+++ b/share/doc/papers/malloc/implementation.ms
@@ -0,0 +1,222 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@login.dknet.dk> wrote this file.  As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $Id$
+.\"
+.ds RH Implementation
+.NH
+Implementation
+.PP
+A new malloc(3) implementation was written to meet the goals,
+and to the extent possible to address the shortcomings listed previously.
+.PP
+The source is 1218 lines of C code, and can be found in FreeBSD 2.2
+(and probably later versions as well) as src/lib/libc/stdlib/malloc.c.
+.PP
+The main data structure is the
+.I page-directory
+which contains a
+.B void*
+for each page we have control over.
+The value can be one of:
+.IP
+.B MALLOC_NOT_MINE
+Another part of the code may call brk(2) to get a piece of the cake.
+Consequently we cannot rely on the memory we get from the kernel to
+be one consequtive piece of memory and therefore we need a way to
+mark such pages as "untouchable".
+.IP
+.B MALLOC_FREE
+This is a free page.
+.IP
+.B MALLOC_FIRST
+This is the first page in a (multi-)page allocation.
+.IP
+.B MALLOC_FOLLOW
+This is a subsequent page in a multi-page allocation.
+.IP
+.B
+struct pginfo*
+.R
+A pointer to a structure describing a partitioned page.
+.PP
+In addition there exist a linked list of small data structures that
+describe the free space as runs of free pages.
+.PP
+Notice that these structures are not part of the free pages themselves,
+but rather allocated with malloc so that the free pages themselves
+are never referenced while they are free.
+.PP
+When a request for storage comes in, it will be treated as a ``page''
+allocation if it is bigger than half a page.
+The freelist will be searched and the first run of free pages that
+can satisfy the request is used.  The first page gets set to
+.B MALLOC_FIRST
+status, if more than that one page is needed the rest of them gets
+.B MALLOC_FOLLOW
+status in the page-directory.
+.PP
+If there were no pages on the free-list, brk(2) will be called, and
+the pages will get added to the page-directory with status
+.B MALLOC_FREE
+and the search restarts.
+.PP
+Freeing a number of pages is done by changing their state in the 
+page directory to MALLOC_FREE, and then traverse the free-pages list to
+find the right place for this run of pages, possibly collapsing
+with the two neighbouring runs into one run and, if it is possible,
+release some memory back to the kernel by calling brk(2).
+.PP
+If the request is less than or equal to half of a page, its size will be
+rounded up to the nearest power of two before being processed
+and if the request is less than some minimum size, it is rounded up to
+that size.
+.PP
+These sub-page allocations are served from pages which are split up
+into some number of equal size chunks.
+For each of these pages a
+.B
+struct pginfo
+.R
+describes the size of the chunks on this page, how many there are,
+how many are free and so on.
+The description consist of a bitmap of used chunks, and various counters
+and numbers used to keep track of the stuff in the page.
+.PP
+For each size of sub-page allocation, the pginfo structures for the
+pages that have free chunks in them form a list.
+The head of these lists are stored in predetermined slots at
+the beginning of the page directory to make access fast.
+.PP
+To allocate a chunk of some size, the head of the list for the
+corresponding size is examined, and a free chunk found, the number
+of free chunks on that page is decreased by one and if zero the
+pginfo structure is unlinked from the list.
+.PP
+To free a chunk, the page is derived from the pointer, the page table
+for that page contains a pointer to the pginfo structure, where the
+free bit is set for the chunk, the number of free chunks increased by
+one, and if equal to one, the pginfo structure is linked into the
+proper place on the list for this size of chunks.
+If the count increases to match the number of chunks on the page, the
+pginfo structure is unlinked from the list and free(3)'ed and the 
+actual page itself is free(3)'ed too.
+.PP
+To be 100% correct performance-wise these lists should be ordered
+according to the recent number of accesses to that page.  This 
+information is not available and it would essentially mean a reordering
+of the list on every memory reference to keep it up-to-date.
+Instead they are ordered according to the address of the pages.
+Interestingly enough, in practice this comes out to almost the same 
+thing performance wise.
+.PP
+It's not that surprising after all, it's the difference between
+following the crowd or actively directing where it can go, in both
+ways you can end up in the middle of it all.
+.PP
+The sideffect of this compromise is that it also uses less storage,
+and the list never has to be reordered, all the ordering happens when
+pages are added or deleted.
+.PP
+It is an interesting twist to the implementation that the
+.B
+struct pginfo
+.R
+Is allocated with malloc.
+That is, "as with malloc" to be painfully correct.
+The code knows the special case where the first (couple) of allocations on
+the page is actually the pginfo structure and deals with it accordingly.
+This avoids some silly "chicken and egg" issues.
+.ds RH Bells and whistles.
+.NH
+Bells and whistles.
+.PP
+brk(2) is actually not a very fast system call when you ask for storage.
+This is mainly because of the need by the kernel to zero the pages before
+handing them over, so therefore this implementation does not release 
+back heap-pages, until there is a large chunk to release back to the kernel.
+Chances are pretty good that we will need it again pretty soon anyway.
+Since these pages are not accessed at all, they will soon be paged out
+and don't affect anything but swap-space usage.
+.PP
+The page directory is actually kept in a mmap(2)'ed piece of
+anonymous memory.  This avoids some rather silly cases that
+we would otherwise have to be handled when the page directory
+has to be extended.
+.PP
+One particular nice feature is that all pointers passed to free(3)
+and realloc(3) can be checked conclusively for validity:
+First the pointer is masked to find the page.  The page directory
+is then examined, it must contain either MALLOC_FIRST, in which
+case the pointer must point exactly at the page, or it can contain
+a struct pginfo*, in which case the pointer must point to a one of
+the chunks described by that structure.
+Warnings will be printed on stderr and nothing will be done with
+the pointer in case it is found to be invalid.
+.PP
+An environment variable
+.B MALLOC_OPTIONS
+allows the user some control over the behaviour of malloc.
+Some of the more interesting options are:
+.IP
+.B Abort
+If malloc fails to allocate storage, core-dump the process with
+a message rather than expect it handle this correctly.
+It's amazing how few programs actually handle this condition correctly,
+and consequently the havoc they can create is the more creative or
+destructive.
+.IP
+.B Realloc
+Always do a free and malloc when realloc(3) is called.  The default
+is to leave things alone if the size of the allocation is still in
+the same size-class.
+For programs doing garbage collect using realloc(3) this make the
+heap collapse faster.  Since the malloc will reallocate from the 
+lowest available address.
+.IP
+.B Junk
+will explicitly fill the allocated area with a particular value
+to try to detect if programs rely on it being zero.
+.IP
+.B Zero
+will explicitly zero out the allocated chunk of memory, while any
+space after the allocation in the chunk will be filled with the
+junk value to try to catch out of the chunk references.
+.ds RH The road not taken.
+.NH
+The road yet not taken.
+.PP
+A couple of avenues were explored that could be interesting in some
+set of circumstances.
+.PP
+Using mmap(2) instead of brk(2) was actually slower, since brk(2)
+knows a lot of the things that mmap has to find out first.
+.PP
+A system call where we could tell the kernel that "we don't
+need the contents of this page anymore" would allow us to
+return the pages on the free list to the kernel and to instruct
+the kernel that it doesn't need to page it out nor in.
+It would save some page-out events, and the page-in would be replaced
+by a zero-fill page.
+This is, according to the VM goods in the FreeBSD camp, "easy",
+and it will probably be attempted at some point in the future.
+.PP
+In general there is little room for further improvement of the
+time-overhead of the malloc, further improvements will have to
+be in the area of improving paging behaviour.
+.PP
+It is still under consideration to add a feature such that
+if realloc is called with two zero arguments, the internal
+allocations will be reallocated to perform a garbage collect.
+This could be used in certain types of programs to collapse
+the memory use, but so far it doesn't seem to be worth the effort.
+.PP
+Malloc/Free can be a significant point of contention in multi-threaded
+programs.  Low-grain locking of the data-structures inside the 
+implementation should be implemented to avoid excessive spin-waiting.
+
author	phk <phk@FreeBSD.org>	1996-04-13 08:30:21 +0000
committer	phk <phk@FreeBSD.org>	1996-04-13 08:30:21 +0000
commit	26030597389c8d6a6c8c1f0c7601fd9bdf0584f5 (patch)
tree	1fd09c21d5269ca254cb617520ccaf0b1cff2e86 /share/doc/papers/malloc/implementation.ms
parent	71735d6644bce690478b4cd10efed8a56497202d (diff)
download	FreeBSD-src-26030597389c8d6a6c8c1f0c7601fd9bdf0584f5.zip FreeBSD-src-26030597389c8d6a6c8c1f0c7601fd9bdf0584f5.tar.gz