diff options
author | phk <phk@FreeBSD.org> | 1996-04-13 08:30:21 +0000 |
---|---|---|
committer | phk <phk@FreeBSD.org> | 1996-04-13 08:30:21 +0000 |
commit | 26030597389c8d6a6c8c1f0c7601fd9bdf0584f5 (patch) | |
tree | 1fd09c21d5269ca254cb617520ccaf0b1cff2e86 /share/doc/papers/malloc/implementation.ms | |
parent | 71735d6644bce690478b4cd10efed8a56497202d (diff) | |
download | FreeBSD-src-26030597389c8d6a6c8c1f0c7601fd9bdf0584f5.zip FreeBSD-src-26030597389c8d6a6c8c1f0c7601fd9bdf0584f5.tar.gz |
A little paper about phkmalloc.
Diffstat (limited to 'share/doc/papers/malloc/implementation.ms')
-rw-r--r-- | share/doc/papers/malloc/implementation.ms | 222 |
1 files changed, 222 insertions, 0 deletions
diff --git a/share/doc/papers/malloc/implementation.ms b/share/doc/papers/malloc/implementation.ms new file mode 100644 index 0000000..3aba2c4 --- /dev/null +++ b/share/doc/papers/malloc/implementation.ms @@ -0,0 +1,222 @@ +.\" +.\" ---------------------------------------------------------------------------- +.\" "THE BEER-WARE LICENSE" (Revision 42): +.\" <phk@login.dknet.dk> wrote this file. As long as you retain this notice you +.\" can do whatever you want with this stuff. If we meet some day, and you think +.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp +.\" ---------------------------------------------------------------------------- +.\" +.\" $Id$ +.\" +.ds RH Implementation +.NH +Implementation +.PP +A new malloc(3) implementation was written to meet the goals, +and to the extent possible to address the shortcomings listed previously. +.PP +The source is 1218 lines of C code, and can be found in FreeBSD 2.2 +(and probably later versions as well) as src/lib/libc/stdlib/malloc.c. +.PP +The main data structure is the +.I page-directory +which contains a +.B void* +for each page we have control over. +The value can be one of: +.IP +.B MALLOC_NOT_MINE +Another part of the code may call brk(2) to get a piece of the cake. +Consequently we cannot rely on the memory we get from the kernel to +be one consequtive piece of memory and therefore we need a way to +mark such pages as "untouchable". +.IP +.B MALLOC_FREE +This is a free page. +.IP +.B MALLOC_FIRST +This is the first page in a (multi-)page allocation. +.IP +.B MALLOC_FOLLOW +This is a subsequent page in a multi-page allocation. +.IP +.B +struct pginfo* +.R +A pointer to a structure describing a partitioned page. +.PP +In addition there exist a linked list of small data structures that +describe the free space as runs of free pages. +.PP +Notice that these structures are not part of the free pages themselves, +but rather allocated with malloc so that the free pages themselves +are never referenced while they are free. +.PP +When a request for storage comes in, it will be treated as a ``page'' +allocation if it is bigger than half a page. +The freelist will be searched and the first run of free pages that +can satisfy the request is used. The first page gets set to +.B MALLOC_FIRST +status, if more than that one page is needed the rest of them gets +.B MALLOC_FOLLOW +status in the page-directory. +.PP +If there were no pages on the free-list, brk(2) will be called, and +the pages will get added to the page-directory with status +.B MALLOC_FREE +and the search restarts. +.PP +Freeing a number of pages is done by changing their state in the +page directory to MALLOC_FREE, and then traverse the free-pages list to +find the right place for this run of pages, possibly collapsing +with the two neighbouring runs into one run and, if it is possible, +release some memory back to the kernel by calling brk(2). +.PP +If the request is less than or equal to half of a page, its size will be +rounded up to the nearest power of two before being processed +and if the request is less than some minimum size, it is rounded up to +that size. +.PP +These sub-page allocations are served from pages which are split up +into some number of equal size chunks. +For each of these pages a +.B +struct pginfo +.R +describes the size of the chunks on this page, how many there are, +how many are free and so on. +The description consist of a bitmap of used chunks, and various counters +and numbers used to keep track of the stuff in the page. +.PP +For each size of sub-page allocation, the pginfo structures for the +pages that have free chunks in them form a list. +The head of these lists are stored in predetermined slots at +the beginning of the page directory to make access fast. +.PP +To allocate a chunk of some size, the head of the list for the +corresponding size is examined, and a free chunk found, the number +of free chunks on that page is decreased by one and if zero the +pginfo structure is unlinked from the list. +.PP +To free a chunk, the page is derived from the pointer, the page table +for that page contains a pointer to the pginfo structure, where the +free bit is set for the chunk, the number of free chunks increased by +one, and if equal to one, the pginfo structure is linked into the +proper place on the list for this size of chunks. +If the count increases to match the number of chunks on the page, the +pginfo structure is unlinked from the list and free(3)'ed and the +actual page itself is free(3)'ed too. +.PP +To be 100% correct performance-wise these lists should be ordered +according to the recent number of accesses to that page. This +information is not available and it would essentially mean a reordering +of the list on every memory reference to keep it up-to-date. +Instead they are ordered according to the address of the pages. +Interestingly enough, in practice this comes out to almost the same +thing performance wise. +.PP +It's not that surprising after all, it's the difference between +following the crowd or actively directing where it can go, in both +ways you can end up in the middle of it all. +.PP +The sideffect of this compromise is that it also uses less storage, +and the list never has to be reordered, all the ordering happens when +pages are added or deleted. +.PP +It is an interesting twist to the implementation that the +.B +struct pginfo +.R +Is allocated with malloc. +That is, "as with malloc" to be painfully correct. +The code knows the special case where the first (couple) of allocations on +the page is actually the pginfo structure and deals with it accordingly. +This avoids some silly "chicken and egg" issues. +.ds RH Bells and whistles. +.NH +Bells and whistles. +.PP +brk(2) is actually not a very fast system call when you ask for storage. +This is mainly because of the need by the kernel to zero the pages before +handing them over, so therefore this implementation does not release +back heap-pages, until there is a large chunk to release back to the kernel. +Chances are pretty good that we will need it again pretty soon anyway. +Since these pages are not accessed at all, they will soon be paged out +and don't affect anything but swap-space usage. +.PP +The page directory is actually kept in a mmap(2)'ed piece of +anonymous memory. This avoids some rather silly cases that +we would otherwise have to be handled when the page directory +has to be extended. +.PP +One particular nice feature is that all pointers passed to free(3) +and realloc(3) can be checked conclusively for validity: +First the pointer is masked to find the page. The page directory +is then examined, it must contain either MALLOC_FIRST, in which +case the pointer must point exactly at the page, or it can contain +a struct pginfo*, in which case the pointer must point to a one of +the chunks described by that structure. +Warnings will be printed on stderr and nothing will be done with +the pointer in case it is found to be invalid. +.PP +An environment variable +.B MALLOC_OPTIONS +allows the user some control over the behaviour of malloc. +Some of the more interesting options are: +.IP +.B Abort +If malloc fails to allocate storage, core-dump the process with +a message rather than expect it handle this correctly. +It's amazing how few programs actually handle this condition correctly, +and consequently the havoc they can create is the more creative or +destructive. +.IP +.B Realloc +Always do a free and malloc when realloc(3) is called. The default +is to leave things alone if the size of the allocation is still in +the same size-class. +For programs doing garbage collect using realloc(3) this make the +heap collapse faster. Since the malloc will reallocate from the +lowest available address. +.IP +.B Junk +will explicitly fill the allocated area with a particular value +to try to detect if programs rely on it being zero. +.IP +.B Zero +will explicitly zero out the allocated chunk of memory, while any +space after the allocation in the chunk will be filled with the +junk value to try to catch out of the chunk references. +.ds RH The road not taken. +.NH +The road yet not taken. +.PP +A couple of avenues were explored that could be interesting in some +set of circumstances. +.PP +Using mmap(2) instead of brk(2) was actually slower, since brk(2) +knows a lot of the things that mmap has to find out first. +.PP +A system call where we could tell the kernel that "we don't +need the contents of this page anymore" would allow us to +return the pages on the free list to the kernel and to instruct +the kernel that it doesn't need to page it out nor in. +It would save some page-out events, and the page-in would be replaced +by a zero-fill page. +This is, according to the VM goods in the FreeBSD camp, "easy", +and it will probably be attempted at some point in the future. +.PP +In general there is little room for further improvement of the +time-overhead of the malloc, further improvements will have to +be in the area of improving paging behaviour. +.PP +It is still under consideration to add a feature such that +if realloc is called with two zero arguments, the internal +allocations will be reallocated to perform a garbage collect. +This could be used in certain types of programs to collapse +the memory use, but so far it doesn't seem to be worth the effort. +.PP +Malloc/Free can be a significant point of contention in multi-threaded +programs. Low-grain locking of the data-structures inside the +implementation should be implemented to avoid excessive spin-waiting. + |