1 files changed, 392 insertions, 0 deletions
diff --git a/share/doc/papers/memfs/1.t b/share/doc/papers/memfs/1.t
new file mode 100644
index 0000000..a065844
--- /dev/null
+++ b/share/doc/papers/memfs/1.t
@@ -0,0 +1,392 @@
+.\" Copyright (c) 1990 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\"    must display the following acknowledgement:
+.\"	This product includes software developed by the University of
+.\"	California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\"    may be used to endorse or promote products derived from this software
+.\"    without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\"	@(#)1.t	5.1 (Berkeley) 4/16/91
+.\"
+.nr PS 11
+.nr VS 13
+.SH
+Introduction
+.PP
+This paper describes the motivation for and implementation of
+a memory-based filesystem.
+Memory-based filesystems have existed for a long time;
+they have generally been marketed as RAM disks or sometimes
+as software packages that use the machine's general purpose memory.
+.[
+white
+.]
+.PP
+A RAM disk is designed to appear like any other disk peripheral
+connected to a machine.
+It is normally interfaced to the processor through the I/O bus
+and is accessed through a device driver similar or sometimes identical
+to the device driver used for a normal magnetic disk.
+The device driver sends requests for blocks of data to the device
+and the requested data is then DMA'ed to or from the requested block.
+Instead of storing its data on a rotating magnetic disk,
+the RAM disk stores its data in a large array of random access memory
+or bubble memory.
+Thus, the latency of accessing the RAM disk is nearly zero
+compared to the 15-50 milliseconds of latency incurred when
+access rotating magnetic media.
+RAM disks also have the benefit of being able to transfer data at
+the maximum DMA rate of the system,
+while disks are typically limited by the rate that the data passes
+under the disk head.
+.PP
+Software packages simulating RAM disks operate by allocating
+a fixed partition of the system memory.
+The software then provides a device driver interface similar
+to the one described for hardware RAM disks,
+except that it uses memory-to-memory copy instead of DMA to move
+the data between the RAM disk and the system buffers,
+or it maps the contents of the RAM disk into the system buffers.
+Because the memory used by the RAM disk is not available for
+other purposes, software RAM-disk solutions are used primarily
+for machines with limited addressing capabilities such as PC's
+that do not have an effective way of using the extra memory anyway.
+.PP
+Most software RAM disks lose their contents when the system is powered
+down or rebooted.
+The contents can be saved by using battery backed-up memory,
+by storing critical filesystem data structures in the filesystem,
+and by running a consistency check program after each reboot.
+These conditions increase the hardware cost
+and potentially slow down the speed of the disk.
+Thus, RAM-disk filesystems are not typically
+designed to survive power failures;
+because of their volatility, their usefulness is limited to transient
+or easily recreated information such as might be found in
+.PN /tmp .
+Their primary benefit is that they have higher throughput
+than disk based filesystems.
+.[
+smith
+.]
+This improved throughput is particularly useful for utilities that
+make heavy use of temporary files, such as compilers.
+On fast processors, nearly half of the elapsed time for a compilation
+is spent waiting for synchronous operations required for file
+creation and deletion.
+The use of the memory-based filesystem nearly eliminates this waiting time.
+.PP
+Using dedicated memory to exclusively support a RAM disk
+is a poor use of resources.
+The overall throughput of the system can be improved
+by using the memory where it is getting the highest access rate.
+These needs may shift between supporting process virtual address spaces
+and caching frequently used disk blocks.
+If the memory is dedicated to the filesystem,
+it is better used in a buffer cache.
+The buffer cache permits faster access to the data
+because it requires only a single memory-to-memory copy
+from the kernel to the user process.
+The use of memory is used in a RAM-disk configuration may require two
+memory-to-memory copies, one from the RAM disk
+to the buffer cache,
+then another copy from the buffer cache to the user process.
+.PP
+The new work being presented in this paper is building a prototype
+RAM-disk filesystem in pageable memory instead of dedicated memory.
+The goal is to provide the speed benefits of a RAM disk
+without paying the performance penalty inherent in dedicating
+part of the physical memory on the machine to the RAM disk.
+By building the filesystem in pageable memory,
+it competes with other processes for the available memory.
+When memory runs short, the paging system pushes its
+least-recently-used pages to backing store.
+Being pageable also allows the filesystem to be much larger than
+would be practical if it were limited by the amount of physical
+memory that could be dedicated to that purpose.
+We typically operate our
+.PN /tmp
+with 30 to 60 megabytes of space
+which is larger than the amount of memory on the machine.
+This configuration allows small files to be accessed quickly,
+while still allowing
+.PN /tmp
+to be used for big files,
+although at a speed more typical of normal, disk-based filesystems.
+.PP
+An alternative to building a memory-based filesystem would be to have
+a filesystem that never did operations synchronously and never
+flushed its dirty buffers to disk.
+However, we believe that such a filesystem would either
+use a disproportionately large percentage of the buffer
+cache space, to the detriment of other filesystems,
+or would require the paging system to flush its dirty pages.
+Waiting for other filesystems to push dirty pages
+subjects them to delays while waiting for the pages to be written.
+We await the results of others trying this approach.
+.[
+Ohta
+.]
+.SH
+Implementation
+.PP
+The current implementation took less time to write than did this paper.
+It consists of 560 lines of kernel code (1.7K text + data)
+and some minor modifications to the program that builds
+disk based filesystems, \fInewfs\fP.
+A condensed version of the kernel code for the
+memory-based filesystem are reproduced in Appendix 1.
+.PP
+A filesystem is created by invoking the modified \fInewfs\fP, with
+an option telling it to create a memory-based filesystem.
+It allocates a section of virtual address space of the requested
+size and builds a filesystem in the memory
+instead of on a disk partition.
+When built, it does a \fImount\fP system call specifying a filesystem type of
+.SM MFS
+(Memory File System).
+The auxiliary data parameter to the mount call specifies a pointer
+to the base of the memory in which it has built the filesystem.
+(The auxiliary data parameter used by the local filesystem, \fIufs\fP,
+specifies the block device containing the filesystem.)
+.PP
+The mount system call allocates and initializes a mount table
+entry and then calls the filesystem-specific mount routine.
+The filesystem-specific routine is responsible for doing
+the mount and initializing the filesystem-specific
+portion of the mount table entry.
+The memory-based filesystem-specific mount routine,
+.RN mfs_mount ,
+is shown in Appendix 1.
+It allocates a block-device vnode to represent the memory disk device.
+In the private area of this vnode it stores the base address of
+the filesystem and the process identifier of the \fInewfs\fP process
+for later reference when doing I/O.
+It also initializes an I/O list that it
+uses to record outstanding I/O requests.
+It can then call the \fIufs\fP filesystem mount routine,
+passing the special block-device vnode that it has created
+instead of the usual disk block-device vnode.
+The mount proceeds just as any other local mount, except that
+requests to read from the block device are vectored through
+.RN mfs_strategy
+(described below) instead of the usual
+.RN spec_strategy
+block device I/O function.
+When the mount is completed,
+.RN mfs_mount
+does not return as most other filesystem mount functions do;
+instead it sleeps in the kernel awaiting I/O requests.
+Each time an I/O request is posted for the filesystem,
+a wakeup is issued for the corresponding \fInewfs\fP process.
+When awakened, the process checks for requests on its buffer list.
+A read request is serviced by copying data from the section of the
+\fInewfs\fP address space corresponding to the requested disk block
+to the kernel buffer.
+Similarly a write request is serviced by copying data to the section of the
+\fInewfs\fP address space corresponding to the requested disk block
+from the kernel buffer.
+When all the requests have been serviced, the \fInewfs\fP
+process returns to sleep to await more requests.
+.PP
+Once mounted,
+all operations on files in the memory-based filesystem are handled
+by the \fIufs\fP filesystem code until they get to the point where the
+filesystem needs to do I/O on the device.
+Here, the filesystem encounters the second piece of the
+memory-based filesystem.
+Instead of calling the special-device strategy routine,
+it calls the memory-based strategy routine,
+.RN mfs_strategy .
+Usually,
+the request is serviced by linking the buffer onto the
+I/O list for the memory-based filesystem
+vnode and sending a wakeup to the \fInewfs\fP process.
+This wakeup results in a context-switch to the \fInewfs\fP
+process, which does a copyin or copyout as described above.
+The strategy routine must be careful to check whether
+the I/O request is coming from the \fInewfs\fP process itself, however.
+Such requests happen during mount and unmount operations,
+when the kernel is reading and writing the superblock.
+Here,
+.RN mfs_strategy
+must do the I/O itself to avoid deadlock.
+.PP
+The final piece of kernel code to support the
+memory-based filesystem is the close routine.
+After the filesystem has been successfully unmounted,
+the device close routine is called.
+For a memory-based filesystem, the device close routine is
+.RN mfs_close .
+This routine flushes any pending I/O requests,
+then sets the I/O list head to a special value
+that is recognized by the I/O servicing loop in
+.RN mfs_mount
+as an indication that the filesystem is unmounted.
+The
+.RN mfs_mount
+routine exits, in turn causing the \fInewfs\fP process
+to exit, resulting in the filesystem vanishing in a cloud of dirty pages.
+.PP
+The paging of the filesystem does not require any additional
+code beyond that already in the kernel to support virtual memory.
+The \fInewfs\fP process competes with other processes on an equal basis
+for the machine's available memory.
+Data pages of the filesystem that have not yet been used
+are zero-fill-on-demand pages that do not occupy memory,
+although they currently allocate space in backing store.
+As long as memory is plentiful, the entire contents of the filesystem
+remain memory resident.
+When memory runs short, the oldest pages of \fInewfs\fP will be
+pushed to backing store as part of the normal paging activity.
+The pages that are pushed usually hold the contents of
+files that have been created in the memory-based filesystem
+but have not been recently accessed (or have been deleted).
+.[
+leffler
+.]
+.SH
+Performance
+.PP
+The performance of the current memory-based filesystem is determined by
+the memory-to-memory copy speed of the processor.
+Empirically we find that the throughput is about 45% of this
+memory-to-memory copy speed.
+The basic set of steps for each block written is:
+.IP 1)
+memory-to-memory copy from the user process doing the write to a kernel buffer
+.IP 2)
+context-switch to the \fInewfs\fP process
+.IP 3)
+memory-to-memory copy from the kernel buffer to the \fInewfs\fP address space
+.IP 4)
+context switch back to the writing process
+.LP
+Thus each write requires at least two memory-to-memory copies
+accounting for about 90% of the
+.SM CPU
+time.
+The remaining 10% is consumed in the context switches and
+the filesystem allocation and block location code.
+The actual context switch count is really only about half
+of the worst case outlined above because
+read-ahead and write-behind allow multiple blocks
+to be handled with each context switch.
+.PP
+On the six-\c
+.SM "MIPS CCI"
+Power 6/32 machine,
+the raw reading and writing speed is only about twice that of
+a regular disk-based filesystem.
+However, for processes that create and delete many files,
+the speedup is considerably greater.
+The reason for the speedup is that the filesystem
+must do two synchronous operations to create a file,
+first writing the allocated inode to disk, then creating the
+directory entry.
+Deleting a file similarly requires at least two synchronous
+operations.
+Here, the low latency of the memory-based filesystem is
+noticeable compared to the disk-based filesystem,
+as a synchronous operation can be done with
+just two context switches instead of incurring the disk latency.
+.SH
+Future Work
+.PP
+The most obvious shortcoming of the current implementation
+is that filesystem blocks are copied twice, once between the \fInewfs\fP
+process' address space and the kernel buffer cache,
+and once between the kernel buffer and the requesting process.
+These copies are done in different process contexts, necessitating
+two context switches per group of I/O requests.
+These problems arise because of the current inability of the kernel
+to do page-in operations
+for an address space other than that of the currently-running process,
+and the current inconvenience of mapping process-owned pages into the kernel
+buffer cache.
+Both of these problems are expected to be solved in the next version
+of the virtual memory system,
+and thus we chose not to address them in the current implementation.
+With the new version of the virtual memory system, we expect to use
+any part of physical memory as part of the buffer cache,
+even though it will not be entirely addressable at once within the kernel.
+In that system, the implementation of a memory-based filesystem
+that avoids the double copy and context switches will be much easier.
+.PP
+Ideally part of the kernel's address space would reside in pageable memory.
+Once such a facility is available it would be most efficient to
+build a memory-based filesystem within the kernel.
+One potential problem with such a scheme is that many kernels
+are limited to a small address space (usually a few megabytes).
+This restriction limits the size of memory-based
+filesystem that such a machine can support.
+On such a machine, the kernel can describe a memory-based filesystem
+that is larger than its address space and use a ``window''
+to map the larger filesystem address space into its limited address space.
+The window would maintain a cache of recently accessed pages.
+The problem with this scheme is that if the working set of
+active pages is greater than the size of the window, then
+much time is spent remapping pages and invalidating
+translation buffers.
+Alternatively, a separate address space could be constructed for each
+memory-based filesystem as in the current implementation,
+and the memory-resident pages of that address space could be mapped
+exactly as other cached pages are accessed.
+.PP
+The current system uses the existing local filesystem structures
+and code to implement the memory-based filesystem.
+The major advantages of this approach are the sharing of code
+and the simplicity of the approach.
+There are several disadvantages, however.
+One is that the size of the filesystem is fixed at mount time.
+This means that a fixed number of inodes (files) and data blocks
+can be supported.
+Currently, this approach requires enough swap space for the entire
+filesystem, and prevents expansion and contraction of the filesystem on demand.
+The current design also prevents the filesystem from taking advantage
+of the memory-resident character of the filesystem.
+It would be interesting to explore other filesystem implementations
+that would be less expensive to execute and that would make better
+use of the space.
+For example, the current filesystem structure is optimized for magnetic
+disks.
+It includes replicated control structures, ``cylinder groups''
+with separate allocation maps and control structures,
+and data structures that optimize rotational layout of files.
+None of this is useful in a memory-based filesystem (at least when the
+backing store for the filesystem is dynamically allocated and not
+contiguous on a single disk type).
+On the other hand,
+directories could be implemented using dynamically-allocated
+memory organized as linked lists or trees rather than as files stored
+in ``disk'' blocks.
+Allocation and location of pages for file data might use virtual memory
+primitives and data structures rather than direct and indirect blocks.
+A reimplementation along these lines will be considered when the virtual
+memory system in the current system has been replaced.
+.[
+$LIST$
+.]