summaryrefslogtreecommitdiffstats
path: root/share/doc/smm/05.fastfs/3.t
diff options
context:
space:
mode:
Diffstat (limited to 'share/doc/smm/05.fastfs/3.t')
-rw-r--r--share/doc/smm/05.fastfs/3.t594
1 files changed, 594 insertions, 0 deletions
diff --git a/share/doc/smm/05.fastfs/3.t b/share/doc/smm/05.fastfs/3.t
new file mode 100644
index 0000000..0ab1196
--- /dev/null
+++ b/share/doc/smm/05.fastfs/3.t
@@ -0,0 +1,594 @@
+.\" Copyright (c) 1986, 1993
+.\" The Regents of the University of California. All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)3.t 8.1 (Berkeley) 6/8/93
+.\"
+.ds RH New file system
+.NH
+New file system organization
+.PP
+In the new file system organization (as in the
+old file system organization),
+each disk drive contains one or more file systems.
+A file system is described by its super-block,
+located at the beginning of the file system's disk partition.
+Because the super-block contains critical data,
+it is replicated to protect against catastrophic loss.
+This is done when the file system is created;
+since the super-block data does not change,
+the copies need not be referenced unless a head crash
+or other hard disk error causes the default super-block
+to be unusable.
+.PP
+To insure that it is possible to create files as large as
+$2 sup 32$ bytes with only two levels of indirection,
+the minimum size of a file system block is 4096 bytes.
+The size of file system blocks can be any power of two
+greater than or equal to 4096.
+The block size of a file system is recorded in the
+file system's super-block
+so it is possible for file systems with different block sizes
+to be simultaneously accessible on the same system.
+The block size must be decided at the time that
+the file system is created;
+it cannot be subsequently changed without rebuilding the file system.
+.PP
+The new file system organization divides a disk partition
+into one or more areas called
+.I "cylinder groups".
+A cylinder group is comprised of one or more consecutive
+cylinders on a disk.
+Associated with each cylinder group is some bookkeeping information
+that includes a redundant copy of the super-block,
+space for inodes,
+a bit map describing available blocks in the cylinder group,
+and summary information describing the usage of data blocks
+within the cylinder group.
+The bit map of available blocks in the cylinder group replaces
+the traditional file system's free list.
+For each cylinder group a static number of inodes
+is allocated at file system creation time.
+The default policy is to allocate one inode for each 2048
+bytes of space in the cylinder group, expecting this
+to be far more than will ever be needed.
+.PP
+All the cylinder group bookkeeping information could be
+placed at the beginning of each cylinder group.
+However if this approach were used,
+all the redundant information would be on the top platter.
+A single hardware failure that destroyed the top platter
+could cause the loss of all redundant copies of the super-block.
+Thus the cylinder group bookkeeping information
+begins at a varying offset from the beginning of the cylinder group.
+The offset for each successive cylinder group is calculated to be
+about one track further from the beginning of the cylinder group
+than the preceding cylinder group.
+In this way the redundant
+information spirals down into the pack so that any single track, cylinder,
+or platter can be lost without losing all copies of the super-block.
+Except for the first cylinder group,
+the space between the beginning of the cylinder group
+and the beginning of the cylinder group information
+is used for data blocks.\(dg
+.FS
+\(dg While it appears that the first cylinder group could be laid
+out with its super-block at the ``known'' location,
+this would not work for file systems
+with blocks sizes of 16 kilobytes or greater.
+This is because of a requirement that the first 8 kilobytes of the disk
+be reserved for a bootstrap program and a separate requirement that
+the cylinder group information begin on a file system block boundary.
+To start the cylinder group on a file system block boundary,
+file systems with block sizes larger than 8 kilobytes
+would have to leave an empty space between the end of
+the boot block and the beginning of the cylinder group.
+Without knowing the size of the file system blocks,
+the system would not know what roundup function to use
+to find the beginning of the first cylinder group.
+.FE
+.NH 2
+Optimizing storage utilization
+.PP
+Data is laid out so that larger blocks can be transferred
+in a single disk transaction, greatly increasing file system throughput.
+As an example, consider a file in the new file system
+composed of 4096 byte data blocks.
+In the old file system this file would be composed of 1024 byte blocks.
+By increasing the block size, disk accesses in the new file
+system may transfer up to four times as much information per
+disk transaction.
+In large files, several
+4096 byte blocks may be allocated from the same cylinder so that
+even larger data transfers are possible before requiring a seek.
+.PP
+The main problem with
+larger blocks is that most UNIX
+file systems are composed of many small files.
+A uniformly large block size wastes space.
+Table 1 shows the effect of file system
+block size on the amount of wasted space in the file system.
+The files measured to obtain these figures reside on
+one of our time sharing
+systems that has roughly 1.2 gigabytes of on-line storage.
+The measurements are based on the active user file systems containing
+about 920 megabytes of formatted space.
+.KF
+.DS B
+.TS
+box;
+l|l|l
+a|n|l.
+Space used % waste Organization
+_
+775.2 Mb 0.0 Data only, no separation between files
+807.8 Mb 4.2 Data only, each file starts on 512 byte boundary
+828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system
+866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system
+948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system
+1128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system
+.TE
+Table 1 \- Amount of wasted space as a function of block size.
+.DE
+.KE
+The space wasted is calculated to be the percentage of space
+on the disk not containing user data.
+As the block size on the disk
+increases, the waste rises quickly, to an intolerable
+45.6% waste with 4096 byte file system blocks.
+.PP
+To be able to use large blocks without undue waste,
+small files must be stored in a more efficient way.
+The new file system accomplishes this goal by allowing the division
+of a single file system block into one or more
+.I "fragments".
+The file system fragment size is specified
+at the time that the file system is created;
+each file system block can optionally be broken into
+2, 4, or 8 fragments, each of which is addressable.
+The lower bound on the size of these fragments is constrained
+by the disk sector size,
+typically 512 bytes.
+The block map associated with each cylinder group
+records the space available in a cylinder group
+at the fragment level;
+to determine if a block is available, aligned fragments are examined.
+Figure 1 shows a piece of a map from a 4096/1024 file system.
+.KF
+.DS B
+.TS
+box;
+l|c c c c.
+Bits in map XXXX XXOO OOXX OOOO
+Fragment numbers 0-3 4-7 8-11 12-15
+Block numbers 0 1 2 3
+.TE
+Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system.
+.DE
+.KE
+Each bit in the map records the status of a fragment;
+an ``X'' shows that the fragment is in use,
+while a ``O'' shows that the fragment is available for allocation.
+In this example,
+fragments 0\-5, 10, and 11 are in use,
+while fragments 6\-9, and 12\-15 are free.
+Fragments of adjoining blocks cannot be used as a full block,
+even if they are large enough.
+In this example,
+fragments 6\-9 cannot be allocated as a full block;
+only fragments 12\-15 can be coalesced into a full block.
+.PP
+On a file system with a block size of 4096 bytes
+and a fragment size of 1024 bytes,
+a file is represented by zero or more 4096 byte blocks of data,
+and possibly a single fragmented block.
+If a file system block must be fragmented to obtain
+space for a small amount of data,
+the remaining fragments of the block are made
+available for allocation to other files.
+As an example consider an 11000 byte file stored on
+a 4096/1024 byte file system.
+This file would uses two full size blocks and one
+three fragment portion of another block.
+If no block with three aligned fragments is
+available at the time the file is created,
+a full size block is split yielding the necessary
+fragments and a single unused fragment.
+This remaining fragment can be allocated to another file as needed.
+.PP
+Space is allocated to a file when a program does a \fIwrite\fP
+system call.
+Each time data is written to a file, the system checks to see if
+the size of the file has increased*.
+.FS
+* A program may be overwriting data in the middle of an existing file
+in which case space would already have been allocated.
+.FE
+If the file needs to be expanded to hold the new data,
+one of three conditions exists:
+.IP 1)
+There is enough space left in an already allocated
+block or fragment to hold the new data.
+The new data is written into the available space.
+.IP 2)
+The file contains no fragmented blocks (and the last
+block in the file
+contains insufficient space to hold the new data).
+If space exists in a block already allocated,
+the space is filled with new data.
+If the remainder of the new data contains more than
+a full block of data, a full block is allocated and
+the first full block of new data is written there.
+This process is repeated until less than a full block
+of new data remains.
+If the remaining new data to be written will
+fit in less than a full block,
+a block with the necessary fragments is located,
+otherwise a full block is located.
+The remaining new data is written into the located space.
+.IP 3)
+The file contains one or more fragments (and the
+fragments contain insufficient space to hold the new data).
+If the size of the new data plus the size of the data
+already in the fragments exceeds the size of a full block,
+a new block is allocated.
+The contents of the fragments are copied
+to the beginning of the block
+and the remainder of the block is filled with new data.
+The process then continues as in (2) above.
+Otherwise, if the new data to be written will
+fit in less than a full block,
+a block with the necessary fragments is located,
+otherwise a full block is located.
+The contents of the existing fragments
+appended with the new data
+are written into the allocated space.
+.PP
+The problem with expanding a file one fragment at a
+a time is that data may be copied many times as a
+fragmented block expands to a full block.
+Fragment reallocation can be minimized
+if the user program writes a full block at a time,
+except for a partial block at the end of the file.
+Since file systems with different block sizes may reside on
+the same system,
+the file system interface has been extended to provide
+application programs the optimal size for a read or write.
+For files the optimal size is the block size of the file system
+on which the file is being accessed.
+For other objects, such as pipes and sockets,
+the optimal size is the underlying buffer size.
+This feature is used by the Standard
+Input/Output Library,
+a package used by most user programs.
+This feature is also used by
+certain system utilities such as archivers and loaders
+that do their own input and output management
+and need the highest possible file system bandwidth.
+.PP
+The amount of wasted space in the 4096/1024 byte new file system
+organization is empirically observed to be about the same as in the
+1024 byte old file system organization.
+A file system with 4096 byte blocks and 512 byte fragments
+has about the same amount of wasted space as the 512 byte
+block UNIX file system.
+The new file system uses less space
+than the 512 byte or 1024 byte
+file systems for indexing information for
+large files and the same amount of space
+for small files.
+These savings are offset by the need to use
+more space for keeping track of available free blocks.
+The net result is about the same disk utilization
+when a new file system's fragment size
+equals an old file system's block size.
+.PP
+In order for the layout policies to be effective,
+a file system cannot be kept completely full.
+For each file system there is a parameter, termed
+the free space reserve, that
+gives the minimum acceptable percentage of file system
+blocks that should be free.
+If the number of free blocks drops below this level
+only the system administrator can continue to allocate blocks.
+The value of this parameter may be changed at any time,
+even when the file system is mounted and active.
+The transfer rates that appear in section 4 were measured on file
+systems kept less than 90% full (a reserve of 10%).
+If the number of free blocks falls to zero,
+the file system throughput tends to be cut in half,
+because of the inability of the file system to localize
+blocks in a file.
+If a file system's performance degrades because
+of overfilling, it may be restored by removing
+files until the amount of free space once again
+reaches the minimum acceptable level.
+Access rates for files created during periods of little
+free space may be restored by moving their data once enough
+space is available.
+The free space reserve must be added to the
+percentage of waste when comparing the organizations given
+in Table 1.
+Thus, the percentage of waste in
+an old 1024 byte UNIX file system is roughly
+comparable to a new 4096/512 byte file system
+with the free space reserve set at 5%.
+(Compare 11.8% wasted with the old file system
+to 6.9% waste + 5% reserved space in the
+new file system.)
+.NH 2
+File system parameterization
+.PP
+Except for the initial creation of the free list,
+the old file system ignores the parameters of the underlying hardware.
+It has no information about either the physical characteristics
+of the mass storage device,
+or the hardware that interacts with it.
+A goal of the new file system is to parameterize the
+processor capabilities and
+mass storage characteristics
+so that blocks can be allocated in an
+optimum configuration-dependent way.
+Parameters used include the speed of the processor,
+the hardware support for mass storage transfers,
+and the characteristics of the mass storage devices.
+Disk technology is constantly improving and
+a given installation can have several different disk technologies
+running on a single processor.
+Each file system is parameterized so that it can be
+adapted to the characteristics of the disk on which
+it is placed.
+.PP
+For mass storage devices such as disks,
+the new file system tries to allocate new blocks
+on the same cylinder as the previous block in the same file.
+Optimally, these new blocks will also be
+rotationally well positioned.
+The distance between ``rotationally optimal'' blocks varies greatly;
+it can be a consecutive block
+or a rotationally delayed block
+depending on system characteristics.
+On a processor with an input/output channel that does not require
+any processor intervention between mass storage transfer requests,
+two consecutive disk blocks can often be accessed
+without suffering lost time because of an intervening disk revolution.
+For processors without input/output channels,
+the main processor must field an interrupt and
+prepare for a new disk transfer.
+The expected time to service this interrupt and
+schedule a new disk transfer depends on the
+speed of the main processor.
+.PP
+The physical characteristics of each disk include
+the number of blocks per track and the rate at which
+the disk spins.
+The allocation routines use this information to calculate
+the number of milliseconds required to skip over a block.
+The characteristics of the processor include
+the expected time to service an interrupt and schedule a
+new disk transfer.
+Given a block allocated to a file,
+the allocation routines calculate the number of blocks to
+skip over so that the next block in the file will
+come into position under the disk head in the expected
+amount of time that it takes to start a new
+disk transfer operation.
+For programs that sequentially access large amounts of data,
+this strategy minimizes the amount of time spent waiting for
+the disk to position itself.
+.PP
+To ease the calculation of finding rotationally optimal blocks,
+the cylinder group summary information includes
+a count of the available blocks in a cylinder
+group at different rotational positions.
+Eight rotational positions are distinguished,
+so the resolution of the
+summary information is 2 milliseconds for a typical 3600
+revolution per minute drive.
+The super-block contains a vector of lists called
+.I "rotational layout tables".
+The vector is indexed by rotational position.
+Each component of the vector
+lists the index into the block map for every data block contained
+in its rotational position.
+When looking for an allocatable block,
+the system first looks through the summary counts for a rotational
+position with a non-zero block count.
+It then uses the index of the rotational position to find the appropriate
+list to use to index through
+only the relevant parts of the block map to find a free block.
+.PP
+The parameter that defines the
+minimum number of milliseconds between the completion of a data
+transfer and the initiation of
+another data transfer on the same cylinder
+can be changed at any time,
+even when the file system is mounted and active.
+If a file system is parameterized to lay out blocks with
+a rotational separation of 2 milliseconds,
+and the disk pack is then moved to a system that has a
+processor requiring 4 milliseconds to schedule a disk operation,
+the throughput will drop precipitously because of lost disk revolutions
+on nearly every block.
+If the eventual target machine is known,
+the file system can be parameterized for it
+even though it is initially created on a different processor.
+Even if the move is not known in advance,
+the rotational layout delay can be reconfigured after the disk is moved
+so that all further allocation is done based on the
+characteristics of the new host.
+.NH 2
+Layout policies
+.PP
+The file system layout policies are divided into two distinct parts.
+At the top level are global policies that use file system
+wide summary information to make decisions regarding
+the placement of new inodes and data blocks.
+These routines are responsible for deciding the
+placement of new directories and files.
+They also calculate rotationally optimal block layouts,
+and decide when to force a long seek to a new cylinder group
+because there are insufficient blocks left
+in the current cylinder group to do reasonable layouts.
+Below the global policy routines are
+the local allocation routines that use a locally optimal scheme to
+lay out data blocks.
+.PP
+Two methods for improving file system performance are to increase
+the locality of reference to minimize seek latency
+as described by [Trivedi80], and
+to improve the layout of data to make larger transfers possible
+as described by [Nevalainen77].
+The global layout policies try to improve performance
+by clustering related information.
+They cannot attempt to localize all data references,
+but must also try to spread unrelated data
+among different cylinder groups.
+If too much localization is attempted,
+the local cylinder group may run out of space
+forcing the data to be scattered to non-local cylinder groups.
+Taken to an extreme,
+total localization can result in a single huge cluster of data
+resembling the old file system.
+The global policies try to balance the two conflicting
+goals of localizing data that is concurrently accessed
+while spreading out unrelated data.
+.PP
+One allocatable resource is inodes.
+Inodes are used to describe both files and directories.
+Inodes of files in the same directory are frequently accessed together.
+For example, the ``list directory'' command often accesses
+the inode for each file in a directory.
+The layout policy tries to place all the inodes of
+files in a directory in the same cylinder group.
+To ensure that files are distributed throughout the disk,
+a different policy is used for directory allocation.
+A new directory is placed in a cylinder group that has a greater
+than average number of free inodes,
+and the smallest number of directories already in it.
+The intent of this policy is to allow the inode clustering policy
+to succeed most of the time.
+The allocation of inodes within a cylinder group is done using a
+next free strategy.
+Although this allocates the inodes randomly within a cylinder group,
+all the inodes for a particular cylinder group can be read with
+8 to 16 disk transfers.
+(At most 16 disk transfers are required because a cylinder
+group may have no more than 2048 inodes.)
+This puts a small and constant upper bound on the number of
+disk transfers required to access the inodes
+for all the files in a directory.
+In contrast, the old file system typically requires
+one disk transfer to fetch the inode for each file in a directory.
+.PP
+The other major resource is data blocks.
+Since data blocks for a file are typically accessed together,
+the policy routines try to place all data
+blocks for a file in the same cylinder group,
+preferably at rotationally optimal positions in the same cylinder.
+The problem with allocating all the data blocks
+in the same cylinder group is that large files will
+quickly use up available space in the cylinder group,
+forcing a spill over to other areas.
+Further, using all the space in a cylinder group
+causes future allocations for any file in the cylinder group
+to also spill to other areas.
+Ideally none of the cylinder groups should ever become completely full.
+The heuristic solution chosen is to
+redirect block allocation
+to a different cylinder group
+when a file exceeds 48 kilobytes,
+and at every megabyte thereafter.*
+.FS
+* The first spill over point at 48 kilobytes is the point
+at which a file on a 4096 byte block file system first
+requires a single indirect block. This appears to be
+a natural first point at which to redirect block allocation.
+The other spillover points are chosen with the intent of
+forcing block allocation to be redirected when a
+file has used about 25% of the data blocks in a cylinder group.
+In observing the new file system in day to day use, the heuristics appear
+to work well in minimizing the number of completely filled
+cylinder groups.
+.FE
+The newly chosen cylinder group is selected from those cylinder
+groups that have a greater than average number of free blocks left.
+Although big files tend to be spread out over the disk,
+a megabyte of data is typically accessible before
+a long seek must be performed,
+and the cost of one long seek per megabyte is small.
+.PP
+The global policy routines call local allocation routines with
+requests for specific blocks.
+The local allocation routines will
+always allocate the requested block
+if it is free, otherwise it
+allocates a free block of the requested size that is
+rotationally closest to the requested block.
+If the global layout policies had complete information,
+they could always request unused blocks and
+the allocation routines would be reduced to simple bookkeeping.
+However, maintaining complete information is costly;
+thus the implementation of the global layout policy
+uses heuristics that employ only partial information.
+.PP
+If a requested block is not available, the local allocator uses
+a four level allocation strategy:
+.IP 1)
+Use the next available block rotationally closest
+to the requested block on the same cylinder. It is assumed
+here that head switching time is zero. On disk
+controllers where this is not the case, it may be possible
+to incorporate the time required to switch between disk platters
+when constructing the rotational layout tables. This, however,
+has not yet been tried.
+.IP 2)
+If there are no blocks available on the same cylinder,
+use a block within the same cylinder group.
+.IP 3)
+If that cylinder group is entirely full,
+quadratically hash the cylinder group number to choose
+another cylinder group to look for a free block.
+.IP 4)
+Finally if the hash fails, apply an exhaustive search
+to all cylinder groups.
+.PP
+Quadratic hash is used because of its speed in finding
+unused slots in nearly full hash tables [Knuth75].
+File systems that are parameterized to maintain at least
+10% free space rarely use this strategy.
+File systems that are run without maintaining any free
+space typically have so few free blocks that almost any
+allocation is random;
+the most important characteristic of
+the strategy used under such conditions is that the strategy be fast.
+.ds RH Performance
+.sp 2
+.ne 1i
OpenPOWER on IntegriCloud