From 982b5386e50eb8776d8e40b66b5ea1dcde7e36e4 Mon Sep 17 00:00:00 2001 From: grog Date: Wed, 24 Mar 1999 09:18:33 +0000 Subject: Add a tutorial-like section "How to set up Vinum" --- sbin/vinum/vinum.8 | 704 +++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 628 insertions(+), 76 deletions(-) diff --git a/sbin/vinum/vinum.8 b/sbin/vinum/vinum.8 index b52a4dc..80c640c 100644 --- a/sbin/vinum/vinum.8 +++ b/sbin/vinum/vinum.8 @@ -2,7 +2,6 @@ .\" .Dd 15 January 1999 .Dt vinum 8 -.Os FreeBSD .Sh NAME .Nm vinum .Nd Logical Volume Manager control program @@ -15,11 +14,11 @@ .Fl f .Ar description-file .in +1i -Create a volume as described in +Create a volume as described in .Ar description-file .in .\" XXX remove this -.Cd attach Ar plex Ar volume +.Cd attach Ar plex Ar volume .Op Nm rename .Cd attach Ar subdisk Ar plex Ar [offset] .Op Nm rename @@ -121,7 +120,7 @@ Write a copy of the current configuration to .in .Cd makedev .in +1i -Remake the device nodes in +Remake the device nodes in .Ar /dev/vinum . .in .Cd quit @@ -168,7 +167,7 @@ configuration. Reset statistisc counters for the specified objects, or for all objects if none are specified. .in -.Cd rm +.Cd rm .Op Fl f .Op Fl r .Ar volume | plex | subdisk @@ -219,7 +218,7 @@ for more information about the volume manager. is designed either for interactive use, when started without a command, or to execute a single command if the command is supplied as arguments to .Nm vinum . -In interactive mode, +In interactive mode, .Nm maintains a command line history. .Ss OPTIONS @@ -228,7 +227,7 @@ commands may optionally be followed by an option. Any of the following options may be specified with any command, but in some cases they do not make any difference: cases, the options are ignored. For example, the .Nm stop -command ignores the +command ignores the .Fl v and .Fl V @@ -237,7 +236,7 @@ options. .It Cd -v The .Nm -v -option can be used with any command to request more detailed information. +option can be used with any command to request more detailed information. .It Cd -V The .Nm -V @@ -261,7 +260,9 @@ cause a panic. .It Cd -r The .Nm -r -(``recursive'') option is used by the list commands to display information not +.if t (``recursive'') +.if n ("recursive") +option is used by the list commands to display information not only about the specified objects, but also about subordinate objects. For example, in conjnction with the .Nm lv @@ -302,7 +303,7 @@ is specified, .Nm renames the object (and in the case of a plex, any subordinate subdisks) to fit in with the default -.Nm +.Nm naming convention. .Pp A number of considerations apply to attaching subdisks: @@ -348,9 +349,9 @@ of no longer wanted .Nm drives is to reset the configuration with the .Nm resetconfig -command. In some cases, however, it may be necessary to create new data on +command. In some cases, however, it may be necessary to create new data on .Nm -drives which can no longer be started. In this case, use the +drives which can no longer be started. In this case, use the .Nm create Fl f command. .It Nm debug @@ -359,8 +360,8 @@ command. .Ar debug is used to enter the remote kernel debugger. It is only activated if .Nm -is built with the -.Ar VINUMDEBUG +is built with the +.Ar VINUMDEBUG option. This option will stop the execution of the operating system until the kernel debugger is exited. If remote debugging is set and there is no remote connection for a kernel debugger, it will be necessary to reset the system and @@ -378,11 +379,11 @@ The bit mask is composed of the following values: Show buffer information during requests .It DEBUG_NUMOUTPUT (2) .br -Show the value of +Show the value of .Dv vp->v_numoutput. .It DEBUG_RESID (4) .br -Go into debugger in +Go into debugger in .Fd complete_rqe. .It DEBUG_LASTREQS (8) .br @@ -404,11 +405,11 @@ when the .Nm debug command is issued. .El -.It Nm detach Op Fl f +.It Nm detach Op Fl f .Ar plex .if n .sp -1v .if t .sp -.6v -.It Nm detach Op Fl f +.It Nm detach Op Fl f .Ar subdisk .sp .Nm @@ -419,8 +420,11 @@ the operation will fail unless the .Fl f option is specified. If the object is named after the object above it (for example, subdisk vol1.p7.s0 attached to plex vol1.p7), the name will be changed -by prepending the text ``ex-'' (for example, ex-vol1.p7.s0). If necessary, the -name will be truncated in the process. +by prepending the text +.if t ``ex-'' +.if n "ex-" +(for example, ex-vol1.p7.s0). If necessary, the name will be truncated in the +process. .Pp .Nm detach does not reduce the number of subdisks in a striped or RAID-5 plex. Instead, @@ -433,7 +437,7 @@ command. .Ar info displays information about .Nm -memory usage. This is intended primarily for debugging. With the +memory usage. This is intended primarily for debugging. With the .Fl v option, it will give detailed information about the memory areas in use. .Pp @@ -466,7 +470,7 @@ Time Event Buf Dev Offset Bytes SD 14:40:00.685547 4DN Write 0xf2361f40 0x427 0x104109 8192 19 0 0 0 .Ed .Pp -The +The .Ar Buf field always contains the address of the user buffer header. This can be used to identify the requests associated with a user request, though this is not 100% @@ -481,7 +485,7 @@ The field contains information related to the sequence of events in the request chain. The digit .Ar 1 -to +to .Ar 6 indicates the approximate sequence of events, and the two-letter abbreviation is a mnemonic for the location @@ -505,10 +509,10 @@ information. In the following requests, .Ar Dev is the device number of the associated disk partition, -.Ar Offset +.Ar Offset is the offset from the beginning of the partition, .Ar SD -is the subdisk index in +is the subdisk index in .Dv vinum_conf , .Ar SDoff is the offset from the beginning of the subdisk, @@ -520,29 +524,29 @@ is the offset of the associated group request, where applicable. (request) shows one of possibly several low-level .Nm requests which are launched to satisfy the high-level request. This information -is also logged in +is also logged in .Fd launch_requests. .It 4DN -(done) is called from +(done) is called from .Fd complete_rqe, showing the completion of a request. This completion should match a request launched either at stage .Ar 4DN -from -.Fd launch_requests, -or from +from +.Fd launch_requests, +or from .Fd complete_raid5_write at stage .Ar 5RD or .Ar 6RP . .It 5RD -(RAID-5 data) is called from +(RAID-5 data) is called from .Fd complete_raid5_write and represents the data written to a RAID-5 data stripe after calculating parity. .It 6RP -(RAID-5 parity) is called from +(RAID-5 parity) is called from .Fd complete_raid5_write and represents the data written to a RAID-5 parity stripe after calculating parity. @@ -556,9 +560,9 @@ initializes a plex by writing zeroes to all its subdisks. This is the only way to ensure consistent data in a plex. You must perform this initialization before using a RAID-5 plex. It is also recommended for other new plexes. .Pp -.Nm +.Nm initializes all subdisks of a plex in parallel. Since this operation can take a -long time, it is performed in the background. +long time, it is performed in the background. .Nm prints a console message when the initialization is complete. .It Nm label @@ -570,7 +574,7 @@ command writes a .Ar ufs style volume label on a volume. It is a simple alternative to an appropriate call to -.Ar disklabel . +.Ar disklabel . This is needed because some .Ar ufs commands still read the disk to find the label instead of using the correct @@ -641,8 +645,8 @@ information for the subdisks and (for a volume) plexes subordinate to the objects. The commands .Ar lv , .Ar lp , -.Ar ls -and +.Ar ls +and .Ar ld commands list only volumes, plexes, subdisks and drives respectively. This is particularly useful when used without parameters. @@ -673,7 +677,7 @@ entering the character. .It Nm printconfig Pa file Write a copy of the current configuration to -.Pa file +.Pa file in a format that can be used to recreate the .Nm configuration. Unlike the configuration saved on disk, it includes definitions @@ -684,7 +688,7 @@ of the drives. The .Nm read command scans the specified disks for -.Nm +.Nm partitions containing previously created configuration information. It reads the configuration in order from the most recently updated to least recently updated configuration. @@ -767,14 +771,14 @@ maintains a number of statistical counters for each object. See the header file .Fi vinumvar.h for more information. .\" XXX put it in here when it's finalized -Use the +Use the .Nm resetstats command to reset these counters. In conjunction with the .Fl r -option, +option, .Nm also resets the counters of subordinate objects. -.It Nm rm +.It Nm rm .Op Fl f .Op Fl r .Ar volume | plex | subdisk @@ -844,10 +848,14 @@ configuration). Option bit 4 can be useful for error recovery. .Op volume | plex | subdisk .Pp .Nm start -starts one or more +starts (brings into to the +.Ar up +state) one or more +.Nm +objects. +.Pp +If no object names are specified, .Nm -objects. If no object names are specified, -.Nm scans the disks known to the system for .Nm drives and then reads in the configuration as described under the @@ -873,7 +881,67 @@ saves. .Pp If object names are specified, .Nm -starts them. +starts them. Normally this operation is only of use with subdisks. The action +depends on the current state of the object: +.Bl -bullet +.It +If the +object is already in the +.Ar up +state, +.Nm +does nothing. +.It +If the object is a subdisk in the +.Ar down +or +.Ar reborn +states, +.Nm +changes it to the +.Ar up +state. +.It +If the object is a subdisk in the +.Ar empty +state, the change depends on the subdisk. If it is part of a plex which is part +of a volume which contains other plexes, +.Nm +places the subdisk in the +.Ar reviving +state and attempts to copy the data from the volume. When the operation +completes, the subdisk is set into the +.Ar up +state. If it is part of a plex which is part of a volume which contains no +other plexes, or if it is not part of a plex, +.Nm +brings it into the +.Ar up +state immediately. +.It +If the object is a subdisk in the +.Ar reviving +state, +.Nm +continues the +.Ar revive +operation offline. When the operation completes, the subdisk is set into the +.Ar up +state. +.El +.Pp +When a subdisk comes into the +.Ar up +state, +.Nm +automatically checks the state of any plex and volume to which it may belong and +changes their state where appropriate. +.Pp +If the object is a volume or a plex, +.Nm start +currently has no effect: it checks the state of the subordinate subdisks (and +plexes in the case of a volume) and sets the state of the object accordingly. +In a later version, this operation will cause the subdisks .Pp To start a plex in a multi-plex volume, the data must be copied from another plex in the volume. Since this frequently takes a long time, it is done in the @@ -893,12 +961,12 @@ This can only be done if no objects are active, In particular, the flag does not override this requirement. This command can only work if .Nm has been loaded as a kld, since it is not possible to unload a statically -configured driver, +configured driver, .\" XXX why? -and it must be issued at a command prompt: the command +and it must be issued at a command prompt: the command .Nm vinum stop will not work. -.Nm +.Nm .Nm stop will fail if .Nm @@ -914,10 +982,10 @@ and .Fl f flags must be specified. This command does not remove the objects from the configuration. They can be accessed again after a -.Nm start +.Nm start command. .Pp -By default, +By default, .Nm does not stop active objects. For example, you cannot stop a plex which is attached to an active volume, and you cannot stop a volume which is open. The @@ -948,7 +1016,7 @@ The configuration file can contain the following entries: .Pp .Bl -hang -width 4n .It Nm volume -.Ar name +.Ar name .Op options .Pp Define a volume with name @@ -960,7 +1028,7 @@ Options are: .It Nm plex Ar plexname Add the specified plex to the volume. If .Ar plexname -is specified as +is specified as .Ar * , .Nm will look for the definition of the plex as the next possible entry in the @@ -970,7 +1038,7 @@ Define a .Ar read policy for the volume. .Ar policy -may be either +may be either .Nm round or .Nm prefer Ar plexname . @@ -982,6 +1050,7 @@ in \fIround-robin\fR\| fashion. A .Ar prefer read policy reads from the specified plex every time. .It Nm setupstate +.Pp When creating a multi-plex volume, assume that the contents of all the plexes are consistent. This is normally not the case, and correctly you should use the .Nm init @@ -1014,7 +1083,7 @@ when naming a plex or subdisk. .Pp Specify the organization of the plex. .Ar organization -can be one of +can be one of .Ar concat , .Ar striped or @@ -1022,7 +1091,7 @@ or For .Ar striped and -.Ar raid5 +.Ar raid5 plexes, the parameter .Ar stripesize must be specified, while for @@ -1088,9 +1157,13 @@ bytes of free space on the drive. .sp .It Nm length Ar length Specify the length of the subdisk. This keyword must be specified. There is no -default. +default, but the value 0 may be specified to mean +.if t ``use the largest available contiguous free area on the drive''. +.if n "use the largest available contiguous free area on the drive". +If the drive is empty, this means that the entire drive will be used for the +subdisk. .Nm length -may be shortened to +may be shortened to .Nm len . .sp .It Nm plex Ar plex @@ -1164,9 +1237,9 @@ volume vol5 .Ss DRIVE LAYOUT CONSIDERATIONS .Nm drives are currently BSD disk partitions. They must be of type -.Ar vinum +.Ar vinum in order to avoid overwriting file systems. For compatibility reasons, -.Nm +.Nm currently accepts partitions of type .Ar unused , but the next release will not allow this kind of partition. @@ -1190,20 +1263,20 @@ partition layout as shown by g: 1900741 2325984 vinum 0 0 0 # (Cyl. 1626*- 2955*) .Ed .sp -In this example, partition +In this example, partition .Nm g may be used as a .Nm -partition. Partitions +partition. Partitions .Nm a , -.Nm e +.Nm e and .Nm f may be used as .Nm UFS file systems or .Nm ccd -partitions. Partition +partitions. Partition .Nm b is a swap partition, and partition .Nm c @@ -1212,6 +1285,481 @@ represents the whole disk and should not be used for any other purpose. .Nm uses the first 265 sectors on each partition for configuration information, so the maximum size of a subdisk is 265 sectors smaller than the drive. +.Sh HOW TO SET UP VINUM +This section gives practical advice about how to implement a +.Nm +system. +.Ss Where to put the data +The first choice you need to make is where to put the data. You need dedicated +disk partitions for +.Nm vinum . +See the example under DRIVE LAYOUT CONSIDERATIONS above. Choose partition type +.Nm +unless your version of +.Xr disklabel 8 +does not understand this partition type, in which case you will need to use +partition type +.Nm unused +until you update your version of +.Xr disklabel 8 . +Use the compatibility partition (for example, +.Pa /dev/da0g ) +rather than the true partition name (such as +.Pa /dev/da0s1g ). +.Nm +currently uses the compatibility partition only for the +.Nm start +command, so this way you can avoid problems. +.Ss Designing volumes +The way you set up +.Nm +volumes depends on your intentions. There are a number of possibilities: +.Bl -enum +.It +You may want to join up a number of small disks to make a reasonable sized file +system. For example, if you had five small drives and wanted to use all the +space for a single volume, you might write a configuration file like: +.Bd -literal -offset 4n +drive d1 device /dev/da2e +drive d2 device /dev/da3e +drive d3 device /dev/da4e +drive d4 device /dev/da5e +drive d5 device /dev/da6e +volume bigger + plex org concat + sd length 0 drive d1 + sd length 0 drive d2 + sd length 0 drive d3 + sd length 0 drive d4 + sd length 0 drive d5 +.Ed +.Pp +In this case, you specify the length of the subdisks as 0, which means +.if t ``use the largest area of free space that you can find on the drive''. +.if n "use the largest area of free space that you can find on the drive". +If the subdisk is the only subdisk on the drive, it will use all available +space. +.It +You want to set up +.Nm +to obtain additional resilience against disk failures. You have the choice of +RAID-1, also called +.if t ``mirroring'', or RAID-5, also called ``parity''. +.if n "mirroring", or RAID-5, also called "parity". +.Pp +To set up mirroring, create multiple plexes in a volume. For example, to create +a mirrored volume of 2 GB, you might create the following configuration file: +.Bd -literal -offset 4n +drive d1 device /dev/da2e +drive d2 device /dev/da3e +volume mirror + plex org concat + sd length 2g drive d1 + plex org concat + sd length 2g drive d2 +.Ed +.Pp +When creating mirrored drives, it is important to ensure that the data from each +plex is on a different physical disk so that +.Nm +can access the complete address space of the volume even if a drive fails. +Note that each plex requires as much data as the complete volume: in this +example, the volume has a size of 2 GB, but each plex (and each subdisk) +requires 2 GB, so the total disk storage requirement is 4 GB. +.Pp +To set up RAID-5, create a single plex of type +.Ar raid5 . +For example, to create an equivalent resilient volume of 2 GB, you might use the +following configuration file: +.Bd -literal -offset 4n +drive d1 device /dev/da2e +drive d2 device /dev/da3e +drive d3 device /dev/da4e +drive d4 device /dev/da5e +drive d5 device /dev/da6e +volume raid + plex org raid5 512k + sd length 512m drive d1 + sd length 512m drive d2 + sd length 512m drive d3 + sd length 512m drive d4 + sd length 512m drive d5 +.Ed +.Pp +RAID-5 plexes require at least three subdisks, one of which is used for storing +parity information and is lost for data storage. The more disks you use, the +greater the proportion of the disk storage can be used for data storage. In +this example, the total storage usage is 2.5 GB, compared to 4 GB for a mirrored +configuration. If you were to use the minimum of only three disks, you would +require 3 GB to store the information, for example: +.Bd -literal -offset 4n +drive d1 device /dev/da2e +drive d2 device /dev/da3e +drive d3 device /dev/da4e +volume raid + plex org raid5 512k + sd length 1g drive d1 + sd length 1g drive d2 + sd length 1g drive d3 +.Ed +.Pp +As with creating mirrored drives, it is important to ensure that the data from +each subdisk is on a different physical disk so that +.Nm +can access the complete address space of the volume even if a drive fails. +.It +You want to set up +.Nm +to allow more concurrent access to a file system. In many cases, access to a +file system is limited by the speed of the disk. By spreading the volume across +multiple disks, you can increase the throughput in multi-access environments. +This technique shows little or no performance improvement in single-access +environments. +.Nm +uses a technique called +.if t ``striping'', +.if n "striping", +or sometimes RAID-0, to increase this concurrency of access. The name RAID-0 is +misleading: striping does not provide any redundancy or additional reliability. +In fact, it decreases the reliability, since the failure of a single disk will +render the volume useless, and the more disks you have, the more likely it is +that one of them will fail. +.Pp +To implement striping, use a +.Ar striped +plex: +.Bd -literal -offset 4n +drive d1 device /dev/da2e +drive d2 device /dev/da3e +drive d3 device /dev/da4e +drive d4 device /dev/da5e +volume raid + plex org striped 512k + sd length 512m drive d1 + sd length 512m drive d2 + sd length 512m drive d3 + sd length 512m drive d4 +.Ed +.Pp +A striped plex must have at least two subdisks, but the increase in performance +is greater if you have a larger number of disks. +.It +You may want to have the best of both worlds and have both resilience and +performance. This is sometimes called RAID-10 (a combination of RAID-1 and +RAID-0), though again this name is misleading. With +.Nm +you can do this with the following configuration file: +.Bd -literal -offset 4n +drive d1 device /dev/da2e +drive d2 device /dev/da3e +drive d3 device /dev/da4e +drive d4 device /dev/da5e +volume raid + plex org striped 512k + sd length 512m drive d1 + sd length 512m drive d2 + sd length 512m drive d3 + sd length 512m drive d4 + plex org striped 512k + sd length 512m drive d4 + sd length 512m drive d3 + sd length 512m drive d2 + sd length 512m drive d1 +.Ed +.Pp +Here the plexes are striped, increasing performance, and there are two of them, +increasing reliablity. Note that this example shows the subdisks of the second +plex in reverse order from the first plex. This is for performance reasons and +will be discussed below. +.El +.Ss Creating the volumes +Once you have created your configuration files, start +.Nm +and create the volumes. In this example, the configuration is in the file +.Pa configfile : +.Bd -literal + # vinum create -v configfile + 1: drive d1 device /dev/da2e + 2: drive d2 device /dev/da3e + 3: volume mirror + 4: plex org concat + 5: sd length 2g drive d1 + 6: plex org concat + 7: sd length 2g drive d2 + Configuration summary + + Drives: 2 (4 configured) + Volumes: 1 (4 configured) + Plexes: 2 (8 configured) + Subdisks: 2 (16 configured) + + Drive d1: Device /dev/da2e + Created on vinum.lemis.com at Tue Mar 23 12:30:31 1999 + Config last updated Tue Mar 23 14:30:32 1999 + Size: 60105216000 bytes (57320 MB) + Used: 2147619328 bytes (2048 MB) + Available: 57957596672 bytes (55272 MB) + State: up + Last error: none + Drive d2: Device /dev/da3e + Created on vinum.lemis.com at Tue Mar 23 12:30:32 1999 + Config last updated Tue Mar 23 14:30:33 1999 + Size: 60105216000 bytes (57320 MB) + Used: 2147619328 bytes (2048 MB) + Available: 57957596672 bytes (55272 MB) + State: up + Last error: none + + Volume mirror: Size: 2147483648 bytes (2048 MB) + State: up + Flags: + 2 plexes + Read policy: round robin + + Plex mirror.p0: Size: 2147483648 bytes (2048 MB) + Subdisks: 1 + State: up + Organization: concat + Part of volume mirror + Plex mirror.p1: Size: 2147483648 bytes (2048 MB) + Subdisks: 1 + State: up + Organization: concat + Part of volume mirror + + Subdisk mirror.p0.s0: + Size: 2147483648 bytes (2048 MB) + State: up + Plex mirror.p0 at offset 0 + + Subdisk mirror.p1.s0: + Size: 2147483648 bytes (2048 MB) + State: up + Plex mirror.p1 at offset 0 +.Ed +.Pp +The +.Fl v +flag tells +.Nm +to list the file as it configures. Subsequently it lists the current +configuration in the same format as the +.Nm list +command. +.Ss Creating more volumes +Once you have created the +.Nm +volumes, +.Nm +keeps track of them in its internal configuration files. You do not need to +create them again. In particular, if you run the +.Nm create +command again, you will create additional objects: +.Bd -literal +.if t .ps -2 + # vinum create sampleconfig + Configuration summary + + Drives: 2 (4 configured) + Volumes: 1 (4 configured) + Plexes: 4 (8 configured) + Subdisks: 4 (16 configured) + + D d1 State: up Device /dev/da2e Avail: 53224/57320 MB (92%) + D d2 State: up Device /dev/da3e Avail: 53224/57320 MB (92%) + + V mirror State: up Plexes: 4 Size: 2048 MB + + P mirror.p0 C State: up Subdisks: 1 Size: 2048 MB + P mirror.p1 C State: up Subdisks: 1 Size: 2048 MB + P mirror.p2 C State: up Subdisks: 1 Size: 2048 MB + P mirror.p3 C State: up Subdisks: 1 Size: 2048 MB + + S mirror.p0.s0 State: up PO: 0 B Size: 2048 MB + S mirror.p1.s0 State: up PO: 0 B Size: 2048 MB + S mirror.p2.s0 State: up PO: 0 B Size: 2048 MB + S mirror.p3.s0 State: up PO: 0 B Size: 2048 MB +.if t .ps +.Ed +.Pp +As this example (this time with the +.Fl f +flag) shows, re-running the +.Nm create +has created four new plexes, each with a new subdisk. If you want to add other +volumes, create new configuration files for them. They do not need to reference +the drives that +.Nm +already knows about. For example, to create a volume +.Pa raid +on the four drives +.Pa /dev/da1e , +.Pa /dev/da2e , +.Pa /dev/da3e +and +.Pa /dev/da4e , +you only need to mention the other two: +.Bd -literal + drive d3 device /dev/da1e + drive d4 device /dev/da4e + volume raid + plex org raid5 512k + sd size 2g drive d1 + sd size 2g drive d2 + sd size 2g drive d3 + sd size 2g drive d4 +.Ed + +.Ss Performance considerations +A number of misconceptions exist about how to set up a RAID array for best +performance. In particular, most systems use far too small a stripe size. The +following discussion applies to all RAID systems, not just to +.Nm vinum . +.Pp +The FreeBSD block I/O system issues requests of between .5kB and 60 kB; a +typical mix is somewhere round 8 kB. You can't stop any striping system from +breaking a request into two physical requests, and if you do it wrong it can be +broken into several. This will result in a significant drop in performance: the +decrease in transfer time per disk is offset by the order of magnitude greater +increase in latency. +.Pp +With modern disk sizes and the FreeBSD block I/O system, you can expect to have +a reasonably small number of fragmented requests with a stripe size between 256 +kB and 512 kB; with correct RAID implementations there is no obvious reason not +to increase the size to 2 or 4 MB on a large disk. +.Pp +The easiest way to consider the impact of any transfer in a multi-access system +is to look at it from the point of view of the potential bottleneck, the disk +subsystem: how much total disk time does the transfer use? Since just about +everything is cached, the time relationship between the request and its +completion is not so important: the important parameter is the total time that +the request keeps the disks active, the time when the disks are not available to +perform other transfers. As a result, it doesn't really matter if the transfers +are happening at the same time or different times. In practical terms, the time +we're looking at is the sum of the total latency (positioning time and +rotational latency, or the time it takes for the data to arrive under the disk +heads) and the total transfer time. For a given transfer to disks of the same +speed, the transfer time depends only on the total size of the transfer. +.Pp +Consider a typical news article or web page of 24 kB, which will probably be +read in a single I/O. Take disks with a transfer rate of 6 MB/s and an average +positioning time of 8 ms, and a file system with 4 kB blocks. Since it's 24 kB, +we don't have to worry about fragments, so the file will start on a 4 kB +boundary. The number of transfers required depends on where the block starts: +it's (S + F - 1) / S, where S is the stripe size in file system blocks, and F is +the file size in file system blocks. +.Pp +.Bl -enum +.It +Stripe size of 4 kB. You'll have 6 transfers. Total subsystem load: 48 ms +latency, 2 ms transfer, 50 ms total. +.It +Stripe size of 8 kB. On average, you'll have 3.5 transfers. Total subsystem +load: 28 ms latency, 2 ms transfer, 30 ms total. +.It +Stripe size of 16 kB. On average, you'll have 2.25 transfers. Total subsystem +load: 18 ms latency, 2 ms transfer, 20 ms total. +.It +Stripe size of 256 kB. On average, you'll have 1.08 transfers. Total subsystem +load: 8.6 ms latency, 2 ms transfer, 10.6 ms total. +.It +Stripe size of 4 MB. On average, you'll have 1.0009 transfers. Total subsystem +load: 8.01 ms latency, 2 ms transfer, 10.01 ms total. +.El +.Pp +It appears that some hardware RAID systems have problems with large stripes: +they appear to always transfer a complete stripe to or from disk, so that a +large stripe size will have an adverse effect on performance. +.Nm +does not suffer from this problem: it optimizes all disk transfers and does not +transfer unneeded data. +.Pp +Note that no well-known benchmark program tests true multi-access conditions +(more than 100 concurrent users), so it is difficult to demonstrate the validity +of these statements. +.Pp +Given these considerations, the following factors affect the performance of a +.Nm +volume: +.Bl -bullet +.It +Striping improves performance for multiple access only, since it increases the +chance of individual requests being on different drives. +.It +Concatenating UFS file systems across multiple drives can also improve +performance for multiple file access, since UFS divides a file system into +cylinder groups and attempts to keep files in a single cylinder group. In +general, it is not as effective as striping. +.It +Mirroring can improve multi-access performance for reads, since by default +.Nm +issues consecutive reads to consecutive plexes. +.It +Mirroring decreases performance for all writes, whether multi-access or single +access, since the data must be written to both plexes. This explains the +subdisk layout in the example of a mirroring configuration above: if the +corresponding subdisk in each plex is on a different physical disk, the write +commands can be issued in parallel, whereas if they are on the same physical +disk, they will be performed sequentially. +.It +RAID-5 reads have essentially the same considerations as striped reads, unless +the striped plex is part of a mirrored volume, in which case the performance of +the mirrored volume will be better. +.It +RAID-5 writes are approximately 25% of the speed of striped writes: to perform +the write, +.Nm +must first read the data block and the corresponding parity block, perform some +calculations and write back the parity block and the data block, four times as +many transfers as for writing a striped plex. On the other hand, this is offset +by the cost of mirroring, so writes to a volume with a single RAID-5 plex are +approximately half the speed of writes to a correctly configured volume with two +striped plexes. +.It +When the +.Nm +configuration changes (for example, adding or removing objects, or the change of +state of one of the objects), +.Nm +writes up to 128 kB of updated configuration to each drive. The larger the +number of drives, the longer this takes. +.El +.Ss Creating file systems on Vinum volumes +You do not need to run +.Nm disklabel +before creating a file system on a +.Nm +volume. Just run +.Nm newfs +against the raw device. Use the +.Fl v +option to state that the device is not divided into partitions. For example, to +create a file system on volume +.Pa mirror , +enter the following command: +.Bd -literal -offset 4n +# newfs -v /dev/vinum/rmirror +.Ed +.Pp +Note the name +.Pa rmirror , +indicating the raw device. +.Sh Other considerations +A number of other considerations apply to +.Nm +configuration: +.Bl -bullet +.It +There is no advantage in creating multiple drives on a single disk. Each drive +uses 131.5 kB of data for label and configuration information, and performance +will suffer when the configuration changes. Use appropriately sized subdisks instead. +.It +It is possible to increase the size of a concatenated +.Nm +plex, but currently the size of striped and RAID-5 plexes cannot be increased. +Currently the size of an existing UFS file system also cannot be increased, but +it is planned to make both plexes and file systems extensible. +.El .Sh GOTCHAS The following points are not bugs, and they have good reasons for existing, but they have shown to cause confusion. Each is discussed in the appropriate @@ -1220,16 +1768,19 @@ section above. .It .Nm will not create a device on UFS partitions. Instead, it will return an error -message ``wrong partition type''. The partition type should be +message +.if t ``wrong partition type''. +.if n "wrong partition type". +The partition type should be .Ar vinum , though currently partitions of type -.Ar unused +.Ar unused are also accepted. .It When you create a volume with multiple plexes, -.Nm +.Nm does not automatically initialize the plexes. This means that the contents are -not known, but they are certainly not consistent. As a result, by default +not known, but they are certainly not consistent. As a result, by default .Nm sets the state of all newly-created plexes except the first to .Ar stale . @@ -1237,13 +1788,13 @@ In order to synchronize them with the first plex, you must .Nm start their subdisks, which causes .Nm -to copy the data from a plex which is in the +to copy the data from a plex which is in the .Ar up state. Depending on the size of the subdisks involved, this can take a long time. .Pp In practice, people aren't too interested in what was in the plex when it was -created, and other volume managers cheat by setting them +created, and other volume managers cheat by setting them .Ar up anyway. .Nm @@ -1267,7 +1818,7 @@ Some of the commands currently supported by are not really needed. For reasons which I don't understand, however, I find that users frequently try the .Nm label -and +and .Nm resetconfig commands, though especially .Nm resetconfig @@ -1284,7 +1835,7 @@ state, with the .Nm stop or .Nm stop Ar -f -commands. If that works, you should then be able to start it. If you find +commands. If that works, you should then be able to start it. If you find that this is the only way to get out of a position where easier methods fail, please report the situation. .It @@ -1314,7 +1865,7 @@ objects. .br .Ar /dev/vinum/control - control device for -.Nm vinum +.Nm vinum .br .Ar /dev/vinum/plex - directory containing device nodes for @@ -1328,10 +1879,11 @@ subdisks. .Sh SEE ALSO .Xr vinum 4 , .Xr disklabel 8 , -.Nm http://www.lemis.com/vinum.html , -.Nm http://www.lemis.com/vinum-debugging.html . +.Xr newfs 8 , +.Pa http://www.lemis.com/vinum.html , +.Pa http://www.lemis.com/vinum-debugging.html . .Sh AUTHOR -Greg Lehey +Greg Lehey .Pa . .Sh HISTORY The -- cgit v1.1