diff options
author | dillon <dillon@FreeBSD.org> | 2001-05-27 23:14:27 +0000 |
---|---|---|
committer | dillon <dillon@FreeBSD.org> | 2001-05-27 23:14:27 +0000 |
commit | 154e92fce6e2492cbf37af0695f1589c91459049 (patch) | |
tree | 695f2f5c34276a3e81c37647fcdc95ff3c1fa687 | |
parent | 0727598a0e32df4c431ce3b099be9c4695eb6f16 (diff) | |
download | FreeBSD-src-154e92fce6e2492cbf37af0695f1589c91459049.zip FreeBSD-src-154e92fce6e2492cbf37af0695f1589c91459049.tar.gz |
Add two new manual pages related to general firewall and tuning issues
Reviewed by: hackers
-rw-r--r-- | share/man/man7/Makefile | 2 | ||||
-rw-r--r-- | share/man/man7/firewall.7 | 375 | ||||
-rw-r--r-- | share/man/man7/tuning.7 | 478 |
3 files changed, 854 insertions, 1 deletions
diff --git a/share/man/man7/Makefile b/share/man/man7/Makefile index 6c6ecc0..777d267 100644 --- a/share/man/man7/Makefile +++ b/share/man/man7/Makefile @@ -3,7 +3,7 @@ #MISSING: eqnchar.7 ms.7 term.7 MAN= ascii.7 build.7 clocks.7 environ.7 hier.7 hostname.7 intro.7 mailaddr.7 \ - operator.7 ports.7 security.7 \ + operator.7 ports.7 security.7 tuning.7 firewall.7 \ style.perl.7 MLINKS= intro.7 miscellaneous.7 diff --git a/share/man/man7/firewall.7 b/share/man/man7/firewall.7 new file mode 100644 index 0000000..043af5a --- /dev/null +++ b/share/man/man7/firewall.7 @@ -0,0 +1,375 @@ +.\" Copyright (c) 2001, Matthew Dillon. Terms and conditions are those of +.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in +.\" the source tree. +.\" +.\" $FreeBSD$ +.\" +.Dd May 26, 2001 +.Dt FIREWALL 7 +.Os FreeBSD +.Sh NAME +.Nm firewall +.Nd simple firewalls under FreeBSD +.Sh FIREWALL BASICS +A Firewall is most commonly used to protect an internal network +from an outside network by preventing the outside network from +making arbitrary connections into the internal network. Firewalls +are also used to prevent outside entities from spoofing internal +IP addresses and to isolate services such as NFS or SMBFS (Windows +file sharing) within LAN segments. +.Pp +The +.Fx +firewalling system also has the capability to limit bandwidth using +.Xr dummynet 4 . +This feature can be useful when you need to guarentee a certain +amount of bandwidth for a critical purpose. For example, if you +are doing video conferencing over the internet via your +office T1 (1.5 MBits), you may wish to bandwidth-limit all other +T1 traffic to 1 MBit in order to reserve at least 0.5 MBits +for your video conferencing connections. Similarly if you are +running a popular web or ftp site from a colocation facility +you might want to limit bandwidth to prevent excessive band +width charges from your provider. +.Pp +Finally, +.Fx +firewalls may be used to divert packets or change the next-hop +address for packets to help route them to the correct destination. +Packet diversion is most often used to support NAT (network +address translation), which allows an internal network using +a private IP space to make connections to the outside for browsing +or other purposes. +.Pp +Constructing a firewall may appear to be trivial, but most people +get them wrong. The most common mistake is to create an exclusive +firewall rather then an inclusive firewall. An exclusive firewall +allows all packets through except for those matching a set of rules. +An inclusive firewall allows only packets matching the rulset +through. Inclusive firewalls are much, much safer then exclusive +firewalls but a tad more difficult to build properly. The +second most common mistake is to blackhole everything except the +particular port you want to let through. TCP/IP needs to be able +to get certain types of ICMP errors to function properly - for +example, to implement MTU discovery. Also, a number of common +system daemons make reverse connections to the +.Sy auth +service in an attempt to authenticate the user making a connection. +Auth is rather dangerous but the proper implementation is to return +a TCP reset for the connection attempt rather then simply blackholing +the packet. We cover these and other quirks involved with constructing +a firewall in the sample firewall section below. +.Sh IPFW KERNEL CONFIGURATION +To use the ip firewall features of +.Fx +you must create a custom kernel with the +.Sy IPFIREWALL +option set. The kernel defaults its firewall to deny all +packets by default, which means that if you do not load in +a permissive ruleset via +.Em /etc/rc.conf , +rebooting into your new kernel will take the network offline +and will prevent you from being able to access it if you +are not sitting at the console. It is also quite common to +update a kernel to a new release and reboot before updating +the binaries. This can result in an incompatibility between +the +.Xr ipfw 8 +program and the kernel which prevents it from running in the +boot sequence, also resulting in an inaccessible machine. +Because of these problems the +.Sy IPFIREWALL_DEFAULT_TO_ACCEPT +kernel option is also available which changes the default firewall +to pass through all packets. Note, however, that this is a very +dangerous option to set because it means your firewall is disabled +during booting. You should use this option while getting up to +speed with +.Fx +firewalling, but get rid of it once you understand how it all works +to close the loophole. There is a third option called +.Sy IPDIVERT +which allows you to use the firewall to divert packets to a user program +and is necessary if you wish to use +.Xr natd 8 +to give private internal networks access to the outside world. +If you want to be able to limit the bandwidth used by certain types of +traffic, the +.Sy DUMMYNET +option must be used to enable +.Em ipfw pipe +rules. +.Pp +.Sh SAMPLE IPFW-BASED FIREWALL +Here is an example ipfw-based firewall taken from a machine with three +interface cards. fxp0 is connected to the 'exposed' LAN. Machines +on this LAN are dual-homed with both internal 10. IP addresses and +internet-routed IP addresses. In our example, 192.100.5.x represents +the internet-routed IP block while 10.x.x.x represents the internal +networks. While it isn't relevant to the example, 10.0.1.x is +assigned as the internal address block for the LAN on fxp0, 10.0.2.x +for the LAN on fxp1, and 10.0.3.x for the LAN on fxp2. +.Pp +In this example we want to isolate all three LANs from the internet +as well as isolate them from each other, and we want to give all +internal addresses access to the internet through a NAT gateway running +on this machine. To make the NAT gateway work, the firewall machine +is given two internet-exposed addresses on fxp0 in addition to an +internal 10. address on fxp0: one exposed address (not shown) +represents the machine's official address, and the second exposed +address (192.100.5.5 in our example) represents the NAT gateway +rendezvous IP. We make the example more complex by giving the machines +on the exposed LAN internal 10.0.0.x addresses as well as exposed +addresses. The idea here is that you can bind internal services +to internal addresses even on exposed machines and still protect +those services from the internet. The only services you run on +exposed IP addresses would be the ones you wish to expose to the +internet. +.Pp +It is important to note that the 10.0.0.x network in our example +is not protected by our firewall. You must make sure that your +internet router protects this network from outside spoofing. +Also, in our example, we pretty much give the exposed hosts free +reign on our internal network when operating services through +internal IP addresses (10.0.0.x). This is somewhat of security +risk... what if an exposed host is compromised? To remove the +risk and force everything coming in via LAN0 to go through +the firewall, remove rules 01010 and 01011. +.Pp +Finally, note that the use of internal addresses represents a +big piece of our firewall protection mechanism. With proper +spoofing safeguards in place, nothing outside can directly +access an internal (LAN1 or LAN2) host. +.Bd -literal +# /etc/rc.conf +# +firewall_enable="YES" +firewall_type="/etc/ipfw.conf" + +# temporary port binding range let +# through the firewall. +# +# NOTE: heavily loaded services running through the firewall may require +# a larger port range for local-size binding. 4000-10000 or 4000-30000 +# might be a better choice. +ip_portrange_first=4000 +ip_portrange_last=5000 +... +.Ed +.Pp +.Bd -literal +# /etc/ipfw.conf +# +# FIREWALL: the firewall machine / nat gateway +# LAN0 10.0.0.X and 192.100.5.X (dual homed) +# LAN1 10.0.1.X +# LAN2 10.0.2.X +# sw: ethernet switch (unmanaged) +# +# 192.100.5.x represents IP addresses exposed to the internet +# (i.e. internet routeable). 10.x.x.x represent internal IPs +# (not exposed) +# +# [LAN1] +# ^ +# | +# FIREWALL -->[LAN2] +# | +# [LAN0] +# | +# +--> exposed host A +# +--> exposed host B +# +--> exposed host C +# | +# INTERNET (secondary firewall) +# ROUTER +# | +# [internet] +# +# NOT SHOWN: The INTERNET ROUTER must contain rules to disallow +# all packets with source IP addresses in the 10. block in order +# to protect the dual-homed 10.0.0.x block. Exposed hosts are +# not otherwise protected in this example - they should only bind +# exposed services to exposed IPs but can safely bind internal +# services to internal IPs. +# +# The NAT gateway works by taking packets sent from internal +# IP addresses to external IP addresses and routing them to natd, which +# is listening on port 8668. This is handled by rule 00300. Data coming +# back to natd from the outside world must also be routed to natd using +# rule 00301. To make the example interesting, we note that we do +# NOT have to run internal requests to exposed hosts through natd +# (rule 00290) because those exposed hosts know about our +# 10. network. This can reduce the load on natd. Also note that we +# of course do not have to route internal<->internal traffic through +# natd since those hosts know how to route our 10. internal network. +# The natd command we run from /etc/rc.local is shown below. See +# also the in-kernel version of natd, ipnat. +# +# natd -s -u -a 208.161.114.67 +# +# +add 00290 skipto 1000 ip from 10.0.0.0/8 to 192.100.5.0/24 +add 00300 divert 8668 ip from 10.0.0.0/8 to not 10.0.0.0/8 +add 00301 divert 8668 ip from not 10.0.0.0/8 to 192.100.5.5 + +# Short cut the rules to avoid running high bandwidths through +# the entire rule set. Allow established tcp connections through, +# and shortcut all outgoing packets under the assumption that +# we need only firewall incoming packets. +# +# Allowing established tcp connections through creates a small +# hole but may be necessary to avoid overloading your firewall. +# If you are worried, you can move the rule to after the spoof +# checks. +# +add 01000 allow tcp from any to any established +add 01001 allow all from any to any out via fxp0 +add 01001 allow all from any to any out via fxp1 +add 01001 allow all from any to any out via fxp2 + +# Spoof protection. This depends on how well you trust your +# internal networks. Packets received via fxp1 MUST come from +# 10.0.1.x. Packets received via fxp2 MUST come from 10.0.2.x. +# Packets received via fxp0 cannot come from the LAN1 or LAN2 +# blocks. We can't protect 10.0.0.x here, the internet router +# must do that for us. +# +add 01500 deny all from not 10.0.1.0/24 in via fxp1 +add 01500 deny all from not 10.0.2.0/24 in via fxp2 +add 01501 deny all from 10.0.1.0/24 in via fxp0 +add 01501 deny all from 10.0.2.0/24 in via fxp0 + +# In this example rule set there are no restrictions between +# internal hosts, even those on the exposed LAN (as long as +# they use an internal IP address). This represents a +# potential security hole (what if an exposed host is +# compromised?). If you want full restrictions to apply +# between the three LANs, firewalling them off from each +# other for added security, remove these two rules. +# +# If you want to isolate LAN1 and LAN2, but still want +# to give exposed hosts free reign with each other, get +# rid of rule 01010 and keep rule 01011. +# +# (commented out, uncomment for less restrictive firewall) +#add 01010 allow all from 10.0.0.0/8 to 10.0.0.0/8 +#add 01011 allow all from 192.100.5.0/24 to 192.100.5.0/24 +# + +# SPECIFIC SERVICES ALLOWED FROM SPECIFIC LANS +# +# If using a more restrictive firewall, allow specific LANs +# access to specific services running on the firewall itself. +# In this case we assume LAN1 needs access to filesharing running +# on the firewall. If using a less restrictive firewall +# (allowing rule 01010), you don't need these rules. +# +add 01012 allow tcp from 10.0.1.0/8 to 10.0.1.1 139 +add 01012 allow udp from 10.0.1.0/8 to 10.0.1.1 137,138 + +# GENERAL SERVICES ALLOWED TO CROSS INTERNAL AND EXPOSED LANS +# +# We allow specific UDP services through: DNS lookups, ntalk, and ntp. +# Note that internal services are protected by virtue of having +# spoof-proof internal IP addresses (10. net), so these rules +# really only apply to services bound to exposed IPs. We have +# to allow UDP fragments or larger fragmented UDP packets will +# not survive the firewall. +# +# If we want to expose high-numbered temporary service ports +# for things like DNS lookup responses we can use a port range, +# in this example 4000-65535, and we set to /etc/rc.conf variables +# on all exposed machines to make sure they bind temporary ports +# to the exposed port range (see rc.conf example above) +# +add 02000 allow udp from any to any 4000-65535,domain,ntalk,ntp +add 02500 allow udp from any to any frag + +# Allow similar services for TCP. Again, these only apply to +# services bound to exposed addresses. NOTE: we allow 'auth' +# through but do not actually run an identd server on any exposed +# port. This allows the machine being authed to respond with a +# TCP RESET. Throwing the packet away would result in delays +# when connecting to remote services that do reverse ident lookups. +# +# Note that we do not allow tcp fragments through, and that we do +# not allow fragments in general (except for UDP fragments). We +# expect the TCP mtu discovery protocol to work properly so there +# should be no TCP fragments. +# +add 03000 allow tcp from any to any http,https +add 03000 allow tcp from any to any 4000-65535,ssh,smtp,domain,ntalk +add 03000 allow tcp from any to any auth,pop3,ftp,ftp-data + +# It is important to allow certain ICMP types through: +# +# 0 Echo Reply +# 3 Destination Unreachable +# 4 Source Quench (typically not allowed) +# 5 Redirect (typically not allowed - can be dangerous!) +# 8 Echo +# 11 Time Exceeded +# 12 Parameter Problem +# 13 Timestamp +# 14 Timestamp Reply +# +# Sometimes people need to allow ICMP REDIRECT packets, which is +# type 5, but if you allow it make sure that your internet router +# disallows it. + +add 04000 allow icmp from any to any icmptypes 0,5,8,11,12,13,14 + +# log any remaining fragments that get through. Might be useful, +# otherwise don't bother. Have a final deny rule as a safety to +# guarentee that your firewall is inclusive no matter how the kernel +# is configured. +# +add 05000 deny log ip from any to any frag +add 06000 deny all from any to any +.Ed +.Sh PORT BINDING INTERNAL AND EXTERNAL SERVICES +We've mentioned multi-homing hosts and binding services to internal or +external addresses but we haven't really explained it. When you have a +host with multiple IP addresses assigned to it, you can bind services run +on that host to specific IPs or interfaces rather then all IPs. Take +the firewall machine for example: With three interfaces +and two exposed IP addresses +on one of those interfaces, the firewall machine is known by 5 different +IP addresses (10.0.0.1, 10.0.1.1, 10.0.2.1, 192.100.5.5, and say +192.100.5.1). If the firewall is providing file sharing services to the +windows LAN segment (say it is LAN1), you can use samba's 'bind interfaces' +directive to specifically bind it to just the LAN1 IP address. That +way the file sharing services will not be made available to other LAN +segments. The same goes for NFS. If LAN2 has your UNIX engineering +workstations, you can tell nfsd to bind specifically to 10.0.2.1. You +can specify how to bind virtually every service on the machine and you +can use a light +.Xr jail 8 +to indirectly bind services that do not otherwise give you the option. +.Sh SEE ALSO +.Pp +.Xr config 8 , +.Xr dummynet 4 , +.Xr ipfw 8 , +.Xr ipnat 1 , +.Xr ipnat 5 , +.Xr jail 8 , +.Xr natd 8 , +.Xr nfsd 8 , +.Xr rc.conf 5 , +.Xr samba 7 [ /usr/ports/net/samba ] +.Xr smb.conf 5 [ /usr/ports/net/samba ] +.Sh ADDITIONAL READING +.Pp +.Xr ipf 5 , +.Xr ipf 8 , +.Xr ipfstat 8 +.Sh HISTORY +The +.Nm +manual page was originally written by +.An Matthew Dillon +and first appeared +in +.Fx 4.3 , +May 2001. diff --git a/share/man/man7/tuning.7 b/share/man/man7/tuning.7 new file mode 100644 index 0000000..8c7630e --- /dev/null +++ b/share/man/man7/tuning.7 @@ -0,0 +1,478 @@ +.\" Copyright (c) 2001, Matthew Dillon. Terms and conditions are those of +.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in +.\" the source tree. +.\" +.\" $FreeBSD$ +.\" +.Dd May 25, 2001 +.Dt TUNING 7 +.Os FreeBSD +.Sh NAME +.Nm tuning +.Nd performance tuning under FreeBSD +.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP +.Pp +When using +.Xr disklabel 8 +to lay out your filesystems on a hard disk it is important to remember +that hard drives can transfer data much more quickly from outer tracks +then they can from inner tracks. To take advantage of this you should +try to pack your smaller filesystems and swap closer to the outer tracks, +follow with the larger filesystems, and end with the largest filesystems. +It is also important to size system standard filesystems such that you +will not be forced to resize them later as you scale the machine up. +I usually create, in order, a 128M root, 1G swap, 128M /var, 128M /var/tmp, +3G /usr, and use any remaining space for /home. +.Pp +You should typically size your swap space to approximately 2x main memory. +If you do not have a lot of ram, though, you will generally want a lot +more swap. It is not recommended that you configure any less than +256M of swap on a system and you should keep in mind future memory +expansion when sizing the swap partition. +The kernel's VM paging algorithms are tuned to perform best when there is +at least 2x swap versus main memory. Configuring too little swap can lead +to inefficiencies in the VM page scanning code as well as create issues +later on if you add more memory to your machine. Finally, on larger systems +with multiple SCSI disks (or multiple IDE disks operating on different +controllers), we strongly recommend that you configure swap on each drive +(up to four drives). The swap partitions on the drives should be +approximately the same size. The kernel can handle arbitrary sizes but +internal data structures scale to 4 times the largest swap partition. Keeping +the swap partitions near the same size will allow the kernel to optimally +stripe swap space across the N disks. Don't worry about overdoing it a +little, swap space is the saving grace of +.Ux +and even if you don't normally use much swap, it can give you more time to +recover from a runaway program before being forced to reboot. +.Pp +How you size your +.Em /var +partition depends heavily on what you intend to use the machine for. This +partition is primarily used to hold mailboxes, the print spool, and log +files. Some people even make +.Em /var/log +its own partition (but except for extreme cases it isn't worth the waste +of a partition id). If your machine is intended to act as a mail +or print server, +or you are running a heavily visited web server, you should consider +creating a much larger partition - perhaps a gig or more. It is very easy +to underestimate log file storage requirements. +.Pp +Sizing +.Em /var/tmp +depends on the kind of temporary file usage you think you will need. 128M is +the minimum we recommend. Also note that you usually want to make +.Em /tmp +a softlink to +.Em /var/tmp . +Dedicating a partition for temporary file storage is important for +two reasons: First, it reduces the possibility of filesystem corruption +in a crash, and second it reduces the chance of a runaway process that +fills up [/var]/tmp from blowing up more critical subsystems (mail, +logging, etc). Filling up [/var]/tmp is a very common problem to have. +.Pp +In the old days there were differences between /tmp and /var/tmp, +but the introduction of /var (and /var/tmp) led to massive confusion +by program writers so today programs halfhazardly use one or the +other and thus no real distinction can be made between the two. So +it makes sense to have just one temporary directory. You can do the +softlink either way. The one thing you do not want to do is leave /tmp +on the root partition where it might cause root to fill up or possibly +corrupt root in a crash/reboot situation. +.Pp +The +.Em /usr +partition holds the bulk of the files required to support the system and +a subdirectory within it called +.Em /usr/local +holds the bulk of the files installed from the +.Xr ports 7 +hierarchy. If you do not use ports all that much and do not intend to keep +system source (/usr/src) on the machine, you can get away with +a 1 gigabyte /usr partition. However, if you install a lot of ports +(especially window managers and linux-emulated binaries), we recommend +at least a 2 gigabyte /usr and if you also intend to keep system source +on the machine, we recommend a 3 gigabyte /usr. Do not underestimate the +amount of space you will need in this partition, it can creep up and +surprise you! +.Pp +The +.Em /home +partition is typically used to hold user-specific data. I usually size it +to the remainder of the disk. +.Pp +Why partition at all? Why not create one big +.Em / +partition and be done with it? Then I don't have to worry about undersizing +things! Well, there are several reasons this isn't a good idea. First, +each partition has different operational characteristics and separating them +allows the filesystem to tune itself to those characteristics. For example, +the root and /usr partitions are read-mostly, with very little writing, while +a lot of reading and writing could occur in /var and /var/tmp. By properly +partitioning your system, fragmentation introduced in the smaller more +heavily write-loaded partitions will not bleed over into the mostly-read +partitions. Additionally, keeping the write-loaded partitions closer to +the edge of the disk (i.e. before the really big partitions instead of after +in the partition table) will increase I/O performance in the partitions +where you need it the most. Now it is true that you might also need I/O +performance in the larger partitions, but they are so large that shifting +them more towards the edge of the disk will not lead to a significnat +performance improvement whereas moving /var to the edge can have a huge impact. +Finally, there are safety concerns. Having a small neat root partition that +is essentially read-only gives it a greater chance of surviving a bad crash +intact. +.Pp +Properly partitioning your system also allows you to tune +.Xr newfs 8 , +and +.Xr tunefs 8 +parameters. Tuning +.Fn newfs +requires more experience but can lead to significant improvements in +performance. There are three parameters that are relatively safe to +tune: +.Em blocksize , +.Em bytes/inode , +and +.Em cylinders/group . +.Pp +.Fx +performs best when using 8K or 16K filesystem block sizes. The default +filesystem block size is 8K. For larger partitions it is usually a good +idea to use a 16K block size. This also requires you to specify a larger +fragment size. We recommend always using a fragment size that is 1/8 +the block size (less testing has been done on other fragment size factors). +The +.Fn newfs +options for this would be +.Em newfs -f 2048 -b 16384 ... +Using a larger block size can cause fragmentation of the buffer cache and +lead to lower performance. +.Pp +If a large partition is intended to be used to hold fewer, larger files, such +as a database files, you can increase the +.Em bytes/inode +ratio which reduces the number if inodes (maximum number of files and +directories that can be created) for that partition. Decreasing the number +of inodes in a filesystem can greatly reduce +.Xr fsck 8 +recovery times after a crash. Do not use this option +unless you are actually storing large files on the partition, because if you +overcompensate you can wind up with a filesystem that has lots of free +space remaining but cannot accomodate any more files. Using +32768, 65536, or 262144 bytes/inode is recommended. You can go higher but +it will have only incremental effects on fsck recovery times. For +example, +.Em newfs -i 32768 ... +.Pp +Finally, increasing the +.Em cylinders/group +ratio has the effect of packing the inodes closer together. This can increase +directory performance and also decrease fsck times. If you use this option +at all, we recommend maxing it out. Use +.Em newfs -c 999 +and newfs will error out and tell you what the maximum is, then use that. +.Pp +.Xr tunefs 8 +may be used to further tune a filesystem. This command can be run in +single-user mode without having to reformat the filesystem. However, this +is possibly the most abused program in the system. Many people attempt to +increase available filesystem space by setting the min-free percentage to 0. +This can lead to severe filesystem fragmentation and we do not recommend +that you do this. Really the only tunefs option worthwhile here is turning on +.Em softupdates +with +.Em tunefs -n enable /filesystem. +(Note: In 5.x softupdates can be turned on using the -U option to newfs). +Softupdates drastically improves meta-data performance, mainly file +creation and deletion. We recommend turning softupdates on on all of your +filesystems. There are two downsides to softupdates that you should be +aware of: First, softupdates guarentees filesystem consistency in the +case of a crash but could very easily be several seconds (even a minute!) +behind updating the physical disk. If you crash you may lose more work +then otherwise. Secondly, softupdates delays the freeing of filesystem +blocks. If you have a filesystem (such as the root filesystem) which is +close to full, doing a major update of it, e.g. +.Em make installworld, +can run it out of space and cause the update to fail. +.Sh STRIPING DISKS +In larger systems you can stripe partitions from several drives together +to create a much larger overall partition. Striping can also improve +the performance of a filesystem by splitting I/O operations across two +or more disks. The +.Xr vinum 8 +and +.Xr ccd 4 +utilities may be used to create simple striped filesystems. Generally +speaking, striping smaller partitions such as the root and /var/tmp, +or essentially read-only partitions such as /usr is a complete waste of +time. You should only stripe partitions that require serious I/O performance... +typically /var, /home, or custom partitions used to hold databases and web +pages. Choosing the proper stripe size is also +important. Filesystems tend to store meta-data on power-of-2 boundries +and you usually want to reduce seeking rather then increase seeking. This +means you want to use a large off-center stripe size such as 1152 sectors +so sequential I/O does not seek both disks and so meta-data is distributed +across both disks rather then concentrated on a single disk. If +you really need to get sophisticated, we recommend using a real hardware +raid controller from the list of +.Fx +supported controllers. +.Sh SYSCTL TUNING +.Pp +There are several hundred +.Xr sysctl 8 +variables in the system, including many that appear to be candidates for +tuning but actually aren't. In this document we will only cover the ones +that have the greatest effect on the system. +.Pp +The +.Em kern.ipc.shm_use_phys +sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on). Setting +this parameter to 1 will cause all SysV shared memory segments to be +mapped to unpageable physical ram. This feature only has an effect if you +are either (A) mapping small amounts of shared memory across many (hundreds) +of processes, or (B) mapping large amounts of shared memory across any +number of processes. This feature allows the kernel to remove a great deal +of internal memory management page-tracking overhead at the cost of wiring +the shared memory into core, making it unswappable. +.Pp +The +.Em vfs.vmiodirenable +sysctl defaults to 0 (off) (though soon it will default to 1) and may be +set to 0 (off) or 1 (on). This parameter controls how directories are cached +by the system. Most directories are small and use but a single fragment +(typically 1K) in the filesystem and even less (typically 512 bytes) in +the buffer cache. However, when operating in the default mode the buffer +cache will only cache a fixed number of directories even if you have a huge +amount of memory. Turning on this sysctl allows the buffer cache to use +the VM Page Cache to cache the directories. The advantage is that all of +memory is now available for caching directories. The disadvantage is that +the minimum in-core memory used to cache a directory is the physical page +size (typically 4K) rather then 512 bytes. We recommend turning this option +on if you are running any services which manipulate large numbers of files. +Such services can include web caches, large mail systems, and news systems. +Turning on this option will generally not reduce performance even with the +wasted memory but you should experiment to find out. +.Pp +There are various buffer-cache and VM page cache related sysctls. We do +not recommend messing around with these at all. As of +.Fx 4.3 , +the VM system does an extremely good job tuning itself. +.Pp +The +.Em net.inet.tcp.sendspace +and +.Em net.inet.tcp.recvspace +sysctls are of particular interest if you are running network intensive +applications. This controls the amount of send and receive buffer space +allowed for any given TCP connection. The default is 16K. You can often +improve bandwidth utilization by increasing the default at the cost of +eating up more kernel memory for each connection. We do not recommend +increasing the defaults if you are serving hundreds or thousands of +simultanious connections because it is possible to quickly run the system +out of memory due to stalled connections building up. But if you need +high bandwidth over a fewer number of connections, especially if you have +gigabit ethernet, increasing these defaults can make a huge difference. +You can adjust the buffer size for incoming and outgoing data separately. +For example, if your machine is primarily doing web serving you may want +to decrease the recvspace in order to be able to increase the sendspace +without eating too much kernel memory. Note that the route table, see +.Xr route 8 , +can be used to introduce route-specific send and receive buffer size +defaults. As an additional mangagement tool you can use pipes in your +firewall rules, see +.Xr ipfw 8 , +to limit the bandwidth going to or from particular IP blocks or ports. +For example, if you have a T1 you might want to limit your web traffic +to 70% of the T1's bandwidth in order to leave the remainder available +for mail and interactive use. Normally a heavily loaded web server +will not introduce significant latencies into other services even if +the network link is maxed out, but enforcing a limit can smooth things +out and lead to longer term stability. Many people also enforce artificial +bandwidth limitations in order to ensure that they are not charged for +using too much bandwidth. +.Pp +We recommend that you turn on (set to 1) and leave on the +.Em net.inet.tcp.always_keepalive +control. The default is usually off. This introduces a small amount of +additional network bandwidth but guarentees that dead tcp connections +will eventually be recognized and cleared. Dead tcp connections are a +particular problem on systems accesed by users operating over dialups, +because users often disconnect their modems without properly closing active +connections. +.Pp +The +.Em kern.ipc.somaxconn +sysctl limits the size of the listen queue for accepting new tcp connections. +The default value of 128 is typically too low for robust handling of new +connections in a heavily loaded web server environment. For such environments, +we recommend increasing this value to 1024 or higher. The service daemon +may itself limit the listen queue size (e.g. sendmail, apache) but will +often have a directive in its configuration file to adjust the queue size up. +Larger listen queue also do a better job of fending of denial of service +attacks. +.Sh KERNEL CONFIG TUNING +.Pp +There are a number of kernel options that you may have to fiddle with in +a large scale system. In order to change these options you need to be +able to compile a new kernel from source. The +.Xr config 8 +manual page and the handbook are good starting points for learning how to +do this. Generally the first thing you do when creating your own custom +kernel is to strip out all the drivers and services you don't use. Removing +things like +.Em INET6 +and drivers you don't have will reduce the size of your kernel, sometimes +by a megabyte or more, leaving more memory available for applications. +.Pp +The +.Em maxusers +kernel option defaults to an incredibly low value. For most modern machines, +you probably want to increase this value to 64, 128, or 256. We do not +recommend going above 256 unless you need a huge number of file descriptors. +Network buffers are also affected but can be controlled with a separate +kernel option. Do not increase maxusers just to get more network mbufs. +.Pp +.Em NMBCLUSTERS +may be adjusted to increase the number of network mbufs the system is +willing to allocate. Each cluster represents approximately 2K of memory, +so a value of 1024 represents 2M of kernel memory reserved for network +buffers. You can do a simple calculation to figure out how many you need. +If you have a web server which maxes out at 1000 simultanious connections, +and each connection eats a 16K receive and 16K send buffer, you need +approximate 32MB worth of network buffers to deal with it. A good rule of +thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768. So for this case +you would want to se NMBCLUSTERS to 32768. We recommend values between +1024 and 4096 for machines with moderates amount of memory, and between 4096 +and 32768 for machines with greater amounts of memory. Under no circumstances +should you specify an arbitrarily high value for this parameter, it could +lead to a boot-time crash. The -m option to +.Xr netstat 1 +may be used to observe network cluster use. +.Pp +More and more programs are using the +.Fn sendfile +system call to transmit files over the network. The +.Em NSFBUFS +kernel parameter controls the number of filesystem buffers +.Fn sendfile +is allowed to use to perform its work. This parameter nominally scales +with +.Em maxusers +so you should not need to mess with this parameter except under extreme +circumstances. +.Pp +.Em SCSI_DELAY +and +.Em IDE_DELAY +may be used to reduce system boot times. The defaults are fairly high and +can be responsible for 15+ seconds of delay in the boot process. Reducing +SCSI_DELAY to 5 seconds usually works (especially with modern drives). +Reducing IDE_DELAY also works but you have to be a little more careful. +.Pp +There are a number of +.Em XXX_CPU +options that can be commented out. If you only want the kernel to run +on a Pentium class cpu, you can easily remove +.Em I386_CPU +and +.Em I486_CPU, +but only remove +.Em I586_CPU +if you are sure your cpu is being recognized as a Pentium II or better. +Some clones may be recognized as a pentium or even a 486 and not be able +to boot without those options. If it works, great! The operating system +will be able to better-use higher-end cpu features for mmu, task switching, +timebase, and even device operations. Additionally, higher-end cpus support +4MB MMU pages which the kernel uses to map the kernel itself into memory, +which increases its efficiency under heavy syscall loads. +.Sh IDE WRITE CACHING +As of +.Fx 4.3 , +IDE write caching is turned off by default. This will reduce write bandwidth +to IDE disks but is considered necessary due to serious data consistency +issues introduced by hard drive vendors. Basically the problem is that +IDE drives lie about when a write completes. With IDE write caching turned +on, IDE hard drives will not only write data to disk out of order, they +will sometimes delay some of the blocks indefinitely when under heavy disk +loads. A crash or power failure can result in serious filesystem +corruption. So our default is to be safe. If you are willing to risk +filesystem corruption, you can return to the old behavior by setting the +hw.ata.wc +kernel variable back to 1. This must be done from the boot loader at boot +time. Please see +.Xr ata 4 , +and +.Xr loader 8 . +.Pp +There is a new experimental feature for IDE hard drives called hw.ata.tags +(you also set this in the bootloader) which allows write caching to be safely +turned on. This brings SCSI tagging features to IDE drives. As of this +writing only IBM DPTA and DTLA drives support the feature. +.Sh CPU, MEMORY, DISK, NETWORK +The type of tuning you do depends heavily on where your system begins to +bottleneck as load increases. If your system runs out of cpu (idle times +are pepetually 0%) then you need to consider upgrading the cpu or moving to +an SMP motherboard (multiple cpu's), or perhaps you need to revisit the +programs that are causing the load and try to optimize them. If your system +is paging to swap a lot you need to consider adding more memory. If your +system is saturating the disk you typically see high cpu idle times and +total disk saturation. +.Xr systat 1 +can be used to monitor this. There are many solutions to saturated disks: +increasing memory for caching, mirroring disks, distributing operations across +several machines, and so forth. If disk performance is an issue and you +are using IDE drives, switching to SCSI can help a great deal. While modern +IDE drives compare with SCSI in raw sequential bandwidth, the moment you +start seeking around the disk SCSI drives usually win. +.Pp +Finally, you might run out of network suds. The first line of defense for +improving network performance is to make sure you are using switches instead +of hubs, especially these days where switches are almost as cheap. Hubs +have severe problems under heavy loads due to collision backoff and one bad +host can severely degrade the entire LAN. Second, optimize the network path +as much as possible. For example, in +.Xr firewall 7 +we describe a firewall protecting internal hosts with a topology where +the externally visible hosts are not routed through it. Use 100BaseT rather +then 10BaseT, or use 1000BaseT rather then 100BaseT, depending on your needs. +Most bottlenecks occur at the WAN link (e.g. modem, T1, DSL, whatever). +If expanding the link is not an option it may be possible to use ipfw's +.Sy DUMMYNET +feature to implement peak shaving or other forms of traffic shaping to +prevent the overloaded service (such as web services) from effecting other +services (such as email), or vise versa. In home installations this could +be used to give interactive traffic (your browser, ssh logins) priority +over services you export from your box (web services, email). +.Sh SEE ALSO +.Pp +.Xr ata 4 , +.Xr boot 8 , +.Xr ccd 4 , +.Xr config 8 , +.Xr disklabel 8 , +.Xr firewall 7 , +.Xr fsck 8 , +.Xr hier 7 , +.Xr ifconfig 8 , +.Xr ipfw 8 , +.Xr loader 8 , +.Xr login.conf 5 , +.Xr netstat 1 , +.Xr newfs 8 , +.Xr ports 7 , +.Xr route 8 , +.Xr sysctl 8 , +.Xr systat 1 , +.Xr tunefs 8 , +.Xr vinum 8 +.Sh HISTORY +The +.Nm +manual page was originally written by +.An Matthew Dillon +and first appeared +in +.Fx 4.3 , +May 2001. |