Add two new manual pages related to general firewall and tuning issues

Reviewed by: hackers
author: dillon <dillon@FreeBSD.org> 2001-05-27 23:14:27 +0000
committer: dillon <dillon@FreeBSD.org> 2001-05-27 23:14:27 +0000
commit: 154e92fce6e2492cbf37af0695f1589c91459049 (patch)
tree: 695f2f5c34276a3e81c37647fcdc95ff3c1fa687
parent: 0727598a0e32df4c431ce3b099be9c4695eb6f16 (diff)
download: FreeBSD-src-154e92fce6e2492cbf37af0695f1589c91459049.zip
FreeBSD-src-154e92fce6e2492cbf37af0695f1589c91459049.tar.gz
3 files changed, 854 insertions, 1 deletions
diff --git a/share/man/man7/Makefile b/share/man/man7/Makefile
index 6c6ecc0..777d267 100644
--- a/share/man/man7/Makefile
+++ b/share/man/man7/Makefile
@@ -3,7 +3,7 @@
 
 #MISSING: eqnchar.7 ms.7 term.7
 MAN=	ascii.7 build.7 clocks.7 environ.7 hier.7 hostname.7 intro.7 mailaddr.7 \
-	operator.7 ports.7 security.7 \
+	operator.7 ports.7 security.7 tuning.7 firewall.7 \
 	style.perl.7
 MLINKS=	intro.7 miscellaneous.7
 
diff --git a/share/man/man7/firewall.7 b/share/man/man7/firewall.7
new file mode 100644
index 0000000..043af5a
--- /dev/null
+++ b/share/man/man7/firewall.7
@@ -0,0 +1,375 @@
+.\" Copyright (c) 2001, Matthew Dillon.  Terms and conditions are those of
+.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in
+.\" the source tree.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd May 26, 2001
+.Dt FIREWALL 7
+.Os FreeBSD
+.Sh NAME
+.Nm firewall
+.Nd simple firewalls under FreeBSD
+.Sh FIREWALL BASICS
+A Firewall is most commonly used to protect an internal network
+from an outside network by preventing the outside network from
+making arbitrary connections into the internal network.  Firewalls
+are also used to prevent outside entities from spoofing internal
+IP addresses and to isolate services such as NFS or SMBFS (Windows
+file sharing) within LAN segments.
+.Pp
+The
+.Fx
+firewalling system also has the capability to limit bandwidth using
+.Xr dummynet 4 .
+This feature can be useful when you need to guarentee a certain
+amount of bandwidth for a critical purpose.  For example, if you
+are doing video conferencing over the internet via your
+office T1 (1.5 MBits), you may wish to bandwidth-limit all other
+T1 traffic to 1 MBit in order to reserve at least 0.5 MBits
+for your video conferencing connections.  Similarly if you are
+running a popular web or ftp site from a colocation facility
+you might want to limit bandwidth to prevent excessive band
+width charges from your provider.
+.Pp
+Finally,
+.Fx
+firewalls may be used to divert packets or change the next-hop
+address for packets to help route them to the correct destination.
+Packet diversion is most often used to support NAT (network
+address translation), which allows an internal network using
+a private IP space to make connections to the outside for browsing
+or other purposes.
+.Pp
+Constructing a firewall may appear to be trivial, but most people
+get them wrong.  The most common mistake is to create an exclusive
+firewall rather then an inclusive firewall.  An exclusive firewall
+allows all packets through except for those matching a set of rules.
+An inclusive firewall allows only packets matching the rulset 
+through.  Inclusive firewalls are much, much safer then exclusive
+firewalls but a tad more difficult to build properly.  The
+second most common mistake is to blackhole everything except the
+particular port you want to let through.  TCP/IP needs to be able 
+to get certain types of ICMP errors to function properly - for
+example, to implement MTU discovery.  Also, a number of common
+system daemons make reverse connections to the
+.Sy auth
+service in an attempt to authenticate the user making a connection.
+Auth is rather dangerous but the proper implementation is to return
+a TCP reset for the connection attempt rather then simply blackholing
+the packet.  We cover these and other quirks involved with constructing
+a firewall in the sample firewall section below.
+.Sh IPFW KERNEL CONFIGURATION
+To use the ip firewall features of
+.Fx
+you must create a custom kernel with the
+.Sy IPFIREWALL
+option set.  The kernel defaults its firewall to deny all
+packets by default, which means that if you do not load in
+a permissive ruleset via
+.Em /etc/rc.conf ,
+rebooting into your new kernel will take the network offline
+and will prevent you from being able to access it if you
+are not sitting at the console.  It is also quite common to
+update a kernel to a new release and reboot before updating
+the binaries.  This can result in an incompatibility between
+the
+.Xr ipfw 8
+program and the kernel which prevents it from running in the
+boot sequence, also resulting in an inaccessible machine.
+Because of these problems the
+.Sy IPFIREWALL_DEFAULT_TO_ACCEPT
+kernel option is also available which changes the default firewall
+to pass through all packets.  Note, however, that this is a very
+dangerous option to set because it means your firewall is disabled
+during booting.  You should use this option while getting up to
+speed with
+.Fx
+firewalling, but get rid of it once you understand how it all works 
+to close the loophole.  There is a third option called
+.Sy IPDIVERT
+which allows you to use the firewall to divert packets to a user program
+and is necessary if you wish to use
+.Xr natd 8
+to give private internal networks access to the outside world. 
+If you want to be able to limit the bandwidth used by certain types of
+traffic, the
+.Sy DUMMYNET
+option must be used to enable
+.Em ipfw pipe
+rules.
+.Pp
+.Sh SAMPLE IPFW-BASED FIREWALL
+Here is an example ipfw-based firewall taken from a machine with three
+interface cards.  fxp0 is connected to the 'exposed' LAN.  Machines
+on this LAN are dual-homed with both internal 10. IP addresses and
+internet-routed IP addresses.  In our example, 192.100.5.x represents
+the internet-routed IP block while 10.x.x.x represents the internal
+networks.  While it isn't relevant to the example, 10.0.1.x is 
+assigned as the internal address block for the LAN on fxp0, 10.0.2.x
+for the LAN on fxp1, and 10.0.3.x for the LAN on fxp2.
+.Pp
+In this example we want to isolate all three LANs from the internet
+as well as isolate them from each other, and we want to give all 
+internal addresses access to the internet through a NAT gateway running
+on this machine.  To make the NAT gateway work, the firewall machine
+is given two internet-exposed addresses on fxp0 in addition to an
+internal 10. address on fxp0: one exposed address (not shown) 
+represents the machine's official address, and the second exposed
+address (192.100.5.5 in our example) represents the NAT gateway
+rendezvous IP.  We make the example more complex by giving the machines
+on the exposed LAN internal 10.0.0.x addresses as well as exposed 
+addresses.  The idea here is that you can bind internal services
+to internal addresses even on exposed machines and still protect
+those services from the internet.  The only services you run on
+exposed IP addresses would be the ones you wish to expose to the
+internet.
+.Pp
+It is important to note that the 10.0.0.x network in our example
+is not protected by our firewall.  You must make sure that your
+internet router protects this network from outside spoofing.  
+Also, in our example, we pretty much give the exposed hosts free
+reign on our internal network when operating services through
+internal IP addresses (10.0.0.x).   This is somewhat of security
+risk... what if an exposed host is compromised?  To remove the
+risk and force everything coming in via LAN0 to go through
+the firewall, remove rules 01010 and 01011.
+.Pp
+Finally, note that the use of internal addresses represents a
+big piece of our firewall protection mechanism.  With proper
+spoofing safeguards in place, nothing outside can directly
+access an internal (LAN1 or LAN2) host.
+.Bd -literal
+# /etc/rc.conf
+#
+firewall_enable="YES"
+firewall_type="/etc/ipfw.conf"
+
+# temporary port binding range let
+# through the firewall.
+# 
+# NOTE: heavily loaded services running through the firewall may require
+# a larger port range for local-size binding.  4000-10000 or 4000-30000
+# might be a better choice.
+ip_portrange_first=4000
+ip_portrange_last=5000
+...
+.Ed
+.Pp
+.Bd -literal
+# /etc/ipfw.conf
+#
+# FIREWALL: the firewall machine / nat gateway
+# LAN0	    10.0.0.X and 192.100.5.X (dual homed)
+# LAN1	    10.0.1.X 
+# LAN2	    10.0.2.X
+# sw:	    ethernet switch (unmanaged)
+#
+# 192.100.5.x represents IP addresses exposed to the internet
+# (i.e. internet routeable).  10.x.x.x represent internal IPs
+# (not exposed)
+#
+#   [LAN1]
+#      ^
+#      |
+#   FIREWALL -->[LAN2]
+#      |
+#   [LAN0]
+#      |
+#      +--> exposed host A
+#      +--> exposed host B
+#      +--> exposed host C
+#      |
+#   INTERNET (secondary firewall)
+#    ROUTER
+#      |
+#    [internet]
+#
+# NOT SHOWN:  The INTERNET ROUTER must contain rules to disallow
+# all packets with source IP addresses in the 10. block in order
+# to protect the dual-homed 10.0.0.x block.  Exposed hosts are
+# not otherwise protected in this example - they should only bind 
+# exposed services to exposed IPs but can safely bind internal
+# services to internal IPs.
+#
+# The NAT gateway works by taking packets sent from internal
+# IP addresses to external IP addresses and routing them to natd, which
+# is listening on port 8668.   This is handled by rule 00300.  Data coming
+# back to natd from the outside world must also be routed to natd using
+# rule 00301.  To make the example interesting, we note that we do
+# NOT have to run internal requests to exposed hosts through natd
+# (rule 00290) because those exposed hosts know about our
+# 10. network.  This can reduce the load on natd.  Also note that we
+# of course do not have to route internal<->internal traffic through
+# natd since those hosts know how to route our 10. internal network.
+# The natd command we run from /etc/rc.local is shown below.  See
+# also the in-kernel version of natd, ipnat.
+#
+#	natd -s -u -a 208.161.114.67
+#
+#
+add 00290 skipto 1000 ip from 10.0.0.0/8 to 192.100.5.0/24
+add 00300 divert 8668 ip from 10.0.0.0/8 to not 10.0.0.0/8
+add 00301 divert 8668 ip from not 10.0.0.0/8 to 192.100.5.5
+
+# Short cut the rules to avoid running high bandwidths through
+# the entire rule set.  Allow established tcp connections through,
+# and shortcut all outgoing packets under the assumption that
+# we need only firewall incoming packets.
+#
+# Allowing established tcp connections through creates a small
+# hole but may be necessary to avoid overloading your firewall.
+# If you are worried, you can move the rule to after the spoof
+# checks.
+#
+add 01000 allow tcp from any to any established
+add 01001 allow all from any to any out via fxp0
+add 01001 allow all from any to any out via fxp1
+add 01001 allow all from any to any out via fxp2
+
+# Spoof protection.  This depends on how well you trust your
+# internal networks.  Packets received via fxp1 MUST come from
+# 10.0.1.x.  Packets received via fxp2 MUST come from 10.0.2.x.
+# Packets received via fxp0 cannot come from the LAN1 or LAN2
+# blocks.  We can't protect 10.0.0.x here, the internet router
+# must do that for us.
+#
+add 01500 deny all from not 10.0.1.0/24 in via fxp1
+add 01500 deny all from not 10.0.2.0/24 in via fxp2
+add 01501 deny all from 10.0.1.0/24 in via fxp0
+add 01501 deny all from 10.0.2.0/24 in via fxp0
+
+# In this example rule set there are no restrictions between
+# internal hosts, even those on the exposed LAN (as long as
+# they use an internal IP address).  This represents a
+# potential security hole (what if an exposed host is 
+# compromised?).  If you want full restrictions to apply
+# between the three LANs, firewalling them off from each
+# other for added security, remove these two rules.
+#
+# If you want to isolate LAN1 and LAN2, but still want
+# to give exposed hosts free reign with each other, get
+# rid of rule 01010 and keep rule 01011.
+#
+# (commented out, uncomment for less restrictive firewall)
+#add 01010 allow all from 10.0.0.0/8 to 10.0.0.0/8
+#add 01011 allow all from 192.100.5.0/24 to 192.100.5.0/24
+#
+
+# SPECIFIC SERVICES ALLOWED FROM SPECIFIC LANS
+#
+# If using a more restrictive firewall, allow specific LANs
+# access to specific services running on the firewall itself.
+# In this case we assume LAN1 needs access to filesharing running
+# on the firewall.  If using a less restrictive firewall
+# (allowing rule 01010), you don't need these rules.
+#
+add 01012 allow tcp from 10.0.1.0/8 to 10.0.1.1 139
+add 01012 allow udp from 10.0.1.0/8 to 10.0.1.1 137,138
+
+# GENERAL SERVICES ALLOWED TO CROSS INTERNAL AND EXPOSED LANS
+#
+# We allow specific UDP services through: DNS lookups, ntalk, and ntp.
+# Note that internal services are protected by virtue of having
+# spoof-proof internal IP addresses (10. net), so these rules
+# really only apply to services bound to exposed IPs.  We have
+# to allow UDP fragments or larger fragmented UDP packets will
+# not survive the firewall.
+#
+# If we want to expose high-numbered temporary service ports
+# for things like DNS lookup responses we can use a port range,
+# in this example 4000-65535, and we set to /etc/rc.conf variables
+# on all exposed machines to make sure they bind temporary ports
+# to the exposed port range (see rc.conf example above)
+#
+add 02000 allow udp from any to any 4000-65535,domain,ntalk,ntp
+add 02500 allow udp from any to any frag
+
+# Allow similar services for TCP.  Again, these only apply to
+# services bound to exposed addresses.  NOTE: we allow 'auth'
+# through but do not actually run an identd server on any exposed
+# port.  This allows the machine being authed to respond with a
+# TCP RESET.  Throwing the packet away would result in delays
+# when connecting to remote services that do reverse ident lookups.
+#
+# Note that we do not allow tcp fragments through, and that we do
+# not allow fragments in general (except for UDP fragments).  We
+# expect the TCP mtu discovery protocol to work properly so there
+# should be no TCP fragments.
+#
+add 03000 allow tcp from any to any http,https
+add 03000 allow tcp from any to any 4000-65535,ssh,smtp,domain,ntalk
+add 03000 allow tcp from any to any auth,pop3,ftp,ftp-data
+
+# It is important to allow certain ICMP types through:
+#
+#	0	Echo Reply
+#	3	Destination Unreachable
+#	4	Source Quench (typically not allowed)
+#	5	Redirect (typically not allowed - can be dangerous!)
+#	8	Echo
+#	11	Time Exceeded
+#	12	Parameter Problem
+#	13	Timestamp
+#	14	Timestamp Reply
+#
+# Sometimes people need to allow ICMP REDIRECT packets, which is
+# type 5, but if you allow it make sure that your internet router
+# disallows it.
+
+add 04000 allow icmp from any to any icmptypes 0,5,8,11,12,13,14
+
+# log any remaining fragments that get through.  Might be useful,
+# otherwise don't bother.  Have a final deny rule as a safety to
+# guarentee that your firewall is inclusive no matter how the kernel
+# is configured.
+#
+add 05000 deny log ip from any to any frag
+add 06000 deny all from any to any
+.Ed
+.Sh PORT BINDING INTERNAL AND EXTERNAL SERVICES
+We've mentioned multi-homing hosts and binding services to internal or 
+external addresses but we haven't really explained it.  When you have a 
+host with multiple IP addresses assigned to it, you can bind services run 
+on that host to specific IPs or interfaces rather then all IPs.  Take
+the firewall machine for example:  With three interfaces
+and two exposed IP addresses 
+on one of those interfaces, the firewall machine is known by 5 different
+IP addresses (10.0.0.1, 10.0.1.1, 10.0.2.1, 192.100.5.5, and say
+192.100.5.1).  If the firewall is providing file sharing services to the
+windows LAN segment (say it is LAN1), you can use samba's 'bind interfaces'
+directive to specifically bind it to just the LAN1 IP address.  That
+way the file sharing services will not be made available to other LAN
+segments.  The same goes for NFS.  If LAN2 has your UNIX engineering
+workstations, you can tell nfsd to bind specifically to 10.0.2.1.  You
+can specify how to bind virtually every service on the machine and you
+can use a light
+.Xr jail 8
+to indirectly bind services that do not otherwise give you the option.
+.Sh SEE ALSO
+.Pp
+.Xr config 8 ,
+.Xr dummynet 4 ,
+.Xr ipfw 8 ,
+.Xr ipnat 1 ,
+.Xr ipnat 5 ,
+.Xr jail 8 ,
+.Xr natd 8 ,
+.Xr nfsd 8 ,
+.Xr rc.conf 5 ,
+.Xr samba 7 [ /usr/ports/net/samba ]
+.Xr smb.conf 5 [ /usr/ports/net/samba ]
+.Sh ADDITIONAL READING
+.Pp
+.Xr ipf 5 ,
+.Xr ipf 8 ,
+.Xr ipfstat 8
+.Sh HISTORY
+The
+.Nm
+manual page was originally written by
+.An Matthew Dillon
+and first appeared 
+in
+.Fx 4.3 ,
+May 2001.
diff --git a/share/man/man7/tuning.7 b/share/man/man7/tuning.7
new file mode 100644
index 0000000..8c7630e
--- /dev/null
+++ b/share/man/man7/tuning.7
@@ -0,0 +1,478 @@
+.\" Copyright (c) 2001, Matthew Dillon.  Terms and conditions are those of
+.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in
+.\" the source tree.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd May 25, 2001
+.Dt TUNING 7
+.Os FreeBSD
+.Sh NAME
+.Nm tuning
+.Nd performance tuning under FreeBSD
+.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
+.Pp
+When using
+.Xr disklabel 8
+to lay out your filesystems on a hard disk it is important to remember
+that hard drives can transfer data much more quickly from outer tracks
+then they can from inner tracks.  To take advantage of this you should
+try to pack your smaller filesystems and swap closer to the outer tracks,
+follow with the larger filesystems, and end with the largest filesystems.
+It is also important to size system standard filesystems such that you
+will not be forced to resize them later as you scale the machine up.
+I usually create, in order, a 128M root, 1G swap, 128M /var, 128M /var/tmp,
+3G /usr, and use any remaining space for /home.
+.Pp
+You should typically size your swap space to approximately 2x main memory.
+If you do not have a lot of ram, though, you will generally want a lot
+more swap.  It is not recommended that you configure any less than
+256M of swap on a system and you should keep in mind future memory
+expansion when sizing the swap partition.
+The kernel's VM paging algorithms are tuned to perform best when there is
+at least 2x swap versus main memory.  Configuring too little swap can lead
+to inefficiencies in the VM page scanning code as well as create issues
+later on if you add more memory to your machine.  Finally, on larger systems
+with multiple SCSI disks (or multiple IDE disks operating on different
+controllers), we strongly recommend that you configure swap on each drive
+(up to four drives).  The swap partitions on the drives should be
+approximately the same size.  The kernel can handle arbitrary sizes but 
+internal data structures scale to 4 times the largest swap partition.  Keeping
+the swap partitions near the same size will allow the kernel to optimally
+stripe swap space across the N disks.  Don't worry about overdoing it a
+little, swap space is the saving grace of
+.Ux
+and even if you don't normally use much swap, it can give you more time to
+recover from a runaway program before being forced to reboot.
+.Pp
+How you size your
+.Em /var
+partition depends heavily on what you intend to use the machine for.  This
+partition is primarily used to hold mailboxes, the print spool, and log
+files.  Some people even make
+.Em /var/log
+its own partition (but except for extreme cases it isn't worth the waste
+of a partition id).  If your machine is intended to act as a mail
+or print server,
+or you are running a heavily visited web server, you should consider
+creating a much larger partition - perhaps a gig or more.  It is very easy
+to underestimate log file storage requirements. 
+.Pp
+Sizing
+.Em /var/tmp
+depends on the kind of temporary file usage you think you will need.  128M is
+the minimum we recommend.  Also note that you usually want to make
+.Em /tmp
+a softlink to
+.Em /var/tmp .
+Dedicating a partition for temporary file storage is important for
+two reasons:  First, it reduces the possibility of filesystem corruption
+in a crash, and second it reduces the chance of a runaway process that
+fills up [/var]/tmp from blowing up more critical subsystems (mail,
+logging, etc).  Filling up [/var]/tmp is a very common problem to have.
+.Pp
+In the old days there were differences between /tmp and /var/tmp,
+but the introduction of /var (and /var/tmp) led to massive confusion
+by program writers so today programs halfhazardly use one or the
+other and thus no real distinction can be made between the two.  So
+it makes sense to have just one temporary directory.  You can do the 
+softlink either way.  The one thing you do not want to do is leave /tmp
+on the root partition where it might cause root to fill up or possibly
+corrupt root in a crash/reboot situation.
+.Pp
+The
+.Em /usr
+partition holds the bulk of the files required to support the system and
+a subdirectory within it called
+.Em /usr/local
+holds the bulk of the files installed from the
+.Xr ports 7
+hierarchy.  If you do not use ports all that much and do not intend to keep
+system source (/usr/src) on the machine, you can get away with
+a 1 gigabyte /usr partition.  However, if you install a lot of ports
+(especially window managers and linux-emulated binaries), we recommend
+at least a 2 gigabyte /usr and if you also intend to keep system source
+on the machine, we recommend a 3 gigabyte /usr.  Do not underestimate the
+amount of space you will need in this partition, it can creep up and 
+surprise you!
+.Pp
+The
+.Em /home
+partition is typically used to hold user-specific data.  I usually size it
+to the remainder of the disk.
+.Pp
+Why partition at all?  Why not create one big
+.Em /
+partition and be done with it?  Then I don't have to worry about undersizing
+things!  Well, there are several reasons this isn't a good idea.  First,
+each partition has different operational characteristics and separating them
+allows the filesystem to tune itself to those characteristics.  For example,
+the root and /usr partitions are read-mostly, with very little writing, while
+a lot of reading and writing could occur in /var and /var/tmp.  By properly
+partitioning your system, fragmentation introduced in the smaller more
+heavily write-loaded partitions will not bleed over into the mostly-read
+partitions.  Additionally, keeping the write-loaded partitions closer to
+the edge of the disk (i.e. before the really big partitions instead of after
+in the partition table) will increase I/O performance in the partitions 
+where you need it the most.  Now it is true that you might also need I/O
+performance in the larger partitions, but they are so large that shifting
+them more towards the edge of the disk will not lead to a significnat
+performance improvement whereas moving /var to the edge can have a huge impact.
+Finally, there are safety concerns.  Having a small neat root partition that
+is essentially read-only gives it a greater chance of surviving a bad crash
+intact.
+.Pp
+Properly partitioning your system also allows you to tune
+.Xr newfs 8 ,
+and
+.Xr tunefs 8
+parameters.  Tuning
+.Fn newfs
+requires more experience but can lead to significant improvements in 
+performance.  There are three parameters that are relatively safe to
+tune:
+.Em blocksize ,
+.Em bytes/inode ,
+and
+.Em cylinders/group .
+.Pp
+.Fx
+performs best when using 8K or 16K filesystem block sizes.  The default
+filesystem  block size is 8K.  For larger partitions it is usually a good
+idea to use a 16K block size.  This also requires you to specify a larger
+fragment size.  We recommend always using a fragment size that is 1/8
+the block size (less testing has been done on other fragment size factors).
+The
+.Fn newfs
+options for this would be
+.Em newfs -f 2048 -b 16384 ...
+Using a larger block size can cause fragmentation of the buffer cache and
+lead to lower performance.
+.Pp
+If a large partition is intended to be used to hold fewer, larger files, such
+as a database files, you can increase the
+.Em bytes/inode
+ratio which reduces the number if inodes (maximum number of files and
+directories that can be created) for that partition.  Decreasing the number
+of inodes in a filesystem can greatly reduce
+.Xr fsck 8
+recovery times after a crash.  Do not use this option
+unless you are actually storing large files on the partition, because if you
+overcompensate you can wind up with a filesystem that has lots of free
+space remaining but cannot accomodate any more files.  Using
+32768, 65536, or 262144 bytes/inode is recommended.  You can go higher but
+it will have only incremental effects on fsck recovery times.  For
+example, 
+.Em newfs -i 32768 ...
+.Pp
+Finally, increasing the
+.Em cylinders/group
+ratio has the effect of packing the inodes closer together.  This can increase
+directory performance and also decrease fsck times.  If you use this option
+at all, we recommend maxing it out.  Use
+.Em newfs -c 999
+and newfs will error out and tell you what the maximum is, then use that.
+.Pp
+.Xr tunefs 8
+may be used to further tune a filesystem.  This command can be run in
+single-user mode without having to reformat the filesystem.  However, this
+is possibly the most abused program in the system.  Many people attempt to 
+increase available filesystem space by setting the min-free percentage to 0.
+This can lead to severe filesystem fragmentation and we do not recommend
+that you do this.  Really the only tunefs option worthwhile here is turning on
+.Em softupdates
+with
+.Em tunefs -n enable /filesystem.
+(Note: In 5.x softupdates can be turned on using the -U option to newfs).
+Softupdates drastically improves meta-data performance, mainly file
+creation and deletion.  We recommend turning softupdates on on all of your
+filesystems.  There are two downsides to softupdates that you should be
+aware of:  First, softupdates guarentees filesystem consistency in the
+case of a crash but could very easily be several seconds (even a minute!)
+behind updating the physical disk.  If you crash you may lose more work
+then otherwise.  Secondly, softupdates delays the freeing of filesystem
+blocks.  If you have a filesystem (such as the root filesystem) which is 
+close to full, doing a major update of it, e.g.
+.Em make installworld,
+can run it out of space and cause the update to fail.
+.Sh STRIPING DISKS
+In larger systems you can stripe partitions from several drives together
+to create a much larger overall partition.  Striping can also improve
+the performance of a filesystem by splitting I/O operations across two
+or more disks.  The
+.Xr vinum 8 
+and
+.Xr ccd 4
+utilities may be used to create simple striped filesystems.  Generally
+speaking, striping smaller partitions such as the root and /var/tmp,
+or essentially read-only partitions such as /usr is a complete waste of
+time.  You should only stripe partitions that require serious I/O performance...
+typically /var, /home, or custom partitions used to hold databases and web
+pages.  Choosing the proper stripe size is also 
+important.  Filesystems tend to store meta-data on power-of-2 boundries
+and you usually want to reduce seeking rather then increase seeking.  This
+means you want to use a large off-center stripe size such as 1152 sectors
+so sequential I/O does not seek both disks and so meta-data is distributed
+across both disks rather then concentrated on a single disk.  If
+you really need to get sophisticated, we recommend using a real hardware
+raid controller from the list of
+.Fx
+supported controllers.
+.Sh SYSCTL TUNING
+.Pp
+There are several hundred
+.Xr sysctl 8
+variables in the system, including many that appear to be candidates for
+tuning but actually aren't.  In this document we will only cover the ones
+that have the greatest effect on the system.
+.Pp
+The
+.Em kern.ipc.shm_use_phys
+sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on).  Setting
+this parameter to 1 will cause all SysV shared memory segments to be
+mapped to unpageable physical ram.  This feature only has an effect if you
+are either (A) mapping small amounts of shared memory across many (hundreds)
+of processes, or (B) mapping large amounts of shared memory across any
+number of processes.  This feature allows the kernel to remove a great deal
+of internal memory management page-tracking overhead at the cost of wiring
+the shared memory into core, making it unswappable.
+.Pp
+The
+.Em vfs.vmiodirenable
+sysctl defaults to 0 (off) (though soon it will default to 1) and may be
+set to 0 (off) or 1 (on).  This parameter controls how directories are cached
+by the system.  Most directories are small and use but a single fragment
+(typically 1K) in the filesystem and even less (typically 512 bytes) in
+the buffer cache.  However, when operating in the default mode the buffer
+cache will only cache a fixed number of directories even if you have a huge
+amount of memory.  Turning on this sysctl allows the buffer cache to use
+the VM Page Cache to cache the directories.  The advantage is that all of
+memory is now available for caching directories.  The disadvantage is that
+the minimum in-core memory used to cache a directory is the physical page
+size (typically 4K) rather then 512 bytes.  We recommend turning this option
+on if you are running any services which manipulate large numbers of files.
+Such services can include web caches, large mail systems, and news systems.
+Turning on this option will generally not reduce performance even with the
+wasted memory but you should experiment to find out.
+.Pp
+There are various buffer-cache and VM page cache related sysctls.  We do
+not recommend messing around with these at all.  As of
+.Fx 4.3 ,
+the VM system does an extremely good job tuning itself.
+.Pp
+The
+.Em net.inet.tcp.sendspace
+and
+.Em net.inet.tcp.recvspace
+sysctls are of particular interest if you are running network intensive
+applications.  This controls the amount of send and receive buffer space
+allowed for any given TCP connection.  The default is 16K.  You can often
+improve bandwidth utilization by increasing the default at the cost of 
+eating up more kernel memory for each connection.  We do not recommend
+increasing the defaults if you are serving hundreds or thousands of
+simultanious connections because it is possible to quickly run the system
+out of memory due to stalled connections building up.  But if you need
+high bandwidth over a fewer number of connections, especially if you have
+gigabit ethernet, increasing these defaults can make a huge difference.
+You can adjust the buffer size for incoming and outgoing data separately.
+For example, if your machine is primarily doing web serving you may want
+to decrease the recvspace in order to be able to increase the sendspace
+without eating too much kernel memory.  Note that the route table, see
+.Xr route 8 ,
+can be used to introduce route-specific send and receive buffer size
+defaults.  As an additional mangagement tool you can use pipes in your
+firewall rules, see
+.Xr ipfw 8 ,
+to limit the bandwidth going to or from particular IP blocks or ports.
+For example, if you have a T1 you might want to limit your web traffic
+to 70% of the T1's bandwidth in order to leave the remainder available
+for mail and interactive use.   Normally a heavily loaded web server
+will not introduce significant latencies into other services even if 
+the network link is maxed out, but enforcing a limit can smooth things
+out and lead to longer term stability.  Many people also enforce artificial
+bandwidth limitations in order to ensure that they are not charged for
+using too much bandwidth.
+.Pp
+We recommend that you turn on (set to 1) and leave on the 
+.Em net.inet.tcp.always_keepalive
+control.  The default is usually off.  This introduces a small amount of
+additional network bandwidth but guarentees that dead tcp connections
+will eventually be recognized and cleared.  Dead tcp connections are a
+particular problem on systems accesed by users operating over dialups,
+because users often disconnect their modems without properly closing active
+connections.
+.Pp
+The
+.Em kern.ipc.somaxconn
+sysctl limits the size of the listen queue for accepting new tcp connections.
+The default value of 128 is typically too low for robust handling of new
+connections in a heavily loaded web server environment.  For such environments,
+we recommend increasing this value to 1024 or higher.  The service daemon
+may itself limit the listen queue size (e.g. sendmail, apache) but will
+often have a directive in its configuration file to adjust the queue size up.
+Larger listen queue also do a better job of fending of denial of service
+attacks.
+.Sh KERNEL CONFIG TUNING
+.Pp
+There are a number of kernel options that you may have to fiddle with in
+a large scale system.  In order to change these options you need to be
+able to compile a new kernel from source.  The
+.Xr config 8
+manual page and the handbook are good starting points for learning how to
+do this.  Generally the first thing you do when creating your own custom
+kernel is to strip out all the drivers and services you don't use.  Removing
+things like
+.Em INET6
+and drivers you don't have will reduce the size of your kernel, sometimes
+by a megabyte or more, leaving more memory available for applications.
+.Pp
+The
+.Em maxusers
+kernel option defaults to an incredibly low value.  For most modern machines,
+you probably want to increase this value to 64, 128, or 256.  We do not 
+recommend going above 256 unless you need a huge number of file descriptors.
+Network buffers are also affected but can be controlled with a separate
+kernel option.  Do not increase maxusers just to get more network mbufs.
+.Pp
+.Em NMBCLUSTERS
+may be adjusted to increase the number of network mbufs the system is
+willing to allocate.  Each cluster represents approximately 2K of memory,
+so a value of 1024 represents 2M of kernel memory reserved for network
+buffers.  You can do a simple calculation to figure out how many you need.
+If you have a web server which maxes out at 1000 simultanious connections,
+and each connection eats a 16K receive and 16K send buffer, you need
+approximate 32MB worth of network buffers to deal with it.  A good rule of
+thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.  So for this case
+you would want to se NMBCLUSTERS to 32768.  We recommend values between
+1024 and 4096 for machines with moderates amount of memory, and between 4096
+and 32768 for machines with greater amounts of memory.  Under no circumstances
+should you specify an arbitrarily high value for this parameter, it could
+lead to a boot-time crash.  The -m option to
+.Xr netstat 1
+may be used to observe network cluster use.
+.Pp
+More and more programs are using the
+.Fn sendfile
+system call to transmit files over the network.  The
+.Em NSFBUFS
+kernel parameter controls the number of filesystem buffers
+.Fn sendfile
+is allowed to use to perform its work.  This parameter nominally scales
+with
+.Em maxusers
+so you should not need to mess with this parameter except under extreme
+circumstances.
+.Pp
+.Em SCSI_DELAY
+and
+.Em IDE_DELAY
+may be used to reduce system boot times.  The defaults are fairly high and
+can be responsible for 15+ seconds of delay in the boot process.  Reducing
+SCSI_DELAY to 5 seconds usually works (especially with modern drives).
+Reducing IDE_DELAY also works but you have to be a little more careful.
+.Pp
+There are a number of
+.Em XXX_CPU
+options that can be commented out.  If you only want the kernel to run
+on a Pentium class cpu, you can easily remove
+.Em I386_CPU
+and
+.Em I486_CPU,
+but only remove
+.Em I586_CPU
+if you are sure your cpu is being recognized as a Pentium II or better.
+Some clones may be recognized as a pentium or even a 486 and not be able
+to boot without those options.  If it works, great!  The operating system
+will be able to better-use higher-end cpu features for mmu, task switching,
+timebase, and even device operations.  Additionally, higher-end cpus support
+4MB MMU pages which the kernel uses to map the kernel itself into memory,
+which increases its efficiency under heavy syscall loads.
+.Sh IDE WRITE CACHING
+As of
+.Fx 4.3 ,
+IDE write caching is turned off by default.  This will reduce write bandwidth
+to IDE disks but is considered necessary due to serious data consistency
+issues introduced by hard drive vendors.  Basically the problem is that 
+IDE drives lie about when a write completes.  With IDE write caching turned
+on, IDE hard drives will not only write data to disk out of order, they
+will sometimes delay some of the blocks indefinitely when under heavy disk
+loads.  A crash or power failure can result in serious filesystem
+corruption.  So our default is to be safe.  If you are willing to risk
+filesystem corruption, you can return to the old behavior by setting the
+hw.ata.wc
+kernel variable back to 1.  This must be done from the boot loader at boot
+time.  Please see
+.Xr ata 4 ,
+and
+.Xr loader 8 .
+.Pp
+There is a new experimental feature for IDE hard drives called hw.ata.tags
+(you also set this in the bootloader) which allows write caching to be safely
+turned on.  This brings SCSI tagging features to IDE drives.  As of this
+writing only IBM DPTA and DTLA drives support the feature.
+.Sh CPU, MEMORY, DISK, NETWORK
+The type of tuning you do depends heavily on where your system begins to
+bottleneck as load increases.  If your system runs out of cpu (idle times
+are pepetually 0%) then you need to consider upgrading the cpu or moving to
+an SMP motherboard (multiple cpu's), or perhaps you need to revisit the
+programs that are causing the load and try to optimize them.  If your system
+is paging to swap a lot you need to consider adding more memory.  If your
+system is saturating the disk you typically see high cpu idle times and
+total disk saturation.
+.Xr systat 1
+can be used to monitor this.  There are many solutions to saturated disks:
+increasing memory for caching, mirroring disks, distributing operations across
+several machines, and so forth.  If disk performance is an issue and you
+are using IDE drives, switching to SCSI can help a great deal.  While modern
+IDE drives compare with SCSI in raw sequential bandwidth, the moment you
+start seeking around the disk SCSI drives usually win.
+.Pp
+Finally, you might run out of network suds.  The first line of defense for
+improving network performance is to make sure you are using switches instead
+of hubs, especially these days where switches are almost as cheap.  Hubs
+have severe problems under heavy loads due to collision backoff and one bad
+host can severely degrade the entire LAN.  Second, optimize the network path
+as much as possible.  For example, in 
+.Xr firewall 7
+we describe a firewall protecting internal hosts with a topology where
+the externally visible hosts are not routed through it.  Use 100BaseT rather
+then 10BaseT, or use 1000BaseT rather then 100BaseT, depending on your needs.
+Most bottlenecks occur at the WAN link (e.g. modem, T1, DSL, whatever).
+If expanding the link is not an option it may be possible to use ipfw's
+.Sy DUMMYNET
+feature to implement peak shaving or other forms of traffic shaping to
+prevent the overloaded service (such as web services) from effecting other
+services (such as email), or vise versa.  In home installations this could
+be used to give interactive traffic (your browser, ssh logins) priority
+over services you export from your box (web services, email).
+.Sh SEE ALSO
+.Pp
+.Xr ata 4 ,
+.Xr boot 8 ,
+.Xr ccd 4 ,
+.Xr config 8 ,
+.Xr disklabel 8 ,
+.Xr firewall 7 ,
+.Xr fsck 8 ,
+.Xr hier 7 ,
+.Xr ifconfig 8 ,
+.Xr ipfw 8 ,
+.Xr loader 8 ,
+.Xr login.conf 5 ,
+.Xr netstat 1 ,
+.Xr newfs 8 ,
+.Xr ports 7 ,
+.Xr route 8 ,
+.Xr sysctl 8 ,
+.Xr systat 1 ,
+.Xr tunefs 8 ,
+.Xr vinum 8
+.Sh HISTORY
+The
+.Nm
+manual page was originally written by
+.An Matthew Dillon
+and first appeared 
+in
+.Fx 4.3 ,
+May 2001.
author	dillon <dillon@FreeBSD.org>	2001-05-27 23:14:27 +0000
committer	dillon <dillon@FreeBSD.org>	2001-05-27 23:14:27 +0000
commit	154e92fce6e2492cbf37af0695f1589c91459049 (patch)
tree	695f2f5c34276a3e81c37647fcdc95ff3c1fa687
parent	0727598a0e32df4c431ce3b099be9c4695eb6f16 (diff)
download	FreeBSD-src-154e92fce6e2492cbf37af0695f1589c91459049.zip FreeBSD-src-154e92fce6e2492cbf37af0695f1589c91459049.tar.gz