1 files changed, 694 insertions, 0 deletions
diff --git a/share/doc/papers/sysperf/3.t b/share/doc/papers/sysperf/3.t
new file mode 100644
index 0000000..832ad42
--- /dev/null
+++ b/share/doc/papers/sysperf/3.t
@@ -0,0 +1,694 @@
+.\" Copyright (c) 1985 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\"    must display the following acknowledgement:
+.\"	This product includes software developed by the University of
+.\"	California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\"    may be used to endorse or promote products derived from this software
+.\"    without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\"	@(#)3.t	5.1 (Berkeley) 4/17/91
+.\"
+.ds RH Results of our observations
+.NH
+Results of our observations
+.PP
+When 4.2BSD was first installed on several large timesharing systems
+the degradation in performance was significant.
+Informal measurements showed 4.2BSD providing 80% of the throughput
+of 4.1BSD (based on load averages observed under a normal timesharing load).
+Many of the initial problems found were because of programs that were
+not part of 4.1BSD.  Using the techniques described in the previous
+section and standard process profiling several problems were identified.
+Later work concentrated on the operation of the kernel itself.
+In this section we discuss the problems uncovered;  in the next
+section we describe the changes made to the system.
+.NH 2
+User programs
+.PP
+.NH 3
+Mail system		
+.PP
+The mail system was the first culprit identified as a major
+contributor to the degradation in system performance.
+At Lucasfilm the mail system is heavily used
+on one machine, a VAX-11/780 with eight megabytes of memory.\**
+.FS
+\** During part of these observations the machine had only four
+megabytes of memory.
+.FE
+Message
+traffic is usually between users on the same machine and ranges from
+person-to-person telephone messages to per-organization distribution
+lists.  After conversion to 4.2BSD, it was
+immediately noticed that mail to distribution lists of 20 or more people
+caused the system load to jump by anywhere from 3 to 6 points.
+The number of processes spawned by the \fIsendmail\fP program and
+the messages sent from \fIsendmail\fP to the system logging
+process, \fIsyslog\fP, generated significant load both from their
+execution and their interference with basic system operation.  The
+number of context switches and disk transfers often doubled while
+\fIsendmail\fP operated; the system call rate jumped dramatically.
+System accounting information consistently
+showed \fIsendmail\fP as the top cpu user on the system.
+.NH 3
+Network servers
+.PP
+The network services provided in 4.2BSD add new capabilities to the system,
+but are not without cost.  The system uses one daemon process to accept
+requests for each network service provided.  The presence of many
+such daemons increases the numbers of active processes and files,
+and requires a larger configuration to support the same number of users.
+The overhead of the routing and status updates can consume
+several percent of the cpu.
+Remote logins and shells incur more overhead
+than their local equivalents.
+For example, a remote login uses three processes and a
+pseudo-terminal handler in addition to the local hardware terminal
+handler.  When using a screen editor, sending and echoing a single
+character involves four processes on two machines.
+The additional processes, context switching, network traffic, and
+terminal handler overhead can roughly triple the load presented by one
+local terminal user.
+.NH 2
+System overhead
+.PP
+To measure the costs of various functions in the kernel,
+a profiling system was run for a 17 hour
+period on one of our general timesharing machines.
+While this is not as reproducible as a synthetic workload,
+it certainly represents a realistic test.
+This test was run on several occasions over a three month period.
+Despite the long period of time that elapsed
+between the test runs the shape of the profiles,
+as measured by the number of times each system call
+entry point was called, were remarkably similar.
+.PP
+These profiles turned up several bottlenecks that are
+discussed in the next section.
+Several of these were new to 4.2BSD,
+but most were caused by overloading of mechanisms
+which worked acceptably well in previous BSD systems.
+The general conclusion from our measurements was that
+the ratio of user to system time had increased from
+45% system / 55% user in 4.1BSD to 57% system / 43% user
+in 4.2BSD.
+.NH 3
+Micro-operation benchmarks
+.PP
+To compare certain basic system operations
+between 4.1BSD and 4.2BSD a suite of benchmark
+programs was constructed and run on a VAX-11/750 with 4.5 megabytes
+of physical memory and two disks on a MASSBUS controller.
+Tests were run with the machine operating in single user mode
+under both 4.1BSD and 4.2BSD.   Paging was localized to the drive
+where the root file system was located.
+.PP
+The benchmark programs were modeled after the Kashtan benchmarks,
+[Kashtan80], with identical sources compiled under each system.
+The programs and their intended purpose are described briefly
+before the presentation of the results.  The benchmark scripts
+were run twice with the results shown as the average of
+the two runs.
+The source code for each program and the shell scripts used during
+the benchmarks are included in the Appendix.
+.PP
+The set of tests shown in Table 1 was concerned with
+system operations other than paging.  The intent of most
+benchmarks is clear.  The result of running \fIsignocsw\fP is
+deducted from the \fIcsw\fP benchmark to calculate the context
+switch overhead.  The \fIexec\fP tests use two different jobs to gauge
+the cost of overlaying a larger program with a smaller one
+and vice versa.  The
+``null job'' and ``big job'' differ solely in the size of their data
+segments, 1 kilobyte versus 256 kilobytes.  In both cases the
+text segment of the parent is larger than that of the child.\**
+.FS
+\** These tests should also have measured the cost of expanding the
+text segment; unfortunately time did not permit running additional tests.
+.FE
+All programs were compiled into the default load format that causes
+the text segment to be demand paged out of the file system and shared
+between processes.
+.KF
+.DS L
+.TS
+center box;
+l | l.
+Test	Description
+_
+syscall	perform 100,000 \fIgetpid\fP system calls
+csw	perform 10,000 context switches using signals
+signocsw	send 10,000 signals to yourself
+pipeself4	send 10,000 4-byte messages to yourself
+pipeself512	send 10,000 512-byte messages to yourself
+pipediscard4	send 10,000 4-byte messages to child who discards
+pipediscard512	send 10,000 512-byte messages to child who discards
+pipeback4	exchange 10,000 4-byte messages with child
+pipeback512	exchange 10,000 512-byte messages with child
+forks0	fork-exit-wait 1,000 times
+forks1k	sbrk(1024), fault page, fork-exit-wait 1,000 times
+forks100k	sbrk(102400), fault pages, fork-exit-wait 1,000 times
+vforks0	vfork-exit-wait 1,000 times
+vforks1k	sbrk(1024), fault page, vfork-exit-wait 1,000 times
+vforks100k	sbrk(102400), fault pages, vfork-exit-wait 1,000 times
+execs0null	fork-exec ``null job''-exit-wait 1,000 times
+execs0null (1K env)	execs0null above, with 1K environment added
+execs1knull	sbrk(1024), fault page, fork-exec ``null job''-exit-wait 1,000 times
+execs1knull (1K env)	execs1knull above, with 1K environment added
+execs100knull	sbrk(102400), fault pages, fork-exec ``null job''-exit-wait 1,000 times
+vexecs0null	vfork-exec ``null job''-exit-wait 1,000 times
+vexecs1knull	sbrk(1024), fault page, vfork-exec ``null job''-exit-wait 1,000 times
+vexecs100knull	sbrk(102400), fault pages, vfork-exec ``null job''-exit-wait 1,000 times
+execs0big	fork-exec ``big job''-exit-wait 1,000 times
+execs1kbig	sbrk(1024), fault page, fork-exec ``big job''-exit-wait 1,000 times
+execs100kbig	sbrk(102400), fault pages, fork-exec ``big job''-exit-wait 1,000 times
+vexecs0big	vfork-exec ``big job''-exit-wait 1,000 times
+vexecs1kbig	sbrk(1024), fault pages, vfork-exec ``big job''-exit-wait 1,000 times
+vexecs100kbig	sbrk(102400), fault pages, vfork-exec ``big job''-exit-wait 1,000 times
+.TE
+.ce
+Table 1. Kernel Benchmark programs.
+.DE
+.KE
+.PP
+The results of these tests are shown in Table 2.  If the 4.1BSD results
+are scaled to reflect their being run on a VAX-11/750, they
+correspond closely to those found in [Joy80].\**
+.FS
+\** We assume that a VAX-11/750 runs at 60% of the speed of a VAX-11/780
+(not considering floating point operations).
+.FE
+.KF
+.DS L
+.TS
+center box;
+c s s s s s s s s s
+c || c s s || c s s || c s s
+c || c s s || c s s || c s s
+c || c | c | c || c | c | c || c | c | c
+l || n | n | n || n | n | n || n | n | n.
+Berkeley Software Distribution UNIX Systems
+_
+Test	Elapsed Time	User Time	System Time
+\^	_	_	_
+\^	4.1	4.2	4.3	4.1	4.2	4.3	4.1	4.2	4.3
+=
+syscall	28.0	29.0	23.0	4.5	5.3	3.5	23.9	23.7	20.4
+csw	45.0	60.0	45.0	3.5	4.3	3.3	19.5	25.4	19.0
+signocsw	16.5	23.0	16.0	1.9	3.0	1.1	14.6	20.1	15.2
+pipeself4	21.5	29.0	26.0	1.1	1.1	0.8	20.1	28.0	25.6
+pipeself512	47.5	59.0	55.0	1.2	1.2	1.0	46.1	58.3	54.2
+pipediscard4	32.0	42.0	36.0	3.2	3.7	3.0	15.5	18.8	15.6
+pipediscard512	61.0	76.0	69.0	3.1	2.1	2.0	29.7	36.4	33.2
+pipeback4	57.0	75.0	66.0	2.9	3.2	3.3	25.1	34.2	29.7
+pipeback512	110.0	138.0	125.0	3.1	3.4	2.2	52.2	65.7	57.7
+forks0	37.5	41.0	22.0	0.5	0.3	0.3	34.5	37.6	21.5
+forks1k	40.0	43.0	22.0	0.4	0.3	0.3	36.0	38.8	21.6
+forks100k	217.5	223.0	176.0	0.7	0.6	0.4	214.3	218.4	175.2
+vforks0	34.5	37.0	22.0	0.5	0.6	0.5	27.3	28.5	17.9
+vforks1k	35.0	37.0	22.0	0.6	0.8	0.5	27.2	28.6	17.9
+vforks100k	35.0	37.0	22.0	0.6	0.8	0.6	27.6	28.9	17.9
+execs0null	97.5	92.0	66.0	3.8	2.4	0.6	68.7	82.5	48.6
+execs0null (1K env)	197.0	229.0	75.0	4.1	2.6	0.9	167.8	212.3	62.6
+execs1knull	99.0	100.0	66.0	4.1	1.9	0.6	70.5	86.8	48.7
+execs1knull (1K env)	199.0	230.0	75.0	4.2	2.6	0.7	170.4	214.9	62.7
+execs100knull	283.5	278.0	216.0	4.8	2.8	1.1	251.9	269.3	202.0
+vexecs0null	100.0	92.0	66.0	5.1	2.7	1.1	63.7	76.8	45.1
+vexecs1knull	100.0	91.0	66.0	5.2	2.8	1.1	63.2	77.1	45.1
+vexecs100knull	100.0	92.0	66.0	5.1	3.0	1.1	64.0	77.7	45.6
+execs0big	129.0	201.0	101.0	4.0	3.0	1.0	102.6	153.5	92.7
+execs1kbig	130.0	202.0	101.0	3.7	3.0	1.0	104.7	155.5	93.0
+execs100kbig	318.0	385.0	263.0	4.8	3.1	1.1	286.6	339.1	247.9
+vexecs0big	128.0	200.0	101.0	4.6	3.5	1.6	98.5	149.6	90.4
+vexecs1kbig	125.0	200.0	101.0	4.7	3.5	1.3	98.9	149.3	88.6
+vexecs100kbig	126.0	200.0	101.0	4.2	3.4	1.3	99.5	151.0	89.0
+.TE
+.ce
+Table 2. Kernel Benchmark results (all times in seconds).
+.DE
+.KE
+.PP
+In studying the measurements we found that the basic system call
+and context switch overhead did not change significantly
+between 4.1BSD and 4.2BSD.  The \fIsignocsw\fP results were caused by
+the changes to the \fIsignal\fP interface, resulting
+in an additional subroutine invocation for each call, not
+to mention additional complexity in the system's implementation.
+.PP
+The times for the use of pipes are significantly higher under
+4.2BSD because of their implementation on top of the interprocess
+communication facilities.  Under 4.1BSD pipes were implemented
+without the complexity of the socket data structures and with
+simpler code.  Further, while not obviously a factor here,
+4.2BSD pipes have less system buffer space provided them than
+4.1BSD pipes.
+.PP
+The \fIexec\fP tests shown in Table 2 were performed with 34 bytes of
+environment information under 4.1BSD and 40 bytes under 4.2BSD.
+To figure the cost of passing data through the environment,
+the execs0null and execs1knull tests were rerun with
+1065 additional bytes of data.  The results are show in Table 3.
+.KF
+.DS L
+.TS
+center box;
+c || c s || c s || c s
+c || c s || c s || c s
+c || c | c || c | c || c | c
+l || n | n || n | n || n | n.
+Test	Real	User	System
+\^	_	_	_
+\^	4.1	4.2	4.1	4.2	4.1	4.2
+=
+execs0null	197.0	229.0	4.1	2.6	167.8	212.3
+execs1knull	199.0	230.0	4.2	2.6	170.4	214.9
+.TE
+.ce
+Table 3. Benchmark results with ``large'' environment (all times in seconds).
+.DE
+.KE
+These results show that passing argument data is significantly
+slower than under 4.1BSD: 121 ms/byte versus 93 ms/byte.  Even using
+this factor to adjust the basic overhead of an \fIexec\fP system
+call, this facility is more costly under 4.2BSD than under 4.1BSD.
+.NH 3
+Path name translation
+.PP
+The single most expensive function performed by the kernel
+is path name translation.
+This has been true in almost every UNIX kernel [Mosher80];
+we find that our general time sharing systems do about
+500,000 name translations per day.
+.PP
+Name translations became more expensive in 4.2BSD for several reasons.
+The single most expensive addition was the symbolic link.
+Symbolic links
+have the effect of increasing the average number of components
+in path names to be translated.
+As an insidious example,
+consider the system manager that decides to change /tmp
+to be a symbolic link to /usr/tmp.
+A name such as /tmp/tmp1234 that previously required two component
+translations,
+now requires four component translations plus the cost of reading 
+the contents of the symbolic link.
+.PP
+The new directory format also changes the characteristics of
+name translation.
+The more complex format requires more computation to determine
+where to place new entries in a directory.
+Conversely the additional information allows the system to only
+look at active entries when searching,
+hence searches of directories that had once grown large
+but currently have few active entries are checked quickly.
+The new format also stores the length of each name so that
+costly string comparisons are only done on names that are the
+same length as the name being sought.
+.PP
+The net effect of the changes is that the average time to
+translate a path name in 4.2BSD is 24.2 milliseconds,
+representing 40% of the time processing system calls,
+that is 19% of the total cycles in the kernel,
+or 11% of all cycles executed on the machine.
+The times are shown in Table 4.  We have no comparable times
+for \fInamei\fP under 4.1 though they are certain to
+be significantly less.
+.KF
+.DS L
+.TS
+center box;
+l r r.
+part	time	% of kernel
+_
+self	14.3 ms/call	11.3%
+child	9.9 ms/call	7.9%
+_
+total	24.2 ms/call	19.2%
+.TE
+.ce
+Table 4. Call times for \fInamei\fP in 4.2BSD.
+.DE
+.KE
+.NH 3
+Clock processing
+.PP
+Nearly 25% of the time spent in the kernel is spent in the clock
+processing routines.
+(This is a clear indication that to avoid sampling bias when profiling the
+kernel with our tools
+we need to drive them from an independent clock.)
+These routines are responsible for implementing timeouts,
+scheduling the processor,
+maintaining kernel statistics,
+and tending various hardware operations such as
+draining the terminal input silos.
+Only minimal work is done in the hardware clock interrupt
+routine (at high priority), the rest is performed (at a lower priority)
+in a software interrupt handler scheduled by the hardware interrupt
+handler.
+In the worst case, with a clock rate of 100 Hz
+and with every hardware interrupt scheduling a software
+interrupt, the processor must field 200 interrupts per second.
+The overhead of simply trapping and returning
+is 3% of the machine cycles,
+figuring out that there is nothing to do
+requires an additional 2%.
+.NH 3
+Terminal multiplexors
+.PP
+The terminal multiplexors supported by 4.2BSD have programmable receiver
+silos that may be used in two ways.
+With the silo disabled, each character received causes an interrupt
+to the processor.
+Enabling the receiver silo allows the silo to fill before
+generating an interrupt, allowing multiple characters to be read
+for each interrupt.
+At low rates of input, received characters will not be processed
+for some time unless the silo is emptied periodically.
+The 4.2BSD kernel uses the input silos of each terminal multiplexor,
+and empties each silo on each clock interrupt.
+This allows high input rates without the cost of per-character interrupts
+while assuring low latency.
+However, as character input rates on most machines are usually
+low (about 25 characters per second),
+this can result in excessive overhead.
+At the current clock rate of 100 Hz, a machine with 5 terminal multiplexors
+configured makes 500 calls to the receiver interrupt routines per second.
+In addition, to achieve acceptable input latency
+for flow control, each clock interrupt must schedule
+a software interrupt to run the silo draining routines.\**
+.FS
+\** It is not possible to check the input silos at
+the time of the actual clock interrupt without modifying the terminal
+line disciplines, as the input queues may not be in a consistent state \**.
+.FE
+\** This implies that the worst case estimate for clock processing
+is the basic overhead for clock processing.
+.NH 3
+Process table management
+.PP
+In 4.2BSD there are numerous places in the kernel where a linear search
+of the process table is performed: 
+.IP \(bu 3
+in \fIexit\fP to locate and wakeup a process's parent;
+.IP \(bu 3
+in \fIwait\fP when searching for \fB\s-2ZOMBIE\s+2\fP and
+\fB\s-2STOPPED\s+2\fP processes;
+.IP \(bu 3
+in \fIfork\fP when allocating a new process table slot and
+counting the number of processes already created by a user;
+.IP \(bu 3
+in \fInewproc\fP, to verify
+that a process id assigned to a new process is not currently
+in use;
+.IP \(bu 3
+in \fIkill\fP and \fIgsignal\fP to locate all processes to
+which a signal should be delivered;
+.IP \(bu 3
+in \fIschedcpu\fP when adjusting the process priorities every
+second; and
+.IP \(bu 3
+in \fIsched\fP when locating a process to swap out and/or swap
+in.
+.LP
+These linear searches can incur significant overhead.  The rule
+for calculating the size of the process table is:
+.ce
+nproc = 20 + 8 * maxusers
+.sp
+that means a 48 user system will have a 404 slot process table.
+With the addition of network services in 4.2BSD, as many as a dozen
+server processes may be maintained simply to await incoming requests.
+These servers are normally created at boot time which causes them
+to be allocated slots near the beginning of the process table.  This
+means that process table searches under 4.2BSD are likely to take
+significantly longer than under 4.1BSD.  System profiling shows
+that as much as 20% of the time spent in the kernel on a loaded
+system (a VAX-11/780) can be spent in \fIschedcpu\fP and, on average,
+5-10% of the kernel time is spent in \fIschedcpu\fP.
+The other searches of the proc table are similarly affected.
+This shows the system can no longer tolerate using linear searches of
+the process table.
+.NH 3
+File system buffer cache
+.PP
+The trace facilities described in section 2.3 were used
+to gather statistics on the performance of the buffer cache.
+We were interested in measuring the effectiveness of the
+cache and the read-ahead policies.
+With the file system block size in 4.2BSD four to
+eight times that of a 4.1BSD file system, we were concerned
+that large amounts of read-ahead might be performed without
+being used.  Also, we were interested in seeing if the
+rules used to size the buffer cache at boot time were severely
+affecting the overall cache operation.
+.PP
+The tracing package was run over a three hour period during
+a peak mid-afternoon period on a VAX 11/780 with four megabytes
+of physical memory.
+This resulted in a buffer cache containing 400 kilobytes of memory
+spread among 50 to 200 buffers
+(the actual number of buffers depends on the size mix of
+disk blocks being read at any given time).
+The pertinent configuration information is shown in Table 5.
+.KF
+.DS L
+.TS
+center box;
+l l l l.
+Controller	Drive	Device	File System
+_
+DEC MASSBUS	DEC RP06	hp0d	/usr
+		hp0b	swap
+Emulex SC780	Fujitsu Eagle	hp1a	/usr/spool/news
+		hp1b	swap
+		hp1e	/usr/src
+		hp1d	/u0 (users)
+	Fujitsu Eagle	hp2a	/tmp
+		hp2b	swap
+		hp2d	/u1 (users)
+	Fujitsu Eagle	hp3a	/
+.TE
+.ce
+Table 5. Active file systems during buffer cache tests.
+.DE
+.KE
+.PP
+During the test period the load average ranged from 2 to 13
+with an average of 5.
+The system had no idle time, 43% user time, and 57% system time.
+The system averaged 90 interrupts per second 
+(excluding the system clock interrupts),
+220 system calls per second,
+and 50 context switches per second (40 voluntary, 10 involuntary).
+.PP
+The active virtual memory (the sum of the address space sizes of
+all jobs that have run in the previous twenty seconds)
+over the period ranged from 2 to 6 megabytes with an average
+of 3.5 megabytes.
+There was no swapping, though the page daemon was inspecting
+about 25 pages per second.
+.PP
+On average 250 requests to read disk blocks were initiated
+per second.
+These include read requests for file blocks made by user 
+programs as well as requests initiated by the system.
+System reads include requests for indexing information to determine
+where a file's next data block resides,
+file system layout maps to allocate new data blocks,
+and requests for directory contents needed to do path name translations.
+.PP
+On average, an 85% cache hit rate was observed for read requests.
+Thus only 37 disk reads were initiated per second.
+In addition, 5 read-ahead requests were made each second
+filling about 20% of the buffer pool.
+Despite the policies to rapidly reuse read-ahead buffers
+that remain unclaimed, more than 90% of the read-ahead
+buffers were used.
+.PP
+These measurements showed that the buffer cache was working
+effectively.  Independent tests have also showed that the size
+of the buffer cache may be reduced significantly on memory-poor
+system without severe effects;
+we have not yet tested this hypothesis [Shannon83].
+.NH 3
+Network subsystem
+.PP
+The overhead associated with the 
+network facilities found in 4.2BSD is often
+difficult to gauge without profiling the system.
+This is because most input processing is performed
+in modules scheduled with software interrupts.
+As a result, the system time spent performing protocol
+processing is rarely attributed to the processes that
+really receive the data.  Since the protocols supported
+by 4.2BSD can involve significant overhead this was a serious
+concern.  Results from a profiled kernel show an average
+of 5% of the system time is spent
+performing network input and timer processing in our environment
+(a 3Mb/s Ethernet with most traffic using TCP).
+This figure can vary significantly depending on
+the network hardware used, the average message
+size, and whether packet reassembly is required at the network
+layer.  On one machine we profiled over a 17 hour
+period (our gateway to the ARPANET)
+206,000 input messages accounted for 2.4% of the system time,
+while another 0.6% of the system time was spent performing
+protocol timer processing.
+This machine was configured with an ACC LH/DH IMP interface
+and a DMA 3Mb/s Ethernet controller.
+.PP
+The performance of TCP over slower long-haul networks
+was degraded substantially by two problems.
+The first problem was a bug that prevented round-trip timing measurements
+from being made, thus increasing retransmissions unnecessarily.
+The second was a problem with the maximum segment size chosen by TCP,
+that was well-tuned for Ethernet, but was poorly chosen for
+the ARPANET, where it causes packet fragmentation.  (The maximum
+segment size was actually negotiated upwards to a value that
+resulted in excessive fragmentation.)
+.PP
+When benchmarked in Ethernet environments the main memory buffer management
+of the network subsystem presented some performance anomalies.
+The overhead of processing small ``mbufs'' severely affected throughput for a
+substantial range of message sizes.
+In spite of the fact that most system ustilities made use of the throughput
+optimal 1024 byte size, user processes faced large degradations for some
+arbitrary sizes. This was specially true for TCP/IP transmissions [Cabrera84,
+Cabrera85].
+.NH 3
+Virtual memory subsystem
+.PP
+We ran a set of tests intended to exercise the virtual
+memory system under both 4.1BSD and 4.2BSD.
+The tests are described in Table 6.
+The test programs dynamically allocated
+a 7.3 Megabyte array (using \fIsbrk\fP\|(2)) then referenced
+pages in the array either: sequentially, in a purely random
+fashion, or such that the distance between
+successive pages accessed was randomly selected from a Gaussian
+distribution.  In the last case, successive runs were made with
+increasing standard deviations.
+.KF
+.DS L
+.TS
+center box;
+l | l.
+Test	Description
+_
+seqpage	sequentially touch pages, 10 iterations
+seqpage-v	as above, but first make \fIvadvise\fP\|(2) call
+randpage	touch random page 30,000 times
+randpage-v	as above, but first make \fIvadvise\fP call
+gausspage.1	30,000 Gaussian accesses, standard deviation of 1
+gausspage.10	as above, standard deviation of 10
+gausspage.30	as above, standard deviation of 30
+gausspage.40	as above, standard deviation of 40
+gausspage.50	as above, standard deviation of 50
+gausspage.60	as above, standard deviation of 60
+gausspage.80	as above, standard deviation of 80
+gausspage.inf	as above, standard deviation of 10,000
+.TE
+.ce
+Table 6. Paging benchmark programs.
+.DE
+.KE
+.PP
+The results in Table 7 show how the additional
+memory requirements
+of 4.2BSD can generate more work for the paging system.
+Under 4.1BSD,
+the system used 0.5 of the 4.5 megabytes of physical memory
+on the test machine;
+under 4.2BSD it used nearly 1 megabyte of physical memory.\**
+.FS
+\** The 4.1BSD system used for testing was really a 4.1a 
+system configured
+with networking facilities and code to support
+remote file access.  The
+4.2BSD system also included the remote file access code.
+Since both
+systems would be larger than similarly configured ``vanilla''
+4.1BSD or 4.2BSD system, we consider out conclusions to still be valid.
+.FE
+This resulted in more page faults and, hence, more system time.
+To establish a common ground on which to compare the paging
+routines of each system, we check instead the average page fault
+service times for those test runs that had a statistically significant
+number of random page faults.  These figures, shown in Table 8, show
+no significant difference between the two systems in
+the area of page fault servicing.  We currently have
+no explanation for the results of the sequential
+paging tests.
+.KF
+.DS L
+.TS
+center box;
+l || c s || c s || c s || c s
+l || c s || c s || c s || c s
+l || c | c || c | c || c | c || c | c
+l || n | n || n | n || n | n || n | n.
+Test	Real	User	System	Page Faults
+\^	_	_	_	_
+\^	4.1	4.2	4.1	4.2	4.1	4.2	4.1	4.2
+=
+seqpage	959	1126	16.7	12.8	197.0	213.0	17132	17113
+seqpage-v	579	812	3.8	5.3	216.0	237.7	8394	8351
+randpage	571	569	6.7	7.6	64.0	77.2	8085	9776
+randpage-v	572	562	6.1	7.3	62.2	77.5	8126	9852
+gausspage.1	25	24	23.6	23.8	0.8	0.8	8	8
+gausspage.10	26	26	22.7	23.0	3.2	3.6	2	2
+gausspage.30	34	33	25.0	24.8	8.6	8.9	2	2
+gausspage.40	42	81	23.9	25.0	11.5	13.6	3	260
+gausspage.50	113	175	24.2	26.2	19.6	26.3	784	1851
+gausspage.60	191	234	27.6	26.7	27.4	36.0	2067	3177
+gausspage.80	312	329	28.0	27.9	41.5	52.0	3933	5105
+gausspage.inf	619	621	82.9	85.6	68.3	81.5	8046	9650
+.TE
+.ce
+Table 7. Paging benchmark results (all times in seconds).
+.DE
+.KE
+.KF
+.DS L
+.TS
+center box;
+c || c s || c s
+c || c s || c s
+c || c | c || c | c
+l || n | n || n | n.
+Test	Page Faults	PFST
+\^	_	_
+\^	4.1	4.2	4.1	4.2
+=
+randpage	8085	9776	791	789
+randpage-v	8126	9852	765	786
+gausspage.inf	8046	9650	848	844
+.TE
+.ce
+Table 8. Page fault service times (all times in microseconds).
+.DE
+.KE