diff options
Diffstat (limited to 'share/doc/papers/nqnfs/nqnfs.me')
-rw-r--r-- | share/doc/papers/nqnfs/nqnfs.me | 2007 |
1 files changed, 2007 insertions, 0 deletions
diff --git a/share/doc/papers/nqnfs/nqnfs.me b/share/doc/papers/nqnfs/nqnfs.me new file mode 100644 index 0000000..ce9003e --- /dev/null +++ b/share/doc/papers/nqnfs/nqnfs.me @@ -0,0 +1,2007 @@ +.\" Copyright (c) 1993 The Usenix Association. All rights reserved. +.\" +.\" This document is derived from software contributed to Berkeley by +.\" Rick Macklem at The University of Guelph with the permission of +.\" the Usenix Association. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)nqnfs.me 8.1 (Berkeley) 4/20/94 +.\" +.lp +.nr PS 12 +.ps 12 +Reprinted with permission from the "Proceedings of the Winter 1994 Usenix +Conference", January 1994, San Francisco, CA, Copyright The Usenix +Association. +.nr PS 14 +.ps 14 +.sp +.ce +\fBNot Quite NFS, Soft Cache Consistency for NFS\fR +.nr PS 12 +.ps 12 +.sp +.ce +\fIRick Macklem\fR +.ce +\fIUniversity of Guelph\fR +.sp +.nr PS 12 +.ps 12 +.ce +\fBAbstract\fR +.nr PS 10 +.ps 10 +.pp +There are some constraints inherent in the NFS\(tm\(mo protocol +that result in performance limitations +for high performance +workstation environments. +This paper discusses an NFS-like protocol named Not Quite NFS (NQNFS), +designed to address some of these limitations. +This protocol provides full cache consistency during normal +operation, while permitting more effective client-side caching in an +effort to improve performance. +There are also a variety of minor protocol changes, in order to resolve +various NFS issues. +The emphasis is on observed performance of a +preliminary implementation of the protocol, in order to show +how well this design works +and to suggest possible areas for further improvement. +.sh 1 "Introduction" +.pp +It has been observed that +overall workstation performance has not been scaling with +processor speed and that file system I/O is a limiting factor [Ousterhout90]. +Ousterhout +notes +that a principal challenge for operating system developers is the +decoupling of system calls from their underlying I/O operations, in order +to improve average system call response times. +For distributed file systems, every synchronous Remote Procedure Call (RPC) +takes a minimum of a few milliseconds and, as such, is analogous to an +underlying I/O operation. +This suggests that client caching with a very good +hit ratio for read type operations, along with asynchronous writing, is required in order to avoid delays waiting for RPC replies. +However, the NFS protocol requires that the server be stateless\** +.(f +\**The server must not require any state that may be lost due to a crash, to +function correctly. +.)f +and does not provide any explicit mechanism for client cache +consistency, putting +constraints on how the client may cache data. +This paper describes an NFS-like protocol that includes a cache consistency +component designed to enhance client caching performance. It does provide +full consistency under normal operation, but without requiring that hard +state information be maintained on the server. +Design tradeoffs were made towards simplicity and +high performance over cache consistency under abnormal conditions. +The protocol design uses a variation of Leases [Gray89] +to provide state on the server that does not need to be recovered after a +crash. +.pp +The protocol also includes changes designed to address other limitations +of NFS in a modern workstation environment. +The use of TCP transport is optionally available to avoid +the pitfalls of Sun RPC over UDP transport when running across an internetwork [Nowicki89]. +Kerberos [Steiner88] support is available +to do proper user authentication, in order to provide improved security and +arbitrary client to server user ID mappings. +There are also a variety of other changes to accommodate large file systems, +such as 64bit file sizes and offsets, as well as lifting the 8Kbyte I/O size +limit. +The remainder of this paper gives an overview of the protocol, highlighting +performance related components, followed by an evaluation of resultant performance +for the 4.4BSD implementation. +.sh 1 "Distributed File Systems and Caching" +.pp +Clients using distributed file systems cache recently-used data in order +to reduce the number of synchronous server operations, and therefore improve +average response times for system calls. +Unfortunately, maintaining consistency between these caches is a problem +whenever write sharing occurs; that is, when a process on a client writes +to a file and one or more processes on other client(s) read the file. +If the writer closes the file before any reader(s) open the file for reading, +this is called sequential write sharing. Both the Andrew ITC file system +[Howard88] and NFS [Sandberg85] maintain consistency for sequential write +sharing by requiring the writer to push all the writes through to the +server on close and having readers check to see if the file has been +modified upon open. If the file has been modified, the client throws away +all cached data for that file, as it is now stale. +NFS implementations typically detect file modification by checking a cached +copy of the file's modification time; since this cached value is often +several seconds out of date and only has a resolution of one second, an NFS +client often uses stale cached data for some time after the file has +been updated on the server. +.pp +A more difficult case is concurrent write sharing, where write operations are intermixed +with read operations. +Consistency for this case, often referred to as "full cache consistency," +requires that a reader always receives the most recently written data. +Neither NFS nor the Andrew ITC file system maintain consistency for this +case. +The simplest mechanism for maintaining full cache consistency is the one +used by Sprite [Nelson88], which disables all client caching of the +file whenever concurrent write sharing might occur. +There are other mechanisms described in the literature [Kent87a, +Burrows88], but they appeared to be too elaborate for incorporation +into NQNFS (for example, Kent's requires specialized hardware). +NQNFS differs from Sprite in the way it +detects write sharing. The Sprite server maintains a list of files currently open +by the various clients and detects write sharing when a file open request +for writing is received and the file is already open for reading +(or vice versa). +This list of open files is hard state information that must be recovered +after a server crash, which is a significant problem in its own +right [Mogul93, Welch90]. +.pp +The approach used by NQNFS is a variant of the Leases mechanism [Gray89]. +In this model, the server issues to a client a promise, referred to as a +"lease," that the client may cache a specific object without fear of +conflict. +A lease has a limited duration and must be renewed by the client if it +wishes to continue to cache the object. +In NQNFS, clients hold short-term (up to one minute) leases on files +for reading or writing. +The leases are analogous to entries in the open file list, except that +they expire after the lease term unless renewed by the client. +As such, one minute after issuing the last lease there are no current +leases and therefore no lease records to be recovered after a crash, hence +the term "soft server state." +.pp +A related design consideration is the way client writing is done. +Synchronous writing requires that all writes be pushed through to the server +during the write system call. +This is the simplest variant, from a consistency point of view, since the +server always has the most recently written data. It also permits any write +errors, such as "file system out of space" to be propagated back to the +client's process via the write system call return. +Unfortunately this approach limits the client write rate, based on server write +performance and client/server RPC round trip time (RTT). +.pp +An alternative to this is delayed writing, where the write system call returns +as soon as the data is cached on the client and the data is written to the +server sometime later. +This permits client writing to occur at the rate of local storage access +up to the size of the local cache. +Also, for cases where file truncation/deletion occurs shortly after writing, +the write to the server may be avoided since the data has already been +deleted, reducing server write load. +There are some obvious drawbacks to this approach. +For any Sprite-like system to maintain +full consistency, the server must "callback" to the client to cause the +delayed writes to be written back to the server when write sharing is about to +occur. +There are also problems with the propagation of errors +back to the client process that issued the write system call. +The reason for this is that +the system call has already returned without reporting an error and the +process may also have already terminated. +As well, there is a risk of the loss of recently written data if the client +crashes before the data is written back to the server. +.pp +A compromise between these two alternatives is asynchronous writing, where +the write to the server is initiated during the write system call but the write system +call returns before the write completes. +This approach minimizes the risk of data loss due to a client crash, but negates +the possibility of reducing server write load by throwing writes away when +a file is truncated or deleted. +.pp +NFS implementations usually do a mix of asynchronous and delayed writing +but push all writes to the server upon close, in order to maintain open/close +consistency. +Pushing the delayed writes on close +negates much of the performance advantage of delayed writing, since the +delays that were avoided in the write system calls are observed in the close +system call. +Akin to Sprite, the NQNFS protocol does delayed writing in an effort to achieve +good client performance and uses a callback mechanism to maintain full cache +consistency. +.sh 1 "Related Work" +.pp +There has been a great deal of effort put into improving the performance and +consistency of the NFS protocol. This work can be put in two categories. +The first category are implementation enhancements for the NFS protocol and +the second involve modifications to the protocol. +.pp +The work done on implementation enhancements have attacked two problem areas, +NFS server write performance and RPC transport problems. +Server write performance is a major problem for NFS, in part due to the +requirement to push all writes to the server upon close and in part due +to the fact that, for writes, all data and meta-data must be committed to +non-volatile storage before the server replies to the write RPC. +The Prestoserve\(tm\(dg +[Moran90] +system uses non-volatile RAM as a buffer for recently written data on the server, +so that the write RPC replies can be returned to the client before the data is written to the +disk surface. +Write gathering [Juszczak94] is a software technique used on the server where a write +RPC request is delayed for a short time in the hope that another contiguous +write request will arrive, so that they can be merged into one write operation. +Since the replies to all of the merged writes are not returned to the client until the write +operation is completed, this delay does not violate the protocol. +When write operations are merged, the number of disk writes can be reduced, +improving server write performance. +Although either of the above reduces write RPC response time for the server, +it cannot be reduced to zero, and so, any client side caching mechanism +that reduces write RPC load or client dependence on server RPC response time +should still improve overall performance. +Good client side caching should be complementary to these server techniques, +although client performance improvements as a result of caching may be less +dramatic when these techniques are used. +.pp +In NFS, each Sun RPC request is packaged in a UDP datagram for transmission +to the server. A timer is started, and if a timeout occurs before the corresponding +RPC reply is received, the RPC request is retransmitted. +There are two problems with this model. +First, when a retransmit timeout occurs, the RPC may be redone, instead of +simply retransmitting the RPC request message to the server. A recent-request +cache can be used on the server to minimize the negative impact of redoing +RPCs [Juszczak89]. +The second problem is that a large UDP datagram, such as a read request or +write reply, must be fragmented by IP and if any one IP fragment is lost in +transit, the entire UDP datagram is lost [Kent87]. Since entire requests and replies +are packaged in a single UDP datagram, this puts an upper bound on the read/write +data size (8 kbytes). +.pp +Adjusting the retransmit timeout (RTT) interval dynamically and applying a +congestion window on outstanding requests has been shown to be of some help +[Nowicki89] with the retransmission problem. +An alternative to this is to use TCP transport to delivery the RPC messages +reliably [Macklem90] and one of the performance results in this paper +shows the effects of this further. +.pp +Srinivasan and Mogul [Srinivasan89] enhanced the NFS protocol to use the Sprite cache +consistency algorithm in an effort to improve performance and to provide +full client cache consistency. +This experimental implementation demonstrated significantly better +performance than NFS, but suffered from a lack of crash recovery support. +The NQNFS protocol design borrowed heavily from this work, but differed +from the Sprite algorithm by using Leases instead of file open state +to detect write sharing. +The decision to use Leases was made primarily to avoid the crash recovery +problem. +More recent work by the Sprite group [Baker91] and Mogul [Mogul93] have +addressed the crash recovery problem, making this design tradeoff more +questionable now. +.pp +Sun has recently updated the NFS protocol to Version 3 [SUN93], using some +changes similar to NQNFS to address various issues. The Version 3 protocol +uses 64bit file sizes and offsets, provides a Readdir_and_Lookup RPC and +an access RPC. +It also provides cache hints, to permit a client to be able to determine +whether a file modification is the result of that client's write or some +other client's write. +It would be possible to add either Spritely NFS or NQNFS support for cache +consistency to the NFS Version 3 protocol. +.sh 1 "NQNFS Consistency Protocol and Recovery" +.pp +The NQNFS cache consistency protocol uses a somewhat Sprite-like [Nelson88] +mechanism, but is based on Leases [Gray89] instead of hard server state information +about open files. +The basic principle is that the server disables client caching of files whenever +concurrent write sharing could occur, by performing a server-to-client +callback, +forcing the client to flush its caches and to do all subsequent I/O on the file with +synchronous RPCs. +A Sprite server maintains a record of the open state of files for +all clients and uses this to determine when concurrent write sharing might +occur. +This \fIopen state\fR information might also be referred to as an infinite-term +lease for the file, with explicit lease cancellation. +NQNFS, on the other hand, uses a short-term lease that expires due to timeout +after a maximum of one minute, unless explicitly renewed by the client. +The fundamental difference is that an NQNFS client must keep renewing +a lease to use cached data whereas a Sprite client assumes the data is valid until canceled +by the server +or the file is closed. +Using leases permits the server to remain "stateless," since the soft +state information, which consists of the set of current leases, is +moot after one minute, when all the leases expire. +.pp +Whenever a client wishes to access a file's data it must hold one of +three types of lease: read-caching, write-caching or non-caching. +The latter type requires that all file operations be done synchronously with +the server via the appropriate RPCs. +.pp +A read-caching lease allows for client data caching but no modifications +may be done. +It may, however, be shared between multiple clients. Diagram 1 shows a typical +read-caching scenario. The vertical solid black lines depict the lease records. +Note that the time lines are nowhere near to scale, since a client/server +interaction will normally take less than one hundred milliseconds, whereas the +normal lease duration is thirty seconds. +Every lease includes a \fImodrev\fR value, which changes upon every modification +of the file. It may be used to check to see if data cached on the client is +still current. +.pp +A write-caching lease permits delayed write caching, +but requires that all data be pushed to the server when the lease expires +or is terminated by an eviction callback. +When a write-caching lease has almost expired, the client will attempt to +extend the lease if the file is still open, but is required to push the delayed writes to the server +if renewal fails (as depicted by diagram 2). +The writes may not arrive at the server until after the write lease has +expired on the client, but this does not result in a consistency problem, +so long as the write lease is still valid on the server. +Note that, in diagram 2, the lease record on the server remains current after +the expiry time, due to the conditions mentioned in section 5. +If a write RPC is done on the server after the write lease has expired on +the server, this could be considered an error since consistency could be +lost, but it is not handled as such by NQNFS. +.pp +Diagram 3 depicts how read and write leases are replaced by a non-caching +lease when there is the potential for write sharing. +.(z +.sp +.PS +.ps +.ps 50 +line from 0.738,5.388 to 1.238,5.388 +.ps +.ps 10 +dashwid = 0.050i +line dashed from 1.488,10.075 to 1.488,5.450 +line dashed from 2.987,10.075 to 2.987,5.450 +line dashed from 4.487,10.075 to 4.487,5.450 +.ps +.ps 50 +line from 4.487,7.013 to 4.487,5.950 +line from 2.987,7.700 to 2.987,5.950 to 2.987,6.075 +line from 1.488,7.513 to 1.488,5.950 +line from 2.987,9.700 to 2.987,8.325 +line from 1.488,9.450 to 1.488,8.325 +.ps +.ps 10 +line from 2.987,6.450 to 4.487,6.200 +line from 4.385,6.192 to 4.487,6.200 to 4.393,6.241 +line from 4.487,6.888 to 2.987,6.575 +line from 3.080,6.620 to 2.987,6.575 to 3.090,6.571 +line from 2.987,7.263 to 4.487,7.013 +line from 4.385,7.004 to 4.487,7.013 to 4.393,7.054 +line from 4.487,7.638 to 2.987,7.388 +line from 3.082,7.429 to 2.987,7.388 to 3.090,7.379 +line from 2.987,6.888 to 1.488,6.575 +line from 1.580,6.620 to 1.488,6.575 to 1.590,6.571 +line from 1.488,7.200 to 2.987,6.950 +line from 2.885,6.942 to 2.987,6.950 to 2.893,6.991 +line from 2.987,7.700 to 1.488,7.513 +line from 1.584,7.550 to 1.488,7.513 to 1.590,7.500 +line from 1.488,8.012 to 2.987,7.763 +line from 2.885,7.754 to 2.987,7.763 to 2.893,7.804 +line from 2.987,9.012 to 1.488,8.825 +line from 1.584,8.862 to 1.488,8.825 to 1.590,8.813 +line from 1.488,9.325 to 2.987,9.137 +line from 2.885,9.125 to 2.987,9.137 to 2.891,9.175 +line from 2.987,9.637 to 1.488,9.450 +line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438 +line from 1.488,9.887 to 2.987,9.700 +line from 2.885,9.688 to 2.987,9.700 to 2.891,9.737 +.ps +.ps 12 +.ft +.ft R +"Lease valid on machine" at 1.363,5.296 ljust +"with same modrev" at 1.675,7.421 ljust +"miss)" at 2.612,9.233 ljust +"(cache" at 2.300,9.358 ljust +.ps +.ps 14 +"Diagram #1: Read Caching Leases" at 0.738,5.114 ljust +"Client B" at 4.112,10.176 ljust +"Server" at 2.612,10.176 ljust +"Client A" at 0.925,10.176 ljust +.ps +.ps 12 +"from cache" at 4.675,6.546 ljust +"Read syscalls" at 4.675,6.796 ljust +"Reply" at 3.737,6.108 ljust +"(cache miss)" at 3.675,6.421 ljust +"Read req" at 3.737,6.608 ljust +"to lease" at 3.112,6.796 ljust +"Client B added" at 3.112,6.983 ljust +"Reply" at 3.237,7.296 ljust +"Read + lease req" at 3.175,7.671 ljust +"Read syscall" at 4.675,7.608 ljust +"Reply" at 1.675,6.796 ljust +"miss)" at 2.487,7.108 ljust +"Read req (cache" at 1.675,7.233 ljust +"from cache" at 0.425,6.296 ljust +"Read syscalls" at 0.425,6.546 ljust +"cache" at 0.425,6.858 ljust +"so can still" at 0.425,7.108 ljust +"Modrev same" at 0.425,7.358 ljust +"Reply" at 1.675,7.671 ljust +"Get lease req" at 1.675,8.108 ljust +"Read syscall" at 0.425,7.983 ljust +"Lease times out" at 0.425,8.296 ljust +"from cache" at 0.425,9.046 ljust +"Read syscalls" at 0.425,9.296 ljust +"for Client A" at 3.112,9.296 ljust +"Read caching lease" at 3.112,9.483 ljust +"Reply" at 1.675,8.983 ljust +"Read req" at 1.675,9.358 ljust +"Reply" at 1.675,9.608 ljust +"Read + lease req" at 1.675,9.921 ljust +"Read syscall" at 0.425,9.921 ljust +.ps +.ft +.PE +.sp +.)z +.(z +.sp +.PS +.ps +.ps 50 +line from 1.175,5.700 to 1.300,5.700 +line from 0.738,5.700 to 1.175,5.700 +line from 2.987,6.638 to 2.987,6.075 +.ps +.ps 10 +dashwid = 0.050i +line dashed from 2.987,6.575 to 2.987,5.950 +line dashed from 1.488,6.575 to 1.488,5.888 +.ps +.ps 50 +line from 2.987,9.762 to 2.987,6.638 +line from 1.488,9.450 to 1.488,7.700 +.ps +.ps 10 +line from 2.987,6.763 to 1.488,6.575 +line from 1.584,6.612 to 1.488,6.575 to 1.590,6.563 +line from 1.488,7.013 to 2.987,6.825 +line from 2.885,6.813 to 2.987,6.825 to 2.891,6.862 +line from 2.987,7.325 to 1.488,7.075 +line from 1.582,7.116 to 1.488,7.075 to 1.590,7.067 +line from 1.488,7.700 to 2.987,7.388 +line from 2.885,7.383 to 2.987,7.388 to 2.895,7.432 +line from 2.987,8.575 to 1.488,8.325 +line from 1.582,8.366 to 1.488,8.325 to 1.590,8.317 +line from 1.488,8.887 to 2.987,8.637 +line from 2.885,8.629 to 2.987,8.637 to 2.893,8.679 +line from 2.987,9.637 to 1.488,9.450 +line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438 +line from 1.488,9.887 to 2.987,9.762 +line from 2.886,9.746 to 2.987,9.762 to 2.890,9.796 +line dashed from 2.987,10.012 to 2.987,6.513 +line dashed from 1.488,10.012 to 1.488,6.513 +.ps +.ps 12 +.ft +.ft R +"write" at 4.237,5.921 ljust +"Lease valid on machine" at 1.425,5.733 ljust +.ps +.ps 14 +"Diagram #2: Write Caching Lease" at 0.738,5.551 ljust +"Server" at 2.675,10.114 ljust +"Client A" at 1.113,10.114 ljust +.ps +.ps 12 +"seconds after last" at 3.112,5.921 ljust +"Expires write_slack" at 3.112,6.108 ljust +"due to write activity" at 3.112,6.608 ljust +"Expiry delayed" at 3.112,6.796 ljust +"Lease times out" at 3.112,7.233 ljust +"Lease renewed" at 3.175,8.546 ljust +"Lease for client A" at 3.175,9.358 ljust +"Write caching" at 3.175,9.608 ljust +"Reply" at 1.675,6.733 ljust +"Write req" at 1.988,7.046 ljust +"Reply" at 1.675,7.233 ljust +"Write req" at 1.675,7.796 ljust +"Lease expires" at 0.487,7.733 ljust +"Close syscall" at 0.487,8.108 ljust +"lease granted" at 1.675,8.546 ljust +"Get write lease" at 1.675,8.921 ljust +"before expiry" at 0.487,8.608 ljust +"Lease renewal" at 0.487,8.796 ljust +"syscalls" at 0.487,9.046 ljust +"Delayed write" at 0.487,9.233 ljust +"lease granted" at 1.675,9.608 ljust +"Get write lease req" at 1.675,9.921 ljust +"Write syscall" at 0.487,9.858 ljust +.ps +.ft +.PE +.sp +.)z +.(z +.sp +.PS +.ps +.ps 50 +line from 0.613,2.638 to 1.238,2.638 +line from 1.488,4.075 to 1.488,3.638 +line from 2.987,4.013 to 2.987,3.575 +line from 4.487,4.013 to 4.487,3.575 +.ps +.ps 10 +line from 2.987,3.888 to 4.487,3.700 +line from 4.385,3.688 to 4.487,3.700 to 4.391,3.737 +line from 4.487,4.138 to 2.987,3.950 +line from 3.084,3.987 to 2.987,3.950 to 3.090,3.938 +line from 2.987,4.763 to 4.487,4.450 +line from 4.385,4.446 to 4.487,4.450 to 4.395,4.495 +.ps +.ps 50 +line from 4.487,4.438 to 4.487,4.013 +.ps +.ps 10 +line from 4.487,5.138 to 2.987,4.888 +line from 3.082,4.929 to 2.987,4.888 to 3.090,4.879 +.ps +.ps 50 +line from 4.487,6.513 to 4.487,5.513 +line from 4.487,6.513 to 4.487,6.513 to 4.487,5.513 +line from 2.987,5.450 to 2.987,5.200 +line from 1.488,5.075 to 1.488,4.075 +line from 2.987,5.263 to 2.987,4.013 +line from 2.987,7.700 to 2.987,5.325 +line from 4.487,7.575 to 4.487,6.513 +line from 1.488,8.512 to 1.488,8.075 +line from 2.987,8.637 to 2.987,8.075 +line from 2.987,9.637 to 2.987,8.825 +line from 1.488,9.450 to 1.488,8.950 +.ps +.ps 10 +line from 2.987,4.450 to 1.488,4.263 +line from 1.584,4.300 to 1.488,4.263 to 1.590,4.250 +line from 1.488,4.888 to 2.987,4.575 +line from 2.885,4.571 to 2.987,4.575 to 2.895,4.620 +line from 2.987,5.263 to 1.488,5.075 +line from 1.584,5.112 to 1.488,5.075 to 1.590,5.063 +line from 4.487,5.513 to 2.987,5.325 +line from 3.084,5.362 to 2.987,5.325 to 3.090,5.313 +line from 2.987,5.700 to 4.487,5.575 +line from 4.386,5.558 to 4.487,5.575 to 4.390,5.608 +line from 4.487,6.013 to 2.987,5.825 +line from 3.084,5.862 to 2.987,5.825 to 3.090,5.813 +line from 2.987,6.200 to 4.487,6.075 +line from 4.386,6.058 to 4.487,6.075 to 4.390,6.108 +line from 4.487,6.450 to 2.987,6.263 +line from 3.084,6.300 to 2.987,6.263 to 3.090,6.250 +line from 2.987,6.700 to 4.487,6.513 +line from 4.385,6.500 to 4.487,6.513 to 4.391,6.550 +line from 1.488,6.950 to 2.987,6.763 +line from 2.885,6.750 to 2.987,6.763 to 2.891,6.800 +line from 2.987,7.700 to 4.487,7.575 +line from 4.386,7.558 to 4.487,7.575 to 4.390,7.608 +line from 4.487,7.950 to 2.987,7.763 +line from 3.084,7.800 to 2.987,7.763 to 3.090,7.750 +line from 2.987,8.637 to 1.488,8.512 +line from 1.585,8.546 to 1.488,8.512 to 1.589,8.496 +line from 1.488,8.887 to 2.987,8.700 +line from 2.885,8.688 to 2.987,8.700 to 2.891,8.737 +line from 2.987,9.637 to 1.488,9.450 +line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438 +line from 1.488,9.950 to 2.987,9.762 +line from 2.885,9.750 to 2.987,9.762 to 2.891,9.800 +dashwid = 0.050i +line dashed from 4.487,10.137 to 4.487,2.825 +line dashed from 2.987,10.137 to 2.987,2.825 +line dashed from 1.488,10.137 to 1.488,2.825 +.ps +.ps 12 +.ft +.ft R +"(not cached)" at 4.612,3.858 ljust +.ps +.ps 14 +"Diagram #3: Write sharing case" at 0.613,2.239 ljust +.ps +.ps 12 +"Write syscall" at 4.675,7.546 ljust +"Read syscall" at 0.550,9.921 ljust +.ps +.ps 14 +"Lease valid on machine" at 1.363,2.551 ljust +.ps +.ps 12 +"(can still cache)" at 1.675,8.171 ljust +"Reply" at 3.800,3.858 ljust +"Write" at 3.175,4.046 ljust +"writes" at 4.612,4.046 ljust +"synchronous" at 4.612,4.233 ljust +"write syscall" at 4.675,5.108 ljust +"non-caching lease" at 3.175,4.296 ljust +"Reply " at 3.175,4.483 ljust +"req" at 3.175,4.983 ljust +"Get write lease" at 3.175,5.108 ljust +"Vacated msg" at 3.175,5.483 ljust +"to the server" at 4.675,5.858 ljust +"being flushed to" at 4.675,6.046 ljust +"Delayed writes" at 4.675,6.233 ljust +.ps +.ps 16 +"Server" at 2.675,10.182 ljust +"Client B" at 3.925,10.182 ljust +"Client A" at 0.863,10.182 ljust +.ps +.ps 12 +"(not cached)" at 0.550,4.733 ljust +"Read data" at 0.550,4.921 ljust +"Reply data" at 1.675,4.421 ljust +"Read request" at 1.675,4.921 ljust +"lease" at 1.675,5.233 ljust +"Reply non-caching" at 1.675,5.421 ljust +"Reply" at 3.737,5.733 ljust +"Write" at 3.175,5.983 ljust +"Reply" at 3.737,6.171 ljust +"Write" at 3.175,6.421 ljust +"Eviction Notice" at 3.175,6.796 ljust +"Get read lease" at 1.675,7.046 ljust +"Read syscall" at 0.550,6.983 ljust +"being cached" at 4.675,7.171 ljust +"Delayed writes" at 4.675,7.358 ljust +"lease" at 3.175,7.233 ljust +"Reply write caching" at 3.175,7.421 ljust +"Get write lease" at 3.175,7.983 ljust +"Write syscall" at 4.675,7.983 ljust +"with same modrev" at 1.675,8.358 ljust +"Lease" at 0.550,8.171 ljust +"Renewed" at 0.550,8.358 ljust +"Reply" at 1.675,8.608 ljust +"Get Lease Request" at 1.675,8.983 ljust +"Read syscall" at 0.550,8.733 ljust +"from cache" at 0.550,9.108 ljust +"Read syscall" at 0.550,9.296 ljust +"Reply " at 1.675,9.671 ljust +"plus lease" at 2.050,9.983 ljust +"Read Request" at 1.675,10.108 ljust +.ps +.ft +.PE +.sp +.)z +A write-caching lease is not used in the Stanford V Distributed System [Gray89], +since synchronous writing is always used. A side effect of this change +is that the five to ten second lease duration recommended by Gray was found +to be insufficient to achieve good performance for the write-caching lease. +Experimentation showed that thirty seconds was about optimal for cases where +the client and server are connected to the same local area network, so +thirty seconds is the default lease duration for NQNFS. +A maximum of twice that value is permitted, since Gray showed that for some +network topologies, a larger lease duration functions better. +Although there is an explicit get_lease RPC defined for the protocol, +most lease requests are piggybacked onto the other RPCs to minimize the +additional overhead introduced by leasing. +.sh 2 "Rationale" +.pp +Leasing was chosen over hard server state information for the following +reasons: +.ip 1. +The server must maintain state information about all current +client leases. +Since at most one lease is allocated for each RPC and the leases expire +after their lease term, +the upper bound on the number of current leases is the product of the +lease term and the server RPC rate. +In practice, it has been observed that less than 10% of RPCs request new leases +and since most leases have a term of thirty seconds, the following rule of +thumb should estimate the number of server lease records: +.sp +.nf + Number of Server Lease Records \(eq 0.1 * 30 * RPC rate +.fi +.sp +Since each lease record occupies 64 bytes of server memory, storing the lease +records should not be a serious problem. +If a server has exhausted lease storage, it can simply wait a few seconds +for a lease to expire and free up a record. +On the other hand, a Sprite-like server must store records for all files +currently open by all clients, which can require significant storage for +a large, heavily loaded server. +In [Mogul93], it is proposed that a mechanism vaguely similar to paging could be +used to deal with this for Spritely NFS, but this +appears to introduce a fair amount of complexity and may limit the +usefulness of open records for storing other state information, such +as file locks. +.ip 2. +After a server crashes it must recover lease records for +the current outstanding leases, which actually implies that if it waits +until all leases have expired, there is no state to recover. +The server must wait for the maximum lease duration of one minute, and it must serve +all outstanding write requests resulting from terminated write-caching +leases before issuing new leases. The one minute delay can be overlapped with +file system consistency checking (eg. fsck). +Because no state must be recovered, a lease-based server, like an NFS server, +avoids the problem of state recovery after a crash. +.sp +There can, however, be problems during crash recovery +because of a potentially large number of write backs due to terminated +write-caching leases. +One of these problems is a "recovery storm" [Baker91], which could occur when +the server is overloaded by the number of write RPC requests. +The NQNFS protocol deals with this by replying +with a return status code called +try_again_later to all +RPC requests (except write) until the write requests subside. +At this time, there has not been sufficient testing of server crash +recovery while under heavy server load to determine if the try_again_later +reply is a sufficient solution to the problem. +The other problem is that consistency will be lost if other RPCs are performed +before all of the write backs for terminated write-caching leases have completed. +This is handled by only performing write RPCs until +no write RPC requests arrive +for write_slack seconds, where write_slack is set to several times +the client timeout retransmit interval, +at which time it is assumed all clients have had an opportunity to send their writes +to the server. +.ip 3. +Another advantage of leasing is that, since leases are required at times when other I/O operations occur, +lease requests can almost always be piggybacked on other RPCs, avoiding some of the +overhead associated with the explicit open and close RPCs required by a Sprite-like system. +Compared with Sprite cache consistency, +this can result in a significantly lower RPC load (see table #1). +.sh 1 "Limitations of the NQNFS Protocol" +.pp +There is a serious risk when leasing is used for delayed write +caching. +If the server is simply too busy to service a lease renewal before a write-caching +lease terminates, the client will not be able to push the write +data to the server before the lease has terminated, resulting in +inconsistency. +Note that the danger of inconsistency occurs when the server assumes that +a write-caching lease has terminated before the client has +had the opportunity to write the data back to the server. +In an effort to avoid this problem, the NQNFS server does not assume that +a write-caching lease has terminated until three conditions are met: +.sp +.(l +1 - clock time > (expiry time + clock skew) +2 - there is at least one server daemon (nfsd) waiting for an RPC request +3 - no write RPCs received for leased file within write_slack after the corrected expiry time +.)l +.lp +The first condition ensures that the lease has expired on the client. +The clock_skew, by default three seconds, must be +set to a value larger than the maximum time-of-day clock error that is likely to occur +during the maximum lease duration. +The second condition attempts to ensure that the client +is not waiting for replies to any writes that are still queued for service by +an nfsd. The third condition tries to guarantee that the client has +transmitted all write requests to the server, since write_slack is set to +several times the client's timeout retransmit interval. +.pp +There are also certain file system semantics that are problematic for both NFS and NQNFS, +due to the +lack of state information maintained by the +server. If a file is unlinked on one client while open on another it will +be removed from the file server, resulting in failed file accesses on the +client that has the file open. +If the file system on the server is out of space or the client user's disk +quota has been exceeded, a delayed write can fail long after the write system +call was successfully completed. +With NFS this error will be detected by the close system call, since +the delayed writes are pushed upon close. With NQNFS however, the delayed write +RPC may not occur until after the close system call, possibly even after the process +has exited. +Therefore, +if a process must check for write errors, +a system call such as \fIfsync\fR must be used. +.pp +Another problem occurs when a process on one client is +running an executable file +and a process on another client starts to write to the file. The read lease on +the first client is terminated by the server, but the client has no recourse but +to terminate the process, since the process is already in progress on the old +executable. +.pp +The NQNFS protocol does not support file locking, since a file lock would have +to involve hard, recovered after a crash, state information. +.sh 1 "Other NQNFS Protocol Features" +.pp +NQNFS also includes a variety of minor modifications to the NFS protocol, in an +attempt to address various limitations. +The protocol uses 64bit file sizes and offsets in order to handle large files. +TCP transport may be used as an alternative to UDP +for cases where UDP does not perform well. +Transport mechanisms +such as TCP also permit the use of much larger read/write data sizes, +which might improve performance in certain environments. +.pp +The NQNFS protocol replaces the Readdir RPC with a Readdir_and_Lookup +RPC that returns the file handle and attributes for each file in the +directory as well as name and file id number. +This additional information may then be loaded into the lookup and file-attribute +caches on the client. +Thus, for cases such as "ls -l", the \fIstat\fR system calls can be performed +locally without doing any lookup or getattr RPCs. +Another additional RPC is the Access RPC that checks for file +accessibility against the server. This is necessary since in some cases the +client user ID is mapped to a different user on the server and doing the +access check locally on the client using file attributes and client credentials is +not correct. +One case where this becomes necessary is when the NQNFS mount point is using +Kerberos authentication, where the Kerberos authentication ticket is translated +to credentials on the server that are mapped to the client side user id. +For further details on the protocol, see [Macklem93]. +.sh 1 "Performance" +.pp +In order to evaluate the effectiveness of the NQNFS protocol, +a benchmark was used that was +designed to typify +real work on the client workstation. +Benchmarks, such as Laddis [Wittle93], that perform server load characterization +are not appropriate for this work, since it is primarily client caching +efficiency that needs to be evaluated. +Since these tests are measuring overall client system performance and +not just the performance of the file system, +each sequence of runs was performed on identical hardware and operating system in order to factor out the system +components affecting performance other than the file system protocol. +.pp +The equipment used for the all the benchmarks are members of the DECstation\(tm\(dg +family of workstations using the MIPS\(tm\(sc RISC architecture. +The operating system running on these systems was a pre-release version of +4.4BSD Unix\(tm\(dd. +For all benchmarks, the file server was a DECstation 2100 (10 MIPS) with 8Mbytes of +memory and a local RZ23 SCSI disk (27msec average access time). +The clients range in speed from DECstation 2100s +to a DECstation 5000/25, and always run with six block I/O daemons +and a 4Mbyte buffer cache, except for the test runs where the +buffer cache size was the independent variable. +In all cases /tmp is mounted on the local SCSI disk\**, all machines were +attached to the same uncongested Ethernet, and ran in single user mode during the benchmarks. +.(f +\**Testing using the 4.4BSD MFS [McKusick90] resulted in slightly degraded performance, +probably since the machines only had 16Mbytes of memory, and so paging +increased. +.)f +Unless noted otherwise, test runs used UDP RPC transport +and the results given are the average values of four runs. +.pp +The benchmark used is the Modified Andrew Benchmark (MAB) +[Ousterhout90], +which is a slightly modified version of the benchmark used to characterize +performance of the Andrew ITC file system [Howard88]. +The MAB was set up with the executable binaries in the remote mounted file +system and the final load step was commented out, due to a linkage problem +during testing under 4.4BSD. +Therefore, these results are not directly comparable to other reported MAB +results. +The MAB is made up of five distinct phases: +.sp +.ip "1." 10 +Makes five directories (no significant cost) +.ip "2." 10 +Copy a file system subtree to a working directory +.ip "3." 10 +Get file attributes (stat) of all the working files +.ip "4." 10 +Search for strings (grep) in the files +.ip "5." 10 +Compile a library of C sources and archive them +.lp +Of the five phases, the fifth is by far the largest and is the one affected most +by client caching mechanisms. +The results for phase #1 are invariant over all +the caching mechanisms. +.sh 2 "Buffer Cache Size Tests" +.pp +The first experiment was done to see what effect changing the size of the +buffer cache would have on client performance. A single DECstation 5000/25 +was used to do a series of runs of MAB with different buffer cache sizes +for four variations of the file system protocol. The four variations are +as follows: +.ip "Case 1:" 10 +NFS - The NFS protocol as implemented in 4.4BSD +.ip "Case 2:" 10 +Leases - The NQNFS protocol using leases for cache consistency +.ip "Case 3:" 10 +Leases, Rdirlookup - The NQNFS protocol using leases for cache consistency +and with the readdir RPC replaced by Readdir_and_Lookup +.ip "Case 4:" 10 +Leases, Attrib leases, Rdirlookup - The NQNFS protocol using leases for +cache consistency, with the readdir +RPC replaced by the Readdir_and_Lookup, +and requiring a valid lease not only for file-data access, but also for file-attribute access. +.lp +As can be seen in figure 1, the buffer cache achieves about optimal +performance for the range of two to ten megabytes in size. At eleven +megabytes in size, the system pages heavily and the runs did not +complete in a reasonable time. Even at 64Kbytes, the buffer cache improves +performance over no buffer cache by a significant margin of 136-148 seconds +versus 239 seconds. +This may be due, in part, to the fact that the Compile Phase of the MAB +uses a rather small working set of file data. +All variants of NQNFS achieve about +the same performance, running around 30% faster than NFS, with a slightly +larger difference for large buffer cache sizes. +Based on these results, all remaining tests were run with the buffer cache +size set to 4Mbytes. +Although I do not know what causes the local peak in the curves between 0.5 and 2 megabytes, +there is some indication that contention for buffer cache blocks, between the update process +(which pushes delayed writes to the server every thirty seconds) and the I/O +system calls, may be involved. +.(z +.PS +.ps +.ps 10 +dashwid = 0.050i +line dashed from 0.900,7.888 to 4.787,7.888 +line dashed from 0.900,7.888 to 0.900,10.262 +line from 0.900,7.888 to 0.963,7.888 +line from 4.787,7.888 to 4.725,7.888 +line from 0.900,8.188 to 0.963,8.188 +line from 4.787,8.188 to 4.725,8.188 +line from 0.900,8.488 to 0.963,8.488 +line from 4.787,8.488 to 4.725,8.488 +line from 0.900,8.775 to 0.963,8.775 +line from 4.787,8.775 to 4.725,8.775 +line from 0.900,9.075 to 0.963,9.075 +line from 4.787,9.075 to 4.725,9.075 +line from 0.900,9.375 to 0.963,9.375 +line from 4.787,9.375 to 4.725,9.375 +line from 0.900,9.675 to 0.963,9.675 +line from 4.787,9.675 to 4.725,9.675 +line from 0.900,9.963 to 0.963,9.963 +line from 4.787,9.963 to 4.725,9.963 +line from 0.900,10.262 to 0.963,10.262 +line from 4.787,10.262 to 4.725,10.262 +line from 0.900,7.888 to 0.900,7.950 +line from 0.900,10.262 to 0.900,10.200 +line from 1.613,7.888 to 1.613,7.950 +line from 1.613,10.262 to 1.613,10.200 +line from 2.312,7.888 to 2.312,7.950 +line from 2.312,10.262 to 2.312,10.200 +line from 3.025,7.888 to 3.025,7.950 +line from 3.025,10.262 to 3.025,10.200 +line from 3.725,7.888 to 3.725,7.950 +line from 3.725,10.262 to 3.725,10.200 +line from 4.438,7.888 to 4.438,7.950 +line from 4.438,10.262 to 4.438,10.200 +line from 0.900,7.888 to 4.787,7.888 +line from 4.787,7.888 to 4.787,10.262 +line from 4.787,10.262 to 0.900,10.262 +line from 0.900,10.262 to 0.900,7.888 +line from 3.800,8.775 to 4.025,8.775 +line from 0.925,10.088 to 0.925,10.088 +line from 0.925,10.088 to 0.938,9.812 +line from 0.938,9.812 to 0.988,9.825 +line from 0.988,9.825 to 1.075,9.838 +line from 1.075,9.838 to 1.163,9.938 +line from 1.163,9.938 to 1.250,9.838 +line from 1.250,9.838 to 1.613,9.825 +line from 1.613,9.825 to 2.312,9.750 +line from 2.312,9.750 to 3.025,9.713 +line from 3.025,9.713 to 3.725,9.850 +line from 3.725,9.850 to 4.438,9.875 +dashwid = 0.037i +line dotted from 3.800,8.625 to 4.025,8.625 +line dotted from 0.925,9.912 to 0.925,9.912 +line dotted from 0.925,9.912 to 0.938,9.887 +line dotted from 0.938,9.887 to 0.988,9.713 +line dotted from 0.988,9.713 to 1.075,9.562 +line dotted from 1.075,9.562 to 1.163,9.562 +line dotted from 1.163,9.562 to 1.250,9.562 +line dotted from 1.250,9.562 to 1.613,9.675 +line dotted from 1.613,9.675 to 2.312,9.363 +line dotted from 2.312,9.363 to 3.025,9.375 +line dotted from 3.025,9.375 to 3.725,9.387 +line dotted from 3.725,9.387 to 4.438,9.450 +line dashed from 3.800,8.475 to 4.025,8.475 +line dashed from 0.925,10.000 to 0.925,10.000 +line dashed from 0.925,10.000 to 0.938,9.787 +line dashed from 0.938,9.787 to 0.988,9.650 +line dashed from 0.988,9.650 to 1.075,9.537 +line dashed from 1.075,9.537 to 1.163,9.613 +line dashed from 1.163,9.613 to 1.250,9.800 +line dashed from 1.250,9.800 to 1.613,9.488 +line dashed from 1.613,9.488 to 2.312,9.375 +line dashed from 2.312,9.375 to 3.025,9.363 +line dashed from 3.025,9.363 to 3.725,9.325 +line dashed from 3.725,9.325 to 4.438,9.438 +dashwid = 0.075i +line dotted from 3.800,8.325 to 4.025,8.325 +line dotted from 0.925,9.963 to 0.925,9.963 +line dotted from 0.925,9.963 to 0.938,9.750 +line dotted from 0.938,9.750 to 0.988,9.662 +line dotted from 0.988,9.662 to 1.075,9.613 +line dotted from 1.075,9.613 to 1.163,9.613 +line dotted from 1.163,9.613 to 1.250,9.700 +line dotted from 1.250,9.700 to 1.613,9.438 +line dotted from 1.613,9.438 to 2.312,9.463 +line dotted from 2.312,9.463 to 3.025,9.312 +line dotted from 3.025,9.312 to 3.725,9.387 +line dotted from 3.725,9.387 to 4.438,9.425 +.ps +.ps -1 +.ft +.ft I +"0" at 0.825,7.810 rjust +"20" at 0.825,8.110 rjust +"40" at 0.825,8.410 rjust +"60" at 0.825,8.697 rjust +"80" at 0.825,8.997 rjust +"100" at 0.825,9.297 rjust +"120" at 0.825,9.597 rjust +"140" at 0.825,9.885 rjust +"160" at 0.825,10.185 rjust +"0" at 0.900,7.660 +"2" at 1.613,7.660 +"4" at 2.312,7.660 +"6" at 3.025,7.660 +"8" at 3.725,7.660 +"10" at 4.438,7.660 +"Time (sec)" at 0.150,8.997 +"Buffer Cache Size (MBytes)" at 2.837,7.510 +"Figure #1: MAB Phase 5 (compile)" at 2.837,10.335 +"NFS" at 3.725,8.697 rjust +"Leases" at 3.725,8.547 rjust +"Leases, Rdirlookup" at 3.725,8.397 rjust +"Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust +.ps +.ft +.PE +.)z +.sh 2 "Multiple Client Load Tests" +.pp +During preliminary runs of the MAB, it was observed that the server RPC +counts were reduced significantly by NQNFS as compared to NFS (table 1). +(Spritely NFS and Ultrix\(tm4.3/NFS numbers were taken from [Mogul93] +and are not directly comparable, due to numerous differences in the +experimental setup including deletion of the load step from phase 5.) +This suggests +that the NQNFS protocol might scale better with +respect to the number of clients accessing the server. +The experiment described in this section +ran the MAB on from one to ten clients concurrently, to observe the +effects of heavier server load. +The clients were started at roughly the same time by pressing all the +<return> keys together and, although not synchronized beyond that point, +all clients would finish the test run within about two seconds of each +other. +This was not a realistic load of N active clients, but it did +result in a reproducible increasing client load on the server. +The results for the four variants +are plotted in figures 2-5. +.(z +.ps -1 +.R +.TS +box, center; +c s s s s s s s +c c c c c c c c +l | n n n n n n n. +Table #1: MAB RPC Counts +RPC Getattr Read Write Lookup Other GetLease/Open-Close Total +_ +BSD/NQNFS 277 139 306 575 294 127 1718 +BSD/NFS 1210 506 451 489 238 0 2894 +Spritely NFS 259 836 192 535 306 1467 3595 +Ultrix4.3/NFS 1225 1186 476 810 305 0 4002 +.TE +.ps +.)z +.pp +For the MAB benchmark, the NQNFS protocol reduces the RPC counts significantly, +but with a minimum of extra overhead (the GetLease/Open-Close count). +.(z +.PS +.ps +.ps 10 +dashwid = 0.050i +line dashed from 0.900,7.888 to 4.787,7.888 +line dashed from 0.900,7.888 to 0.900,10.262 +line from 0.900,7.888 to 0.963,7.888 +line from 4.787,7.888 to 4.725,7.888 +line from 0.900,8.225 to 0.963,8.225 +line from 4.787,8.225 to 4.725,8.225 +line from 0.900,8.562 to 0.963,8.562 +line from 4.787,8.562 to 4.725,8.562 +line from 0.900,8.900 to 0.963,8.900 +line from 4.787,8.900 to 4.725,8.900 +line from 0.900,9.250 to 0.963,9.250 +line from 4.787,9.250 to 4.725,9.250 +line from 0.900,9.588 to 0.963,9.588 +line from 4.787,9.588 to 4.725,9.588 +line from 0.900,9.925 to 0.963,9.925 +line from 4.787,9.925 to 4.725,9.925 +line from 0.900,10.262 to 0.963,10.262 +line from 4.787,10.262 to 4.725,10.262 +line from 0.900,7.888 to 0.900,7.950 +line from 0.900,10.262 to 0.900,10.200 +line from 1.613,7.888 to 1.613,7.950 +line from 1.613,10.262 to 1.613,10.200 +line from 2.312,7.888 to 2.312,7.950 +line from 2.312,10.262 to 2.312,10.200 +line from 3.025,7.888 to 3.025,7.950 +line from 3.025,10.262 to 3.025,10.200 +line from 3.725,7.888 to 3.725,7.950 +line from 3.725,10.262 to 3.725,10.200 +line from 4.438,7.888 to 4.438,7.950 +line from 4.438,10.262 to 4.438,10.200 +line from 0.900,7.888 to 4.787,7.888 +line from 4.787,7.888 to 4.787,10.262 +line from 4.787,10.262 to 0.900,10.262 +line from 0.900,10.262 to 0.900,7.888 +line from 3.800,8.900 to 4.025,8.900 +line from 1.250,8.325 to 1.250,8.325 +line from 1.250,8.325 to 1.613,8.500 +line from 1.613,8.500 to 2.312,8.825 +line from 2.312,8.825 to 3.025,9.175 +line from 3.025,9.175 to 3.725,9.613 +line from 3.725,9.613 to 4.438,10.012 +dashwid = 0.037i +line dotted from 3.800,8.750 to 4.025,8.750 +line dotted from 1.250,8.275 to 1.250,8.275 +line dotted from 1.250,8.275 to 1.613,8.412 +line dotted from 1.613,8.412 to 2.312,8.562 +line dotted from 2.312,8.562 to 3.025,9.088 +line dotted from 3.025,9.088 to 3.725,9.375 +line dotted from 3.725,9.375 to 4.438,10.000 +line dashed from 3.800,8.600 to 4.025,8.600 +line dashed from 1.250,8.250 to 1.250,8.250 +line dashed from 1.250,8.250 to 1.613,8.438 +line dashed from 1.613,8.438 to 2.312,8.637 +line dashed from 2.312,8.637 to 3.025,9.088 +line dashed from 3.025,9.088 to 3.725,9.525 +line dashed from 3.725,9.525 to 4.438,10.075 +dashwid = 0.075i +line dotted from 3.800,8.450 to 4.025,8.450 +line dotted from 1.250,8.262 to 1.250,8.262 +line dotted from 1.250,8.262 to 1.613,8.425 +line dotted from 1.613,8.425 to 2.312,8.613 +line dotted from 2.312,8.613 to 3.025,9.137 +line dotted from 3.025,9.137 to 3.725,9.512 +line dotted from 3.725,9.512 to 4.438,9.988 +.ps +.ps -1 +.ft +.ft I +"0" at 0.825,7.810 rjust +"20" at 0.825,8.147 rjust +"40" at 0.825,8.485 rjust +"60" at 0.825,8.822 rjust +"80" at 0.825,9.172 rjust +"100" at 0.825,9.510 rjust +"120" at 0.825,9.847 rjust +"140" at 0.825,10.185 rjust +"0" at 0.900,7.660 +"2" at 1.613,7.660 +"4" at 2.312,7.660 +"6" at 3.025,7.660 +"8" at 3.725,7.660 +"10" at 4.438,7.660 +"Time (sec)" at 0.150,8.997 +"Number of Clients" at 2.837,7.510 +"Figure #2: MAB Phase 2 (copying)" at 2.837,10.335 +"NFS" at 3.725,8.822 rjust +"Leases" at 3.725,8.672 rjust +"Leases, Rdirlookup" at 3.725,8.522 rjust +"Leases, Attrib leases, Rdirlookup" at 3.725,8.372 rjust +.ps +.ft +.PE +.)z +.(z +.PS +.ps +.ps 10 +dashwid = 0.050i +line dashed from 0.900,7.888 to 4.787,7.888 +line dashed from 0.900,7.888 to 0.900,10.262 +line from 0.900,7.888 to 0.963,7.888 +line from 4.787,7.888 to 4.725,7.888 +line from 0.900,8.188 to 0.963,8.188 +line from 4.787,8.188 to 4.725,8.188 +line from 0.900,8.488 to 0.963,8.488 +line from 4.787,8.488 to 4.725,8.488 +line from 0.900,8.775 to 0.963,8.775 +line from 4.787,8.775 to 4.725,8.775 +line from 0.900,9.075 to 0.963,9.075 +line from 4.787,9.075 to 4.725,9.075 +line from 0.900,9.375 to 0.963,9.375 +line from 4.787,9.375 to 4.725,9.375 +line from 0.900,9.675 to 0.963,9.675 +line from 4.787,9.675 to 4.725,9.675 +line from 0.900,9.963 to 0.963,9.963 +line from 4.787,9.963 to 4.725,9.963 +line from 0.900,10.262 to 0.963,10.262 +line from 4.787,10.262 to 4.725,10.262 +line from 0.900,7.888 to 0.900,7.950 +line from 0.900,10.262 to 0.900,10.200 +line from 1.613,7.888 to 1.613,7.950 +line from 1.613,10.262 to 1.613,10.200 +line from 2.312,7.888 to 2.312,7.950 +line from 2.312,10.262 to 2.312,10.200 +line from 3.025,7.888 to 3.025,7.950 +line from 3.025,10.262 to 3.025,10.200 +line from 3.725,7.888 to 3.725,7.950 +line from 3.725,10.262 to 3.725,10.200 +line from 4.438,7.888 to 4.438,7.950 +line from 4.438,10.262 to 4.438,10.200 +line from 0.900,7.888 to 4.787,7.888 +line from 4.787,7.888 to 4.787,10.262 +line from 4.787,10.262 to 0.900,10.262 +line from 0.900,10.262 to 0.900,7.888 +line from 3.800,8.775 to 4.025,8.775 +line from 1.250,8.975 to 1.250,8.975 +line from 1.250,8.975 to 1.613,8.963 +line from 1.613,8.963 to 2.312,8.988 +line from 2.312,8.988 to 3.025,9.037 +line from 3.025,9.037 to 3.725,9.062 +line from 3.725,9.062 to 4.438,9.100 +dashwid = 0.037i +line dotted from 3.800,8.625 to 4.025,8.625 +line dotted from 1.250,9.312 to 1.250,9.312 +line dotted from 1.250,9.312 to 1.613,9.287 +line dotted from 1.613,9.287 to 2.312,9.675 +line dotted from 2.312,9.675 to 3.025,9.262 +line dotted from 3.025,9.262 to 3.725,9.738 +line dotted from 3.725,9.738 to 4.438,9.512 +line dashed from 3.800,8.475 to 4.025,8.475 +line dashed from 1.250,9.400 to 1.250,9.400 +line dashed from 1.250,9.400 to 1.613,9.287 +line dashed from 1.613,9.287 to 2.312,9.575 +line dashed from 2.312,9.575 to 3.025,9.300 +line dashed from 3.025,9.300 to 3.725,9.613 +line dashed from 3.725,9.613 to 4.438,9.512 +dashwid = 0.075i +line dotted from 3.800,8.325 to 4.025,8.325 +line dotted from 1.250,9.400 to 1.250,9.400 +line dotted from 1.250,9.400 to 1.613,9.412 +line dotted from 1.613,9.412 to 2.312,9.700 +line dotted from 2.312,9.700 to 3.025,9.537 +line dotted from 3.025,9.537 to 3.725,9.938 +line dotted from 3.725,9.938 to 4.438,9.812 +.ps +.ps -1 +.ft +.ft I +"0" at 0.825,7.810 rjust +"5" at 0.825,8.110 rjust +"10" at 0.825,8.410 rjust +"15" at 0.825,8.697 rjust +"20" at 0.825,8.997 rjust +"25" at 0.825,9.297 rjust +"30" at 0.825,9.597 rjust +"35" at 0.825,9.885 rjust +"40" at 0.825,10.185 rjust +"0" at 0.900,7.660 +"2" at 1.613,7.660 +"4" at 2.312,7.660 +"6" at 3.025,7.660 +"8" at 3.725,7.660 +"10" at 4.438,7.660 +"Time (sec)" at 0.150,8.997 +"Number of Clients" at 2.837,7.510 +"Figure #3: MAB Phase 3 (stat/find)" at 2.837,10.335 +"NFS" at 3.725,8.697 rjust +"Leases" at 3.725,8.547 rjust +"Leases, Rdirlookup" at 3.725,8.397 rjust +"Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust +.ps +.ft +.PE +.)z +.(z +.PS +.ps +.ps 10 +dashwid = 0.050i +line dashed from 0.900,7.888 to 4.787,7.888 +line dashed from 0.900,7.888 to 0.900,10.262 +line from 0.900,7.888 to 0.963,7.888 +line from 4.787,7.888 to 4.725,7.888 +line from 0.900,8.188 to 0.963,8.188 +line from 4.787,8.188 to 4.725,8.188 +line from 0.900,8.488 to 0.963,8.488 +line from 4.787,8.488 to 4.725,8.488 +line from 0.900,8.775 to 0.963,8.775 +line from 4.787,8.775 to 4.725,8.775 +line from 0.900,9.075 to 0.963,9.075 +line from 4.787,9.075 to 4.725,9.075 +line from 0.900,9.375 to 0.963,9.375 +line from 4.787,9.375 to 4.725,9.375 +line from 0.900,9.675 to 0.963,9.675 +line from 4.787,9.675 to 4.725,9.675 +line from 0.900,9.963 to 0.963,9.963 +line from 4.787,9.963 to 4.725,9.963 +line from 0.900,10.262 to 0.963,10.262 +line from 4.787,10.262 to 4.725,10.262 +line from 0.900,7.888 to 0.900,7.950 +line from 0.900,10.262 to 0.900,10.200 +line from 1.613,7.888 to 1.613,7.950 +line from 1.613,10.262 to 1.613,10.200 +line from 2.312,7.888 to 2.312,7.950 +line from 2.312,10.262 to 2.312,10.200 +line from 3.025,7.888 to 3.025,7.950 +line from 3.025,10.262 to 3.025,10.200 +line from 3.725,7.888 to 3.725,7.950 +line from 3.725,10.262 to 3.725,10.200 +line from 4.438,7.888 to 4.438,7.950 +line from 4.438,10.262 to 4.438,10.200 +line from 0.900,7.888 to 4.787,7.888 +line from 4.787,7.888 to 4.787,10.262 +line from 4.787,10.262 to 0.900,10.262 +line from 0.900,10.262 to 0.900,7.888 +line from 3.800,8.775 to 4.025,8.775 +line from 1.250,9.412 to 1.250,9.412 +line from 1.250,9.412 to 1.613,9.425 +line from 1.613,9.425 to 2.312,9.463 +line from 2.312,9.463 to 3.025,9.600 +line from 3.025,9.600 to 3.725,9.875 +line from 3.725,9.875 to 4.438,10.075 +dashwid = 0.037i +line dotted from 3.800,8.625 to 4.025,8.625 +line dotted from 1.250,9.450 to 1.250,9.450 +line dotted from 1.250,9.450 to 1.613,9.438 +line dotted from 1.613,9.438 to 2.312,9.438 +line dotted from 2.312,9.438 to 3.025,9.525 +line dotted from 3.025,9.525 to 3.725,9.550 +line dotted from 3.725,9.550 to 4.438,9.662 +line dashed from 3.800,8.475 to 4.025,8.475 +line dashed from 1.250,9.438 to 1.250,9.438 +line dashed from 1.250,9.438 to 1.613,9.412 +line dashed from 1.613,9.412 to 2.312,9.450 +line dashed from 2.312,9.450 to 3.025,9.500 +line dashed from 3.025,9.500 to 3.725,9.613 +line dashed from 3.725,9.613 to 4.438,9.675 +dashwid = 0.075i +line dotted from 3.800,8.325 to 4.025,8.325 +line dotted from 1.250,9.387 to 1.250,9.387 +line dotted from 1.250,9.387 to 1.613,9.600 +line dotted from 1.613,9.600 to 2.312,9.625 +line dotted from 2.312,9.625 to 3.025,9.738 +line dotted from 3.025,9.738 to 3.725,9.850 +line dotted from 3.725,9.850 to 4.438,9.800 +.ps +.ps -1 +.ft +.ft I +"0" at 0.825,7.810 rjust +"5" at 0.825,8.110 rjust +"10" at 0.825,8.410 rjust +"15" at 0.825,8.697 rjust +"20" at 0.825,8.997 rjust +"25" at 0.825,9.297 rjust +"30" at 0.825,9.597 rjust +"35" at 0.825,9.885 rjust +"40" at 0.825,10.185 rjust +"0" at 0.900,7.660 +"2" at 1.613,7.660 +"4" at 2.312,7.660 +"6" at 3.025,7.660 +"8" at 3.725,7.660 +"10" at 4.438,7.660 +"Time (sec)" at 0.150,8.997 +"Number of Clients" at 2.837,7.510 +"Figure #4: MAB Phase 4 (grep/wc/find)" at 2.837,10.335 +"NFS" at 3.725,8.697 rjust +"Leases" at 3.725,8.547 rjust +"Leases, Rdirlookup" at 3.725,8.397 rjust +"Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust +.ps +.ft +.PE +.)z +.(z +.PS +.ps +.ps 10 +dashwid = 0.050i +line dashed from 0.900,7.888 to 4.787,7.888 +line dashed from 0.900,7.888 to 0.900,10.262 +line from 0.900,7.888 to 0.963,7.888 +line from 4.787,7.888 to 4.725,7.888 +line from 0.900,8.150 to 0.963,8.150 +line from 4.787,8.150 to 4.725,8.150 +line from 0.900,8.412 to 0.963,8.412 +line from 4.787,8.412 to 4.725,8.412 +line from 0.900,8.675 to 0.963,8.675 +line from 4.787,8.675 to 4.725,8.675 +line from 0.900,8.938 to 0.963,8.938 +line from 4.787,8.938 to 4.725,8.938 +line from 0.900,9.213 to 0.963,9.213 +line from 4.787,9.213 to 4.725,9.213 +line from 0.900,9.475 to 0.963,9.475 +line from 4.787,9.475 to 4.725,9.475 +line from 0.900,9.738 to 0.963,9.738 +line from 4.787,9.738 to 4.725,9.738 +line from 0.900,10.000 to 0.963,10.000 +line from 4.787,10.000 to 4.725,10.000 +line from 0.900,10.262 to 0.963,10.262 +line from 4.787,10.262 to 4.725,10.262 +line from 0.900,7.888 to 0.900,7.950 +line from 0.900,10.262 to 0.900,10.200 +line from 1.613,7.888 to 1.613,7.950 +line from 1.613,10.262 to 1.613,10.200 +line from 2.312,7.888 to 2.312,7.950 +line from 2.312,10.262 to 2.312,10.200 +line from 3.025,7.888 to 3.025,7.950 +line from 3.025,10.262 to 3.025,10.200 +line from 3.725,7.888 to 3.725,7.950 +line from 3.725,10.262 to 3.725,10.200 +line from 4.438,7.888 to 4.438,7.950 +line from 4.438,10.262 to 4.438,10.200 +line from 0.900,7.888 to 4.787,7.888 +line from 4.787,7.888 to 4.787,10.262 +line from 4.787,10.262 to 0.900,10.262 +line from 0.900,10.262 to 0.900,7.888 +line from 3.800,8.675 to 4.025,8.675 +line from 1.250,8.800 to 1.250,8.800 +line from 1.250,8.800 to 1.613,8.912 +line from 1.613,8.912 to 2.312,9.113 +line from 2.312,9.113 to 3.025,9.438 +line from 3.025,9.438 to 3.725,9.750 +line from 3.725,9.750 to 4.438,10.088 +dashwid = 0.037i +line dotted from 3.800,8.525 to 4.025,8.525 +line dotted from 1.250,8.637 to 1.250,8.637 +line dotted from 1.250,8.637 to 1.613,8.700 +line dotted from 1.613,8.700 to 2.312,8.713 +line dotted from 2.312,8.713 to 3.025,8.775 +line dotted from 3.025,8.775 to 3.725,8.887 +line dotted from 3.725,8.887 to 4.438,9.037 +line dashed from 3.800,8.375 to 4.025,8.375 +line dashed from 1.250,8.675 to 1.250,8.675 +line dashed from 1.250,8.675 to 1.613,8.688 +line dashed from 1.613,8.688 to 2.312,8.713 +line dashed from 2.312,8.713 to 3.025,8.825 +line dashed from 3.025,8.825 to 3.725,8.887 +line dashed from 3.725,8.887 to 4.438,9.062 +dashwid = 0.075i +line dotted from 3.800,8.225 to 4.025,8.225 +line dotted from 1.250,8.700 to 1.250,8.700 +line dotted from 1.250,8.700 to 1.613,8.688 +line dotted from 1.613,8.688 to 2.312,8.762 +line dotted from 2.312,8.762 to 3.025,8.812 +line dotted from 3.025,8.812 to 3.725,8.925 +line dotted from 3.725,8.925 to 4.438,9.025 +.ps +.ps -1 +.ft +.ft I +"0" at 0.825,7.810 rjust +"50" at 0.825,8.072 rjust +"100" at 0.825,8.335 rjust +"150" at 0.825,8.597 rjust +"200" at 0.825,8.860 rjust +"250" at 0.825,9.135 rjust +"300" at 0.825,9.397 rjust +"350" at 0.825,9.660 rjust +"400" at 0.825,9.922 rjust +"450" at 0.825,10.185 rjust +"0" at 0.900,7.660 +"2" at 1.613,7.660 +"4" at 2.312,7.660 +"6" at 3.025,7.660 +"8" at 3.725,7.660 +"10" at 4.438,7.660 +"Time (sec)" at 0.150,8.997 +"Number of Clients" at 2.837,7.510 +"Figure #5: MAB Phase 5 (compile)" at 2.837,10.335 +"NFS" at 3.725,8.597 rjust +"Leases" at 3.725,8.447 rjust +"Leases, Rdirlookup" at 3.725,8.297 rjust +"Leases, Attrib leases, Rdirlookup" at 3.725,8.147 rjust +.ps +.ft +.PE +.)z +.pp +In figure 2, where a subtree of seventy small files is copied, the difference between the protocol variants is minimal, +with the NQNFS variants performing slightly better. +For this case, the Readdir_and_Lookup RPC is a slight hindrance under heavy +load, possibly because it results in larger directory blocks in the buffer +cache. +.pp +In figure 3, for the phase that gets file attributes for a large number +of files, the leasing variants take about 50% longer, indicating that +there are performance problems in this area. For the case where valid +current leases are required for every file when attributes are returned, +the performance is significantly worse than when the attributes are allowed +to be stale by a few seconds on the client. +I have not been able to explain the oscillation in the curves for the +Lease cases. +.pp +For the string searching phase depicted in figure 4, the leasing variants +that do not require valid leases for files when attributes are returned +appear to scale better with server load than NFS. +However, the effect appears to be +negligible until the server load is fairly heavy. +.pp +Most of the time in the MAB benchmark is spent in the compilation phase +and this is where the differences between caching methods are most +pronounced. +In figure 5 it can be seen that any protocol variant using Leases performs +about a factor of two better than NFS +at a load of ten clients. This indicates that the use of NQNFS may +allow servers to handle significantly more clients for this type of +workload. +.pp +Table 2 summarizes the MAB run times for all phases for the single client +DECstation 5000/25. The \fILeases\fR case refers to using leases, whereas +the \fILeases, Rdirl\fR case uses the Readdir_and_Lookup RPC as well and +the \fIBCache Only\fR case uses leases, but only the buffer cache and not +the attribute or name caches. +The \fINo Caching\fR cases does not do any client side caching, performing +all system calls via synchronous RPCs to the server. +.(z +.ps -1 +.R +.TS +box, center; +c s s s s s s +c c c c c c c c +l | n n n n n n n. +Table #2: Single DECstation 5000/25 Client Elapsed Times (sec) +Phase 1 2 3 4 5 Total % Improvement +_ +No Caching 6 35 41 40 258 380 -93 +NFS 5 24 15 20 133 197 0 +BCache Only 5 20 24 23 116 188 5 +Leases, Rdirl 5 20 21 20 105 171 13 +Leases 5 19 21 21 99 165 16 +.TE +.ps +.)z +.sh 2 "Processor Speed Tests" +.pp +An important goal of client-side file system caching is to decouple the +I/O system calls from the underlying distributed file system, so that the +client's system performance might scale with processor speed. In order +to test this, a series of MAB runs were performed on three +DECstations that are similar except for processor speed. +In addition to the four protocol variants used for the above tests, runs +were done with the client caches turned off, for +worst case performance numbers for caching mechanisms with a 100% miss rate. The CPU utilization +was measured, as an indicator of how much the processor was blocking for +I/O system calls. Note that since the systems were running in single user mode +and otherwise quiescent, almost all CPU activity was directly related +to the MAB run. +The results are presented in +table 3. +The CPU time is simply the product of the CPU utilization and +elapsed running time and, as such, is the optimistic bound on performance +achievable with an ideal client caching scheme that never blocks for I/O. +.(z +.ps -1 +.R +.TS +box, center; +c s s s s s s s s s +c c s s c s s c s s +c c c c c c c c c c +c c c c c c c c c c +l | n n n n n n n n n. +Table #3: MAB Phase 5 (compile) + DS2100 (10.5 MIPS) DS3100 (14.0 MIPS) DS5000/25 (26.7 MIPS) + Elapsed CPU CPU Elapsed CPU CPU Elapsed CPU CPU + time Util(%) time time Util(%) time time Util(%) time +_ +Leases 143 89 127 113 87 98 99 89 88 +Leases, Rdirl 150 89 134 110 91 100 105 88 92 +BCache Only 169 85 144 129 78 101 116 75 87 +NFS 172 77 132 135 74 100 133 71 94 +No Caching 330 47 155 256 41 105 258 39 101 +.TE +.ps +.)z +As can be seen in the table, any caching mechanism achieves significantly +better performance than when caching is disabled, roughly doubling the CPU +utilization with a corresponding reduction in run time. For NFS, the CPU +utilization is dropping with increase in CPU speed, which would suggest that +it is not scaling with CPU speed. For the NQNFS variants, the CPU utilization +remains at just below 90%, which suggests that the caching mechanism is working +well and scaling within this CPU range. +Note that for this benchmark, the ratio of CPU times for +the DECstation 3100 and DECstation 5000/25 are quite different than the +Dhrystone MIPS ratings would suggest. +.pp +Overall, the results seem encouraging, although it remains to be seen whether +or not the caching provided by NQNFS can continue to scale with CPU +performance. +There is a good indication that NQNFS permits a server to scale +to more clients than does NFS, at least for workloads akin to the MAB compile phase. +A more difficult question is "What if the server is much faster doing +write RPCs?" as a result of some technology such as Prestoserve +or write gathering. +Since a significant part of the difference between NFS and NQNFS is +the synchronous writing, it is difficult to predict how much a server +capable of fast write RPCs will negate the performance improvements of NQNFS. +At the very least, table 1 indicates that the write RPC load on the server +has decreased by approximately 30%, and this reduced write load should still +result in some improvement. +.pp +Indications are that the Readdir_and_Lookup RPC has not improved performance +for these tests and may in fact be degrading performance slightly. +The results in figure 3 indicate some problems, possibly with handling +of the attribute cache. It seems logical that the Readdir_and_Lookup RPC +should be permit priming of the attribute cache improving hit rate, but the +results are counter to that. +.sh 2 "Internetwork Delay Tests" +.pp +This experimental setup was used to explore how the different protocol +variants might perform over internetworks with larger RPC RTTs. The +server was moved to a separate Ethernet, using a MicroVAXII\(tm as an +IP router to the other Ethernet. The 4.3Reno BSD Unix system running on the +MicroVAXII was modified to delay IP packets being forwarded by a tunable N +millisecond delay. The implementation was rather crude and did not try to +simulate a distribution of delay times nor was it programmed to drop packets +at a given rate, but it served as a simple emulation of a long, +fat network\** [Jacobson88]. +.(f +\**Long fat networks refer to network interconnections with +a Bandwidth X RTT product > 10\u5\d bits. +.)f +The MAB was run using both UDP and TCP RPC transports +for a variety of RTT delays from five to two hundred milliseconds, +to observe the effects of RTT delay on RPC transport. +It was found that, due to a high variability between runs, four runs was not +suffice, so eight runs at each value was done. +The results in figure 6 and table 4 are the average for the eight runs. +.(z +.PS +.ps +.ps 10 +dashwid = 0.050i +line dashed from 0.900,7.888 to 4.787,7.888 +line dashed from 0.900,7.888 to 0.900,10.262 +line from 0.900,7.888 to 0.963,7.888 +line from 4.787,7.888 to 4.725,7.888 +line from 0.900,8.350 to 0.963,8.350 +line from 4.787,8.350 to 4.725,8.350 +line from 0.900,8.800 to 0.963,8.800 +line from 4.787,8.800 to 4.725,8.800 +line from 0.900,9.262 to 0.963,9.262 +line from 4.787,9.262 to 4.725,9.262 +line from 0.900,9.713 to 0.963,9.713 +line from 4.787,9.713 to 4.725,9.713 +line from 0.900,10.175 to 0.963,10.175 +line from 4.787,10.175 to 4.725,10.175 +line from 0.900,7.888 to 0.900,7.950 +line from 0.900,10.262 to 0.900,10.200 +line from 1.825,7.888 to 1.825,7.950 +line from 1.825,10.262 to 1.825,10.200 +line from 2.750,7.888 to 2.750,7.950 +line from 2.750,10.262 to 2.750,10.200 +line from 3.675,7.888 to 3.675,7.950 +line from 3.675,10.262 to 3.675,10.200 +line from 4.600,7.888 to 4.600,7.950 +line from 4.600,10.262 to 4.600,10.200 +line from 0.900,7.888 to 4.787,7.888 +line from 4.787,7.888 to 4.787,10.262 +line from 4.787,10.262 to 0.900,10.262 +line from 0.900,10.262 to 0.900,7.888 +line from 4.125,8.613 to 4.350,8.613 +line from 0.988,8.400 to 0.988,8.400 +line from 0.988,8.400 to 1.637,8.575 +line from 1.637,8.575 to 2.375,8.713 +line from 2.375,8.713 to 3.125,8.900 +line from 3.125,8.900 to 3.862,9.137 +line from 3.862,9.137 to 4.600,9.425 +dashwid = 0.037i +line dotted from 4.125,8.463 to 4.350,8.463 +line dotted from 0.988,8.375 to 0.988,8.375 +line dotted from 0.988,8.375 to 1.637,8.525 +line dotted from 1.637,8.525 to 2.375,8.850 +line dotted from 2.375,8.850 to 3.125,8.975 +line dotted from 3.125,8.975 to 3.862,9.137 +line dotted from 3.862,9.137 to 4.600,9.625 +line dashed from 4.125,8.312 to 4.350,8.312 +line dashed from 0.988,8.525 to 0.988,8.525 +line dashed from 0.988,8.525 to 1.637,8.688 +line dashed from 1.637,8.688 to 2.375,8.838 +line dashed from 2.375,8.838 to 3.125,9.150 +line dashed from 3.125,9.150 to 3.862,9.275 +line dashed from 3.862,9.275 to 4.600,9.588 +dashwid = 0.075i +line dotted from 4.125,8.162 to 4.350,8.162 +line dotted from 0.988,8.525 to 0.988,8.525 +line dotted from 0.988,8.525 to 1.637,8.838 +line dotted from 1.637,8.838 to 2.375,8.863 +line dotted from 2.375,8.863 to 3.125,9.137 +line dotted from 3.125,9.137 to 3.862,9.387 +line dotted from 3.862,9.387 to 4.600,10.200 +.ps +.ps -1 +.ft +.ft I +"0" at 0.825,7.810 rjust +"100" at 0.825,8.272 rjust +"200" at 0.825,8.722 rjust +"300" at 0.825,9.185 rjust +"400" at 0.825,9.635 rjust +"500" at 0.825,10.097 rjust +"0" at 0.900,7.660 +"50" at 1.825,7.660 +"100" at 2.750,7.660 +"150" at 3.675,7.660 +"200" at 4.600,7.660 +"Time (sec)" at 0.150,8.997 +"Round Trip Delay (msec)" at 2.837,7.510 +"Figure #6: MAB Phase 5 (compile)" at 2.837,10.335 +"Leases,UDP" at 4.050,8.535 rjust +"Leases,TCP" at 4.050,8.385 rjust +"NFS,UDP" at 4.050,8.235 rjust +"NFS,TCP" at 4.050,8.085 rjust +.ps +.ft +.PE +.)z +.(z +.ps -1 +.R +.TS +box, center; +c s s s s s s s s +c c s c s c s c s +c c c c c c c c c +c c c c c c c c c +l | n n n n n n n n. +Table #4: MAB Phase 5 (compile) for Internetwork Delays + NFS,UDP NFS,TCP Leases,UDP Leases,TCP +Delay Elapsed Standard Elapsed Standard Elapsed Standard Elapsed Standard +(msec) time (sec) Deviation time (sec) Deviation time (sec) Deviation time (sec) Deviation +_ +5 139 2.9 139 2.4 112 7.0 108 6.0 +40 175 5.1 208 44.5 150 23.8 139 4.3 +80 207 3.9 213 4.7 180 7.7 210 52.9 +120 276 29.3 273 17.1 221 7.7 238 5.8 +160 304 7.2 328 77.1 275 21.5 274 10.1 +200 372 35.0 506 235.1 338 25.2 379 69.2 +.TE +.ps +.)z +.pp +I found these results somewhat surprising, since I had assumed that stability +across an internetwork connection would be a function of RPC transport +protocol. +Looking at the standard deviations observed between the eight runs, there is an indication +that the NQNFS protocol plays a larger role in +maintaining stability than the underlying RPC transport protocol. +It appears that NFS over TCP transport +is the least stable variant tested. +It should be noted that the TCP implementation used was roughly at 4.3BSD Tahoe +release and that the 4.4BSD TCP implementation was far less stable and would +fail intermittently, due to a bug I was not able to isolate. +It would appear that some of the recent enhancements to the 4.4BSD TCP +implementation have a detrimental effect on the performance of +RPC-type traffic loads, which intermix small and large +data transfers in both directions. +It is obvious that more exploration of this area is needed before any +conclusions can be made +beyond the fact that over a local area network, TCP transport provides +performance comparable to UDP. +.sh 1 "Lessons Learned" +.pp +Evaluating the performance of a distributed file system is fraught with +difficulties, due to the many software and hardware factors involved. +The limited benchmarking presented here took a considerable amount of time +and the results gained by the exercise only give indications of what the +performance might be for a few scenarios. +.pp +The IP router with delay introduction proved to be a valuable tool for protocol debugging\**, +.(f +\**It exposed two bugs in the 4.4BSD networking, one a problem in the Lance chip +driver for the DECstation and the other a TCP window sizing problem that I was +not able to isolate. +.)f +and may be useful for a more extensive study of performance over internetworks +if enhanced to do a better job of simulating internetwork delay and packet loss. +.pp +The Leases mechanism provided a simple model for the provision of cache +consistency and did seem to improve performance for various scenarios. +Unfortunately, it does not provide the server state information that is required +for file system semantics, such as locking, that many software systems demand. +In production environments on my campus, the need for file locking and the correct +generation of the ETXTBSY error code +are far more important that full cache consistency, and leasing +does not satisfy these needs. +Another file system semantic that requires hard server state is the delay +of file removal until the last close system call. Although Spritely NFS +did not support this semantic either, it is logical that the open file +state maintained by that system would facilitate the implementation of +this semantic more easily than would the Leases mechanism. +.sh 1 "Further Work" +.pp +The current implementation uses a fixed, moderate sized buffer cache designed +for the local UFS [McKusick84] file system. +The results in figure 1 suggest that this is adequate so long as the cache +is of an appropriate size. +However, a mechanism permitting the cache to vary in size +has been shown to outperform fixed sized buffer caches [Nelson90], and could +be beneficial. It could also be useful to allow the buffer cache to grow very +large by making use of local backing store for cases where server performance +is limited. +A very large buffer cache size would in turn permit experimentation with +much larger read/write data sizes, facilitating bulk data transfers +across long fat networks, such as will characterize the Internet of the +near future. +A careful redesign of the buffer cache mechanism to provide +support for these features would probably be the next implementation step. +.pp +The results in figure 3 indicate that the mechanics of caching file +attributes and maintaining the attribute cache's consistency needs to +be looked at further. +There also needs to be more work done on the interaction between a +Readdir_and_Lookup RPC and the name and attribute caches, in an effort +to reduce Getattr and Lookup RPC loads. +.pp +The NQNFS protocol has never been used in a production environment and doing +so would provide needed insight into how well the protocol saisfies the +needs of real workstation environments. +It is hoped that the distribution of the implementation in 4.4BSD will +facilitate use of the protocol in production environments elsewhere. +.pp +The big question that needs to be resolved is whether Leases are an adequate +mechanism for cache consistency or whether hard server state is required. +Given the work presented here and in the papers related to Sprite and Spritely +NFS, there are clear indications that a cache consistency algorithm can +improve both performance and file system semantics. +As yet, however, it is unclear what the best approach to maintain consistency is. +It would appear that hard state information is required for file locking and +other mechanisms and, if so, it seems appropriate to use it for cache +consistency as well. +.sh 1 "Acknowledgements" +.pp +I would like to thank the members of the CSRG at the University of California, +Berkeley for their continued support over the years. Without their encouragement and assistance this +software would never have been implemented. +Prof. Jim Linders and Prof. Tom Wilson here at the University of Guelph helped +proofread this paper and Jeffrey Mogul provided a great deal of +assistance, helping to turn my gibberish into something at least moderately +readable. +.sh 1 "References" +.ip [Baker91] 15 +Mary Baker and John Ousterhout, Availability in the Sprite Distributed +File System, In \fIOperating System Review\fR, (25)2, pg. 95-98, +April 1991. +.ip [Baker91a] 15 +Mary Baker, private communication, May 1991. +.ip [Burrows88] 15 +Michael Burrows, Efficient Data Sharing, Technical Report #153, +Computer Laboratory, University of Cambridge, Dec. 1988. +.ip [Gray89] 15 +Cary G. Gray and David R. Cheriton, Leases: An Efficient Fault-Tolerant +Mechanism for Distributed File Cache Consistency, In \fIProc. of the +Twelfth ACM Symposium on Operating Systems Principals\fR, Litchfield Park, +AZ, Dec. 1989. +.ip [Howard88] 15 +John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, +M. Satyanarayanan, Robert N. Sidebotham and Michael J. West, +Scale and Performance in a Distributed File System, \fIACM Trans. on +Computer Systems\fR, (6)1, pg 51-81, Feb. 1988. +.ip [Jacobson88] 15 +Van Jacobson and R. Braden, \fITCP Extensions for Long-Delay Paths\fR, +ARPANET Working Group Requests for Comment, DDN Network Information Center, +SRI International, Menlo Park, CA, October 1988, RFC-1072. +.ip [Jacobson89] 15 +Van Jacobson, Sun NFS Performance Problems, \fIPrivate Communication,\fR +November, 1989. +.ip [Juszczak89] 15 +Chet Juszczak, Improving the Performance and Correctness of an NFS Server, +In \fIProc. Winter 1989 USENIX Conference,\fR pg. 53-63, San Diego, CA, January 1989. +.ip [Juszczak94] 15 +Chet Juszczak, Improving the Write Performance of an NFS Server, +to appear in \fIProc. Winter 1994 USENIX Conference,\fR San Francisco, CA, January 1994. +.ip [Kazar88] 15 +Michael L. Kazar, Synchronization and Caching Issues in the Andrew File System, +In \fIProc. Winter 1988 USENIX Conference,\fR pg. 27-36, Dallas, TX, February +1988. +.ip [Kent87] 15 +Christopher. A. Kent and Jeffrey C. Mogul, \fIFragmentation Considered Harmful\fR, Research Report 87/3, +Digital Equipment Corporation Western Research Laboratory, Dec. 1987. +.ip [Kent87a] 15 +Christopher. A. Kent, \fICache Coherence in Distributed Systems\fR, Research Report 87/4, +Digital Equipment Corporation Western Research Laboratory, April 1987. +.ip [Macklem90] 15 +Rick Macklem, Lessons Learned Tuning the 4.3BSD Reno Implementation of the +NFS Protocol, +In \fIProc. Winter 1991 USENIX Conference,\fR pg. 53-64, Dallas, TX, +January 1991. +.ip [Macklem93] 15 +Rick Macklem, The 4.4BSD NFS Implementation, +In \fIThe System Manager's Manual\fR, 4.4 Berkeley Software Distribution, +University of California, Berkeley, June 1993. +.ip [McKusick84] 15 +Marshall K. McKusick, William N. Joy, Samuel J. Leffler and Robert S. Fabry, +A Fast File System for UNIX, \fIACM Transactions on Computer Systems\fR, +Vol. 2, Number 3, pg. 181-197, August 1984. +.ip [McKusick90] 15 +Marshall K. McKusick, Michael J. Karels and Keith Bostic, A Pageable Memory +Based Filesystem, +In \fIProc. Summer 1990 USENIX Conference,\fR pg. 137-143, Anaheim, CA, June +1990. +.ip [Mogul93] 15 +Jeffrey C. Mogul, Recovery in Spritely NFS, +Research Report 93/2, Digital Equipment Corporation Western Research +Laboratory, June 1993. +.ip [Moran90] 15 +Joseph Moran, Russel Sandberg, Don Coleman, Jonathan Kepecs and Bob Lyon, +Breaking Through the NFS Performance Barrier, +In \fIProc. Spring 1990 EUUG Conference,\fR pg. 199-206, Munich, FRG, +April 1990. +.ip [Nelson88] 15 +Michael N. Nelson, Brent B. Welch, and John K. Ousterhout, Caching in the +Sprite Network File System, \fIACM Transactions on Computer Systems\fR (6)1 +pg. 134-154, February 1988. +.ip [Nelson90] 15 +Michael N. Nelson, \fIVirtual Memory vs. The File System\fR, Research Report +90/4, Digital Equipment Corporation Western Research Laboratory, March 1990. +.ip [Nowicki89] 15 +Bill Nowicki, Transport Issues in the Network File System, In \fIComputer +Communication Review\fR, pg. 16-20, March 1989. +.ip [Ousterhout90] 15 +John K. Ousterhout, Why Aren't Operating Systems Getting Faster As Fast as +Hardware? In \fIProc. Summer 1990 USENIX Conference\fR, pg. 247-256, Anaheim, +CA, June 1990. +.ip [Sandberg85] 15 +Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon, +Design and Implementation of the Sun Network filesystem, In \fIProc. Summer +1985 USENIX Conference\fR, pages 119-130, Portland, OR, June 1985. +.ip [Srinivasan89] 15 +V. Srinivasan and Jeffrey. C. Mogul, Spritely NFS: Experiments with +Cache-Consistency Protocols, +In \fIProc. of the +Twelfth ACM Symposium on Operating Systems Principals\fR, Litchfield Park, +AZ, Dec. 1989. +.ip [Steiner88] 15 +J. G. Steiner, B. C. Neuman and J. I. Schiller, Kerberos: An Authentication +Service for Open Network Systems, +In \fIProc. Winter 1988 USENIX Conference,\fR pg. 191-202, Dallas, TX, February +1988. +.ip [SUN89] 15 +Sun Microsystems Inc., \fINFS: Network File System Protocol Specification\fR, +ARPANET Working Group Requests for Comment, DDN Network Information Center, +SRI International, Menlo Park, CA, March 1989, RFC-1094. +.ip [SUN93] 15 +Sun Microsystems Inc., \fINFS: Network File System Version 3 Protocol Specification\fR, +Sun Microsystems Inc., Mountain View, CA, June 1993. +.ip [Wittle93] 15 +Mark Wittle and Bruce E. Keith, LADDIS: The Next Generation in NFS File +Server Benchmarking, +In \fIProc. Summer 1993 USENIX Conference,\fR pg. 111-128, Cincinnati, OH, June +1993. +.(f +\(mo +NFS is believed to be a trademark of Sun Microsystems, Inc. +.)f +.(f +\(dg +Prestoserve is a trademark of Legato Systems, Inc. +.)f +.(f +\(sc +MIPS is a trademark of Silicon Graphics, Inc. +.)f +.(f +\(dg +DECstation, MicroVAXII and Ultrix are trademarks of Digital Equipment Corp. +.)f +.(f +\(dd +Unix is a trademark of Novell, Inc. +.)f |