summaryrefslogtreecommitdiffstats
path: root/sys/nfs/nfs_vnops.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix at least one source of the continued 'NFS append race'. close()dillon2000-01-051-3/+21
| | | | | | | | | | | | | | | was calling nfs_flush() and then clearing the NMODIFIED bit. This is not legal since there might still be dirty buffers after the nfs_flush (for example, pending commits). The clearing of this bit in turn prevented a necessary vinvalbuf() from occuring leaving left over dirty buffers even after truncating the file in a new operation. The fix is to simply not clear NMODIFIED. Also added a sysctl vfs.nfs.nfsv3_commit_on_close which, if set to 1, will cause close() to do a stage 1 write AND a stage 2 commit synchronously. By default only the stage 1 write is done synchronously. Reviewed by: Alfred Perlstein <bright@wintelcom.net>
* Introduce NDFREE (and remove VOP_ABORTOP)eivind1999-12-151-22/+0
|
* Synopsis of problem being fixed: Dan Nelson originally reported thatdillon1999-12-121-26/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blocks of zeros could wind up in a file written to over NFS by a client. The problem only occurs a few times per several gigabytes of data. This problem turned out to be bug #3 below. bug #1: B_CLUSTEROK must be cleared when an NFS buffer is reverted from stage 2 (ready for commit rpc) to stage 1 (ready for write). Reversions can occur when a dirty NFS buffer is redirtied with new data. Otherwise the VFS/BIO system may end up thinking that a stage 1 NFS buffer is clusterable. Stage 1 NFS buffers are not clusterable. bug #2: B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short buffers only occur near the EOF of the file). Change to only set when the buffer is a full biosize (usually 8K). This bug has no effect but should be fixed in -current anyway. It need not be backported. bug #3: B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is typically only called by the update daemon). nfs_flush() does a multi-pass loop but due to the lack of vnode locking it is possible for new buffers to be added to the dirtyblkhd list while a flush operation is going on. This may result in nfs_flush() setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its stage 1 write, causing only the commit rpc to be made and thus causing the contents of the buffer to be thrown away (never sent to the server). The patch also contains some cleanup, which only applies to the commit into -current. Reviewed by: dg, julian Originally Reported by: Dan Nelson <dnelson@emsphone.com>
* The symlink implementation could improperly return a NULL vp along withdillon1999-11-301-3/+33
| | | | | | | | | | | | | | | | | | | | a 0 error code. The problem occured with NFSv2 mounts and also with any NFSv3 mount returning an EEXIST error (which is translated to 0 prior to return). The reply to the rpc only contains the file handle for the no-error case under NFSv3. The error case under NFSv3 and all cases under NFSv2 do *not* return the file handle. The fix is to do a secondary lookup to obtain the file handle and thus be able to generate a return vnode for the situations where the rpc reply does not contain the required information. The bug was originally introduced when VOP_SYMLINK semantics were changed for -CURRENT. The NFS symlink implementation was not properly modified to go along with the change despite the fact that three people reviewed the code. It took four attempts to get the current fix correct with five people. Is NFS obfuscated? Ha! Reviewed by: Alfred Perlstein <bright@wintelcom.net> Testing and Discussion: "Viren R.Shah" <viren@rstcorp.com>, Eivind Eklund <eivind@FreeBSD.ORG>, Ian Dowse <iedowse@maths.tcd.ie>
* Remap the error EEXISTS => 0 *before* using error to determine if we shouldeivind1999-11-271-7/+9
| | | | return a vp.
* Fix VOP_MKNOD for loss of WILLRELE. I don't know how I could have missedeivind1999-11-201-7/+1
| | | | | | this in the first place :-( Noticed by: bde
* Remove WILLRELE from VOP_SYMLINKeivind1999-11-131-1/+3
| | | | | | Note: Previous commit to these files (except coda_vnops and devfs_vnops) that claimed to remove WILLRELE from VOP_RENAME actually removed it from VOP_MKNOD.
* Move NFS access cache hits/misses into nfsstats structure sodillon1999-10-251-6/+7
| | | | /usr/bin/nfsstat can get to it easily.
* Before we start to mess with the VFS name-cache clean things up a little bit:phk1999-10-031-4/+0
| | | | Isolate the namecache in its own file, and give it a dedicated malloc type.
* Asynchronized client-side nfs_commit. NFS commit operations weredillon1999-09-171-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | previously issued synchronously even if async daemons (nfsiod's) were available. The commit has been moved from the strategy code to the doio code in order to asynchronize it. Removed use of lastr in preparation for removal of vnode->v_lastr. It has been replaced with seqcount, which is already supported by the system and, in fact, gives us a better heuristic for sequential detection then lastr ever did. Made major performance improvements to the server side commit. The server previously fsync'd the entire file for each commit rpc. The server now bawrite()s only those buffers related to the offset/size specified in the commit rpc. Note that we do not commit the meta-data yet. This works still needs to be done. Note that a further optimization can be done (and has not yet been done) on the client: we can merge multiple potential commit rpc's into a single rpc with a greater file offset/size range and greatly reduce rpc traffic. Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
* $Id$ -> $FreeBSD$peter1999-08-281-1/+1
|
* Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,phk1999-08-081-2/+2
| | | | | | a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
* As described by the submitter:msmith1999-07-311-34/+55
| | | | | | | | | | | | | | | | | | | | I did some tcpdumping the other day and noticed that GETATTR calls were frequently followed by an ACCESS call to the same file. The attached patch changes nfs_getattr to fill the access cache as a side effect. This is accomplished by calling ACCESS rather than GETATTR. This implies a modest overhead of 4 bytes in the request and 8 bytes in the response compared to doing a vanilla GETATTR. ... [The patch comprises two parts] The first is the "real" patch, the second counts misses and hits rather than fills and hits. The difference is subtle but important because both nfs_getattr and nfs_access now fill the cache. It also changes the default value of nfsaccess_cache_timeout to better match the attribute cache. IMHO, file timestamps change much more frequently than protection bits. Submitted by: Bjoern Groenvall <bg@sics.se> Reviewed by: dillon (partially)
* Close PR #12651: the hash calculation routine has changed in otherwpaul1999-07-301-2/+2
| | | | parts of the kernel but was not updated in nfs_readdirplusrpc().
* Fix two bugs in nfs_readdirplus(). The first is that in some cases,wpaul1999-07-301-3/+6
| | | | | | | | | | | | | | | vnodes are locked and never unlocked, which leads to processes starting to wedge up after doing a mount -o nfsv3,tcp,rdirplus foo:/fs /fs; ls /fs. The second is that sometimes cnp is accessed without having been properly initialized: cnp->cn_nameptr points to an earlier name while "len" contains the length of a current name of different size. This leads to an attempt to dereference *(cn->cn_nameptr + len) which will sometimes cause a page fault and a panic. With these two fixes, client side readdirplus works correctly with FreeBSD, IRIX 6.5.4 and Solaris 2.5.1 and 2.6 servers. Submitted by: Matthew Dillon <dillon@backplane.com>
* Fix warning. va_fsid is udev_t, which is int32_t. No need to use %lx.peter1999-07-011-2/+2
|
* Submitted by: Conrad Minshall <conrad@apple.com>julian1999-06-301-1/+6
| | | | | | | | | | Reviewed by: Matthew Dillon <dillon@apollo.backplane.com> The following ugly hack to the exit path of nfs_readlinkrpc() circumvents an Auspex bug: for symlinks longer than 112 (0x70) they return a 1024 byte xdr string - the correct data with many nulls appended. Without this fix namei returns ENAMETOOLONG, at least it does on our source base and on FreeBSD 3.0. Note we do not (and should not) rely upon their null padding.
* Fix a KASSERT() that was negated and lead to:peter1999-06-281-2/+2
| | | | | nfs_strategy: buffer 0xxxxx not locked when you attempted to write and had INVARIANTS turned on.
* Convert buffer locking from using the B_BUSY and B_WANTED flags to usingmckusick1999-06-261-22/+28
| | | | | | | lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
* Add a vnode argument to VOP_BWRITE to get rid of the last vnodemckusick1999-06-161-2/+2
| | | | | operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.
* Various changes lifted from the OpenBSD cvs tree:peter1999-06-051-6/+6
| | | | | | | | | | | | | | | txdr_hyper and fxdr_hyper tweaks to avoid excessive CPU order knowledge. nfs_serv.c: don't call nfsm_adj() with negative values, windows clients could crash servers when doing a readdir of a large directory. nfs_socket.c: Use IP_PORTRANGE to get a priviliged port without a spin loop trying to bind(). Don't clobber a mbuf pointer or we get panics on a NFS3ERR_JUKEBOX error from a server when reusing a freed mbuf. nfs_subs.c: Don't loose st_blocks on NFSv2 mounts when > 2GB. Obtained from: OpenBSD
* Divorce "dev_t" from the "major|minor" bitmap, which is now calledphk1999-05-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | udev_t in the kernel but still called dev_t in userland. Provide functions to manipulate both types: major() umajor() minor() uminor() makedev() umakedev() dev2udev() udev2dev() For now they're functions, they will become in-line functions after one of the next two steps in this process. Return major/minor/makedev to macro-hood for userland. Register a name in cdevsw[] for the "filedescriptor" driver. In the kernel the udev_t appears in places where we have the major/minor number combination, (ie: a potential device: we may not have the driver nor the device), like in inodes, vattr, cdevsw registration and so on, whereas the dev_t appears where we carry around a reference to a actual device. In the future the cdevsw and the aliased-from vnode will be hung directly from the dev_t, along with up to two softc pointers for the device driver and a few houskeeping bits. This will essentially replace the current "alias" check code (same buck, bigger bang). A little stunt has been provided to try to catch places where the wrong type is being used (dev_t vs udev_t), if you see something not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if it makes a difference. If it does, please try to track it down (many hands make light work) or at least try to reproduce it as simply as possible, and describe how to do that. Without DEVT_FASCIST I belive this patch is a no-op. Stylistic/posixoid comments about the userland view of the <sys/*.h> files welcome now, from userland they now contain the end result. Next planned step: make all dev_t's refer to the same devsw[] which means convert BLK's to CHR's at the perimeter of the vnodes and other places where they enter the game (bootdev, mknod, sysctl).
* remove b_proc from struct buf, it's (now) unused.phk1999-05-061-5/+6
| | | | Reviewed by: dillon, bde
* Add sufficient braces to keep egcs happy about potentially ambiguouspeter1999-05-061-2/+3
| | | | if/else nesting.
* All directory accesses must be made with NFS_DIRBLKSIZE chunks to avoidalc1999-05-031-3/+3
| | | | | | | | | | confusing the directory read cookie cache. The nfs_access implementation for v2 mounts attempts to read from the directory if root is the user so that root can't access cached files when the server remaps root to some other user. Submitted by: Doug Rabson <dfr@nlsystems.com> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
* The VFS/BIO subsystem contained a number of hacks in order to optimizealc1999-05-021-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
* Reviewed by: Many at differnt times in differnt parts,julian1999-03-121-30/+51
| | | | | | | | | | | | | | | | | | | | | | including alan, john, me, luoqi, and kirk Submitted by: Matt Dillon <dillon@frebsd.org> This change implements a relatively sophisticated fix to getnewbuf(). There were two problems with getnewbuf(). First, the writerecursion can lead to a system stack overflow when you have NFS and/or VN devices in the system. Second, the free/dirty buffer accounting was completely broken. Not only did the nfs routines blow it trying to manually account for the buffer state, but the accounting that was done did not work well with the purpose of their existance: figuring out when getnewbuf() needs to sleep. The meat of the change is to kern/vfs_bio.c. The remaining diffs are all minor except for NFS, which includes both the fixes for bp interaction AND fixes for a 'biodone(): buffer already done' lockup. Sys/buf.h also contains a chaining structure which is not used by this patchset but is used by other patches that are coming soon. This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF. (sorry for the missing T matt)
* * Change sysctl from using linker_set to construct its tree using SLISTs.dfr1999-02-161-1/+3
| | | | | | | | | | This makes it possible to change the sysctl tree at runtime. * Change KLD to find and register any sysctl nodes contained in the loaded file and to unregister them when the file is unloaded. Reviewed by: Archie Cobbs <archie@whistle.com>, Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
* General additional cleanup of VOP API for NFS ops - mainly NFS ignoringdillon1999-02-131-4/+13
| | | | | | | | the API for freeing up cnp's. This cleanup should not effect nominal operation one way or the other since NFS VOPs just happen to be called with flags that match what it actually does to the NAMEI components it gets. Still, if an NFS error occured, there was probably some memory leakage of NAMEI components with certain NFS VOP ops.
* PR: kern/9970dillon1999-02-131-2/+1
| | | | Remove incorrect vput() in nfs_link()
* Flush delayed-write data out prior to issuing a rename rpc. This appearsdillon1999-02-061-1/+14
| | | | | | to fix the problem w/ NFSV3 whereby a make installworld would get into high-network-bandwidth situations continuously trying to retry nfs writes that fail with a 'stale file handle' error.
* Fix nasty bug in nfs_access(). A conditional was if (a = b) instead ofdillon1999-01-271-2/+2
| | | | if (a == b).
* Fix warnings in preparation for adding -Wall -Wcast-qual to thedillon1999-01-271-6/+6
| | | | kernel compile
* This is a rather large commit that encompasses the new swapper,dillon1999-01-211-1/+4
| | | | | | | | | | changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
* Remove two cases of unused variable sp3.eivind1999-01-121-3/+1
|
* Fix for creating files on a Solaris 7 server with NFSv3 (the request wasdfr1998-12-251-79/+15
| | | | | | slightly garbled but older servers seemed to understand it). Reviewed by: David O'Brien <obrien@nuxi.ucdavis.edu>
* Reimplement the NFS ACCESS RPC cache as an "accelerator" rather than a truemsmith1998-11-151-42/+12
| | | | | | | | | | | | | | | | | | | | cache. If the cached result lets us say "yes", then go with that. If we're not sure, or we think the answer might be "no", go to the wire to be certain. This avoids all of the possible false veto cases, and allows us to key the cached value with just the UID for which the cached value holds, reducing the bloat of the nfsnode structure from 104 bytes to just 12 bytes. Since the "yes" case is by far the most common, this should still provide a substantial performance improvement. Also default the cache to on, with a conservative timeout (2 seconds). This improves performance if NFS is loaded as a KLD module, as there's not (yet) code to parse an option out of the module arguments to set it, and sysctl doesn't work (yet) for OIDs in modules. The 'accelerator' mode was suggested by Bjoern Groenvall (bg@sics.se) Feedback on this would be appreciated as testing has been necessarily limited by Comdex, and it would be valuable to have this in 2.2.8.
* Avoid a null pointer reference if the target of an NFS rename has beenmsmith1998-11-131-7/+10
| | | | | | sillrenamed, or if the source vnode doesn't have an associated nfsnode. Bug report from Andrew Gallatin <gallatin@cs.duke.edu>
* Implement NFS ACCESS RPC result caching.msmith1998-11-131-23/+94
| | | | | | | | | | | | | | | | | | This yields startling performance increases for NFS clients for many access profiles, due to the fact that ACCESS results are persistently cached in the namecache in many cases. Note that the code is somewhat conservative in that it requires an exact credential match for a cache hit. This bloats the nfsnode structure by sizeof(struct ucred) (96 bytes). Any less conservative approach opens the possibility for a false veto in eg. setuid applications. Alternative suggestions would be welcomed. The cache is normally disabled, to activate set the sysctl variable vfs.nfs.access_cache_timeout to a nonzero value. This is the time in seconds that a cached entry will be considered valid; useful values appear to be 2-10 seconds. Performance of the cache can be monitored with the vfs.nfs.access_cache_hits and vfs.nfs.access_cache_hits variables.
* Remove [apparently] bogus casts to u_long for the vnode_pager_setsize()peter1998-11-091-4/+4
| | | | | | | second argument. np_size is a 64 bit int, so is the second arg. This might have caused needless 2G/4G file size problems. I believe it was Bruce who queried this.
* Use TAILQ macros for clean/dirty block list processing. Set b_xflagspeter1998-10-311-8/+8
| | | | rather than abusing the list next pointer with a magic number.
* In nfs_link(), check for a cross-device mount *before* lookingmckusick1998-09-291-2/+3
| | | | | in the v_data field. Obtained from: Charles Hannum, via Frank van der Linden <frank@wins.uva.nl>
* Missing vput when cross-device link error is detected in nfs_link.mckusick1998-09-291-0/+1
|
* During truncation, have to notify the VM about the new sizemckusick1998-09-291-3/+5
| | | | | of the NFS file *before* doing the nfs_vinvalbuf operation. Otherwise some invalid data may show up in an mmap.
* Protect all modifications to v_numoutput with splbio().dfr1998-08-131-2/+2
|
* VOP_STRATEGY grows an (struct vnode *) argumentjulian1998-07-041-2/+2
| | | | | | as the value in b_vp is often not really what you want. (and needs to be frobbed). more cleanups will follow this. Reviewed by: Bruce Evans <bde@freebsd.org>
* readlink() returns EINVAL rather than EPERM if called on a non-symlink.peter1998-06-011-2/+2
|
* For the on-the-wire protocol, u_long -> u_int32_t; long -> int32_t;peter1998-05-311-96/+102
| | | | | | | int -> int32_t; u_short -> u_int16_t. Also, use mode_t instead of u_short for storing modes (mode_t is a u_int16_t). Obtained from: NetBSD
* xdr encode -1 properly.peter1998-05-311-2/+2
| | | | Obtained from: NetBSD
* Fully fill in nfsv2 write rpc requests rather than leaving garbage.peter1998-05-311-4/+11
| | | | Obtained from: NetBSD
OpenPOWER on IntegriCloud