summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_subr.c
Commit message (Collapse)AuthorAgeFilesLines
* In vn_isdisk(), check whether vp->v_rdev is NULL. If it is, thenchris2000-03-181-0/+5
| | | | | | | | return ENXIO (Device not configured). Without this, vn_isdisk() could (and did in the case of lstat() under fdesc) pass a NULL pointer to devsw(), which caused a page fault. Reviewed by: alfred
* Eliminate the undocumented, experimental, non-delivering and highlyphk2000-03-161-6/+0
| | | | dangerous MAX_PERF option.
* Don't try so hard to make the lower 16 bits of fsids unique. It tendedbde2000-03-141-22/+13
| | | | | | | | to recycle full fsids after only 16 mount/unmount's. This is probably too often for exported fsids. Now we recycle the full fsids only after 2^16 mount/ umount's and only ensure uniqueness in the lower 16 bits if there have been <= 256 calls to vfs_getnewfsid() since the system started.
* Try harder to make the lower 16 bits of fsids unique. The vfs typebde2000-03-121-15/+25
| | | | | | | | | | | number was packed very wastefully, giving perfect non-uniqeness in the lower 16 bits of fsids for filesystems with the same vfs type. This made linux_stat() return perfectly non-unique (broken) 16-bit st_dev's for nfs mount points, and effectively reduced mntid_base to 8 bits so that the vfs_getnewfsid() looped endlessly when there are already 256 mounted filesystems with the required vfs type. Approved by: jkh
* Do refcounting of open devices (more) correctly.sos2000-02-071-0/+16
| | | | count_dev funtion by phk.
* Remove static qualifier from vgonel, as it is needed by the Arla folkrwatson2000-02-021-2/+1
| | | | | | | | outside of vfs_subr.c. Submitted by: Assar Westerlund <assar@sics.se> Reviewed by: rwatson Approved by: jkh
* This patch fixes a locking bug that can result in deadlock ifrwatson2000-01-291-2/+17
| | | | | | | | | | | | | | | | | the codepath is followed. From the PR: vclean calls vrele leading to deadlock (if usecount > 0) vclean() calls vrele() if v_usecount of the node was higher than one. But before calling it, it sets the VXLOCK flag, which will make vn_lock called from vrele dead-lock. PR: kern/15117 Submitted by: Assar Westerlund <assar@stacken.kth.se> Reviewed by: rwatson Obtained from: NetBSD
* Give vn_isdisk() a second argument where it can return a suitable errno.phk2000-01-101-6/+18
| | | | Suggested by: bde
* Remove the P_BUFEXHAUST flag from the syncer process (leavingmckusick2000-01-101-2/+0
| | | | | | | | | | | | it only on the buf_daemon process). The problem is that when the syncer process starts running the worklist, it wants to delete lots of files. It does this by VFS_VGET'ing the vnodes, clearing the blocks in them and bdwrite'ing the buffer. It can process close to a thousand files per second which generates a large number of dirty buffers. So, giving it special priviledge at the buffer trough leads to trouble as the buf_daemon does occationally need a free buffer to proceed and if the syncer has used every last one up, we are toast.
* Change NDFREE() from a macro to a function for the time being; the macroeivind2000-01-081-0/+34
| | | | | | | version caused intolerable bloat (30k). I'm likely to revisit this with an attempt at a smarter macro. Bloat noticed by: bde
* Introduce a mechanism to suspend/resume system processes. Suspend syncerluoqi2000-01-071-7/+14
| | | | and bufdaemon prior to disk sync during system shutdown.
* Enhance reassignbuf(). When a buffer cannot be time-optimally inserteddillon2000-01-051-2/+19
| | | | | | | | | | | | | | | | | | | into vnode dirtyblkhd we append it to the list instead of prepend it to the list in order to maintain a 'forward' locality of reference, which is arguably better then 'reverse'. The original algorithm did things this way to but at a huge time cost. Enhance the append interlock for NFS writes to handle intr/soft mounts better. Fix the hysteresis for NFS async daemon I/O requests to reduce the number of unnecessary context switches. Modify handling of NFS mount options. Any given user option that is too high now defaults to the kernel maximum for that option rather then the kernel default for that option. Reviewed by: Alfred Perlstein <bright@wintelcom.net>
* Prettyness police: Identify flags in b_xflags with BX_ to distinguishmckusick1999-12-221-17/+19
| | | | them from flags in b_flags which are prefixed with B_
* Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC todillon1999-12-121-1/+1
| | | | | | | | | | | | | | | | | madvise(). This feature prevents the update daemon from gratuitously flushing dirty pages associated with a mapped file-backed region of memory. The system pager will still page the memory as necessary and the VM system will still be fully coherent with the filesystem. Modifications made by other means to the same area of memory, for example by write(), are unaffected. The feature works on a page-granularity basis. MAP_NOSYNC allows one to use mmap() to share memory between processes without incuring any significant filesystem overhead, putting it in the same performance category as SysV Shared memory and anonymous memory. Reviewed by: julian, alc, dg
* Lock reporting and assertion changes.eivind1999-12-111-3/+3
| | | | | | | | | | | | | | | * lockstatus() and VOP_ISLOCKED() gets a new process argument and a new return value: LK_EXCLOTHER, when the lock is held exclusively by another process. * The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them * Extend the vnode_if.src format to allow more exact specification than locked/unlocked. This commit should not do any semantic changes unless you are using DEBUG_VFS_LOCKS. Discussed with: grog, mch, peter, phk Reviewed by: peter
* Remove vfs_getrootfsid() function (a temporary hack added a few monthsdillon1999-11-291-17/+0
| | | | | ago to make BOOTP work again). It is no longer required by BOOTP and no longer used.
* Convert various pieces of code to use vn_isdisk() rather than checkingphk1999-11-221-3/+4
| | | | | | | | for vp->v_type == VBLK. In ccd: we don't need to call VOP_GETATTR to find the type of a vnode. Reviewed by: sos
* struct mountlist and struct mount.mnt_list have no business beingphk1999-11-201-14/+14
| | | | | | | | | | a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively. This removes ugly mp != (void*)&mountlist comparisons. Requested by: phk Submitted by: Jake Burkholder jake@checker.org PR: 14967
* Commit the remaining part of PR14914:phk1999-11-161-19/+18
| | | | | | | | | | | Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
* Next step in the device cleanup process.phk1999-11-091-1/+1
| | | | | | | | Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code. Unify spec_open() for bdev and cdev cases. Remove the disabled bdev specific read/write code.
* useracc() the prequel:phk1999-10-291-1/+0
| | | | | | | | | | | Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE} as argument.
* Trim unused options (or #ifdef for undoc options).peter1999-10-111-1/+0
| | | | Submitted by: phk
* Move the buffered read/write code out of spec_{read|write} and intophk1999-10-041-3/+0
| | | | | | | | | | | two new functions spec_buf{read|write}. Add sysctl vfs.bdev_buffered which defaults to 1 == true. This sysctl can be used to experimentally turn buffered behaviour for bdevs off. I should not be changed while any blockdevices are open. Remove the misplaced sysctl vfs.enable_userblk_io. No other changes in behaviour.
* Remove v_maxio from struct vnode.phk1999-09-291-1/+1
| | | | | | Replace it with mnt_iosize_max in struct mount. Nits from: bde
* Final commit to remove vnode->v_lastr. vm_fault now handles readdillon1999-09-211-1/+0
| | | | | | | | | | | | | | | clustering issues (replacing code that used to be in ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter inlines. This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr for VM, and fp->f_nextread and fp->f_seqcount (which have been in the tree for a while). Determination of the I/O strategy (sequential, random, and so forth) is now handled on a descriptor-by-descriptor basis for base I/O calls, and on a memory-region-by-memory-region and process-by-process basis for VM faults. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
* Initialize vp->v_maxio to its default in getnetvnode() rather thanphk1999-09-201-1/+1
| | | | four different places in vfs_cluster.c
* Fix BOOTP root FS mounts. Also cleanup vfs_getnewfsid() and collapsedillon1999-09-191-21/+37
| | | | | | | | | | addaliasu() into addalias() (no operational change) and clarify comments relating to a trick that vclean() uses. The fix to BOOTP is yet another hack. Actually, rootfsid handling is already a major hack. The whole thing needs to be cleaned up. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
* Add vfs.enable_userblk_io sysctl to control whether user reads and writesdillon1999-09-171-0/+3
| | | | | | | | | | | | | | | | | | to buffered block devices are allowed. The default is to be backwards compatible, i.e. reads and writes are allowed. The idea is for a larger crowd to start running with this disabled and see what problems, if any, crop up, and then to change the default to off and see if any problems crop up in the next 6 months prior to potentially removing support entirely. There are still a few people, Julian and myself included, who believe the buffered block device access from usermode to be useful. Remove use of vnode->v_lastr from buffered block device I/O in preparation for removal of vnode->v_lastr field, replacing it with the already existing seqcount metric to detect sequential operation. Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
* Add dev_t freeing code. Controlled by sysctl debug.free_devt, defaultphk1999-08-291-1/+2
| | | | is off.
* remove unused variables.phk1999-08-281-1/+1
|
* $Id$ -> $FreeBSD$peter1999-08-281-1/+1
|
* Simplify the handling of VCHR and VBLK vnodes using the new dev_t:phk1999-08-261-218/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | Make the alias list a SLIST. Drop the "fast recycling" optimization of vnodes (including the returning of a prexisting but stale vnode from checkalias). It doesn't buy us anything now that we don't hardlimit vnodes anymore. Rename checkalias2() and checkalias() to addalias() and addaliasu() - which takes dev_t and udev_t arg respectively. Make the revoke syscalls use vcount() instead of VALIASED. Remove VALIASED flag, we don't need it now and it is faster to traverse the much shorter lists than to maintain the flag. vfs_mountedon() can check the dev_t directly, all the vnodes point to the same one. Print the devicename in specfs/vprint(). Remove a couple of stale LFS vnode flags. Remove unimplemented/unused LK_DRAINED;
* Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.phk1999-08-251-9/+20
|
* Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw.julian1999-08-251-3/+2
| | | | | | | | In lookup() however it's the other way around as we need to supply the dev_t for the vnode, so devfs still has a copy of it stashed away. Sourcing it from the vnode in the vnops however is useful as it makes a lot of the code almost the same as that in specfs.
* Support full-precision file timestamps. Until now, only the secondsjdp1999-08-221-1/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | have been maintained, and that is still the default. A new sysctl variable "vfs.timestamp_precision" can be used to enable higher levels of precision: 0 = seconds only; nanoseconds zeroed (default). 1 = seconds and nanoseconds, accurate within 1/HZ. 2 = seconds and nanoseconds, truncated to microseconds. >=3 = seconds and nanoseconds, maximum precision. Level 1 uses getnanotime(), which is fast but can be wrong by up to 1/HZ. Level 2 uses microtime(). It might be desirable for consistency with utimes() and friends, which take timeval structures rather than timespecs. Level 3 uses nanotime() for the higest precision. I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with "cpio -pdu". There was almost negligible difference in the system times -- much less than 1%, and less than the variation among multiple runs at the same level. Bruce Evans dreamed up a torture test involving 1-byte reads with intervening fstat() calls, but the cpio test seems more realistic to me. This feature is currently implemented only for the UFS (FFS and MFS) filesystems. But I think it should be easy to support it in the others as well. An earlier version of this was reviewed by Bruce. He's not to blame for any breakage I've introduced since then. Reviewed by: bde (an earlier version of the code)
* The bdevsw() and cdevsw() are now identical, so kill the former.phk1999-08-131-2/+2
|
* s/v_specinfo/v_rdev/phk1999-08-131-4/+4
|
* Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,phk1999-08-081-4/+2
| | | | | | a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
* Add sysctl and support code to allow directories to be VMIO'd. The defaultalc1999-07-261-3/+3
| | | | | | setting for the sysctl is OFF, which is the historical operation. Submitted by: dillon
* Now a dev_t is a pointer to struct specinfo which is shared by all specdevphk1999-07-201-51/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vnodes referencing this device. Details: cdevsw->d_parms has been removed, the specinfo is available now (== dev_t) and the driver should modify it directly when applicable, and the only driver doing so, does so: vn.c. I am not sure the logic in checking for "<" was right before, and it looks even less so now. An intial pool of 50 struct specinfo are depleted during early boot, after that malloc had better work. It is likely that fewer than 50 would do. Hashing is done from udev_t to dev_t with a prime number remainder hash, experiments show no better hash available for decent cost (MD5 is only marginally better) The prime number used should not be close to a power of two, we use 83 for now. Add new checkalias2() to get around the loss of info from dev2udev() in bdevvp(); The aliased vnodes are hung on a list straight of the dev_t, and speclisth[SPECSZ] is unused. The sharing of struct specinfo means that the v_specnext moves into the vnode which grows by 4 bytes. Don't use a VBLK dev_t which doesn't make sense in MFS, now we hang a dummy cdevsw on B/Cmaj 253 so that things look sane. Storage overhead from all of this is O(50k). Bump __FreeBSD_version to 400009 The next step will add the stuff needed so device-drivers can start to hang things from struct specinfo
* [click] Now all dev_t's in the kernel have their char device major.phk1999-07-191-2/+4
| | | | | | | | | | Only know casualy of this is swapinfo/pstat which should be fixes the right way: Store the actual pathname in the kernel like mount does. [Volounteers sought for this task] The road map from here is roughly: expand struct specinfo into struct based dev_t. Add dev_t registration facilities for device drivers and start to use them.
* Introduce the vn_todev(struct vnode*) function, which returns the dev_tphk1999-07-181-1/+13
| | | | corresponding to a VBLK or VCHR node, or NODEV.
* Fix 2nd arg to udev2dev().phk1999-07-171-2/+2
|
* I have not one single time remembered the name of this function correctlyphk1999-07-171-4/+4
| | | | so obviously I gave it the wrong name. s/umakedev/makeudev/g
* Correct a couple of spelling errors in comments.kris1999-07-121-3/+3
|
* These changes appear to give us benefits with both small (32MB) andmckusick1999-07-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() *PRIOR* to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
* The buffer queue mechanism has been reformulated. Instead of havingmckusick1999-07-041-10/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
* Make sure that stat(2) and friends always return a valid st_dev field.phk1999-07-021-4/+5
| | | | | | Pseudo-FS need not fill in the va_fsid anymore, the syscall code will use the first half of the fsid, which now looks like a udev_t with major 255.
* Slight reorganization of kernel thread/process creation. Instead of usingpeter1999-07-011-3/+4
| | | | | | | | | | | | | | | SYSINIT_KT() etc (which is a static, compile-time procedure), use a NetBSD-style kthread_create() interface. kproc_start is still available as a SYSINIT() hook. This allowed simplification of chunks of the sysinit code in the process. This kthread_create() is our old kproc_start internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work the same as the NetBSD one. One thing I'd like to do shortly is get rid of nfsiod as a user initiated process. It makes sense for the nfs client code to create them on the fly as needed up to a user settable limit. This means that nfsiod doesn't need to be in /sbin and is always "available". This is a fair bit easier to do outside of the SYSINIT_KT() framework.
* Convert buffer locking from using the B_BUSY and B_WANTED flags to usingmckusick1999-06-261-28/+23
| | | | | | | lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
OpenPOWER on IntegriCloud