summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_bio.c
Commit message (Collapse)AuthorAgeFilesLines
* When the number of dirty buffers rises too high, the buf_daemon runsmckusick2002-10-181-3/+17
| | | | | | | | | | | | | | to help clean up. After selecting a potential buffer to write, this patch has it acquire a lock on the vnode that owns the buffer before trying to write it. The vnode lock is necessary to avoid a race with some other process holding the vnode locked and trying to flush its dirty buffers. In particular, if the vnode in question is a snapshot file, then the race can lead to a deadlock. To avoid slowing down the buf_daemon, it does a non-blocking lock request when trying to lock the vnode. If it fails to get the lock it skips over the buffer and continues down its queue looking for buffers to flush. Sponsored by: DARPA & NAI Labs.
* Remove unused includes.phk2002-09-281-4/+4
| | | | | Clarify the intention of a while(); Move a local variable to avoid potential name-confusion.
* Be consistent about "static" functions: if the function is markedphk2002-09-281-1/+1
| | | | | | static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
* Correctly order VI_UNLOCK(), local variables and block comment.phk2002-09-281-4/+4
|
* Make biowait() check bio_error before the BIO_ERROR flag, to properyphk2002-09-261-2/+2
| | | | | | catch internal GEOM use of bio_error. Sponsored by: DARPA & NAI Labs.
* - Lock accesses to v_numoutput.jeff2002-09-251-0/+16
| | | | - Lock calls to gbincore.
* s/Danglish/English/phk2002-09-151-4/+5
| | | | | | | Some style issues. Change the timeout to be hz/10 instead of hz. Brucification by: bde.
* Un-inline the non-trivial "trivial" bio* functions.phk2002-09-141-0/+39
| | | | Untangle devstat_end_transaction_bio()
* Remove all use of vnode->v_tag, replacing with appropriate substitutes.njl2002-09-141-1/+2
| | | | | | | | | | | | v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)
* Oops, broke the build there. Uninline biodone() now that it is non-trivial.phk2002-09-131-0/+28
| | | | | | | | | Introduce biowait() function. Currently there is a race condition and the mitigation is a timeout/retry. It is not obvious what kind of locking (if any) is suitable for BIO_DONE, since the majority of users take are of this themselves, and only a few places actually rely on the wakeup. Sponsored by: DARPA & NAI Labs.
* Change hw.physmem and hw.usermem to unsigned long like they used to bepeter2002-08-301-1/+1
| | | | | | | | | | | | | in the original hardwired sysctl implementation. The buf size calculator still overflows an integer on machines with large KVA (eg: ia64) where the number of pages does not fit into an int. Use 'long' there. Change Maxmem and physmem and related variables to 'long', mostly for completeness. Machines are not likely to overflow 'int' pages in the near term, but then again, 640K ought to be enough for anybody. This comes for free on 32 bit machines, so why not?
* Replace various spelling with FALLTHROUGH which is lint()ablecharnier2002-08-251-2/+2
|
* - Replace v_flag with v_iflag and v_vflagjeff2002-08-041-3/+5
| | | | | | | | | | | | | | | - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
* o Convert two instances of vm_page_sleep_busy() to vm_page_sleep_if_busy()alc2002-08-031-7/+6
| | | | with appropriate page queue locking.
* o Acquire the page queues lock before calling vm_page_io_finish().alc2002-08-011-2/+4
| | | | o Assert that the page queues lock is held in vm_page_io_finish().
* o Replace vm_page_sleep_busy() with vm_page_sleep_if_busy()alc2002-07-301-7/+4
| | | | in vfs_busy_pages().
* o Use vm_page_alloc(... | VM_ALLOC_WIRED) in place of vm_page_wire().alc2002-07-191-4/+3
|
* Add support to UFS2 to provide storage for extended attributes.mckusick2002-07-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags |= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.
* o Lock page queue accesses by vm_page_wire().alc2002-07-141-0/+6
|
* o Lock some page queue accesses, in particular, those by vm_page_unwire().alc2002-07-131-1/+6
|
* Replace the global buffer hash table with per-vnode splay trees using adillon2002-07-101-7/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | methodology similar to the vm_map_entry splay and the VM splay that Alan Cox is working on. Extensive testing has appeared to have shown no increase in overhead. Disadvantages Dirties more cache lines during lookups. Not as fast as a hash table lookup (but still N log N and optimal when there is locality of reference). Advantages vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem syncer operate more efficiently. I get to rip out all the old hacks (some of which were mine) that tried to keep the v_dirtyblkhd tailq sorted. The per-vnode splay tree should be easier to lock / SMPng pushdown on vnodes will be easier. This commit along with another that Alan is working on for the VM page global hash table will allow me to implement ranged fsync(), optimize server-side nfs commit rpcs, and implement partial syncs by the filesystem syncer (aka filesystem syncer would detect that someone is trying to get the vnode lock, remembers its place, and skip to the next vnode). Note that the buffer cache splay is somewhat more complex then other splays due to special handling of background bitmap writes (multiple buffers with the same lblkno in the same vnode), and B_INVAL discontinuities between the old hash table and the existence of the buffer on the v_cleanblkhd list. Suggested by: alc
* Fixed some printf format errors (one new one reported by gcc and 3 nearbybde2002-07-081-2/+2
| | | | old ones not reported by gcc). This helps unbreak LINT.
* Add two asserts that prove & document getblk and geteblk's behavior ofjeff2002-07-071-0/+2
| | | | returning locked bufs.
* Fix a mistake in my last commit. Don't grab an extra reference to the objectjeff2002-07-061-1/+0
| | | | in bp->b_object.
* Fixup uses of GETVOBJECT.jeff2002-07-061-15/+10
| | | | | | | | | | | - Cache a pointer to the vnode's object in the buf. - Hold a reference to that object in addition to the vnode's reference just to be consistent. - Cleanup code that got the object indirectly through the vp and VOP calls. This fixes at least one case where we were calling GETVOBJECT without a lock. It also avoids an expensive layered call at the cost of another pointer in struct buf.
* More 64 bits platforms warning fixes.mux2002-06-231-4/+4
| | | | Reviewed by: rwatson
* Fix a bug in vfs_bio_clrbuf(). The single-page-clrbuf optimization wasdillon2002-06-221-5/+10
| | | | | | | | improperly clearing more then just the invalid portions of the page. (This bug is not known to have been triggered by anything). Submitted by: tegge MFC after: 7 days
* This commit adds basic support for the UFS2 filesystem. The UFS2mckusick2002-06-211-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
* Use "bwrbg" as description when we sleep for background writing,phk2002-06-061-1/+1
| | | | "biord" was misleading in every possible way.
* Remove a six year old undocumented #ifdef : NO_B_MALLOC.phk2002-05-041-12/+0
|
* Change callers of mtx_init() to pass in an appropriate lock type name. Injhb2002-04-041-1/+1
| | | | | | | most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
* brelse() was improperly clearing B_DELWRI in the B_DELWRI|B_INVAL casedillon2002-04-031-5/+2
| | | | | | | | | | | without removing the buffer from the vnode's dirty buffer list, which can result in a panic in NFS. Replaced the code with a call to bundirty() which deals with it properly. PR: kern/36108, kern/36174 Submitted by: various people Special mention: to Danny Schales <dan@coes.LaTech.edu> for providing a core dump that helped me track this down. MFC after: 1 day
* Remove __P.alfred2002-03-191-2/+2
|
* Fixed some printf format errors (hopefully all of the remaining daddr64_tbde2002-03-191-8/+10
| | | | | | ones for GENERIC, and all others on the same line as those). Reformat the printfs if necessary to avoid new long lones or old format printf errors.
* Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/jake2002-03-171-2/+2
| | | | | | | | | | | | | | | pmap_qremove. pmap_kenter is not safe to use in MI code because it is not guaranteed to flush the mapping from the tlb on all cpus. If the process in question is preempted and migrates cpus between the call to pmap_kenter and pmap_kremove, the original cpu will be left with stale mappings in its tlb. This is currently not a problem for i386 because we do not use PG_G on SMP, and thus all mappings are flushed from the tlb on context switches, not just user mappings. This is not the case on all architectures, and if PG_G is to be used with SMP on i386 it will be a problem. This was committed by peter earlier as part of his fine grained tlb shootdown work for i386, which was backed out for other reasons. Reviewed by: peter
* Introduce the new 64-bit size disk block, daddr64_t. Changemckusick2002-03-151-2/+2
| | | | | | | | | | | | the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.
* Giant pushdown for read/write/pread/pwrite syscalls.alfred2002-03-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | kern/kern_descrip.c: Aquire Giant in fdrop_locked when file refcount hits zero, this removes the requirement for the caller to own Giant for the most part. kern/kern_ktrace.c: Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls. kern/vfs_bio.c: Aquire Giant in bwillwrite if needed. kern/sys_generic.c Giant pushdown, remove Giant for: read, pread, write and pwrite. readv and writev aren't done yet because of the possible malloc calls for iov to uio processing. kern/sys_socket.c Grab giant in the socket fo_read/write functions. kern/vfs_vnops.c Grab giant in the vnode fo_read/write functions.
* * Move bswlist declaration and initialization from kern/vfs_bio.c toeivind2002-03-051-4/+15
| | | | | | vm/vm_pager.c, which is the only place it is used. * Make the QUEUE_* definitions and bufqueues local to vfs_bio.c. * constify buf_wmesg.
* Document all functions, global and static variables, and sysctls.eivind2002-03-051-69/+138
| | | | | | | | Includes some minor whitespace changes, and re-ordering to be able to document properly (e.g, grouping of variables and the SYSCTL macro calls for them, where the documentation has been added.) Reviewed by: phk (but all errors are mine)
* Back out all the pmap related stuff I've touched over the last few days.peter2002-02-271-2/+2
| | | | | | There is some unresolved badness that has been eluding me, particularly affecting uniprocessor kernels. Turning off PG_G helped (which is a bad sign) but didn't solve it entirely. Userland programs still crashed.
* Jake further reduced IPI shootdowns on sparc64 in loops by using rangedpeter2002-02-271-2/+2
| | | | | | | | shootdowns in a couple of key places. Do the same for i386. This also hides some physical addresses from higher levels and has it use the generic vm_page_t's instead. This will help for PAE down the road. Obtained from: jake (MI code, suggestions for MD part)
* GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.phk2002-02-221-2/+0
|
* Replace bowrite() with BUF_WRITE in ufs.phk2002-02-221-16/+0
| | | | | | | | | Remove bowrite(), it is now unused. This is the first step in getting entirely rid of BIO_ORDERED which is a generally accepted evil thing. Approved by: mckusick
* GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid bufferdillon2002-01-311-1/+0
| | | | | | cache lockups for over a year now. MFC after: 0 days
* This fixes a large number of bugs in our NFS client side code. A recentdillon2001-12-141-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit by Kirk also fixed a softupdates bug that could easily be triggered by server side NFS. * An edge case with shared R+W mmap()'s and truncate whereby the system would inappropriately clear the dirty bits on still-dirty data. (applicable to all filesystems) THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING. see vm/vm_page.c line 1641 * The straddle case for VM pages and buffer cache buffers when truncating. (applicable to NFS client side) * Possible SMP database corruption due to vm_pager_unmap_page() not clearing the TLB for the other cpu's. (applicable to NFS client side but could effect all filesystems). Note: not considered serious since the corruption occurs beyond the file EOF. * When flusing a dirty buffer due to B_CACHE getting cleared, we were accidently setting B_CACHE again (that is, bwrite() sets B_CACHE), when we really want it to stay clear after the write is complete. This resulted in a corrupt buffer. (applicable to all filesystems but probably only triggered by NFS) * We have to call vtruncbuf() when ftruncate()ing to remove any buffer cache buffers. This is still tentitive, I may be able to remove it due to the second bug fix. (applicable to NFS client side) * vnode_pager_setsize() race against nfs_vinvalbuf()... we have to set n_size before calling nfs_vinvalbuf or the NFS code may recursively vnode_pager_setsize() to the original value before the truncate. This is what was causing the user mmap bus faults in the nfs tester program. (applicable to NFS client side) * Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made by Kirk). Testing program written by: Avadis Tevanian, Jr. Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step') MFC after: 1 week
* The nbuf calculation was assuming that PAGE_SIZE = 4096 bytes, which isdillon2001-12-081-6/+12
| | | | | | | bogus. The calculation has been adjusted to use units of kilobytes. Noticed by: Chad David <davidc@acns.ab.ca> MFC after: 1 week
* Placemark an interrupt race in -current which is currently protected bydillon2001-11-081-0/+4
| | | | | | | Giant. -stable will get spl*() fixes for the race. Reported by: Rob Anderson <rob@isilon.com> MFC after: 0 days
* Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blockingdillon2001-11-051-4/+9
| | | | | | | | | | | | | in wdrain during a write. This flag needs to be used in devices whos strategy routines turn-around and issue another high level I/O, such as when MD turns around and issues a VOP_WRITE to vnode backing store, in order to avoid deadlocking the dirty buffer draining code. Remove a vprintf() warning from MD when the backing vnode is found to be in-use. The syncer of buf_daemon could be flushing the backing vnode at the time of an MD operation so the warning is not correct. MFC after: 1 week
* Documentationdillon2001-10-211-6/+3
| | | | MFC after: 1 day
* Change the kernel's ucred API as follows:jhb2001-10-111-10/+4
| | | | | | | | - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.
OpenPOWER on IntegriCloud