summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_bio.c
Commit message (Collapse)AuthorAgeFilesLines
* Eliminate dead code. (The commit message for revision 1.287 explains whyalc2008-07-201-13/+0
| | | | this code is dead.)
* b_waiters cannot be adequately protected by the interlock because it isattilio2008-03-281-8/+6
| | | | | | | | | | | | | | | | dropped after the call to lockmgr() so just revert this approach using something similar to the precedent one: BUF_LOCKWAITERS() just checks if there are waiters (not the actual number of them) and it is based on newly introduced lockmgr_waiters() which returns if the lockmgr has waiters or not. The name has been choosen differently by old lockwaiters() in order to not confuse them. KPI results enriched by this commit so __FreeBSD_version bumping and manpage update will be happening soon. 'struct buf' also changes, so kernel ABI is disturbed. Bug found by: jeff Approved by: jeff, kib
* - Complete part of the unfinished bufobj work by consistently usingjeff2008-03-221-4/+5
| | | | | | | | | | | | | | | | | BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find. Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
* Reduce contention on the vnode interlock by not acquiring the BO_LOCKkib2008-03-211-12/+10
| | | | | | | | around the check for the BV_BKGRDINPROG in the brelse() and bqrelse(). See the comment for the explanation why it is safe. Tested by: pho Submitted by: jeff
* - Reduce contention on the global bdonelock and bpinlock by usingjeff2008-03-211-30/+34
| | | | | a pool mutex to protect these sleep/wakeup/counter races. This still is preferable to bloating each bio with a mtx.
* In keeping with style(9)'s recommendations on macros, use a ';'rwatson2008-03-161-1/+1
| | | | | | | | | after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
* - Handle buffer lock waiters count directly in the buffer cache insteadattilio2008-03-011-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | than rely on the lockmgr support [1]: * bump the waiters only if the interlock is held * let brelvp() return the waiters count * rely on brelvp() instead than BUF_LOCKWAITERS() in order to check for the waiters number - Remove a namespace pollution introduced recently with lockmgr.h including lock.h by including lock.h directly in the consumers and making it mandatory for using lockmgr. - Modify flags accepted by lockinit(): * introduce LK_NOPROFILE which disables lock profiling for the specified lockmgr * introduce LK_QUIET which disables ktr tracing for the specified lockmgr [2] * disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it can only be used on a per-instance basis - Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer used This patch breaks KPI so __FreBSD_version will be bumped and manpages updated by further commits. Additively, 'struct buf' changes results in a disturbed ABI also. [2] Really, currently there is no ktr tracing in the lockmgr, but it will be added soon. [1] Submitted by: kib Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
* - Add real assertions to lockmgr locking primitives.attilio2008-02-131-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A couple of notes for this: * WITNESS support, when enabled, is only used for shared locks in order to avoid problems with the "disowned" locks * KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order to assert for a generic thread (not curthread) owning or not the lock. Really, this kind of check is bogus but it seems very widespread in the consumers code. So, for the moment, we cater this untrusted behaviour, until the consumers are not fixed and the options could be removed (hopefully during 8.0-CURRENT lifecycle) * Implementing KA_HELD and KA_UNHELD (not surported natively by WITNESS) made necessary the introduction of LA_MASKASSERT which specifies the range for default lock assertion flags * About other aspects, lockmgr_assert() follows exactly what other locking primitives offer about this operation. - Build real assertions for buffer cache locks on the top of lockmgr_assert(). They can be used with the BUF_ASSERT_*(bp) paradigm. - Add checks at lock destruction time and use a cookie for verifying lock integrity at any operation. - Redefine BUF_LOCKFREE() in order to not use a direct assert but let it rely on the aforementioned destruction time check. KPI results evidently broken, so __FreeBSD_version bumping and manpage update result necessary and will be committed soon. Side note: lockmgr_assert() will be used soon in order to implement real assertions in the vnode namespace replacing the legacy and still bogus "VOP_ISLOCKED()" way. Tested by: kris (earlier version) Reviewed by: jhb
* - Introduce the function lockmgr_recursed() which returns true if theattilio2008-01-191-16/+14
| | | | | | | | | | | | | | | | | | | lockmgr lkp, when held in exclusive mode, is recursed - Introduce the function BUF_RECURSED() which does the same for bufobj locks based on the top of lockmgr_recursed() - Introduce the function BUF_ISLOCKED() which works like the counterpart VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus BUF_REFCNT() in a more explicative and SMP-compliant way. This allows us to axe out BUF_REFCNT() and leaving the function lockcount() totally unused in our stock kernel. Further commits will axe lockcount() as well as part of lockmgr() cleanup. KPI results, obviously, broken so further commits will update manpages and freebsd version. Tested by: kris (on UFS and NFS)
* VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used inattilio2008-01-131-2/+1
| | | | | | | | | | | conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
* vn_lock() is currently only used with the 'curthread' passed as argument.attilio2008-01-101-1/+1
| | | | | | | | | | | | | | | | Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
* Rather than not redirting the bp when we get ENXIO, only redirty itimp2007-12-301-11/+5
| | | | | | | when the error is EIO. This catches a much larger class of errors that are unlikely to succeed if retried. Submitted by: bde
* A partial solution to some of the 'pull the umass device with aimp2007-12-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | mounted FS' problems. These are more along the lines of 'avoiding an avoidable panic' than a complete solution to removable devices. We now close the barn door after the horse has gotten lose and has been hit by a truck, as it were. The barn no longer catches fire in this case, but the horse is still dead :-). The vfs_bio.c fix causes us not to put a failed write back into the dirty pool if the error returned was ENXIO. In that case, the buffer is treated like any other clean buffer that's being retured. ENXIO means the device isn't there anymore and will never be there again in the future, so retrying is futile. The vfs_mount.c fix treats 'ENXIO' as success for unmounting a file system. If the device is gone, retrying later won't help and we'll never be able to unmount the device. These two are part of a larger patch set submitted by the author. The other patches will be forth coming. I added comments to these two patches. Submitted by: Henrik Gulbrandsen Reviewed by: phk@ PR: usb/46176 (partial)
* Eliminate vfs_page_set_valid()'s unused argument.alc2007-12-021-5/+5
|
* Rename the kthread_xxx (e.g. kthread_create()) callsjulian2007-10-201-1/+1
| | | | | | | | | | | to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
* Fix the description of the formula used to autosize the number ofru2007-09-261-1/+1
| | | | | | buffers in the buffer cache. Approved by: re (kensmith)
* Change the management of cached pages (PQ_CACHE) in two fundamentalalc2007-09-251-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
* Work around an integer overflow in expression `3 * maxbufspace / 4',marcel2007-06-091-0/+7
| | | | | when maxbufspace is larger than INT_MAX / 3. The overflow causes a hard hang on ia64 when physical memory is sufficiently large (8GB).
* In getblk(), before gbincore(), use BO_LOCK directly when lockingdelphij2007-06-081-2/+2
| | | | | the bufobj, rather than using VI_LOCK, like what was done with revision 1.453.
* - Move rusage from being per-process in struct pstats to per-thread injeff2007-06-011-3/+3
| | | | | | | | | | | | | | | | | | | td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
* Revert VMCNT_* operations introduction.attilio2007-05-311-4/+2
| | | | | | | | Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
* - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulatingjeff2007-05-181-2/+4
| | | | | | | | vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
* Disable nesting of BOP_BDFLUSH(). VOP_FSYNC() call in bdwrite() couldkib2007-04-241-2/+4
| | | | | | | | result in bdwrite() being reentered, thus causing infinite recursion. Reported and tested by: Peter Holm Reviewed by: tegge MFC after: 2 weeks
* vm_map_delete should be used only internally, by the VM subsystem. Replacewkoszek2007-03-291-6/+2
| | | | | | | | it with vm_map_remove, which not only embeds additional check, but also takes care of locking. Reviewed by: alc Approved by: alc, cognet (mentor)
* Correct a comment typokris2007-03-251-1/+1
|
* Instead of doing comparisons using the pcpu area to see ifjulian2007-03-081-4/+4
| | | | | | | a thread is an idle thread, just see if it has the IDLETD flag set. That flag will probably move to the pflags word as it's permenent and never chenges for the life of the system so it doesn't need locking.
* Use LIST_EMPTY() instead of unrolled version (LIST_FIRST() [!=]= NULL)delphij2007-02-221-5/+5
|
* Cylinder group bitmaps and blocks containing inode for a snapshotkib2007-01-231-42/+51
| | | | | | | | | | | | | | | | | | | | | file are after snaplock, while other ffs device buffers are before snaplock in global lock order. By itself, this could cause deadlock when bdwrite() tries to flush dirty buffers on snapshotted ffs. If, during the flush, COW activity for snapshot needs to allocate block and ffs_alloccg() selects the cylinder group that is being written by bdwrite(), then kernel would panic due to recursive buffer lock acquision. Avoid dealing with buffers in bdwrite() that are from other side of snaplock divisor in the lock order then the buffer being written. Add new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in the bdwrite(). Default implementation, bufbdflush(), refactors the code from bdwrite(). For ffs device buffers, specialized implementation is used. Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes) Tested by: Peter Holm X-MFC after: 3 weeks (if ever: it changes ABI)
* In rev. 1.514, iodone on async buffer may happen before code checks thekib2006-12-201-2/+9
| | | | | | | | | | | vnode v_flag. For cluster buffers this would result in dereferencing NULL b_vp. To prevent the panic, cache relevant vnode flag before calling bstrategy. Reported by: Peter Holm, kris Tested by: Peter Holm Reviewed by: tegge Pointy hat to: kib
* Resolve two deadlocks that could be caused by busy md device backedkib2006-12-141-1/+2
| | | | | | | | | | by vnode. Allow for md thread and the thread that owns lock on vnode backing the md device to do the write even when runningbufspace is exhausted. Tested by: Peter Holm Reviewed by: tegge MFC after: 2 weeks
* Refactor vfs_setdirty(), creating vfs_setdirty_locked_object().alc2006-10-291-6/+16
| | | | | | Call vfs_setdirty_locked_object() from vfs_busy_pages() instead of vfs_setdirty(), thereby eliminating a second acquisition and release of the same vm object lock.
* In bufdone_finish() restrict the acquisition and release of the pagealc2006-10-281-2/+8
| | | | | | queues lock to BIO_READ operations. Recent changes to the implementation of the per-page flags have eliminated the need for the page queues lock in the other cases.
* Replace PG_BUSY with VPO_BUSY. In other words, changes to the page'salc2006-10-221-3/+3
| | | | | busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
* If the buffer lock has waiters after the buffer has changed identity thentegge2006-10-021-0/+11
| | | | | getnewbuf() needs to drop the buffer in order to wake waiters that might sleep on the buffer in the context of the old identity.
* Introduce a field to struct vm_page for storing flags that arealc2006-08-091-2/+0
| | | | | | | | | | | | | | | | synchronized by the lock on the object containing the page. Transition PG_WANTED and PG_SWAPINPROG to use the new field, eliminating the need for holding the page queues lock when setting or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to VPO_WANTED and VPO_SWAPINPROG, respectively. Eliminate the assertion that the page queues lock is held in vm_page_io_finish(). Eliminate the acquisition and release of the page queues lock around calls to vm_page_io_finish() in kern_sendfile() and vfs_unbusy_pages().
* Reduce the scope of the page queues lock in vfs_busy_pages() now thatalc2006-08-081-1/+1
| | | | | vm_page_sleep_if_busy() no longer requires the caller to hold the page queues lock.
* Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.alc2006-07-211-7/+1
|
* - Properly check against B_DELWRI and B_NEEDSGIANT. This check wasjeff2006-04-041-1/+2
| | | | | | | incorrectly written and caused some !NEEDSGIANT buffers to be put in the NEEDSGIANT queue. Sponsored by: Isilon Systems, Inc.
* - Add the B_NEEDSGIANT flag which is only set if the vnode that owns a bufjeff2006-03-311-16/+35
| | | | | | | | | requires Giant. It is set in bgetvp and cleared in brelvp. - Create QUEUE_DIRTY_GIANT for dirty buffers that require giant. - In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and only if we think there are buffers in that queue. Sponsored by: Isilon Systems, Inc.
* Destroy "bip" bio in error case.pjd2006-03-221-0/+1
| | | | | | Found by: Coverity Prevent analysis tool Coverity ID: 795 MFC after: 3 days
* For low memory situations, non-VMIO buffers didnt't release pages back totegge2006-02-021-0/+5
| | | | | | | | | | | | | | the system when brelse() was called with B_RELBUF set on the buffer. This could be a problem when the system was low on memory, had many buffers on QUEUE_EMPTYKVA and started to traverse directories. For each getnewbuf(), pages were allocated from the system, driving the free reserve downwards. For each brelse(), the system put the buffer on QUEUE_CLEAN, with B_INVAL set. This commit changes the semantics of B_RELBUF to also free pages from non-VMIO buffers. Reviewed by: alc
* Remove an unnecessary call to pmap_remove_all(). The given page is notalc2006-01-231-1/+0
| | | | | | mapped because its contents are invalid. Reviewed by: tegge
* Set flag in needsbuffer while still holding bqlock to avoid lost wakeup.tegge2006-01-161-2/+4
|
* MI changes:netchild2005-12-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | - provide an interface (macros) to the page coloring part of the VM system, this allows to try different coloring algorithms without the need to touch every file [1] - make the page queue tuning values readable: sysctl vm.stats.pagequeue - autotuning of the page coloring values based upon the cache size instead of options in the kernel config (disabling of the page coloring as a kernel option is still possible) MD changes: - detection of the cache size: only IA32 and AMD64 (untested) contains cache size detection code, every other arch just comes with a dummy function (this results in the use of default values like it was the case without the autotuning of the page coloring) - print some more info on Intel CPU's (like we do on AMD and Transmeta CPU's) Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue" and report if the cache* values are zero (= bug in the cache detection code) or not. Based upon work by: Chad David <davidc@acns.ab.ca> [1] Reviewed by: alc, arch (in 2004) Discussed with: alc, Chad David, arch (in 2004)
* Changes imported from XFS for FreeBSD project:rodrigc2005-12-071-32/+130
| | | | | | | | | | | | | | - add fields to struct buf (needed by XFS) - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3 - b_pin_count, count of pinned buffer - add new B_MANAGED flag - add breada() function to initiate asynchronous I/O on read-ahead blocks. - add bufdone_finish(), bpin(), bunpin_wait() functions Patches provided by: kan Reviewed by: phk Silence on: arch@
* Normalize a significant number of kernel malloc type names:rwatson2005-10-311-1/+1
| | | | | | | | | | | | | | | | | | | - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
* Release clean buffer with wrong size and no dependencies also for non-VMIOtegge2005-10-091-2/+1
| | | | case.
* Un-staticize waitrunningbufspace() and call it before returning fromtruckman2005-09-301-1/+1
| | | | | | | ffs_copyonwrite() if any async writes were launched. Restore the threads previous TDP_NORUNNINGBUF state before returning from ffs_copyonwrite().
* Un-staticize runningbufwakeup() and staticize updateproc.truckman2005-09-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new private thread flag to indicate that the thread should not sleep if runningbufspace is too large. Set this flag on the bufdaemon and syncer threads so that they skip the waitrunningbufspace() call in bufwrite() rather than than checking the proc pointer vs. the known proc pointers for these two threads. A way of preventing these threads from being starved for I/O but still placing limits on their outstanding I/O would be desirable. Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from blocking on the runningbufspace check while holding snaplk. This prevents snaplk from being held for an arbitrarily long period of time if runningbufspace is high and greatly reduces the contention for snaplk. The disadvantage is that ffs_copyonwrite() can start a large amount of I/O if there are a large number of snapshots, which could cause a deadlock in other parts of the code. Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace before attempting to grab snaplk so that I/O requests waiting on snaplk are not counted in runningbufspace as being in-progress. Increment runningbufspace again before actually launching the original I/O request. Prior to the above two changes, the system could deadlock if enough I/O requests were blocked by snaplk to prevent runningbufspace from falling below lorunningspace and one of the bawrite() calls in ffs_copyonwrite() blocked in waitrunningbufspace() while holding snaplk. See <http://www.holm.cc/stress/log/cons143.html>
* Close a race in biodone(), whereby the bio_done field of the passedpeadar2005-09-291-3/+5
| | | | | | | | | | bio may have been freed and reassigned by the wakeup before being tested after releasing the bdonelock. There's a non-zero chance this is the cause of a few of the crashes knocking around with biodone() sitting in the stack backtrace. Reviewed By: phk@
OpenPOWER on IntegriCloud