summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_subr.c
Commit message (Collapse)AuthorAgeFilesLines
* In preparation for 10.3-RELEASE, temporarily revert the MFC of r291244marius2016-02-231-242/+80
| | | | | | | | | | | done as part of r292895 on stable/10 as that change causes hangs with ZFS and the cause on at least amd64 so far not understood. Discussed with: kib For further information see: https://lists.freebsd.org/pipermail/freebsd-stable/2016-February/084045.html PR: 207281 Approved by: re (gjb)
* Hide the "unmount of /dev failed (BUSY)" warning at shutdown or reboot,trasz2016-01-121-1/+1
| | | | | | | | | | | introduced with r293742, just like it was hidden before that commit. This is a direct commit to 10-STABLE; this special case is not needed in 11-CURRENT, because devfs supports forced unmounts there. The forced unmount could be MFC-ed, but there are some LORs at shutdown, and I have a weird feelings about it. Sponsored by: The FreeBSD Foundation
* MFC r287107:trasz2016-01-121-28/+29
| | | | | | | | | Make vfs_unmountall() unmount /dev after /, not before. The only reason this didn't result in an unclean shutdown is that devfs ignores MNT_FORCE flag. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3467
* MFC of 291244, 291380, 291459, 291460, 291671, and 291743:mckusick2015-12-301-124/+351
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This MFC includes changes to better manage the vnode freelist and to streamline the allocation and freeing of vnodes. Note that to maintain the KPI the VI_AGE flag is left defined in sys/vnode.h though its use is dropped as described in 291380. To maintain KBI the vfs.vlru_alloc_cache_src sysctl variable remains though it no longer has any effect as described in 291244. MFC of 291244: Move the comment about resident pages preventing vnode from leaving active list, into the header comment for vdrop(), which is the function that decides whether to leave the vnode on the list. Note that dirty page write-out in vinactive() is asynchronous. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC of 291380: Remove VI_AGE vnode iflag, it is unused. Noted by: bde Sponsored by: The FreeBSD Foundation MFC of 291459: For performance reasons, it is useful to have a single string used as the name of a filesystem when setting it as the first parameter to the getnewvnode() function. Most filesystems call getnewvnode from just one place so can use a literal string as the first parameter. However, NFS calls getnewvnode from two places, so we create a global constant string that can be used by the two instances. This change also collapses two instances of getnewvnode() in the UFS filesystem to a single call. Reviewed by: kib Tested by: Peter Holm MFC of 291460: As the kernel allocates and frees vnodes, it fully initializes them on every allocation and fully releases them on every free. These are not trivial costs: it starts by zeroing a large structure then initializes a mutex, a lock manager lock, an rw lock, four lists, and six pointers. And looking at vfs.vnodes_created, these operations are being done millions of times an hour on a busy machine. As a performance optimization, this code update uses the uma_init and uma_fini routines to do these initializations and cleanups only as the vnodes enter and leave the vnode_zone. With this change the initializations are only done kern.maxvnodes times at system startup and then only rarely again. The frees are done only if the vnode_zone shrinks which never happens in practice. For those curious about the avoided work, look at the vnode_init() and vnode_fini() functions in kern/vfs_subr.c to see the code that has been removed from the main vnode allocation/free path. Reviewed by: kib Tested by: Peter Holm MFC of 291671: We need to zero out the union of pointers in a freed vnode structure. Fix from: Mateusz Guzik Tested by: Jason Unovitch MFC of 291743: We need to zero out the clustering variables in a freed vnode structure. For completeness add a VNASSERT that there are no threads waiting on a range lock (this was previously checked on every vnode free). Reported by; Rick Macklem Fix from: Mateusz Guzik
* MFC r291379:kib2015-12-041-5/+11
| | | | | Move the comment about resident pages preventing vnode from leaving active list, into the header comment for vdrop().
* MFC r273118 (by mjg)smh2015-11-051-3/+6
| | | | | | Don't take devmtx unnecessarily in vn_isdisk. Sponsored by: Multiplay
* MFC r287033:trasz2015-10-181-4/+0
| | | | | | | After r286237 it should be fine to call vgone(9) on a busy GEOM vnode; remove KASSERT that would prevent forced devfs unmount from working. Sponsored by: The FreeBSD Foundation
* MFC r286281:trasz2015-10-181-1/+1
| | | | | | Mark vgonel() as static. Sponsored by: The FreeBSD Foundation
* MFC r288276:markj2015-09-301-2/+2
| | | | Fix argument ordering in vn_printf().
* MFC of 281677:mckusick2015-09-221-2/+19
| | | | | | | | | | | | | | | | | | | | | | | More accurately collect name-cache statistics in sysctl functions sysctl_debug_hashstat_nchash() and sysctl_debug_hashstat_rawnchash(). These changes are in preparation for allowing changes in the size of the vnode hash tables driven by increases and decreases in the maximum number of vnodes in the system. Reviewed by: kib@ Phabric: D2265 MFC of 287497: Track changes to kern.maxvnodes and appropriately increase or decrease the size of the name cache hash table (mapping file names to vnodes) and the vnode hash table (mapping mount point and inode number to vnode). An appropriate locking strategy is the key to changing hash table sizes while they are in active use. Reviewed by: kib Tested by: Peter Holm Differential Revision: https://reviews.freebsd.org/D2265
* MFC r285384:kib2015-08-071-3/+3
| | | | Do not allow creation of the dirty buffers for the dead buffer objects.
* MFC r284495:kib2015-07-011-19/+27
| | | | | Keep a vnode which is freed but still owing inactivation, on the active list. This closes a race where such vnode is not msync-ed until reboot.
* MFC r283602:kib2015-06-101-1/+2
| | | | | | | | | Prevent dounmount() from acting on the freed (although type-stable) memory by changing the interface to require the mount point to be referenced. MFC r283629: Add missed {}.
* MFC: r281562rmacklem2015-04-301-0/+1
| | | | | | | | | | | File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache.
* MFC 278760:jhb2015-03-311-1/+13
| | | | | | | | Add two new counters for vnode life cycle events: - vfs.recycles counts the number of vnodes forcefully recycled to avoid exceeding kern.maxvnodes. - vfs.vnodes_created counts the number of vnodes created by successful calls to getnewvnode().
* MFC 277712:jhb2015-03-101-1/+1
| | | | Change the default VFS timestamp precision from seconds to microseconds.
* MFC r279362:kib2015-03-061-0/+1
| | | | | | | The VNASSERT in vflush() FORCECLOSE case is trying to panic early to prevent errors from yanking devices out from under filesystems. Only care about special vnodes on devfs, special nodes on other kinds of filesystems do not have special properties.
* MFC r278891:ngie2015-03-011-0/+1
| | | | | | | | Add the mnt_lockref field to the ddb(4) 'show mount' command Differential Revision: https://reviews.freebsd.org/D1688 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Sponsored by: EMC / Isilon Storage Division
* MFC r275743:kib2014-12-201-10/+24
| | | | Put the buffer cleanup code after inactivation.
* MFC r275620:kib2014-12-151-2/+21
| | | | | | | Add functions syncer_suspend() and syncer_resume(). MFC r275637: Remove local variable for real.
* MFC r269457:kib2014-08-171-10/+21
| | | | Remove Giant acquisition from the mount and unmount pathes.
* MFC r269244:kib2014-08-051-16/+25
| | | | Remove one-time use macros which check for the vnode lifecycle.
* MFC r267392:mav2014-06-221-0/+30
| | | | | | | | | | | | | | | Implement simple direct-mapped cache for popular filesystem identifiers to avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while traversing through the list of mount points. This change significantly improves NFS server scalability, since it had to do this translation for every request, and the global lock becomes quite congested. This code is more optimized for relatively small number of mount points. On systems with hundreds of active mount points this simple cache may have many collisions. But the original traversal code in that case should also behave much worse, so we are not loosing much.
* MFC r267362:mav2014-06-221-4/+1
| | | | | | Remove unneeded mountlist_mtx acquisition from sync_fsync(). All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.
* MFC r267232, r267239:mav2014-06-221-9/+11
| | | | | | | | Use atomics to modify numvnodes variable. This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(), that reduces number of global vnode_free_list_mtx mutex acquisitions from 4 to 2 per NFS request on ZFS, improving SMP scalability.
* Do not flush buffers when the v_object of the passed vnode does notkib2013-10-091-0/+2
| | | | | | | | | | | | | | | | | | really belong to it. Such vnodes, with the pointers to other vnodes v_objects, are typically instantiated by the bypass filesystems. Invalidating mappings of other vnode pages and the pages is wrong, since reclamation of the upper vnode does not imply that lower vnode is reclaimed too. One of the consequences of the improper reclamation was destruction of the wired mappings of the lower vnode pages, triggering miscellaneous assertions in the VM system. Reported by: John Marshall <john.marshall@riverwillow.com.au> Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
* When printing the vnode information from ddb, print the lengths of thekib2013-10-011-2/+5
| | | | | | | | dirty and clean buffer queues. Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
* For vunref(), try to upgrade the vnode lock if the function was calledkib2013-09-291-2/+4
| | | | | | | | | | with the vnode shared-locked. If upgrade succeeded, the inactivation can be done immediately, instead of being postponed. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
* Acquire a hold reference on the vnode when a knote is instantiated.kib2013-09-261-0/+2
| | | | | | | | | | | | Otherwise, knote keeps a pointer to a vnode which could become invalid any time. Reported by: many Tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Discussed with: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
* In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated.pjd2013-08-171-6/+3
| | | | Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().
* When creation of the v_pollinfo raced and our instance of vpollinfokib2013-07-281-4/+11
| | | | | | | | | | | | | | must be destroyed, knlist_clear() and seldrain() calls could be avoided, since vpollinfo was not used. More, the knlist_clear() calling protocol requires the knlist locked, which is not true at the call site. Split the destruction into the helper destroy_vpollinfo_free(), and call it when raced, instead of destroy_vpollinfo(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
* Clear the vnode knotes before destroying vpollinfo.kib2013-07-171-0/+2
| | | | | | Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Be more generous when donating the current thread time to the owner ofkib2013-06-031-1/+1
| | | | | | | | | | | the vnode lock while iterating over the free vnode list. Instead of yielding, pause for 1 tick. The change is reported to help in some virtualized environments. Submitted by: Roger Pau Monn? <roger.pau@citrix.com> Discussed with: jilles Tested by: pho MFC after: 2 weeks
* - Convert the bufobj lock to rwlock.jeff2013-05-311-21/+11
| | | | | | | | | | - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
* - Add a new general purpose path-compressed radix trie which can be usedjeff2013-05-121-112/+55
| | | | | | | | | | | with any structure containing a uint64_t index. The tree code auto-generates type safe wrappers. - Eliminate the buf splay and replace it with pctrie. This is not only significantly faster with large files but also allows for the possibility of shared locking. Reviewed by: alc, attilio Sponsored by: EMC / Isilon Storage Division
* - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). Thekib2013-05-111-7/+18
| | | | | | | | | | | | | | | | | | | | | | | null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Add option WITNESS_NO_VNODE to suppress printing LORs between VNODEmarcel2013-05-091-1/+1
| | | | | | | | | locks. To support this, VNODE locks are created with the LK_IS_VNODE flag. This flag is propagated down using the LO_IS_VNODE flag. Note that WITNESS still records the LOR. Only the printing and the optional entering into the kernel debugger is bypassed with the WITNESS_NO_VNODE option.
* Add missing vdrop() in error case.mdf2013-05-041-0/+1
| | | | | Submitted by: Fahad (mohd.fahadullah@isilon.com) MFC after: 1 week
* Allow the vnode to be unlocked for the weird case ofrmacklem2013-04-161-1/+1
| | | | | | | | | LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a usecount on a vnode during NFSv4 recovery from an expired lease. Reported and tested by: pho MFC after: 2 weeks
* Prepare to replace the buf splay with a trie:jeff2013-04-061-19/+9
| | | | | | | | | | | | | | | | - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division
* Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() toattilio2013-02-201-10/+10
| | | | | | their "write" versions. Sponsored by: EMC / Isilon storage division
* Switch vm_object lock to be a rwlock.attilio2013-02-201-0/+1
| | | | | | | | * VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations * VM_OBJECT_SLEEP() is introduced as a general purpose primitve to get a sleep operation using a VM_OBJECT_LOCK() as protection * The approach must bear with vm_pager.h namespace pollution so many files require including directly rwlock.h
* Add a trivial comment to record the proper commit log for r245407:kib2013-01-141-0/+1
| | | | | | | | | | | | Set the v_hash for a new vnode in the getnewvnode() to the value calculated based on the vnode structure address. Filesystems using vfs_hash_insert() override the v_hash using the standard formula of (inode_number + mnt_hashseed). For other filesystems, the initialization allows the vfs_hash_index() to provide useful hash too. Suggested, reviewed and tested by: peter Sponsored by: The FreeBSD Foundation MFC after: 5 days
* diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.ckib2013-01-141-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | index 7c243b6..0bdaf36 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt) +static int vnsz2log; /* * Initialize the vnode management data structures. @@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, static void vntblinit(void *dummy __unused) { + u_int i; int physvnodes, virtvnodes; /* @@ -332,6 +334,9 @@ vntblinit(void *dummy __unused) syncer_maxdelay = syncer_mask + 1; mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF); cv_init(&sync_wakeup, "syncer"); + for (i = 1; i <= sizeof(struct vnode); i <<= 1) + vnsz2log++; + vnsz2log--; } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL); @@ -1067,6 +1072,14 @@ alloc: } rangelock_init(&vp->v_rl); + /* + * For the filesystems which do not use vfs_hash_insert(), + * still initialize v_hash to have vfs_hash_index() useful. + * E.g., nullfs uses vfs_hash_index() on the lower vnode for + * its own hashing. + */ + vp->v_hash = (uintptr_t)vp >> vnsz2log; + *vpp = vp; return (0); }
* Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1attilio2012-12-261-8/+1
| | | | | | | | | | | case. There is no point in optimizing further the code and use a TRUE litteral for a path that does heavyweight stuff anyway (like lock acq), at the price of obfuscated code. Use the appropriate check where necessary and remove a macro. Sponsored by: EMC / Isilon storage division MFC after: 3 days
* Fixup r218424: uio_yield() was scaling directly to userland priority.attilio2012-12-211-4/+4
| | | | | | | | | | | | | | | When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week
* When mnt_vnode_next_active iterator cannot lock the next vnode andkib2012-12-151-55/+51
| | | | | | | | | | | | | | | | | | | yields, specify the user priority for the yield. Otherwise, a higher-priority (kernel) thread could fall into the priority-inversion with the thread owning the mutex lock. On single-processor machines or UP kernels, do not loop adaptively when the next vnode cannot be locked, instead yield unconditionally. Restructure the iteration initializer and the iterator to remove code duplication. Put the code to fetch and lock a vnode next to the current marker, into the mnt_vnode_next_active() function, and use it instead of repeating the loop. Reported by: hrs, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
* Do not yield while owning a mutex. The Giant reacquire in thekib2012-12-101-16/+18
| | | | | | | | | | | | | | | | | kern_yield() is problematic than. The owned mutex is the mount interlock, and it is in fact not needed to guarantee the stability of the mount list of active vnodes, so fix the the issue by only taking the mount interlock for MNT_REF and MNT_REL operations. While there, augment the unconditional yield by some amount of spinning [1]. Reported and tested by: pho Reviewed by: attilio Submitted by: attilio [1] MFC after: 3 days
* The vnode_free_list_mtx is required unconditionally when iteratingkib2012-12-031-4/+28
| | | | | | | | | | | | | | | | | | | | | over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week
* Take first active vnode correctly.davidxu2012-11-271-1/+1
| | | | | Reviewed by: kib MFC after: 3 days
OpenPOWER on IntegriCloud