summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_bio.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r285384:kib2015-08-071-0/+6
| | | | Do not allow creation of the dirty buffers for the dead buffer objects.
* MFC r284887:kib2015-07-111-1/+21
| | | | | | | | | Handle errors from background write of the cylinder group blocks. MFC r284927: Simplify code. Approved by: re (gjb)
* MFC r284719:kib2015-06-301-7/+9
| | | | | Only take previous buffer queue lock (olock) when needed for REMFREE in binsfree().
* MFC: r281960rmacklem2015-05-141-6/+8
| | | | | | | | | | | | | MAXBSIZE defines both the largest UFS block size and the largest size for a buffer in the buffer cache. This patch defines a new constant MAXBCACHEBUF, which is the largest size for a buffer in the buffer cache. Having a separate constant allows MAXBCACHEBUF to be set larger than MAXBSIZE on a per-architecture basis, so that NFS can do larger read/writes for these architectures. It modifies sys/param.h so that BKVASIZE can also be set on a per-architecture basis. A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE is fixed as well.
* MFC r282085:kib2015-05-041-29/+64
| | | | | | | Partially revert r255986: do not call VOP_FSYNC() when helping bufdaemon in getnewbuf(), do use buf_flush(). The difference is that bufdaemon uses TRYLOCK to get buffer locks, which allows calls to getnewbuf() while another buffer is locked.
* MFC r275619:kib2014-12-151-3/+7
| | | | | | Check for bo_bufobj->bo_object for NULL and cache the value in local variable to avoid NULL dereference in getnewbuf_reuse_bp(). The vnode owning the buffer is not locked there.
* MFC r273638:mav2014-10-281-1/+3
| | | | | | | | | | | | | Revert somewhat hackish geom_disk optimization, committed as part of r256880, and the following r273143 commit, supposed to workaround introduced issue by quite innocent-looking change. While there is no clear understanding why, but r273143 is accused in data corruption in some environments with high I/O load. I personally don't see any problem in that commit, and possibly it is just a trigger to some other bug somewhere, but better safe then sorry for now. Requested by: scottl@
* MFC r273143: Remove setting BIO_DONE flag for BIOs that have done() method.mav2014-10-191-3/+1
| | | | | This fixes use-after-free, caused by geom_disk, completing same BIO twice to save extra allocation, and getting BIO_DONE set after the first.
* MFC: r272718jkim2014-10-151-3/+6
| | | | Make kern.nswbuf tunable from loader.
* MFC of 269533 (by mckusick):mckusick2014-08-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for multi-threading of soft updates. Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process. Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix MFC of 269853 (by kib): Revision r269457 removed the Giant around mount and unmount code, but r269533, which was tested before r269457 was committed, implicitely relied on the Giant to protect the manipulations of the softdepmounts list. Use softdep global lock consistently to guarantee the list structure now. Insert the new struct mount_softdeps into the softdepmounts only after it is sufficiently initialized, to prevent softdep_speedup() from accessing bare memory. Similarly, remove struct mount_softdeps for the unmounted filesystem from the tailq before destroying structure rwlock. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation
* MFC r267255:kib2014-06-221-27/+42
| | | | | | | Change the nblock mutex to rwlock. MFC r267264: Devolatile as needed.
* MFC r267226:kib2014-06-151-8/+0
| | | | | | | Initialize the pbuf counter for directio using SYSINIT. Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources.
* MFC Alexander Motin's GEOM direct dispatch work:scottl2014-01-071-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r256603: Introduce new function devstat_end_transaction_bio_bt(), adding new argument to specify present time. Use this function to move binuptime() out of lock, substantially reducing lock congestion when slow timecounter is used. r256606: Move g_io_deliver() out of the lock, as required for direct dispatch. Move g_destroy_bio() out too to reduce lock scope even more. r256607: Fix passing uninitialized bio_resid argument to g_trace(). r256610: Add unmapped I/O support to GEOM RAID. r256830: Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping temporary mapped buffer. That fixes double unmap if biodone() called twice for the same BIO (but with different done methods). r256880: Merge GEOM direct dispatch changes from the projects/camlock branch. When safety requirements are met, it allows to avoid passing I/O requests to GEOM g_up/g_down thread, executing them directly in the caller context. That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid several context switches per I/O. r259247: Fix bug introduced at r256607. We have to recalculate bp_resid here since sizes of original and completed requests may differ due to end of media. Testing of the stable/10 merge was done by Netflix, but all of the credit goes to Alexander and iX Systems. Submitted by: mav Sponsored by: iX Systems
* MFC r256614:mav2014-01-051-10/+9
| | | | | | | - Take BIO lock in biodone() only when there is no completion callback set and so we should wake up thread waiting in biowait(). - Remove msleep() timeout from biowait(). It was added 11 years ago, when there was no locks used, and it should not be needed any more.
* The device vnodes are often unlocked when bread() or bwrite() iskib2013-10-091-1/+2
| | | | | | | | | | | called. This probably should be fixed eventually, but for now it is not needed to try to flush such vnodes from the buffer allocation context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
* When helping the bufdaemon from the buffer allocation context, therekib2013-10-021-63/+41
| | | | | | | | | | | | | | | | | | | | | | | | | is no sense to walk the whole dirty buffer queue. We are only interested in, and can operate on, the buffers owned by the current vnode [1]. Instead of calling generic queue flush routine, do VOP_FSYNC() if possible. Holding the dirty buffer queue lock in the bufdaemon, without dropping it, can cause starvation of buffer writes from other threads. This is esp. easy to reproduce on the big memory machines, where large files are written, causing almost all dirty buffers accumulating in several big files, which vnodes are locked by writers. Bufdaemon cannot flush any buffer, but is iterating over the whole dirty queue continuously. Since dirty queue mutex is not dropped, bufdone() in g_up thread is starved, usually deadlocking the machine [2]. Mitigate this by dropping the queue lock after the vnode is locked, allowing other queue lock contenders to make a progress. Discussed with: Jeff [1] Reported by: pho [2] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (hrs)
* Reimplement r255797 using LK_TRYUPGRADE.kib2013-09-291-2/+14
| | | | | | | | | | | | | | The r255797 was: Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
* Revert r255797. The LK_UPGRADE | LK_NOWAIT drops the lock.kib2013-09-221-14/+2
| | | | Approved by: re (marius, implicit)
* Increase the chance of the buffer write from the bufdaemon helperkib2013-09-221-2/+14
| | | | | | | | | | | | context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
* Drain for the xbusy state for two places which potentially dokib2013-09-081-0/+6
| | | | | | | | | | | | pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)
* The vm_pageout_flush() functions sbusies pages in the passed pageskib2013-09-051-2/+2
| | | | | | | | | | | | | | | | | run. After that, the pager put method is called, usually translated to VOP_WRITE(). For the filesystems which use buffer cache, bufwrite() sbusies the buffer pages again, waiting for the xbusy state to drain. The later is done in vfs_drain_busy_pages(), which is called with the buffer pages already sbusied (by vm_pageout_flush()). Since vfs_drain_busy_pages() can only wait for one page at the time, and during the wait, the object lock is dropped, previous pages in the buffer must be protected from other threads busying them. Up to the moment, it was done by xbusying the pages, that is incompatible with the sbusy state in the new implementation of busy. Switch to sbusy. Reported and tested by: pho Sponsored by: The FreeBSD Foundation
* Both cluster_rbuild() and cluster_wbuild() sometimes set the pageskib2013-08-221-2/+1
| | | | | | | | | | | | | shared busy without first draining the hard busy state. Previously it went unnoticed since VPO_BUSY and m->busy fields were distinct, and vm_page_io_start() did not verified that the passed page has VPO_BUSY flag cleared, but such page state is wrong. New implementation is more strict and catched this case. Drain the busy state as needed, before calling vm_page_sbusy(). Tested by: pho, jkim Sponsored by: The FreeBSD Foundation
* Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).kib2013-08-221-1/+1
| | | | | | | | The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
* The soft and hard busy mechanism rely on the vm object lock to work.attilio2013-08-091-35/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
* Replace kernel virtual address space allocation with vmem. This providesjeff2013-08-071-1/+1
| | | | | | | | | | | | | transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
* Assert that runningbufspace does not underflow.kib2013-07-131-3/+5
| | | | Sponsored by: The FreeBSD Foundation
* There is no need to count waiters for the runningbufspace.kib2013-07-131-1/+1
| | | | Sponsored by: The FreeBSD Foundation
* Do not invalidate page of the B_NOCACHE buffer or buffer after an I/Okib2013-07-111-1/+2
| | | | | | | | | | | | | | error if any user wired mappings exist. Doing the invalidation destroys the user wiring. The change is the temporal measure to close the bug, the more proper fix is to delegate the invalidation of the page to upper layers always. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Make kassert_printf use __printflike.alfred2013-07-071-2/+2
| | | | | | Fix associated errors/warnings while I'm here. Requested by: avg
* - Add a general purpose resource allocator, vmem, from NetBSD. It wasjeff2013-06-281-15/+8
| | | | | | | | | | | | | | originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division
* - Consolidate duplicate code into support functions.jeff2013-06-051-264/+305
| | | | | | | | | | | | | | - Split the bqlock into bqclean and bqdirty locks. - Only acquire the wakeup synchronization locks when we cross a threshold requiring them. - Restructure the way flushbufqueues() targets work so they are more smp friendly and sane. Reviewed by: kib Discussed with: mckusick, attilio Sponsored by: EMC / Isilon Storage Division M vfs_bio.c
* When auto-sizing the buffer cache, limit the amount of physical memorykib2013-06-031-1/+2
| | | | | | | | | | used as the estimation of size, to 32GB. This provides around 100K of buffer headers and corresponding KVA for buffer map at the peak. Sizing the cache larger is not useful, also resulting in the wasting and exhausting of KVA for large machines. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation
* Reduce the scope of the VM object locking in brelse(). In my tests, thisalc2013-06-021-2/+4
| | | | | | | change reduced the total number of VM object lock acquisitions by brelse() by 74%. Sponsored by: EMC / Isilon Storage Division
* - Convert the bufobj lock to rwlock.jeff2013-05-311-95/+46
| | | | | | | | | | - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
* vm_object locking is not needed there as pages are already wired.attilio2013-05-211-2/+0
| | | | | Sponsored by: EMC / Isilon storage division Submitted by: alc
* Use readlocking now that assertions on vm_page_lookup() are relaxed.attilio2013-05-171-3/+3
| | | | | | Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: flo, pho
* Add dev_strategy_csw() function, which is similar to dev_strategy()kib2013-03-271-15/+26
| | | | | | | | | | | | | | | | | but assumes that a thread reference was already obtained on the passed device. Use the function from physio(), to avoid two extra dev_mtx lock and unlock. Note that physio() is always used as the cdevsw method, or is called from a cdevsw method, and the caller already owns the reference. dev_strategy() is left to keep KPI intact, but now it is implemented as a wrapper around dev_strategy_csw(). Do some style cleanup in physio(). Requested and reviewed by: kan (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* On i386, double the default size of the bio transient map. With thekib2013-03-271-6/+13
| | | | | | | | maxbcache size fixed, the auto-tuned transient map is too small for real-world load on i386. Tested by: David Wolfskill Sponsored by: The FreeBSD Foundation
* Only size and create the bio_transient_map when unmapped buffers arekib2013-03-211-1/+1
| | | | | | | enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation
* In bufwrite(), a dirty buffer is moved to the clean queue before thekib2013-03-201-2/+7
| | | | | | | | | | | | | | | | | bufobj counter of the writes in progress is incremented. Other thread inspecting the bufobj would consider it clean. For the regular vnodes, the vnode lock is typically held both by the thread performing the bufwrite() and an other thread doing syncing, which prevents the situation. On the other hand, writes to the VCHR vnodes are done without holding vnode lock. Increment the write ref counter for the buffer object before calling bundirty(). Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks
* Do not remap usermode pages into KVA for physio.kib2013-03-191-7/+17
| | | | | Sponsored by: The FreeBSD Foundation Tested by: pho
* Add a helper function vfs_bio_bzero_buf() to zero the portion of thekib2013-03-191-0/+26
| | | | | | | | | buffer, transparently handling mapped or unmapped buffers. Its intent is to replace the use of bzero(bp->b_data) in cases where the buffer might be unmapped, to avoid unneeded upgrades. Sponsored by: The FreeBSD Foundation Tested by: pho
* Implement the concept of the unmapped VMIO buffers, i.e. buffers whichkib2013-03-191-262/+701
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks
* Some style fixes.kib2013-03-141-7/+5
| | | | Sponsored by: The FreeBSD Foundation
* Add currently unused flag argument to the cluster_read(),kib2013-03-141-2/+3
| | | | | | | | cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho
* Rewrite the vfs_bio_clrbuf(9) to not access the b_data for B_VMIOkib2013-03-141-13/+15
| | | | | | | | | buffers directly, use pmap_zero_page_area(9) for each zeroing page region instead. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks
* MFCattilio2013-02-271-6/+13
|
* Hide the details for the assertion for VM_OBJECT_LOCK operations.attilio2013-02-211-5/+5
| | | | | | | | Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into VM_OBJECT_ASSERT_WLOCKED(foo) Sponsored by: EMC / Isilon storage division Requested by: alc
* Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() toattilio2013-02-201-23/+23
| | | | | | their "write" versions. Sponsored by: EMC / Isilon storage division
* Switch vm_object lock to be a rwlock.attilio2013-02-201-5/+6
| | | | | | | | * VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations * VM_OBJECT_SLEEP() is introduced as a general purpose primitve to get a sleep operation using a VM_OBJECT_LOCK() as protection * The approach must bear with vm_pager.h namespace pollution so many files require including directly rwlock.h
OpenPOWER on IntegriCloud