summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_cluster.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r298819:bdrewery2016-06-271-1/+1
| | | | sys/kern: spelling fixes in comments.
* MFC r283735:kib2015-06-051-2/+0
| | | | Remove several write-only variables.
* When allocating a pbuf for the cluster write, do not sleep waitingkib2013-08-271-1/+3
| | | | | | | | | | | for the available pbuf when passed vnode is backing md(4). Other i/o directed to the same md device might already hold pbufs, and then we could deadlock since only our progress can free a pbuf needed for wakeup. Obtained from: projects/vm6 Reminded and tested by: pho MFC after: 1 week
* Fix a whitespace.jkim2013-08-231-1/+1
|
* Both cluster_rbuild() and cluster_wbuild() sometimes set the pageskib2013-08-221-9/+26
| | | | | | | | | | | | | shared busy without first draining the hard busy state. Previously it went unnoticed since VPO_BUSY and m->busy fields were distinct, and vm_page_io_start() did not verified that the passed page has VPO_BUSY flag cleared, but such page state is wrong. New implementation is more strict and catched this case. Drain the busy state as needed, before calling vm_page_sbusy(). Tested by: pho, jkim Sponsored by: The FreeBSD Foundation
* The soft and hard busy mechanism rely on the vm object lock to work.attilio2013-08-091-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
* - Convert the bufobj lock to rwlock.jeff2013-05-311-8/+7
| | | | | | | | | | - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
* Add a sysctl vfs.read_min to complement the exiting vfs.read_max. Itscottl2013-05-071-0/+12
| | | | | | | | | | | | | | | | | | defaults to 1, meaning that it's off. When read-ahead is enabled on a file, the vfs cluster code deliberately breaks a read into 2 I/O transactions; one to satisfy the actual read, and one to perform read-ahead. This makes sense in low-latency circumstances, but often produces unbalanced i/o transactions that penalize disks. By setting vfs.read_min, we can tell the algorithm to fetch a larger transaction that what we asked for, achieving the same effect as the read-ahead but without the doubled, unbalanced transaction and the slightly lower latency. This significantly helps our workloads with video streaming. Submitted by: emax Reviewed by: kib Obtained from: Netflix
* Implement the concept of the unmapped VMIO buffers, i.e. buffers whichkib2013-03-191-41/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks
* Some style fixes.kib2013-03-141-1/+1
| | | | Sponsored by: The FreeBSD Foundation
* Add currently unused flag argument to the cluster_read(),kib2013-03-141-16/+8
| | | | | | | | cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho
* Hide the details for the assertion for VM_OBJECT_LOCK operations.attilio2013-02-211-3/+2
| | | | | | | | Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into VM_OBJECT_ASSERT_WLOCKED(foo) Sponsored by: EMC / Isilon storage division Requested by: alc
* Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() toattilio2013-02-201-9/+9
| | | | | | their "write" versions. Sponsored by: EMC / Isilon storage division
* Switch vm_object lock to be a rwlock.attilio2013-02-201-2/+3
| | | | | | | | * VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations * VM_OBJECT_SLEEP() is introduced as a general purpose primitve to get a sleep operation using a VM_OBJECT_LOCK() as protection * The approach must bear with vm_pager.h namespace pollution so many files require including directly rwlock.h
* Add barrier write capability to the VFS buffer interface. A barriermckusick2013-02-161-3/+9
| | | | | | | | | | | | | | | | | | | write is a disk write request that tells the disk that the buffer being written must be committed to the media along with any writes that preceeded it before any future blocks may be written to the drive. Barrier writes are provided by adding the functions bbarrierwrite (bwrite with barrier) and babarrierwrite (bawrite with barrier). Following a bbarrierwrite the client knows that the requested buffer is on the media. It does not ensure that buffers written before that buffer are on the media. It only ensure that buffers written before that buffer will get to the media before any buffers written after that buffer. A flush command must be sent to the disk to ensure that all earlier written buffers are on the media. Reviewed by: kib Tested by: Peter Holm
* Correct a KASSERT message.alc2012-08-151-1/+1
| | | | Submitted by: bde
* Unbreak detection of the async mode for clustered writes after r231075.kib2012-02-081-1/+1
| | | | | Submitted by: bde MFC after: 12 days
* The hardware has caught up; improvements are now observed even at 128,ivoras2011-03-161-1/+1
| | | | | but stay conservative and bump read_max to "only" 64 (it will probably be a good idea to increase this to 128 after the next major release).
* Bumping the read-ahead count once more, to value equivalent to 512 KiB onivoras2010-08-091-1/+1
| | | | | | | | | | | | | | | | most system, based on benchmark results on a low-end fibre channel SAN under VMWare: vfs.read_max read performance 8 (historical default) 83 MB/s 16 (recent bump) 131 MB/s 32 (this version) 152 MB/s 64 157 MB/s (results are +/- 3 MB/s) As read-ahead is heuristic, based on past IO requests, it shouldn't be problematic. The new default is still smaller then in other OSes.
* To help with sequential read UFS performance on modern systems, increaseivoras2010-08-071-1/+1
| | | | | | | | | | | | | | the vfs.read_max default. For most systems this means going from 128 KiB to 256 KiB, which is still very conservative and lower than what most other operating systems use, but as a sane default should not interfere much with existing systems. For systems with RAID volumes and/or virtualization envirnments, where read performance is very important, increasing this sysctl tunable to 32 or even more will demonstratively yield additional performance benefits. If MAXPHYS ever gets bumped up, it will probably be a good idea to slave read_max to it.
* Remove a stale comment. The very same revision (r85511) that introducedalc2009-06-301-3/+0
| | | | | | this comment also implemented the proposed change to the code. Approved by: re (kib)
* Correct a long-standing performance bug in cluster_rbuild(). Specifically,alc2009-06-271-4/+15
| | | | | | | | | | in the case of a file system with a block size that is less than the page size, cluster_rbuild() looks at too many of the page's valid bits. Consequently, it may terminate prematurely, resulting in poor performance. Reported by: bde Reviewed by: tegge Approved by: re (kib)
* Eliminate unnecessary obfuscation when testing a page's valid bits.alc2009-06-071-4/+2
|
* - Complete part of the unfinished bufobj work by consistently usingjeff2008-03-221-13/+18
| | | | | | | | | | | | | | | | | BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find. Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
* - Move rusage from being per-process in struct pstats to per-thread injeff2007-06-011-2/+2
| | | | | | | | | | | | | | | | | | | td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
* Change these descriptions of memory types used in malloc(9), as theirwkoszek2007-03-051-1/+1
| | | | | | current, rather long strings make output from vmstat -m look unpleasant. Approved by: cognet (mentor)
* Replace PG_BUSY with VPO_BUSY. In other words, changes to the page'salc2006-10-221-1/+1
| | | | | busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
* Add mnt_noasync counter to better handle interleaved calls to nmount(),tegge2006-09-261-1/+1
| | | | | | sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag which is set only when MNT_ASYNC is set and mnt_noasync is zero, and check that flag instead of MNT_ASYNC before initiating async io.
* Remove unused leaked debug function prototype.tegge2006-03-211-1/+0
|
* Let snapshots make a copy of old contents for all buffers taking part in ategge2006-03-191-5/+1
| | | | | | | | cluster instead of just the first buffer. Delay buf_start() calls until snapshots have a copy of old content. PR: kern/93942
* Changes imported from XFS for FreeBSD project:rodrigc2005-12-071-0/+15
| | | | | | | | | | | | | | - add fields to struct buf (needed by XFS) - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3 - b_pin_count, count of pinned buffer - add new B_MANAGED flag - add breada() function to initiate asynchronous I/O on read-ahead blocks. - add bufdone_finish(), bpin(), bunpin_wait() functions Patches provided by: kan Reviewed by: phk Silence on: arch@
* Normalize a significant number of kernel malloc type names:rwatson2005-10-311-1/+1
| | | | | | | | | | | | | | | | | | | - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
* Only set B_RAM (Read ahead mark) on an incore buffers if we can lock it.ups2005-10-241-3/+8
| | | | | | | This fixes a race condition caused by the unlocked write access to the b_flags field. MFC after: 3 days
* Do not use vm_pager_init() to initialize vnode_pbuf_freecnt variable.kan2005-08-131-6/+0
| | | | | | | | | | | vm_pager_init() is run before required nswbuf variable has been set to correct value. This caused system to run with single pbuf available for vnode_pager. Handle both cluster_pbuf_freecnt and vnode_pbuf_freecnt variable in the same way. Reported by: ade Obtained from: alc MFC after: 2 days
* Revert revision 1.164: pmap_qremove() does not require protection byalc2005-05-141-2/+0
| | | | | | VM_LOCK_GIANT. Discussed with: jeff
* - Remove spls and comments relating to them.jeff2005-05-011-26/+2
|
* - Call VM_LOCK_GIANT in cluster_callback() to protect some pmap calls. VFSjeff2005-04-301-0/+2
| | | | | | will not be acquiring Giant before calling this function anymore. Sponsored by: Isilon Systems, Inc.
* make cluster_callback() staticphk2005-02-101-1/+2
|
* - Remove GIANT_REQUIRED where giant is no longer required.jeff2005-01-241-6/+0
| | | | Sponsored By: Isilon Systems, Inc.
* Eliminate (now) unnecessary acquisition and release of the global pagealc2004-12-291-4/+0
| | | | queues lock.
* Don't manually set b_bufobj, pbgetvp() does this for us.phk2004-11-151-1/+0
|
* Explicitly call pbrelvp()phk2004-11-151-0/+1
|
* Retire b_magic now, we have the bufobj containing the same hint.phk2004-11-041-1/+0
|
* Lock bp->b_bufobj->b_object instead of bp->b_objectphk2004-10-281-8/+8
|
* Avoid using bp->b_vp when we already have the vnode by other means.phk2004-10-271-6/+5
|
* Synchronize access to the vm page's PG_BUSY flag using the containing vmalc2004-10-271-4/+4
| | | | | object's lock. In the same place, eliminate unnecessary checks for a NULL vm object pointer.
* Move the buffer method vector (buf->b_op) to the bufobj.phk2004-10-241-5/+2
| | | | | | | | | | | | | | | | | Extend it with a strategy method. Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance. Rename ibwrite to bufwrite(). Move the two NFS buf_ops to more sensible places, add bufstrategy to them. Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}(). Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
* Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.phk2004-10-221-4/+5
| | | | | | | | | | | | | | | | | | Initialize b_bufobj for all buffers. Make incore() and gbincore() take a bufobj instead of a vnode. Make inmem() local to vfs_bio.c Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(), Make buf_vlist_add() take a bufobj instead of a vnode. Eliminate other uses of bp->b_vp where bp->b_bufobj will do. Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.
* Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAITphk2004-10-211-3/+1
| | | | | | | | | | Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup(). Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible. Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
* Give cluster_write() an explicit vnode argument.phk2004-09-271-6/+1
| | | | In the future a struct buf will not automatically point out a vnode for us.
OpenPOWER on IntegriCloud