summaryrefslogtreecommitdiffstats
path: root/sys/kern/vfs_cluster.c
Commit message (Collapse)AuthorAgeFilesLines
* sys/kern: spelling fixes in comments.pfg2016-04-291-1/+1
| | | | No functional change.
* kern: for pointers replace 0 with NULL.pfg2016-04-151-1/+1
| | | | | | These are mostly cosmetical, no functional change. Found with devel/coccinelle.
* Add four new RCTL resources - readbps, readiops, writebps and writeiops,trasz2016-04-071-0/+15
| | | | | | | | | | | | | | | for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080
* The bread() function was inconsistent about whether it would returnmckusick2016-01-271-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | a buffer pointer in the event of an error (for some errors it would return a buffer pointer and for other errors it would not return a buffer pointer). The cluster_read() function was similarly inconsistent. Clients of these functions were inconsistent in handling errors. Some would assume that no buffer was returned after an error and would thus lose buffers under certain error conditions. Others would assume that brelse() should always be called after an error and would thus panic the system under certain error conditions. To correct both of these problems with minimal code churn, bread() and cluster_write() now always free the buffer when returning an error thus ensuring that buffers will never be lost. The brelse() routine checks for being passed a NULL buffer pointer and silently returns to avoid panics. Thus both approaches to handling error returns from bread() and cluster_read() will work correctly. Future code should be written assuming that bread() and cluster_read() will never return a buffer with an error, so should not attempt to brelse() the buffer when an error is returned. Reviewed by: kib
* Refactor unmapped buffer address handling.jeff2015-07-231-7/+3
| | | | | | | | | | | | | | | | | - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division
* Remove several write-only variables, all reported by the gcc 4.9kib2015-05-291-2/+0
| | | | | | | | | | | | | | | | buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week
* When allocating a pbuf for the cluster write, do not sleep waitingkib2013-08-271-1/+3
| | | | | | | | | | | for the available pbuf when passed vnode is backing md(4). Other i/o directed to the same md device might already hold pbufs, and then we could deadlock since only our progress can free a pbuf needed for wakeup. Obtained from: projects/vm6 Reminded and tested by: pho MFC after: 1 week
* Fix a whitespace.jkim2013-08-231-1/+1
|
* Both cluster_rbuild() and cluster_wbuild() sometimes set the pageskib2013-08-221-9/+26
| | | | | | | | | | | | | shared busy without first draining the hard busy state. Previously it went unnoticed since VPO_BUSY and m->busy fields were distinct, and vm_page_io_start() did not verified that the passed page has VPO_BUSY flag cleared, but such page state is wrong. New implementation is more strict and catched this case. Drain the busy state as needed, before calling vm_page_sbusy(). Tested by: pho, jkim Sponsored by: The FreeBSD Foundation
* The soft and hard busy mechanism rely on the vm object lock to work.attilio2013-08-091-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
* - Convert the bufobj lock to rwlock.jeff2013-05-311-8/+7
| | | | | | | | | | - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
* Add a sysctl vfs.read_min to complement the exiting vfs.read_max. Itscottl2013-05-071-0/+12
| | | | | | | | | | | | | | | | | | defaults to 1, meaning that it's off. When read-ahead is enabled on a file, the vfs cluster code deliberately breaks a read into 2 I/O transactions; one to satisfy the actual read, and one to perform read-ahead. This makes sense in low-latency circumstances, but often produces unbalanced i/o transactions that penalize disks. By setting vfs.read_min, we can tell the algorithm to fetch a larger transaction that what we asked for, achieving the same effect as the read-ahead but without the doubled, unbalanced transaction and the slightly lower latency. This significantly helps our workloads with video streaming. Submitted by: emax Reviewed by: kib Obtained from: Netflix
* Implement the concept of the unmapped VMIO buffers, i.e. buffers whichkib2013-03-191-41/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks
* Some style fixes.kib2013-03-141-1/+1
| | | | Sponsored by: The FreeBSD Foundation
* Add currently unused flag argument to the cluster_read(),kib2013-03-141-16/+8
| | | | | | | | cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho
* Hide the details for the assertion for VM_OBJECT_LOCK operations.attilio2013-02-211-3/+2
| | | | | | | | Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into VM_OBJECT_ASSERT_WLOCKED(foo) Sponsored by: EMC / Isilon storage division Requested by: alc
* Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() toattilio2013-02-201-9/+9
| | | | | | their "write" versions. Sponsored by: EMC / Isilon storage division
* Switch vm_object lock to be a rwlock.attilio2013-02-201-2/+3
| | | | | | | | * VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations * VM_OBJECT_SLEEP() is introduced as a general purpose primitve to get a sleep operation using a VM_OBJECT_LOCK() as protection * The approach must bear with vm_pager.h namespace pollution so many files require including directly rwlock.h
* Add barrier write capability to the VFS buffer interface. A barriermckusick2013-02-161-3/+9
| | | | | | | | | | | | | | | | | | | write is a disk write request that tells the disk that the buffer being written must be committed to the media along with any writes that preceeded it before any future blocks may be written to the drive. Barrier writes are provided by adding the functions bbarrierwrite (bwrite with barrier) and babarrierwrite (bawrite with barrier). Following a bbarrierwrite the client knows that the requested buffer is on the media. It does not ensure that buffers written before that buffer are on the media. It only ensure that buffers written before that buffer will get to the media before any buffers written after that buffer. A flush command must be sent to the disk to ensure that all earlier written buffers are on the media. Reviewed by: kib Tested by: Peter Holm
* Correct a KASSERT message.alc2012-08-151-1/+1
| | | | Submitted by: bde
* Unbreak detection of the async mode for clustered writes after r231075.kib2012-02-081-1/+1
| | | | | Submitted by: bde MFC after: 12 days
* The hardware has caught up; improvements are now observed even at 128,ivoras2011-03-161-1/+1
| | | | | but stay conservative and bump read_max to "only" 64 (it will probably be a good idea to increase this to 128 after the next major release).
* Bumping the read-ahead count once more, to value equivalent to 512 KiB onivoras2010-08-091-1/+1
| | | | | | | | | | | | | | | | most system, based on benchmark results on a low-end fibre channel SAN under VMWare: vfs.read_max read performance 8 (historical default) 83 MB/s 16 (recent bump) 131 MB/s 32 (this version) 152 MB/s 64 157 MB/s (results are +/- 3 MB/s) As read-ahead is heuristic, based on past IO requests, it shouldn't be problematic. The new default is still smaller then in other OSes.
* To help with sequential read UFS performance on modern systems, increaseivoras2010-08-071-1/+1
| | | | | | | | | | | | | | the vfs.read_max default. For most systems this means going from 128 KiB to 256 KiB, which is still very conservative and lower than what most other operating systems use, but as a sane default should not interfere much with existing systems. For systems with RAID volumes and/or virtualization envirnments, where read performance is very important, increasing this sysctl tunable to 32 or even more will demonstratively yield additional performance benefits. If MAXPHYS ever gets bumped up, it will probably be a good idea to slave read_max to it.
* Remove a stale comment. The very same revision (r85511) that introducedalc2009-06-301-3/+0
| | | | | | this comment also implemented the proposed change to the code. Approved by: re (kib)
* Correct a long-standing performance bug in cluster_rbuild(). Specifically,alc2009-06-271-4/+15
| | | | | | | | | | in the case of a file system with a block size that is less than the page size, cluster_rbuild() looks at too many of the page's valid bits. Consequently, it may terminate prematurely, resulting in poor performance. Reported by: bde Reviewed by: tegge Approved by: re (kib)
* Eliminate unnecessary obfuscation when testing a page's valid bits.alc2009-06-071-4/+2
|
* - Complete part of the unfinished bufobj work by consistently usingjeff2008-03-221-13/+18
| | | | | | | | | | | | | | | | | BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find. Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
* - Move rusage from being per-process in struct pstats to per-thread injeff2007-06-011-2/+2
| | | | | | | | | | | | | | | | | | | td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
* Change these descriptions of memory types used in malloc(9), as theirwkoszek2007-03-051-1/+1
| | | | | | current, rather long strings make output from vmstat -m look unpleasant. Approved by: cognet (mentor)
* Replace PG_BUSY with VPO_BUSY. In other words, changes to the page'salc2006-10-221-1/+1
| | | | | busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
* Add mnt_noasync counter to better handle interleaved calls to nmount(),tegge2006-09-261-1/+1
| | | | | | sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag which is set only when MNT_ASYNC is set and mnt_noasync is zero, and check that flag instead of MNT_ASYNC before initiating async io.
* Remove unused leaked debug function prototype.tegge2006-03-211-1/+0
|
* Let snapshots make a copy of old contents for all buffers taking part in ategge2006-03-191-5/+1
| | | | | | | | cluster instead of just the first buffer. Delay buf_start() calls until snapshots have a copy of old content. PR: kern/93942
* Changes imported from XFS for FreeBSD project:rodrigc2005-12-071-0/+15
| | | | | | | | | | | | | | - add fields to struct buf (needed by XFS) - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3 - b_pin_count, count of pinned buffer - add new B_MANAGED flag - add breada() function to initiate asynchronous I/O on read-ahead blocks. - add bufdone_finish(), bpin(), bunpin_wait() functions Patches provided by: kan Reviewed by: phk Silence on: arch@
* Normalize a significant number of kernel malloc type names:rwatson2005-10-311-1/+1
| | | | | | | | | | | | | | | | | | | - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
* Only set B_RAM (Read ahead mark) on an incore buffers if we can lock it.ups2005-10-241-3/+8
| | | | | | | This fixes a race condition caused by the unlocked write access to the b_flags field. MFC after: 3 days
* Do not use vm_pager_init() to initialize vnode_pbuf_freecnt variable.kan2005-08-131-6/+0
| | | | | | | | | | | vm_pager_init() is run before required nswbuf variable has been set to correct value. This caused system to run with single pbuf available for vnode_pager. Handle both cluster_pbuf_freecnt and vnode_pbuf_freecnt variable in the same way. Reported by: ade Obtained from: alc MFC after: 2 days
* Revert revision 1.164: pmap_qremove() does not require protection byalc2005-05-141-2/+0
| | | | | | VM_LOCK_GIANT. Discussed with: jeff
* - Remove spls and comments relating to them.jeff2005-05-011-26/+2
|
* - Call VM_LOCK_GIANT in cluster_callback() to protect some pmap calls. VFSjeff2005-04-301-0/+2
| | | | | | will not be acquiring Giant before calling this function anymore. Sponsored by: Isilon Systems, Inc.
* make cluster_callback() staticphk2005-02-101-1/+2
|
* - Remove GIANT_REQUIRED where giant is no longer required.jeff2005-01-241-6/+0
| | | | Sponsored By: Isilon Systems, Inc.
* Eliminate (now) unnecessary acquisition and release of the global pagealc2004-12-291-4/+0
| | | | queues lock.
* Don't manually set b_bufobj, pbgetvp() does this for us.phk2004-11-151-1/+0
|
* Explicitly call pbrelvp()phk2004-11-151-0/+1
|
* Retire b_magic now, we have the bufobj containing the same hint.phk2004-11-041-1/+0
|
* Lock bp->b_bufobj->b_object instead of bp->b_objectphk2004-10-281-8/+8
|
* Avoid using bp->b_vp when we already have the vnode by other means.phk2004-10-271-6/+5
|
* Synchronize access to the vm page's PG_BUSY flag using the containing vmalc2004-10-271-4/+4
| | | | | object's lock. In the same place, eliminate unnecessary checks for a NULL vm object pointer.
OpenPOWER on IntegriCloud