summaryrefslogtreecommitdiffstats
path: root/sys/vm
Commit message (Collapse)AuthorAgeFilesLines
* Correct the recovery logic in vm_page_alloc_contig:attilio2013-08-111-4/+2
| | | | | | | | | | | what is really needed on this code snipped is that all the pages that are already fully inserted gets fully freed, while for the others the object removal itself might be skipped, hence the object might be set to NULL. Sponsored by: EMC / Isilon storage division Reported by: alc, kib Reviewed by: alc
* Different consumers of the struct vm_page abuse pageq member to keepkib2013-08-109-80/+73
| | | | | | | | | | | | | | | | | additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
* Remove unused definition for CTL_VM_NAMES.zont2013-08-091-15/+0
| | | | Suggested by: bde
* Revert the addition of VPO_BUSY and instead update vm_page_replace() tojhb2013-08-092-7/+6
| | | | | | properly unbusy the page. Submitted by: alc
* Add missing 'VPO_BUSY' from r254141 to fix kernel build break.obrien2013-08-091-0/+1
|
* On all the architectures, avoid to preallocate the physical memoryattilio2013-08-0910-116/+408
| | | | | | | | | | | | | | | | | | | | | for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl
* The soft and hard busy mechanism rely on the vm object lock to work.attilio2013-08-0910-233/+421
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
* Split the pagequeues per NUMA domains, and split pageademon processkib2013-08-076-177/+414
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
* Replace kernel virtual address space allocation with vmem. This providesjeff2013-08-0712-367/+278
| | | | | | | | | | | | | transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
* Fill in the description fields for M_FICT_PAGES.markj2013-08-071-1/+1
| | | | | Reviewed by: kib MFC after: 3 days
* Revert r253939:attilio2013-08-054-18/+16
| | | | | | | | | | | | | We cannot busy a page before doing pagefaults. Infact, it can deadlock against vnode lock, as it tries to vget(). Other functions, right now, have an opposite lock ordering, like vm_object_sync(), which acquires the vnode lock first and then sleeps on the busy mechanism. Before this patch is reinserted we need to break this ordering. Sponsored by: EMC / Isilon storage division Reported by: kib
* The page hold mechanism is fast but it has couple of fallouts:attilio2013-08-044-16/+18
| | | | | | | | | | | | | | | | | | | | | | | | - It does not let pages respect the LRU policy - It bloats the active/inactive queues of few pages Try to avoid it as much as possible with the long-term target to completely remove it. Use the soft-busy mechanism to protect page content accesses during short-term operations (like uiomove_fromphys()). After this change only vm_fault_quick_hold_pages() is still using the hold mechanism for page content access. There is an additional complexity there as the quick path cannot immediately access the page object to busy the page and the slow path cannot however busy more than one page a time (to avoid deadlocks). Fixing such primitive can bring to complete removal of the page hold mechanism. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff Tested by: pho
* Unbreak sysctl ABI changes introduced in r253662zont2013-07-291-2/+4
| | | | Requested by: bde
* Improve page LRU quality and simplify the logic.jeff2013-07-261-71/+57
| | | | | | | | | | | | | - Don't short-circuit aging tests for unmapped objects. This biases against unmapped file pages and transient mappings. - Always honor PGA_REFERENCED. We can now use this after soft busying to lazily restart the LRU. - Don't transition directly from active to cached bypassing the inactive queue. This frees recently used data much too early. - Rename actcount to act_delta to be more consistent with use and meaning. Reviewed by: kib, alc Sponsored by: EMC / Isilon Storage Division
* Remove define and documentation for vm_pageout_algorithm missed in r253587zont2013-07-261-4/+2
|
* Clear entire map structure including locks so that thekientzle2013-07-251-2/+1
| | | | | | | | | | | | locks don't accidentally appear to have been already initialized. In particular, this fixes a consistent kernel crash on armv6 with: panic: lock "vm map (user)" 0xc09cc050 already initialized that appeared with r251709. PR: arm/180820
* rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LASTavg2013-07-242-18/+4
| | | | | | | | | | | | | | | | Also directly call swapper() at the end of mi_startup instead of relying on swapper being the last thing in sysinits order. Rationale: - "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage - "scheduler" was misleading, the function swaps in the swapped out processes - another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be invoked depending on its relative order with scheduler; this was not obvious and the bug actually used to exist Reviewed by: kib (ealier version) MFC after: 14 days
* Since r251709 a slab no longer use 8-bit indicies to manage items,glebius2013-07-241-8/+0
| | | | | | thus remove a stale comment. Reviewed by: jeff
* - Remove the long obsolete 'vm_pageout_algorithm' experiment.jeff2013-07-241-9/+2
| | | | | Discussed with: alc Sponsored by: EMC / Isilon Storage Division
* - Correct a stale comment. We don't have vclean() anymore. The work isjeff2013-07-231-5/+0
| | | | | | | done by vgonel() and destroy_vobject() should only be called once from VOP_INACTIVE(). Sponsored by: EMC / Isilon Storage Division
* Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. Thisglebius2013-07-231-2/+3
| | | | | | | allows us to init counter zone at early stage of boot. Reviewed by: kib Tested by: Lytochkin Boris <lytboris gmail.com>
* Fix previous commit when option RACCT is not used.jlh2013-07-221-0/+2
| | | | MFC after: 7 days
* Fix a panic in the racct code when munlock(2) is called with incorrect values.jlh2013-07-221-1/+4
| | | | | | | | | | | | | | | The racct code in sys_munlock() assumed that the boundaries provided by the userland were correct as long as vm_map_unwire() returned successfully. However the latter contains its own logic and sometimes manages to do something out of those boundaries, even if they are buggy. This change makes the racct code to use the accounting done by the vm layer, as it is done in other places such as vm_mlock(). Despite fixing the panic, Alan Cox pointed that this code is still race-y though: two simultaneous callers will produce incorrect values. Reviewed by: alc MFC after: 7 days
* Be more aggressive in using superpages in all mappings of objects:jhb2013-07-193-8/+20
| | | | | | | | | | | | | | - Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for vm_map_find() that will try to alter the alignment of a mapping to match any existing superpage mappings of the object being mapped. If no suitable address range is found with the necessary alignment, vm_map_find() will fall back to using the simple first-fit strategy (VMFS_ANY_SPACE). - Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE. Reviewed by: alc (earlier version) MFC after: 2 weeks
* When swap pager allocates metadata in the pagedaemon context, allow itkib2013-07-111-1/+2
| | | | | | | | | | to drain the reserve. This was broken in r243040, causing deadlock. Note that VM_WAIT call in case of uma_zalloc() failure from pagedaemon would only wait for the v_pageout_free_min anyway. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation
* The vm_fault() should not be allowed to proceed on the map entry whichkib2013-07-111-0/+13
| | | | | | | | | | | | | | | | | is being wired now. The entry wired count is changed to non-zero in advance, before the map lock is dropped. This makes the vm_fault() to perceive the entry as wired, and breaks the fragment which moves the wire count from the shadowed page, to the upper page, making the code unwiring non-wired page. On the other hand, the vm_fault() calls from vm_fault_wire() should be allowed to proceed, so only drain MAP_ENTRY_IN_TRANSITION from vm_fault() when wiring_thread is not current. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* The mlockall() or VM_MAP_WIRE_HOLESOK does not interact properly withkib2013-07-112-11/+52
| | | | | | | | | | | | | | | | | | | | parallel creation of the map entries, e.g. by mmap() or stack growing. It also breaks when other entry is wired in parallel. The vm_map_wire() iterates over the map entries in the region, and assumes that map entries it finds are marked as in transition before, also that any entry marked as in transition, are marked by the current invocation of vm_map_wire(). This is not true for new entries in the holes. Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct vm_map_entry. In vm_map_wire() and vm_map_unwire(), only process the entries which transition owner is the current thread. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Never remove user-wired pages from an object when doingkib2013-07-112-9/+11
| | | | | | | | | | | | | | | msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object range which corresponds to the user-wired vm_map_entry, is always fully populated. Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the preserving behaviour, use it when calling vm_object_page_remove() from vm_object_sync(). Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* In the vm_page_set_invalid() function, do not assert that the page iskib2013-07-111-2/+0
| | | | | | | | | | | | not busy, since its only caller brelse() can legitimately call it on busy page. This happens for VOP_PUTPAGES() on filesystems that use buffers and which VOP_WRITE() method marked the buffer containing page as non-cacheable. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Fix typo in comment.kib2013-07-091-1/+1
| | | | MFC after: 3 days
* vm_phys_fictitious_reg_range() was losing the 'memattr' because it would beneel2013-07-032-1/+1
| | | | | | | | | | | | | reset by pmap_page_init() right after being initialized in vm_page_initfake(). The statement above is with reference to the amd64 implementation of pmap_page_init(). Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing the 'memattr'. Reviewed by: kib MFC after: 2 weeks
* Remove a spurious keg lock acquisition.davide2013-06-281-1/+1
|
* - Add a general purpose resource allocator, vmem, from NetBSD. It wasjeff2013-06-287-40/+29
| | | | | | | | | | | | | | originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division
* - Resolve bucket recursion issues by passing a cookie with zone flagsjeff2013-06-263-43/+86
| | | | | | | | | | | through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division
* Typo in comment.glebius2013-06-241-1/+1
|
* - Add a per-zone lock for zones without kegs.jeff2013-06-204-105/+104
| | | | | | | | | | | | - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division
* - Persist the caller's flags in the bucket allocation flags so we don'tjeff2013-06-191-1/+1
| | | | | | lose a M_NOVM when we recurse into a bucket allocation. Sponsored by: EMC / Isilon Storage Division
* Fix a bug that allowed a tracing process (e.g. gdb) to writedes2013-06-181-0/+6
| | | | | | | | | | to a memory-mapped file in the traced process's address space even if neither the traced process nor the tracing process had write access to that file. Security: CVE-2013-2171 Security: FreeBSD-SA-13:06.mmap Approved by: so
* Refine UMA bucket allocation to reduce space consumption and improvejeff2013-06-182-309/+264
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division
* - Add a new UMA API: uma_zcache_create(). This makes a zone without anyjeff2013-06-173-216/+293
| | | | | | | | | | | | | | backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division
* - Convert the slab free item list from a linked array of indices to ajeff2013-06-133-295/+149
| | | | | | | | | | | | | bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division
* Revise the interface between vm_object_madvise() and vm_page_dontneed() soalc2013-06-103-31/+30
| | | | | | | that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division
* Make sys_mlock() function just a wrapper around vm_mlock() functionglebius2013-06-082-5/+11
| | | | | | | that does all the job. Reviewed by: kib, jilles Sponsored by: Nginx, Inc.
* Complete r251452:attilio2013-06-062-4/+7
| | | | | | | | | Avoid to busy/unbusy a page in cases where there is no need to drop the vm_obj lock, more nominally when the page is full valid after vm_page_grab(). Sponsored by: EMC / Isilon storage division Reviewed by: alc
* In vm_object_split(), busy and consequently unbusy the pages only whenattilio2013-06-041-3/+4
| | | | | | | | swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc
* Update a comment.alc2013-06-041-2/+2
|
* Relax the object locking in vm_pageout_map_deactivate_pages() andalc2013-06-041-11/+11
| | | | | | vm_pageout_object_deactivate_pages(). A read lock suffices. Sponsored by: EMC / Isilon Storage Division
* Remove irrelevant comments.kib2013-06-031-7/+0
| | | | | Discussed with: alc MFC after: 3 days
* Require that the page lock is held, instead of the object lock, whenalc2013-06-032-7/+16
| | | | | | | | | | | | | | | | | | | clearing the page's PGA_REFERENCED flag. Since we are typically manipulating the page's act_count field when we are clearing its PGA_REFERENCED flag, the page lock is already held everywhere that we clear the PGA_REFERENCED flag. So, in fact, this revision only changes some comments and an assertion. Nonetheless, it will enable later changes to object locking in the pageout code. Introduce vm_page_assert_locked(), which completely hides the implementation details of the page lock from the caller, and use it in vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll either eliminate or replace the various uses of vm_page_lock_assert() with vm_page_assert_locked(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division
* Now that access to the page's "act_count" field is synchronized by the pagealc2013-06-011-1/+0
| | | | | | | | lock instead of the object lock, there is no reason for vm_page_activate() to assert that the object is locked for either read or write access. (The "VPO_UNMANAGED" flag never changes after page allocation.) Sponsored by: EMC / Isilon Storage Division
OpenPOWER on IntegriCloud