summaryrefslogtreecommitdiffstats
path: root/sys/vm/vm_mmap.c
Commit message (Collapse)AuthorAgeFilesLines
* - Fix locked memory accounting for maps with MAP_WIREFUTURE flag.zont2012-12-181-11/+33
| | | | | | | | - Add sysctl vm.old_mlock which may turn such accounting off. Reviewed by: avg, trasz Approved by: kib (mentor) MFC after: 1 week
* Remove the support for using non-mpsafe filesystem modules.kib2012-10-221-9/+3
| | | | | | | | | | | | In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
* - Simplify VM code by using vmspace_wired_count() for counting wiredzont2012-09-051-4/+3
| | | | | | | | memory of a process. Reviewed by: avg Approved by: kib (mentor) MFC after: 2 weeks
* For old mmap syscall, when executing on amd64 or ia64, enforce thekib2012-08-141-0/+7
| | | | | | | | | | | | PROT_EXEC if prot is non-zero, process is 32bit and kern.elf32.i386_read_exec syscal is enabled. This workaround is needed for old i386 a.out binaries, where dynamic linker did not specified PROT_EXEC for mapping of the text. The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries, but I reused the existing knob which already has the needed semantic. MFC after: 1 week
* Adjust the r205536, by allowing a non-zero offset for anonymouskib2012-08-141-5/+17
| | | | | | | | mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD 1.1.5.1 can issue such requests. Reported and tested by: Dan Plassche <dplassche@gmail.com> MFC after: 1 week
* When MAP_STACK mapping is created, the map entry is created only tokib2012-04-211-2/+4
| | | | | | | | | | | | cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent call to vm_map_wire() to wire the whole stack region fails due to VM_MAP_WIRE_NOHOLES flag. Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack. Reported and tested by: Sushanth Rai <sushanth_rai yahoo com> Reviewed by: alc MFC after: 1 week
* Fix mincore(2) so that it reports PG_CACHED pages as resident.alc2012-04-081-0/+3
| | | | MFC after: 2 weeks
* In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flagkib2012-03-171-0/+2
| | | | | | | | | | | | | | | | | | if the filesystem performed short write and we are skipping the page due to this. Propogate write error from the pager back to the callers of vm_pageout_flush(). Report the failure to write a page from the requested range as the FALSE return value from vm_object_page_clean(), and propagate it back to msync(2) to return EIO to usermode. While there, convert the clearobjflags variable in the vm_object_page_clean() and arguments of the helper functions to boolean. PR: kern/165927 Reviewed by: alc MFC after: 2 weeks
* Eliminate stale incorrect ARGSUSED comments.alc2012-03-021-3/+0
| | | | Submitted by: bde
* Simplify vm_mmap()'s control flow.alc2012-02-251-16/+19
| | | | | | | | Add a comment describing what vm_mmap_to_errno() does. Reviewed by: kib MFC after: 3 weeks X-MFC after: r232071
* Place the if() at the right location, to activate the v_writecountkib2012-02-241-4/+4
| | | | | | | | | accounting for shared writeable mappings for all filesystems, not only for the bypass layers. Submitted by: alc Pointy hat to: kib MFC after: 20 days
* Account the writeable shared mappings backed by file in the vnodekib2012-02-231-11/+39
| | | | | | | | | | | | | | | | | | | | | | | | v_writecount. Keep the amount of the virtual address space used by the mappings in the new vm_object un_pager.vnp.writemappings counter. The vnode v_writecount is incremented when writemappings gets non-zero value, and decremented when writemappings is returned to zero. Writeable shared vnode-backed mappings are accounted for in vm_mmap(), and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on the created map entry. During deferred map entry deallocation, vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and decrements writemappings for the vm object. Now, the writeable mount cannot be demoted to read-only while writeable shared mappings of the vnodes from the mount point exist. Also, execve(2) fails for such files with ETXTBUSY, as it should be. Noted by: tegge Reviewed by: tegge (long time ago, early version), alc Tested by: pho MFC after: 3 weeks
* When vm_mmap() is used to map a vm object into a kernel vm_map, italc2012-02-161-10/+10
| | | | | | | makes no sense to check the size of the kernel vm_map against the user-level resource limits for the calling process. Reviewed by: kib
* Close a race due to dropping of the map lock between creating map entrykib2012-02-111-7/+3
| | | | | | | | | for a shared mapping and marking the entry for inheritance. Other thread might execute vmspace_fork() in between (e.g. by fork(2)), resulting in the mapping becoming private. Noted and reviewed by: alc MFC after: 1 week
* In order to maximize the re-usability of kernel code in user space thiskmacy2011-09-161-15/+15
| | | | | | | | | | | | | patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
* Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomickib2011-09-061-4/+4
| | | | | | | | | | | | | | | | | flags field. Updates to the atomic flags are performed using the atomic ops on the containing word, do not require any vm lock to be held, and are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9) functions are provided to modify afalgs. Document the changes to flags field to only require the page lock. Introduce vm_page_reference(9) function to provide a stable KPI and KBI for filesystems like tmpfs and zfs which need to mark a page as referenced. Reviewed by: alc, attilio Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64) Approved by: re (bz)
* Second-to-last commit implementing Capsicum capabilities in the FreeBSDrwatson2011-08-111-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
* Extract the code to translate VM error into errno, into an exportedkib2011-07-101-0/+7
| | | | | | | | | | function vm_mmap_to_errno(). It is useful for the drivers that implement mmap(2)-like functionality, to be able to return error codes consistent with mmap(2). Sponsored by: The FreeBSD Foundation No objections from: alc MFC after: 1 week
* Style.kib2011-07-101-1/+1
| | | | MFC after: 3 days
* All the racct_*() calls need to happen with the proc locked. Fixing thistrasz2011-07-061-0/+12
| | | | | | won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".
* Add accounting for most of the memory-related resources.trasz2011-04-051-3/+42
| | | | | Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
* Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion.pluknet2011-02-241-39/+0
| | | | | | | | | | | | | | As it was pointed out by Alan Cox, that no longer serves its purpose with the modern UMA allocator compared to the old one used in 4.x days. The removal of sysctl eliminates max_proc_mmap type overflow leading to the broken mmap(2) seen with large amount of physical memory on arches with factually unbound KVA space (such as amd64). It was found that slightly less than 256GB of physmem was enough to trigger the overflow. Reviewed by: alc, kib Approved by: avg (mentor) MFC after: 2 months
* Fix comment intentation.trasz2010-12-041-8/+8
|
* Do not use __FreeBSD_version prefix for the special osrel version.kib2010-11-141-1/+1
| | | | | | | | The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot grok several constants with the prefix. Reported and tested by: swell.k gmail com MFC after: 1 week
* Use symbolic names instead of hardcoding values for magic p_osrel constants.kib2010-11-141-1/+1
| | | | MFC after: 1 week
* Allow a POSIX shared memory object that is opened for read but not foralc2010-09-191-1/+2
| | | | | | | | | | | | write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e., copy-on-write. (This is a regression in the new implementation of POSIX shared memory objects that is used by HEAD and RELENG_8. This bug does not exist in RELENG_7's user-level, file-based implementation.) PR: 150260 MFC after: 3 weeks
* Fix a typo in r212281. uintptr -> uintptr_trstone2010-09-071-1/+1
| | | | | | | Pointy hat to: rstone Approved by: emaste (mentor) MFC after: 2 weeks
* In munmap() downgrade the vm_map_lock to a read lock before taking a readrstone2010-09-071-3/+11
| | | | | | | | | | | | | | | | lock on the pmc-sx lock. This prevents a deadlock with pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries to get a read lock on a vm_map. Downgrading the vm_map_lock in munmap allows pmc_log_process_mappings to continue, preventing the deadlock. Without this change I could cause a deadlock on a multicore 8.1-RELEASE system by having one thread constantly mmap'ing and then munmap'ing a PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat in system-wide sampling mode. Reviewed by: fabient Approved by: emaste (mentor) MFC after: 2 weeks
* Add the MAP_PREFAULT_READ option to mmap(2).alc2010-08-281-2/+3
| | | | Reviewed by: jhb, kib
* Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that createdkib2010-08-061-7/+7
| | | | | | | | | cdev will never be destroyed. Propagate the flag to devfs vnodes as VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a thread reference on such nodes. In collaboration with: pho MFC after: 1 month
* Fix commented out resource limit check in mlockall(2). It's still racy,trasz2010-07-271-2/+1
| | | | but at least less misleading.
* Push down page queues lock acquisition in pmap_enter_object() andalc2010-05-261-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pmap_is_referenced(). Eliminate the corresponding page queues lock acquisitions from vm_map_pmap_enter() and mincore(), respectively. In mincore(), this allows some additional cases to complete without ever acquiring the page queues lock. Assert that the page is managed in pmap_is_referenced(). On powerpc/aim, push down the page queues lock acquisition from moea*_is_modified() and moea*_is_referenced() into moea*_query_bit(). Again, this will allow some additional cases to complete without ever acquiring the page queues lock. Reorder a few statements in vm_page_dontneed() so that a race can't lead to an old reference persisting. This scenario is described in detail by a comment. Correct a spelling error in vm_page_dontneed(). Assert that the object is locked in vm_page_clear_dirty(), and restrict the page queues lock assertion to just those cases in which the page is currently writeable. Add object locking to vnode_pager_generic_putpages(). This was the one and only place where vm_page_clear_dirty() was being called without the object being locked. Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call to vm_page_clear_dirty(). Change vnode_pager_generic_putpages() to the modern-style of function definition. Also, change the name of one of the parameters to follow virtual memory system naming conventions. Reviewed by: kib
* Roughly half of a typical pmap_mincore() implementation is machine-alc2010-05-241-26/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | independent code. Move this code into mincore(), and eliminate the page queues lock from pmap_mincore(). Push down the page queues lock into pmap_clear_modify(), pmap_clear_reference(), and pmap_is_modified(). Assert that these functions are never passed an unmanaged page. Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m: Contrary to what the comment says, pmap_mincore() is not simply an optimization. Without a complete pmap_mincore() implementation, mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED because only the pmap can provide this information. Eliminate the page queues lock from vfs_setdirty_locked_object(), vm_pageout_clean(), vm_object_page_collect_flush(), and vm_object_page_clean(). Generally speaking, these are all accesses to the page's dirty field, which are synchronized by the containing vm object's lock. Reduce the scope of the page queues lock in vm_object_madvise() and vm_page_dontneed(). Reviewed by: kib (an earlier version)
* On Alan's advice, rather than do a wholesale conversion on a singlekmacy2010-04-301-0/+2
| | | | | | | | | | | | architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
* Resurrect pmap_is_referenced() and use it in mincore(). Essentially,alc2010-04-241-3/+1
| | | | | | | | | | | | | | | | | pmap_ts_referenced() is not always appropriate for checking whether or not pages have been referenced because it clears any reference bits that it encounters. For example, in mincore(), clearing the reference bits has two negative consequences. First, it throws off the activity count calculations performed by the page daemon. Specifically, a page on which mincore() has called pmap_ts_referenced() looks less active to the page daemon than it should. Consequently, the page could be deactivated prematurely by the page daemon. Arguably, this problem could be fixed by having mincore() duplicate the activity count calculation on the page. However, there is a second problem for which that is not a solution. In order to clear a reference on a 4KB page, it may be necessary to demote a 2/4MB page mapping. Thus, a mincore() by one process can have the side effect of demoting a superpage mapping within another process!
* Reject attempts to create a MAP_ANON mapping with a non-zero offset.jhb2010-03-231-2/+1
| | | | | | PR: kern/71258 Submitted by: Alexander Best MFC after: 2 weeks
* Back out the functional parts from r197537. After r197711, affecting allbz2009-10-021-15/+0
| | | | user mappings, mmap no longer needs special treatment.
* Do not allow mmap with the MAP_FIXED argument to map at address zero.simon2009-09-271-1/+18
| | | | | | | | | | | | | | | | | This is done to make it harder to exploit kernel NULL pointer security vulnerabilities. While this of course does not fix vulnerabilities, it does mitigate their impact. Note that this may break some applications, most likely emulators or similar, which for one reason or another require mapping memory at zero. This restriction can be disabled with the security.bsd.mmap_zero sysctl variable. Discussed with: rwatson, bz Tested by: bz (Wine), simon (VirtualBox) Submitted by: jhb
* Old (a.out) rtld attempts to mmap zero-length region, e.g. when bsskib2009-09-201-1/+3
| | | | | | | | | | | of the linked object is zero-length. More old code assumes that mmap of zero length returns success. For a.out and pre-8 ELF binaries, allow the mmap of zero length. Reported by: tegge Reviewed by: tegge, alc, jhb MFC after: 3 days
* - Change mmap() to fail requests with EINVAL that pass a length of 0. Thisjhb2009-07-141-1/+1
| | | | | | | | | | | | behavior is mandated by POSIX. - Do not fail requests that pass a length greater than SSIZE_MAX (such as > 2GB on 32-bit platforms). The 'len' parameter is actually an unsigned 'size_t' so negative values don't really make sense. Submitted by: Alexander Best alexbestms at math.uni-muenster.de Reviewed by: alc Approved by: re (kib) MFC after: 1 week
* Implement global and per-uid accounting of the anonymous memory. Addkib2009-06-231-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)
* Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERICrwatson2009-06-051-1/+0
| | | | | | | | and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
* Add an extension to the character device interface that allows characterjhb2009-06-011-58/+47
| | | | | | | | | | | | | | | | | | | | device drivers to use arbitrary VM objects to satisfy individual mmap() requests. - A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is added to cdevsw. This function is called for each mmap() request. If it returns ENODEV, then the mmap() request will fall back to using the device's device pager object and d_mmap(). Otherwise, the method can return a VM object to satisfy this entire mmap() request via *object. It can also modify the starting offset into this object via *foff. This allows device drivers to use the file offset as a cookie to identify specific VM objects. - vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when mapping V_CHR vnodes. This avoids duplicating all the cdev mmap handling code and simplifies some of vm_mmap_vnode(). - D_VERSION has been bumped to D_VERSION_02. Older device drivers using D_VERSION_01 are still supported. MFC after: 1 month
* Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization,alc2009-04-041-12/+0
| | | | | | | | | | | | | | | | | | but I see no benefit from it today. VM_PROT_READ_IS_EXEC was only intended for use on processors that do not distinguish between read and execute permission. On an mmap(2) or mprotect(2), it automatically added execute permission if the caller specified permissions included read permission. The hope was that this would reduce the number of vm map entries needed to implement an address space because there would be fewer neighboring vm map entries that differed only in the presence or absence of VM_PROT_EXECUTE. (See vm/vm_mmap.c revision 1.56.) Today, I don't see any real applications that benefit from VM_PROT_READ_IS_EXEC. In any case, vm map entries are now organized as a self-adjusting binary search tree instead of an ordered list. So, the need for coalescing vm map entries is not as great as it once was.
* Revert the addition of the freelist argument for the vm_map_delete()kib2009-02-241-4/+1
| | | | | | | | | function, done in r188334. Instead, collect the entries that shall be freed, in the deferred_freelist member of the map. Automatically purge the deferred freelist when map is unlocked. Tested by: pho Reviewed by: alc
* Do not call vm_object_deallocate() from vm_map_delete(), because wekib2009-02-081-1/+4
| | | | | | | | | | hold the map lock there, and might need the vnode lock for OBJT_VNODE objects. Postpone object deallocation until caller of vm_map_delete() drops the map lock. Link the map entries to be freed into the freelist, that is released by the new helper function vm_map_entry_free_freelist(). Reviewed by: tegge, alc Tested by: pho
* Now that vfs_markatime() no longer requires an exclusive lock due tojhb2009-01-211-2/+2
| | | | | | the VOP_MARKATIME() changes, use a shared vnode lock for mmap(). Submitted by: ups
* Update mmap() comment: no more block devices, so no more block devicerwatson2008-10-221-4/+0
| | | | | | cache coherency questions. MFC after: 3 days
* Allow the d_mmap driver methods to use cdevpriv KPI during verificationkib2008-09-201-0/+2
| | | | | | | | phase of establishing mapping. Discussed with: rwatson, jhb, rnoland Tested by: rnoland MFC after: 3 days
* Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed threadattilio2008-08-281-4/+5
| | | | | | was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
OpenPOWER on IntegriCloud