summaryrefslogtreecommitdiffstats
path: root/sys/dev/md
Commit message (Collapse)AuthorAgeFilesLines
* Only assert the length of the passed bio in the mdstart_vnode() whenkib2013-12-101-2/+2
| | | | | | | | | | | the bio is unmapped, so we must map the bio pages into pbuf. This works around the geom classes which do not follow the MAXPHYS limit on the i/o size, since such classes do not know about unmapped bios either. Reported by: Paolo Pinto <paolo.pinto@netasq.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week
* Change comment to match code.trasz2013-12-041-4/+4
| | | | | Discussed with: thompsa Sponsored by: The FreeBSD Foundation
* Add "null" backend to mdconfig(8). This does exactly what the nametrasz2013-12-041-0/+39
| | | | | | | | suggests, and is somewhat useful for benchmarking. MFC after: 1 month No objections from: kib Sponsored by: The FreeBSD Foundation
* Merge GEOM direct dispatch changes from the projects/camlock branch.mav2013-10-221-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When safety requirements are met, it allows to avoid passing I/O requests to GEOM g_up/g_down thread, executing them directly in the caller context. That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid several context switches per I/O. The defined now safety requirements are: - caller should not hold any locks and should be reenterable; - callee should not depend on GEOM dual-threaded concurency semantics; - on the way down, if request is unmapped while callee doesn't support it, the context should be sleepable; - kernel thread stack usage should be below 50%. To keep compatibility with GEOM classes not meeting above requirements new provider and consumer flags added: - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request); - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done); - G_PF_DIRECT_SEND -- provider code meets caller requirements (done); - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request). Capable GEOM class can set them, allowing direct dispatch in cases where it is safe. If any of requirements are not met, request is queued to g_up or g_down thread same as before. Such GEOM classes were reviewed and updated to support direct dispatch: CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE, VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL, MAP, FLASHMAP, etc). To declare direct completion capability disk(9) KPI got new flag equivalent to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk drivers got it set now thanks to earlier CAM locking work. This change more then twice increases peak block storage performance on systems with manu CPUs, together with earlier CAM locking changes reaching more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to 256 user-level threads). Sponsored by: iXsystems, Inc. MFC after: 2 months
* Give the page allocations initiated by the swap-backed md(4) a higherkib2013-08-301-1/+1
| | | | | | | | priority. If the write is requested by a system daemon, sleeping there would starve resources and cause deadlock. Reported and tested by: pho Sponsored by: The FreeBSD Foundation
* Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).kib2013-08-221-2/+1
| | | | | | | | The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
* The soft and hard busy mechanism rely on the vm object lock to work.attilio2013-08-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
* Fix the data corruption on the swap-backed md.kib2013-05-241-1/+7
| | | | | | | | | | | Assign the rv variable a success code if the pager was not asked for the page. Using an error code from the previous processed page caused zeroing of the valid page, when e.g. the previous page was not available in the pager. Reported by: lstewart Sponsored by: The FreeBSD Foundation MFC after: 1 week
* Do not declare that preloaded md(4) supports unmapped bio requests, itkib2013-04-021-1/+9
| | | | | | | does not. Reported by: <mh@kernel32.de> Sponsored by: The FreeBSD Foundation
* Support unmapped i/o for the md(4).kib2013-03-191-40/+204
| | | | | | | | | | The vnode-backed md(4) has to map the unmapped bio because VOP_READ() and VOP_WRITE() interfaces do not allow to pass unmapped requests to the filesystem. Vnode-backed md(4) uses pbufs instead of relying on the bio_transient_map, to avoid usual md deadlock. Sponsored by: The FreeBSD Foundation Tested by: pho, scottl
* Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() toattilio2013-02-201-8/+8
| | | | | | their "write" versions. Sponsored by: EMC / Isilon storage division
* Switch vm_object lock to be a rwlock.attilio2013-02-201-0/+1
| | | | | | | | * VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations * VM_OBJECT_SLEEP() is introduced as a general purpose primitve to get a sleep operation using a VM_OBJECT_LOCK() as protection * The approach must bear with vm_pager.h namespace pollution so many files require including directly rwlock.h
* Print correct unit number when attaching preloaded memory disks.jh2012-11-211-6/+7
| | | | Retire now unused mdunits variable.
* Disallow attaching preloaded memory disks via ioctl.jh2012-11-211-23/+6
| | | | | | | | | | | | - The feature is dangerous because the kernel code didn't check validity of the memory address provided from user space. - It seems that mdconfig(8) never really supported attaching preloaded memory disks. - Preloaded memory disks are automatically attached during md(4) initialization. Thus there shouldn't be much use for the feature. PR: kern/169683 Discussed on: freebsd-hackers
* Zero the newly allocated md(4) swap-backed page to prevent randomkib2012-11-081-0/+9
| | | | | | | | | | | kernel memory leakage to userspace. For the typical use, when a filesystem put on the md disk, the change only results in CPU and memory bandwidth spent to zero the page, since filsystems make sure that user never see unwritten content. But if md disk is used as raw device by userspace, the garbage is exposed. Reported by: Paul Schenkeveld <freebsd@psconsult.nl> MFC after: 2 weeks
* Add a MD_ROOT_FSTYPE kernel option. The option specifies themarcel2012-11-031-1/+5
| | | | | file system part for the MD_ROOT mount string. Hardcoding the the file system type as "ufs" is too restrictive.
* Remove the support for using non-mpsafe filesystem modules.kib2012-10-221-15/+3
| | | | | | | | | | | | In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
* After the PHYS_TO_VM_PAGE() function was de-inlined, the main reasonkib2012-08-051-2/+1
| | | | | | | | | | | | | to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks
* Remove verbose unused commented out debugging printf.kib2012-08-041-6/+0
| | | | | MFC after: 1 week Reviewed by: alc
* Disallow sectorsize larger than MAXPHYS and mediasize smaller thanjh2012-08-021-6/+12
| | | | | | | | sectorsize. PR: 169947 Submitted by: Filip Palian (original version) Reviewed by: kib
* Make it possible to resize md(4) devices.trasz2012-07-071-1/+71
| | | | | Reviewed by: kib Sponsored by: FreeBSD Foundation
* Document a large number of currently undocumented sysctls. While hereeadler2011-12-131-2/+4
| | | | | | | | | | | | fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week
* Add information about MD_READONLY and MD_COMPRESS flags to theae2011-10-311-0/+5
| | | | | | configuration dump. MFC after: 1 week
* Include sys/sbuf.h directly.ae2011-07-111-0/+1
|
* Move the ZERO_REGION_SIZE to a machine-dependent file, as on manymdf2011-05-131-0/+2
| | | | | | | | | | | | | | | | | | architectures (i386, for example) the virtual memory space may be constrained enough that 2MB is a large chunk. Use 64K for arches other than amd64 and ia64, with special handling for sparc64 due to differing hardware. Also commit the comment changes to kmem_init_zero_region() that I missed due to not saving the file. (Darn the unfamiliar development environment). Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you see fit. Requested by: alc MFC after: 1 week MFC with: r221853
* Usa a globally visible region of zeros for both /dev/zero and the mdmdf2011-05-131-5/+3
| | | | | | | | device. There are likely other kernel uses of "blob of zeros" than can be converted. Reviewed by: alc MFC after: 1 week
* Implement BIO_DELETE for vnode devices by simply overwriting the deleteddes2011-04-291-0/+42
| | | | | | | | | | | | sectors with all-zeroes. The zeroes come from a static buffer; null(4) uses a dynamic buffer for the same purpose (for /dev/zero). It might be a good idea to have a static, shared, read-only all-zeroes page somewhere in the kernel that md(4), null(4) and any other code that needs zeroes could use. Reviewed by: kib MFC after: 3 weeks
* Use the preload_fetch_addr() and preload_fetch_size() conveniencemarcel2011-02-091-10/+9
| | | | | | | functions and only create the MD device when we have a non-zero pointer and size. Sponsored by: Juniper Networks
* Add support for BIO_DELETE on swap-backed md(4). In the case of BIO_DELETEkib2011-01-271-6/+10
| | | | | | | | | | | covering the whole page, free the page. Otherwise, clear the region and mark it clean. Not marking the page dirty could reinstantiate cleared data, but it is allowed by BIO_DELETE specification and saves unneeded write to swap. Reviewed by: alc Tested by: pho MFC after: 2 weeks
* Bio shall not be accessed after g_io_deliver(9).kib2011-01-251-1/+1
| | | | | | Reported and tested by: pho Reviewed by: ae, phk MFC after: 1 week
* Add missed ().kib2011-01-191-2/+2
| | | | | Noted by: alc MFC after: 3 days
* There is no point in calling vm_object_set_writeable_dirty() on an objectalc2011-01-191-1/+0
| | | | | | | that is definitively known to be swap backed since its only effects are on vnode-backed objects. Reviewed by: kib
* Add reporting of GEOM::candelete BIO_GETATTR for md(4) and geom_disk(4).kib2010-12-291-2/+3
| | | | | | | | Non-zero value of attribute means that device supports BIO_DELETE. Suggested and reviewed by: pjd Tested by: pho MFC after: 1 week
* Add sysctl vm.md_malloc_wait, non-zero value of which switches malloc-backedkib2010-12-291-3/+8
| | | | | | | | | | | md(4) to using M_WAITOK malloc calls. M_NOWAITOK allocations may fail when enough memory could be freed, but not immediately. E.g. SU UFS becomes quite unhappy when metadata write return error, that would happen for failed malloc() call. Reported and tested by: pho MFC after: 1 week
* Allow the MDIOCATTACH ioctl operation to originate from within the kernel.marcel2010-10-181-8/+16
| | | | | | To protect against malicious software, we demand that the file name is at a particular location (i.e. appended to the mdio structure) for it to be treated as in-kernel.
* - Remove some extra white space.jh2010-07-261-9/+7
| | | | - Wrap g_md_dumpconf() prototype to 80 columns.
* Convert md(4) to use alloc_unr(9) and alloc_unr_specific(9) for unitjh2010-07-221-12/+20
| | | | | | | number allocation. The old approach had some problems such as it allowed an overflow to occur in the unit number calculation. PR: kern/122288
* Calculate nshift only once.kib2010-07-061-4/+6
| | | | | Also noted by: avg MFC after: 1 week
* Eliminate unnecessary page queues locking.alc2010-06-151-3/+1
|
* Lock the page around vm_page_activate() and vm_page_deactivate() callskib2010-05-031-0/+2
| | | | | | | where it was missed. The wrapped fragments now protect wire_count with page lock. Reviewed by: alc
* Fix panic on invalid 'mdconfig -at preload' usage.trasz2010-02-271-0/+2
| | | | PR: kern/80136
* (S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.antoine2009-12-281-1/+1
| | | | | | | | | Fix some wrong usages. Note: this does not affect generated binaries as this argument is not used. PR: 137213 Submitted by: Eygene Ryabinkin (initial version) MFC after: 1 month
* Implement global and per-uid accounting of the anonymous memory. Addkib2009-06-231-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)
* Add cpu_flush_dcache() for use after non-DMA based I/O so that amarcel2009-05-181-3/+6
| | | | | | | | | | | | | | | | | | | | | possible future I-cache coherency operation can succeed. On ARM for example the L1 cache can be (is) virtually mapped, which means that any I/O that uses temporary mappings will not see the I-cache made coherent. On ia64 a similar behaviour has been observed. By flushing the D-cache, execution of binaries backed by md(4) and/or NFS work reliably. For Book-E (powerpc), execution over NFS exhibits SIGILL once in a while as well, though cpu_flush_dcache() hasn't been implemented yet. Doing an explicit D-cache flush as part of the non-DMA based I/O read operation eliminates the need to do it as part of the I-cache coherency operation itself and as such avoids pessimizing the DMA-based I/O read operations for which D-cache are already flushed/invalidated. It also allows future optimizations whereby the bcopy() followed by the D-cache flush can be integrated in a single operation, which could be implemented using on-chips DMA engines, by-passing the D-cache altogether.
* Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that ajhb2009-03-111-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF. Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month
* Remove unnecessary page queues locking around vm_page_wakeup(). (Thisalc2009-02-221-7/+1
| | | | | | change is applicable to RELENG_7 but not RELENG_6.) MFC after: 1 week
* Add the possibility to specify "-o force" with "mdconfig -du".trasz2009-01-101-2/+4
| | | | | | Reviewed by: scottl Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
* Fix forced mdconfig -du. E.g. the following would previouslytrasz2008-12-161-1/+4
| | | | | | | | | | | | result in panic: mdconfig -af blah.img -o force mount /dev/md0 /mnt mdconfig -du 0 Reviewed by: scottl Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
* Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed threadattilio2008-08-281-1/+1
| | | | | | was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
* Remove the distinction between device minor and unit numbers.ed2008-05-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Even though we got rid of device major numbers some time ago, device drivers still need to provide unique device minor numbers to make_dev(). These numbers are only used inside the kernel. They are not related to device major and minor numbers which are visible in devfs. These are actually based on the inode number of the device. It would eventually be nice to remove minor numbers entirely, but we don't want to be too agressive here. Because the 8-15 bits of the device number field (si_drv0) are still reserved for the major number, there is no 1:1 mapping of the device minor and unit numbers. Because this is now unused, remove the restrictions on these numbers. The MAXMAJOR definition was actually used for two purposes. It was used to convert both the userspace and kernelspace device numbers to their major/minor pair, which is why it is now named UMINORMASK. minor2unit() and unit2minor() have now become useless. Both minor() and dev2unit() now serve the same purpose. We should eventually remove some of them, at least turning them into macro's. If devfs would become completely minor number unaware, we could consider using si_drv0 directly, just like si_drv1 and si_drv2. Approved by: philip (mentor)
OpenPOWER on IntegriCloud