summaryrefslogtreecommitdiffstats
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* Revert r251590. It unexpectedly broke the build and there were somemarcel2013-06-101-2/+0
| | | | | questions on locking. As part of commit-bit grooming, I'd like Steve to handle this, but can't leave things broken in the mean time.
* Add vfs_mounted and vfs_unmounted events so that components can be informedmarcel2013-06-091-0/+2
| | | | | | | | about mount and unmount events. This is used by Juniper to implement a more optimal implementation of NetBSD's veriexec. Submitted by: stevek@juniper.net Obtained from: Juniper Networks, Inc
* aio_mlock() added:glebius2013-06-083-2/+26
| | | | | - Regen for r251526. - Bump __FreeBSD_version.
* Add new system call - aio_mlock(). The name speaks for itself. It allowsglebius2013-06-082-4/+46
| | | | | | | | to perform the mlock(2) operation, which can consume a lot of time, under control of aio(4). Reviewed by: kib, jilles Sponsored by: Nginx, Inc.
* Separate LIO_SYNC processing into a separate function aio_process_sync(),glebius2013-06-081-19/+41
| | | | | | | and rename aio_process() into aio_process_rw(). Reviewed by: kib Sponsored by: Nginx, Inc.
* Do not compare the existing mask of a cpuset with a new mask when changingjhb2013-06-061-8/+11
| | | | | | | the mask of a cpuset. Also, change the cpuset's mask before updating the masks of all children. Previously changing a cpuset's mask first required setting the mask to a super-set of both the old and new masks and then changing it a second time to the new mask.
* Don't busy the page unless we are likely to release the object lock.alc2013-06-061-2/+4
| | | | | Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
* - Consolidate duplicate code into support functions.jeff2013-06-051-264/+305
| | | | | | | | | | | | | | - Split the bqlock into bqclean and bqdirty locks. - Only acquire the wakeup synchronization locks when we cross a threshold requiring them. - Restructure the way flushbufqueues() targets work so they are more smp friendly and sane. Reviewed by: kib Discussed with: mckusick, attilio Sponsored by: EMC / Isilon Storage Division M vfs_bio.c
* Improve r250890, so that we stop processing of a message with zeroglebius2013-06-041-8/+7
| | | | | | | descriptors as early as possible, and assert that number of descriptors is positive in unp_freerights(). Reviewed by: mjg, pjd, jilles
* - Fix a couple of inverted panic messages for shared/exclusive mismatchesjhb2013-06-032-10/+26
| | | | | | of a lock within a single thread. - Fix handling of interlocks in WITNESS by properly requiring the interlock to be held exactly once if it is specified.
* - Handle the recursed/not recursed flags with RA_RLOCKED in rw_assert().jhb2013-06-031-4/+6
| | | | - Tweak a panic message.
* Be more generous when donating the current thread time to the owner ofkib2013-06-031-1/+1
| | | | | | | | | | | the vnode lock while iterating over the free vnode list. Instead of yielding, pause for 1 tick. The change is reported to help in some virtualized environments. Submitted by: Roger Pau Monn? <roger.pau@citrix.com> Discussed with: jilles Tested by: pho MFC after: 2 weeks
* Do not map the shared page COW. If the process wired its addresskib2013-06-031-2/+3
| | | | | | | | | | | | | space, fork(2) would cause shadowing of the physical object and copying of the shared page into private copy, effectively preventing updates for the exported timehands structure and stopping the clock. Specify the maximum allowed permissions for the page to be read and execute, preventing write from the user mode. Reported and tested by: <huanghwh@yahoo.com> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* When auto-sizing the buffer cache, limit the amount of physical memorykib2013-06-031-1/+2
| | | | | | | | | | used as the estimation of size, to 32GB. This provides around 100K of buffer headers and corresponding KVA for buffer map at the peak. Sizing the cache larger is not useful, also resulting in the wasting and exhausting of KVA for large machines. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation
* Reduce the scope of the VM object locking in brelse(). In my tests, thisalc2013-06-021-2/+4
| | | | | | | change reduced the total number of VM object lock acquisitions by brelse() by 74%. Sponsored by: EMC / Isilon Storage Division
* Move an assertion to the right spot; only bus_dmamap_load_mbuf(9)marius2013-06-011-2/+2
| | | | | | | | requires a pkthdr being present but that's not the case for either _bus_dmamap_load_mbuf_sg() or bus_dmamap_load_mbuf_sg(9). Reported by: sbruno MFC after: 1 week
* Style fixes to vn_ioctl().jhb2013-05-311-14/+15
| | | | Suggested by: bde
* - Convert the bufobj lock to rwlock.jeff2013-05-314-125/+65
| | | | | | | | | | - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
* Initialising the new fibnum field to a known value turns out tojulian2013-05-241-0/+3
| | | | | | | be a GOOD IDEA (TM). Apparently MOST users set this (e.g. tcp and friends) but there are a few users that just assume that it is a sensible value but then go on to read it. These include SCTP, pf and the FLOWTABLE option (and maybe others).
* Ensure alq's shutdown_pre_sync event handler is deregistered on module unload tolstewart2013-05-241-2/+5
| | | | | | | | avoid a dangling pointer and eventual panic on system shutdown. Reported by: Ali <comnetboy at gmail.com> Tested by: Ali <comnetboy at gmail.com> MFC after: 1 week
* Use proper malloc type for ioctls white-list.pjd2013-05-231-5/+7
| | | | | Reported by: pho Tested by: pho
* Increase the (arbitrary) limit for the number of packets per tickluigi2013-05-221-2/+1
| | | | | | | | | | | | | | | | | | | | from 1k to 20k The previous value was good 10 years ago, but not anymore now. More importantly, lots of good surprises: polling is incredibly effective under virtualization, and not only prevents livelock but also saves most of the VM exit overhead in receive mode. Using polling, a FreeBSD instance under qemu-kvm remains perfectly responsive even when bombed with 10 Mpps over an emulated e1000, and happily processes 1.7 Mpps through ipfw. Note that some incompatibilities still remain: e.g. polling is not (yet) compatible with netmap, and seems to freeze the guest when kern.polling.idle_poll=1 MFC after: 3 days
* passing fd over unix socket: fix a corner case where callermjg2013-05-211-1/+8
| | | | | | | | | wants to pass no descriptors. Previously the kernel would leak memory and try to free a potentially arbitrary pointer. Reviewed by: pjd
* vm_object locking is not needed there as pages are already wired.attilio2013-05-211-2/+0
| | | | | Sponsored by: EMC / Isilon storage division Submitted by: alc
* Regenerate.kib2013-05-213-4/+4
|
* Fix the wait6(2) on 32bit architectures and for the compat32, by usingkib2013-05-211-1/+1
| | | | | | | | | | the right type for the argument in syscalls.master. Also fix the posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the architectures which require padding of the 64bit argument. Noted and reviewed by: jhb Pointy hat to: kib MFC after: 1 week
* Style nits.pjd2013-05-191-6/+5
|
* Use SDT_PROBE1() instead of SDT_PROBE().pjd2013-05-191-7/+4
|
* Refine the "nojail" rc keyword, adding "nojailvnet" for files that don'tjamie2013-05-191-0/+20
| | | | | | | | | apply to most jails but do apply to vnet jails. This includes adding a new sysctl "security.jail.vnet" to identify vnet jails. PR: conf/149050 Submitted by: mdodd MFC after: 3 days
* Use readlocking now that assertions on vm_page_lookup() are relaxed.attilio2013-05-171-3/+3
| | | | | | Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: flo, pho
* A library function shall not set errno to 0.jh2013-05-161-2/+3
| | | | Reviewed by: mdf
* - Add a new general purpose path-compressed radix trie which can be usedjeff2013-05-122-112/+760
| | | | | | | | | | | with any structure containing a uint64_t index. The tree code auto-generates type safe wrappers. - Eliminate the buf splay and replace it with pctrie. This is not only significantly faster with large files but also allows for the possibility of shared locking. Reviewed by: alc, attilio Sponsored by: EMC / Isilon Storage Division
* - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). Thekib2013-05-112-7/+20
| | | | | | | | | | | | | | | | | | | | | | | null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
* Fxi a bunch of typos.eadler2013-05-101-1/+1
| | | | | PR: misc/174625 Submitted by: Jeremy Chadwick <jdc@koitsu.org>
* Add option WITNESS_NO_VNODE to suppress printing LORs between VNODEmarcel2013-05-093-2/+16
| | | | | | | | | locks. To support this, VNODE locks are created with the LK_IS_VNODE flag. This flag is propagated down using the LO_IS_VNODE flag. Note that WITNESS still records the LOR. Only the printing and the optional entering into the kernel debugger is bypassed with the WITNESS_NO_VNODE option.
* Item 1 in r248830 causes earlier exits from the sendfile(2), beforekib2013-05-091-18/+29
| | | | | | | | | | | | | | | | | all requested data was sent. The reason is that xfsize <= 0 condition must not be tested at all if space == loopbytes. Otherwise, the done is set to 1, and sendfile(2) is aborted too early. Instead of moving the condition to exiting the inner loop after the xfersize check, directly check for the completed transfer before the testing of the available space in the socket buffer, and revert item 1 of r248830. It is arguably another bug to sleep waiting for socket buffer space (or return EAGAIN for non-blocking socket) if all bytes are already transferred. Reported by: pho Discussed with: scottl, gibbs Tested by: scottl (stable/9 backport), pho
* When the accept queue is full print the number of already pendingandre2013-05-081-1/+1
| | | | | | | | new connections instead of by how many we're over the limit, which is always 1. Noticed by: jmallet MFC after: 1 week
* Add a sysctl vfs.read_min to complement the exiting vfs.read_max. Itscottl2013-05-071-0/+12
| | | | | | | | | | | | | | | | | | defaults to 1, meaning that it's off. When read-ahead is enabled on a file, the vfs cluster code deliberately breaks a read into 2 I/O transactions; one to satisfy the actual read, and one to perform read-ahead. This makes sense in low-latency circumstances, but often produces unbalanced i/o transactions that penalize disks. By setting vfs.read_min, we can tell the algorithm to fetch a larger transaction that what we asked for, achieving the same effect as the read-ahead but without the doubled, unbalanced transaction and the slightly lower latency. This significantly helps our workloads with video streaming. Submitted by: emax Reviewed by: kib Obtained from: Netflix
* Back out r249318, r249320 and r249327 due to a heisenbug mostandre2013-05-061-2/+2
| | | | | likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.
* Add missing vdrop() in error case.mdf2013-05-041-0/+1
| | | | | Submitted by: Fahad (mohd.fahadullah@isilon.com) MFC after: 1 week
* Similar to 233760 and 236717, export some more useful info about thejhb2013-05-032-1/+49
| | | | | | | | | | | | | | | kernel-based POSIX semaphore descriptors to userland via procstat(1) and fstat(1): - Change sem file descriptors to track the pathname they are associated with and add a ksem_info() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and ksem_info() to export the path, mode, and value of a semaphore via struct kinfo_file. - Add a struct semstat to the libprocstat(3) interface along with a procstat_get_sem_info() to export the mode and value of a semaphore. - Teach fstat about semaphores and to display their path, mode, and value. MFC after: 2 weeks
* Fix FIONREAD on regular files. The computed result was being ignored andjhb2013-05-031-3/+2
| | | | | | | | it was being passed down to VOP_IOCTL() where it promptly resulted in ENOTTY due to a missing else for the past 8 years. While here, use a shared vnode lock while fetching the current file's size. MFC after: 1 week
* Regenerate files for pipe2().jilles2013-05-013-2/+30
|
* Add pipe2() system call.jilles2013-05-013-0/+20
| | | | | | | | | | | | | The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and O_NONBLOCK (on both sides) as part of the function. If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p). If the pointer is not valid, behaviour differs: pipe2() writes into the array from the kernel like socketpair() does, while pipe() writes into the array from an architecture-specific assembler wrapper. Reviewed by: kan, kib
* Regenerate files for accept4().jilles2013-05-013-2/+38
|
* Add accept4() system call.jilles2013-05-013-25/+63
| | | | | | | | | | | | | | | The accept4() function, compared to accept(), allows setting the new file descriptor atomically close-on-exec and explicitly controlling the non-blocking status on the new socket. (Note that the latter point means that accept() is not equivalent to any form of accept4().) The linuxulator's accept4 implementation leaves a race window where the new file descriptor is not close-on-exec because it calls sys_accept(). This implementation leaves no such race window (by using falloc() flags). The linuxulator could be fixed and simplified by using the new code. Like accept(), accept4() is async-signal-safe, a cancellation point and permitted in capability mode.
* Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare ourtrociny2013-05-011-9/+10
| | | | | | intention to use 4-byte padding for elf notes. MFC after: 3 weeks
* socket: Make shutdown() wake up a blocked accept().jilles2013-04-301-0/+2
| | | | | | | | | | | | | | | | | | | A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5. MFC after: 2 weeks
* Eliminate the layering violation in the kern_sendfile(). When queringkib2013-04-281-14/+16
| | | | | | | | | | | | | | | | the file size, use VOP_GETATTR() instead of accessing vnode vm_object un_pager.vnp.vnp_size. Take the shared vnode lock earlier to cover the added VOP_GETATTR() call and, as consequence, the whole internal sendfile loop. Reduce vm object lock scope to not protect the local calculations. Note that this is the last misuse of the vnp_size in the tree, the others were removed from the ELF image activator by r230246. Reviewed by: alc Tested by: pho, bf (previous version) MFC after: 1 week
* Base the calculation of maxmbufmem in part on kmem_map sizeandre2013-04-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | instead of kernel_map size to prevent kernel memory exhaustion by mbufs and a subsequent panic on physical page allocation failure. On architectures without a direct map all mbuf memory (except for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map. It is the limiting factor hence. For architectures with a direct map using the size of kmem_map is a good proxy of available kernel memory as well. If it is much smaller the mbuf limit may be sub-optimal but remains reasonable, while avoiding panics under exhaustion. The overall mbuf memory limit calculation may be reconsidered again later, however due to the many different mbuf sizes and different backing KVM maps it is a tricky subject. Found by: pho's new network stress test Pointed out by: alc (kmem_map instead of kernel_map) Tested by: pho
OpenPOWER on IntegriCloud