summaryrefslogtreecommitdiffstats
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* Add internal 'mac_policy_count' counter to the MAC Framework, which is arwatson2009-06-025-50/+12
| | | | | | | | | | | | | | | | | | count of the number of registered policies. Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero. This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies. Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls. Obtained from: TrustedBSD Project
* Remove unneeded include.rwatson2009-06-021-2/+0
| | | | MFC after: 3 days
* Handle lock recursion differenty by always checking against LO_RECURSABLEattilio2009-06-023-17/+20
| | | | | | instead the lock own flag itself. Tested by: pho
* Correct a boundary case error in the management of a page's dirty bits byalc2009-06-021-4/+17
| | | | | | | | shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of a shared memory object or a file is truncated such that the length modulo the page size is between 1 and 511, then all of the page's dirty bits were cleared. Now, a dirty bit is cleared only if the corresponding block is truncated in its entirety.
* - Use an acquire barrier to increment f_count in fget_unlocked andjeff2009-06-021-2/+6
| | | | | | remove the volatile cast. Describe the reason in detail in a comment. Discussed with: bde, jhb
* Add an extension to the character device interface that allows characterjhb2009-06-011-2/+27
| | | | | | | | | | | | | | | | | | | | device drivers to use arbitrary VM objects to satisfy individual mmap() requests. - A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is added to cdevsw. This function is called for each mmap() request. If it returns ENODEV, then the mmap() request will fall back to using the device's device pager object and d_mmap(). Otherwise, the method can return a VM object to satisfy this entire mmap() request via *object. It can also modify the starting offset into this object via *foff. This allows device drivers to use the file offset as a cookie to identify specific VM objects. - vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when mapping V_CHR vnodes. This avoids duplicating all the cdev mmap handling code and simplifies some of vm_mmap_vnode(). - D_VERSION has been bumped to D_VERSION_02. Older device drivers using D_VERSION_01 are still supported. MFC after: 1 month
* Rework socket upcalls to close some races with setup/teardown of upcalls.jhb2009-06-013-12/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Each socket upcall is now invoked with the appropriate socket buffer locked. It is not permissible to call soisconnected() with this lock held; however, so socket upcalls now return an integer value. The two possible values are SU_OK and SU_ISCONNECTED. If an upcall returns SU_ISCONNECTED, then the soisconnected() will be invoked on the socket after the socket buffer lock is dropped. - A new API is provided for setting and clearing socket upcalls. The API consists of soupcall_set() and soupcall_clear(). - To simplify locking, each socket buffer now has a separate upcall. - When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from the receive socket buffer automatically. Note that a SO_SND upcall should never return SU_ISCONNECTED. - All this means that accept filters should now return SU_ISCONNECTED instead of calling soisconnected() directly. They also no longer need to explicitly clear the upcall on the new socket. - The HTTP accept filter still uses soupcall_set() to manage its internal state machine, but other accept filters no longer have any explicit knowlege of socket upcall internals aside from their return value. - The various RPC client upcalls currently drop the socket buffer lock while invoking soreceive() as a temporary band-aid. The plan for the future is to add a new flag to allow soreceive() to be called with the socket buffer locked. - The AIO callback for socket I/O is now also invoked with the socket buffer locked. Previously sowakeup() would drop the socket buffer lock only to call aio_swake() which immediately re-acquired the socket buffer lock for the duration of the function call. Discussed with: rwatson, rmacklem
* Add a simple API to manage scatter/gather lists of phyiscal addresses.jhb2009-06-011-0/+656
| | | | | | | | | | | | | Each list describes a logical memory object that is backed by one or more physical address ranges. To minimize locking, the sglist objects themselves are immutable once they are shared. These objects may be used in the future to facilitate I/O requests using physically-addressed buffers. For the immediate future I plan to use them to implement a new type of VM object and pager. Reviewed by: jeff, scottl MFC after: 1 month
* Add a flags field to struct ucred, and export that via kinfo_proc,rwatson2009-06-011-0/+1
| | | | | | | | consuming one of its spare fields. The cr_flags field is currently unused, but will be used for features, including capability mode and pay-as-you-go audit. Discussed with: jhb, sson
* Regenerate generated syscall files following changes to struct sysent inrwatson2009-06-011-509/+509
| | | | r193234.
* Add 'sy_flags', a currently unused per-syscall entry flags field that willrwatson2009-06-011-14/+21
| | | | | | | see future use in 9-CURRENT and 8-STABLE for features such as the capability-mode enable flag and pay-as-you-audit. Discussed with: jhb, sson
* Reimplement the netisr framework in order to support parallel netisrrwatson2009-06-011-11/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | threads: - Support up to one netisr thread per CPU, each processings its own workstream, or set of per-protocol queues. Threads may be bound to specific CPUs, or allowed to migrate, based on a global policy. In the future it would be desirable to support topology-centric policies, such as "one netisr per package". - Allow each protocol to advertise an ordering policy, which can currently be one of: NETISR_POLICY_SOURCE: packets must maintain ordering with respect to an implicit or explicit source (such as an interface or socket). NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work, as well as allowing protocols to provide a flow generation function for mbufs without flow identifers (m2flow). Falls back on NETISR_POLICY_SOURCE if now flow ID is available. NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for each packet handled by netisr (m2cpuid). - Provide utility functions for querying the number of workstreams being used, as well as a mapping function from workstream to CPU ID, which protocols may use in work placement decisions. - Add explicit interfaces to get and set per-protocol queue limits, and get and clear drop counters, which query data or apply changes across all workstreams. - Add a more extensible netisr registration interface, in which protocols declare 'struct netisr_handler' structures for each registered NETISR_ type. These include name, handler function, optional mbuf to flow ID function, optional mbuf to CPU ID function, queue limit, and ordering policy. Padding is present to allow these to be expanded in the future. If no queue limit is declared, then a default is used. - Queue limits are now per-workstream, and raised from the previous IFQ_MAXLEN default of 50 to 256. - All protocols are updated to use the new registration interface, and with the exception of netnatm, default queue limits. Most protocols register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use NETISR_POLICY_FLOW, and will therefore take advantage of driver- generated flow IDs if present. - Formalize a non-packet based interface between interface polling and the netisr, rather than having polling pretend to be two protocols. Provide two explicit hooks in the netisr worker for start and end events for runs: netisr_poll() and netisr_pollmore(), as well as a function, netisr_sched_poll(), to allow the polling code to schedule netisr execution. DEVICE_POLLING still embeds single-netisr assumptions in its implementation, so for now if it is compiled into the kernel, a single and un-bound netisr thread is enforced regardless of tunable configuration. In the default configuration, the new netisr implementation maintains the same basic assumptions as the previous implementation: a single, un-bound worker thread processes all deferred work, and direct dispatch is enabled by default wherever possible. Performance measurement shows a marginal performance improvement over the old implementation due to the use of batched dequeue. An rmlock is used to synchronize use and registration/unregistration using the framework; currently, synchronized use is disabled (replicating current netisr policy) due to a measurable 3%-6% hit in ping-pong micro-benchmarking. It will be enabled once further rmlock optimization has taken place. However, in practice, netisrs are rarely registered or unregistered at runtime. A new man page for netisr will follow, but since one doesn't currently exist, it hasn't been updated. This change is not appropriate for MFC, although the polling shutdown handler should be merged to 7-STABLE. Bump __FreeBSD_version. Reviewed by: bz
* Eliminate a comment describing code that was deleted over eight years ago.alc2009-06-011-14/+6
| | | | Move another comment to its proper place. Fix a typo in a third comment.
* sys/boot/common.crodrigc2009-06-011-21/+109
| | | | | | | | | | | | | | | | | | | | | ================= Extend the loader to parse the root file system mount options in /etc/fstab, and set a new loader variable vfs.root.mountfrom.options with these options. The root mount options must be a comma-delimited string, as specified in /etc/fstab. Only set the vfs.root.mountfrom.options variable if it has not been set in the environment. sys/kern/vfs_mount.c ==================== When mounting the root file system, pass the mount options specified in vfs.root.mountfrom.options, but filter out "rw" and "noro", since the initial mount of the root file system must be done as "ro". While we are here, try to add a few hints to the mountroot prompt to give users and idea what might of gone wrong during mounting of the root file system. Reviewed by: jhb (an earlier patch)
* nfs_write() can use the recently introduced vfs_bio_set_valid() instead ofalc2009-05-311-41/+0
| | | | | | vfs_bio_set_validclean(), thereby avoiding the page queues lock. Garbage collect vfs_bio_set_validclean(). Nothing uses it any longer.
* Unbreak the build. Add missed probes.kib2009-05-311-6/+12
| | | | | Reviewed by: rwatson Pointy hat to: me
* Eliminate code duplication in vn_fullpath1() around the cache lookupskib2009-05-311-85/+75
| | | | | | | | | | | | | | and calls to vn_vptocnp() by moving more of the common code to vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that cache is locked around the call. Do not track buffer position by both the pointer and offset, use only buflen to record the start of the free space. Export vn_vptocnp() for external consumers as a wrapper around vn_vptocnp_locked() that locks the cache and handles hold counts. Tested by: pho
* Split native socketpair() syscall onto kern_socketpair() which shoulddchagin2009-05-311-24/+29
| | | | | | | be used by kernel consumers and socketpair() itself. Approved by: kib (mentor) MFC after: 1 month
* Introduce an interm userland-kernel API for creating vnets andzec2009-05-312-28/+382
| | | | | | | | | | | | | | | | | | | | | | assigning ifnets from one vnet to another. Deletion of vnets is not yet supported. The interface is implemented as an ioctl extension so that no syscalls had to be introduced. This should be acceptable given that the new interface will be used for a short / interim period only, until the new jail management framwork gains the capability of managing vnets. This method for managing vimages / vnets has been in use for the past 7 years without any observable issues. The userland tool to be used in conjunction with the interim API can be found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and will most probably never get commited to svn. While here, bump copyright notices in kern_vimage.c and vimage.h to cover work done in year 2009. Approved by: julian (mentor) Discussed with: bz, rwatson
* Provide a new CPU device driver ivar to report the nominal speed of thenwhitehorn2009-05-311-12/+21
| | | | | | CPU, if available. This is meant to solve the issue of cpufreq misreporting speeds on CPUs that boot in a reduced power mode and have only relative speed control.
* Remove the now invalid (and possibly unused) debug.mpsafevfsattilio2009-05-301-9/+0
| | | | | | | sysctl/tunable. Reviewed by: emaste Sponsored by: Sandvine Incorporated
* Add VOP_ACCESSX, which can be used to query for newly added V*trasz2009-05-303-0/+74
| | | | | | | | permissions, such as VWRITE_ACL. For a filsystems that don't implement it, there is a default implementation, which works as a wrapper around VOP_ACCESS. Reviewed by: rwatson@
* Place hostnames and similar information fully under the prison system.jamie2009-05-294-106/+247
| | | | | | | | | | | | | | | | | The system hostname is now stored in prison0, and the global variable "hostname" has been removed, as has the hostname_mtx mutex. Jails may have their own host information, or they may inherit it from the parent/system. The proper way to read the hostname is via getcredhostname(), which will copy either the hostname associated with the passed cred, or the system hostname if you pass NULL. The system hostname can still be accessed directly (and without locking) at prison0.pr_host, but that should be avoided where possible. The "similar information" referred to is domainname, hostid, and hostuuid, which have also become prison parameters and had their associated global variables removed. Approved by: bz (mentor)
* Modify vm_hold_load_pages() to allocate pages using VM_ALLOC_NOOBJ ratheralc2009-05-291-13/+5
| | | | | than using the kernel object. This allows the elimination of page queues locking from vm_hold_free_pages().
* Minor style tweak.rwatson2009-05-291-1/+2
|
* Since sched_pin() and sched_unpin() are already inlined, don't manuallyrwatson2009-05-291-2/+2
| | | | inline in rmlocks.
* Remove extra cpu_spinwait() invocations. This should really only be usedjhb2009-05-292-11/+0
| | | | | | | in tight spin loops, not in these edge cases where we restart a much larger loop only a few times. Reviewed by: attilio
* Tweak a few comments on adaptive spinning.jhb2009-05-292-6/+15
|
* Make the rmlock(9) interface a bit more like the rwlock(9) interface:rwatson2009-05-292-7/+27
| | | | | | | | | | | | | | - Add rm_init_flags() and accept extended options only for that variation. - Add a flags space specifically for rm_init_flags(), rather than borrowing the lock_init() flag space. - Define flag RM_RECURSE to use instead of LO_RECURSABLE. - Define flag RM_NOWITNESS to allow an rmlock to be exempt from WITNESS checking; this wasn't possible previously as rm_init() always passed LO_WITNESS when initializing an rmlock's struct lock. - Add RM_SYSINIT_FLAGS(). - Rename embedded mutex in rmlocks to make it more obvious what it is. - Update consumers. - Update man page.
* Let vfs_lookup() return ENOTDIR if the path has a trailing slash anddes2009-05-291-8/+12
| | | | | | | | | | | | | | | | | the last component is a symlink to something that isn't a directory. We introduce a new namei flag, TRAILINGSLASH, which is set by lookup() if the last component is followed by a slash. The trailing slash is then stripped, as before. If the final component is a symlink, lookup() will return to namei(), which will expand the symlink and call lookup() with the new path. When all symlinks have been resolved, lookup() checks if the TRAILINGSLASH flag is set, and if it is, and the vnode it ended up with is not a directory, it returns ENOTDIR. PR: kern/21768 Submitted by: Eygene Ryabinkin <rea-fbsd@codelabs.ru> MFC after: 3 weeks
* Fix misleading comment.des2009-05-291-1/+1
| | | | MFC after: 1 week
* The patch for r193011 was partially rejected when applied, complete it.attilio2009-05-291-2/+4
|
* Last minute TTY API change: remove mutex argument from tty_alloc().ed2009-05-292-3/+10
| | | | | | | | | | I don't want people to override the mutex when allocating a TTY. It has to be there, to keep drivers like syscons happy. So I'm creating a tty_alloc_mutex() which can be used in those cases. tty_alloc_mutex() should eventually be removed. The advantage of this approach, is that we can just remove a function, without breaking the regular API in the future.
* Reverse the logic for ADAPTIVE_SX option and enable it by default.attilio2009-05-291-21/+50
| | | | | | | | | | | | | | | | | | | | | | Introduce for this operation the reverse NO_ADAPTIVE_SX option. The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed and the new flag, offering the reversed logic, SX_NOADAPTIVE is added. Additively implements adaptive spininning for sx held in shared mode. The spinning limit can be handled through sysctls in order to be tuned while the code doesn't reach the release, after which time they should be dropped probabilly. This change has made been necessary by recent benchmarks where it does improve concurrency of workloads in presence of high contention (ie. ZFS). KPI breakage is documented by __FreeBSD_version bumping, manpage and UPDATING updates. Requested by: jeff, kmacy Reviewed by: jeff Tested by: pho
* fail(9) support:zml2009-05-272-3/+588
| | | | | | | | Add support for kernel fault injection using KFAIL_POINT_* macros and fail_point_* infrastructure. Add example fail point in vfs_bio.c to simulate VM buf pressure. Approved by: dfr (mentor)
* Add hierarchical jails. A jail may further virtualize its environmentjamie2009-05-2717-706/+1787
| | | | | | | | | | | | | | | | | | | | | | by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
* Add the ksyms(4) pseudo driver. The ksyms driver allows a process tosson2009-05-263-0/+78
| | | | | | | | | | | | | | get a quick snapshot of the kernel's symbol table including the symbols from any loaded modules (the symbols are all merged into one symbol table). Unlike like other implementations, this ksyms driver maps memory in the process memory space to store the snapshot at the time /dev/ksyms is opened. It also checks to see if the process has already a snapshot open and won't allow it to open /dev/ksyms it again until it closes first. This prevents kernel and process memory from being exhausted. Note that /dev/ksyms is used by the lockstat(1) command. Reviewed by: gallatin kib (freebsd-arch) Approved by: gnn (mentor)
* Add the OpenSolaris dtrace lockstat provider. The lockstat providersson2009-05-266-29/+339
| | | | | | | | | | adds probes for mutexes, reader/writer and shared/exclusive locks to gather contention statistics and other locking information for dtrace scripts, the lockstat(1M) command and other potential consumers. Reviewed by: attilio jhb jb Approved by: gnn (mentor)
* Get rid of M_TEMP.ed2009-05-261-2/+2
|
* Add missing socket options.pjd2009-05-261-0/+8
|
* The advisory lock may be activated or activated and removed during thekib2009-05-241-2/+15
| | | | | | | | | | | | | | | sleep waiting for conditions when the lock may be granted. To prevent lf_setlock() from accessing possibly freed memory, add reference counting to the struct lockf_entry. Bump refcount around the sleep. Make lf_free_lock() return non-zero when structure was freed, and use this after the sleep to return EINTR to the caller. The error code might need a clarification, but we cannot return success to usermode, since the lock is not owned anymore. Reviewed by: dfr Tested by: pho MFC after: 1 month
* In lf_purgelocks(), assert that state->ls_pending is empty after wekib2009-05-241-1/+3
| | | | | | | | weeded out threads, and clean ls_active instead of ls_pending. Reviewed by: dfr Tested by: pho MFC after: 1 month
* In lf_advlockasync(), recheck for doomed vnode after the state->ls_lockkib2009-05-241-2/+17
| | | | | | | | | | | is acquired. In the lf_purgelocks(), assert that vnode is doomed and set *statep to NULL before clearing ls_pending list. Otherwise, we allow for the thread executing lf_advlockasync() to put new pending entry after state->ls_lock is dropped in lf_purgelocks(). Reviewed by: dfr Tested by: pho MFC after: 1 month
* Block when initially opening a TTY multiple times.ed2009-05-241-5/+11
| | | | | | | | | | | | | | | In the original MPSAFE TTY code, I changed the behaviour by returning EBUSY. I thought this made more sense, because it's basically a race to see who gets the TTY first. It turns out this is not a good change, because it also causes EBUSY to be returned when another process is closing the TTY. This can happen during startup, when /etc/rc (or one of its children) is still busy draining its data and /sbin/init is attempting to open the TTY to spawn a getty. Reported by: bz Tested by: bz
* Replace the while statement with the if for clarity. The loop bodykib2009-05-241-1/+1
| | | | | | | | cannot be executed more then once. Reviewed by: dfr Tested by: pho MFC after: 1 month
* V_irtualize the if_clone framework, thus allowing for clonable ifnetszec2009-05-231-0/+5
| | | | | | | | | | | | | | | | | | | | | | | to optionally have overlapping unit numbers if attached in different vnets. At this stage if_loop is the only clonable ifnet class that has been extended to allow for such overlapping allocation of unit numbers, i.e. in each vnet it is possible to have a lo0 interface. Other clonable ifnet classes remain to operate with traditional semantics, i.e. each instance of a clonable ifnet will be assigned a globally unique unit number, regardless in which vnet such an ifnet becomes instantiated. While here, garbage collect unused _lo_list field in struct vnet_net, as well as improve indentation for #defines in sys/net/vnet.h. The layout of struct vnet_net has changed, therefore bump __FreeBSD_version. This change has no functional impact on nooptions VIMAGE kernel builds. Reviewed by: bz, brooks Approved by: julian (mentor)
* Delay an error message until the variable it uses gets initialized.jamie2009-05-231-8/+6
| | | | | | | Found with: Coverity Prevent(tm) CID: 4316 Reported by: trasz Approved by: bz (mentor)
* Introduce the if_vmove() function, which will be used in the futurezec2009-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for reassigning ifnets from one vnet to another. if_vmove() works by calling a restricted subset of actions normally executed by if_detach() on an ifnet in the current vnet, and then switches to the target vnet and executes an appropriate subset of if_attach() actions there. if_attach() and if_detach() have become wrapper functions around if_attach_internal() and if_detach_internal(), where the later variants have an additional argument, a flag indicating whether a full attach or detach sequence is to be executed, or only a restricted subset suitable for moving an ifnet from one vnet to another. Hence, if_vmove() will not call if_detach() and if_attach() directly, but will call the if_detach_internal() and if_attach_internal() variants instead, with the vmove flag set. While here, staticize ifnet_setbyindex() since it is not referenced from outside of sys/net/if.c. Also rename ifccnt field in struct vimage to ifcnt, and do some minor whitespace garbage collection where appropriate. This change should have no functional impact on nooptions VIMAGE kernel builds. Reviewed by: bz, rwatson, brooks? Approved by: julian (mentor)
* Make 'struct acl' larger, as required to support NFSv4 ACLs. Providetrasz2009-05-222-7/+142
| | | | | | compatibility interfaces in both kernel and libc. Reviewed by: rwatson
* Enable secure TTY input buffer flushing by default.ed2009-05-211-1/+1
| | | | | | | I'm leaving the sysctl there. If people really notice a slowdown, they can revert to the old behaviour. Discussed with: kib
OpenPOWER on IntegriCloud