summaryrefslogtreecommitdiffstats
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* Move zero copy statistics structure before sosend_copyin().rwatson2005-11-281-15/+15
| | | | | MFC after: 1 month Reported by: tinderbox, sam
* When checking to see if a process has exceeded its time limit, flag thejhb2005-11-281-2/+2
| | | | | | | | | | | | | | | process as over the limit when its time is >= to the limit rather than > the limit. Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit yet. However, having the fraction exactly equal to 0 is rather rare, and it is not worth the overhead to handle that edge case. With just the > comparison, the process would have to exceed its limit by almost a second before it was killed. PR: kern/83192 Submitted by: Maciej Zawadzinski mzawadzinski at gmail dot com Reviewed by: bde MFC after: 1 week
* Break out functionality in sosend() responsible for building mbufrwatson2005-11-281-141/+170
| | | | | | | | | | chains and copying in mbufs from the body of the send logic, creating a new function sosend_copyin(). This changes makes sosend() almost readable, and will allow the same logic to be used by tailored socket send routines. MFC after: 1 month Reviewed by: andre, glebius
* Fix a stupid compiler warining, remove a redundant line.davidxu2005-11-271-1/+1
|
* Change filesystem name from mqueue to mqueuefs for style consistent.davidxu2005-11-271-2/+2
| | | | Suggested by: rwatson
* Regen.davidxu2005-11-272-14/+26
|
* Don't use OpenBSD syscall numbers, instead, use new syscall numbersdavidxu2005-11-271-16/+22
| | | | | | for POSIX message queue. Suggested by: rwatson
* Add several aliases for existing clockid_t names to indicate that therwatson2005-11-271-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | application wishes to request high precision time stamps be returned: Alias Existing CLOCK_REALTIME_PRECISE CLOCK_REALTIME CLOCK_MONOTONIC_PRECISE CLOCK_MONOTONIC CLOCK_UPTIME_PRECISE CLOCK_UPTIME Add experimental low-precision clockid_t names corresponding to these clocks, but implemented using cached timestamps in kernel rather than a full time counter query. This offers a minimum update rate of 1/HZ, but in practice will often be more frequent due to the frequency of time stamping in the kernel: New clockid_t name Approximates existing clockid_t CLOCK_REALTIME_FAST CLOCK_REALTIME CLOCK_MONOTONIC_FAST CLOCK_MONOTONIC CLOCK_UPTIME_FAST CLOCK_UPTIME Add one additional new clockid_t, CLOCK_SECOND, which returns the current second without performing a full time counter query or cache lookup overhead to make sure the cached timestamp is stable. This is intended to support very low granularity consumers, such as time(3). The names, visibility, and implementation of the above are subject to change, and will not be MFC'd any time soon. The goal is to expose lower quality time measurement to applications willing to sacrifice accuracy in performance critical paths, such as when taking time stamps for the purpose of rescheduling select() and poll() timeouts. Future changes might include retrofitting the time counter infrastructure to allow the "fast" time query mechanisms to use a different time counter, rather than a cached time counter (i.e., TSC). NOTE: With different underlying time mechanisms exposed, using different time query mechanisms in the same application may result in relative non-monoticity or the appearance of clock stalling for a single clockid_t, as a cached time stamp queried after a precision time stamp lookup may be "before" the time returned by the earlier live time counter query.
* Regen.davidxu2005-11-262-14/+14
|
* Bring in experimental kernel support for POSIX message queue.davidxu2005-11-262-6/+2374
|
* In nmount() and vfs_donmount(), do not strcmp() the options in the iovecrodrigc2005-11-231-36/+46
| | | | | | | | | | | | | | directly. We need to copyin() the strings in the iovec before we can strcmp() them. Also, when we want to send the errmsg back to userspace, we need to copyout()/copystr() the string. Add a small helper function vfs_getopt_pos() which takes in the name of an option, and returns the array index of the name in the iovec, or -1 if not found. This allows us to locate an option in the iovec without actually manipulating the iovec members. directly via strcmp(). Noticed by: kris on sparc64
* Fix a bug in the loop in sonewconn that makes room on the incompletejdp2005-11-222-2/+2
| | | | | | | | | connection queue for a new connection. It was removing connections from the wrong list. Submitted by: Paul Mikesell Sponsored by: Isilon Systems MFC after: 1 week
* Fix bug introduced in revision 1.186:marcel2005-11-191-3/+8
| | | | | | | | | | | | | | | | | | | | When all file systems have a time stamp of zero, which is the case for example when the root file system is on a read-only medium, we ended up not calling inittodr() at all. A potential uncleanliness existed as well. If multiple file systems had a non-zero time stamp, we would call inittodr() multiple times. While this should not be harmful, it's definitely not ideal. Fix both issues by iterating over the mounted file systems to find the largest time stamp and call inittodr() exactly once with that time stamp. This could of course be a zero time stamp if none of the mounted file systems have a non-zero time stamp. In that case the annoying errors mentioned in the commit log for revision 1.186 still haven't been avoided. The bottom line is that inittodr() should not complain when it gets a time base of zero. At the time of this commit only alpha seems to have that problem. Reported by: Dario Freni (saturnero at freesbie dot org) MFC after: 1 week
* Parse more mount options in vfs_donmount(), before vfs_domount()rodrigc2005-11-191-0/+42
| | | | | | | | | | | | is called. It looks like there are lots of different mount flags checked in vfs_domount(), so we need to do the parsing for these particular mount flags earlier on. The new flags parsed are: async, force, multilabel, noasync, noatime, noclusterr, noclusterw, noexec, nosuid, nosymfollow, snapshot, suiddir, sync, union. Existing code which uses mount() to mount UFS filesystems is not affected, but new code which uses nmount() to mount UFS filesystems should behave better.
* Add CLOCK_UPTIME to clock_gettime(2) reporting the currentandre2005-11-181-0/+2
| | | | | | uptime measured in SI seconds. Sponsored by: TCP/IP Optimization Fundraise 2005
* In vfs_nmount(), check to see if "update" mount option was passedrodrigc2005-11-181-0/+9
| | | | | | | in, and if so, set MNT_UPDATE filesystem flag. vfs_nmount() calls vfs_domount(), and there is special logic inside vfs_domount() if MNT_UPDATE is set. This is very important when we want to do an update mount of the root filesystem, using nmount().
* Prefer NULL to 0.yongari2005-11-171-25/+29
| | | | | | | | | Add missing lock/unlock in sysctl handler. Protect accessing NULL pointer when resource allocation was failed. style(9) Reviewed by: scottl MFC after: 1 week
* Add a new sysctl, kern.elf[32|64].can_exec_dyn. When set to 1, one cancognet2005-11-141-1/+7
| | | | | | | | | execute a ET_DYN binary (shared object). This does not make much sense, but some linux scripts expect to be able to execute /lib/ld-linux.so.2 (ldd comes to mind). The sysctl defaults to 0. MFC after: 3 days
* In ktr_getrequest(), acquire ktrace_mtx earlier -- while the racerwatson2005-11-141-2/+3
| | | | | | | | | | currently present is minor and offers no real semantic issues, it also doesn't make sense since an earlier lockless check has already occurred. Also hold the mutex longer, over a manipulation of per-process ktrace state, which requires synchronization. MFC after: 1 month Pointed out by: jhb
* Moderate rewrite of kernel ktrace code to attempt to generally improverwatson2005-11-136-92/+201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reliability when tracing fast-moving processes or writing traces to slow file systems by avoiding unbounded queueuing and dropped records. Record loss was previously possible when the global pool of records become depleted as a result of record generation outstripping record commit, which occurred quickly in many common situations. These changes partially restore the 4.x model of committing ktrace records at the point of trace generation (synchronous), but maintain the 5.x deferred record commit behavior (asynchronous) for situations where entering VFS and sleeping is not possible (i.e., in the scheduler). Records are now queued per-process as opposed to globally, with processes responsible for committing records from their own context as required. - Eliminate the ktrace worker thread and global record queue, as they are no longer used. Keep the global free record list, as records are still used. - Add a per-process record queue, which will hold any asynchronously generated records, such as from context switches. This replaces the global queue as the place to submit asynchronous records to. - When a record is committed asynchronously, simply queue it to the process. - When a record is committed synchronously, first drain any pending per-process records in order to maintain ordering as best we can. Currently ordering between competing threads is provided via a global ktrace_sx, but a per-process flag or lock may be desirable in the future. - When a process returns to user space following a system call, trap, signal delivery, etc, flush any pending records. - When a process exits, flush any pending records. - Assert on process tear-down that there are no pending records. - Slightly abstract the notion of being "in ktrace", which is used to prevent the recursive generation of records, as well as generating traces for ktrace events. Future work here might look at changing the set of events marked for synchronous and asynchronous record generation, re-balancing queue depth, timeliness of commit to disk, and so on. I.e., performing a drain every (n) records. MFC after: 1 month Discussed with: jhb Requested by: Marc Olzheim <marcolz at stack dot nl>
* style(9) cleanups.rodrigc2005-11-121-16/+17
| | | | Spotted by: njl, bde
* Significant refactoring of the accounting code to improve locking and VFSrwatson2005-11-121-108/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | happiness, as well as correct other bugs: - Replace notion of current and saved accounting credential/vnode with a single credential/vnode and an acct_suspended flag. This simplifies the accounting logic substantially. - Replace acct_mtx with acct_sx, a sleepable lock held exclusively during reconfiguration and space polling, but shared during log entry generation. This avoids holding a mutex over sleepable VFS operations. - Hold the sx lock over the duration of the I/O so that the vnode I/O cannot occur after vnode close, which could occur previously if accounting was disabled as a process exited. - Write the accounting log entry with Giant conditionally acquired based on the file system where the log is stored. Previously, the accounting code relied on the caller acquiring Giant. - Acquire Giant conditionally in the accounting callout based on the file system where the accounting log is stored. Run the callout MPSAFE. - Expose acct_suspended via a read-only sysctl so it is possibly to programmatically determine whether accounting is suspended or not without attempting to parse logs. - Check both acct_vp and acct_suspended lock-free before entering the accounting sx lock in acct(). - When accounting is disabled due to a VBAD vnode (i.e., forceable unmount), generate a log message indicating accounting has been disabled. - Correct a long-standing bug in how free space is calculated and compared to the required space: generate and compare signed results, not unsigned results, or negative free space will cause accounting to not be suspended when required, or worse, incorrectly resumed once negative free space is reached. MFC after: 2 weeks
* Make sure only remove one signal by debugger.davidxu2005-11-121-1/+2
|
* Correct a number of serious and closely related bugs in the UNIX domainrwatson2005-11-101-50/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | socket file descriptor garbage collection code, which is intended to detect and clear cycles of orphaned file descriptors that are "in-flight" in a socket when that socket is closed before they are received. The algorithm present was both run at poor times (resulting in recursion and reentrance), and also buggy in the presence of parallelism. In order to fix these problems, make the following changes: - When there are in-flight sockets and a UNIX domain socket is destroyed, asynchronously schedule the garbage collector, rather than running it synchronously in the current context. This avoids lock order issues when the garbage collection code reenters the UNIX domain socket code, avoiding lock order reversals, deadlocks, etc. Run the code asynchronously in a task queue. - In the garbage collector, when skipping file descriptors that have entered a closing state (i.e., have f_count == 0), re-test the FDEFER flag, and decrement unp_defer. As file descriptors can now transition to a closed state, while the garbage collector is running, it is no longer the case that unp_defer will remain an accurate count of deferred sockets in the mark portion of the GC algorithm. Otherwise, the garbage collector will loop waiting waiting for unp_defer to reach zero, which it will never do as it is skipping file descriptors that were marked in an earlier pass, but now closed. - Acquire the UNIX domain socket subsystem lock in unp_discard() when modifying the unp_rights counter, or a read/write race is risked with other threads also manipulating the counter. While here: - Remove #if 0'd code regarding acquiring the socket buffer sleep lock in the garbage collector, this is not required as we are able to use the socket buffer receive lock to protect scanning the receive buffer for in-flight file descriptors on the socket buffer. - Annotate that the description of the garbage collector implementation is increasingly inaccurate and needs to be updated. - Add counters of the number of deferred garbage collections and recycled file descriptors. This will be removed and is here temporarily for debugging purposes. With these changes in place, the unp_passfd regression test now appears to be passed consistently on UP and SMP systems for extended runs, whereas before it hung quickly or panicked, depending on which bug was triggered. Reported by: Philip Kizer <pckizer at nostrum dot com> MFC after: 2 weeks
* Add the f_msgcount field to the set of struct file fields printed in showrwatson2005-11-101-4/+5
| | | | | | files. MFC after: 1 week
* Expanet of details printed for each file descriptor to include it'srwatson2005-11-101-5/+5
| | | | | | | garbage collection flags. Reformat generally to make this fit and leave some room for future expansion. MFC after: 1 week
* Add a DDB "show files" command to list the current open file list, somerwatson2005-11-101-0/+73
| | | | | | | | state about each open file, and identify the first process in the process table that references the file. This is helpful in debugging leaks of file descriptors. MFC after: 1 week
* This is a workaround for a complicated issue involving VFS cookies and devfs.dwhite2005-11-091-0/+4
| | | | | | | | | | | | | The PR and patch have the details. The ultimate fix requires architectural changes and clarifications to the VFS API, but this will prevent the system from panicking when someone does "ls /dev" while running in a shell under the linuxulator. This issue affects HEAD and RELENG_6 only. PR: 88249 Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com> MFC after: 3 days
* Fix typo in recent comment tweak.rwatson2005-11-091-1/+1
| | | | | Submitted by: jkim MFC after: 1 week
* In closef(), remove the assumption that there is a thread associatedrwatson2005-11-091-2/+6
| | | | | | | | | | | with the file descriptor. When a file descriptor is closed as a result of garbage collecting a UNIX domain socket, the file descriptor will not have any associated thread, so the logic to identify advisory locks held by that thread is not appropriate. Check the thread for NULL to avoid this scenario. Expand an existing comment to say a bit more about this. MFC after: 1 week
* General consensus is that it would be even better to run this in aimp2005-11-091-1/+1
| | | | | | | | thread context. While it doesn't matter too much at the moment, in the future we could be back in the same boat if/when more restrictions are placed (or enforced) in a SWI. Suggested by: njl, bde, jhb, scottl
* Use intptr_t casts to convert void * <--> int to make 64-bit archs happy.jhb2005-11-091-2/+2
|
* Use sparse initializers for "struct domain" and "struct protosw",ru2005-11-091-18/+24
| | | | so they are easier to follow for the human being.
* WIFxxx macros requires an int type but p_xstat is short, convert itdavidxu2005-11-091-2/+3
| | | | | | to int before using the macros. Bug reported by : Pyun YongHyeon pyunyh at gmail dot com
* Kick off the suspend sequence from the keyboard in a SWI rather thanimp2005-11-091-2/+13
| | | | | | | | | | | in the hardware interrupt context (even if it is likely just an ithread). We don't document that suspend/resume routines are run from such a context and some of the things that happen in those routines aren't interrupt safe. Since there's no real need to run from that context, this restores assumptions that suspend routines have made. This fixes Thierry Herbelot's 'Trying to sleep while sleeping is prohibited' problem.
* Clarify panic message, I parsed the old one 'trying to sleep while sleeping'imp2005-11-091-1/+1
|
* For nmount(), allow a text string error message to be propagated backrodrigc2005-11-091-2/+37
| | | | | | | | | to user-space if a parameter named "errmsg" is passed into the iovec. Used in conjunction with vfs_mount_error(), more useful error messages than errno can be passed back to userspace when mounting a filesystem fails. Discussed with: phk, pjd
* In aio_waitcomplete, do not return EAGAIN if no other threadsdavidxu2005-11-081-1/+1
| | | | | | | | | | | have started aio, instead, initialize aio management structure if it hasn't been done, the reason to adjust this behavior is to make it a bit friendly for threaded program, consider two threads, one submits aio_write, and another just calls aio_waitcomplete to wait any I/O to be completed and recycle the aio requests, before submitter doing any I/O, the recycler wants to wait in kernel. This also fixes inconsistency with other aio syscalls.
* Make sure pending SIGCHLD is removed from previous parent when processdavidxu2005-11-081-1/+10
| | | | is attached or detached.
* Various and sundry cleanups:jhb2005-11-081-80/+84
| | | | | | | | | | | - Use curthread for calls to knlist_delete() and add a big comment explaining why as well as appropriate assertions. - Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them. - Use fget() family of functions to lookup file objects instead of grovelling around in file descriptor tables. - Destroy the aio_freeproc mutex if we are unloaded. Tested on: i386
* Giant clean up for exit(2)csjp2005-11-081-7/+7
| | | | | | | | | | | | -Change unconditional aquisition of Giant to only pickup Giant if the vnode for the controlling tty resides on a non-mpsafe file system. -Pickup Giant around executable vnode reference counting operations only if the executable resides on a non-mpsafe file system. -If this process is being traced, pickup Giant for trace file reference count operations only if it resides on a non-mpsafe file system. Discussed with: jhb Tested by: kris
* Add support for queueing SIGCHLD same as other UNIX systems did.davidxu2005-11-084-13/+131
| | | | | | | | | | | | | | | | | | | | For each child process whose status has been changed, a SIGCHLD instance is queued, if the signal is stilling pending, and process changed status several times, signal information is updated to reflect latest process status. If wait() returns because the status of a child process is available, pending SIGCHLD signal associated with the child process is discarded. Any other pending SIGCHLD signals remain pending. The signal information is allocated at the same time when proc structure is allocated, if process signal queue is fully filled or there is a memory shortage, it can still send the signal to process. There is a booting time tunable kern.sigqueue.queue_sigchild which can control the behavior, setting it to zero disables the SIGCHLD queueing feature, the tunable will be removed if the function is proved that it is stable enough. Tested on: i386 (SMP and UP)
* Add utility function to propagate mount errors as text string messages.rodrigc2005-11-081-0/+21
| | | | Discussed with: phk
* Fix panic string in last revision.glebius2005-11-061-1/+1
|
* Free only those mbuf+clusters back to the packet zone that were allocatedandre2005-11-052-2/+5
| | | | | | | | | | from there. All others get broken up and free'd individually to the mbuf and cluster zones. The packet zone is a secondary zone to the mbuf zone. There is currently a limitation in UMA which prevents decreasing the packet zone stock when the mbuf and cluster zone are drained and all their members are part of packets. When this is fixed this change may be reverted.
* Fix a logic error introduced with mandatory mbuf cluster refcounting andandre2005-11-042-6/+7
| | | | freeing of mbufs+clusters back to the packet zone.
* Fix name compatible problem with POSIX standard. the sigval_ptr anddavidxu2005-11-043-4/+4
| | | | | | sigval_int really should be sival_ptr and sival_int. Also sigev_notify_function accepts a union sigval value but not a pointer.
* Add stoppcbs[] arrays on Alpha and sparc64 and have each CPU save itsjhb2005-11-031-1/+1
| | | | | | | | current context in the IPI_STOP handler so that we can get accurate stack traces of threads on other CPUs on these two archs like we do now on i386 and amd64. Tested on: alpha, sparc64
* Fix 'show allpcpu' ddb command on non-x86. CPU IDs are in the range 0 ..jhb2005-11-031-1/+1
| | | | | | | mp_maxid, not 0 .. mp_maxid - 1. The result was that the highest numbered CPU was skipped on Alpha and sparc64. MFC after: 1 week
* Detect memory leaks when memory type is being destroyed.pjd2005-11-031-0/+21
| | | | | | This is very helpful for detecting kernel modules memory leaks on unload. Discussed and reviewed by: rwatson
OpenPOWER on IntegriCloud