summaryrefslogtreecommitdiffstats
path: root/sys/kern/kern_fork.c
Commit message (Collapse)AuthorAgeFilesLines
...
* Remove 'MPSAFE' annotations from the comments above most system calls: allrwatson2007-03-041-9/+0
| | | | | | | | system calls now enter without Giant held, and then in some cases, acquire Giant explicitly. Remove a number of other MPSAFE annotations in the credential code and tweak one or two other adjacent comments.
* Use pause() rather than tsleep() on explicit global dummy variables.jhb2007-02-271-3/+1
|
* Close race conditions between fork() and [sg]etpriority()'sdelphij2007-02-261-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PRIO_USER case, possibly also other places that deferences p_ucred. In the past, we insert a new process into the allproc list right after PID allocation, and release the allproc_lock sx. Because most content in new proc's structure is not yet initialized, this could lead to undefined result if we do not handle PRS_NEW with care. The problem with PRS_NEW state is that it does not provide fine grained information about how much initialization is done for a new process. By defination, after PRIO_USER setpriority(), all processes that belongs to given user should have their nice value set to the specified value. Therefore, if p_{start,end}copy section was done for a PRS_NEW process, we can not safely ignore it because p_nice is in this area. On the other hand, we should be careful on PRS_NEW processes because we do not allow non-root users to lower their nice values, and without a successful copy of the copy section, we can get stale values that is inherted from the uninitialized area of the process structure. This commit tries to close the race condition by grabbing proc mutex *before* we release allproc_lock xlock, and do copy as well as zero immediately after the allproc_lock xunlock. This guarantees that the new process would have its p_copy and p_zero sections, as well as user credential informaion initialized. In getpriority() case, instead of grabbing PROC_LOCK for a PRS_NEW process, we just skip the process in question, because it does not affect the final result of the call, as the p_nice value would be copied from its parent, and we will see it during allproc traverse. Other potential solutions are still under evaluation. Discussed with: davidxu, jhb, rwatson PR: kern/108071 MFC after: 2 weeks
* - Remove setrunqueue and replace it with direct calls to sched_add().jeff2007-01-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | setrunqueue() was mostly empty. The few asserts and thread state setting were moved to the individual schedulers. sched_add() was chosen to displace it for naming consistency reasons. - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be different on all three schedulers where it was only called in one place each. - Remove the long ifdef'd out remrunqueue code. - Remove the now redundant ts_state. Inspect the thread state directly. - Don't set TSF_* flags from kern_switch.c, we were only doing this to support a feature in one scheduler. - Change sched_choose() to return a thread rather than a td_sched. Also, rely on the schedulers to return the idlethread. This simplifies the logic in choosethread(). Aside from the run queue links kern_switch.c mostly does not care about the contents of td_sched. Discussed with: julian - Move the idle thread loop into the per scheduler area. ULE wants to do something different from the other schedulers. Suggested by: jhb Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.
* Threading cleanup.. part 2 of several.julian2006-12-061-18/+0
| | | | | | | | | | | | | | | | | | | | | | Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.
* Sweep kernel replacing suser(9) calls with priv(9) calls, assigningrwatson2006-11-061-2/+6
| | | | | | | | | | | | | specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
* Make KSE a kernel option, turned on by default in all GENERICjb2006-10-261-0/+12
| | | | | | | kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
* Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.hrwatson2006-10-221-1/+1
| | | | | | | | | | | | | begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
* - Change process_exec function handlers prototype to include structnetchild2006-08-151-0/+2
| | | | | | | | | | | | | image_params arg. - Change struct image_params to include struct sysentvec pointer and initialize it. - Change all consumers of process_exit/process_exec eventhandlers to new prototypes (includes splitting up into distinct exec/exit functions). - Add eventhandler to userret. Sponsored by: Google SoC 2006 Submitted by: rdivacky Parts suggested by: jhb (on hackers@)
* Don't lock each of the processes while looking for a pid. The allproc andjhb2006-08-011-5/+1
| | | | proctree locks that we already hold provide sufficient protection.
* - Use suser_cred(9) instead of checking cr_ruid directly.pjd2006-06-271-7/+10
| | | | | | | | | - For privileged processes safe two mutex operations. We may want to consider if this is good idea to use SUSER_ALLOWJAIL here, but for now I didn't wanted to change the original behaviour. Reviewed by: rwatson
* Fix a race between file operations and rfork(RFCFDG) by parkingdavidxu2006-03-151-0/+17
| | | | | | | | all other threads at user boundary, the race can crash kernel under stress testing. Reviewed by: jhb MFC after: 3 days
* Simplify system time accounting for profiling.phk2006-02-081-1/+1
| | | | | | | | | | Rename struct thread's td_sticks to td_pticks, we will need the other name for more appropriately named use shortly. Reduce it from uint64_t to u_int. Clear td_pticks whenever we enter the kernel instead of recording its value as reference for userret(). Use the absolute value of td->pticks in userret() and eliminate third argument.
* We don't need the proc lock to check P_KTHREAD on curthread since it isjhb2006-02-061-3/+0
| | | | only set before the kthread starts executing and is never cleared.
* Audit the args to rfork(), and the child PID for all fork system calls.wsalamon2006-02-061-0/+2
| | | | | Obtained from: TrustedBSD Project Approved by: rwatson (mentor)
* Hook up audit to fork() and exit() events. These changes manage therwatson2006-02-021-1/+11
| | | | | | | audit state on processes, not auditing of these events. Much work by: wsalamon Obtained from: TrustedBSD Project
* Moderate rewrite of kernel ktrace code to attempt to generally improverwatson2005-11-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reliability when tracing fast-moving processes or writing traces to slow file systems by avoiding unbounded queueuing and dropped records. Record loss was previously possible when the global pool of records become depleted as a result of record generation outstripping record commit, which occurred quickly in many common situations. These changes partially restore the 4.x model of committing ktrace records at the point of trace generation (synchronous), but maintain the 5.x deferred record commit behavior (asynchronous) for situations where entering VFS and sleeping is not possible (i.e., in the scheduler). Records are now queued per-process as opposed to globally, with processes responsible for committing records from their own context as required. - Eliminate the ktrace worker thread and global record queue, as they are no longer used. Keep the global free record list, as records are still used. - Add a per-process record queue, which will hold any asynchronously generated records, such as from context switches. This replaces the global queue as the place to submit asynchronous records to. - When a record is committed asynchronously, simply queue it to the process. - When a record is committed synchronously, first drain any pending per-process records in order to maintain ordering as best we can. Currently ordering between competing threads is provided via a global ktrace_sx, but a per-process flag or lock may be desirable in the future. - When a process returns to user space following a system call, trap, signal delivery, etc, flush any pending records. - When a process exits, flush any pending records. - Assert on process tear-down that there are no pending records. - Slightly abstract the notion of being "in ktrace", which is used to prevent the recursive generation of records, as well as generating traces for ktrace events. Future work here might look at changing the set of events marked for synchronous and asynchronous record generation, re-balancing queue depth, timeliness of commit to disk, and so on. I.e., performing a drain every (n) records. MFC after: 1 month Discussed with: jhb Requested by: Marc Olzheim <marcolz at stack dot nl>
* Fix the recent panics/LORs/hangs created by my kqueue commit by:ssouhlal2005-07-011-1/+1
| | | | | | | | | | | | | | | | | - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone
* Inherit signal mask for child process in fork1(), RELENG_4 and otherdavidxu2005-04-201-0/+1
| | | | | | | *BSD have this behaviour, also it is required by POSIX. PR: kern/80130 Submitted by: Kostik Belousov konstantin.belousov at zoral dot com dot ua
* Divorce critical sections from spinlocks. Critical sections as denoted byjhb2005-04-041-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | critical_enter() and critical_exit() are now solely a mechanism for deferring kernel preemptions. They no longer have any affect on interrupts. This means that standalone critical sections are now very cheap as they are simply unlocked integer increments and decrements for the common case. Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter() and spinlock_exit(). This KPI is responsible for providing whatever MD guarantees are needed to ensure that a thread holding a spin lock won't be preempted by any other code that will try to lock the same lock. For now all archs continue to block interrupts in a "spinlock section" as they did formerly in all critical sections. Note that I've also taken this opportunity to push a few things into MD code rather than MI. For example, critical_fork_exit() no longer exists. Instead, MD code ensures that new threads have the correct state when they are created. Also, we no longer try to fixup the idlethreads for APs in MI code. Instead, each arch sets the initial curthread and adjusts the state of the idle thread it borrows in order to perform the initial context switch. This change is largely a big NOP, but the cleaner separation it provides will allow for more efficient alternative locking schemes in other parts of the kernel (bare critical sections rather than per-CPU spin mutexes for per-CPU data for example). Reviewed by: grehan, cognet, arch@, others Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
* /* -> /*- for copyright notices, minor format tweaks as necessaryimp2005-01-061-1/+1
|
* Add new function fdunshare() which encapsulates the necessary light magicphk2004-12-141-12/+2
| | | | | | | | for ensuring that a process' filedesc is not shared with anybody. Use it in the two places which previously had private implmentations. This collects all fd_refcnt handling in kern_descrip.c
* Don't include sys/user.h merely for its side-effect of recursivelydas2004-11-271-1/+1
| | | | including other headers.
* Remove local definitions of RANGEOF() and use __rangeof() instead.das2004-11-201-9/+6
| | | | Also remove a few bogus casts.
* Malloc p_stats instead of putting it in the U area. We should considerdas2004-11-201-1/+3
| | | | | | simply embedding it in struct proc. Reviewed by: arch@
* Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.phk2004-11-131-5/+5
| | | | | | | | Use this in all the places where sleeping with the lock held is not an issue. The distinction will become significant once we finalize the exact lock-type to use for this kind of case.
* Use more intuitive pointer for fdinit() and fdcopy().phk2004-11-081-5/+3
| | | | Change fdcopy() to take unlocked filedesc.
* Allow fdinit() to be called with a NULL fdp argument so we can usephk2004-11-071-4/+0
| | | | | | it when setting up init. Make fdinit() lock the fdp argument as needed.
* Back out rev 1.240; it is unnecessary. In particular,das2004-10-061-8/+3
| | | | | | | p1 == curthread, so _PHOLD(p1) will not have to block to swap in p1. Noticed by: jhb
* Avoid calling _PHOLD(p1) with p2's lock held, since _PHOLD()das2004-10-011-3/+8
| | | | | may block to swap in p1. Instead, call _PHOLD earlier, at a point where the only lock held happens to be p1's.
* make some of these conditions apply equally to both threading systems.julian2004-09-131-3/+3
|
* Refactor a bunch of scheduler code to give basically the same behaviourjulian2004-09-051-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | but with slightly cleaned up interfaces. The KSE structure has become the same as the "per thread scheduler private data" structure. In order to not make the diffs too great one is #defined as the other at this time. The KSE (or td_sched) structure is now allocated per thread and has no allocation code of its own. Concurrency for a KSEGRP is now kept track of via a simple pair of counters rather than using KSE structures as tokens. Since the KSE structure is different in each scheduler, kern_switch.c is now included at the end of each scheduler. Nothing outside the scheduler knows the contents of the KSE (aka td_sched) structure. The fields in the ksegrp structure that are to do with the scheduler's queueing mechanisms are now moved to the kg_sched structure. (per ksegrp scheduler private data structure). In other words how the scheduler queues and keeps track of threads is no-one's business except the scheduler's. This should allow people to write experimental schedulers with completely different internal structuring. A scheduler call sched_set_concurrency(kg, N) has been added that notifies teh scheduler that no more than N threads from that ksegrp should be allowed to be on concurrently scheduled. This is also used to enforce 'fainess' at this time so that a ksegrp with 10000 threads can not swamp a the run queue and force out a process with 1 thread, since the current code will not set the concurrency above NCPU, and both schedulers will not allow more than that many onto the system run queue at a time. Each scheduler should eventualy develop their own methods to do this now that they are effectively separated. Rejig libthr's kernel interface to follow the same code paths as linkse for scope system threads. This has slightly hurt libthr's performance but I will work to recover as much of it as I can. Thread exit code has been cleaned up greatly. exit and exec code now transitions a process back to 'standard non-threaded mode' before taking the next step. Reviewed by: scottl, peter MFC after: 1 week
* Push Giant deep into vm_forkproc(), acquiring it only if the process hasalc2004-09-031-16/+12
| | | | | mapped System V shared memory segments (see shmfork_myhook()) or requires the allocation of an ldt (see vm_fault_wire()).
* Give setrunqueue() and sched_add() more of a clue as tojulian2004-09-011-1/+1
| | | | | | where they are coming from and what is expected from them. MFC after: 2 days
* Remove sched_free_thread() which was only usedjulian2004-08-311-3/+0
| | | | | | | | in diagnostics. It has outlived its usefulness and has started causing panics for people who turn on DIAGNOSTIC, in what is otherwise good code. MFC after: 2 days
* Add locking to the kqueue subsystem. This also makes the kqueue subsystemjmg2004-08-151-1/+2
| | | | | | | | | | | | | a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
* Increase the amount of data exported by KTR in the KTR_RUNQ setting.julian2004-08-091-2/+2
| | | | | This extra data is needed to really follow what is going on in the threaded case.
* Move the schedlock owner state update following the contextbmilekic2004-07-271-12/+14
| | | | | | | | | switch in fork_exit() to before anything else is done (but keep schedlock for the deadthread check). This means one less nasty bug if ever in the future whatever might have been called before the update played with schedlock or critical sections. Discussed with: tjr
* In revision 1.228, I accidentally broke the "total number of processes incperciva2004-07-261-1/+2
| | | | | | | | | | | | the system" resource limit code: When checking if the caller has superuser privileges, we should be checking the *real* user, not the *effective* user. (In general, resource limiting is done based on the real user, in order to avoid resource-exhaustion-by-setuid-program attacks.) Now that a SUSER_RUID flag to suser_cred exists, use it here to return this code to its correct behaviour. Pointed out by: rwatson
* When calling scheduler entrypoints for creating new threads and processes,julian2004-07-181-1/+1
| | | | | | | | | | | specify "us" as the thread not the process/ksegrp/kse. You can always find the others from the thread but the converse is not true. Theorotically this would lead to runtime being allocated to the wrong entity in some cases though it is not clear how often this actually happenned. (would only affect threaded processes and would probably be pretty benign, but it WAS a bug..) Reviewed by: peter
* fix compilation.phk2004-07-131-1/+1
|
* Replace "uid != 0" with "suser(td->td_ucred) != 0" when checking if we'vecperciva2004-07-131-1/+2
| | | | | hit the maximum number of processes. The last ten processes are reserved for the *non-jailed* superuser.
* Allocate TIDs in thread_init() and deallocate them in thread_fini().marcel2004-06-261-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | The overhead of unconditionally allocating TIDs (and likewise, unconditionally deallocating them), is amortized across multiple thread creations by the way UMA makes it possible to have type-stable storage. Previously the cost was kept down by having threads created as part of a fork operation use the process' PID as the TID. While this had some nice properties, it also introduced complexity in the way TIDs were allocated. Most importantly, by using the type-stable storage that UMA gives us this was also unnecessary. This change affects how core dumps are created and in particular how the PRSTATUS notes are dumped. Since we don't have a thread with a TID equalling the PID, we now need a different way to preserve the old and previous behavior. We do this by having the given thread (i.e. the thread passed to the core dump code in td) dump it's state first and fill in pr_pid with the actual PID. All other threads will have pr_pid contain their TIDs. The upshot of all this is that the debugger will now likely select the right LWP (=TID) as the initial thread. Credits to: julian@ for spotting how we can utilize UMA. Thanks to: all who provided julian@ with test results.
* Remove advertising clause from University of California Regent's license,imp2004-04-051-4/+0
| | | | | | per letter dated July 22, 1999. Approved by: core
* Assign thread IDs to kernel threads. The purpose of the thread ID (tid)marcel2004-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | is twofold: 1. When a 1:1 or M:N threaded process dumps core, we need to put the register state of each of its kernel threads in the core file. This can only be done by differentiating the pid field in the respective note. For this we need the tid. 2. When thread support is present for remote debugging the kernel with gdb(1), threads need to be identified by an integer due to limitations in the remote protocol. This requires having a tid. To minimize the impact of having thread IDs, threads that are created as part of a fork (i.e. the initial thread in a process) will inherit the process ID (i.e. tid=pid). Subsequent threads will have IDs larger than PID_MAX to avoid interference with the pid allocation algorithm. The assignment of tids is handled by thread_new_tid(). The thread ID allocation algorithm has been written with 3 assumptions in mind: 1. IDs need to be created as fast a possible, 2. Reuse of IDs may happen instantaneously, 3. Someone else will write a better algorithm.
* Make the process_exit eventhandler run without Giant. Add Giant hookspeter2004-03-141-2/+0
| | | | | | in the two consumers that need it.. processes using AIO and netncp. Update docs. Say that process_exec is called with Giant, but not to depend on it. All our consumers can handle it without Giant.
* Move the process_fork event out from under Giant. This one is easy,peter2004-03-141-1/+3
| | | | since there are no consumers in the tree. Document this.
* Push Giant down a little further:peter2004-03-131-3/+0
| | | | | | | | | | | | | | | - no longer serialize on Giant for thread_single*() and family in fork, exit and exec - thread_wait() is mpsafe, assert no Giant - reduce scope of Giant in exit to not cover thread_wait and just do vm_waitproc(). - assert that thread_single() family are not called with Giant - remove the DROP/PICKUP_GIANT macros from thread_single() family - assert that thread_suspend_check() s not called with Giant - remove manual drop_giant hack in thread_suspend_check since we know it isn't held. - remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family - mark kse_create() mpsafe
* make sure we had the filedesc lock when calling fdinit when RFCFDG is setjmg2004-03-101-0/+4
| | | | | | | on call to rfork. Submitted by: Brian Buchanan Semi-Reviewed by: rwatson
* Move a vref call outside of proc locks and Giant. By virtue of the factpeter2004-03-081-5/+4
| | | | | | | that we (p1) are currently running, we hold a reference on p_textvp which means the vnode cannot go away. p2 cannot run yet (and hence cannot exit) so this should be safe to do at this point. As a bonus, it removes a block of under-Giant code that was there to support the vref.
OpenPOWER on IntegriCloud