summaryrefslogtreecommitdiffstats
path: root/sys/kern/kern_fork.c
Commit message (Collapse)AuthorAgeFilesLines
* Add a sysctl kern.pid_max, which limits the maximum pid the system iskib2012-08-151-4/+4
| | | | | | | allowed to allocate, and corresponding tunable with the same name. Note that existing processes with higher pids are left intact. MFC after: 1 week
* The falloc() function obtains two references to newly created 'fp'.pjd2012-06-191-2/+6
| | | | | | | | | | | | On success we have to drop one after procdesc_finit() and on failure we have to close allocated slot with fdclose(), which also drops one reference for us and drop the remaining reference with fdrop(). Without this change closing process descriptor didn't result in killing pdfork(2)ed child. Reviewed by: rwatson MFC after: 1 month
* Stop treating td_sigmask specially for the purposes of new threadkib2012-05-261-1/+0
| | | | | | | | | creation. Move it into the copied region of the struct thread. Update some comments. Requested by: bde X-MFC after: never
* Fix panic with RACCT that could occur in low memory (or out of swap)trasz2012-05-221-1/+1
| | | | | | | | situations, due to fork1() calling racct_proc_exit() without calling racct_proc_fork() first. Submitted by: Mateusz Guzik <mjguzik at gmail dot com> (earlier version) Reviewed by: Mateusz Guzik <mjguzik at gmail dot com>
* Currently, the debugger attached to the process executing vfork() doeskib2012-02-271-8/+4
| | | | | | | | | | not get syscall exit notification until the child performed exec of exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT clearing, by postponing the wait into syscallret after ptracestop() notification is done. Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net> MFC after: 2 weeks
* Allow the parent to gather the exit status of the children reparentedkib2012-02-231-0/+1
| | | | | | | | | | to the debugger. When reparenting for debugging, keep the child in the new orphan list of old parent. When looping over the children in kern_wait(), iterate over both children list and orphan list to search for the process by pid. Submitted by: Dmitry Mikulin <dmitrym juniper.net> MFC after: 2 weeks
* Mark the automatically attached child with PL_FLAG_CHILD in structkib2012-02-101-0/+2
| | | | | | | lwpinfo flags, for PT_FOLLOWFORK auto-attachment. In collaboration with: Dmitry Mikulin <dmitrym juniper net> MFC after: 1 week
* Move some code inside the racct_proc_fork(); it spares a few lock operationstrasz2011-10-031-11/+0
| | | | | | and it's more logical this way. MFC after: 3 days
* Fix another bug introduced in r225641, which caused rctl to access certaintrasz2011-10-031-0/+1
| | | | | | fields in 'struct proc' before they got initialized in do_fork(). MFC after: 3 days
* Fix error handling bug that would prevent MAC structures from gettingtrasz2011-09-171-20/+18
| | | | | | freed properly if resource limit got exceeded. Approved by: re (kib)
* Fix long-standing thinko regarding maxproc accounting. Basically,trasz2011-09-171-21/+3
| | | | | | | | | | we were accounting the newly created process to its parent instead of the child itself. This caused problems later, when the child changed its credentials - the per-uid, per-jail etc counters were not properly updated, because the maxproc counter in the child process was 0. Approved by: re (kib)
* In order to maximize the re-usability of kernel code in user space thiskmacy2011-09-161-5/+5
| | | | | | | | | | | | | patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
* Add experimental support for process descriptorsjonathan2011-08-181-6/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | A "process descriptor" file descriptor is used to manage processes without using the PID namespace. This is required for Capsicum's Capability Mode, where the PID namespace is unavailable. New system calls pdfork(2) and pdkill(2) offer the functional equivalents of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote process for debugging purposes. The currently-unimplemented pdwait(2) will, in the future, allow querying rusage/exit status. In the interim, poll(2) may be used to check (and wait for) process termination. When a process is referenced by a process descriptor, it does not issue SIGCHLD to the parent, making it suitable for use in libraries---a common scenario when using library compartmentalisation from within large applications (such as web browsers). Some observers may note a similarity to Mach task ports; process descriptors provide a subset of this behaviour, but in a UNIX style. This feature is enabled by "options PROCDESC", but as with several other Capsicum kernel features, is not enabled by default in GENERIC 9.0. Reviewed by: jhb, kib Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc
* Implement an RFTSIGZMB flag to rfork(2) to specify a signal that iskib2011-07-121-1/+16
| | | | | | | | delivered to parent when the child exists. Submitted by: Petr Salinger <Petr.Salinger seznam cz> (Debian/kFreeBSD) MFC after: 1 week X-MFC-note: bump __FreeBSD_version
* All the racct_*() calls need to happen with the proc locked. Fixing thistrasz2011-07-061-0/+6
| | | | | | won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".
* Enable accounting for RACCT_NPROC and RACCT_NTHR.trasz2011-03-311-0/+20
| | | | | Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
* Add racct. It's an API to keep per-process, per-jail, per-loginclasstrasz2011-03-291-0/+17
| | | | | | | | | and per-loginclass resource accounting information, to be used by the new resource limits code. It's connected to the build, but the code that actually calls the new functions will come later. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
* Fix some locking nits with the p_state field of struct proc:jhb2011-03-241-7/+4
| | | | | | | | | | | | | | | | | | - Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL in fork to honor the locking requirements. While here, expand the scope of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously the code was locking the new child process (p2) after it had locked the parent process (p1). However, when locking two processes, the safe order is to lock the child first, then the parent. - Fix various places that were checking p_state against PRS_NEW without having the process locked to use PROC_LOCK(). Every place was already locking the process, just after the PRS_NEW check. - Remove or reduce the use of PROC_SLOCK() for places that were checking p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading the current state. - Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once. MFC after: 1 week
* Extend struct sysvec with new method sv_schedtail, which is used for andchagin2011-03-081-1/+3
| | | | | | | | | | | | | | | explicit process at fork trampoline path instead of eventhadler(schedtail) invocation for each child process. Remove eventhandler(schedtail) code and change linux ABI to use newly added sysvec method. While here replace explicit comparing of module sysentvec structure with the newly created process sysentvec to detect the linux ABI. Discussed with: kib MFC after: 2 Week
* Introduce preliminary support of the show description of the ABI ofdchagin2011-02-251-4/+4
| | | | | | | traced process by adding two new events which records value of process sv_flags to the trace file at process creation/execing/exiting time. MFC after: 1 Month.
* Allow debugger to specify that children of the traced process should bekib2011-01-251-10/+63
| | | | | | | | automatically traced. Extend the ptrace(PL_LWPINFO) to report that child just forked. Reviewed by: davidxu, jhb MFC after: 2 weeks
* - Move sched_fork() later in fork() after the various sections of the newjhb2011-01-061-6/+7
| | | | | | | | | | | | | | thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
* Finishing touches to fork1() - ANSIfy missed function definition, style(9)trasz2011-01-021-27/+20
| | | | | fixes, removal of few comments that didn't really make sense and addition of fork_findpid() locking requirements.
* Refactor fork1() to make it easier to follow. No functional changes.trasz2010-12-101-191/+220
| | | | | Reviewed by: kib (earlier version) Tested by: pho
* MFp4:davidxu2010-12-091-0/+1
| | | | | | | | | It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
* Add a KASSERT to make it obvious when fork_norfproc() is to be called,trasz2010-12-061-1/+3
| | | | | | | | and set *procp to NULL in all cases. Previously, it was not being set in the ERESTART case. This is effectively no-op, since its value is ignored by callers in the error case. Reviewed by: kib@
* Fix style bug introduced by previous commit.trasz2010-12-061-1/+1
|
* Improve readability by factoring out the !RFPROC case. While here,trasz2010-12-061-59/+57
| | | | | | turn K&R function definitions into ANSI. No functional changes. Reviewed by: kib@
* - When disabling ktracing on a process, free any pending requests thatjhb2010-10-211-15/+1
| | | | | | | | | | | | | | | may be left. This fixes a memory leak that can occur when tracing is disabled on a process via disabling tracing of a specific file (or if an I/O error occurs with the tracefile) if the process's next system call is exit(). The trace disabling code clears p_traceflag, so exit1() doesn't do any KTRACE-related cleanup leading to the leak. I chose to make the free'ing of pending records synchronous rather than patching exit1(). - Move KTRACE-specific logic out of kern_(exec|exit|fork).c and into kern_ktrace.c instead. Make ktrace_mtx private to kern_ktrace.c as a result. MFC after: 1 month
* Create a global thread hash table to speed up thread lookup, usedavidxu2010-10-091-1/+1
| | | | | | | | | | rwlock to protect the table. In old code, thread lookup is done with process lock held, to find a thread, kernel has to iterate through process and thread list, this is quite inefficient. With this change, test shows in extreme case performance is dramatically improved. Earlier patch was reviewed by: jhb, julian
* Fix two bugs in DTrace:rpaulo2010-09-091-9/+15
| | | | | | | * when the process exits, remove the associated USDT probes * when the process forks, duplicate the USDT probes. Sponsored by: The FreeBSD Foundation
* Add an extra comment to the SDT probes definition. This allows us to getrpaulo2010-08-221-1/+1
| | | | | | | | | use '-' in probe names, matching the probe names in Solaris.[1] Add userland SDT probes definitions to sys/sdt.h. Sponsored by: The FreeBSD Foundation Discussed with: rwaston [1]
* Reintroduce the r196640, after fixing the problem with my testing.kib2009-09-011-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho (and retested according to new test scenarious) MFC after: 1 week
* Reverse r196640 and r196644 for now.kib2009-08-291-15/+10
|
* Dispose the kernel stack of the proper thread.kib2009-08-291-1/+1
| | | | | Submitted by: alc MFC after: 1 week
* Remove the altkstacks, instead instantiate threads with kernel stackkib2009-08-291-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho MFC after: 1 week
* Remove the interim vimage containers, struct vimage and struct procg,jamie2009-07-171-4/+0
| | | | | | and the ioctl-based interface that supported them. Approved by: re (kib), bz (mentor)
* Replace AUDIT_ARG() with variable argument macros with a set more morerwatson2009-06-271-2/+2
| | | | | | | | | | | | | | specific macros for each audit argument type. This makes it easier to follow call-graphs, especially for automated analysis tools (such as fxr). In MFC, we should leave the existing AUDIT_ARG() macros as they may be used by third-party kernel modules. Suggested by: brooks Approved by: re (kib) Obtained from: TrustedBSD Project MFC after: 1 week
* Implement global and per-uid accounting of the anonymous memory. Addkib2009-06-231-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)
* Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Usekib2009-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland
* Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERICrwatson2009-06-051-1/+0
| | | | | | | | and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
* Add hierarchical jails. A jail may further virtualize its environmentjamie2009-05-271-4/+3
| | | | | | | | | | | | | | | | | | | | | | by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
* Introduce a new virtualization container, provisionally named vprocg, to holdzec2009-05-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | virtualized instances of hostname and domainname, as well as a new top-level virtualization struct vimage, which holds pointers to struct vnet and struct vprocg. Struct vprocg is likely to become replaced in the near future with a new jail management API import. As a consequence of this change, change struct ucred to point to a struct vimage, instead of directly pointing to a vnet. Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage branch. Permit kldload / kldunload operations to be executed only from the default vimage context. This change should have no functional impact on nooptions VIMAGE kernel builds. Reviewed by: bz Approved by: julian (mentor)
* Change the curvnet variable from a global const struct vnet *,zec2009-05-051-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_* macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
* Several threads in a process may do vfork() simultaneously. Then, allkib2008-12-051-1/+1
| | | | | | | | | | | | | | | | | | | parent threads sleep on the parent' struct proc until corresponding child releases the vmspace. Each sleep is interlocked with proc mutex of the child, that triggers assertion in the sleepq_add(). The assertion requires that at any time, all simultaneous sleepers for the channel use the same interlock. Silent the assertion by using conditional variable allocated in the child. Broadcast the variable event on exec() and exit(). Since struct proc * sleep wait channel is overloaded for several unrelated events, I was unable to remove wakeups from the places where cv_broadcast() is added, except exec(). Reported and tested by: ganbold Suggested and reviewed by: jhb MFC after: 2 week
* MFp4:bz2008-11-291-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
* - Forward port flush of page table updates on context switch or userretkmacy2008-10-191-2/+7
| | | | - Forward port vfork XEN hack
* Do the pargs_hold() on the copy of the pointer to the p_args of thekib2008-07-231-1/+1
| | | | | | | | | | | | | | | child process immediately after bulk bcopy() without dropping the process lock. Since process is not single-threaded when forking, dropping and reacquiring the lock allows an other thread to change the process title of the parent in between, and results in hold being done on the invalid pointer. The problem manifested itself as the double free of the old p_args. Reported by: kris Reviewed by: jhb MFC after: 1 week
* The kqueue_register() function assumes that it is called from the top ofkib2008-07-071-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | the syscall code and acquires various event subsystem locks as needed. The handling of the NOTE_TRACK for EVFILT_PROC is currently done by calling the kqueue_register() from filt_proc() filter, causing recursive entrance of the kqueue code. This results in the LORs and recursive acquisition of the locks. Implement the variant of the knote() function designed to only handle the fork() event. It mostly copies the knote() body, but also handles the NOTE_TRACK, removing the handling from the filt_proc(), where it causes problems described above. The function is called from the fork1() instead of knote(). When encountering NOTE_TRACK knote, it marks the knote as influx and drops the knlist and kqueue lock. In this context call to kqueue_register is safe from the problems. An error from the kqueue_register() is reported to the observer as NOTE_TRACKERR fflag. PR: 108201 Reviewed by: jhb, Pramod Srinivasan <pramod juniper net> (previous version) Discussed with: jmg Tested by: pho MFC after: 2 weeks
* Add DTrace 'proc' provider probes using the Statically Defined Tracejb2008-05-241-0/+23
| | | | (sdt) mechanism.
OpenPOWER on IntegriCloud