summaryrefslogtreecommitdiffstats
path: root/sys/kern/kern_exec.c
Commit message (Collapse)AuthorAgeFilesLines
* Add some checks to ensure that Capsicum is behaving correctly, and add somejonathan2011-06-301-0/+24
| | | | | | | more explicit comments about what's going on and what future maintainers need to do when e.g. adding a new operation to a sys_machdep.c. Approved by: mentor(rwatson), re(bz)
* Introduce preliminary support of the show description of the ABI ofdchagin2011-02-251-0/+6
| | | | | | | traced process by adding two new events which records value of process sv_flags to the trace file at process creation/execing/exiting time. MFC after: 1 Month.
* Create shared (readonly) page. Each ABI may specify the use of page bykib2011-01-081-4/+85
| | | | | | | | | | | | | setting SV_SHP flag and providing pointer to the vm object and mapping address. Provide simple allocator to carve space in the page, tailored to put the code with alignment restrictions. Enable shared page use for amd64, both native and 32bit FreeBSD binaries. Page is private mapped at the top of the user address space, moving a start of the stack one page down. Move signal trampoline code from the top of the stack to the shared page. Reviewed by: alc
* - When disabling ktracing on a process, free any pending requests thatjhb2010-10-211-10/+2
| | | | | | | | | | | | | | | may be left. This fixes a memory leak that can occur when tracing is disabled on a process via disabling tracing of a specific file (or if an I/O error occurs with the tracefile) if the process's next system call is exit(). The trace disabling code clears p_traceflag, so exit1() doesn't do any KTRACE-related cleanup leading to the leak. I chose to make the free'ing of pending records synchronous rather than patching exit1(). - Move KTRACE-specific logic out of kern_(exec|exit|fork).c and into kern_ktrace.c instead. Make ktrace_mtx private to kern_ktrace.c as a result. MFC after: 1 month
* execve(2) has a special check for file permissions: a file must have atjh2010-08-301-8/+8
| | | | | | | | | | | | | | | least one execute bit set, otherwise execve(2) will return EACCES even for an user with PRIV_VFS_EXEC privilege. Add the check also to vaccess(9), vaccess_acl_nfs4(9) and vaccess_acl_posix1e(9). This makes access(2) to better agree with execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check to zfs_freebsd_access() too. There may be other file systems which are not using vaccess*() functions and need to be handled separately. PR: kern/125009 Reviewed by: bde, trasz Approved by: pjd (ZFS part)
* Add an extra comment to the SDT probes definition. This allows us to getrpaulo2010-08-221-3/+3
| | | | | | | | | use '-' in probe names, matching the probe names in Solaris.[1] Add userland SDT probes definitions to sys/sdt.h. Sponsored by: The FreeBSD Foundation Discussed with: rwaston [1]
* Supply some useful information to the started image using ELF aux vectors.kib2010-08-171-3/+28
| | | | | | | | In particular, provide pagesize and pagesizes array, the canary value for SSP use, number of host CPUs and osreldate. Tested by: marius (sparc64) MFC after: 1 month
* The interpreter name should no longer be treated as a buffer that can bealc2010-07-281-0/+4
| | | | | | overwritten. (This change should have been included in r210545.) Submitted by: kib
* Introduce exec_alloc_args(). The objective being to encapsulate thealc2010-07-271-13/+23
| | | | | | | | | | details of the string buffer allocation in one place. Eliminate the portion of the string buffer that was dedicated to storing the interpreter name. The pointer to the interpreter name can simply be made to point to the appropriate argument string. Reviewed by: kib
* Change the order in which the file name, arguments, environment, andalc2010-07-251-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | shell command are stored in exec*()'s demand-paged string buffer. For a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces the number of global TLB shootdowns by 31%. It also eliminates about 330k page faults on the kernel address space. Change exec_shell_imgact() to use "args->begin_argv" consistently as the start of the argument and environment strings. Previously, it would sometimes use "args->buf", which is the start of the overall buffer, but no longer the start of the argument and environment strings. While I'm here, eliminate unnecessary passing of "&length" to copystr(), where we don't actually care about the length of the copied string. Clean up the initialization of the exec map. In particular, use the correct size for an entry, and express that size in the same way that is used when an entry is allocated. The old size was one page too large. (This discrepancy originated in 2004 when I rewrote exec_map_first_page() to use sf_buf_alloc() instead of the exec map for mapping the first page of the executable.) Reviewed by: kib
* Eliminate a little bit of duplicated code.alc2010-07-231-3/+2
|
* Accidentally committed an older version of this comment rather than thejhb2010-07-091-4/+4
| | | | final one.
* Refine a comment.jhb2010-07-091-5/+4
| | | | Reviewed by: bde
* Use vm_page_next() instead of vm_page_lookup() in exec_map_first_page()alc2010-07-021-1/+1
| | | | because vm_page_next() is faster.
* Tweak the in-kernel API for sending signals to threads:jhb2010-06-291-2/+2
| | | | | | | | | | - Rename tdsignal() to tdsendsignal() and make it private to kern_sig.c. - Add tdsignal() and tdksignal() routines that mirror psignal() and pksignal() except that they accept a thread as an argument instead of a process. They send a signal to a specific thread rather than to an individual process. Reviewed by: kib
* Reorganize syscall entry and leave handling.kib2010-05-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend struct sysvec with three new elements: sv_fetch_syscall_args - the method to fetch syscall arguments from usermode into struct syscall_args. The structure is machine-depended (this might be reconsidered after all architectures are converted). sv_set_syscall_retval - the method to set a return value for usermode from the syscall. It is a generalization of cpu_set_syscall_retval(9) to allow ABIs to override the way to set a return value. sv_syscallnames - the table of syscall names. Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding the call to cpu_set_syscall_retval(). The new functions syscallenter(9) and syscallret(9) are provided that use sv_*syscall* pointers and contain the common repeated code from the syscall() implementations for the architecture-specific syscall trap handlers. Syscallenter() fetches arguments, calls syscall implementation from ABI sysent table, and set up return frame. The end of syscall bookkeeping is done by syscallret(). Take advantage of single place for MI syscall handling code and implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the thread is stopped at syscall entry or return point respectively. The EXEC flag augments SCX and notifies debugger that the process address space was changed by one of exec(2)-family syscalls. The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are changed to use syscallenter()/syscallret(). MIPS and arm are not converted and use the mostly unchanged syscall() implementation. Reviewed by: jhb, marcel, marius, nwhitehorn, stas Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc), stas (mips) MFC after: 1 month
* Eliminate page queues locking around most calls to vm_page_free().alc2010-05-061-3/+1
|
* Acquire the page lock around all remaining calls to vm_page_free() onalc2010-05-051-0/+2
| | | | | | | | | | | | | managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib
* On Alan's advice, rather than do a wholesale conversion on a singlekmacy2010-04-301-4/+4
| | | | | | | | | | | | architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
* Add the ELF relocation base to struct image_params. This will benwhitehorn2010-03-251-0/+1
| | | | | required to correctly relocate the executable entry point's function descriptor on powerpc64.
* Change the arguments of exec_setregs() so that it receives a pointernwhitehorn2010-03-251-4/+3
| | | | | | | | to the image_params struct instead of several members of that struct individually. This makes it easier to expand its arguments in the future without touching all platforms. Reviewed by: jhb
* The nargvstr and nenvstr properties of arginfo are ints, not longs,nwhitehorn2010-03-241-2/+2
| | | | | | | | so should be copied to userspace with suword32() instead of suword(). This alleviates problems on 64-bit big-endian architectures, and is a no-op on all 32-bit architectures. Tested on: amd64, sparc64, powerpc64
* - Fix several off-by-one errors when using MAXCOMLEN. The p_comm[] andjhb2009-10-231-13/+7
| | | | | | | | | | | | | td_name[] arrays are actually MAXCOMLEN + 1 in size and a few places that created shadow copies of these arrays were just using MAXCOMLEN. - Prefer using sizeof() of an array type to explicit constants for the array length in a few places. - Ensure that all of p_comm[] and td_name[] is always zero'd during execve() to guard against any possible information leaks. Previously trailing garbage in p_comm[] could be leaked to userland in ktrace record headers via td_name[]. Reviewed by: bde
* Add a mitigation feature that will prevent user mappings atbz2009-10-021-3/+12
| | | | | | | | | | | | | | | | | virtual address 0, limiting the ability to convert a kernel NULL pointer dereference into a privilege escalation attack. If the sysctl is set to 0 a newly started process will not be able to map anything in the address range of the first page (0 to PAGE_SIZE). This is the default. Already running processes are not affected by this. You can either change the sysctl or the tunable from loader in case you need to map at a virtual address of 0, for example when running any of the extinct species of a set of a.out binaries, vm86 emulation, .. In that case set security.bsd.map_at_zero="1". Superseeds: r197537 In collaboration with: jhb, kib, alc
* Unlock the image vnode around the call of pmc PMC_FN_PROCESS_EXEC hook.kib2009-09-091-0/+2
| | | | | | | | | The hook calls vn_fullpath(9), that should not be executed with a vnode lock held. Reported by: Bruce Cran <bruce cran org uk> Tested by: pho MFC after: 3 days
* Fix some LORs between vnode locks and filedescriptor table locks.jhb2009-07-311-1/+1
| | | | | | | | | | - Don't grab the filedesc lock just to read fd_cmask. - Drop vnode locks earlier when mounting the root filesystem and before sanitizing stdin/out/err file descriptors during execve(). Submitted by: kib Approved by: re (rwatson) MFC after: 1 week
* Rework vnode argument auditing to follow the same structure, in orderrwatson2009-07-281-1/+1
| | | | | | | | | | to avoid exposing ARG_ macros/flag values outside of the audit code in order to name which one of two possible vnodes will be audited for a system call. Approved by: re (kib) Obtained from: TrustedBSD Project MFC after: 1 month
* Replace AUDIT_ARG() with variable argument macros with a set more morerwatson2009-06-271-4/+4
| | | | | | | | | | | | | | specific macros for each audit argument type. This makes it easier to follow call-graphs, especially for automated analysis tools (such as fxr). In MFC, we should leave the existing AUDIT_ARG() macros as they may be used by third-party kernel modules. Suggested by: brooks Approved by: re (kib) Obtained from: TrustedBSD Project MFC after: 1 week
* Rework the credential code to support larger values of NGROUPS andbrooks2009-06-191-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024 and 1023 respectively. (Previously they were equal, but under a close reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it is the number of supplemental groups, not total number of groups.) The bulk of the change consists of converting the struct ucred member cr_groups from a static array to a pointer. Do the equivalent in kinfo_proc. Introduce new interfaces crcopysafe() and crsetgroups() for duplicating a process credential before modifying it and for setting group lists respectively. Both interfaces take care for the details of allocating groups array. crsetgroups() takes care of truncating the group list to the current maximum (NGROUPS) if necessary. In the future, crsetgroups() may be responsible for insuring invariants such as sorting the supplemental groups to allow groupmember() to be implemented as a binary search. Because we can not change struct xucred without breaking application ABIs, we leave it alone and introduce a new XU_NGROUPS value which is always 16 and is to be used or NGRPS as appropriate for things such as NFS which need to use no more than 16 groups. When feasible, truncate the group list rather than generating an error. Minor changes: - Reduce the number of hand rolled versions of groupmember(). - Do not assign to both cr_gid and cr_groups[0]. - Modify ipfw to cache ucreds instead of part of their contents since they are immutable once referenced by more than one entity. Submitted by: Isilon Systems (initial implementation) X-MFC after: never PR: bin/113398 kern/133867
* Eliminate unnecessary obfuscation when testing a page's valid bits.alc2009-06-071-1/+1
|
* If vm_pager_get_pages() returns VM_PAGER_OK, then there is no need to checkalc2009-06-061-2/+1
| | | | | the page's valid bits. The page is guaranteed to be fully valid. (For the record, this is documented in vm/vm_pager.h's comments.)
* Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERICrwatson2009-06-051-1/+0
| | | | | | | | and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
* Supply AT_EXECPATH auxinfo entry to the interpreter, both for native andkib2009-03-171-2/+29
| | | | | | | compat32 binaries. Tested by: pho Reviewed by: kan
* Remove unneeded pointer `ndp'.ed2009-02-261-11/+10
| | | | | | | Inside do_execve(), we have a pointer `ndp', which always points to `&nd'. I can imagine a primitive (non-optimizing) compiler to really reserve space for such a pointer, so just remove the variable and use `&nd' directly.
* Remove even more unneeded variable assignments.ed2009-02-261-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kern_time.c: - Unused variable `p'. kern_thr.c: - Variable `error' is always caught immediately, so no reason to initialize it. There is no way that error != 0 at the end of create_thread(). kern_sig.c: - Unused variable `code'. kern_synch.c: - `rval' is always assigned in all different cases. kern_rwlock.c: - `v' is always overwritten with RW_UNLOCKED further on. kern_malloc.c: - `size' is always initialized with the proper value before being used. kern_exit.c: - `error' is always caught and returned immediately. abort2() never returns a non-zero value. kern_exec.c: - `len' is always assigned inside the if-statement right below it. tty_info.c: - `td' is always overwritten by FOREACH_THREAD_IN_PROC(). Found by: LLVM's scan-build
* Several threads in a process may do vfork() simultaneously. Then, allkib2008-12-051-1/+1
| | | | | | | | | | | | | | | | | | | parent threads sleep on the parent' struct proc until corresponding child releases the vmspace. Each sleep is interlocked with proc mutex of the child, that triggers assertion in the sleepq_add(). The assertion requires that at any time, all simultaneous sleepers for the channel use the same interlock. Silent the assertion by using conditional variable allocated in the child. Broadcast the variable event on exec() and exit(). Since struct proc * sleep wait channel is overloaded for several unrelated events, I was unable to remove wakeups from the places where cv_broadcast() is added, except exec(). Reported and tested by: ganbold Suggested and reviewed by: jhb MFC after: 2 week
* Merge latest DTrace changes from Perforce.rodrigc2008-11-051-5/+4
| | | | Approved by: jb
* Remove VSVTX, VSGID and VSUID. This should be a no-op,trasz2008-09-101-4/+5
| | | | | | as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID. Approved by: rwatson (mentor)
* Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed threadattilio2008-08-281-2/+2
| | | | | | was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
* More fully audit fexecve(2) and its arguments.rwatson2008-08-251-0/+2
| | | | | Obtained from: TrustedBSD Project Sponsored by: Google, Inc.
* Introduce two related changes to the TrustedBSD MAC Framework:rwatson2008-08-231-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | (1) Abstract interpreter vnode labeling in execve(2) and mac_execve(2) so that the general exec code isn't aware of the details of allocating, copying, and freeing labels, rather, simply passes in a void pointer to start and stop functions that will be used by the framework. This change will be MFC'd. (2) Introduce a new flags field to the MAC_POLICY_SET(9) interface allowing policies to declare which types of objects require label allocation, initialization, and destruction, and define a set of flags covering various supported object types (MPC_OBJECT_PROC, MPC_OBJECT_VNODE, MPC_OBJECT_INPCB, ...). This change reduces the overhead of compiling the MAC Framework into the kernel if policies aren't loaded, or if policies require labels on only a small number or even no object types. Each time a policy is loaded or unloaded, we recalculate a mask of labeled object types across all policies present in the system. Eliminate MAC_ALWAYS_LABEL_MBUF option as it is no longer required. MFC after: 1 week ((1) only) Reviewed by: csjp Obtained from: TrustedBSD Project Sponsored by: Apple, Inc.
* Reduce the scope of the vnode lock such that it does not covercsjp2008-08-121-1/+5
| | | | | | | | | | | | | the various copyouts associated with initializing the process's argv/env data in userspace. It is possible that these copyout operations can fault under memory pressure, possibly resulting in dead locks. This is believed to be safe since none of the copyout_strings() operations need to interact with the vnode here. Submitted by: Zhouyi Zhou PR: kern/111260 Discussed with: kib MFC after: 3 weeks
* Call pargs_drop() unconditionally in do_execve(), the function correctlykib2008-07-251-4/+2
| | | | | | | handles the NULL argument. Make pargs_free() static. MFC after: 1 week
* Pair the VOP_OPEN call from do_execve() with the reciprocal VOP_CLOSE.kib2008-07-171-0/+9
| | | | | | | | This was unnoticed because local filesystems usually do nothing non-trivial in the close vop. Reported and tested by: Rick Macklem MFC after: 2 weeks
* Add DTrace 'proc' provider probes using the Statically Defined Tracejb2008-05-241-0/+32
| | | | (sdt) mechanism.
* Implement the fexecve(2) syscall.kib2008-03-311-28/+77
| | | | | | | Based on the submission by rdivacky, sponsored by Google Summer of Code 2007 Reviewed by: rwatson, rdivacky Tested by: pho
* Remove kernel support for M:N threading.jeff2008-03-121-1/+1
| | | | | | | | While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
* VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used inattilio2008-01-131-3/+3
| | | | | | | | | | | conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
* vn_lock() is currently only used with the 'curthread' passed as argument.attilio2008-01-101-3/+3
| | | | | | | | | | | | | | | | Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
* Add the superpage reservation system. This is "part 2 of 2" of thealc2007-12-291-0/+7
| | | | | | | | | | | | machine-independent support for superpages. (The earlier part was the rewrite of the physical memory allocator.) The remainder of the code required for superpages support is machine-dependent and will be added to the various pmap implementations at a later date. Initially, I am only supporting one large page size per architecture. Moreover, I am only enabling the reservation system on amd64. (In an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0 in amd64/include/vmparam.h or your kernel configuration file.)
OpenPOWER on IntegriCloud