summaryrefslogtreecommitdiffstats
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* Add a sysctl kern.pid_max, which limits the maximum pid the system iskib2012-08-155-7/+45
| | | | | | | allowed to allocate, and corresponding tunable with the same name. Note that existing processes with higher pids are left intact. MFC after: 1 week
* Revert r239178 and implement two new functions, namelyhselasky2012-08-152-25/+33
| | | | | | | | | | | "device_free_softc()" and "device_claim_softc()", to allow USB serial drivers refcounting the softc. These functions are used to grab the softc from auto-free and to free the softc back to the correct malloc type, respectivly. Discussed with: jhb MFC after: 2 weeks
* Don't include opt_ddb.h & <ddb/ddb.h> twice.obrien2012-08-151-2/+0
|
* Reserve room for the terminating NUL when setting or getting kerneljh2012-08-141-6/+6
| | | | | environment variables. KENV_MNAMELEN and KENV_MVALLEN doesn't include space for the terminating NUL.
* Some style fixes inspired by @bde.davidxu2012-08-111-12/+12
|
* Some more minor tunings inspired by bde@.mav2012-08-112-25/+31
|
* Allow idle threads to steal second threads from other cores on systems withmav2012-08-111-6/+0
| | | | | | | | | | | | 8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2) system shown any slowdown from mentioned "excess thrashing". Same time in pbzip2 test with number of threads more then number of CPUs I see up to 10% speedup with SMT disabled and up 5% with SMT enabled. Thinking about trashing I was trying to limit that stealing within same last level cache, but got only worse results. Present code any way prefers to steal threads from topologically closer cores. Sponsored by: iXsystems, Inc.
* tvtohz will print out an error message if a negative value is givendavidxu2012-08-111-9/+13
| | | | | | to it, avoid this problem by detecting timeout earlier. Reported by: pho
* Some minor tunings/cleanups inspired by bde@ after previous commits:mav2012-08-102-44/+69
| | | | | | | | | | - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.
* sched_rr_interval() seems always returned period in hz ticks, but samemav2012-08-101-1/+1
| | | | always it was used as rate. Fix use side units to period in hz ticks.
* Add new device method to free the automaticallyhselasky2012-08-102-4/+25
| | | | | | | | | | | | | | | allocated softc structure which is returned by device_get_softc(). This method can be used to easily implement softc refcounting. This can be desirable when the softc has memory references which are controlled by userspace handles for example. This solves the problem of blocking the caller of device_detach() for a non-deterministic time. Discussed with: kib, ed MFC after: 2 weeks
* Rework r220198 change (by fabient). I believe it solves the problem frommav2012-08-092-9/+14
| | | | | | | | | | | | | | | | | | | | | the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.
* SCHED_4BSD scheduling quantum mechanism appears to be broken for some time.mav2012-08-091-33/+36
| | | | | | | | | | | With switchticks variable being reset each time thread preempted (that is done regularly by interrupt threads) scheduling quantum may never expire. It was not noticed in time because several other factors still regularly trigger context switches. Handle the problem by replacing that mechanism with its equivalent from SCHED_ULE called time slice. It is effectively the same, just measured in context of stathz instead of hz. Some unification is probably not bad.
* Always initialize pl_event.kib2012-08-081-0/+1
| | | | | Submitted by: Andrey Zonov <andrey@zonov.org> MFC after: 3 days
* Do not add handler to event handlers list until ithread is created.kan2012-08-061-25/+27
| | | | | | | | | | | | In rare event when fast and ithread interrupts share the same vector and the fast handler was registered first, we can end up trying to schedule the ithread that is not created yet. The kernel built with INVARIANTS then triggers an assertion. Change the order to create the ithread first and only then add the handler that needs it to the interrupt event handlers list. Reviewed by: jhb
* After the PHYS_TO_VM_PAGE() function was de-inlined, the main reasonkib2012-08-053-1/+3
| | | | | | | | | | | | | to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks
* Particlly MFcalloutng r238425 (by davide):mav2012-08-041-11/+6
| | | | | | | | | Fix an issue related to old periodic timers. The code in kern_clocksource.c uses interrupt to keep track of time, and this time may not match with binuptime(). In order to address such incoherency, switch periodic timers to binuptime(). Except further calloutng it is needed for already present cyclic subsystem.
* Partialy MFcalloutng r236894 (by davide):mav2012-08-041-20/+20
| | | | | | ... While here, Bruce Evans told me that "unsigned int" is spelled "u_int" in KNF, so replace it where needed.
* Microoptimize time math. As soon as our event periods are always below omemav2012-08-031-12/+14
| | | | | second we may not add intereger parts by using bintime_addx() instead of bintime_add(). Profiling shows handleevents() time redction by 15%.
* Reorder the managament of advisory locks on open files so that the advisoryjhb2012-07-312-43/+61
| | | | | | | | | | | | | | | | | | | | | | lock is obtained before the write count is increased during open() and the lock is released after the write count is decreased during close(). The first change closes a race where an open() that will block with O_SHLOCK or O_EXLOCK can increase the write count while it waits. If the process holding the current lock on the file then tries to call exec() on the file it has locked, it can fail with ETXTBUSY even though the advisory lock is preventing other threads from succesfully completeing a writable open(). The second change closes a race where a read-only open() with O_SHLOCK or O_EXLOCK may return successfully while the write count is non-zero due to another descriptor that had the advisory lock and was blocking the open() still being in the process of closing. If the process that completed the open() then attempts to call exec() on the file it locked, it can fail with ETXTBUSY even though the other process that held a write lock has closed the file and released the lock. Reviewed by: kib MFC after: 1 month
* I am comparing current pipe code with the one in 8.3-STABLE r236165,davidxu2012-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | I found 8.3 is a history BSD version using socket to implement FIFO pipe, it uses per-file seqcount to compare with writer generation stored in per-pipe object. The concept is after all writers are gone, the pipe enters next generation, all old readers have not closed the pipe should get the indication that the pipe is disconnected, result is they should get EPIPE, SIGPIPE or get POLLHUP in poll(). But newcomer should not know that previous writters were gone, it should treat it as a fresh session. I am trying to bring back FIFO pipe to history behavior. It is still unclear that if single EOF flag can represent SBS_CANTSENDMORE and SBS_CANTRCVMORE which socket-based version is using, but I have run the poll regression test in tool directory, output is same as the one on 8.3-STABLE now. I think the output "not ok 18 FIFO state 6b: poll result 0 expected 1. expected POLLHUP; got 0" might be bogus, because newcomer should not know that old writers were gone. I got the same behavior on Linux. Our implementation always return POLLIN for disconnected pipe even it should return POLLHUP, but I think it is not wise to remove POLLIN for compatible reason, this is our history behavior. Regression test: /usr/src/tools/regression/poll
* When a thread is blocked in direct write state, it only sets PIPE_DIRECTWdavidxu2012-07-311-3/+3
| | | | | | | | | | | | | | | | | | | | | flag but not PIPE_WANTW, but FIFO pipe code does not understand this internal state, when a FIFO peer reader closes the pipe, it wants to notify the writer, it checks PIPE_WANTW, if not set, it skips calling wakeup(), so blocked writer never noticed the case, but in general, the writer should return from the syscall with EPIPE error code and may get SIGPIPE signal. Setting the PIPE_WANTW fixed problem, or you can turn off direct write, it should fix the problem too. This bug is found by PR/170203. Another bug in FIFO pipe code is when peer closes the pipe, another end which is being blocked in select() or poll() is not notified, it missed to call pipeselwakeup(). Third problem is found in poll regression test, the existing code can not pass 6b,6c,6d tests, but FreeBSD-4 works. This commit does not fix the problem, I still need to study more to find the cause. PR: 170203 Tested by: Garrett Copper &lt; yanegomi at gmail dot com &gt;
* Until now KTR_ENTRIES, which defines the size of circular buffer used indavide2012-07-301-2/+2
| | | | | | | | | ktr(4), was constrained to be a power of two. Remove this constraint and update sys/conf/NOTES accordingly. Reviewed by: jhb Approved by: gnn (mentor) Sponsored by: Google Summer of Code 2012
* Add F_DUP2FD_CLOEXEC. Apparently Solaris 11 already did this.kib2012-07-271-0/+8
| | | | | | Submitted by: Jukka A. Ukkonen <jau iki fi> PR: standards/169962 MFC after: 1 week
* Cosmetics: define FREEBSD32_MINUSER and AOUT32_MINUSER for structkib2012-07-221-1/+2
| | | | | | | sysentvec .sv_minuser. Also improve style. Submitted by: Oliver Pinter <oliver.pinter@gmail.com> MFC after: 1 week
* (Incomplete) fixes for symbols visibility issues and style in fcntl.h.kib2012-07-211-1/+1
| | | | | | | | | | | | | | | | | Append '__' prefix to the tag of struct oflock, and put it under BSD namespace. Structure is needed both by libc and kernel, thus cannot be hidden under #ifdef _KERNEL. Move a set of non-standard F_* and O_* constants into BSD namespace. SUSv4 explicitely allows implemenation to pollute F_* and O_* names after fcntl.h is included, but it costs us nothing to adhere to the specification if exact POSIX compliance level is requested by user code. Change some spaces after #define to tabs. Noted by and discussed with: bde MFC after: 1 week
* Remove line which was accidentally kept in r238614.kib2012-07-191-1/+0
| | | | | | Submitted by: pjd Pointy hat to: kib MFC after: 1 week
* Fix several reads beyond the mapped first page of the binary in thekib2012-07-191-9/+18
| | | | | | | | | ELF parser. Specifically, do not allow note reader and interpreter path comparision in the brandelf code to read past end of the page. This may happen if specially crafter ELF image is activated. Submitted by: Lukasz Wojcik <lukasz.wojcik zoho com> MFC after: 3 days
* Implement F_DUPFD_CLOEXEC command for fcntl(2), specified by SUSv4.kib2012-07-191-0/+11
| | | | | | PR: standards/169962 Submitted by: Jukka A. Ukkonen <jau iki fi> MFC after: 1 week
* Add support for walltimestamp in DTrace.gnn2012-07-161-0/+20
| | | | | Submitted by: Fabian Keil MFC after: 2 weeks
* - Add support for displaying process stack memory regions.pgj2012-07-161-0/+4
| | | | | Approved by: rwatson MFC after: 3 days
* Fix a bug with memguard(9) on 32-bit architectures without amdf2012-07-151-1/+1
| | | | | | | | | | | | VM_KMEM_MAX_SIZE. The code was not taking into account the size of the kernel_map, which the kmem_map is allocated from, so it could produce a sub-map size too large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely and base the memguard map's size off the kernel_map's size, since this is always relevant and always smaller. Found by: Justin Hibbits
* Make the interval timings for EVFILT_TIMER more accurate. tvtohz() alwaysjhb2012-07-131-5/+12
| | | | | | | | | | | | | | adds an extra tick to account for the current partial clock tick. However, that is not appropriate for a repeating timer when the exact tvtohz() value should be used for subsequent intervals. Fix repeating callouts for EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the fix used in realitexpire() for interval timers. While here, update a few comments to note that if the EVFILT_TIMER code were to move out of kern_event.c, it should move to kern_time.c (where the interval timer code it mimics lives) rather than kern_timeout.c. MFC after: 1 month
* Fix build for kernels with dtrace hooks.kib2012-07-111-0/+4
| | | | MFC after: 1 month
* Initial commit of an I/O provider for DTrace on FreeBSD.gnn2012-07-112-0/+286
| | | | | | | | | | | | These probes are most useful when looking into the structures they provide, which are listed in io.d. For example: dtrace -n 'io:genunix::start { printf("%d\n", args[0]->bio_bcount); }' Note that the I/O systems in FreeBSD and Solaris/Illumos are sufficiently different that there is not a 1:1 mapping from scripts that work with one to the other. MFC after: 1 month
* Always clear p_xthread if current thread no longer needs it, in theory, ifdavidxu2012-07-101-2/+3
| | | | | | | debugger exited without calling ptrace(PT_DETACH), there is a time window that the p_xthread may be pointing to non-existing thread, in practical, this is not a problem because child process soon will be killed by parent process.
* If you have pressed CTRL+Z and a process is suspended, then you use gdbdavidxu2012-07-091-4/+4
| | | | | | | | | | | | to attach to the process, it is surprising that the process is resumed without inputting any gdb commands, however ptrace manual said: The tracing process will see the newly-traced process stop and may then control it as if it had been traced all along. But the current code does not work in this way, unless traced process received a signal later, it will continue to run as a background task. To fix this problem, just send signal SIGSTOP to the traced process after we resumed it, this works like that you are attaching to a running process, it is not perfect but better than nothing.
* Follow-up commit to r238220:mjg2012-07-091-8/+24
| | | | | | | | | | | | | Pass only FEXEC (instead of FREAD|FEXEC) in fgetvp_exec. _fget has to check for !FWRITE anyway and may as well know about FREAD. Make _fget code a bit more readable by converting permission checking from if() to switch(). Assert that correct permission flags are passed. In collaboration with: kib Approved by: trasz (mentor) MFC after: 6 days X-MFC: with r238220
* Unbreak handling of descriptors opened with O_EXEC by fexecve(2).mjg2012-07-082-4/+13
| | | | | | | | | | | While here return EBADF for descriptors opened for writing (previously it was ETXTBSY). Add fgetvp_exec function which performs appropriate checks. PR: kern/169651 In collaboration with: kib Approved by: trasz (mentor) MFC after: 1 week
* Fix KASSERT message.trociny2012-07-031-1/+1
| | | | MFC after: 3 days
* Extend the KPI to lock and unlock f_offset member of struct file. Itkib2012-07-023-51/+108
| | | | | | | | | | | | | | | | | | now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks
* Honor db_pager_quit in 'show uma' and 'show malloc'.jhb2012-07-021-0/+4
| | | | MFC after: 1 month
* Remove an old hack I noticed years ago, but never committed.imp2012-06-282-7/+7
|
* Add new pmap layer locks to the predefined lock order. Change the namesalc2012-06-271-5/+8
| | | | of a few existing VM locks to follow a consistent naming scheme.
* Correct sizeof usagekevlo2012-06-251-1/+1
| | | | Obtained from: DragonFly
* Move the code dealing with shared page into a dedicatedkib2012-06-232-192/+240
| | | | | | kern_sharedpage.c source file from kern_exec.c. MFC after: 29 days
* Stop updating the struct vdso_timehands from even handler executed inkib2012-06-232-95/+64
| | | | | | | | | | | | | | | | | | | | | | the scheduled task from tc_windup(). Do it directly from tc_windup in interrupt context [1]. Establish the permanent mapping of the shared page into the kernel address space, avoiding the potential need to sleep waiting for allocation of sf buffer during vdso_timehands update. As a consequence, shared_page_write_start() and shared_page_write_end() functions are not needed anymore. Guess and memorize the pointers to native host and compat32 sysentvec during initialization, to avoid the need to get shared_page_alloc_sx lock during the update. In tc_fill_vdso_timehands(), do not loop waiting for timehands generation to stabilize, since vdso_timehands is written in the same interrupt context which wrote timehands. Requested by: mav [1] MFC after: 29 days
* Implement mechanism to export some kernel timekeeping data tokib2012-06-224-0/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | usermode, using shared page. The structures and functions have vdso prefix, to indicate the intended location of the code in some future. The versioned per-algorithm data is exported in the format of struct vdso_timehands, which mostly repeats the content of in-kernel struct timehands. Usermode reading of the structure can be lockless. Compatibility export for 32bit processes on 64bit host is also provided. Kernel also provides usermode with indication about currently used timecounter, so that libc can fall back to syscall if configured timecounter is unknown to usermode code. The shared data updates are initiated both from the tc_windup(), where a fast task is queued to do the update, and from sysctl handlers which change timecounter. A manual override switch kern.timecounter.fast_gettime allows to turn off the mechanism. Only x86 architectures export the real algorithm data, and there, only for tsc timecounter. HPET counters page could be exported as well, but I prefer to not further glue the kernel and libc ABI there until proper vdso-based solution is developed. Minimal stubs neccessary for non-x86 architectures to still compile are provided. Discussed with: bde Reviewed by: jhb Tested by: flo MFC after: 1 month
* Enchance the shared page chunk allocator.kib2012-06-221-14/+63
| | | | | | | | | | | | | | | | Do not rely on the busy state of the page from which we allocate the chunk, to protect allocator state. Use statically allocated sx lock instead. Provide more flexible KPI. In particular, allow to allocate chunk without providing initial data, and allow writes into existing allocation. Allow to get an sf buf which temporary maps the chunk, to allow sequential updates to shared page content without unmapping in between. Reviewed by: jhb Tested by: flo MFC after: 1 month
* Fix locking for f_offset, vn_read() and vn_write() cases only, for now.kib2012-06-211-53/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems that intended locking protocol for struct file f_offset field was as follows: f_offset should always be changed under the vnode lock (except fcntl(2) and lseek(2) did not followed the rules). Since read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally taken to serialize shared vnode lock owners. This was broken first by enabling shared lock on writes, then by fadvise changes, which moved f_offset assigned from under vnode lock, and last by vn_io_fault() doing chunked i/o. More, due to uio_offset not yet valid in vn_io_fault(), the range lock for reads was taken on the wrong region. Change the locking for f_offset to always use FOFFSET_LOCKED block, which is placed before rangelocks in the lock order. Extract foffset_lock() and foffset_unlock() functions which implements FOFFSET_LOCKED lock, and consistently lock f_offset with it in the vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is not set for the vnode mount. Indicate that f_offset is already valid for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET flag, and assert that all callers of vn_read() and vn_write() follow this protocol. Extract get_advice() function to calculate the POSIX_FADV_XXX value for the i/o region, and use it were appropriate. Reviewed by: jhb Tested by: pho MFC after: 2 weeks
OpenPOWER on IntegriCloud