summaryrefslogtreecommitdiffstats
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* IFp4 @208451:pjd2012-11-301-11/+8
| | | | | | | | | | | | | | | | | | | Fix path handling for *at() syscalls. Before the change directory descriptor was totally ignored, so the relative path argument was appended to current working directory path and not to the path provided by descriptor, thus wrong paths were stored in audit logs. Now that we use directory descriptor in vfs_lookup, move AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where we hold file descriptors table lock, so we are sure paths will be resolved according to the same directory in audit record and in actual operation. Sponsored by: FreeBSD Foundation (auditdistd) Reviewed by: rwatson MFC after: 2 weeks
* IFp4 @208450:pjd2012-11-301-1/+0
| | | | | | | | Remove redundant call to AUDIT_ARG_UPATH1(). Path will be remembered by the following NDINIT(AUDITVNODE1) call. Sponsored by: FreeBSD Foundation (auditdistd) MFC after: 2 weeks
* Using a long is the wrong type to represent the realmem and maxmbufmemandre2012-11-291-4/+4
| | | | | | | | | variable as they may overflow on i386/PAE and i386 with > 2GB RAM. Use 64bit quad_t instead. It has broader kernel infrastructure support with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types. Pointed out by: alc, bde
* Complete r243631 by applying the remainder of kern_mbuf.c that gotandre2012-11-271-16/+18
| | | | | | | lost while merging into the commit tree. MFC after: 1 month X-MFC-with: r243631
* Fix r243627 by testing against the head socket instead of the socketandre2012-11-271-1/+1
| | | | | | | just created. MFC after: 1 week X-MFC-with: r243627
* Base the mbuf related limits on the available physical memory orandre2012-11-273-24/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month
* Fix a race on listen socket teardown where while draining theandre2012-11-271-4/+21
| | | | | | | | | | | | accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week
* Add kern.capmode_coredump sysctl/tunable to allow processes in capability modepjd2012-11-271-2/+13
| | | | | | | | to dump core. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks
* - Add NOCAPCHECK flag to namei that allows lookup to work even if the processpjd2012-11-272-1/+5
| | | | | | | | | | | | is in capability mode. - Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into NOCAPCHECK namei flag. This functionality will be used to enable core dumps for sandboxed processes. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks
* Regenerate after r243610.pjd2012-11-271-1/+1
|
* Allow to use kill(2) in capability mode, but process can send a signal onlypjd2012-11-272-0/+13
| | | | | | | | | to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT) which was failing in capability mode, so the code was failing back to exit(1). Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks
* Allow to modify kern.sugid_coredump and kern.corefile from loader.conf.pjd2012-11-271-0/+2
| | | | Obtained from: WHEEL Systems
* More style fixes.pjd2012-11-271-4/+4
|
* Style fixes (mostly whitespaces).pjd2012-11-271-35/+39
|
* Take first active vnode correctly.davidxu2012-11-271-1/+1
| | | | | Reviewed by: kib MFC after: 3 days
* Look for zombie process only if we were given process id.pjd2012-11-251-5/+6
| | | | | | Reviewed by: kib MFC after: 2 weeks X-MFC-after-or-with: 243142
* remove stop_scheduler_on_panic knobavg2012-11-251-36/+16
| | | | | | | | | | | | There has not been any complaints about the default behavior, so there is no need to keep a knob that enables the worse alternative. Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu spinlock-like logic can be dropped, because only a single CPU is supposed to win stop_cpus_hard(other_cpus) race and proceed past that call. MFC after: 1 month
* assert_vop_locked: make the assertion race-free and more efficientavg2012-11-241-3/+6
| | | | | | this is really a minor improvement for the sake of correctness MFC after: 6 days
* remove vop_lookup_pre and vop_lookup_postavg2012-11-222-12/+0
| | | | | Suggested by: kib MFC after: 5 days
* Schedule garbage collection run for the in-flight rights passed overkib2012-11-201-3/+3
| | | | | | | | | | | | | | | | | | | | | the unix domain sockets to the next tick, coalescing the serial calls until the collection fires. The thought is that more work for the collector could arise in the near time, allowing to clean more and not spend too much CPU on repeated collection when there is no garbage. Currently the collection task is fired immediately upon unix domain socket close if there are any rights in flight, which caused excessive CPU usage and too long blocking of the threads waiting for unp_list_lock and unp_link_rwlock in write mode. Robert noted that it would be nice if we could find some heuristic by which we decide whether to run GC a bit more quickly. E.g., if the number of UNIX domain sockets is close to its resource limit, but not quite. Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch> Reviewed by: rwatson MFC after: 2 weeks
* Add a special meaning to the negative ticks argument forkib2012-11-201-2/+6
| | | | | | | | | | | | | | | | taskqueue_enqueue_timeout(). Do not rearm the callout if it is already armed and the ticks is negative. Otherwise rearm it to fire in abs(ticks) ticks in the future. The intended use is to call taskqueue_enqueue_timeout() for the given timeout_task with the same negative ticks argument. As result, the task is scheduled to execute not further than abs(ticks) ticks in future, and the consequent enqueues are coalesced until the already scheduled task is finished. Reviewed by: rwatson Tested by: Markus Gebert <markus.gebert@hostpoint.ch> MFC after: 2 weeks
* insmntque() is always called with the lock held in exclusive mode,attilio2012-11-191-16/+8
| | | | | | | | | | then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks
* assert_vop_locked should treat LK_EXCLOTHER as the not locked caseavg2012-11-191-1/+2
| | | | | | | | ... from a perspective of the current thread. Spotted by: mjg Discussed with: kib MFC after: 18 days
* vnode_if: fix locking protocol description for lookup and cachedlookupavg2012-11-192-26/+2
| | | | | | | | | Also remove the checks from vop_lookup_pre and vop_lookup_post, which are now completely redundant (before this change they were partially redundant). Discussed with: kib MFC after: 10 days
* Fix possible fp reference leak in posix_openptmjg2012-11-181-0/+1
| | | | | | Reviewed by: ed Approved by: trasz (mentor) MFC after: 3 days
* Update comment.glebius2012-11-161-1/+2
|
* In pget(9), if PGET_NOTWEXIT flag is not specified, also search thekib2012-11-161-21/+46
| | | | | | | | | | | | zombie list for the pid. This allows several kern.proc sysctls to report useful information for zombies. Hold the allproc_lock around all searches instead of relocking it. Remove private pfind_locked() from the new nfs client code. Requested and reviewed by: pjd Tested by: pho MFC after: 3 weeks
* Restore the proper handling of the pid 0 for waitpid(2).kib2012-11-161-4/+9
| | | | | | | Fix the style around. Reported and reviewed by: bde (previous version) MFC after: 28 days
* Style fixes for r242958.kib2012-11-161-8/+6
| | | | | Reported and reviewed by: bde MFC after: 28 days
* Improve KASSERT messages in racct, to make it clear which resourcetrasz2012-11-151-12/+17
| | | | | | caused the problem. Submitted by: mjg
* Fix kassert that's not really valid for %CPU accounting. The problemtrasz2012-11-151-2/+3
| | | | | | | | here is race between decaying the resource usage in containers, and updating per-process usage; basically, the former may cause per-container usage to get smaller than per-process usage. Submitted by: Rudo Tomori
* Fix bug in r242852 that prevented CPU from becoming idle if kernel builtmav2012-11-151-1/+3
| | | | without SMP support.
* - Implement run-time expansion of the KTR buffer via sysctl.jeff2012-11-153-48/+143
| | | | | | | | | | - Implement a function to ensure that all preempted threads have switched back out at least once. Use this to make sure there are no stale references to the old ktr_buf or the lock profiling buffers before updating them. Reviewed by: marius (sparc64 parts), attilio (earlier patch) Sponsored by: EMC / Isilon Storage Division
* Style fixbapt2012-11-141-1/+1
| | | | MFC after: 1 day
* return ERANGE if the buffer is too small to contain the login as documented inbapt2012-11-141-0/+2
| | | | | | | the manpage Reviewed by: cognet, kib MFC after: 1 month
* enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result.mjg2012-11-131-5/+1
| | | | | | | pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS Approved by: trasz (mentor) MFC after: 1 week
* Regenkib2012-11-133-4/+48
|
* Add the wait6(2) system call. It takes POSIX waitid()-like processkib2012-11-133-40/+274
| | | | | | | | | | | | | | | | | | | | | designator to select a process which is waited for. The system call optionally returns siginfo_t which would be otherwise provided to SIGCHLD handler, as well as extended structure accounting for child and cumulative grandchild resource usage. Allow to get the current rusage information for non-exited processes as well, similar to Solaris. The explicit WEXITED flag is required to wait for exited processes, allowing for more fine-grained control of the events the waiter is interested in. Fix the handling of siginfo for WNOWAIT option for all wait*(2) family, by not removing the queued signal state. PR: standards/170346 Submitted by: "Jukka A. Ukkonen" <jau@iki.fi> MFC after: 1 month
* Don't divide by zero.trasz2012-11-131-6/+12
| | | | Tested by: swills
* Several optimizations to sched_idletd():mav2012-11-101-18/+35
| | | | | | | | | | | | | | | | - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@
* Allow maxusers to scale on machines with large address space.alfred2012-11-102-12/+21
| | | | | | | | | | | | | | | | | | Some hooks are added to clamp down maxusers and nmbclusters for small address space systems. VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on physical memory. VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory. These are set to the old values on i386 to preserve the clamping that was being done to all arches. Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override for the calculation on a MD basis. Currently no arch defines this. Reviewed by: peter MFC after: 2 weeks
* Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.attilio2012-11-092-2/+0
| | | | | Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.
* Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_addressmarius2012-11-081-2/+4
| | | | vm_offset_t as they should be.
* - Change ULE to use dynamic slice sizes for the timeshare queue in orderjeff2012-11-081-10/+48
| | | | | | | | | | to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav
* Fix typo; s/ouput/outputkevlo2012-11-071-1/+1
|
* export VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS via sysctl.alfred2012-11-061-0/+8
| | | | | On several platforms the are determined by too many nested #defines to be easily discernible. This will aid in development of auto-tuning.
* A clarification to the behaviour of the active vnode list managementkib2012-11-051-0/+3
| | | | | | | regarding the vnode page cleaning. In collaboration with: pho MFC after: 1 week
* Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.kib2012-11-041-0/+5
| | | | MFC after: 3 weeks
* Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.kib2012-11-041-3/+9
| | | | MFC after: 3 days
* Order the enumeration of the MNT_ flags to be the same as the order ofkib2012-11-041-2/+2
| | | | | | their definitions. MFC after: 3 days
OpenPOWER on IntegriCloud