summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_ule.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC 286256:jhb2015-10-011-0/+4
| | | | | | | | | | | kgdb uses td_oncpu to determine if a thread is running and should use a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads retained td_oncpu from the last time they ran, and newborn threads had their CPU fields cleared to zero during fork and thread creation since they are in the set of fields zeroed when threads are setup. To fix, explicitly update the CPU fields for exiting threads in sched_throw() to reflect the switch out and reset the CPU fields for new threads in sched_fork_thread() to NOCPU.
* MFC r271604, r271616:mav2014-09-171-0/+14
| | | | | | | | | | | Add couple memory barriers to order tdq_cpu_idle and tdq_load accesses. This change fixes transient performance drops in some of my benchmarks, vanishing as soon as I am trying to collect any stats from the scheduler. It looks like reordered access to those variables sometimes caused loss of IPI_PREEMPT, that delayed thread execution until some later interrupt. Approved by: re (marius)
* MFC r270423:mav2014-09-061-1/+2
| | | | | | | | | | Restore pre-r239157 handling of sched_yield(), when thread time slice was aborted, allowing other threads to run. Without this change thread is just rescheduled again, that was illustrated by provided test tool. PR: 192926 Submitted by: eric@vangyzen.net Approved by: re (marius)
* MFC r268211:kib2014-07-101-6/+8
| | | | Micro-manage clang to get the expected inlining for cpu_search().
* MFC r267227:kib2014-06-151-2/+0
| | | | Remove write-only local variable.
* MFC 261357, 261358, 261421:ian2014-05-171-4/+0
| | | | | | | Enable SCHED_ULE for ppc book-e. Add driver for the ADT7460/ADT7467 fan controller found in later PowerBooks and iBooks.
* MFC r258622: dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2014-01-171-16/+16
|
* MFC r259875:dim2013-12-281-24/+0
| | | | | In sys/kern/sched_ule.c, remove static function sched_both(), which is unused since r232207.
* MFC 258869:jhb2013-12-241-1/+1
| | | | | | | Fix an off-by-one error in r228960. The maximum priority delta provided by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting priority value (before nice adjustment) is between SCHED_PRI_MIN and SCHED_PRI_MAX, inclusive.
* Micro-optimize cpu_search(), allowing compiler to use more efficient inlinemav2013-09-071-2/+10
| | | | | | | ffsl() implementation, when it is available, instead of homegrown iteration. On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
* Point args[0] not at the thread that is ending but at the one thatgnn2013-04-151-1/+1
| | | | | | | | | | | | is starting. This is in line with practice in OpenSolaris. Note that this change is only in ULE and not in the 4BSD scheduler. Once this change settles in (MFC timeout has expired) we'll try it out on 4BSD as well. PR: 177706 Submitted by: Tiwei Bie MFC after: 1 month
* Fix bug in r242852 that prevented CPU from becoming idle if kernel builtmav2012-11-151-1/+3
| | | | without SMP support.
* Several optimizations to sched_idletd():mav2012-11-101-18/+35
| | | | | | | | | | | | | | | | - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@
* - Change ULE to use dynamic slice sizes for the timeshare queue in orderjeff2012-11-081-10/+48
| | | | | | | | | | to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav
* Rework the known mutexes to benefit about staying on their ownattilio2012-10-311-3/+2
| | | | | | | | | | | cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc
* tdq_lock_pair() already does spinlock_enter() so migration is notattilio2012-10-301-2/+0
| | | | | | possible in sched_balance_pair(). Remove redundant sched_pin(). Reviewed by: marius, jeff
* Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.jimharris2012-10-241-1/+6
| | | | | | | | | | This enables CPU searches (which read tdq_load) to operate independently of any contention on the spinlock. Some scheduler-intensive workloads running on an 8C single-socket SNB Xeon show considerable improvement with this change (2-3% perf improvement, 5-6% decrease in CPU util). Sponsored by: Intel Reviewed by: jeff
* remove duplicate semicolons where possible.eadler2012-10-221-1/+1
| | | | | Approved by: cperciva MFC after: 1 week
* sched_ule: fix inverted condition in reporting of priority lending via ktravg2012-09-141-1/+1
| | | | | Reviewed by: kan MFC after: 1 week
* Mark the idle threads as non-sleepable and also assert that an idlejhb2012-08-221-0/+1
| | | | thread never blocks on a turnstile.
* Some more minor tunings inspired by bde@.mav2012-08-111-18/+22
|
* Allow idle threads to steal second threads from other cores on systems withmav2012-08-111-6/+0
| | | | | | | | | | | | 8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2) system shown any slowdown from mentioned "excess thrashing". Same time in pbzip2 test with number of threads more then number of CPUs I see up to 10% speedup with SMT disabled and up 5% with SMT enabled. Thinking about trashing I was trying to limit that stealing within same last level cache, but got only worse results. Present code any way prefers to steal threads from topologically closer cores. Sponsored by: iXsystems, Inc.
* Some minor tunings/cleanups inspired by bde@ after previous commits:mav2012-08-101-30/+40
| | | | | | | | | | - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.
* Rework r220198 change (by fabient). I believe it solves the problem frommav2012-08-091-5/+8
| | | | | | | | | | | | | | | | | | | | | the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.
* Let us manage differences of Book-E PowerPC variations i.e. vendor /raj2012-05-271-1/+1
| | | | | | | | | | | | implementation specific vs. the common architecture definition. Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under BOOKE_PPC4XX are not used in the code yet. This change set is not supposed to affect existing E500 support, it's just another reorg step before bringing support for E500mc, E5500 and PPC465. Obtained from: AppliedMicro, Freescale, Semihalf
* Implement the DTrace sched provider. This implementation aims to berstone2012-05-151-1/+35
| | | | | | | | | | | | | | | | | | | | | | compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated MFC after: 2 weeks
* Microoptimize cpu_search().mav2012-04-091-24/+28
| | | | | According to profiling, it makes one take 6% of CPU time on hackbench with its million of context switches per second, instead of 8% before.
* Rewrite thread CPU usage percentage math to not depend on periodic callsmav2012-03-131-46/+22
| | | | | | | | | | | with HZ rate through the sched_tick() calls from hardclock(). Potentially it can be used to improve precision, but now it is just minus one more reason to call hardclock() for every HZ tick on every active CPU. SCHED_4BSD never used sched_tick(), but keep it in place for now, as at least SCHED_FBFS existing in patches out of the tree depends on it. MFC after: 1 month
* Make kern.sched.idlespinthresh default value adaptive depending of HZ.mav2012-03-091-1/+3
| | | | Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
* Add a new sched_clear_name() method to the scheduler interface to clearjhb2012-03-081-0/+11
| | | | | | | | the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve(). MFC after: 2 weeks
* Fix bug of r232207, when cpu_search() could prefer CPU group with bestmav2012-03-031-5/+7
| | | | | load, but with no CPU matching given limitations. It caused kernel panics in some cases when thread was bound to specific CPUs with cpuset(1).
* Rework CPU load balancing in SCHED_ULE:mav2012-02-271-148/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - In sched_pickcpu() be more careful taking previous CPU on SMT systems. Do it only if all other logical CPUs of that physical one are idle to avoid extra resource sharing. - In sched_pickcpu() change general logic of CPU selection. First look for idle CPU, sharing last level cache with previously used one, skipping SMT CPU groups. If none found, search all CPUs for the least loaded one, where the thread with its priority can run now. If none found, search just for the least loaded CPU. - Make cpu_search() compare lowest/highest CPU load when comparing CPU groups with equal load. That allows to differentiate 1+1 and 2+0 loads. - Make cpu_search() to prefer specified (previous) CPU or group if load is equal. This improves cache affinity for more complicated topologies. - Randomize CPU selection if above factors are equal. Previous code tend to prefer CPUs with lower IDs, causing unneeded collisions. - Rework periodic balancer in sched_balance_group(). With cpu_search() more intelligent now, make balansing process flat, removing recursion over the topology tree. That fixes double swap problem and makes load distribution more even and predictable. All together this gives 10-15% performance improvement in many tests on CPUs with SMT, such as Core i7, for number of threads is less then number of logical CPUs. In some tests it also gives positive effect to systems without SMT. Reviewed by: jeff Tested by: flo, hackers@ MFC after: 1 month Sponsored by: iXsystems, Inc.
* Some small fixes to CPU accounting for threads:jhb2012-01-031-2/+2
| | | | | | | | | | | | | | | | - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible. Reviewed by: bde, kib MFC after: 1 week
* Cap the priority calculated from the current thread's running tick countjhb2011-12-291-1/+2
| | | | | | | | | at SCHED_PRI_RANGE to prevent overflows in the priority value. This can happen due to irregularities with clock interrupts under certain virtualization environments. Tested by: Larry Rosenman ler lerctr org MFC after: 2 weeks
* ule: ensure that batch timeshare threads are scheduled fairlyavg2011-12-191-2/+2
| | | | | | | | | | | | | | With the previous code, if the range of priorities for timeshare batch threads was greater than RQ_NQS, then the threads with low priorities in the part of the range above RQ_NQS would be scheduled to the run-queues as if they had high priorities at the beginning of the range. In other words, threads with a nice level of +N could be scheduled as if they had a nice level of -M. Reported by: George Mitchell <george@m5p.com> Reviewed by: jhb Tested by: George Mitchell <george@m5p.com> (earlier version) MFC after: 1 week
* - Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT tomarius2011-10-061-4/+9
| | | | | | | | | | | | | | | | | | | | itself, which sparc64 hardware doesn't support. One way to solve this would be to directly call sched_preempt() instead of issuing a self-IPI. However, quoting jhb@: "On the other hand, you can probably just skip the IPI entirely if we are going to send it to the current CPU. Presumably, once this routine finishes, the current CPU will exit softlock (or will do so "soon") and will then pick the next thread to run based on the adjustments made in this routine, so there's no need to IPI the CPU running this routine anyway. I think this is the better solution. Right now what is probably happening on other platforms is as soon as this routine finishes the CPU processes its self-IPI and causes mi_switch() which will just switch back to the softclock thread it is already running." - With r226054 and the the above change in place, sparc64 now no longer is incompatible with ULE and vice versa. However, powerpc/E500 still is. Submitted by: jhb Reviewed by: jeff
* Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.delphij2011-08-261-2/+2
| | | | | | | Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)
* Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespaceattilio2011-07-191-1/+1
| | | | | | | | | | pollution. That is a step further in the direction of building correct policies for userland and modules on how to deal with the number of maxcpus at runtime. Reported by: jhb Reviewed and tested by: pluknet Approved by: re (kib)
* Commit the support for removing cpumask_t and replacing it directly withattilio2011-05-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuset_t objects. That is going to offer the underlying support for a simple bump of MAXCPU and then support for number of cpus > 32 (as it is today). Right now, cpumask_t is an int, 32 bits on all our supported architecture. cpumask_t on the other side is implemented as an array of longs, and easilly extendible by definition. The architectures touched by this commit are the following: - amd64 - i386 - pc98 - arm - ia64 - XEN while the others are still missing. Userland is believed to be fully converted with the changes contained here. Some technical notes: - This commit may be considered an ABI nop for all the architectures different from amd64 and ia64 (and sparc64 in the future) - per-cpu members, which are now converted to cpuset_t, needs to be accessed avoiding migration, because the size of cpuset_t should be considered unknown - size of cpuset_t objects is different from kernel and userland (this is primirally done in order to leave some more space in userland to cope with KBI extensions). If you need to access kernel cpuset_t from the userland please refer to example in this patch on how to do that correctly (kgdb may be a good source, for example). - Support for other architectures is going to be added soon - Only MAXCPU for amd64 is bumped now The patch has been tested by sbruno and Nicholas Esborn on opteron 4 x 12 pack CPUs. More testing on big SMP is expected to came soon. pluknet tested the patch with his 8-ways on both amd64 and i386. Tested by: pluknet, sbruno, gianni, Nicholas Esborn Reviewed by: jeff, jhb, sbruno
* Clearing the flag when preempting will let the preempted thread runfabient2011-03-311-1/+2
| | | | | | | | | | | | | | | | | too much time. This can finish in a scheduler deadlock with ping-pong between two threads. One sample of this is: - device lapic (to have a preemption point on critical_exit()) - options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq) - running a cpu intensive task (that does not enter the kernel) - only one CPU on SMP or no SMP. As requested by jhb@ 4BSD have received the same type of fix instead of propagating the flag to the new thread. Reviewed by: jhb, jeff MFC after: 1 month
* Rework realtime priority support:jhb2011-01-141-4/+12
| | | | | | | | | | | | | | | | | | | - Move the realtime priority range up above kernel sleep priorities and just below interrupt thread priorities. - Contract the interrupt and kernel sleep priority ranges a bit so that the timesharing priority band can be increased. The new timeshare range is now slightly larger than the old realtime + timeshare ranges. - Change the ULE scheduler to no longer use realtime priorities for interactive threads. Instead, the larger timeshare range is now split into separate subranges for interactive and non-interactive ("batch") threads. The end result is that interactive threads and non-interactive threads still use the same priority ranges as before, but realtime threads now have a separate, dedicated priority range. - Do not modify the priority of non-timeshare threads in sched_sleep() or via cv_broadcastpri(). Realtime and idle priority threads will no longer have their priorities affected by sleeping in the kernel. Reviewed by: jeff
* Introduce two new helper macros to define the priority ranges used forjhb2011-01-131-16/+25
| | | | | | | | interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME and PRI_*_TIMESHARE. No functional change. Reviewed by: jeff
* Always use PRI_BASE() when checking the base type of a thread's priorityjhb2011-01-111-2/+2
| | | | | | class. MFC after: 2 weeks
* Fix two harmless off-by-one errors.jhb2011-01-101-3/+3
| | | | | Reviewed by: jeff MFC after: 2 weeks
* - Move sched_fork() later in fork() after the various sections of the newjhb2011-01-061-3/+5
| | | | | | | | | | | | | | thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
* - Follow r216313, the sched_unlend_user_prio is no longer needed, alwaysdavidxu2010-12-291-17/+5
| | | | | | | use sched_lend_user_prio to set lent priority. - Improve pthread priority-inherit mutex, when a contender's priority is lowered, repropagete priorities, this may cause mutex owner's priority to be lowerd, in old code, mutex owner's priority is rise-only.
* MFp4:davidxu2010-12-091-10/+11
| | | | | | | | | It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
* Remove unused variables.trasz2010-11-131-4/+0
|
* Fix typos.attilio2010-11-101-2/+2
| | | | | Submitted by: gianni MFC after: 3 days
* Use integer for size of cpuset, as it won't be bigger than INT_MAX,davidxu2010-11-011-9/+0
| | | | | | This is requested by bge. Also move the sysctl into file kern_cpuset.c, because it should always be there, it is independent of thread scheduler.
OpenPOWER on IntegriCloud