summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_ule.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix bug in r242852 that prevented CPU from becoming idle if kernel builtmav2012-11-151-1/+3
| | | | without SMP support.
* Several optimizations to sched_idletd():mav2012-11-101-18/+35
| | | | | | | | | | | | | | | | - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@
* - Change ULE to use dynamic slice sizes for the timeshare queue in orderjeff2012-11-081-10/+48
| | | | | | | | | | to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav
* Rework the known mutexes to benefit about staying on their ownattilio2012-10-311-3/+2
| | | | | | | | | | | cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc
* tdq_lock_pair() already does spinlock_enter() so migration is notattilio2012-10-301-2/+0
| | | | | | possible in sched_balance_pair(). Remove redundant sched_pin(). Reviewed by: marius, jeff
* Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.jimharris2012-10-241-1/+6
| | | | | | | | | | This enables CPU searches (which read tdq_load) to operate independently of any contention on the spinlock. Some scheduler-intensive workloads running on an 8C single-socket SNB Xeon show considerable improvement with this change (2-3% perf improvement, 5-6% decrease in CPU util). Sponsored by: Intel Reviewed by: jeff
* remove duplicate semicolons where possible.eadler2012-10-221-1/+1
| | | | | Approved by: cperciva MFC after: 1 week
* sched_ule: fix inverted condition in reporting of priority lending via ktravg2012-09-141-1/+1
| | | | | Reviewed by: kan MFC after: 1 week
* Mark the idle threads as non-sleepable and also assert that an idlejhb2012-08-221-0/+1
| | | | thread never blocks on a turnstile.
* Some more minor tunings inspired by bde@.mav2012-08-111-18/+22
|
* Allow idle threads to steal second threads from other cores on systems withmav2012-08-111-6/+0
| | | | | | | | | | | | 8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2) system shown any slowdown from mentioned "excess thrashing". Same time in pbzip2 test with number of threads more then number of CPUs I see up to 10% speedup with SMT disabled and up 5% with SMT enabled. Thinking about trashing I was trying to limit that stealing within same last level cache, but got only worse results. Present code any way prefers to steal threads from topologically closer cores. Sponsored by: iXsystems, Inc.
* Some minor tunings/cleanups inspired by bde@ after previous commits:mav2012-08-101-30/+40
| | | | | | | | | | - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.
* Rework r220198 change (by fabient). I believe it solves the problem frommav2012-08-091-5/+8
| | | | | | | | | | | | | | | | | | | | | the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.
* Let us manage differences of Book-E PowerPC variations i.e. vendor /raj2012-05-271-1/+1
| | | | | | | | | | | | implementation specific vs. the common architecture definition. Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under BOOKE_PPC4XX are not used in the code yet. This change set is not supposed to affect existing E500 support, it's just another reorg step before bringing support for E500mc, E5500 and PPC465. Obtained from: AppliedMicro, Freescale, Semihalf
* Implement the DTrace sched provider. This implementation aims to berstone2012-05-151-1/+35
| | | | | | | | | | | | | | | | | | | | | | compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated MFC after: 2 weeks
* Microoptimize cpu_search().mav2012-04-091-24/+28
| | | | | According to profiling, it makes one take 6% of CPU time on hackbench with its million of context switches per second, instead of 8% before.
* Rewrite thread CPU usage percentage math to not depend on periodic callsmav2012-03-131-46/+22
| | | | | | | | | | | with HZ rate through the sched_tick() calls from hardclock(). Potentially it can be used to improve precision, but now it is just minus one more reason to call hardclock() for every HZ tick on every active CPU. SCHED_4BSD never used sched_tick(), but keep it in place for now, as at least SCHED_FBFS existing in patches out of the tree depends on it. MFC after: 1 month
* Make kern.sched.idlespinthresh default value adaptive depending of HZ.mav2012-03-091-1/+3
| | | | Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
* Add a new sched_clear_name() method to the scheduler interface to clearjhb2012-03-081-0/+11
| | | | | | | | the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve(). MFC after: 2 weeks
* Fix bug of r232207, when cpu_search() could prefer CPU group with bestmav2012-03-031-5/+7
| | | | | load, but with no CPU matching given limitations. It caused kernel panics in some cases when thread was bound to specific CPUs with cpuset(1).
* Rework CPU load balancing in SCHED_ULE:mav2012-02-271-148/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - In sched_pickcpu() be more careful taking previous CPU on SMT systems. Do it only if all other logical CPUs of that physical one are idle to avoid extra resource sharing. - In sched_pickcpu() change general logic of CPU selection. First look for idle CPU, sharing last level cache with previously used one, skipping SMT CPU groups. If none found, search all CPUs for the least loaded one, where the thread with its priority can run now. If none found, search just for the least loaded CPU. - Make cpu_search() compare lowest/highest CPU load when comparing CPU groups with equal load. That allows to differentiate 1+1 and 2+0 loads. - Make cpu_search() to prefer specified (previous) CPU or group if load is equal. This improves cache affinity for more complicated topologies. - Randomize CPU selection if above factors are equal. Previous code tend to prefer CPUs with lower IDs, causing unneeded collisions. - Rework periodic balancer in sched_balance_group(). With cpu_search() more intelligent now, make balansing process flat, removing recursion over the topology tree. That fixes double swap problem and makes load distribution more even and predictable. All together this gives 10-15% performance improvement in many tests on CPUs with SMT, such as Core i7, for number of threads is less then number of logical CPUs. In some tests it also gives positive effect to systems without SMT. Reviewed by: jeff Tested by: flo, hackers@ MFC after: 1 month Sponsored by: iXsystems, Inc.
* Some small fixes to CPU accounting for threads:jhb2012-01-031-2/+2
| | | | | | | | | | | | | | | | - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible. Reviewed by: bde, kib MFC after: 1 week
* Cap the priority calculated from the current thread's running tick countjhb2011-12-291-1/+2
| | | | | | | | | at SCHED_PRI_RANGE to prevent overflows in the priority value. This can happen due to irregularities with clock interrupts under certain virtualization environments. Tested by: Larry Rosenman ler lerctr org MFC after: 2 weeks
* ule: ensure that batch timeshare threads are scheduled fairlyavg2011-12-191-2/+2
| | | | | | | | | | | | | | With the previous code, if the range of priorities for timeshare batch threads was greater than RQ_NQS, then the threads with low priorities in the part of the range above RQ_NQS would be scheduled to the run-queues as if they had high priorities at the beginning of the range. In other words, threads with a nice level of +N could be scheduled as if they had a nice level of -M. Reported by: George Mitchell <george@m5p.com> Reviewed by: jhb Tested by: George Mitchell <george@m5p.com> (earlier version) MFC after: 1 week
* - Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT tomarius2011-10-061-4/+9
| | | | | | | | | | | | | | | | | | | | itself, which sparc64 hardware doesn't support. One way to solve this would be to directly call sched_preempt() instead of issuing a self-IPI. However, quoting jhb@: "On the other hand, you can probably just skip the IPI entirely if we are going to send it to the current CPU. Presumably, once this routine finishes, the current CPU will exit softlock (or will do so "soon") and will then pick the next thread to run based on the adjustments made in this routine, so there's no need to IPI the CPU running this routine anyway. I think this is the better solution. Right now what is probably happening on other platforms is as soon as this routine finishes the CPU processes its self-IPI and causes mi_switch() which will just switch back to the softclock thread it is already running." - With r226054 and the the above change in place, sparc64 now no longer is incompatible with ULE and vice versa. However, powerpc/E500 still is. Submitted by: jhb Reviewed by: jeff
* Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.delphij2011-08-261-2/+2
| | | | | | | Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)
* Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespaceattilio2011-07-191-1/+1
| | | | | | | | | | pollution. That is a step further in the direction of building correct policies for userland and modules on how to deal with the number of maxcpus at runtime. Reported by: jhb Reviewed and tested by: pluknet Approved by: re (kib)
* Commit the support for removing cpumask_t and replacing it directly withattilio2011-05-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuset_t objects. That is going to offer the underlying support for a simple bump of MAXCPU and then support for number of cpus > 32 (as it is today). Right now, cpumask_t is an int, 32 bits on all our supported architecture. cpumask_t on the other side is implemented as an array of longs, and easilly extendible by definition. The architectures touched by this commit are the following: - amd64 - i386 - pc98 - arm - ia64 - XEN while the others are still missing. Userland is believed to be fully converted with the changes contained here. Some technical notes: - This commit may be considered an ABI nop for all the architectures different from amd64 and ia64 (and sparc64 in the future) - per-cpu members, which are now converted to cpuset_t, needs to be accessed avoiding migration, because the size of cpuset_t should be considered unknown - size of cpuset_t objects is different from kernel and userland (this is primirally done in order to leave some more space in userland to cope with KBI extensions). If you need to access kernel cpuset_t from the userland please refer to example in this patch on how to do that correctly (kgdb may be a good source, for example). - Support for other architectures is going to be added soon - Only MAXCPU for amd64 is bumped now The patch has been tested by sbruno and Nicholas Esborn on opteron 4 x 12 pack CPUs. More testing on big SMP is expected to came soon. pluknet tested the patch with his 8-ways on both amd64 and i386. Tested by: pluknet, sbruno, gianni, Nicholas Esborn Reviewed by: jeff, jhb, sbruno
* Clearing the flag when preempting will let the preempted thread runfabient2011-03-311-1/+2
| | | | | | | | | | | | | | | | | too much time. This can finish in a scheduler deadlock with ping-pong between two threads. One sample of this is: - device lapic (to have a preemption point on critical_exit()) - options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq) - running a cpu intensive task (that does not enter the kernel) - only one CPU on SMP or no SMP. As requested by jhb@ 4BSD have received the same type of fix instead of propagating the flag to the new thread. Reviewed by: jhb, jeff MFC after: 1 month
* Rework realtime priority support:jhb2011-01-141-4/+12
| | | | | | | | | | | | | | | | | | | - Move the realtime priority range up above kernel sleep priorities and just below interrupt thread priorities. - Contract the interrupt and kernel sleep priority ranges a bit so that the timesharing priority band can be increased. The new timeshare range is now slightly larger than the old realtime + timeshare ranges. - Change the ULE scheduler to no longer use realtime priorities for interactive threads. Instead, the larger timeshare range is now split into separate subranges for interactive and non-interactive ("batch") threads. The end result is that interactive threads and non-interactive threads still use the same priority ranges as before, but realtime threads now have a separate, dedicated priority range. - Do not modify the priority of non-timeshare threads in sched_sleep() or via cv_broadcastpri(). Realtime and idle priority threads will no longer have their priorities affected by sleeping in the kernel. Reviewed by: jeff
* Introduce two new helper macros to define the priority ranges used forjhb2011-01-131-16/+25
| | | | | | | | interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME and PRI_*_TIMESHARE. No functional change. Reviewed by: jeff
* Always use PRI_BASE() when checking the base type of a thread's priorityjhb2011-01-111-2/+2
| | | | | | class. MFC after: 2 weeks
* Fix two harmless off-by-one errors.jhb2011-01-101-3/+3
| | | | | Reviewed by: jeff MFC after: 2 weeks
* - Move sched_fork() later in fork() after the various sections of the newjhb2011-01-061-3/+5
| | | | | | | | | | | | | | thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
* - Follow r216313, the sched_unlend_user_prio is no longer needed, alwaysdavidxu2010-12-291-17/+5
| | | | | | | use sched_lend_user_prio to set lent priority. - Improve pthread priority-inherit mutex, when a contender's priority is lowered, repropagete priorities, this may cause mutex owner's priority to be lowerd, in old code, mutex owner's priority is rise-only.
* MFp4:davidxu2010-12-091-10/+11
| | | | | | | | | It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
* Remove unused variables.trasz2010-11-131-4/+0
|
* Fix typos.attilio2010-11-101-2/+2
| | | | | Submitted by: gianni MFC after: 3 days
* Use integer for size of cpuset, as it won't be bigger than INT_MAX,davidxu2010-11-011-9/+0
| | | | | | This is requested by bge. Also move the sysctl into file kern_cpuset.c, because it should always be there, it is independent of thread scheduler.
* Add sysctl kern.sched.cpusetsize to export the size of kernel cpuset,davidxu2010-10-291-0/+11
| | | | | | also add sysconf() key _SC_CPUSET_SIZE to get sysctl value. Submitted by: gcooper
* Comment nit, set TDF_NEEDRESCHED after the comment describing why it isjhb2010-09-211-1/+1
| | | | | | done rather than before. MFC after: 1 week
* kern.sched.topology_spec sysctl: use step of 1 for group levels numerationavg2010-09-181-1/+1
| | | | | | | | | | This is just a cosmetic change for prettier output. 'indent' variable/parameter serves two purposes: it specifies whitespace indentation level and also implies cpu group level/depth. It would have been better to split those two uses, but for now just a simple change. MFC after: 1 week
* Refactor timer management code with priority to one-shot operation mode.mav2010-09-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle. There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating. As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads. Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.
* Do not IPI CPU that is already spinning for load. It doubles effect ofmav2010-09-101-4/+11
| | | | spining (comparing to MWAIT) on some heavly switching test loads.
* Fix UP build.mdf2010-09-021-0/+2
| | | | MFC after: 2 weeks
* Fix a bug with sched_affinity() where it checks td_pinned of anothermdf2010-09-011-11/+13
| | | | | | | | | | | | | | | thread in a racy manner, which can lead to attempting to migrate a thread that is pinned to a CPU. Instead, have sched_switch() determine which CPU a thread should run on if the current one is not allowed. KASSERT in sched_bind() that the thread is not yet pinned to a CPU. KASSERT in sched_switch() that only migratable threads or those moving due to a sched_bind() are changing CPUs. sched_affinity code came from jhb@. MFC after: 2 weeks
* Remove unused KTRACE includes.jhb2010-08-191-4/+0
|
* Add a new ipi_cpu() function to the MI IPI API that can be used to send anjhb2010-08-061-3/+3
| | | | | | | | | | | | IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month
* A cosmetic change - don't output empty <flags>.ivoras2010-07-151-2/+2
|
* Update several places that iterate over CPUs to use CPU_FOREACH().jhb2010-06-111-4/+2
|
OpenPOWER on IntegriCloud