summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_ule.c
Commit message (Collapse)AuthorAgeFilesLines
* Get rid of struct proc p_sched and struct thread td_sched pointers.kib2016-06-051-46/+48
| | | | | | | | | | | | | | p_sched is unused. The struct td_sched is always co-allocated with the struct thread, except for the thread0. Avoid useless indirection, instead calculate td_sched location using simple pointer arithmetic in td_get_sched(9). For thread0, which is statically allocated, create a structure to emulate layout of the dynamic allocation. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D6711
* The struct thread td_estcpu member is only used by the 4BSD scheduler.kib2016-04-171-5/+3
| | | | | | | | | | | | | | | | | | Move it to the struct td_sched for 4BSD, removing always present field, otherwise unused for ULE. New scheduler method sched_estcpu() returns the estimation for kinfo_proc consumption. As before, it always returns 0 for ULE. Remove sched_tick() scheduler method, unused both by 4BSD and ULE. Update locking comment for the 4BSD struct td_sched, copying it from the same comment for ULE. Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment. Based on some notes from, and reviewed by: bde Sponsored by: The FreeBSD Foundation
* Summary: Add the interactivity equations to the header comment for ourgnn2015-08-261-0/+15
| | | | | | interactivity calculation routine. Suggested by: rwatson
* kgdb uses td_oncpu to determine if a thread is running and should usejhb2015-08-031-0/+4
| | | | | | | | | | | | | | a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads retained td_oncpu from the last time they ran, and newborn threads had their CPU fields cleared to zero during fork and thread creation since they are in the set of fields zeroed when threads are setup. To fix, explicitly update the CPU fields for exiting threads in sched_throw() to reflect the switch out and reset the CPU fields for new threads in sched_fork_thread() to NOCPU. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3193
* Change the mb() use in the sched_ult tdq_notify() and sched_idletd()kib2015-07-101-2/+2
| | | | | | | | | | | | to more C11-ish atomic_thread_fence_seq_cst(). Note that on PowerPC, which currently uses lwsync for mb(), the change actually fixes the missed store/load barrier, intended by r271604 [*]. Reviewed by: alc Noted by: alc [*] Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
* Relocate sched_random() within the SMP section.pfg2015-07-071-20/+18
| | | | | | | | | Place sched_random nearer to where it's first used: moving the code nearer to where it is used makes the code easier to read and we can reduce the initial "#ifdef SMP" island. Reword a little the comment and clean some whitespaces while here.
* Use sbuf_new_for_sysctl() instead of plain sbuf_new() to ensure sysctlian2015-03-141-3/+2
| | | | | | string returned to userland is nulterminated. PR: 195668
* Put back Andy's void for gcc happiness.imp2015-02-271-1/+1
| | | | Submitted by: jchandra@
* Make sched_random() return an unsigned number, and use uint32_timp2015-02-271-11/+13
| | | | | | | | | | | | | | | | consistently. This also matches the per-cpu pointer declaration anyway. This changes the tweak we give to the load from -32..31 to be 0..31 which seems more inline with the rest of the code (- rnd and the -= 64). It should also provide the randomness we need, and may fix a signedness bug in the old code (it isn't clear that the effect was intentional as opposed to sloppy, and the right shift of a signed value is undefined to boot). This stores sched_balance() behavior when it used random(). Differential Revision: https://reviews.freebsd.org/D1981
* Fix sched_ule on sparc64, gcc complains sched_random is not a correctandrew2015-02-271-1/+1
| | | | | | prototype. Sponsored by: The FreeBSD Foundation
* sched_random is only called for SMP, only define it there.andrew2015-02-271-1/+2
| | | | Sponsored by: The FreeBSD Foundation
* Create sched_rand() and move the LCG code into that. Call this whenimp2015-02-271-9/+22
| | | | | | | | we need randomness in ULE. This removes random() call from the rebalance interval code. Submitted by: Harrison Grundy Differential Revision: https://reviews.freebsd.org/D1968
* Update the ULE scheduler + thread and kinfo structs to use int for cpuidadrian2014-10-181-1/+1
| | | | | | | | | | | rather than u_char. To try and play nice with the ABI, the u_char CPU ID values are clamped at 254. The new fields now contain the full CPU ID, or -1 for no cpu. Differential Revision: D955 Reviewed by: jhb, kib Sponsored by: Norse Corp, Inc.
* Reprase r271616 comments.mav2014-09-171-2/+2
| | | | | Submitted by: alc MFC after: 1 month
* Add comments describing r271604 change.mav2014-09-151-0/+12
| | | | MFC after: 3 days
* Add couple memory barries to serialize tdq_cpu_idle and tdq_load accesses.mav2014-09-141-0/+2
| | | | | | | | | This change fixes transient performance drops in some of my benchmarks, vanishing as soon as I am trying to collect any stats from the scheduler. It looks like reordered access to those variables sometimes caused loss of IPI_PREEMPT, that delayed thread execution until some later interrupt. MFC after: 3 days
* Restore pre-r239157 handling of sched_yield(), when thread time slice wasmav2014-08-231-1/+2
| | | | | | | | | aborted, allowing other threads to run. Without this change thread is just rescheduled again, that was illustrated by provided test tool. PR: 192926 Submitted by: eric@vangyzen.net MFC after: 2 weeks
* Micro-manage clang to get the expected inlining for cpu_search().kib2014-07-031-6/+8
| | | | | | | | | | | | | Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline, while cpu_search() gets always_inline. With the attributes set, cpu_search() is inlined in wrappers, and if()s with constant conditionals are optimized. On some tests on many-core machine, the hwpmc reported samples for cpu_search*() are reduced from 25% total to 9%. Submitted by: "Rang, Anton" <anton.rang@isilon.com> MFC after: 1 week
* Remove write-only local variable.kib2014-06-081-2/+0
| | | | | Sponsored by: The FreeBSD Foundation MFC after: 1 week
* Fix GENERIC build.attilio2014-03-191-0/+1
|
* - Make runq_steal_from more aggressive. Previously it would examine onlyjeff2014-03-081-16/+11
| | | | | | | | | | | a single priority queue. If that queue had a thread or threads which could not be migrated we would fail to steal load. This could cause starvation in situations where cores are idle. Submitted by: Doug Kilpatrick <dkilpatrick@isilon.com> Tested by: pho Reviewed by: mav Sponsored by: EMC / Isilon Storage Division
* ULE works on Book-E since r258002, so remove statements to the contrary.nwhitehorn2014-02-011-4/+0
|
* In sys/kern/sched_ule.c, remove static function sched_both(), which isdim2013-12-251-24/+0
| | | | | | unused since r232207. MFC after: 3 days
* Fix an off-by-one error in r228960. The maximum priority delta providedjhb2013-12-031-1/+1
| | | | | | | | | | by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting priority value (before nice adjustment) is between SCHED_PRI_MIN and SCHED_PRI_MAX, inclusive. Submitted by: kib Reported by: pho MFC after: 1 week
* dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2013-11-261-16/+16
| | | | | | | | In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
* - For kernel compiled only with KDTRACE_HOOKS and not any lock debuggingattilio2013-11-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
* Micro-optimize cpu_search(), allowing compiler to use more efficient inlinemav2013-09-071-2/+10
| | | | | | | ffsl() implementation, when it is available, instead of homegrown iteration. On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
* Point args[0] not at the thread that is ending but at the one thatgnn2013-04-151-1/+1
| | | | | | | | | | | | is starting. This is in line with practice in OpenSolaris. Note that this change is only in ULE and not in the 4BSD scheduler. Once this change settles in (MFC timeout has expired) we'll try it out on 4BSD as well. PR: 177706 Submitted by: Tiwei Bie MFC after: 1 month
* Fix bug in r242852 that prevented CPU from becoming idle if kernel builtmav2012-11-151-1/+3
| | | | without SMP support.
* Several optimizations to sched_idletd():mav2012-11-101-18/+35
| | | | | | | | | | | | | | | | - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@
* - Change ULE to use dynamic slice sizes for the timeshare queue in orderjeff2012-11-081-10/+48
| | | | | | | | | | to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav
* Rework the known mutexes to benefit about staying on their ownattilio2012-10-311-3/+2
| | | | | | | | | | | cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc
* tdq_lock_pair() already does spinlock_enter() so migration is notattilio2012-10-301-2/+0
| | | | | | possible in sched_balance_pair(). Remove redundant sched_pin(). Reviewed by: marius, jeff
* Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.jimharris2012-10-241-1/+6
| | | | | | | | | | This enables CPU searches (which read tdq_load) to operate independently of any contention on the spinlock. Some scheduler-intensive workloads running on an 8C single-socket SNB Xeon show considerable improvement with this change (2-3% perf improvement, 5-6% decrease in CPU util). Sponsored by: Intel Reviewed by: jeff
* remove duplicate semicolons where possible.eadler2012-10-221-1/+1
| | | | | Approved by: cperciva MFC after: 1 week
* sched_ule: fix inverted condition in reporting of priority lending via ktravg2012-09-141-1/+1
| | | | | Reviewed by: kan MFC after: 1 week
* Mark the idle threads as non-sleepable and also assert that an idlejhb2012-08-221-0/+1
| | | | thread never blocks on a turnstile.
* Some more minor tunings inspired by bde@.mav2012-08-111-18/+22
|
* Allow idle threads to steal second threads from other cores on systems withmav2012-08-111-6/+0
| | | | | | | | | | | | 8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2) system shown any slowdown from mentioned "excess thrashing". Same time in pbzip2 test with number of threads more then number of CPUs I see up to 10% speedup with SMT disabled and up 5% with SMT enabled. Thinking about trashing I was trying to limit that stealing within same last level cache, but got only worse results. Present code any way prefers to steal threads from topologically closer cores. Sponsored by: iXsystems, Inc.
* Some minor tunings/cleanups inspired by bde@ after previous commits:mav2012-08-101-30/+40
| | | | | | | | | | - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.
* Rework r220198 change (by fabient). I believe it solves the problem frommav2012-08-091-5/+8
| | | | | | | | | | | | | | | | | | | | | the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.
* Let us manage differences of Book-E PowerPC variations i.e. vendor /raj2012-05-271-1/+1
| | | | | | | | | | | | implementation specific vs. the common architecture definition. Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under BOOKE_PPC4XX are not used in the code yet. This change set is not supposed to affect existing E500 support, it's just another reorg step before bringing support for E500mc, E5500 and PPC465. Obtained from: AppliedMicro, Freescale, Semihalf
* Implement the DTrace sched provider. This implementation aims to berstone2012-05-151-1/+35
| | | | | | | | | | | | | | | | | | | | | | compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated MFC after: 2 weeks
* Microoptimize cpu_search().mav2012-04-091-24/+28
| | | | | According to profiling, it makes one take 6% of CPU time on hackbench with its million of context switches per second, instead of 8% before.
* Rewrite thread CPU usage percentage math to not depend on periodic callsmav2012-03-131-46/+22
| | | | | | | | | | | with HZ rate through the sched_tick() calls from hardclock(). Potentially it can be used to improve precision, but now it is just minus one more reason to call hardclock() for every HZ tick on every active CPU. SCHED_4BSD never used sched_tick(), but keep it in place for now, as at least SCHED_FBFS existing in patches out of the tree depends on it. MFC after: 1 month
* Make kern.sched.idlespinthresh default value adaptive depending of HZ.mav2012-03-091-1/+3
| | | | Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
* Add a new sched_clear_name() method to the scheduler interface to clearjhb2012-03-081-0/+11
| | | | | | | | the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve(). MFC after: 2 weeks
* Fix bug of r232207, when cpu_search() could prefer CPU group with bestmav2012-03-031-5/+7
| | | | | load, but with no CPU matching given limitations. It caused kernel panics in some cases when thread was bound to specific CPUs with cpuset(1).
* Rework CPU load balancing in SCHED_ULE:mav2012-02-271-148/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - In sched_pickcpu() be more careful taking previous CPU on SMT systems. Do it only if all other logical CPUs of that physical one are idle to avoid extra resource sharing. - In sched_pickcpu() change general logic of CPU selection. First look for idle CPU, sharing last level cache with previously used one, skipping SMT CPU groups. If none found, search all CPUs for the least loaded one, where the thread with its priority can run now. If none found, search just for the least loaded CPU. - Make cpu_search() compare lowest/highest CPU load when comparing CPU groups with equal load. That allows to differentiate 1+1 and 2+0 loads. - Make cpu_search() to prefer specified (previous) CPU or group if load is equal. This improves cache affinity for more complicated topologies. - Randomize CPU selection if above factors are equal. Previous code tend to prefer CPUs with lower IDs, causing unneeded collisions. - Rework periodic balancer in sched_balance_group(). With cpu_search() more intelligent now, make balansing process flat, removing recursion over the topology tree. That fixes double swap problem and makes load distribution more even and predictable. All together this gives 10-15% performance improvement in many tests on CPUs with SMT, such as Core i7, for number of threads is less then number of logical CPUs. In some tests it also gives positive effect to systems without SMT. Reviewed by: jeff Tested by: flo, hackers@ MFC after: 1 month Sponsored by: iXsystems, Inc.
* Some small fixes to CPU accounting for threads:jhb2012-01-031-2/+2
| | | | | | | | | | | | | | | | - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible. Reviewed by: bde, kib MFC after: 1 week
OpenPOWER on IntegriCloud