summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_ule.c
Commit message (Collapse)AuthorAgeFilesLines
...
* - Pass the priority argument from *sleep() into sleepq and down intojeff2008-03-121-1/+9
| | | | | | | | | | | | | | | | | sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_* to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
* - Fix the invalid priority panics people are seeing by forcingjeff2008-03-101-25/+10
| | | | | | | | | | tdq_runq_add to select the runq rather than hoping we set it properly when we adjusted the priority. This involves the same number of branches as before so should perform identically without the extra fragility. Tested by: bz Reviewed by: bz
* - Don't rely on a side effect of sched_prio() to set the initial ts_runqjeff2008-03-101-0/+1
| | | | | | | for thread0. Set it directly in sched_setup(). This fixes traps on boot seen on some machines. Reported by: phk
* Reduce ULE context switch time by over 25%.jeff2008-03-101-52/+52
| | | | | | | | | | | - Only calculate timeshare priorities once per tick or when a thread is woken from sleeping. - Keep the ts_runq pointer valid after all priority changes. - Call tdq_runq_add() directly from sched_switch() without passing in via tdq_add(). We don't need to adjust loads or runqs anymore. - Sort tdq and ts_sched according to utilization to improve cache behavior. Sponsored by: Nokia
* - Add an implementation of sched_preempt() that avoids excessive IPIs.jeff2008-03-101-62/+72
| | | | | | | | | - Normalize the preemption/ipi setting code by introducing sched_shouldpreempt() so the logical is identical and not repeated between tdq_notify() and sched_setpreempt(). - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock this could have caused us to lose td_flags settings. - Garbage collect some tunables that are no longer relevant.
* Add support for the new cpu topology api:jeff2008-03-021-497/+439
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - When searching for affinity search backwards in the tree from the last cpu we ran on while the thread still has affinity for the group. This can take advantage of knowledge of shared L2 or L3 caches among a group of cores. - When searching for the least loaded cpu find the least loaded cpu via the least loaded path through the tree. This load balances system bus links, individual cache levels, and hyper-threaded/SMT cores. - Make the periodic balancer recursively balance the highest and lowest loaded cpu across each link. Add support for cpusets: - Convert the cpuset to a simple native cpumask_t while the kernel still only supports cpumask. - Pass the derived cpumask down through the cpu_search functions to restrict the result cpus. - Make the various steal functions resilient to failure since all threads can not run on all cpus any longer. General improvements: - Precisely track the lowest priority thread on every runq with tdq_setlowpri(). Before it was more advisory but this ended up having pathological behaviors. - Remove many #ifdef SMP conditions to simplify the code. - Get rid of the old cumbersome tdq_group. This is more naturally expressed via the cpu_group tree. Sponsored by: Nokia Testing by: kris
* - Remove the old smp cpu topology specification with a new, more flexiblejeff2008-03-021-64/+1
| | | | | | | | | | | | | | | | | tree structure that encodes the level of cache sharing and other properties. - Provide several convenience functions for creating one and two level cpu trees as well as a default flat topology. The system now always has some topology. - On i386 and amd64 create a seperate level in the hierarchy for HTT and multi-core cpus. This will allow the scheduler to intelligently load balance non-uniform cores. Presently we don't detect what level of the cache hierarchy is shared at each level in the topology. - Add a mechanism for testing common topologies that have more information than the MD code is able to provide via the kern.smp.topology tunable. This should be considered a debugging tool only and not a stable api. Sponsored by: Nokia
* - Add a new sched_affinity() api to be used in the upcoming cpusetjeff2008-03-021-0/+5
| | | | | | | implementation. - Add empty implementations of sched_affinity() to 4BSD and ULE. Sponsored by: Nokia
* - sched_prio() should only adjust tdq_lowpri if the thread is running or onjeff2008-01-231-7/+9
| | | | | | | | a run-queue. If the priority is numerically raised only change lowpri if we're certain it will be correct. Some slop is allowed however previously we could erroneously raise lowpri for an idle cpu that a thread had recently run on which lead to errors in load balancing decisions.
* - When executing the 'tryself' branch in sched_pickcpu() look at thejeff2008-01-151-4/+5
| | | | | | | | | | | | | | lowest priority on the queue for the current cpu vs curthread's priority. In the case that curthread is waking up many threads of a lower priority as would happen with a turnstile_broadcast() or wakeup() of many threads this prevents them from all ending up on the current cpu. - In sched_add() make the relationship between a scheduled ithread and the current cpu advisory rather than strict. Only give the ithread affinity for the current cpu if it's actually being scheduled from a hardware interrupt. This prevents it from migrating when it simply blocks on a lock. Sponsored by: Nokia
* - Restore timeslicing code for all bit SCHED_FIFO priority classes.jeff2008-01-051-10/+9
| | | | Reported by: Peter Jeremy <peterjeremy@optushome.com.au>
* Make SCHED_ULE buildable with gcc3.wkoszek2007-12-211-17/+17
| | | | | Reviewed by: cognet (mentor), jeffr Approved by: cognet (mentor), jeffr
* - Re-implement lock profiling in such a way that it no longer breaksjeff2007-12-151-0/+6
| | | | | | | | | | | | | | | | | | | | | | the ABI when enabled. There is no longer an embedded lock_profile_object in each lock. Instead a list of lock_profile_objects is kept per-thread for each lock it may own. The cnt_hold statistic is now always 0 to facilitate this. - Support shared locking by tracking individual lock instances and statistics in the per-thread per-instance lock_profile_object. - Make the lock profiling hash table a per-cpu singly linked list with a per-cpu static lock_prof allocator. This removes the need for an array of spinlocks and reduces cache contention between cores. - Use a seperate hash for spinlocks and other locks so that only a critical_enter() is required and not a spinlock_enter() to modify the per-cpu tables. - Count time spent spinning in the lock statistics. - Remove the LOCK_PROFILE_SHARED option as it is always supported now. - Specifically drop and release the scheduler locks in both schedulers since we track owners now. In collaboration with: Kip Macy Sponsored by: Nokia
* Fix LOR of thread lock and umtx's priority propagation mutex duedavidxu2007-12-111-8/+4
| | | | | | to the reworking of scheduler lock. MFC: after 3 days
* generally we are interested in what thread did something asjulian2007-11-141-9/+9
| | | | | | opposed to what process. Since threads by default have teh name of the process unless over-written with more useful information, just print the thread name instead.
* Cut over to ULE on PowerPCgrehan2007-10-231-1/+1
| | | | | | | | | | | | | | | | | kern/sched_ule.c - Add __powerpc__ to the list of supported architectures powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by updating the old thread's lock. Note: uniprocessor-only, will require modification for MP support. powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of old thread's lock, making the call a no-op. Reviewed by: marcel, jeffr (slightly older version)
* ULE works fine on arm; allow it to be usedsam2007-10-161-1/+1
| | | | | Reviewed by: jeff, cognet, imp MFC after: 1 week
* - Bail out of tdq_idled if !smp_started or idle stealing is disabled. Thisjeff2007-10-081-8/+14
| | | | | | | | | | | | fixes a bug on UP machines with SMP kernels where the idle thread constantly switches after trying to steal work from the local cpu. - Make the idle stealing code more robust against self selection. - Prefer to steal from the cpu with the highest load that has at least one transferable thread. Before we selected the cpu with the highest transferable count which excludes bound threads. Collaborated with: csjp Approved by: re
* - Restore historical sched_yield() behavior by changing sched_relinquish()jeff2007-10-081-2/+0
| | | | | | | | | to simply switch rather than lowering priority and switching. This allows threads of equal priority to run but not lesser priority. Discussed with: davidxu Reported by: NIIMI Satoshi <sa2c@sa2c.net> Approved by: re
* - Reassign the thread queue lock to newtd prior to switching. Assigningjeff2007-10-021-4/+6
| | | | | | | | | | | after the switch leads to a race where the outgoing thread still owns the local queue lock while another cpu may switch it in. This race is only possible on machines where cpu_switch can take significantly longer on different cpus which in practice means HTT machines with unfair thread scheduling algorithms. Found by: kris (of course) Approved by: re
* - Move the rebalancer back into hardclock to prevent potential softclockjeff2007-10-021-55/+86
| | | | | | | | | | starvation caused by unbalanced interrupt loads. - Change the rebalancer to work on stathz ticks but retain randomization. - Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than complex sequences of locks to avoid deadlock. Reported by: kris Approved by: re
* - Honor the PREEMPTION and FULL_PREEMPTION flags by setting the defaultjeff2007-09-271-2/+10
| | | | | | | | | | | value for kern.sched.preempt_thresh appropriately. It can still by adjusted at runtime. ULE will still use IPI_PREEMPT in certain migration situations. - Assert that we're not trying to compile ULE on an unsupported architecture. To date, I believe only i386 and amd64 have implemented the third cpu switch argument required. Approved by: re
* - Bound the interactivity score so that it cannot become negative.jeff2007-09-241-1/+1
| | | | Approved by: re
* - Improve grammar. s/it's/its/.jeff2007-09-221-5/+13
| | | | | | | | | | | | | | | | | | - Improve load long-term load balancer by always IPIing exactly once. Previously the delay after rebalancing could cause problems with uneven workloads. - Allow nice to have a linear effect on the interactivity score. This allows negatively niced programs to stay interactive longer. It may be useful with very expensive Xorg servers under high loads. In general it should not be necessary to alter the nice level to improve interactive response. We may also want to consider never allowing positively niced processes to become interactive at all. - Initialize ccpu to 0 rather than 0.0. The decimal point was leftover from when the code was copied from 4bsd. ccpu is 0 in ULE because ULE only exports weighted cpu values. Reported by: Steve Kargl (Load balancing problem) Approved by: re
* - Redefine p_swtime and td_slptime as p_swtick and td_slptick. Thisjeff2007-09-211-7/+5
| | | | | | | | | | | | changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re
* - Move all of the PS_ flags into either p_flag or td_flags.jeff2007-09-171-2/+2
| | | | | | | | | | | | | | - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
* - Set steal_thresh to log2(ncpus). This improves idle-time load balancingjeff2007-08-201-0/+6
| | | | | | | | on 2cpu machines by reducing it to 1 by default. This improves loaded operation on 8cpu machines by increasing it to 3 where the extra idle time is not as critical. Approved by: re
* - Fix one line that erroneously crept in my last commit.jeff2007-08-041-1/+0
| | | | Approved by: re
* - Share scheduler locks between hyper-threaded cores to protect thejeff2007-08-031-114/+200
| | | | | | | | | | | | | tdq_group structure. Hyper-threaded cores won't really benefit from seperate locks anyway. - Seperate out the migration case from sched_switch to simplify the main switch code. We only migrate here if called via sched_bind(). - When preempted place the preempted thread back in the same queue at the head. - Improve the cpu group and topology infrastructure. Tested by: many on current@ Approved by: re
* - Refine the load balancer to improve buildkernel times on dual corejeff2007-07-191-47/+29
| | | | | | | | | | | | | | | machines. - Leave the long-term load balancer running by default once per second. - Enable stealing load from the idle thread only when the remote processor has more than two transferable tasks. Setting this to one further improves buildworld. Setting it higher improves mysql. - Remove the bogus pick_zero option. I had not intended to commit this. - Entirely disallow migration for threads with SRQ_YIELDING set. This balances out the extra migration allowed for with the load balancers. It also makes pick_pri perform better as I had anticipated. Tested by: Dmitry Morozovsky <marck@rinet.ru> Approved by: re
* - When newtd is specified to sched_switch() it was not being initializedjeff2007-07-191-7/+25
| | | | | | | | | | properly. We have to temporarily unlock the TDQ lock so we can lock the thread and add it to the run queue. This is used only for KSE. - When we add a thread from the tdq_move() via sched_balance() we need to ipi the target if it's sitting in the idle thread or it'll never run. Reported by: Rene Landan Approved by: re
* ULE 3.0: Fine grain scheduler locking and affinity improvements. This hasjeff2007-07-171-548/+916
| | | | | | | | | | | | | | | | | | | | | | been in development for over 6 months as SCHED_SMP. - Implement one spin lock per thread-queue. Threads assigned to a run-queue point to this lock via td_lock. - Improve the facility for assigning threads to CPUs now that sched_lock contention no longer dominates scheduling decisions on larger SMP machines. - Re-write idle time stealing in an attempt to make it less damaging to general performance. This is still disabled by default. See kern.sched.steal_idle. - Call the long-term load balancer from a callout rather than sched_clock() so there are no locks held. This is disabled by default. See kern.sched.balance. - Parameterize many scheduling decisions via sysctls. Try to document these via sysctl descriptions. - General structural and naming cleanups. - Document each function with comments. Tested by: current@ amd64, x86, UP, SMP. Approved by: re
* - Fix an off by one error in sched_pri_range.jeff2007-06-151-8/+3
| | | | | | | | | | - In tdq_choose() only assert that a thread does not have too high a priority (low value) for the queue we removed it from. This will catch bugs in priority elevation. It's not a serious error for the thread to have too low a priority as we don't change queues in this case as an optimization. Reported by: kris
* - Move some common code out of sched_fork_exit() and back into fork_exit().jeff2007-06-121-15/+4
|
* - Placing the 'volatile' on the right side of the * in the td_lockjeff2007-06-061-1/+1
| | | | | | declaration removes the need for __DEVOLATILE(). Pointed out by: tegge
* - Better fix for previous error; use DEVOLATILE on the td_lock pointerjeff2007-06-051-1/+1
| | | | | | | it can actually sometimes be something other than sched_lock even on schedulers which rely on a global scheduler lock. Tested by: kan
* - Pass &sched_lock as the third argument to cpu_switch() as this willjeff2007-06-051-1/+1
| | | | | | | always be the correct lock and we don't get volatile warnings this way. Pointed out by: kan
* - Define TDQ_ID() for the !SMP case.jeff2007-06-051-1/+2
| | | | - Default pick_pri to off. It is not faster in most cases.
* Commit 1/14 of sched_lock decomposition.jeff2007-06-041-39/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | - Move all scheduler locking into the schedulers utilizing a technique similar to solaris's container locking. - A per-process spinlock is now used to protect the queue of threads, thread count, suspension count, p_sflags, and other process related scheduling fields. - The new thread lock is actually a pointer to a spinlock for the container that the thread is currently owned by. The container may be a turnstile, sleepqueue, or run queue. - thread_lock() is now used to protect access to thread related scheduling fields. thread_unlock() unlocks the lock and thread_set_lock() implements the transition from one lock to another. - A new "blocked_lock" is used in cases where it is not safe to hold the actual thread's lock yet we must prevent access to the thread. - sched_throw() and sched_fork_exit() are introduced to allow the schedulers to fix-up locking at these points. - Add some minor infrastructure for optionally exporting scheduler statistics that were invaluable in solving performance problems with this patch. Generally these statistics allow you to differentiate between different causes of context switches. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
* Schedule the ithread on the same cpu as the interruptkmacy2007-04-201-2/+1
| | | | | Tested by: kmacy Submitted by: jeffr
* - Handle the case where slptime == runtime.jeff2007-03-171-1/+5
| | | | Submitted by: Atoine Brodin
* - Cast the intermediate value in priority computtion back down tojeff2007-03-171-1/+1
| | | | | | | | | unsigned char. Weirdly, casting the 1 constant to u_char still produces a signed integer result that is then used in the % computation. This avoids that mess all together and causes a 0 pri to turn into 255 % 64 as we expect. Reported by: kkenn (about 4 times, thanks)
* Instead of doing comparisons using the pcpu area to see ifjulian2007-03-081-1/+1
| | | | | | | a thread is an idle thread, just see if it has the IDLETD flag set. That flag will probably move to the pflags word as it's permenent and never chenges for the life of the system so it doesn't need locking.
* general LOCK_PROFILING cleanupkmacy2007-02-261-1/+1
| | | | | | | | | | | | - only collect timestamps when a lock is contested - this reduces the overhead of collecting profiles from 20x to 5x - remove unused function from subr_lock.c - generalize cnt_hold and cnt_lock statistics to be kept for all locks - NOTE: rwlock profiling generates invalid statistics (and most likely always has) someone familiar with that should review
* - Change types for necent runq additions to u_char rather than int.jeff2007-02-081-4/+4
| | | | | | | - Fix these types in ULE as well. This fixes bugs in priority index calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64. Reported by: kkenn using pho's stress2.
* - Implement much more intelligent ipi sending. This algorithm tries tojeff2007-01-251-11/+23
| | | | | | | | | minimize IPIs and rescheduling when scheduling like tasks while keeping latency low for important threads. 1) An idle thread is running. 2) The current thread is worse than realtime and the new thread is better than realtime. Realtime to realtime doesn't preempt. 3) The new thread's priority is less than the threshold.
* - Get rid of the unused DIDRUN flag. This was really only present tojeff2007-01-251-21/+28
| | | | | | | | | | | | support sched_4bsd. - Rename the KTR level for non schedgraph parsed events. They take event space from things we'd like to graph. - Reset our slice value after we sleep. The slice is simply there to prevent starvation among equal priorities. A thread which had almost exhausted it's slice and then slept doesn't need to be rescheduled a tick after it wakes up. - Set the maximum slice value to a more conservative 100ms now that it is more accurately enforced.
* - With a sleep time over 2097 seconds hzticks and slptime could end upjeff2007-01-241-5/+6
| | | | | | negative. Use unsigned integers for sleep and run time so this doesn't disturb sched_interact_score(). This should fix the invalid interactive priority panics reported by several users.
* - Catch up to setrunqueue/choosethread/etc. api changes.jeff2007-01-231-39/+90
| | | | | | | | | | - Define our own maybe_preempt() as sched_preempt(). We want to be able to preempt idlethread in all cases. - Define our idlethread to require preemption to exit. - Get the cpu estimation tick from sched_tick() so we don't have to worry about errors from a sampling interval that differs from the time domain. This was the source of sched_priority prints/panics and inaccurate pctcpu display in top.
* - Disable the long-term load balancer. I believe that steal_busy worksjeff2007-01-201-1/+1
| | | | better and gives more predictable results.
OpenPOWER on IntegriCloud