summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_ule.c
Commit message (Collapse)AuthorAgeFilesLines
* - Fix a race in sched_switch() of sched_4bsd.attilio2010-01-231-21/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of the thread being on a sleepqueue or a turnstile, the sched_lock was acquired (without the aid of the td_lock interface) and the td_lock was dropped. This was going to break locking rules on other threads willing to access to the thread (via the td_lock interface) and modify his flags (allowed as long as the container lock was different by the one used in sched_switch). In order to prevent this situation, while sched_lock is acquired there the td_lock gets blocked. [0] - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). This means that thread_lock_block() will not disable interrupts when called (and consequently thread_unlock_block() will not re-enabled them when called). This should be done manually when necessary. Note, however, that ULE's thread_unblock_switch() is not reaped because it does reflect a difference in semantic due in ULE (the td_lock may not be necessarilly still blocked_lock when calling this). While asymmetric, it does describe a remarkable difference in semantic that is good to keep in mind. [0] Reported by: Kohji Okuno <okuno dot kohji at jp dot panasonic dot com> Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks
* Allow swap out of the kernel stack for the thread with priority greaterkib2009-12-311-1/+1
| | | | | | | | | | | | | or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week
* Don't forget to use `void' for sched_balance(). It has no arguments.ed2009-12-281-1/+1
|
* Make ULE process usage (%CPU) accounting usable again by keeping trackivoras2009-11-241-1/+4
| | | | | | | | | of the last tick we incremented on. Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru Reviewed by: jeff (who thinks there should be a better way in the future) Approved by: gnn (mentor) MFC after: 3 weeks
* Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).attilio2009-11-031-2/+2
| | | | | | | | | | This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
* Fix a sign bug in the handling of nice priorities when computing thejhb2009-10-151-1/+1
| | | | | | | | interactive score for a thread. Submitted by: Taku YAMAMOTO taku of tackymt.homeip.net Reviewed by: jeff MFC after: 3 days
* Fix sched_switch_migrate():attilio2009-09-151-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - In 8.x and above the run-queue locks are nomore shared even in the HTT case, so remove the special case. - The deadlock explained in the removed comment here is still possible even with different locks, with the contribution of tdq_lock_pair(). An explanation is here: (hypotesis: a thread needs to migrate on another CPU, thread1 is doing sched_switch_migrate() and thread2 is the one handling the sched_switch() request or in other words, thread1 is the thread that needs to migrate and thread2 is a thread that is going to be preempted, most likely an idle thread. Also, 'old' is referred to the context (in terms of run-queue and CPU) thread1 is leaving and 'new' is referred to the context thread1 is going into. Finally, thread3 is doing tdq_idletd() or sched_balance() and definitively doing tdq_lock_pair()) * thread1 blocks its td_lock. Now td_lock is 'blocked' * thread1 drops its old runqueue lock * thread1 acquires the new runqueue lock * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT through tdq_notify() to the new CPU * thread1 drops the new lock * thread3, scanning the runqueues, locks the old lock * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock pointing to the new runqueue * thread3 wants to acquire the new runqueue lock, but it can't because it is held by thread2 so it spins * thread1 wants to acquire old lock, but as long as it is held by thread3 it can't * thread2 going further, at some point wants to switchin in thread1, but it will wait forever because thread1->td_lock is in blocked state This deadlock has been manifested mostly on 7.x and reported several time on mailing lists under the voice 'spinlock held too long'. Many thanks to des@ for having worked hard on producing suitable textdumps and Jeff for help on the comment wording. Reviewed by: jeff Reported by: des, others Tested by: des, Giovanni Trematerra <giovanni dot trematerra at gmail dot com> (STABLE_7 based version)
* - Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULEjeff2009-06-231-19/+19
| | | | | supports arbitrary numbers of cpus rather than being limited by cpumask_t to the number of bits in a long.
* - Fix non-SMP build by encapsulating idle spin logic in a macro.jeff2009-04-291-2/+8
| | | | Pointy hat to: me
* - Fix the FBSDID line.jeff2009-04-291-1/+1
|
* - Remove the bogus idle thread state code. This may have a race in itjeff2009-04-291-28/+12
| | | | | | | | | | | and it only optimized out an ipi or mwait in very few cases. - Skip the adaptive idle code when running on SMT or HTT cores. This just wastes cpu time that could be used on a busy thread on the same core. - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use CG_FLAG_THREAD to mean SMT or HTT. Sponsored by: Nokia
* - Fix an error that occurs when mp_ncpu is an odd number. steal_threshjeff2009-03-141-4/+9
| | | | | | | | | | | | | | is calculated as 0 which causes errors elsewhere. Submitted by: KOIE Hidetaka <koie@suri.co.jp> - When sched_affinity() is called with a thread that is not curthread we need to handle the ON_RUNQ() case by adding the thread to the correct run queue. Submitted by: Justin Teller <justin.teller@gmail.com> MFC after: 1 Week
* - Use __XSTRING where I want the define to be expanded. This resulted injeff2009-01-251-2/+2
| | | | | | | sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.
* - Implement generic macros for producing KTR records that are compatiblejeff2009-01-171-18/+59
| | | | | | | | | | | | with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia
* Add missing newlines to flags tags of CPU topology, for prettierivoras2008-12-231-2/+2
| | | | | | | output. Reviewed by: jeff (original version) Approved by: gnn (mentor) (original version)
* When checking to see if another CPU is running its idle thread, examinejhb2008-11-181-4/+4
| | | | | | | | the thread running on the other CPU instead of the thread being placed on the run queue. Reported by: Ravi Murty @ Intel Reviewed by: jeff
* Increase the initial sbuf size for CPU topology dump to something moreivoras2008-11-021-1/+1
| | | | | | | | usable for newer CPUs. The new value allows 2 x quad core configuration dumps to fit within the initial buffer without reallocations. Approved by: gnn (mentor) (older version) Pointed out by: rdivacky
* Introduce a new sysctl, kern.sched.topology_spec, that returns an XMLivoras2008-10-291-1/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | dump of detected ULE CPU topology. This dump can be used to check the topology detection and for general system information. An example of CPU topology dump is: kern.sched.topology_spec: <groups> <group level="1" cache-level="0"> <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu> <flags></flags> <children> <group level="2" cache-level="0"> <cpu count="4" mask="0xf">0, 1, 2, 3</cpu> <flags></flags> </group> <group level="2" cache-level="0"> <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu> <flags></flags> </group> </children> </group> </groups> Reviewed by: jeff Approved by: gnn (mentor)
* - Check whether we've recorded this tick in ts_ticks on another cpu injeff2008-07-191-0/+6
| | | | | | | | | | sched_tick() to prevent multiple increments for one tick. This pushes the value out of range and breaks priority calculation. Reviewed by: kib Found by: pho/nokia Sponsored by: Nokia MFC after: 3 days
* Add the vtime (virtual time) hooks for DTrace.jb2008-05-251-0/+18
|
* - Add an integer argument to idle to indicate how likely we are to wakejeff2008-04-251-5/+9
| | | | | | | | | | | | | | | from idle over the next tick. - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are suspended in cpu specific states. This function can fail and cause the scheduler to fall back to another mechanism (ipi). - Implement support for mwait in cpu_idle() on i386/amd64 machines that support it. mwait is a higher performance way to synchronize cpus as compared to hlt & ipis. - Allow selecting the idle routine by name via sysctl machdep.idle. This replaces machdep.cpu_idle_hlt. Only idle routines supported by the current machine are permitted. Sponsored by: Nokia
* - Add a metric to describe how busy a processor has been over the lastjeff2008-04-171-7/+71
| | | | | | | | | | | two ticks by counting the number of switches and the load when sched_clock() is called. - If the busy metric exceeds a threshold allow the idle thread to spin waiting for new work for a brief period to avoid using IPIs. This reduces the cost on the sender and receiver as well as reducing wakeup latency considerably when it works. Sponsored by: Nokia
* - Make SCHED_STATS more generic by adding a wrapper to create thejeff2008-04-171-8/+30
| | | | | | | | | | | | | | | | | | variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
* Support and switch to the ULE scheduler:marcel2008-04-151-1/+1
| | | | | | | | o Implement IPI_PREEMPT, o Set td_lock for the thread being switched out, o For ULE & SMP, loop while td_lock points to blocked_lock for the thread being switched in, o Enable ULE by default in GENERIC and SKI,
* - Allow static_boost to specify no boost with '0', traditional kerneljeff2008-04-041-2/+6
| | | | | | | | | | | | fixed pri boost with '1' or any priority less than the current thread's priority with a value greater than two. Default the boost to PRI_MIN_TIMESHARE to prevent regular user-space threads from starving threads in the kernel. This prevents these user-threads from also being scheduled as if they are high fixed-priority kernel threads. - Restore the setting of lowpri in tdq_choose(). It has to be either here or in sched_switch(). I accidentally removed it from both places. Tested by: kris
* - Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn'tjeff2008-04-041-12/+6
| | | | | | do this either. Simply check P_NOLOAD. It'd be nice if this was in a thread flag so we didn't have an extra cache miss every time we add and remove a thread from the run-queue.
* - Restore runq to manipulating threads directly by putting runq links andjeff2008-03-201-130/+117
| | | | | | | | | | | rqindex back in struct thread. - Compile kern_switch.c independently again and stop #include'ing it from schedulers. - Remove the ts_thread backpointers and convert most code to go from struct thread to struct td_sched. - Cleanup the ts_flags #define garbage that was causing us to sometimes do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD. - Export the kern.sched sysctl node in sysctl.h
* - ULE and 4BSD share only one line of code from sched_newthread() so implementjeff2008-03-201-6/+5
| | | | | the required pieces in sched_fork_thread(). The td_sched pointer is already setup by thread_init anyway.
* - Remove some dead code and comments related to KSE.jeff2008-03-191-56/+16
| | | | | - Don't set tdq_lowpri on every switch, it should be precisely maintained now. - Add some comments to sched_thread_priority().
* - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice fromjeff2008-03-191-2/+1
| | | | | | | requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
* In keeping with style(9)'s recommendations on macros, use a ';'rwatson2008-03-161-2/+3
| | | | | | | | | after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
* Make the function prototype for cpu_search() match the declaration so thatjhb2008-03-141-2/+2
| | | | this still compiles with gcc3.
* Remove kernel support for M:N threading.jeff2008-03-121-9/+0
| | | | | | | | While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
* - Pass the priority argument from *sleep() into sleepq and down intojeff2008-03-121-1/+9
| | | | | | | | | | | | | | | | | sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_* to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
* - Fix the invalid priority panics people are seeing by forcingjeff2008-03-101-25/+10
| | | | | | | | | | tdq_runq_add to select the runq rather than hoping we set it properly when we adjusted the priority. This involves the same number of branches as before so should perform identically without the extra fragility. Tested by: bz Reviewed by: bz
* - Don't rely on a side effect of sched_prio() to set the initial ts_runqjeff2008-03-101-0/+1
| | | | | | | for thread0. Set it directly in sched_setup(). This fixes traps on boot seen on some machines. Reported by: phk
* Reduce ULE context switch time by over 25%.jeff2008-03-101-52/+52
| | | | | | | | | | | - Only calculate timeshare priorities once per tick or when a thread is woken from sleeping. - Keep the ts_runq pointer valid after all priority changes. - Call tdq_runq_add() directly from sched_switch() without passing in via tdq_add(). We don't need to adjust loads or runqs anymore. - Sort tdq and ts_sched according to utilization to improve cache behavior. Sponsored by: Nokia
* - Add an implementation of sched_preempt() that avoids excessive IPIs.jeff2008-03-101-62/+72
| | | | | | | | | - Normalize the preemption/ipi setting code by introducing sched_shouldpreempt() so the logical is identical and not repeated between tdq_notify() and sched_setpreempt(). - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock this could have caused us to lose td_flags settings. - Garbage collect some tunables that are no longer relevant.
* Add support for the new cpu topology api:jeff2008-03-021-497/+439
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - When searching for affinity search backwards in the tree from the last cpu we ran on while the thread still has affinity for the group. This can take advantage of knowledge of shared L2 or L3 caches among a group of cores. - When searching for the least loaded cpu find the least loaded cpu via the least loaded path through the tree. This load balances system bus links, individual cache levels, and hyper-threaded/SMT cores. - Make the periodic balancer recursively balance the highest and lowest loaded cpu across each link. Add support for cpusets: - Convert the cpuset to a simple native cpumask_t while the kernel still only supports cpumask. - Pass the derived cpumask down through the cpu_search functions to restrict the result cpus. - Make the various steal functions resilient to failure since all threads can not run on all cpus any longer. General improvements: - Precisely track the lowest priority thread on every runq with tdq_setlowpri(). Before it was more advisory but this ended up having pathological behaviors. - Remove many #ifdef SMP conditions to simplify the code. - Get rid of the old cumbersome tdq_group. This is more naturally expressed via the cpu_group tree. Sponsored by: Nokia Testing by: kris
* - Remove the old smp cpu topology specification with a new, more flexiblejeff2008-03-021-64/+1
| | | | | | | | | | | | | | | | | tree structure that encodes the level of cache sharing and other properties. - Provide several convenience functions for creating one and two level cpu trees as well as a default flat topology. The system now always has some topology. - On i386 and amd64 create a seperate level in the hierarchy for HTT and multi-core cpus. This will allow the scheduler to intelligently load balance non-uniform cores. Presently we don't detect what level of the cache hierarchy is shared at each level in the topology. - Add a mechanism for testing common topologies that have more information than the MD code is able to provide via the kern.smp.topology tunable. This should be considered a debugging tool only and not a stable api. Sponsored by: Nokia
* - Add a new sched_affinity() api to be used in the upcoming cpusetjeff2008-03-021-0/+5
| | | | | | | implementation. - Add empty implementations of sched_affinity() to 4BSD and ULE. Sponsored by: Nokia
* - sched_prio() should only adjust tdq_lowpri if the thread is running or onjeff2008-01-231-7/+9
| | | | | | | | a run-queue. If the priority is numerically raised only change lowpri if we're certain it will be correct. Some slop is allowed however previously we could erroneously raise lowpri for an idle cpu that a thread had recently run on which lead to errors in load balancing decisions.
* - When executing the 'tryself' branch in sched_pickcpu() look at thejeff2008-01-151-4/+5
| | | | | | | | | | | | | | lowest priority on the queue for the current cpu vs curthread's priority. In the case that curthread is waking up many threads of a lower priority as would happen with a turnstile_broadcast() or wakeup() of many threads this prevents them from all ending up on the current cpu. - In sched_add() make the relationship between a scheduled ithread and the current cpu advisory rather than strict. Only give the ithread affinity for the current cpu if it's actually being scheduled from a hardware interrupt. This prevents it from migrating when it simply blocks on a lock. Sponsored by: Nokia
* - Restore timeslicing code for all bit SCHED_FIFO priority classes.jeff2008-01-051-10/+9
| | | | Reported by: Peter Jeremy <peterjeremy@optushome.com.au>
* Make SCHED_ULE buildable with gcc3.wkoszek2007-12-211-17/+17
| | | | | Reviewed by: cognet (mentor), jeffr Approved by: cognet (mentor), jeffr
* - Re-implement lock profiling in such a way that it no longer breaksjeff2007-12-151-0/+6
| | | | | | | | | | | | | | | | | | | | | | the ABI when enabled. There is no longer an embedded lock_profile_object in each lock. Instead a list of lock_profile_objects is kept per-thread for each lock it may own. The cnt_hold statistic is now always 0 to facilitate this. - Support shared locking by tracking individual lock instances and statistics in the per-thread per-instance lock_profile_object. - Make the lock profiling hash table a per-cpu singly linked list with a per-cpu static lock_prof allocator. This removes the need for an array of spinlocks and reduces cache contention between cores. - Use a seperate hash for spinlocks and other locks so that only a critical_enter() is required and not a spinlock_enter() to modify the per-cpu tables. - Count time spent spinning in the lock statistics. - Remove the LOCK_PROFILE_SHARED option as it is always supported now. - Specifically drop and release the scheduler locks in both schedulers since we track owners now. In collaboration with: Kip Macy Sponsored by: Nokia
* Fix LOR of thread lock and umtx's priority propagation mutex duedavidxu2007-12-111-8/+4
| | | | | | to the reworking of scheduler lock. MFC: after 3 days
* generally we are interested in what thread did something asjulian2007-11-141-9/+9
| | | | | | opposed to what process. Since threads by default have teh name of the process unless over-written with more useful information, just print the thread name instead.
* Cut over to ULE on PowerPCgrehan2007-10-231-1/+1
| | | | | | | | | | | | | | | | | kern/sched_ule.c - Add __powerpc__ to the list of supported architectures powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by updating the old thread's lock. Note: uniprocessor-only, will require modification for MP support. powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of old thread's lock, making the call a no-op. Reviewed by: marcel, jeffr (slightly older version)
* ULE works fine on arm; allow it to be usedsam2007-10-161-1/+1
| | | | | Reviewed by: jeff, cognet, imp MFC after: 1 week
OpenPOWER on IntegriCloud