summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_ule.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix UP build.mdf2010-09-021-0/+2
| | | | MFC after: 2 weeks
* Fix a bug with sched_affinity() where it checks td_pinned of anothermdf2010-09-011-11/+13
| | | | | | | | | | | | | | | thread in a racy manner, which can lead to attempting to migrate a thread that is pinned to a CPU. Instead, have sched_switch() determine which CPU a thread should run on if the current one is not allowed. KASSERT in sched_bind() that the thread is not yet pinned to a CPU. KASSERT in sched_switch() that only migratable threads or those moving due to a sched_bind() are changing CPUs. sched_affinity code came from jhb@. MFC after: 2 weeks
* Remove unused KTRACE includes.jhb2010-08-191-4/+0
|
* Add a new ipi_cpu() function to the MI IPI API that can be used to send anjhb2010-08-061-3/+3
| | | | | | | | | | | | IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month
* A cosmetic change - don't output empty <flags>.ivoras2010-07-151-2/+2
|
* Update several places that iterate over CPUs to use CPU_FOREACH().jhb2010-06-111-4/+2
|
* Unconfuse THREAD and SMT flagsivoras2010-06-101-1/+3
|
* Cosmetic change to XML - less ugly newlinesivoras2010-06-101-2/+2
|
* Assert that the thread lock is held in sched_pctcpu() instead ofjhb2010-06-031-2/+1
| | | | | | | recursively acquiring it. All of the current callers already hold the lock. MFC after: 1 month
* Assert that the thread passed to sched_bind() and sched_unbind() isjhb2010-05-211-0/+2
| | | | | | curthread as those routines are only supported for curthread currently. MFC after: 1 month
* This pushes all of JC's patches that I have in place. Irrs2010-05-161-1/+1
| | | | | | | | | | | am now able to run 32 cores ok.. but I still will hang on buildworld with a NFS problem. I suspect I am missing a patch for the netlogic rge driver. JC check and see if I am missing anything except your core-mask changes Obtained from: JC
* - Fix a race in sched_switch() of sched_4bsd.attilio2010-01-231-21/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of the thread being on a sleepqueue or a turnstile, the sched_lock was acquired (without the aid of the td_lock interface) and the td_lock was dropped. This was going to break locking rules on other threads willing to access to the thread (via the td_lock interface) and modify his flags (allowed as long as the container lock was different by the one used in sched_switch). In order to prevent this situation, while sched_lock is acquired there the td_lock gets blocked. [0] - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). This means that thread_lock_block() will not disable interrupts when called (and consequently thread_unlock_block() will not re-enabled them when called). This should be done manually when necessary. Note, however, that ULE's thread_unblock_switch() is not reaped because it does reflect a difference in semantic due in ULE (the td_lock may not be necessarilly still blocked_lock when calling this). While asymmetric, it does describe a remarkable difference in semantic that is good to keep in mind. [0] Reported by: Kohji Okuno <okuno dot kohji at jp dot panasonic dot com> Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks
* Allow swap out of the kernel stack for the thread with priority greaterkib2009-12-311-1/+1
| | | | | | | | | | | | | or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week
* Don't forget to use `void' for sched_balance(). It has no arguments.ed2009-12-281-1/+1
|
* Make ULE process usage (%CPU) accounting usable again by keeping trackivoras2009-11-241-1/+4
| | | | | | | | | of the last tick we incremented on. Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru Reviewed by: jeff (who thinks there should be a better way in the future) Approved by: gnn (mentor) MFC after: 3 weeks
* Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).attilio2009-11-031-2/+2
| | | | | | | | | | This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
* Fix a sign bug in the handling of nice priorities when computing thejhb2009-10-151-1/+1
| | | | | | | | interactive score for a thread. Submitted by: Taku YAMAMOTO taku of tackymt.homeip.net Reviewed by: jeff MFC after: 3 days
* Fix sched_switch_migrate():attilio2009-09-151-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - In 8.x and above the run-queue locks are nomore shared even in the HTT case, so remove the special case. - The deadlock explained in the removed comment here is still possible even with different locks, with the contribution of tdq_lock_pair(). An explanation is here: (hypotesis: a thread needs to migrate on another CPU, thread1 is doing sched_switch_migrate() and thread2 is the one handling the sched_switch() request or in other words, thread1 is the thread that needs to migrate and thread2 is a thread that is going to be preempted, most likely an idle thread. Also, 'old' is referred to the context (in terms of run-queue and CPU) thread1 is leaving and 'new' is referred to the context thread1 is going into. Finally, thread3 is doing tdq_idletd() or sched_balance() and definitively doing tdq_lock_pair()) * thread1 blocks its td_lock. Now td_lock is 'blocked' * thread1 drops its old runqueue lock * thread1 acquires the new runqueue lock * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT through tdq_notify() to the new CPU * thread1 drops the new lock * thread3, scanning the runqueues, locks the old lock * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock pointing to the new runqueue * thread3 wants to acquire the new runqueue lock, but it can't because it is held by thread2 so it spins * thread1 wants to acquire old lock, but as long as it is held by thread3 it can't * thread2 going further, at some point wants to switchin in thread1, but it will wait forever because thread1->td_lock is in blocked state This deadlock has been manifested mostly on 7.x and reported several time on mailing lists under the voice 'spinlock held too long'. Many thanks to des@ for having worked hard on producing suitable textdumps and Jeff for help on the comment wording. Reviewed by: jeff Reported by: des, others Tested by: des, Giovanni Trematerra <giovanni dot trematerra at gmail dot com> (STABLE_7 based version)
* - Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULEjeff2009-06-231-19/+19
| | | | | supports arbitrary numbers of cpus rather than being limited by cpumask_t to the number of bits in a long.
* - Fix non-SMP build by encapsulating idle spin logic in a macro.jeff2009-04-291-2/+8
| | | | Pointy hat to: me
* - Fix the FBSDID line.jeff2009-04-291-1/+1
|
* - Remove the bogus idle thread state code. This may have a race in itjeff2009-04-291-28/+12
| | | | | | | | | | | and it only optimized out an ipi or mwait in very few cases. - Skip the adaptive idle code when running on SMT or HTT cores. This just wastes cpu time that could be used on a busy thread on the same core. - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use CG_FLAG_THREAD to mean SMT or HTT. Sponsored by: Nokia
* - Fix an error that occurs when mp_ncpu is an odd number. steal_threshjeff2009-03-141-4/+9
| | | | | | | | | | | | | | is calculated as 0 which causes errors elsewhere. Submitted by: KOIE Hidetaka <koie@suri.co.jp> - When sched_affinity() is called with a thread that is not curthread we need to handle the ON_RUNQ() case by adding the thread to the correct run queue. Submitted by: Justin Teller <justin.teller@gmail.com> MFC after: 1 Week
* - Use __XSTRING where I want the define to be expanded. This resulted injeff2009-01-251-2/+2
| | | | | | | sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.
* - Implement generic macros for producing KTR records that are compatiblejeff2009-01-171-18/+59
| | | | | | | | | | | | with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia
* Add missing newlines to flags tags of CPU topology, for prettierivoras2008-12-231-2/+2
| | | | | | | output. Reviewed by: jeff (original version) Approved by: gnn (mentor) (original version)
* When checking to see if another CPU is running its idle thread, examinejhb2008-11-181-4/+4
| | | | | | | | the thread running on the other CPU instead of the thread being placed on the run queue. Reported by: Ravi Murty @ Intel Reviewed by: jeff
* Increase the initial sbuf size for CPU topology dump to something moreivoras2008-11-021-1/+1
| | | | | | | | usable for newer CPUs. The new value allows 2 x quad core configuration dumps to fit within the initial buffer without reallocations. Approved by: gnn (mentor) (older version) Pointed out by: rdivacky
* Introduce a new sysctl, kern.sched.topology_spec, that returns an XMLivoras2008-10-291-1/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | dump of detected ULE CPU topology. This dump can be used to check the topology detection and for general system information. An example of CPU topology dump is: kern.sched.topology_spec: <groups> <group level="1" cache-level="0"> <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu> <flags></flags> <children> <group level="2" cache-level="0"> <cpu count="4" mask="0xf">0, 1, 2, 3</cpu> <flags></flags> </group> <group level="2" cache-level="0"> <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu> <flags></flags> </group> </children> </group> </groups> Reviewed by: jeff Approved by: gnn (mentor)
* - Check whether we've recorded this tick in ts_ticks on another cpu injeff2008-07-191-0/+6
| | | | | | | | | | sched_tick() to prevent multiple increments for one tick. This pushes the value out of range and breaks priority calculation. Reviewed by: kib Found by: pho/nokia Sponsored by: Nokia MFC after: 3 days
* Add the vtime (virtual time) hooks for DTrace.jb2008-05-251-0/+18
|
* - Add an integer argument to idle to indicate how likely we are to wakejeff2008-04-251-5/+9
| | | | | | | | | | | | | | | from idle over the next tick. - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are suspended in cpu specific states. This function can fail and cause the scheduler to fall back to another mechanism (ipi). - Implement support for mwait in cpu_idle() on i386/amd64 machines that support it. mwait is a higher performance way to synchronize cpus as compared to hlt & ipis. - Allow selecting the idle routine by name via sysctl machdep.idle. This replaces machdep.cpu_idle_hlt. Only idle routines supported by the current machine are permitted. Sponsored by: Nokia
* - Add a metric to describe how busy a processor has been over the lastjeff2008-04-171-7/+71
| | | | | | | | | | | two ticks by counting the number of switches and the load when sched_clock() is called. - If the busy metric exceeds a threshold allow the idle thread to spin waiting for new work for a brief period to avoid using IPIs. This reduces the cost on the sender and receiver as well as reducing wakeup latency considerably when it works. Sponsored by: Nokia
* - Make SCHED_STATS more generic by adding a wrapper to create thejeff2008-04-171-8/+30
| | | | | | | | | | | | | | | | | | variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
* Support and switch to the ULE scheduler:marcel2008-04-151-1/+1
| | | | | | | | o Implement IPI_PREEMPT, o Set td_lock for the thread being switched out, o For ULE & SMP, loop while td_lock points to blocked_lock for the thread being switched in, o Enable ULE by default in GENERIC and SKI,
* - Allow static_boost to specify no boost with '0', traditional kerneljeff2008-04-041-2/+6
| | | | | | | | | | | | fixed pri boost with '1' or any priority less than the current thread's priority with a value greater than two. Default the boost to PRI_MIN_TIMESHARE to prevent regular user-space threads from starving threads in the kernel. This prevents these user-threads from also being scheduled as if they are high fixed-priority kernel threads. - Restore the setting of lowpri in tdq_choose(). It has to be either here or in sched_switch(). I accidentally removed it from both places. Tested by: kris
* - Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn'tjeff2008-04-041-12/+6
| | | | | | do this either. Simply check P_NOLOAD. It'd be nice if this was in a thread flag so we didn't have an extra cache miss every time we add and remove a thread from the run-queue.
* - Restore runq to manipulating threads directly by putting runq links andjeff2008-03-201-130/+117
| | | | | | | | | | | rqindex back in struct thread. - Compile kern_switch.c independently again and stop #include'ing it from schedulers. - Remove the ts_thread backpointers and convert most code to go from struct thread to struct td_sched. - Cleanup the ts_flags #define garbage that was causing us to sometimes do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD. - Export the kern.sched sysctl node in sysctl.h
* - ULE and 4BSD share only one line of code from sched_newthread() so implementjeff2008-03-201-6/+5
| | | | | the required pieces in sched_fork_thread(). The td_sched pointer is already setup by thread_init anyway.
* - Remove some dead code and comments related to KSE.jeff2008-03-191-56/+16
| | | | | - Don't set tdq_lowpri on every switch, it should be precisely maintained now. - Add some comments to sched_thread_priority().
* - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice fromjeff2008-03-191-2/+1
| | | | | | | requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
* In keeping with style(9)'s recommendations on macros, use a ';'rwatson2008-03-161-2/+3
| | | | | | | | | after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
* Make the function prototype for cpu_search() match the declaration so thatjhb2008-03-141-2/+2
| | | | this still compiles with gcc3.
* Remove kernel support for M:N threading.jeff2008-03-121-9/+0
| | | | | | | | While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
* - Pass the priority argument from *sleep() into sleepq and down intojeff2008-03-121-1/+9
| | | | | | | | | | | | | | | | | sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_* to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
* - Fix the invalid priority panics people are seeing by forcingjeff2008-03-101-25/+10
| | | | | | | | | | tdq_runq_add to select the runq rather than hoping we set it properly when we adjusted the priority. This involves the same number of branches as before so should perform identically without the extra fragility. Tested by: bz Reviewed by: bz
* - Don't rely on a side effect of sched_prio() to set the initial ts_runqjeff2008-03-101-0/+1
| | | | | | | for thread0. Set it directly in sched_setup(). This fixes traps on boot seen on some machines. Reported by: phk
* Reduce ULE context switch time by over 25%.jeff2008-03-101-52/+52
| | | | | | | | | | | - Only calculate timeshare priorities once per tick or when a thread is woken from sleeping. - Keep the ts_runq pointer valid after all priority changes. - Call tdq_runq_add() directly from sched_switch() without passing in via tdq_add(). We don't need to adjust loads or runqs anymore. - Sort tdq and ts_sched according to utilization to improve cache behavior. Sponsored by: Nokia
* - Add an implementation of sched_preempt() that avoids excessive IPIs.jeff2008-03-101-62/+72
| | | | | | | | | - Normalize the preemption/ipi setting code by introducing sched_shouldpreempt() so the logical is identical and not repeated between tdq_notify() and sched_setpreempt(). - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock this could have caused us to lose td_flags settings. - Garbage collect some tunables that are no longer relevant.
* Add support for the new cpu topology api:jeff2008-03-021-497/+439
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - When searching for affinity search backwards in the tree from the last cpu we ran on while the thread still has affinity for the group. This can take advantage of knowledge of shared L2 or L3 caches among a group of cores. - When searching for the least loaded cpu find the least loaded cpu via the least loaded path through the tree. This load balances system bus links, individual cache levels, and hyper-threaded/SMT cores. - Make the periodic balancer recursively balance the highest and lowest loaded cpu across each link. Add support for cpusets: - Convert the cpuset to a simple native cpumask_t while the kernel still only supports cpumask. - Pass the derived cpumask down through the cpu_search functions to restrict the result cpus. - Make the various steal functions resilient to failure since all threads can not run on all cpus any longer. General improvements: - Precisely track the lowest priority thread on every runq with tdq_setlowpri(). Before it was more advisory but this ended up having pathological behaviors. - Remove many #ifdef SMP conditions to simplify the code. - Get rid of the old cumbersome tdq_group. This is more naturally expressed via the cpu_group tree. Sponsored by: Nokia Testing by: kris
OpenPOWER on IntegriCloud