summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_ule.c
Commit message (Collapse)AuthorAgeFilesLines
...
* Cap the priority calculated from the current thread's running tick countjhb2011-12-291-1/+2
| | | | | | | | | at SCHED_PRI_RANGE to prevent overflows in the priority value. This can happen due to irregularities with clock interrupts under certain virtualization environments. Tested by: Larry Rosenman ler lerctr org MFC after: 2 weeks
* ule: ensure that batch timeshare threads are scheduled fairlyavg2011-12-191-2/+2
| | | | | | | | | | | | | | With the previous code, if the range of priorities for timeshare batch threads was greater than RQ_NQS, then the threads with low priorities in the part of the range above RQ_NQS would be scheduled to the run-queues as if they had high priorities at the beginning of the range. In other words, threads with a nice level of +N could be scheduled as if they had a nice level of -M. Reported by: George Mitchell <george@m5p.com> Reviewed by: jhb Tested by: George Mitchell <george@m5p.com> (earlier version) MFC after: 1 week
* - Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT tomarius2011-10-061-4/+9
| | | | | | | | | | | | | | | | | | | | itself, which sparc64 hardware doesn't support. One way to solve this would be to directly call sched_preempt() instead of issuing a self-IPI. However, quoting jhb@: "On the other hand, you can probably just skip the IPI entirely if we are going to send it to the current CPU. Presumably, once this routine finishes, the current CPU will exit softlock (or will do so "soon") and will then pick the next thread to run based on the adjustments made in this routine, so there's no need to IPI the CPU running this routine anyway. I think this is the better solution. Right now what is probably happening on other platforms is as soon as this routine finishes the CPU processes its self-IPI and causes mi_switch() which will just switch back to the softclock thread it is already running." - With r226054 and the the above change in place, sparc64 now no longer is incompatible with ULE and vice versa. However, powerpc/E500 still is. Submitted by: jhb Reviewed by: jeff
* Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.delphij2011-08-261-2/+2
| | | | | | | Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)
* Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespaceattilio2011-07-191-1/+1
| | | | | | | | | | pollution. That is a step further in the direction of building correct policies for userland and modules on how to deal with the number of maxcpus at runtime. Reported by: jhb Reviewed and tested by: pluknet Approved by: re (kib)
* Commit the support for removing cpumask_t and replacing it directly withattilio2011-05-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuset_t objects. That is going to offer the underlying support for a simple bump of MAXCPU and then support for number of cpus > 32 (as it is today). Right now, cpumask_t is an int, 32 bits on all our supported architecture. cpumask_t on the other side is implemented as an array of longs, and easilly extendible by definition. The architectures touched by this commit are the following: - amd64 - i386 - pc98 - arm - ia64 - XEN while the others are still missing. Userland is believed to be fully converted with the changes contained here. Some technical notes: - This commit may be considered an ABI nop for all the architectures different from amd64 and ia64 (and sparc64 in the future) - per-cpu members, which are now converted to cpuset_t, needs to be accessed avoiding migration, because the size of cpuset_t should be considered unknown - size of cpuset_t objects is different from kernel and userland (this is primirally done in order to leave some more space in userland to cope with KBI extensions). If you need to access kernel cpuset_t from the userland please refer to example in this patch on how to do that correctly (kgdb may be a good source, for example). - Support for other architectures is going to be added soon - Only MAXCPU for amd64 is bumped now The patch has been tested by sbruno and Nicholas Esborn on opteron 4 x 12 pack CPUs. More testing on big SMP is expected to came soon. pluknet tested the patch with his 8-ways on both amd64 and i386. Tested by: pluknet, sbruno, gianni, Nicholas Esborn Reviewed by: jeff, jhb, sbruno
* Clearing the flag when preempting will let the preempted thread runfabient2011-03-311-1/+2
| | | | | | | | | | | | | | | | | too much time. This can finish in a scheduler deadlock with ping-pong between two threads. One sample of this is: - device lapic (to have a preemption point on critical_exit()) - options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq) - running a cpu intensive task (that does not enter the kernel) - only one CPU on SMP or no SMP. As requested by jhb@ 4BSD have received the same type of fix instead of propagating the flag to the new thread. Reviewed by: jhb, jeff MFC after: 1 month
* Rework realtime priority support:jhb2011-01-141-4/+12
| | | | | | | | | | | | | | | | | | | - Move the realtime priority range up above kernel sleep priorities and just below interrupt thread priorities. - Contract the interrupt and kernel sleep priority ranges a bit so that the timesharing priority band can be increased. The new timeshare range is now slightly larger than the old realtime + timeshare ranges. - Change the ULE scheduler to no longer use realtime priorities for interactive threads. Instead, the larger timeshare range is now split into separate subranges for interactive and non-interactive ("batch") threads. The end result is that interactive threads and non-interactive threads still use the same priority ranges as before, but realtime threads now have a separate, dedicated priority range. - Do not modify the priority of non-timeshare threads in sched_sleep() or via cv_broadcastpri(). Realtime and idle priority threads will no longer have their priorities affected by sleeping in the kernel. Reviewed by: jeff
* Introduce two new helper macros to define the priority ranges used forjhb2011-01-131-16/+25
| | | | | | | | interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME and PRI_*_TIMESHARE. No functional change. Reviewed by: jeff
* Always use PRI_BASE() when checking the base type of a thread's priorityjhb2011-01-111-2/+2
| | | | | | class. MFC after: 2 weeks
* Fix two harmless off-by-one errors.jhb2011-01-101-3/+3
| | | | | Reviewed by: jeff MFC after: 2 weeks
* - Move sched_fork() later in fork() after the various sections of the newjhb2011-01-061-3/+5
| | | | | | | | | | | | | | thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
* - Follow r216313, the sched_unlend_user_prio is no longer needed, alwaysdavidxu2010-12-291-17/+5
| | | | | | | use sched_lend_user_prio to set lent priority. - Improve pthread priority-inherit mutex, when a contender's priority is lowered, repropagete priorities, this may cause mutex owner's priority to be lowerd, in old code, mutex owner's priority is rise-only.
* MFp4:davidxu2010-12-091-10/+11
| | | | | | | | | It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
* Remove unused variables.trasz2010-11-131-4/+0
|
* Fix typos.attilio2010-11-101-2/+2
| | | | | Submitted by: gianni MFC after: 3 days
* Use integer for size of cpuset, as it won't be bigger than INT_MAX,davidxu2010-11-011-9/+0
| | | | | | This is requested by bge. Also move the sysctl into file kern_cpuset.c, because it should always be there, it is independent of thread scheduler.
* Add sysctl kern.sched.cpusetsize to export the size of kernel cpuset,davidxu2010-10-291-0/+11
| | | | | | also add sysconf() key _SC_CPUSET_SIZE to get sysctl value. Submitted by: gcooper
* Comment nit, set TDF_NEEDRESCHED after the comment describing why it isjhb2010-09-211-1/+1
| | | | | | done rather than before. MFC after: 1 week
* kern.sched.topology_spec sysctl: use step of 1 for group levels numerationavg2010-09-181-1/+1
| | | | | | | | | | This is just a cosmetic change for prettier output. 'indent' variable/parameter serves two purposes: it specifies whitespace indentation level and also implies cpu group level/depth. It would have been better to split those two uses, but for now just a simple change. MFC after: 1 week
* Refactor timer management code with priority to one-shot operation mode.mav2010-09-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle. There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating. As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads. Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.
* Do not IPI CPU that is already spinning for load. It doubles effect ofmav2010-09-101-4/+11
| | | | spining (comparing to MWAIT) on some heavly switching test loads.
* Fix UP build.mdf2010-09-021-0/+2
| | | | MFC after: 2 weeks
* Fix a bug with sched_affinity() where it checks td_pinned of anothermdf2010-09-011-11/+13
| | | | | | | | | | | | | | | thread in a racy manner, which can lead to attempting to migrate a thread that is pinned to a CPU. Instead, have sched_switch() determine which CPU a thread should run on if the current one is not allowed. KASSERT in sched_bind() that the thread is not yet pinned to a CPU. KASSERT in sched_switch() that only migratable threads or those moving due to a sched_bind() are changing CPUs. sched_affinity code came from jhb@. MFC after: 2 weeks
* Remove unused KTRACE includes.jhb2010-08-191-4/+0
|
* Add a new ipi_cpu() function to the MI IPI API that can be used to send anjhb2010-08-061-3/+3
| | | | | | | | | | | | IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month
* A cosmetic change - don't output empty <flags>.ivoras2010-07-151-2/+2
|
* Update several places that iterate over CPUs to use CPU_FOREACH().jhb2010-06-111-4/+2
|
* Unconfuse THREAD and SMT flagsivoras2010-06-101-1/+3
|
* Cosmetic change to XML - less ugly newlinesivoras2010-06-101-2/+2
|
* Assert that the thread lock is held in sched_pctcpu() instead ofjhb2010-06-031-2/+1
| | | | | | | recursively acquiring it. All of the current callers already hold the lock. MFC after: 1 month
* Assert that the thread passed to sched_bind() and sched_unbind() isjhb2010-05-211-0/+2
| | | | | | curthread as those routines are only supported for curthread currently. MFC after: 1 month
* This pushes all of JC's patches that I have in place. Irrs2010-05-161-1/+1
| | | | | | | | | | | am now able to run 32 cores ok.. but I still will hang on buildworld with a NFS problem. I suspect I am missing a patch for the netlogic rge driver. JC check and see if I am missing anything except your core-mask changes Obtained from: JC
* - Fix a race in sched_switch() of sched_4bsd.attilio2010-01-231-21/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of the thread being on a sleepqueue or a turnstile, the sched_lock was acquired (without the aid of the td_lock interface) and the td_lock was dropped. This was going to break locking rules on other threads willing to access to the thread (via the td_lock interface) and modify his flags (allowed as long as the container lock was different by the one used in sched_switch). In order to prevent this situation, while sched_lock is acquired there the td_lock gets blocked. [0] - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). This means that thread_lock_block() will not disable interrupts when called (and consequently thread_unlock_block() will not re-enabled them when called). This should be done manually when necessary. Note, however, that ULE's thread_unblock_switch() is not reaped because it does reflect a difference in semantic due in ULE (the td_lock may not be necessarilly still blocked_lock when calling this). While asymmetric, it does describe a remarkable difference in semantic that is good to keep in mind. [0] Reported by: Kohji Okuno <okuno dot kohji at jp dot panasonic dot com> Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks
* Allow swap out of the kernel stack for the thread with priority greaterkib2009-12-311-1/+1
| | | | | | | | | | | | | or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week
* Don't forget to use `void' for sched_balance(). It has no arguments.ed2009-12-281-1/+1
|
* Make ULE process usage (%CPU) accounting usable again by keeping trackivoras2009-11-241-1/+4
| | | | | | | | | of the last tick we incremented on. Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru Reviewed by: jeff (who thinks there should be a better way in the future) Approved by: gnn (mentor) MFC after: 3 weeks
* Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).attilio2009-11-031-2/+2
| | | | | | | | | | This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
* Fix a sign bug in the handling of nice priorities when computing thejhb2009-10-151-1/+1
| | | | | | | | interactive score for a thread. Submitted by: Taku YAMAMOTO taku of tackymt.homeip.net Reviewed by: jeff MFC after: 3 days
* Fix sched_switch_migrate():attilio2009-09-151-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - In 8.x and above the run-queue locks are nomore shared even in the HTT case, so remove the special case. - The deadlock explained in the removed comment here is still possible even with different locks, with the contribution of tdq_lock_pair(). An explanation is here: (hypotesis: a thread needs to migrate on another CPU, thread1 is doing sched_switch_migrate() and thread2 is the one handling the sched_switch() request or in other words, thread1 is the thread that needs to migrate and thread2 is a thread that is going to be preempted, most likely an idle thread. Also, 'old' is referred to the context (in terms of run-queue and CPU) thread1 is leaving and 'new' is referred to the context thread1 is going into. Finally, thread3 is doing tdq_idletd() or sched_balance() and definitively doing tdq_lock_pair()) * thread1 blocks its td_lock. Now td_lock is 'blocked' * thread1 drops its old runqueue lock * thread1 acquires the new runqueue lock * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT through tdq_notify() to the new CPU * thread1 drops the new lock * thread3, scanning the runqueues, locks the old lock * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock pointing to the new runqueue * thread3 wants to acquire the new runqueue lock, but it can't because it is held by thread2 so it spins * thread1 wants to acquire old lock, but as long as it is held by thread3 it can't * thread2 going further, at some point wants to switchin in thread1, but it will wait forever because thread1->td_lock is in blocked state This deadlock has been manifested mostly on 7.x and reported several time on mailing lists under the voice 'spinlock held too long'. Many thanks to des@ for having worked hard on producing suitable textdumps and Jeff for help on the comment wording. Reviewed by: jeff Reported by: des, others Tested by: des, Giovanni Trematerra <giovanni dot trematerra at gmail dot com> (STABLE_7 based version)
* - Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULEjeff2009-06-231-19/+19
| | | | | supports arbitrary numbers of cpus rather than being limited by cpumask_t to the number of bits in a long.
* - Fix non-SMP build by encapsulating idle spin logic in a macro.jeff2009-04-291-2/+8
| | | | Pointy hat to: me
* - Fix the FBSDID line.jeff2009-04-291-1/+1
|
* - Remove the bogus idle thread state code. This may have a race in itjeff2009-04-291-28/+12
| | | | | | | | | | | and it only optimized out an ipi or mwait in very few cases. - Skip the adaptive idle code when running on SMT or HTT cores. This just wastes cpu time that could be used on a busy thread on the same core. - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use CG_FLAG_THREAD to mean SMT or HTT. Sponsored by: Nokia
* - Fix an error that occurs when mp_ncpu is an odd number. steal_threshjeff2009-03-141-4/+9
| | | | | | | | | | | | | | is calculated as 0 which causes errors elsewhere. Submitted by: KOIE Hidetaka <koie@suri.co.jp> - When sched_affinity() is called with a thread that is not curthread we need to handle the ON_RUNQ() case by adding the thread to the correct run queue. Submitted by: Justin Teller <justin.teller@gmail.com> MFC after: 1 Week
* - Use __XSTRING where I want the define to be expanded. This resulted injeff2009-01-251-2/+2
| | | | | | | sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.
* - Implement generic macros for producing KTR records that are compatiblejeff2009-01-171-18/+59
| | | | | | | | | | | | with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia
* Add missing newlines to flags tags of CPU topology, for prettierivoras2008-12-231-2/+2
| | | | | | | output. Reviewed by: jeff (original version) Approved by: gnn (mentor) (original version)
* When checking to see if another CPU is running its idle thread, examinejhb2008-11-181-4/+4
| | | | | | | | the thread running on the other CPU instead of the thread being placed on the run queue. Reported by: Ravi Murty @ Intel Reviewed by: jeff
* Increase the initial sbuf size for CPU topology dump to something moreivoras2008-11-021-1/+1
| | | | | | | | usable for newer CPUs. The new value allows 2 x quad core configuration dumps to fit within the initial buffer without reallocations. Approved by: gnn (mentor) (older version) Pointed out by: rdivacky
OpenPOWER on IntegriCloud