summaryrefslogtreecommitdiffstats
path: root/sys/kern/sched_4bsd.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r312426: fix a thread preemption regression in schedulers introducedavg2017-01-231-2/+2
| | | | in r270423
* MFC 308564: Don't place threads on the run queue after waking up other CPUs.jhb2016-12-021-49/+13
| | | | | | | | | | | | | | The other CPU might resume and see a still-empty runq and go back to sleep before sched_add() adds the thread to the runq. This results in a lost wakeup and a potential hang if the system is otherwise completely idle. The race originated due to a micro-optimization (my fault) in 4BSD in that it avoided putting a thread on the run queue if the scheduler was going to preempt to the new thread. To avoid complexity while fixing this race, just drop this optimization. 4BSD now always sets the "owepreempt" flag when a preemption is warranted and defers the actual preemption to the thread_unlock of the caller the same as ULE.
* MFC 303503: Don't treat NOCPU as a valid CPU to CPU_ISSET.jhb2016-08-091-1/+1
| | | | | If a thread is created bound to a cpuset it might already be bound before its very first timeslice, and td_lastcpu will be NOCPU in that case.
* MFC r298819:bdrewery2016-06-271-1/+1
| | | | sys/kern: spelling fixes in comments.
* MFC 286256:jhb2015-10-011-0/+4
| | | | | | | | | | | kgdb uses td_oncpu to determine if a thread is running and should use a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads retained td_oncpu from the last time they ran, and newborn threads had their CPU fields cleared to zero during fork and thread creation since they are in the set of fields zeroed when threads are setup. To fix, explicitly update the CPU fields for exiting threads in sched_throw() to reflect the switch out and reset the CPU fields for new threads in sched_fork_thread() to NOCPU.
* MFC r282213:trasz2015-06-211-1/+1
| | | | | | | | | | | | | | | | | | Add kern.racct.enable tunable and RACCT_DISABLED config option. The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). MFC r282901: Build GENERIC with RACCT/RCTL support by default. Note that it still needs to be enabled by adding "kern.racct.enable=1" to /boot/loader.conf. Note those two are MFC-ed together, because the latter one changes the name of RACCT_DISABLED option to RACCT_DEFAULT_TO_DISABLED. Should have committed the renaming separately... Relnotes: yes Sponsored by: The FreeBSD Foundation
* MFC r270423:mav2014-09-061-1/+2
| | | | | | | | | | Restore pre-r239157 handling of sched_yield(), when thread time slice was aborted, allowing other threads to run. Without this change thread is just rescheduled again, that was illustrated by provided test tool. PR: 192926 Submitted by: eric@vangyzen.net Approved by: re (marius)
* MFC r260043:markj2014-02-171-1/+1
| | | | | The arguments to sched:::off-cpu are the thread and associated process of the thread selected to run, not the currently running thread.
* MFC r258622: dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2014-01-171-16/+16
|
* rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LASTavg2013-07-241-1/+1
| | | | | | | | | | | | | | | | Also directly call swapper() at the end of mi_startup instead of relying on swapper being the last thing in sysinits order. Rationale: - "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage - "scheduler" was misleading, the function swaps in the swapped out processes - another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be invoked depending on its relative order with scheduler; this was not obvious and the bug actually used to exist Reviewed by: kib (ealier version) MFC after: 14 days
* Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".trasz2012-10-261-0/+34
| | | | It was implemented by Rudolf Tomori during Google Summer of Code 2012.
* Mark the idle threads as non-sleepable and also assert that an idlejhb2012-08-221-0/+1
| | | | thread never blocks on a turnstile.
* Some more minor tunings inspired by bde@.mav2012-08-111-7/+9
|
* Some minor tunings/cleanups inspired by bde@ after previous commits:mav2012-08-101-14/+29
| | | | | | | | | | - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.
* Rework r220198 change (by fabient). I believe it solves the problem frommav2012-08-091-4/+6
| | | | | | | | | | | | | | | | | | | | | the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.
* SCHED_4BSD scheduling quantum mechanism appears to be broken for some time.mav2012-08-091-33/+36
| | | | | | | | | | | With switchticks variable being reset each time thread preempted (that is done regularly by interrupt threads) scheduling quantum may never expire. It was not noticed in time because several other factors still regularly trigger context switches. Handle the problem by replacing that mechanism with its equivalent from SCHED_ULE called time slice. It is effectively the same, just measured in context of stathz instead of hz. Some unification is probably not bad.
* Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP.pluknet2012-05-151-1/+1
|
* Implement the DTrace sched provider. This implementation aims to berstone2012-05-151-1/+38
| | | | | | | | | | | | | | | | | | | | | | compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated MFC after: 2 weeks
* Add a new sched_clear_name() method to the scheduler interface to clearjhb2012-03-081-0/+11
| | | | | | | | the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve(). MFC after: 2 weeks
* Some small fixes to CPU accounting for threads:jhb2012-01-031-2/+2
| | | | | | | | | | | | | | | | - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible. Reviewed by: bde, kib MFC after: 1 week
* Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.ed2011-11-071-1/+2
| | | | | | The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
* Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.delphij2011-08-261-2/+2
| | | | | | | Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)
* With retirement of cpumask_t and usage of cpuset_t for representing aattilio2011-07-041-33/+24
| | | | | | | | | | | | | | | mask of CPUs, pc_other_cpus and pc_cpumask become highly inefficient. Remove them and replace their usage with custom pc_cpuid magic (as, atm, pc_cpumask can be easilly represented by (1 << pc_cpuid) and pc_other_cpus by (all_cpus & ~(1 << pc_cpuid))). This change is not targeted for MFC because of struct pcpu members removal and dependency by cpumask_t retirement. MD review by: marcel, marius, alc Tested by: pluknet MD testing by: marcel, marius, gonzo, andreast
* MFCattilio2011-05-311-2/+2
|
* Commit the support for removing cpumask_t and replacing it directly withattilio2011-05-051-22/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuset_t objects. That is going to offer the underlying support for a simple bump of MAXCPU and then support for number of cpus > 32 (as it is today). Right now, cpumask_t is an int, 32 bits on all our supported architecture. cpumask_t on the other side is implemented as an array of longs, and easilly extendible by definition. The architectures touched by this commit are the following: - amd64 - i386 - pc98 - arm - ia64 - XEN while the others are still missing. Userland is believed to be fully converted with the changes contained here. Some technical notes: - This commit may be considered an ABI nop for all the architectures different from amd64 and ia64 (and sparc64 in the future) - per-cpu members, which are now converted to cpuset_t, needs to be accessed avoiding migration, because the size of cpuset_t should be considered unknown - size of cpuset_t objects is different from kernel and userland (this is primirally done in order to leave some more space in userland to cope with KBI extensions). If you need to access kernel cpuset_t from the userland please refer to example in this patch on how to do that correctly (kgdb may be a good source, for example). - Support for other architectures is going to be added soon - Only MAXCPU for amd64 is bumped now The patch has been tested by sbruno and Nicholas Esborn on opteron 4 x 12 pack CPUs. More testing on big SMP is expected to came soon. pluknet tested the patch with his 8-ways on both amd64 and i386. Tested by: pluknet, sbruno, gianni, Nicholas Esborn Reviewed by: jeff, jhb, sbruno
* - Remove the following sysctl:attilio2011-04-301-31/+7
| | | | | | | | | | | | | | kern.sched.ipiwakeup.onecpu kern.sched.ipiwakeup.htt2 Because they are absolutely obsolete. Probabilly the whole wakeup forward mechanism should be revisited for a better fitting in modern hw. - As map2 variable is no longer used rename map3 to map2 - Fix a string by making more informative the msg and removing the arguments passing Approved by: julian
* idle_cpus_mask is just used in the SMP case and within sched_4BSD.attilio2011-04-301-0/+2
| | | | Declare appropriately.
* If the 4BSD scheduler tries to schedule a thread that has been pinned orrstone2011-04-261-19/+21
| | | | | | | | | | bound to an AP before SMP has started, the system will panic when we try to touch per-CPU state for that AP because that state has not been initialized yet. Fix this in the same way as ULE: place all threads in the global run queue before SMP has started. Reviewed by: jhb MFC after: 1 month
* Fix several places to ignore processes that are not yet fully constructed.jhb2011-04-061-0/+4
| | | | MFC after: 1 week
* Clearing the flag when preempting will let the preempted thread runfabient2011-03-311-6/+2
| | | | | | | | | | | | | | | | | too much time. This can finish in a scheduler deadlock with ping-pong between two threads. One sample of this is: - device lapic (to have a preemption point on critical_exit()) - options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq) - running a cpu intensive task (that does not enter the kernel) - only one CPU on SMP or no SMP. As requested by jhb@ 4BSD have received the same type of fix instead of propagating the flag to the new thread. Reviewed by: jhb, jeff MFC after: 1 month
* Rework realtime priority support:jhb2011-01-141-1/+1
| | | | | | | | | | | | | | | | | | | - Move the realtime priority range up above kernel sleep priorities and just below interrupt thread priorities. - Contract the interrupt and kernel sleep priority ranges a bit so that the timesharing priority band can be increased. The new timeshare range is now slightly larger than the old realtime + timeshare ranges. - Change the ULE scheduler to no longer use realtime priorities for interactive threads. Instead, the larger timeshare range is now split into separate subranges for interactive and non-interactive ("batch") threads. The end result is that interactive threads and non-interactive threads still use the same priority ranges as before, but realtime threads now have a separate, dedicated priority range. - Do not modify the priority of non-timeshare threads in sched_sleep() or via cv_broadcastpri(). Realtime and idle priority threads will no longer have their priorities affected by sleeping in the kernel. Reviewed by: jeff
* One more sysctl(9) type-safety that I missed before.mdf2011-01-131-1/+1
|
* - Move sched_fork() later in fork() after the various sections of the newjhb2011-01-061-0/+1
| | | | | | | | | | | | | | thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
* - Follow r216313, the sched_unlend_user_prio is no longer needed, alwaysdavidxu2010-12-291-17/+5
| | | | | | | use sched_lend_user_prio to set lent priority. - Improve pthread priority-inherit mutex, when a contender's priority is lowered, repropagete priorities, this may cause mutex owner's priority to be lowerd, in old code, mutex owner's priority is rise-only.
* MFp4:davidxu2010-12-091-13/+10
| | | | | | | | | It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
* After some off-list discussion, revert a number of changes to thedim2010-11-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
* Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughoutdim2010-11-141-1/+1
| | | | the tree.
* Refactor timer management code with priority to one-shot operation mode.mav2010-09-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle. There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating. As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads. Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.
* Merge some SCHED_ULE features to SCHED_4BSD:mav2010-09-111-4/+28
| | | | | | | | | | - Teach SCHED_4BSD to inform cpu_idle() about high sleep/wakeup rate to choose optimized handler. In case of x86 it is MONITOR/MWAIT. Also it will be needed to bypass forthcoming idle tick skipping logic to not consume resources on events rescheduling when it won't give any benefits. - Teach SCHED_4BSD to wake up idle CPUs without using IPI. In case of x86, when MONITOR/MWAIT is active, it require just single memory write. This doubles performance on some heavily switching test loads.
* Add a new ipi_cpu() function to the MI IPI API that can be used to send anjhb2010-08-061-4/+4
| | | | | | | | | | | | IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month
* Update several places that iterate over CPUs to use CPU_FOREACH().jhb2010-06-111-6/+2
|
* Assert that the thread lock is held in sched_pctcpu() instead ofjhb2010-06-031-0/+1
| | | | | | | recursively acquiring it. All of the current callers already hold the lock. MFC after: 1 month
* Assert that the thread passed to sched_bind() and sched_unbind() isjhb2010-05-211-3/+3
| | | | | | curthread as those routines are only supported for curthread currently. MFC after: 1 month
* Split out an invariant in order to better check that newtd, whenattilio2010-01-241-2/+4
| | | | | | | | | provided, must be on a runqueue. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks X-MFC: r202889
* - Fix a race in sched_switch() of sched_4bsd.attilio2010-01-231-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of the thread being on a sleepqueue or a turnstile, the sched_lock was acquired (without the aid of the td_lock interface) and the td_lock was dropped. This was going to break locking rules on other threads willing to access to the thread (via the td_lock interface) and modify his flags (allowed as long as the container lock was different by the one used in sched_switch). In order to prevent this situation, while sched_lock is acquired there the td_lock gets blocked. [0] - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). This means that thread_lock_block() will not disable interrupts when called (and consequently thread_unlock_block() will not re-enabled them when called). This should be done manually when necessary. Note, however, that ULE's thread_unblock_switch() is not reaped because it does reflect a difference in semantic due in ULE (the td_lock may not be necessarilly still blocked_lock when calling this). While asymmetric, it does describe a remarkable difference in semantic that is good to keep in mind. [0] Reported by: Kohji Okuno <okuno dot kohji at jp dot panasonic dot com> Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks
* - Fix a bug in sched_4bsd where the timestamp for the sleeping operationattilio2010-01-081-1/+1
| | | | | | | | | | | | | | is not cleaned up on the wakeup but reset. This is harmless mostly because td_slptick (and ki_slptime from userland) should be analyzed only with the assumption that the thread is actually sleeping (thus while the td_slptick is correctly set) but without this invariant the number is nomore consistent. - Move td_slptick from u_int to int in order to follow 'ticks' signedness and wrap up accordingly [0] [0] Submitted by: emaste Sponsored by: Sandvine Incorporated MFC 1 week
* Allow swap out of the kernel stack for the thread with priority greaterkib2009-12-311-1/+1
| | | | | | | | | | | | | or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week
* Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).attilio2009-11-031-8/+8
| | | | | | | | | | This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
* - Use __XSTRING where I want the define to be expanded. This resulted injeff2009-01-251-1/+1
| | | | | | | sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.
* - Implement generic macros for producing KTR records that are compatiblejeff2009-01-171-18/+54
| | | | | | | | | | | | with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia
OpenPOWER on IntegriCloud