op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	sched: Cure load average vs NO_HZ woes	Peter Zijlstra	2010-04-23	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Chase reported that due to us decrementing calc_load_task prematurely (before the next LOAD_FREQ sample), the load average could be scewed by as much as the number of CPUs in the machine. This patch, based on Chase's patch, cures the problem by keeping the delta of the CPU going into NO_HZ idle separately and folding that in on the next LOAD_FREQ update. This restores the balance and we get strict LOAD_FREQ period samples. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1271934490.1776.343.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Add enqueue/dequeue flags	Peter Zijlstra	2010-04-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In order to reduce the dependency on TASK_WAKING rework the enqueue interface to support a proper flags field. Replace the int wakeup, bool head arguments with an int flags argument and create the following flags: ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task, ENQUEUE_WAKING - the enqueue has relative vruntime due to having sched_class::task_waking() called, ENQUEUE_HEAD - the waking task should be places on the head of the priority queue (where appropriate). For symmetry also convert sched_class::dequeue() to a flags scheme. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Fix TASK_WAKING vs fork deadlock	Peter Zijlstra	2010-04-02	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Oleg noticed a few races with the TASK_WAKING usage on fork. - since TASK_WAKING is basically a spinlock, it should be IRQ safe - since we set TASK_WAKING () without holding rq->lock it could be there still is a rq->lock holder, thereby not actually providing full serialization. () in fact we clear PF_STARTING, which in effect enables TASK_WAKING. Cure the second issue by not setting TASK_WAKING in sched_fork(), but only temporarily in wake_up_new_task() while calling select_task_rq(). Cure the first by holding rq->lock around the select_task_rq() call, this will disable IRQs, this however requires that we push down the rq->lock release into select_task_rq_fair()'s cgroup stuff. Because select_task_rq_fair() still needs to drop the rq->lock we cannot fully get rid of TASK_WAKING. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Remove the sched_class load_balance methods	Peter Zijlstra	2010-01-21	1	-21/+0
\| \| \| \| \| \| \| \|	Take out the sched_class methods for load-balancing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Don't expose local functions	H Hartley Sweeten	2010-01-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	kernel/sched: don't expose local functions The get_rr_interval_* functions are all class methods of struct sched_class. They are not exported so make them static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <201001132021.53253.hartleys@visionengravers.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Restore printk sanity	Peter Zijlstra	2009-12-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Revert the braindead pr_* crap. (Commit 663997d "sched: Use pr_fmt() and pr_<level>()") It's dumb and causes stupid "sched: " strings all over the place. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Joe Perches <joe@perches.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1261315437.4314.6.camel@laptop> [ i dont mind the pr_*() patterns that much - but Peter dislikes them with a vengence. ] [ - v2: remove spurious diffstat from changelog :-/ ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	Merge branch 'linus' into sched/urgent	Ingo Molnar	2009-12-16	1	-2/+2
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: kernel/sched_idletask.c Merge reason: resolve the conflicts, pick up latest changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
\| *	sched: Convert rq->lock to raw_spinlock	Thomas Gleixner	2009-12-14	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>
* \|	sched: Use pr_fmt() and pr_<level>()	Joe Perches	2009-12-13	1	-1/+1
\|/ \| \| \| \| \| \| \| \| \| \| \|	- Convert printk(KERN_<level> to pr_<level> (not KERN_DEBUG) - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt - Coalesce long format strings - Add missing \n to "ERROR: !SD_LOAD_BALANCE domain has parent" Signed-off-by: Joe Perches <joe@perches.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1260655047.2637.7.camel@Joe-Laptop.home> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Protect sched_rr_get_param() access to task->sched_class	Thomas Gleixner	2009-12-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sched_rr_get_param calls task->sched_class->get_rr_interval(task) without protection against a concurrent sched_setscheduler() call which modifies task->sched_class. Serialize the access with task_rq_lock(task) and hand the rq pointer into get_rr_interval() as it's needed at least in the sched_fair implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Simplify sys_sched_rr_get_interval() system call	Peter Williams	2009-09-21	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \|	By removing the need for it to know details of scheduling classes. This allows PlugSched to define orthogonal scheduling classes. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <06d1b89ee15a0eef82d7.1253496713@mudlark.pw.nest> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Rename sync arguments	Peter Zijlstra	2009-09-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	In order to extend the functions to have more than 1 flag (sync), rename the argument to flags, and explicitly define a WF_ space for individual flags. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Rename select_task_rq() argument	Peter Zijlstra	2009-09-15	1	-1/+1
\| \| \| \| \| \| \| \| \|	In order to be able to rename the sync argument, we need to rename the current flag argument. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: Hook sched_balance_self() into sched_class::select_task_rq()	Peter Zijlstra	2009-09-15	1	-1/+1
\| \| \| \| \| \| \| \| \|	Rather ugly patch to fully place the sched_balance_self() code inside the fair class. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched, timers: move calc_load() to scheduler	Thomas Gleixner	2009-05-15	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dimitri Sivanich noticed that xtime_lock is held write locked across calc_load() which iterates over all online CPUs. That can cause long latencies for xtime_lock readers on large SMP systems. The load average calculation is an rough estimate anyway so there is no real need to protect the readers vs. the update. It's not a problem when the avenrun array is updated while a reader copies the values. Instead of iterating over all online CPUs let the scheduler_tick code update the number of active tasks shortly before the avenrun update happens. The avenrun update itself is handled by the CPU which calls do_timer(). [ Impact: reduce xtime_lock write locked section ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org>
*	sched: add CONFIG_SMP consistency	Li Zefan	2008-10-22	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	a patch from Henrik Austad did this: >> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is >> not set. Peter observed: > While a proper cleanup, could you do it by re-arranging the methods so > as to not create an additional ifdef? Do not declare select_task_rq and some other methods as part of sched_class when CONFIG_SMP is not set. Also gather those methods to avoid CONFIG_SMP mess. Idea-by: Henrik Austad <henrik.austad@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Henrik Austad <henrik@austad.us> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: wakeup preempt when small overlap	Peter Zijlstra	2008-09-22	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lin Ming reported a 10% OLTP regression against 2.6.27-rc4. The difference seems to come from different preemption agressiveness, which affects the cache footprint of the workload and its effective cache trashing. Aggresively preempt a task if its avg overlap is very small, this should avoid the task going to sleep and find it still running when we schedule back to it - saving a wakeup. Reported-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: make rt_sched_class, idle_sched_class static	Harvey Harrison	2008-05-05	1	-1/+1
\| \| \| \| \| \| \| \| \|	The C files are included directly in sched.c, so they are effectively static. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: high-res preemption tick	Peter Zijlstra	2008-01-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use HR-timers (when available) to deliver an accurate preemption tick. The regular scheduler tick that runs at 1/HZ can be too coarse when nice level are used. The fairness system will still keep the cpu utilisation 'fair' by then delaying the task that got an excessive amount of CPU time but try to minimize this by delivering preemption points spot-on. The average frequency of this extra interrupt is sched_latency / nr_latency. Which need not be higher than 1/HZ, its just that the distribution within the sched_latency period is important. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: RT-balance, add new methods to sched_class	Steven Rostedt	2008-01-25	1	-0/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dmitry Adamushko found that the current implementation of the RT balancing code left out changes to the sched_setscheduler and rt_mutex_setprio. This patch addresses this issue by adding methods to the schedule classes to handle being switched out of (switched_from) and being switched into (switched_to) a sched_class. Also a method for changing of priorities is also added (prio_changed). This patch also removes some duplicate logic between rt_mutex_setprio and sched_setscheduler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: de-SCHED_OTHER-ize the RT path	Gregory Haskins	2008-01-25	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current wake-up code path tries to determine if it can optimize the wake-up to "this_cpu" by computing load calculations. The problem is that these calculations are only relevant to SCHED_OTHER tasks where load is king. For RT tasks, priority is king. So the load calculation is completely wasted bandwidth. Therefore, we create a new sched_class interface to help with pre-wakeup routing decisions and move the load calculation as a function of CFS task's class. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: isolate SMP balancing code a bit more	Peter Williams	2007-10-24	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the moment, a lot of load balancing code that is irrelevant to non SMP systems gets included during non SMP builds. This patch addresses this issue and reduces the binary size on non SMP systems: text data bss dec hex filename 10983 28 1192 12203 2fab sched.o.before 10739 28 1192 11959 2eb7 sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: reduce balance-tasks overhead	Peter Williams	2007-10-24	1	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: mark scheduling classes as const	Ingo Molnar	2007-10-15	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	mark scheduling classes as const. The speeds up the code a bit and shrinks it: text data bss dec hex filename 40027 4018 292 44337 ad31 sched.o.before 40190 3842 292 44324 ad24 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
*	sched: revert recent removal of set_curr_task()	Srivatsa Vaddagiri	2007-10-15	1	-0/+5
\| \| \| \| \| \| \| \| \| \|	Revert removal of set_curr_task. Use put_prev_task/set_curr_task when changing groups/policies Signed-off-by: Srivatsa Vaddagiri < vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
*	sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()	Dmitry Adamushko	2007-10-15	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task(). This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks(). text data bss dec hex filename 24330 2734 20 27084 69cc sched.o.before 24233 2730 20 26983 6967 sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
*	sched: group-scheduler core	Srivatsa Vaddagiri	2007-10-15	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \|	Add interface to control cpu bandwidth allocation to task-groups. (not yet configurable, due to missing CONFIG_CONTAINERS) Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
*	sched: remove the 'u64 now' parameter from ->put_prev_task()	Ingo Molnar	2007-08-09	1	-1/+1
\| \| \| \| \| \| \| \|	remove the 'u64 now' parameter from ->put_prev_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: remove the 'u64 now' parameter from ->pick_next_task()	Ingo Molnar	2007-08-09	1	-1/+1
\| \| \| \| \| \| \| \|	remove the 'u64 now' parameter from ->pick_next_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: remove the 'u64 now' parameter from ->dequeue_task()	Ingo Molnar	2007-08-09	1	-1/+1
\| \| \| \| \| \| \| \|	remove the 'u64 now' parameter from ->dequeue_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: fix bug in balance_tasks()	Peter Williams	2007-08-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two problems with balance_tasks() and how it used: 1. The variables best_prio and best_prio_seen (inherited from the old move_tasks()) were only required to handle problems caused by the active/expired arrays, the order in which they were processed and the possibility that the task with the highest priority could be on either. These issues are no longer present and the extra overhead associated with their use is unnecessary (and possibly wrong). 2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same this_best_prio variable needs to be used by all scheduling classes or there is a risk of moving too much load. E.g. if the highest priority task on this at the beginning is a fairly low priority task and the rt class migrates a task (during its turn) then that moved task becomes the new highest priority task on this_rq but when the sched_fair class initializes its copy of this_best_prio it will get the priority of the original highest priority task as, due to the run queue locks being held, the reschedule triggered by pull_task() will not have taken place. This could result in inappropriate overriding of skip_for_load and excessive load being moved. The attached patch addresses these problems by deleting all reference to best_prio and best_prio_seen and making this_best_prio a reference parameter to the various functions involved. load_balance_fair() has also been modified so that this_best_prio is only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set. This should preserve the effect of helping spread groups' higher priority tasks around the available CPUs while improving system performance when CONFIG_FAIR_GROUP_SCHED isn't set. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: simplify move_tasks()	Peter Williams	2007-08-09	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The move_tasks() function is currently multiplexed with two distinct capabilities: 1. attempt to move a specified amount of weighted load from one run queue to another; and 2. attempt to move a specified number of tasks from one run queue to another. The first of these capabilities is used in two places, load_balance() and load_balance_idle(), and in both of these cases the return value of move_tasks() is used purely to decide if tasks/load were moved and no notice of the actual number of tasks moved is taken. The second capability is used in exactly one place, active_load_balance(), to attempt to move exactly one task and, as before, the return value is only used as an indicator of success or failure. This multiplexing of sched_task() was introduced, by me, as part of the smpnice patches and was motivated by the fact that the alternative, one function to move specified load and one to move a single task, would have led to two functions of roughly the same complexity as the old move_tasks() (or the new balance_tasks()). However, the new modular design of the new CFS scheduler allows a simpler solution to be adopted and this patch addresses that solution by: 1. adding a new function, move_one_task(), to be used by active_load_balance(); and 2. making move_tasks() a single purpose function that tries to move a specified weighted load and returns 1 for success and 0 for failure. One of the consequences of these changes is that neither move_one_task() or the new move_tasks() care how many tasks sched_class.load_balance() moves and this enables its interface to be simplified by returning the amount of load moved as its result and removing the load_moved pointer from the argument list. This helps simplify the new move_tasks() and slightly reduces the amount of work done in each of sched_class.load_balance()'s implementations. Further simplification, e.g. changes to balance_tasks(), are possible but (slightly) complicated by the special needs of load_balance_fair() so I've left them to a later patch (if this one gets accepted). NB Since move_tasks() gets called with two run queue locks held even small reductions in overhead are worthwhile. [ mingo@elte.hu ] this change also reduces code size nicely: text data bss dec hex filename 39216 3618 24 42858 a76a sched.o.before 39173 3618 24 42815 a73f sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: cfs core, kernel/sched_idletask.c	Ingo Molnar	2007-07-09	1	-0/+71
	add kernel/sched_idletask.c - which implements the idle thread scheduling class. This further simplifies sched.c (under CFS), for example a number of 'if (p == rq->idle)' type of special-cases can be removed from sched.c, and schedule() gets simpler too. Signed-off-by: Ingo Molnar <mingo@elte.hu>