summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | kgdb: Do not access xtime directlyThomas Gleixner2010-07-291-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xtime cleanup missed the kgdb access to xtime. Fix it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | clocksource: Add __clocksource_updatefreq_hz/khz methodsJohn Stultz2010-07-271-5/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To properly handle clocksources that change frequencies at the clocksource->enable() point, this patch adds a method that will update the clocksource's mult/shift and max_idle_ns values. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-12-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | timekeeping: Make xtime and wall_to_monotonic staticJohn Stultz2010-07-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes xtime and wall_to_monotonic static, as planned in Documentation/feature-removal-schedule.txt. This will allow for further cleanups to the timekeeping core. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-10-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | hrtimer: Cleanup direct access to wall_to_monotonicJohn Stultz2010-07-272-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provides an accessor function to replace hrtimer.c's direct access of wall_to_monotonic. This will allow wall_to_monotonic to be made static as planned in Documentation/feature-removal-schedule.txt Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-9-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | timkeeping: Fix update_vsyscall to provide wall_to_monotonic offsetJohn Stultz2010-07-271-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | update_vsyscall() did not provide the wall_to_monotoinc offset, so arch specific implementations tend to reference wall_to_monotonic directly. This limits future cleanups in the timekeeping core, so this patch fixes the update_vsyscall interface to provide wall_to_monotonic, allowing wall_to_monotonic to be made static as planned in Documentation/feature-removal-schedule.txt Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Anton Blanchard <anton@samba.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Tony Luck <tony.luck@intel.com> LKML-Reference: <1279068988-21864-7-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | time: Kill off CONFIG_GENERIC_TIMEJohn Stultz2010-07-275-73/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all arches have been converted over to use generic time via clocksources or arch_gettimeoffset(), we can remove the GENERIC_TIME config option and simplify the generic code. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-4-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | time: Implement timespec_addJohn Stultz2010-07-271-3/+3
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After accidentally misusing timespec_add_safe, I wanted to make sure we don't accidently trip over that issue again, so I created a simple timespec_add() function which we can use to replace the instances of timespec_add_safe() that don't want the overflow detection. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-3-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | Merge branch 'timers-for-linus' of ↵Linus Torvalds2010-08-063-12/+35
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: Documentation: Add timers/timers-howto.txt timer: Added usleep_range timer Revert "timer: Added usleep[_range] timer" clockevents: Remove the per cpu tick skew posix_timer: Move copy_to_user(created_timer_id) down in timer_create() timer: Added usleep[_range] timer timers: Document meaning of deferrable timer
| * | | | timer: Added usleep_range timerPatrick Pannuto2010-08-041-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | usleep_range is a finer precision implementations of msleep and is designed to be a drop-in replacement for udelay where a precise sleep / busy-wait is unnecessary. Since an easy interface to hrtimers could lead to an undesired proliferation of interrupts, we provide only a "range" API, forcing the caller to think about an acceptable tolerance on both ends and hopefully avoiding introducing another interrupt. INTRO As discussed here ( http://lkml.org/lkml/2007/8/3/250 ), msleep(1) is not precise enough for many drivers (yes, sleep precision is an unfair notion, but consistently sleeping for ~an order of magnitude greater than requested is worth fixing). This patch adds a usleep API so that udelay does not have to be used. Obviously not every udelay can be replaced (those in atomic contexts or being used for simple bitbanging come to mind), but there are many, many examples of mydriver_write(...) /* Wait for hardware to latch */ udelay(100) in various drivers where a busy-wait loop is neither beneficial nor necessary, but msleep simply does not provide enough precision and people are using a busy-wait loop instead. CONCERNS FROM THE RFC Why is udelay a problem / necessary? Most callers of udelay are in device/ driver initialization code, which is serial... As I see it, there is only benefit to sleeping over a delay; the notion of "refactoring" areas that use udelay was presented, but I see usleep as the refactoring. Consider i2c, if the bus is busy, you need to wait a bit (say 100us) before trying again, your current options are: * udelay(100) * msleep(1) <-- As noted above, actually as high as ~20ms on some platforms, so not really an option * Manually set up an hrtimer to try again in 100us (which is what usleep does anyway...) People choose the udelay route because it is EASY; we need to provide a better easy route. Device / driver / boot code is *currently* serial, but every few months someone makes noise about parallelizing boot, and IMHO, a little forward-thinking now is one less thing to worry about if/when that ever happens udelay's could be preempted Sure, but if udelay plans on looping 1000 times, and it gets preempted on loop 200, whenever it's scheduled again, it is going to do the next 800 loops. Is the interruptible case needed? Probably not, but I see usleep as a very logical parallel to msleep, so it made sense to include the "full" API. Processors are getting faster (albeit not as quickly as they are becoming more parallel), so if someone wanted to be interruptible for a few usecs, why not let them? If this is a contentious point, I'm happy to remove it. OTHER THOUGHTS I believe there is also value in exposing the usleep_range option; it gives the scheduler a lot more flexibility and allows the programmer to express his intent much more clearly; it's something I would hope future driver writers will take advantage of. To get the results in the NUMBERS section below, I literally s/udelay/usleep the kernel tree; I had to go in and undo the changes to the USB drivers, but everything else booted successfully; I find that extremely telling in and of itself -- many people are using a delay API where a sleep will suit them just fine. SOME ATTEMPTS AT NUMBERS It turns out that calculating quantifiable benefit on this is challenging, so instead I will simply present the current state of things, and I hope this to be sufficient: How many udelay calls are there in 2.6.35-rc5? udealy(ARG) >= | COUNT 1000 | 319 500 | 414 100 | 1146 20 | 1832 I am working on Android, so that is my focus for this. The following table is a modified usleep that simply printk's the amount of time requested to sleep; these tests were run on a kernel with udelay >= 20 --> usleep "boot" is power-on to lock screen "power collapse" is when the power button is pushed and the device suspends "resume" is when the power button is pushed and the lock screen is displayed (no touchscreen events or anything, just turning on the display) "use device" is from the unlock swipe to clicking around a bit; there is no sd card in this phone, so fail loading music, video, camera ACTION | TOTAL NUMBER OF USLEEP CALLS | NET TIME (us) boot | 22 | 1250 power-collapse | 9 | 1200 resume | 5 | 500 use device | 59 | 7700 The most interesting category to me is the "use device" field; 7700us of busy-wait time that could be put towards better responsiveness, or at the least less power usage. Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org> Cc: apw@canonical.com Cc: corbet@lwn.net Cc: arjan@linux.intel.com Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | Revert "timer: Added usleep[_range] timer"Thomas Gleixner2010-08-041-22/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 22b8f15c2f7130bb0386f548428df2ffd4e81903 to merge an advanced version. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | clockevents: Remove the per cpu tick skewArjan van de Ven2010-08-021-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, Linux has tried to make the regular timer tick on the various CPUs not happen at the same time, to avoid contention on xtime_lock. Nowadays, with the tickless kernel, this contention no longer happens since time keeping and updating are done differently. In addition, this skew is actually hurting power consumption in a measurable way on many-core systems. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20100727210210.58d3118c@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | posix_timer: Move copy_to_user(created_timer_id) down in timer_create()Andrey Vagin2010-07-231-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to Oleg Nesterov: We can move copy_to_user(created_timer_id) down after "if (timer_event_spec)" block too. (but before CLOCK_DISPATCH(), of course). Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Andrey Vagin <avagin@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | timer: Added usleep[_range] timerPatrick Pannuto2010-07-231-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | usleep[_range] are finer precision implementations of msleep and are designed to be drop-in replacements for udelay where a precise sleep / busy-wait is unnecessary. They also allow an easy interface to specify slack when a precise (ish) wakeup is unnecessary to help minimize wakeups Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org> Cc: akinobu.mita@gmail.com Cc: sboyd@codeaurora.org Acked-by: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <4C44CDD2.1070708@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | timers: Document meaning of deferrable timerJ. Bruce Fields2010-07-231-2/+7
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Steal some text from 6e453a67510 "Add support for deferrable timers". A reader shouldn't have to dig through the git logs for the basic description of a deferrable timer. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: johnstul@us.ibm.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds2010-08-062-2/+13
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (28 commits) driver core: device_rename's new_name can be const sysfs: Remove owner field from sysfs struct attribute powerpc/pci: Remove owner field from attribute initialization in PCI bridge init regulator: Remove owner field from attribute initialization in regulator core driver leds: Remove owner field from attribute initialization in bd2802 driver scsi: Remove owner field from attribute initialization in ARCMSR driver scsi: Remove owner field from attribute initialization in LPFC driver cgroupfs: create /sys/fs/cgroup to mount cgroupfs on Driver core: Add BUS_NOTIFY_BIND_DRIVER driver core: fix memory leak on one error path in bus_register() debugfs: no longer needs to depend on SYSFS sysfs: Fix one more signature discrepancy between sysfs implementation and docs. sysfs: fix discrepancies between implementation and documentation dcdbas: remove a redundant smi_data_buf_free in dcdbas_exit dmi-id: fix a memory leak in dmi_id_init error path sysfs: sysfs_chmod_file's attr can be const firmware: Update hotplug script Driver core: move platform device creation helpers to .init.text (if MODULE=n) Driver core: reduce duplicated code for platform_device creation Driver core: use kmemdup in platform_device_add_resources ...
| * | | | cgroupfs: create /sys/fs/cgroup to mount cgroupfs onGreg KH2010-08-051-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We really shouldn't be asking userspace to create new root filesystems. So follow along with all of the other in-kernel filesystems, and provide a mount point in sysfs. For cgroupfs, this should be in /sys/fs/cgroup/ This change provides that mount point when the cgroup filesystem is registered in the kernel. Acked-by: Paul Menage <menage@google.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
| * | | | hotplug: Support kernel/hotplug sysctl variable when !CONFIG_NETIan Abbott2010-08-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel/hotplug sysctl variable (/proc/sys/kernel/hotplug file) was made conditional on CONFIG_NET by commit f743ca5e10f4145e0b3e6d11b9b46171e16af7ce (applied in 2.6.18) to fix problems with undefined references in 2.6.16 when CONFIG_HOTPLUG=y && !CONFIG_NET, but this restriction is no longer needed. This patch makes the kernel/hotplug sysctl variable depend only on CONFIG_HOTPLUG. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Randy Dunlap <randy.dunlap@oracle.COM> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* | | | | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2010-08-0620-376/+800
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits) sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug sched: No need for bootmem special cases sched: Revert nohz_ratelimit() for now sched: Reduce update_group_power() calls sched: Update rq->clock for nohz balanced cpus sched: Fix spelling of sibling sched, cpuset: Drop __cpuexit from cpu hotplug callbacks sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check() sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand() sched: thread_group_cputime: Simplify, document the "alive" check sched: Remove the obsolete exit_state/signal hacks sched: task_tick_rt: Remove the obsolete ->signal != NULL check sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless sched: Fix comments to make them DocBook happy sched: Fix fix_small_capacity powerpc: Exclude arch_sd_sibiling_asym_packing() on UP powerpc: Enable asymmetric SMT scheduling on POWER7 sched: Add asymmetric group packing option for sibling domain sched: Fix capacity calculations for SMT4 sched: Change nohz idle load balancing logic to push model ...
| * \ \ \ \ Merge branch 'sched/urgent' into sched/coreIngo Molnar2010-08-052-11/+1
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: include/linux/sched.h Merge reason: Add the leftover .35 urgent bits, fix the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | | | sched: Revert nohz_ratelimit() for nowPeter Zijlstra2010-07-172-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Norbert reported that nohz_ratelimit() causes his laptop to burn about 4W (40%) extra. For now back out the change and see if we can adjust the power management code to make better decisions. Reported-by: Norbert Preining <preining@logic.at> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | sched: Use correct macro to display sched_child_runs_first in /proc/sched_debugJosh Hunt2010-07-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sched_child_runs_first value in /proc/sched_debug is currently displayed using a macro meant to split ns time values. This patch uses the correct macro to display it as a plain decimal value. Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: peterz@infradead.org Cc: juhlenko@akamai.com Cc: efault@gmx.de LKML-Reference: <1279567876-25418-1-git-send-email-johunt@akamai.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | Merge branch 'linus' into sched/coreIngo Molnar2010-07-2113-59/+87
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge reason: Move from the -rc3 to the almost-rc6 base. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: No need for bootmem special casesPekka Enberg2010-07-173-19/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of commit dcce284 ("mm: Extend gfp masking to the page allocator") and commit 7e85ee0 ("slab,slub: don't enable interrupts during early boot"), the slab allocator makes sure we don't attempt to sleep during boot. Therefore, remove bootmem special cases from the scheduler and use plain GFP_KERNEL instead. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Reduce update_group_power() callsPeter Zijlstra2010-07-171-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we update cpu_power() too often, update_group_power() only updates the local group's cpu_power but it gets called for all groups. Furthermore, CPU_NEWLY_IDLE invocations will result in all cpus calling it, even though a slow update of cpu_power is sufficient. Therefore move the update under 'idle != CPU_NEWLY_IDLE && local_group' to reduce superfluous invocations. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <1278612989.1900.176.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Update rq->clock for nohz balanced cpusSuresh Siddha2010-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Suresh spotted that we don't update the rq->clock in the nohz load-balancer path. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1278626014.2834.74.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Fix spelling of siblingMichael Neuling2010-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No logic changes, only spelling. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: linuxppc-dev@ozlabs.org Cc: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <15249.1277776921@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched, cpuset: Drop __cpuexit from cpu hotplug callbacksTejun Heo2010-06-222-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3a101d05 (sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining) added hotplug notifiers marked with __cpuexit; however, ia64 drops text in __cpuexit during link unlike x86. This means that functions which are referenced during init but used only for cpu hot unplugging afterwards shouldn't be marked with __cpuexit. Drop __cpuexit from those functions. Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4C1FDF5B.1040301@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()Oleg Nesterov2010-06-181-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fastpath_timer_check()->thread_group_cputimer() is racy and unneeded. It is racy because another thread can clear ->running before thread_group_cputimer() takes cputimer->lock. In this case thread_group_cputimer() will set ->running = true again and call thread_group_cputime(). But since we do not hold tasklist or siglock, we can race with fork/exit and copy the wrong results into cputimer->cputime. It is unneeded because if ->running == true we can just use the numbers in cputimer->cputime we already have. Change fastpath_timer_check() to copy cputimer->cputime into the local variable under cputimer->lock. We do not re-check ->running under cputimer->lock, run_posix_cpu_timers() does this check later. Note: we can add more optimizations on top of this change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100611180446.GA13025@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()Oleg Nesterov2010-06-181-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | run_posix_cpu_timers() doesn't work if current has already passed exit_notify(). This was needed to prevent the races with do_wait(). Since ea6d290c ->signal is always valid and can't go away. We can remove the "tsk->exit_state == 0" in fastpath_timer_check() and convert run_posix_cpu_timers() to use lock_task_sighand(). Note: it makes sense to take group_leader's sighand instead, the sub-thread still uses CPU after release_task(). But we need more changes to do this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610231018.GA25942@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: thread_group_cputime: Simplify, document the "alive" checkOleg Nesterov2010-06-181-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | thread_group_cputime() looks as if it is rcu-safe, but in fact this was wrong until ea6d290c which pins task->signal to task_struct. It checks ->sighand != NULL under rcu, but this can't help if ->signal can go away. Fortunately the caller either holds ->siglock, or it is fastpath_timer_check() which uses current and checks exit_state == 0. - Since ea6d290c commit tsk->signal is stable, we can read it first and avoid the initialization from INIT_CPUTIME. - Even if tsk->signal is always valid, we still have to check it is safe to use next_thread() under rcu_read_lock(). Currently the code checks ->sighand != NULL, change it to use pid_alive() which is commonly used to ensure the task wasn't unhashed before we take rcu_read_lock(). Add the comment to explain this check. - Change the main loop to use the while_each_thread() helper. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230956.GA25921@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Remove the obsolete exit_state/signal hacksOleg Nesterov2010-06-181-24/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | account_group_xxx() functions check ->exit_state to ensure that current->signal is valid and can't go away. This is not needed since ea6d290c, task->signal is pinned to task_struct. The comment and another hack in account_group_exec_runtime() refers to task_rq_unlock_wait() which was already removed by b7b8ff63. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230952.GA25914@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: task_tick_rt: Remove the obsolete ->signal != NULL checkOleg Nesterov2010-06-181-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the obsolete ->signal != NULL check in watchdog(). Since ea6d290c ->signal can't be NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230948.GA25911@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value locklessOleg Nesterov2010-06-181-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __sched_setscheduler() takes lock_task_sighand() to access task->signal. This is not needed since ea6d290c, ->signal can't go away. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230944.GA25903@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Fix comments to make them DocBook happyMichael Neuling2010-06-181-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Docbook fails in sched_fair.c due to comments added in the asymmetric packing patch series. This fixes these errors. No code changes. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <24737.1276135581@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Fix fix_small_capacityMichael Neuling2010-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CPU power test is the wrong way around in fix_small_capacity. This was due to a small changes made in the posted patch on lkml to what was was taken upstream. This patch fixes asymmetric packing for POWER7. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <12629.1276124617@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | Merge commit 'v2.6.35-rc3' into sched/coreIngo Molnar2010-06-181-1/+4
| |\ \ \ \ \ \ \ | | | |_|_|_|/ / | | |/| | | | | | | | | | | | | Merge reason: Update to the latest -rc.
| * | | | | | | sched: Add asymmetric group packing option for sibling domainMichael Neuling2010-06-091-17/+122
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check to see if the group is packed in a sched doman. This is primarily intended to used at the sibling level. Some cores like POWER7 prefer to use lower numbered SMT threads. In the case of POWER7, it can move to lower SMT modes only when higher threads are idle. When in lower SMT modes, the threads will perform better since they share less core resources. Hence when we have idle threads, we want them to be the higher ones. This adds a hook into f_b_g() called check_asym_packing() to check the packing. This packing function is run on idle threads. It checks to see if the busiest CPU in this domain (core in the P7 case) has a higher CPU number than what where the packing function is being run on. If it is, calculate the imbalance and return the higher busier thread as the busiest group to f_b_g(). Here we are assuming a lower CPU number will be equivalent to a lower SMT thread number. It also creates a new SD_ASYM_PACKING flag to enable this feature at any scheduler domain level. It also creates an arch hook to enable this feature at the sibling level. The default function doesn't enable this feature. Based heavily on patch from Peter Zijlstra. Fixes from Srivatsa Vaddagiri. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20100608045702.2936CCC897@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Fix capacity calculations for SMT4Srivatsa Vaddagiri2010-06-091-10/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Handle cpu capacity being reported as 0 on cores with more number of hardware threads. For example on a Power7 core with 4 hardware threads, core power is 1177 and thus power of each hardware thread is 1177/4 = 294. This low power can lead to capacity for each hardware thread being calculated as 0, which leads to tasks bouncing within the core madly! Fix this by reporting capacity for hardware threads as 1, provided their power is not scaled down significantly because of frequency scaling or real-time tasks usage of cpu. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20100608045702.21D03CC895@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Change nohz idle load balancing logic to push modelVenkatesh Pallipadi2010-06-095-153/+234
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the new push model, all idle CPUs indeed go into nohz mode. There is still the concept of idle load balancer (performing the load balancing on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz balancer when any of the nohz CPUs need idle load balancing. The kickee CPU does the idle load balancing on behalf of all idle CPUs instead of the normal idle balance. This addresses the below two problems with the current nohz ilb logic: * the idle load balancer continued to have periodic ticks during idle and wokeup frequently, even though it did not have any rebalancing to do on behalf of any of the idle CPUs. * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic wakeup can result in a periodic additional interrupt on a CPU doing the timer broadcast. Also currently we are migrating the unpinned timers from an idle to the cpu doing idle load balancing (when all the cpus in the system are idle, there is no idle load balancing cpu and timers get added to the same idle cpu where the request was made. So the existing optimization works only on semi idle system). And In semi idle system, we no longer have periodic ticks on the idle load balancer CPU. Using that cpu will add more delays to the timers than intended (as that cpu's timer base may not be uptodate wrt jiffies etc). This was causing mysterious slowdowns during boot etc. For now, in the semi idle case, use the nearest busy cpu for migrating timers from an idle cpu. This is good for power-savings anyway. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Avoid side-effect of tickless idle on update_cpu_loadVenkatesh Pallipadi2010-06-092-6/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tickless idle has a negative side effect on update_cpu_load(), which in turn can affect load balancing behavior. update_cpu_load() is supposed to be called every tick, to keep track of various load indicies. With tickless idle, there are no scheduler ticks called on the idle CPUs. Idle CPUs may still do load balancing (with idle_load_balance CPU) using the stale cpu_load. It will also cause problems when all CPUs go idle for a while and become active again. In this case loads would not degrade as expected. This is how rq->nr_load_updates change looks like under different conditions: <cpu_num> <nr_load_updates change> All CPUS idle for 10 seconds (HZ=1000) 0 1621 10 496 11 139 12 875 13 1672 14 12 15 21 1 1472 2 2426 3 1161 4 2108 5 1525 6 701 7 249 8 766 9 1967 One CPU busy rest idle for 10 seconds 0 10003 10 601 11 95 12 966 13 1597 14 114 15 98 1 3457 2 93 3 6679 4 1425 5 1479 6 595 7 193 8 633 9 1687 All CPUs busy for 10 seconds 0 10026 10 10026 11 10026 12 10026 13 10025 14 10025 15 10025 1 10026 2 10026 3 10026 4 10026 5 10026 6 10026 7 10026 8 10026 9 10026 That is update_cpu_load works properly only when all CPUs are busy. If all are idle, all the CPUs get way lower updates. And when few CPUs are busy and rest are idle, only busy and ilb CPU does proper updates and rest of the idle CPUs will do lower updates. The patch keeps track of when a last update was done and fixes up the load avg based on current time. On one of my test system SPECjbb with warehouse 1..numcpus, patch improves throughput numbers by ~1% (average of 6 runs). On another test system (with different domain hierarchy) there is no noticable change in perf. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched: Simplify the reacquire_kernel_lock() logicOleg Nesterov2010-06-091-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Contrary to what 6d558c3a says, there is no need to reload prev = rq->curr after the context switch. You always schedule back to where you came from, prev must be equal to current even if cpu/rq was changed. - This also means reacquire_kernel_lock() can use prev instead of current. - No need to reassign switch_count if reacquire_kernel_lock() reports need_resched(), we can just move the initial assignment down, under the "need_resched_nonpreemptible:" label. - Try to update the comment after context_switch(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100519125711.GA30199@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | sched_clock: Add local_clock() API and improve documentationPeter Zijlstra2010-06-096-16/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For people who otherwise get to write: cpu_clock(smp_processor_id()), there is now: local_clock(). Also, as per suggestion from Andrew, provide some documentation on the various clock interfaces, and minimize the unsigned long long vs u64 mess. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jens Axboe <jaxboe@fusionio.com> LKML-Reference: <1275052414.1645.52.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | Merge branch 'sched-wq' of ↵Ingo Molnar2010-06-085-80/+170
| |\ \ \ \ \ \ \ | | | |_|_|_|_|/ | | |/| | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into sched/core
| | * | | | | | sched: add hooks for workqueueTejun Heo2010-06-083-3/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Concurrency managed workqueue needs to know when workers are going to sleep and waking up. Using these two hooks, cmwq keeps track of the current concurrency level and throttles execution of new works if it's too high and wakes up another worker from the sleep hook if it becomes too low. This patch introduces PF_WQ_WORKER to identify workqueue workers and adds the following two hooks. * wq_worker_waking_up(): called when a worker is woken up. * wq_worker_sleeping(): called when a worker is going to sleep and may return a pointer to a local task which should be woken up. The returned task is woken up using try_to_wake_up_local() which is simplified ttwu which is called under rq lock and can only wake up local tasks. Both hooks are currently defined as noop in kernel/workqueue_sched.h. Later cmwq implementation will replace them with proper implementation. These hooks are hard coded as they'll always be enabled. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu>
| | * | | | | | sched: refactor try_to_wake_up()Tejun Heo2010-06-081-34/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up(). The factoring out doesn't affect try_to_wake_up() much code-generation-wise. Depending on configuration options, it ends up generating the same object code as before or slightly different one due to different register assignment. This is to help future implementation of try_to_wake_up_local(). Mike Galbraith suggested rename to ttwu_post_activation() from ttwu_woken_up() and comment update in try_to_wake_up(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu>
| | * | | | | | sched: adjust when cpu_active and cpuset configurations are updated during ↵Tejun Heo2010-06-083-42/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpu on/offlining Currently, when a cpu goes down, cpu_active is cleared before CPU_DOWN_PREPARE starts and cpuset configuration is updated from a default priority cpu notifier. When a cpu is coming up, it's set before CPU_ONLINE but cpuset configuration again is updated from the same cpu notifier. For cpu notifiers, this presents an inconsistent state. Threads which a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be migrated to other cpus because the cpu is no more inactive. Fix it by updating cpu_active in the highest priority cpu notifier and cpuset configuration in the second highest when a cpu is coming up. Down path is updated similarly. This guarantees that all other cpu notifiers see consistent cpu_active and cpuset configuration. cpuset_track_online_cpus() notifier is converted to cpuset_update_active_cpus() which just updates the configuration and now called from cpuset_cpu_[in]active() notifiers registered from sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus() degenerates into partition_sched_domains() making separate notifier for !CONFIG_CPUSETS unnecessary. This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug callback creates a kthread and kthread_bind()s it to the target cpu, and the thread is expected to run on that cpu. * Ingo's test discovered __cpuinit/exit markups were incorrect. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Menage <menage@google.com>
| | * | | | | | sched: define and use CPU_PRI_* enums for cpu notifier prioritiesTejun Heo2010-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of hardcoding priority 10 and 20 in sched and perf, collect them into CPU_PRI_* enums. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
* | | | | | | | Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2010-08-0633-2925/+1451
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (162 commits) tracing/kprobes: unregister_trace_probe needs to be called under mutex perf: expose event__process function perf events: Fix mmap offset determination perf, powerpc: fsl_emb: Restore setting perf_sample_data.period perf, powerpc: Convert the FSL driver to use local64_t perf tools: Don't keep unreferenced maps when unmaps are detected perf session: Invalidate last_match when removing threads from rb_tree perf session: Free the ref_reloc_sym memory at the right place x86,mmiotrace: Add support for tracing STOS instruction perf, sched migration: Librarize task states and event headers helpers perf, sched migration: Librarize the GUI class perf, sched migration: Make the GUI class client agnostic perf, sched migration: Make it vertically scrollable perf, sched migration: Parameterize cpu height and spacing perf, sched migration: Fix key bindings perf, sched migration: Ignore unhandled task states perf, sched migration: Handle ignored migrate out events perf: New migration tool overview tracing: Drop cpparg() macro perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call ... Fix up trivial conflicts in Makefile and drivers/cpufreq/cpufreq.c
| * \ \ \ \ \ \ \ Merge branch 'perf/core' of ↵Ingo Molnar2010-08-051-0/+3
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core
| | * | | | | | | | tracing/kprobes: unregister_trace_probe needs to be called under mutexSrikar Dronamraju2010-08-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Comment in unregister_trace_probe() says probe_lock will be held when it gets called. However there is a case where it might called without the probe_lock being held. Also since we are traversing the probe_list and deleting an element from the probe_list, probe_lock should be held. This was first pointed in uprobes traceevent review by Frederic Weisbecker here. (http://lkml.org/lkml/2010/5/12/106) Cc: Ingo Molnar <mingo@elte.hu> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100630084548.GA10325@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
OpenPOWER on IntegriCloud