op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	futex: Prevent stale futex owner when interrupted/timeout	Thomas Gleixner	2008-01-08	1	-10/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Roland Westrelin did a great analysis of a long standing thinko in the return path of futex_lock_pi. While we fixed the lock steal case long ago, which was easy to trigger, we never had a test case which exposed this problem and stupidly never thought about the reverse lock stealing scenario and the return to user space with a stale state. When a blocked tasks returns from rt_mutex_timed_locked without holding the rt_mutex (due to a signal or timeout) and at the same time the task holding the futex is releasing the futex and assigning the ownership of the futex to the returning task, then it might happen that a third task acquires the rt_mutex before the final rt_mutex_trylock() of the returning task happens under the futex hash bucket lock. The returning task returns to user space with ETIMEOUT or EINTR, but the user space futex value is assigned to this task. The task which acquired the rt_mutex fixes the user space futex value right after the hash bucket lock has been released by the returning task, but for a short period of time the user space value is wrong. Detailed description is available at: https://bugzilla.redhat.com/show_bug.cgi?id=400541 The fix for this is the same as we do when the rt_mutex was acquired by a higher priority task via lock stealing from the designated new owner. In that case we already fix the user space value and the internal pi_state up before we return. This mechanism can be used to fixup the above corner case as well. When the returning task, which failed to acquire the rt_mutex, notices that it is the designated owner of the futex, then it fixes up the stale user space value and the pi_state, before returning to user space. This happens with the futex hash bucket lock held, so the task which acquired the rt_mutex is guaranteed to be blocked on the hash bucket lock. We can access the rt_mutex owner, which gives us the pid of the new owner, safely here as the owner is not able to modify (release) it while waiting on the hash bucket lock. Rename the "curr" argument of fixup_pi_state_owner() to "newowner" to avoid confusion with current and add the check for the stale state into the failure path of rt_mutex_trylock() in the return path of unlock_futex_pi(). If the situation is detected use fixup_pi_state_owner() to assign everything to the owner of the rt_mutex. Pointed-out-and-tested-by: Roland Westrelin <roland.westrelin@sun.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	vmcoreinfo: add the array length of "free_list" for filtering free pages	Ken'ichi Ohmichi	2008-01-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds the array length of "free_area.free_list" to the vmcoreinfo data so that makedumpfile (dump filtering command) can exclude all free pages in linux-2.6.24. makedumpfile creates a small dumpfile by excluding unnecessary pages for the analysis. To distinguish unnecessary pages, makedumpfile gets the vmcoreinfo data which has the minimum debugging information only for dump filtering. In 2.6.24-rc1 or later, the free_area.free_list is an array which has one list for each migrate types instead of a single list. makedumpfile needs the array length of "free_area.free_list" and the vmcoreinfo data should contain it. Signed-off-by: Huang Ying <ying.huang@intel.com> Tested-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Acked-by: Simon Horman <horms@verge.net.au> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	acct: real_parent ppid	Roland McGrath	2008-01-07	1	-1/+1
\| \| \| \| \| \| \| \| \|	The ac_ppid field reported in process accounting records should match what getppid() would have returned to that process, regardless of whether a debugger is attached. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched	Linus Torvalds	2008-01-03	1	-4/+4
\|\ \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: fix gcc warnings
\| *	sched: fix gcc warnings	Ingo Molnar	2007-12-30	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Meelis Roos reported these warnings on sparc64: CC kernel/sched.o In file included from kernel/sched.c:879: kernel/sched_debug.c: In function 'nsec_high': kernel/sched_debug.c:38: warning: comparison of distinct pointer types lacks a cast the debug check in do_div() is over-eager here, because the long long is always positive in these places. Mark this by casting them to unsigned long long. no change in code output: text data bss dec hex filename 51471 6582 376 58429 e43d sched.o.before 51471 6582 376 58429 e43d sched.o.after md5: 7f7729c111f185bf3ccea4d542abc049 sched.o.before.asm 7f7729c111f185bf3ccea4d542abc049 sched.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	Fix kernel/ptrace.c compile problem (missing "may_attach()")	Linus Torvalds	2008-01-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous commit missed one use of "may_attach()" that had been renamed to __ptrace_may_attach(). Tssk, tssk, Al. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace pid	Al Viro	2008-01-02	1	-2/+2
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Contents of /proc/*/maps is sensitive and may become sensitive after open() (e.g. if target originally shares our ->mm and later does exec on suid-root binary). Check at read() (actually, ->start() of iterator) time that mm_struct we'd grabbed and locked is - still the ->mm of target - equal to reader's ->mm or the target is ptracable by reader. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	[SERIAL]: Fix section mismatches in Sun serial console drivers.	David S. Miller	2007-12-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We're exporting an __init function, oops :-) The core issue here is that add_preferred_console() is marked as __init, this makes it impossible to invoke this thing from a driver probe routine which is what the Sparc serial drivers need to do. There is no harm in dropping the __init marker. This code will actually work properly when invoked from a modular driver, except that init will probably not pick up the console change without some other support code. Then we can drop the __init from sunserial_console_match() and we're no longer exporting an __init function to modules. Signed-off-by: David S. Miller <davem@davemloft.net>
*	Modules: fix memory leak of module names	Greg Kroah-Hartman	2007-12-22	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Due to the change in kobject name handling, the module kobject needs to have a null release function to ensure that the name it previously set will be properly cleaned up. All of this wierdness goes away in 2.6.25 with the rework of the kobject name and cleanup logic, but this is required for 2.6.24. Thanks to Alexey Dobriyan for finding the problem, and to Kay Sievers for pointing out the simple way to fix it after I tried many complex ways. Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
*	debug: add end-of-oops marker	Arjan van de Ven	2007-12-20	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now it's nearly impossible for parsers that collect kernel crashes from logs or emails (such as www.kerneloops.org) to detect the end-of-oops condition. In addition, it's not currently possible to detect whether or not 2 oopses that look alike are actually the same oops reported twice, or are truly two unique oopses. This patch adds an end-of-oops marker, and makes the end marker include a very simple 64-bit random ID to be able to detect duplicate reports. Normally, this ID is calculated as a late_initcall() (in the hope that at that time there is enough entropy to get a unique enough ID); however for early oopses the oops_exit() function needs to generate the ID on the fly. We do this all at the _end_ of an oops printout, so this does not impact our ability to get the most important portions of a crash out to the console first. [ Sidenote: the already existing oopses-since-bootup counter we print during crashes serves as the differentiator between multiple oopses that trigger during the same bootup. ] Tested on 32-bit and 64-bit x86. Artificially injected very early crashes as well, as expected they result in this constant ID after multiple bootups: ---[ end trace ca143223eefdc828 ]--- ---[ end trace ca143223eefdc828 ]--- because the random pools are still all zero. But it all still works fine and causes no additional problems (which is the main goal of instrumentation code). Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: rt: account the cpu time during the tick	Peter Zijlstra	2007-12-20	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Realtime tasks would not account their runtime during ticks. Which would lead to: struct sched_param param = { .sched_priority = 10 }; pthread_setschedparam(pthread_self(), SCHED_FIFO, &param); while (1) ; Not showing up in top. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86	Linus Torvalds	2007-12-18	3	-44/+25
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: x86: fix "Kernel panic - not syncing: IO-APIC + timer doesn't work!" genirq: revert lazy irq disable for simple irqs x86: also define AT_VECTOR_SIZE_ARCH x86: kprobes bugfix x86: jprobe bugfix timer: kernel/timer.c section fixes genirq: add unlocked version of set_irq_handler() clockevents: fix reprogramming decision in oneshot broadcast oprofile: op_model_athlon.c support for AMD family 10h barcelona performance counters
\| *	genirq: revert lazy irq disable for simple irqs	Steven Rostedt	2007-12-18	1	-7/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In commit 76d2160147f43f982dfe881404cfde9fd0a9da21 lazy irq disabling was implemented, and the simple irq handler had a masking set to it. Remy Bohmer discovered that some devices in the ARM architecture would trigger the mask, but never unmask it. His patch to do the unmasking was questioned by Russell King about masking simple irqs to begin with. Looking further, it was discovered that the problems Remy was seeing was due to improper use of the simple handler by devices, and he later submitted patches to fix those. But the issue that was uncovered was that the simple handler should never mask. This patch reverts the masking in the simple handler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
\| *	timer: kernel/timer.c section fixes	Adrian Bunk	2007-12-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes the following section mismatches with CONFIG_HOTPLUG=n, CONFIG_HOTPLUG_CPU=y: ... WARNING: vmlinux.o(.text+0x41cd3): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq') WARNING: vmlinux.o(.text+0x41d67): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq') ... Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
\| *	clockevents: fix reprogramming decision in oneshot broadcast	Thomas Gleixner	2007-12-18	1	-35/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resolve the following regression of a choppy, almost unusable laptop: http://lkml.org/lkml/2007/12/7/299 http://bugzilla.kernel.org/show_bug.cgi?id=9525 A previous version of the code did the reprogramming of the broadcast device in the return from idle code. This was removed, but the logic in tick_handle_oneshot_broadcast() was kept the same. When a broadcast interrupt happens we signal the expiry to all CPUs which have an expired event. If none of the CPUs has an expired event, which can happen in dyntick mode, then we reprogram the broadcast device. We do not reprogram otherwise, but this is only correct if all CPUs, which are in the idle broadcast state have been woken up. The code ignores, that there might be pending not yet expired events on other CPUs, which are in the idle broadcast state. So the delivery of those events can be delayed for quite a time. Change the tick_handle_oneshot_broadcast() function to check for CPUs, which are in broadcast state and are not woken up by the current event, and enforce the rearming of the broadcast device for those CPUs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	sched: do not hurt SCHED_BATCH on wakeup	Ingo Molnar	2007-12-18	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	measurements by Yanmin Zhang have shown that SCHED_BATCH tasks benefit if they run the same place_entity() logic as SCHED_OTHER tasks - so uniformize behavior in this area. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	sched: touch softlockup watchdog after idling	Ingo Molnar	2007-12-18	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	touch softlockup watchdog after idling. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	sched: sysctl, proc_dointvec_minmax() expects int values for	Eric Dumazet	2007-12-18	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	min_sched_granularity_ns, max_sched_granularity_ns, min_wakeup_granularity_ns and max_wakeup_granularity_ns are declared "unsigned long". This is incorrect since proc_dointvec_minmax() expects plain "int" guard values. This bug only triggers on big endian 64 bit arches. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	sched: mark rwsem functions as __sched for wchan/profiling	Livio Soares	2007-12-18	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This following commit http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=fdf8cb0909b531f9ae8f9b9d7e4eb35ba3505f07 un-inlined a low-level rwsem function, but did not mark it as __sched. The result is that it now shows up as thread wchan (which also affects /proc/profile stats). The following simple patch fixes this by properly marking rwsem_down_failed_common() as a __sched function. Also in this patch, which is up for discussion, marks down_read() and down_write() proper as __sched. For profiling, it is pretty much useless to know that a semaphore is beig help - it is necessary to know _which_ one. By going up another frame on the stack, the information becomes much more useful. In summary, the below change to lib/rwsem.c should be applied; the changes to kernel/rwsem.c could be applied if other kernel hackers agree with my proposal that down_read()/down_write() in the profile is not enough. [ akpm@linux-foundation.org: build fix ] Signed-off-by: Livio Soares <livio@eecg.toronto.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	sched: fix crash on ia64, introduce task_current()	Dmitry Adamushko	2007-12-18	1	-6/+11
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and sched_move_task()) must handle a given task differently in case it's the 'rq->curr' task on its run-queue. The task_running() interface is not suitable for determining such tasks for platforms with one of the following options: #define __ARCH_WANT_UNLOCKED_CTXSW #define __ARCH_WANT_INTERRUPTS_ON_CTXSW Due to the fact that it makes use of 'p->oncpu == 1' as a criterion but such a task is not necessarily 'rq->curr'. The detailed explanation is available here: https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
*	sysctl: fix ax25 checks	Eric W. Biederman	2007-12-17	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix: sysctl table check failed: /net/ax25/ax0/ax25_default_mode .3.9.1.2 Unknown sysctl binary path Pid: 2936, comm: kissattach Not tainted 2.6.24-rc5 #1 [<c012ca6a>] set_fail+0x3b/0x43 [<c012ce7a>] sysctl_check_table+0x408/0x456 [<c012ce8e>] sysctl_check_table+0x41c/0x456 [<c012ce8e>] sysctl_check_table+0x41c/0x456 ... Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Bernard Pidoux <pidoux@ccr.jussieu.fr> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Revert "hugetlb: Add hugetlb_dynamic_pool sysctl"	Nishanth Aravamudan	2007-12-17	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb: Add hugetlb_dynamic_pool sysctl") Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool sysctl is not needed, as its semantics can be expressed by 0 in the overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl (pool enabled). (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change) Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hugetlb: introduce nr_overcommit_hugepages sysctl	Nishanth Aravamudan	2007-12-17	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hugetlb: introduce nr_overcommit_hugepages sysctl While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I became convinced that having a boolean sysctl was insufficient: 1) To support per-node control of hugepages, I have previously submitted patches to add a sysfs attribute related to nr_hugepages. However, with a boolean global value and per-mount quota enforcement constraining the dynamic pool, adding corresponding control of the dynamic pool on a per-node basis seems inconsistent to me. 2) Administration of the hugetlb dynamic pool with multiple hugetlbfs mount points is, arguably, more arduous than it needs to be. Each quota would need to be set separately, and the sum would need to be monitored. To ease the administration, and to help make the way for per-node control of the static & dynamic hugepage pool, I added a separate sysctl, nr_overcommit_hugepages. This value serves as a high watermark for the overall hugepage pool, while nr_hugepages serves as a low watermark. The boolean sysctl can then be removed, as the condition nr_overcommit_hugepages > 0 indicates the same administrative setting as hugetlb_dynamic_pool == 1 Quotas still serve as local enforcement of the size of the pool on a per-mount basis. A few caveats: 1) There is a race whereby the global surplus huge page counter is incremented before a hugepage has allocated. Another process could then try grow the pool, and fail to convert a surplus huge page to a normal huge page and instead allocate a fresh huge page. I believe this is benign, as no memory is leaked (the actual pages are still tracked correctly) and the counters won't go out of sync. 2) Shrinking the static pool while a surplus is in effect will allow the number of surplus huge pages to exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. Successfully tested on x86_64 with the current libhugetlbfs snapshot, modified to use the new sysctl. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86	Linus Torvalds	2007-12-07	2	-0/+13
\|\ \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: ACPI: move timer broadcast before busmaster disable clockevents: warn once when program_event() is called with negative expiry hrtimers: avoid overflow for large relative timeouts
\| *	clockevents: warn once when program_event() is called with negative expiry	Thomas Gleixner	2007-12-07	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The hrtimer problem with large relative timeouts resulting in a negative expiry time went unnoticed as there is no check in the clockevents_program_event() code. Put a check there with a WARN_ONCE to avoid such problems in the future. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
\| *	hrtimers: avoid overflow for large relative timeouts	Thomas Gleixner	2007-12-07	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Relative hrtimers with a large timeout value might end up as negative timer values, when the current time is added in hrtimer_start(). This in turn is causing the clockevents_set_next() function to set an huge timeout and sleep for quite a long time when we have a clock source which is capable of long sleeps like HPET. With PIT this almost goes unnoticed as the maximum delta is ~27ms. The non-hrt/nohz code sorts this out in the next timer interrupt, so we never noticed that problem which has been there since the first day of hrtimers. This bug became more apparent in 2.6.24 which activates HPET on more hardware. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	sched: enable early use of sched_clock()	Ingo Molnar	2007-12-07	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	some platforms have sched_clock() implementations that cannot be called very early during wakeup. If it's called it might hang or crash in hard to debug ways. So only call update_rq_clock() [which calls sched_clock()] if sched_init() has already been called. (rq->idle is NULL before the scheduler is initialized.) Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	lockdep: make cli/sti annotation warnings clearer	Ingo Molnar	2007-12-07	1	-4/+9
\|/ \| \| \| \| \| \|	make cli/sti annotation warnings easier to interpret. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
*	Tiny clean-up of OPROFILE/KPROBES configuration	Linus Torvalds	2007-12-06	1	-4/+4
\| \| \| \| \| \| \|	Make the Kconfig.instrumentation file a bit easier on the eyes, and use the new ARCH_SUPPORTS_OPROFILE for x86[-64]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Fix oprofile configuration breakage	Ralf Baechle	2007-12-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The cleanup 09cadedbdc01f1a4bea1f427d4fb4642eaa19da9 broke the oprofile configuration for MIPS by allowing oprofile support to be built for kernel models where oprofile doesn't have a chance in hell to work. Just a dependecy list on a number of architectures is - surprise - broken and should as per past discussions probably in most considered to be broken in most cases. So I introduce a dependency for the oprofile configuration on ARCH_SUPPORTS_OPROFILE. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched	Linus Torvalds	2007-12-05	3	-89/+99
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: futex: correctly return -EFAULT not -EINVAL lockdep: in_range() fix lockdep: fix debug_show_all_locks() sched: style cleanups futex: fix for futex_wait signal stack corruption
\| *	futex: correctly return -EFAULT not -EINVAL	Thomas Gleixner	2007-12-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	return -EFAULT not -EINVAL. Found by review. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
\| *	lockdep: in_range() fix	Oleg Nesterov	2007-12-05	1	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Torsten Kaiser wrote: \| static inline int in_range(const void start, const void addr, const void end) \| { \| return addr >= start && addr <= end; \| } \| This will return true, if addr is in the range of start (including) \| to end (including). \| \| But debug_check_no_locks_freed() seems does: \| const void mem_to = mem_from + mem_len \| -> mem_to is the last byte of the freed range, that fits in_range \| lock_from = (void )hlock->instance; \| -> first byte of the lock \| lock_to = (void )(hlock->instance + 1); \| -> first byte of the next lock, not last byte of the lock that is being checked! \| \| The test is: \| if (!in_range(mem_from, lock_from, mem_to) && \| !in_range(mem_from, lock_to, mem_to)) \| continue; \| So it tests, if the first byte of the lock is in the range that is freed ->OK \| And if the first byte of the next lock is in the range that is freed \| -> Not OK. We can also simplify in_range checks, we need only 2 comparisons, not 4. If the lock is not in memory range, it should be either at the left of range or at the right. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
\| *	lockdep: fix debug_show_all_locks()	Ingo Molnar	2007-12-05	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fix the oops that can be seen in: http://bugzilla.kernel.org/attachment.cgi?id=13828&action=view it is not safe to print the locks of running tasks. (even with this fix we have a small race - but this is a debug function after all.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
\| *	sched: style cleanups	Ingo Molnar	2007-12-05	1	-64/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	style cleanup of various changes that were done recently. no code changed: text data bss dec hex filename 23680 2542 28 26250 668a sched.o.before 23680 2542 28 26250 668a sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>
\| *	futex: fix for futex_wait signal stack corruption	Steven Rostedt	2007-12-05	1	-12/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	David Holmes found a bug in the -rt tree with respect to pthread_cond_timedwait. After trying his test program on the latest git from mainline, I found the bug was there too. The bug he was seeing that his test program showed, was that if one were to do a "Ctrl-Z" on a process that was in the pthread_cond_timedwait, and then did a "bg" on that process, it would return with a "-ETIMEDOUT" but early. That is, the timer would go off early. Looking into this, I found the source of the problem. And it is a rather nasty bug at that. Here's the relevant code from kernel/futex.c: (not in order in the file) [...] smlinkage long sys_futex(u32 __user uaddr, int op, u32 val, struct timespec __user utime, u32 __user uaddr2, u32 val3) { struct timespec ts; ktime_t t, tp = NULL; u32 val2 = 0; int cmd = op & FUTEX_CMD_MASK; if (utime && (cmd == FUTEX_WAIT \|\| cmd == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); if (cmd == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } [...] return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); } [...] long do_futex(u32 __user uaddr, int op, u32 val, ktime_t timeout, u32 __user uaddr2, u32 val2, u32 val3) { int ret; int cmd = op & FUTEX_CMD_MASK; struct rw_semaphore fshared = NULL; if (!(op & FUTEX_PRIVATE_FLAG)) fshared = &current->mm->mmap_sem; switch (cmd) { case FUTEX_WAIT: ret = futex_wait(uaddr, fshared, val, timeout); [...] static int futex_wait(u32 __user uaddr, struct rw_semaphore fshared, u32 val, ktime_t abs_time) { [...] struct restart_block restart; restart = &current_thread_info()->restart_block; restart->fn = futex_wait_restart; restart->arg0 = (unsigned long)uaddr; restart->arg1 = (unsigned long)val; restart->arg2 = (unsigned long)abs_time; restart->arg3 = 0; if (fshared) restart->arg3 \|= ARG3_SHARED; return -ERESTART_RESTARTBLOCK; [...] static long futex_wait_restart(struct restart_block restart) { u32 __user uaddr = (u32 __user )restart->arg0; u32 val = (u32)restart->arg1; ktime_t abs_time = (ktime_t )restart->arg2; struct rw_semaphore fshared = NULL; restart->fn = do_no_restart_syscall; if (restart->arg3 & ARG3_SHARED) fshared = &current->mm->mmap_sem; return (long)futex_wait(uaddr, fshared, val, abs_time); } So when the futex_wait is interrupt by a signal we break out of the hrtimer code and set up or return from signal. This code does not return back to userspace, so we set up a RESTARTBLOCK. The bug here is that we save the "abs_time" which is a pointer to the stack variable "ktime_t t" from sys_futex. This returns and unwinds the stack before we get to call our signal. On return from the signal we go to futex_wait_restart, where we update all the parameters for futex_wait and call it. But here we have a problem where abs_time is no longer valid. I verified this with print statements, and sure enough, what abs_time was set to ends up being garbage when we get to futex_wait_restart. The solution I did to solve this (with input from Linus Torvalds) was to add unions to the restart_block to allow system calls to use the restart with specific parameters. This way the futex code now saves the time in a 64bit value in the restart block instead of storing it on the stack. Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64 in thread_info.h, when there's a #ifdef __KERNEL__ just below that. Not sure what that is there for. If this turns out to be a problem, I've tested this with using "unsigned int" for u32 and "unsigned long long" for u64 and it worked just the same. I'm using u32 and u64 just to be consistent with what the futex code uses. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6	Linus Torvalds	2007-12-05	1	-1/+1
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6: [SPARC64]: Update defconfig. [SPARC]: Add missing of_node_put [SPARC64]: check for possible NULL pointer dereference [SPARC]: Add missing "space" [SPARC64]: Add missing "space" [SPARC64]: Add missing pci_dev_put [SYSCTL_CHECK]: Fix typo in KERN_SPARC_SCONS_PWROFF entry string. [SPARC64]: Missing mdesc_release() in ldc_init().
\| * \|	[SYSCTL_CHECK]: Fix typo in KERN_SPARC_SCONS_PWROFF entry string.	David S. Miller	2007-12-05	1	-1/+1
\| \|/ \| \| \| \| \| \| \| \| \| \|	Based upon a report by Mikael Pettersson. Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	Avoid potential NULL dereference in unregister_sysctl_table	Pavel Emelyanov	2007-12-05	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	register_sysctl_table() can return NULL sometimes, e.g. when kmalloc() returns NULL or when sysctl check fails. I've also noticed, that many (most?) code in the kernel doesn't check for the return value from register_sysctl_table() and later simply calls the unregister_sysctl_table() with potentially NULL argument. This is unlikely on a common kernel configuration, but in case we're dealing with modules and/or fault-injection support, there's a slight possibility of an OOPS. Changing all the users to check for return code from the registering does not look like a good solution - there are too many code doing this and failure in sysctl tables registration is not a good reason to abort module loading (in most of the cases). So I think, that we can just have this check in unregister_sysctl_table just to avoid accidental OOPS-es (actually, the unregister_sysctl_table() did exactly this, before the start_unregistering() appeared). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	fix clone(CLONE_NEWPID)	Eric W. Biederman	2007-12-05	1	-15/+6
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we are complicating the code in copy_process, the clone ABI, and if we fix the bugs sys_setsid itself, with an unnecessary open coded version of sys_setsid. So just simplify everything and don't special case the session and pgrp of the initial process in a pid namespace. Having this special case actually presents to user space the classic linux startup conditions with session == pgrp == 0 for /sbin/init. We already handle sending signals to processes in a child pid namespace. We need to handle sending signals to processes in a parent pid namespace for cases like SIGCHILD and SIGIO. This makes nothing extra visible inside a pid namespace. So this extra special case appears to have no redeeming merits. Further removing this special case increases the flexibility of how we can use pid namespaces, by not requiring the initial process in a pid namespace to be a daemon. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	sched: default to more agressive yield for SCHED_BATCH tasks	Ingo Molnar	2007-12-04	1	-3/+4
\| \| \| \| \| \| \| \|	do more agressive yield for SCHED_BATCH tuned tasks: they are all about throughput anyway. This allows a gentler migration path for any apps that relied on stronger yield. Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	sched: fix crash in sys_sched_rr_get_interval()	Ingo Molnar	2007-12-04	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Luiz Fernando N. Capitulino reported that sched_rr_get_interval() crashes for SCHED_OTHER tasks that are on an idle runqueue. The fix is to return a 0 timeslice for tasks that are on an idle runqueue. (and which are not running, obviously) this also shrinks the code a bit: text data bss dec hex filename 47903 3934 336 52173 cbcd sched.o.before 47885 3934 336 52155 cbbb sched.o.after Reported-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched	Linus Torvalds	2007-12-03	3	-26/+136
\|\ \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: cpu accounting controller (V2)
\| *	sched: cpu accounting controller (V2)	Srivatsa Vaddagiri	2007-12-02	3	-26/+136
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for us, which provided a cpu accounting resource controller. This feature would be useful if someone wants to group tasks only for accounting purpose and doesnt really want to exercise any control over their cpu consumption. The patch below reintroduces the feature. It is based on Paul Menage's original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with these differences: - Removed load average information. I felt it needs more thought (esp to deal with SMP and virtualized platforms) and can be added for 2.6.25 after more discussions. - Convert group cpu usage to be nanosecond accurate (as rest of the cfs stats are) and invoke cpuacct_charge() from the respective scheduler classes - Make accounting scalable on SMP systems by splitting the usage counter to be per-cpu - Move the code from kernel/cpu_acct.c to kernel/sched.c (since the code is not big enough to warrant a new file and also this rightly needs to live inside the scheduler. Also things like accessing rq->lock while reading cpu usage becomes easier if the code lived in kernel/sched.c) The patch also modifies the cpu controller not to provide the same accounting information. Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran some simple tests like cpuspin (spin on the cpu), ran several tasks in the same group and timed them. Compared their time stamps with cpuacct.usage. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* \|	uml: add !UML dependencies	Al Viro	2007-12-03	1	-2/+2
\|/ \| \| \| \| \| \| \| \| \| \|	The previous commit ("uml: keep UML Kconfig in sync with x86") is not enough, unfortunately. If we go that way, we need to add dependencies on !UML for several options. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Dike <jdike@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait_task_stopped(): pass correct exit_code to wait_noreap_copyout()	Scott James Remnant	2007-11-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In wait_task_stopped() exit_code already contains the right value for the si_status member of siginfo, and this is simply set in the non WNOWAIT case. If you call waitid() with a stopped or traced process, you'll get the signal in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the same time, you'll get the signal << 8 \| 0x7f Pass it unchanged to wait_noreap_copyout(); we would only need to shift it and add 0x7f if we were returning it in the user status field and that isn't used for any function that permits WNOWAIT. Signed-off-by: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	FRV: fix the extern declaration of kallsyms_num_syms	David Howells	2007-11-29	1	-1/+6
\| \| \| \| \| \| \| \| \| \|	Fix the extern declaration of kallsyms_num_syms to indicate that the symbol does not reside in the small-data storage space, and so may not be accessed relative to the small data base register. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Isolate the UTS namespace's domainname and hostname back	Pavel Emelyanov	2007-11-29	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 7d69a1f4a72b18876c99c697692b78339d491568 ("remove CONFIG_UTS_NS and CONFIG_IPC_NS") by Cedric Le Goater accidentally removed the code that prevented the uts->hostname and uts->domainname values from being overwritten from another namespace. In other words, setting hostname/domainname via sysfs (echo xxx > /proc/sys/kernel/(host\|domain)name) cased the new value to be set in init UTS namespace only. Return the isolation back. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait_task_stopped(): don't use task_pid_nr_ns() lockless	Oleg Nesterov	2007-11-29	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \|	wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock, we can read an already freed memory. Use the cached pid_t value. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Looks-good-to: Roland McGrath <roland@redhat.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	sched: clean up kernel/sched_stat.h	Ingo Molnar	2007-11-28	1	-1/+2
\| \| \| \| \| \|	clean up kernel/sched_stat.h. Signed-off-by: Ingo Molnar <mingo@elte.hu>