op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	[PATCH] kill wall_jiffies	Atsushi Nemoto	2006-10-01	1	-5/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With 2.6.18-rc4-mm2, now wall_jiffies will always be the same as jiffies. So we can kill wall_jiffies completely. This is just a cleanup and logically should not change any real behavior except for one thing: RTC updating code in (old) ppc and xtensa use a condition "jiffies - wall_jiffies == 1". This condition is never met so I suppose it is just a bug. I just remove that condition only instead of kill the whole "if" block. [heiko.carstens@de.ibm.com: s390 build fix and cleanup] Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Hirokazu Takata <takata.hirokazu@renesas.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] kernel/time/ntp.c: possible cleanups	Adrian Bunk	2006-10-01	1	-20/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch contains the following possible cleanups: - make the following needlessly global function static: - ntp_update_frequency() - make the following needlessly global variables static: - time_state - time_offset - time_constant - time_reftime - remove the following read-only global variable: - time_precision Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: convert to the NTP4 reference model	Roman Zippel	2006-10-01	1	-32/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This converts the kernel ntp model into a model which matches the nanokernel reference implementations. The previous patches already increased the resolution and precision of the computations, so that this conversion becomes quite simple. <linux@horizon.com> explains: The original NTP kernel interface was defined in units of microseconds. That's what Linux implements. As computers have gotten faster and can now split microseconds easily, a new kernel interface using nanosecond units was defined ("the nanokernel", confusing as that name is to OS hackers), and there's an STA_NANO bit in the adjtimex() status field to tell the application which units it's using. The current ntpd supports both, but Linux loses some possible timing resolution because of quantization effects, and the ntpd hackers would really like to be able to drop the backwards compatibility code. Ulrich Windl has been maintaining a patch set to do the conversion for years, but it's hard to keep in sync. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: convert time_freq to nsec value	Roman Zippel	2006-10-01	1	-14/+22
\| \| \| \| \| \| \| \| \| \| \|	This converts time_freq to a scaled nsec value and adds around 6bit of extra resolution. This pushes the time_freq to its 32bit limits so the calculatons have to be done with 64bit. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: remove time_tolerance	Roman Zippel	2006-10-01	1	-5/+4
\| \| \| \| \| \| \| \| \| \|	time_tolerance isn't changed at all in the kernel, so simply remove it, this simplifies the next patch, as it avoids a number of conversions. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: add time_adjust to tick length	Roman Zippel	2006-10-01	2	-55/+18
\| \| \| \| \| \| \| \| \| \| \| \|	This folds update_ntp_one_tick() into second_overflow() and adds time_adjust to the tick length, this makes time_next_adjust unnecessary. This slightly changes the adjtime() behaviour, instead of applying it to the next tick, it's applied to the next second. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: prescale time_offset	Roman Zippel	2006-10-01	1	-48/+16
\| \| \| \| \| \| \| \| \| \|	This converts time_offset into a scaled per tick value. This avoids now completely the crude compensation in second_overflow(). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: add time_freq to tick length	Roman Zippel	2006-10-01	1	-5/+3
\| \| \| \| \| \| \| \| \|	This adds the frequency part to ntp_update_frequency(). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: add time_adj to tick length	Roman Zippel	2006-10-01	1	-4/+2
\| \| \| \| \| \| \| \| \| \|	This makes time_adj local to second_overflow() and integrates it into the tick length instead of adding it everytime. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ntp: add ntp_update_frequency	Roman Zippel	2006-10-01	2	-13/+48
\| \| \| \| \| \| \| \| \| \| \| \|	This introduces ntp_update_frequency() and deinlines ntp_clear() (as it's not performance critical). ntp_update_frequency() calculates the base tick length using tick_usec and adds a base adjustment, in case the frequency doesn't divide evenly by HZ. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] NTP: Move all the NTP related code to ntp.c	john stultz	2006-10-01	4	-384/+391
\| \| \| \| \| \| \| \| \| \| \|	Move all the NTP related code to ntp.c [akpm@osdl.org: cleanups, build fix] Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Directed yield: cpu_relax variants for spinlocks and rw-locks	Martin Schwidefsky	2006-10-01	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On systems running with virtual cpus there is optimization potential in regard to spinlocks and rw-locks. If the virtual cpu that has taken a lock is known to a cpu that wants to acquire the same lock it is beneficial to yield the timeslice of the virtual cpu in favour of the cpu that has the lock (directed yield). With CONFIG_PREEMPT="n" this can be implemented by the architecture without common code changes. Powerpc already does this. With CONFIG_PREEMPT="y" the lock loops are coded with _raw_spin_trylock, _raw_read_trylock and _raw_write_trylock in kernel/spinlock.c. If the lock could not be taken cpu_relax is called. A directed yield is not possible because cpu_relax doesn't know anything about the lock. To be able to yield the lock in favour of the current lock holder variants of cpu_relax for spinlocks and rw-locks are needed. The new _raw_spin_relax, _raw_read_relax and _raw_write_relax primitives differ from cpu_relax insofar that they have an argument: a pointer to the lock structure. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] CodingStyle cleanup for kernel/sys.c	Cal Peake	2006-10-01	1	-50/+30
\| \| \| \| \| \| \| \| \|	Fix up kernel/sys.c to be consistent with CodingStyle and the rest of the file. Signed-off-by: Cal Peake <cp@absolutedigital.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] maximum latency tracking infrastructure	Arjan van de Ven	2006-10-01	2	-1/+280
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add infrastructure to track "maximum allowable latency" for power saving policies. The reason for adding this infrastructure is that power management in the idle loop needs to make a tradeoff between latency and power savings (deeper power save modes have a longer latency to running code again). The code that today makes this tradeoff just does a rather simple algorithm; however this is not good enough: There are devices and use cases where a lower latency is required than that the higher power saving states provide. An example would be audio playback, but another example is the ipw2100 wireless driver that right now has a very direct and ugly acpi hook to disable some higher power states randomly when it gets certain types of error. The proposed solution is to have an interface where drivers can * announce the maximum latency (in microseconds) that they can deal with * modify this latency * give up their constraint and a function where the code that decides on power saving strategy can query the current global desired maximum. This patch has a user of each side: on the consumer side, ACPI is patched to use this, on the producer side the ipw2100 driver is patched. A generic maximum latency is also registered of 2 timer ticks (more and you lose accurate time tracking after all). While the existing users of the patch are x86 specific, the infrastructure is not. I'd like to ask the arch maintainers of other architectures if the infrastructure is generic enough for their use (assuming the architecture has such a tradeoff as concept at all), and the sound/multimedia driver owners to look at the driver facing API to see if this is something they can use. [akpm@osdl.org: cleanups] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jesse Barnes <jesse.barnes@intel.com> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] BLOCK: Make it possible to disable the block layer [try #6]	David Howells	2006-09-30	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make it possible to disable the block layer. Not all embedded devices require it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require the block layer to be present. This patch does the following: () Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev support. () Adds dependencies on CONFIG_BLOCK to any configuration item that controls an item that uses the block layer. This includes: () Block I/O tracing. () Disk partition code. () All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS. () The SCSI layer. As far as I can tell, even SCSI chardevs use the block layer to do scheduling. Some drivers that use SCSI facilities - such as USB storage - end up disabled indirectly from this. () Various block-based device drivers, such as IDE and the old CDROM drivers. () MTD blockdev handling and FTL. () JFFS - which uses set_bdev_super(), something it could avoid doing by taking a leaf out of JFFS2's book. () Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is, however, still used in places, and so is still available. () Also made contingent are the contents of linux/mpage.h, linux/genhd.h and parts of linux/fs.h. () Makes a number of files in fs/ contingent on CONFIG_BLOCK. () Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK. () set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK is not enabled. () fs/no-block.c is created to hold out-of-line stubs and things that are required when CONFIG_BLOCK is not set: () Default blockdev file operations (to give error ENODEV on opening). () Makes some /proc changes: () /proc/devices does not list any blockdevs. () /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK. () Makes some compat ioctl handling contingent on CONFIG_BLOCK. () If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if given command other than Q_SYNC or if a special device is specified. () In init/do_mounts.c, no reference is made to the blockdev routines if CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2. () The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return error ENOSYS by way of cond_syscall if so). () The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if CONFIG_BLOCK is not set, since they can't then happen. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	[PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6]	David Howells	2006-09-30	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	Create a new header file, fs/internal.h, for common definitions local to the sources in the fs/ directory. Move extern definitions that should be in header files from fs/*.c to fs/internal.h or other main header files where they span directories. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	[PATCH] BLOCK: Remove duplicate declaration of exit_io_context() [try #6]	David Howells	2006-09-30	1	-0/+1
\| \| \| \| \| \| \|	Remove the duplicate declaration of exit_io_context() from linux/sched.h. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	[PATCH] Define vsyscall cache as blob to make clearer that user space ↵	Andi Kleen	2006-09-30	1	-4/+4
\| \| \| \| \| \|	shouldn't use it Signed-off-by: Andi Kleen <ak@suse.de>
*	[PATCH] x86: Clean up x86 NMI sysctls	Andi Kleen	2006-09-30	2	-8/+4
\| \| \| \| \| \| \| \| \|	Use prototypes in headers Don't define panic_on_unrecovered_nmi for all architectures Cc: dzickus@redhat.com Signed-off-by: Andi Kleen <ak@suse.de>
*	[PATCH] cpuset: fix obscure attach_task vs exiting race	Paul Jackson	2006-09-29	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix obscure race condition in kernel/cpuset.c attach_task() code. There is basically zero chance of anyone accidentally being harmed by this race. It requires a special 'micro-stress' load and a special timing loop hacks in the kernel to hit in less than an hour, and even then you'd have to hit it hundreds or thousands of times, followed by some unusual and senseless cpuset configuration requests, including removing the top cpuset, to cause any visibly harm affects. One could, with perhaps a few days or weeks of such effort, get the reference count on the top cpuset below zero, and manage to crash the kernel by asking to remove the top cpuset. I found it by code inspection. The race was introduced when 'the_top_cpuset_hack' was introduced, and one piece of code was not updated. An old check for a possibly null task cpuset pointer needed to be changed to a check for a task marked PF_EXITING. The pointer can't be null anymore, thanks to the_top_cpuset_hack (documented in kernel/cpuset.c). But the task could have gone into PF_EXITING state after it was found in the task_list scan. If a task is PF_EXITING in this code, it is possible that its task->cpuset pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather than because the top_cpuset was that tasks last valid cpuset. In that case, the wrong cpuset reference counter would be decremented. The fix is trivial. Instead of failing the system call if the tasks cpuset pointer is null here, fail it if the task is in PF_EXITING state. The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to the top_cpuset is done without locking, so could happen at anytime. But it is done during the exit handling, after the PF_EXITING flag is set. So if we verify that a task is still not PF_EXITING after we copy out its cpuset pointer (into 'oldcs', below), we know that 'oldcs' is not one of these hack references to the top_cpuset. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] lockdep core: improve the lock-chain-hash	Ingo Molnar	2006-09-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With CONFIG_DEBUG_LOCK_ALLOC turned off i was getting sporadic failures in the locking self-test: ------------> \| Locking API testsuite: ---------------------------------------------------------------------------- \| spin \|wlock \|rlock \|mutex \| wsem \| rsem \| -------------------------------------------------------------------------- A-A deadlock: ok \| ok \| ok \| ok \| ok \| ok \| A-B-B-A deadlock: ok \| ok \| ok \| ok \| ok \| ok \| A-B-B-C-C-A deadlock: ok \| ok \| ok \| ok \| ok \| ok \| A-B-C-A-B-C deadlock: ok \| ok \| ok \| ok \| ok \| ok \| A-B-B-C-C-D-D-A deadlock: ok \|FAILED\| ok \| ok \| ok \| ok \| A-B-C-D-B-D-D-A deadlock: ok \| ok \| ok \| ok \| ok \| ok \| A-B-C-D-B-C-D-A deadlock: ok \| ok \| ok \| ok \| ok \|FAILED\| after much debugging it turned out to be caused by accidental chain-hash key collisions. The current hash is: #define iterate_chain_key(key1, key2) \ (((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \ ((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \ (key2)) where MAX_LOCKDEP_KEYS_BITS is 11. This hash is pretty good as it will shift by 5 bits in every iteration, where every new ID 'mixed' into the hash would have up to 11 bits. But because there was a 6 bits overlap between subsequent IDs and their high bits tended to be similar, there was a chance for accidental chain-hash collision for a low number of locks held. the solution is to shift by 11 bits: #define iterate_chain_key(key1, key2) \ (((key1) << MAX_LOCKDEP_KEYS_BITS) ^ \ ((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \ (key2)) This keeps the hash perfect up to 5 locks held, but even above that the hash is still good because 11 bits is a relative prime to the total 64 bits, so a complete match will only occur after 64 held locks (which doesnt happen in Linux). Even after 5 locks held, entropy of the 5 IDs mixed into the hash is already good enough so that overlap doesnt generate a colliding hash ID. with this change the false positives went away. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] audit/accounting: tty locking	Alan Cox	2006-09-29	2	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add tty locking around the audit and accounting code. The whole current->signal-> locking is all deeply strange but it's for someone else to sort out. Add rather than replace the lock for acct.c Signed-off-by: Alan Cox <alan@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] stop_machine.c copyright	Rusty Russell	2006-09-29	1	-0/+3
\| \| \| \| \| \| \| \|	I had to look back: this code was extracted from the module.c code in 2005. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] /sys/modules: allow full length section names	Ian S. Nelson	2006-09-29	1	-6/+20
\| \| \| \| \| \| \| \| \| \| \| \|	I've been using systemtap for some debugging and I noticed that it can't probe a lot of modules. Turns out it's kind of silly, the sections section of /sys/module is limited to 32byte filenames and many of the actual sections are a a bit longer than that. [akpm@osdl.org: rewrite to use dymanic allocation] Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] cpuset: hotunplug cpus and mems in all cpusets	Paul Jackson	2006-09-29	1	-17/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect - it could remove a CPU or Node from the top cpuset, while leaving it still in some child cpusets. One basic rule of cpusets is that each cpusets cpus and mems are subsets of its parents. The cpuset hot unplug code violated this rule. So the cpuset hotunplug handler must walk down the tree, removing any removed CPU or Node from all cpusets. However, it is not allowed to make a cpusets cpus or mems become empty. They can only transition from empty to non-empty, not back. So if the last CPU or Node would be removed from a cpuset by the above walk, we scan back up the cpuset hierarchy, finding the nearest ancestor that still has something online, and copy its CPU or Memory placement. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Nathan Lynch <ntl@pobox.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map	Paul Jackson	2006-09-29	1	-3/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change the list of memory nodes allowed to tasks in the top (root) nodeset to dynamically track what cpus are online, using a call to a cpuset hook from the memory hotplug code. Make this top cpus file read-only. On systems that have cpusets configured in their kernel, but that aren't actively using cpusets (for some distros, this covers the majority of systems) all tasks end up in the top cpuset. If that system does support memory hotplug, then these tasks cannot make use of memory nodes that are added after system boot, because the memory nodes are not allowed in the top cpuset. This is a surprising regression over earlier kernels that didn't have cpusets enabled. One key motivation for this change is to remain consistent with the behaviour for the top_cpuset's 'cpus', which is also read-only, and which automatically tracks the cpu_online_map. This change also has the minor benefit that it fixes a long standing, little noticed, minor bug in cpusets. The cpuset performance tweak to short circuit the cpuset_zone_allowed() check on systems with just a single cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply changing the 'mems' of the top_cpuset had no affect, even though the change (the write system call) appeared to succeed. With the following change, that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly refuses to be changed via user space writes. Thus no one should be mislead into thinking they've changed the top_cpusets's 'mems' when in affect they haven't. In order to keep the behaviour of cpusets consistent between systems actively making use of them and systems not using them, this patch changes the behaviour of the 'mems' file in the top (root) cpuset, making it read only, and making it automatically track the value of node_online_map. Thus tasks in the top cpuset will have automatic use of hot plugged memory nodes allowed by their cpuset. [akpm@osdl.org: build fix] [bunk@stusta.de: build fix] Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] introduce TASK_DEAD state	Oleg Nesterov	2006-09-29	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] kill PF_DEAD flag	Oleg Nesterov	2006-09-29	2	-12/+10
\| \| \| \| \| \| \| \| \| \|	After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we don't need PF_DEAD any longer. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] set EXIT_DEAD state in do_exit(), not in schedule()	Oleg Nesterov	2006-09-29	2	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD to ensure that the exiting task will be deactivated. Note that this EXIT_DEAD is in fact a "random" value, we can use any bit except normal TASK_XXX values. It is better to set this state in do_exit() along with PF_DEAD flag and remove that check in schedule(). We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it can not change task's ->state: the 'state' argument of try_to_wake_up() can't have EXIT_DEAD bit. And in case when try_to_wake_up() sees a stale value of ->state == TASK_RUNNING it will do nothing. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] sys_get_robust_list(): don't take tasklist_lock	Oleg Nesterov	2006-09-29	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	use rcu locks for find_task_by_pid(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] futex_find_get_task(): don't take tasklist_lock	Oleg Nesterov	2006-09-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	It is ok to do find_task_by_pid() + get_task_struct() under rcu_read_lock(), we cand drop tasklist_lock. Note that testing of ->exit_state is racy with or without tasklist anyway. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] copy_process: cosmetic ->ioprio tweak	Oleg Nesterov	2006-09-29	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	copy_process: // holds tasklist_lock + ->siglock /* * inherit ioprio */ p->ioprio = current->ioprio; Why? ->ioprio was already copied in dup_task_struct(). I guess this is needed to ensure that the child can't escape sys_ioprio_set(IOPRIO_WHO_{PGRP,USER}), yes? In that case we don't need ->siglock held, and the comment should be updated. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] reparent_to_init(): use has_rt_policy()	Oleg Nesterov	2006-09-29	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Remove open-coded has_rt_policy(), no changes in kernel/exit.o Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] sched_setscheduler: fix? policy checks	Oleg Nesterov	2006-09-29	1	-20/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I am not sure this patch is correct: I can't understand what the current code does, and I don't know what it was supposed to do. The comment says: * can't change policy, except between SCHED_NORMAL * and SCHED_BATCH: The code: if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) && (policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) && But this is equivalent to: if ( (is_rt_policy(policy) && has_rt_policy(p)) && which means something different. We can't _decrease_ the current ->rt_priority with such a check (if rlim[RLIMIT_RTPRIO] == 0). Probably, it was supposed to be: if ( !(policy == SCHED_NORMAL && p->policy == SCHED_BATCH) && !(policy == SCHED_BATCH && p->policy == SCHED_NORMAL) this matches the comment, but strange: it doesn't allow to _drop_ the realtime priority when rlim[RLIMIT_RTPRIO] == 0. I think the right check would be: /* can't set/change rt policy */ if (is_rt_policy(policy) && policy != p->policy && !rlim_rtprio) return -EPERM; Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] introduce is_rt_policy() helper	Oleg Nesterov	2006-09-29	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Imho, makes the code a bit easier to read. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] do_sched_setscheduler(): don't take tasklist_lock	Oleg Nesterov	2006-09-29	1	-10/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Use rcu locks instead. sched_setscheduler() now takes ->siglock before reading ->signal->rlim[]. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] kill extraneous printk in kernel_restart()	Cal Peake	2006-09-29	1	-1/+0
\| \| \| \| \| \| \| \| \|	Get rid of an extraneous printk in kernel_restart(). Signed-off-by: Cal Peake <cp@absolutedigital.net> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Fix ____call_usermodehelper errors being silently ignored	Björn Steinbrink	2006-09-29	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If ____call_usermodehelper fails, we're not interested in the child process' exit value, but the real error, so let's stop wait_for_helper from overwriting it in that case. Issue discovered by Benedikt Böhm while working on a Linux-VServer usermode helper. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] exit: fix crash case	Alan Cox	2006-09-29	1	-1/+2
\| \| \| \| \| \| \| \| \|	If we are going to BUG() not panic() here then we should cover the case of the BUG being compiled out Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] kexec warning fix	Roland McGrath	2006-09-29	1	-2/+4
\| \| \| \| \| \| \| \| \|	This fixes a couple of compiler warnings, and adds paranoia checks as well. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] simplify update_times (avoid jiffies/jiffies_64 aliasing problem)	Atsushi Nemoto	2006-09-29	1	-13/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pass ticks to do_timer() and update_times(), and adjust x86_64 and s390 timer interrupt handler with this change. Currently update_times() calculates ticks by "jiffies - wall_jiffies", but callers of do_timer() should know how many ticks to update. Passing ticks get rid of this redundant calculation. Also there are another redundancy pointed out by Martin Schwidefsky. This cleanup make a barrier added by 5aee405c662ca644980c184774277fc6d0769a84 needless. So this patch removes it. As a bonus, this cleanup make wall_jiffies can be removed easily, since now wall_jiffies is always synced with jiffies. (This patch does not really remove wall_jiffies. It would be another cleanup patch) Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Acked-by: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Hirokazu Takata <takata.hirokazu@renesas.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Acked-by: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] __dequeue_signal() cleanup	Roland McGrath	2006-09-29	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	This tightens up __dequeue_signal a little. It also avoids doing recalc_sigpending twice in a row, instead doing it once in dequeue_signal. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] has_stopped_jobs() cleanup	Roland McGrath	2006-09-29	1	-11/+0
\| \| \| \| \| \| \| \| \| \| \|	This check has been obsolete since the introduction of TASK_TRACED. Now TASK_STOPPED always means job control stop. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] posix-timers: Fix the flags handling in posix_cpu_nsleep()	Toyo Abe	2006-09-29	1	-26/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a posix_cpu_nsleep() sleep is interrupted by a signal more than twice, it incorrectly reports the sleep time remaining to the user. Because posix_cpu_nsleep() doesn't report back to the user when it's called from restart function due to the wrong flags handling. This patch, which applies after previous one, moves the nanosleep() function from posix_cpu_nsleep() to do_cpu_nanosleep() and cleans up the flags handling appropriately. Signed-off-by: Toyo Abe <toyoa@mvista.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] posix-timers: Fix clock_nanosleep() doesn't return the remaining ↵	Toyo Abe	2006-09-29	4	-15/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	time in compatibility mode The clock_nanosleep() function does not return the time remaining when the sleep is interrupted by a signal. This patch creates a new call out, compat_clock_nanosleep_restart(), which handles returning the remaining time after a sleep is interrupted. This patch revives clock_nanosleep_restart(). It is now accessed via the new call out. The compat_clock_nanosleep_restart() is used for compatibility access. Since this is implemented in compatibility mode the normal path is virtually unaffected - no real performance impact. Signed-off-by: Toyo Abe <toyoa@mvista.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] check return value of cpu_callback	Akinobu Mita	2006-09-29	4	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \|	Spawing ksoftirqd, migration, or watchdog, and calling init_timers_cpu() may fail with small memory. If it happens in initcalls, kernel NULL pointer dereference happens later. This patch makes crash happen immediately in such cases. It seems a bit better than getting kernel NULL pointer dereference later. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] memory ordering in __kfifo primitives	Paul E. McKenney	2006-09-29	1	-0/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both __kfifo_put() and __kfifo_get() have header comments stating that if there is but one concurrent reader and one concurrent writer, locking is not necessary. This is almost the case, but a couple of memory barriers are needed. Another option would be to change the header comments to remove the bit about locking not being needed, and to change the those callers who currently don't use locking to add the required locking. The attachment analyzes this approach, but the patch below seems simpler. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Cc: Stelian Pop <stelian@popies.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] lockdep: print kernel version	Dave Jones	2006-09-29	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	Lets do the same thing we do for oopses - print out the version in the report. It's an extra line of output though. We could tack it on the end of the INFO: lines, but that screws up Ingo's pretty output. Signed-off-by: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] pidspace: is_init()	Sukadev Bhattiprolu	2006-09-29	6	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Fix unserialized task->files changing	Kirill Korotaev	2006-09-29	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed race on put_files_struct on exec with proc. Restoring files on current on error path may lead to proc having a pointer to already kfree-d files_struct. ->files changing at exit.c and khtread.c are safe as exit_files() makes all things under lock. Found during OpenVZ stress testing. [akpm@osdl.org: add export] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>