op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	i386: remove module.h include from termios.h	Thomas Gleixner	2007-10-11	1	-1/+0
\| \| \| \|	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
*	i386: remove bogus comment about memory barrier	Nick Piggin	2007-09-29	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The comment being removed by this patch is incorrect and misleading. In the following situation: 1. load ... 2. store 1 -> X 3. wmb 4. rmb 5. load a <- Y 6. store ... 4 will only ensure ordering of 1 with 5. 3 will only ensure ordering of 2 with 6. Further, a CPU with strictly in-order stores will still only provide that 2 and 6 are ordered (effectively, it is the same as a weakly ordered CPU with wmb after every store). In all cases, 5 may still be executed before 2 is visible to other CPUs! The additional piece of the puzzle that mb() provides is the store/load ordering, which fundamentally cannot be achieved with any combination of rmb()s and wmb()s. This can be an unexpected result if one expected any sort of global ordering guarantee to barriers (eg. that the barriers themselves are sequentially consistent with other types of barriers). However sfence or lfence barriers need only provide an ordering partial ordering of memory operations -- Consider that wmb may be implemented as nothing more than inserting a special barrier entry in the store queue, or, in the case of x86, it can be a noop as the store queue is in order. And an rmb may be implemented as a directive to prevent subsequent loads only so long as their are no previous outstanding loads (while there could be stores still in store queues). I can actually see the occasional load/store being reordered around lfence on my core2. That doesn't prove my above assertions, but it does show the comment is wrong (unless my program is -- can send it out by request). So: mb() and smp_mb() always have and always will require a full mfence or lock prefixed instruction on x86. And we should remove this comment. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Paul McKenney <paulmck@us.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Pull bugzilla-1641 into release branch	Len Brown	2007-08-24	1	-1/+0
\|\
\| *	ACPI: boot correctly with "nosmp" or "maxcpus=0"	Len Brown	2007-08-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In MPS mode, "nosmp" and "maxcpus=0" boot a UP kernel with IOAPIC disabled. However, in ACPI mode, these parameters didn't completely disable the IO APIC initialization code and boot failed. init/main.c: Disable the IO_APIC if "nosmp" or "maxcpus=0" undefine disable_ioapic_setup() when it doesn't apply. i386: delete ioapic_setup(), it was a duplicate of parse_noapic() delete undefinition of disable_ioapic_setup() x86_64: rename disable_ioapic_setup() to parse_noapic() to match i386 define disable_ioapic_setup() in header to match i386 http://bugzilla.kernel.org/show_bug.cgi?id=1641 Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Len Brown <len.brown@intel.com>
* \|	PCI: Document pci_iomap()	Rolf Eike Beer	2007-08-22	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This useful interface is hardly mentioned anywhere in the in-tree documentation. Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* \|	x86_64: Fix to keep watchdog disabled by default for i386/x86_64	Daniel Gollub	2007-08-18	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed wrong expression which enabled watchdogs even if nmi_watchdog kernel parameter wasn't set. This regression got slightly introduced with commit b7471c6da94d30d3deadc55986cc38d1ff57f9ca. Introduced NMI_DISABLED (-1) which allows to switch the value of NMI_DEFAULT without breaking the APIC NMI watchdog code (again). Fixes: https://bugzilla.novell.com/show_bug.cgi?id=298084 http://bugzilla.kernel.org/show_bug.cgi?id=7839 And likely some more nmi_watchdog=0 related issues. Signed-off-by: Daniel Gollub <dgollub@suse.de> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()	Satyam Sharma	2007-08-18	2	-2/+4
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use cpu_relax() in the busy loops, as atomic_read() doesn't automatically imply volatility for i386 and x86_64. x86_64 doesn't have this issue because it open-codes the while loop in smpboot.c:smp_callin() itself that already uses cpu_relax(). For i386, however, smpboot.c:smp_callin() calls wait_for_init_deassert() which is buggy for mach-default and mach-es7000 cases. [ I test-built a kernel -- smp_callin() itself got inlined in its only callsite, smpboot.c:start_secondary() -- and the relevant piece of code disassembles to the following: 0xc1019704 <start_secondary+12>: mov 0xc144c4c8,%eax 0xc1019709 <start_secondary+17>: test %eax,%eax 0xc101970b <start_secondary+19>: je 0xc1019709 <start_secondary+17> init_deasserted (at 0xc144c4c8) gets fetched into %eax only once and then we loop over the test of the stale value in the register only, so these look like real bugs to me. With the fix below, this becomes: 0xc1019706 <start_secondary+14>: pause 0xc1019708 <start_secondary+16>: cmpl $0x0,0xc144c4c8 0xc101970f <start_secondary+23>: je 0xc1019706 <start_secondary+14> which looks nice and healthy. ] Thanks to Heiko Carstens for noticing this. Signed-off-by: Satyam Sharma <satyam@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: Use global flag to disable broken local apic timer on AMD CPUs.	Andi Kleen	2007-08-11	2	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Averatec 2370 and some other Turion laptop BIOS seems to program the ENABLE_C1E MSR inconsistently between cores. This confuses the lapic use heuristics because when C1E is enabled anywhere it seems to affect the complete chip. Use a global flag instead of a per cpu flag to handle this. If any CPU has C1E enabled disabled lapic use. Thanks to Cal Peake for debugging. Cc: tglx@linutronix.de Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: Make patching more robust, fix paravirt issue	Andi Kleen	2007-08-11	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 19d36ccdc34f5ed444f8a6af0cbfdb6790eb1177 "x86: Fix alternatives and kprobes to remap write-protected kernel text" uses code which is being patched for patching. In particular, paravirt_ops does patching in two stages: first it calls paravirt_ops.patch, then it fills any remaining instructions with nop_out(). nop_out calls text_poke() which calls lookup_address() which calls pgd_val() (aka paravirt_ops.pgd_val): that call site is one of the places we patch. If we always do patching as one single call to text_poke(), we only need make sure we're not patching the memcpy in text_poke itself. This means the prototype to paravirt_ops.patch needs to change, to marshal the new code into a buffer rather than patching in place as it does now. It also means all patching goes through text_poke(), which is known to be safe (apply_alternatives is also changed to make a single patch). AK: fix compilation on x86-64 (bad rusty!) AK: fix boot on x86-64 (sigh) AK: merged with other patches Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	finish i386 and x86-64 sysdata conversion	Muli Ben-Yehuda	2007-08-11	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch finishes the i386 and x86-64 ->sysdata conversion and hopefully also fixes Riku's and Andy's observed bugs. It is based on Yinghai Lu's and Andy Whitcroft's patches (thanks!) with some changes: - introduce pci_scan_bus_with_sysdata() and use it instead of pci_scan_bus() where appropriate. pci_scan_bus_with_sysdata() will allocate the sysdata structure and then call pci_scan_bus(). - always allocate pci_sysdata dynamically. The whole point of this sysdata work is to make it easy to do root-bus specific things (e.g., support PCI domains and IOMMU's). I dislike using a default struct pci_sysdata in some places and a dynamically allocated pci_sysdata elsewhere - the potential for someone indavertantly changing the default structure is too high. - this patch only makes the minimal changes necessary, i.e., the NUMA node is always initialized to -1. Patches to do the right thing with regards to the NUMA node can build on top of this (either add a 'node' parameter to pci_scan_bus_with_sysdata() or just update the node when it becomes known). The patch was compile tested with various configurations (e.g., NUMAQ, VISWS) and run-time tested on i386 and x86-64. Unfortunately none of my machines exhibited the bugs so caveat emptor. Andy, could you please see if this fixes the NUMA issues you've seen? Riku, does this fix "pci=noacpi" on your laptop? Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Andi Kleen <ak@suse.de> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: <riku.seppala@kymp.net> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	revert "x86, serial: convert legacy COM ports to platform devices"	Andrew Morton	2007-07-31	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Revert 7e92b4fc345f5b6f57585fbe5ffdb0f24d7c9b26. It broke Sébastien Dugué's machine and Jeff said (persuasively) This seems like it will break decades-long-working stuff, in favor of breaking new ground in our favorite area, "trusting the BIOS." It's just not worth it for serial ports, IMO. Serial ports are something that just shouldn't break at this late stage in the game. My new Intel platform boxes don't even have serial ports, so I question the value of messing with serial port probing even more... because... just wait a year, and your box won't have a serial port either! :) I certainly don't object to the use of platform devices (or isa_driver), but the probe change seems questionable. That's sorta analagous to rewriting the floppy driver probe routine. Sure you could do it... but why risk all that damage and go through debugging all over again? It seems clear from this report that we cannot, should not, trust BIOS for something (a) so simple and (b) that has been working for over a decade. Much discussion ensued and we've decided to have another go at all of this. Cc: Sébastien Dugué <sebastien.dugue@bull.net> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Len Brown <lenb@kernel.org> Cc: Adam Belay <ambx1@neo.rr.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Jeff Garzik <jeff@garzik.org> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Cc: Sascha Sommer <saschasommer@freenet.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	remove unused TIF_NOTIFY_RESUME flag	Stephane Eranian	2007-07-31	1	-10/+8
\| \| \| \| \| \| \| \| \| \| \| \|	Remove unused TIF_NOTIFY_RESUME flag for all processor architectures. The flag was not used excecpt on IA-64 where the patch replaces it with TIF_PERFMON_WORK. Signed-off-by: stephane eranian <eranian@hpl.hp.com> Cc: <linux-arch@vger.kernel.org> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION	Rafael J. Wysocki	2007-07-29	1	-1/+1
\| \| \| \| \| \| \| \| \|	Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION to avoid confusion (among other things, with CONFIG_SUSPEND introduced in the next patch). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	[x86 setup] Make struct ist_info cross-architecture, and use in setup code	H. Peter Anvin	2007-07-25	2	-6/+9
\| \| \| \| \| \| \| \|	Make "struct ist_info" valid on both i386 and x86-64, and use the structure by name in the setup code. Additionally, "Intel SpeedStep IST" is redundant, refer to it as IST consistently. Signed-off-by: H. Peter Anvin <hpa@zytor.com>
*	[x86 setup] Fix typos in struct efi_info	H. Peter Anvin	2007-07-25	1	-2/+2
\| \| \| \| \| \|	Fix missing letters in the structure members of struct efi_info. Signed-off-by: H. Peter Anvin <hpa@zytor.com>
*	ACPI: Kconfig: remove CONFIG_ACPI_SLEEP from source	Len Brown	2007-07-25	2	-15/+10
\| \| \| \| \| \| \| \| \| \|	As it was a synonym for (CONFIG_ACPI && CONFIG_X86), the ifdefs for it were more clutter than they were worth. For ia64, just add a few stubs in anticipation of future S3 or S4 support. Signed-off-by: Len Brown <len.brown@intel.com>
*	i386: Use patchable lock prefix in set_64bit	Andi Kleen	2007-07-22	1	-1/+1
\| \| \| \| \| \| \| \|	Previously lock was unconditionally used, but shouldn't be needed on UP systems. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: Replace NSC/Cyrix specific chipset access macros by inlined functions.	Juergen Beisert	2007-07-22	2	-11/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Due to index register access ordering problems, when using macros a line like this fails (and does nothing): setCx86(CX86_CCR2, getCx86(CX86_CCR2) \| 0x88); With inlined functions this line will work as expected. Note about a side effect: Seems on Geode GX1 based systems the "suspend on halt power saving feature" was never enabled due to this wrong macro expansion. With inlined functions it will be enabled, but this will stop the TSC when the CPU runs into a HLT instruction. Kernel output something like this: Clocksource tsc unstable (delta = -472746897 ns) This is the 3rd version of this patch. - Adding missed arch/i386/kernel/cpu/mtrr/state.c Thanks to Andres Salomon - Adding some big fat comments into the new header file Suggested by Andi Kleen AK: fixed x86-64 compilation Signed-off-by: Juergen Beisert <juergen@kreuzholzen.de> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: Stop MCEs and NMIs during code patching	Andi Kleen	2007-07-22	2	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a machine check or NMI occurs while multiple byte code is patched the CPU could theoretically see an inconsistent instruction and crash. Prevent this by temporarily disabling MCEs and returning early in the NMI handler. Based on discussion with Mathieu Desnoyers. Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: Fix alternatives and kprobes to remap write-protected kernel text	Andi Kleen	2007-07-22	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reenable kprobes and alternative patching when the kernel text is write protected by DEBUG_RODATA Add a general utility function to change write protected text. The new function remaps the code using vmap to write it and takes care of CPU synchronization. It also does CLFLUSH to make icache recovery faster. There are some limitations on when the function can be used, see the comment. This is a newer version that also changes the paravirt_ops code. text_poke also supports multi byte patching now. Contains bug fixes from Zach Amsden and suggestions from Mathieu Desnoyers. Cc: Jan Beulich <jbeulich@novell.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Zach Amsden <zach@vmware.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86-64: introduce struct pci_sysdata to facilitate sharing of ->sysdata	Muli Ben-Yehuda	2007-07-21	2	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch introduces struct pci_sysdata to x86 and x86-64, and converts the existing two users (NUMA, Calgary) to use it. This lays the groundwork for having other users of sysdata, such as the PCI domains work. The Calgary bits are tested, the NUMA bits just look ok. Signed-off-by: Jeff Garzik <jeff@garzik.org> Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: basic infrastructure support for AMD geode-class machines	Andres Salomon	2007-07-21	1	-0/+159
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This builds upon the existing geode infrastructure, but adds southbridge support, some GPIO functions, and a header file (asm-i386/geode.h) with some useful GX/LX detection tests. The majority of this code was written by Jordan Crouse. Signed-off-by: Jordan Crouse <jordan.crouse@amd.com> Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: move PIT function declarations and constants to correct header file	Thomas Gleixner	2007-07-21	3	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	setup_pit_timer is declared in asm-i386/timer.h. Move it to the pit header file, so it can be used by x86_64 as well. Move also the PIT constants. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: replace hard-coded constant with appropriate macro from kernel.h	Robert P. J. Day	2007-07-21	1	-1/+1
\| \| \| \| \| \| \|	Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: add cpu_relax() to cmos_lock()	Andreas Mohr	2007-07-21	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add cpu_relax() to cmos_lock() inline function for faster operation on SMT CPUs and less power consumption on others in case of lock contention (which probably doesn't happen too often, so admittedly this patch is not too exciting). [akpm@linux-foundation.org: Include the header file for cpu_relax()] Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: do not restore reserved memory after hibernation	Rafael J. Wysocki	2007-07-21	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On some systems the ACPI NVS area is located in the first 1 MB of RAM and it is overwritten by the i386 code during the restore after hibernation. This confuses the ACPI platform firmware that doesn't update the AC adapter status appropriately as a result (http://bugzilla.kernel.org/show_bug.cgi?id=7995). The solution is to register the reserved memory in the first 1 MB as 'nosave', so that swsusp doesn't touch it during the restore. Also, this has been done on x86_64 for a long time now, so this patch makes the i386 restore code behave like the x86_64 one. [akpm@linux-foundation.org: build fix] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: remove support for the Rise CPU	Adrian Bunk	2007-07-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Rise CPUs were only very short-lived, and there are no reports of anyone both owning one and running Linux on it. Googling for the printk string "CPU: Rise iDragon" didn't find any dmesg available online. If it turns out that against all expectations there are actually users reverting this patch would be easy. This patch will make the kernel images smaller by a few bytes for all i386 users. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: add reference to the arguments	Andrew Morton	2007-07-21	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Prevent stuff like this: mm/vmalloc.c: In function 'unmap_kernel_range': mm/vmalloc.c:75: warning: unused variable 'start' Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: PM_TRACE support	Nigel Cunningham	2007-07-21	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: fix machine rebooting	Truxton Fulton	2007-07-21	1	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 59f4e7d572980a521b7bdba74ab71b21f5995538 fixed machine rebooting on Truxton's machine (when no keyboard was present). But it broke it on Lee's machine. The patch reinstates the old (pre-59f4e7d572980a521b7bdba74ab71b21f5995538) code and if that doesn't work out, try the new, post-59f4e7d572980a521b7bdba74ab71b21f5995538 code instead. Cc: Lee Garrett <lee-in-berlin@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: minor nx handling adjustment	Jan Beulich	2007-07-21	1	-1/+0
\| \| \| \| \| \| \| \|	Constrain __supported_pte_mask and NX handling to just the PAE kernel. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	x86: share hpet.h with i386	Thomas Gleixner	2007-07-21	1	-78/+48
\| \| \| \| \| \| \| \| \| \| \| \| \|	hpet.h in asm-i386 and asm-x86_64 contain tons of duplicated stuff. Consolidate into one shared header file. AK: Fix i386 compilation with !X86_IO_APIC Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: remove pit_interrupt_hook	Chris Wright	2007-07-21	3	-13/+2
\| \| \| \| \| \| \| \| \| \|	Remove pit_interrupt_hook as it adds just an extra layer. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: Move all simple string operations out of line	Andi Kleen	2007-07-21	1	-230/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The compiler generally generates reasonable inline code for the simple cases and for the rest it's better for code size for them to be out of line. Also there they can be potentially optimized more in the future. In fact they probably should be in a .S file because they're all pure assembly, but that's for another day. Also some code style cleanup on them while I was on it (this seems to be the last untouched really early Linux code) This saves ~12k text for a defconfig kernel with gcc 4.1. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	NTP: move the cmos update code into ntp.c	Thomas Gleixner	2007-07-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	i386 and sparc64 have the identical code to update the cmos clock. Move it into kernel/time/ntp.c as there are other architectures coming along with the same requirements. [akpm@linux-foundation.org: build fixes] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	i386: Allow KVM on i386 nonpae	Avi Kivity	2007-07-19	2	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, CONFIG_X86_CMPXCHG64 both enables boot-time checking of the cmpxchg64b feature and enables compilation of the set_64bit() family. Since the option is dependent on PAE, and since KVM depends on set_64bit(), this effectively disables KVM on i386 nopae. Simplify by removing the config option altogether: the boot check is made dependent on CONFIG_X86_PAE directly, and the set_64bit() family is exposed without constraints. It is up to users to check for the feature flag (KVM does not as virtualiation extensions imply its existence). Signed-off-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	[PATCH] sched: sched_cacheflush is now unused	Ralf Baechle	2007-07-19	1	-9/+0
\| \| \| \| \| \| \| \|	Since Ingo's recent scheduler rewrite which was merged as commit 0437e109e1841607f2988891eaa36c531c6aa6ac sched_cacheflush is unused. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
*	lguest: the host code	Rusty Russell	2007-07-19	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the code for the "lg.ko" module, which allows lguest guests to be launched. [akpm@linux-foundation.org: update for futex-new-private-futexes] [akpm@linux-foundation.org: build fix] [jmorris@namei.org: lguest: use hrtimers] [akpm@linux-foundation.org: x86_64 build fix] Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	arch: personality independent stack top	Peter Zijlstra	2007-07-19	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	New arch macro STACK_TOP_MAX it gives the larges valid stack address for the architecture in question. It differs from STACK_TOP in that it will not distinguish between personalities but will always return the largest possible address. This is used to create the initial stack on execve, which we will move down to the proper location once the binfmt code has figured out where that is. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ollie Wild <aaw@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	define new percpu interface for shared data	Fenghua Yu	2007-07-19	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	per cpu data section contains two types of data. One set which is exclusively accessed by the local cpu and the other set which is per cpu, but also shared by remote cpus. In the current kernel, these two sets are not clearely separated out. This can potentially cause the same data cacheline shared between the two sets of data, which will result in unnecessary bouncing of the cacheline between cpus. One way to fix the problem is to cacheline align the remotely accessed per cpu data, both at the beginning and at the end. Because of the padding at both ends, this will likely cause some memory wastage and also the interface to achieve this is not clean. This patch: Moves the remotely accessed per cpu data (which is currently marked as ____cacheline_aligned_in_smp) into a different section, where all the data elements are cacheline aligned. And as such, this differentiates the local only data and remotely accessed data cleanly. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: <linux-arch@vger.kernel.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	jprobes: remove JPROBE_ENTRY()	Michael Ellerman	2007-07-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything useful - so remove it .. I've left a do-nothing version so that out-of-tree jprobes code will still compile without modifications. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge branch 'for_linus' of ↵	Linus Torvalds	2007-07-18	1	-1/+2
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: extent macros cleanup Fix compilation with EXT_DEBUG, also fix leXX_to_cpu conversions. ext4: remove extra IS_RDONLY() check ext4: Use is_power_of_2() Use zero_user_page() in ext4 where possible ext4: Remove 65000 subdirectory limit ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields ext4: Add nanosecond timestamps jbd2: Move jbd2-debug file to debugfs jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUG ext4: Set the journal JBD2_FEATURE_INCOMPAT_64BIT on large devices ext4: Make extents code sanely handle on-disk corruption ext4: copy i_flags to inode flags on write ext4: Enable extents by default Change on-disk format to support 2^15 uninitialized extents write support for preallocated blocks fallocate support in ext4 sys_fallocate() implementation on i386, x86_64 and powerpc
\| *	sys_fallocate() implementation on i386, x86_64 and powerpc	Amit Arora	2007-07-17	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora <aarora@in.ibm.com>
* \|	xen: add the Xenbus sysfs and virtual device hotplug driver	Jeremy Fitzhardinge	2007-07-18	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This communicates with the machine control software via a registry residing in a controlling virtual machine. This allows dynamic creation, destruction and modification of virtual device configurations (network devices, block devices and CPUS, to name some examples). [ Greg, would you mind giving this a review? Thanks -J ] Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Greg KH <greg@kroah.com>
* \|	xen: Core Xen implementation	Jeremy Fitzhardinge	2007-07-18	2	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch is a rollup of all the core pieces of the Xen implementation, including: - booting and setup - pagetable setup - privileged instructions - segmentation - interrupt flags - upcalls - multicall batching BOOTING AND SETUP The vmlinux image is decorated with ELF notes which tell the Xen domain builder what the kernel's requirements are; the domain builder then constructs the address space accordingly and starts the kernel. Xen has its own entrypoint for the kernel (contained in an ELF note). The ELF notes are set up by xen-head.S, which is included into head.S. In principle it could be linked separately, but it seems to provoke lots of binutils bugs. Because the domain builder starts the kernel in a fairly sane state (32-bit protected mode, paging enabled, flat segments set up), there's not a lot of setup needed before starting the kernel proper. The main steps are: 1. Install the Xen paravirt_ops, which is simply a matter of a structure assignment. 2. Set init_mm to use the Xen-supplied pagetables (analogous to the head.S generated pagetables in a native boot). 3. Reserve address space for Xen, since it takes a chunk at the top of the address space for its own use. 4. Call start_kernel() PAGETABLE SETUP Once we hit the main kernel boot sequence, it will end up calling back via paravirt_ops to set up various pieces of Xen specific state. One of the critical things which requires a bit of extra care is the construction of the initial init_mm pagetable. Because Xen places tight constraints on pagetables (an active pagetable must always be valid, and must always be mapped read-only to the guest domain), we need to be careful when constructing the new pagetable to keep these constraints in mind. It turns out that the easiest way to do this is use the initial Xen-provided pagetable as a template, and then just insert new mappings for memory where a mapping doesn't already exist. This means that during pagetable setup, it uses a special version of xen_set_pte which ignores any attempt to remap a read-only page as read-write (since Xen will map its own initial pagetable as RO), but lets other changes to the ptes happen, so that things like NX are set properly. PRIVILEGED INSTRUCTIONS AND SEGMENTATION When the kernel runs under Xen, it runs in ring 1 rather than ring 0. This means that it is more privileged than user-mode in ring 3, but it still can't run privileged instructions directly. Non-performance critical instructions are dealt with by taking a privilege exception and trapping into the hypervisor and emulating the instruction, but more performance-critical instructions have their own specific paravirt_ops. In many cases we can avoid having to do any hypercalls for these instructions, or the Xen implementation is quite different from the normal native version. The privileged instructions fall into the broad classes of: Segmentation: setting up the GDT and the GDT entries, LDT, TLS and so on. Xen doesn't allow the GDT to be directly modified; all GDT updates are done via hypercalls where the new entries can be validated. This is important because Xen uses segment limits to prevent the guest kernel from damaging the hypervisor itself. Traps and exceptions: Xen uses a special format for trap entrypoints, so when the kernel wants to set an IDT entry, it needs to be converted to the form Xen expects. Xen sets int 0x80 up specially so that the trap goes straight from userspace into the guest kernel without going via the hypervisor. sysenter isn't supported. Kernel stack: The esp0 entry is extracted from the tss and provided to Xen. TLB operations: the various TLB calls are mapped into corresponding Xen hypercalls. Control registers: all the control registers are privileged. The most important is cr3, which points to the base of the current pagetable, and we handle it specially. Another instruction we treat specially is CPUID, even though its not privileged. We want to control what CPU features are visible to the rest of the kernel, and so CPUID ends up going into a paravirt_op. Xen implements this mainly to disable the ACPI and APIC subsystems. INTERRUPT FLAGS Xen maintains its own separate flag for masking events, which is contained within the per-cpu vcpu_info structure. Because the guest kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely ignored (and must be, because even if a guest domain disables interrupts for itself, it can't disable them overall). (A note on terminology: "events" and interrupts are effectively synonymous. However, rather than using an "enable flag", Xen uses a "mask flag", which blocks event delivery when it is non-zero.) There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which are implemented to manage the Xen event mask state. The only thing worth noting is that when events are unmasked, we need to explicitly see if there's a pending event and call into the hypervisor to make sure it gets delivered. UPCALLS Xen needs a couple of upcall (or callback) functions to be implemented by each guest. One is the event upcalls, which is how events (interrupts, effectively) are delivered to the guests. The other is the failsafe callback, which is used to report errors in either reloading a segment register, or caused by iret. These are implemented in i386/kernel/entry.S so they can jump into the normal iret_exc path when necessary. MULTICALL BATCHING Xen provides a multicall mechanism, which allows multiple hypercalls to be issued at once in order to mitigate the cost of trapping into the hypervisor. This is particularly useful for context switches, since the 4-5 hypercalls they would normally need (reload cr3, update TLS, maybe update LDT) can be reduced to one. This patch implements a generic batching mechanism for hypercalls, which gets used in many places in the Xen code. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Ian Pratt <ian.pratt@xensource.com> Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Cc: Adrian Bunk <bunk@stusta.de>
* \|	xen: Add Xen interface header files	Jeremy Fitzhardinge	2007-07-18	3	-0/+655
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add Xen interface header files. These are taken fairly directly from the Xen tree, but somewhat rearranged to suit the kernel's conventions. Define macros and inline functions for doing hypercalls into the hypervisor. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
* \|	Add a sched_clock paravirt_op	Jeremy Fitzhardinge	2007-07-18	3	-4/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The tsc-based get_scheduled_cycles interface is not a good match for Xen's runstate accounting, which reports everything in nanoseconds. This patch replaces this interface with a sched_clock interface, which matches both Xen and VMI's requirements. In order to do this, we: 1. replace get_scheduled_cycles with sched_clock 2. hoist cycles_2_ns into a common header 3. update vmi accordingly One thing to note: because sched_clock is implemented as a weak function in kernel/sched.c, we must define a real function in order to override this weak binding. This means the usual paravirt_ops technique of using an inline function won't work in this case. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Zachary Amsden <zach@vmware.com> Cc: Dan Hecht <dhecht@vmware.com> Cc: john stultz <johnstul@us.ibm.com>
* \|	paravirt: helper to disable all IO space	Jeremy Fitzhardinge	2007-07-18	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a virtual environment, device drivers such as legacy IDE will waste quite a lot of time probing for their devices which will never appear. This helper function allows a paravirt implementation to lay claim to the whole iomem and ioport space, thereby disabling all device drivers trying to claim IO resources. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Rusty Russell <rusty@rustcorp.com.au>
* \|	paravirt: make siblingmap functions visible	Jeremy Fitzhardinge	2007-07-18	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Paravirt implementations need to set the sibling map on new cpus. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
* \|	paravirt: unstatic smp_store_cpu_info	Jeremy Fitzhardinge	2007-07-18	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Paravirt implementations need to store cpu info when bringing up cpus. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>