summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* [PATCH] timers fixes/improvementsOleg Nesterov2005-06-232-192/+166
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch tries to solve following problems: 1. del_timer_sync() is racy. The timer can be fired again after del_timer_sync have checked all cpus and before it will recheck timer_pending(). 2. It has scalability problems. All cpus are scanned to determine if the timer is running on that cpu. With this patch del_timer_sync is O(1) and no slower than plain del_timer(pending_timer), unless it has to actually wait for completion of the currently running timer. The only restriction is that the recurring timer should not use add_timer_on(). 3. The timers are not serialized wrt to itself. If CPU_0 does mod_timer(jiffies+1) while the timer is currently running on CPU 1, it is quite possible that local interrupt on CPU_0 will start that timer before it finished on CPU_1. 4. The timers locking is suboptimal. __mod_timer() takes 3 locks at once and still requires wmb() in del_timer/run_timers. The new implementation takes 2 locks sequentially and does not need memory barriers. Currently ->base != NULL means that the timer is pending. In that case ->base.lock is used to lock the timer. __mod_timer also takes timer->lock because ->base can be == NULL. This patch uses timer->entry.next != NULL as indication that the timer is pending. So it does __list_del(), entry->next = NULL instead of list_del() when the timer is deleted. The ->base field is used for hashed locking only, it is initialized in init_timer() which sets ->base = per_cpu(tvec_bases). When the tvec_bases.lock is locked, it means that all timers which are tied to this base via timer->base are locked, and the base itself is locked too. So __run_timers/migrate_timers can safely modify all timers which could be found on ->tvX lists (pending timers). When the timer's base is locked, and the timer removed from ->entry list (which means that _run_timers/migrate_timers can't see this timer), it is possible to set timer->base = NULL and drop the lock: the timer remains locked. This patch adds lock_timer_base() helper, which waits for ->base != NULL, locks the ->base, and checks it is still the same. __mod_timer() schedules the timer on the local CPU and changes it's base. However, it does not lock both old and new bases at once. It locks the timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base, and adds this timer. This simplifies the code, because AB-BA deadlock is not possible. __mod_timer() also ensures that the timer's base is not changed while the timer's handler is running on the old base. __run_timers(), del_timer() do not change ->base anymore, they only clear pending flag. So del_timer_sync() can test timer->base->running_timer == timer to detect whether it is running or not. We don't need timer_list->lock anymore, this patch kills it. We also don't need barriers. del_timer() and __run_timers() used smp_wmb() before clearing timer's pending flag. It was needed because __mod_timer() did not lock old_base if the timer is not pending, so __mod_timer()->list_add() could race with del_timer()->list_del(). With this patch these functions are serialized through base->lock. One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch adds global struct timer_base_s { spinlock_t lock; struct timer_list *running_timer; } __init_timer_base; which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s struct are replaced by struct timer_base_s t_base. It is indeed ugly. But this can't have scalability problems. The global __init_timer_base.lock is used only when __mod_timer() is called for the first time AND the timer was compile time initialized. After that the timer migrates to the local CPU. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] blk: unplug laterNick Piggin2005-06-231-1/+1
| | | | | | | | | get_request_wait needn't unplug the device immediately. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] blk: branch hintsNick Piggin2005-06-231-6/+6
| | | | | | | | | Sprinkle around a few branch hints in the block layer. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] blk: no memory barrierNick Piggin2005-06-231-1/+0
| | | | | | | | | | | | | | | | | | | | This memory barrier is not needed because the waitqueue will only get waiters on it in the following situations: rq->count has exceeded the threshold - however all manipulations of ->count are performed under the runqueue lock, and so we will correctly pick up any waiter. Memory allocation for the request fails. In this case, there is no additional help provided by the memory barrier. We are guaranteed to eventually wake up waiters because the request allocation mempool guarantees that if the mem allocation for a request fails, there must be some requests in flight. They will wake up waiters when they are retired. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] blk: cleanup generic tag support error messagesTejun Heo2005-06-231-5/+13
| | | | | | | | | | Add KERN_ERR and __FUNCTION__ to generic tag error messages, and add a comment in blk_queue_end_tag() which explains the silent failure path. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] blk: remove BLK_TAGS_{PER_LONG|MASK}Tejun Heo2005-06-232-5/+2
| | | | | | | | | Replace BLK_TAGS_PER_LONG with BITS_PER_LONG and remove unused BLK_TAGS_MASK. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] blk: remove blk_queue_tag->real_max_depth optimizationTejun Heo2005-06-232-26/+10
| | | | | | | | | | | | | | | | blk_queue_tag->real_max_depth was used to optimize out unnecessary allocations/frees on tag resize. However, the whole thing was very broken - tag_map was never allocated to real_max_depth resulting in access beyond the end of the map, bits in [max_depth..real_max_depth] were set when initializing a map and copied when resizing resulting in pre-occupied tags. As the gain of the optimization is very small, well, almost nill, remove the whole thing. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] blk: use find_first_zero_bit() in blk_queue_start_tag()Tejun Heo2005-06-231-9/+4
| | | | | | | | | | | | blk_queue_start_tag() hand-coded searching for the first zero bit in the tag map. Replace it with find_first_zero_bit(). With this patch, blk_queue_star_tag() doesn't need to fill remains of tag map with 1, thus allowing it to work properly with the next remove_real_max_depth patch. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ptrace_h8300: condition bugfixDomen Puncer2005-06-231-2/+2
| | | | | | | | | Assignment doesn't make much sense here as condition would always be true. Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] xen: x86_64: use more usermode macroVincent Hanquez2005-06-233-12/+12
| | | | | | | | | | | Make use of the user_mode macro where it's possible. This is useful for Xen because it will need only to redefine only the macro to a hypervisor call. Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk> Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] xen: x86_64: Add macro for debugregVincent Hanquez2005-06-233-3/+11
| | | | | | | | | | | Add 2 macros to set and get debugreg on x86_64. This is useful for Xen because it will need only to redefine each macro to a hypervisor call. Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk> Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] xen: x86: Use more usermode macroVincent Hanquez2005-06-233-8/+8
| | | | | | | | | Use the user_mode macro where it's possible. Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk> Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] xen: x86: Rename usermode macroVincent Hanquez2005-06-236-5/+7
| | | | | | | | | | | | | | Rename user_mode to user_mode_vm and add a user_mode macro similar to the x86-64 one. This is useful for Xen because the linux xen kernel does not runs on the same priviledge that a vanilla linux kernel, and with this we just need to redefine user_mode(). Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk> Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] xen: x86: Use new macro for debugregVincent Hanquez2005-06-235-19/+17
| | | | | | | | | Make use of the 2 new macro set_debugreg and get_debugreg. Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk> Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] xen: x86: add macro for debugregVincent Hanquez2005-06-231-5/+9
| | | | | | | | | | Add 2 macros to set and get debugreg on x86. This is useful for Xen because it will need only to redefine each macro to a hypervisor call. Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk> Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86_64: avoid wasting IRQsNatalie Protasevich2005-06-231-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I suggest to change the way IRQs are handed out to PCI devices. Currently, each I/O APIC pin gets associated with an IRQ, no matter if the pin is used or not. It is expected that each pin can potentually be engaged by a device inserted into the corresponding PCI slot. However, this imposes severe limitation on systems that have designs that employ many I/O APICs, only utilizing couple lines of each, such as P64H2 chipset. It is used in ES7000, and currently, there is no way to boot the system with more that 9 I/O APICs. The simple change below allows to boot a system with say 64 (or more) I/O APICs, each providing 1 slot, which otherwise impossible because of the IRQ gaps created for unused lines on each I/O APIC. It does not resolve the problem with number of devices that exceeds number of possible IRQs, but eases up a tension for IRQs on any large system with potentually large number of devices. I only implemented this for the ACPI boot, since if the system is this big and using newer chipsets it is probably (better be!) an ACPI based system :). The change is completely "mechanical" and does not alter any internal structures or interrupt model/implementation. The patch works for both i386 and x86_64 archs. It works with MSIs just fine, and should not intervene with implementations like shared vectors, when they get worked out and incorporated. To illustrate, below is the interrupt distribution for 2-cell ES7000 with 20 I/O APICs, and an Ethernet card in the last slot, which should be eth1 and which was not configured because its IRQ exceeded allowable number (it actially turned out huge - 480!): zorro-tb2:~ # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 65716 30012 30007 30002 30009 30010 30010 30010 IO-APIC-edge timer 4: 373 0 725 280 0 0 0 0 IO-APIC-edge serial 8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi 14: 39 3 0 0 0 0 0 0 IO-APIC-edge ide0 16: 108 13 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1 18: 0 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb3 19: 15 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2 23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4 96: 4240 397 18 0 0 0 0 0 IO-APIC-level aic7xxx 97: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx 192: 847 0 0 0 0 0 0 0 IO-APIC-level eth0 NMI: 0 0 0 0 0 0 0 0 LOC: 273423 274528 272829 274228 274092 273761 273827 273694 ERR: 7 MIS: 0 Even though the system doesn't have that many devices, some don't get enabled only because of IRQ numbering model. This is the IRQ picture after the patch was applied: zorro-tb2:~ # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 44169 10004 10004 10001 10004 10003 10004 6135 IO-APIC-edge timer 4: 345 0 0 0 0 244 0 0 IO-APIC-edge serial 8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi 14: 39 0 3 0 0 0 0 0 IO-APIC-edge ide0 17: 4425 0 9 0 0 0 0 0 IO-APIC-level aic7xxx 18: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx, uhci_hcd:usb3 21: 231 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1 22: 26 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2 23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4 24: 348 0 0 0 0 0 0 0 IO-APIC-level eth0 25: 6 192 0 0 0 0 0 0 IO-APIC-level eth1 NMI: 0 0 0 0 0 0 0 0 LOC: 107981 107636 108899 108698 108489 108326 108331 108254 ERR: 7 MIS: 0 Not only we see the card in the last I/O APIC, but we are not even close to using up available IRQs, since we didn't waste any. Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] eliminate duplicate rdpmc definitionJan Beulich2005-06-231-5/+0
| | | | | | | | | Eliminate duplicate definition of rdpmc in x86-64's mtrr.h. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86_64: never block forced SIGSEGVRoland McGrath2005-06-232-18/+23
| | | | | | | | | | | | | | | | | | | This is the x86_64 version of the signal fix I just posted for i386. This problem was first noticed on PPC and has already been fixed there. But the exact same issue applies to other platforms in the same way. The signal blocking for sa_mask and the handled signal takes place after the handler setup. When the stack is bogus, the handler setup forces a SIGSEGV. But then this will be blocked, and returning to user mode will fault again and iterate. This patch fixes the problem by checking whether signal handler setup failed, and not doing the signal-blocking if so. This copies what was done in the ppc code. I think all architectures' signal handler setup code follows this pattern and needs the change. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86_64: fix hpet for systems that don't support legacy replacementjohn stultz2005-06-231-14/+28
| | | | | | | | | | | | | | | | | | | | Currently the x86-64 HPET code assumes the entire HPET implementation from the spec is present. This breaks on boxes that do not implement the optional legacy timer replacement functionality portion of the spec. This patch fixes this issue, allowing x86-64 systems that cannot use the HPET for the timer interrupt and RTC to still use the HPET as a time source. I've tested this patch on a system systems without HPET, with HPET but without legacy timer replacement, as well as HPET with legacy timer replacement. This version adds a minor check to cap the HPET counter value in gettimeoffset_hpet to avoid possible time inconsistencies. Please ignore the A2 version I sent to you earlier. Acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86_64: i8259.c iso99 structure initializationAlexander Nyberg2005-06-231-8/+7
| | | | | | Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] mtrr size-and-base debuggingAndrew Morton2005-06-231-8/+15
| | | | | | | Consolidate the mtrr sanity checking, add a dump_stack(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86: cpu_khz type fixAndrew Morton2005-06-237-9/+11
| | | | | | | | | | x86_64's cpu_khz is unsigned int and there is no reason why x86 needs to use unsigned long. So make cpu_khz unsigned int on x86 as well. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Remove i386_ksyms.c, almost.Alexey Dobriyan2005-06-2320-162/+101
| | | | | | | | | | | | | | | | | | | | * EXPORT_SYMBOL's moved to other files * #include <linux/config.h>, <linux/module.h> where needed * #include's in i386_ksyms.c cleaned up * After copy-paste, redundant due to Makefiles rules preprocessor directives removed: #ifdef CONFIG_FOO EXPORT_SYMBOL(foo); #endif obj-$(CONFIG_FOO) += foo.o * Tiny reformat to fit in 80 columns Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86: #include asm/uaccess.h in asm/checksum.hAlexey Dobriyan2005-06-231-0/+2
| | | | | | | | | csum_and_copy_to_user is static inline and uses VERIFY_WRITE. Patch allows to remove asm/uaccess.h from i386_ksyms.c without dependency surprises. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] VIA 82C586B IRQ routing fixAleksey Gorelov2005-06-231-0/+22
| | | | | | | | | | | According to the VIA 82C586B datasheet (still available from http://gkernel.sourceforge.net/specs/via/586b.pdf.bz2) this chip need a special PIRQ mapping. Signed-off-by: Karsten Keil <kkeil@suse.de> Signed-off-by: Aleksey Gorelov <aleksey_gorelov@phoenix.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86: avoid wasting IRQs for PCI devicesNatalie Protasevich2005-06-231-1/+25
| | | | | | | | | | | | | | | | | | | | | | I have submitted the patch for x86_64, this is submission for i386. The patch changes the way IRQs are handed out to PCI devices. Currently, each I/O APIC pin gets associated with an IRQ, no matter if the pin is used or not. This imposes severe limitation on systems that have designs that employ many I/O APICs, only utilizing couple lines of each, such as P64H2 chipset. It is used in ES7000, and currently, there is no way to boot the system with more that 9 I/O APICs. The simple change below allows to boot a system with say 64 (or more) I/O APICs, each providing 1 slot, which otherwise impossible because of the IRQ gaps created for unused lines on each I/O APIC. It does not resolve the problem with number of devices that exceeds number of possible IRQs, but eases up a tension for IRQs on any large system with potentually large number of devices. Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ia64: Selectable Timer Interrupt FrequencyChristoph Lameter2005-06-232-1/+3
| | | | | | | | | | It allows a selectable timer interrupt frequency of 100, 250 and 1000 HZ. Reducing the timer frequency may have important performance benefits on large systems. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] i386: Selectable Frequency of the Timer InterruptChristoph Lameter2005-06-235-3/+57
| | | | | | | | | | | Make the timer frequency selectable. The timer interrupt may cause bus and memory contention in large NUMA systems since the interrupt occurs on each processor HZ times per second. Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] allow early printk to use more than 25 linesJan Beulich2005-06-231-3/+10
| | | | | | | | | | Allow early printk code to take advantage of the full size of the screen, not just the first 25 lines. Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] adjust i386 watchdog tick calculationJan Beulich2005-06-231-9/+15
| | | | | | | | | | Get the i386 watchdog tick calculation into a state where it can also be used on CPUs with frequencies beyond 4GHz, and it consolidates the calculation into a single place (for potential furture adjustments). Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Do not enforce unique IO_APIC_ID check for xAPIC systems (i386)Natalie Protasevich2005-06-2310-19/+10
| | | | | | | | | | | | | This patch is per Andi's request to remove NO_IOAPIC_CHECK from genapic and use heuristics to prevent unique I/O APIC ID check for systems that don't need it. The patch disables unique I/O APIC ID check for Xeon-based and other platforms that don't use serial APIC bus for interrupt delivery. Andi stated that AMD systems don't need unique IO_APIC_IDs either. Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] i386: never block forced SIGSEGVRoland McGrath2005-06-231-11/+16
| | | | | | | | | | | | | | | | This problem was first noticed on PPC and has already been fixed there. But the exact same issue applies to other platforms in the same way. The signal blocking for sa_mask and the handled signal takes place after the handler setup. When the stack is bogus, the handler setup forces a SIGSEGV. But then this will be blocked, and returning to user mode will fault again and iterate. This patch fixes the problem by checking whether signal handler setup failed, and not doing the signal-blocking if so. This copies what was done in the ppc code. I think all architectures' signal handler setup code follows this pattern and needs the change. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NUMA aware block device control structure allocationChristoph Lameter2005-06-2311-30/+77
| | | | | | | | | | | | | | | | Patch to allocate the control structures for for ide devices on the node of the device itself (for NUMA systems). The patch depends on the Slab API change patch by Manfred and me (in mm) and the pcidev_to_node patch that I posted today. Does some realignment too. Signed-off-by: Justin M. Forbes <jmforbes@linuxtx.org> Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Pravin Shelar <pravin@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86/x86_64: pcibus_to_nodeChristoph Lameter2005-06-235-34/+17
| | | | | | | | | | | | | | | Define pcibus_to_node to be able to figure out which NUMA node contains a given PCI device. This defines pcibus_to_node(bus) in include/linux/topology.h and adjusts the macros for i386 and x86_64 that already provided a way to determine the cpumask of a pci device. x86_64 was changed to not build an array of cpumasks anymore. Instead an array of nodes is build which can be used to generate the cpumask via node_to_cpumask. Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ppc64: pcibus_to_node fixChristoph Lameter2005-06-231-3/+1
| | | | | | | | asm-generic/topology.h must also be included if CONFIG_NUMA is set in order to provide the fall back pcibus_to_node function. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] m32r: build fix for asm-m32r/topology.hHirokazu Takata2005-06-231-43/+1
| | | | | | | | Use asm-generic/topology.h to fix yet another pcibus_to_node() build error. Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Platform SMIs and their interferance with tsc based delay calibrationVenkatesh Pallipadi2005-06-2310-0/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Issue: Current tsc based delay_calibration can result in significant errors in loops_per_jiffy count when the platform events like SMIs (System Management Interrupts that are non-maskable) are present. This could lead to potential kernel panic(). This issue is becoming more visible with 2.6 kernel (as default HZ is 1000) and on platforms with higher SMI handling latencies. During the boot time, SMIs are mostly used by BIOS (for things like legacy keyboard emulation). Description: The psuedocode for current delay calibration with tsc based delay looks like (0) Estimate a value for loops_per_jiffy (1) While (loops_per_jiffy estimate is accurate enough) (2) wait for jiffy transition (jiffy1) (3) Note down current tsc (tsc1) (4) loop until tsc becomes tsc1 + loops_per_jiffy (5) check whether jiffy changed since jiffy1 or not and refine loops_per_jiffy estimate Consider the following cases Case 1: If SMIs happen between (2) and (3) above, we can end up with a loops_per_jiffy value that is too low. This results in shorted delays and kernel can panic () during boot (Mostly at IOAPIC timer initialization timer_irq_works() as we don't have enough timer interrupts in a specified interval). Case 2: If SMIs happen between (3) and (4) above, then we can end up with a loops_per_jiffy value that is too high. And with current i386 code, too high lpj value (greater than 17M) can result in a overflow in delay.c:__const_udelay() again resulting in shorter delay and panic(). Solution: The patch below makes the calibration routine aware of asynchronous events like SMIs. We increase the delay calibration time and also identify any significant errors (greater than 12.5%) in the calibration and notify it to user. Patch below changes both i386 and x86-64 architectures to use this new and improved calibrate_delay_direct() routine. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] use ${CROSS_COMPILE}installkernel in arch/*/boot/install.shIan Campbell2005-06-236-12/+12
| | | | | | | | | | | | | | | | | | | | | | | The attached patch causes the various arch specific install.sh scripts to look for ${CROSS_COMPILE}installkernel rather than just installkernel (in both /sbin/ and ~/bin/ where the script already did this). This allows you to have e.g. arm-linux-installkernel as a handy way to install on your cross target. It also prevents the script picking up on the host /sbin/installkernel which causes the script to fall through and do the install itself (which is what I actually use myself, with $INSTALL_PATH set). I don't believe it causes back-compatibility problems since calling the host installkernel was never likely to work or be what you wanted when cross compiling anyway. If $CROSS_COMPILE isn't set then nothing changes. I only use ARM and i386 myself but I figured it couldn't hurt to do the whole lot. I've cc'd those who I hope are the arch maintainers for files that I've touched. Signed-off-by: Ian Campbell <icampbell@arcom.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] biarch compiler support for i386H. Peter Anvin2005-06-231-0/+7
| | | | | | | | | | | This allows the i386 architecture to be built on a system with a biarch compiler that defaults to x86-64, merely by specifying ARCH=i386. As previously discussed, this uses the equivalent logic to the ppc port. Signed-Off-By: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add page_state info to show_memMartin J. Bligh2005-06-231-0/+8
| | | | | | | | This helps a lot when debugging out of memory stuff - useful especially to see if all the memory is sucked into slab, etc. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add x86-64 specific support for sparsememMatt Tolentino2005-06-234-12/+51
| | | | | | | | | | | | | | | | | | | | | This patch adds in the necessary support for sparsemem such that x86-64 kernels may use sparsemem as an alternative to discontigmem for NUMA kernels. Note that this does no preclude one from continuing to build NUMA kernels using discontigmem, but merely allows the option to build NUMA kernels with sparsemem. Interestingly, the use of sparsemem in lieu of discontigmem in NUMA kernels results in reduced text size for otherwise equivalent kernels as shown in the example builds below: text data bss dec hex filename 2371036 765884 1237108 4374028 42be0c vmlinux.discontig 2366549 776484 1302772 4445805 43d66d vmlinux.sparse Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] reorganize x86-64 NUMA and DISCONTIGMEM config optionsMatt Tolentino2005-06-239-24/+25
| | | | | | | | | | | | | In order to use the alternative sparsemem implmentation for NUMA kernels, we need to reorganize the config options. This patch effectively abstracts out the CONFIG_DISCONTIGMEM options to CONFIG_NUMA in most cases. Thus, the discontigmem implementation may be employed as always, but the sparsemem implementation may be used alternatively. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add x86-64 Kconfig options for sparsememMatt Tolentino2005-06-231-0/+19
| | | | | | | | | | Add the requisite arch specific Kconfig options to enable the use of the sparsemem implementation for NUMA kernels on x86-64. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove direct ref to contig_page_data for x86-64Matt Tolentino2005-06-232-5/+1
| | | | | | | | | | This patch pulls out all remaining direct references to contig_page_data from arch/x86-64, thus saving an ifdef in one case. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ppc64: sparsemem memory modelAndy Whitcroft2005-06-237-21/+68
| | | | | | | | | | | | Provide the architecture specific implementation for SPARSEMEM for PPC64 systems. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> (in part) Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ppc64: add memory presentAndy Whitcroft2005-06-232-2/+5
| | | | | | | | | | | | Provide hooks for PPC64 to allow memory models to be informed of installed memory areas. This allows SPARSEMEM to instantiate mem_map for the populated areas. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ppc64: add early_pfn_to_nidAndy Whitcroft2005-06-232-0/+9
| | | | | | | | | | | | Provide an implementation of early_pfn_to_nid for PPC64. This is used by memory models to determine the node from which to take allocations before the memory allocators are fully initialised. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sparsemem hotplug baseAndy Whitcroft2005-06-233-27/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sparse's initalization be accessible at runtime. This allows sparse mappings to be created after boot in a hotplug situation. This patch is separated from the previous one just to give an indication how much of the sparse infrastructure is *just* for hotplug memory. The section_mem_map doesn't really store a pointer. It stores something that is convenient to do some math against to get a pointer. It isn't valid to just do *section_mem_map, so I don't think it should be stored as a pointer. There are a couple of things I'd like to store about a section. First of all, the fact that it is !NULL does not mean that it is present. There could be such a combination where section_mem_map *is* NULL, but the math gets you properly to a real mem_map. So, I don't think that check is safe. Since we're storing 32-bit-aligned structures, we have a few bits in the bottom of the pointer to play with. Use one bit to encode whether there's really a mem_map there, and the other one to tell whether there's a valid section there. We need to distinguish between the two because sometimes there's a gap between when a section is discovered to be present and when we can get the mem_map for it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sparsemem swiss cheese numa layoutsAndy Whitcroft2005-06-233-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The part of the sparsemem patch which modifies memmap_init_zone() has recently become a problem. It changes behavior so that there is a call to pfn_to_page() for each individual page inside of a node's range: node_start_pfn through node_end_pfn. It used to simply do this once, at the beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside of a node made it necessary to change. Mike Kravetz recently wrote a patch which made the NUMA code accept some new kinds of layouts. The system's memory was laid out like this, with node 0's memory in two pieces: one before and one after node 1's memory: Node 0: +++++ +++++ Node 1: +++++ Previous behavior before Mike's patch was to assign nodes like this: Node 0: 00000 XXXXX Node 1: 11111 Where the 'X' areas were simply thrown away. The new behavior was to make the pg_data_t span node 0 across all of its areas, including areas that are really node 1's: Node 0: 000000000000000 Node 1: 11111 This wastes a little bit of mem_map space, but ends up being OK, and more fully utilizes the system's memory. memmap_init_zone() initializes all of the "struct page"s for node 0, even for the "hole", but those never get used, because there is no pfn_to_page() that resolves to those pages. However, only calling pfn_to_page() once, memmap_init_zone() always uses the pages that were allocated for node0->node_mem_map because: struct page *start = pfn_to_page(start_pfn); // effectively start = &node->node_mem_map[0] for (page = start; page < (start + size); page++) { init_page_here();... page++; } Slow, and wasteful, but generally harmless. But, modify that to call pfn_to_page() for each loop iteration (like sparsemem does): for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) { page = pfn_to_page(pfn); } And you end up trying to initialize node 1's pages too early, along with bogus data from node 0. This patch checks for those weird layouts and declines to touch the pages, making the more frequent pfn_to_page() calls OK to do. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sparsemem memory model for i386Andy Whitcroft2005-06-239-81/+135
| | | | | | | | | | | | Provide the architecture specific implementation for SPARSEMEM for i386 SMP and NUMA systems. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OpenPOWER on IntegriCloud