From 39c715b71740c4a78ba4769fb54826929bac03cb Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Tue, 21 Jun 2005 17:14:34 -0700 Subject: [PATCH] smp_processor_id() cleanup This patch implements a number of smp_processor_id() cleanup ideas that Arjan van de Ven and I came up with. The previous __smp_processor_id/_smp_processor_id/smp_processor_id API spaghetti was hard to follow both on the implementational and on the usage side. Some of the complexity arose from picking wrong names, some of the complexity comes from the fact that not all architectures defined __smp_processor_id. In the new code, there are two externally visible symbols: - smp_processor_id(): debug variant. - raw_smp_processor_id(): nondebug variant. Replaces all existing uses of _smp_processor_id() and __smp_processor_id(). Defined by every SMP architecture in include/asm-*/smp.h. There is one new internal symbol, dependent on DEBUG_PREEMPT: - debug_smp_processor_id(): internal debug variant, mapped to smp_processor_id(). Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new lib/smp_processor_id.c file. All related comments got updated and/or clarified. I have build/boot tested the following 8 .config combinations on x86: {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT} I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT. (Other architectures are untested, but should work just fine.) Signed-off-by: Ingo Molnar Signed-off-by: Arjan van de Ven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 2 +- kernel/power/smp.c | 4 ++-- kernel/sched.c | 4 ++-- kernel/stop_machine.c | 4 ++-- 4 files changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 83b3d37..a566745 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -379,7 +379,7 @@ static void module_unload_init(struct module *mod) for (i = 0; i < NR_CPUS; i++) local_set(&mod->ref[i].count, 0); /* Hold reference count during initialization. */ - local_set(&mod->ref[_smp_processor_id()].count, 1); + local_set(&mod->ref[raw_smp_processor_id()].count, 1); /* Backwards compatibility macros put refcount during init. */ mod->waiter = current; } diff --git a/kernel/power/smp.c b/kernel/power/smp.c index cba3584b..457c230 100644 --- a/kernel/power/smp.c +++ b/kernel/power/smp.c @@ -48,11 +48,11 @@ void disable_nonboot_cpus(void) { oldmask = current->cpus_allowed; set_cpus_allowed(current, cpumask_of_cpu(0)); - printk("Freezing CPUs (at %d)", _smp_processor_id()); + printk("Freezing CPUs (at %d)", raw_smp_processor_id()); current->state = TASK_INTERRUPTIBLE; schedule_timeout(HZ); printk("..."); - BUG_ON(_smp_processor_id() != 0); + BUG_ON(raw_smp_processor_id() != 0); /* FIXME: for this to work, all the CPUs must be running * "idle" thread (or we deadlock). Is that guaranteed? */ diff --git a/kernel/sched.c b/kernel/sched.c index f12a0c8..deca041 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3814,7 +3814,7 @@ EXPORT_SYMBOL(yield); */ void __sched io_schedule(void) { - struct runqueue *rq = &per_cpu(runqueues, _smp_processor_id()); + struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id()); atomic_inc(&rq->nr_iowait); schedule(); @@ -3825,7 +3825,7 @@ EXPORT_SYMBOL(io_schedule); long __sched io_schedule_timeout(long timeout) { - struct runqueue *rq = &per_cpu(runqueues, _smp_processor_id()); + struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id()); long ret; atomic_inc(&rq->nr_iowait); diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 6116b25..84a9d18 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -100,7 +100,7 @@ static int stop_machine(void) stopmachine_state = STOPMACHINE_WAIT; for_each_online_cpu(i) { - if (i == _smp_processor_id()) + if (i == raw_smp_processor_id()) continue; ret = kernel_thread(stopmachine, (void *)(long)i,CLONE_KERNEL); if (ret < 0) @@ -182,7 +182,7 @@ struct task_struct *__stop_machine_run(int (*fn)(void *), void *data, /* If they don't care which CPU fn runs on, bind to any online one. */ if (cpu == NR_CPUS) - cpu = _smp_processor_id(); + cpu = raw_smp_processor_id(); p = kthread_create(do_stop, &smdata, "kstopmachine"); if (!IS_ERR(p)) { -- cgit v1.1 From 753ee728964e5afb80c17659cc6c3a6fd0a42fe0 Mon Sep 17 00:00:00 2001 From: Martin Hicks Date: Tue, 21 Jun 2005 17:14:41 -0700 Subject: [PATCH] VM: early zone reclaim This is the core of the (much simplified) early reclaim. The goal of this patch is to reclaim some easily-freed pages from a zone before falling back onto another zone. One of the major uses of this is NUMA machines. With the default allocator behavior the allocator would look for memory in another zone, which might be off-node, before trying to reclaim from the current zone. This adds a zone tuneable to enable early zone reclaim. It is selected on a per-zone basis and is turned on/off via syscall. Adding some extra throttling on the reclaim was also required (patch 4/4). Without the machine would grind to a crawl when doing a "make -j" kernel build. Even with this patch the System Time is higher on average, but it seems tolerable. Here are some numbers for kernbench runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run: wall user sys %cpu ctx sw. sleeps ---- ---- --- ---- ------ ------ No patch 1009 1384 847 258 298170 504402 w/patch, no reclaim 880 1376 667 288 254064 396745 w/patch & reclaim 1079 1385 926 252 291625 548873 These numbers are the average of 2 runs of 3 "make -j" runs done right after system boot. Run-to-run variability for "make -j" is huge, so these numbers aren't terribly useful except to seee that with reclaim the benchmark still finishes in a reasonable amount of time. I also looked at the NUMA hit/miss stats for the "make -j" runs and the reclaim doesn't make any difference when the machine is thrashing away. Doing a "make -j8" on a single node that is filled with page cache pages takes 700 seconds with reclaim turned on and 735 seconds without reclaim (due to remote memory accesses). The simple zone_reclaim syscall program is at http://www.bork.org/~mort/sgi/zone_reclaim.c Signed-off-by: Martin Hicks Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys_ni.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 0dda70e..6f15bea 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -77,6 +77,7 @@ cond_syscall(sys_request_key); cond_syscall(sys_keyctl); cond_syscall(compat_sys_keyctl); cond_syscall(compat_sys_socketcall); +cond_syscall(sys_set_zone_reclaim); /* arch-specific weak syscall entries */ cond_syscall(sys_pciconfig_read); -- cgit v1.1 From 1363c3cd8603a913a27e2995dccbd70d5312d8e6 Mon Sep 17 00:00:00 2001 From: Wolfgang Wander Date: Tue, 21 Jun 2005 17:14:49 -0700 Subject: [PATCH] Avoiding mmap fragmentation Ingo recently introduced a great speedup for allocating new mmaps using the free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and causes huge performance increases in thread creation. The downside of this patch is that it does lead to fragmentation in the mmap-ed areas (visible via /proc/self/maps), such that some applications that work fine under 2.4 kernels quickly run out of memory on any 2.6 kernel. The problem is twofold: 1) the free_area_cache is used to continue a search for memory where the last search ended. Before the change new areas were always searched from the base address on. So now new small areas are cluttering holes of all sizes throughout the whole mmap-able region whereas before small holes tended to close holes near the base leaving holes far from the base large and available for larger requests. 2) the free_area_cache also is set to the location of the last munmap-ed area so in scenarios where we allocate e.g. five regions of 1K each, then free regions 4 2 3 in this order the next request for 1K will be placed in the position of the old region 3, whereas before we appended it to the still active region 1, placing it at the location of the old region 2. Before we had 1 free region of 2K, now we only get two free regions of 1K -> fragmentation. The patch addresses thes issues by introducing yet another cache descriptor cached_hole_size that contains the largest known hole size below the current free_area_cache. If a new request comes in the size is compared against the cached_hole_size and if the request can be filled with a hole below free_area_cache the search is started from the base instead. The results look promising: Whereas 2.6.12-rc4 fragments quickly and my (earlier posted) leakme.c test program terminates after 50000+ iterations with 96 distinct and fragmented maps in /proc/self/maps it performs nicely (as expected) with thread creation, Ingo's test_str02 with 20000 threads requires 0.7s system time. Taking out Ingo's patch (un-patch available per request) by basically deleting all mentions of free_area_cache from the kernel and starting the search for new memory always at the respective bases we observe: leakme terminates successfully with 11 distinctive hardly fragmented areas in /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system time for Ingo's test_str02 with 20000 threads. Now - drumroll ;-) the appended patch works fine with leakme: it ends with only 7 distinct areas in /proc/self/maps and also thread creation seems sufficiently fast with 0.71s for 20000 threads. Signed-off-by: Wolfgang Wander Credit-to: "Richard Purdie" Signed-off-by: Ken Chen Acked-by: Ingo Molnar (partly) Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index f42a17f..876b31c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -194,6 +194,7 @@ static inline int dup_mmap(struct mm_struct * mm, struct mm_struct * oldmm) mm->mmap = NULL; mm->mmap_cache = NULL; mm->free_area_cache = oldmm->mmap_base; + mm->cached_hole_size = ~0UL; mm->map_count = 0; set_mm_counter(mm, rss, 0); set_mm_counter(mm, anon_rss, 0); @@ -322,6 +323,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm) mm->ioctx_list = NULL; mm->default_kioctx = (struct kioctx)INIT_KIOCTX(mm->default_kioctx, *mm); mm->free_area_cache = TASK_UNMAPPED_BASE; + mm->cached_hole_size = ~0UL; if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; -- cgit v1.1 From 45918e1a8bfcabc1cb4570b8df276655020eac45 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 21 Jun 2005 17:15:08 -0700 Subject: [PATCH] dup_mmap: update comment on new vma Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap came in, try_to_unmap_one knows the vma without needing find_vma. But add a comment to note that here vma is inserted without mmap_sem. Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 876b31c..a28d11e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -250,8 +250,9 @@ static inline int dup_mmap(struct mm_struct * mm, struct mm_struct * oldmm) /* * Link in the new vma and copy the page table entries: - * link in first so that swapoff can see swap entries, - * and try_to_unmap_one's find_vma find the new vma. + * link in first so that swapoff can see swap entries. + * Note that, exceptionally, here the vma is inserted + * without holding mm->mmap_sem. */ spin_lock(&mm->page_table_lock); *pprev = tmp; -- cgit v1.1 From dbce706e2550253c5ab6043f4f5dfde0cd02470f Mon Sep 17 00:00:00 2001 From: Paolo 'Blaisorblade' Giarrusso Date: Tue, 21 Jun 2005 17:16:19 -0700 Subject: [PATCH] uml: add and use generic hw_controller_type->release With Chris Wedgwood Currently UML must explicitly call the UML-specific free_irq_by_irq_and_dev() for each free_irq call it's done. This is needed because ->shutdown and/or ->disable are only called when the last "action" for that irq is removed. Instead, for UML shared IRQs (UML IRQs are very often, if not always, shared), for each dev_id some setup is done, which must be cleared on the release of that fd. For instance, for each open console a new instance (i.e. new dev_id) of the same IRQ is requested(). Exactly, a fd is stored in an array (pollfds), which is after read by a host thread and passed to poll(). Each event registered by poll() triggers an interrupt. So, for each free_irq() we must remove the corresponding host fd from the table, which we do via this -release() method. In this patch we add an appropriate hook for this, and remove all uses of it by pointing the hook to the said procedure; this is safe to do since the said procedure. Also some cosmetic improvements are included. This is heavily based on some work by Chris Wedgwood, which however didn't get the patch merged for something I'd call a "misunderstanding" (the need for this patch wasn't cleanly explained, thus adding the generic hook was felt as undesirable). Signed-off-by: Paolo 'Blaisorblade' Giarrusso CC: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/manage.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 5202e4c..5fde817 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -255,6 +255,10 @@ void free_irq(unsigned int irq, void *dev_id) /* Found it - now remove it from the list of entries */ *pp = action->next; + + if (desc->handler->release) + desc->handler->release(irq, dev_id); + if (!desc->action) { desc->status |= IRQ_DISABLED; if (desc->handler->shutdown) -- cgit v1.1 From b77d6adc922b8bbf8b16b67f567958c42962cf88 Mon Sep 17 00:00:00 2001 From: Paolo 'Blaisorblade' Giarrusso Date: Tue, 21 Jun 2005 17:16:24 -0700 Subject: [PATCH] uml: make hw_controller_type->release exist only for archs needing it With Chris Wedgwood As suggested by Chris, we can make the "just added" method ->release conditional to UML only (better: to archs requesting it, i.e. only UML currently), so that other archs don't get this unneeded crud, and if UML won't need it any more we can kill this. Signed-off-by: Paolo 'Blaisorblade' Giarrusso CC: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/manage.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 5fde817..ac67009 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -6,6 +6,7 @@ * This file contains driver APIs to the irq subsystem. */ +#include #include #include #include @@ -256,8 +257,11 @@ void free_irq(unsigned int irq, void *dev_id) /* Found it - now remove it from the list of entries */ *pp = action->next; + /* Currently used only by UML, might disappear one day.*/ +#ifdef CONFIG_IRQ_RELEASE_METHOD if (desc->handler->release) desc->handler->release(irq, dev_id); +#endif if (!desc->action) { desc->status |= IRQ_DISABLED; -- cgit v1.1