summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* cpuset,mm: fix no node to alloc memory when changing cpuset's memsMiao Xie2010-05-2511-20/+148
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before applying this patch, cpuset updates task->mems_allowed and mempolicy by setting all new bits in the nodemask first, and clearing all old unallowed bits later. But in the way, the allocator may find that there is no node to alloc memory. The reason is that cpuset rebinds the task's mempolicy, it cleans the nodes which the allocater can alloc pages on, for example: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom This patch fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. [akpm@linux-foundation.org: fix spello] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: restructure rebinding-mempolicy functionsMiao Xie2010-05-253-24/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nick Piggin reported that the allocator may see an empty nodemask when changing cpuset's mems[1]. It happens only on the kernel that do not do atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG) But I found that there is also a problem on the kernel that can do atomic nodemask_t stores. The problem is that the allocator can't find a node to alloc page when changing cpuset's mems though there is a lot of free memory. The reason is like this: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom I can use the attached program reproduce it by the following step: # mkdir /dev/cpuset # mount -t cpuset cpuset /dev/cpuset # mkdir /dev/cpuset/1 # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems # echo $$ > /dev/cpuset/1/tasks # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> & <nr_tasks> = max(nr_cpus - 1, 1) # killall -s SIGUSR1 cpuset_mem_hog # ./change_mems.sh several hours later, oom will happen though there is a lot of free memory. This patchset fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. This patch: In order to fix no node to alloc memory, when we want to update mempolicy and mems_allowed, we expand the set of nodes first (set all the newly nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the mempolicy's rebind functions may breaks the expanding. So we restructure the mempolicy's rebind functions and split the rebind work to two steps, just like the update of cpuset's mems: The 1st step: expand the set of the mempolicy's nodes. The 2nd step: shrink the set of the mempolicy's nodes. It is used when there is no real lock to protect the mempolicy in the read-side. Otherwise we can do rebind work at once. In order to implement it, we define enum mpol_rebind_step { MPOL_REBIND_ONCE, MPOL_REBIND_STEP1, MPOL_REBIND_STEP2, MPOL_REBIND_NSTEP, }; If the mempolicy needn't be updated by two steps, we can pass MPOL_REBIND_ONCE to the rebind functions. Or we can pass MPOL_REBIND_STEP1 to do the first step of the rebind work and pass MPOL_REBIND_STEP2 to do the second step work. Besides that, it maybe long time between these two step and we have to release the lock that protects mempolicy and mems_allowed. If we hold the lock once again, we must check whether the current mempolicy is under the rebinding (the first step has been done) or not, because the task may alloc a new mempolicy when we don't hold the lock. So we defined the following flag to identify it: #define MPOL_F_REBINDING (1 << 2) The new functions will be used in the next patch. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: document cpuset interaction with tmpfs mpol mount optionLee Schermerhorn2010-05-251-1/+9
| | | | | | | | | | | | | | | | | Update Documentation/filesystems/tmpfs.txt to describe the interaction of tmpfs mount option memory policy with tasks' cpuset mems_allowed. Note: the mount(8) man page [in the util-linux-ng package] requires similiar updates. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: factor mpol_shared_policy_init() return pathsLee Schermerhorn2010-05-251-10/+6
| | | | | | | | | | | | | | Factor out duplicate put/frees in mpol_shared_policy_init() to a common return path. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: rename policy_types and cleanup initializationLee Schermerhorn2010-05-251-6/+12
| | | | | | | | | | | | | | | | Rename 'policy_types[]' to 'policy_modes[]' to better match the array contents. Use designated intializer syntax for policy_modes[]. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: lose unnecessary loop variable in mpol_parse_str()Lee Schermerhorn2010-05-251-6/+4
| | | | | | | | | | | | | | | We don't really need the extra variable 'i' in mpol_parse_str(). The only use is as the the loop variable. Then, it's assigned to 'mode'. Just use mode, and loose the 'uninitialized_var()' macro. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: don't call mpol_set_nodemask() when no_contextLee Schermerhorn2010-05-251-5/+4
| | | | | | | | | | | | | | | | | | No need to call mpol_set_nodemask() when we have no context for the mempolicy. This can occur when we're parsing a tmpfs 'mpol' mount option. Just save the raw nodemask in the mempolicy's w.user_nodemask member for use when a tmpfs/shmem file is created. mpol_shared_policy_init() will "contextualize" the policy for the new file based on the creating task's context. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: remove redundant checkBob Liu2010-05-251-11/+5
| | | | | | | | | | | | | | Lee's patch "mempolicy: use MPOL_PREFERRED for system-wide default policy" has made the MPOL_DEFAULT only used in the memory policy APIs. So, no need to check in __mpol_equal also. Also get rid of mpol_match_intent() and move its logic directly into __mpol_equal(). Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: remove case MPOL_INTERLEAVE from policy_zonelist()Bob Liu2010-05-251-3/+1
| | | | | | | | | | | | In policy_zonelist() mode MPOL_INTERLEAVE shouldn't happen, so fall through to BUG() instead of break to return. I also fixed the comment. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: remove redundant codeBob Liu2010-05-251-4/+1
| | | | | | | | | | | | | | | | 1. In funtion is_valid_nodemask(), varibable k will be inited to 0 in the following loop, needn't init to policy_zone anymore. 2. (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) has already defined to MPOL_MODE_FLAGS in mempolicy.h. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove return value of putback_lru_pages()Minchan Kim2010-05-252-8/+3
| | | | | | | | | | | | | | | | putback_lru_page() never can fail. So it doesn't matter count of "the number of pages put back". In addition, users of this functions don't use return value. Let's remove unnecessary code. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* shmem: remove redundant codeHuang Shijie2010-05-251-2/+0
| | | | | | | | | | | prep_new_page() will call set_page_private(page, 0) to initialise the page, so the code is redundant. Signed-off-by: Huang Shijie <shijie8@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sparsemem: on no vmemmap path put mem_map on node high tooYinghai Lu2010-05-251-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to put mem_map high when virtual memmap is not used. before this patch free mem pfn range on first node: [ 0.000000] 19 - 1f [ 0.000000] 28 40 - 80 95 [ 0.000000] 702 740 - 1000 1000 [ 0.000000] 347c - 347e [ 0.000000] 34e7 3500 - 3b80 3b8b [ 0.000000] 73b8b 73bc0 - 73c00 73c00 [ 0.000000] 73ddd - 73e00 [ 0.000000] 73fdd - 74000 [ 0.000000] 741dd - 74200 [ 0.000000] 743dd - 74400 [ 0.000000] 745dd - 74600 [ 0.000000] 747dd - 74800 [ 0.000000] 749dd - 74a00 [ 0.000000] 74bdd - 74c00 [ 0.000000] 74ddd - 74e00 [ 0.000000] 74fdd - 75000 [ 0.000000] 751dd - 75200 [ 0.000000] 753dd - 75400 [ 0.000000] 755dd - 75600 [ 0.000000] 757dd - 75800 [ 0.000000] 759dd - 75a00 [ 0.000000] 79bdd 79c00 - 7d540 7d550 [ 0.000000] 7f745 - 7f750 [ 0.000000] 10000b 100040 - 2080000 2080000 so only 79c00 - 7d540 are major free block under 4g... after this patch, we will get [ 0.000000] 19 - 1f [ 0.000000] 28 40 - 80 95 [ 0.000000] 702 740 - 1000 1000 [ 0.000000] 347c - 347e [ 0.000000] 34e7 3500 - 3600 3600 [ 0.000000] 37dd - 3800 [ 0.000000] 39dd - 3a00 [ 0.000000] 3bdd - 3c00 [ 0.000000] 3ddd - 3e00 [ 0.000000] 3fdd - 4000 [ 0.000000] 41dd - 4200 [ 0.000000] 43dd - 4400 [ 0.000000] 45dd - 4600 [ 0.000000] 47dd - 4800 [ 0.000000] 49dd - 4a00 [ 0.000000] 4bdd - 4c00 [ 0.000000] 4ddd - 4e00 [ 0.000000] 4fdd - 5000 [ 0.000000] 51dd - 5200 [ 0.000000] 53dd - 5400 [ 0.000000] 95dd 9600 - 7d540 7d550 [ 0.000000] 7f745 - 7f750 [ 0.000000] 17000b 170040 - 2080000 2080000 we will have 9600 - 7d540 for major free block... sparse-vmemmap path already used __alloc_bootmem_node_high() Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page allocator: reduce fragmentation in buddy allocator by adding buddies ↵Corrado Zoccolo2010-05-251-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | that are merging to the tail of the free lists In order to reduce fragmentation, this patch classifies freed pages in two groups according to their probability of being part of a high order merge. Pages belonging to a compound whose next-highest buddy is free are more likely to be part of a high order merge in the near future, so they will be added at the tail of the freelist. The remaining pages are put at the front of the freelist. In this way, the pages that are more likely to cause a big merge are kept free longer. Consequently there is a tendency to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. This heuristic was tested on three machines, x86, x86-64 and ppc64 with 3GB of RAM in each machine. The tests were kernbench, netperf, sysbench and STREAM for performance and a high-order stress test for huge page allocations. KernBench X86 Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%) User mean 649.53 ( 0.00%) 650.44 (-0.14%) System mean 54.75 ( 0.00%) 54.18 ( 1.05%) CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%) KernBench X86-64 Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%) User mean 323.27 ( 0.00%) 322.66 ( 0.19%) System mean 36.71 ( 0.00%) 36.50 ( 0.57%) CPU mean 380.75 ( 0.00%) 381.75 (-0.26%) KernBench PPC64 Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%) User mean 587.99 ( 0.00%) 587.95 ( 0.01%) System mean 60.60 ( 0.00%) 60.57 ( 0.05%) CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%) Nothing notable for kernbench. NetPerf UDP X86 64 42.68 ( 0.00%) 42.77 ( 0.21%) 128 85.62 ( 0.00%) 85.32 (-0.35%) 256 170.01 ( 0.00%) 168.76 (-0.74%) 1024 655.68 ( 0.00%) 652.33 (-0.51%) 2048 1262.39 ( 0.00%) 1248.61 (-1.10%) 3312 1958.41 ( 0.00%) 1944.61 (-0.71%) 4096 2345.63 ( 0.00%) 2318.83 (-1.16%) 8192 4132.90 ( 0.00%) 4089.50 (-1.06%) 16384 6770.88 ( 0.00%) 6642.05 (-1.94%)* NetPerf UDP X86-64 64 148.82 ( 0.00%) 154.92 ( 3.94%) 128 298.96 ( 0.00%) 312.95 ( 4.47%) 256 583.67 ( 0.00%) 626.39 ( 6.82%) 1024 2293.18 ( 0.00%) 2371.10 ( 3.29%) 2048 4274.16 ( 0.00%) 4396.83 ( 2.79%) 3312 6356.94 ( 0.00%) 6571.35 ( 3.26%) 4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)* 8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%) 16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)* 1.64% 2.73% NetPerf UDP PPC64 64 49.98 ( 0.00%) 50.25 ( 0.54%) 128 98.66 ( 0.00%) 100.95 ( 2.27%) 256 197.33 ( 0.00%) 191.03 (-3.30%) 1024 761.98 ( 0.00%) 785.07 ( 2.94%) 2048 1493.50 ( 0.00%) 1510.85 ( 1.15%) 3312 2303.95 ( 0.00%) 2271.72 (-1.42%) 4096 2774.56 ( 0.00%) 2773.06 (-0.05%) 8192 4918.31 ( 0.00%) 4793.59 (-2.60%) 16384 7497.98 ( 0.00%) 7749.52 ( 3.25%) The tests are run to have confidence limits within 1%. Results marked with a * were not confident although in this case, it's only outside by small amounts. Even with some results that were not confident, the netperf UDP results were generally positive. NetPerf TCP X86 64 652.25 ( 0.00%)* 648.12 (-0.64%)* 23.80% 22.82% 128 1229.98 ( 0.00%)* 1220.56 (-0.77%)* 21.03% 18.90% 256 2105.88 ( 0.00%) 1872.03 (-12.49%)* 1.00% 16.46% 1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)* 13.37% 11.39% 2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)* 9.76% 12.48% 3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)* 6.49% 8.75% 4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)* 9.85% 8.50% 8192 4732.28 ( 0.00%)* 5777.77 (18.10%)* 9.13% 13.04% 16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)* 7.73% 8.68% NETPERF TCP X86-64 netperf-tcp-vanilla-netperf netperf-tcp tcp-vanilla pgalloc-delay 64 1895.87 ( 0.00%)* 1775.07 (-6.81%)* 5.79% 4.78% 128 3571.03 ( 0.00%)* 3342.20 (-6.85%)* 3.68% 6.06% 256 5097.21 ( 0.00%)* 4859.43 (-4.89%)* 3.02% 2.10% 1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)* 5.89% 6.55% 2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)* 7.08% 7.44% 3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)* 6.87% 7.33% 4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)* 6.86% 8.18% 8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)* 7.49% 5.55% 16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)* 7.36% 6.49% NETPERF TCP PPC64 netperf-tcp-vanilla-netperf netperf-tcp tcp-vanilla pgalloc-delay 64 594.17 ( 0.00%) 596.04 ( 0.31%)* 1.00% 2.29% 128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)* 1.30% 1.40% 256 1852.46 ( 0.00%)* 1856.95 ( 0.24%) 1.25% 1.00% 1024 3839.46 ( 0.00%)* 3813.05 (-0.69%) 1.02% 1.00% 2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)* 1.15% 1.04% 3312 5506.90 ( 0.00%) 5459.72 (-0.86%) 4096 6449.19 ( 0.00%) 6345.46 (-1.63%) 8192 7501.17 ( 0.00%) 7508.79 ( 0.10%) 16384 9618.65 ( 0.00%) 9490.10 (-1.35%) There was a distinct lack of confidence in the X86* figures so I included what the devation was where the results were not confident. Many of the results, whether gains or losses were within the standard deviation so no solid conclusion can be reached on performance impact. Looking at the figures, only the X86-64 ones look suspicious with a few losses that were outside the noise. However, the results were so unstable that without knowing why they vary so much, a solid conclusion cannot be reached. SYSBENCH X86 sysbench-vanilla pgalloc-delay 1 7722.85 ( 0.00%) 7756.79 ( 0.44%) 2 14901.11 ( 0.00%) 13683.44 (-8.90%) 3 15171.71 ( 0.00%) 14888.25 (-1.90%) 4 14966.98 ( 0.00%) 15029.67 ( 0.42%) 5 14370.47 ( 0.00%) 14865.00 ( 3.33%) 6 14870.33 ( 0.00%) 14845.57 (-0.17%) 7 14429.45 ( 0.00%) 14520.85 ( 0.63%) 8 14354.35 ( 0.00%) 14362.31 ( 0.06%) SYSBENCH X86-64 1 17448.70 ( 0.00%) 17484.41 ( 0.20%) 2 34276.39 ( 0.00%) 34251.00 (-0.07%) 3 50805.25 ( 0.00%) 50854.80 ( 0.10%) 4 66667.10 ( 0.00%) 66174.69 (-0.74%) 5 66003.91 ( 0.00%) 65685.25 (-0.49%) 6 64981.90 ( 0.00%) 65125.60 ( 0.22%) 7 64933.16 ( 0.00%) 64379.23 (-0.86%) 8 63353.30 ( 0.00%) 63281.22 (-0.11%) 9 63511.84 ( 0.00%) 63570.37 ( 0.09%) 10 62708.27 ( 0.00%) 63166.25 ( 0.73%) 11 62092.81 ( 0.00%) 61787.75 (-0.49%) 12 61330.11 ( 0.00%) 61036.34 (-0.48%) 13 61438.37 ( 0.00%) 61994.47 ( 0.90%) 14 62304.48 ( 0.00%) 62064.90 (-0.39%) 15 63296.48 ( 0.00%) 62875.16 (-0.67%) 16 63951.76 ( 0.00%) 63769.09 (-0.29%) SYSBENCH PPC64 -sysbench-pgalloc-delay-sysbench sysbench-vanilla pgalloc-delay 1 7645.08 ( 0.00%) 7467.43 (-2.38%) 2 14856.67 ( 0.00%) 14558.73 (-2.05%) 3 21952.31 ( 0.00%) 21683.64 (-1.24%) 4 27946.09 ( 0.00%) 28623.29 ( 2.37%) 5 28045.11 ( 0.00%) 28143.69 ( 0.35%) 6 27477.10 ( 0.00%) 27337.45 (-0.51%) 7 26489.17 ( 0.00%) 26590.06 ( 0.38%) 8 26642.91 ( 0.00%) 25274.33 (-5.41%) 9 25137.27 ( 0.00%) 24810.06 (-1.32%) 10 24451.99 ( 0.00%) 24275.85 (-0.73%) 11 23262.20 ( 0.00%) 23674.88 ( 1.74%) 12 24234.81 ( 0.00%) 23640.89 (-2.51%) 13 24577.75 ( 0.00%) 24433.50 (-0.59%) 14 25640.19 ( 0.00%) 25116.52 (-2.08%) 15 26188.84 ( 0.00%) 26181.36 (-0.03%) 16 26782.37 ( 0.00%) 26255.99 (-2.00%) Again, there is little to conclude here. While there are a few losses, the results vary by +/- 8% in some cases. They are the results of most concern as there are some large losses but it's also within the variance typically seen between kernel releases. The STREAM results varied so little and are so verbose that I didn't include them here. The final test stressed how many huge pages can be allocated. The absolute number of huge pages allocated are the same with or without the page. However, the "unusability free space index" which is a measure of external fragmentation was slightly lower (lower is better) throughout the lifetime of the system. I also measured the latency of how long it took to successfully allocate a huge page. The latency was slightly lower and on X86 and PPC64, more huge pages were allocated almost immediately from the free lists. The improvement is slight but there. [mel@csn.ul.ie: Tested, reworked for less branches] [czoccolo@gmail.com: fix oops by checking pfn_valid_within()] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: insert tmpfs cache pages to inactive list at firstKOSAKI Motohiro2010-05-252-12/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer. This is regression of caused by commit 9ff473b9a7 ("vmscan: evict streaming IO first"). Wow, It is 2 years old patch! Currently, tmpfs file cache is inserted active list at first. This means that the insertion doesn't only increase numbers of pages in anon LRU, but it also reduces anon scanning ratio. Therefore, vmscan will get totally confused. It scans almost only file LRU even though the system has plenty unused tmpfs pages. Historically, lru_cache_add_active_anon() was used for two reasons. 1) Intend to priotize shmem page rather than regular file cache. 2) Intend to avoid reclaim priority inversion of used once pages. But we've lost both motivation because (1) Now we have separate anon and file LRU list. then, to insert active list doesn't help such priotize. (2) In past, one pte access bit will cause page activation. then to insert inactive list with pte access bit mean higher priority than to insert active list. Its priority inversion may lead to uninteded lru chun. but it was already solved by commit 645747462 (vmscan: detect mapped file pages used only once). (Thanks Hannes, you are great!) Thus, now we can use lru_cache_add_anon() instead. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xtensa: includecheck fix: vectors.SJaswinder Singh Rajput2010-05-251-2/+0
| | | | | | | | | | | | fix the following 'make includecheck' warnings: arch/xtensa/kernel/vectors.S: asm/processor.h is included more than once. arch/xtensa/kernel/vectors.S: asm/ptrace.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xtensa: convert to asm-generic/hardirq.hChristoph Hellwig2010-05-252-21/+3
| | | | | | | | | Also remove lots of unused irq_cpustat fields. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xtensa: set ARCH_KMALLOC_MINALIGNFUJITA Tomonori2010-05-251-0/+1
| | | | | | | | | | | | | Architectures that handle DMA-non-coherent memory need to set ARCH_KMALLOC_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the buffer doesn't share a cache with the others. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Chris Zankel <chris@zankel.net> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-2.6Linus Torvalds2010-05-244-8/+8
|\ | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-2.6: cmd640: fix kernel oops in test_irq() method pdc202xx_old: ignore "FIFO empty" bit in test_irq() method pdc202xx_old: wire test_irq() method for PDC2026x IDE: pass IRQ flags to the IDE core ide: fix comment typo in ide.h
| * cmd640: fix kernel oops in test_irq() methodSergei Shtylyov2010-05-111-4/+2
| | | | | | | | | | | | | | | | | | When implementing the test_iqr() method, I forgot that this driver is not an ordinary PCI driver and also needs to support VLB variant of the chip. Moreover, 'hwif->dev' should be NULL, potentially causing oops in pci_read_config_byte(). Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * pdc202xx_old: ignore "FIFO empty" bit in test_irq() methodSergei Shtylyov2010-04-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | The driver takes into account not only the interrupt status bit but also "FIFO empty" bit in its test_irq() method. This actually is a superfluous check since for the DMA commands calling the dma_test_irq() method further in the interrupt handler makes sure FIFO is emptied. Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * pdc202xx_old: wire test_irq() method for PDC2026xSergei Shtylyov2010-04-221-0/+1
| | | | | | | | | | | | | | | | | | In the commit e0321fbe6d34b4bb514fb6daff9e0859e5d76001 (pdc202xx_old: implement test_irq() method (take 2)) I forgot to modify 'pdc2026x_port_ops'... :-/ Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * IDE: pass IRQ flags to the IDE coreYegor Yefremov2010-04-151-0/+1
| | | | | | | | | | | | | | This enables shared IRQs and other features to be used with platform devices Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ide: fix comment typo in ide.hSergei Shtylyov2010-04-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Fix typo in the comment to the 'dma_mode' field of the 'struct ide_drive_s' introduced by the commit 3fccaa192b9501e79a57e02e62b6bf420d2b461e (ide: add drive->dma_mode field). Whilt at it, convert spaces to a tab in the declaration of the neighbouring 'dn' field... Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'for-linus' of ↵Linus Torvalds2010-05-2448-3241/+1938
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vapier/blackfin * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier/blackfin: (30 commits) Blackfin: SMP: fix continuation lines Blackfin: acvilon: fix timeout usage for I2C Blackfin: fix typo in BF537 IRQ comment Blackfin: unify duplicate MEM_MT48LC32M8A2_75 kconfig options Blackfin: set ARCH_KMALLOC_MINALIGN Blackfin: use atomic kmalloc in L1 alloc so it too can be atomic Blackfin: another year of changes (update copyright in boot log) Blackfin: optimize strncpy a bit Blackfin: isram: clean up ITEST_COMMAND macro and improve the selftests Blackfin: move string functions to normal lib/ assembly Blackfin: SIC: cut down on IAR MMR reads a bit Blackfin: bf537-minotaur: fix build errors due to header changes Blackfin: kgdb: pass up the CC register instead of a 0 stub Blackfin: handle HW errors in the new "FAULT" printing code Blackfin: show the whole accumulator in the pseudo DBG insn Blackfin: support all possible registers in the pseudo instructions Blackfin: add support for the DBG (debug output) pseudo insn Blackfin: change the BUG opcode to an unused 16-bit opcode Blackfin: allow NMI watchdog to be used w/RETN as a scratch reg Blackfin: add support for the DBGA (debug assert) pseudo insn ...
| * | Blackfin: SMP: fix continuation linesJoe Perches2010-05-221-2/+2
| | | | | | | | | | | | | | | Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: acvilon: fix timeout usage for I2CWolfram Sang2010-05-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The timeout value is in jiffies, so it should be using HZ, not a plain number. As '10000' is ambiguous, 1HZ is used as conservative default. Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Valentin Yakovenkov <yakovenkov@gmail.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: fix typo in BF537 IRQ commentMichael Hennerich2010-05-221-1/+1
| | | | | | | | | | | | | | | Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: unify duplicate MEM_MT48LC32M8A2_75 kconfig optionsMike Frysinger2010-05-221-6/+1
| | | | | | | | | | | | | | | Reported-by: Christoph Egger <siccegge@cs.fau.de> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: set ARCH_KMALLOC_MINALIGNFUJITA Tomonori2010-05-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Architectures that handle DMA-non-coherent memory need to set ARCH_KMALLOC_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the buffer doesn't share a cache with the others. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: use atomic kmalloc in L1 alloc so it too can be atomicMike Frysinger2010-05-221-1/+2
| | | | | | | | | | | | | | | | | | | | | Some drivers allocate L1 SRAM in atomic contexts, so make sure these functions also use GFP_ATOMIC to avoid BUG()'s. Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: another year of changes (update copyright in boot log)Mike Frysinger2010-05-221-2/+2
| | | | | | | | | | | | Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: optimize strncpy a bitRobin Getz2010-05-222-13/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a little strncpy optimization which can easily cut boot time by 20%. When the kernel is booting with initramfs, it builds up the filesystem from a cpio archive by calling strncpy_from_user() via fs/namei.c's do_getname() on every file in the archive (which can be lots) with a length of PATH_MAX (1024). This causes the dest of the strncpy to be padded with many NUL bytes. This optimization mostly causes these NUL bytes to be padded with a call to memset() which is already optimized for filling memory quickly, but the hardware loop helps a little bit as well. Boot time measured with 'loglevel=0' so UART speed doesn't get in the way. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: isram: clean up ITEST_COMMAND macro and improve the selftestsMike Frysinger2010-05-221-43/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IADDR2DTEST() macro had some duplicated logic with bit 11 and some incorrect comments, so scrub all of that. In order to verify these aren't a problem (and won't be in the future), extend the self tests to operate on as much L1 SRAM as possible. Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: move string functions to normal lib/ assemblyRobin Getz2010-05-2211-184/+226
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 'extern inline' doesn't work correctly in the context of the Linux kernel (too many overriding defines), move the string functions to normal lib/ assembly files (like the existing mem funcs). This avoids the forced inline all over the kernel and allows us to place them constantly in L1. This also avoids some module failures when gcc inserts calls to string functions but the kernel build system doesn't fully consult the library archives. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: SIC: cut down on IAR MMR reads a bitMike Frysinger2010-05-221-15/+18
| | | | | | | | | | | | | | | | | | | | | | | | Tweak the for loops that operate on the SIC IAR system MMRs to avoid re-reading them multiple times in a row. System MMRs are a little slower to access, so avoid the penalty when possible. Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: bf537-minotaur: fix build errors due to header changesMike Frysinger2010-05-221-1/+2
| | | | | | | | | | | | Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: kgdb: pass up the CC register instead of a 0 stubMike Frysinger2010-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | While the CC pseudo register can be deduced from the ASTAT register, make sure we set its value correctly instead of always stubbing it out as 0. GDB itself looks at this pseudo register instead of ASTAT, so we have to supply the right value. Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: handle HW errors in the new "FAULT" printing codeRobin Getz2010-05-221-0/+9
| | | | | | | | | | | | | | | Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: show the whole accumulator in the pseudo DBG insnRobin Getz2010-05-221-6/+19
| | | | | | | | | | | | | | | | | | | | | | | | Rather than print just part of the accumulator register, show the whole 40 bits. This matches the simulator behavior better. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: support all possible registers in the pseudo instructionsRobin Getz2010-05-221-6/+61
| | | | | | | | | | | | | | | | | | | | | Rather than decoding just the common R/P registers, handle all of them. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: add support for the DBG (debug output) pseudo insnRobin Getz2010-05-223-18/+71
| | | | | | | | | | | | | | | | | | | | | | | | Another pseudo insn used by Blackfin simulators. Also factor some now common register lookup code out of the DBGA handlers. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: change the BUG opcode to an unused 16-bit opcodeRobin Getz2010-05-221-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | The current BUG opcode includes the bit that flags the insn as a 32bit opcode, but it wasn't declaring it as 32bits. So pick an unused 16bit. URL: http://blackfin.uclinux.org/gf/tracker/5973 Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: allow NMI watchdog to be used w/RETN as a scratch regGraf Yang2010-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NMIs are not safe to return from because many anomaly workarounds are implemented by disabling interrupts. The NMI obviously violates this assumption. Since the NMI watchdog never returns, we don't have to worry about it clobbering RETN when it is being used as a scratch register with the exception stack. Signed-off-by: Graf Yang <graf.yang@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: add support for the DBGA (debug assert) pseudo insnRobin Getz2010-05-225-0/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A few pseudo debug insns exist to make testing of simulators easier. Since these don't actually exist in the hardware, we have to have the exception handler take care of emulating these. This allows sim test cases to be executed unmodified under Linux and thus simplify debugging greatly. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: SMP: add flush_tlb_kernel_range stubGraf Yang2010-05-211-0/+1
| | | | | | | | | | | | | | | | | | | | | Newer code in mm/page_alloc.c requires this function now in arches. Signed-off-by: Graf Yang <graf.yang@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: make hardware trace output a little more usefulRobin Getz2010-05-214-67/+446
| | | | | | | | | | | | | | | | | | | | | | | | Decode the vast majority of insns that appear in the trace buffer to get a better idea of what's going on at a glance. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: print out the faulting insn in the trace outputRobin Getz2010-05-211-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | Print out the faulting instruction so when people send traces as part of bug reports, we have a better idea of what is going on. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: remove CONFIG_DEBUG_VERBOSE from trace.cRobin Getz2010-05-215-123/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the split traps code has moved all the verbose output to the trace.c file, we can unify all the CONFIG_DEBUG_VERBOSE handling. This gets rid of much of the crappy ifdef forest and enables usage of normal pr_xxx functions so checkpatch stops complaining. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
| * | Blackfin: split kernel/traps.cRobin Getz2010-05-217-836/+887
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current kernel/traps.c file has grown a bit unwieldy as more debugging functionality has been added over time, so split it up into more logical files. There should be no functional changes here, just minor whitespace tweaking. This should make future extensions easier to manage. Signed-off-by: Robin Getz <robin.getz@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>
OpenPOWER on IntegriCloud