op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	kdump: add is_vmcore_usable() and vmcore_unusable()	Simon Horman	2008-10-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The usage of elfcorehdr_addr has changed recently such that being set to ELFCORE_ADDR_MAX is used by is_kdump_kernel() to indicate if the code is executing in a kernel executed as a crash kernel. However, arch/ia64/kernel/setup.c:reserve_elfcorehdr will rest elfcorehdr_addr to ELFCORE_ADDR_MAX on error, which means any subsequent calls to is_kdump_kernel() will return 0, even though they should return 1. Ok, at this point in time there are no subsequent calls, but I think its fair to say that there is ample scope for error or at the very least confusion. This patch add an extra state, ELFCORE_ADDR_ERR, which indicates that elfcorehdr_addr was passed on the command line, and thus execution is taking place in a crashdump kernel, but vmcore can't be used for some reason. This is tested for using is_vmcore_usable() and set using vmcore_unusable(). A subsequent patch makes use of this new code. To summarise, the states that elfcorehdr_addr can now be in are as follows: ELFCORE_ADDR_MAX: not a crashdump kernel ELFCORE_ADDR_ERR: crashdump kernel but vmcore is unusable any other value: crash dump kernel and vmcore is usable Signed-off-by: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	kdump: make elfcorehdr_addr independent of CONFIG_PROC_VMCORE	Vivek Goyal	2008-10-20	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE but also by the code which is not inside CONFIG_PROC_VMCORE. For example, is_kdump_kernel() is used by powerpc code to determine if kernel is booting after a panic then use previous kernel's TCE table. So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be able to correctly determine that we are booting after a panic and setup calgary iommu accordingly. o So remove the assumption that elfcorehdr_addr is under CONFIG_PROC_VMCORE. o Move definition of elfcorehdr_addr to arch dependent crash files. (Unfortunately crash dump does not have an arch independent file otherwise that would have been the best place). o kexec.c is not the right place as one can Have CRASH_DUMP enabled in second kernel without KEXEC being enabled. o I don't see sh setup code parsing the command line for elfcorehdr_addr. I am wondering how does vmcore interface work on sh. Anyway, I am atleast defining elfcoredhr_addr so that compilation is not broken on sh. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Simon Horman <horms@verge.net.au> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	add CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS	Roland McGrath	2008-10-20	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a kconfig option to change the /proc/PID/coredump_filter default. Fedora has been carrying a trivial patch to change the hard-wired value for this default, since Fedora 8. The default default can't change safely because there are old GDB versions out there (all before 6.7) that are confused by the core dump files created by the MMF_DUMP_ELF_HEADERS setting. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	coredump: format_corename: don't append .%pid if multi-threaded	Oleg Nesterov	2008-10-20	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the coredumping is multi-threaded, format_corename() appends .%pid to the corename. This was needed before the proper multi-thread core dump support, now all the threads in the mm go into a single unified core file. Remove this special case, it is not even documented and we have "%p" and core_uses_pid. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: La Monte Yarroll <piggy@laurelnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	seq_file: add seq_cpumask_list(), seq_nodemask_list()	Lai Jiangshan	2008-10-20	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \|	seq_cpumask_list(), seq_nodemask_list() are very like seq_cpumask(), seq_nodemask(), but they print human readable string. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	seq_file: don't call bitmap_scnprintf_len()	Lai Jiangshan	2008-10-20	1	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \|	"m->count + len < m->size" is true commonly, so bitmap_scnprintf() is commonly called. this fix saves a call to bitmap_scnprintf_len(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hfsplus: fix possible deadlock when handling corrupted extents	Eric Sesterhenn	2008-10-20	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	A corrupted extent for the extent file itself may try to get an impossible extent, causing a deadlock if I see it correctly. Check the inode number after the first_blocks checks and fail if it's the extent file, as according to the spec the extent file should have no extent for itself. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hfsplus: missing O_LARGEFILE check	Alan Cox	2008-10-20	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	hfsplus: O_LARGEFILE checking is missing Addresses http://bugzilla.kernel.org/show_bug.cgi?id=8490 From: Alan Cox <alan@redhat.com> Reported-by: didier <did447@gmail.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ext3: avoid printk floods in the face of directory corruption	Eric Sandeen	2008-10-20	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A very large directory with many read failures (either due to storage problems, or due to invalid size & blocks from corruption) will generate a printk storm as the filesystem continues to try to read all the blocks. This flood of messages can tie up the box until it is complete - which may be a very long time, especially for very large corrupted values. This is fixed by only reporting the corruption once each time we try to read the directory. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Eugene Teo <eugeneteo@kernel.sg> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ext3: truncate block allocated on a failed ext3_write_begin	Aneesh Kumar K.V	2008-10-20	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \|	For blocksize < pagesize we need to remove blocks that got allocated in block_write_begin() if we fail with ENOSPC for later blocks. block_write_begin() internally does this if it allocated page locally. This makes sure we don't have blocks outside inode.i_size during ENOSPC. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ext3: fix ext3_dx_readdir hash collision handling	Eugene Dashevsky	2008-10-20	1	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \|	This fixes a bug where readdir() would return a directory entry twice if there was a hash collision in an hash tree indexed directory. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Eugene Dashevsky <eugene@ibrix.com> Signed-off-by: Mike Snitzer <msnitzer@ibrix.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	jbd: ordered data integrity fix	Hidehiro Kawai	2008-10-20	1	-3/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In ordered mode, if a file data buffer being dirtied exists in the committing transaction, we write the buffer to the disk, move it from the committing transaction to the running transaction, then dirty it. But we don't have to remove the buffer from the committing transaction when the buffer couldn't be written out, otherwise it would miss the error and the committing transaction would not abort. This patch adds an error check before removing the buffer from the committing transaction. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ext3: add an option to control error handling on file data	Hidehiro Kawai	2008-10-20	2	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the journal doesn't abort when it gets an IO error in file data blocks, the file data corruption will spread silently. Because most of applications and commands do buffered writes without fsync(), they don't notice the IO error. It's scary for mission critical systems. On the other hand, if the journal aborts whenever it gets an IO error in file data blocks, the system will easily become inoperable. So this patch introduces a filesystem option to determine whether it aborts the journal or just call printk() when it gets an IO error in file data. If you mount a ext3 fs with data_err=abort option, it aborts on file data write error. If you mount it with data_err=ignore, it doesn't abort, just call printk(). data_err=ignore is the default. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ext3: fix ext3 block reservation early ENOSPC issue	Mingming Cao	2008-10-20	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We could run into ENOSPC error on ext3, even when there is free blocks on the filesystem. The problem is triggered in the case the goal block group has 0 free blocks , and the rest block groups are skipped due to the check of "free_blocks < windowsz/2". Current code could fall back to non reservation allocation to prevent early ENOSPC after examing all the block groups with reservation on , but this code was bypassed if the reservation window is turned off already, which is true in this case. This patch fixed two issues: 1) We don't need to turn off block reservation if the goal block group has 0 free blocks left and continue search for the rest of block groups. Current code the intention is to turn off the block reservation if the goal allocation group has a few (some) free blocks left (not enough for make the desired reservation window),to try to allocation in the goal block group, to get better locality. But if the goal blocks have 0 free blocks, it should leave the block reservation on, and continues search for the next block groups,rather than turn off block reservation completely. 2) we don't need to check the window size if the block reservation is off. The problem was originally found and fixed in ext4. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ext3: don't try to resize if there are no reserved gdt blocks left	Josef Bacik	2008-10-20	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When trying to resize a ext3 fs and you run out of reserved gdt blocks, you get an error that doesn't actually tell you what went wrong, it just says that the gdb it picked is not correct, which is the case since you don't have any reserved gdt blocks left. This patch adds a check to make sure you have reserved gdt blocks to use, and if not prints out a more relevant error. Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Andreas Dilger <adilger@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	jbd: don't dirty original metadata buffer on abort	Hidehiro Kawai	2008-10-20	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, original metadata buffers are dirtied when they are unfiled whether the journal has aborted or not. Eventually these buffers will be written-back to the filesystem by pdflush. This means some metadata buffers are written to the filesystem without journaling if the journal aborts. So if both journal abort and system crash happen at the same time, the filesystem would become inconsistent state. Additionally, replaying journaled metadata can overwrite the latest metadata on the filesystem partly. Because, if the journal aborts, journaled metadata are preserved and replayed during the next mount not to lose uncheckpointed metadata. This would also break the consistency of the filesystem. This patch prevents original metadata buffers from being dirtied on abort by clearing BH_JBDDirty flag from those buffers. Thus, no metadata buffers are written to the filesystem without journaling. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	jbd: abort when failed to log metadata buffers	Hidehiro Kawai	2008-10-20	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	coredump_filter: add hugepage dumping	KOSAKI Motohiro	2008-10-20	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Presently hugepage's vma has a VM_RESERVED flag in order not to be swapped. But a VM_RESERVED vma isn't core dumped because this flag is often used for some kernel vmas (e.g. vmalloc, sound related). Thus hugepages are never dumped and it can't be debugged easily. Many developers want hugepages to be included into core-dump. However, We can't read generic VM_RESERVED area because this area is often IO mapping area. then these area reading may change device state. it is definitly undesiable side-effect. So adding a hugepage specific bit to the coredump filter is better. It will be able to hugepage core dumping and doesn't cause any side-effect to any i/o devices. In additional, libhugetlb use hugetlb private mapping pages as anonymous page. Then, hugepage private mapping pages should be core dumped by default. Then, /proc/[pid]/core_dump_filter has two new bits. - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes) - bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no) I tested by following method. % ulimit -c unlimited % ./crash_hugepage 50 % ./crash_hugepage 50 -p % ls -lh % gdb ./crash_hugepage core % % echo 0x43 > /proc/self/coredump_filter % ./crash_hugepage 50 % ./crash_hugepage 50 -p % ls -lh % gdb ./crash_hugepage core #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include "hugetlbfs.h" int main(int argc, char** argv){ char* p; int ch; int mmap_flags = MAP_SHARED; int fd; int nr_pages; while((ch = getopt(argc, argv, "p")) != -1) { switch (ch) { case 'p': mmap_flags &= ~MAP_SHARED; mmap_flags \|= MAP_PRIVATE; break; default: /* nothing/ break; } } argc -= optind; argv += optind; if (argc == 0){ printf("need # of pages\n"); exit(1); } nr_pages = atoi(argv[0]); if (nr_pages < 2) { printf("nr_pages must >2\n"); exit(1); } fd = hugetlbfs_unlinked_fd(); p = mmap(NULL, nr_pages gethugepagesize(), PROT_READ\|PROT_WRITE, mmap_flags, fd, 0); sleep(2); (p + gethugepagesize()) = 1; / COW / sleep(2); / crash! / (int*)0 = 1; return 0; } Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: William Irwin <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs: buffer lock use lock bitops	Nick Piggin	2008-10-20	1	-2/+1
\| \| \| \| \| \| \| \| \|	trylock_buffer and unlock_buffer open and close a critical section. Hence, we can use the lock bitops to get the desired memory ordering. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	vmstat: mlocked pages statistics	Nick Piggin	2008-10-20	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add NR_MLOCK zone page state, which provides a (conservative) count of mlocked pages (actually, the number of mlocked pages moved off the LRU). Reworked by lts to fit in with the modified mlock page support in the Reclaim Scalability series. [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo] [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Ramfs and Ram Disk pages are unevictable	Lee Schermerhorn	2008-10-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Christoph Lameter pointed out that ram disk pages also clutter the LRU lists. When vmscan finds them dirty and tries to clean them, the ram disk writeback function just redirties the page so that it goes back onto the active list. Round and round she goes... With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no longer the case, as ram disk pages are no longer maintained on the lru. [This makes them unmigratable for defrag or memory hot remove, but that can be addressed by a separate patch series.] However, the ramfs pages behave like ram disk pages used to, so: Define new address_space flag [shares address_space flags member with mapping's gfp mask] to indicate that the address space contains all unevictable pages. This will provide for efficient testing of ramfs pages in page_evictable(). Also provide wrapper functions to set/test the unevictable state to minimize #ifdefs in ramfs driver and any other users of this facility. Set the unevictable state on address_space structures for new ramfs inodes. Test the unevictable state in page_evictable() to cull unevictable pages. These changes depend on [CONFIG_]UNEVICTABLE_LRU. [riel@redhat.com: undo the brd.c part] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Unevictable LRU Page Statistics	Lee Schermerhorn	2008-10-20	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Report unevictable pages per zone and system wide. Kosaki Motohiro added support for memory controller unevictable statistics. [riel@redhat.com: fix printk in show_free_areas()] [akpm@linux-foundation.org: fix units in /proc/vmstats] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	vmscan: split LRU lists into anon & file sets	Rik van Riel	2008-10-20	5	-39/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset] [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru] [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page] [hugh@veritas.com: memcg swapbacked pages active] [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED] [akpm@linux-foundation.org: fix /proc/vmstat units] [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration] [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo] [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge branch 'for_linus' of ↵	Linus Torvalds	2008-10-17	11	-313/+280
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Remove automatic enabling of the HUGE_FILE feature flag ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback ext4: Update Documentation/filesystems/ext4.txt ext4: Remove unused mount options: nomballoc, mballoc, nocheck ext4: Remove compile warnings when building w/o CONFIG_PROC_FS ext4: Add missing newlines to printk messages ext4: Fix file fragmentation during large file write. vfs: Add no_nrwrite_index_update writeback control flag vfs: Remove the range_cont writeback mode. ext4: Use tag dirty lookup during mpage_da_submit_io ext4: let the block device know when unused blocks can be discarded ext4: Don't reuse released data blocks until transaction commits ext4: Use an rbtree for tracking blocks freed during transaction. ext4: Do mballoc init before doing filesystem recovery ext4: Free ext4_prealloc_space using kmem_cache_free ext4: Fix Kconfig typo for ext4dev ext4: Remove an old reference to ext4dev in Makefile comment
\| *	ext4: Remove automatic enabling of the HUGE_FILE feature flag	Theodore Ts'o	2008-10-16	2	-88/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the HUGE_FILE feature flag is not set, don't allow the creation of large files, instead of automatically enabling the feature flag. Recent versions of mke2fs will set the HUGE_FILE flag automatically anyway for ext4 filesystems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback	Theodore Ts'o	2008-10-16	5	-75/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The multiblock allocator needs to be able to release blocks (and issue a blkdev discard request) when the transaction which freed those blocks is committed. Previously this was done via a polling mechanism when blocks are allocated or freed. A much better way of doing things is to create a jbd2 callback function and attaching the list of blocks to be freed directly to the transaction structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Remove unused mount options: nomballoc, mballoc, nocheck	Theodore Ts'o	2008-10-17	2	-10/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	These mount options don't actually do anything any more, so remove them. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Remove compile warnings when building w/o CONFIG_PROC_FS	Manish Katiyar	2008-10-17	1	-1/+6
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Add missing newlines to printk messages	Eric Sesterhenn	2008-10-17	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some newlines missing in ext4_check_descriptors, which cause the printk level to be printed out when the next printk call is made: [ 778.847265] EXT4-fs: ext4_check_descriptors: Block bitmap for group 0 not in group (block 1509949442)!<3>EXT4-fs: group descriptors corrupted! [ 802.646630] EXT4-fs: ext4_check_descriptors: Inode bitmap for group 0 not in group (block 9043971)!<3>EXT4-fs: group descriptors corrupted! Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Fix file fragmentation during large file write.	Aneesh Kumar K.V	2008-10-16	1	-34/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The range_cyclic writeback mode uses the address_space writeback_index as the start index for writeback. With delayed allocation we were updating writeback_index wrongly resulting in highly fragmented file. This patch reduces the number of extents reduced from 4000 to 27 for a 3GB file. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
\| *	ext4: Use tag dirty lookup during mpage_da_submit_io	Aneesh Kumar K.V	2008-10-14	1	-17/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This enables us to drop the range_cont writeback mode use from ext4_da_writepages. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
\| *	ext4: let the block device know when unused blocks can be discarded	Theodore Ts'o	2008-10-16	2	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Let the block device know when unused blocks can be discarded, using the new sb_issue_discard() interface. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Don't reuse released data blocks until transaction commits	Aneesh Kumar K.V	2008-10-10	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to make sure we don't reuse the data blocks released during the transaction untill the transaction commits. We force this mode only for ordered and journalled mode. Writeback mode already don't provided data consistency. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
\| *	ext4: Use an rbtree for tracking blocks freed during transaction.	Aneesh Kumar K.V	2008-10-16	2	-77/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With this patch we track the block freed during a transaction using red-black tree. We also make sure contiguous blocks freed are collected in one node in the tree. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
\| *	ext4: Do mballoc init before doing filesystem recovery	Aneesh Kumar K.V	2008-10-10	1	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During filesystem recovery we may be doing a truncate which expects some of the mballoc data structures to be initialized. So do ext4_mb_init before recovery. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
\| *	ext4: Free ext4_prealloc_space using kmem_cache_free	Aneesh Kumar K.V	2008-10-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We should use kmem_cache_free to free memory allocated via kmem_cache_alloc Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
\| *	ext4: Fix Kconfig typo for ext4dev	Manish Katiyar	2008-10-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Looks like there is one more instance where ext4dev should be changed to ext4 because the module name will be "ext4" unless EXT4DEV_COMPAT is selected. Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
\| *	ext4: Remove an old reference to ext4dev in Makefile comment	Martin Michlmayr	2008-10-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove an old reference to ext4dev. Signed-off-by: Martin Michlmayr <tbm@cyrius.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* \|	block: fix current kernel-doc warnings	Randy Dunlap	2008-10-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix block kernel-doc warnings: Warning(linux-2.6.27-git4//fs/block_dev.c:1272): No description found for parameter 'path' Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'cpu' Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'part' Warning(/var/linsrc/linux-2.6.27-git4//block/genhd.c:544): No description found for parameter 'partno' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* \|	block: add partition attribute for partition number	Tejun Heo	2008-10-17	1	-0/+10
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With extended devt, finding out the partition number becomes a bit more challenging as subtracting the minor number from that of the parent device doesn't work anymore. The only thing left is parsing the partition name which is brittle and not exactly universal (some have '-' between the device name and partition number while others don't). This patch introduced partition attribute which contains the partition number of the device. This should make finding partitions and its index easier. This problem and solution were suggested by H. Peter Anvin. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6	Linus Torvalds	2008-10-16	14	-211/+321
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (53 commits) NFS: Fix a resolution problem with nfs_inode->cache_change_attribute NFS: Fix the resolution problem with nfs_inode_attrs_need_update() NFS: Changes to inode->i_nlinks must set the NFS_INO_INVALID_ATTR flag RPC/RDMA: ensure connection attempt is complete before signalling. RPC/RDMA: correct the reconnect timer backoff RPC/RDMA: optionally emit useful transport info upon connect/disconnect. RPC/RDMA: reformat a debug printk to keep lines together. RPC/RDMA: harden connection logic against missing/late rdma_cm upcalls. RPC/RDMA: fix connect/reconnect resource leak. RPC/RDMA: return a consistent error, when connect fails. RPC/RDMA: adhere to protocol for unpadded client trailing write chunks. RPC/RDMA: avoid an oops due to disconnect racing with async upcalls. RPC/RDMA: maintain the RPC task bytes-sent statistic. RPC/RDMA: suppress retransmit on RPC/RDMA clients. RPC/RDMA: fix connection IRD/ORD setting RPC/RDMA: support FRMR client memory registration. RPC/RDMA: check selected memory registration mode at runtime. RPC/RDMA: add data types and new FRMR memory registration enum. RPC/RDMA: refactor the inline memory registration code. NFS: fix nfs_parse_ip_address() corner case ...
\| *	Merge branch 'next'	Trond Myklebust	2008-10-15	14	-211/+321
\| \|\
\| \| *	NFS: Fix a resolution problem with nfs_inode->cache_change_attribute	Trond Myklebust	2008-10-14	2	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The cache_change_attribute is used to decide whether or not a directory has changed, in which case we may need to look it up again. Again, the use of 'jiffies' leads to an issue of resolution. Once again, the fix is to change nfs_inode->cache_change_attribute, and just make it a simple counter. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
\| \| *	NFS: Fix the resolution problem with nfs_inode_attrs_need_update()	Trond Myklebust	2008-10-14	2	-11/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It appears that 'jiffies' timestamps do not have high enough resolution for nfs_inode_attrs_need_update(). One problem is that a GETATTR can be launched within < 1 jiffy of the last operation that updated the attribute. Another problem is that RPC calls can take < 1 jiffy to execute. We can fix this by switching the variables to use a simple global counter that gets incremented every time we start another GETATTR call. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
\| \| *	NFS: Changes to inode->i_nlinks must set the NFS_INO_INVALID_ATTR flag	Trond Myklebust	2008-10-14	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
\| \| *	NFS: fix nfs_parse_ip_address() corner case	Chuck Lever	2008-10-10	1	-11/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bruce observed that nfs_parse_ip_address() will successfully parse an IPv6 address that looks like this: "::1%" A scope delimiter is present, but there is no scope ID following it. This is harmless, as it would simply set the scope ID to zero. However, in some cases we would like to flag this as an improperly formed address. We are now also careful to reject addresses where garbage follows the address (up to the length of the string), instead of ignoring the non-address characters; and where the scope ID is nonsense (not a valid device name, but also not numeric). Before, both of these cases would result in a harmless zero scope ID. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
\| \| *	NFS: Cleanup nfs_set_port	J. Bruce Fields	2008-10-10	1	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
\| \| *	NFS: Fix attribute updates	Trond Myklebust	2008-10-09	1	-9/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes a regression seen when running the Connectathon testsuite against an ext3 filesystem. The reason was that the inode was constantly being marked as 'just updated' by the jiffy wraparound test. This again meant that newer GETATTR calls were failing to pass the nfs_inode_attrs_need_update() test unless the changes caused a ctime update on the server, since they were perceived as having been started before the latest inode update. Given that nfs_inode_attrs_need_update() already checks for wraparound of nfsi->last_updated, we can drop the buggy "protection" in nfs_update_inode(). Also make a slight micro-optimisation of nfs_inode_attrs_need_update(): we are more often going to see time_after(fattr->time_start, nfsi->last_updated) be true, rather than seeing an update of ctime/size, so put that test first to ensure that we optimise away the ctime/size tests. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
\| \| *	NFS: Don't use range_cyclic for data integrity syncs	Trond Myklebust	2008-10-07	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is more efficient to write linearly starting from the beginning of the file. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
\| \| *	NFS: Client mounts hang when exported directory do not exist	Steve Dickson	2008-10-07	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a regression that was introduced by the string based mounts. nfs_mount() statically returns -EACCES for every error returned by the remote mounted. This is incorrect because -EACCES is an non-fatal error to the mount.nfs command. This error causes mount.nfs to retry the mount even in the case when the exported directory does not exist. This patch maps the errors returned by the remote mountd into valid errno values, exactly how it was done pre-string based mounts. By returning the correct errno enables mount.nfs to do the right thing. Signed-off-by: Steve Dickson <steved@redhat.com> [Trond.Myklebust@netapp.com: nfs_stat_to_errno() now correctly returns negative errors, so remove the sign change.] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>