op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	[PATCH] ext3 sequential read regression fix	Suparna Bhattacharya	2006-09-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ext3-get-blocks support caused ~20% degrade in Sequential read performance (tiobench). Problem is with marking the buffer boundary so IO can be submitted right away. Here is the patch to fix it. 2.6.18-rc6: ----------- # ./iotest 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 75.2726 seconds, 57.1 MB/s real 1m15.285s user 0m0.276s sys 0m3.884s 2.6.18-rc6 + fix: ----------------- [root@elm3a241 ~]# ./iotest 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 62.9356 seconds, 68.2 MB/s The boundary block check in ext3_get_blocks_handle needs to be adjusted against the count of blocks mapped in this call, now that it can map more than one block. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] knfsd: Make ext3 reject filehandles referring to invalid inode number	NeilBrown	2006-09-16	1	-0/+42
\| \| \| \| \| \| \| \| \| \| \|	Inodes earlier than the 'first' inode (e.g. journal, resize) should be rejected early - except the root inode. Also inode numbers that are too big should be rejected early. [akpm@osdl.org: cleanup] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_getblk() should handle HOLE correctly	Badari Pulavarty	2006-09-08	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It has been reported that ext3_getblk() is not doing the right thing and triggering following WARN(): BUG: warning at fs/ext3/inode.c:1016/ext3_getblk() <c01c5140> ext3_getblk+0x98/0x2a6 <c03b2806> md_wakeup_thread+0x26/0x2a <c01c536d> ext3_bread+0x1f/0x88 <c01cedf9> ext3_quota_read+0x136/0x1ae <c018b683> v1_read_dqblk+0x61/0xac <c0188f32> dquot_acquire+0xf6/0x107 <c01ceaba> ext3_acquire_dquot+0x46/0x68 <c01897d4> dqget+0x155/0x1e7 <c018a97b> dquot_transfer+0x3e0/0x3e9 <c016fe52> dput+0x23/0x13e <c01c7986> ext3_setattr+0xc3/0x240 <c0120f66> current_fs_time+0x52/0x6a <c017320e> notify_change+0x2bd/0x30d <c0159246> chown_common+0x9c/0xc5 <c02a222c> strncpy_from_user+0x3b/0x68 <c0167fe6> do_path_lookup+0xdf/0x266 <c016841b> __user_walk_fd+0x44/0x5a <c01592b9> sys_chown+0x4a/0x55 <c015a43c> vfs_write+0xe7/0x13c <c01695d4> sys_mkdir+0x1f/0x23 <c0102a97> syscall_call+0x7/0xb Looking at the code, it looks like it's not handle HOLE correctly. It ends up returning -EIO. Here is the patch to fix it. If we really want to be paranoid, we can allow return values 0 (HOLE), 1 (we asked for one block) and return -EIO for more than 1 block. But I really don't see a reason for doing it - all we need is the block# here. (doesn't matter how many blocks are mapped). ext3_get_blocks_handle() returns number of blocks it mapped. It returns 0 in case of HOLE. ext3_getblk() should handle HOLE properly (currently its dumping warning stack and returning -EIO). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3 filesystem bogus ENOSPC with reservation fix	Mingming Cao	2006-08-27	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	To handle the earlier bogus ENOSPC error caused by filesystem full of block reservation, current code falls back to non block reservation, starts to allocate block(s) from the goal allocation block group as if there is no block reservation. Current code needs to re-load the corresponding block group descriptor for the initial goal block group in this case. The patch fixes this. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3 -nobh option causes oops	Badari Pulavarty	2006-07-31	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \|	For files other than IFREG, nobh option doesn't make sense. Modifications to them are journalled and needs buffer heads to do that. Without this patch, we get kernel oops in page_buffers(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: avoid triggering ext3_error on bad NFS file handle	Neil Brown	2006-07-31	2	-8/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The inode number out of an NFS file handle gets passed eventually to ext3_get_inode_block() without any checking. If ext3_get_inode_block() allows it to trigger an error, then bad filehandles can have unpleasant effect - ext3_error() will usually cause a forced read-only remount, or a panic if `errors=panic' was used. So remove the call to ext3_error there and put a matching check in ext3/namei.c where inode numbers are read off storage. [akpm@osdl.org: fix off-by-one error] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: <stable@kernel.org> Cc: "Stephen C. Tweedie" <sct@redhat.com> Cc: Eric Sandeen <esandeen@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Remove leftover ext3 acl declarations	Andreas Gruenbacher	2006-07-10	1	-3/+0
\| \| \| \| \| \| \| \|	These functions no longer exist; remove their declarations. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] lockdep: annotate the quota code	Arjan van de Ven	2006-07-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The quota code plays interesting games with the lock ordering; to quote Jan: \| i_mutex of inode containing quota file is acquired after all other \| quota locks. i_mutex of all other inodes is acquired before quota \| locks. Quota code makes sure (by resetting inode operations and \| setting special flag on inode) that noone tries to enter quota code \| while holding i_mutex on a quota file... The good news is that all of this special case i_mutex grabbing happens in the (per filesystem) low level quota write function. For this special case we need a new I_MUTEX_* nesting level, since this just entirely outside any of the regular VFS locking rules for i_mutex. I trust Jan on his blue eyes that this is not ever going to deadlock; and based on that the patch below is what it takes to inform lockdep of these very interesting new locking rules. The new locking rule for the I_MUTEX_QUOTA nesting level is that this is the deepest possible level of nesting for i_mutex, and that this only should be used in quota write (and possibly read) function of filesystems. This makes the lock ordering of the I_MUTEX_* levels: I_MUTEX_PARENT -> I_MUTEX_CHILD -> I_MUTEX_NORMAL -> I_MUTEX_QUOTA Has no effect on non-lockdep kernels. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	Remove obsolete #include <linux/config.h>	Jörn Engel	2006-06-30	4	-4/+0
\| \| \| \| \|	Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
*	[PATCH] mark address_space_operations const	Christoph Hellwig	2006-06-28	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	Same as with already do with the file operations: keep them in .rodata and prevents people from doing runtime patching. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: Add "-o bh" option	Badari Pulavarty	2006-06-26	1	-1/+5
\| \| \| \| \| \| \| \| \|	This patch adds "-o bh" option to force use of buffer_heads. This option is needed when we make "nobh" as default - and if we run into problems. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: cleanup dead code in ext3_add_entry()	Johann Lombardi	2006-06-25	1	-3/+1
\| \| \| \| \| \| \| \| \|	The variables nlen and rlen are defined/initialized but not used in ext3_add_entry(). Signed-off-by: Johann Lombardi <johann.lombardi@bull.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_fsblk_t: the rest of in-kernel filesystem blocks conversion	Mingming Cao	2006-06-25	6	-82/+82
\| \| \| \| \| \| \| \| \| \|	Convert the ext3 in-kernel filesystem blocks to ext3_fsblk_t. Convert the rest of all unsigned long type in-kernel filesystem blocks to ext3_fsblk_t, and replace the printk format string respondingly. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_fsblk_t: filesystem, group blocks and bug fixes	Mingming Cao	2006-06-25	6	-141/+158
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some of the in-kernel ext3 block variable type are treated as signed 4 bytes int type, thus limited ext3 filesystem to 8TB (4kblock size based). While trying to fix them, it seems quite confusing in the ext3 code where some blocks are filesystem-wide blocks, some are group relative offsets that need to be signed value (as -1 has special meaning). So it seem saner to define two types of physical blocks: one is filesystem wide blocks, another is group-relative blocks. The following patches clarify these two types of blocks in the ext3 code, and fix the type bugs which limit current 32 bit ext3 filesystem limit to 8TB. With this series of patches and the percpu counter data type changes in the mm tree, we are able to extend exts filesystem limit to 16TB. This work is also a pre-request for the recent >32 bit ext3 work, and makes the kernel to able to address 48 bit ext3 block a lot easier: Simply redefine ext3_fsblk_t from unsigned long to sector_t and redefine the format string for ext3 filesystem block corresponding. Two RFC with a series patches have been posted to ext2-devel list and have been reviewed and discussed: http://marc.theaimsgroup.com/?l=ext2-devel&m=114722190816690&w=2 http://marc.theaimsgroup.com/?l=ext2-devel&m=114784919525942&w=2 Patches are tested on both 32 bit machine and 64 bit machine, <8TB ext3 and >8TB ext3 filesystem(with the latest to be released e2fsprogs-1.39). Tests includes overnight fsx, tiobench, dbench and fsstress. This patch: Defines ext3_fsblk_t and ext3_grpblk_t, and the printk format string for filesystem wide blocks. This patch classifies all block group relative blocks, and ext3_fsblk_t blocks occurs in the same function where used to be confusing before. Also include kernel bug fixes for filesystem wide in-kernel block variables. There are some fileystem wide blocks are treated as int/unsigned int type in the kernel currently, especially in ext3 block allocation and reservation code. This patch fixed those bugs by converting those variables to ext3_fsblk_t(unsigned long) type. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: remove inconsistent space before exclamation point in mount code	Theodore Ts'o	2006-06-25	1	-1/+1
\| \| \| \| \| \| \| \|	This was reported as Debian bug #336604. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Avoid disk sector_t overflow for >2TB ext3 filesystem	Mingming Cao	2006-06-25	2	-0/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If ext3 filesystem is larger than 2TB, and sector_t is a u32 (i.e. CONFIG_LBD not defined in the kernel), the calculation of the disk sector will overflow. Add check at ext3_fill_super() and ext3_group_extend() to prevent mount/remount/resize >2TB ext3 filesystem if sector_t size is 4 bytes. Verified this patch on a 32 bit platform without CONFIG_LBD defined (sector_t is 32 bits long), mount refuse to mount a 10TB ext3. Signed-off-by: Mingming Cao<cmm@us.ibm.com> Acked-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] percpu counter data type changes to suppport more than 2**31 ext3 ↵	Mingming Cao	2006-06-23	1	-17/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	free blocks counter The percpu counter data type are changed in this set of patches to support more users like ext3 who need more than 32 bit to store the free blocks total in the filesystem. - Generic perpcu counters data type changes. The size of the global counter and local counter were explictly specified using s64 and s32. The global counter is changed from long to s64, while the local counter is changed from long to s32, so we could avoid doing 64 bit update in most cases. - Users of the percpu counters are updated to make use of the new percpu_counter_init() routine now taking an additional parameter to allow users to pass the initial value of the global counter. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_clear_inode(): avoid kfree(NULL)	Andrew Morton	2006-06-23	1	-11/+12
\| \| \| \| \| \| \| \| \| \|	Steven Rostedt <rostedt@goodmis.org> points out that `rsv' here is usually NULL, so we should avoid calling kfree(). Also, fix up some nearby whitespace damage. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] VFS: Permit filesystem to perform statfs with a known root dentry	David Howells	2006-06-23	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Give the statfs superblock operation a dentry pointer rather than a superblock pointer. This complements the get_sb() patch. That reduced the significance of sb->s_root, allowing NFS to place a fake root there. However, NFS does require a dentry to use as a target for the statfs operation. This permits the root in the vfsmount to be used instead. linux/mount.h has been added where necessary to make allyesconfig build successfully. Interest has also been expressed for use with the FUSE and XFS filesystems. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] VFS: Permit filesystem to override root dentry on mount	David Howells	2006-06-23	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend the get_sb() filesystem operation to take an extra argument that permits the VFS to pass in the target vfsmount that defines the mountpoint. The filesystem is then required to manually set the superblock and root dentry pointers. For most filesystems, this should be done with simple_set_mnt() which will set the superblock pointer and then set the root dentry to the superblock's s_root (as per the old default behaviour). The get_sb() op now returns an integer as there's now no need to return the superblock pointer. This patch permits a superblock to be implicitly shared amongst several mount points, such as can be done with NFS to avoid potential inode aliasing. In such a case, simple_set_mnt() would not be called, and instead the mnt_root and mnt_sb would be set directly. The patch also makes the following changes: () the get_sb_() convenience functions in the core kernel now take a vfsmount pointer argument and return an integer, so most filesystems have to change very little. () If one of the convenience function is not used, then get_sb() should normally call simple_set_mnt() to instantiate the vfsmount. This will always return 0, and so can be tail-called from get_sb(). () generic_shutdown_super() now calls shrink_dcache_sb() to clean up the dcache upon superblock destruction rather than shrink_dcache_anon(). This is required because the superblock may now have multiple trees that aren't actually bound to s_root, but that still need to be cleaned up. The currently called functions assume that the whole tree is rooted at s_root, and that anonymous dentries are not the roots of trees which results in dentries being left unculled. However, with the way NFS superblock sharing are currently set to be implemented, these assumptions are violated: the root of the filesystem is simply a dummy dentry and inode (the real inode for '/' may well be inaccessible), and all the vfsmounts are rooted on anonymous[] dentries with child trees. [] Anonymous until discovered from another tree. () The documentation has been adjusted, including the additional bit of changing ext2_ into foo_* in the documentation. [akpm@osdl.org: convert ipath_fs, do other stuff] Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	Merge git://git.infradead.org/~dwmw2/rbtree-2.6	Linus Torvalds	2006-06-20	1	-1/+1
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.infradead.org/~dwmw2/rbtree-2.6: [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency Update UML kernel/physmem.c to use rb_parent() accessor macro [RBTREE] Update hrtimers to use rb_parent() accessor macro. [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node. [RBTREE] Merge colour and parent fields of struct rb_node. [RBTREE] Remove dead code in rb_erase() [RBTREE] Update JFFS2 to use rb_parent() accessor macro. [RBTREE] Update eventpoll.c to use rb_parent() accessor macro. [RBTREE] Update key.c to use rb_parent() accessor macro. [RBTREE] Update ext3 to use rb_parent() accessor macro. [RBTREE] Change rbtree off-tree marking in I/O schedulers. [RBTREE] Add accessor macros for colour and parent fields of rb_node
\| *	[RBTREE] Update ext3 to use rb_parent() accessor macro.	David Woodhouse	2006-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	Signed-off-by: David Woodhouse <dwmw2@infradead.org>
* \|	[PATCH] ext3 resize: fix double unlock_super()	Andrew Morton	2006-05-31	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	From: Andrew Morton <akpm@osdl.org> Spotted by Jan Capek <jca@sysgo.com> Cc: "Stephen C. Tweedie" <sct@redhat.com> Cc: Andreas Dilger <adilger@clusterfs.com> Cc: Jan Capek <jca@sysgo.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* \|	[PATCH] ext3: multile block allocate little endian fixes	Mingming Cao	2006-05-03	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some places in ext3 multiple block allocation code (in 2.6.17-rc3) don't handle the little endian well. This was resulting in wrong block numbers being assigned to in-memory block variables and then stored on disk eventually. The following patch has been verified to fix an ext3 filesystem failure when run ltp test on a 64 bit machine. Signed-off-by; Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* \|	[PATCH] protect ext3 ioctl modifying append_only, immutable, etc. with i_mutex	Al Viro	2006-04-26	1	-4/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	All modifications of ->i_flags in inodes that might be visible to somebody else must be under ->i_mutex. That patch fixes ext3 ioctl() setting S_APPEND and friends. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* \|	[PATCH] forgotten ->b_data in memcpy() call in ext3/resize.c (oopsable)	Al Viro	2006-04-26	1	-1/+1
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \|	sbi->s_group_desc is an array of pointers to buffer_head. memcpy() of buffer size from address of buffer_head is a bad idea - it will generate junk in any case, may oops if buffer_head is close to the end of slab page and next page is not mapped and isn't what was intended there. IOW, ->b_data is missing in that call. Fortunately, result doesn't go into the primary on-disk data structures, so only backup ones get crap written to them; that had allowed this bug to remain unnoticed until now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: Fix missed mutex unlock	Ananiev, Leonid I	2006-04-17	1	-0/+1
\| \| \| \| \| \| \| \|	Missed unlock_super()call is added in error condition code path. Signed-off-by: Leonid Ananiev <leonid.i.ananiev@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
*	[PATCH] ext3: Fix missed mutex unlock	Ananiev, Leonid I	2006-04-11	1	-0/+1
\| \| \| \| \| \| \| \| \|	Missed unlock_super()call is added in error condition code path. Signed-off-by: Leonid Ananiev <leonid.i.ananiev@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Introduce sys_splice() system call	Jens Axboe	2006-03-30	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds support for the sys_splice system call. Using a pipe as a transport, it can connect to files or sockets (latter as output only). From the splice.c comments: "splice": joining two ropes together by interweaving their strands. This is the "extended pipe" functionality, where a pipe is used as an arbitrary in-memory buffer. Think of a pipe as a small kernel buffer that you can use to transfer data from one end to the other. The traditional unix read/write is extended with a "splice()" operation that transfers data buffers to or from a pipe buffer. Named by Larry McVoy, original implementation from Linus, extended by Jens to support splicing to files and fixing the initial implementation bugs. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Make most file operations structs in fs/ const	Arjan van de Ven	2006-03-28	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a conversion to make the various file_operations structs in fs/ const. Basically a regexp job, with a few manual fixups The goal is both to increase correctness (harder to accidentally write to shared datastructures) and reducing the false sharing of cachelines with things that get dirty in .data (while .rodata is nicely read only and thus cache clean) Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: "nobh" writeback support for filesystems blocksize < pagesize	Badari Pulavarty	2006-03-26	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \|	There is no valid reason why we can't support "nobh" option for filesystems with blocksize != PAGESIZE. This patch lets them use "nobh" option for writeback mode for blocksize < pagesize. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: multi-block get_block()	Badari Pulavarty	2006-03-26	1	-12/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Mingming Cao recently added multi-block allocation support for ext3, currently used only by DIO. I added support to map multiple blocks for mpage_readpages(). This patch add support for ext3_get_block() to deal with multi-block mapping. Basically it renames ext3_direct_io_get_blocks() as ext3_get_block(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: cleanups and WARN_ON()	Andrew Morton	2006-03-26	1	-125/+114
\| \| \| \| \| \| \| \| \| \| \| \| \|	- Clean up a few little layout things and comments. - Add a WARN_ON to a case which I was wondering about. - Tune up some inlines. Cc: Mingming Cao <cmm@us.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] remove ->get_blocks() support	Badari Pulavarty	2006-03-26	1	-10/+2
\| \| \| \| \| \| \| \| \| \|	Now that get_block() can handle mapping multiple disk blocks, no need to have ->get_blocks(). This patch removes fs specific ->get_blocks() added for DIO and makes it users use get_block() instead. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_get_blocks: Adjust reservation window size for mblocks	Mingming Cao	2006-03-26	1	-1/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Optimize the block reservation and the multiple block allocation: with the knowledge of the total number of blocks ahead, set or adjust the reservation window size properly (based on the number of blocks needed) before block allocation happens: if there isn't any reservation yet, make sure the reservation window equals to or greater than the number of blocks needed, before create an reservation window; if a reservation window is already exists, try to extends the window size to match the number of blocks to allocate. This could increase the possibility of completing multiple blocks allocation in a single request, as blocks are only allocated in the range of the inode's reservation window. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_get_blocks: Adjust accounting info in ext3_new_blocks()	Mingming Cao	2006-03-26	1	-12/+19
\| \| \| \| \| \| \| \| \|	Update accounting information (quota, boundary checks, free blocks number etc) in ext3_new_blocks(). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_get_blocks: support multiple blocks allocation in ext3_new_block()	Mingming Cao	2006-03-26	1	-10/+36
\| \| \| \| \| \| \| \| \| \| \| \|	Change ext3_try_to_allocate() (called via ext3_new_blocks()) to try to allocate the requested number of blocks on a best effort basis: After allocated the first block, it will always attempt to allocate the next few(up to the requested size and not beyond the reservation window) adjacent blocks at the same time. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_get_blocks: multiple block allocation	Mingming Cao	2006-03-26	1	-77/+193
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for multiple block allocation in ext3-get-blocks(). Look up the disk block mapping and count the total number of blocks to allocate, then pass it to ext3_new_block(), where the real block allocation is performed. Once multiple blocks are allocated, prepare the branch with those just allocated blocks info and finally splice the whole branch into the block mapping tree. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_get_blocks: Mapping multiple blocks at a once	Mingming Cao	2006-03-26	2	-31/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently ext3_get_block() only maps or allocates one block at a time. This is quite inefficient for sequential IO workload. I have posted a early implements a simply multiple block map and allocation with current ext3. The basic idea is allocating the 1st block in the existing way, and attempting to allocate the next adjacent blocks on a best effort basis. More description about the implementation could be found here: http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2 The following the latest version of the patch: break the original patch into 5 patches, re-worked some logicals, and fixed some bugs. The break ups are: [patch 1] Adding map multiple blocks at a time in ext3_get_blocks() [patch 2] Extend ext3_get_blocks() to support multiple block allocation [patch 3] Implement multiple block allocation in ext3-try-to-allocate (called via ext3_new_block()). [patch 4] Proper accounting updates in ext3_new_blocks() [patch 5] Adjust reservation window size properly (by the given number of blocks to allocate) before block allocation to increase the possibility of allocating multiple blocks in a single call. Tests done so far includes fsx,tiobench and dbench. The following numbers collected from Direct IO tests (1G file creation/read) shows the system time have been greatly reduced (more than 50% on my 8 cpu system) with the patches. 1G file DIO write: 2.6.15 2.6.15+patches real 0m31.275s 0m31.161s user 0m0.000s 0m0.000s sys 0m3.384s 0m0.564s 1G file DIO read: 2.6.15 2.6.15+patches real 0m30.733s 0m30.624s user 0m0.000s 0m0.004s sys 0m0.748s 0m0.380s Some previous test we did on buffered IO with using multiple blocks allocation and delayed allocation shows noticeable improvement on throughput and system time. This patch: Add support of mapping multiple blocks in one call. This is useful for DIO reads and re-writes (where blocks are already allocated), also is in line with Christoph's proposal of using getblocks() in mpage_readpage() or mpage_readpages(). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Make address_space_operations->invalidatepage return void	NeilBrown	2006-03-26	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The return value of this function is never used, so let's be honest and declare it as void. Some places where invalidatepage returned 0, I have inserted comments suggesting a BUG_ON. [akpm@osdl.org: JBD BUG fix] [akpm@osdl.org: rework for git-nfs] [akpm@osdl.org: don't go BUG in block_invalidate_page()] Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: Fix debug logging-only compilation error	Kirk True	2006-03-25	1	-3/+3
\| \| \| \| \| \| \| \| \|	When EXT3FS_DEBUG is #define-d, the compile breaks due to #include file issues. Signed-off-by: Kirk True <kernel@kirkandsheila.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: Properly report backup block present in a group	Glauber de Oliveira Costa	2006-03-24	1	-7/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In filesystems with the meta block group flag on, ext3_bg_num_gdb() fails to report the correct number of blocks used to store the group descriptor backups in a given group. It happens because meta_bg follows a different logic from the original ext3 backup placement in groups multiples of 3, 5 and 7. Signed-off-by: Glauber de Oliveira Costa <glommer@br.ibm.com> Cc: Andreas Dilger <adilger@clusterfs.com> Cc: "Stephen C. Tweedie" <sct@redhat.com> Cc: Alex Tomas <alex@clusterfs.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] fast ext3_statfs	Alex Tomas	2006-03-24	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	Under I/O load it may take up to a dozen seconds to read all group descriptors. This is what ext3_statfs() does. At the same time, we already maintain global numbers of free inodes/blocks. Why don't we use them instead of group reading and summing? Cc: Ravikiran G Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] cpuset memory spread: slab cache format	Paul Jackson	2006-03-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	Rewrap the overly long source code lines resulting from the previous patch's addition of the slab cache flag SLAB_MEM_SPREAD. This patch contains only formatting changes, and no function change. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] cpuset memory spread: slab cache filesystems	Paul Jackson	2006-03-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Mark file system inode and similar slab caches subject to SLAB_MEM_SPREAD memory spreading. If a slab cache is marked SLAB_MEM_SPREAD, then anytime that a task that's in a cpuset with the 'memory_spread_slab' option enabled goes to allocate from such a slab cache, the allocations are spread evenly over all the memory nodes (task->mems_allowed) allowed to that task, instead of favoring allocation on the node local to the current cpu. The following inode and similar caches are marked SLAB_MEM_SPREAD: file cache ==== ===== fs/adfs/super.c adfs_inode_cache fs/affs/super.c affs_inode_cache fs/befs/linuxvfs.c befs_inode_cache fs/bfs/inode.c bfs_inode_cache fs/block_dev.c bdev_cache fs/cifs/cifsfs.c cifs_inode_cache fs/coda/inode.c coda_inode_cache fs/dquot.c dquot fs/efs/super.c efs_inode_cache fs/ext2/super.c ext2_inode_cache fs/ext2/xattr.c (fs/mbcache.c) ext2_xattr fs/ext3/super.c ext3_inode_cache fs/ext3/xattr.c (fs/mbcache.c) ext3_xattr fs/fat/cache.c fat_cache fs/fat/inode.c fat_inode_cache fs/freevxfs/vxfs_super.c vxfs_inode fs/hpfs/super.c hpfs_inode_cache fs/isofs/inode.c isofs_inode_cache fs/jffs/inode-v23.c jffs_fm fs/jffs2/super.c jffs2_i fs/jfs/super.c jfs_ip fs/minix/inode.c minix_inode_cache fs/ncpfs/inode.c ncp_inode_cache fs/nfs/direct.c nfs_direct_cache fs/nfs/inode.c nfs_inode_cache fs/ntfs/super.c ntfs_big_inode_cache_name fs/ntfs/super.c ntfs_inode_cache fs/ocfs2/dlm/dlmfs.c dlmfs_inode_cache fs/ocfs2/super.c ocfs2_inode_cache fs/proc/inode.c proc_inode_cache fs/qnx4/inode.c qnx4_inode_cache fs/reiserfs/super.c reiser_inode_cache fs/romfs/inode.c romfs_inode_cache fs/smbfs/inode.c smb_inode_cache fs/sysv/inode.c sysv_inode_cache fs/udf/super.c udf_inode_cache fs/ufs/super.c ufs_inode_cache net/socket.c sock_inode_cache net/sunrpc/rpc_pipe.c rpc_inode_cache The choice of which slab caches to so mark was quite simple. I marked those already marked SLAB_RECLAIM_ACCOUNT, except for fs/xfs, dentry_cache, inode_cache, and buffer_head, which were marked in a previous patch. Even though SLAB_RECLAIM_ACCOUNT is for a different purpose, it marks the same potentially large file system i/o related slab caches as we need for memory spreading. Given that the rule now becomes "wherever you would have used a SLAB_RECLAIM_ACCOUNT slab cache flag before (usually the inode cache), use the SLAB_MEM_SPREAD flag too", this should be easy enough to maintain. Future file system writers will just copy one of the existing file system slab cache setups and tend to get it right without thinking. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] convert ext3's truncate_sem to a mutex	Arjan van de Ven	2006-03-23	4	-12/+12
\| \| \| \| \| \| \| \| \|	ext3's truncate_sem is always released in the same function it's taken and it otherwise is a mutex as well.. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] sem2mutex: quota	Ingo Molnar	2006-03-23	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3_readdir: use generic readahead	Andrew Morton	2006-03-23	2	-29/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linus points out that ext3_readdir's readahead only cuts in when ext3_readdir() is operating at the very start of the directory. So for large directories we end up performing no readahead at all and we suck. So take it all out and use the core VM's page_cache_readahead(). This means that ext3 directory reads will use all of readahead's dynamic sizing goop. Note that we're using the directory's filp->f_ra to hold the readahead state, but readahead is actually being performed against the underlying blockdev's address_space. Fortunately the readahead code is all set up to handle this. Tested with printk. It works. I was struggling to find a real workload which actually cared. (The patch also exports page_cache_readahead() to GPL modules) Cc: "Stephen C. Tweedie" <sct@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: fix nobh mode for chattr +j inodes	Badari Pulavarty	2006-03-11	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	One can do "chattr +j" on a file to change its journalling mode. Fix writeback mode with "nobh" handling for it. Even though, we mount ext3 filesystem in writeback mode with "nobh" option, some one can do "chattr +j" on a single file to force it to do journalled mode. In order to do journaling, ext3_block_truncate_page() need to fallback to default case of creating buffers and adding them to transaction etc. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: ext3_symlink should use GFP_NOFS allocations inside	Kirill Korotaev	2006-03-11	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes illegal __GFP_FS allocation inside ext3 transaction in ext3_symlink(). Such allocation may re-enter ext3 code from try_to_free_pages. But JBD/ext3 code keeps a pointer to current journal handle in task_struct and, hence, is not reentrable. This bug led to "Assertion failure in journal_dirty_metadata()" messages. http://bugzilla.openvz.org/show_bug.cgi?id=115 Signed-off-by: Andrey Savochkin <saw@saw.sw.com.sg> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>