op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	ext4: Make non-journal fsync work properly	Frank Mayhar	2009-09-09	1	-14/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Teach ext4_write_inode() and ext4_do_update_inode() about non-journal mode: If we're not using a journal, ext4_write_inode() now calls ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc()) with a new "do_sync" parameter. If that parameter is nonzero _and_ we're not using a journal, ext4_do_update_inode() calls sync_dirty_buffer() instead of ext4_handle_dirty_metadata(). This problem was found in power-fail testing, checking the amount of loss of files and blocks after a power failure when using fsync() and when not using fsync(). It turned out that using fsync() was actually worse than not doing so, possibly because it increased the likelihood that the inodes would remain unflushed and would therefore be lost at the power failure. Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: print more sysadmin-friendly message in check_block_validity()	Theodore Ts'o	2009-09-08	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	Drop the WARN_ON(1), as he stack trace is not appropriate, since it is triggered by file system corruption, and it misleads users into thinking there is a kernel bug. In addition, change the message displayed by ext4_error() to make it clear that this is a file system corruption problem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Take page lock before looking at attached buffer_heads flags	Aneesh Kumar K.V	2009-09-09	1	-2/+11
\| \| \| \| \| \| \| \|	In order to check whether the buffer_heads are mapped we need to hold page lock. Otherwise a reclaim can cleanup the attached buffer_heads. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Add new tracepoint: trace_ext4_da_write_pages()	Theodore Ts'o	2009-08-31	1	-12/+1
\| \| \| \| \| \| \|	Add a new tracepoint which shows the pages that will be written using write_cache_pages() by ext4_da_writepages(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Restore wbc->range_start in ext4_da_writepages()	Theodore Ts'o	2009-08-31	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	To solve a lock inversion problem, we implement part of the range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26 for more details.) As part of that change wbc->range_start was modified by ext4's writepages function, which causes its callers to get confused since they aren't expecting the filesystem to modify it. The simplest fix is to save and restore wbc->range_start in ext4_da_writepages. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks()	Jan Kara	2009-08-17	1	-4/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During truncate we are sometimes forced to start a new transaction as the amount of blocks to be journaled is both quite large and hard to predict. So far we restarted a transaction while holding i_data_sem and that violates lock ordering because i_data_sem ranks below a transaction start (and it can lead to a real deadlock with ext4_get_blocks() mapping blocks in some page while having a transaction open). We fix the problem by dropping the i_data_sem before restarting the transaction and acquire it afterwards. It's slightly subtle that this works: 1) By the time ext4_truncate() is called, all the page cache for the truncated part of the file is dropped so get_block() should not be called on it (we only have to invalidate extent cache after we reacquire i_data_sem because some extent from not-truncated part could extend also into the part we are going to truncate). 2) Writes, migrate or defrag hold i_mutex so they are stopped for all the time of the truncate. This bug has been found and analyzed by Theodore Tso <tytso@mit.edu>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: remove redundant test on unsigned	Roel Kluin	2009-08-10	1	-3/+1
\| \| \| \| \| \| \|	unsigned i_block cannot be less than 0. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: More buffer head reference leaks	Curt Wohlgemuth	2009-07-17	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After the patch I posted last week regarding buffer head ref leaks in no-journal mode, I looked at all the code that uses buffer heads and searched for more potential leaks. The patch below fixes the issues I found; these can occur even when a journal is present. The change to inode.c fixes a double release if ext4_journal_get_create_access() fails. The changes to namei.c are more complicated. add_dirent_to_buf() will release the input buffer head EXCEPT when it returns -ENOSPC. There are some callers of this routine that don't always do the brelse() in the event that -ENOSPC is returned. Unfortunately, to put this fix into ext4_add_entry() required capturing the return value of make_indexed_dir() and add_dirent_to_buf(). Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix buffer head reference leak in no-journal mode	Curt Wohlgemuth	2009-07-13	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We found a problem with buffer head reference leaks when using an ext4 partition without a journal. In particular, calls to ext4_forget() would not to a brelse() on the input buffer head, which will cause pages they belong to to not be reclaimable. Further investigation showed that all places where ext4_journal_forget() and ext4_journal_revoke() are called are subject to the same problem. The patch below changes __ext4_journal_forget/__ext4_journal_revoke to do an explicit release of the buffer head when the journal handle isn't valid. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Move __ext4_journalled_writepage() to avoid forward declaration	Aneesh Kumar K.V	2009-06-14	1	-58/+54
\| \| \| \| \| \| \|	In addition, fix two unused variable warnings. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaoc	Aneesh Kumar K.V	2009-06-14	1	-177/+57
\| \| \| \| \| \| \| \| \| \|	This patch fixes the mmap/truncate race that was fixed for delayed allocation by merging ext4_{journalled,normal,da}_writepage() into ext4_writepage(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation	Aneesh Kumar K.V	2009-06-14	1	-15/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is possible to see buffer_heads which are not mapped in the writepage callback in the following scneario (where the fs blocksize is 1k and the page size is 4k): 1) truncate(f, 1024) 2) mmap(f, 0, 4096) 3) a[0] = 'a' 4) truncate(f, 4096) 5) writepage(...) Now if we get a writepage callback immediately after (4) and before an attempt to write at any other offset via mmap address (which implies we are yet to get a pagefault and do a get_block) what we would have is the page which is dirty have first block allocated and the other three buffer_heads unmapped. In the above case the writepage should go ahead and try to write the first blocks and clear the page_dirty flag. Further attempts to write to the page will again create a fault and result in allocating blocks and marking page dirty. If we don't write any other offset via mmap address we would still have written the first block to the disk and rest of the space will be considered as a hole. So to address this, we change all of the places where we look for delayed, unmapped, or unwritten buffer heads, and only check for delayed or unwritten buffer heads instead. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Don't look at buffer_heads outside i_size.	Aneesh Kumar K.V	2009-06-04	1	-12/+16
\| \| \| \| \| \| \| \| \| \| \|	Buffer heads outside i_size will be unmapped. So when we are doing "walk_page_buffers" limit ourself to i_size. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Josef Bacik <jbacik@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> ----
*	ext4: Fix truncation of symlinks after failed write	Jan Kara	2009-07-13	1	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Contents of long symlinks is written via standard write methods. So when the write fails, we add inode to orphan list. But symlinks don't have .truncate method defined so nobody properly removes them from the on disk orphan list. Fix this by calling ext4_truncate() directly instead of calling vmtruncate() (which is saner anyway since we don't need anything vmtruncate() does except from calling .truncate in these paths). We also add inode to orphan list only if ext4_can_truncate() is true (currently, it can be false for symlinks when there are no blocks allocated) - otherwise orphan list processing will complain and ext4_truncate() will not remove inode from on-disk orphan list. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix potential reclaim deadlock when truncating partial block	Theodore Ts'o	2009-07-05	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ext4_block_truncate_page() function previously called grab_cache_page(), which called find_or_create_page() with the __GFP_FS flag potentially set. This could cause a deadlock if the system is low on memory and it attempts a memory reclaim, which could potentially call back into ext4. So we need to call find_or_create_page() directly, and remove the __GFP_FP flag to avoid this potential deadlock. Thanks to Roland Dreier for reporting a lockdep warning which showed this problem. [20786.363249] ================================= [20786.363257] [ INFO: inconsistent lock state ] [20786.363265] 2.6.31-2-generic #14~rbd4gitd960eea9 [20786.363270] --------------------------------- [20786.363276] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. [20786.363285] http/8397 [HC0[0]:SC0[0]:HE1:SE1] takes: [20786.363291] (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150 [20786.363314] {IN-RECLAIM_FS-W} state was registered at: [20786.363320] [<ffffffff8108bef6>] mark_irqflags+0xc6/0x1a0 [20786.363334] [<ffffffff8108d347>] __lock_acquire+0x287/0x430 [20786.363345] [<ffffffff8108d595>] lock_acquire+0xa5/0x150 [20786.363355] [<ffffffff812008da>] jbd2_journal_start+0xfa/0x150 [20786.363365] [<ffffffff811d98a8>] ext4_journal_start_sb+0x58/0x90 [20786.363377] [<ffffffff811cce85>] ext4_delete_inode+0xc5/0x2c0 [20786.363389] [<ffffffff81146fa3>] generic_delete_inode+0xd3/0x1a0 [20786.363401] [<ffffffff81147095>] generic_drop_inode+0x25/0x30 [20786.363411] [<ffffffff81145ce2>] iput+0x62/0x70 [20786.363420] [<ffffffff81142878>] dentry_iput+0x98/0x110 [20786.363429] [<ffffffff81142a00>] d_kill+0x50/0x80 [20786.363438] [<ffffffff811444c5>] dput+0x95/0x180 [20786.363447] [<ffffffff8120de4b>] ecryptfs_d_release+0x2b/0x70 [20786.363459] [<ffffffff81142978>] d_free+0x28/0x60 [20786.363468] [<ffffffff81142a18>] d_kill+0x68/0x80 [20786.363477] [<ffffffff81142ad3>] prune_one_dentry+0xa3/0xc0 [20786.363487] [<ffffffff81142d61>] __shrink_dcache_sb+0x271/0x290 [20786.363497] [<ffffffff81142e89>] prune_dcache+0x109/0x1b0 [20786.363506] [<ffffffff81142f6f>] shrink_dcache_memory+0x3f/0x50 [20786.363516] [<ffffffff810f6d3d>] shrink_slab+0x12d/0x190 [20786.363527] [<ffffffff810f97d7>] balance_pgdat+0x4d7/0x640 [20786.363537] [<ffffffff810f9a57>] kswapd+0x117/0x170 [20786.363546] [<ffffffff810773ce>] kthread+0x9e/0xb0 [20786.363558] [<ffffffff8101430a>] child_rip+0xa/0x20 [20786.363569] [<ffffffffffffffff>] 0xffffffffffffffff [20786.363598] irq event stamp: 15997 [20786.363603] hardirqs last enabled at (15997): [<ffffffff81125f9d>] kmem_cache_alloc+0xfd/0x1a0 [20786.363617] hardirqs last disabled at (15996): [<ffffffff81125f01>] kmem_cache_alloc+0x61/0x1a0 [20786.363628] softirqs last enabled at (15966): [<ffffffff810631ea>] __do_softirq+0x14a/0x220 [20786.363641] softirqs last disabled at (15861): [<ffffffff8101440c>] call_softirq+0x1c/0x30 [20786.363651] [20786.363653] other info that might help us debug this: [20786.363660] 3 locks held by http/8397: [20786.363665] #0: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<ffffffff8112ed24>] do_truncate+0x64/0x90 [20786.363685] #1: (&sb->s_type->i_alloc_sem_key#5){+++++.}, at: [<ffffffff81147f90>] notify_change+0x250/0x350 [20786.363707] #2: (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150 [20786.363724] [20786.363726] stack backtrace: [20786.363734] Pid: 8397, comm: http Tainted: G C 2.6.31-2-generic #14~rbd4gitd960eea9 [20786.363741] Call Trace: [20786.363752] [<ffffffff8108ad7c>] print_usage_bug+0x18c/0x1a0 [20786.363763] [<ffffffff8108b0c0>] ? check_usage_backwards+0x0/0xb0 [20786.363773] [<ffffffff8108bad2>] mark_lock_irq+0xf2/0x280 [20786.363783] [<ffffffff8108bd97>] mark_lock+0x137/0x1d0 [20786.363793] [<ffffffff8108c03c>] mark_held_locks+0x6c/0xa0 [20786.363803] [<ffffffff8108c11f>] lockdep_trace_alloc+0xaf/0xe0 [20786.363813] [<ffffffff810efbac>] __alloc_pages_nodemask+0x7c/0x180 [20786.363824] [<ffffffff810e9411>] ? find_get_page+0x91/0xf0 [20786.363835] [<ffffffff8111d3b7>] alloc_pages_current+0x87/0xd0 [20786.363845] [<ffffffff810e9827>] __page_cache_alloc+0x67/0x70 [20786.363856] [<ffffffff810eb7df>] find_or_create_page+0x4f/0xb0 [20786.363867] [<ffffffff811cb3be>] ext4_block_truncate_page+0x3e/0x460 [20786.363876] [<ffffffff812008da>] ? jbd2_journal_start+0xfa/0x150 [20786.363885] [<ffffffff812008bb>] ? jbd2_journal_start+0xdb/0x150 [20786.363895] [<ffffffff811c6415>] ? ext4_meta_trans_blocks+0x75/0xf0 [20786.363905] [<ffffffff811e8d8b>] ext4_ext_truncate+0x1bb/0x1e0 [20786.363916] [<ffffffff811072c5>] ? unmap_mapping_range+0x75/0x290 [20786.363926] [<ffffffff811ccc28>] ext4_truncate+0x498/0x630 [20786.363938] [<ffffffff8129b4ce>] ? _raw_spin_unlock+0x5e/0xb0 [20786.363947] [<ffffffff81107306>] ? unmap_mapping_range+0xb6/0x290 [20786.363957] [<ffffffff8108c3ad>] ? trace_hardirqs_on+0xd/0x10 [20786.363966] [<ffffffff811ffe58>] ? jbd2_journal_stop+0x1f8/0x2e0 [20786.363976] [<ffffffff81107690>] vmtruncate+0xb0/0x110 [20786.363986] [<ffffffff81147c05>] inode_setattr+0x35/0x170 [20786.363995] [<ffffffff811c9906>] ext4_setattr+0x186/0x370 [20786.364005] [<ffffffff81147eab>] notify_change+0x16b/0x350 [20786.364014] [<ffffffff8112ed30>] do_truncate+0x70/0x90 [20786.364021] [<ffffffff8112f48b>] T.657+0xeb/0x110 [20786.364021] [<ffffffff8112f4be>] sys_ftruncate+0xe/0x10 [20786.364021] [<ffffffff81013132>] system_call_fastpath+0x16/0x1b Reported-by: Roland Dreier <roland@digitalvampire.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	switch ext4 to inode->i_acl	Al Viro	2009-06-24	1	-4/+0
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	ext4: Don't update ctime for non-extent-mapped inodes	Theodore Ts'o	2009-06-15	1	-5/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The VFS handles updating ctime, so we don't need to update the inode's ctime in ext4_splace_branch() to update the direct or indirect blocks. This was harmless when we did this in ext3, but in ext4, thanks to delayed allocation, updating the ctime in ext4_splice_branch() can cause the ctime to mysteriously jump when the blocks are finally allocated. Thanks to Björn Steinbrink for pointing out this problem on the git mailing list. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix up whitespace issues in fs/ext4/inode.c	Theodore Ts'o	2009-06-14	1	-97/+103
\| \| \| \| \| \|	This is a pure cleanup patch. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: move the abort flag from s_mount_opts to s_mount_flags	Theodore Ts'o	2009-06-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	We're running out of space in the mount options word, and EXT4_MOUNT_ABORT isn't really a mount option, but a run-time flag. So move it to become EXT4_MF_FS_ABORTED in s_mount_flags. Also remove bogus ext2_fs.h / ext4.h simultaneous #include protection, which can never happen. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: change s_mount_opt to be an unsigned int	Theodore Ts'o	2009-06-13	1	-1/+1
\| \| \| \| \| \| \| \|	We can only fit 32 options in s_mount_opt because an unsigned long is 32-bits on a x86 machine. So use an unsigned int to save space on 64-bit platforms. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: convert instrumentation from markers to tracepoints	Theodore Ts'o	2009-06-17	1	-55/+14
\| \| \| \|	Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Don't treat a truncation of a zero-length file as replace-via-truncate	Theodore Ts'o	2009-06-09	1	-1/+2
\| \| \| \| \| \| \| \|	If a non-existent file is opened via O_WRONLY\|O_CREAT\|O_TRUNC, there's no need to treat this as a true file truncation, so we shouldn't activate the replace-via-truncate hueristic. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: truncate the file properly if we fail to copy data from userspace	Aneesh Kumar K.V	2009-06-05	1	-26/+102
\| \| \| \| \| \| \| \| \| \| \| \| \|	In generic_perform_write if we fail to copy the user data we don't update the inode->i_size. We should truncate the file in the above case so that we don't have blocks allocated outside inode->i_size. Add the inode to orphan list in the same transaction as block allocation This ensures that if we crash in between the recovery would do the truncate. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Avoid leaking blocks after a block allocation failure	Aneesh Kumar K.V	2009-06-05	1	-2/+22
\| \| \| \| \| \| \| \| \| \| \|	We should add inode to the orphan list in the same transaction as block allocation. This ensures that if we crash after a failed block allocation and before we do a vmtruncate we don't leak block (ie block marked as used in bitmap but not claimed by the inode). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()	Jan Kara	2009-06-09	1	-20/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This seems to be a relict from some old days and setting disksize in this function does not make much sense. Currently it was set only by ext4_getblk(). Since the parameter has some effect only if create == 1, it is easy to check by grepping through the sources that the three callers which end up calling ext4_getblk() with create == 1 (ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set disksize themselves. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: remove unused function __ext4_write_dirty_metadata	Theodore Ts'o	2009-05-25	1	-19/+0
\| \| \| \| \| \| \| \| \| \|	The __ext4_write_dirty_metadata() function was introduced by commit 0390131b, "ext4: Allow ext4 to run without a journal", but nothing ever used the function, either then or since. So let's remove it and save a bit of space. Cc: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Add a comprehensive block validity check to ext4_get_blocks()	Theodore Ts'o	2009-05-17	1	-9/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To catch filesystem bugs or corruption which could lead to the filesystem getting severly damaged, this patch adds a facility for tracking all of the filesystem metadata blocks by contiguous regions in a red-black tree. This allows quick searching of the tree to locate extents which might overlap with filesystem metadata blocks. This facility is also used by the multi-block allocator to assure that it is not allocating blocks out of the system zone, as well as by the routines used when reading indirect blocks and extents information from disk to make sure their contents are valid. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state	Theodore Ts'o	2009-05-14	1	-25/+31
\| \| \| \| \| \| \| \| \| \| \| \| \|	The ext4_get_blocks() function was depending on the value of bh_result->b_state as an input parameter to decide whether or not update the delalloc accounting statistics by calling ext4_da_update_reserve_space(). We now use a separate flag, EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE, to requests this update, so that all callers of ext4_get_blocks() can clear map_bh.b_state before calling ext4_get_blocks() without worrying about any consistency issues. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()	Theodore Ts'o	2009-05-14	1	-67/+43
\| \| \| \| \| \| \| \|	The static function ext4_da_get_block_write() was only used by mpage_da_map_blocks(). So to simplify the code, merge that function into mpage_da_map_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Add BUG_ON debugging checks to noalloc_get_block_write()	Theodore Ts'o	2009-05-12	1	-0/+3
\| \| \| \| \| \| \| \| \|	Enforce that noalloc_get_block_write() is only called to map one block at a time, and that it always is successful in finding a mapping for given an inode's logical block block number if it is called with create == 1. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Add documentation to the ext4_get_block functions	Theodore Ts'o	2009-05-14	1	-31/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds more documentation to various internal functions in fs/ext4/inode.c, most notably ext4_ind_get_blocks(), ext4_da_get_block_write(), ext4_da_get_block_prep(), ext4_normal_get_block_write(). In addition, the static function ext4_normal_get_block_write() has been renamed noalloc_get_block_write(), since it is used in many places far beyond ext4_normal_writepage(). Plenty of warnings have been added to the noalloc_get_block_write() function, since the way it is used is amazingly fragile. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Define a new set of flags for ext4_get_blocks()	Theodore Ts'o	2009-05-14	1	-27/+30
\| \| \| \| \| \| \| \| \| \|	The functions ext4_get_blocks(), ext4_ext_get_blocks(), and ext4_ind_get_blocks() used an ad-hoc set of integer variables used as boolean flags passed in as arguments. Use a single flags parameter and a setandard set of bitfield flags instead. This saves space on the call stack, and it also makes the code a bit more understandable. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Rename ext4_get_blocks_wrap() to be ext4_get_blocks()	Theodore Ts'o	2009-05-14	1	-18/+17
\| \| \| \| \| \| \| \|	Another function rename for clarity's sake. The _wrap prefix simply confuses people, and didn't add much people trying to follow the code paths. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Rename ext4_get_blocks_handle() to be ext4_ind_get_blocks()	Theodore Ts'o	2009-05-12	1	-6/+6
\| \| \| \| \| \| \| \|	The static function ext4_get_blocks_handle() is badly named. Of course it takes a handle. Since its counterpart for extent-based file is ext4_ext_get_blocks(), rename it to be ext4_ind_get_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Simplify function signature for ext4_da_get_block_write()	Theodore Ts'o	2009-05-12	1	-3/+3
\| \| \| \| \| \| \| \|	The function ext4_da_get_block_write() is called in exactly one write, and the last argument, create, is always 1. Remove it to simplify the code slightly. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Avoid races caused by on-line resizing and SMP memory reordering	Theodore Ts'o	2009-05-01	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ext4's on-line resizing adds a new block group and then, only at the last step adjusts s_groups_count. However, it's possible on SMP systems that another CPU could see the updated the s_group_count and not see the newly initialized data structures for the just-added block group. For this reason, it's important to insert a SMP read barrier after reading s_groups_count and before reading any (for example) the new block group descriptors allowed by the increased value of s_groups_count. Unfortunately, we rather blatently violate this locking protocol documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes happen relatively rarely, and (2) it seems rare that the filesystem code will immediately try to use just-added block group before any memory ordering issues resolve themselves. So apparently problems here are relatively hard to hit, since ext3 has been vulnerable to the same issue for years with no one apparently complaining. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Mark the unwritten buffer_head as mapped during write_begin	Aneesh Kumar K.V	2009-05-12	1	-30/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Setting BH_Unwritten buffer_heads as BH_Mapped avoids multiple (unnecessary) calls to get_block() during the call to the write(2) system call. Setting BH_Unwritten buffer heads as BH_Mapped requires that the writepages() functions can handle BH_Unwritten buffer_heads. After this commit, things work as follows: ext4_ext_get_block() returns unmapped, unwritten, buffer head when called with create = 0 for prealloc space. This makes sure we handle the read path and non-delayed allocation case correctly. Even though the buffer head is marked unmapped we have valid b_blocknr and b_bdev values in the buffer_head. ext4_da_get_block_prep() called for block resrevation will now return mapped, unwritten, new buffer_head for prealloc space. This avoids multiple calls to get_block() for write to same offset. By making such buffers as BH_New, we also assure that sub-block zeroing of buffered writes happens correctly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Properly initialize the buffer_head state	Aneesh Kumar K.V	2009-05-13	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \|	These struct buffer_heads are allocated on the stack (and hence are initialized with stack garbage). They are only used to call a get_blocks() function, so that's mostly OK, but b_state must be initialized to be 0 so we don't have any unexpected BH_* flags set by accident, such as BH_Unwritten or BH_Delay. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Clear the unwritten buffer_head flag after the extent is initialized	Aneesh Kumar K.V	2009-05-14	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The BH_Unwritten flag indicates that the buffer is allocated on disk but has not been written; that is, the disk was part of a persistent preallocation area. That flag should only be set when a get_blocks() function is looking up a inode's logical to physical block mapping. When ext4_get_blocks_wrap() is called with create=1, the uninitialized extent is converted into an initialized one, so the BH_Unwritten flag is no longer appropriate. Hence, we need to make sure the BH_Unwritten is not left set, since the combination of BH_Mapped and BH_Unwritten is not allowed; among other things, it will result ext4's get_block() to be called over and over again during the write_begin phase of write(2). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Use a fake block number for delayed new buffer_head	Aneesh Kumar K.V	2009-05-12	1	-1/+5
\| \| \| \| \| \| \| \| \|	Use a very large unsigned number (~0xffff) as as the fake block number for the delayed new buffer. The VFS should never try to write out this number, but if it does, this will make it obvious. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix sub-block zeroing for writes into preallocated extents	Aneesh Kumar K.V	2009-05-13	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to mark the buffer_head mapping preallocated space as new during write_begin. Otherwise we don't zero out the page cache content properly for a partial write. This will cause file corruption with preallocation. Now that we mark the buffer_head new we also need to have a valid buffer_head blocknr so that unmap_underlying_metadata() unmaps the correct block. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Do not try to validate extents on special files	Theodore Ts'o	2009-04-24	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The EXTENTS_FL flag should never be set on special files, but if it is, don't bother trying to validate that the extents tree is valid, since only files, directories, and non-fast symlinks will ever have an extent data structure. We perhaps should flag the filesystem as being corrupted if we see a special file (named pipes, device nodes, Unix domain sockets, etc.) with the EXTENTS_FL flag, but e2fsck doesn't currently check this case, so we'll just ignore this for now, since it's harmless. Without this fix, a special device with the extents flag is flagged as an error by the kernel, so it is impossible to access or delete the inode, but e2fsck doesn't see it as a problem, leading to confused/frustrated users. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Ignore i_file_acl_high unless EXT4_FEATURE_INCOMPAT_64BIT is present	Theodore Ts'o	2009-04-24	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \|	Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature bit is set. The field is normally zero, but older versions of e2fsck didn't automatically check to make sure of this, so in the spirit of "be liberal in what you accept", don't look at i_file_acl_high unless we are using a 64-bit filesystem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix softlockup caused by illegal i_file_acl value in on-disk inode	Theodore Ts'o	2009-04-24	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the block containing external extended attributes (which is stored in i_file_acl and i_file_acl_high) is larger than the on-disk filesystem, the process which tried to access the extended attributes will endlessly issue kernel printks complaining that "__find_get_block_slow() failed", locking up that CPU until the system is forcibly rebooted. So when we read in the inode, make sure the i_file_acl value is legal, and if not, flag the filesystem as being corrupted. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
*	ext4: Fix big-endian problem in __ext4_check_blockref()	Thiemo Nagel	2009-04-07	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Commit fe2c8191 introduced a regression on big-endian system, because the checks to make sure block references in non-extent inodes are valid failed to use le32_to_cpu(). Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Tested-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
*	Merge branch 'for_linus' of ↵	Linus Torvalds	2009-04-01	1	-161/+263
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) ext4: Regularize mount options ext4: fix locking typo in mballoc which could cause soft lockup hangs ext4: fix typo which causes a memory leak on error path jbd2: Update locking coments ext4: Rename pa_linear to pa_type ext4: add checks of block references for non-extent inodes ext4: Check for an valid i_mode when reading the inode from disk ext4: Use WRITE_SYNC for commits which are caused by fsync() ext4: Add auto_da_alloc mount option ext4: Use struct flex_groups to calculate get_orlov_stats() ext4: Use atomic_t's in struct flex_groups ext4: remove /proc tuning knobs ext4: Add sysfs support ext4: Track lifetime disk writes ext4: Fix discard of inode prealloc space with delayed allocation. ext4: Automatically allocate delay allocated blocks on rename ext4: Automatically allocate delay allocated blocks on close ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl ext4: Simplify delalloc code by removing mpage_da_writepages() ext4: Save stack space by removing fake buffer heads ...
\| *	ext4: add checks of block references for non-extent inodes	Thiemo Nagel	2009-03-31	1	-7/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Check block references in the inode and indorect blocks for non-extent inodes to make sure they are valid, and flag an error if they are invalid. Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Check for an valid i_mode when reading the inode from disk	Theodore Ts'o	2009-03-26	1	-1/+9
\| \| \| \| \| \| \| \|	Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: Add auto_da_alloc mount option	Theodore Ts'o	2009-03-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a mount option which allows the user to disable automatic allocation of blocks whose allocation by delayed allocation when the file was originally truncated or when the file is renamed over an existing file. This feature is intended to save users from the effects of naive application writers, but it reduces the effectiveness of the delayed allocation code. This mount option disables this safety feature, which may be desirable for prodcutions systems where the risk of unclean shutdowns or unexpected system crashes is low. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
\| *	ext4: remove /proc tuning knobs	Theodore Ts'o	2009-03-31	1	-6/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>