summaryrefslogtreecommitdiffstats
path: root/fs/ext4
Commit message (Collapse)AuthorAgeFilesLines
* ext4: Declare seq_operations and file_operations structures as constTobias Klauser2009-09-051-4/+4
| | | | | Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Add new tracepoint: trace_ext4_da_write_pages()Theodore Ts'o2009-08-312-12/+16
| | | | | | | Add a new tracepoint which shows the pages that will be written using write_cache_pages() by ext4_da_writepages(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Restore wbc->range_start in ext4_da_writepages()Theodore Ts'o2009-08-311-0/+2
| | | | | | | | | | | | | To solve a lock inversion problem, we implement part of the range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26 for more details.) As part of that change wbc->range_start was modified by ext4's writepages function, which causes its callers to get confused since they aren't expecting the filesystem to modify it. The simplest fix is to save and restore wbc->range_start in ext4_da_writepages. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Limit number of links that can be created by ext4_link()Theodore Ts'o2009-08-291-1/+1
| | | | | | | | In ext4_link we need to check using EXT4_LINK_MAX, and not EXT4_DIR_LINK_MAX(), since ext4_link() is creating hard links of regular files, and not directories. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Allow rename to create more than EXT4_LINK_MAX subdirectoriesAneesh Kumar K.V2009-08-281-1/+1
| | | | | | | | Use EXT4_DIR_LINK_MAX so that rename() can move a directory into new parent directory without running into the EXT4_LINK_MAX limit. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix extent sanity checking code with AGGRESSIVE_TESTTheodore Ts'o2009-08-281-26/+34
| | | | | | | | | | | | | | | The extents sanity-checking code depends on the ext4_ext_space_*() functions returning the maximum alloable size for eh_max; however, when the debugging #ifdef AGGRESSIVE_TEST is enabled to test the extent tree handling code, this prevents a normally created ext4 filesystem from being mounted with the errors: Aug 26 15:43:50 bsd086 kernel: [ 96.070277] EXT4-fs error (device sda8): ext4_ext_check_inode: bad header/extent in inode #8: too large eh_max - magic f30a, entries 1, max 4(3), depth 0(0) Aug 26 15:43:50 bsd086 kernel: [ 96.070526] EXT4-fs (sda8): no journal found Bug reported by Akira Fujita. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: use ext4_grpblk_t more extensivelyEric Sandeen2009-08-253-16/+19
| | | | | | | | | | | | | | | | | unsigned short is potentially too small to track blocks within a group; today it is safe due to restrictions in e2fsprogs but we have _lo / _hi bits for group blocks with the intent to go up to 32 bits, so clean this up now. There are many more places where we use unsigned/int/unsigned int to contain a group block but this should at least fix all the short types. I added a few comments to the struct ext4_group_info definition as well. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: use variables not types in sizeofs() for allocationsEric Sandeen2009-08-251-3/+4
| | | | | | | | | Precursor to changing some types; to keep things in sync, it seems better to allocate/memset based on the size of the variables we are using rather than on some disconnected basic type like "unsigned short" Signed-off-by: Eric Sandeen <sandeen@redhat.com>
* ext4: Add missing unlock_new_inode() call in extent migration codeAneesh Kumar K.V2009-08-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to unlock the new inode before iput. This patch fixes the following warning when calling chattr +e to migrate a file to use extents. It also fixes problems in when e4defrag attempts to defragment an inode. [ 470.400044] ------------[ cut here ]------------ [ 470.400065] WARNING: at fs/inode.c:1210 generic_delete_inode+0x65/0x16a() [ 470.400072] Hardware name: N/A ..... ... [ 470.400353] Pid: 4451, comm: chattr Not tainted 2.6.31-rc7-red-debug #4 [ 470.400359] Call Trace: [ 470.400372] [<ffffffff81037771>] warn_slowpath_common+0x77/0x8f [ 470.400385] [<ffffffff81037798>] warn_slowpath_null+0xf/0x11 [ 470.400395] [<ffffffff810b7f28>] generic_delete_inode+0x65/0x16a [ 470.400405] [<ffffffff810b8044>] generic_drop_inode+0x17/0x1bd [ 470.400413] [<ffffffff810b7083>] iput+0x61/0x65 [ 470.400455] [<ffffffffa003b229>] ext4_ext_migrate+0x5eb/0x66a [ext4] [ 470.400492] [<ffffffffa002b1f8>] ext4_ioctl+0x340/0x756 [ext4] [ 470.400507] [<ffffffff810b1a91>] vfs_ioctl+0x1d/0x82 [ 470.400517] [<ffffffff810b1ff0>] do_vfs_ioctl+0x483/0x4c9 [ 470.400527] [<ffffffff81059c30>] ? trace_hardirqs_on+0xd/0xf [ 470.400537] [<ffffffff810b2087>] sys_ioctl+0x51/0x74 [ 470.400549] [<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b [ 470.400557] ---[ end trace ab85723542352dac ]--- Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Add feature set check helper for mount & remount pathsEric Sandeen2009-08-181-42/+49
| | | | | | | | | | | | | | | | | | | | | A user reported that although his root ext4 filesystem was mounting fine, other filesystems would not mount, with the: "Filesystem with huge files cannot be mounted RDWR without CONFIG_LBDAF" error on his 32-bit box built without CONFIG_LBDAF. This is because the test at mount time for this situation was not being re-checked on remount, and the normal boot process makes an ro->rw transition, so this was being missed. Refactor to make a common helper function to test the filesystem features against the type of mount request (RO vs. RW) so that we stay consistent. Addresses Red-Hat-Bugzilla: #517650 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* simplify some logic in ext4_mb_normalize_requestEric Sandeen2009-08-171-9/+4
| | | | | | | | | | | | | | | | | | | | While reading through some of the mballoc code it seems that a couple spots in the size normalization function could be streamlined. The test for non-overlapping PAs can be or'd for the start & end conditions, and the tests for adjacent PAs can be else-if'd - it's essentially independently testing: if (A + B <= C) ... if (A > C) ... These cannot both be true so it seems like the else-if might be slightly more efficient and/or informative. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: open-code ext4_mb_update_group_infoEric Sandeen2009-08-173-12/+1
| | | | | | | | | | | | | | ext4_mb_update_group_info is only called in one place, and it's extremely simple. There's no reason to have it in a separate function in a separate file as far as I can tell, it just obfuscates what's really going on. Perhaps it was intended to keep the grp->bb_* manipulation local to mballoc.c but we're already accessing other grp-> fields in balloc.c directly so this seems ok. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: reject too-large filesystems on 32-bit kernelsEric Sandeen2009-08-171-3/+10
| | | | | | | | | | | | | | | ext4 will happily mount a > 16T filesystem on a 32-bit box, but this is not safe; writes to the block device will wrap past 16T and the page cache can't index past 16T (232 index * 4k pages). Adding another test to the existing "too many sectors" test should do the trick. Add a comment, a relevant return value, and fix the reference to the CONFIG_LBD(AF) option as well. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks()Jan Kara2009-08-173-7/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | During truncate we are sometimes forced to start a new transaction as the amount of blocks to be journaled is both quite large and hard to predict. So far we restarted a transaction while holding i_data_sem and that violates lock ordering because i_data_sem ranks below a transaction start (and it can lead to a real deadlock with ext4_get_blocks() mapping blocks in some page while having a transaction open). We fix the problem by dropping the i_data_sem before restarting the transaction and acquire it afterwards. It's slightly subtle that this works: 1) By the time ext4_truncate() is called, all the page cache for the truncated part of the file is dropped so get_block() should not be called on it (we only have to invalidate extent cache after we reacquire i_data_sem because some extent from not-truncated part could extend also into the part we are going to truncate). 2) Writes, migrate or defrag hold i_mutex so they are stopped for all the time of the truncate. This bug has been found and analyzed by Theodore Tso <tytso@mit.edu>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Show unwritten extent flag in ext4_ext_show_leaf()Mingming2009-09-182-11/+23
| | | | | | | | | | | | ext4_ext_show_leaf() will display the leaf extents when extent debugging is enabled. Printing out the unwritten bit is useful for debugging unwritten extent, allow us to see the unwritten extents vs written extents, after the unwritten extents are splitted or converted. Signed-off-by: Mingming Cao <cmm@us.ibm.com>
* ext4: Compile warning fix when EXT_DEBUG enabledMingming2009-09-011-4/+4
| | | | | | | | | | | | | | | | | | | | | | | When EXT_DEBUG is enabled I received the following compile warning on PPC64: CC [M] fs/ext4/inode.o CC [M] fs/ext4/extents.o fs/ext4/extents.c: In function ‘ext4_ext_rm_leaf’: fs/ext4/extents.c:2097: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘ext4_lblk_t’ fs/ext4/extents.c: In function ‘ext4_ext_get_blocks’: fs/ext4/extents.c:2789: warning: format ‘%u’ expects type ‘unsigned int’, but argument 4 has type ‘long unsigned int’ fs/ext4/extents.c:2852: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘ext4_lblk_t’ fs/ext4/extents.c:2953: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 4 has type ‘unsigned int’ CC [M] fs/ext4/migrate.o The patch fixes compile warning. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Index: linux-2.6.31-rc4/fs/ext4/extents.c ===================================================================
* ext4: Avoid group preallocation for closed filesTheodore Ts'o2009-09-182-2/+38
| | | | | | | | | | | | | | | | | | | | | | Currently the group preallocation code tries to find a large (512) free block from which to do per-cpu group allocation for small files. The problem with this scheme is that it leaves the filesystem horribly fragmented. In the worst case, if the filesystem is unmounted and remounted (after a system shutdown, for example) we forget the fact that wee were using a particular (now-partially filled) 512 block extent. So the next time we try to allocate space for a small file, we will find *another* completely free 512 block chunk to allocate small files. Given that there are 32,768 blocks in a block group, after 64 iterations of "mount, write one 4k file in a directory, unmount", the block group will have 64 files, each separated by 511 blocks, and the block group will no longer have any free 512 completely free chunks of blocks for group preallocation space. So if we try to allocate blocks for a file that has been closed, such that we know the final size of the file, and the filesystem is not busy, avoid using group preallocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix bugs in mballoc's stream allocation modeTheodore Ts'o2009-08-092-13/+12
| | | | | | | | | | | | | | | | The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all screwed up. These fields were getting unconditionally all the time, set even when stream allocation had not taken place, and if they were being used when the file was smaller than s_mb_stream_request, which is when the allocation should _not_ be doing stream allocation. Fix this by determining whether or not we stream allocation should take place once, in ext4_mb_group_or_file(), and setting a flag which gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found(). This simplifies the code and assures that we are consistently using (or not using) the stream allocation logic. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Display the mballoc flags in mb_history in hex instead of decimalTheodore Ts'o2009-08-092-13/+13
| | | | | | | Displaying the flags in base 16 makes it easier to see which flags have been set. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Add configurable run-time mballoc debuggingTheodore Ts'o2009-09-183-26/+80
| | | | | | | Allow mballoc debugging to be enabled at run-time instead of just at compile time. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix journal ref count in move_extent_par_pagePeng Tao2009-08-101-0/+1
| | | | | | | | | | | | move_extent_par_page calls a_ops->write_begin() to increase journal handler's reference count. However, if either mext_replace_branches() or ext4_get_block fails, the increased reference count isn't decreased. This will cause a later attempt to umount of the fs to hang forever. The patch addresses the issue by calling ext4_journal_stop() if page is not NULL (which means a_ops->write_end() isn't invoked). Signed-off-by: Peng Tao <bergwolf@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: remove redundant test on unsignedRoel Kluin2009-08-101-3/+1
| | | | | | | unsigned i_block cannot be less than 0. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix build warning when EXT4FS_DEBUG is onPeng Tao2009-07-271-1/+1
| | | | | | | | | | | | | When compiling with EXT4FS_DEBUG on, gcc will complain with following warnings: linux-2.6/fs/ext4/ialloc.c: In function ‘ext4_count_free_inodes’: linux-2.6/fs/ext4/ialloc.c:1192: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘ext4_group_t’ So add a type cast to suppress it. Signed-off-by: Peng Tao <bergwolf@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix compile warnings with MB_DEBUGAkira Fujita2009-07-051-4/+4
| | | | | | | | When MB_DEBUG is enabled, we get some compile warnings because ext4_group_t is unsigned int. This patch fixes them. Signed-off-by Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Remove unnecessary semicolons in mballoc.cJoe Perches2009-07-051-1/+1
| | | | | Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: More buffer head reference leaksCurt Wohlgemuth2009-07-172-5/+14
| | | | | | | | | | | | | | | | | | | | | | After the patch I posted last week regarding buffer head ref leaks in no-journal mode, I looked at all the code that uses buffer heads and searched for more potential leaks. The patch below fixes the issues I found; these can occur even when a journal is present. The change to inode.c fixes a double release if ext4_journal_get_create_access() fails. The changes to namei.c are more complicated. add_dirent_to_buf() will release the input buffer head EXCEPT when it returns -ENOSPC. There are some callers of this routine that don't always do the brelse() in the event that -ENOSPC is returned. Unfortunately, to put this fix into ext4_add_entry() required capturing the return value of make_indexed_dir() and add_dirent_to_buf(). Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Avoid null pointer dereference when decoding EROFS w/o a journalTheodore Ts'o2009-07-271-1/+2
| | | | | | | | We need to check to make sure a journal is present before checking the journal flags in ext4_decode_error(). Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix typo in ext4/KconfigManish Katiyar2009-07-271-1/+1
| | | | | Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix memory leak fix when mounting an ext4 filesystemAneesh Kumar K.V2009-07-171-19/+0
| | | | | | | | | | | | | | | | | The allocation of the ext4_group_info array was moved to a new function ext4_mb_add_group_info() in commit 5f21b0e6 so that online resize would use a common (and correct) codepath. Unfortunately, the call to the new ext4_mb_add_group_info() function was added without removing the code which originally allocated the array. This caused a memory leak each time an ext4 filesystem was mounted. The fix is simple; remove the code that did the original allocation, since it is no longer needed. Reported-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Remove syncing logic from ext4_file_writeJan Kara2009-09-141-51/+2
| | | | | | | | | The syncing is now properly handled by generic_file_aio_write() so no special ext4 code is needed. CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Signed-off-by: Jan Kara <jack@suse.cz>
* ext[234]: move over to 'check_acl' permission modelLinus Torvalds2009-09-084-12/+6
| | | | | | | | | | Don't implement per-filesystem 'extX_permission()' functions that have to be called for every path component operation, and instead just expose the actual ACL checking so that the VFS layer can now do it for us. Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for_linus' of ↵Linus Torvalds2009-07-138-301/+180
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: fix race between write_metadata_buffer and get_write_access ext4: Fix ext4_mb_initialize_context() to initialize all fields ext4: fix null handler of ioctls in no journal mode ext4: Fix buffer head reference leak in no-journal mode ext4: Move __ext4_journalled_writepage() to avoid forward declaration ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaoc ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation ext4: Don't look at buffer_heads outside i_size. ext4: Fix goal inum check in the inode allocator ext4: fix no journal corruption with locale-gen ext4: Calculate required journal credits for inserting an extent properly ext4: Fix truncation of symlinks after failed write jbd2: Fix a race between checkpointing code and journal_get_write_access() ext4: Use rcu_barrier() on module unload. ext4: naturally align struct ext4_allocation_request ext4: mark several more functions in mballoc.c as noinline ext4: Fix potential reclaim deadlock when truncating partial block jbd2: Remove GFP_ATOMIC kmalloc from inside spinlock critical region ext4: Fix type warning on 64-bit platforms in tracing events header
| * ext4: Fix ext4_mb_initialize_context() to initialize all fieldsTheodore Ts'o2009-07-131-18/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Pavel Roskin pointed out that kmemcheck indicated that ext4_mb_store_history() was accessing uninitialized values of ac->ac_tail and ac->ac_buddy leading to garbage in the mballoc history. Fix this by initializing the entire structure to all zeros first. Also, two fields were getting doubly initialized by the caller of ext4_mb_initialize_context, so remove them for efficiency's sake. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: fix null handler of ioctls in no journal modePeng Tao2009-07-131-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The EXT4_IOC_GROUP_ADD and EXT4_IOC_GROUP_EXTEND ioctls should not flush the journal in no_journal mode. Otherwise, running resize2fs on a mounted no_journal partition triggers the following error messages: BUG: unable to handle kernel NULL pointer dereference at 00000014 IP: [<c039d282>] _spin_lock+0x8/0x19 *pde = 00000000 Oops: 0002 [#1] SMP Signed-off-by: Peng Tao <bergwolf@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Fix buffer head reference leak in no-journal modeCurt Wohlgemuth2009-07-133-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We found a problem with buffer head reference leaks when using an ext4 partition without a journal. In particular, calls to ext4_forget() would not to a brelse() on the input buffer head, which will cause pages they belong to to not be reclaimable. Further investigation showed that all places where ext4_journal_forget() and ext4_journal_revoke() are called are subject to the same problem. The patch below changes __ext4_journal_forget/__ext4_journal_revoke to do an explicit release of the buffer head when the journal handle isn't valid. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Move __ext4_journalled_writepage() to avoid forward declarationAneesh Kumar K.V2009-06-141-58/+54
| | | | | | | | | | | | | | In addition, fix two unused variable warnings. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaocAneesh Kumar K.V2009-06-141-177/+57
| | | | | | | | | | | | | | | | | | | | This patch fixes the mmap/truncate race that was fixed for delayed allocation by merging ext4_{journalled,normal,da}_writepage() into ext4_writepage(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocationAneesh Kumar K.V2009-06-141-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible to see buffer_heads which are not mapped in the writepage callback in the following scneario (where the fs blocksize is 1k and the page size is 4k): 1) truncate(f, 1024) 2) mmap(f, 0, 4096) 3) a[0] = 'a' 4) truncate(f, 4096) 5) writepage(...) Now if we get a writepage callback immediately after (4) and before an attempt to write at any other offset via mmap address (which implies we are yet to get a pagefault and do a get_block) what we would have is the page which is dirty have first block allocated and the other three buffer_heads unmapped. In the above case the writepage should go ahead and try to write the first blocks and clear the page_dirty flag. Further attempts to write to the page will again create a fault and result in allocating blocks and marking page dirty. If we don't write any other offset via mmap address we would still have written the first block to the disk and rest of the space will be considered as a hole. So to address this, we change all of the places where we look for delayed, unmapped, or unwritten buffer heads, and only check for delayed or unwritten buffer heads instead. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Don't look at buffer_heads outside i_size.Aneesh Kumar K.V2009-06-041-12/+16
| | | | | | | | | | | | | | | | | | | | | | Buffer heads outside i_size will be unmapped. So when we are doing "walk_page_buffers" limit ourself to i_size. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Josef Bacik <jbacik@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> ----
| * ext4: Fix goal inum check in the inode allocatorJohann Lombardi2009-07-051-1/+1
| | | | | | | | | | | | | | | | The goal inode is specificed by inode number which belongs to [1; s_inodes_count]. Signed-off-by: Johann Lombardi <johann@sun.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: fix no journal corruption with locale-genTheodore Ts'o2009-07-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is no journal, ext4_should_writeback_data() should return TRUE. This will fix ext4_set_aops() to set ext4_da_ops in the case of delayed allocation; otherwise ext4_journaled_aops gets used by default, which doesn't handle delayed allocation properly. The advantage of using ext4_should_writeback_data() approach is that it should handle nobh better as well. Thanks to Curt Wohlgemuth for investigating this problem, and Aneesh Kumar for suggesting this approach. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Calculate required journal credits for inserting an extent properlyAneesh Kumar K.V2009-07-051-0/+1
| | | | | | | | | | | | | | | | | | | | When we have space in the extent tree leaf node we should be able to insert the extent with much less journal credits. The code was doing proper calculation but missed a return statement. Reported-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Fix truncation of symlinks after failed writeJan Kara2009-07-131-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Contents of long symlinks is written via standard write methods. So when the write fails, we add inode to orphan list. But symlinks don't have .truncate method defined so nobody properly removes them from the on disk orphan list. Fix this by calling ext4_truncate() directly instead of calling vmtruncate() (which is saner anyway since we don't need anything vmtruncate() does except from calling .truncate in these paths). We also add inode to orphan list only if ext4_can_truncate() is true (currently, it can be false for symlinks when there are no blocks allocated) - otherwise orphan list processing will complain and ext4_truncate() will not remove inode from on-disk orphan list. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Use rcu_barrier() on module unload.Jesper Dangaard Brouer2009-07-051-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | The ext4 module uses rcu_call() thus it should use rcu_barrier()on module unload. The kmem cache ext4_pspace_cachep is sometimes free'ed using call_rcu() callbacks. Thus, we must wait for completion of call_rcu() before doing kmem_cache_destroy(). Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: naturally align struct ext4_allocation_requestEric Sandeen2009-07-131-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Ted noted, the ext4_allocation_request isn't well aligned. Looking at it with pahole we're wasting space on 64-bit arches: struct ext4_allocation_request { struct inode * inode; /* 0 8 */ ext4_lblk_t logical; /* 8 4 */ /* XXX 4 bytes hole, try to pack */ ext4_fsblk_t goal; /* 16 8 */ ext4_lblk_t lleft; /* 24 4 */ /* XXX 4 bytes hole, try to pack */ ext4_fsblk_t pleft; /* 32 8 */ ext4_lblk_t lright; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ ext4_fsblk_t pright; /* 48 8 */ unsigned int len; /* 56 4 */ unsigned int flags; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 9 */ /* sum members: 52, holes: 3, sum holes: 12 */ }; Grouping 32-bit members together closes these holes and shrinks the structure by 12 bytes. which is important since ext4 can get on the hairy edge of stack overruns. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: mark several more functions in mballoc.c as noinlineEric Sandeen2009-07-051-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ted noticed a stack-deep callchain through writepages->ext4_mb_regular_allocator->ext4_mb_init_cache->submit_bh ... With all the static functions in mballoc.c, gcc helpfully inlines for us, and we get something like this: ext4_mb_regular_allocator (232 bytes stack) ext4_mb_init_cache (232 bytes stack) submit_bh (starts 464 deeper) the 2 ext4 functions here get several others inlined; by telling gcc not to inline them, we can save stack space for when we head off into submit_bh land and associated block layer callchains. The following noinlined functions are only called once, so this won't impact any other callchains: ext4_mb_regular_allocator (104) (was 232) ext4_mb_find_by_goal (56) (noinlined) ext4_mb_init_group (24) (noinlined) ext4_mb_init_cache (136) (was 232) ext4_mb_generate_buddy (88) (noinlined) ext4_mb_generate_from_pa (40) (noinlined) submit_bh ext4_mb_simple_scan_group (24) (noinlined) ext4_mb_scan_aligned (56) (noinlined) ext4_mb_complex_scan_group (40) (noinlined) ext4_mb_try_best_found (24) (noinlined) now when we head off into submit_bh() we're only 264 bytes deeper in stack than when we entered ext4_mb_regular_allocator() (vs. 464 bytes before). Every 200 bytes helps. :) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * ext4: Fix potential reclaim deadlock when truncating partial blockTheodore Ts'o2009-07-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ext4_block_truncate_page() function previously called grab_cache_page(), which called find_or_create_page() with the __GFP_FS flag potentially set. This could cause a deadlock if the system is low on memory and it attempts a memory reclaim, which could potentially call back into ext4. So we need to call find_or_create_page() directly, and remove the __GFP_FP flag to avoid this potential deadlock. Thanks to Roland Dreier for reporting a lockdep warning which showed this problem. [20786.363249] ================================= [20786.363257] [ INFO: inconsistent lock state ] [20786.363265] 2.6.31-2-generic #14~rbd4gitd960eea9 [20786.363270] --------------------------------- [20786.363276] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. [20786.363285] http/8397 [HC0[0]:SC0[0]:HE1:SE1] takes: [20786.363291] (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150 [20786.363314] {IN-RECLAIM_FS-W} state was registered at: [20786.363320] [<ffffffff8108bef6>] mark_irqflags+0xc6/0x1a0 [20786.363334] [<ffffffff8108d347>] __lock_acquire+0x287/0x430 [20786.363345] [<ffffffff8108d595>] lock_acquire+0xa5/0x150 [20786.363355] [<ffffffff812008da>] jbd2_journal_start+0xfa/0x150 [20786.363365] [<ffffffff811d98a8>] ext4_journal_start_sb+0x58/0x90 [20786.363377] [<ffffffff811cce85>] ext4_delete_inode+0xc5/0x2c0 [20786.363389] [<ffffffff81146fa3>] generic_delete_inode+0xd3/0x1a0 [20786.363401] [<ffffffff81147095>] generic_drop_inode+0x25/0x30 [20786.363411] [<ffffffff81145ce2>] iput+0x62/0x70 [20786.363420] [<ffffffff81142878>] dentry_iput+0x98/0x110 [20786.363429] [<ffffffff81142a00>] d_kill+0x50/0x80 [20786.363438] [<ffffffff811444c5>] dput+0x95/0x180 [20786.363447] [<ffffffff8120de4b>] ecryptfs_d_release+0x2b/0x70 [20786.363459] [<ffffffff81142978>] d_free+0x28/0x60 [20786.363468] [<ffffffff81142a18>] d_kill+0x68/0x80 [20786.363477] [<ffffffff81142ad3>] prune_one_dentry+0xa3/0xc0 [20786.363487] [<ffffffff81142d61>] __shrink_dcache_sb+0x271/0x290 [20786.363497] [<ffffffff81142e89>] prune_dcache+0x109/0x1b0 [20786.363506] [<ffffffff81142f6f>] shrink_dcache_memory+0x3f/0x50 [20786.363516] [<ffffffff810f6d3d>] shrink_slab+0x12d/0x190 [20786.363527] [<ffffffff810f97d7>] balance_pgdat+0x4d7/0x640 [20786.363537] [<ffffffff810f9a57>] kswapd+0x117/0x170 [20786.363546] [<ffffffff810773ce>] kthread+0x9e/0xb0 [20786.363558] [<ffffffff8101430a>] child_rip+0xa/0x20 [20786.363569] [<ffffffffffffffff>] 0xffffffffffffffff [20786.363598] irq event stamp: 15997 [20786.363603] hardirqs last enabled at (15997): [<ffffffff81125f9d>] kmem_cache_alloc+0xfd/0x1a0 [20786.363617] hardirqs last disabled at (15996): [<ffffffff81125f01>] kmem_cache_alloc+0x61/0x1a0 [20786.363628] softirqs last enabled at (15966): [<ffffffff810631ea>] __do_softirq+0x14a/0x220 [20786.363641] softirqs last disabled at (15861): [<ffffffff8101440c>] call_softirq+0x1c/0x30 [20786.363651] [20786.363653] other info that might help us debug this: [20786.363660] 3 locks held by http/8397: [20786.363665] #0: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<ffffffff8112ed24>] do_truncate+0x64/0x90 [20786.363685] #1: (&sb->s_type->i_alloc_sem_key#5){+++++.}, at: [<ffffffff81147f90>] notify_change+0x250/0x350 [20786.363707] #2: (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150 [20786.363724] [20786.363726] stack backtrace: [20786.363734] Pid: 8397, comm: http Tainted: G C 2.6.31-2-generic #14~rbd4gitd960eea9 [20786.363741] Call Trace: [20786.363752] [<ffffffff8108ad7c>] print_usage_bug+0x18c/0x1a0 [20786.363763] [<ffffffff8108b0c0>] ? check_usage_backwards+0x0/0xb0 [20786.363773] [<ffffffff8108bad2>] mark_lock_irq+0xf2/0x280 [20786.363783] [<ffffffff8108bd97>] mark_lock+0x137/0x1d0 [20786.363793] [<ffffffff8108c03c>] mark_held_locks+0x6c/0xa0 [20786.363803] [<ffffffff8108c11f>] lockdep_trace_alloc+0xaf/0xe0 [20786.363813] [<ffffffff810efbac>] __alloc_pages_nodemask+0x7c/0x180 [20786.363824] [<ffffffff810e9411>] ? find_get_page+0x91/0xf0 [20786.363835] [<ffffffff8111d3b7>] alloc_pages_current+0x87/0xd0 [20786.363845] [<ffffffff810e9827>] __page_cache_alloc+0x67/0x70 [20786.363856] [<ffffffff810eb7df>] find_or_create_page+0x4f/0xb0 [20786.363867] [<ffffffff811cb3be>] ext4_block_truncate_page+0x3e/0x460 [20786.363876] [<ffffffff812008da>] ? jbd2_journal_start+0xfa/0x150 [20786.363885] [<ffffffff812008bb>] ? jbd2_journal_start+0xdb/0x150 [20786.363895] [<ffffffff811c6415>] ? ext4_meta_trans_blocks+0x75/0xf0 [20786.363905] [<ffffffff811e8d8b>] ext4_ext_truncate+0x1bb/0x1e0 [20786.363916] [<ffffffff811072c5>] ? unmap_mapping_range+0x75/0x290 [20786.363926] [<ffffffff811ccc28>] ext4_truncate+0x498/0x630 [20786.363938] [<ffffffff8129b4ce>] ? _raw_spin_unlock+0x5e/0xb0 [20786.363947] [<ffffffff81107306>] ? unmap_mapping_range+0xb6/0x290 [20786.363957] [<ffffffff8108c3ad>] ? trace_hardirqs_on+0xd/0x10 [20786.363966] [<ffffffff811ffe58>] ? jbd2_journal_stop+0x1f8/0x2e0 [20786.363976] [<ffffffff81107690>] vmtruncate+0xb0/0x110 [20786.363986] [<ffffffff81147c05>] inode_setattr+0x35/0x170 [20786.363995] [<ffffffff811c9906>] ext4_setattr+0x186/0x370 [20786.364005] [<ffffffff81147eab>] notify_change+0x16b/0x350 [20786.364014] [<ffffffff8112ed30>] do_truncate+0x70/0x90 [20786.364021] [<ffffffff8112f48b>] T.657+0xeb/0x110 [20786.364021] [<ffffffff8112f4be>] sys_ftruncate+0xe/0x10 [20786.364021] [<ffffffff81013132>] system_call_fastpath+0x16/0x1b Reported-by: Roland Dreier <roland@digitalvampire.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | headers: smp_lock.h reduxAlexey Dobriyan2009-07-121-1/+0
|/ | | | | | | | | | | | | * Remove smp_lock.h from files which don't need it (including some headers!) * Add smp_lock.h to files which do need it * Make smp_lock.h include conditional in hardirq.h It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT This will make hardirq.h inclusion cheaper for every PREEMPT=n config (which includes allmodconfig/allyesconfig, BTW) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* helpers for acl caching + switch to thoseAl Viro2009-06-241-56/+9
| | | | | | | | | helpers: get_cached_acl(inode, type), set_cached_acl(inode, type, acl), forget_cached_acl(inode, type). ubifs/xattr.c needed includes reordered, the rest is a plain switchover. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* switch ext4 to inode->i_aclAl Viro2009-06-245-40/+10
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
OpenPOWER on IntegriCloud