summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds2012-05-284-2/+230
|\ | | | | | | | | | | | | | | | | | | | | | | Pull exofs updates from Boaz Harrosh: "Just a couple of patches. The first is a BUG fix destined for stable which missed the 3.4-rc7 Kernel. The second is just a fixture addition so exofs is able to be better exported as a cluster file system via pNFS." * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Add SYSFS info for autologin/pNFS export exofs: Fix CRASH on very early IO errors.
| * exofs: Add SYSFS info for autologin/pNFS exportSachin Bhamare2012-05-214-1/+229
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce sysfs infrastructure for exofs cluster filesystem. Each OSD target shows up as below in the sysfs hierarchy: /sys/fs/exofs/<osdname>_<partition_id>/devX Where <osdname>_<partition_id> is the unique identification of a Superblock. Where devX: 0 <= X < device_table_size. They are ordered in device-table order as specified to the mkfs.exofs command Each OSD device devX has following attributes : osdname - ReadOnly systemid - ReadOnly uri - Read/Write It is up to user-mode to update devX/uri for support of autologin. These sysfs information are used both for autologin as well as support for exporting exofs via a pNFSD server in user-mode. (.eg NFS-Ganesha) Signed-off-by: Sachin Bhamare <sbhamare@panasas.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
| * exofs: Fix CRASH on very early IO errors.Boaz Harrosh2012-05-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If at exofs_fill_super() we had an early termination do to any error, like an IO error while reading the super-block. We would crash inside exofs_free_sbi(). This is because sbi->oc.numdevs was set to 1, before we actually have a device table at all. Fix it by moving the sbi->oc.numdevs = 1 to after the allocation of the device table. Reported-by: Johannes Schild <JSchild@gmx.de> Stable: This is a bug since v3.2.0 CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* | Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds2012-05-2847-186/+267
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally
| * | writeback: Avoid iput() from flusher threadJan Kara2012-05-062-14/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Doing iput() from flusher thread (writeback_sb_inodes()) can create problems because iput() can do a lot of work - for example truncate the inode if it's the last iput on unlinked file. Some filesystems depend on flusher thread progressing (e.g. because they need to flush delay allocated blocks to reduce allocation uncertainty) and so flusher thread doing truncate creates interesting dependencies and possibilities for deadlocks. We get rid of iput() in flusher thread by using the fact that I_SYNC inode flag effectively pins the inode in memory. So if we take care to either hold i_lock or have I_SYNC set, we can get away without taking inode reference in writeback_sb_inodes(). As a side effect of these changes, we also fix possible use-after-free in wb_writeback() because inode_wait_for_writeback() call could try to reacquire i_lock on the inode that was already free. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | vfs: Rename end_writeback() to clear_inode()Jan Kara2012-05-0646-54/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | vfs: Move waiting for inode writeback from end_writeback() to evict_inode()Jan Kara2012-05-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, I_SYNC can never be set when evict_inode() (and thus end_writeback()) is called because flusher thread holds inode reference while inode is under writeback. As a result inode_sync_wait() in those places currently does nothing. However that is going to change and unveils problems with calling inode_sync_wait() from end_writeback(). Several filesystems call end_writeback() after they have deleted the inode (btrfs, gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when waiting for I_SYNC because they call end_writeback() from within a transaction. To avoid these issues, we move inode_sync_wait() into evict_inode() before calling ->evict_inode(). That way we preserve the current property that ->evict_inode() and writeback never run in parallel and all filesystems are safe. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | writeback: Refactor writeback_single_inode()Jan Kara2012-05-061-56/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code in writeback_single_inode() is relatively complex. The list requeing logic makes sense only for flusher thread but not really for sync_inode() or write_inode_now() callers. Also when we want to get rid of inode references held by flusher thread, we will need a special I_SYNC handling there. So separate part of writeback_single_inode() which does the real writeback work into __writeback_single_inode() and make writeback_single_inode() do only stuff necessary for callers writing only one inode, moving the special list handling into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we could skip some inode during sync(2) because other writer refiled it from b_io to b_dirty list. Also I_SYNC handling is moved into the callers of __writeback_single_inode() to make locking easier. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | writeback: Remove wb->list_lock from writeback_single_inode()Jan Kara2012-05-061-10/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | writeback_single_inode() doesn't need wb->list_lock for anything on entry now. So remove the requirement. This makes locking of writeback_single_inode() temporarily awkward (entering with i_lock, returning with i_lock and wb->list_lock) but it will be sanitized in the next patch. Also inode_wait_for_writeback() doesn't need wb->list_lock for anything. It was just taking it to make usage convenient for callers but with writeback_single_inode() changing it's not very convenient anymore. So remove the lock from that function. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | writeback: Separate inode requeueing after writebackJan Kara2012-05-061-47/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | Move inode requeueing after inode has been written out into a separate function. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | writeback: Move I_DIRTY_PAGES handlingJan Kara2012-05-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in writing them all, just clear the bit only when we succeeded writing all the pages. We also move the clearing of the bit close to other i_state handling to separate it from writeback list handling. This is desirable because list handling will differ for flusher thread and other writeback_single_inode() callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like checking it in ->writepages or ->write_inode implementation) so this movement is safe. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()Jan Kara2012-05-061-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When writeback_single_inode() is called on inode which has I_SYNC already set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However this makes sense only if the caller is flusher thread. For other callers of writeback_single_inode() it doesn't really make sense and may be even wrong - flusher thread may be doing WB_SYNC_ALL writeback in parallel. So we move requeueing from writeback_single_inode() to writeback_sb_inodes(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
| * | writeback: Move clearing of I_SYNC into inode_sync_complete()Jan Kara2012-05-061-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move clearing of I_SYNC into inode_sync_complete(). It is more logical to have clearing of I_SYNC bit and waking of waiters in one place. Also later we will have two places needing to clear I_SYNC and wake up waiters so this allows them to use the common helper. Moving of I_SYNC clearing to a later stage of writeback_single_inode() is safe since we hold i_lock all the time. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
* | | word-at-a-time: make the interfaces truly genericLinus Torvalds2012-05-261-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes the interfaces in <asm/word-at-a-time.h> to be a bit more complicated, but a lot more generic. In particular, it allows us to really do the operations efficiently on both little-endian and big-endian machines, pretty much regardless of machine details. For example, if you can rely on a fast population count instruction on your architecture, this will allow you to make your optimized <asm/word-at-a-time.h> file with that. NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is not truly generic, it actually only works on big-endian. Why? Because on little-endian the generic algorithms are wasteful, since you can inevitably do better. The x86 implementation is an example of that. (The only truly non-generic part of the asm-generic implementation is the "find_zero()" function, and you could make a little-endian version of it. And if the Kbuild infrastructure allowed us to pick a particular header file, that would be lovely) The <asm/word-at-a-time.h> functions are as follows: - WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm uses. - has_zero(): take a word, and determine if it has a zero byte in it. It gets the word, the pointer to the constant pool, and a pointer to an intermediate "data" field it can set. This is the "quick-and-dirty" zero tester: it's what is run inside the hot loops. - "prep_zero_mask()": take the word, the data that has_zero() produced, and the constant pool, and generate an *exact* mask of which byte had the first zero. This is run directly *outside* the loop, and allows the "has_zero()" function to answer the "is there a zero byte" question without necessarily getting exactly *which* byte is the first one to contain a zero. If you do multiple byte lookups concurrently (eg "hash_name()", which looks for both NUL and '/' bytes), after you've done the prep_zero_mask() phase, the result of those can be or'ed together to get the "either or" case. - The result from "prep_zero_mask()" can then be fed into "find_zero()" (to find the byte offset of the first byte that was zero) or into "zero_bytemask()" (to find the bytemask of the bytes preceding the zero byte). The existence of zero_bytemask() is optional, and is not necessary for the normal string routines. But dentry name hashing needs it, so if you enable DENTRY_WORD_AT_A_TIME you need to expose it. This changes the generic strncpy_from_user() function and the dentry hashing functions to use these modified word-at-a-time interfaces. This gets us back to the optimized state of the x86 strncpy that we lost in the previous commit when moving over to the generic version. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for_linus' of ↵Linus Torvalds2012-05-2516-211/+313
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, ext3 and quota fixes from Jan Kara: "Interesting bits are: - removal of a special i_mutex locking subclass (I_MUTEX_QUOTA) since quota code does not need i_mutex anymore in any unusual way. - backport (from ext4) of a fix of a checkpointing bug (missing cache flush) that could lead to fs corruption on power failure The rest are just random small fixes & cleanups." * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: trivial fix to comment for ext2_free_blocks ext2: remove the redundant comment for ext2_export_ops ext3: return 32/64-bit dir name hash according to usage type quota: Get rid of nested I_MUTEX_QUOTA locking subclass quota: Use precomputed value of sb_dqopt in dquot_quota_sync ext2: Remove i_mutex use from ext2_quota_write() reiserfs: Remove i_mutex use from reiserfs_quota_write() ext4: Remove i_mutex use from ext4_quota_write() ext3: Remove i_mutex use from ext3_quota_write() quota: Fix double lock in add_dquot_ref() with CONFIG_QUOTA_DEBUG jbd: Write journal superblock with WRITE_FUA after checkpointing jbd: protect all log tail updates with j_checkpoint_mutex jbd: Split updating of journal superblock and marking journal empty ext2: do not register write_super within VFS ext2: Remove s_dirt handling ext2: write superblock only once on unmount ext3: update documentation with barrier=1 default ext3: remove max_debt in find_group_orlov() jbd: Refine commit writeout logic
| * | | ext2: trivial fix to comment for ext2_free_blocksWang Sheng-Hui2012-05-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The function is ext2_free_blocks(), not ext2_free_blocks_sb(). Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext2: remove the redundant comment for ext2_export_opsWang Sheng-Hui2012-05-151-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment is outdated and isn't particularly informative anyway - NULL meaning the default behavior is very common in kernel. And we really set about half of entries. So remove the whole comment for ext2_export_ops. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext3: return 32/64-bit dir name hash according to usage typeEric Sandeen2012-05-153-48/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is based on commit d1f5273e9adb40724a85272f248f210dc4ce919a ext4: return 32/64-bit dir name hash according to usage type by Fan Yong <yong.fan@whamcloud.com> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek() to appease NFSv2, which can only handle a 32-bit cookie for seekdir() and telldir(). However, this causes problems if there are 32-bit hash collisions, since the NFSv2 server can get stuck resending the same entries from the directory repeatedly. Allow ext3 to return a full 64-bit hash (both major and minor) for telldir to decrease the chance of hash collisions. This patch does implement a new ext3_dir_llseek op, because with 64-bit hashes, nfs will attempt to seek to a hash "offset" which is much larger than ext3's s_maxbytes. So for dx dirs, we call generic_file_llseek_size() with the appropriate max hash value as the maximum seekable size. Otherwise we just pass through to generic_file_llseek(). Patch-updated-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Patch-updated-by: Eric Sandeen <sandeen@redhat.com> (blame us if something is not correct) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | quota: Get rid of nested I_MUTEX_QUOTA locking subclassJan Kara2012-05-151-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far i_mutex was ranking above dqonoff_mutex and i_mutex on quota files was special and ranking below dqonoff_mutex (and several other locks). However there's no real need for i_mutex on quota files to be special. IO on quota files is serialized by dqio_mutex anyway so we don't need to take i_mutex when writing to quota files. Other places where we take i_mutex on quota file can accomodate standard i_mutex lock ranking, we only need to change the lock ranking to be dqonoff_mutex > i_mutex which is a matter of changing documentation because there's no place which would enforce ordering in the other direction. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | quota: Use precomputed value of sb_dqopt in dquot_quota_syncJan Kara2012-05-151-6/+6
| | | | | | | | | | | | | | | | Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext2: Remove i_mutex use from ext2_quota_write()Jan Kara2012-05-151-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need i_mutex in ext2_quota_write() because writes to quota file are serialized by dqio_mutex anyway. Changes to quota files outside of quota code are forbidded and enforced by NOATIME and IMMUTABLE bits. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | reiserfs: Remove i_mutex use from reiserfs_quota_write()Jan Kara2012-05-151-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need i_mutex in reiserfs_quota_write() because writes to quota file are serialized by dqio_mutex anyway. Changes to quota files outside of quota code are forbidded and enforced by NOATIME and IMMUTABLE bits. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext4: Remove i_mutex use from ext4_quota_write()Jan Kara2012-05-151-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need i_mutex in ext4_quota_write() because writes to quota file are serialized by dqio_mutex anyway. Changes to quota files outside of quota code are forbidded and enforced by NOATIME and IMMUTABLE bits. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext3: Remove i_mutex use from ext3_quota_write()Jan Kara2012-05-151-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need i_mutex in ext3_quota_write() because writes to quota file are serialized by dqio_mutex anyway. Changes to quota files outside of quota code are forbidded and enforced by NOATIME and IMMUTABLE bits. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | quota: Fix double lock in add_dquot_ref() with CONFIG_QUOTA_DEBUGJan Kara2012-05-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_QUOTA_DEBUG is enabled we call inode_get_rsv_space() from add_dquot_ref() while holding i_lock. But inode_get_rsv_space() is trying to get i_lock as well resulting in double lock. Fix the problem by moving inode_get_rsv_space() call out of i_lock. Reported-and-analyzed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | jbd: Write journal superblock with WRITE_FUA after checkpointingJan Kara2012-05-153-35/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If journal superblock is written only in disk's caches and other transaction starts reusing space of the transaction cleaned from the log, it can happen blocks of a new transaction reach the disk before journal superblock. When power failure happens in such case, subsequent journal replay would still try to replay the old transaction but some of it's blocks may be already overwritten by the new transaction. For this reason we must use WRITE_FUA when updating log tail and we must first write new log tail to disk and update in-memory information only after that. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | jbd: protect all log tail updates with j_checkpoint_mutexJan Kara2012-05-152-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are some log tail updates that are not protected by j_checkpoint_mutex. Some of these are harmless because they happen during startup or shutdown but updates in journal_commit_transaction() and journal_flush() can really race with other log tail updates (e.g. someone doing journal_flush() with someone running cleanup_journal_tail()). So protect all log tail updates with j_checkpoint_mutex. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | jbd: Split updating of journal superblock and marking journal emptyJan Kara2012-05-153-69/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are three case of updating journal superblock. In the first case, we want to mark journal as empty (setting s_sequence to 0), in the second case we want to update log tail, in the third case we want to update s_errno. Split these cases into separate functions. It makes the code slightly more straightforward and later patches will make the distinction even more important. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext2: do not register write_super within VFSArtem Bityutskiy2012-04-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jan Kara removed 'sb->s_dirt' VFS flag references, so we do not need to register the ext2 'ext2_write_super()' method in the VFS superblock operations, because 'sb->s_dirt' won't be ever set to 1 and VFS won't ever call '->write_super()' anyway. Thus, remove the method. Tested using xfstests. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext2: Remove s_dirt handlingJan Kara2012-04-114-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Places which modify superblock feature / state fields mark the superblock buffer dirty so it is written out by flusher thread. Thus there's no need to set s_dirt there. The only other fields changing in the superblock are the numbers of free blocks, free inodes and s_wtime. There's no real need to write (or even compute) these periodically. Free blocks / inodes counters are recomputed on every mount from group counters anyway and value of s_wtime is only informational and imprecise anyway. So it should be enough to write these opportunistically on mount, remount, umount, and sync_fs times. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext2: write superblock only once on unmountArtem Bityutskiy2012-04-111-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently on unmount if we are mounted R/W, we first write the superblock to the media if it is dirty, and then write it again, which is not optimal. This patch makes ext2 write the superblock on unmount less times. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext3: remove max_debt in find_group_orlov()Akira Fujita2012-04-111-18/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | max_debt, involved variables and calculations are no longer needed, clean them up. Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | jbd: Refine commit writeout logicJan Kara2012-04-113-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous, most noticably it slows down readers and users complain about slow desktop response etc. So submit writes as asynchronous in the normal case and only submit writes as WRITE_SYNC if we detect someone is waiting for current transaction commit. I've gathered some numbers to back this change. The first is the read latency test. It measures time to read 1 MB after several seconds of sleeping in presence of streaming writes. Top 10 times (out of 90) in us: Before After 2131586 697473 1709932 557487 1564598 535642 1480462 347573 1478579 323153 1408496 222181 1388960 181273 1329565 181070 1252486 172832 1223265 172278 Average: 619377 82180 So the improvement in both maximum and average latency is massive. I've measured fsync throughput by: fs_mark -n 100 -t 1 -s 16384 -d /mnt/fsync/ -S 1 -L 4 in presence of streaming reader. The numbers (fsyncs/s) are: Before After 9.9 6.3 6.8 6.0 6.3 6.2 5.8 6.1 So fsync performance seems unharmed by this change. Signed-off-by: Jan Kara <jack@suse.cz>
* | | | Merge tag 'stable/for-linus-3.5-rc0-tag' of ↵Linus Torvalds2012-05-241-0/+128
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen updates from Konrad Rzeszutek Wilk: "Features: * Extend the APIC ops implementation and add IRQ_WORKER vector support so that 'perf' can work properly. * Fix self-ballooning code, and balloon logic when booting as initial domain. * Move array printing code to generic debugfs * Support XenBus domains. * Lazily free grants when a domain is dead/non-existent. * In M2P code use batching calls Bug-fixes: * Fix NULL dereference in allocation failure path (hvc_xen) * Fix unbinding of IRQ_WORKER vector during vCPU hot-unplug * Fix HVM guest resume - we would leak an PIRQ value instead of reusing the existing one." Fix up add-add onflicts in arch/x86/xen/enlighten.c due to addition of apic ipi interface next to the new apic_id functions. * tag 'stable/for-linus-3.5-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen: do not map the same GSI twice in PVHVM guests. hvc_xen: NULL dereference on allocation failure xen: Add selfballoning memory reservation tunable. xenbus: Add support for xenbus backend in stub domain xen/smp: unbind irqworkX when unplugging vCPUs. xen: enter/exit lazy_mmu_mode around m2p_override calls xen/acpi/sleep: Enable ACPI sleep via the __acpi_os_prepare_sleep xen: implement IRQ_WORK_VECTOR handler xen: implement apic ipi interface xen/setup: update VA mapping when releasing memory during setup xen/setup: Combine the two hypercall functions - since they are quite similar. xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages freed" if Z > 0 xen/gnttab: add deferred freeing logic debugfs: Add support to print u32 array in debugfs xen/p2m: An early bootup variant of set_phys_to_machine xen/p2m: Collapse early_alloc_p2m_middle redundant checks. xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on argument xen/p2m: Move code around to allow for better re-usage.
| * | | | debugfs: Add support to print u32 array in debugfsSrivatsa Vaddagiri2012-04-171-0/+128
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the code from Xen to debugfs to make the code common for other users as well. Accked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com> [v1: Fixed rebase issues] [v2: Fixed PPC compile issues] Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds2012-05-241-0/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull sparc changes from David S. Miller: "This has the generic strncpy_from_user() implementation architectures can now use, which we've been developing on linux-arch over the past few days. For good measure I ran both a 32-bit and a 64-bit glibc testsuite run, and the latter of which pointed out an adjustment I needed to make to sparc's user_addr_max() definition. Linus, you were right, STACK_TOP was not the right thing to use, even on sparc itself :-) From Sam Ravnborg, we have a conversion of sparc32 over to the common alloc_thread_info_node(), since the aspect which originally blocked our doing so (sun4c) has been removed." Fix up trivial arch/sparc/Kconfig and lib/Makefile conflicts. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: Fix user_addr_max() definition. lib: Sparc's strncpy_from_user is generic enough, move under lib/ kernel: Move REPEAT_BYTE definition into linux/kernel.h sparc: Increase portability of strncpy_from_user() implementation. sparc: Optimize strncpy_from_user() zero byte search. sparc: Add full proper error handling to strncpy_from_user(). sparc32: use the common implementation of alloc_thread_info_node()
| * | | | kernel: Move REPEAT_BYTE definition into linux/kernel.hDavid S. Miller2012-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | And make sure that everything using it explicitly includes that header file. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2012-05-2479-2851/+2458
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull XFS update from Ben Myers: - Removal of xfsbufd - Background CIL flushes have been moved to a workqueue. - Fix to xfs_check_page_type applicable to filesystems where blocksize < page size - Fix for stale data exposure when extsize hints are used. - A series of xfs_buf cache cleanups. - Fix for XFS_IOC_ALLOCSP - Cleanups for includes and removal of xfs_lrw.[ch]. - Moved all busy extent handling to it's own file so that it is easier to merge with userspace. - Fix for log mount failure. - Fix to enable inode reclaim during quotacheck at mount time. - Fix for delalloc quota accounting. - Fix for memory reclaim deadlock on agi buffer. - Fixes for failed writes and to clean up stale delalloc blocks. - Fix to use GFP_NOFS in blkdev_issue_flush - SEEK_DATA/SEEK_HOLE support * 'for-linus' of git://oss.sgi.com/xfs/xfs: (57 commits) xfs: add trace points for log forces xfs: fix memory reclaim deadlock on agi buffer xfs: fix delalloc quota accounting on failure xfs: protect xfs_sync_worker with s_umount semaphore xfs: introduce SEEK_DATA/SEEK_HOLE support xfs: make xfs_extent_busy_trim not static xfs: make XBF_MAPPED the default behaviour xfs: flush outstanding buffers on log mount failure xfs: Properly exclude IO type flags from buffer flags xfs: clean up xfs_bit.h includes xfs: move xfs_do_force_shutdown() and kill xfs_rw.c xfs: move xfs_get_extsz_hint() and kill xfs_rw.h xfs: move xfs_fsb_to_db to xfs_bmap.h xfs: clean up busy extent naming xfs: move busy extent handling to it's own file xfs: move xfsagino_t to xfs_types.h xfs: use iolock on XFS_IOC_ALLOCSP calls xfs: kill XBF_DONTBLOCK xfs: kill xfs_read_buf() xfs: kill XBF_LOCK ...
| * | | | | xfs: add trace points for log forcesDave Chinner2012-05-212-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To enable easy tracing of the location of log forces and the frequency of them via perf, add a pair of trace points to the log force functions. This will help debug where excessive log forces are being issued from by simple perf commands like: # ~/perf/perf top -e xfs:xfs_log_force -G -U Which gives this sort of output: Events: 141 xfs:xfs_log_force - 100.00% [kernel] [k] xfs_log_force - xfs_log_force 87.04% xfsaild kthread kernel_thread_helper - 12.87% xfs_buf_lock _xfs_buf_find xfs_buf_get xfs_trans_get_buf xfs_da_do_buf xfs_da_get_buf xfs_dir2_data_init xfs_dir2_leaf_addname xfs_dir_createname xfs_create xfs_vn_mknod xfs_vn_create vfs_create do_last.isra.41 path_openat do_filp_open do_sys_open sys_open system_call_fastpath Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sig.com>
| * | | | | xfs: fix memory reclaim deadlock on agi bufferPeter Watkins2012-05-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note xfs_iget can be called while holding a locked agi buffer. If it goes into memory reclaim then inode teardown may try to lock the same buffer. Prevent the deadlock by calling radix_tree_preload with GFP_NOFS. Signed-off-by: Peter Watkins <treestem@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: fix delalloc quota accounting on failureDave Chinner2012-05-213-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfstest 270 was causing quota reservations way beyond what was sane (ten to hundreds of TB) for a 4GB filesystem. There's a sign problem in the error handling path of xfs_bmapi_reserve_delalloc() because xfs_trans_unreserve_quota_nblks() simple negates the value passed - which doesn't work for an unsigned variable. This causes reservations of close to 2^32 block instead of removing a reservation of a handful of blocks. Fix the same problem in the other xfs_trans_unreserve_quota_nblks() callers where unsigned integer variables are used, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: protect xfs_sync_worker with s_umount semaphoreBen Myers2012-05-151-13/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_sync_worker checks the MS_ACTIVE flag in s_flags to avoid doing work during mount and unmount. This flag can be cleared by unmount after the xfs_sync_worker checks it but before the work is completed. The has caused crashes in the completion handler for the dummy transaction commited by xfs_sync_worker: PID: 27544 TASK: ffff88013544e040 CPU: 3 COMMAND: "kworker/3:0" #0 [ffff88016fdff930] machine_kexec at ffffffff810244e9 #1 [ffff88016fdff9a0] crash_kexec at ffffffff8108d053 #2 [ffff88016fdffa70] oops_end at ffffffff813ad1b8 #3 [ffff88016fdffaa0] no_context at ffffffff8102bd48 #4 [ffff88016fdffaf0] __bad_area_nosemaphore at ffffffff8102c04d #5 [ffff88016fdffb40] bad_area_nosemaphore at ffffffff8102c12e #6 [ffff88016fdffb50] do_page_fault at ffffffff813afaee #7 [ffff88016fdffc60] page_fault at ffffffff813ac635 [exception RIP: xlog_get_lowest_lsn+0x30] RIP: ffffffffa04a9910 RSP: ffff88016fdffd10 RFLAGS: 00010246 RAX: ffffc90014e48000 RBX: ffff88014d879980 RCX: ffff88014d879980 RDX: ffff8802214ee4c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88016fdffd10 R8: ffff88014d879a80 R9: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802214ee400 R13: ffff88014d879980 R14: 0000000000000000 R15: ffff88022fd96605 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff88016fdffd18] xlog_state_do_callback at ffffffffa04aa186 [xfs] #9 [ffff88016fdffd98] xlog_state_done_syncing at ffffffffa04aa568 [xfs] Protect xfs_sync_worker by using the s_umount semaphore at the read level to provide exclusion with unmount while work is progressing. Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: introduce SEEK_DATA/SEEK_HOLE supportJeff Liu2012-05-141-1/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds lseek(2) SEEK_DATA/SEEK_HOLE functionality to xfs. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: make xfs_extent_busy_trim not staticBen Myers2012-05-143-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e459df5, 'xfs: move busy extent handling to it's own file' moved some code from xfs_alloc.c into xfs_extent_busy.c for convenience in userspace code merges. One of the functions moved is xfs_extent_busy_trim (formerly xfs_alloc_busy_trim) which is defined STATIC. Unfortunately this function is still used in xfs_alloc.c, and this results in an undefined symbol in xfs.ko. Make xfs_extent_busy_trim not static and add its prototype to xfs_extent_busy.h. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>
| * | | | | xfs: make XBF_MAPPED the default behaviourDave Chinner2012-05-147-32/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than specifying XBF_MAPPED for almost all buffers, introduce XBF_UNMAPPED for the couple of users that use unmapped buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: flush outstanding buffers on log mount failureDave Chinner2012-05-141-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we fail to mount the log in xfs_mountfs(), we tear down all the infrastructure we have already allocated. However, the process of mounting the log may have progressed to the point of reading, caching and modifying buffers in memory. Hence before we can free all the infrastructure, we have to flush and remove all the buffers from memory. Problem first reported by Eric Sandeen, later a different incarnation was reported by Ben Myers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: Properly exclude IO type flags from buffer flagsDave Chinner2012-05-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent event tracing during a debugging session showed that flags that define the IO type for a buffer are leaking into the flags on the buffer incorrectly. Fix the flag exclusion mask in xfs_buf_alloc() to avoid problems that may be caused by such leakage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: clean up xfs_bit.h includesDave Chinner2012-05-1428-35/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the removal of xfs_rw.h and other changes over time, xfs_bit.h is being included in many files that don't actually need it. Clean up the includes as necessary. Also move the only-used-once xfs_ialloc_find_free() static inline function out of a header file that is widely included to reduce the number of needless dependencies on xfs_bit.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: move xfs_do_force_shutdown() and kill xfs_rw.cDave Chinner2012-05-143-91/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_do_force_shutdown now is the only thing in xfs_rw.c. There is no need to keep it in it's own file anymore, so move it to xfs_fsops.c next to xfs_fs_goingdown() and kill xfs_rw.c. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * | | | | xfs: move xfs_get_extsz_hint() and kill xfs_rw.hDave Chinner2012-05-1416-57/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only thing left in xfs_rw.h is a function prototype for an inode function. Move that to xfs_inode.h, and kill xfs_rw.h. Also move the function implementing the prototype from xfs_rw.c to xfs_inode.c so we only have one function left in xfs_rw.c Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
OpenPOWER on IntegriCloud