summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* nfsd: Use vfs_fsync_range() in nfsd_commitTrond Myklebust2010-01-291-10/+20
| | | | | | | | | The NFS COMMIT operation allows the client to specify the exact byte range that it wishes to sync to disk in order to optimise server performance. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* NFSD: Create PF_INET6 listener in write_portsChuck Lever2010-01-271-1/+17
| | | | | | | | | | | | | | | | | | Try to create a PF_INET6 listener for NFSD, if IPv6 is enabled in the kernel. Make sure nfsd_serv's reference count is decreased if __write_ports_addxprt() failed to create a listener. See __write_ports_addfd(). Our current plan is to rely on rpc.nfsd to create appropriate IPv6 listeners when server-side NFS/IPv6 support is desired. Legacy behavior, via the write_threads or write_svc kernel APIs, will remain the same -- only IPv4 listeners are created. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [bfields@citi.umich.edu: Move error-handling code to end] Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found"Chuck Lever2010-01-261-5/+1
| | | | | | | | | | | | | | | | | | | | write_ports() converts svc_create_xprt()'s ENOENT error return to EPROTONOSUPPORT so that rpc.nfsd (in user space) can report an error message that makes sense. It turns out that several of the other kernel APIs rpc.nfsd use can also return ENOENT from svc_create_xprt(), by way of lockd_up(). On the client side, an NFSv2 or NFSv3 mount request can also return the result of lockd_up(). This error may also be returned during an NFSv4 mount request, since the NFSv4 callback service uses svc_create_xprt() to create the callback listener. An ENOENT error return results in a confusing error message from the mount command. Let's have svc_create_xprt() return EPROTONOSUPPORT instead of ENOENT. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt()Chuck Lever2010-01-262-4/+0
| | | | | | | | | | | | | | Clean up: Bruce observed we have more or less common logic in each of svc_create_xprt()'s callers: the check to create an IPv6 RPC listener socket only if CONFIG_IPV6 is set. I'm about to add another case that does just the same. If we move the ifdefs into __svc_xpo_create(), then svc_create_xprt() call sites can get rid of the "#ifdef" ugliness, and can use the same logic with or without IPv6 support available in the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* nfsd41: Create the recovery entry for the NFSv4.1 clientRicardo Labiaga2010-01-141-1/+3
| | | | | Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* nfsd: use vfs_fsync for non-directoriesChristoph Hellwig2010-01-131-32/+17
| | | | | | | | | | | Instead of opencoding the fsync calling sequence use vfs_fsync. This also gets rid of the useless i_mutex over the data writeout. Consolidate the remaining special code for syncing directories and document it's quirks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* nfsd4: Use FIRST_NFS4_OP in nfsd4_decode_compound()Ricardo Labiaga2010-01-131-1/+1
| | | | | | | Since we're checking for LAST_NFS4_OP, use FIRST_NFS4_OP to be consistent. Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* nfsd41: nfsd4_decode_compound() does not recognize all opsRicardo Labiaga2010-01-131-1/+1
| | | | | | | | | | | | | | | The server incorrectly assumes that the operations in the array start with value 0. The first operation (OP_ACCESS) has a value of 3, causing the check in nfsd4_decode_compound to be off. Instead of comparing that the operation number is less than the number of elements in the array, the server should verify that it is less than the maximum valid operation number defined by LAST_NFS4_OP. Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* lib: Introduce generic list_sort functionDave Chinner2010-01-121-95/+1
| | | | | | | | | | | | There are two copies of list_sort() in the tree already, one in the DRM code, another in ubifs. Now XFS needs this as well. Create a generic list_sort() function from the ubifs version and convert existing users to it so we don't end up with yet another copy in the tree. Signed-off-by: Dave Chinner <david@fromorbit.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2010-01-115-625/+666
|\ | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Ensure we force all busy extents in range to disk xfs: Don't flush stale inodes xfs: fix timestamp handling in xfs_setattr xfs: use DECLARE_EVENT_CLASS
| * xfs: Ensure we force all busy extents in range to diskDave Chinner2010-01-102-29/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we search for and find a busy extent during allocation we force the log out to ensure the extent free transaction is on disk before the allocation transaction. The current implementation has a subtle bug in it--it does not handle multiple overlapping ranges. That is, if we free lots of little extents into a single contiguous extent, then allocate the contiguous extent, the busy search code stops searching at the first extent it finds that overlaps the allocated range. It then uses the commit LSN of the transaction to force the log out to. Unfortunately, the other busy ranges might have more recent commit LSNs than the first busy extent that is found, and this results in xfs_alloc_search_busy() returning before all the extent free transactions are on disk for the range being allocated. This can lead to potential metadata corruption or stale data exposure after a crash because log replay won't replay all the extent free transactions that cover the allocation range. Modified-by: Alex Elder <aelder@sgi.com> (Dropped the "found" argument from the xfs_alloc_busysearch trace event.) Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * xfs: Don't flush stale inodesDave Chinner2010-01-101-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | Because inodes remain in cache much longer than inode buffers do under memory pressure, we can get the situation where we have stale, dirty inodes being reclaimed but the backing storage has been freed. Hence we should never, ever flush XFS_ISTALE inodes to disk as there is no guarantee that the backing buffer is in cache and still marked stale when the flush occurs. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * xfs: fix timestamp handling in xfs_setattrChristoph Hellwig2010-01-102-55/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have some rather odd code in xfs_setattr for updating the a/c/mtime timestamps: - first we do a non-transaction update if all three are updated together - second we implicitly update the ctime for various changes instead of relying on the ATTR_CTIME flag - third we set the timestamps to the current time instead of the arguments in the iattr structure in many cases. This patch makes sure we update it in a consistent way: - always transactional - ctime is only updated if ATTR_CTIME is set or we do a size update, which is a special case - always to the times passed in from the caller instead of the current time The only non-size caller of xfs_setattr that doesn't come from the VFS is updated to set ATTR_CTIME and pass in a valid ctime value. Reported-by: Eric Blake <ebb9@byu.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * xfs: use DECLARE_EVENT_CLASSChristoph Hellwig2010-01-101-538/+591
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using DECLARE_EVENT_CLASS allows us to to use trace event code instead of duplicating it in the binary. This was not available before 2.6.33 so it had to be done as a separate step once the prerequisite was merged. This only requires changes to xfs_trace.h and the results are rather impressive: hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o* text data bss dec hex filename 607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o 1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixesLinus Torvalds2010-01-114-15/+52
|\ \ | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Use MAX_LFS_FILESIZE for meta inode size GFS2: Fix gfs2_xattr_acl_chmod() GFS2: Fix locking bug in rename GFS2: Ensure uptodate inode size when using O_APPEND
| * | GFS2: Use MAX_LFS_FILESIZE for meta inode sizeSteven Whitehouse2010-01-111-1/+1
| | | | | | | | | | | | | | | | | | | | | Using ~0ULL was cauing sign issues in filemap_fdatawrite_range, so use MAX_LFS_FILESIZE instead. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Fix gfs2_xattr_acl_chmod()Steven Whitehouse2010-01-081-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | The ref counting for the bh returned by gfs2_ea_find() was wrong. This patch ensures that we always drop the ref count to that bh correctly. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Fix locking bug in renameSteven Whitehouse2010-01-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | The rename code was taking a resource group lock in cases where it wasn't actually needed, this caused problems if the rename was resulting in an inode being unlinked. The patch ensures that we only take the rgrp lock early if it is really needed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Ensure uptodate inode size when using O_APPENDSteven Whitehouse2010-01-081-2/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VFS reads the inode size during generic_file_aio_write() but with no locking around it. In order to get the expected result from O_APPEND opens, this patch updated the inode size before calling generic_file_aio_write() There is of course still a race here, in that there is nothing to prevent another node coming in and extending the file in the mean time. On the other hand, when used with file locking this will ensure that the expected results are obtained. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* | | Merge branch 'for_linus' of ↵Linus Torvalds2010-01-111-0/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Fix dquot_transfer for filesystems different from ext4
| * | | quota: Fix dquot_transfer for filesystems different from ext4Jan Kara2010-01-111-0/+3
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fd8fbfc1 modified the way we find amount of reserved space belonging to an inode. The amount of reserved space is checked from dquot_transfer and thus inode_reserved_space gets called even for filesystems that don't provide get_reserved_space callback which results in a BUG. Fix the problem by checking get_reserved_space callback and return 0 if the filesystem does not provide it. CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
* | | smaps: fix wrong rss countMinchan Kim2010-01-111-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A long time ago we regarded zero page as file_rss and vm_normal_page doesn't return NULL. But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can return NULL in case of zero page. Also we don't count it with file_rss any more. Then, RSS and PSS can't be matched. For consistency, Let's ignore zero page in smaps_pte_range. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Matt Mackall <mpm@selenic.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | proc: partially revert "procfs: provide stack information for threads"KOSAKI Motohiro2010-01-111-89/+0
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d899bf7b (procfs: provide stack information for threads) introduced to show stack information in /proc/{pid}/status. But it cause large performance regression. Unfortunately /proc/{pid}/status is used ps command too and ps is one of most important component. Because both to take mmap_sem and page table walk are heavily operation. If many process run, the ps performance is, [before d899bf7b] % perf stat ps >/dev/null Performance counter stats for 'ps': 4090.435806 task-clock-msecs # 0.032 CPUs 229 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 234 page-faults # 0.000 M/sec 8587565207 cycles # 2099.425 M/sec 9866662403 instructions # 1.149 IPC 3789415411 cache-references # 926.409 M/sec 30419509 cache-misses # 7.437 M/sec 128.859521955 seconds time elapsed [after d899bf7b] % perf stat ps > /dev/null Performance counter stats for 'ps': 4305.081146 task-clock-msecs # 0.028 CPUs 480 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 237 page-faults # 0.000 M/sec 9021211334 cycles # 2095.480 M/sec 10605887536 instructions # 1.176 IPC 3612650999 cache-references # 839.160 M/sec 23917502 cache-misses # 5.556 M/sec 152.277819582 seconds time elapsed Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps provide almost same information. we can use it. Commit d899bf7b introduced two features: 1) Add the annotattion of [thread stack: xxxx] mark to /proc/{pid}/task/{tid}/maps. 2) Add StackUsage field to /proc/{pid}/status. I only revert (2), because I haven't seen (1) cause regression. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'reiserfs/kill-bkl' of ↵Linus Torvalds2010-01-084-9/+33
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locks reiserfs: Fix unreachable statement reiserfs: Don't call reiserfs_get_acl() with the reiserfs lock reiserfs: Relax lock on xattr removing reiserfs: Relax the lock before truncating pages reiserfs: Fix recursive lock on lchown reiserfs: Fix mistake in down_write() conversion
| * | reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locksFrederic Weisbecker2010-01-071-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix remaining xattr locks acquired in reiserfs_xattr_set_handle() while we are holding the reiserfs lock to avoid lock inversions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
| * | reiserfs: Fix unreachable statementJiri Slaby2010-01-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stanse found an unreachable statement in reiserfs_ioctl. There is a if followed by error assignment and `break' with no braces. Add the braces so that we don't break every time, but only in error case, so that REISERFS_IOC_SETVERSION actually works when it returns no error. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Reiserfs <reiserfs-devel@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
| * | reiserfs: Don't call reiserfs_get_acl() with the reiserfs lockFrederic Weisbecker2010-01-071-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reiserfs_get_acl is usually not called under the reiserfs lock, as it doesn't need it. But it happens when it is called by reiserfs_acl_chmod(), which creates a dependency inversion against the private xattr inodes mutexes for the given inode. We need to call it without the reiserfs lock, especially since it's unnecessary. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
| * | reiserfs: Relax lock on xattr removingFrederic Weisbecker2010-01-051-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we remove an xattr, we call lookup_and_delete_xattr() that takes some private xattr inodes mutexes. But we hold the reiserfs lock at this time, which leads to dependency inversions. We can safely call lookup_and_delete_xattr() without the reiserfs lock, where xattr inodes lookups only need the xattr inodes mutexes. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
| * | reiserfs: Relax the lock before truncating pagesFrederic Weisbecker2010-01-051-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While truncating a file, reiserfs_setattr() calls inode_setattr() that will truncate the mapping for the given inode, but for that it needs the pages locks. In order to release these, the owners need the reiserfs lock to complete their jobs. But they can't, as we don't release it before calling inode_setattr(). We need to do that to fix the following softlockups: INFO: task flush-8:0:2149 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-8:0 D f51af998 0 2149 2 0x00000000 f51af9ac 00000092 00000002 f51af998 c2803304 00000000 c1894ad0 010f3000 f51af9cc c1462604 c189ef80 f51af974 c1710304 f715b450 f715b5ec c2807c40 00000000 0005bb00 c2803320 c102c55b c1710304 c2807c50 c2803304 00000246 Call Trace: [<c1462604>] ? schedule+0x434/0xb20 [<c102c55b>] ? resched_task+0x4b/0x70 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c146414d>] ? mutex_lock_nested+0x1fd/0x350 [<c14640b9>] mutex_lock_nested+0x169/0x350 [<c1178cde>] ? reiserfs_write_lock+0x2e/0x40 [<c1178cde>] reiserfs_write_lock+0x2e/0x40 [<c11719a2>] do_journal_end+0xc2/0xe70 [<c1172912>] journal_end+0xb2/0x120 [<c11686b3>] ? pathrelse+0x33/0xb0 [<c11729e4>] reiserfs_end_persistent_transaction+0x64/0x70 [<c1153caa>] reiserfs_get_block+0x12ba/0x15f0 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c1154b24>] reiserfs_writepage+0xa74/0xe80 [<c1465a27>] ? _raw_spin_unlock_irq+0x27/0x50 [<c11f3d25>] ? radix_tree_gang_lookup_tag_slot+0x95/0xc0 [<c10b5377>] ? find_get_pages_tag+0x127/0x1a0 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c106fcd4>] ? trace_hardirqs_on_caller+0x124/0x170 [<c10bc1e0>] __writepage+0x10/0x40 [<c10bc9ab>] write_cache_pages+0x16b/0x320 [<c10bc1d0>] ? __writepage+0x0/0x40 [<c10bcb88>] generic_writepages+0x28/0x40 [<c10bcbd5>] do_writepages+0x35/0x40 [<c11059f7>] writeback_single_inode+0xc7/0x330 [<c11067b2>] writeback_inodes_wb+0x2c2/0x490 [<c1106a86>] wb_writeback+0x106/0x1b0 [<c1106cf6>] wb_do_writeback+0x106/0x1e0 [<c1106c18>] ? wb_do_writeback+0x28/0x1e0 [<c1106e0a>] bdi_writeback_task+0x3a/0xb0 [<c10cbb13>] bdi_start_fn+0x63/0xc0 [<c10cbab0>] ? bdi_start_fn+0x0/0xc0 [<c105d1f4>] kthread+0x74/0x80 [<c105d180>] ? kthread+0x0/0x80 [<c100327a>] kernel_thread_helper+0x6/0x10 3 locks held by flush-8:0/2149: #0: (&type->s_umount_key#30){+++++.}, at: [<c110676f>] writeback_inodes_wb+0x27f/0x490 #1: (&journal->j_mutex){+.+...}, at: [<c117199a>] do_journal_end+0xba/0xe70 #2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178cde>] reiserfs_write_lock+0x2e/0x40 INFO: task fstest:3813 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. fstest D 00000002 0 3813 3812 0x00000000 f5103c94 00000082 f5103c40 00000002 f5ad5450 00000007 f5103c28 011f3000 00000006 f5ad5450 c10bb005 00000480 c1710304 f5ad5450 f5ad55ec c2907c40 00000001 f5ad5450 f5103c74 00000046 00000002 f5ad5450 00000007 f5103c6c Call Trace: [<c10bb005>] ? free_hot_cold_page+0x1d5/0x280 [<c1462d64>] io_schedule+0x74/0xc0 [<c10b5a45>] sync_page+0x35/0x60 [<c146325a>] __wait_on_bit_lock+0x4a/0x90 [<c10b5a10>] ? sync_page+0x0/0x60 [<c10b59e5>] __lock_page+0x85/0x90 [<c105d660>] ? wake_bit_function+0x0/0x60 [<c10bf654>] truncate_inode_pages_range+0x1e4/0x2d0 [<c10bf75f>] truncate_inode_pages+0x1f/0x30 [<c10bf7cf>] truncate_pagecache+0x5f/0xa0 [<c10bf86a>] vmtruncate+0x5a/0x70 [<c10fdb7d>] inode_setattr+0x5d/0x190 [<c1150117>] reiserfs_setattr+0x1f7/0x2f0 [<c1464569>] ? down_write+0x49/0x70 [<c10fde01>] notify_change+0x151/0x330 [<c10e6f3d>] do_truncate+0x6d/0xa0 [<c10f4ce2>] do_filp_open+0x9a2/0xcf0 [<c1465aec>] ? _raw_spin_unlock+0x2c/0x50 [<c10fec50>] ? alloc_fd+0xe0/0x100 [<c10e602d>] do_sys_open+0x6d/0x130 [<c1002cfb>] ? sysenter_exit+0xf/0x16 [<c10e615e>] sys_open+0x2e/0x40 [<c1002ccc>] sysenter_do_call+0x12/0x32 3 locks held by fstest/3813: #0: (&sb->s_type->i_mutex_key#4){+.+.+.}, at: [<c10e6f33>] do_truncate+0x63/0xa0 #1: (&sb->s_type->i_alloc_sem_key#3){+.+.+.}, at: [<c10fdf07>] notify_change+0x257/0x330 #2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178c8e>] reiserfs_write_lock_once+0x2e/0x50 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
| * | reiserfs: Fix recursive lock on lchownFrederic Weisbecker2010-01-051-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On chown, reiserfs will call reiserfs_setattr() to change the owner of the given inode, but it may also recursively call reiserfs_setattr() to propagate the owner change to the private xattr files for this inode. Hence, the reiserfs lock may be acquired twice which is not wanted as reiserfs_setattr() calls journal_begin() that is going to try to relax the lock in order to safely acquire the journal mutex. Using reiserfs_write_lock_once() from reiserfs_setattr() solves the problem. This fixes the following warning, that precedes a lockdep report. WARNING: at fs/reiserfs/lock.c:95 reiserfs_lock_check_recursive+0x3f/0x50() Hardware name: MS-7418 Unwanted recursive reiserfs lock! Pid: 4189, comm: fsstress Not tainted 2.6.33-rc2-tip-atom+ #195 Call Trace: [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c103f7ac>] warn_slowpath_common+0x6c/0xc0 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c103f84b>] warn_slowpath_fmt+0x2b/0x30 [<c1178bff>] reiserfs_lock_check_recursive+0x3f/0x50 [<c1172ae3>] do_journal_begin_r+0x83/0x350 [<c1172f2d>] journal_begin+0x7d/0x140 [<c106509a>] ? in_group_p+0x2a/0x30 [<c10fda71>] ? inode_change_ok+0x91/0x140 [<c115007d>] reiserfs_setattr+0x15d/0x2e0 [<c10f9bf3>] ? dput+0xe3/0x140 [<c1465adc>] ? _raw_spin_unlock+0x2c/0x50 [<c117831d>] chown_one_xattr+0xd/0x10 [<c11780a3>] reiserfs_for_each_xattr+0x113/0x2c0 [<c1178310>] ? chown_one_xattr+0x0/0x10 [<c14641e9>] ? mutex_lock_nested+0x2a9/0x350 [<c117826f>] reiserfs_chown_xattrs+0x1f/0x60 [<c106509a>] ? in_group_p+0x2a/0x30 [<c10fda71>] ? inode_change_ok+0x91/0x140 [<c1150046>] reiserfs_setattr+0x126/0x2e0 [<c1177c20>] ? reiserfs_getxattr+0x0/0x90 [<c11b0d57>] ? cap_inode_need_killpriv+0x37/0x50 [<c10fde01>] notify_change+0x151/0x330 [<c10e659f>] chown_common+0x6f/0x90 [<c10e67bd>] sys_lchown+0x6d/0x80 [<c1002ccc>] sysenter_do_call+0x12/0x32 ---[ end trace 7c2b77224c1442fc ]--- Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
| * | reiserfs: Fix mistake in down_write() conversionFrederic Weisbecker2010-01-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a mistake in commit 0719d3434747889b314a1e8add776418c4148bcf (reiserfs: Fix reiserfs lock <-> i_xattr_sem dependency inversion) that has converted a down_write() into a down_read() accidentally. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
* | | Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2010-01-081-4/+4
|\ \ \ | | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: kill some warnings on i386 builds
| * | | xfs: kill some warnings on i386 buildsDave Chinner2010-01-081-4/+4
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | Randy Dunlap Reported printk() format-related warnings reported on i386 builds in his environment. Dave Chinner provided this patch to eliminate them. Signed-off by: Dave Chinner <david@fromorbit.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* | | Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds2010-01-081-0/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: nfs: fix oops in nfs_rename() sunrpc: fix build-time warning sunrpc: on successful gss error pipe write, don't return error SUNRPC: Fix the return value in gss_import_sec_context() SUNRPC: Fix up an error return value in gss_import_sec_context_kerberos()
| * | | nfs: fix oops in nfs_rename()OGAWA Hirofumi2010-01-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent change is missing to update "rehash". With that change, it will become the cause of adding dentry to hash twice. This explains the reason of Oops (dereference the freed dentry in __d_lookup()) on my machine. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: Marvin <marvin24@gmx.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* | | | FDPIC: Respect PT_GNU_STACK exec protection markings when creating NOMMU stackMike Frysinger2010-01-061-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code will load the stack size and protection markings, but then only use the markings in the MMU code path. The NOMMU code path always passes PROT_EXEC to the mmap() call. While this doesn't matter to most people whilst the code is running, it will cause a pointless icache flush when starting every FDPIC application. Typically this icache flush will be of a region on the order of 128KB in size, or may be the entire icache, depending on the facilities available on the CPU. In the case where the arch default behaviour seems to be desired (EXSTACK_DEFAULT), we probe VM_STACK_FLAGS for VM_EXEC to determine whether we should be setting PROT_EXEC or not. For arches that support an MPU (Memory Protection Unit - an MMU without the virtual mapping capability), setting PROT_EXEC or not will make an important difference. It should be noted that this change also affects the executability of the brk region, since ELF-FDPIC has that share with the stack. However, this is probably irrelevant as NOMMU programs aren't likely to use the brk region, preferring instead allocation via mmap(). Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2010-01-061-4/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: sunrpc: fix peername failed on closed listener nfsd: make sure data is on disk before calling ->fsync nfsd: fix "insecure" export option
| * | | | nfsd: make sure data is on disk before calling ->fsyncChristoph Hellwig2010-01-061-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfsd is not using vfs_fsync, so I missed it when changing the calling convention during the 2.6.32 window. This patch fixes it to not only start the data writeout, but also wait for it to complete before calling into ->fsync. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | nfsd: fix "insecure" export optionJ. Bruce Fields2009-12-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A typo in 12045a6ee9908b "nfsd: let "insecure" flag vary by pseudoflavor" reversed the sense of the "insecure" flag. Reported-by: Michael Guntsche <mike@it-loops.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* | | | | Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds2010-01-062-9/+18
|\ \ \ \ \ | |_|_|/ / |/| | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: simple_write_end does not mark_inode_dirty exofs: fix pnfs_osd re-definitions in pre-pnfs trees
| * | | | exofs: simple_write_end does not mark_inode_dirtyBoaz Harrosh2010-01-051-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | exofs uses simple_write_end() for it's .write_end handler. But it is not enough because simple_write_end() does not call mark_inode_dirty() when it extends i_size. So even if we do call mark_inode_dirty at beginning of write out, with a very long IO and a saturated system we might get the .write_inode() called while still extend-writing to file and miss out on the last i_size updates. So override .write_end, call simple_write_end(), and afterwords if i_size was changed call mark_inode_dirty(). It stands to logic that since simple_write_end() was the one extending i_size it should also call mark_inode_dirty(). But it looks like all users of simple_write_end() are memory-bound pseudo filesystems, who could careless about mark_inode_dirty(). I might submit a warning-comment patch to simple_write_end() in future. CC: Stable <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
| * | | | exofs: fix pnfs_osd re-definitions in pre-pnfs treesBoaz Harrosh2010-01-051-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some on disk exofs constants and types are defined in the pnfs_osd_xdr.h file. Since we needed these types before the pnfs-objects code was accepted to mainline we duplicated the minimal needed definitions into an exofs local header. The definitions where conditionally included depending on !CONFIG_PNFS defined. So if PNFS was present in the tree definitions are taken from there and if not they are defined locally. That was all good but, the CONFIG_PNFS is planed to be included upstream before the pnfs-objects is also included. (The first pnfs batch might be pnfs-files only) So condition exofs local definitions on the absence of pnfs_osd_xdr.h inclusion (__PNFS_OSD_XDR_H__ not defined). User code must make sure that in future pnfs_osd_xdr.h will be included before fs/exofs/pnfs.h, which happens to be so in current code. Once pnfs-objects hits mainline, exofs's local header will be removed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* | | | | Merge branch 'upstream-linus' of ↵Linus Torvalds2010-01-051-6/+15
|\ \ \ \ \ | |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Handle O_DIRECT when writing to a refcounted cluster.
| * | | | ocfs2: Handle O_DIRECT when writing to a refcounted cluster.Tao Ma2009-12-301-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of writing to a refcounted cluster with O_DIRECT, we need to fall back to buffer write. And when it is finished, we need to flush the page and the journal as we did for other O_DIRECT writes. This patch fix oss bug 1191. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1191 Signed-off-by: Tao Ma <tao.ma@oracle.com> Tested-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
* | | | | sysfs: Add lockdep annotations for the sysfs active referenceEric W. Biederman2010-01-042-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Holding locks over device_del -> kobject_del -> sysfs_deactivate can cause deadlocks if those same locks are grabbed in sysfs show or store methods. The I model s_active count + completion as a sleeping read/write lock. I describe to lockdep sysfs_get_active as a read_trylock, sysfs_put_active as a read_unlock, and sysfs_deactivate as a write_lock and write_unlock pair. This seems to capture the essence for purposes of finding deadlocks, and in my testing gives finds real issues and ignores non-issues. This brings us back to holding locks over kobject_del is a problem that ideally we should find a way of addressing, but at least lockdep can tell us about the problems instead of requiring developers to debug rare strange system deadlocks, that happen when sysfs files are removed while being written to. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge branch 'sh/for-2.6.33' of ↵Linus Torvalds2010-01-041-2/+2
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 * 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: binfmt_elf_fdpic: Fix build breakage introduced by coredump changes. sh: update defconfigs. sh: Don't default enable PMB support. sh: Disable PMB for SH4AL-DSP CPUs. sh: Only provide a PCLK definition for legacy CPG CPUs.
| * | | | | binfmt_elf_fdpic: Fix build breakage introduced by coredump changes.Daisuke HATAYAMA2010-01-041-2/+2
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f6151dfea21496d43dbaba32cfcd9c9f404769bc introduces build breakage, so this patch fixes it together with some printk formatting cleanup. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
* | | | | Merge branch 'for_linus' of ↵Linus Torvalds2010-01-045-48/+77
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Calculate metadata requirements more accurately ext4: Fix accounting of reserved metadata blocks
| * | | | | ext4: Calculate metadata requirements more accuratelyTheodore Ts'o2010-01-015-44/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the past, ext4_calc_metadata_amount(), and its sub-functions ext4_ext_calc_metadata_amount() and ext4_indirect_calc_metadata_amount() badly over-estimated the number of metadata blocks that might be required for delayed allocation blocks. This didn't matter as much when functions which managed the reserved metadata blocks were more aggressive about dropping reserved metadata blocks as delayed allocation blocks were written, but unfortunately they were too aggressive. This was fixed in commit 0637c6f, but as a result the over-estimation by ext4_calc_metadata_amount() would lead to reserving 2-3 times the number of pending delayed allocation blocks as potentially required metadata blocks. So if there are 1 megabytes of blocks which have been not yet been allocation, up to 3 megabytes of space would get reserved out of the user's quota and from the file system free space pool until all of the inode's data blocks have been allocated. This commit addresses this problem by much more accurately estimating the number of metadata blocks that will be required. It will still somewhat over-estimate the number of blocks needed, since it must make a worst case estimate not knowing which physical blocks will be needed, but it is much more accurate than before. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | | | ext4: Fix accounting of reserved metadata blocksTheodore Ts'o2010-01-011-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0637c6f had a typo which caused the reserved metadata blocks to not be released correctly. Fix this. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
OpenPOWER on IntegriCloud