summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* ceph: re-request max_size if cap auth changesSage Weil2010-11-071-0/+5
| | | | | | | | | If the auth cap migrates to another MDS, clear requested_max_size so that we resend any pending max_size increase requests. This fixes potential hangs on writes that extend a file and race with an cap migration between MDSs. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: only let auth caps update max_sizeSage Weil2010-11-071-1/+8
| | | | | | | | | | | | | Only the auth MDS has a meaningful max_size value for us, so only update it in fill_inode if we're being issued an auth cap. Otherwise, a random stat result from a non-auth MDS can clobber a meaningful max_size, get the client<->mds cap state out of sync, and make writes hang. Specifically, even if the client re-requests a larger max_size (which it will), the MDS won't respond because as far as it knows we already have a sufficiently large value. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix open for write on clustered mdsSage Weil2010-11-071-2/+4
| | | | | | | | | Normally when we open a file we already have a cap, and simply update the wanted set. However, if we open a file for write, but don't have an auth cap, that doesn't work; we need to open a new cap with the auth MDS. Only reuse existing caps if we are opening for read or the existing cap is auth. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix bad pointer dereference in ceph_fill_traceSage Weil2010-11-071-1/+2
| | | | | | | | We dereference *in a few lines down, but only set it on rename. It is apparently pretty rare for this to trigger, but I have been hitting it with a clustered MDSs. Signed-off-by: Sage Weil <sage@newdream.net>
* Revert "ceph: update issue_seq on cap grant"Sage Weil2010-10-271-5/+3
| | | | | | | | | | | | | | This reverts commit d91f2438d881514e4a923fd786dbd94b764a9440. The intent of issue_seq is to distinguish between mds->client messages that (re)create the cap and those that do not, which means we should _only_ be updating that value in the create paths. By updating it in handle_cap_grant, we reset it to zero, which then breaks release. The larger question is what workload/problem made me think it should be updated here... Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: do not carry i_lock for readdir from dcacheSage Weil2010-10-202-26/+18
| | | | | | | | | | | | We were taking dcache_lock inside of i_lock, which introduces a dependency not found elsewhere in the kernel, complicationg the vfs locking scalability work. Since we don't actually need it here anyway, remove it. We only need i_lock to test for the I_COMPLETE flag, so be careful to do so without dcache_lock held. Signed-off-by: Sage Weil <sage@newdream.net>
* fs/ceph/xattr.c: Use kmemdupJulia Lawall2010-10-201-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Convert a sequence of kmalloc and memcpy to use kmemdup. The semantic patch that performs this transformation is: (http://coccinelle.lip6.fr/) // <smpl> @@ expression a,flag,len; expression arg,e1,e2; statement S; @@ a = - \(kmalloc\|kzalloc\)(len,flag) + kmemdup(arg,len,flag) <... when != a if (a == NULL || ...) S ...> - memcpy(a,arg,len+1); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl.Greg Farnum2010-10-202-1/+69
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix debugfs warningsRandy Dunlap2010-10-201-1/+2
| | | | | | | | | | | Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning: fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: switch from BKL to lock_flocks()Sage Weil2010-10-201-5/+6
| | | | | | | Switch from using the BKL explicitly to the new lock_flocks() interface. Eventually this will turn into a spinlock. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: preallocate flock state without locks heldGreg Farnum2010-10-202-15/+44
| | | | | | | | | When the lock_kernel() turns into lock_flocks() and a spinlock, we won't be able to do allocations with the lock held. Preallocate space without the lock, and retry if the lock state changes out from underneath us. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use mapping->nrpages to determine if mapping is emptySage Weil2010-10-201-12/+1
| | | | | | This is simpler and faster. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: only invalidate on check_caps if we actually have pagesSage Weil2010-10-201-1/+1
| | | | | | | The i_rdcache_gen value only implies we MAY have cached pages; actually check the mapping to see if it's worth bothering with an invalidate. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: do not hide .snap in root directorySage Weil2010-10-201-1/+0
| | | | | | | Snaps in the root directory are now supported by the MDS, and harmless on older versions. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: factor out libceph from Ceph file systemYehuda Sadeh2010-10-2062-14078/+926
| | | | | | | | | | | | | | | | | | | | | | This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph-rbd: osdc support for osd call and rollback operationsYehuda Sadeh2010-10-203-0/+29
| | | | | | This will be used for rbd snapshots administration. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: messenger and osdc changes for rbdYehuda Sadeh2010-10-206-101/+436
| | | | | | | | | | | | Allow the messenger to send/receive data in a bio. This is added so that we wouldn't need to copy the data into pages or some other buffer when doing IO for an rbd block device. We can now have trailing variable sized data for osd ops. Also osd ops encoding is more modular. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: refactor osdc requests creation functionsYehuda Sadeh2010-10-202-57/+155
| | | | | | | | | | The osd requests creation are being decoupled from the vino parameter, allowing clients using the osd to use other arbitrary object names that are not necessarily vino based. Also, calc_raw_layout now takes a snap id. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: lookup pool in osdmap by nameYehuda Sadeh2010-10-202-0/+15
| | | | | | | Implement a pool lookup by name. This will be used by rbd. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* Un-inline the core-dump helper functionsLinus Torvalds2010-10-141-0/+38
| | | | | | | | | | | | | | | | | | Tony Luck reports that the addition of the access_ok() check in commit 0eead9ab41da ("Don't dump task struct in a.out core-dumps") broke the ia64 compile due to missing the necessary header file includes. Rather than add yet another include (<asm/unistd.h>) to make everything happy, just uninline the silly core dump helper functions and move the bodies to fs/exec.c where they make a lot more sense. dump_seek() in particular was too big to be an inline function anyway, and none of them are in any way performance-critical. And we really don't need to mess up our include file headers more than they already are. Reported-and-tested-by: Tony Luck <tony.luck@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Don't dump task struct in a.out core-dumpsLinus Torvalds2010-10-141-4/+0
| | | | | | | | | | | | | | | | | | | | | | akiphie points out that a.out core-dumps have that odd task struct dumping that was never used and was never really a good idea (it goes back into the mists of history, probably the original core-dumping code). Just remove it. Also do the access_ok() check on dump_write(). It probably doesn't matter (since normal filesystems all seem to do it anyway), but he points out that it's normally done by the VFS layer, so ... [ I suspect that we should possibly do "vfs_write()" instead of calling ->write directly. That also does the whole fsnotify and write statistics thing, which may or may not be a good idea. ] And just to be anal, do this all for the x86-64 32-bit a.out emulation code too, even though it's not enabled (and won't currently even compile) Reported-by: akiphie <akiphie@lavabit.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-2.6.36' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2010-10-131-2/+0
|\ | | | | | | | | * 'for-2.6.36' of git://linux-nfs.org/~bfields/linux: nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink
| * nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlinkJ. Bruce Fields2010-10-131-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | As of commit 43a9aa64a2f4330a9cb59aaf5c5636566bce067c "NFSD: Fill in WCC data for REMOVE, RMDIR, MKNOD, and MKDIR", we sometimes call fh_unlock on a filehandle that isn't fully initialized. We should fix up the callers, but as a quick fix it is also sufficient just to remove this assertion. Reported-by: Marius Tolzmann <tolzmann@molgen.mpg.de> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | fanotify: disable fanotify syscallsEric Paris2010-10-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch disables the fanotify syscalls by just not building them and letting the cond_syscall() statements in kernel/sys_ni.c redirect them to sys_ni_syscall(). It was pointed out by Tvrtko Ursulin that the fanotify interface did not include an explicit prioritization between groups. This is necessary for fanotify to be usable for hierarchical storage management software, as they must get first access to the file, before inotify-like notifiers see the file. This feature can be added in an ABI compatible way in the next release (by using a number of bits in the flags field to carry the info) but it was suggested by Alan that maybe we should just hold off and do it in the next cycle, likely with an (new) explicit argument to the syscall. I don't like this approach best as I know people are already starting to use the current interface, but Alan is all wise and noone on list backed me up with just using what we have. I feel this is needlessly ripping the rug out from under people at the last minute, but if others think it needs to be a new argument it might be the best way forward. Three choices: Go with what we got (and implement the new feature next cycle). Add a new field right now (and implement the new feature next cycle). Wait till next cycle to release the ABI (and implement the new feature next cycle). This is number 3. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2010-10-094-23/+33
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: update issue_seq on cap grant ceph: send cap release message early on failed revoke. ceph: Update max_len with minimum required size ceph: Fix return value of encode_fh function ceph: avoid null deref in osd request error path ceph: fix list_add usage on unsafe_writes list
| * | ceph: update issue_seq on cap grantSage Weil2010-10-071-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to update the issue_seq on any grant operation, be it via an MDS reply or a separate grant message. The update in the grant path was missing. This broke cap release for inodes in which the MDS sent an explicit grant message that was not soon after followed by a successful MDS reply on the same inode. Also fix the signedness on seq locals. Signed-off-by: Sage Weil <sage@newdream.net>
| * | ceph: send cap release message early on failed revoke.Greg Farnum2010-10-071-10/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an MDS tries to revoke caps that we don't have, we want to send releases early since they probably contain the caps message the MDS is looking for. Previously, we only sent the messages if we didn't have the inode either. But in a multi-mds system we can retain the inode after dropping all caps for a single MDS. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * | ceph: Update max_len with minimum required sizeAneesh Kumar K.V2010-10-071-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | encode_fh on error should update max_len with minimum required size, so that caller can redo the call with the reallocated buffer. This is required with open by handle patch series Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Sage Weil <sage@newdream.net>
| * | ceph: Fix return value of encode_fh functionAneesh Kumar K.V2010-10-071-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | encode_fh function should return 255 on error as done by other file system to indicate EOVERFLOW. Also max_len is in sizeof(u32) units and not in bytes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Sage Weil <sage@newdream.net>
| * | ceph: avoid null deref in osd request error pathSage Weil2010-10-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | If we interrupt an osd request, we call __cancel_request, but it wasn't verifying that req->r_osd was non-NULL before dereferencing it. This could cause a crash if osds were flapping and we aborted a request on said osd. Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
| * | ceph: fix list_add usage on unsafe_writes listHenry C Chang2010-10-071-1/+1
| |/ | | | | | | | | | | | | Fix argument order. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
* | Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds2010-10-091-1/+7
|\ \ | | | | | | | | | | | | * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Fix double page_unlock BUG in write_begin/end
| * | exofs: Fix double page_unlock BUG in write_begin/endBoaz Harrosh2010-10-081-1/+7
| |/ | | | | | | | | | | | | | | | | | | This BUG is there since the first submit of the code, but only triggered in last Kernel. It's timing related do to the asynchronous object-creation behaviour of exofs. (Which should be investigated farther) The bug is obvious hence the fixed. Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>
* | Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2010-10-071-5/+14
|\ \ | |/ |/| | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: properly account for reclaimed inodes
| * xfs: properly account for reclaimed inodesJohannes Weiner2010-10-061-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When marking an inode reclaimable, a per-AG counter is increased, the inode is tagged reclaimable in its per-AG tree, and, when this is the first reclaimable inode in the AG, the AG entry in the per-mount tree is also tagged. When an inode is finally reclaimed, however, it is only deleted from the per-AG tree. Neither the counter is decreased, nor is the parent tree's AG entry untagged properly. Since the tags in the per-mount tree are not cleared, the inode shrinker iterates over all AGs that have had reclaimable inodes at one point in time. The counters on the other hand signal an increasing amount of slab objects to reclaim. Since "70e60ce xfs: convert inode shrinker to per-filesystem context" this is not a real issue anymore because the shrinker bails out after one iteration. But the problem was observable on a machine running v2.6.34, where the reclaimable work increased and each process going into direct reclaim eventually got stuck on the xfs inode shrinking path, trying to scan several million objects. Fix this by properly unwinding the reclaimable-state tracking of an inode when it is reclaimed. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@kernel.org Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* | Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-10-061-15/+4
|\ \ | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: writeback: always use sb->s_bdi for writeback purposes
| * | writeback: always use sb->s_bdi for writeback purposesChristoph Hellwig2010-10-041-15/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently use struct backing_dev_info for various different purposes. Originally it was introduced to describe a backing device which includes an unplug and congestion function and various bits of readahead information and VM-relevant flags. We're also using for tracking dirty inodes for writeback. To make writeback properly find all inodes we need to only access the per-filesystem backing_device pointed to by the superblock in ->s_bdi inside the writeback code, and not the instances pointeded to by inode->i_mapping->backing_dev which can be overriden by special devices or might not be set at all by some filesystems. Long term we should split out the writeback-relevant bits of struct backing_device_info (which includes more than the current bdi_writeback) and only point to it from the superblock while leaving the traditional backing device as a separate structure that can be overriden by devices. The one exception for now is the block device filesystem which really wants different writeback contexts for it's different (internal) inodes to handle the writeout more efficiently. For now we do this with a hack in fs-writeback.c because we're so late in the cycle, but in the future I plan to replace this with a superblock method that allows for multiple writeback contexts per filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2010-10-061-1/+1
|\ \ | |/ |/| | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: Initialize total_len in fuse_retrieve()
| * fuse: Initialize total_len in fuse_retrieve()Geert Uytterhoeven2010-10-041-1/+1
| | | | | | | | | | | | | | | | | | | | fs/fuse/dev.c:1357: warning: ‘total_len’ may be used uninitialized in this function Initialize total_len to zero, else its value will be undefined. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6Linus Torvalds2010-10-012-16/+35
|\ \ | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: prevent infinite recursion in cifs_reconnect_tcon cifs: set backing_dev_info on new S_ISREG inodes
| * | cifs: prevent infinite recursion in cifs_reconnect_tconJeff Layton2010-10-011-16/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cifs_reconnect_tcon is called from smb_init. After a successful reconnect, cifs_reconnect_tcon will call reset_cifs_unix_caps. That function will, in turn call CIFSSMBQFSUnixInfo and CIFSSMBSetFSUnixInfo. Those functions also call smb_init. It's possible for the session and tcon reconnect to succeed, and then for another cifs_reconnect to occur before CIFSSMBQFSUnixInfo or CIFSSMBSetFSUnixInfo to be called. That'll cause those functions to call smb_init and cifs_reconnect_tcon again, ad infinitum... Break the infinite recursion by having those functions use a new smb_init variant that doesn't attempt to perform a reconnect. Reported-and-Tested-by: Michal Suchanek <hramrach@centrum.cz> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
| * | cifs: set backing_dev_info on new S_ISREG inodesJeff Layton2010-09-291-0/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Testing on very recent kernel (2.6.36-rc6) made this warning pop: WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x65/0x70() Hardware name: Dirtiable inode bdi default != sb bdi cifs ...the following patch fixes it and seems to be the obviously correct thing to do for cifs. Cc: stable@kernel.org Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* | reiserfs: fix unwanted reiserfs lock recursionFrederic Weisbecker2010-10-011-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prevent from recursively locking the reiserfs lock in reiserfs_unpack() because we may call journal_begin() that requires the lock to be taken only once, otherwise it won't be able to release the lock while taking other mutexes, ending up in inverted dependencies between the journal mutex and the reiserfs lock for example. This fixes: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.35.4.4a #3 ------------------------------------------------------- lilo/1620 is trying to acquire lock: (&journal->j_mutex){+.+...}, at: [<d0325bff>] do_journal_begin_r+0x7f/0x340 [reiserfs] but task is already holding lock: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a278>] reiserfs_write_lock+0x28/0x40 [reiserfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&REISERFS_SB(s)->lock){+.+.+.}: [<c10562b7>] lock_acquire+0x67/0x80 [<c12facad>] __mutex_lock_common+0x4d/0x410 [<c12fb0c8>] mutex_lock_nested+0x18/0x20 [<d032a278>] reiserfs_write_lock+0x28/0x40 [reiserfs] [<d0325c06>] do_journal_begin_r+0x86/0x340 [reiserfs] [<d0325f77>] journal_begin+0x77/0x140 [reiserfs] [<d0315be4>] reiserfs_remount+0x224/0x530 [reiserfs] [<c10b6a20>] do_remount_sb+0x60/0x110 [<c10cee25>] do_mount+0x625/0x790 [<c10cf014>] sys_mount+0x84/0xb0 [<c12fca3d>] syscall_call+0x7/0xb -> #0 (&journal->j_mutex){+.+...}: [<c10560f6>] __lock_acquire+0x1026/0x1180 [<c10562b7>] lock_acquire+0x67/0x80 [<c12facad>] __mutex_lock_common+0x4d/0x410 [<c12fb0c8>] mutex_lock_nested+0x18/0x20 [<d0325bff>] do_journal_begin_r+0x7f/0x340 [reiserfs] [<d0325f77>] journal_begin+0x77/0x140 [reiserfs] [<d0326271>] reiserfs_persistent_transaction+0x41/0x90 [reiserfs] [<d030d06c>] reiserfs_get_block+0x22c/0x1530 [reiserfs] [<c10db9db>] __block_prepare_write+0x1bb/0x3a0 [<c10dbbe6>] block_prepare_write+0x26/0x40 [<d030b738>] reiserfs_prepare_write+0x88/0x170 [reiserfs] [<d03294d6>] reiserfs_unpack+0xe6/0x120 [reiserfs] [<d0329782>] reiserfs_ioctl+0x272/0x320 [reiserfs] [<c10c3188>] vfs_ioctl+0x28/0xa0 [<c10c3bbd>] do_vfs_ioctl+0x32d/0x5c0 [<c10c3eb3>] sys_ioctl+0x63/0x70 [<c12fca3d>] syscall_call+0x7/0xb other info that might help us debug this: 2 locks held by lilo/1620: #0: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<d032945a>] reiserfs_unpack+0x6a/0x120 [reiserfs] #1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a278>] reiserfs_write_lock+0x28/0x40 [reiserfs] stack backtrace: Pid: 1620, comm: lilo Not tainted 2.6.35.4.4a #3 Call Trace: [<c10560f6>] __lock_acquire+0x1026/0x1180 [<c10562b7>] lock_acquire+0x67/0x80 [<c12facad>] __mutex_lock_common+0x4d/0x410 [<c12fb0c8>] mutex_lock_nested+0x18/0x20 [<d0325bff>] do_journal_begin_r+0x7f/0x340 [reiserfs] [<d0325f77>] journal_begin+0x77/0x140 [reiserfs] [<d0326271>] reiserfs_persistent_transaction+0x41/0x90 [reiserfs] [<d030d06c>] reiserfs_get_block+0x22c/0x1530 [reiserfs] [<c10db9db>] __block_prepare_write+0x1bb/0x3a0 [<c10dbbe6>] block_prepare_write+0x26/0x40 [<d030b738>] reiserfs_prepare_write+0x88/0x170 [reiserfs] [<d03294d6>] reiserfs_unpack+0xe6/0x120 [reiserfs] [<d0329782>] reiserfs_ioctl+0x272/0x320 [reiserfs] [<c10c3188>] vfs_ioctl+0x28/0xa0 [<c10c3bbd>] do_vfs_ioctl+0x32d/0x5c0 [<c10c3eb3>] sys_ioctl+0x63/0x70 [<c12fca3d>] syscall_call+0x7/0xb Reported-by: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: All since 2.6.32 <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | reiserfs: fix dependency inversion between inode and reiserfs mutexesFrederic Weisbecker2010-10-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The reiserfs mutex already depends on the inode mutex, so we can't lock the inode mutex in reiserfs_unpack() without using the safe locking API, because reiserfs_unpack() is always called with the reiserfs mutex locked. This fixes: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.35c #13 ------------------------------------------------------- lilo/1606 is trying to acquire lock: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<d0329450>] reiserfs_unpack+0x60/0x110 [reiserfs] but task is already holding lock: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a268>] reiserfs_write_lock+0x28/0x40 [reiserfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&REISERFS_SB(s)->lock){+.+.+.}: [<c1056347>] lock_acquire+0x67/0x80 [<c12f083d>] __mutex_lock_common+0x4d/0x410 [<c12f0c58>] mutex_lock_nested+0x18/0x20 [<d032a268>] reiserfs_write_lock+0x28/0x40 [reiserfs] [<d0329e9a>] reiserfs_lookup_privroot+0x2a/0x90 [reiserfs] [<d0316b81>] reiserfs_fill_super+0x941/0xe60 [reiserfs] [<c10b7d17>] get_sb_bdev+0x117/0x170 [<d0313e21>] get_super_block+0x21/0x30 [reiserfs] [<c10b74ba>] vfs_kern_mount+0x6a/0x1b0 [<c10b7659>] do_kern_mount+0x39/0xe0 [<c10cebe0>] do_mount+0x340/0x790 [<c10cf0b4>] sys_mount+0x84/0xb0 [<c12f25cd>] syscall_call+0x7/0xb -> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}: [<c1056186>] __lock_acquire+0x1026/0x1180 [<c1056347>] lock_acquire+0x67/0x80 [<c12f083d>] __mutex_lock_common+0x4d/0x410 [<c12f0c58>] mutex_lock_nested+0x18/0x20 [<d0329450>] reiserfs_unpack+0x60/0x110 [reiserfs] [<d0329772>] reiserfs_ioctl+0x272/0x320 [reiserfs] [<c10c3228>] vfs_ioctl+0x28/0xa0 [<c10c3c5d>] do_vfs_ioctl+0x32d/0x5c0 [<c10c3f53>] sys_ioctl+0x63/0x70 [<c12f25cd>] syscall_call+0x7/0xb other info that might help us debug this: 1 lock held by lilo/1606: #0: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a268>] reiserfs_write_lock+0x28/0x40 [reiserfs] stack backtrace: Pid: 1606, comm: lilo Not tainted 2.6.35c #13 Call Trace: [<c1056186>] __lock_acquire+0x1026/0x1180 [<c1056347>] lock_acquire+0x67/0x80 [<c12f083d>] __mutex_lock_common+0x4d/0x410 [<c12f0c58>] mutex_lock_nested+0x18/0x20 [<d0329450>] reiserfs_unpack+0x60/0x110 [reiserfs] [<d0329772>] reiserfs_ioctl+0x272/0x320 [reiserfs] [<c10c3228>] vfs_ioctl+0x28/0xa0 [<c10c3c5d>] do_vfs_ioctl+0x32d/0x5c0 [<c10c3f53>] sys_ioctl+0x63/0x70 [<c12f25cd>] syscall_call+0x7/0xb Reported-by: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: <stable@kernel.org> [2.6.32 and later] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc: make /proc/pid/limits world readableJiri Olsa2010-10-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having the limits file world readable will ease the task of system management on systems where root privileges might be restricted. Having admin restricted with root priviledges, he/she could not check other users process' limits. Also it'd align with most of the /proc stat files. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Eugene Teo <eugene@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'fixes' of ↵Linus Torvalds2010-09-291-1/+1
|\ \ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Don't walk off the end of fast symlinks.
| * | ocfs2: Don't walk off the end of fast symlinks.Joel Becker2010-09-291-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | ocfs2 fast symlinks are NUL terminated strings stored inline in the inode data area. However, disk corruption or a local attacker could, in theory, remove that NUL. Because we're using strlen() (my fault, introduced in a731d1 when removing vfs_follow_link()), we could walk off the end of that string. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: stable@kernel.org
* | xfs: force background CIL push under sustained loadDave Chinner2010-09-292-19/+30
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I have been seeing occasional pauses in transaction throughput up to 30s long under heavy parallel workloads. The only notable thing was that the xfsaild was trying to be active during the pauses, but making no progress. It was running exactly 20 times a second (on the 50ms no-progress backoff), and the number of pushbuf events was constant across this time as well. IOWs, the xfsaild appeared to be stuck on buffers that it could not push out. Further investigation indicated that it was trying to push out inode buffers that were pinned and/or locked. The xfsbufd was also getting woken at the same frequency (by the xfsaild, no doubt) to push out delayed write buffers. The xfsbufd was not making any progress because all the buffers in the delwri queue were pinned. This scan- and-make-no-progress dance went one in the trace for some seconds, before the xfssyncd came along an issued a log force, and then things started going again. However, I noticed something strange about the log force - there were way too many IO's issued. 516 log buffers were written, to be exact. That added up to 129MB of log IO, which got me very interested because it's almost exactly 25% of the size of the log. He delayed logging code is suppose to aggregate the minimum of 25% of the log or 8MB worth of changes before flushing. That's what really puzzled me - why did a log force write 129MB instead of only 8MB? Essentially what has happened is that no CIL pushes had occurred since the previous tail push which cleared out 25% of the log space. That caused all the new transactions to block because there wasn't log space for them, but they kick the xfsaild to push the tail. However, the xfsaild was not making progress because there were buffers it could not lock and flush, and the xfsbufd could not flush them because they were pinned. As a result, both the xfsaild and the xfsbufd could not move the tail of the log forward without the CIL first committing. The cause of the problem was that the background CIL push, which should happen when 8MB of aggregated changes have been committed, is being held off by the concurrent transaction commit load. The background push does a down_write_trylock() which will fail if there is a concurrent transaction commit holding the push lock in read mode. With 8 CPUs all doing transactions as fast as they can, there was enough concurrent transaction commits to hold off the background push until tail-pushing could no longer free log space, and the halt would occur. It should be noted that there is no reason why it would halt at 25% of log space used by a single CIL checkpoint. This bug could definitely violate the "no transaction should be larger than half the log" requirement and hence result in corruption if the system crashed under heavy load. This sort of bug is exactly the reason why delayed logging was tagged as experimental.... The fix is to start blocking background pushes once the threshold has been exceeded. Rework the threshold calculations to keep the amount of log space a CIL checkpoint can use to below that of the AIL push threshold to avoid the problem completely. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* Merge branch 'upstream-linus' of ↵Linus Torvalds2010-09-2414-44/+117
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: o2dlm: force free mles during dlm exit ocfs2: Sync inode flags with ext2. ocfs2: Move 'wanted' into parens of ocfs2_resmap_resv_bits. ocfs2: Use cpu_to_le16 for e_leaf_clusters in ocfs2_bg_discontig_add_extent. ocfs2: update ctime when changing the file's permission by setfacl ocfs2/net: fix uninitialized ret in o2net_send_message_vec() Ocfs2: Handle empty list in lockres_seq_start() for dlmdebug.c Ocfs2: Re-access the journal after ocfs2_insert_extent() in dxdir codes. ocfs2: Fix lockdep warning in reflink. ocfs2/lockdep: Move ip_xattr_sem out of ocfs2_xattr_get_nolock.
| * o2dlm: force free mles during dlm exitSrinivas Eeda2010-09-233-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | While umounting, a block mle doesn't get freed if dlm is shutdown after master request is received but before assert master. This results in unclean shutdown of dlm domain. This patch frees all mles that lie around after other nodes were notified about exiting the dlm and marking dlm state as leaving. Only block mles are expected to be around, so we log ERROR for other mles but still free them. Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
OpenPOWER on IntegriCloud