summaryrefslogtreecommitdiffstats
path: root/fs/ceph
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-clientLinus Torvalds2011-09-092-3/+3
|\ | | | | | | | | | | | | | | * 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix leak of osd structs during shutdown ceph: fix memory leak ceph: fix encoding of ino only (not relative) paths libceph: fix msgpool
| * ceph: fix memory leakNoah Watkins2011-08-221-2/+2
| | | | | | | | | | | | | | | | kfree does not clean up indirect allocations in ceph_fs_client and ceph_options (e.g. snapdir_name). Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix encoding of ino only (not relative) pathsSage Weil2011-08-151-1/+1
| | | | | | | | | | | | | | | | A 'path' consists of a starting ino and relative component. Encode even when there is no relative component. This is primarily needed by the NFS reexport code. Signed-off-by: Sage Weil <sage@newdream.net>
* | Merge branch 'for-linus' of ↵Linus Torvalds2011-07-2613-137/+249
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: document unlocked d_parent accesses ceph: explicitly reference rename old_dentry parent dir in request ceph: document locking for ceph_set_dentry_offset ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug ceph: protect d_parent access in ceph_d_revalidate ceph: protect access to d_parent ceph: handle racing calls to ceph_init_dentry ceph: set dir complete frag after adding capability rbd: set blk_queue request sizes to object size ceph: set up readahead size when rsize is not passed rbd: cancel watch request when releasing the device ceph: ignore lease mask ceph: fix ceph_lookup_open intent usage ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC ceph: fix bad parent_inode calc in ceph_lookup_open ceph: avoid carrying Fw cap during write into page cache libceph: don't time out osd requests that haven't been received ceph: report f_bfree based on kb_avail rather than diffing. ceph: only queue capsnap if caps are dirty ceph: fix snap writeback when racing with writes ...
| * ceph: document unlocked d_parent accessesSage Weil2011-07-262-4/+11
| | | | | | | | | | | | | | | | | | For the most part we don't care about racing with rename when directing MDS requests; either the old or new parent is fine. Document that, and do some minor cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: explicitly reference rename old_dentry parent dir in requestSage Weil2011-07-264-11/+17
| | | | | | | | | | | | | | | | | | | | | | We carry a pin on the parent directory for the rename source and dest dentries. For the source it's r_locked_dir; we need to explicitly reference the old_dentry parent as well, since the dentry's d_parent may change between when the request was created and pinned and when it is freed. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: document locking for ceph_set_dentry_offsetSage Weil2011-07-261-1/+3
| | | | | | | | | | Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bugSage Weil2011-07-264-13/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Have caller pass in a safely-obtained reference to the parent directory for calculating a dentry's hash valud. While we're here, simpify the flow through ceph_encode_fh() so that there is a single exit point and cleanup. Also fix a bug with the dentry hash calculation: calculate the hash for the dentry we were given, not its parent. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: protect d_parent access in ceph_d_revalidateSage Weil2011-07-261-15/+17
| | | | | | | | | | | | | | | | Protect d_parent with d_lock. Carry a reference. Simplify the flow so that there is a single exit point and cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: protect access to d_parentSage Weil2011-07-266-15/+33
| | | | | | | | | | | | | | | | | | | | d_parent is protected by d_lock: use it when looking up a dentry's parent directory inode. Also take a reference and drop it in the caller to avoid a use-after-free. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: handle racing calls to ceph_init_dentrySage Weil2011-07-261-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | The ->lookup() and prepopulate_readdir() callers are working with unhashed dentries, so we don't have to worry. The export.c callers, though, need to initialize something they got back from d_obtain_alias() and are potentially racing with other callers. Make sure we don't return unless the dentry is properly initialized (by us or someone else). Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: set dir complete frag after adding capabilitySage Weil2011-07-261-13/+17
| | | | | | | | | | | | | | | | | | | | Curretly ceph_add_cap clears the complete bit if we are newly issued the FILE_SHARED cap, which is normally the case for a newly issue cap on a new directory. That means we clear the just-set bit. Move the check that sets the flag to after the cap is added/updated. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: set up readahead size when rsize is not passedYehuda Sadeh2011-07-261-0/+4
| | | | | | | | | | | | | | This should improve the default read performance, as without it readahead is practically disabled. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
| * ceph: ignore lease maskSage Weil2011-07-263-18/+12
| | | | | | | | | | | | | | | | The lease mask is no longer used (and it changed a while back). Instead, use a non-zero duration to indicate that there is a lease being issued. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix ceph_lookup_open intent usageSage Weil2011-07-263-19/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We weren't properly calling lookup_instantiate_filp when setting up the lookup intent, which could lead to file leakage on errors. So: - use separate helper for the hidden snapdir translation, immediately following the mds request - use ceph_finish_lookup for the final dentry/return value dance in the exit path - lookup_instantiate_filp on success Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNCSage Weil2011-07-261-1/+2
| | | | | | | | | | | | | | | | We only need to put these on the directory unsafe list if they have side effects that fsync(2) should flush out. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix bad parent_inode calc in ceph_lookup_openSage Weil2011-07-261-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We were always getting NULL here because the intent file f_dentry is always NULL at this point, which means we were always passing NULL to ceph_mdsc_do_request. In reality, this was fine, since this isn't currently ever a write operation that needs to get strung on the dir's unsafe list. Use the dir explicitly, and only pass it if this open has side-effects that a dir fsync should flush. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: avoid carrying Fw cap during write into page cacheSage Weil2011-07-261-3/+19
| | | | | | | | | | | | | | | | | | | | The generic_file_aio_write call may block on balance_dirty_pages while we flush data to the OSDs. If we hold a reference to the FILE_WR cap during that interval revocation by the MDS (e.g., to do a stat(2)) may be very slow. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: report f_bfree based on kb_avail rather than diffing.Greg Farnum2011-07-261-2/+1
| | | | | | | | | | Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
| * ceph: only queue capsnap if caps are dirtySage Weil2011-07-261-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | We used to go into this branch if i_wrbuffer_ref_head was non-zero. This was an ancient check from before we were careful about dealing with all kinds of caps (and not just dirty pages). It is cleaner to only queue a capsnap if there is an actual dirty cap. If we are racing with... something...we will end up here with ci->i_wrbuffer_refs but no dirty caps. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix snap writeback when racing with writesSage Weil2011-07-261-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two problems that come up when we try to queue a capsnap while a write is in progress: - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap with dirty == 0. That will crash later in __ceph_flush_snaps(). Or on the FILE_WR cap if a write is in progress. - We may not have i_head_snapc set, which causes problems pretty quickly. Look to the snaprealm in this case. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: use flag bit for at_end readdir flagSage Weil2011-07-262-6/+6
| | | | | | | | | | | | | | This saves us a word of memory per file. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: add F_SYNC file flag to force sync (non-O_DIRECT) ioSage Weil2011-07-264-2/+18
| | | | | | | | | | | | | | | | | | | | This allows us to force IO through the sync path which you normally only get when multiple clients are reading/writing to the same file or by mounting with -o sync. Among other things, this lets test programs verify correctness with a single mount. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: add flags field to file_infoSage Weil2011-07-261-1/+2
| | | | | | | | | | Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* | fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlersJosef Bacik2011-07-203-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseekJosef Bacik2011-07-202-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases we just return -EINVAL, in others we do the normal generic thing, and in others we're simply making sure that the properly due-dilligence is done. For example in NFS/CIFS we need to make sure the file size is update properly for the SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself that is all we have to do. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | don't open-code parent_ino() in assorted ->readdir()Al Viro2011-07-201-1/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ceph: LOOKUP_OPEN is set only when it's the last componentAl Viro2011-07-201-1/+0
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | don't transliterate lower bits of ->intent.open.flags to FMODE_...Al Viro2011-07-201-1/+1
| | | | | | | | | | | | ->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ->permission() sanitizing: don't pass flags to ->permission()Al Viro2011-07-202-3/+3
| | | | | | | | | | | | not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ->permission() sanitizing: don't pass flags to generic_permission()Al Viro2011-07-201-1/+1
| | | | | | | | | | | | | | redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | kill check_acl callback of generic_permission()Al Viro2011-07-201-1/+1
|/ | | | | | | its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ceph analog of cifs build_path_from_dentry() race fixAl Viro2011-07-161-3/+16
| | | | | | ... unfortunately, cifs bug got copied. Fix is essentially the same. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ceph: fix sync and dio writes across stripe boundariesSage Weil2011-06-131-3/+3
| | | | | | | | | We were iterating across stripe boundaries properly, but not moving the write buffer pointer forward. This caused us to rewrite the same data after the break. Fix by adjusting the data pointer forward, and recalculating the io and buffer alignment after the break. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix page alignment correctionsSage Weil2011-06-131-5/+3
| | | | | | | | dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1 dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1 Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: unwind canceled flock stateSage Weil2011-06-071-10/+16
| | | | | | | | | If we request a lock and then abort (e.g., ^C), we need to send a matching unlock request to the MDS to unwind our lock attempt to avoid indefinitely blocking other clients. Reported-by: Brian Chrisman <brchrisman@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix ENOENT logic in striped_readSage Weil2011-06-071-2/+2
| | | | | | | | | | | | Getting ENOENT is equivalent to reading 0 bytes. Make that correction before setting up the hit_stripe and was_short flags. Fixes the following case: dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix short sync reads from the OSDSage Weil2011-06-071-13/+15
| | | | | | | | | | | | | If we get a short read from the OSD because the object is small, we need to zero the remainder of the buffer. For O_DIRECT reads, the attempted range is not trimmed to i_size by the VFS, so we were actually looping indefinitely. Fix by trimming by i_size, and the unconditionally zeroing the trailing range. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use ihold when we already have an inode refSage Weil2011-06-0710-28/+37
| | | | | | | | We should use ihold whenever we already have a stable inode ref, even when we aren't holding i_lock. This avoids adding new and unnecessary locking dependencies. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix cap flush race reentrancySage Weil2011-05-243-29/+31
| | | | | | | | | | | | | | | | | | | | | In e9964c10 we change cap flushing to do a delicate dance because some inodes on the cap_dirty list could be in a migrating state (got EXPORT but not IMPORT) in which we couldn't actually flush and move from dirty->flushing, breaking the while (!empty) { process first } loop structure. It worked for a single sync thread, but was not reentrant and triggered infinite loops when multiple syncers came along. Instead, move inodes with dirty to a separate cap_dirty_migrating list when in the limbo export-but-no-import state, allowing us to go back to the simple loop structure (which was reentrant). This is cleaner and more robust. Audited the cap_dirty users and this looks fine: list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we have dirty caps (which list we're on is irrelevant) and list_del_init() calls still do the right thing. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: avoid inode lookup on nfs fh reconnectSage Weil2011-05-241-2/+6
| | | | | | | If we get the inode from the MDS, we have a reference in req; don't do a fresh lookup. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use LOOKUPINO to make unconnected nfs fh more reliableSage Weil2011-05-241-2/+17
| | | | | | | If we are unable to locate an inode by ino, ask the MDS using the new LOOKUPINO command. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: check return value for start_request in writepagesSage Weil2011-05-191-1/+2
| | | | | | | Since we pass the nofail arg, we should never get an error; BUG if we do. (And fix the function to not return an error if __map_request fails.) Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: remove useless checkSage Weil2011-05-191-2/+0
| | | | | | rc is only ever 0 or negative in this method. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix broken comparison in readdir loopSage Weil2011-05-191-1/+1
| | | | | | | Both off and fi->offset are unsigned, so the difference is always >= 0. Compare them directly instead of the sign of the difference. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix rare potential cap leakSage Weil2011-05-191-1/+2
| | | | | | | If we grab new_cap, retake the lock, and find we already have a cap now for the given mds, release new_cap. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use snprintf for dirstat contentSage Weil2011-05-191-2/+3
| | | | | | | We allocate a buffer for rstats if the dirstat option is enabled. Use snprintf. Signed-off-by: Sage Weil <sage@newdream.net>
* libceph: remove unused variableSage Weil2011-05-191-2/+0
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: take reference on mds request r_unsafe_dirSage Weil2011-05-191-0/+4
| | | | | | | | | | We put ourselves on an inode list for the parent directory of metadata operations so that an fsync on the directory will wait for metadata updates to commit to disk. We weren't holding a reference to that directory, however, and under certain workloads (fsstress in this case) the directory can go away. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: do not use i_wrbuffer_ref as refcount for Fb capHenry C Chang2011-05-113-9/+10
| | | | | | | | | | | | | | We increments i_wrbuffer_ref when taking the Fb cap. This breaks the dirty page accounting and causes looping in __ceph_do_pending_vmtruncate, and ceph client hangs. This bug can be reproduced occasionally by running blogbench. Add a new field i_wb_ref to inode and dedicate it to Fb reference counting. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
OpenPOWER on IntegriCloud