summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()Ilya Dryomov2014-04-041-6/+8
| | | | | | | To be in line with all the other osdmap decode helpers. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fix and clarify ceph_decode_need() sizesIlya Dryomov2014-04-041-6/+7
| | | | | | | | Sum up sizeof(...) results instead of (incorrectly) hard-coding the number of bytes, expressed in ints and longs. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: nuke bogus encoding version check in osdmap_apply_incremental()Ilya Dryomov2014-04-041-5/+4
| | | | | | | | | Only version 6 of osdmap encoding is supported, anything other than version 6 results in an error and halts the decoding process. Checking if version is >= 5 is therefore bogus. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fixup error handling in osdmap_apply_incremental()Ilya Dryomov2014-04-041-32/+34
| | | | | | | | | | | The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Follow osdmap_decode() and fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fix crush_decode() call site in osdmap_decode()Ilya Dryomov2014-04-041-5/+2
| | | | | | | | | | The size of the memory area feeded to crush_decode() should be limited not only by osdmap end, but also by the crush map length. Also, drop unnecessary dout() (dout() in crush_decode() conveys the same info) and step past crush map only if it is decoded successfully. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: check length of osdmap osd arraysIlya Dryomov2014-04-041-4/+10
| | | | | | | | | Check length of osd_state, osd_weight and osd_addr arrays. They should all have exactly max_osd elements after the call to osdmap_set_max_osd(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: safely decode max_osd value in osdmap_decode()Ilya Dryomov2014-04-041-2/+4
| | | | | | | | max_osd value is not covered by any ceph_decode_need(). Use a safe version of ceph_decode_* macro to decode it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fixup error handling in osdmap_decode()Ilya Dryomov2014-04-041-26/+27
| | | | | | | | | | | The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: split osdmap allocation and decode stepsIlya Dryomov2014-04-043-17/+31
| | | | | | | | Split osdmap allocation and initialization into a separate function, ceph_osdmap_decode(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: dump osdmap and enhance output on decode errorsIlya Dryomov2014-04-041-6/+15
| | | | | | | | | Dump osdmap in hex on both full and incremental decode errors, to make it easier to match the contents with error offset. dout() map epoch and max_osd value on success. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: dump pg_temp mappings to debugfsIlya Dryomov2014-04-041-0/+11
| | | | | | | | | | Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g: pg_temp 2.6 [2,3,4] Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: do not prefix osd lines with \t in debugfs outputIlya Dryomov2014-04-041-1/+1
| | | | | | | | To save screen space in anticipation of more fields (e.g. primary affinity). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: refer to osdmap directly in osdmap_show()Ilya Dryomov2014-04-041-12/+14
| | | | | | | To make it more readable and save screen space. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* crush: support chooseleaf_vary_r tunable (tunables3) by defaultIlya Dryomov2014-04-041-1/+9
| | | | | | | | Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features supported by default. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: add SET_CHOOSELEAF_VARY_R stepIlya Dryomov2014-04-042-0/+6
| | | | | | | | | This lets you adjust the vary_r tunable on a per-rule basis. Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: add chooseleaf_vary_r tunableIlya Dryomov2014-04-042-6/+30
| | | | | | | | | | | | | | | | | | | | | | | The current crush_choose_firstn code will re-use the same 'r' value for the recursive call. That means that if we are hitting a collision or rejection for some reason (say, an OSD that is marked out) and need to retry, we will keep making the same (bad) choice in that recursive selection. Introduce a tunable that fixes that behavior by incorporating the parent 'r' value into the recursive starting point, so that a different path will be taken in subsequent placement attempts. Note that this was done from the get-go for the new crush_choose_indep algorithm. This was exposed by a user who was seeing PGs stuck in active+remapped after reweight-by-utilization because the up set mapped to a single OSD. Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: allow crush rules to set (re)tries counts to 0Ilya Dryomov2014-04-041-2/+2
| | | | | | | | | These two fields are misnomers; they are *retry* counts. Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: fix off-by-one errors in total_tries refactorIlya Dryomov2014-04-041-19/+27
| | | | | | | | | | | | | | | | | | | | | | | | Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH code to allow adjustment of the retry counts on a per-pool basis. That commit had an off-by-one bug: the previous "tries" counter was a *retry* count, not a *try* count, but the new code was passing in 1 meaning there should be no retries. Fix the ftotal vs tries comparison to use < instead of <= to fix the problem. Note that the original code used <= here, which means the global "choose_total_tries" tunable is actually counting retries. Compensate for that by adding 1 in crush_do_rule when we pull the tunable into the local variable. This was noticed looking at output from a user provided osdmap. Unfortunately the map doesn't illustrate the change in mapping behavior and I haven't managed to construct one yet that does. Inspection of the crush debug output now aligns with prior versions, though. Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* ceph: don't include ceph.{file,dir}.layout vxattr in listxattr()Yan, Zheng2014-04-041-2/+2
| | | | | | This avoids 'cp -a' modifying layout of new files/directories. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: check buffer size in ceph_vxattrcb_layout()Yan, Zheng2014-04-041-9/+25
| | | | | | | If buffer size is zero, return the size of layout vxattr. If buffer size is not zero, check if it is large enough for layout vxattr. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: fix null pointer dereference in discard_cap_releases()Yan, Zheng2014-04-041-9/+12
| | | | | | | | send_mds_reconnect() may call discard_cap_releases() after all release messages have been dropped by cleanup_cap_releases() Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()Yan, Zheng2014-04-041-0/+6
| | | | | | | When there is no more data, ceph_msg_data_{pages,pagelist}_advance() should not move on to the next page. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: Remove get/set acl on symlinksFabian Frederick2014-04-041-2/+0
| | | | | | | Remove unsupported symlink operations. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
* ceph: set mds_wanted when MDS reply changes a cap to auth capYan, Zheng2014-04-041-1/+3
| | | | | | | | | When adjusting caps client wants, MDS does not record caps that are not allowed. For non-auth MDS, it does not record WR caps. So when a MDS reply changes a non-auth cap to auth cap, client needs to set cap's mds_wanted according to the reply. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: use fl->fl_file as owner identifier of flock and posix lockYan, Zheng2014-04-044-22/+45
| | | | | | | | | | | | | | | | | | | | | flock and posix lock should use fl->fl_file instead of process ID as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner is usually equal to fl->fl_file, but it also can be a customized value). The process ID of who holds the lock is just for F_GETLK fcntl(2). The fix is rename the 'pid' fields of struct ceph_mds_request_args and struct ceph_filelock to 'owner', rename 'pid_namespace' fields to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages. We also set the most significant bit of the 'owner' field. MDS can use that bit to distinguish between old and new clients. The MDS counterpart of this patch modifies the flock code to not take the 'pid_namespace' into consideration when checking conflict locks. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: forbid mandatory file lockYan, Zheng2014-04-041-0/+12
| | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: use fl->fl_type to decide flock operationYan, Zheng2014-04-041-12/+9
| | | | | | | | VFS does not directly pass flock's operation code to filesystem's flock callback. It translates the operation code to the form how posix lock's parameters are presented. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: update i_max_size even if inode version does not changeYan, Zheng2014-04-041-8/+8
| | | | | | | | | | | | | | handle following sequence of events: - client releases a inode with i_max_size > 0. The release message is queued. (is not sent to the auth MDS) - a 'lookup' request reply from non-auth MDS returns the same inode. - client opens the inode in write mode. The version of inode trace in 'open' request reply is equal to the cached inode's version. - client requests new max size. The MDS ignores the request because it does not affect client's write range Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: make sure write caps are registered with auth MDSYan, Zheng2014-04-041-1/+4
| | | | | | | Only auth MDS can issue write caps to clients, so don't consider write caps registered with non-auth MDS as valid. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: print inode number for LOOKUPINO requestYan, Zheng2014-04-031-0/+2
| | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: add get_name() NFS export callbackYan, Zheng2014-04-034-1/+92
| | | | | | | | Use the newly introduced LOOKUPNAME MDS request to connect child inode to its parent directory. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: fix ceph_fh_to_parent()Yan, Zheng2014-04-031-33/+9
| | | | | | | | | ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field of struct ceph_nfs_confh. This is wrong, it should return dentry that corresponds to the 'parent_ino' field. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: add get_parent() NFS export callbackYan, Zheng2014-04-031-0/+60
| | | | | | | The callback uses LOOKUPPARENT MDS request to find parent. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: simplify ceph_fh_to_dentry()Yan, Zheng2014-04-031-135/+32
| | | | | | | | MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way. So __cfh_to_dentry() is redundant. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: fscache: Wait for completion of object initializationYunchuan Wen2014-04-031-0/+1
| | | | | | | | | | | | | | | The object store limit needs to be updated after writing, and this can be done provided the corresponding object has already been initialized. Current object initialization is done asynchrously, which introduce a race if a file is opened, then immediately followed by a writing, the initialization may have not completed, the code will reach the ASSERT in fscache_submit_exclusive_op() to cause kernel bug. Tested-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Min Chen <minchen@ubuntukylin.com> Signed-off-by: Li Wang <liwang@ubuntukylin.com>
* ceph: fscache: Update object store limit after file writingYunchuan Wen2014-04-031-0/+3
| | | | | | | | | Synchronize object->store_limit[_l] with new inode->i_size after file writing. Tested-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Min Chen <minchen@ubuntukylin.com> Signed-off-by: Li Wang <liwang@ubuntukylin.com>
* ceph: fscache: add an interface to synchronize object store limitYunchuan Wen2014-04-031-0/+10
| | | | | | | | | | Add an interface to explicitly synchronize object->store_limit[_l] with inode->i_size Tested-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Min Chen <minchen@ubuntukylin.com> Signed-off-by: Li Wang <liwang@ubuntukylin.com>
* ceph: do not set r_old_dentry_dir on link()Sage Weil2014-04-031-2/+1
| | | | | | | | | | | This is racy--we do not know whather d_parent has changed out from underneath us because i_mutex is not held on the source inode's directory. Also, taking this reference is useless. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: do not assume r_old_dentry[_dir] always set togetherSage Weil2014-04-032-4/+6
| | | | | | | | | Do not assume that r_old_dentry implies that r_old_dentry_dir is also true. Separate out the ref cleanup and make the debugs dump behave when it is NULL. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: do not chain inode updates to parent fsyncSage Weil2014-04-034-17/+5
| | | | | | | | | The fsync(dirfd) only covers namespace operations, not inode updates. We do not need to cover setattr variants or O_TRUNC. Reported-by: Al Viro <viro@xeniv.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()Sage Weil2014-04-031-1/+2
| | | | | | | | This is just old_dir; no reason to abuse the dcache pointers. Reported-by: Al Viro <viro.zeniv.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: let MDS adjust readdir 'frag'Yan, Zheng2014-04-031-3/+0
| | | | | | | | | | | | If readdir 'frag' is adjusted, readdir 'offset' should be reset. Otherwise some dentries may be lost when readdir and fragmenting directory happen at the some. Another way to fix this issue is let MDS adjust readdir 'frag'. The code that handles MDS reply reset the readdir 'offset' if the readdir reply is different than the requested one. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: fix reset_readdir()Yan, Zheng2014-04-031-3/+6
| | | | | | | | When changing readdir postion, fi->next_offset should be set to 0 if the new postion is not in the first dirfrag. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@linaro.org>
* ceph: fix ceph_dir_llseek()Yan, Zheng2014-04-032-7/+7
| | | | | | | | | | | | | Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for directory. For a fragmented directory, offset (frag_t, off) can be larger than inode->i_sb->s_maxbytes. At the very beginning of ceph_dir_llseek(), local variable old_offset is initialized to parameter offset. This doesn't make sense neither. Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd opIlya Dryomov2014-04-031-15/+39
| | | | | | | | | | | | | | | | | | | In an effort to reduce fragmentation, prefix every rbd write with a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set to the object size (1 << order). Backwards compatibility is taken care of on the libceph/osd side. "The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to do it once. The reason every rbd write is prefixed is that rbd doesn't explicitly create objects and relies on writes creating them implicitly, so there is no place to stick a single hint op into. To get around that we decided to prefix every rbd write with a hint (just like write and setattr ops, hint op will create an object implicitly if it doesn't exist)." Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: num_ops parameter for rbd_osd_req_create()Ilya Dryomov2014-04-031-10/+18
| | | | | | | | | | | In preparation for prefixing rbd writes with an allocation hint introduce a num_ops parameter for rbd_osd_req_create(). The rationale is that not every write request is a write op that needs to be prefixed (e.g. watch op), so the num_ops logic needs to be in the callers. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: bump CEPH_OSD_MAX_OP to 3Ilya Dryomov2014-04-032-2/+2
| | | | | | | | | | | Our longest osd request now contains 3 ops: copyup+hint+write. Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was hard-coded to 2. Fix it, and switch to rbd_assert while at it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd opIlya Dryomov2014-04-033-0/+42
| | | | | | | | | | | | | | | | | | | This is primarily for rbd's benefit and is supposed to combat fragmentation: "... knowing that rbd images have a 4m size, librbd can pass a hint that will let the osd do the xfs allocation size ioctl on new files so that they are allocated in 1m or 4m chunks. We've seen cases where users with rbd workloads have very high levels of fragmentation in xfs and this would mitigate that and probably have a pretty nice performance benefit." SETALLOCHINT is considered advisory, so our backwards compatibility mechanism here is to set FAILOK flag for all SETALLOCHINT ops. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: encode CEPH_OSD_OP_FLAG_* op flagsIlya Dryomov2014-04-033-1/+4
| | | | | | | | Encode ceph_osd_op::flags field so that it gets sent over the wire. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: fix error paths in rbd_img_request_fill()Ilya Dryomov2014-04-031-1/+1
| | | | | | | | | | | | | | | | | | Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is not only insufficient, but also triggers an rbd_assert() in rbd_obj_request_destroy(): Assertion failure in rbd_obj_request_destroy() at line 1867: rbd_assert(obj_request->img_request == NULL); rbd_img_obj_request_add() adds obj_requests to the img_request, the opposite is rbd_img_obj_request_del(). Use it. Fixes: http://tracker.ceph.com/issues/7327 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
OpenPOWER on IntegriCloud