summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ceph: check directory's completeness before emitting directory entryYan, Zheng2014-04-281-10/+12
| | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* ceph: skip invalid dentry during dcache readdirYan, Zheng2014-04-061-5/+8
| | | | | | | | skip dentries that were added before MDS issued FILE_SHARED to client. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* libceph: dump pool {read,write}_tier to debugfsIlya Dryomov2014-04-041-3/+3
| | | | | | | Dump pool {read,write}_tier to debugfs. While at it, fixup printk type specifiers and remove the unnecessary cast to unsigned long long. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
* libceph: output primary affinity values on osdmap updatesIlya Dryomov2014-04-041-0/+2
| | | | | | | Similar to osd weights, output primary affinity values on incremental osdmap updates. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
* ceph: flush cap release queue when trimming session capsYan, Zheng2014-04-041-0/+3
| | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: don't grabs open file reference for aborted requestYan, Zheng2014-04-041-1/+1
| | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: drop extra open file reference in ceph_atomic_open()Yan, Zheng2014-04-041-1/+2
| | | | | | | | ceph_atomic_open() calls ceph_open() after receiving the MDS reply. ceph_open() grabs an extra open file reference. (The open request already holds an open file reference) Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: preallocate buffer for readdir replyYan, Zheng2014-04-043-21/+59
| | | | | | | Preallocate buffer for readdir reply. Limit number of entries in readdir reply according to the buffer size. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* libceph: enable PRIMARY_AFFINITY feature bitIlya Dryomov2014-04-041-1/+2
| | | | | | | | Announce our support for osdmaps with non-default primary affinity values. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()Ilya Dryomov2014-04-041-75/+4
| | | | | | | | Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() and get rid of the now unused calc_pg_raw(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: add support for osd primary affinityIlya Dryomov2014-04-041-0/+68
| | | | | | | | | | | Respond to non-default primary_affinity values accordingly. (Primary affinity allows the admin to shift 'primary responsibility' away from specific osds, effectively shifting around the read side of the workload and whatever overhead is incurred by peering and writes by virtue of being the primary). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: add support for primary_temp mappingsIlya Dryomov2014-04-041-1/+6
| | | | | | | | Change apply_temp() to override primary in the same way pg_temp overrides osd set. primary_temp overrides pg_temp primary too. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: return primary from ceph_calc_pg_acting()Ilya Dryomov2014-04-043-15/+17
| | | | | | | | | | In preparation for adding support for primary_temp, stop assuming primaryness: add a primary out parameter to ceph_calc_pg_acting() and change call sites accordingly. Primary is now specified separately from the order of osds in the set. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: switch ceph_calc_pg_acting() to new helpersIlya Dryomov2014-04-042-14/+39
| | | | | | | | Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(), raw_to_up_osds() and apply_temps(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: introduce apply_temps() helperIlya Dryomov2014-04-041-0/+52
| | | | | | | | | apply_temp() helper for applying various temporary mappings (at this point only pg_temp mappings) to the up set, therefore transforming it into an acting set. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpersIlya Dryomov2014-04-041-0/+76
| | | | | | | | | | | | pg_to_raw_osds() helper for computing a raw (crush) set, which can contain non-existant and down osds. raw_to_up_osds() helper for pruning non-existant and down osds from the raw set, therefore transforming it into an up set, and determining up primary. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: ceph_can_shift_osds(pool) and pool type definesIlya Dryomov2014-04-042-2/+15
| | | | | | | | Bring in pg_pool_t::can_shift_osds() counterpart along with pool type defines. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: ceph_osd_{exists,is_up,is_down}(osd) definitionsIlya Dryomov2014-04-041-1/+13
| | | | | | | Sync up with ceph.git definitions. Bring in ceph_osd_is_down(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: enable OSDMAP_ENC feature bitIlya Dryomov2014-04-041-0/+1
| | | | | | | | Announce our support for "new" (v7 - split and separately versioned client and osd sections) osdmap enconding. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: primary_affinity decode bitsIlya Dryomov2014-04-041-0/+72
| | | | | | | | Add two helpers to decode primary_affinity (full map, vector<u32>) and new_primary_affinity (inc map, map<u32, u32>) and switch to them. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: primary_affinity infrastructureIlya Dryomov2014-04-044-2/+57
| | | | | | | | | | | | | Add primary_affinity infrastructure. primary_affinity values are stored in an max_osd-sized array, hanging off ceph_osdmap, similar to a osd_weight array. Introduce {get,set}_primary_affinity() helpers, primarily to return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to abstract out osd_primary_affinity array allocation and initialization. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: primary_temp decode bitsIlya Dryomov2014-04-041-0/+69
| | | | | | | | Add a common helper to decode both primary_temp (full map, map<pg_t, u32>) and new_primary_temp (inc map, same) and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: primary_temp infrastructureIlya Dryomov2014-04-043-1/+21
| | | | | | | | | | | | | | Add primary_temp mappings infrastructure. struct ceph_pg_mapping is overloaded, primary_temp mappings are stored in an rb-tree, rooted at ceph_osdmap, in a manner similar to pg_temp mappings. Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'primary_temp <pgid> <osd>' per line, e.g: primary_temp 2.6 4 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: generalize ceph_pg_mappingIlya Dryomov2014-04-043-8/+13
| | | | | | | | In preparation for adding support for primary_temp mappings, generalize struct ceph_pg_mapping so it can hold mappings other than pg_temp. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: introduce get_osdmap_client_data_v()Ilya Dryomov2014-04-041-16/+65
| | | | | | | | | | Full and incremental osdmaps are structured identically and have identical headers. Add a helper to decode both "old" (16-bit version, v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap enconding headers and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: introduce decode{,_new}_pg_temp() and switch to themIlya Dryomov2014-04-041-72/+67
| | | | | | | | Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: switch osdmap_set_max_osd() to krealloc()Ilya Dryomov2014-04-041-15/+17
| | | | | | | | | Use krealloc() instead of rolling our own. (krealloc() with a NULL first argument acts as a kmalloc()). Properly initalize the new array elements. This is needed to make future additions to osdmap easier. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: introduce decode{,_new}_pools() and switch to themIlya Dryomov2014-04-041-37/+57
| | | | | | | | Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()Ilya Dryomov2014-04-041-6/+8
| | | | | | | To be in line with all the other osdmap decode helpers. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fix and clarify ceph_decode_need() sizesIlya Dryomov2014-04-041-6/+7
| | | | | | | | Sum up sizeof(...) results instead of (incorrectly) hard-coding the number of bytes, expressed in ints and longs. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: nuke bogus encoding version check in osdmap_apply_incremental()Ilya Dryomov2014-04-041-5/+4
| | | | | | | | | Only version 6 of osdmap encoding is supported, anything other than version 6 results in an error and halts the decoding process. Checking if version is >= 5 is therefore bogus. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fixup error handling in osdmap_apply_incremental()Ilya Dryomov2014-04-041-32/+34
| | | | | | | | | | | The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Follow osdmap_decode() and fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fix crush_decode() call site in osdmap_decode()Ilya Dryomov2014-04-041-5/+2
| | | | | | | | | | The size of the memory area feeded to crush_decode() should be limited not only by osdmap end, but also by the crush map length. Also, drop unnecessary dout() (dout() in crush_decode() conveys the same info) and step past crush map only if it is decoded successfully. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: check length of osdmap osd arraysIlya Dryomov2014-04-041-4/+10
| | | | | | | | | Check length of osd_state, osd_weight and osd_addr arrays. They should all have exactly max_osd elements after the call to osdmap_set_max_osd(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: safely decode max_osd value in osdmap_decode()Ilya Dryomov2014-04-041-2/+4
| | | | | | | | max_osd value is not covered by any ceph_decode_need(). Use a safe version of ceph_decode_* macro to decode it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: fixup error handling in osdmap_decode()Ilya Dryomov2014-04-041-26/+27
| | | | | | | | | | | The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: split osdmap allocation and decode stepsIlya Dryomov2014-04-043-17/+31
| | | | | | | | Split osdmap allocation and initialization into a separate function, ceph_osdmap_decode(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: dump osdmap and enhance output on decode errorsIlya Dryomov2014-04-041-6/+15
| | | | | | | | | Dump osdmap in hex on both full and incremental decode errors, to make it easier to match the contents with error offset. dout() map epoch and max_osd value on success. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: dump pg_temp mappings to debugfsIlya Dryomov2014-04-041-0/+11
| | | | | | | | | | Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g: pg_temp 2.6 [2,3,4] Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: do not prefix osd lines with \t in debugfs outputIlya Dryomov2014-04-041-1/+1
| | | | | | | | To save screen space in anticipation of more fields (e.g. primary affinity). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: refer to osdmap directly in osdmap_show()Ilya Dryomov2014-04-041-12/+14
| | | | | | | To make it more readable and save screen space. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* crush: support chooseleaf_vary_r tunable (tunables3) by defaultIlya Dryomov2014-04-041-1/+9
| | | | | | | | Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features supported by default. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: add SET_CHOOSELEAF_VARY_R stepIlya Dryomov2014-04-042-0/+6
| | | | | | | | | This lets you adjust the vary_r tunable on a per-rule basis. Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: add chooseleaf_vary_r tunableIlya Dryomov2014-04-042-6/+30
| | | | | | | | | | | | | | | | | | | | | | | The current crush_choose_firstn code will re-use the same 'r' value for the recursive call. That means that if we are hitting a collision or rejection for some reason (say, an OSD that is marked out) and need to retry, we will keep making the same (bad) choice in that recursive selection. Introduce a tunable that fixes that behavior by incorporating the parent 'r' value into the recursive starting point, so that a different path will be taken in subsequent placement attempts. Note that this was done from the get-go for the new crush_choose_indep algorithm. This was exposed by a user who was seeing PGs stuck in active+remapped after reweight-by-utilization because the up set mapped to a single OSD. Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: allow crush rules to set (re)tries counts to 0Ilya Dryomov2014-04-041-2/+2
| | | | | | | | | These two fields are misnomers; they are *retry* counts. Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* crush: fix off-by-one errors in total_tries refactorIlya Dryomov2014-04-041-19/+27
| | | | | | | | | | | | | | | | | | | | | | | | Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH code to allow adjustment of the retry counts on a per-pool basis. That commit had an off-by-one bug: the previous "tries" counter was a *retry* count, not a *try* count, but the new code was passing in 1 meaning there should be no retries. Fix the ftotal vs tries comparison to use < instead of <= to fix the problem. Note that the original code used <= here, which means the global "choose_total_tries" tunable is actually counting retries. Compensate for that by adding 1 in crush_do_rule when we pull the tunable into the local variable. This was noticed looking at output from a user provided osdmap. Unfortunately the map doesn't illustrate the change in mapping behavior and I haven't managed to construct one yet that does. Inspection of the crush debug output now aligns with prior versions, though. Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* ceph: don't include ceph.{file,dir}.layout vxattr in listxattr()Yan, Zheng2014-04-041-2/+2
| | | | | | This avoids 'cp -a' modifying layout of new files/directories. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: check buffer size in ceph_vxattrcb_layout()Yan, Zheng2014-04-041-9/+25
| | | | | | | If buffer size is zero, return the size of layout vxattr. If buffer size is not zero, check if it is large enough for layout vxattr. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* ceph: fix null pointer dereference in discard_cap_releases()Yan, Zheng2014-04-041-9/+12
| | | | | | | | send_mds_reconnect() may call discard_cap_releases() after all release messages have been dropped by cleanup_cap_releases() Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
* libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()Yan, Zheng2014-04-041-0/+6
| | | | | | | When there is no more data, ceph_msg_data_{pages,pagelist}_advance() should not move on to the next page. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
OpenPOWER on IntegriCloud