summaryrefslogtreecommitdiffstats
path: root/include/linux/ceph
Commit message (Collapse)AuthorAgeFilesLines
* libceph: fix ceph_msg_revoke()Ilya Dryomov2016-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a number of problems with revoking a "was sending" message: (1) We never make any attempt to revoke data - only kvecs contibute to con->out_skip. However, once the header (envelope) is written to the socket, our peer learns data_len and sets itself to expect at least data_len bytes to follow front or front+middle. If ceph_msg_revoke() is called while the messenger is sending message's data portion, anything we send after that call is counted by the OSD towards the now revoked message's data portion. The effects vary, the most common one is the eventual hang - higher layers get stuck waiting for the reply to the message that was sent out after ceph_msg_revoke() returned and treated by the OSD as a bunch of data bytes. This is what Matt ran into. (2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs is wrong. If ceph_msg_revoke() is called before the tag is sent out or while the messenger is sending the header, we will get a connection reset, either due to a bad tag (0 is not a valid tag) or a bad header CRC, which kind of defeats the purpose of revoke. Currently the kernel client refuses to work with header CRCs disabled, but that will likely change in the future, making this even worse. (3) con->out_skip is not reset on connection reset, leading to one or more spurious connection resets if we happen to get a real one between con->out_skip is set in ceph_msg_revoke() and before it's cleared in write_partial_skip(). Fixing (1) and (3) is trivial. The idea behind fixing (2) is to never zero the tag or the header, i.e. send out tag+header regardless of when ceph_msg_revoke() is called. That way the header is always correct, no unnecessary resets are induced and revoke stands ready for disabled CRCs. Since ceph_msg_revoke() rips out con->out_msg, introduce a new "message out temp" and copy the header into it before sending. Cc: stable@vger.kernel.org # 4.0+ Reported-by: Matt Conner <matt.conner@keepertech.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Matt Conner <matt.conner@keepertech.com> Reviewed-by: Sage Weil <sage@redhat.com>
* ceph: ceph_frag_contains_value can be booleanYaowei Bai2016-01-211-1/+1
| | | | | | | | | | | This patch makes ceph_frag_contains_value return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: remove unused functions in ceph_frag.hYaowei Bai2016-01-211-35/+0
| | | | | | | | | These functions were introduced in commit 3d14c5d2b ("ceph: factor out libceph from Ceph file system"). Howover, there's no user of these functions since then, so remove them for simplicity. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: add nocephx_sign_messages optionIlya Dryomov2015-11-021-1/+2
| | | | | | | | | | | | | | | Support for message signing was merged into 3.19, along with nocephx_require_signatures option. But, all that option does is allow the kernel client to talk to clusters that don't support MSG_AUTH feature bit. That's pretty useless, given that it's been supported since bobtail. Meanwhile, if one disables message signing on the server side with "cephx sign messages = false", it becomes impossible to use the kernel client since it expects messages to be signed if MSG_AUTH was negotiated. Add nocephx_sign_messages option to support this use case. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: stop duplicating client fields in messengerIlya Dryomov2015-11-022-10/+2
| | | | | | | supported_features and required_features serve no purpose at all, while nocrc and tcp_nodelay belong to ceph_options::flags. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: msg signing callouts don't need con argumentIlya Dryomov2015-11-021-3/+2
| | | | | | | | | | We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: advertise support for keepalive2Ilya Dryomov2015-09-171-0/+1
| | | | | | | | | We are the client, but advertise keepalive2 anyway - for consistency, if nothing else. In the future the server might want to know whether its clients support keepalive2. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
* libceph: don't access invalid memory in keepalive2 pathIlya Dryomov2015-09-171-1/+3
| | | | | | | | | | | | | | | | This struct ceph_timespec ceph_ts; ... con_out_kvec_add(con, sizeof(ceph_ts), &ceph_ts); wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet ceph_ts becomes invalid on return from prepare_write_keepalive(). As a result, we send out bogus keepalive2 stamps. Fix this by encoding into a ceph_timespec member, similar to how acks are read and written. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
* libceph: use keepalive2 to verify the mon session is aliveYan, Zheng2015-09-083-1/+9
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: enable ceph in a non-default network namespaceIlya Dryomov2015-07-091-0/+3
| | | | | | | | | | | | | | Grab a reference on a network namespace of the 'rbd map' (in case of rbd) or 'mount' (in case of ceph) process and use that to open sockets instead of always using init_net and bailing if network namespace is anything but init_net. Be careful to not share struct ceph_client instances between different namespaces and don't add any code in the !CONFIG_NET_NS case. This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* ceph: pre-allocate data structure that tracks caps flushingYan, Zheng2015-06-251-0/+1
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: store timeouts in jiffies, verify user inputIlya Dryomov2015-06-251-6/+11
| | | | | | | | | | | | | | | | | | | | | | There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: nuke time_sub()Ilya Dryomov2015-06-251-9/+0
| | | | | | | Unused since ceph got merged into mainline I guess. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: allow setting osd_req_op's flagsYan, Zheng2015-06-251-1/+1
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: announce support for straw2 bucketsIlya Dryomov2015-04-221-1/+15
| | | | | | Sync up feature bits and enable CEPH_FEATURE_CRUSH_V4. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: rename snapshot supportYan, Zheng2015-04-221-0/+1
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: simplify our debugfs attr macroIlya Dryomov2015-04-201-7/+1
| | | | | | No need to do single_open()'s job ourselves. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: expose client options through debugfsIlya Dryomov2015-04-201-0/+1
| | | | | | Add a client_options attribute for showing libceph options. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph, ceph: split ceph_show_options()Ilya Dryomov2015-04-201-0/+1
| | | | | | | | | Split ceph_show_options() into two pieces and move the piece responsible for printing client (libceph) options into net/ceph. This way people adding a libceph option wouldn't have to remember to update code in fs/ceph. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: osdmap.h: Add missing format newlinesJoe Perches2015-04-201-3/+2
| | | | | | | | | To avoid possible interleaving, add missing '\n' to formats. Convert pr_warning to pr_warn while there. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: tcp_nodelay supportChaitanya Huilgol2015-02-192-2/+5
| | | | | | | | | | | TCP_NODELAY socket option set on connection sockets, disables Nagle’s algorithm and improves latency characteristics. tcp_nodelay(default)/notcp_nodelay option flags provided to enable/disable setting the socket option. Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com> [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* ceph: handle SESSION_FORCE_RO messageYan, Zheng2015-02-191-0/+1
| | | | | | mark session as readonly and wake up all cap waiters. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: nuke pool op infrastructureIlya Dryomov2015-02-192-44/+1
| | | | | | | | | | | | | On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 22 Dec 2014, Ilya Dryomov wrote: >> Actually, pool op stuff has been unused for over two years - looks like >> it was added for rbd create_snap and that got ripped out in 2012. It's >> unlikely we'd ever need to manage pools or snaps from the kernel client >> so I think it makes sense to nuke it. Sage? > > Yep! Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: fix sparse endianness warningsIlya Dryomov2015-01-081-2/+2
| | | | | | | The only real issue is the one in auth_x.c and it came with 3.19-rc1 merge. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: fixup includes in pagelist.hIlya Dryomov2014-12-171-1/+3
| | | | | | | pagelist.h needs to include linux/types.h and asm/byteorder.h and not rely on other headers pulling yet another set of headers. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* ceph: use getattr request to fetch inline dataYan, Zheng2014-12-171-0/+2
| | | | | | | Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data in getattr reply will be copied to the page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: add inline data to pagecacheYan, Zheng2014-12-171-0/+1
| | | | | | | Request reply and cap message can contain inline data. add inline data to the page cache if there is Fc cap. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: specify position of extent operationYan, Zheng2014-12-171-1/+2
| | | | | | | | | allow specifying position of extent operation in multi-operations osd request. This is required for cephfs to convert inline data to normal data (compare xattr, then write object). Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: add SETXATTR/CMPXATTR osd operations supportYan, Zheng2014-12-171-0/+10
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: require cephx message signature by defaultYan, Zheng2014-12-171-0/+1
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: update ceph_msg_header structureJohn Spray2014-12-171-1/+2
| | | | | | | | 2 bytes of what was reserved space is now used by userspace for the compat_version field. Signed-off-by: John Spray <john.spray@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: message signature supportYan, Zheng2014-12-174-1/+43
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: nuke ceph_kvfree()Ilya Dryomov2014-12-172-3/+1
| | | | | | | | | Use kvfree() from linux/mm.h instead, which is identical. Also fix the ceph_buffer comment: we will allocate with kmalloc() up to 32k - the value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an implementation detail so don't mention it at all. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* ceph: fix file lock interruptionYan, Zheng2014-12-171-2/+5
| | | | | | | | | | | | | When a lock operation is interrupted, current code sends a unlock request to MDS to undo the lock operation. This method does not work as expected because the unlock request can drop locks that have already been acquired. The fix is use the newly introduced CEPH_LOCK_FCNTL_INTR/CEPH_LOCK_FLOCK_INTR requests to interrupt blocked file lock request. These requests do not drop locks that have alread been acquired, they only interrupt blocked file lock request. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: sync osd op definitions in rados.hIlya Dryomov2014-10-141-96/+129
| | | | | | | | Bring in missing osd ops and strings, use macros to eliminate multiple points of maintenance. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: remove redundant declarationFabian Frederick2014-10-141-1/+0
| | | | | | | ceph_release_page_vector was defined twice in libceph.h Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: reference counting pagelistYan, Zheng2014-10-141-1/+4
| | | | | | | this allow pagelist to present data that may be sent multiple times. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: nuke ceph_osdc_unregister_linger_request()Ilya Dryomov2014-07-081-2/+0
| | | | | | | Remove now unused ceph_osdc_unregister_linger_request(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: introduce ceph_osdc_cancel_request()Ilya Dryomov2014-07-081-0/+1
| | | | | | | | | | | | | | | Introduce ceph_osdc_cancel_request() intended for canceling requests from the higher layers (rbd and cephfs). Because higher layers are in charge and are supposed to know what and when they are canceling, the request is not completed, only unref'ed and removed from the libceph data structures. __cancel_request() is no longer called before __unregister_request(), because __unregister_request() unconditionally revokes r_request and there is no point in trying to do it twice. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: move and add dout()s to ceph_osdc_request_{get,put}()Ilya Dryomov2014-07-081-9/+2
| | | | | | | | Add dout()s to ceph_osdc_request_{get,put}(). Also move them to .c and turn kref release callback into a static function. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: move and add dout()s to ceph_msg_{get,put}()Ilya Dryomov2014-07-081-12/+2
| | | | | | | | Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn kref release callback into a static function. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: rename ceph_osd_request::r_linger_osd to r_linger_osd_itemIlya Dryomov2014-07-081-1/+1
| | | | | | | | | | So that: req->r_osd_item --> osd->o_requests list req->r_linger_osd_item --> osd->o_linger_requests list Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-06-122-3/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...
| * libceph: add ceph_monc_wait_osdmap()Ilya Dryomov2014-06-061-0/+2
| | | | | | | | | | | | | | | | | | | | Add ceph_monc_wait_osdmap(), which will block until the osdmap with the specified epoch is received or timeout occurs. Export both of these as they are going to be needed by rbd. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * libceph: mon_get_version request infrastructureIlya Dryomov2014-06-061-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for mon_get_version requests to libceph. This reuses much of the ceph_mon_generic_request infrastructure, with one exception. Older OSDs don't set mon_get_version reply hdr->tid even if the original request had a non-zero tid, which makes it impossible to lookup ceph_mon_generic_request contexts by tid in get_generic_reply() for such replies. As a workaround, we allocate a reply message on the reply path. This can probably interfere with revoke, but I don't see a better way. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * ceph: update inode fields according to issued capsYan, Zheng2014-06-061-0/+2
| | | | | | | | | | | | | | | | | | Cap message and request reply from non-auth MDS may carry stale information (corresponding locks are in LOCK states) even they have the newest inode version. So client should update inode fields according to issued caps. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
* | ceph_sync_read: stop poking into iov_iter gutsAl Viro2014-05-061-2/+0
|/ | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* libceph: enable PRIMARY_AFFINITY feature bitIlya Dryomov2014-04-041-1/+2
| | | | | | | | Announce our support for osdmaps with non-default primary affinity values. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: return primary from ceph_calc_pg_acting()Ilya Dryomov2014-04-041-1/+1
| | | | | | | | | | In preparation for adding support for primary_temp, stop assuming primaryness: add a primary out parameter to ceph_calc_pg_acting() and change call sites accordingly. Primary is now specified separately from the order of osds in the set. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: switch ceph_calc_pg_acting() to new helpersIlya Dryomov2014-04-041-1/+1
| | | | | | | | Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(), raw_to_up_osds() and apply_temps(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
OpenPOWER on IntegriCloud