summaryrefslogtreecommitdiffstats
path: root/drivers/block/rbd.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2016-10-101-264/+1168
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Ceph updates from Ilya Dryomov: "The big ticket item here is support for rbd exclusive-lock feature, with maintenance operations offloaded to userspace (Douglas Fuller, Mike Christie and myself). Another block device bullet is a series fixing up layering error paths (myself). On the filesystem side, we've got patches that improve our handling of buffered vs dio write races (Neil Brown) and a few assorted fixes from Zheng. Also included a couple of random cleanups and a minor CRUSH update" * tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits) crush: remove redundant local variable crush: don't normalize input of crush_ln iteratively libceph: ceph_build_auth() doesn't need ceph_auth_build_hello() libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello() ceph: fix description for rsize and rasize mount options rbd: use kmalloc_array() in rbd_header_from_disk() ceph: use list_move instead of list_del/list_add ceph: handle CEPH_SESSION_REJECT message ceph: avoid accessing / when mounting a subpath ceph: fix mandatory flock check ceph: remove warning when ceph_releasepage() is called on dirty page ceph: ignore error from invalidate_inode_pages2_range() in direct write ceph: fix error handling of start_read() rbd: add rbd_obj_request_error() helper rbd: img_data requests don't own their page array rbd: don't call rbd_osd_req_format_read() for !img_data requests rbd: rework rbd_img_obj_exists_submit() error paths rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback() rbd: move bumping img_request refcount into rbd_obj_request_submit() rbd: mark the original request as done if stat request fails ...
| * rbd: use kmalloc_array() in rbd_header_from_disk()Markus Elfring2016-10-031-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | * A multiplication for the size determination of a memory allocation indicated that an array data structure should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. * Delete the local variable "size" which became unnecessary with this refactoring. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: add rbd_obj_request_error() helperIlya Dryomov2016-10-031-10/+18
| | | | | | | | | | | | | | | | Pull setting an error and marking a request done code into a new helper. obj_request_img_data_test() check isn't strictly needed right now, but makes it applicable to !img_data requests and a bit safer. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: img_data requests don't own their page arrayIlya Dryomov2016-10-031-8/+3
| | | | | | | | | | | | | | | | | | | | Move the check into rbd_obj_request_destroy() to avoid use-after-free on errors in rbd_img_request_fill(..., OBJ_REQUEST_PAGES, ...), where pages, owned by the caller, gets freed in rbd_img_request_fill(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * rbd: don't call rbd_osd_req_format_read() for !img_data requestsIlya Dryomov2016-10-031-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Accessing obj_request->img_request union field is only valid for object requests associated with an image (i.e. if obj_request_img_data_test() returns true). rbd_osd_req_format_read() used to do more, but now it just sets osd_req->snap_id. Standalone and stat object requests always go to the HEAD revision and are fine with CEPH_NOSNAP set by libceph, so get around the invalid union field use by simply not calling rbd_osd_req_format_read() in those places. Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * rbd: rework rbd_img_obj_exists_submit() error pathsIlya Dryomov2016-10-031-20/+22
| | | | | | | | | | | | | | | | | | | | | | | | - don't put obj_request before rbd_obj_request_get() if rbd_obj_request_create() fails - don't leak pages if rbd_obj_request_create() fails - don't leak stat_request if rbd_osd_req_create() fails Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()Ilya Dryomov2016-10-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | - fix parent_length == img_request->xferred assert to not fire on copyup read failures - don't leak pages if copyup read fails or we can't allocate a new osd request Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * rbd: move bumping img_request refcount into rbd_obj_request_submit()Ilya Dryomov2016-10-031-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0f2d5be792b0 ("rbd: use reference counts for image requests") added rbd_img_request_get(), which rbd_img_request_fill() calls for each obj_request added to img_request. It was an urgent band-aid for the uglyness that is rbd_img_obj_callback() and none of the error paths were updated. Given that this img_request reference is meant to represent an obj_request that hasn't passed through rbd_img_obj_callback() yet, proper cleanup in appropriate destructors is a challenge. However, noting that if we don't get a chance to call rbd_obj_request_complete(), there is not going to be a call to rbd_img_obj_callback(), we can move rbd_img_request_get() into rbd_obj_request_submit() and fixup the two places that call rbd_obj_request_complete() directly and not through rbd_obj_request_submit() to temporarily bump img_request, so that rbd_img_obj_callback() can put as usual. This takes care of img_request leaks on errors on the submit side. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * rbd: mark the original request as done if stat request failsIlya Dryomov2016-10-031-13/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | If stat request fails with something other than -ENOENT (which just means that we need to copyup), the original object request is never marked as done and therefore never completed. Fix this by moving the mark done + complete snippet from rbd_img_obj_parent_read_full() into rbd_img_obj_exists_callback(). The former remains covered, as the latter is its only caller (through rbd_img_obj_request_submit()). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * rbd: clean up asserts in rbd_img_obj_request_submit() helpersIlya Dryomov2016-10-031-20/+10
| | | | | | | | | | | | | | | | Assert once in rbd_img_obj_request_submit(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * rbd: change rbd_obj_request_submit() signatureIlya Dryomov2016-10-031-47/+23
| | | | | | | | | | | | | | | | | | | | - osdc parameter is useless - starting with commit 5aea3dcd5021 ("libceph: a major OSD client update"), ceph_osdc_start_request() always returns success Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * rbd: lock_on_read map optionIlya Dryomov2016-10-031-1/+12
| | | | | | | | | | | | | | | | | | Add a per-device option to acquire exclusive lock on reads (in addition to writes and discards). The use case is iSCSI, where it will be used to prevent execution of stale writes after the implicit failover. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Mike Christie <mchristi@redhat.com>
| * rbd: add force close optionMike Christie2016-08-241-9/+26
| | | | | | | | | | | | | | | | | | | | | | This adds a force close option, so we can force the unmapping of a rbd device that is open. If a path/device is blacklisted, apps like multipathd can map a new device and then unmap the old one. The unmapping cleanup would then be handled by the generic hotunplug code paths in multipahd like is done for iSCSI, FC/FCOE, SAS, etc. Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: add 'config_info' sysfs rbd device attributeMike Christie2016-08-241-2/+21
| | | | | | | | | | | | | | | | | | Export the info used to setup the rbd image, so it can be used to remap the image. Signed-off-by: Mike Christie <mchristi@redhat.com> [idryomov@gmail.com: do_rbd_add() EH] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: add 'snap_id' sysfs rbd device attributeMike Christie2016-08-241-0/+10
| | | | | | | | | | | | | | Export snap id in sysfs, so tools like multipathd can use it in a uuid. Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: add 'cluster_fsid' sysfs rbd device attributeMike Christie2016-08-241-0/+10
| | | | | | | | | | | | | | | | Export the cluster fsid, so tools like udev and multipath-tools can use it for part of the uuid. Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: add 'client_addr' sysfs rbd device attributeIlya Dryomov2016-08-241-0/+13
| | | | | | | | | | | | | | | | | | Export client addr/nonce, so userspace can check if a image is being blacklisted. Signed-off-by: Mike Christie <mchristi@redhat.com> [idryomov@gmail.com: ceph_client_addr(), endianess fix] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: print capacity in decimal and features in hexIlya Dryomov2016-08-241-2/+3
| | | | | | | | | | | | | | | | With exclusive-lock added and more to come, print features into dmesg. Change capacity to decimal while at it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com>
| * rbd: support for exclusive-lock featureIlya Dryomov2016-08-241-16/+796
| | | | | | | | | | | | | | | | | | | | | | Add basic support for RBD_FEATURE_EXCLUSIVE_LOCK feature. Maintenance operations (resize, snapshot create, etc) are offloaded to librbd via returning -EOPNOTSUPP - librbd should request the lock and execute the operation. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Tested-by: Mike Christie <mchristi@redhat.com>
| * rbd: retry watch re-registration periodicallyIlya Dryomov2016-08-241-29/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Revamp watch code to support retrying watch re-registration: - add rbd_dev->watch_state for more robust errcb handling - store watch cookie separately to avoid dereferencing watch_handle which is set to NULL on unwatch - move re-register code into a delayed work and retry re-registration every second, unless the client is blacklisted Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Tested-by: Mike Christie <mchristi@redhat.com>
| * rbd: introduce a per-device ordered workqueueIlya Dryomov2016-08-241-80/+71
| | | | | | | | | | | | | | | | | | | | This is going to be used for re-registering watch requests and exclusive-lock tasks: acquire/request lock, notify-acquired, release lock, notify-released. Some refactoring in the map/unmap paths was necessary to give this workqueue a meaningful name: "rbdX-tasks". Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com>
| * libceph: rename ceph_client_id() -> ceph_client_gid()Ilya Dryomov2016-08-241-1/+1
| | | | | | | | | | | | | | | | It's gid / global_id in other places. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* | blk-mq: remove ->map_queueChristoph Hellwig2016-09-151-1/+0
|/ | | | | | | | | | | | | All drivers use the default, so provide an inline version of it. If we ever need other queue mapping we can add an optional method back, although supporting will also require major changes to the queue setup code. This provides better code generation, and better debugability as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* rbd: nuke the 32-bit pool id checkIlya Dryomov2016-08-091-9/+0
| | | | | | | | ceph_file_layout::pool_id is now s64. rbd_add_get_pool_id() and ceph_pg_poolid_by_name() both return an int, so it's bogus anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: destroy header_oloc in rbd_dev_release()Ilya Dryomov2016-08-081-0/+1
| | | | | | | Purely cosmetic at this point, as rbd doesn't use RADOS namespaces and hence rbd_dev->header_oloc->pool_ns is always NULL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2016-08-021-7/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Ceph updates from Ilya Dryomov: "The highlights are: - RADOS namespace support in libceph and CephFS (Zheng Yan and myself). The stopgaps added in 4.5 to deny access to inodes in namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully supported - A large rework of the MDS cap flushing code (Zheng Yan) - Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were overly pessimistic before, bailing at the first sight of LOOKUP_RCU On top of that we've got a few CephFS bug fixes, a couple of cleanups and Arnd's workaround for a weird genksyms issue" * tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits) ceph: fix symbol versioning for ceph_monc_do_statfs ceph: Correctly return NXIO errors from ceph_llseek ceph: Mark the file cache as unreclaimable ceph: optimize cap flush waiting ceph: cleanup ceph_flush_snaps() ceph: kick cap flushes before sending other cap message ceph: introduce an inode flag to indicates if snapflush is needed ceph: avoid sending duplicated cap flush message ceph: unify cap flush and snapcap flush ceph: use list instead of rbtree to track cap flushes ceph: update types of some local varibles ceph: include 'follows' of pending snapflush in cap reconnect message ceph: update cap reconnect message to version 3 ceph: mount non-default filesystem by name libceph: fsmap.user subscription support ceph: handle LOOKUP_RCU in ceph_d_revalidate ceph: allow dentry_lease_is_valid to work under RCU walk ceph: clear d_fsinfo pointer under d_lock ceph: remove ceph_mdsc_lease_release ceph: don't use ->d_time ...
| * libceph: rados pool namespace supportYan, Zheng2016-07-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Add pool namesapce pointer to struct ceph_file_layout and struct ceph_object_locator. Pool namespace is used by when mapping object to PG, it's also used when composing OSD request. The namespace pointer in struct ceph_file_layout is RCU protected. So libceph can read namespace without taking lock. Signed-off-by: Yan, Zheng <zyan@redhat.com> [idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: define new ceph_file_layout structureYan, Zheng2016-07-281-7/+7
| | | | | | | | | | | | | | | | Define new ceph_file_layout structure and rename old ceph_file_layout to ceph_file_layout_legacy. This is preparation for adding namespace to ceph_file_layout structure. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | drivers: use req op accessorMike Christie2016-06-071-2/+2
|/ | | | | | | | | | The req operation REQ_OP is separated from the rq_flag_bits definition. This converts the block layer drivers to use req_op to get the op from the request struct. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* libceph: replace ceph_monc_request_next_osdmap()Ilya Dryomov2016-05-261-1/+1
| | | | | | | ... with a wrapper around maybe_request_map() - no need for two osdmap-specific functions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: async MON client generic requestsIlya Dryomov2016-05-261-2/+2
| | | | | | | | | | For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION messages asynchronously and get a callback on completion. Refactor MON client to allow firing off generic requests asynchronously and add an async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is switched over and remains sync. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph, rbd: ceph_osd_linger_request, watch/notify v2Ilya Dryomov2016-05-261-138/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | This adds support and switches rbd to a new, more reliable version of watch/notify protocol. As with the OSD client update, this is mostly about getting the right structures linked into the right places so that reconnects are properly sent when needed. watch/notify v2 also requires sending regular pings to the OSDs - send_linger_ping(). A major change from the old watch/notify implementation is the introduction of ceph_osd_linger_request - linger requests no longer piggy back on ceph_osd_request. ceph_osd_event has been merged into ceph_osd_linger_request. All the details are now hidden within libceph, the interface consists of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack(). ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep the lifetime management simple. ceph_osdc_notify_ack() accepts an optional data payload, which is relayed back to the notifier. Portions of this patch are loosely based on work by Douglas Fuller <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: rbd_dev_header_unwatch_sync() variantIlya Dryomov2016-05-261-4/+9
| | | | | | | | Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify callbacks. This is for the new rados_watcherrcb_t, which would be called from a notify callback. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: drop msg argument from ceph_osdc_callback_tIlya Dryomov2016-05-261-3/+2
| | | | | | | | finish_read(), its only user, uses it to get to hdr.data_len, which is what ->r_result is set to on success. This gains us the ability to safely call callbacks from contexts other than reply, e.g. map check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: switch to calc_target(), part 2Ilya Dryomov2016-05-261-14/+4
| | | | | | | | | | | | | | | | | | | | The crux of this is getting rid of ceph_osdc_build_request(), so that MOSDOp can be encoded not before but after calc_target() calculates the actual target. Encoding now happens within ceph_osdc_start_request(). Also nuked is the accompanying bunch of pointers into the encoded buffer that was used to update fields on each send - instead, the entire front is re-encoded. If we want to support target->name_len != base->name_len in the future, there is no other way, because oid is surrounded by other fields in the encoded buffer. Encoding OSD ops and adding data items to the request message were mixed together in osd_req_encode_op(). While we want to re-encode OSD ops, we don't want to add duplicate data items to the message when resending, so all call to ceph_osdc_msg_data_add() are factored out into a new setup_request_data(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: use header_oid instead of header_nameIlya Dryomov2016-05-261-33/+24
| | | | | | | Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare const char *. This reduces noise in rbd_dev_header_name(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: variable-sized ceph_object_idIlya Dryomov2016-05-261-2/+6
| | | | | | | | | | | | | | | | | | | | Currently ceph_object_id can hold object names of up to 100 (CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases, expect one - long rbd image names: - a format 1 header is named "<imgname>.rbd" - an object that points to a format 2 header is named "rbd_id.<imgname>" We operate on these potentially long-named objects during rbd map, and, for format 1 images, during header refresh. (A format 2 header name is a small system-generated string.) Lift this 100 character limit by making ceph_object_id be able to point to an externally-allocated string. Apart from being able to work with almost arbitrarily-long named objects, this allows us to reduce the size of ceph_object_id from >100 bytes to 64 bytes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: move message allocation out of ceph_osdc_alloc_request()Ilya Dryomov2016-05-261-2/+16
| | | | | | | | | | | | | | | | The size of ->r_request and ->r_reply messages depends on the size of the object name (ceph_object_id), while the size of ceph_osd_request is fixed. Move message allocation into a separate function that would have to be called after ceph_object_id and ceph_object_locator (which is also going to become variable in size with RADOS namespaces) have been filled in: req = ceph_osdc_alloc_request(...); <fill in req->r_base_oid> <fill in req->r_base_oloc> ceph_osdc_alloc_messages(req); Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: get/put img_request in rbd_img_request_submit()Ilya Dryomov2016-05-261-4/+7
| | | | | | | | | | | | | | By the time we get to checking for_each_obj_request_safe(img_request) terminating condition, all obj_requests may be complete and img_request ref, that rbd_img_request_submit() takes away from its caller, may be put. Moving the next_obj_request cursor is then a use-after-free on img_request. It's totally benign, as the value that's read is never used, but I think it's still worth fixing. Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: report unsupported features to syslogIlya Dryomov2016-04-281-3/+6
| | | | | | | ... instead of just returning an error. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
* rbd: fix rbd map vs notify racesIlya Dryomov2016-04-281-24/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A while ago, commit 9875201e1049 ("rbd: fix use-after free of rbd_dev->disk") fixed rbd unmap vs notify race by introducing an exported wrapper for flushing notifies and sticking it into do_rbd_remove(). A similar problem exists on the rbd map path, though: the watch is registered in rbd_dev_image_probe(), while the disk is set up quite a few steps later, in rbd_dev_device_setup(). Nothing prevents a notify from coming in and crashing on a NULL rbd_dev->disk: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 Call Trace: [<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd] [<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e3ab>] worker_thread+0x11b/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81645dd8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 RIP [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd] If an error occurs during rbd map, we have to error out, potentially tearing down a watch. Just like on rbd unmap, notifies have to be flushed, otherwise rbd_watch_cb() may end up trying to read in the image header after rbd_dev_image_release() has run: Assertion failure in rbd_dev_header_info() at line 4722: rbd_assert(rbd_image_format_valid(rbd_dev->image_format)); Call Trace: [<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150 [<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390 [<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290 [<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0 [<ffffffff81107799>] process_one_work+0x689/0x1a80 [<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80 [<ffffffff81132065>] ? finish_task_switch+0x225/0x640 [<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0 [<ffffffff81108c69>] worker_thread+0xd9/0x1320 [<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80 [<ffffffff8111b02d>] kthread+0x21d/0x2e0 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 [<ffffffff82022802>] ret_from_fork+0x22/0x40 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 RIP [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30 To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling revalidate_disk(), b) move ceph_osdc_flush_notifies() call into rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn header read-in into a critical section. The latter also happens to take care of rbd map foo@bar vs rbd snap rm foo@bar race. Fixes: http://tracker.ceph.com/issues/15490 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
* rbd: use GFP_NOIO consistently for request allocationsDavid Disseldorp2016-04-051-3/+3
| | | | | | | | | | | | | | | | | | As of 5a60e87603c4c533492c515b7f62578189b03c9c, RBD object request allocations are made via rbd_obj_request_create() with GFP_NOIO. However, subsequent OSD request allocations in rbd_osd_req_create*() use GFP_ATOMIC. With heavy page cache usage (e.g. OSDs running on same host as krbd client), rbd_osd_req_create() order-1 GFP_ATOMIC allocations have been observed to fail, where direct reclaim would have allowed GFP_NOIO allocations to succeed. Cc: stable@vger.kernel.org # 3.18+ Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Neil Brown <neilb@suse.com> Signed-off-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: use KMEM_CACHE macroGeliang Tang2016-03-251-8/+2
| | | | | | | Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: enable large, variable-sized OSD requestsIlya Dryomov2016-03-251-2/+0
| | | | | | | | | | | | | | | | | Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: move r_reply_op_{len,result} into struct ceph_osd_req_opYan, Zheng2016-03-251-1/+1
| | | | | | | | This avoids defining large array of r_reply_op_{len,result} in in struct ceph_osd_request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: delete an unnecessary check before rbd_dev_destroy()Markus Elfring2016-01-211-2/+1
| | | | | | | | | | The rbd_dev_destroy() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: don't put snap_context twice in rbd_queue_workfn()Ilya Dryomov2015-12-041-0/+1
| | | | | | | | | | | | | Commit 4e752f0ab0e8 ("rbd: access snapshot context and mapping size safely") moved ceph_get_snap_context() out of rbd_img_request_create() and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the error path in rbd_queue_workfn(). However, rbd_img_request_create() consumes a ref on snapc, so calling ceph_put_snap_context() after a successful rbd_img_request_create() leads to an extra put. Fix it. Cc: stable@vger.kernel.org # 3.18+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
* rbd: remove duplicate calls to rbd_dev_mapping_clear()Ilya Dryomov2015-11-021-3/+0
| | | | | | | | | | | Commit d1cf5788450e ("rbd: set mapping info earlier") defined rbd_dev_mapping_clear(), but, just a few days after, commit f35a4dee14c3 ("rbd: set the mapping size and features later") moved rbd_dev_mapping_set() calls and added another rbd_dev_mapping_clear() call instead of moving the old one. Around the same time, another duplicate was introduced in rbd_dev_device_release() - kill both. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: set device_type::release instead of device::releaseIlya Dryomov2015-11-021-5/+2
| | | | | | | No point in providing an empty device_type::release callback and then setting device::release for each rbd_dev dynamically. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: don't free rbd_dev outside of the release callbackIlya Dryomov2015-11-021-42/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct rbd_device has struct device embedded in it, which means it's part of kobject universe and has an unpredictable life cycle. Freeing its memory outside of the release callback is flawed, yet commits 200a6a8be5db ("rbd: don't destroy rbd_dev in device release function") and 8ad42cd0c002 ("rbd: don't have device release destroy rbd_dev") moved rbd_dev_destroy() out to rbd_dev_image_release(). This commit reverts most of that, the key points are: - rbd_dev->dev is initialized in rbd_dev_create(), making it possible to use rbd_dev_destroy() - which is just a put_device() - both before we register with device core and after. - rbd_dev_release() (the release callback) is the only place we kfree(rbd_dev). It's also where we do module_put(), keeping the module unload race window as small as possible. - We pin the module in rbd_dev_create(), but only for mapping rbd_dev-s. Moving image related stuff out of struct rbd_device into another struct which isn't tied with sysfs and device core is long overdue, but until that happens, this will keep rbd module refcount (which users can observe with lsmod) sane. Fixes: http://tracker.ceph.com/issues/12697 Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
OpenPOWER on IntegriCloud