op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge branch 'for-linus' of ↵	Linus Torvalds	2013-02-28	1	-0/+6
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "A few groups of patches here. Alex has been hard at work improving the RBD code, layout groundwork for understanding the new formats and doing layering. Most of the infrastructure is now in place for the final bits that will come with the next window. There are a few changes to the data layout. Jim Schutt's patch fixes some non-ideal CRUSH behavior, and a set of patches from me updates the client to speak a newer version of the protocol and implement an improved hashing strategy across storage nodes (when the server side supports it too). A pair of patches from Sam Lang fix the atomicity of open+create operations. Several patches from Yan, Zheng fix various mds/client issues that turned up during multi-mds torture tests. A final set of patches expose file layouts via virtual xattrs, and allow the policies to be set on directories via xattrs as well (avoiding the awkward ioctl interface and providing a consistent interface for both kernel mount and ceph-fuse users)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits) libceph: add support for HASHPSPOOL pool flag libceph: update osd request/reply encoding libceph: calculate placement based on the internal data types ceph: update support for PGID64, PGPOOL3, OSDENC protocol features ceph: update "ceph_features.h" libceph: decode into cpu-native ceph_pg type libceph: rename ceph_pg -> ceph_pg_v1 rbd: pass length, not op for osd completions rbd: move rbd_osd_trivial_callback() libceph: use a do..while loop in con_work() libceph: use a flag to indicate a fault has occurred libceph: separate non-locked fault handling libceph: encapsulate connection backoff libceph: eliminate sparse warnings ceph: eliminate sparse warnings in fs code rbd: eliminate sparse warnings libceph: define connection flag helpers rbd: normalize dout() calls rbd: barriers are hard rbd: ignore zero-length requests ...
\| *	ceph: Check for created flag in response from mds	Sam Lang	2013-01-17	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The mds now sends back a created inode if the create request performed the create. If the file already existed, no inode is returned in the reply. This allows ceph to set the created flag in atomic_open so that permissions are properly checked in the case that the file wasn't created by the create call to the mds. To ensure compability with previous kernels, a feature for sending back the inode in the create reply was added, so that the mds will only send back the inode if the client indicates it supports the feature. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* \|	ceph: Convert struct ceph_mds_request to use kuid_t and kgid_t	Eric W. Biederman	2013-02-12	1	-2/+2
\|/ \| \| \| \| \| \| \| \| \|	Hold the uid and gid for a pending ceph mds request using the types kuid_t and kgid_t. When a request message is finally created convert the kuid_t and kgid_t values into uids and gids in the initial user namespace. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
*	ceph: define ceph_auth_handshake type	Alex Elder	2012-05-17	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	The definitions for the ceph_mds_session and ceph_osd both contain five fields related only to "authorizers." Encapsulate those fields into their own struct type, allowing for better isolation in some upcoming patches. Fix the #includes in "linux/ceph/osd_client.h" to lay out their more complete canonical path. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
*	ceph: create a new session lock to avoid lock inversion	Alex Elder	2012-02-02	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lockdep was reporting a possible circular lock dependency in dentry_lease_is_valid(). That function needs to sample the session's s_cap_gen and and s_cap_ttl fields coherently, but needs to do so while holding a dentry lock. The s_cap_lock field was being used to protect the two fields, but that can't be taken while holding a lock on a dentry within the session. In most cases, the s_cap_gen and s_cap_ttl fields only get operated on separately. But in three cases they need to be updated together. Implement a new lock to protect the spots updating both fields atomically is required. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
*	ceph: use i_ceph_lock instead of i_lock	Sage Weil	2011-12-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have been using i_lock to protect all kinds of data structures in the ceph_inode_info struct, including lists of inodes that we need to iterate over while avoiding races with inode destruction. That requires grabbing a reference to the inode with the list lock protected, but igrab() now takes i_lock to check the inode flags. Changing the list lock ordering would be a painful process. However, using a ceph-specific i_ceph_lock in the ceph inode instead of i_lock is a simple mechanical change and avoids the ordering constraints imposed by igrab(). Reported-by: Amon Ott <a.ott@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: explicitly reference rename old_dentry parent dir in request	Sage Weil	2011-07-26	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	We carry a pin on the parent directory for the rename source and dest dentries. For the source it's r_locked_dir; we need to explicitly reference the old_dentry parent as well, since the dentry's d_parent may change between when the request was created and pinned and when it is freed. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: ignore lease mask	Sage Weil	2011-07-26	1	-1/+1
\| \| \| \| \| \| \| \|	The lease mask is no longer used (and it changed a while back). Instead, use a non-zero duration to indicate that there is a lease being issued. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: fix cap flush race reentrancy	Sage Weil	2011-05-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In e9964c10 we change cap flushing to do a delicate dance because some inodes on the cap_dirty list could be in a migrating state (got EXPORT but not IMPORT) in which we couldn't actually flush and move from dirty->flushing, breaking the while (!empty) { process first } loop structure. It worked for a single sync thread, but was not reentrant and triggered infinite loops when multiple syncers came along. Instead, move inodes with dirty to a separate cap_dirty_migrating list when in the limbo export-but-no-import state, allowing us to go back to the simple loop structure (which was reentrant). This is cleaner and more robust. Audited the cap_dirty users and this looks fine: list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we have dirty caps (which list we're on is irrelevant) and list_del_init() calls still do the right thing. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: drop redundant r_mds field	Sage Weil	2011-01-12	1	-1/+0
\| \| \| \| \| \| \|	The r_mds field is redundant, since we can find the same information at r_session->s_mds, and when r_session is NULL then r_mds is meaningless. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS	Sage Weil	2011-01-12	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements the DIRLAYOUTHASH protocol feature, which passes the dir layout over the wire from the MDS. This gives the client knowledge of the correct hash function to use for mapping dentries among dir fragments. Note that if this feature is _not_ present on the client but is on the MDS, the client may misdirect requests. This will result in a forward and degrade performance. It may also result in inaccurate NFS filehandle generation, which will prevent fh resolution when the inode is not present in the client cache and the parent directories have been fragmented. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: Handle file locks in replies from the MDS.	Herb Shiu	2010-12-01	1	-10/+21
\| \| \| \| \| \| \| \|	Previously the kernel client incorrectly assumed everything was a directory. Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com> Acked-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: fix uid/gid on resent mds requests	Sage Weil	2010-11-08	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	MDS requests can be rebuilt and resent in non-process context, but were filling in uid/gid from current_fsuid/gid. Put that information in the request struct on request setup. This fixes incorrect (and root) uid/gid getting set for requests that are forwarded between MDSs, usually due to metadata migrations. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: factor out libceph from Ceph file system	Yehuda Sadeh	2010-10-20	1	-13/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: fix multiple mds session shutdown	Sage Weil	2010-08-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	The use of a completion when waiting for session shutdown during umount is inappropriate, given the complexity of the condition. For multiple MDS's, this resulted in the umount thread spinning, often preventing the session close message from being processed in some cases. Switch to a waitqueue and defined a condition helper. This cleans things up nicely. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: handle ESTALE properly; on receipt send to authority if it wasn't	Greg Farnum	2010-08-01	1	-1/+1
\| \| \| \| \|	Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: connect to export targets on cap export	Sage Weil	2010-08-01	1	-0/+3
\| \| \| \| \| \| \|	When we get a cap EXPORT message, make sure we are connected to all export targets to ensure we can handle the matching IMPORT. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: do caps accounting per mds_client	Yehuda Sadeh	2010-08-01	1	-0/+22
\| \| \| \| \| \| \| \| \|	Caps related accounting is now being done per mds client instead of just being global. This prepares ground work for a later revision of the caps preallocated reservation list. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: drop unused argument	Sage Weil	2010-08-01	1	-2/+1
\| \| \| \|	Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: do not include cap/dentry releases in replayed messages	Sage Weil	2010-07-16	1	-0/+1
\| \| \| \| \| \| \| \| \|	Strip the cap and dentry releases from replayed messages. They can cause the shared state to get out of sync because they were generated (with the request message) earlier, and no longer reflect the current client state. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: try to send partial cap release on cap message on missing inode	Sage Weil	2010-06-10	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	If we have enough memory to allocate a new cap release message, do so, so that we can send a partial release message immediately. This keeps us from making the MDS wait when the cap release it needs is in a partially full release message. If we fail because of ENOMEM, oh well, they'll just have to wait a bit longer. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: release cap on import if we don't have the inode	Sage Weil	2010-06-10	1	-0/+3
\| \| \| \| \| \| \| \|	If we get an IMPORT that give us a cap, but we don't have the inode, queue a release (and try to send it immediately) so that the MDS doesn't get stuck waiting for us. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: use common helper for aborted dir request invalidation	Sage Weil	2010-05-17	1	-0/+2
\| \| \| \| \| \| \|	We invalidate I_COMPLETE and dentry leases in two places: on aborted mds request and on request replay. Use common helper to avoid duplicate code. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: fix race between aborted requests and fill_trace	Sage Weil	2010-05-17	1	-0/+2
\| \| \| \| \| \| \| \| \|	When we abort requests we need to prevent fill_trace et al from doing anything that relies on locks held by the VFS caller. This fixes a race between the reply handler and the abort code, ensuring that continue holding the dir mutex until the reply handler completes. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: clean up mds reply, error handling	Sage Weil	2010-05-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We would occasionally BUG out in the reply handler because r_reply was nonzero, due to a race with ceph_mdsc_do_request temporarily setting r_reply to an ERR_PTR value. This is unnecessary, messy, and also wrong in the EIO case. Clean up by consistently using r_err for errors and r_reply for messages. Also fix the abort logic to trigger consistently for all errors that return to the caller early (e.g., EIO from timeout case). If an abort races with a reply, use the result from the reply. Also fix locking for r_err, r_reply update in the reply handler. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: fix iterate_caps removal race	Sage Weil	2010-02-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to be able to iterate over all caps on a session with a possibly slow callback on each cap. To allow this, we used to prevent cap reordering while we were iterating. However, we were not safe from races with removal: removing the 'next' cap would make the next pointer from list_for_each_entry_safe be invalid, and cause a lock up or similar badness. Instead, we keep an iterator pointer in the session pointing to the current cap. As before, we avoid reordering. For removal, if the cap isn't the current cap we are iterating over, we are fine. If it is, we clear cap->ci (to mark the cap as pending removal) but leave it in the session list. In iterate_caps, we can safely finish removal and get the next cap pointer. While we're at it, clean up put_cap to not take a cap reservation context, as it was never used. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: use rbtree for snap_realms	Sage Weil	2010-02-16	1	-2/+1
\| \| \| \| \| \| \|	Switch from radix tree to rbtree for snap realms. This is much more appropriate given that realm keys are few and far between. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: use rbtree for mds requests	Sage Weil	2010-02-16	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	The rbtree is a more appropriate data structure than a radix_tree. It avoids extra memory usage and simplifies the code. It also fixes a bug where the debugfs 'mdsc' file wasn't including the most recent mds request. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: properly handle aborted mds requests	Sage Weil	2010-01-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, if the MDS request was interrupted, we would unregister the request and ignore any reply. This could cause the caps or other cache state to become out of sync. (For instance, aborting dbench and doing rm -r on clients would complain about a non-empty directory because the client didn't realize it's aborted file create request completed.) Even we don't unregister, we still can't process the reply normally because we are no longer holding the caller's locks (like the dir i_mutex). So, mark aborted operations with r_aborted, and in the reply handler, be sure to process all the caps. Do not process the namespace changes, though, since we no longer will hold the dir i_mutex. The dentry lease state can also be ignored as it's more forgiving. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: do not touch_caps while iterating over caps list	Sage Weil	2009-12-23	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Avoid confusing iterate_session_caps(), flag the session while we are iterating so that __touch_cap does not rearrange items on the list. All other modifiers of session->s_caps do so under the protection of s_mutex. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: use kref for struct ceph_mds_request	Sage Weil	2009-12-07	1	-3/+8
\| \| \| \|	Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: negotiate authentication protocol; implement AUTH_NONE protocol	Sage Weil	2009-11-18	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we open a monitor session, we send an initial AUTH message listing the auth protocols we support, our entity name, and (possibly) a previously assigned global_id. The monitor chooses a protocol and responds with an initial message. Initially implement AUTH_NONE, a dummy protocol that provides no security, but works within the new framework. It generates 'authorizers' that are used when connecting to (mds, osd) services that simply state our entity name and global_id. This is a wire protocol change. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: handle errors during osd client init	Sage Weil	2009-11-18	1	-1/+1
\| \| \| \| \| \|	Unwind initializing if we get ENOMEM during client initialization. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: build cleanly without CONFIG_DEBUG_FS	Sage Weil	2009-11-12	1	-0/+2
\| \| \| \|	Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: remove recon_gen logic	Sage Weil	2009-11-10	1	-2/+0
\| \| \| \| \| \| \| \|	We don't get an explicit affirmative confirmation that our caps reconnect, nor do we necessarily want to pay that cost. So, take all this code out for now. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: do not confuse stale and dead (unreconnected) caps	Sage Weil	2009-11-09	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	We were using the cap_gen to track both stale caps (caps that timed out due to temporarily losing touch with the mds) and dead caps that did not reconnect after an MDS failure. Introduce a recon_gen counter to track reconnections to restarted MDSs and kill dead caps based on that instead. Rename gen to cap_gen while we're at it to make it more clear which is which. Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: MDS client	Sage Weil	2009-10-06	1	-0/+321
	The MDS (metadata server) client is responsible for submitting requests to the MDS cluster and parsing the response. We decide which MDS to submit each request to based on cached information about the current partition of the directory hierarchy across the cluster. A stateful session is opened with each MDS before we submit requests to it, and a mutex is used to control the ordering of messages within each session. An MDS request may generate two responses. The first indicates the operation was a success and returns any result. A second reply is sent when the operation commits to disk. Note that locking on the MDS ensures that the results of updates are visible only to the updating client before the operation commits. Requests are linked to the containing directory so that an fsync will wait for them to commit. If an MDS fails and/or recovers, we resubmit requests as needed. We also reconnect existing capabilities to a recovering MDS to reestablish that shared session state. Old dentry leases are invalidated. Signed-off-by: Sage Weil <sage@newdream.net>