op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	sx8: use real time for the command seconds	Jens Axboe	2015-12-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Commit 8182503df1ba used monotonic time, but if the adapter is using the seconds for logging entries, then we'll get duplicate entries if the system is rebooted. Use real time instead. Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: 8182503df1ba ("block: sx8.c: Replace timeval with ktime_t") Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: sx8.c: Replace timeval with ktime_t	Shraddha Barke	2015-12-22	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \|	32-bit systems using 'struct timeval' will break in the year 2038, in order to avoid that replace the code with more appropriate types. This patch replaces timeval with 64 bit ktime_t which is y2038 safe. Since st->timestamp is only interested in seconds, directly using time64_t here. Function ktime_get_seconds is used since it uses monotonic instead of real time and thus will not cause overflow. Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: fix error path during resize	Lars Ellenberg	2015-11-25	1	-30/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In case the lower level device size changed, but some other internal details of the resize did not work out, drbd_determine_dev_size() would try to restore the previous settings, trusting drbd_md_set_sector_offsets() to "do the right thing", but overlooked that this internally may set the meta data base offset based on device size. This could end up with incomplete on-disk meta data layout change, and ultimately lead to data corruption (if the failure was not noticed or ignored by the operator, and other things go wrong as well). Just remember all meta data related offsets/sizes, and on error restore them all. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: avoid potential deadlock during handshake	Lars Ellenberg	2015-11-25	3	-23/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During handshake communication, we also reconsider our device size, using drbd_determine_dev_size(). Just in case we need to change the offsets or layout of our on-disk metadata, we lock out application and other meta data IO, and wait for the activity log to be "idle" (no more referenced extents). If this handshake happens just after a connection loss, with a fencing policy of "resource-and-stonith", we have frozen IO. If, additionally, the activity log was "starving" (too many incoming random writes at that point in time), it won't become idle, ever, because of the frozen IO, and this would be a lockup of the receiver thread, and consquentially of DRBD. Previous logic (re-)initialized with a special "empty" transaction block, which required the activity log to fully drain first. Instead, write out some standard activity log transactions. Using lc_try_lock_for_transaction() instead of lc_try_lock() does not care about pending activity log references, avoiding the potential deadlock. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: separate out __al_write_transaction helper function	Lars Ellenberg	2015-11-25	1	-148/+156
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To be able to "force out" an activity log transaction, even if there are no pending updates. This will be used to relocate the on-disk activity log, if the on-disk offsets have to be changed, without the need to empty the activity log first. While at it, move the definition, so we can drop the forward declaration of a static helper. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: make suspend_io() / resume_io() must be thread and recursion safe	Philipp Reisner	2015-11-25	3	-6/+8
\| \| \| \| \| \| \| \| \|	Avoid to prematurely resume application IO: don't set/clear a single bit, but inc/dec an atomic counter. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: fix "endless" transfer log walk in protocol A	Lars Ellenberg	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't remember a DRBD request as ack_pending, if it is not. In protocol A, we usually clear RQ_NET_PENDING at the same time we set RQ_NET_SENT, so when deciding to remember it as ack_pending, mod_rq_state needs to look at the current request state, not at the previous state before the current modification was applied. This should prevent advance_conn_req_ack_pending() from walking the full transfer log just to find NULL in protocol A, which would cause serious performance degradation with many "in-flight" requests, e.g. when working via DRBD-proxy, or with a huge bandwidth-delay product. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: fix memory leak in drbd_adm_resize	Oleg Drokin	2015-11-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	new_disk_conf could be leaked if the follow on checks fail, so make sure to free it on error if it was not assigned yet. Found with smatch. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: don't block forever in disconnect during resync if fencing=r-a-stonith	Lars Ellenberg	2015-11-25	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Disconnect should wait for pending bitmap IO. But if that bitmap IO is not happening, because it is waiting for pending application IO, and there is no progress, because the fencing policy suspended application IO because of the disconnect, then we deadlock. The bitmap writeout in this case does not care for concurrent application IO, so there is no point waiting for it. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: make drbd known to lsblk: use bd_link_disk_holder	Lars Ellenberg	2015-11-25	4	-51/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	lsblk should be able to pick up stacking device driver relations involving DRBD conveniently. Even though upstream kernel since 2011 says "DON'T USE THIS UNLESS YOU'RE ALREADY USING IT." a new user has been added since (bcache), which sets the precedences for us to use it as well. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: fix queue limit setup for discard	Lars Ellenberg	2015-11-25	1	-9/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We cannot possibly support SECDISCARD, even if all backend devices would support it: if our peer is currently unreachable, some instance of the data may obviously still be recoverable. We did not set discard_granularity at all. We don't really care (yet), we only pass them on, so for now, set our granularity to one sector. blkdev_stack_limits() takes care of the rest. If we decide we cannot support discards, not only clear the (not user visible) QUEUE_FLAG_DISCARD, but set both (user visible) discard_granularity and max_discard_sectors to zero, to avoid confusion with e.g. lsblk -D. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: fix spurious alert level printk	Lars Ellenberg	2015-11-25	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When accessing out meta data area on disk, we double check the plausibility of the requested sector offsets, and are very noisy about it if they look suspicious. During initial read of our "superblock", for "external" meta data, this triggered because the range estimate returned by drbd_md_last_sector() was still wrong. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: use bitmap_weight() helper, don't open code	Lars Ellenberg	2015-11-25	1	-8/+8
\| \| \| \| \| \| \| \|	Suggested by Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: avoid redefinition of BITS_PER_PAGE	Lars Ellenberg	2015-11-25	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \|	Apparently we now implicitly get definitions for BITS_PER_PAGE and BITS_PER_PAGE_MASK from the pid_namespace.h Instead of renaming our defines, I chose to define only if not yet defined, but to double check the value if already defined. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: use resource name in workqueue	Lars Ellenberg	2015-11-25	2	-3/+6
\| \| \| \| \| \| \| \| \|	Since kernel 3.3, we can use snprintf-style arguments to create a workqueue. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: debugfs: expose ed_data_gen_id	Lars Ellenberg	2015-11-25	2	-0/+11
\| \| \| \| \| \| \| \| \|	The effective data generation ID may be interesting for debugging purposes of scenarios involving diskless states. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: prevent NULL pointer deref when resuming diskless primary	Lars Ellenberg	2015-11-25	1	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a multiple error scenario, we may end up with a "frozen" Primary, that has no access to any data (no local disk, no replication link). If we then resume-io, we try to generate a new data generation id, which will fail if there is no longer a local disk. Double check for available local data, which prevents the NULL pointer deref. If we are diskless, turn the resume-io in this situation into the first stage of a "force down", by bumping the "effective" data gen id, which will prevent later attach or connect to the former data set without first being demoted (deconfigured). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Create a dedicated workqueue for sending acks on the control connection	Philipp Reisner	2015-11-25	7	-115/+141
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The intention is to reduce CPU utilization. Recent measurements unveiled that the current performance bottleneck is CPU utilization on the receiving node. The asender thread became CPU limited. One of the main points is to eliminate the idr_for_each_entry() loop from the sending acks code path. One exception in that is sending back ping_acks. These stay in the ack-receiver thread. Otherwise the logic becomes too complicated for no added value. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Rename asender to ack_receiver	Philipp Reisner	2015-11-25	3	-11/+11
\| \| \| \| \| \| \| \| \|	This prepares the next patch where the sending on the meta (or control) socket is moved to a dedicated workqueue. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: fix refcount error during detach of an already failed disk	Lars Ellenberg	2015-11-25	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A D_FAILED disk transitions as quickly as possible to D_DISKLESS. But in the "unresponsive local disk" case, there remains a time window where a administrative detach command could find the disk already failed, but some internal meta data IO against the unresponsive local disk still pending. In that case, drbd_md_get_buffer() will return NULL. Don't unconditionally call drbd_md_put_buffer(), or it will cause refcount imbalance, and prevent any further re-attach on this volume (until it is deleted and re-created). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: fix NULL deref in remember_new_state	Lars Ellenberg	2015-11-25	1	-32/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The recent (not yet released) backport of the extended state broadcasts to support the "events2" subcommand of drbdsetup had some glitches. remember_old_state() would first count all connections with a net_conf != NULL, then allocate a suitable array, then populate that array with all connections found to have net_conf != NULL. This races with the state change to C_STANDALONE, and the NULL assignment there. remember_new_state() then iterates over said connection array, assuming that it would be fully populated. But rcu_lock() just makes sure the thing some pointer points to, if any, won't go away. It does not make the pointer itself immutable. In fact there is no need to "filter" connections based on whether or not they have a currently valid configuration. Just record them always, if they don't have a config, that's fine, there will be no change then. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: improve network timeout detection	Lars Ellenberg	2015-11-25	3	-27/+100
\| \| \| \| \| \| \| \| \|	Don't blame the peer for being unresponsive, if we did not even ask the question yet. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: drbd_panic_after_delayed_completion_of_aborted_request()	Lars Ellenberg	2015-11-25	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The only way to make DRBD intentionally call panic is to set a disk timeout, have that trigger, "abort" some request and complete to upper layers, then have the backend IO subsystem later complete these requests successfully regardless. As the attached IO pages have been recycled for other purposes meanwhile, this will cause unexpected random memory changes. To prevent corruption, we rather panic in that case. Make it obvious from stack traces that this was the case by introducing drbd_panic_after_delayed_completion_of_aborted_request(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: add comment why we want to first call local-io-error, then send state	Lars Ellenberg	2015-11-25	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though we really want to get the state information about our bad disk to the peer as soon as possible, it is useful to first call the local-io-error handler. People may chose to hard-reset the box from there. If that looks and behaves exactly like a "regular node crash", without bumping the data generation UUIDs on the peer in between, it makes it easier to deal with. If you intend to return from the local-io-error handler, then better return as quickly as possible to avoid triggering other timeouts. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: also bump UUIDs if a diskless primary connects	Lars Ellenberg	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If for some reason the primary lost its disk and the replication link before it is able to communicate the disk loss, probably blocked IO, then later is able to re-establish the connection, the peer needs to bump its UUIDs just like it does when peer only loses the disk and is able to communicate this in time. Otherwise, a later re-attach of the disk on the primary may start a resync in the "wrong" direction. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: drbdsetup detach of an unresponsive local disk should not block IO ↵	Lars Ellenberg	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"forever" When detaching, we make sure no application IO is in-flight by internally suspending IO, then trigger the state change, wait for the result, and finally internally resume IO again. Once we triggered the stat change to "Failed", we expect it to change from Failed to Diskless. (To avoid races, we actually wait for it to leave "Failed"). On an unresponsive local IO backend, this may not happen, ever. Don't have a "hung" detach block IO "forever", but resume IO before waiting for the state change to Diskless. We may well be able to continue IO to and from a healthy peer. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Fix spurious disk-timeout	Lars Ellenberg	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(You should not use disk-timeout anyways, see the man page for why...) We add incoming requests to the tail of some ring list. On local completion, requests are removed from that list. The timer looks only at the head of that ring list, so is supposed to only see the oldest request. All protected by a spinlock. The request object is created with timestamps zeroed out. The timestamp was only filled in just before the actual submit. But to actually submit the request, we need to give up the spinlock. If you are unlucky, there is no older still pending request, the timer looks at a new request with timestamp still zero (before it even was submitted), and 0 + timeout is most likely older than "now". Better assign the timestamp right when we put the request object on said ring list. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Replace 0 with the more meaningful GFP_NOWAIT	Philipp Reisner	2015-11-25	1	-6/+6
\| \| \| \| \| \| \| \|	GFP_NOWAIT has a value of 0. I.e. functionality not changed. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Deletion of an unnecessary check before the function call "lc_destroy"	Markus Elfring	2015-11-25	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The lc_destroy() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Backport the "status" command	Andreas Gruenbacher	2015-11-25	1	-79/+487
\| \| \| \| \| \| \| \| \| \| \| \| \|	The status command originates the drbd9 code base. While for now we keep the status information in /proc/drbd available, this commit allows the user base to gracefully migrate their monitoring infrastructure to the new status reporting interface. In drbd9 no status information is exposed through /proc/drbd. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Backport the "events2" command	Andreas Gruenbacher	2015-11-25	5	-12/+1151
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The events2 command originates from drbd-9 development. It features more information but requires a incompatible change in output format. Therefore the previous events command continues to exist, the new improved events2 command becomes available now. This prepares the user-base for a later switch to the complete drbd9 code base. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Fix locking across all resources	Andreas Gruenbacher	2015-11-25	6	-93/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of using a rwlock for synchronizing state changes across resources, take the request locks of all resources for global state changes. Use resources_mutex to serialize global state changes. This means that taking the request lock of a resource is now enough to prevent changes of that resource. (Previously, a read lock on the global state lock was needed as well.) Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: drbd_adm_attach(): Add missing drbd_resync_after_changed()	Andreas Gruenbacher	2015-11-25	1	-12/+16
\| \| \| \| \| \|	Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Move enum write_ordering_e to drbd.h	Andreas Gruenbacher	2015-11-25	5	-26/+20
\| \| \| \| \| \| \| \|	Also change the enum values to all-capital letters. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Get rid of some first_peer_device() calls	Andreas Gruenbacher	2015-11-25	1	-4/+4
\| \| \| \| \| \|	Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: De-inline drbd_should_do_remote() and drbd_should_send_out_of_sync()	Andreas Gruenbacher	2015-11-25	2	-16/+19
\| \| \| \| \| \| \| \| \|	There is no need to have these two as inline functions. In addition, drbd_should_send_out_of_sync() is only used in a single place, anyway. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	drbd: Remove pointless check	Philipp Reisner	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	In drbd-8.4 there is always a single connection per resource, and there is always exactly one peer_device for a device. peer_device can not be NULL here. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	mtip32xx: use formatting capability of kthread_create_on_node	Rasmus Villemoes	2015-11-20	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	kthread_create_on_node takes format+args, so there's no need to do the pretty-printing in advance. Moreover, "mtip_svc_thd_99" (including its '\0') only just fits in 16 bytes, so if index could ever go above 99 we'd have a stack buffer overflow. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	null_blk: do not del gendisk with lightnvm	Matias Bjørling	2015-11-19	1	-3/+5
\| \| \| \| \| \| \| \| \|	The gendisk structure has not been initialized when using lightnvm. Make sure to not delete it upon exit. Also make sure that we use the appropriate disk_name at unregistration. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
*	null_blk: use device addressing mode	Matias Bjørling	2015-11-19	1	-5/+23
\| \| \| \| \| \| \| \|	The linear addressing mode was removed in 7386af2. Make null_blk instead expose the ppa format geometry and support the generic addressing mode. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
*	null_blk: use ppa_cache pool	Matias Bjørling	2015-11-19	1	-2/+23
\| \| \| \| \| \| \| \| \|	Instead of using a page pool, we can save memory by only allocating room for 64 entries for the ppa command. Introduce a ppa_cache to allocate only the required memory for the ppa list. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
*	null_blk: register as a LightNVM device	Matias Bjørling	2015-11-16	1	-6/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for registering as a LightNVM device. This allows us to evaluate the performance of the LightNVM subsystem. In /drivers/Makefile, LightNVM is moved above block device drivers to make sure that the LightNVM media managers have been initialized before drivers under /drivers/block are initialized. Signed-off-by: Matias Bjørling <m@bjorling.me> Fix by Jens Axboe to remove unneeded slab cache and the following memory leak. Signed-off-by: Jens Axboe <axboe@fb.com>
*	Merge branch 'for-linus' of ↵	Linus Torvalds	2015-11-13	1	-55/+54
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There are several patches from Ilya fixing RBD allocation lifecycle issues, a series adding a nocephx_sign_messages option (and associated bug fixes/cleanups), several patches from Zheng improving the (directory) fsync behavior, a big improvement in IO for direct-io requests when striping is enabled from Caifeng, and several other small fixes and cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: clear msg->con in ceph_msg_release() only libceph: add nocephx_sign_messages option libceph: stop duplicating client fields in messenger libceph: drop authorizer check from cephx msg signing routines libceph: msg signing callouts don't need con argument libceph: evaluate osd_req_op_data() arguments only once ceph: make fsync() wait unsafe requests that created/modified inode ceph: add request to i_unsafe_dirops when getting unsafe reply libceph: introduce ceph_x_authorizer_cleanup() ceph: don't invalidate page cache when inode is no longer used rbd: remove duplicate calls to rbd_dev_mapping_clear() rbd: set device_type::release instead of device::release rbd: don't free rbd_dev outside of the release callback rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails libceph: use local variable cursor instead of &msg->cursor libceph: remove con argument in handle_reply() ceph: combine as many iovec as possile into one OSD request ceph: fix message length computation ceph: fix a comment typo rbd: drop null test before destroy functions
\| *	rbd: remove duplicate calls to rbd_dev_mapping_clear()	Ilya Dryomov	2015-11-02	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit d1cf5788450e ("rbd: set mapping info earlier") defined rbd_dev_mapping_clear(), but, just a few days after, commit f35a4dee14c3 ("rbd: set the mapping size and features later") moved rbd_dev_mapping_set() calls and added another rbd_dev_mapping_clear() call instead of moving the old one. Around the same time, another duplicate was introduced in rbd_dev_device_release() - kill both. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
\| *	rbd: set device_type::release instead of device::release	Ilya Dryomov	2015-11-02	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	No point in providing an empty device_type::release callback and then setting device::release for each rbd_dev dynamically. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
\| *	rbd: don't free rbd_dev outside of the release callback	Ilya Dryomov	2015-11-02	1	-42/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	struct rbd_device has struct device embedded in it, which means it's part of kobject universe and has an unpredictable life cycle. Freeing its memory outside of the release callback is flawed, yet commits 200a6a8be5db ("rbd: don't destroy rbd_dev in device release function") and 8ad42cd0c002 ("rbd: don't have device release destroy rbd_dev") moved rbd_dev_destroy() out to rbd_dev_image_release(). This commit reverts most of that, the key points are: - rbd_dev->dev is initialized in rbd_dev_create(), making it possible to use rbd_dev_destroy() - which is just a put_device() - both before we register with device core and after. - rbd_dev_release() (the release callback) is the only place we kfree(rbd_dev). It's also where we do module_put(), keeping the module unload race window as small as possible. - We pin the module in rbd_dev_create(), but only for mapping rbd_dev-s. Moving image related stuff out of struct rbd_device into another struct which isn't tied with sysfs and device core is long overdue, but until that happens, this will keep rbd module refcount (which users can observe with lsmod) sane. Fixes: http://tracker.ceph.com/issues/12697 Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
\| *	rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails	Ilya Dryomov	2015-11-02	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes userspace think it needs to retry the write. Fix it - it's a leftover from the times when the equivalent of rbd_dev_create() was the first action in rbd_add(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
\| *	rbd: drop null test before destroy functions	Julia Lawall	2015-11-02	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) { \(kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy\)(x); x = NULL; -} // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* \|	brd: Refuse improperly aligned discard requests	Jan Kara	2015-11-11	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently when improperly aligned discard request is submitted, we just silently discard more / less data which results in filesystem corruption in some cases. Refuse such misaligned requests. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* \|	Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block	Linus Torvalds	2015-11-10	9	-18/+24
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull block IO poll support from Jens Axboe: "Various groups have been doing experimentation around IO polling for (really) fast devices. The code has been reviewed and has been sitting on the side for a few releases, but this is now good enough for coordinated benchmarking and further experimentation. Currently O_DIRECT sync read/write are supported. A framework is in the works that allows scalable stats tracking so we can auto-tune this. And we'll add libaio support as well soon. Fow now, it's an opt-in feature for test purposes" * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block: direct-io: be sure to assign dio->bio_bdev for both paths directio: add block polling support NVMe: add blk polling support block: add block polling support blk-mq: return tag/queue combo in the make_request_fn handlers block: change ->make_request_fn() and users to return a queue cookie