op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	dm thin: fix crash by initializing thin device's refcount and completion earlier	Marc Dionne	2014-12-17	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 80e96c5484be ("dm thin: do not allow thin device activation while pool is suspended") delayed the initialization of a new thin device's refcount and completion until after this new thin was added to the pool's active_thins list and the pool lock is released. This opens a race with a worker thread that walks the list and calls thin_get/put, noticing that the refcount goes to 0 and calling complete, freezing up the system and giving the oops below: kernel: BUG: unable to handle kernel NULL pointer dereference at (null) kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90 kernel: Call Trace: kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool] kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool] kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 Set the thin device's initial refcount and initialize the completion before adding it to the pool's active_thins list in thin_ctr(). Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: fix missing out-of-data-space to write mode transition if blocks ↵	Joe Thornber	2014-12-17	1	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	are released Discard bios and thin device deletion have the potential to release data blocks. If the thin-pool is in out-of-data-space mode, and blocks were released, transition the thin-pool back to full write mode. The correct time to do this is just after the thin-pool metadata commit. It cannot be done before the commit because the space maps will not allow immediate reuse of the data blocks in case there's a rollback following power failure. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: fix inability to discard blocks when in out-of-data-space mode	Joe Thornber	2014-12-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard function pointer was incorrectly being set to process_prepared_discard_passdown rather than process_prepared_discard. This incorrect function pointer meant the discard was being passed down, but not effecting the mapping. As such any discard that was issued, in an attempt to reclaim blocks, would not successfully free data space. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: fix pool_io_hints to avoid looking at max_hw_sectors	Mike Snitzer	2014-11-21	1	-14/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Simplify the pool_io_hints code that works to establish a max_sectors value that is a power-of-2 factor of the thin-pool's blocksize. The biggest associated improvement is that the DM thin-pool is no longer concerning itself with the data device's max_hw_sectors when adjusting max_sectors. This fixes the relative fragility of the original "dm thin: adjust max_sectors_kb based on thinp blocksize" commit that only became apparent when testing was performed using a DM thin-pool ontop of a virtio_blk device. One proposed upstream patch detailed the problems inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611 So even though virtio_blk incorrectly set its max_hw_sectors it actually helped make it clear that we need DM thinp to be tolerant of any future Linux driver that incorrectly sets max_hw_sectors. We only need to be concerned with modifying the thin-pool device's max_sectors limit if it is smaller than the thin-pool's blocksize. In this case the value of max_sectors does become a limiting factor when upper layers (e.g. filesystems) construct their bios. But if the hardware can support IOs larger than the thin-pool's blocksize the user is encouraged to adjust the thin-pool's data device's max_sectors accordingly -- doing so will enable the thin-pool to inherit the established user-defined max_sectors. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: suspend/resume active thin devices when reloading thin-pool	Mike Snitzer	2014-11-19	1	-2/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before this change it was expected that userspace would first suspend all active thin devices, reload/resize the thin-pool target, then resume all active thin devices. Now the thin-pool suspend/resume will trigger the suspend/resume of all active thins via appropriate calls to dm_internal_suspend and dm_internal_resume. Store the mapped_device for each thin device in struct thin_c to make these calls possible. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: do not allow thin device activation while pool is suspended	Mike Snitzer	2014-11-19	1	-10/+45
\| \| \| \| \| \| \| \| \| \| \| \|	Otherwise IO could be issued to the pool while it is suspended. Care was taken to properly interlock between the thin and thin-pool targets when accessing the pool's 'suspended' flag. The thin_ctr will not add a new thin device to the pool's active_thins list if the pool is susepended. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: remove stale 'trim' message in block comment above pool_message	Mike Snitzer	2014-11-12	1	-1/+0
\| \| \| \|	Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: fix a race in thin_dtr	Mikulas Patocka	2014-11-12	1	-3/+3
\| \| \| \| \| \| \| \| \|	As long as struct thin_c is in the list, anyone can grab a reference of it. Consequently, we must wait for the reference count to drop to zero after we remove the structure from the list, not before. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm bio prison: introduce support for locking ranges of blocks	Joe Thornber	2014-11-10	1	-2/+4
\| \| \| \| \| \| \| \| \| \|	Ranges will be placed in the same cell if they overlap. Range locking is a prerequisite for more efficient multi-block discard support in both the cache and thin-provisioning targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: refactor requeue_io to eliminate spinlock bouncing	Mike Snitzer	2014-11-10	1	-20/+23
\| \| \| \| \| \|	Also refactor some other bio_list erroring helpers. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: optimize retry_bios_on_resume	Mike Snitzer	2014-11-10	1	-7/+2
\| \| \| \| \| \| \|	Eliminate redundant should_error_unserviceable_bio check and error loop. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: sort the deferred cells	Joe Thornber	2014-11-10	1	-20/+68
\| \| \| \| \| \| \| \| \|	Sort the cells in logical block order before processing each cell in process_thin_deferred_cells(). This significantly improves the ondisk layout on rotational storage, whereby improving read performance. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: direct dispatch when breaking sharing	Joe Thornber	2014-11-10	1	-13/+57
\| \| \| \| \| \| \| \| \|	This use of direct submission in process_shared_bio() reduces latency for submitting bios in the shared cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: remap the bios in a cell immediately	Joe Thornber	2014-11-10	1	-29/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This use of direct submission in process_prepared_mapping() reduces latency for submitting bios in a cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. But this direct submission exposes the potential for a race between releasing a cell and incrementing deferred set. Fix this by introducing dm_cell_visit_release() and refactoring inc_remap_and_issue_cell() accordingly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: defer whole cells rather than individual bios	Joe Thornber	2014-11-10	1	-47/+207
\| \| \| \| \| \| \| \| \| \| \| \|	This avoids dropping the cell, so increases the probability that other bios will collect within the cell, rather than being passed individually to the worker. Also add required process_cell and process_discard_cell error handling wrappers and set associated pool-mode function pointers accordingly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: factor out remap_and_issue_overwrite	Mike Snitzer	2014-11-10	1	-18/+20
\| \| \| \| \| \|	Purely cleanup of duplicated code, no functional change. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: performance improvement to discard processing	Joe Thornber	2014-11-10	1	-7/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When processing a discard bio, if the block is already quiesced do the discard immediately rather than adding the mapping to a list for the next iteration of the worker thread. Discarding a fully provisioned 100G thin volume with 64k block size goes from 860s to 95s with this change. Clearly there's something wrong with the worker architecture, more investigation needed. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: implement thin_merge	Mike Snitzer	2014-11-10	1	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce thin_merge so that any additional constraints from the data volume may be taken into account when determing the maximum number of sectors that can be issued relative to the specified logical offset. This is particularly important if/when the data volume is layered ontop of a more sophisticated device (e.g. dm-raid or some other DM target). Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: adjust max_sectors_kb based on thinp blocksize	Mike Snitzer	2014-11-10	1	-3/+31
\| \| \| \| \| \| \| \| \| \| \|	Allows for filesystems to submit bios that are a factor of the thinp blocksize, improving dm-thinp efficiency (particularly when the data volume is RAID). Also set io_min to max_sectors_kb if it is a factor of the thinp blocksize. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: throttle incoming IO	Joe Thornber	2014-11-10	1	-1/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Throttle IO based on the time it's taking the worker to do one loop. There were reports of hung task timeouts occuring and it was observed that the excessively long avgqu-sz (as reported by iostat) was contributing to these hung tasks. Throttling definitely helps dm-thinp perform better under heavy IO load (without being detremental by being overzealous). It reduces avgqu-sz drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2. And avgqu-sz stays at or below 30K even with dirty_ratio=20, dirty_background_ratio=10. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: prefetch missing metadata pages	Joe Thornber	2014-11-10	1	-4/+6
\| \| \| \| \| \| \| \|	Prefetch metadata at the start of the worker thread and then again every 128th bio processed from the deferred list. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm bio prison: switch to using a red black tree	Joe Thornber	2014-11-10	1	-2/+1
\| \| \| \| \| \| \| \| \| \|	Previously it was using a fixed sized hash table. There are times when very many concurrent cells are held (such as when processing a very large discard). When this happens the hash table performance becomes very poor. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: grab a virtual cell before looking up the mapping	Joe Thornber	2014-11-04	1	-4/+12
\| \| \| \| \| \| \| \|	Avoids normal IO racing with discard. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: set minimum_io_size to pool's data block size	Mike Snitzer	2014-08-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before, if the block layer's limit stacking didn't establish an optimal_io_size that was compatible with the thin-pool's data block size we'd set optimal_io_size to the data block size and minimum_io_size to 0 (which the block layer adjusts to be physical_block_size). Update pool_io_hints() to set both minimum_io_size and optimal_io_size to the thin-pool's data block size. This fixes an issue reported where mkfs.xfs would create more XFS Allocation Groups on thinp volumes than on a normal linear LV of comparable size, see: https://bugzilla.redhat.com/show_bug.cgi?id=1003227 Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: relax external origin size constraints	Joe Thornber	2014-08-01	1	-43/+115
\| \| \| \| \| \| \| \| \| \|	Track the size of any external origin. Previously the external origin's size had to be a multiple of the thin-pool's block size, that is no longer a requirement. In addition, snapshots that are larger than the external origin are now supported. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: switch to an atomic_t for tracking pending new block preparations	Joe Thornber	2014-08-01	1	-13/+16
\| \| \| \| \| \| \| \| \|	Previously we used separate boolean values to track quiescing and copying actions. By switching to an atomic_t we can support blocks that need a partial copy and partial zero. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: update discard_granularity to reflect the thin-pool blocksize	Lukas Czerner	2014-06-11	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DM thinp already checks whether the discard_granularity of the data device is a factor of the thin-pool block size. But when using the dm-thin-pool's discard passdown support, DM thinp was not selecting the max of the underlying data device's discard_granularity and the thin-pool's block size. Update set_discard_limits() to set discard_granularity to the max of these values. This enables blkdev_issue_discard() to properly align the discards that are sent to the DM thin device on a full block boundary. As such each discard will now cover an entire DM thin-pool block and the block will be reclaimed. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: return ENOSPC instead of EIO when error_if_no_space enabled	Mike Snitzer	2014-06-03	1	-14/+24
\| \| \| \| \| \| \| \| \| \| \| \| \|	Update the DM thin provisioning target's allocation failure error to be consistent with commit a9d6ceb8 ("[SCSI] return ENOSPC on thin provisioning failure"). The DM thin target now returns -ENOSPC rather than -EIO when block allocation fails due to the pool being out of data space (and the 'error_if_no_space' thin-pool feature is enabled). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-By: Joe Thornber <ejt@redhat.com>
*	dm thin: cleanup noflush_work to use a proper completion	Joe Thornber	2014-06-03	1	-18/+34
\| \| \| \| \| \| \| \| \| \|	Factor out a pool_work interface that noflush_work makes use of to wait for and complete work items (in terms of a proper completion struct). Allows discontinuing the use of a custom completion in terms of atomic_t and wait_event. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: add 'no_space_timeout' dm-thin-pool module param	Mike Snitzer	2014-05-20	1	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") introduced a fixed 60 second timeout. Users may want to either disable or modify this timeout. Allow the out-of-data-space timeout to be configured using the 'no_space_timeout' dm-thin-pool module param. Setting it to 0 will disable the timeout, resulting in IO being queued until more data space is added to the thin-pool. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.14+
*	dm thin: add timeout to stop out-of-data-space mode holding IO forever	Joe Thornber	2014-05-14	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the pool runs out of data space, dm-thin can be configured to either error IOs that would trigger provisioning, or hold those IOs until the pool is resized. Unfortunately, holding IOs until the pool is resized can result in a cascade of tasks hitting the hung_task_timeout, which may render the system unavailable. Add a fixed timeout so IOs can only be held for a maximum of 60 seconds. If LVM is going to resize a thin-pool that is out of data space it needs to be prompt about it. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.14+
*	dm thin: allow metadata commit if pool is in PM_OUT_OF_DATA_SPACE mode	Joe Thornber	2014-05-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 3e1a0699 ("dm thin: fix out of data space handling") introduced a regression in the metadata commit() method by returning an error if the pool is in PM_OUT_OF_DATA_SPACE mode. This oversight caused a thin device to return errors even if the default queue_if_no_space ENOSPC handling mode is used. Fix commit() to only fail if pool is in PM_READ_ONLY or PM_FAIL mode. Reported-by: qindehua@163.com Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.14+
*	dm thin: use INIT_WORK_ONSTACK in noflush_work to avoid ODEBUG warning	Mike Snitzer	2014-04-29	1	-1/+1
\| \| \| \| \| \| \| \| \|	Use INIT_WORK_ONSTACK to silence "ODEBUG: object is on stack, but not annotated". Reported-by: Zdeněk Kabeláč <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: fix rcu_read_lock being held in code that can sleep	Joe Thornber	2014-04-08	1	-3/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit c140e1c4e23 ("dm thin: use per thin device deferred bio lists") introduced the use of an rculist for all active thin devices. The use of rcu_read_lock() in process_deferred_bios() can result in a BUG if a dm_bio_prison_cell must be allocated as a side-effect of bio_detain(): BUG: sleeping function called from invalid context at mm/mempool.c:203 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0 3 locks held by kworker/u8:0/6: #0: ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550 #1: ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550 #2: (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0 We can't process deferred bios with the rcu lock held, since dm_bio_prison_cell allocation may block if the bio-prison's cell mempool is exhausted. To fix: - Introduce a refcount and completion field to each thin_c - Add thin_get/put methods for adjusting the refcount. If the refcount hits zero then the completion is triggered. - Initialise refcount to 1 when creating thin_c - When iterating the active_thins list we thin_get() whilst the rcu lock is held. - After the rcu lock is dropped we process the deferred bios for that thin. - When destroying a thin_c we thin_put() and then wait for the completion -- to avoid a race between the worker thread iterating from that thin_c and destroying the thin_c. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: irqsave must always be used with the pool->lock spinlock	Joe Thornber	2014-04-08	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Commit c140e1c4e23 ("dm thin: use per thin device deferred bio lists") incorrectly stopped disabling irqs when taking the pool's spinlock. Irqs must be disabled when taking the pool's spinlock otherwise a thread could spin_lock(), then get interrupted to service thin_endio() in interrupt context, which would then deadlock in spin_lock_irqsave(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: sort the per thin deferred bios using an rb_tree	Mike Snitzer	2014-04-04	1	-2/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A thin-pool will allocate blocks using FIFO order for all thin devices which share the thin-pool. Because of this simplistic allocation the thin-pool's space can become fragmented quite easily; especially when multiple threads are requesting blocks in parallel. Sort each thin device's deferred_bio_list based on logical sector to help reduce fragmentation of the thin-pool's ondisk layout. The following tables illustrate the realized gains/potential offered by sorting each thin device's deferred_bio_list. An "io size"-sized random read of the device would result in "seeks/io" fragments being read, with an average "distance/seek" between each fragment. Data was written to a single thin device using multiple threads via iozone (8 threads, 64K for both the block_size and io_size). unsorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.013 11m 64k 0.065 11m 256k 0.274 10m 1m 1.109 10m 4m 4.411 10m 16m 17.097 11m 64m 60.055 13m 256m 148.798 25m 1g 809.929 21m sorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.000 1g 64k 0.001 1g 256k 0.003 1g 1m 0.011 1g 4m 0.045 1g 16m 0.181 1g 64m 0.747 1011m 256m 3.299 1g 1g 14.373 1g Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: use per thin device deferred bio lists	Mike Snitzer	2014-03-31	1	-61/+104
\| \| \| \| \| \| \| \| \| \| \| \| \|	The thin-pool previously only had a single deferred_bios list that would collect bios for all thin devices in the pool. Split this per-pool deferred_bios list out to per-thin deferred_bios_list -- doing so enables increased parallelism when processing deferred bios. And now that each thin device has it's own deferred_bios_list we can sort all bios in the list using logical sector. The requeue code in error handling path is also cleaner as a side-effect. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: simplify pool_is_congested	Mike Snitzer	2014-03-31	1	-11/+5
\| \| \| \| \| \| \| \| \|	The pool is congested if the pool is in PM_OUT_OF_DATA_SPACE mode. This is more explicit/clear/efficient than inferring whether or not the pool is congested by checking if retry_on_resume_list is empty. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: fix dangling bio in process_deferred_bios error path	Mike Snitzer	2014-03-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	If unable to ensure_next_mapping() we must add the current bio, which was removed from the @bios list via bio_list_pop, back to the deferred_bios list before all the remaining @bios. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: fix noflush suspend IO queueing	Joe Thornber	2014-03-05	1	-2/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	i) by the time DM core calls the postsuspend hook the dm_noflush flag has been cleared. So the old thin_postsuspend did nothing. We need to use the presuspend hook instead. ii) There was a race between bios leaving DM core and arriving in the deferred queue. thin_presuspend now sets a 'requeue' flag causing all bios destined for that thin to be requeued back to DM core. Then it requeues all held IO, and all IO on the deferred queue (destined for that thin). Finally postsuspend clears the 'requeue' flag. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: fix deadlock in __requeue_bio_list	Joe Thornber	2014-03-05	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \|	The spin lock in requeue_io() was held for too long, allowing deadlock. Don't worry, due to other issues addressed in the following "dm thin: fix noflush suspend IO queueing" commit, this code was never called. Fix this by taking the spin lock for a much shorter period of time. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: fix out of data space handling	Joe Thornber	2014-03-05	1	-45/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ideally a thin pool would never run out of data space; the low water mark would trigger userland to extend the pool before we completely run out of space. However, many small random IOs to unprovisioned space can consume data space at an alarming rate. Adjust your low water mark if you're frequently seeing "out-of-data-space" mode. Before this fix, if data space ran out the pool would be put in PM_READ_ONLY mode which also aborted the pool's current metadata transaction (data loss for any changes in the transaction). This had a side-effect of needlessly compromising data consistency. And retry of queued unserviceable bios, once the data pool was resized, could initiate changes to potentially inconsistent pool metadata. Now when the pool's data space is exhausted transition to a new pool mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data may not be allocated. This allows users to remove thin volumes or discard data to recover data space. The pool is no longer put in PM_READ_ONLY mode in response to the pool running out of data space. And PM_READ_ONLY mode no longer aborts the pool's current metadata transaction. Also, set_pool_mode() will now notify userspace when the pool mode is changed. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: ensure user takes action to validate data and metadata consistency	Mike Snitzer	2014-03-05	1	-22/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a thin metadata operation fails the current transaction will abort, whereby causing potential for IO layers up the stack (e.g. filesystems) to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the thin metadata's superblock which: 1) requires the user verify the thin metadata is consistent (e.g. use thin_check, etc) 2) suggests the user verify the thin data is consistent (e.g. use fsck) The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is to run thin_repair. On metadata operation failure: abort current metadata transaction, set pool in read-only mode, and now set the needs_check flag. As part of this change, constraints are introduced or relaxed: * don't allow a pool to transition to write mode if needs_check is set * don't allow data or metadata space to be resized if needs_check is set * if a thin pool's metadata space is exhausted: the kernel will now force the user to take the pool offline for repair before the kernel will allow the metadata space to be extended. Also, update Documentation to include information about when the thin provisioning target commits metadata, how it handles metadata failures and running out of space. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>
*	dm thin: synchronize the pool mode during suspend	Mike Snitzer	2014-03-04	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit b5330655 ("dm thin: handle metadata failures more consistently") increased potential for the pool's mode to be changed in response to metadata operation failures. When the pool mode is changed it isn't synchronized with the mode in pool_features stored in the target's context (ti->private) that is used as the basis for (re)establishing the pool mode during resume via bind_control_target. It is important that we synchronize the pool mode when it is changed otherwise the pool may experience and unexpected mode transition on the next resume (especially if there was no new table load). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: allow metadata space larger than supported to go unused	Mike Snitzer	2014-02-27	1	-12/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It was always intended that a user could provide a thin metadata device that is larger than the max supported by the on-disk format. The extra space would just go unused. Unfortunately that never worked. If the user attempted to use a larger metadata device on creation they would get an error like the following: device-mapper: space map common: space map too large device-mapper: transaction manager: couldn't create metadata space map device-mapper: thin metadata: tm_create_with_sm failed device-mapper: table: 252:17: thin-pool: Error creating metadata object device-mapper: ioctl: error adding target to table Fix this by allowing the initial metadata space map creation to cap its size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS). get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported). Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Lastly, remove the "excess space will not be used" warning message from get_metadata_dev_size(); it resulted in printing the warning multiple times. Factor out warn_if_metadata_device_too_big(), call it from pool_ctr() and maybe_resize_metadata_dev(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm thin: fix the error path for the thin device constructor	Mike Snitzer	2014-02-24	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dm_pool_close_thin_device() must be called if dm_set_target_max_io_len() fails in thin_ctr(). Otherwise __pool_destroy() will fail because the pool will still have an open thin device: device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed. Also, must establish error code if failing thin_ctr() because the pool is in fail_io mode. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: avoid metadata commit if a pool's thin devices haven't changed	Mike Snitzer	2014-02-17	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 905e51b ("dm thin: commit outstanding data every second") introduced a periodic commit. This commit occurs regardless of whether any thin devices have made changes. Fix the periodic commit to check if any of a pool's thin devices have changed using dm_pool_changed_this_transaction(). Reported-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
*	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block	Linus Torvalds	2014-01-30	1	-12/+18
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
\| *	Merge tag 'v3.13-rc6' into for-3.14/core	Jens Axboe	2013-12-31	1	-27/+39
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c
\| * \|	block: Generic bio chaining	Kent Overstreet	2013-11-23	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a generic mechanism for chaining bio completions. This is going to be used for a bio_split() replacement, and it turns out to be very useful in a fair amount of driver code - a fair number of drivers were implementing this in their own roundabout ways, often painfully. Note that this means it's no longer to call bio_endio() more than once on the same bio! This can cause problems for drivers that save/restore bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all - in all but the simplest cases they'd be better off just cloning the bio, and immutable biovecs is making bio cloning cheaper. But for now, we add a bio_endio_nodec() for these cases. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>