op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	dm cache: fix access beyond end of origin device	Heinz Mauelshagen	2014-03-12	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In order to avoid wasting cache space a partial block at the end of the origin device is not cached. Unfortunately, the check for such a partial block at the end of the origin device was flawed. Fix accesses beyond the end of the origin device that occured due to attempted promotion of an undetected partial block by: - initializing the per bio data struct to allow cache_end_io to work properly - recognizing access to the partial block at the end of the origin device - avoiding out of bounds access to the discard bitset Otherwise, users can experience errors like the following: attempt to access beyond end of device dm-5: rw=0, want=20971520, limit=20971456 ... device-mapper: cache: promotion failed; couldn't copy block Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm cache: fix truncation bug when copying a block to/from >2TB fast device	Heinz Mauelshagen	2014-03-12	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During demotion or promotion to a cache's >2TB fast device we must not truncate the cache block's associated sector to 32bits. The 32bit temporary result of from_cblock() caused a 32bit multiplication when calculating the sector of the fast device in issue_copy_real(). Use an intermediate 64bit type to store the 32bit from_cblock() to allow for proper 64bit multiplication. Here is an example of how this bug manifests on an ext4 filesystem: EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt. JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm space map metadata: fix refcount decrement below 0 which caused corruption	Joe Thornber	2014-03-07	1	-21/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This has been a relatively long-standing issue that wasn't nailed down until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014, see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html From that report: "When decreasing the reference count of a metadata block with its reference count equals 3, we will call dm_btree_remove() to remove this enrty from the B+tree which keeps the reference count info in metadata device. The B+tree will try to rebalance the entry of the child nodes in each node it traversed, and the rebalance process contains the following steps. (1) Finding the corresponding children in current node (shadow_current(s)) (2) Shadow the children block (issue BOP_INC) (3) redistribute keys among children, and free children if necessary (issue BOP_DEC) Since the update of a metadata block's reference count could be recursive, we will stash these reference count update operations in smm->uncommitted and then process them in a FILO fashion. The problem is that step(3) could free the children which is created in step(2), so the BOP_DEC issued in step(3) will be carried out before the BOP_INC issued in step(2) since these BOPs will be processed in FILO fashion. Once the BOP_DEC from step(3) tries to decrease the reference count of newly shadow block, it will report failure for its reference equals 0 before decreasing. It looks like we can solve this issue by processing these BOPs in a FIFO fashion instead of FILO." Commit 5b564d80 ("dm space map: disallow decrementing a reference count below zero") changed the code to report an error for this temporary refcount decrement below zero. So what was previously a harmless invalid refcount became a hard failure due to the new error path: device-mapper: space map common: unable to decrement a reference count below 0 device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22 device-mapper: thin: 253:6: switching pool to read-only mode This bug is in dm persistent-data code that is common to the DM thin and cache targets. So any users of those targets should apply this fix. Fix this by applying recursive space map operations in FIFO order rather than FILO. Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801 Reported-by: Apollon Oikonomopoulos <apoikos@debian.org> Reported-by: edwillam1007@gmail.com Reported-by: Teng-Feng Yang <shinrairis@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.13+
*	dm thin: fix noflush suspend IO queueing	Joe Thornber	2014-03-05	1	-2/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	i) by the time DM core calls the postsuspend hook the dm_noflush flag has been cleared. So the old thin_postsuspend did nothing. We need to use the presuspend hook instead. ii) There was a race between bios leaving DM core and arriving in the deferred queue. thin_presuspend now sets a 'requeue' flag causing all bios destined for that thin to be requeued back to DM core. Then it requeues all held IO, and all IO on the deferred queue (destined for that thin). Finally postsuspend clears the 'requeue' flag. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: fix deadlock in __requeue_bio_list	Joe Thornber	2014-03-05	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \|	The spin lock in requeue_io() was held for too long, allowing deadlock. Don't worry, due to other issues addressed in the following "dm thin: fix noflush suspend IO queueing" commit, this code was never called. Fix this by taking the spin lock for a much shorter period of time. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: fix out of data space handling	Joe Thornber	2014-03-05	1	-45/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ideally a thin pool would never run out of data space; the low water mark would trigger userland to extend the pool before we completely run out of space. However, many small random IOs to unprovisioned space can consume data space at an alarming rate. Adjust your low water mark if you're frequently seeing "out-of-data-space" mode. Before this fix, if data space ran out the pool would be put in PM_READ_ONLY mode which also aborted the pool's current metadata transaction (data loss for any changes in the transaction). This had a side-effect of needlessly compromising data consistency. And retry of queued unserviceable bios, once the data pool was resized, could initiate changes to potentially inconsistent pool metadata. Now when the pool's data space is exhausted transition to a new pool mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data may not be allocated. This allows users to remove thin volumes or discard data to recover data space. The pool is no longer put in PM_READ_ONLY mode in response to the pool running out of data space. And PM_READ_ONLY mode no longer aborts the pool's current metadata transaction. Also, set_pool_mode() will now notify userspace when the pool mode is changed. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: ensure user takes action to validate data and metadata consistency	Mike Snitzer	2014-03-05	3	-23/+101
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a thin metadata operation fails the current transaction will abort, whereby causing potential for IO layers up the stack (e.g. filesystems) to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the thin metadata's superblock which: 1) requires the user verify the thin metadata is consistent (e.g. use thin_check, etc) 2) suggests the user verify the thin data is consistent (e.g. use fsck) The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is to run thin_repair. On metadata operation failure: abort current metadata transaction, set pool in read-only mode, and now set the needs_check flag. As part of this change, constraints are introduced or relaxed: * don't allow a pool to transition to write mode if needs_check is set * don't allow data or metadata space to be resized if needs_check is set * if a thin pool's metadata space is exhausted: the kernel will now force the user to take the pool offline for repair before the kernel will allow the metadata space to be extended. Also, update Documentation to include information about when the thin provisioning target commits metadata, how it handles metadata failures and running out of space. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>
*	dm thin: synchronize the pool mode during suspend	Mike Snitzer	2014-03-04	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit b5330655 ("dm thin: handle metadata failures more consistently") increased potential for the pool's mode to be changed in response to metadata operation failures. When the pool mode is changed it isn't synchronized with the mode in pool_features stored in the target's context (ti->private) that is used as the basis for (re)establishing the pool mode during resume via bind_control_target. It is important that we synchronize the pool mode when it is changed otherwise the pool may experience and unexpected mode transition on the next resume (especially if there was no new table load). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm snapshot: fix metadata corruption	Mikulas Patocka	2014-03-03	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 55494bf2947dccdf2 ("dm snapshot: use dm-bufio") broke snapshots. Before that 3.14-rc1 commit, loading a snapshot's list of exceptions involved reading exception areas one by one into ps->area and inserting those exceptions into the hash table. Commit 55494bf2947dccdf2 changed it so that dm-bufio with prefetch is used to load exceptions in batchs. Exceptions are loaded correctly, but ps->area is left uninitialized. When a new exception is allocated, it is stored in this uninitialized ps->area which will be written to the disk. This causes metadata corruption. Fix this corruption by copying the last area that was read via dm-bufio into ps->area. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm: fix Kconfig indentation	Mike Snitzer	2014-03-03	2	-10/+10
\| \| \| \| \| \| \| \| \|	Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig. Doing so fixes indentation for other DM config options. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm cache mq: fix memory allocation failure for large cache devices	Heinz Mauelshagen	2014-02-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The memory allocated for the multiqueue policy's hash table doesn't need to be physically contiguous. Use vzalloc() instead of kzalloc(). Fedora has been carrying this fix since 10/10/2013. Failure seen during creation of a 10TB cached device with a 2048 sector block size and 411GB cache size: dmsetup: page allocation failure: order:9, mode:0x10c0d0 CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3 Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a 12/30/2011 000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928 ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928 ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0 Call Trace: [<ffffffff81387ab4>] dump_stack+0x19/0x1b [<ffffffff810bb26f>] warn_alloc_failed+0x110/0x124 [<ffffffff81385dbc>] ? __alloc_pages_direct_compact+0x17c/0x18e [<ffffffff810bda2e>] __alloc_pages_nodemask+0x6c7/0x75e [<ffffffff810bdad7>] __get_free_pages+0x12/0x3f [<ffffffff810ea148>] kmalloc_order_trace+0x29/0x88 [<ffffffff810ec1fd>] __kmalloc+0x36/0x11b [<ffffffffa031eeed>] ? mq_create+0x1dc/0x2cf [dm_cache_mq] [<ffffffffa031efc0>] mq_create+0x2af/0x2cf [dm_cache_mq] [<ffffffffa0314605>] dm_cache_policy_create+0xa7/0xd2 [dm_cache] [<ffffffffa0312530>] ? cache_ctr+0x245/0xa13 [dm_cache] [<ffffffffa031263e>] cache_ctr+0x353/0xa13 [dm_cache] [<ffffffffa012b916>] dm_table_add_target+0x227/0x2ce [dm_mod] [<ffffffffa012e8e4>] table_load+0x286/0x2ac [dm_mod] [<ffffffffa012e65e>] ? dev_wait+0x8a/0x8a [dm_mod] [<ffffffffa012e324>] ctl_ioctl+0x39a/0x3c2 [dm_mod] [<ffffffffa012e35a>] dm_ctl_ioctl+0xe/0x12 [dm_mod] [<ffffffff81101181>] vfs_ioctl+0x21/0x34 [<ffffffff811019d3>] do_vfs_ioctl+0x3b1/0x3f4 [<ffffffff810f4d2e>] ? ____fput+0x9/0xb [<ffffffff81050b6c>] ? task_work_run+0x7e/0x92 [<ffffffff81101a68>] SyS_ioctl+0x52/0x82 [<ffffffff81391d92>] system_call_fastpath+0x16/0x1b Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm cache: fix truncation bug when mapping I/O to >2TB fast device	Heinz Mauelshagen	2014-02-28	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When remapping a block to the cache's fast device that is larger than 2TB we must not truncate the destination sector to 32bits. The 32bit temporary result of from_cblock() was being overflowed in remap_to_cache() due to the logical left shift. Use an intermediate 64bit type to store the 32bit from_cblock() result to fix the overflow. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: allow metadata space larger than supported to go unused	Mike Snitzer	2014-02-27	5	-19/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It was always intended that a user could provide a thin metadata device that is larger than the max supported by the on-disk format. The extra space would just go unused. Unfortunately that never worked. If the user attempted to use a larger metadata device on creation they would get an error like the following: device-mapper: space map common: space map too large device-mapper: transaction manager: couldn't create metadata space map device-mapper: thin metadata: tm_create_with_sm failed device-mapper: table: 252:17: thin-pool: Error creating metadata object device-mapper: ioctl: error adding target to table Fix this by allowing the initial metadata space map creation to cap its size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS). get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported). Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Lastly, remove the "excess space will not be used" warning message from get_metadata_dev_size(); it resulted in printing the warning multiple times. Factor out warn_if_metadata_device_too_big(), call it from pool_ctr() and maybe_resize_metadata_dev(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm mpath: fix stalls when handling invalid ioctls	Hannes Reinecke	2014-02-26	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An invalid ioctl will never be valid, irrespective of whether multipath has active paths or not. So for invalid ioctls we do not have to wait for multipath to activate any paths, but can rather return an error code immediately. This fix resolves numerous instances of: udevd[]: worker [] unexpectedly returned with status 0x0100 that have been seen during testing. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
*	dm thin: fix the error path for the thin device constructor	Mike Snitzer	2014-02-24	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dm_pool_close_thin_device() must be called if dm_set_target_max_io_len() fails in thin_ctr(). Otherwise __pool_destroy() will fail because the pool will still have an open thin device: device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed. Also, must establish error code if failing thin_ctr() because the pool is in fail_io mode. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
*	dm raid1: fix immutable biovec related BUG when retrying read bio	Mikulas Patocka	2014-02-18	1	-0/+3
\| \| \| \| \| \| \| \|	When restoring bi_end_io, increase bi_remaining before retrying the bio to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm io: fix I/O to multiple destinations	Mikulas Patocka	2014-02-17	1	-12/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 003b5c5719f159f4f4bf97511c4702a0638313dd ("block: Convert drivers to immutable biovecs") broke dm-mirror due to dm-io breakage. dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC, DM_IO_VMA) that iterate over pages where the I/O should be performed. The switch to immutable biovecs changed the DM_IO_BVEC iterator to DM_IO_BIO. Before this change the iterator stored the pointer to a bio vector in the dpages structure. The iterator incremented the pointer in the dpages structure as it advanced over the pages. After the immutable biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in the dpages structure and uses bio_advance to change the bio as it advances. The problem is that the function dispatch_io stores the content of the dpages structure into the variable old_pages and restores it before issuing I/O to each of the devices. Before the change, the statement "dp = old_pages;" restored the iterator to its starting position. After the change, struct dpages holds a pointer to the bio, thus the statement "dp = old_pages;" doesn't restore the iterator. Consequently, in the context of dm-mirror: only the first mirror leg is written correctly, the kernel locks up when trying to write the other mirror legs because the number of sectors to write in the where->count variable doesn't match the number of sectors returned by the iterator. This patch fixes the bug by partially reverting the original patch - it changes the code so that struct dpages holds a pointer to the bio vector, so that the statement "*dp = old_pages;" restores the iterator correctly. The field "context_u" holds the offset from the beginning of the current bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
*	dm thin: avoid metadata commit if a pool's thin devices haven't changed	Mike Snitzer	2014-02-17	3	-1/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 905e51b ("dm thin: commit outstanding data every second") introduced a periodic commit. This commit occurs regardless of whether any thin devices have made changes. Fix the periodic commit to check if any of a pool's thin devices have changed using dm_pool_changed_this_transaction(). Reported-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
*	dm cache: do not add migration to completed list before unhooking bio	Mike Snitzer	2014-02-17	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When completing an overwrite bio, in overwrite_endio(), the associated migration should not be added to the 'completed_migrations' until the bio's fields are restored with dm_unhook_bio(). Otherwise, do_worker() can race to process 'completed_migrations' before dm_unhook_bio() -- so the bio's bi_end_io is incorrect. This is unlikely to cause any problems given the current code but should be fixed on the basis of correctness. Also, the cache's spinlock only needs to be held when manipulating the 'completed_migrations' list -- other changes don't need protection. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
*	dm cache: move hook_info into common portion of per_bio_data structure	Mike Snitzer	2014-02-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit c9d28d5d ("dm cache: promotion optimisation for writes") incorrectly placed the 'hook_info' member in the writethrough-only portion of the per_bio_data structure. Given that the overwrite optimization may be used for writeback the 'hook_info' member must be placed above the 'cache' member of the per_bio_data structure. Any members above 'cache' are available from both writeback and writethrough modes' per_bio_data structure. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # 3.13+
*	Merge tag 'md/3.14-fixes' of git://neil.brown.name/md	Linus Torvalds	2014-02-14	2	-49/+54
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull md fixes from Neil Brown: "Two bugfixes for md both tagged for -stable" * tag 'md/3.14-fixes' of git://neil.brown.name/md: md/raid5: Fix CPU hotplug callback registration md/raid1: restore ability for check and repair to fix read errors.
\| *	md/raid5: Fix CPU hotplug callback registration	Oleg Nesterov	2014-02-13	1	-46/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Interestingly, the raid5 code can actually prevent double initialization and hence can use the following simplified form of callback registration: register_cpu_notifier(&foobar_cpu_notifier); get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); put_online_cpus(); A hotplug operation that occurs between registering the notifier and calling get_online_cpus(), won't disrupt anything, because the code takes care to perform the memory allocations only once. So reorganize the code in raid5 this way to fix the deadlock with callback registration. Cc: linux-raid@vger.kernel.org Cc: stable@vger.kernel.org (v2.6.32+) Fixes: 36d1c6476be51101778882897b315bd928c8c7b5 Signed-off-by: Oleg Nesterov <oleg@redhat.com> [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the free_scratch_buffer() helper to condense code further and wrote the changelog.] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md/raid1: restore ability for check and repair to fix read errors.	NeilBrown	2014-02-05	1	-3/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a md/raid1: fix bio handling problems in process_checks() Move the bio_reset() to a point before where BIO_UPTODATE is checked, so that check now always report that the bio is uptodate, even if it is not. This causes process_check() to sometimes treat read-errors as successful matches so the good data isn't written out. This patch preserves the flag until it is needed. Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed an even worse bug). So suitable for any -stable since 3.10. Reported-and-tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: stable@vger.kernel.org (3.10+) Fixed: 30bc9b53878a9921b02e3b5bc4283ac1c6de102a Signed-off-by: NeilBrown <neilb@suse.de>
* \|	Merge branch 'bcache-for-3.14' of git://evilpiepirate.org/~kent/linux-bcache ↵	Jens Axboe	2014-01-30	6	-10/+15
\|\ \ \| \|/ \|/\| \| \|	into for-linus
\| *	bcache: bugfix - gc thread now gets woken when cache is full	Nicholas Swenson	2014-01-29	1	-3/+3
\| \| \| \| \| \| \| \|	Signed-off-by: Nicholas Swenson <nks@daterainc.com>
\| *	bcache: Minor fixes from kbuild robot	Kent Overstreet	2014-01-29	4	-5/+8
\| \| \| \| \| \| \| \|	Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED	Darrick J. Wong	2014-01-29	2	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The BUG_ON at the end of __bch_btree_mark_key can be triggered due to an integer overflow error: BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13); ... SET_GC_SECTORS_USED(g, min_t(unsigned, GC_SECTORS_USED(g) + KEY_SIZE(k), (1 << 14) - 1)); BUG_ON(!GC_SECTORS_USED(g)); In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide. While the SET_ code tries to ensure that the field doesn't overflow by clamping it to (1<<14)-1 == 16383, this is incorrect because 16383 requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() = 8192, the SET_ statement tries to store 8192 into a 13-bit field. In a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON. Therefore, create a field width constant and a max value constant, and use those to create the bitfield and check the inputs to SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have BUG_ON checks for too-large values, but that's a separate patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* \|	Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block	Linus Torvalds	2014-01-30	22	-1755/+2213
\|\ \ \| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull block IO driver changes from Jens Axboe: - bcache update from Kent Overstreet. - two bcache fixes from Nicholas Swenson. - cciss pci init error fix from Andrew. - underflow fix in the parallel IDE pg_write code from Dan Carpenter. I'm sure the 1 (or 0) users of that are now happy. - two PCI related fixes for sx8 from Jingoo Han. - floppy init fix for first block read from Jiri Kosina. - pktcdvd error return miss fix from Julia Lawall. - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from Michael Opdenacker. - comment typo fix for the loop driver from Olaf Hering. - potential oops fix for null_blk from Raghavendra K T. - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing an OOM problem and a problem with handling security locked conditions * 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits) mg_disk: Spelling s/finised/finished/ null_blk: Null pointer deference problem in alloc_page_buffers mtip32xx: Correctly handle security locked condition mtip32xx: Make SGL container per-command to eliminate high order dma allocation drivers/block/loop.c: fix comment typo in loop_config_discard drivers/block/cciss.c:cciss_init_one(): use proper errnos drivers/block/paride/pg.c: underflow bug in pg_write() drivers/block/sx8.c: remove unnecessary pci_set_drvdata() drivers/block/sx8.c: use module_pci_driver() floppy: bail out in open() if drive is not responding to block0 read bcache: Fix auxiliary search trees for key size > cacheline size bcache: Don't return -EINTR when insert finished bcache: Improve bucket_prio() calculation bcache: Add bch_bkey_equal_header() bcache: update bch_bkey_try_merge bcache: Move insert_fixup() to btree_keys_ops bcache: Convert sorting to btree_keys bcache: Convert debug code to btree_keys bcache: Convert btree_iter to struct btree_keys bcache: Refactor bset_tree sysfs stats ...
\| *	bcache: Fix auxiliary search trees for key size > cacheline size	Kent Overstreet	2014-01-08	1	-14/+14
\| \| \| \| \| \| \| \|	Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Don't return -EINTR when insert finished	Kent Overstreet	2014-01-08	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to return -EINTR after a split because we invalidated iterators (and freed the btree node) - but if we were finished inserting, we don't want to redo the traversal. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Improve bucket_prio() calculation	Kent Overstreet	2014-01-08	2	-3/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When deciding what order to reuse buckets we take into account both the bucket's priority (which indicates lru order) and also the amount of live data in that bucket. The way they were scaled together wasn't as correct as it could be... this patch improves and documents it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Add bch_bkey_equal_header()	Nicholas Swenson	2014-01-08	3	-8/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Checks if two keys have equivalent header fields. (good enough for replacement or merging) Used in bch_bkey_try_merge, and replacing a key in the btree. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: update bch_bkey_try_merge	Nicholas Swenson	2014-01-08	3	-16/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added generic header checks to bch_bkey_try_merge, which then calls the bkey specific function Removed extraneous checks from bch_extent_merge Signed-off-by: Nicholas Swenson <nks@daterainc.com>
\| *	bcache: Move insert_fixup() to btree_keys_ops	Kent Overstreet	2014-01-08	4	-229/+257
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now handling overlapping extents/keys is a method that's specific to what the btree node contains. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Convert sorting to btree_keys	Kent Overstreet	2014-01-08	3	-36/+33
\| \| \| \| \| \| \| \| \| \| \| \|	More work to disentangle various code from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Convert debug code to btree_keys	Kent Overstreet	2014-01-08	9	-217/+264
\| \| \| \| \| \| \| \| \| \| \| \|	More work to disentangle various code from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Convert btree_iter to struct btree_keys	Kent Overstreet	2014-01-08	6	-38/+41
\| \| \| \| \| \| \| \| \| \| \| \|	More work to disentangle bset.c from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Refactor bset_tree sysfs stats	Kent Overstreet	2014-01-08	3	-47/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We're in the process of turning bset.c into library code, so none of the code in that file should know about struct cache_set or struct btree - so, move the btree traversal part of the stats code to sysfs.c. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Add bch_btree_keys_u64s_remaining()	Kent Overstreet	2014-01-08	2	-13/+30
\| \| \| \| \| \| \| \| \| \| \| \|	Helper function to explicitly check how much space is free in a btree node Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Add struct btree_keys	Kent Overstreet	2014-01-08	8	-263/+322
\| \| \| \| \| \| \| \| \| \| \| \|	Soon, bset.c won't need to depend on struct btree. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Abstract out stuff needed for sorting	Kent Overstreet	2014-01-08	9	-289/+423
\| \| \| \| \| \| \| \|	Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Rename/shuffle various code around	Kent Overstreet	2014-01-08	8	-276/+341
\| \| \| \| \| \| \| \| \| \| \| \|	More work to disentangle bset.c from the rest of the code: Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Add struct bset_sort_state	Kent Overstreet	2014-01-08	6	-49/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	More disentangling bset.c from the rest of the bcache code - soon, the sorting routines won't have any dependencies on any outside structs. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Split out sort_extent_cmp()	Kent Overstreet	2014-01-08	4	-32/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Only use extent comparison for comparing extents, so we're not using START_KEY() on other key types (i.e. btree pointers) Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Bkey indexing renaming	Kent Overstreet	2014-01-08	6	-52/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	More refactoring: node() -> bset_bkey_idx() end() -> bset_bkey_last() Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Make bch_keylist_realloc() take u64s, not nptrs	Kent Overstreet	2014-01-08	4	-16/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Getting away from KEY_PTRS and moving toward KEY_U64s - and getting rid of magic 2s Also - split out the part that checks against journal entry size so as to avoid a dependancy on struct cache_set in bset.c Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Remove/fix some header dependencies	Kent Overstreet	2014-01-08	3	-24/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the process of disentagling/libraryizing bset.c from the rest of the bcache code. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Use a mempool for mergesort temporary space	Kent Overstreet	2014-01-08	3	-16/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It was a single element mempool before, it's slightly cleaner to just use a real mempool. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: Btree verify code improvements	Kent Overstreet	2014-01-08	6	-40/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Used this fixed code to find and fix the bug fixed by a4d885097b0ac0cd1337f171f2d4b83e946094d4. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
\| *	bcache: kill index()	Kent Overstreet	2014-01-08	4	-8/+24
\| \| \| \| \| \| \| \| \| \| \| \|	That was a terrible name for a macro, add some better helpers to replace it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>