summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* md/raid1: fix bio handling problems in process_checks()NeilBrown2013-07-181-23/+30
| | | | | | | | | | | | | | | | | | | | | | Recent change to use bio_copy_data() in raid1 when repairing an array is faulty. The underlying may have changed the bio in various ways using bio_advance and these need to be undone not just for the 'sbio' which is being copied to, but also the 'pbio' (primary) which is being copied from. So perform the reset on all bios that were read from and do it early. This also ensure that the sbio->bi_io_vec[j].bv_len passed to memcmp is correct. This fixes a crash during a 'check' of a RAID1 array. The crash was introduced in 3.10 so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Remove recent change which allows devices to skip recovery.NeilBrown2013-07-181-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. allowed a bit more than just that. It also allows devices to be added to a read-write array and to end up skipping recovery. This patch removes the offending piece of code pending a rewrite for a subsequent release. More specifically: If the array has a bitmap, then the device will still need a bitmap based resync ('saved_raid_disk' is set under different conditions is a bitmap is present). If the array doesn't have a bitmap, then this is correct as long as nothing has been written to the array since the metadata was checked by ->validate_super. However there is no locking to ensure that there was no write. Bug was introduced in 3.10 and causes data corruption so patch is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid10: fix two problems with RAID10 resync.NeilBrown2013-07-181-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | 1/ When an different between blocks is found, data is copied from one bio to the other. However bv_len is used as the length to copy and this could be zero. So use r10_bio->sectors to calculate length instead. Using bv_len was probably always a bit dubious, but the introduction of bio_advance made it much more likely to be a problem. 2/ When preparing some blocks for sync, we don't set BIO_UPTODATE except on bios that we schedule for a read. This ensures that missing/failed devices don't confuse the loop at the top of sync_request write. Commit 8be185f2c9d54d6 "raid10: Use bio_reset()" removed a loop which set BIO_UPTDATE on all appropriate bios. So we need to re-add that flag. These bugs were introduced in 3.10, so this patch is suitable for 3.10-stable, and can remove a potential for data corruption. Cc: stable@vger.kernel.org (3.10) Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* Merge tag 'dm-3.11-changes' of ↵Linus Torvalds2013-07-1111-180/+818
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull device-mapper changes from Alasdair G Kergon: "Add a device-mapper target called dm-switch to provide a multipath framework for storage arrays that dynamically reconfigure their preferred paths for different device regions. Fix a bug in the verity target that prevented its use with some specific sizes of devices. Improve some locking mechanisms in the device-mapper core and bufio. Add Mike Snitzer as a device-mapper maintainer. A few more clean-ups and fixes" * tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm: add switch target dm: update maintainers dm: optimize reorder structure dm: optimize use SRCU and RCU dm bufio: submit writes outside lock dm cache: fix arm link errors with inline dm verity: use __ffs and __fls dm flakey: correct ctr alloc failure mesg dm verity: remove pointless comparison dm: use __GFP_HIGHMEM in __vmalloc dm verity: fix inability to use a few specific devices sizes dm ioctl: set noio flag to avoid __vmalloc deadlock dm mpath: fix ioctl deadlock when no paths
| * dm: add switch targetJim Ramsay2013-07-103-0/+553
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm-switch is a new target that maps IO to underlying block devices efficiently when there is a large number of fixed-sized address regions but there is no simple pattern to allow for a compact mapping representation such as dm-stripe. Though we have developed this target for a specific storage device, Dell EqualLogic, we have made an effort to keep it as general purpose as possible in the hope that others may benefit. Originally developed by Jim Ramsay. Simplified by Mikulas Patocka. Signed-off-by: Jim Ramsay <jim_ramsay@dell.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: optimize reorder structureMikulas Patocka2013-07-101-7/+7
| | | | | | | | | | | | | | | | | | | | | | This reorder actually improves performance by 20% (from 39.1s to 32.8s) on x86-64 quad core Opteron. I have no explanation for this, possibly it makes some other entries are better cache-aligned. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: optimize use SRCU and RCUMikulas Patocka2013-07-103-142/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes "io_lock" and "map_lock" in struct mapped_device and "holders" in struct dm_table and replaces these mechanisms with sleepable-rcu. Previously, the code would call "dm_get_live_table" and "dm_table_put" to get and release table. Now, the code is changed to call "dm_get_live_table" and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and dm_put_live_table unlocks it. dm_get_live_table_fast/dm_put_live_table_fast can be used instead of dm_get_live_table/dm_put_live_table. These *_fast functions use non-sleepable RCU, so the caller must not block between them. If the code changes active or inactive dm table, it must call dm_sync_table before destroying the old table. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm bufio: submit writes outside lockMikulas Patocka2013-07-101-15/+58
| | | | | | | | | | | | | | | | | | | | This patch changes dm-bufio so that it submits write I/Os outside of the lock. If the number of submitted buffers is greater than the number of requests on the target queue, submit_bio blocks. We want to block outside of the lock to improve latency of other threads that may need the lock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm cache: fix arm link errors with inlineMikulas Patocka2013-07-101-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Use __always_inline to avoid a link failure with gcc 4.6 on ARM. gcc 4.7 is OK. It creates a function block_div.part.8, it references __udivdi3 and __umoddi3 and it is never called. The references to __udivdi3 and __umoddi3 cause a link failure. Reported-by: Rob Herring <robherring2@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm verity: use __ffs and __flsMikulas Patocka2013-07-101-4/+4
| | | | | | | | | | | | | | | | This patch changes ffs() to __ffs() and fls() to __fls() which don't add one to the result. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm flakey: correct ctr alloc failure mesgAlasdair G Kergon2013-07-101-1/+1
| | | | | | | | | | | | | | | | Remove the reference to the "linear" target from the error message issued when allocation fails in the flakey target. Cc: Robin Dong <sanbai@taobao.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm verity: remove pointless comparisonMikulas Patocka2013-07-101-2/+2
| | | | | | | | | | | | | | | | Remove num < 0 test in verity_ctr because num is unsigned. (Found by Coverity.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: use __GFP_HIGHMEM in __vmallocMikulas Patocka2013-07-102-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Use __GFP_HIGHMEM in __vmalloc. Pages allocated with __vmalloc can be allocated in high memory that is not directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc does. This patch reduces memory pressure slightly because pages can be allocated in the high zone. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm verity: fix inability to use a few specific devices sizesMikulas Patocka2013-07-101-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a boundary condition that caused failure for certain device sizes. The problem is reported at http://code.google.com/p/cryptsetup/issues/detail?id=160 For certain device sizes the number of hashes at a specific level was calculated incorrectly. It happens for example for a device with data and metadata block size 4096 that has 16385 blocks and algorithm sha256. The user can test if he is affected by this bug by running the "veritysetup verify" command and also by activating the dm-verity kernel driver and reading the whole block device. If it passes without an error, then the user is not affected. The condition for the bug is: Split the total number of data blocks (data_block_bits) into bit strings, each string has hash_per_block_bits bits. hash_per_block_bits is rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you can say that you convert data_blocks_bits to 2^hash_per_block_bits base. If there some zero bit string below the most significant bit string and at least one bit below this zero bit string is set, then the bug happens. The same bug exists in the userspace veritysetup tool, so you must use fixed veritysetup too if you want to use devices that are affected by this boundary condition. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # 3.4+ Cc: Milan Broz <gmazyland@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm ioctl: set noio flag to avoid __vmalloc deadlockMikulas Patocka2013-07-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Set noio flag while calling __vmalloc() because it doesn't fully respect gfp flags to avoid a possible deadlock (see commit 502624bdad3dba45dfaacaf36b7d83e39e74b2d2). This should be backported to stable kernels 3.8 and newer. The kernel 3.8 doesn't have memalloc_noio_save(), so we should set and restore process flag PF_MEMALLOC instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm mpath: fix ioctl deadlock when no pathsHannes Reinecke2013-07-102-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When multipath needs to retry an ioctl the reference to the current live table needs to be dropped. Otherwise a deadlock occurs when all paths are down: - dm_blk_ioctl takes a reference to the current table and spins in multipath_ioctl(). - A new table is being loaded, but upon resume the process hangs in dm_table_destroy() waiting for references to drop to zero. With this patch the reference to the old table is dropped prior to retry, thereby avoiding the deadlock. Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-07-041-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) treewide: relase -> release Documentation/cgroups/memory.txt: fix stat file documentation sysctl/net.txt: delete reference to obsolete 2.4.x kernel spinlock_api_smp.h: fix preprocessor comments treewide: Fix typo in printk doc: device tree: clarify stuff in usage-model.txt. open firmware: "/aliasas" -> "/aliases" md: bcache: Fixed a typo with the word 'arithmetic' irq/generic-chip: fix a few kernel-doc entries frv: Convert use of typedef ctl_table to struct ctl_table sgi: xpc: Convert use of typedef ctl_table to struct ctl_table doc: clk: Fix incorrect wording Documentation/arm/IXP4xx fix a typo Documentation/networking/ieee802154 fix a typo Documentation/DocBook/media/v4l fix a typo Documentation/video4linux/si476x.txt fix a typo Documentation/virtual/kvm/api.txt fix a typo Documentation/early-userspace/README fix a typo Documentation/video4linux/soc-camera.txt fix a typo lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment ...
| * | md: bcache: Fixed a typo with the word 'arithmetic'Phil Viana2013-06-181-2/+2
| | | | | | | | | | | | | | | | | | | | | The word 'arithmetic' was typed as 'arithmatic' Signed-off-by: Phil Viana <phillip.l.viana@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | md/raid10: fix bug which causes all RAID10 reshapes to move no data.NeilBrown2013-07-041-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent comment: commit 7e83ccbecd608b971f340e951c9e84cd0343002f md/raid10: Allow skipping recovery when clean arrays are assembled Causes raid10 to skip a recovery in certain cases where it is safe to do so. Unfortunately it also causes a reshape to be skipped which is never safe. The result is that an attempt to reshape a RAID10 will appear to complete instantly, but no data will have been moves so the array will now contain garbage. (If nothing is written, you can recovery by simple performing the reverse reshape which will also complete instantly). Bug was introduced in 3.10, so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Cc: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
* | | md/raid5: allow 5-device RAID6 to be reshaped to 4-device.NeilBrown2013-07-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a bug in 'check_reshape' for raid5.c To checks that the new minimum number of devices is large enough (which is good), but it does so also after the reshape has started (bad). This is bad because - the calculation is now wrong as mddev->raid_disks has changed already, and - it is pointless because it is now too late to stop. So only perform that test when reshape has not been committed to. Signed-off-by: NeilBrown <neilb@suse.de>
* | | md/raid10: fix two bugs affecting RAID10 reshape.NeilBrown2013-07-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1/ If a RAID10 is being reshaped to a fewer number of devices and is stopped while this is ongoing, then when the array is reassembled the 'mirrors' array will be allocated too small. This will lead to an access error or memory corruption. 2/ A sanity test for a reshaping RAID10 array is restarted is slightly incorrect. Due to the first bug, this is suitable for any -stable kernel since 3.5 where this code was introduced. Cc: stable@vger.kernel.org (v3.5+) Signed-off-by: NeilBrown <neilb@suse.de>
* | | MD: Remember the last sync operation that was performedJonathan Brassow2013-06-263-9/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD: Remember the last sync operation that was performed This patch adds a field to the mddev structure to track the last sync operation that was performed. This is especially useful when it comes to what is recorded in mismatch_cnt in sysfs. If the last operation was "data-check", then it reports the number of descrepancies found by the user-initiated check. If it was a "repair" operation, then it is reporting the number of descrepancies repaired. etc. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | md: fix buglet in RAID5 -> RAID0 conversion.NeilBrown2013-06-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RAID5 uses a 'per-array' value for the 'size' of each device. RAID0 uses a 'per-device' value - it can be different for each device. When converting a RAID5 to a RAID0 we must ensure that the per-device size of each device matches the per-array size for the RAID5, else the array will change size. If the metadata cannot record a changed per-device size (as is the case with v0.90 metadata) the array could get bigger on restart. This does not cause data corruption, so it not a big issue and is mainly yet another a reason to not use 0.90. Signed-off-by: NeilBrown <neilb@suse.de>
* | | md/raid10: check In_sync flag in 'enough()'.NeilBrown2013-06-141-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It isn't really enough to check that the rdev is present, we need to also be sure that the device is still In_sync. Doing this requires using rcu_dereference to access the rdev, and holding the rcu_read_lock() to ensure the rdev doesn't disappear while we look at it. Signed-off-by: NeilBrown <neilb@suse.de>
* | | md/raid10: locking changes for 'enough()'.NeilBrown2013-06-141-16/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As 'enough' accesses conf->prev and conf->geo, which can change spontanously, it should guard against changes. This can be done with device_lock as start_reshape holds device_lock while updating 'geo' and end_reshape holds it while updating 'prev'. So 'error' needs to hold 'device_lock'. On the other hand, raid10_end_read_request knows which of the two it really wants to access, and as it is an active request on that one, the value cannot change underneath it. So change _enough to take flag rather than a pointer, pass the appropriate flag from raid10_end_read_request(), and remove the locking. All other calls to 'enough' are made with reconfig_mutex held, so neither 'prev' nor 'geo' can change. Signed-off-by: NeilBrown <neilb@suse.de>
* | | md: replace strict_strto*() with kstrto*()Jingoo Han2013-06-144-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | The usage of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | md: Wait for md_check_recovery before attempting device removal.Hannes Reinecke2013-06-141-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a device has failed, it needs to be removed from the personality module before it can be removed from the array as a whole. The first step is performed by md_check_recovery() which is called from the raid management thread. So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery to have run. This increases the chance that the ioctl will succeed. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Neil Brown <nfbrown@suse.de>
* | | dm-raid: silence compiler warning on rebuilds_per_group.NeilBrown2013-06-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | This doesn't really need to be initialised, but it doesn't hurt, silences the compiler, and as it is a counter it makes sense for it to start at zero. Signed-off-by: NeilBrown <neilb@suse.de>
* | | DM RAID: Fix raid_resume not reviving failed devices in all casesJonathan Brassow2013-06-141-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DM RAID: Fix raid_resume not reviving failed devices in all cases When a device fails in a RAID array, it is marked as Faulty. Later, md_check_recovery is called which (through the call chain) calls 'hot_remove_disk' in order to have the personalities remove the device from use in the array. Sometimes, it is possible for the array to be suspended before the personalities get their chance to perform 'hot_remove_disk'. This is normally not an issue. If the array is deactivated, then the failed device will be noticed when the array is reinstantiated. If the array is resumed and the disk is still missing, md_check_recovery will be called upon resume and 'hot_remove_disk' will be called at that time. However, (for dm-raid) if the device has been restored, a resume on the array would cause it to attempt to revive the device by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called, a situation is then created where the device is thought to concurrently be the replacement and the device to be replaced. Thus, the device is first sync'ed with the rest of the array (because it is the replacement device) and then marked Faulty and removed from the array (because it is also the device being replaced). The solution is to check and see if the device had properly been removed before the array was suspended. This is done by seeing whether the device's 'raid_disk' field is -1 - a condition that implies that 'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1) -> hot_remove_disk' has been called. If 'raid_disk' is not -1, then 'hot_remove_disk' must be called to complete the removal of the previously faulty device before it can be revived via 'hot_add_disk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | DM RAID: Break-up untidy functionJonathan Brassow2013-06-141-33/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DM RAID: Break-up untidy function Clean-up excessive indentation by moving some code in raid_resume() into its own function. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | DM RAID: Add ability to restore transiently failed devices on resumeJonathan Brassow2013-06-143-8/+53
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | DM RAID: Add ability to restore transiently failed devices on resume This patch adds code to the resume function to check over the devices in the RAID array. If any are found to be marked as failed and their superblocks can be read, an attempt is made to reintegrate them into the array. This allows the user to refresh the array with a simple suspend and resume of the array - rather than having to load a completely new table, allocate and initialize all the structures and throw away the old instantiation. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | Merge tag 'md-3.10-fixes' of git://neil.brown.name/mdLinus Torvalds2013-06-134-26/+47
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.
| * | md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in placeH. Peter Anvin2013-06-133-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | md/raid1,raid10: use freeze_array in place of raise_barrier in various places.NeilBrown2013-06-132-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits 050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | md/raid1: consider WRITE as successful only if at least one non-Faulty and ↵Alex Lyakas2013-06-132-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Cc: stable@vger.kernel.org Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | md: md_stop_writes() should always freeze recovery.NeilBrown2013-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __md_stop_writes() will currently sometimes freeze recovery. So any caller must be ready for that to happen, and indeed they are. However if __md_stop_writes() doesn't freeze_recovery, then a recovery could start before mddev_suspend() is called, which could be awkward. This can particularly cause problems or dm-raid. So change __md_stop_writes() to always freeze recovery. This is safe and more predicatable. Reported-by: Brassow Jonathan <jbrassow@redhat.com> Tested-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2013-06-126-124/+102
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer fixes from Jens Axboe: "Outside of bcache (which really isn't super big), these are all few-liners. There are a few important fixes in here: - Fix blk pm sleeping when holding the queue lock - A small collection of bcache fixes that have been done and tested since bcache was included in this merge window. - A fix for a raid5 regression introduced with the bio changes. - Two important fixes for mtip32xx, fixing an oops and potential data corruption (or hang) due to wrong bio iteration on stacked devices." * 'for-linus' of git://git.kernel.dk/linux-block: scatterlist: sg_set_buf() argument must be in linear mapping raid5: Initialize bi_vcnt pktcdvd: silence static checker warning block: remove refs to XD disks from documentation blkpm: avoid sleep when holding queue lock mtip32xx: Correctly handle bio->bi_idx != 0 conditions mtip32xx: Fix NULL pointer dereference during module unload bcache: Fix error handling in init code bcache: clarify free/available/unused space bcache: drop "select CLOSURES" bcache: Fix incompatible pointer type warning
| * | raid5: Initialize bi_vcntKent Overstreet2013-05-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch that converted raid5 to use bio_reset() forgot to initialize bi_vcnt. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: NeilBrown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Tested-by: Ilia Mirkin <imirkin@alum.mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Merge branch 'bcache-for-upstream' of ↵Jens Axboe2013-05-155-124/+100
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://evilpiepirate.org/~kent/linux-bcache into for-linus Kent writes: Jens - couple more bcache patches. Bug fixes and a doc update.
| | * | bcache: Fix error handling in init codeKent Overstreet2013-05-154-121/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This code appears to have rotted... fix various bugs and do some refactoring. Signed-off-by: Kent Overstreet <koverstreet@google.com>
| | * | bcache: drop "select CLOSURES"Paul Bolle2013-05-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig symbol CLOSURES. That symbol was used in development versions of bcache, but was removed when the closures code was no longer provided as a kernel library. It can safely be dropped. Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
| | * | bcache: Fix incompatible pointer type warningEmil Goode2013-05-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function pointer release in struct block_device_operations should point to functions declared as void. Sparse warnings: drivers/md/bcache/super.c:656:27: warning: incorrect type in initializer (different base types) drivers/md/bcache/super.c:656:27: expected void ( *release )( ... ) drivers/md/bcache/super.c:656:27: got int ( static [toplevel] *<noident> )( ... ) drivers/md/bcache/super.c:656:2: warning: initialization from incompatible pointer type [enabled by default] drivers/md/bcache/super.c:656:2: warning: (near initialization for ‘bcache_ops.release’) [enabled by default] Signed-off-by: Emil Goode <emilgoode@gmail.com> Signed-off-by: Kent Overstreet <koverstreet@google.com>
* | | | dm thin: fix metadata dev resize detectionAlasdair G Kergon2013-05-191-2/+2
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix detection of the need to resize the dm thin metadata device. The code incorrectly tried to extend the metadata device when it didn't need to due to a merging error with patch 24347e9 ("dm thin: detect metadata device resizing"). device-mapper: transaction manager: couldn't open metadata space map device-mapper: thin metadata: tm_open_with_sm failed device-mapper: thin: aborting transaction failed device-mapper: thin: switching pool to failure mode Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm cache: set config valueJoe Thornber2013-05-101-28/+31
| | | | | | | | | | | | | | | | | | | | | | | | Share configuration option processing code between the dm cache ctr and message functions. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm cache: move config fnsAlasdair G Kergon2013-05-101-17/+17
| | | | | | | | | | | | | | | | | | | | | Move process_config_option() in dm-cache-target.c to make the next patch more readable. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm thin: generate event when metadata threshold passedJoe Thornber2013-05-103-0/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generate a dm event when the amount of remaining thin pool metadata space falls below a certain level. The threshold is taken to be a quarter of the size of the metadata device with a minimum threshold of 4MB. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm persistent metadata: add space map threshold callbackJoe Thornber2013-05-101-1/+76
| | | | | | | | | | | | | | | | | | | | | Add a threshold callback to dm persistent data space maps. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm persistent data: add threshold callback to space mapJoe Thornber2013-05-103-3/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a threshold callback function to the persistent data space map interface for a subsequent patch to use. dm-thin and dm-cache are interested in knowing when they're getting low on metadata or data blocks. This patch introduces a new method for registering a callback against a threshold. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm thin: detect metadata device resizingJoe Thornber2013-05-103-3/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the dm thin pool metadata device to be extended. Whenever a pool is resumed, detect whether the size of the metadata device has increased, and if so, extend the metadata to use the new space. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm persistent data: support space map resizingJoe Thornber2013-05-101-6/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support extending a dm persistent data metadata space map. The extend itself is implemented by switching back to the boostrap allocator and pointing to the new space. The extra bitmap indexes are then allocated from the new space, and finally we switch back to the proper space map ops and tweak the reference counts. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
OpenPOWER on IntegriCloud