summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* dm: use unlocked variants of queue flag check/setJens Axboe2008-04-291-5/+3
| | | | | | | dm.c already provides mutual exclusion through ->map_lock. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2008-04-295-7/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Skip I/O merges when disabled block: add large command support block: replace sizeof(rq->cmd) with BLK_MAX_CDB ide: use blk_rq_init() to initialize the request block: use blk_rq_init() to initialize the request block: rename and export rq_init() block: no need to initialize rq->cmd with blk_get_request block: no need to initialize rq->cmd in prepare_flush_fn hook block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline block/elevator.c:elv_rq_merge_ok() mustn't be inline block: make queue flags non-atomic block: add dma alignment and padding support to blk_rq_map_kern unexport blk_max_pfn ps3disk: Remove superfluous cast block: make rq_init() do a full memset() relay: fix splice problem
| * block: no need to initialize rq->cmd with blk_get_requestFUJITA Tomonori2008-04-293-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_get_request initializes rq->cmd (rq_init does) so the users don't need to do that. The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd, as a preparation for large command support, which changes rq->cmd from the static array to a pointer. sizeof(rq->cmd) will not make sense and &rq->cmd won't work. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| * block: make queue flags non-atomicNick Piggin2008-04-292-3/+7
| | | | | | | | | | | | | | We can save some atomic ops in the IO path, if we clearly define the rules of how to modify the queue flags. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | drivers: use non-racy method for proc entries creation (2)Denis V. Lunev2008-04-291-5/+1
|/ | | | | | | | | | | | | | | | | | | Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data be setup before gluing PDE to main tree. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Neil Brown <neilb@suse.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/md: use time_before, time_before_eq, etcJulia Lawall2008-04-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions time_before, time_before_eq, time_after, and time_after_eq are more robust for comparing jiffies against other values. A simplified version of the semantic patch making this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @ change_compare_np @ expression E; @@ ( - jiffies <= E + time_before_eq(jiffies,E) | - jiffies >= E + time_after_eq(jiffies,E) | - jiffies < E + time_before(jiffies,E) | - jiffies > E + time_after(jiffies,E) ) @ include depends on change_compare_np @ @@ #include <linux/jiffies.h> @ no_include depends on !include && change_compare_np @ @@ #include <linux/...> + #include <linux/jiffies.h> // </smpl> [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* raid: remove leading TAB on printk messagesNick Andrew2008-04-284-7/+8
| | | | | | | | | | | | | | | | MD drivers use one printk() call to print 2 log messages and the second line may be prefixed by a TAB character. It may also output a trailing space before newline. klogd (I think) turns the TAB character into the 2 characters '^I' when logging to a file. This looks ugly. Instead of a leading TAB to indicate continuation, prefix both output lines with 'raid:' or similar. Also remove any trailing space in the vicinity of the affected code and consistently end the sentences with a period. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: raid5.c convert simple_strtoul to strict_strtoulDan Williams2008-04-281-9/+5
| | | | | | | | | | strict_strtoul handles the open-coded sanity checks in raid5_store_stripe_cache_size and raid5_store_preread_threshold Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: introduce get_priority_stripe() to improve raid456 write performanceDan Williams2008-04-281-10/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve write performance by preventing the delayed_list from dumping all its stripes onto the handle_list in one shot. Delayed stripes are now further delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when: * a STRIPE_IO_STARTED stripe is found at the head of 'handle_list' * 'handle_list' is empty and i/o is being done to satisfy full stripe-width write requests * 'bypass_count' is less than 'bypass_threshold'. By default the threshold is 1, i.e. every other stripe handled is a preread stripe provided the top two conditions are false. Benchmark data: System: 2x Xeon 5150, 4x SATA, mem=1GB Baseline: 2.6.24-rc7 Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 * patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512) Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS) * patched: +13% * patched + preread_bypass_threshold = 0: +37% Changes since v1: * reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and make it configurable. This defaults to fairness and modest performance gains out of the box. Changes since v2: * [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not necessary, the important change was clearing STRIPE_DELAYED in add_stripe_bio and this has been moved out to make_request for the hang fix. * [neilb@suse.de]: simplify get_priority_stripe * [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is sampled empty (+11%) * [dan.j.williams@intel.com]: decrement the bypass_count at the detection of stripes being naturally promoted off of hold_list +2%. Note, resetting bypass_count instead of decrementing on these events yields +4% but that is probably too aggressive. Changes since v3: * cosmetic fixups Tested-by: James W. Laferriere <babydr@baby-dragons.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: replace remaining __FUNCTION__ occurrencesHarvey Harrison2008-04-282-25/+25
| | | | | | | | | __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: fix integer as NULL pointer warnings in md.cHarvey Harrison2008-04-281-4/+4
| | | | | | | | | | | | drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer Add some braces to match the else-block as well. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dm: remove md argument from specific_minorFrederik Deweerdt2008-04-251-8/+6
| | | | | | | | | | | The small patch below: - Removes the unused md argument from both specific_minor() and next_free_minor() - Folds kmalloc + memset(0) into a single kzalloc call in alloc_dev() This has been compile tested on x86. Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: remove unused dm_create_error_tableAdrian Bunk2008-04-251-38/+0
| | | | | | | dm_create_error_table() was added in kernel 2.6.18 and never used... Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: drop void suspend_targets returnAdrian Bunk2008-04-251-2/+2
| | | | | | | | | | void returning functions returned the return value of another void returning function... Spotted by sparse. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: unplug queues in threadsMikulas Patocka2008-04-253-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O. It is specified that any submitted bio without BIO_RW_SYNC flag may plug the queue (i.e. block the requests from being dispatched to the physical device). The queue is unplugged when the caller calls blk_unplug() function. Usually, the sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs the queue and waits (to be possibly joined with other adjacent bios). Then, when the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to the disk. This was happenning: When doing O_SYNC writes, function fsync_buffers_list() submits a list of bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is woken up. fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but there are no bios on the device queue as they are still in the dm_raid1 queue. wait_on_buffer() starts waiting until the IO is finished. kmirrord is scheduled, kmirrord takes bios and submits them to the devices. The submitted bio plugs the harddisk queue but there is no one to unplug it. (The process that called wait_on_buffer() is already sleeping.) So there is a 3ms timeout, after which the queues on the harddisks are unplugged and requests are processed. This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes), dm-raid1 is 10 times slower than md raid1. Every time we submit something asynchronously via dm_io, we must unplug the queue actually to send the request to the device. This patch adds an unplug call to kmirrord - while processing requests, it keeps the queue plugged (so that adjacent bios can be merged); when it finishes processing all the bios, it unplugs the queue to submit the bios. It also fixes kcopyd which has the same potential problem. All kcopyd requests are submitted with BIO_RW_SYNC. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Jens Axboe <jens.axboe@oracle.com>
* dm raid1: use timerMikulas Patocka2008-04-251-20/+28
| | | | | | | | | | | | This patch replaces the schedule() in the main kmirrord thread with a timer. The schedule() could introduce an unwanted delay when work is ready to be processed. The code instead calls wake() when there's work to be done immediately, and delayed_wake() after a failure to give a short delay before retrying. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: move include filesAlasdair G Kergon2008-04-2510-274/+11
| | | | | | Publish the dm-io, dm-log and dm-kcopyd headers in include/linux. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm kcopyd: renameAlasdair G Kergon2008-04-252-0/+0
| | | | | | Rename kcopyd.[ch] to dm-kcopyd.[ch]. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: expose macrosAlasdair G Kergon2008-04-251-89/+0
| | | | | | Make dm.h macros and inlines available in include/linux/device-mapper.h Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm kcopyd: remove redundant client countingMikulas Patocka2008-04-253-98/+29
| | | | | | | | | | | | Remove client counting code that is no longer needed. Initialization and destruction is made globally from dm_init and dm_exit and is not based on client counts. Initialization allocates only one empty slab cache, so there is no negative impact from performing the initialization always, regardless of whether some client uses kcopyd or not. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm kcopyd: private mempoolMikulas Patocka2008-04-251-13/+19
| | | | | | | | Change the global mempool in kcopyd into a per-device mempool to avoid deadlock possibilities. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm kcopyd: per deviceMikulas Patocka2008-04-251-59/+73
| | | | | | | | Make one kcopyd thread per device. The original shared kcopyd could deadlock. Configuration:
* dm log: make module use tracking internalJonathan Brassow2008-04-252-31/+94
| | | | | | | Remove internal module reference fields from the interface. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: move register functionsAlasdair G Kergon2008-04-251-26/+26
| | | | | | Reorder a couple of functions in the file so the next patch is readable. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: clean interfaceHeinz Mauelshagen2008-04-254-103/+112
| | | | | | | Clean up the dm-log interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm kcopyd: clean interfaceHeinz Mauelshagen2008-04-255-58/+62
| | | | | | | Clean up the kcopyd interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm io: clean interfaceHeinz Mauelshagen2008-04-258-25/+33
| | | | | | | Clean up the dm-io interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm io: rename error to error_bitsAlasdair G Kergon2008-04-251-7/+7
| | | | | | Rename 'error' to 'error_bits' for clarity. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: store pointer to target instanceMikulas Patocka2008-04-252-4/+4
| | | | | | | Save pointer to dm_target in dm_snapshot structure. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: move dirty region log code into separate moduleHeinz Mauelshagen2008-04-253-12/+11
| | | | | | | | Move the dirty region log code into a separate module so other targets can share the code. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: generalise name in messagesHeinz Mauelshagen2008-04-251-7/+8
| | | | | | | | Change dm-log.c messages from "mirror log" to "dirty region log" as a new dm target wants to share this code. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid1: use list_split_initRobert P. J. Day2008-04-251-6/+4
| | | | | | | Use shorter list_splice_init() for brevity. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: reduce default memory allocationMilan Broz2008-04-251-2/+2
| | | | | | | | | Limit the amount of memory allocated per snapshot on systems with a large page size. (The larger default chunk size on these systems compensates for the smaller number of pages reserved.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: fix chunksize sector conversionMikulas Patocka2008-04-251-1/+1
| | | | | | | | | | | If a snapshot has a smaller chunksize than the page size the conversion to pages currently returns 0 instead of 1, causing: kernel BUG in mempool_resize. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
* RAID: remove trailing space from printk lineNick Andrew2008-04-211-1/+1
| | | | | | | | drivers/md/*.[ch] contains only one more printk line with a trailing space. Remove it. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
* md: close a livelock window in handle_parity_checks5Dan Williams2008-04-111-22/+29
| | | | | | | | | | | | | | | | | | If a failure is detected after a parity check operation has been initiated, but before it completes handle_parity_checks5 will never quiesce operations on the stripe. Explicitly handle this case by "canceling" the parity check, i.e. clear the STRIPE_OP_CHECK flags and queue the stripe on the handle list again to refresh any non-uptodate blocks. Kernel versions >= 2.6.23 are susceptible. Cc: <stable@kernel.org> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dm io: write error bits form long not intAlasdair G Kergon2008-03-285-11/+11
| | | | | | | | | | | | write_err is an unsigned long used with set_bit() so should not be passed around as unsigned int. http://bugzilla.kernel.org/show_bug.cgi?id=10271 Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dm crypt: fix ctx pendingMilan Broz2008-03-281-28/+30
| | | | | | | | | | | | | | | | | | | | | | | Fix regression in dm-crypt introduced in commit 3a7f6c990ad04e6f576a159876c602d14d6f7fef ("dm crypt: use async crypto"). If write requests need to be split into pieces, the code must not process them in parallel because the crypto context cannot be shared. So there can be parallel crypto operations on one part of the write, but only one write bio can be processed at a time. This is not optimal and the workqueue code needs to be optimized for parallel processing, but for now it solves the problem without affecting the performance of synchronous crypto operation (most of current dm-crypt users). http://bugzilla.kernel.org/show_bug.cgi?id=10242 http://bugzilla.kernel.org/show_bug.cgi?id=10207 Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/md/raid5.c: fix printk warningsAndrew Morton2008-03-191-3/+3
| | | | | | | | | | | | | | | | | gcc-3.4.5 on sparc64: drivers/md/raid5.c: In function `raid5_end_read_request': drivers/md/raid5.c:1147: warning: long long unsigned int format, long unsigned int arg (arg 4) drivers/md/raid5.c:1164: warning: long long unsigned int format, long unsigned int arg (arg 3) drivers/md/raid5.c:1170: warning: long long unsigned int format, long unsigned int arg (arg 3) sector_t is u64, and we don't know what type the architecture uses to implement u64 (on some it is unsigned long). Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: remove the 'super' sysfs attribute from devices in an 'md' arrayNeilBrown2008-03-191-12/+0
| | | | | | | | | | | | | | | | Exposing the binary blob which is the md 'super-block' via sysfs doesn't really fit with the whole sysfs model, and ever since commit 8118a859dc7abd873193986c77a8d9bdb877adc8 ("sysfs: fix off-by-one error in fill_read_buffer()") it doesn't actually work at all (as the size of the blob is often one page). (akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG) So just remove it altogether. It isn't really useful. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: reduce CPU wastage on idle md array with a write-intent bitmapNeilBrown2008-03-101-1/+3
| | | | | | | | | | | | | | | | Recent patch titled Reduce CPU wastage on idle md array with a write-intent bitmap. would sometimes leave the array with dirty bitmap bits that stay dirty. A subsequent write would sort things out so it isn't a big problem, but should be fixed nonetheless. We need to make sure that when the bitmap becomes not "allclean", the daemon_sleep really does get set to a sensible value. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: fix formatting error in /proc/mdstatNeilBrown2008-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | If an md array is "auto-read-only", then this appears in /proc/mdstat as /dev/md0: active(auto-read-only) whereas if it is truely readonly, it appears as /dev/md0: active (read-only) The difference being a space. One program known to parse this file expects the space and gets badly confused. It will be fixed, but it would be best if what the kernel generates is more consistent too. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: the md RAID10 resync thread could cause a md RAID10 array deadlockK.Tanaka2008-03-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This message describes another issue about md RAID10 found by testing the 2.6.24 md RAID10 using new scsi fault injection framework. Abstract: When a scsi error results in disabling a disk during RAID10 recovery, the resync threads of md RAID10 could stall. This case, the raid array has already been broken and it may not matter. But I think stall is not preferable. If it occurs, even shutdown or reboot will fail because of resource busy. The deadlock mechanism: The r10bio_s structure has a "remaining" member to keep track of BIOs yet to be handled when recovering. The "remaining" counter is incremented when building a BIO in sync_request() and is decremented when finish a BIO in end_sync_write(). If building a BIO fails for some reasons in sync_request(), the "remaining" should be decremented if it has already been incremented. I found a case where this decrement is forgotten. This causes a md_do_sync() deadlock because md_do_sync() waits for md_done_sync() called by end_sync_write(), but end_sync_write() never calls md_done_sync() because of the "remaining" counter mismatch. For example, this problem would be reproduced in the following case: Personalities : [raid10] md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F) 3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_] [>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec This case, sdf1 is recovering, sdb1 and sde1 are disabled. An additional error with detaching sdd will cause a deadlock. md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F) 3919616 blocks 64K chunks 2 near-copies [4/1] [_U__] [=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec 2739 ? S< 0:17 [md0_raid10] 28608 ? D< 0:00 [md0_resync] 28629 pts/1 Ss 0:00 bash 28830 pts/1 R+ 0:00 ps ax 31819 ? D< 0:00 [kjournald] The resync thread keeps working, but actually it is deadlocked. Patch: By this patch, the remaining counter will be decremented if needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: fix possible raid1/raid10 deadlock on read error during resyncNeilBrown2008-03-042-4/+18
| | | | | | | | | | | | | | | | | Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for another possible deadlock in raid1/raid10 error handing. If a read request returns an error while a resync is happening and a resync request is pending, the attempt to fix the error will block until the resync progresses, and the resync will block until the read request completes. Thus a deadlock. This patch fixes the problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: don't attempt read-balancing for raid10 'far' layoutsKeld Simonsen2008-03-041-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the disk to be read for layout "far > 1" to always be the disk with the lowest block address. Thus the chunks to be read will always be (for a fully functioning array) from the first band of stripes, and the raid will then work as a raid0 consisting of the first band of stripes. Some advantages: The fastest part which is the outer sectors of the disks involved will be used. The outer blocks of a disk may be as much as 100 % faster than the inner blocks. Average seek time will be smaller, as seeks will always be confined to the first part of the disks. Mixed disks with different performance characteristics will work better, as they will work as raid0, the sequential read rate will be number of disks involved times the IO rate of the slowest disk. If a disk is malfunctioning, the first disk which is working, and has the lowest block address for the logical block will be used. Signed-off-by: Keld Simonsen <keld@dkuug.dk> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: lock access to rdev attributes properlyNeilBrown2008-03-041-9/+26
| | | | | | | | | | | | | | | | | | | | When we access attributes of an rdev (component device on an md array) through sysfs, we really need to lock the array against concurrent changes. We currently do that when we change an attribute, but not when we read an attribute. We need to lock when reading as well else rdev->mddev could become NULL while we are accessing it. So add appropriate locking (mddev_lock) to rdev_attr_show. rdev_size_store requires some extra care as well as it needs to unlock the mddev while scanning other mddevs for overlapping regions. We currently assume that rdev->mddev will still be unchanged after the scan, but that cannot be certain. So take a copy of rdev->mddev for use at the end of the function. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: make sure a reshape is started when device switches to read-writeNeilBrown2008-03-041-0/+1
| | | | | | | | | | | | | | | | | | A resync/reshape/recovery thread will refuse to progress when the array is marked read-only. So whenever it mark it not read-only, it is important to wake up thread resync thread. There is one place we didn't do this. The problem manifests if the start_ro module parameters is set, and a raid5 array that is in the middle of a reshape (restripe) is started. The array will initially be semi-read-only (meaning it acts like it is readonly until the first write). So the reshape will not proceed. On the first write, the array will become read-write, but the reshape will not be started, and there is no event which will ever restart that thread. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: clean up irregularity with raid autodetectNeilBrown2008-03-041-1/+3
| | | | | | | | | | | When a raid1 array is stopped, all components currently get added to the list for auto-detection. However we should really only add components that were found by autodetection in the first place. So add a flag to record that information, and use it. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: guard against possible bad array geometry in v1 metadataNeilBrown2008-03-041-2/+6
| | | | | | | | | Make sure the data doesn't start before the end of the superblock when the superblock is at the start of the device. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* md: reduce CPU wastage on idle md array with a write-intent bitmapNeilBrown2008-03-041-2/+17
| | | | | | | | | | | | | On an md array with a write-intent bitmap, a thread wakes up every few seconds and scans the bitmap looking for work to do. If the array is idle, there will be no work to do, but a lot of scanning is done to discover this. So cache the fact that the bitmap is completely clean, and avoid scanning the whole bitmap when the cache is known to be clean. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud