op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	dm exception store: move cow pointer	Jonathan Brassow	2009-04-02	6	-23/+30
\| \| \| \| \| \| \|	Move COW device from snapshot to exception store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm exception store: move chunk_fields	Jonathan Brassow	2009-04-02	6	-53/+67
\| \| \| \| \| \| \|	Move chunk fields from snapshot to exception store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm exception store: move dm_target pointer	Jonathan Brassow	2009-04-02	4	-7/+7
\| \| \| \| \| \| \|	Move target pointer from snapshot to exception store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm exception store: introduce registry	Jonathan Brassow	2009-04-02	6	-49/+307
\| \| \| \| \| \| \|	Move exception stores into a registry. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm raid1: add is_remote_recovering hook for clusters	Jonathan Brassow	2009-04-02	1	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The logging API needs an extra function to make cluster mirroring possible. This new function allows us to check whether a mirror region is being recovered on another machine in the cluster. This helps us prevent simultaneous recovery I/O and process I/O to the same locations on disk. Cluster-aware log modules will implement this function. Single machine log modules will not. So, there is no performance penalty for single machine mirrors. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm exception store: separate type from instance	Jonathan Brassow	2009-04-02	4	-25/+35
\| \| \| \| \| \| \|	Introduce struct dm_exception_store_type. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm log: remove struct dm_dirty_log_internal	Mike Snitzer	2009-04-02	1	-44/+14
\| \| \| \| \| \| \| \| \| \|	Remove the 'dm_dirty_log_internal' structure. The resulting cleanup eliminates extra memory allocations. Therefore exposing the internal list_head to the external 'dm_dirty_log_type' structure is a worthwhile compromise. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm log: use standard kernel module refcount	Mike Snitzer	2009-04-02	1	-16/+3
\| \| \| \| \| \| \| \| \|	Avoid private module usage accounting by removing 'use' from dm_dirty_log_internal. The standard module reference counting is sufficient. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm crypt: use kzfree	Johannes Weiner	2009-04-02	1	-4/+2
\| \| \| \| \| \| \| \| \|	Use kzfree() instead of memset() + kfree(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm target: remove struct tt_internal	Cheng Renquan	2009-04-02	2	-61/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The tt_internal is really just a list_head to manage registered target_type in a double linked list, Here embed the list_head into target_type directly, 1. to avoid kmalloc/kfree; 2. then tt_internal is really unneeded; Cc: stable@kernel.org Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com>
*	dm table: fix upgrade mode race	Alasdair G Kergon	2009-04-02	1	-12/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	upgrade_mode() sets bdev to NULL temporarily, and does not have any locking to exclude anything from seeing that NULL. In dm_table_any_congested() bdev_get_queue() can dereference that NULL and cause a reported oops. Fix this by not changing that field during the mode upgrade. Cc: stable@kernel.org Cc: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm: path selector use module refcount directly	Jun'ichi Nomura	2009-04-02	1	-18/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix refcount corruption in dm-path-selector Refcounting with non-atomic ops under shared lock will corrupt the counter in multi-processor system and may trigger BUG_ON(). Use module refcount. # same approach as dm-target-use-module-refcount-directly.patch here # https://www.redhat.com/archives/dm-devel/2008-December/msg00075.html Typical oops: kernel BUG at linux-2.6.29-rc3/drivers/md/dm-path-selector.c:90! Pid: 11148, comm: dmsetup Not tainted 2.6.29-rc3-nm #1 dm_put_path_selector+0x4d/0x61 [dm_multipath] Call Trace: [<ffffffffa031d3f9>] free_priority_group+0x33/0xb3 [dm_multipath] [<ffffffffa031d4aa>] free_multipath+0x31/0x67 [dm_multipath] [<ffffffffa031d50d>] multipath_dtr+0x2d/0x32 [dm_multipath] [<ffffffffa015d6c2>] dm_table_destroy+0x64/0xd8 [dm_mod] [<ffffffffa015b73a>] __unbind+0x46/0x4b [dm_mod] [<ffffffffa015b79f>] dm_swap_table+0x60/0x14d [dm_mod] [<ffffffffa015f963>] dev_suspend+0xfd/0x177 [dm_mod] [<ffffffffa0160250>] dm_ctl_ioctl+0x24c/0x29c [dm_mod] [<ffffffff80288cd3>] ? get_page_from_freelist+0x49c/0x61d [<ffffffffa015f866>] ? dev_suspend+0x0/0x177 [dm_mod] [<ffffffff802bf05c>] vfs_ioctl+0x2a/0x77 [<ffffffff802bf4f1>] do_vfs_ioctl+0x448/0x4a0 [<ffffffff802bf5a0>] sys_ioctl+0x57/0x7a [<ffffffff8020c05b>] system_call_fastpath+0x16/0x1b Cc: stable@kernel.org Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm target: use module refcount directly	Cheng Renquan	2009-04-02	1	-17/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The tt_internal's 'use' field is superfluous: the module's refcount can do the work properly. An acceptable side-effect is that this increases the reference counts reported by 'lsmod'. Remove the superfluous test when removing a target module. [Crash possible without this on SMP - agk] Cc: stable@kernel.org Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
*	dm snapshot: avoid having two exceptions for the same chunk	Mikulas Patocka	2009-04-02	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to check if the exception was completed after dropping the lock. After regaining the lock, __find_pending_exception checks if the exception was already placed into &s->pending hash. But we don't check if the exception was already completed and placed into &s->complete hash. If the process waiting in alloc_pending_exception was delayed at this point because of a scheduling latency and the exception was meanwhile completed, we'd miss that and allocate another pending exception for already completed chunk. It would lead to a situation where two records for the same chunk exist and potential data corruption because multiple snapshot I/Os to the affected chunk could be redirected to different locations in the snapshot. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm snapshot: avoid dropping lock in __find_pending_exception	Mikulas Patocka	2009-04-02	1	-18/+24
\| \| \| \| \| \| \| \| \|	It is uncommon and bug-prone to drop a lock in a function that is called with the lock held, so this is moved to the caller. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm snapshot: refactor __find_pending_exception	Mikulas Patocka	2009-04-02	1	-24/+28
\| \| \| \| \| \| \| \| \|	Move looking-up of a pending exception from __find_pending_exception to another function. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm io: make sync_io uninterruptible	Mikulas Patocka	2009-04-02	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If someone sends signal to a process performing synchronous dm-io call, the kernel may crash. The function sync_io attempts to exit with -EINTR if it has pending signal, however the structure "io" is allocated on stack, so already submitted io requests end up touching unallocated stack space and corrupting kernel memory. sync_io sets its state to TASK_UNINTERRUPTIBLE, so the signal can't break out of io_schedule() --- however, if the signal was pending before sync_io entered while (1) loop, the corruption of kernel memory will happen. There is no way to cancel in-progress IOs, so the best solution is to ignore signals at this point. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm raid1: switch read_record from kmalloc to slab to save memory	Mikulas Patocka	2009-04-02	1	-4/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With my previous patch to save bi_io_vec, the size of dm_raid1_read_record is significantly increased (the vector list takes 3072 bytes on 32-bit machines and 4096 bytes on 64-bit machines). The structure dm_raid1_read_record used to be allocated with kmalloc, but kmalloc aligns the size on the next power-of-two so an object slightly greater than 4096 will allocate 8192 bytes of memory and half of that memory will be wasted. This patch turns kmalloc into a slab cache which doesn't have this padding so it will reduce the memory consumed. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm: preserve bi_io_vec when resubmitting bios	Mikulas Patocka	2009-04-02	1	-0/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Device mapper saves and restores various fields in the bio, but it doesn't save bi_io_vec. If the device driver modifies this after a partially successful request, dm-raid1 and dm-multipath may attempt to resubmit a bio that has bi_size inconsistent with the size of vector. To make requests resubmittable in dm-raid1 and dm-multipath, we must save and restore the bio vector as well. To reduce the memory overhead involved in this, we do not save the pages in a vector and use a 16-bit field size if the page size is less than 65536. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm crypt: wait for endio to complete before destruction	Milan Broz	2009-03-16	2	-20/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following oops has been reported when dm-crypt runs over a loop device. ... [ 70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000) ... [ 70.381058] Call Trace: [ 70.381058] [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt] [ 70.381058] [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt] [ 70.381058] [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt] [ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e [ 70.381058] [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod] [ 70.381058] [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod] [ 70.381058] [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod] [ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e [ 70.381058] [<c02bad86>] ? loop_thread+0x380/0x3b7 [ 70.381058] [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165 [ 70.381058] [<c013754f>] ? autoremove_wake_function+0x0/0x33 [ 70.381058] [<c02baa06>] ? loop_thread+0x0/0x3b7 When a table is being replaced, it waits for I/O to complete before destroying the mempool, but the endio function doesn't call mempool_free() until after completing the bio. Fix it by swapping the order of those two operations. The same problem occurs in dm.c with md referenced after dec_pending. Again, we swap the order. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm crypt: fix kcryptd_async_done parameter	Huang Ying	2009-03-16	1	-5/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the async encryption-complete function (kcryptd_async_done), the crypto_async_request passed in may be different from the one passed to crypto_ablkcipher_encrypt/decrypt. Only crypto_async_request->data is guaranteed to be same as the one passed in. The current kcryptd_async_done uses the passed-in crypto_async_request directly which may cause the AES-NI-based AES algorithm implementation to panic. This patch fixes this bug by only using crypto_async_request->data, which points to dm_crypt_request, the crypto_async_request passed in. The original data (convert_context) is gotten from dm_crypt_request. [mbroz@redhat.com: reworked] Cc: stable@kernel.org Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm io: respect BIO_MAX_PAGES limit	Mikulas Patocka	2009-03-16	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	dm-io calls bio_get_nr_vecs to get the maximum number of pages to use for a given device. It allocates one additional bio_vec to use internally but failed to respect BIO_MAX_PAGES, so fix this. This was the likely cause of: https://bugzilla.redhat.com/show_bug.cgi?id=173153 Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm table: rework reference counting fix	Mikulas Patocka	2009-03-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix an error introduced in dm-table-rework-reference-counting.patch. When there is failure after table initialization, we need to use dm_table_destroy, not dm_table_put, to free the table. dm_table_put may be used only after dm_table_get. Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	dm ioctl: validate name length when renaming	Milan Broz	2009-03-16	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When renaming a mapped device validate the length of the new name. The rename ioctl accepted any correctly-terminated string enclosed within the data passed from userspace. The other ioctls enforce a size limit of DM_NAME_LEN. If the name is changed and becomes longer than that, the device can no longer be addressed by name. Fix it by properly checking for device name length (including terminating zero). Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
*	md: fix deadlock when stopping arrays	Dan Williams	2009-03-04	1	-12/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resolve a deadlock when stopping redundant arrays, i.e. ones that require a call to sysfs_remove_group when shutdown. The deadlock is summarized below: Thread1 Thread2 ------- ------- read sysfs attribute stop array take mddev lock sysfs_remove_group sysfs_get_active wait for mddev lock wait for active Sysrq-w: -------- mdmon S 00000017 2212 4163 1 f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc 00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd Call Trace: [<c044d9dd>] ? debug_mutex_add_waiter+0x1d/0x58 [<c06ef451>] __mutex_lock_common+0x1d9/0x338 [<c06ef451>] ? __mutex_lock_common+0x1d9/0x338 [<c06ef5e3>] mutex_lock_interruptible_nested+0x33/0x3a [<c0634553>] ? mddev_lock+0x14/0x16 [<c0634553>] mddev_lock+0x14/0x16 [<c0634eda>] md_attr_show+0x2a/0x49 [<c04e9997>] sysfs_read_file+0x93/0xf9 mdadm D 00000017 2812 4177 1 f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70 00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000 Call Trace: [<c06eed1b>] schedule_timeout+0x1b/0x95 [<c06eed1b>] ? schedule_timeout+0x1b/0x95 [<c06eeb97>] ? wait_for_common+0x34/0xdc [<c044fa8a>] ? trace_hardirqs_on_caller+0x18/0x145 [<c044fbc2>] ? trace_hardirqs_on+0xb/0xd [<c06eec03>] wait_for_common+0xa0/0xdc [<c0428c7c>] ? default_wake_function+0x0/0x12 [<c06eeccc>] wait_for_completion+0x17/0x19 [<c04ea620>] sysfs_addrm_finish+0x19f/0x1d1 [<c04e920e>] sysfs_hash_and_remove+0x42/0x55 [<c04eb4db>] sysfs_remove_group+0x57/0x86 [<c0638086>] do_md_stop+0x13a/0x499 This has been there for a while, but is easier to trigger now that mdmon is closely watching sysfs. Cc: <stable@kernel.org> Reported-by: Jacek Danecki <jacek.danecki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
*	md: avoid races when stopping resync.	NeilBrown	2009-02-25	2	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There has been a race in raid10 and raid1 for a long time which has only recently started showing up due to a scheduler changed. When a sync_read request finishes, as soon as reschedule_retry is called, another thread can mark the resync request as having completed, so md_do_sync can finish, ->stop can be called, and ->conf can be freed. So using conf after reschedule_retry is not safe. Similarly, when finishing a sync_write, calling md_done_sync must be the last thing we do, as it allows a chain of events which will free conf and other data structures. The first of these requires action in raid10.c The second requires action in raid1.c and raid10.c Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery.	NeilBrown	2009-02-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For raid1/4/5/6, resync (fixing inconsistencies between devices) is very similar to recovery (rebuilding a failed device onto a spare). The both walk through the device addresses in order. For raid10 it can be quite different. resync follows the 'array' address, and makes sure all copies are the same. Recover walks through 'device' addresses and recreates each missing block. The 'bitmap_cond_end_sync' function allows the write-intent-bitmap (When present) to be updated to reflect a partially completed resync. It makes assumptions which mean that it does not work correctly for raid10 recovery at all. In particularly, it can cause bitmap-directed recovery of a raid10 to not recovery some of the blocks that need to be recovered. So move the call to bitmap_cond_end_sync into the resync path, rather than being in the common "resync or recovery" path. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
*	md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery.	NeilBrown	2009-02-25	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When doing recovery on a raid10 with a write-intent bitmap, we only need to recovery chunks that are flagged in the bitmap. However if we choose to skip a chunk as it isn't flag, the code currently skips the whole raid10-chunk, thus it might not recovery some blocks that need recovering. This patch fixes it. In case that is confusing, it might help to understand that there is a 'raid10 chunk size' which guides how data is distributed across the devices, and a 'bitmap chunk size' which says how much data corresponds to a single bit in the bitmap. This bug only affects cases where the bitmap chunk size is smaller than the raid10 chunk size. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
*	block: fix bad definition of BIO_RW_SYNC	Jens Axboe	2009-02-18	3	-4/+4
\| \| \| \| \| \| \| \|	We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before 213d9417fec62ef4c3675621b9364a667954d4dd. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	md: Ensure an md array never has too many devices.	NeilBrown	2009-02-06	1	-10/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each different metadata format supported by md supports a different maximum number of devices. We really should be enforcing this maximum in the kernel, but we aren't quite doing that properly. We currently only enforce it at the 'hot_add' point, which is an older interface which is not used by current userspace. We need to also enforce it at 'add_new_disk' time for active arrays and at 'do_md_run' time when starting a new array. So move the test from 'hot_add' into 'bind_rdev_to_array' which is called from both 'hot_add' and 'add_new_disk, and add a new test in 'analyse_sbs' which is called from 'do_md_run'. This bug (or missing feature) has been around "forever" and so the patch is suitable for any -stable that is currently maintained. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
*	md: Fix a bug in linear.c causing which_dev() to return the wrong device.	Andre Noll	2009-02-06	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ab5bd5cbc8d4b868378d062eed3d4240930fbb86 introduced the following bug in linear software raid for large arrays on 32 bit machines: which_dev() computes the device holding a given sector by shifting down the sector number to a 32 bit range, dividing by the array spacing and looking up the resulting index in the hash table of the array. Because the computed index might be slightly too small, a loop at the end of which_dev() increases the index until the given sector actually falls into the range of the device associated with that index. The changes of the above mentioned commit caused this loop to check whether the _index_ rather than the sector number is small enough, effectively bypassing the loop and thus possibly returning the wrong device. As reported by Simon Kirby, this leads to errors such as linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464 Fix this bug by introducing a local variable for the index so that the variable containing the passed sector is left unchanged. Cc: stable@kernel.org Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: Allow read error in a single drive raid1 to be passed up.	NeilBrown	2009-02-06	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a raid1 only has a single working device and gets a read error, we choose to simply return that error up to the filesystem (or whatever) rather than failing the whole array. However the codes doesn't quite do that. We attempt a readbalance which allocates the same drive, so we retry the read - indefinitely. Instead: If read_balance in the error case chooses the same drive that just failed, treat it as a failure and don't retry. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: don't retry recovery of raid1 that fails due to error on source drive.	NeilBrown	2009-01-09	2	-3/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a raid1 has only one working drive and it has a sector which gives an error on read, then an attempt to recover onto a spare will fail, but as the single remaining drive is not removed from the array, the recovery will be immediately re-attempted, resulting in an infinite recovery loop. So detect this situation and don't retry recovery once an error on the lone remaining drive is detected. Allow recovery to be retried once every time a spare is added in case the problem wasn't actually a media error. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: Allow md devices to be created by name.	NeilBrown	2009-01-09	1	-18/+101
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using sequential numbers to identify md devices is somewhat artificial. Using names can be a lot more user-friendly. Also, creating md devices by opening the device special file is a bit awkward. So this patch provides a new option for creating and naming devices. Writing a name such as "md_home" to /sys/modules/md_mod/parameters/new_array will cause an array with that name to be created. It will appear in /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'. It will have an arbitrary minor number allocated. md devices that a created by an open are destroyed on the last close when the device is inactive. For named md devices, they will not be destroyed until the array is explicitly stopped, either with the STOP_ARRAY ioctl or by writing 'clear' to /sys/block/md_XXXX/md/array_state. The name of the array must start 'md_' to avoid conflict with other devices. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: make devices disappear when they are no longer needed.	NeilBrown	2009-01-09	1	-12/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently md devices, once created, never disappear until the module is unloaded. This is essentially because the gendisk holds a reference to the mddev, and the mddev holds a reference to the gendisk, this a circular reference. If we drop the reference from mddev to gendisk, then we need to ensure that the mddev is destroyed when the gendisk is destroyed. However it is not possible to hook into the gendisk destruction process to enable this. So we drop the reference from the gendisk to the mddev and destroy the gendisk when the mddev gets destroyed. However this has a complication. Between the call __blkdev_get->get_gendisk->kobj_lookup->md_probe and the call __blkdev_get->md_open there is no obvious way to hold a reference on the mddev any more, so unless something is done, it will disappear and gendisk will be destroyed prematurely. Also, once we decide to destroy the mddev, there will be an unlockable moment before the gendisk is unlinked (blk_unregister_region) during which a new reference to the gendisk can be created. We need to ensure that this reference can not be used. i.e. the ->open must fail. So: 1/ in md_probe we set a flag in the mddev (hold_active) which indicates that the array should be treated as active, even though there are no references, and no appearance of activity. This is cleared by md_release when the device is closed if it is no longer needed. This ensures that the gendisk will survive between md_probe and md_open. 2/ In md_open we check if the mddev we expect to open matches the gendisk that we did open. If there is a mismatch we return -ERESTARTSYS and modify __blkdev_get to retry from the top in that case. In the -ERESTARTSYS sys case we make sure to wait until the old gendisk (that we succeeded in opening) is really gone so we loop at most once. Some udev configurations will always open an md device when it first appears. If we allow an md device that was just created by an open to disappear on an immediate close, then this can race with such udev configurations and result in an infinite loop the device being opened and closed, then re-open due to the 'ADD' even from the first open, and then close and so on. So we make sure an md device, once created by an open, remains active at least until some md 'ioctl' has been made on it. This means that all normal usage of md devices will allow them to disappear promptly when not needed, but the worst that an incorrect usage will do it cause an inactive md device to be left in existence (it can easily be removed). As an array can be stopped by writing to a sysfs attribute echo clear > /sys/block/mdXXX/md/array_state we need to use scheduled work for deleting the gendisk and other kobjects. This allows us to wait for any pending gendisk deletion to complete by simply calling flush_scheduled_work(). Signed-off-by: NeilBrown <neilb@suse.de>
*	md: centralise all freeing of an 'mddev' in 'md_free'	NeilBrown	2009-01-09	1	-9/+11
\| \| \| \| \| \| \| \| \|	md_free is the .release handler for the md kobj_type. So it makes sense to release all the objects referenced by the mddev in there, rather than just prior to calling kobject_put for what we think is the last time. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: move allocation of ->queue from mddev_find to md_probe	NeilBrown	2009-01-09	1	-11/+17
\| \| \| \| \| \| \| \| \| \|	It is more balanced to just do simple initialisation in mddev_find, which allocates and links a new md device, and leave all the more sophisticated allocation to md_probe (which calls mddev_find). md_probe already allocated the gendisk. It should allocate the queue too. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: need another print_sb for mdp_superblock_1	Cheng Renquan	2009-01-09	1	-6/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	md_print_devices is called in two code path: MD_BUG(...), and md_ioctl with PRINT_RAID_DEBUG. it will dump out all in use md devices information; However, it wrongly processed two types of superblock in one: The header file <linux/raid/md_p.h> has defined two types of superblock, struct mdp_superblock_s (typedefed with mdp_super_t) according to md with metadata 0.90, and struct mdp_superblock_1 according to md with metadata 1.0 and later, These two types of superblock are very different, The md_print_devices code processed them both in mdp_super_t, that would lead to wrong informaton dump like: [ 6742.345877] [ 6742.345887] md: ********************************** [ 6742.345890] md: * <COMPLETE RAID STATE PRINTOUT> * [ 6742.345892] md: ******************************** [ 6742.345896] md1: <ram7><ram6><ram5><ram4> [ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3 [ 6742.345909] md: rdev superblock: [ 6742.345914] md: SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d [ 6742.345918] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536 [ 6742.345922] md: UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001 [ 6742.345924] D 0: DISK<N:0,(1,8),R:0,S:6> [ 6742.345930] D 1: DISK<N:1,(1,10),R:1,S:6> [ 6742.345933] D 2: DISK<N:2,(1,12),R:2,S:6> [ 6742.345937] D 3: DISK<N:3,(1,14),R:3,S:6> [ 6742.345942] md: THIS: DISK<N:3,(1,14),R:3,S:6> ... [ 6742.346058] md0: <ram3><ram2><ram1><ram0> [ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3 [ 6742.346070] md: rdev superblock: [ 6742.346073] md: SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c [ 6742.346077] md: L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610 [ 6742.346081] md: UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000 [ 6742.346084] D 0: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346089] D 1: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346092] D 2: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346096] D 3: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346102] md: THIS: DISK<N:0,(0,0),R:0,S:0> ... [ 6742.346219] md: ****************************** [ 6742.346221] Here md1 is metadata 0.90.0, and md0 is metadata 1.2 After some more code to distinguish these two types of superblock, in this patch, it will generate dump information like: [ 7906.755790] [ 7906.755799] md: ******************************** [ 7906.755802] md: * <COMPLETE RAID STATE PRINTOUT> * [ 7906.755804] md: ******************************** [ 7906.755808] md1: <ram7><ram6><ram5><ram4> [ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3 [ 7906.755821] md: rdev superblock (MJ:0): [ 7906.755826] md: SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3 [ 7906.755830] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536 [ 7906.755834] md: UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001 [ 7906.755836] D 0: DISK<N:0,(1,8),R:0,S:6> [ 7906.755842] D 1: DISK<N:1,(1,10),R:1,S:6> [ 7906.755845] D 2: DISK<N:2,(1,12),R:2,S:6> [ 7906.755849] D 3: DISK<N:3,(1,14),R:3,S:6> [ 7906.755855] md: THIS: DISK<N:3,(1,14),R:3,S:6> ... [ 7906.755972] md0: <ram3><ram2><ram1><ram0> [ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3 [ 7906.755984] md: rdev superblock (MJ:1): [ 7906.755989] md: SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd> [ 7906.755990] md: Name: "DG5:0" CT:1226410480 [ 7906.755998] md: L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0 [ 7906.755999] md: Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a [ 7906.756001] md: (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829 [ 7906.756003] md: (MaxDev:384) ... [ 7906.756113] md: ******************************** [ 7906.756116] this md0 (metadata 1.2) information dumping is exactly according to struct mdp_superblock_1. Signed-off-by: Cheng Renquan <crquan@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: use list_for_each_entry macro directly	Cheng Renquan	2009-01-09	9	-87/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to list_for_each_entry_safe, from <linux/list.h>, it should be defined to use list_for_each_entry_safe, instead of reinventing the wheel. But some calls to each_entry_safe don't really need a safe version, just a direct list_for_each_entry is enough, this could save a temp variable (tmp) in every function that used rdev_for_each. In this patch, most rdev_for_each loops are replaced by list_for_each_entry, totally save many tmp vars; and only in the other situations that will call list_del to delete an entry, the safe version is used. Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0: make hash_spacing and preshift sector-based.	Andre Noll	2009-01-09	1	-29/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch renames the hash_spacing and preshift members of struct raid0_private_data to spacing and sector_shift respectively and changes the semantics as follows: We always have spacing = 2 * hash_spacing. In case sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1 while sector_shift = preshift = 0 otherwise. Note that the values of nb_zone and zone are unaffected by these changes because in the sector_div() preceeding the assignement of these two variables both arguments double. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0: Represent the size of strip zones in sectors.	Andre Noll	2009-01-09	1	-13/+13
\| \| \| \| \| \| \|	This completes the block -> sector conversion of struct strip_zone. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's.	Andre Noll	2009-01-09	1	-21/+25
\| \| \| \| \| \| \|	This patch consists only of these trivial changes. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0 create_strip_zones(): Make two local variables sector-based.	Andre Noll	2009-01-09	1	-14/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	current_offset and curr_zone_offset stored the corresponding offsets as 1K quantities. Rename them to current_start and curr_zone_start to match the naming of struct strip_zone and store the offsets as sector counts. Also, add KERN_INFO to the printk() affected by this change to make checkpatch happy. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0: Represent zone->zone_offset in sectors.	Andre Noll	2009-01-09	1	-6/+6
\| \| \| \| \| \| \| \|	For the same reason as in the previous patch, rename it from zone_offset to zone_start. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0: Represent device offset in sectors.	Andre Noll	2009-01-09	1	-5/+4
\| \| \| \| \| \| \| \|	Rename zone->dev_offset to zone->dev_start to make sure all users have been converted. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0_make_request(): Replace local variable block by sector.	Andre Noll	2009-01-09	1	-7/+6
\| \| \| \| \| \| \|	This change already simplifies the code a bit. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0_make_request(): Remove local variable chunk_size.	Andre Noll	2009-01-09	1	-4/+3
\| \| \| \| \| \| \|	We might as well use chunk_sects instead. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: raid0_make_request(): Replace chunksize_bits by chunksect_bits.	Andre Noll	2009-01-09	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \|	As ffz(~(2 * x)) = ffz(~x) + 1, we have chunksect_bits = chunksize_bits + 1. Fixup all users accordingly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: use sysfs_notify_dirent to notify changes to md/sync_action.	NeilBrown	2009-01-09	1	-7/+13
\| \| \| \| \| \| \|	There is no compelling need for this, but sysfs_notify_dirent is a nicer interface and the change is good for consistency. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: fix bitmap-on-external-file bug.	NeilBrown	2009-01-09	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \|	commit a2ed9615e3222645007fc19991aedf30eed3ecfd fixed a bug with 'internal' bitmaps, but in the process broke 'in a file' bitmaps. So they are broken in 2.6.28 This fixes it, and needs to go in 2.6.28-stable. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org