op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	RAID5: batch adjacent full stripe write	shli@kernel.org	2015-04-22	2	-25/+336
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	raid5: track overwrite disk count	shli@kernel.org	2015-04-22	2	-1/+17
\| \| \| \| \| \| \|	Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	raid5: add a new flag to track if a stripe can be batched	shli@kernel.org	2015-04-22	2	-3/+10
\| \| \| \| \| \| \| \|	A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	raid5: use flex_array for scribble data	shli@kernel.org	2015-04-22	2	-38/+57
\| \| \| \| \| \| \| \| \|	Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md raid0: access mddev->queue (request queue member) conditionally because ↵	Heinz Mauelshagen	2015-04-22	1	-21/+27
\| \| \| \| \| \| \| \| \| \| \| \| \|	it is not set when accessed from dm-raid The patch makes 3 references to mddev->queue in the raid0 personality conditional in order to allow for it to be accessed from dm-raid. Mandatory, because md instances underneath dm-raid don't manage a request queue of their own which'd lead to oopses without the patch. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
*	md: allow resync to go faster when there is competing IO.	NeilBrown	2015-04-22	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When md notices non-sync IO happening while it is trying to resync (or reshape or recover) it slows down to the set minimum. The default minimum might have made sense many years ago but the drives have become faster. Changing the default to match the times isn't really a long term solution. This patch changes the code so that instead of waiting until the speed has dropped to the target, it just waits until pending requests have completed. This means that the delay inserted is a function of the speed of the devices. Testing shows that: - for some loads, the resync speed is unchanged. For those loads increasing the minimum doesn't change the speed either. So this is a good result. To increase resync speed under such loads we would probably need to increase the resync window size. - for other loads, resync speed does increase to a reasonable fraction (e.g. 20%) of maximum possible, and throughput of the load only drops a little bit (e.g. 10%) - for other loads, throughput of the non-sync load drops quite a bit more. These seem to be latency-sensitive loads. So it isn't a perfect solution, but it is mostly an improvement. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: remove 'go_faster' option from ->sync_request()	NeilBrown	2015-04-22	5	-21/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>
*	md: don't require sync_min to be a multiple of chunk_size.	NeilBrown	2015-04-22	1	-11/+2
\| \| \| \| \| \| \| \| \| \| \| \|	There is really no need for sync_min to be a multiple of chunk_size, and values read from here often aren't. That means you cannot read a value and expect to be able to write it back later. So remove the chunk_size check, and round down to a multiple of 4K, to be sure everything works with 4K-sector devices. Signed-off-by: NeilBrown <neilb@suse.de>
*	Merge branch 'cluster' into for-next	NeilBrown	2015-04-22	12	-82/+1709
\|\
\| *	md-cluster: re-add capabilities	Goldwyn Rodrigues	2015-04-22	5	-13/+71
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md: re-add a failed disk	Goldwyn Rodrigues	2015-04-22	1	-20/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds the capability of re-adding a failed disk by writing "re-add" to /sys/block/mdXX/md/dev-YYY/state. This facilitates adding disks which have encountered a temporary error such as a network disconnection/hiccup in an iSCSI device, or a SAN cable disconnection which has been restored. In such a situation, you do not need to remove and re-add the device. Writing re-add to the failed device's state would add it again to the array and perform the recovery of only the blocks which were written after the device failed. This works for generic md, and is not related to clustering. However, this patch is to ease re-add operations listed above in clustering environments. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md-cluster: remove capabilities	Goldwyn Rodrigues	2015-04-22	3	-1/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md: Export and rename find_rdev_nr_rcu	Goldwyn Rodrigues	2015-04-22	2	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is required by the clustering module (patches to follow) to find the device to remove or re-add. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md: Export and rename kick_rdev_from_array	Goldwyn Rodrigues	2015-04-22	2	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This export is required for clustering module in order to co-ordinate remove/readd a rdev from all nodes. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md-cluster: correct the num for comparison	Guoqing Jiang	2015-04-22	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since the node num of md-cluster is from zero, and cinfo->slot_number represents the slot num of dlm, no need to check for equality. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md: Fix bitmap offset calculations	Goldwyn Rodrigues	2015-03-25	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The calculations of bitmap offset is incorrect with respect to bits to bytes conversion. Also, remove an irrelevant duplicate message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md/cluster: Communication Framework: fix semicolon.cocci warnings	kbuild test robot	2015-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	drivers/md/md-cluster.c:328:2-3: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md: recover_bitmaps() can be static	kbuild test robot	2015-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static? Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md: Fix stray --cluster-confirm crash	Goldwyn Rodrigues	2015-03-21	3	-4/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A --cluster-confirm without an --add (by another node) can crash the kernel. Fix it by guarding it using a state. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md/bitmap: use sector_div for sector_t divisions	Stephen Rothwell	2015-03-04	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	neilb: modified to not corrupt ->resync_max_sectors. sector_div usage fixed by Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md/bitmap: fix incorrect DIV_ROUND_UP usage.	NeilBrown	2015-03-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	DIV_ROUTND_UP doesn't work on "long long", - and it should be sector_t anyway. Signed-off-by: NeilBrown <neilb@suse.de>
\| *	md: fix error paths from bitmap_create.	NeilBrown	2015-02-25	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent change to bitmap_create mishandles errors. In particular a failure doesn't alway cause 'err' to be set. Signed-off-by: NeilBrown <neilb@suse.de>
\| *	Add new disk to clustered array	Goldwyn Rodrigues	2015-02-23	7	-4/+169
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Algorithm: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Read from the first device when an area is resyncing	Goldwyn Rodrigues	2015-02-23	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	set choose_first true for cluster read in read balance when the area is resyncing. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Suspend writes in RAID1 if within range	Goldwyn Rodrigues	2015-02-23	4	-3/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there is a resync going on, all nodes must suspend writes to the range. This is recorded in the suspend_info/suspend_list. If there is an I/O within the ranges of any of the suspend_info, should_suspend will return 1. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Resync start/Finish actions	Goldwyn Rodrigues	2015-02-23	1	-0/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a RESYNC_START message arrives, the node removes the entry with the current slot number and adds the range to the suspend_list. Simlarly, when a RESYNC_FINISHED message is received, node clears entry with respect to the bitmap number. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Send RESYNCING while performing resync start/stop	Goldwyn Rodrigues	2015-02-23	3	-2/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a resync is initiated, RESYNCING message is sent to all active nodes with the range (lo,hi). When the resync is over, a RESYNCING message is sent with (0,0). A high sector value of zero indicates that the resync is over. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Reload superblock if METADATA_UPDATED is received	Goldwyn Rodrigues	2015-02-23	3	-0/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Re-reads the devices by invalidating the cache. Since we don't write to faulty devices, this is detected using events recorded in the devices. If it is old as compared to the mddev mark it is faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	metadata_update sends message to other nodes	Goldwyn Rodrigues	2015-02-23	3	-14/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- request to send a message - make changes to superblock - send messages telling everyone that the superblock has changed - other nodes all read the superblock - other nodes all ack the messages - updating node release the "I'm sending a message" resource. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Communication Framework: Sending functions	Goldwyn Rodrigues	2015-02-23	1	-0/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The sending part is split in two functions to make sure atomicity of the operations, such as the MD superblock update. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Communication Framework: Receiving	Goldwyn Rodrigues	2015-02-23	1	-0/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. receive status sender receiver receiver ACK:CR ACK:CR ACK:CR 2. sender get EX of TOKEN sender get EX of MESSAGE sender receiver receiver TOKEN:EX ACK:CR ACK:CR MESSAGE:EX ACK:CR 3. sender write LVB. sender down-convert MESSAGE from EX to CR sender try to get EX of ACK [ wait until all receiver has processed the MESSAGE ] [ triggered by bast of ACK ] receiver get CR of MESSAGE receiver read LVB receiver processes the message [ wait finish ] receiver release ACK sender receiver receiver TOKEN:EX MESSAGE:CR MESSAGE:CR MESSAGE:CR ACK:EX 4. sender down-convert ACK from EX to CR sender release MESSAGE sender release TOKEN receiver upconvert to EX of MESSAGE receiver get CR of ACK receiver release MESSAGE sender receiver receiver ACK:CR ACK:CR ACK:CR Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Perform resync for cluster node failure	Goldwyn Rodrigues	2015-02-23	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \|	If bitmap_copy_slot returns hi>0, we need to perform resync. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Initiate recovery on node failure	Goldwyn Rodrigues	2015-02-23	1	-0/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The DLM informs us in case of node failure with the DLM slot number. cluster_info->recovery_map sets the bit corresponding to the slot number and wakes up the recovery thread. The recovery thread: 1. Derives the slot number from the recovery_map 2. Locks the bitmap corresponding to the slot 3. Copies the set bits to the node-local bitmap Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Copy set bits from another slot	Goldwyn Rodrigues	2015-02-23	2	-1/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bitmap_copy_from_slot reads the bitmap from the slot mentioned. It then copies the set bits to the node local bitmap. This is helper function for the resync operation on node failure. bitmap_set_memory_bits() currently assumes it is only run at startup and that they bitmap is currently empty. So if it finds that a region is already marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits() to always set the NEEDED_MASK bit if 'needed' is set. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	bitmap_create returns bitmap pointer	Goldwyn Rodrigues	2015-02-23	3	-31/+56
\| \| \| \| \| \| \| \| \| \| \| \|	This is done to have multiple bitmaps open at the same time. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Gather on-going resync information of other nodes	Goldwyn Rodrigues	2015-02-23	3	-0/+120
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a node joins, it does not know of other nodes performing resync. So, each node keeps the resync information in it's LVB. When a new node joins, it reads the LVB of each "online" bitmap. [TODO] The new node attempts to get the PW lock on other bitmap, if it is successful, it reads the bitmap and performs the resync (if required) on it's behalf. If the node does not get the PW, it requests CR and reads the LVB for the resync information. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Lock bitmap while joining the cluster	Goldwyn Rodrigues	2015-02-23	1	-0/+14
\| \| \| \| \| \| \| \|	Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Use separate bitmaps for each nodes in the cluster	Goldwyn Rodrigues	2015-02-23	3	-10/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On-disk format: 0 4k 8k 12k ------------------------------------------------------------------- \| idle \| md super \| bm super [0] + bits \| \| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \| \| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \| \| bm bits [3, contd] \| \| \| Bitmap super has a field nodes, which defines the maximum number of nodes the device can use. While reading the bitmap super, if the cluster finds out that the number of nodes is > 0: 1. Requests the md-cluster module. 2. Calls md_cluster_ops->join(), which sets up clustering such as joining DLM lockspace. Since the first time, the first bitmap is read. After the call to the cluster_setup, the bitmap offset is adjusted and the superblock is re-read. This also ensures the bitmap is read the bitmap lock (when bitmap lock is introduced in later patches) Questions: 1. cluster name is repeated in all bitmap supers. Is that okay? Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Add node recovery callbacks	Goldwyn Rodrigues	2015-02-23	6	-9/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DLM offers callbacks when a node fails and the lock remastery is performed: 1. recover_prep: called when DLM discovers a node is down 2. recover_slot: called when DLM identifies the node and recovery can start 3. recover_done: called when all nodes have completed recover_slot recover_slot() and recover_done() are also called when the node joins initially in order to inform the node with its slot number. These slot numbers start from one, so we deduct one to make it start with zero which the cluster-md code uses. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Return MD_SB_CLUSTERED if mddev is clustered	Goldwyn Rodrigues	2015-02-23	2	-0/+4
\| \| \| \| \| \| \| \|	Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Introduce md_cluster_info	Goldwyn Rodrigues	2015-02-23	4	-4/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	md_cluster_info stores the cluster information in the MD device. The join() is called when mddev detects it is a clustered device. The main responsibilities are: 1. Setup a DLM lockspace 2. Setup all initial locks such as super block locks and bitmap lock (will come later) The leave() clears up the lockspace and all the locks held. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Introduce md_cluster_operations to handle cluster functions	Goldwyn Rodrigues	2015-02-23	4	-0/+92
\| \| \| \| \| \| \| \| \| \| \| \|	This allows dynamic registering of cluster hooks. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	DLM lock and unlock functions	Goldwyn Rodrigues	2015-02-23	1	-0/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A dlm_lock_resource is a structure which contains all information required for locking using DLM. The init function allocates the lock and acquires the lock in NL mode. The unlock function converts the lock resource to NL mode. This is done to preserve LVB and for faster processing of locks. The lock resource is DLM unlocked only in the lockres_free function, which is the end of life of the lock resource. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Create a separate module for clustering support	Goldwyn Rodrigues	2015-02-23	3	-0/+45
\| \| \| \| \| \| \| \| \| \| \| \|	Tagged as EXPERIMENTAL for now. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	Add number of nodes to bitmap structure for clustering	Goldwyn Rodrigues	2015-02-23	1	-1/+2
\| \| \| \| \| \| \| \|	Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
\| *	md-cluster: Design Documentation	Goldwyn Rodrigues	2015-02-23	1	-0/+176
\| \| \| \| \| \| \| \|	Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
* \|	md/raid0: fix bug with chunksize not a power of 2.	NeilBrown	2015-04-10	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 in v3.14-rc1 RAID0 has performed incorrect calculations when the chunksize is not a power of 2. This happens because "sector_div()" modifies its first argument, but this wasn't taken into account in the patch. So restore that first arg before re-using the variable. Reported-by: Joe Landman <joe.landman@gmail.com> Reported-by: Dave Chinner <david@fromorbit.com> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: stable@vger.kernel.org (3.14 and later). Signed-off-by: NeilBrown <neilb@suse.de>
* \|	md: fix md io stats accounting broken	Gu Zheng	2015-04-08	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Simon reported the md io stats accounting issue: " I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5. It's not abnormal other than it's 3-disk, with one being SSD (sdc) and the other two being write-mostly: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00 " The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the generic_start_io_acct to account the disk stats rather than the open code, but it also introduced the increase to .in_flight[rw] which is needless to md. So we re-use the open code here to fix it. Reported-by: Simon Kirby <sim@hostway.ca> Cc: <stable@vger.kernel.org> 3.19 Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de>
* \|	Linux 4.0-rc7v4.0-rc7	Linus Torvalds	2015-04-06	1	-1/+1
\| \|
* \|	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	Linus Torvalds	2015-04-06	21	-71/+78
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull networking fixes from David Miller: 1) In TCP, don't register an FRTO for cumulatively ACK'd data that was previously SACK'd, from Neal Cardwell. 2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup, from Cong WANG. 3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also from Cong WANG. 4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel. 5) When we encapsulate for a tunnel device, skb->sk still points to the user socket. So this leads to cases where we retraverse the ipv4/ipv6 output path with skb->sk being of some other address family (f.e. AF_PACKET). This can cause things to crash since the ipv4 output path is dereferencing an AF_PACKET socket as if it were an ipv4 one. The short term fix for 'net' and -stable is to elide these socket checks once we've entered an encapsulation sequence by testing xmit_recursion. Longer term we have a better solution wherein we pass the tunnel's socket down through the output paths, but that is way too invasive for 'net' and -stable. From Hannes Frederic Sowa. 6) l2tp_init() failure path forgets to unregister per-net ops, from Cong WANG. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net/mlx4_core: Fix error message deprecation for ConnectX-2 cards net: dsa: fix filling routing table from OF description l2tp: unregister l2tp_net_ops on failure path mvneta: dont call mvneta_adjust_link() manually ipv6: protect skb->sk accesses from recursive dereference inside the stack netns: don't allocate an id for dead netns Revert "netns: don't clear nsid too early on removal" ip6mr: call del_timer_sync() in ip6mr_free_table() net: move fib_rules_unregister() under rtnl lock ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup tcp: fix FRTO undo on cumulative ACK of SACKed range xen-netfront: transmit fully GSO-sized packets