summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2015-04-24 09:28:01 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2015-04-24 09:28:01 -0700
commit474095e46cd14421821da3201a9fd6a4c070996b (patch)
tree7203d36f53c376a96099ed0310787b1fb0c4f7a5 /Documentation
parentd56a669ca59c37ed0a7282a251b2f2f22533343a (diff)
parent9ffc8f7cb9647b13dfe4d1ad0d5e1427bb8b46d6 (diff)
downloadop-kernel-dev-474095e46cd14421821da3201a9fd6a4c070996b.zip
op-kernel-dev-474095e46cd14421821da3201a9fd6a4c070996b.tar.gz
Merge tag 'md/4.1' of git://neil.brown.name/md
Pull md updates from Neil Brown: "More updates that usual this time. A few have performance impacts which hould mostly be positive, but RAID5 (in particular) can be very work-load ensitive... We'll have to wait and see. Highlights: - "experimental" code for managing md/raid1 across a cluster using DLM. Code is not ready for general use and triggers a WARNING if used. However it is looking good and mostly done and having in mainline will help co-ordinate development. - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to handle a full (chunk wide) stripe as a single unit. - RAID6 can now perform read-modify-write cycles which should help performance on larger arrays: 6 or more devices. - RAID5/6 stripe cache now grows and shrinks dynamically. The value set is used as a minimum. - Resync is now allowed to go a little faster than the 'mininum' when there is competing IO. How much faster depends on the speed of the devices, so the effective minimum should scale with device speed to some extent" * tag 'md/4.1' of git://neil.brown.name/md: (58 commits) md/raid5: don't do chunk aligned read on degraded array. md/raid5: allow the stripe_cache to grow and shrink. md/raid5: change ->inactive_blocked to a bit-flag. md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe md/raid5: pass gfp_t arg to grow_one_stripe() md/raid5: introduce configuration option rmw_level md/raid5: activate raid6 rmw feature md/raid6 algorithms: xor_syndrome() for SSE2 md/raid6 algorithms: xor_syndrome() for generic int md/raid6 algorithms: improve test program md/raid6 algorithms: delta syndrome functions raid5: handle expansion/resync case with stripe batching raid5: handle io error of batch list RAID5: batch adjacent full stripe write raid5: track overwrite disk count raid5: add a new flag to track if a stripe can be batched raid5: use flex_array for scribble data md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid md: allow resync to go faster when there is competing IO. md: remove 'go_faster' option from ->sync_request() ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/md-cluster.txt176
1 files changed, 176 insertions, 0 deletions
diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt
new file mode 100644
index 0000000..de1af7d
--- /dev/null
+++ b/Documentation/md-cluster.txt
@@ -0,0 +1,176 @@
+The cluster MD is a shared-device RAID for a cluster.
+
+
+1. On-disk format
+
+Separate write-intent-bitmap are used for each cluster node.
+The bitmaps record all writes that may have been started on that node,
+and may not yet have finished. The on-disk layout is:
+
+0 4k 8k 12k
+-------------------------------------------------------------------
+| idle | md super | bm super [0] + bits |
+| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
+| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
+| bm bits [3, contd] | | |
+
+During "normal" functioning we assume the filesystem ensures that only one
+node writes to any given block at a time, so a write
+request will
+ - set the appropriate bit (if not already set)
+ - commit the write to all mirrors
+ - schedule the bit to be cleared after a timeout.
+
+Reads are just handled normally. It is up to the filesystem to
+ensure one node doesn't read from a location where another node (or the same
+node) is writing.
+
+
+2. DLM Locks for management
+
+There are two locks for managing the device:
+
+2.1 Bitmap lock resource (bm_lockres)
+
+ The bm_lockres protects individual node bitmaps. They are named in the
+ form bitmap001 for node 1, bitmap002 for node and so on. When a node
+ joins the cluster, it acquires the lock in PW mode and it stays so
+ during the lifetime the node is part of the cluster. The lock resource
+ number is based on the slot number returned by the DLM subsystem. Since
+ DLM starts node count from one and bitmap slots start from zero, one is
+ subtracted from the DLM slot number to arrive at the bitmap slot number.
+
+3. Communication
+
+Each node has to communicate with other nodes when starting or ending
+resync, and metadata superblock updates.
+
+3.1 Message Types
+
+ There are 3 types, of messages which are passed
+
+ 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
+ updated, and the node must re-read the md superblock. This is performed
+ synchronously.
+
+ 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
+ so that each node may suspend or resume the region.
+
+3.2 Communication mechanism
+
+ The DLM LVB is used to communicate within nodes of the cluster. There
+ are three resources used for the purpose:
+
+ 3.2.1 Token: The resource which protects the entire communication
+ system. The node having the token resource is allowed to
+ communicate.
+
+ 3.2.2 Message: The lock resource which carries the data to
+ communicate.
+
+ 3.2.3 Ack: The resource, acquiring which means the message has been
+ acknowledged by all nodes in the cluster. The BAST of the resource
+ is used to inform the receive node that a node wants to communicate.
+
+The algorithm is:
+
+ 1. receive status
+
+ sender receiver receiver
+ ACK:CR ACK:CR ACK:CR
+
+ 2. sender get EX of TOKEN
+ sender get EX of MESSAGE
+ sender receiver receiver
+ TOKEN:EX ACK:CR ACK:CR
+ MESSAGE:EX
+ ACK:CR
+
+ Sender checks that it still needs to send a message. Messages received
+ or other events that happened while waiting for the TOKEN may have made
+ this message inappropriate or redundant.
+
+ 3. sender write LVB.
+ sender down-convert MESSAGE from EX to CR
+ sender try to get EX of ACK
+ [ wait until all receiver has *processed* the MESSAGE ]
+
+ [ triggered by bast of ACK ]
+ receiver get CR of MESSAGE
+ receiver read LVB
+ receiver processes the message
+ [ wait finish ]
+ receiver release ACK
+
+ sender receiver receiver
+ TOKEN:EX MESSAGE:CR MESSAGE:CR
+ MESSAGE:CR
+ ACK:EX
+
+ 4. triggered by grant of EX on ACK (indicating all receivers have processed
+ message)
+ sender down-convert ACK from EX to CR
+ sender release MESSAGE
+ sender release TOKEN
+ receiver upconvert to EX of MESSAGE
+ receiver get CR of ACK
+ receiver release MESSAGE
+
+ sender receiver receiver
+ ACK:CR ACK:CR ACK:CR
+
+
+4. Handling Failures
+
+4.1 Node Failure
+ When a node fails, the DLM informs the cluster with the slot. The node
+ starts a cluster recovery thread. The cluster recovery thread:
+ - acquires the bitmap<number> lock of the failed node
+ - opens the bitmap
+ - reads the bitmap of the failed node
+ - copies the set bitmap to local node
+ - cleans the bitmap of the failed node
+ - releases bitmap<number> lock of the failed node
+ - initiates resync of the bitmap on the current node
+
+ The resync process, is the regular md resync. However, in a clustered
+ environment when a resync is performed, it needs to tell other nodes
+ of the areas which are suspended. Before a resync starts, the node
+ send out RESYNC_START with the (lo,hi) range of the area which needs
+ to be suspended. Each node maintains a suspend_list, which contains
+ the list of ranges which are currently suspended. On receiving
+ RESYNC_START, the node adds the range to the suspend_list. Similarly,
+ when the node performing resync finishes, it send RESYNC_FINISHED
+ to other nodes and other nodes remove the corresponding entry from
+ the suspend_list.
+
+ A helper function, should_suspend() can be used to check if a particular
+ I/O range should be suspended or not.
+
+4.2 Device Failure
+ Device failures are handled and communicated with the metadata update
+ routine.
+
+5. Adding a new Device
+For adding a new device, it is necessary that all nodes "see" the new device
+to be added. For this, the following algorithm is used:
+
+ 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
+ ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
+ 2. Node 1 sends NEWDISK with uuid and slot number
+ 3. Other nodes issue kobject_uevent_env with uuid and slot number
+ (Steps 4,5 could be a udev rule)
+ 4. In userspace, the node searches for the disk, perhaps
+ using blkid -t SUB_UUID=""
+ 5. Other nodes issue either of the following depending on whether the disk
+ was found:
+ ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
+ disc.number set to slot number)
+ ioctl(CLUSTERED_DISK_NACK)
+ 6. Other nodes drop lock on no-new-devs (CR) if device is found
+ 7. Node 1 attempts EX lock on no-new-devs
+ 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
+ as SpareLocal
+ 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
+ 10. Other nodes get the information whether a disk is added or not
+ by the following METADATA_UPDATED.
OpenPOWER on IntegriCloud