summaryrefslogtreecommitdiffstats
path: root/fs/exofs/ore_raid.c
Commit message (Collapse)AuthorAgeFilesLines
* ore: Support for raid 6Boaz Harrosh2014-05-221-13/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This simple patch adds support for raid6 to the ORE. Most operations and calculations where already for the general case. Only things left: * call async_gen_syndrome() in the case of raid6 (NOTE that the raid6 math is the one supported by the Linux Kernel see: crypto/async_tx/async_pq.c) * call _ore_add_parity_unit() twice with only last call generating the redundancy pages. * Fix couple BUGS in old code a. In reads when parity==2 it can happen that per_dev->length=0 but per_dev->offset was set and adjusted by _ore_add_sg_seg(). Don't let it be overwritten. b. The all 'cur_comp > starting_dev' thing to determine if: "per_dev->offset is in the current stripe number or the next one." Was a complete raid5/4 accident. When parity==2 this is not at all true usually. All we need to do is increment si->ob_offset once we pass by the first parity device. (This also greatly simplifies the code, amen) c. Calculation of si->dev rotation can overflow when parity==2. * Then last enable raid6 in ore_verify_layout() I want to deeply thank Daniel Gryniewicz who found first all the bugs in the old raid code, and inspired these patches: Inspired-by Daniel Gryniewicz <dang@linuxbox.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* ore: Remove redundant dev_order(), more cleanupsBoaz Harrosh2014-05-221-9/+4
| | | | | | | | | | | | | | | | | | | | | Two cleanups: * si->cur_comp, si->cur_pg where always calculated after the call to ore_calc_stripe_info() with the help of _dev_order(...). But these are already calculated by ore_calc_stripe_info() and can be just set there. (This is left over from the time that si->cur_comp, si->cur_pg were only used by raid code, but now the main loop manages them anyway even though they are ultimately not used in none raid code) * si->cur_comp - For the very last stripe case, was set inside _ore_add_parity_unit(). This is not clear and will be wrong for coming raid6 so move this to only caller. Now si->cur_comp is only manipulated within _prepare_for_striping(), always next to the manipulation of cur_dev. Which is much easier to understand and follow. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* ore: (trivial) reformat some codeBoaz Harrosh2014-05-221-3/+1
| | | | | | rearrange some source lines. Nothing changed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* fs: Mark functions as static in exofs/ore_raid.cRashika Kheria2014-04-031-2/+2
| | | | | | | | | | | | | Mark functions as static in exofs/ore_raid.c because they are not used outside this file. This also eliminates the following warning in exofs/ore_raid.c: fs/exofs/ore_raid.c:24:14: warning: no previous prototype for _raid_page_alloc [-Wmissing-prototypes] fs/exofs/ore_raid.c:29:6: warning: no previous prototype for _raid_page_free [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* block: Add bio_for_each_segment_all()Kent Overstreet2013-03-231-1/+1
| | | | | | | | | | | | | | | | | __bio_for_each_segment() iterates bvecs from the specified index instead of bio->bv_idx. Currently, the only usage is to walk all the bvecs after the bio has been advanced by specifying 0 index. For immutable bvecs, we need to split these apart; bio_for_each_segment() is going to have a different implementation. This will also help document the intent of code that's using it - bio_for_each_segment_all() is only legal to use for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Neil Brown <neilb@suse.de> CC: Boaz Harrosh <bharrosh@panasas.com>
* ore: signedness bug in _sp2d_min_pg()Dan Carpenter2012-10-031-1/+1
| | | | | | This for loop doesn't work correctly when "p" is unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
* ore: Unlock r4w pages in exact reverse order of lockingBoaz Harrosh2012-07-201-12/+12
| | | | | | | | | | | | | The read-4-write pages are locked in address ascending order. But where unlocked in a way easiest for coding. Fix that, locks should be released in opposite order of locking, .i.e descending address order. I have not hit this dead-lock. It was found by inspecting the dbug print-outs. I suspect there is an higher lock at caller that protects us, but fix it regardless. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* ore: Fix NFS crash by supporting any unaligned RAID IOBoaz Harrosh2012-07-201-31/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In RAID_5/6 We used to not permit an IO that it's end byte is not stripe_size aligned and spans more than one stripe. .i.e the caller must check if after submission the actual transferred bytes is shorter, and would need to resubmit a new IO with the remainder. Exofs supports this, and NFS was supposed to support this as well with it's short write mechanism. But late testing has exposed a CRASH when this is used with none-RPC layout-drivers. The change at NFS is deep and risky, in it's place the fix at ORE to lift the limitation is actually clean and simple. So here it is below. The principal here is that in the case of unaligned IO on both ends, beginning and end, we will send two read requests one like old code, before the calculation of the first stripe, and also a new site, before the calculation of the last stripe. If any "boundary" is aligned or the complete IO is within a single stripe. we do a single read like before. The code is clean and simple by splitting the old _read_4_write into 3 even parts: 1._read_4_write_first_stripe 2. _read_4_write_last_stripe 3. _read_4_write_execute And calling 1+3 at the same place as before. 2+3 before last stripe, and in the case of all in a single stripe then 1+2+3 is preformed additively. Why did I not think of it before. Well I had a strike of genius because I have stared at this code for 2 years, and did not find this simple solution, til today. Not that I did not try. This solution is much better for NFS than the previous supposedly solution because the short write was dealt with out-of-band after IO_done, which would cause for a seeky IO pattern where as in here we execute in order. At both solutions we do 2 separate reads, only here we do it within a single IO request. (And actually combine two writes into a single submission) NFS/exofs code need not change since the ORE API communicates the new shorter length on return, what will happen is that this case would not occur anymore. hurray!! [Stable this is an NFS bug since 3.2 Kernel should apply cleanly] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* ore: Must support none-PAGE-aligned IOBoaz Harrosh2012-01-081-12/+60
| | | | | | | | | | | | | | | | | | NFS might send us offsets that are not PAGE aligned. So we must read in the reminder of the first/last pages, in cases we need it for Parity calculations. We only add an sg segments to read the partial page. But we don't mark it as read=true because it is a lock-for-write page. TODO: In some cases (IO spans a single unit) we can just adjust the raid_unit offset/length, but this is left for later Kernels. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* ore: fix BUG_ON, too few sgs when readingBoaz Harrosh2012-01-061-1/+5
| | | | | | | | | | | | | | | When reading RAID5 files, in rare cases, we calculated too few sg segments. There should be two extra for the beginning and end partial units. Also "too few sg segments" should not be a BUG_ON there is all the mechanics in place to handle it, as a short read. So just return -ENOMEM and the rest of the code will gracefully split the IO. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* ore: RAID5 WriteBoaz Harrosh2011-10-241-7/+527
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is finally the RAID5 Write support. The bigger part of this patch is not the XOR engine itself, But the read4write logic, which is a complete mini prepare_for_striping reading engine that can read scattered pages of a stripe into cache so it can be used for XOR calculation. That is, if the write was not stripe aligned. The main algorithm behind the XOR engine is the 2 dimensional array: struct __stripe_pages_2d. A drawing might save 1000 words --- __stripe_pages_2d | n = pages_in_stripe_unit; w = group_width - parity; | pages array presented to the XOR lib | | V | __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---| | | __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <--- | ... | ... | __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par] ^ | data added columns first then row --- The pages are put on this array columns first. .i.e: p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ... So we are doing a corner turn of the pages. Note that pages will zigzag down and left. but are put sequentially in growing order. So when the time comes to XOR the stripe, only the beginning and end of the array need be checked. We scan the array and any NULL spot will be field by pages-to-be-read. The FS that wants to support RAID5 needs to supply an operations-vector that searches a given page in cache, and specifies if the page is uptodate or need reading. All these pages to be read are put on a slave ore_io_state and synchronously read. All the pages of a stripe are read in one IO, using the scatter gather mechanism. In write we constrain our IO to only be incomplete on a single stripe. Meaning either the complete IO is within a single stripe so we might have pages to read from both beginning or end of the strip. Or we have some reading to do at beginning but end at strip boundary. The left over pages are pushed to the next IO by the API already established by previous work, where an IO offset/length combination presented to the ORE might get the length truncated and the user must re-submit the leftover pages. (Both exofs and NFS support this) But any ORE user should make it's best effort to align it's IO before hand and avoid complications. A cached ore_layout->stripe_size member can be used for that calculation. (NOTE: that ORE demands that stripe_size may not be bigger then 32bit) What else? Well read it and tell me. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* ore: RAID5 readBoaz Harrosh2011-10-241-0/+140
This patch introduces the first stage of RAID5 support mainly the skip-over-raid-units when reading. For writes it inserts BLANK units, into where XOR blocks should be calculated and written to. It introduces the new "general raid maths", and the main additional parameters and components needed for raid5. Since at this stage it could corrupt future version that actually do support raid5. The enablement of raid5 mounting and setting of parity-count > 0 is disabled. So the raid5 code will never be used. Mounting of raid5 is only enabled later once the basic XOR write is also in. But if the patch "enable RAID5" is applied this code has been tested to be able to properly read raid5 volumes and is according to standard. Also it has been tested that the new maths still properly supports RAID0 and grouping code just as before. (BTW: I have found more bugs in the pnfs-obj RAID math fixed here) The ore.c file is getting too big, so new ore_raid.[hc] files are added that will include the special raid stuff that are not used in striping and mirrors. In future write support these will get bigger. When adding the ore_raid.c to Kbuild file I was forced to rename ore.ko to libore.ko. Is it possible to keep source file, say ore.c and module file ore.ko the same even if there are multiple files inside ore.ko? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
OpenPOWER on IntegriCloud