op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	Btrfs: Handle data checksumming on bios that span multiple ordered extents	Chris Mason	2008-09-25	1	-2/+2
\| \| \| \| \| \| \| \|	Data checksumming is done right before the bio is sent down the IO stack, which means a single bio might span more than one ordered extent. In this case, the checksumming data is split between two ordered extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	btrfs_start_transaction: wait for commits in progress to finish	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Use async helpers to deal with pages that have been improperly dirtied	Chris Mason	2008-09-25	1	-0/+6
\| \| \| \| \| \| \| \| \|	Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: New data=ordered implementation	Chris Mason	2008-09-25	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add locking around volume management (device add/remove/balance)	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Online btree defragmentation fixes	Chris Mason	2008-09-25	1	-1/+6
\| \| \| \| \| \| \| \| \| \|	The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add btree locking to the tree defragmentation code	Chris Mason	2008-09-25	1	-0/+2
\| \| \| \| \| \| \|	The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Replace the transaction work queue with kthreads	Chris Mason	2008-09-25	1	-9/+4
\| \| \| \| \| \| \|	This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add a skip_locking parameter to struct path, and make various funcs ↵	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Drop locks in btrfs_search_slot when reading a tree block.	Chris Mason	2008-09-25	1	-1/+0
\| \| \| \| \| \| \| \|	One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Replace the big fs_mutex with a collection of other locks	Chris Mason	2008-09-25	1	-2/+3
\| \| \| \| \| \| \| \|	Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Start btree concurrency work.	Chris Mason	2008-09-25	1	-8/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add a thread pool just for submit_bio	Chris Mason	2008-09-25	1	-0/+4
\| \| \| \| \| \| \| \|	If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: split out ioctl.c	Christoph Hellwig	2008-09-25	1	-1/+8
\| \| \| \| \| \| \| \|	Split the ioctl handling out of inode.c into a file of it's own. Also fix up checkpatch.pl warnings for the moved code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add a mount option to control worker thread pool size	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add async worker threads for pre and post IO checksumming	Chris Mason	2008-09-25	1	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	btrfs: sanity mount option parsing and early mount code	Christoph Hellwig	2008-09-25	1	-2/+1
\| \| \| \| \| \| \|	Also adds lots of comments to describe what's going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: transaction ioctls	Sage Weil	2008-09-25	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These ioctls let a user application hold a transaction open while it performs a series of operations. A final ioctl does a sync on the fs (closing the current transaction). This is the main requirement for Ceph's OSD to be able to keep the data it's storing in a btrfs volume consistent, and AFAICS it works just fine. The application would do something like fd = ::open("some/file", O_RDONLY); ::ioctl(fd, BTRFS_IOC_TRANS_START); /* do a bunch of stuff */ ::ioctl(fd, BTRFS_IOC_TRANS_END); or just ::close(fd); And to ensure it commits to disk, ::ioctl(fd, BTRFS_IOC_SYNC); When a transaction is held open, the trans_handle is attached to the struct file (via private_data) so that it will get cleaned up if the process dies unexpectedly. A held transaction is also ended on fsync() to avoid a deadlock. A misbehaving application could also deliberately hold a transaction open, effectively locking up the FS, so it may make sense to restrict something like this to root or something. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Invalidate dcache entry after creating snapshot and	Sven Wegener	2008-09-25	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to invalidate an existing dcache entry after creating a new snapshot or subvolume, because a negative dache entry will stop us from accessing the new snapshot or subvolume. --- ctree.h \| 23 +++++++++++++++++++++++ inode.c \| 4 ++++ transaction.c \| 4 ++++ 3 files changed, 31 insertions(+) Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Allocator fix variety pack	Chris Mason	2008-09-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Change the congestion functions to meter the number of async submits ↵	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \| \| \| \| \| \|	as well The async submit workqueue was absorbing too many requests, leading to long stalls where the async submitters were stalling. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add mount -o degraded to allow mounts to continue with missing devices	Chris Mason	2008-09-25	1	-0/+3
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Update nodatacow mode to support cloned single files and resizing	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Before, nodatacow only checked to make sure multiple roots didn't have references on a single extent. This check makes sure that multiple inodes don't have references. nodatacow needed an extra check to see if the block group was currently readonly. This way cows forced by the chunk relocation code are honored. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Properly find the root for snapshotted blocks during chunk relocation	Chris Mason	2008-09-25	1	-0/+2
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add support for online device removal	Chris Mason	2008-09-25	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	This required a few structural changes to the code that manages bdev pointers: The VFS super block now gets an anon-bdev instead of a pointer to the lowest bdev. This allows us to avoid swapping the super block bdev pointer around at run time. The code to read in the super block no longer goes through the extent buffer interface. Things got ugly keeping the mapping constant. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Clone file data ioctl	Sage Weil	2008-09-25	1	-2/+2
\| \| \| \| \| \|	Add a new ioctl to clone file data Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add balance ioctl to restripe the chunks	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add new ioctl to add devices	Chris Mason	2008-09-25	1	-0/+2
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Make the resizer work based on shrinking and growing devices	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add support for labels in the super block	Chris Mason	2008-09-25	1	-0/+2
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Check device uuids along with devids	Chris Mason	2008-09-25	1	-0/+5
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Write bio checksumming outside the FS mutex	Chris Mason	2008-09-25	1	-1/+3
\| \| \| \| \| \| \|	This significantly improves streaming write performance by allowing concurrency in the data checksumming. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Create a work queue for bio writes	Chris Mason	2008-09-25	1	-0/+3
\| \| \| \| \| \| \|	This allows checksumming to happen in parallel among many cpus, and keeps us from bogging down pdflush with the checksumming code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add RAID10 support	Chris Mason	2008-09-25	1	-0/+7
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add chunk uuids and update multi-device back references	Chris Mason	2008-09-25	1	-13/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Block headers now store the chunk tree uuid Chunk items records the device uuid for each stripes Device extent items record better back refs to the chunk tree Block groups record better back refs to the chunk tree The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY used to be the logical offset of the chunk. Now it is a chunk tree id, with the logical offset being stored in the offset field of the key. This allows a single chunk tree to record multiple logical address spaces, upping the number of bytes indexed by a chunk tree from 2^64 to 2^128. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Add a min size parameter to btrfs_alloc_extent	Chris Mason	2008-09-25	1	-1/+2
\| \| \| \| \| \| \| \| \|	On huge machines, delayed allocation may try to allocate massive extents. This change allows btrfs_alloc_extent to return something smaller than the caller asked for, and the data allocation routines will loop over the allocations until it fills the whole delayed alloc. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Do metadata checksums for reads via a workqueue	Chris Mason	2008-09-25	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before, metadata checksumming was done by the callers of read_tree_block, which would set EXTENT_CSUM bits in the extent tree to show that a given range of pages was already checksummed and didn't need to be verified again. But, those bits could go away via try_to_releasepage, and the end result was bogus checksum failures on pages that never left the cache. The new code validates checksums when the page is read. It is a little tricky because metadata blocks can span pages and a single read may end up going via multiple bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Fix allocation profile init	Chris Mason	2008-09-25	1	-6/+7
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add support for duplicate blocks on a single spindle	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add support for mirroring across drives	Chris Mason	2008-09-25	1	-2/+7
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Reorder the flags field in struct btrfs_header and record a flag on writeout	Chris Mason	2008-09-25	1	-3/+25
\| \| \| \| \| \| \| \|	This allows detection of blocks that have already been written in the running transaction so they can be recowed instead of modified again. It is step one in trusting the transid field of the block pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Create a btrfs backing dev info	Chris Mason	2008-09-25	1	-0/+2
\| \| \| \| \| \|	This allows intelligent versions of unplug and congestion functions Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Implement raid0 when multiple devices are present	Chris Mason	2008-09-25	1	-0/+3
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add support for device scanning and detection ioctls	Chris Mason	2008-09-25	1	-2/+19
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Bring back mount -o ssd optimizations	Chris Mason	2008-09-25	1	-0/+3
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Move device information into the super block so it can be scanned	Chris Mason	2008-09-25	1	-19/+2
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Make the FS tree the last objectid in the tree of tree roots	Chris Mason	2008-09-25	1	-9/+8
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Dynamic chunk and block group allocation	Chris Mason	2008-09-25	1	-1/+11
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add support for multiple devices per filesystem	Chris Mason	2008-09-25	1	-27/+286
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: checksum file data at bio submission time instead of during writepage	Chris Mason	2008-09-25	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we checkum file data during writepage, the checksumming is done one page at a time, making it difficult to do bulk metadata modifications to insert checksums for large ranges of the file at once. This patch changes btrfs to checksum on a per-bio basis instead. The bios are checksummed before they are handed off to the block layer, so each bio is contiguous and only has pages from the same inode. Checksumming on a bio basis allows us to insert and modify the file checksum items in large groups. It also allows the checksumming to be done more easily by async worker threads. Signed-off-by: Chris Mason <chris.mason@oracle.com>