op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	Btrfs: Change btree locking to use explicit blocking points	Chris Mason	2009-02-04	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: fix tree logs parallel sync	Yan Zheng	2009-01-21	1	-184/+166
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To improve performance, btrfs_sync_log merges tree log sync requests. But it wrongly merges sync requests for different tree logs. If multiple tree logs are synced at the same time, only one of them actually gets synced. This patch has following changes to fix the bug: Move most tree log related fields in btrfs_fs_info to btrfs_root. This allows merging sync requests separately for each tree log. Don't insert root item into the log root tree immediately after log tree is allocated. Root item for log tree is inserted when log tree get synced for the first time. This allows syncing the log root tree without first syncing all log trees. At tree-log sync, btrfs_sync_log first sync the log tree; then updates corresponding root item in the log root tree; sync the log root tree; then update the super block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: explicitly mark the tree log root for writeback	Chris Mason	2009-01-09	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each subvolume has an extent_state_tree used to mark metadata that needs to be sent to disk while syncing the tree. This is used in addition to the dirty bits on the pages themselves so that a single subvolume can be sent to disk efficiently in disk order. Normally this marking happens in btrfs_alloc_free_block, which also does special recording of dirty tree blocks for the tree log roots. Yan Zheng noticed that when the root of the log tree is allocated, it is added to the wrong writeback list. The fix used here is to explicitly set it dirty as part of tree log creation. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: tree logging checksum fixes	Yan Zheng	2009-01-06	1	-202/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch contains following things. 1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This struct is kmalloced so we want to keep it reasonable. 2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was duplicated code in tree-log.c 3) Remove replay_one_csum. csum items are replayed at the same time as replaying file extents. This guarantees we only replay useful csums. 4) nbytes accounting fix. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: Fix checkpatch.pl warnings	Chris Mason	2009-01-05	1	-36/+34
\| \| \| \| \| \| \|	There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: avoid orphan inode caused by log replay	Yan Zheng	2009-01-05	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	drop_one_dir_item does not properly update inode's link count. It can be reproduced by executing following commands: #touch test #sync #rm -f test #dd if=/dev/zero bs=4k count=1 of=test conv=fsync #echo b > /proc/sysrq-trigger This fixes it by adding an BTRFS_ORPHAN_ITEM_KEY for the inode Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: properly check free space for tree balancing	Yan Zheng	2008-12-17	1	-7/+2
\| \| \| \| \| \| \| \| \| \| \| \|	btrfs_insert_empty_items takes the space needed by the btrfs_item structure into account when calculating the required free space. So the tree balancing code shouldn't add sizeof(struct btrfs_item) to the size when checking the free space. This patch removes these superfluous additions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: Fix compressed checksum fsync log copies	Chris Mason	2008-12-08	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \|	The fsync logging code makes sure to onl copy the relevant checksum for each extent based on the file extent pointers it finds. But for compressed extents, it needs to copy the checksum for the entire extent. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: superblock duplication	Yan Zheng	2008-12-08	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \|	This patch implements superblock duplication. Superblocks are stored at offset 16K, 64M and 256G on every devices. Spaces used by superblocks are preserved by the allocator, which uses a reverse mapping function to find the logical addresses that correspond to superblocks. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: move data checksumming into a dedicated tree	Chris Mason	2008-12-08	1	-13/+108
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: add support for multiple csum algorithms	Josef Bacik	2008-12-02	1	-4/+6
\| \| \| \| \| \| \| \| \|	This patch gives us the space we will need in order to have different csum algorithims at some point in the future. We save the csum algorithim type in the superblock, and use those instead of define's. Signed-off-by: Josef Bacik <jbacik@redhat.com>
*	Btrfs: make things static and include the right headers	Christoph Hellwig	2008-12-02	1	-2/+3
\| \| \| \| \| \| \| \|	Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: Christoph Hellwig <hch@lst.de>
*	Btrfs: Add fallocate support v2	Yan Zheng	2008-10-30	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \|	This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: Add root tree pointer transaction ids	Yan Zheng	2008-10-29	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	This patch adds transaction IDs to root tree pointers. Transaction IDs in tree pointers are compared with the generation numbers in block headers when reading root blocks of trees. This can detect some types of IO errors. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: nuke fs wide allocation mutex V2	Josef Bacik	2008-10-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
*	Btrfs: Add zlib compression support	Chris Mason	2008-10-29	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Remove offset field from struct btrfs_extent_ref	Yan Zheng	2008-10-09	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The offset field in struct btrfs_extent_ref records the position inside file that file extent is referenced by. In the new back reference system, tree leaves holding references to file extent are recorded explicitly. We can scan these tree leaves very quickly, so the offset field is not required. This patch also makes the back reference system check the objectid when extents are in deleting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: Count space allocated to file in bytes	Yan Zheng	2008-10-09	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \|	This patch makes btrfs count space allocated to file in bytes instead of 512 byte sectors. Everything else in btrfs uses a byte count instead of sector sizes or blocks sizes, so this fits better. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
*	Btrfs: Full back reference support	Zheng Yan	2008-09-25	1	-50/+58
\| \| \| \| \| \| \| \| \| \|	This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Disable the dir fsync optimization to skip logging the dir sometimes	Chris Mason	2008-09-25	1	-2/+1
\| \| \| \| \| \|	More testing has turned up a bug, disable this for now. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Record dirty pages tree-log pages in an extent_io tree	Chris Mason	2008-09-25	1	-14/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the same way the transaction code makes sure that all the other tree blocks are safely on disk. There's an extent_io tree for each root, and any blocks allocated to the tree logs are recorded in that tree. At tree-log sync, the extent_io tree is walked to flush down the dirty pages and wait for them. The main benefit is less time spent walking the tree log and skipping clean pages, and getting sequential IO down to the drive. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Copy into the log tree in big batches	Chris Mason	2008-09-25	1	-61/+122
\| \| \| \| \| \| \|	This changes the log tree copy code to use btrfs_insert_items and to work in larger batches where possible. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Optimize tree log block allocations	Chris Mason	2008-09-25	1	-13/+7
\| \| \| \| \| \| \| \|	Since tree log blocks get freed every transaction, they never really need to be written to disk. This skips the step where we update metadata to record they were allocated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Optimize btree walking while logging inodes	Chris Mason	2008-09-25	1	-6/+19
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Dir fsync optimizations	Chris Mason	2008-09-25	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	Drop i_mutex during the commit Don't bother doing the fsync at all unless the dir is marked as dirtied and needing fsync in this transaction. For directories, this means that someone has unlinked a file from the dir without fsyncing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Update the highest objectid in a root after log replay is done	Chris Mason	2008-09-25	1	-0/+7
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Tree logging fixes	Chris Mason	2008-09-25	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Pin down data blocks to prevent them from being reallocated like so: trans 1: allocate file extent trans 2: free file extent trans 3: free file extent during old snapshot deletion trans 3: allocate file extent to new file trans 3: fsync new file Before the tree logging code, this was legal because the fsync would commit the transation that did the final data extent free and the transaction that allocated the extent to the new file at the same time. With the tree logging code, the tree log subtransaction can commit before the transaction that freed the extent. If we crash, we're left with two different files using the extent. * Don't wait in start_transaction if log replay is going on. This avoids deadlocks from iput while we're cleaning up link counts in the replay code. * Don't deadlock in replay_one_name by trying to read an inode off the disk while holding paths for the directory * Hold the buffer lock while we mark a buffer as written. This closes a race where someone is changing a buffer while we write it. They are supposed to mark it dirty again after they change it, but this violates the cow rules. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add a write ahead tree log to optimize synchronous operations	Chris Mason	2008-09-25	1	-0/+2804
	File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>