op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	Remove Btrfs compat code for older kernels	Chris Mason	2008-09-25	1	-23/+1
\| \| \| \| \| \| \| \|	Btrfs had compatibility code for kernels back to 2.6.18. These have been removed, and will be maintained in a separate backport git tree from now on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Full back reference support	Zheng Yan	2008-09-25	1	-65/+56
\| \| \| \| \| \| \| \| \| \|	This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Dir fsync optimizations	Chris Mason	2008-09-25	1	-1/+18
\| \| \| \| \| \| \| \| \| \| \|	Drop i_mutex during the commit Don't bother doing the fsync at all unless the dir is marked as dirtied and needing fsync in this transaction. For directories, this means that someone has unlinked a file from the dir without fsyncing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add a write ahead tree log to optimize synchronous operations	Chris Mason	2008-09-25	1	-5/+34
\| \| \| \| \| \| \| \| \| \| \|	File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add debugging checks to track down corrupted metadata	Chris Mason	2008-09-25	1	-7/+8
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them	Chris Mason	2008-09-25	1	-2/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Improve and cleanup locking done by walk_down_tree	Chris Mason	2008-09-25	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While dropping snapshots, walk_down_tree does most of the work of checking reference counts and limiting tree traversal to just the blocks that we are freeing. It dropped and held the allocation mutex in strange and confusing ways, this commit changes it to only hold the mutex while actually freeing a block. The rest of the checks around reference counts should be safe without the lock because we only allow one process in btrfs_drop_snapshot at a time. Other processes dropping reference counts should not drop it to 1 because their tree roots already have an extra ref on the block. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Drop some debugging around the extent_map pinned flag	Chris Mason	2008-09-25	1	-9/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Throttle tuning	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	This avoids waiting for transactions with pages locked by breaking out the code to wait for the current transaction to close into a function called by btrfs_throttle. It also lowers the limits for where we start throttling. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add compatibility for kernels >= 2.6.27-rc1	Sven Wegener	2008-09-25	1	-0/+4
\| \| \| \| \| \| \|	Add a couple of #if's to follow API changes. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: implement memory reclaim for leaf reference cache	Yan	2008-09-25	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Throttle operations if the reference cache gets too large	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Leaf reference cache update	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Fix some data=ordered related data corruptions	Chris Mason	2008-09-25	1	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Data ordered fixes	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* In btrfs_delete_inode, wait for ordered extents after calling truncate_inode_pages. This is much faster, and more correct * Properly clear our the PageChecked bit everywhere we redirty the page. * Change the writepage fixup handler to lock the page range and check to see if an ordered extent had been inserted since the improperly dirtied page was discovered * Wait for ordered extents outside the transaction. This isn't required for locking rules but does improve transaction latencies * Reduce contention on the alloc_mutex by dropping it while incrementing refs on a node/leaf and while dropping refs on a leaf. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Keep extent mappings in ram until pending ordered extents are done	Chris Mason	2008-09-25	1	-4/+10
\| \| \| \| \| \| \| \|	It was possible for stale mappings from disk to be used instead of the new pending ordered extent. This adds a flag to the extent map struct to keep it pinned until the pending ordered extent is actually on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Add a per-inode lock around btrfs_drop_extents	Chris Mason	2008-09-25	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \|	btrfs_drop_extents is always called with a range lock held on the inode. But, it may operate on extents outside that range as it drops and splits them. This patch adds a per-inode mutex that is held while calling btrfs_drop_extents and while inserting new extents into the tree. It prevents races from two procs working against adjacent ranges in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Don't pin pages in ram until the entire ordered extent is on disk.	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Checksum items are not inserted until the entire ordered extent is on disk, but individual pages might be clean and available for reclaim long before the whole extent is on disk. In order to allow those pages to be freed, we need to be able to search the list of ordered extents to find the checksum that is going to be inserted in the tree. This way if the page needs to be read back in before the checksums are in the btree, we'll be able to verify the checksum on the page. This commit adds the ability to search the pending ordered extents for a given offset in the file, and changes btrfs_releasepage to allow ordered pages to be freed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	btrfs_start_transaction: wait for commits in progress to finish	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Update on disk i_size only after pending ordered extents are done	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \| \| \| \| \|	This changes the ordered data code to update i_size after the extent is on disk. An on disk i_size is maintained in the in-memory btrfs inode structures, and this is updated as extents finish. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Use async helpers to deal with pages that have been improperly dirtied	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \| \| \| \| \| \|	Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: New data=ordered implementation	Chris Mason	2008-09-25	1	-22/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add a per-inode csum mutex to avoid races creating csum items	Chris Mason	2008-09-25	1	-3/+3
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Add btrfs_end_transaction_throttle to force writers to wait for pending commits	Chris Mason	2008-09-25	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \| \| \| \|	This allows us to delete an unlinked inode with dirty pages from the list instead of forcing commit to write these out before deleting the inode. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Replace the big fs_mutex with a collection of other locks	Chris Mason	2008-09-25	1	-6/+1
\| \| \| \| \| \| \| \|	Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: transaction ioctls	Sage Weil	2008-09-25	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These ioctls let a user application hold a transaction open while it performs a series of operations. A final ioctl does a sync on the fs (closing the current transaction). This is the main requirement for Ceph's OSD to be able to keep the data it's storing in a btrfs volume consistent, and AFAICS it works just fine. The application would do something like fd = ::open("some/file", O_RDONLY); ::ioctl(fd, BTRFS_IOC_TRANS_START); /* do a bunch of stuff */ ::ioctl(fd, BTRFS_IOC_TRANS_END); or just ::close(fd); And to ensure it commits to disk, ::ioctl(fd, BTRFS_IOC_SYNC); When a transaction is held open, the trans_handle is attached to the struct file (via private_data) so that it will get cleaned up if the process dies unexpectedly. A held transaction is also ended on fsync() to avoid a deadlock. A misbehaving application could also deliberately hold a transaction open, effectively locking up the FS, so it may make sense to restrict something like this to root or something. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	btrfs delete ordered inode handling fix	Mingming	2008-09-25	1	-0/+7
\| \| \| \| \| \|	Use btrfs_release_file instead of a put_inode call Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Fix corners in writepage and btrfs_truncate_page	Chris Mason	2008-09-25	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \|	The extent_io writepage calls needed an extra check for discarding pages that started on th last byte in the file. btrfs_truncate_page needed checks to make sure the page was still part of the file after reading it, and most importantly, needed to wait for all IO to the page to finish before freeing the corresponding extents on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add workaround for AppArmor changing remove_suid()	Jeff Mahoney	2008-09-25	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In openSUSE 10.3, AppArmor modifies remove_suid to take a struct path rather than just a dentry. This patch tests that the kernel is openSUSE 10.3 or newer and adjusts the call accordingly. Debian/Ubuntu with AppArmor applied will also need a similar patch. Maintainers of btrfs under those distributions should build on this patch or, alternatively, alter their package descriptions to add -DREMOVE_SUID_PATH to the compiler command line. Signed-off-by: Jeff Mahoney <jeffm@suse.com> - --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ b/compat.h 2008-02-06 16:46:13.000000000 -0500 @@ -0,0 +1,15 @@ +#ifndef _COMPAT_H_ +#define _COMPAT_H_ + + +/* + * Even if AppArmor isn't enabled, it still has different prototypes. + * Add more distro/version pairs here to declare which has AppArmor applied. + / +#if defined(CONFIG_SUSE_KERNEL) +# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,22) +# define REMOVE_SUID_PATH 1 +# endif +#endif + +#endif / _COMPAT_H_ */ - --- a/file.c 2008-02-06 11:37:39.000000000 -0500 +++ b/file.c 2008-02-06 16:46:23.000000000 -0500 @@ -37,6 +37,7 @@ #include "ordered-data.h" #include "ioctl.h" #include "print-tree.h" +#include "compat.h" static int btrfs_copy_from_user(loff_t pos, int num_pages, int write_bytes, @@ -790,7 +791,11 @@ static ssize_t btrfs_file_write(struct f goto out_nolock; if (count == 0) goto out_nolock; +#ifdef REMOVE_SUID_PATH + err = remove_suid(&file->f_path); +#else err = remove_suid(fdentry(file)); +#endif if (err) goto out_nolock; file_update_time(file); Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Fix do_sync_file_range ifdefs (2.6.22)	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Clone file data ioctl	Sage Weil	2008-09-25	1	-1/+1
\| \| \| \| \| \|	Add a new ioctl to clone file data Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Throttle file_write when data=ordered is flushing the inode	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Set nodatasum on the inode when written by a nodatasum mount	Chris Mason	2008-09-25	1	-0/+8
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Use the extent map cache to find the logical disk block during data ↵	Chris Mason	2008-09-25	1	-1/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	retries The data read retry code needs to find the logical disk block before it can resubmit new bios. But, finding this block isn't allowed to take the fs_mutex because that will deadlock with a number of different callers. This changes the retry code to use the extent map cache instead, but that requires the extent map cache to have the extent we're looking for. This is a problem because btrfs_drop_extent_cache just drops the entire extent instead of the little tiny part it is invalidating. The bulk of the code in this patch changes btrfs_drop_extent_cache to invalidate only a portion of the extent cache, and changes btrfs_get_extent to deal with the results. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: A few updates for 2.6.18 and versions older than 2.6.25	Chris Mason	2008-09-25	1	-1/+7
\| \| \| \| \| \| \|	This includes fixing a missing spinlock init call that caused oops on mount for most kernels other than 2.6.25. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add O_DIRECT read and write (writes == buffered + cache flush)	Chris Mason	2008-09-25	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \|	This adds basic O_DIRECT read and write support. In the write case, we just do a normal buffered write followed by a cache flush. O_DIRECT + O_SYNC are required to trigger metadata syncs. In the read case, there is a basic btrfs_get_block call for use by the generic O_DIRECT code. This does honor multi-volume mapping rules but it skips all checksumming. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Properly cast before shifting	Chris Mason	2008-09-25	1	-1/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Take the extent lock before dropping the delalloc bits	Chris Mason	2008-09-25	1	-0/+4
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Properly clear dirty and delalloc extent bits while preparing the ↵	Chris Mason	2008-09-25	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \|	file for write Yan Zheng noticed that we don't clear the extent state tree dirty and delalloc bits when we clear the dirty bits on the page during file write. This leads to csum errors later on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Fix "no csum found for inode" issue.	Yan	2008-09-25	1	-1/+4
\| \| \| \| \| \| \|	A few codes were not properly updated for changes of extent map. This may be the causes of "no csum found for inode" issue. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Fix i_blocks accounting	Chris Mason	2008-09-25	1	-6/+11
\| \| \| \| \| \| \| \| \| \|	Now that delayed allocation accounting works, i_blocks accounting is changed to only modify i_blocks when extents inserted or removed. The fillattr call is changed to include the delayed allocation byte count in the i_blocks result. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	btrfs_drop_extents: handle BTRFS_INODE_REF_KEY types	Yan	2008-09-25	1	-3/+4
\| \| \| \| \| \|	It's possible "key.type == BTRFS_INODE_REF_KEY" and "key.offset >= end". Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Fix hole creation in file_write	Yan	2008-09-25	1	-4/+2
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	btrfs_drop_extent fixe for inline items > 8K	Yan	2008-09-25	1	-5/+3
\| \| \| \| \| \| \| \|	When truncating a inline extent, btrfs_drop_extents doesn't properly handle the case "key.offset > inline_limit". This bug can only happen when max line size is larger than 8K. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: mount -o max_inline=size to control the maximum inline extent size	Chris Mason	2008-09-25	1	-1/+2
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Do delalloc accounting via hooks in the extent_state code	Chris Mason	2008-09-25	1	-4/+0
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Split the extent_map code into two parts	Chris Mason	2008-09-25	1	-14/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is now extent_map for mapping offsets in the file to disk and extent_io for state tracking, IO submission and extent_bufers. The new extent_map code shifts from [start,end] pairs to [start,len], and pushes the locking out into the caller. This allows a few performance optimizations and is easier to use. A number of extent_map usage bugs were fixed, mostly with failing to remove extent_map entries when changing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Fix hole insertion corner cases	Chris Mason	2008-09-25	1	-1/+77
\| \| \| \| \| \| \| \| \| \|	There were a few places that could cause duplicate extent insertion, this adjusts the code that creates holes to avoid it. lookup_extent_map is changed to correctly return all of the extents in a range, even when there are none matching at the start of the range. Signed-off-by: Chris Mason <chris.mason@oracle.com>
*	Btrfs: Add some simple throttling to wait for data=ordered and snapshot deletion	Chris Mason	2008-09-25	1	-0/+1
\| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>