summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/transaction.c
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: Fix nodatacow for the new data=ordered modeYan Zheng2008-09-251-0/+11
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Various small fixes.Yan Zheng2008-09-251-4/+5
| | | | | | | | This trivial patch contains two locking fixes and a off by one fix. --- Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix ioctl-initiated transactions vs wait_current_trans()Sage Weil2008-09-251-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in progress) breaks the transaction start/stop ioctls by making btrfs_start_transaction conditionally wait for the next transaction to start. If an application artificially is holding a transaction open, things deadlock. This workaround maintains a count of open ioctl-initiated transactions in fs_info, and avoids wait_current_trans() if any are currently open (in start_transaction() and btrfs_throttle()). The start transaction ioctl uses a new btrfs_start_ioctl_transaction() that _does_ call wait_current_trans(), effectively pushing the join/wait decision to the outer ioctl-initiated transaction. This more or less neuters btrfs_throttle() when ioctl-initiated transactions are in use, but that seems like a pretty fundamental consequence of wrapping lots of write()'s in a transaction. Btrfs has no way to tell if the application considers a given operation as part of it's transaction. Obviously, if the transaction start/stop ioctls aren't being used, there is no effect on current behavior. Signed-off-by: Sage Weil <sage@newdream.net> --- ctree.h | 1 + ioctl.c | 12 +++++++++++- transaction.c | 18 +++++++++++++----- transaction.h | 2 ++ 4 files changed, 27 insertions(+), 6 deletions(-) Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: More throttle tuningChris Mason2008-09-251-0/+15
| | | | | | | | | | * Make walk_down_tree wake up throttled tasks more often * Make walk_down_tree call cond_resched during long loops * As the size of the ref cache grows, wait longer in throttle * Get rid of the reada code in walk_down_tree, the leaves don't get read anymore, thanks to the ref cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs_search_slot: reduce lock contention by cowing in two stagesChris Mason2008-09-251-1/+1
| | | | | | | | | | | | | | | | A btree block cow has two parts, the first is to allocate a destination block and the second is to copy the old bock over. The first part needs locks in the extent allocation tree, and may need to do IO. This changeset splits that into a separate function that can be called without any tree locks held. btrfs_search_slot is changed to drop its path and start over if it has to COW a contended block. This often means that many writers will pre-alloc a new destination for a the same contended block, but they cache their prealloc for later use on lower levels in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Throttle less often waiting for snapshots to deleteChris Mason2008-09-251-14/+0
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Throttle tuningChris Mason2008-09-251-11/+27
| | | | | | | | | | This avoids waiting for transactions with pages locked by breaking out the code to wait for the current transaction to close into a function called by btrfs_throttle. It also lowers the limits for where we start throttling. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: implement memory reclaim for leaf reference cacheYan2008-09-251-10/+30
| | | | | | | | | | | | | | The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Update and fix mount -o nodatacowYan Zheng2008-09-251-11/+5
| | | | | | | | | | | | | | To check whether a given file extent is referenced by multiple snapshots, the checker walks down the fs tree through dead root and checks all tree blocks in the path. We can easily detect whether a given tree block is directly referenced by other snapshot. We can also detect any indirect reference from other snapshot by checking reference's generation. The checker can always detect multiple references, but can't reliably detect cases of single reference. So btrfs may do file data cow even there is only one reference. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Throttle operations if the reference cache gets too largeChris Mason2008-09-251-15/+44
| | | | | | | | | | | | A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Leaf reference cache updateChris Mason2008-09-251-22/+12
| | | | | | | | | | | | | | | This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a leaf reference cacheYan Zheng2008-09-251-18/+49
| | | | | | | | | | Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Implement new dir index formatJosef Bacik2008-09-251-2/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Take the csum mutex while reading checksumsChris Mason2008-09-251-0/+3
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix some data=ordered related data corruptionsChris Mason2008-09-251-2/+0
| | | | | | | | | | | | | | | | | | Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* btrfs_start_transaction: wait for commits in progress to finishChris Mason2008-09-251-3/+40
| | | | | | | | | | | | | btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: New data=ordered implementationChris Mason2008-09-251-58/+9
| | | | | | | | | | | | | | | | | | | | | | | | The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Drop some verbose printksChris Mason2008-09-251-2/+0
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Online btree defragmentation fixesChris Mason2008-09-251-34/+1
| | | | | | | | | | The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a per-inode csum mutex to avoid races creating csum itemsChris Mason2008-09-251-1/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Replace the transaction work queue with kthreadsChris Mason2008-09-251-70/+2
| | | | | | | This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Add btrfs_end_transaction_throttle to force writers to wait for pending commitsChris Mason2008-09-251-21/+45
| | | | | | | | | | | | | | | | The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Replace the big fs_mutex with a collection of other locksChris Mason2008-09-251-26/+16
| | | | | | | | Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Start btree concurrency work.Chris Mason2008-09-251-6/+9
| | | | | | | | | | | | | | | The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Invalidate dcache entry after creating snapshot andSven Wegener2008-09-251-1/+7
| | | | | | | | | | | | | | We need to invalidate an existing dcache entry after creating a new snapshot or subvolume, because a negative dache entry will stop us from accessing the new snapshot or subvolume. --- ctree.h | 23 +++++++++++++++++++++++ inode.c | 4 ++++ transaction.c | 4 ++++ 3 files changed, 31 insertions(+) Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix race in running_transaction checksChris Mason2008-09-251-1/+3
| | | | | | | | When a new transaction was started, the code would incorrectly set the pointer in fs_info before all the data structures were setup. fsync heavy workloads hit races on the setup of the ordered inode spinlock Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for online device removalChris Mason2008-09-251-3/+2
| | | | | | | | | | | | | This required a few structural changes to the code that manages bdev pointers: The VFS super block now gets an anon-bdev instead of a pointer to the lowest bdev. This allows us to avoid swapping the super block bdev pointer around at run time. The code to read in the super block no longer goes through the extent buffer interface. Things got ugly keeping the mapping constant. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fixes for 2.6.18 enterprise kernelsChris Mason2008-09-251-2/+6
| | | | | | | | | | | | | | | 2.6.18 seems to get caught in an infinite loop when cancel_rearming_delayed_workqueue is called more than once, so this switches to cancel_delayed_work, which is arguably more correct. Also, balance_dirty_pages can run into problems with 2.6.18 based kernels because it doesn't have the per-bdi dirty limits. This avoids calling balance_dirty_pages on the btree inode unless there is actually something to balance, which is a good optimization in general. Finally there's a compile fix for ordered-data.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Throttle file_write when data=ordered is flushing the inodeChris Mason2008-09-251-2/+8
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Do metadata checksums for reads via a workqueueChris Mason2008-09-251-1/+1
| | | | | | | | | | | | | | | | Before, metadata checksumming was done by the callers of read_tree_block, which would set EXTENT_CSUM bits in the extent tree to show that a given range of pages was already checksummed and didn't need to be verified again. But, those bits could go away via try_to_releasepage, and the end result was bogus checksum failures on pages that never left the cache. The new code validates checksums when the page is read. It is a little tricky because metadata blocks can span pages and a single read may end up going via multiple bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for multiple devices per filesystemChris Mason2008-09-251-16/+34
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Lower stack usage in transaction.cChris Mason2008-09-251-18/+24
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add data block hints to SSD mode tooChris Mason2008-09-251-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Split the extent_map code into two partsChris Mason2008-09-251-4/+4
| | | | | | | | | | | | | | There is now extent_map for mapping offsets in the file to disk and extent_io for state tracking, IO submission and extent_bufers. The new extent_map code shifts from [start,end] pairs to [start,len], and pushes the locking out into the caller. This allows a few performance optimizations and is easier to use. A number of extent_map usage bugs were fixed, mostly with failing to remove extent_map entries when changing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add mount -o ssd, which includes optimizations for seek free storageChris Mason2008-09-251-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix data=ordered vs wait_on_inode deadlock on older kernelsChris Mason2008-09-251-17/+13
| | | | | | | Using ilookup5 during data=ordered writeback could deadlock on I_LOCK. This saves a pointer to the inode instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Run igrab on data=ordered inodes to prevent deadlocks during writeoutChris Mason2008-09-251-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Rework btrfs_drop_inode to avoid schedulingChris Mason2008-09-251-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add some simple throttling to wait for data=ordered and snapshot deletionChris Mason2008-09-251-0/+4
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Move snapshot creation to commit timeChris Mason2008-09-251-3/+78
| | | | | | | | | | It is very difficult to create a consistent snapshot of the btree when other writers may update the btree before the commit is done. This changes the snapshot creation to happen during the commit, while no other updates are possible. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add data=ordered supportChris Mason2008-09-251-0/+58
| | | | | | | | This forces file data extents down the disk along with the metadata that references them. The current implementation is fairly simple, and just writes out all of the dirty pages in an inode before the commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Reduce stack usage in the resizer, fix 32 bit compilesChris Mason2008-09-251-5/+15
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Back port to 2.6.18-el kernelsChris Mason2008-09-251-0/+8
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: section mismatch warningsChristian Hesse2008-09-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | --Boundary-00=_CcOWHFYK4T+JwSj Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Hello everybody, compiling btrfs into the kernel results in section mismatch warnings. __exit functions are called where they are not allowed to. The attached patch fixes this for me. Not sure if it is correct though. Signed-off-by: Christian Hesse <mail@earthworm.de> -- Regards, Chris --Boundary-00=_CcOWHFYK4T+JwSj Content-Type: text/x-diff; charset="iso-8859-1"; name="btrfs-section_mismatches.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="btrfs-section_mismatches.patch" Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix PAGE_CACHE_SHIFT shifts on 32 bit machinesChris Mason2008-09-251-1/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Defrag only leaves, and only when the parent node has a single objectidChris Mason2008-09-251-2/+0
| | | | | | | This allows us to defrag huge directories, but skip the expensive defrag case in more common usage, where it does not help as much. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add an extent buffer LRU to reduce radix tree hitsChris Mason2008-09-251-3/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add back the online defragging codeChris Mason2008-09-251-2/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Allow tree blocks larger than the page sizeChris Mason2008-09-251-17/+22
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map ↵Chris Mason2008-09-251-2/+3
| | | | | | trees Signed-off-by: Chris Mason <chris.mason@oracle.com>
OpenPOWER on IntegriCloud