summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/inode.c
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: fix btrfs fallocate oops and deadlockChris Mason2009-04-211-8/+28
| | | | | | | | | | | | | | | | | Btrfs fallocate was incorrectly starting a transaction with a lock held on the extent_io tree for the file, which could deadlock. Strictly speaking it was using join_transaction which would be safe, but it is better to move the transaction outside of the lock. When preallocated extents are overwritten, btrfs_mark_buffer_dirty was being called on an unlocked buffer. This was triggering an assertion and oops because the lock is supposed to be held. The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had been run. btrfs_del_item takes care of dirtying things, so the solution is a to skip the btrfs_mark_buffer_dirty call in this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2009-04-031-1/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: BUG to BUG_ON changes Btrfs: remove dead code Btrfs: remove dead code Btrfs: fix typos in comments Btrfs: remove unused ftrace include Btrfs: fix __ucmpdi2 compile bug on 32 bit builds Btrfs: free inode struct when btrfs_new_inode fails Btrfs: fix race in worker_loop Btrfs: add flushoncommit mount option Btrfs: notreelog mount option Btrfs: introduce btrfs_show_options Btrfs: rework allocation clustering Btrfs: Optimize locking in btrfs_next_leaf() Btrfs: break up btrfs_search_slot into smaller pieces Btrfs: kill the pinned_mutex Btrfs: kill the block group alloc mutex Btrfs: clean up find_free_extent Btrfs: free space cache cleanups Btrfs: unplug in the async bio submission threads Btrfs: keep processing bios for a given bdev if our proc is batching
| * Btrfs: free inode struct when btrfs_new_inode failsShen Feng2009-04-021-1/+4
| | | | | | | | | | | | | | | | | | btrfs_new_inode doesn't call iput to free the inode when it fails. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2009-04-011-23/+171
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: try to free metadata pages when we free btree blocks Btrfs: add extra flushing for renames and truncates Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod Btrfs: optimize fsyncs on old files Btrfs: tree logging unlink/rename fixes Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay Btrfs: limit balancing work while flushing delayed refs Btrfs: readahead checksums during btrfs_finish_ordered_io Btrfs: leave btree locks spinning more often Btrfs: Only let very young transactions grow during commit Btrfs: Check for a blocking lock before taking the spin Btrfs: reduce stack in cow_file_range Btrfs: reduce stalls during transaction commit Btrfs: process the delayed reference queue in clusters Btrfs: try to cleanup delayed refs while freeing extents Btrfs: reduce stack usage in some crucial tree balancing functions Btrfs: do extent allocation and reference count updates in the background Btrfs: don't preallocate metadata blocks during btrfs_search_slot
| * Btrfs: add extra flushing for renames and truncatesChris Mason2009-03-311-7/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: tree logging unlink/rename fixesChris Mason2009-03-241-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: readahead checksums during btrfs_finish_ordered_ioChris Mason2009-03-241-2/+33
| | | | | | | | | | | | | | | | | | This reads in blocks in the checksum btree before starting the transaction in btrfs_finish_ordered_io. It makes it much more likely we'll be able to do operations inside the transaction without needing any btree reads, which limits transaction latencies overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: leave btree locks spinning more oftenChris Mason2009-03-241-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: reduce stack in cow_file_rangeChris Mason2009-03-241-8/+7
| | | | | | | | | | | | | | | | | | The fs/btrfs/inode.c code to run delayed allocation during writout needed some stack usage optimization. This is the first pass, it does the check for compression earlier on, which allows us to do the common (no compression) case higher up in the call chain. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: reduce stalls during transaction commitChris Mason2009-03-241-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To avoid deadlocks and reduce latencies during some critical operations, some transaction writers are allowed to jump into the running transaction and make it run a little longer, while others sit around and wait for the commit to finish. This is a bit unfair, especially when the callers that jump in do a bunch of IO that makes all the others procs on the box wait. This commit reduces the stalls this produces by pre-reading file extent pointers during btrfs_finish_ordered_io before the transaction is joined. It also tunes the drop_snapshot code to politely wait for transactions that have started writing out their delayed refs to finish. This avoids new delayed refs being flooded into the queue while we're trying to close off the transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | fs: fix page_mkwrite error cases in core code and btrfsNick Piggin2009-04-011-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | page_mkwrite is called with neither the page lock nor the ptl held. This means a page can be concurrently truncated or invalidated out from underneath it. Callers are supposed to prevent truncate races themselves, however previously the only thing they can do in case they hit one is to raise a SIGBUS. A sigbus is wrong for the case that the page has been invalidated or truncated within i_size (eg. hole punched). Callers may also have to perform memory allocations in this path, where again, SIGBUS would be wrong. The previous patch ("mm: page_mkwrite change prototype to match fault") made it possible to properly specify errors. Convert the generic buffer.c code and btrfs to return sane error values (in the case of page removed from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit without doing anything, and the fault will be retried properly). This fixes core code, and converts btrfs as a template/example. All other filesystems defining their own page_mkwrite should be fixed in a similar manner. Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: page_mkwrite change prototype to match faultNick Piggin2009-04-011-1/+4
|/ | | | | | | | | | | | | | | | | | | | | | | | Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Btrfs: add better -ENOSPC handlingJosef Bacik2009-02-201-46/+16
| | | | | | | | | | | | | | | | | | | | | | | | This is a step in the direction of better -ENOSPC handling. Instead of checking the global bytes counter we check the space_info bytes counters to make sure we have enough space. If we don't we go ahead and try to allocate a new chunk, and then if that fails we return -ENOSPC. This patch adds two counters to btrfs_space_info, bytes_delalloc and bytes_may_use. bytes_delalloc account for extents we've actually setup for delalloc and will be allocated at some point down the line. bytes_may_use is to keep track of how many bytes we may use for delalloc at some point. When we actually set the extent_bit for the delalloc bytes we subtract the reserved bytes from the bytes_may_use counter. This keeps us from not actually being able to allocate space for any delalloc bytes. Signed-off-by: Josef Bacik <jbacik@redhat.com>
* Btrfs: remove btrfs_init_pathJeff Mahoney2009-02-121-2/+0
| | | | | | | | | | | btrfs_init_path was initially used when the path objects were on the stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path isn't required. This patch removes it, and just uses kmem_cache_zalloc to zero out the object. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Avoid using __GFP_HIGHMEM with slab allocatorYan Zheng2009-02-121-1/+1
| | | | | | | | | | btrfs_releasepage may call kmem_cache_alloc indirectly, and provide same GFP flags it gets to kmem_cache_alloc. So it's possible to use __GFP_HIGHMEM with the slab allocator. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Make sure dir is non-null before doing S_ISGID checksChris Mason2009-02-061-1/+1
| | | | | | | The S_ISGID check in btrfs_new_inode caused an oops during subvol creation because sometimes the dir is null. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inodeChris Mason2009-02-041-1/+5
| | | | | | | | | | | | | | | btrfs_truncate_inode_items is setup to stop doing btree searches when it has finished removing the items for the inode. It used to detect the end of the inode by looking for an objectid that didn't match the one we were searching for. But, this would result in an extra search through the btree, which adds extra balancing and cow costs to the operation. This commit adds a check to see if we found the inode item, which means we can stop searching early. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't try to compress pages past i_sizeChris Mason2009-02-041-0/+14
| | | | | | | | | | | | | The compression code had some checks to make sure we were only compressing bytes inside of i_size, but it wasn't catching every case. To make things worse, some incorrect math about the number of bytes remaining would make it try to compress more pages than the file really had. The fix used here is to fall back to the non-compression code in this case, which does all the proper cleanup of delalloc and other accounting. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Handle SGID bit when creating inodesChris Ball2009-02-041-1/+8
| | | | | | | | | | Before this patch, new files/dirs would ignore the SGID bit on their parent directory and always be owned by the creating user's uid/gid. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunksChris Mason2009-02-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every transaction in btrfs creates a new snapshot, and then schedules the snapshot from the last transaction for deletion. Snapshot deletion works by walking down the btree and dropping the reference counts on each btree block during the walk. If if a given leaf or node has a reference count greater than one, the reference count is decremented and the subtree pointed to by that node is ignored. If the reference count is one, walking continues down into that node or leaf, and the references of everything it points to are decremented. The old code would try to work in small pieces, walking down the tree until it found the lowest leaf or node to free and then returning. This was very friendly to the rest of the FS because it didn't have a huge impact on other operations. But it wouldn't always keep up with the rate that new commits added new snapshots for deletion, and it wasn't very optimal for the extent allocation tree because it wasn't finding leaves that were close together on disk and processing them at the same time. This changes things to walk down to a level 1 node and then process it in bulk. All the leaf pointers are sorted and the leaves are dropped in order based on their extent number. The extent allocation tree and commit code are now fast enough for this kind of bulk processing to work without slowing the rest of the FS down. Overall it does less IO and is better able to keep up with snapshot deletions under high load. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Change btree locking to use explicit blocking pointsChris Mason2009-02-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: selinux supportJim Owens2009-02-041-4/+19
| | | | | | | | | | | Add call to LSM security initialization and save resulting security xattr for new inodes. Add xattr support to symlink inode ops. Set inode->i_op for existing special files. Signed-off-by: jim owens <jowens@hp.com>
* Btrfs: fix readdir on 32 bit machinesChris Mason2009-01-281-1/+1
| | | | | | | | | | | | | | | | | | | | After btrfs_readdir has gone through all the directory items, it sets the directory f_pos to the largest possible int. This way applications that mix readdir with creating new files don't end up in an endless loop finding the new directory items as they go. It was a workaround for a bug in git, but the assumption was that if git could make this looping mistake than it would be a common problem. The largest possible int chosen was INT_LIMIT(typeof(file->f_pos), and it is possible for that to be a larger number than 32 bit glibc expects to come out of readdir. This patches switches that to INT_LIMIT(off_t), which should keep applications happy on 32 and 64 bit machines. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fiemap supportYehuda Sadeh2009-01-211-0/+7
| | | | | | | | | Now that bmap support is gone, this is the only way to get extent mappings for userland. These are still not valid for IO, but they can tell us if a file has holes or how much fragmentation there is. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* Btrfs: stop providing a bmap operation to avoid swapfile corruptionsChris Mason2009-01-211-6/+12
| | | | | | | | | | | | | | | | | | | | | Swapfiles use bmap to build a list of extents belonging to the file, and they assume these extents won't change over the life of the file. They also use resulting list to do IO directly to the block device. This causes problems for btrfs in a few ways: btrfs returns logical block numbers through bmap, and these are not suitable for IO. They might translate to different devices, raid etc. COW means that file block mappings are going to change frequently. Using swapfiles on btrfs will lead to corruption, so we're avoiding the problem for now by dropping bmap support entirely. A later commit will add fiemap support for people that really want to know how a file is laid out. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: simplify iteration codesQinghuang Feng2009-01-211-3/+2
| | | | | | | | Merge list_for_each* and list_entry to list_for_each_entry* Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: removed unused #include <version.h>'sHuang Weiyi2009-01-211-1/+0
| | | | | | | | Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hookChris Mason2009-01-071-3/+3
| | | | | | | None of the checksum verification code schedules, so we can use the faster kmap_atomic Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifiesChris Mason2009-01-061-7/+3
| | | | | | | Checksum verification happens in a helper thread, and there is no need to mess with interrupts. This switches to kmap() instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: tree logging checksum fixesYan Zheng2009-01-061-3/+2
| | | | | | | | | | | | | | | | | | This patch contains following things. 1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This struct is kmalloced so we want to keep it reasonable. 2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was duplicated code in tree-log.c 3) Remove replay_one_csum. csum items are replayed at the same time as replaying file extents. This guarantees we only replay useful csums. 4) nbytes accounting fix. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creationYan Zheng2009-01-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Snapshot creation happens at a specific time during transaction commit. We need to make sure the code called by snapshot creation doesn't wait for the running transaction to commit. This changes btrfs_delete_inode and finish_pending_snaps to use btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks. It would be better if btrfs_delete_inode didn't use the join, but the call path that triggers it is: btrfs_commit_transaction->create_pending_snapshots-> create_pending_snapshot->btrfs_lookup_dentry-> fixup_tree_root_location->btrfs_read_fs_root-> btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput This will be fixed in a later patch by moving the orphan cleanup to the cleaner thread. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix checkpatch.pl warningsChris Mason2009-01-051-87/+86
| | | | | | | There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: shift all end_io work to thread poolsChris Mason2008-12-171-6/+6
| | | | | | | | | | | | | | | | | | bio_end_io for reads without checksumming on and btree writes were happening without using async thread pools. This means the extent_io.c code had to use spin_lock_irq and friends on the rb tree locks for extent state. There were some irq safe vs unsafe lock inversions between the delallock lock and the extent state locks. This patch gets rid of them by moving all end_io code into the thread pools. To avoid contention and deadlocks between the data end_io processing and the metadata end_io processing yet another thread pool is added to finish off metadata writes. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Don't use spin*lock_irq for the delalloc lockChris Mason2008-12-151-14/+20
| | | | | | | The delalloc lock doesn't need to have irqs disabled, nobody that changes the number of delalloc bytes in the FS is running with irqs off. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix compressed writes on truncated pagesChris Mason2008-12-151-3/+5
| | | | | | | | | | | The compression code was using isize to limit the amount of data it sent through zlib. But, it wasn't properly limiting the looping to just the pages inside i_size. The end result was trying to compress too many pages, including those that had not been setup and properly locked down. This made the compression code oops while trying find_get_page on a page that didn't exist. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix nodatasum handling in balancing codeYan Zheng2008-12-121-13/+61
| | | | | | | | | | | | | | | | | | | | | | Checksums on data can be disabled by mount option, so it's possible some data extents don't have checksums or have invalid checksums. This causes trouble for data relocation. This patch contains following things to make data relocation work. 1) make nodatasum/nodatacow mount option only affects new files. Checksums and COW on data are only controlled by the inode flags. 2) check the existence of checksum in the nodatacow checker. If checksums exist, force COW the data extent. This ensure that checksum for a given block is either valid or does not exist. 3) update data relocation code to properly handle the case of checksum missing. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: fix leaking block group on balanceYan Zheng2008-12-111-30/+13
| | | | | | | | | | | | | The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Add inode sequence number for NFS and reserved space in a few structsChris Mason2008-12-081-1/+3
| | | | | | | | | | | | | | | | | | | This adds a sequence number to the btrfs inode that is increased on every update. NFS will be able to use that to detect when an inode has changed, without relying on inaccurate time fields. While we're here, this also: Puts reserved space into the super block and inode Adds a log root transid to the super so we can pick the newest super based on the fsync log as well as the main transaction ID. For now the log root transid is always zero, but that'll get fixed. Adds a starting offset to the dev_item. This will let us do better alignment calculations if we know the start of a partition on the disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: move data checksumming into a dedicated treeChris Mason2008-12-081-21/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: delete unused function: btrfs_invalidate_dcache_rootChris Mason2008-12-021-25/+0
| | | | | | | Snapshot and subvolume creation no longer need this helper. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: make things static and include the right headersChristoph Hellwig2008-12-021-13/+13
| | | | | | | | Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: Christoph Hellwig <hch@lst.de>
* Btrfs: Fix cow semantic in run_delalloc_nocow()Liu Hui2008-12-011-2/+2
| | | | | | The file preallocation code reversed the logic to force nodatacow. This fixes it.
* Btrfs: compat code fixesChris Mason2008-11-201-1/+1
| | | | | | | | The btrfs git kernel trees is used to build a standalone tree for compiling against older kernels. This commit makes the standalone tree work with 2.6.27 Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Use current_fsuid/gidChris Mason2008-11-191-2/+2
| | | | | | This fixes compile problems with linux-next Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Avoid writeback stallsChris Mason2008-11-191-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | While building large bios in writepages, btrfs may end up waiting for other page writeback to finish if WB_SYNC_ALL is used. While it is waiting, the bio it is building has a number of pages with the writeback bit set and they aren't getting to the disk any time soon. This lowers the latencies of writeback in general by sending down the bio being built before waiting for other pages. The bio submission code tries to limit the total number of async bios in flight by waiting when we're over a certain number of async bios. But, the waits are happening while writepages is building bios, and this can easily lead to stalls and other problems for people calling wait_on_page_writeback. The current fix is to let the congestion tests take care of waiting. sync() and others make sure to drain the current async requests to make sure that everything that was pending when the sync was started really get to disk. The code would drain pending requests both before and after submitting a new request. But, if one of the requests is waiting for page writeback to finish, the draining waits might block that page writeback. This changes the draining code to only wait after submitting the bio being processed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add backrefs and forward refs for subvols and snapshotsChris Mason2008-11-171-0/+6
| | | | | | | | | | | | | Subvols and snapshots can now be referenced from any point in the directory tree. We need to maintain back refs for them so we can find lost subvols. Forward refs are added so that we know all of the subvols and snapshots referenced anywhere in the directory tree of a single subvol. This can be used to do recursive snapshotting (but they aren't yet) and it is also used to detect and prevent directory loops when creating new snapshots. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Give each subvol and snapshot their own anonymous devidChris Mason2008-11-171-2/+13
| | | | | | | | | | | | | | Each subvolume has its own private inode number space, and so we need to fill in different device numbers for each subvolume to avoid confusing applications. This commit puts a struct super_block into struct btrfs_root so it can call set_anon_super() and get a different device number generated for each root. btrfs_rename is changed to prevent renames across subvols. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Allow subvolumes and snapshots anywhere in the directory treeChris Mason2008-11-171-18/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, all snapshots and subvolumes lived in a single flat directory. This was awkward and confusing because the single flat directory was only writable with the ioctls. This commit changes the ioctls to create subvols and snapshots at any point in the directory tree. This requires making separate ioctls for snapshot and subvol creation instead of a combining them into one. The subvol ioctl does: btrfsctl -S subvol_name parent_dir After the ioctl is done subvol_name lives inside parent_dir. The snapshot ioctl does: btrfsctl -s path_for_snapshot root_to_snapshot path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up into directory and basename components. root_to_snapshot can be any file or directory in the FS. The snapshot is taken of the entire root where that file lives. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: mount ro and remount supportYan Zheng2008-11-121-10/+5
| | | | | | | | | | | This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Fix compile warnings on 32 bit machinesChris Mason2008-11-111-2/+2
| | | | | | | Simple casting here and there to fix things up. Signed-off-by: Chris Mason <chris.mason@oracle.com>
OpenPOWER on IntegriCloud