Btrfs: add extra flushing for renames and truncates

Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
author: Chris Mason <chris.mason@oracle.com> 2009-03-31 13:27:11 -0400
committer: Chris Mason <chris.mason@oracle.com> 2009-03-31 14:27:58 -0400
commit: 5a3f23d515a2ebf0c750db80579ca57b28cbce6d (patch)
tree: e0ffb43dd35f1c3def9a74ec7a6f4470902c9761 /fs/btrfs/ctree.h
parent: 1a81af4d1d9c60d4313309f937a1fc5567205a87 (diff)
download: op-kernel-dev-5a3f23d515a2ebf0c750db80579ca57b28cbce6d.zip
op-kernel-dev-5a3f23d515a2ebf0c750db80579ca57b28cbce6d.tar.gz
1 files changed, 35 insertions, 0 deletions
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2737fac..f48905e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -45,6 +45,13 @@ struct btrfs_ordered_sum;
 
 #define BTRFS_MAX_LEVEL 8
 
+/*
+ * files bigger than this get some pre-flushing when they are added
+ * to the ordered operations list.  That way we limit the total
+ * work done by the commit
+ */
+#define BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT (8 * 1024 * 1024)
+
 /* holds pointers to all of the tree roots */
 #define BTRFS_ROOT_TREE_OBJECTID 1ULL
 
@@ -727,6 +734,15 @@ struct btrfs_fs_info {
 	struct mutex volume_mutex;
 	struct mutex tree_reloc_mutex;
 
+	/*
+	 * this protects the ordered operations list only while we are
+	 * processing all of the entries on it.  This way we make
+	 * sure the commit code doesn't find the list temporarily empty
+	 * because another function happens to be doing non-waiting preflush
+	 * before jumping into the main commit.
+	 */
+	struct mutex ordered_operations_mutex;
+
 	struct list_head trans_list;
 	struct list_head hashers;
 	struct list_head dead_roots;
@@ -741,10 +757,29 @@ struct btrfs_fs_info {
 	 * ordered extents
 	 */
 	spinlock_t ordered_extent_lock;
+
+	/*
+	 * all of the data=ordered extents pending writeback
+	 * these can span multiple transactions and basically include
+	 * every dirty data page that isn't from nodatacow
+	 */
 	struct list_head ordered_extents;
+
+	/*
+	 * all of the inodes that have delalloc bytes.  It is possible for
+	 * this list to be empty even when there is still dirty data=ordered
+	 * extents waiting to finish IO.
+	 */
 	struct list_head delalloc_inodes;
 
 	/*
+	 * special rename and truncate targets that must be on disk before
+	 * we're allowed to commit.  This is basically the ext3 style
+	 * data=ordered list.
+	 */
+	struct list_head ordered_operations;
+
+	/*
 	 * there is a pool of worker threads for checksumming during writes
 	 * and a pool for checksumming after reads.  This is because readers
 	 * can run with FS locks held, and the writers may be waiting for
author	Chris Mason <chris.mason@oracle.com>	2009-03-31 13:27:11 -0400
committer	Chris Mason <chris.mason@oracle.com>	2009-03-31 14:27:58 -0400
commit	5a3f23d515a2ebf0c750db80579ca57b28cbce6d (patch)
tree	e0ffb43dd35f1c3def9a74ec7a6f4470902c9761 /fs/btrfs/ctree.h
parent	1a81af4d1d9c60d4313309f937a1fc5567205a87 (diff)
download	op-kernel-dev-5a3f23d515a2ebf0c750db80579ca57b28cbce6d.zip op-kernel-dev-5a3f23d515a2ebf0c750db80579ca57b28cbce6d.tar.gz