Btrfs: fix data corruption after fast fsync and writeback error

When we do a fast fsync, we start all ordered operations and then while they're running in parallel we visit the list of modified extent maps and construct their matching file extent items and write them to the log btree. After that, in btrfs_sync_log() we wait for all the ordered operations to finish (via btrfs_wait_logged_extents). The problem with this is that we were completely ignoring errors that can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM or -EIO errors for example. When such error happens, it means we have parts of the on disk extent that weren't written to, and so we end up logging file extent items that point to these extents that contain garbage/random data - so after a crash/reboot plus log replay, we get our inode's metadata pointing to those extents. This worked in contrast with the full (non-fast) fsync path, where we start all ordered operations, wait for them to finish and then write to the log btree. In this path, after each ordered operation completes we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return -EIO if so (via btrfs_wait_ordered_range). So if an error happens with any ordered operation, just return a -EIO error to userspace, so that it knows that not all of its previous writes were durably persisted and the application can take proper action (like redo the writes for e.g.) - and definitely not leave any file extent items in the log refer to non fully written extents. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
author: Filipe Manana <fdmanana@suse.com> 2014-09-05 15:14:39 +0100
committer: Chris Mason <clm@fb.com> 2014-09-19 06:57:51 -0700
commit: 8407f553268a4611f2542ed90677f0edfaa2c9c4 (patch)
tree: 2d0c17df51443c5d5e2cde9c079362e840f5c637 /fs/btrfs/file.c
parent: 669249eea365dd32b793b58891c74281c0aac47e (diff)
download: op-kernel-dev-8407f553268a4611f2542ed90677f0edfaa2c9c4.zip
op-kernel-dev-8407f553268a4611f2542ed90677f0edfaa2c9c4.tar.gz
1 files changed, 19 insertions, 0 deletions
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cdb7146..29b147d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2029,6 +2029,25 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 */
 	mutex_unlock(&inode->i_mutex);
 
+	/*
+	 * If any of the ordered extents had an error, just return it to user
+	 * space, so that the application knows some writes didn't succeed and
+	 * can take proper action (retry for e.g.). Blindly committing the
+	 * transaction in this case, would fool userspace that everything was
+	 * successful. And we also want to make sure our log doesn't contain
+	 * file extent items pointing to extents that weren't fully written to -
+	 * just like in the non fast fsync path, where we check for the ordered
+	 * operation's error flag before writing to the log tree and return -EIO
+	 * if any of them had this flag set (btrfs_wait_ordered_range) -
+	 * therefore we need to check for errors in the ordered operations,
+	 * which are indicated by ctx.io_err.
+	 */
+	if (ctx.io_err) {
+		btrfs_end_transaction(trans, root);
+		ret = ctx.io_err;
+		goto out;
+	}
+
 	if (ret != BTRFS_NO_LOG_SYNC) {
 		if (!ret) {
 			ret = btrfs_sync_log(trans, root, &ctx);
author	Filipe Manana <fdmanana@suse.com>	2014-09-05 15:14:39 +0100
committer	Chris Mason <clm@fb.com>	2014-09-19 06:57:51 -0700
commit	8407f553268a4611f2542ed90677f0edfaa2c9c4 (patch)
tree	2d0c17df51443c5d5e2cde9c079362e840f5c637 /fs/btrfs/file.c
parent	669249eea365dd32b793b58891c74281c0aac47e (diff)
download	op-kernel-dev-8407f553268a4611f2542ed90677f0edfaa2c9c4.zip op-kernel-dev-8407f553268a4611f2542ed90677f0edfaa2c9c4.tar.gz