xfs: always drain dio before extending aio write submission

XFS supports and typically allows concurrent asynchronous direct I/O submission to a single file. One exception to the rule is that file extending dio writes that start beyond the current EOF (e.g., potentially create a hole at EOF) require exclusive I/O access to the file. This is because such writes must zero any pre-existing blocks beyond EOF that are exposed by virtue of now residing within EOF as a result of the write about to be submitted. Before EOF zeroing can occur, the current file i_size must be stabilized to avoid data corruption. In this scenario, XFS upgrades the iolock to exclude any further I/O submission, waits on in-flight I/O to complete to ensure i_size is up to date (i_size is updated on dio write completion) and restarts the various checks against the state of the file. The problem is that this protection sequence is triggered only when the iolock is currently held shared. While this is true for async dio in most cases, the caller may upgrade the lock in advance based on arbitrary circumstances with respect to EOF zeroing. For example, the iolock is always acquired exclusively if the start offset is not block aligned. This means that even though the iolock is already held exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing can occur based on an unstable i_size. This problem has been reproduced as guest data corruption in virtual machines with file-backed qcow2 virtual disks hosted on an XFS filesystem. The virtual disks must be configured with aio=native mode and the must not be truncated out to the maximum file size (as some virt managers will do). Update xfs_file_aio_write_checks() to unconditionally drain in-flight dio before EOF zeroing can occur. Rather than trigger the wait based on iolock state, use a new flag and upgrade the iolock when necessary. Note that this results in a full restart of the inode checks even when the iolock was already held exclusive when technically it is only required to recheck i_size. This should be a rare enough occurrence that it is preferable to keep the code simple rather than create an alternate restart jump target. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
author: Brian Foster <bfoster@redhat.com> 2015-10-12 16:02:05 +1100
committer: Dave Chinner <david@fromorbit.com> 2015-10-12 16:02:05 +1100
commit: 3136e8bb3054d3bb68942f8f1ee6c26c05f798b0 (patch)
tree: 8f69aadc30dc84f2b1d33e2c31f873486eb0674d /fs
parent: 009c6e871e98aa23bc2e58474c3d9feb05dd1ae6 (diff)
download: op-kernel-dev-3136e8bb3054d3bb68942f8f1ee6c26c05f798b0.zip
op-kernel-dev-3136e8bb3054d3bb68942f8f1ee6c26c05f798b0.tar.gz
1 files changed, 9 insertions, 6 deletions
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e78feb4..347b3e0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -574,6 +574,7 @@ xfs_file_aio_write_checks(
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			error = 0;
 	size_t			count = iov_iter_count(from);
+	bool			drained_dio = false;
 
 restart:
 	error = generic_write_checks(iocb, from);
@@ -611,12 +612,13 @@ restart:
 		bool	zero = false;
 
 		spin_unlock(&ip->i_flags_lock);
-		if (*iolock == XFS_IOLOCK_SHARED) {
-			xfs_rw_iunlock(ip, *iolock);
-			*iolock = XFS_IOLOCK_EXCL;
-			xfs_rw_ilock(ip, *iolock);
-			iov_iter_reexpand(from, count);
-
+		if (!drained_dio) {
+			if (*iolock == XFS_IOLOCK_SHARED) {
+				xfs_rw_iunlock(ip, *iolock);
+				*iolock = XFS_IOLOCK_EXCL;
+				xfs_rw_ilock(ip, *iolock);
+				iov_iter_reexpand(from, count);
+			}
 			/*
 			 * We now have an IO submission barrier in place, but
 			 * AIO can do EOF updates during IO completion and hence
@@ -626,6 +628,7 @@ restart:
 			 * no-op.
 			 */
 			inode_dio_wait(inode);
+			drained_dio = true;
 			goto restart;
 		}
 		error = xfs_zero_eof(ip, iocb->ki_pos, i_size_read(inode), &zero);
author	Brian Foster <bfoster@redhat.com>	2015-10-12 16:02:05 +1100
committer	Dave Chinner <david@fromorbit.com>	2015-10-12 16:02:05 +1100
commit	3136e8bb3054d3bb68942f8f1ee6c26c05f798b0 (patch)
tree	8f69aadc30dc84f2b1d33e2c31f873486eb0674d /fs
parent	009c6e871e98aa23bc2e58474c3d9feb05dd1ae6 (diff)
download	op-kernel-dev-3136e8bb3054d3bb68942f8f1ee6c26c05f798b0.zip op-kernel-dev-3136e8bb3054d3bb68942f8f1ee6c26c05f798b0.tar.gz