shmem: fix shm fallocate() list corruption

The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not want to race with generating new pages by faulting them in. However, the wait-queue used to delay the page faulting has a serious problem: the wait queue head (in shmem_fallocate()) is allocated on the stack, and the code expects that "wake_up_all()" will make sure that all the queue entries are gone before the stack frame is de-allocated. And that is not at all necessarily the case. Yes, a normal wake-up sequence will remove the wait-queue entry that caused the wakeup (see "autoremove_wake_function()"), but the key wording there is "that caused the wakeup". When there are multiple possible wakeup sources, the wait queue entry may well stay around. And _particularly_ in a page fault path, we may be faulting in new pages from user space while we also have other things going on, and there may well be other pending wakeups. So despite the "wake_up_all()", it's not at all guaranteed that all list entries are removed from the wait queue head on the stack. Fix this by introducing a new wakeup function that removes the list entry unconditionally, even if the target process had already woken up for other reasons. Use that "synchronous" function to set up the waiters in shmem_fault(). This problem has never been seen in the wild afaik, but Dave Jones has reported it on and off while running trinity. We thought we fixed the stack corruption with the blk-mq rq_list locking fix (commit 7fe311302f7d: "blk-mq: update hardware and software queues for sleeping alloc"), but it turns out there was _another_ stack corruptor hiding in the trinity runs. Vegard Nossum (also running trinity) was able to trigger this one fairly consistently, and made us look once again at the shmem code due to the faults often being in that area. Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>. Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Linus Torvalds <torvalds@linux-foundation.org> 2016-12-05 12:10:29 -0800
committer: Linus Torvalds <torvalds@linux-foundation.org> 2016-12-06 08:59:05 -0800
commit: 10d20bd25e06b220b1d816228b036e367215dc60 (patch)
tree: 5f19ca03611e4d35581d102ee64a2e72a4623369 /mm
parent: d9d04527c79f0f7d9186272866526e871ef4ac6f (diff)
download: op-kernel-dev-10d20bd25e06b220b1d816228b036e367215dc60.zip
op-kernel-dev-10d20bd25e06b220b1d816228b036e367215dc60.tar.gz
1 files changed, 14 insertions, 1 deletions
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5..9d32e1c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1848,6 +1848,18 @@ unlock:
 	return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	int ret = default_wake_function(wait, mode, sync, key);
+	list_del_init(&wait->task_list);
+	return ret;
+}
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vma->vm_file);
@@ -1883,7 +1895,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		    vmf->pgoff >= shmem_falloc->start &&
 		    vmf->pgoff < shmem_falloc->next) {
 			wait_queue_head_t *shmem_falloc_waitq;
-			DEFINE_WAIT(shmem_fault_wait);
+			DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
 
 			ret = VM_FAULT_NOPAGE;
 			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2665,6 +2677,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		spin_lock(&inode->i_lock);
 		inode->i_private = NULL;
 		wake_up_all(&shmem_falloc_waitq);
+		WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.task_list));
 		spin_unlock(&inode->i_lock);
 		error = 0;
 		goto out;
author	Linus Torvalds <torvalds@linux-foundation.org>	2016-12-05 12:10:29 -0800
committer	Linus Torvalds <torvalds@linux-foundation.org>	2016-12-06 08:59:05 -0800
commit	10d20bd25e06b220b1d816228b036e367215dc60 (patch)
tree	5f19ca03611e4d35581d102ee64a2e72a4623369 /mm
parent	d9d04527c79f0f7d9186272866526e871ef4ac6f (diff)
download	op-kernel-dev-10d20bd25e06b220b1d816228b036e367215dc60.zip op-kernel-dev-10d20bd25e06b220b1d816228b036e367215dc60.tar.gz