op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	proc: set attributes of pde using accessor functions	Rui Xiang	2014-01-23	2	-4/+3
\| \| \| \| \| \| \| \| \|	Use existing accessors proc_set_user() and proc_set_size() to set attributes. Just a cleanup. Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: fix ->f_pos overflows in first_tid()	Oleg Nesterov	2014-01-23	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. proc_task_readdir()->first_tid() path truncates f_pos to int, this is wrong even on 64bit. We could check that f_pos < PID_MAX or even INT_MAX in proc_task_readdir(), but this patch simply checks the potential overflow in first_tid(), this check is nop on 64bit. We do not care if it was negative and the new unsigned value is huge, all we need to ensure is that we never wrongly return !NULL. 2. Remove the 2nd "nr != 0" check before get_nr_threads(), nr_threads == 0 is not distinguishable from !pid_task() above. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: don't (ab)use ->group_leader in proc_task_readdir() paths	Oleg Nesterov	2014-01-23	1	-28/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	proc_task_readdir() does not really need "leader", first_tid() has to revalidate it anyway. Just pass proc_pid(inode) to first_tid() instead, it can do pid_task(PIDTYPE_PID) itself and read ->group_leader only if necessary. The patch also extracts the "inode is dead" code from pid_delete_dentry(dentry) into the new trivial helper, proc_inode_is_dead(inode), proc_task_readdir() uses it to return -ENOENT if this dir was removed. This is a bit racy, but the race is very inlikely and the getdents() after openndir() can see the empty "." + ".." dir only once. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: change first_tid() to use while_each_thread() rather than next_thread()	Oleg Nesterov	2014-01-23	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rerwrite the main loop to use while_each_thread() instead of next_thread(). We are going to fix or replace while_each_thread(), next_thread() should be avoided whenever possible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: fix the potential use-after-free in first_tid()	Oleg Nesterov	2014-01-23	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	proc_task_readdir() verifies that the result of get_proc_task() is pid_alive() and thus its ->group_leader is fine too. However this is not necessarily true after rcu_read_unlock(), we need to recheck this again after first_tid() does rcu_read_lock(). Otherwise leader->thread_group.next (used by next_thread()) can be invalid if the rcu grace period expires in between. The race is subtle and unlikely, but still it is possible afaics. To simplify lets ignore the "likely" case when tid != 0, f_version can be cleared by proc_task_operations->llseek(). Suppose we have a main thread M and its subthread T. Suppose that f_pos == 3, iow first_tid() should return T. Now suppose that the following happens between rcu_read_unlock() and rcu_read_lock(): 1. T execs and becomes the new leader. This removes M from ->thread_group but next_thread(M) is still T. 2. T creates another thread X which does exec as well, T goes away. 3. X creates another subthread, this increments nr_threads. 4. first_tid() does next_thread(M) and returns the already dead T. Note also that we need 2. and 3. only because of get_nr_threads() check, and this check was supposed to be optimization only. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: cleanup/simplify get_task_state/task_state_array	Oleg Nesterov	2014-01-23	1	-12/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	get_task_state() and task_state_array[] look confusing and suboptimal, it is not clear what it can actually report to user-space and task_state_array[] blows .data for no reason. 1. state = (tsk->state & TASK_REPORT) \| tsk->exit_state is not clear. TASK_REPORT is self-documenting but it is not clear what ->exit_state can add. Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD) into TASK_REPORT and use it to calculate the final result. 2. With the change above it is obvious that task_state_array[] has the unused entries just to make BUILD_BUG_ON() happy. Change this BUILD_BUG_ON() to use TASK_REPORT rather than TASK_STATE_MAX and shrink task_state_array[]. 3. Turn the "while (state)" loop into fls(state). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	coredump: make __get_dumpable/get_dumpable inline, kill fs/coredump.h	Oleg Nesterov	2014-01-23	3	-25/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Remove fs/coredump.h. It is not clear why do we need it, it only declares __get_dumpable(), signal.c includes it for no reason. 2. Now that get_dumpable() and __get_dumpable() are really trivial make them inline in linux/sched.h. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	coredump: kill MMF_DUMPABLE and MMF_DUMP_SECURELY	Oleg Nesterov	2014-01-23	1	-15/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Nobody actually needs MMF_DUMPABLE/MMF_DUMP_SECURELY, they are only used to enforce the encoding of SUID_DUMP_* enum in mm->flags & MMF_DUMPABLE_MASK. Now that set_dumpable() updates both bits atomically we can kill them and simply store the value "as is" in 2 lower bits. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	coredump: set_dumpable: fix the theoretical race with itself	Oleg Nesterov	2014-01-23	1	-34/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	set_dumpable() updates MMF_DUMPABLE_MASK in a non-trivial way to ensure that get_dumpable() can't observe the intermediate state, but this all can't help if multiple threads call set_dumpable() at the same time. And in theory commit_creds()->set_dumpable(SUID_DUMP_ROOT) racing with sys_prctl()->set_dumpable(SUID_DUMP_DISABLE) can result in SUID_DUMP_USER. Change this code to update both bits atomically via cmpxchg(). Note: this assumes that it is safe to mix bitops and cmpxchg. IOW, if, say, an architecture implements cmpxchg() using the locking (like arch/parisc/lib/bitops.c does), then it should use the same locks for set_bit/etc. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hfsplus: remove hfsplus_file_lookup()	Sougata Santra	2014-01-23	1	-59/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HFS+ resource fork lookup breaks opendir() library function. Since opendir first calls open() with O_DIRECTORY flag set. O_DIRECTORY means "refuse to open if not a directory". The open system call in the kernel does a check for inode->i_op->lookup and returns -ENOTDIR. So if hfsplus_file_lookup is set it allows opendir() for plain files. Also resource fork lookup in HFS+ does not work. Since it is never invoked after VFS permission checking. It will always return with -EACCES. When we call opendir() on a file, it does not return NULL. opendir() library call is based on open with O_DIRECTORY flag passed and then layered on top of getdents() system call. O_DIRECTORY means "refuse to open if not a directory". The open() system call in the kernel does a check for: do_sys_open() -->..--> can_lookup() i.e it only checks inode->i_op->lookup and returns ENOTDIR if this function pointer is not set. In OSX, we can open "file/rsrc" to get the resource fork of "file". This behavior is emulated inside hfsplus on Linux, which means that to some degree every file acts like a directory. That is the reason lookup() inode operations is supported for files, and it is possible to do a lookup on this specific name. As a result of this open succeeds without returning ENOTDIR for HFS+ Please see the LKML discussion thread on this issue: http://marc.info/?l=linux-fsdevel&m=122823343730412&w=2 I tried to test file/rsrc lookup in HFS+ driver and the feature does not work. From OSX: $ touch test $ echo "1234" > test/..namedfork/rsrc $ ls -l test..namedfork/rsrc --rw-r--r-- 1 tuxera staff 5 10 dec 12:59 test/..namedfork/rsrc [sougata@ultrabook tmp]$ id uid=1000(sougata) gid=1000(sougata) groups=1000(sougata),5(tty),18(dialout),1001(vboxusers) [sougata@ultrabook tmp]$ mount /dev/sdb1 on /mnt/tmp type hfsplus (rw,relatime,umask=0,uid=1000,gid=1000,nls=utf8) [sougata@ultrabook tmp]$ ls -l test/rsrc ls: cannot access test/rsrc: Permission denied According to this LKML thread it is expected behavior. http://marc.info/?t=121139033800008&r=1&w=4 I guess now that permission checking happens in vfs generic_permission() ? So it turns out that even though the lookup() inode_operation exists for HFS+ files. It cannot really get invoked ?. So if we can disable this feature to make opendir() work for HFS+. Signed-off-by: Sougata Santra <sougata@tuxera.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	nilfs2: add comments for ioctls	Vyacheslav Dubeyko	2014-01-23	1	-1/+362
\| \| \| \| \| \| \| \| \| \| \|	Add comments for ioctls in fs/nilfs2/ioctl.c file and describe NILFS2 specific ioctls in Documentation/filesystems/nilfs2.txt. Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Wenliang Fan <fanwlexca@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/nilfs2: fix integer overflow in nilfs_ioctl_wrap_copy()	Wenliang Fan	2014-01-23	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The local variable 'pos' in nilfs_ioctl_wrap_copy function can overflow if a large number was passed to argv->v_index from userspace and the sum of argv->v_index and argv->v_nmembs exceeds the maximum value of __u64 type integer (= ~(__u64)0 = 18446744073709551615). Here, argv->v_index is a 64-bit width argument to specify the start position of target data items (such as segment number, checkpoint number, or virtual block address of nilfs), and argv->v_nmembs gives the total number of the items that userland programs (such as lssu, lscp, or cleanerd) want to get information about, which also gives the maximum element count of argv->v_base[] array. nilfs_ioctl_wrap_copy() calls dofunc() repeatedly and increments the position variable 'pos' at the end of each iteration if dofunc() itself didn't update 'pos': if (pos == ppos) pos += n; This patch prevents the overflow here by rejecting pairs of a start position (argv->v_index) and a total count (argv->v_nmembs) which leads to the overflow. [konishi.ryusuke@lab.ntt.co.jp: fix signedness issue] Signed-off-by: Wenliang Fan <fanwlexca@gmail.com> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/pipe.c: skip file_update_time on frozen fs	Dmitry Monakhov	2014-01-23	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pipe has no data associated with fs so it is not good idea to block pipe_write() if FS is frozen, but we can not update file's time on such filesystem. Let's use same idea as we use in touch_time(). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=65701 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	autofs: fix symlinks aren't checked for expiry	Ian Kent	2014-01-23	2	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The autofs4 module doesn't consider symlinks for expire as it did in the older autofs v3 module (so it's actually a long standing regression). The user space daemon has focused on the use of bind mounts instead of symlinks for a long time now and that's why this has not been noticed. But with the future addition of amd map parsing to automount(8), not to mention amd itself (of am-utils), symlink expiry will be needed. The direct and offset mount types can't be symlinks and the tree mounts of version 4 were always real mounts so only indirect mounts need expire symlinks. Since the current users of the autofs4 module haven't reported this as a problem to date this patch probably isn't a candidate for backport to stable. Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	autofs: use IS_ROOT to replace root dentry checks	Rui Xiang	2014-01-23	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	Use the helper macro !IS_ROOT to replace parent != dentry->d_parent. Just clean up. Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	autofs: fix the return value of autofs4_fill_super	Rui Xiang	2014-01-23	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \|	While kzallocing sbi/ino fails, it should return -ENOMEM. And it should return the err value from autofs_prepare_pipe. Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	autofs4: translate pids to the right namespace for the daemon	Miklos Szeredi	2014-01-23	1	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The PID and the TGID of the process triggering the mount are sent to the daemon. Currently the global pid values are sent (ones valid in the initial pid namespace) but this is wrong if the autofs daemon itself is not running in the initial pid namespace. So send the pid values that are valid in the namespace of the autofs daemon. The namespace to use is taken from the oz_pgrp pid pointer, which was set at mount time to the mounting process' pid namespace. If the pid translation fails (the triggering process is in an unrelated pid namespace) then the automount fails with ENOENT. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Eric Biederman <ebiederm@xmission.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	autofs4: allow autofs to work outside the initial PID namespace	Sukadev Bhattiprolu	2014-01-23	3	-13/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enable autofs4 to work in a "container". oz_pgrp is converted from pid_t to struct pid and this is stored at mount time based on the "pgrp=" option or if the option is missing then the current pgrp. The "pgrp=" option is interpreted in the PID namespace of the current process. This option is flawed in that it doesn't carry the namespace information, so it should be deprecated. AFAICS the autofs daemon always sends the current pgrp, which is the default anyway. The oz_pgrp is also set from the AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ioctl. This ioctl sets oz_pgrp to the current pgrp. It is not allowed to change the pid namespace. oz_pgrp is used mainly to determine whether the process traversing the autofs mount tree is the autofs daemon itself or not. This function now compares the pid pointers instead of the pid_t values. One other use of oz_pgrp is in autofs4_show_options. There is shows the virtual pid number (i.e. the one that is valid inside the PID namespace of the calling process) For debugging printk convert oz_pgrp to the value in the initial pid namespace. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Eric Biederman <ebiederm@xmission.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/ramfs: move ramfs_aops to inode.c	Axel Lin	2014-01-23	4	-15/+7
\| \| \| \| \| \| \| \| \| \|	ramfs_aops is identical in file-mmu.c and file-nommu.c. Thus move it to fs/ramfs/inode.c and make it static. Signed-off-by: Axel Lin <axel.lin@ingics.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/ramfs/file-nommu.c: make ramfs_nommu_get_unmapped_area() and ↵	Axel Lin	2014-01-23	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	ramfs_nommu_mmap() static Since commit 853ac43ab194 ("shmem: unify regular and tiny shmem"), ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap() are not directly referenced outside of file-nommu.c. Thus make them static. Signed-off-by: Axel Lin <axel.lin@ingics.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs: binfmt_elf: remove unused defines INTERPRETER_NONE and INTERPRETER_ELF	Todor Minchev	2014-01-23	1	-3/+0
\| \| \| \| \| \| \| \| \|	These two defines are unused since the removal of the a.out interpreter support in the ELF loader in kernel 2.6.25 Signed-off-by: Todor Minchev <todor@minchev.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	remove extra definitions of U32_MAX	Alex Elder	2014-01-23	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Now that the definition is centralized in <linux/kernel.h>, the definitions of U32_MAX (and related) elsewhere in the kernel can be removed. Signed-off-by: Alex Elder <elder@linaro.org> Acked-by: Sage Weil <sage@inktank.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	conditionally define U32_MAX	Alex Elder	2014-01-23	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	The symbol U32_MAX is defined in several spots. Change these definitions to be conditional. This is in preparation for the next patch, which centralizes the definition in <linux/kernel.h>. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Sage Weil <sage@inktank.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	logfs: check for the return value after calling find_or_create_page()	Younger Liu	2014-01-23	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In get_mapping_page(), after calling find_or_create_page(), the return value should be checked. This patch has been provided: http://www.spinics.net/lists/linux-fsdevel/msg66948.html but not been applied now. Signed-off-by: Younger Liu <liuyiyang@hisense.com> Cc: Younger Liu <younger.liucn@gmail.com> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Reviewed-by: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Jörn Engel <joern@logfs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/proc/page.c: add PageAnon check to surely detect thp	Naoya Horiguchi	2014-01-23	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to know that a specified page is thp or not. But sometimes it's not enough and we fail to detect thp when the thp is on pagevec. This happens only for a few seconds after LRU list operations, but it makes it difficult to control our applications depending on this flag. So this patch adds another check PageAnon to detect thps on pagevec. It might not give the future extensibility for thp pagecache, but it's OK at least for now. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge branch 'for_linus' of ↵	Linus Torvalds	2014-01-23	3	-8/+6
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF & jbd fixes from Jan Kara: "A cleanup of JBD log messages and UDF fix of a lockdep warning" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix lockdep warning from udf_symlink() jbd: Revise KERN_EMERG error messages
\| *	udf: Fix lockdep warning from udf_symlink()	Jan Kara	2013-12-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lockdep is complaining about UDF: ============================================= [ INFO: possible recursive locking detected ] 3.12.0+ #16 Not tainted --------------------------------------------- ln/7386 is trying to acquire lock: (&ei->i_data_sem){+.+...}, at: [<ffffffff8142f06d>] udf_get_block+0x8d/0x130 but task is already holding lock: (&ei->i_data_sem){+.+...}, at: [<ffffffff81431a8d>] udf_symlink+0x8d/0x690 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ei->i_data_sem); lock(&ei->i_data_sem); * DEADLOCK * This is because we hold i_data_sem of the symlink inode while calling udf_add_entry() for the directory. I don't think this can ever lead to deadlocks since we never hold i_data_sem for two inodes in any other place. The fix is simple - move unlock of i_data_sem for symlink inode up. We don't need it for anything when linking symlink inode to directory. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jan Kara <jack@suse.cz>
\| *	jbd: Revise KERN_EMERG error messages	Jan Kara	2013-12-04	2	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some of KERN_EMERG printk messages do not really deserve this log level and the one in log_wait_commit() is even rather useless (the journal has been previously aborted and that is where we should have been complaining). So make some messages just KERN_ERR and remove the useless message. Signed-off-by: Jan Kara <jack@suse.cz>
* \|	Merge branch 'for-linus' of ↵	Linus Torvalds	2014-01-23	5	-32/+71
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: "This contains a fix for a potential use-after-module-unload bug noticed by Al and caching improvements for read-only fuse filesystems by Andrew Gallagher" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: support clients that don't implement 'open' fuse: don't invalidate attrs when not using atime fuse: fix SetPageUptodate() condition in STORE fuse: fix pipe_buf_operations
\| * \|	fuse: support clients that don't implement 'open'	Andrew Gallagher	2014-01-22	2	-10/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	open/release operations require userspace transitions to keep track of the open count and to perform any FS-specific setup. However, for some purely read-only FSs which don't need to perform any setup at open/release time, we can avoid the performance overhead of calling into userspace for open/release calls. This patch adds the necessary support to the fuse kernel modules to prevent open/release operations from hitting in userspace. When the client returns ENOSYS, we avoid sending the subsequent release to userspace, and also remember this so that future opens also don't trigger a userspace operation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
\| * \|	fuse: don't invalidate attrs when not using atime	Andrew Gallagher	2014-01-22	3	-4/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Various read operations (e.g. readlink, readdir) invalidate the cached attrs for atime changes. This patch adds a new function 'fuse_invalidate_atime', which checks for a read-only super block and avoids the attr invalidation in that case. Signed-off-by: Andrew Gallagher <andrewjcg@fb.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
\| * \|	fuse: fix SetPageUptodate() condition in STORE	Miklos Szeredi	2014-01-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As noticed by Coverity the "num != 0" condition never triggers. Instead it should check for a complete page. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
\| * \|	fuse: fix pipe_buf_operations	Miklos Szeredi	2014-01-22	2	-17/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Having this struct in module memory could Oops when if the module is unloaded while the buffer still persists in a pipe. Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal merge them into nosteal_pipe_buf_ops (this is the same as default_pipe_buf_ops except stealing the page from the buffer is not allowed). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
* \| \|	Merge tag 'for-f2fs-3.14' of ↵	Linus Torvalds	2014-01-23	19	-800/+1739
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, a couple of sysfs entries were introduced to tune the f2fs at runtime. In addition, f2fs starts to support inline_data and improves the read/write performance in some workloads by refactoring bio-related flows. This patch-set includes the following major enhancement patches. - support inline_data - refactor bio operations such as merge operations and rw type assignment - enhance the direct IO path - enhance bio operations - truncate a node page when it becomes obsolete - add sysfs entries: small_discards, max_victim_search, and in-place-update - add a sysfs entry to control max_victim_search The other bug fixes are as follows. - fix a bug in truncate_partial_nodes - avoid warnings during sparse and build process - fix error handling flows - fix potential bit overflows And, there are a bunch of cleanups" * tag 'for-f2fs-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (95 commits) f2fs: drop obsolete node page when it is truncated f2fs: introduce NODE_MAPPING for code consistency f2fs: remove the orphan block page array f2fs: add help function META_MAPPING f2fs: move a branch for code redability f2fs: call mark_inode_dirty to flush dirty pages f2fs: clean checkpatch warnings f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH) f2fs: avoid f2fs_balance_fs call during pageout f2fs: add delimiter to seperate name and value in debug phrase f2fs: use spinlock rather than mutex for better speed f2fs: move alloc new orphan node out of lock protection region f2fs: move grabing orphan pages out of protection region f2fs: remove the needless parameter of f2fs_wait_on_page_writeback f2fs: update documents and a MAINTAINERS entry f2fs: add a sysfs entry to control max_victim_search f2fs: improve write performance under frequent fsync calls f2fs: avoid to read inline data except first page f2fs: avoid to left uninitialized data in page when read inline data f2fs: fix truncate_partial_nodes bug ...
\| * \| \|	f2fs: drop obsolete node page when it is truncated	Jaegeuk Kim	2014-01-23	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a node page is trucated, we'd better drop the page in the node_inode's page cache for better memory footprint. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: introduce NODE_MAPPING for code consistency	Jaegeuk Kim	2014-01-22	4	-31/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds NODE_MAPPING which is similar as META_MAPPING introduced by Gu Zheng. Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: remove the orphan block page array	Gu Zheng	2014-01-22	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As the orphan_blocks may be max to 504, so it is not security and rigorous to store such a large array in the kernel stack as Dan Carpenter said. In fact, grab_meta_page has locked the page in the page cache, and we can use find_get_page() to fetch the page safely in the downstream, so we can remove the page array directly. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: add help function META_MAPPING	Gu Zheng	2014-01-22	5	-8/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce help function META_MAPPING() to get the cache meta blocks' address space. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: move a branch for code redability	Jaegeuk Kim	2014-01-22	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch moves a function in f2fs_delete_entry for code readability. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: call mark_inode_dirty to flush dirty pages	Jaegeuk Kim	2014-01-22	3	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a dentry page is updated, we should call mark_inode_dirty to add the inode into the dirty list, so that its dentry pages are flushed to the disk. Otherwise, the inode can be evicted without flush. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: clean checkpatch warnings	Chris Fries	2014-01-20	11	-30/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed a variety of trivial checkpatch warnings. The only delta should be some minor formatting on log strings that were split / too long. Signed-off-by: Chris Fries <cfries@motorola.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH)	Changman Lee	2014-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw using WRITE_FLUSH_FUA. At this time, we also should set REQ_META\|REQ_PRIO. Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: avoid f2fs_balance_fs call during pageout	Jaegeuk Kim	2014-01-16	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch should resolve the following bug. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 3.13.0-rc5.f2fs+ #6 Not tainted --------------------------------------------------------- kswapd0/41 just changed the state of lock: (&sbi->gc_mutex){+.+.-.}, at: [<ffffffffa030503e>] f2fs_balance_fs+0xae/0xd0 [f2fs] but this lock took another, RECLAIM_FS-READ-unsafe lock in the past: (&sbi->cp_rwsem){++++.?} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Chain exists of: &sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsem Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sbi->cp_rwsem); local_irq_disable(); lock(&sbi->gc_mutex); lock(&sbi->cp_mutex); <Interrupt> lock(&sbi->gc_mutex); * DEADLOCK * This bug is due to the f2fs_balance_fs call in f2fs_write_data_page. If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall flow. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: add delimiter to seperate name and value in debug phrase	Changman Lee	2014-01-14	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Support for f2fs-tools/tools/f2stat to monitor /sys/kernel/debug/f2fs/status Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: use spinlock rather than mutex for better speed	Gu Zheng	2014-01-14	2	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With the 2 previous changes, all the long time operations are moved out of the protection region, so here we can use spinlock rather than mutex (orphan_inode_mutex) for lower overhead. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: move alloc new orphan node out of lock protection region	Gu Zheng	2014-01-14	1	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move alloc new orphan node out of lock protection region. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: move grabing orphan pages out of protection region	Gu Zheng	2014-01-14	1	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move grabing orphan block page out of protection region, and grab all the orphan block pages ahead. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> [Jaegeuk Kim: remove unnecessary code pointed by Chao Yu] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: remove the needless parameter of f2fs_wait_on_page_writeback	Yuan Zhong	2014-01-14	6	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback. We should remove this parameter. Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: add a sysfs entry to control max_victim_search	Jaegeuk Kim	2014-01-08	4	-3/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously during SSR and GC, the maximum number of retrials to find a victim segment was hard-coded by MAX_VICTIM_SEARCH, 4096 by default. This number makes an effect on IO locality, when SSR mode is activated, which results in performance fluctuation on some low-end devices. If max_victim_search = 4, the victim will be searched like below. ("D" represents a dirty segment, and "" indicates a selected victim segment.) D1 D2 D3 D4 D5 D6 D7 D8 D9 [ ] [ * ] [ * ] [ ....] This patch adds a sysfs entry to control the number dynamically through: /sys/fs/f2fs/$dev/max_victim_search Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
\| * \| \|	f2fs: improve write performance under frequent fsync calls	Jaegeuk Kim	2014-01-08	4	-10/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When considering a bunch of data writes with very frequent fsync calls, we are able to think the following performance regression. N: Node IO, D: Data IO, IO scheduler: cfq Issue pending IOs D1 D2 D3 D4 D1 D2 D3 D4 N1 D2 D3 D4 N1 N2 N1 D3 D4 N2 D1 --> N1 can be selected by cfq becase of the same priority of N and D. Then D3 and D4 would be delayed, resuling in performance degradation. So, when processing the fsync call, it'd better give higher priority to data IOs than node IOs by assigning WRITE and WRITE_SYNC respectively. This patch improves the random wirte performance with frequent fsync calls by up to 10%. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>