op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	kconfig: make allnoconfig disable options behind EMBEDDED and EXPERT	Josh Triplett	2014-04-07	8	-6/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"make allnoconfig" exists to ease testing of minimal configurations. Documentation/SubmitChecklist includes a note to test with allnoconfig. This helps catch missing dependencies on common-but-not-required functionality, which might otherwise go unnoticed. However, allnoconfig still leaves many symbols enabled, because they're hidden behind CONFIG_EMBEDDED or CONFIG_EXPERT. For instance, allnoconfig still has CONFIG_PRINTK and CONFIG_BLOCK enabled, so drivers don't typically get build-tested with those disabled. To address this, introduce a new Kconfig option "allnoconfig_y", used on symbols which only exist to hide other symbols. Set it on CONFIG_EMBEDDED (which then selects CONFIG_EXPERT). allnoconfig will then disable all the symbols hidden behind those. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ppc: make PPC_BOOK3S_64 select IRQ_WORK	Josh Triplett	2014-04-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix breakage which will be exposed by the patch "kconfig: make allnoconfig disable options behind EMBEDDED and EXPERT". arch/powerpc/kernel/mce.c, compiled in for PPC_BOOK3S_64, calls functions only built when IRQ_WORK, so select it. Fixes the following build error: arch/powerpc/kernel/built-in.o: In function `.machine_check_queue_event': (.text+0x11260): undefined reference to `.irq_work_queue' Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	ia64: select CONFIG_TTY for use of tty_write_message in unaligned	Josh Triplett	2014-04-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix breakage which will be exposed by the patch "kconfig: make allnoconfig disable options behind EMBEDDED and EXPERT". arch/ia64/kernel/unaligned.c uses tty_write_message to print an unaligned access exception to the TTY of the current user process. Enable TTY to prevent a build error. Minimal fix, on the basis that few people on ia64 will care deeply about kernel size enough to turn off TTY. Ideally, I'd instead suggest dropping the tty_write_message entirely, and just leaving the printk. Bonus: no need to sprintf first. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	cris: cpuinfo_op should depend on CONFIG_PROC_FS	Geert Uytterhoeven	2014-04-07	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix breakage which will be exposed by the patch "kconfig: make allnoconfig disable options behind EMBEDDED and EXPERT". Now allnoconfig started disabling CONFIG_PROC_FS: arch/cris/kernel/built-in.o:(.rodata+0xc): undefined reference to `show_cpuinfo' make: *** [vmlinux] Error 1 Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	cris: make ETRAX_ARCH_V10 select TTY for use in debugport	Josh Triplett	2014-04-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix breakage which will be exposed by the patch "kconfig: make allnoconfig disable options behind EMBEDDED and EXPERT". arch/cris/arch-v10/kernel/debugport.c, compiled in unconditionally with ETRAX_ARCH_V10, requires TTY, so select TTY to avoid a build failure. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	drivers/misc/sgi-gru/grukdump.c: cleanup gru_dump_context() a little	Dan Carpenter	2014-04-07	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	"ret" is zero here so we can remove the "!ret" part of the condition. "uhdr" is alread a __user pointer so we can remove the cast. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	kernel/panic.c: display reason at end + pr_emerg	Fabian Frederick	2014-04-07	1	-7/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, booting without initrd specified on 80x25 screen gives a call trace followed by atkbd : Spurious ACK. Original message ("VFS: Unable to mount root fs") is not available. Of course this could happen in other situations... This patch displays panic reason after call trace which could help lot of people even if it's not the very last line on screen. Also, convert all panic.c printk(KERN_EMERG to pr_emerg( [akpm@linux-foundation.org: missed a couple of pr_ conversions] Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/bfs/inode.c: add __init to init_inodecache()	Fabian Frederick	2014-04-07	1	-1/+1
\| \| \| \| \| \| \| \|	init_inodecache is only called by __init init_bfs_fs Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	affs: add mount option to avoid filename truncates	Fabian Frederick	2014-04-07	5	-33/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Normal behavior for filenames exceeding specific filesystem limits is to refuse operation. AFFS standard name length being only 30 characters against 255 for usual Linux filesystems, original implementation does filename truncate by default with a define value AFFS_NO_TRUNCATE which can be enabled but needs module compilation. This patch adds 'nofilenametruncate' mount option so that user can easily activate that feature and avoid a lot of problems (eg overwrite files ...) Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/affs/dir.c: unlock/brelse dir on failure + code clean-up	Fabian Frederick	2014-04-07	1	-10/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 0edf977d2ae3 ("[readdir] convert affs") returns directly -EIO without unlocking dir inode and releasing dir bh when second affs_bread sequence fails. This patch restores initial behaviour. It also fixes pr_debug and affs_error to fit in 80 columns + removes reference to filldir (replaced by dir_emit in the commit above). Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	affs: add __init to init_inodecache ()	Fabian Frederick	2014-04-07	1	-1/+1
\| \| \| \| \| \| \| \|	init_inodecache is only called by __init init_affs_fs Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/adfs/super.c: add __init to init_inodecache()	Fabian Frederick	2014-04-07	1	-1/+1
\| \| \| \| \| \| \| \|	init_inodecache is only called by __init init_adfs_fs. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hung_task: check the value of "sysctl_hung_task_timeout_sec"	Liu Hua	2014-04-07	2	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As sysctl_hung_task_timeout_sec is unsigned long, when this value is larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in watchdog will return immediately without sleep and with print : schedule_timeout: wrong timeout value ffffffffffffff83 and then the funtion watchdog will call schedule_timeout_interruptible again and again. The screen will be filled with "schedule_timeout: wrong timeout value ffffffffffffff83" This patch does some check and correction in sysctl, to let the function schedule_timeout_interruptible allways get the valid parameter. Signed-off-by: Liu Hua <sdu.liu@huawei.com> Tested-by: Satoru Takeuchi <satoru.takeuchi@gmail.com> Cc: <stable@vger.kernel.org> [3.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	rapidio: rework device hierarchy and introduce mport class of devices	Alexandre Bounine	2014-04-07	10	-16/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes an artificial RapidIO bus root device and establishes actual device hierarchy by providing reference to real parent devices. It also introduces device class for RapidIO controller devices (on-chip or an eternal bridge, known as "mport"). Existing implementation was sufficient for SoC-based platforms that have a single RapidIO controller. With introduction of devices using multiple RapidIO controllers and PCIe-to-RapidIO bridges the old scheme is very limiting or does not work at all. The implemented changes allow to properly reference platform's local RapidIO mport devices and provide device details needed for upper layers. This change to RapidIO device hierarchy does not break any known existing kernel or user space interfaces. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com> Cc: Stef van Os <stef.van.os@prodrive-technologies.com> Cc: Jerry Jacobs <jerry.jacobs@prodrive-technologies.com> Cc: Arno Tiemersma <arno.tiemersma@prodrive-technologies.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	drivers/rapidio/devices/tsi721_dma.c: optimize use of BDMA descriptors	Alexandre Bounine	2014-04-07	2	-33/+82
\| \| \| \| \| \| \| \| \| \| \| \|	Combine SG entries describing single contiguous memory block into one Tsi721 BDMA descriptor. This reduces number of hardware descriptors required for large data transfers and improves performance on the PCIe side by reducing number of descriptor fetch requests. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	lib/idr.c: use RCU_INIT_POINTER(x, NULL)	Monam Agarwal	2014-04-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	idr: remove dead code	Stephen Hemminger	2014-04-07	2	-81/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove no longer used deprecated code, and make local functions static. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Jean Delvare <jdelvare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jeff Layton <jlayton@redhat.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: George Spelvin <linux@horizon.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	vmcore: continue vmcore initialization if PT_NOTE is found empty	WANG Chao	2014-04-07	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently when an empty PT_NOTE is detected, vmcore initialization fails. It sounds too harsh. Because PT_NOTE could be empty, for example, one offlined a cpu but never restarted kdump service, and after crash, PT_NOTE program header is there but no data contains. It's better to warn about the empty PT_NOTE and continue to initialise vmcore. And ultimately the multiple PT_NOTE are merged into a single one, all empty PT_NOTE are discarded naturally during the merge. So empty PT_NOTE is not visible to user space and vmcore is as good as expected. Signed-off-by: WANG Chao <chaowang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Greg Pearson <greg.pearson@hp.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	include/linux/crash_dump.h: add vmcore_cleanup() prototype	Rashika Kheria	2014-04-07	3	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Eliminate the following warning in proc/vmcore.c: fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes] [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait: WSTOPPED\|WCONTINUED doesn't work if a zombie leader is traced by ↵	Oleg Nesterov	2014-04-07	1	-13/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	another process Even if the main thread is dead the process still can stop/continue. However, if the leader is ptraced wait_consider_task(ptrace => false) always skips wait_task_stopped/wait_task_continued, so WSTOPPED or WCONTINUED can never work for the natural parent in this case. Move the "A zombie ptracee is only visible to its ptracer" check into the "if (!delay_group_leader(p))" block. ->notask_error is cleared by the "fall through" code below. This depends on the previous change, wait_task_stopped/continued must be avoided if !delay_group_leader() and the tracer is ->real_parent. Otherwise WSTOPPED\|WEXITED could wrongly report "stopped" when the child is already dead (single-threaded or not). If it is traced by another task then the "stopped" state is fine until the debugger detaches and reveals a zombie state. Stupid test-case: void tfunc(void arg) { sleep(1); // wait for zombie leader raise(SIGSTOP); exit(0x13); return NULL; } int run_child(void) { pthread_t thread; if (!fork()) { int tracee = getppid(); assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0); do ptrace(PTRACE_CONT, tracee, 0,0); while (wait(NULL) > 0); return 0; } sleep(1); // wait for PTRACE_ATTACH assert(pthread_create(&thread, NULL, tfunc, NULL) == 0); pthread_exit(NULL); } int main(void) { int child, stat; child = fork(); if (!child) return run_child(); assert(child == waitpid(-1, &stat, WSTOPPED)); assert(stat == 0x137f); kill(child, SIGCONT); assert(child == waitpid(-1, &stat, WCONTINUED)); assert(stat == 0xffff); assert(child == waitpid(-1, &stat, 0)); assert(stat == 0x1300); return 0; } Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is never called. Note: this doesn't fix all problems with a zombie delay_group_leader(), WCONTINUED \| WEXITED check is not exactly right. debugger can't assume it will be notified if another thread reaps the whole thread group. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait: WSTOPPED\|WCONTINUED hangs if a zombie child is traced by real_parent	Oleg Nesterov	2014-04-07	1	-13/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"A zombie is only visible to its ptracer" logic in wait_consider_task() is very wrong. Trivial test-case: #include <unistd.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <assert.h> int main(void) { int child = fork(); if (!child) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); return 0x23; } assert(waitid(P_ALL, child, NULL, WEXITED \| WNOWAIT) == 0); assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1); return 0; } it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie child. This is because wait_consider_task(ptrace => 0) sees p->ptrace and cleares ->notask_error assuming that the debugger should detach and notify us. Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the child is traced by us. This really simplifies the logic and allows us to do more fixes, see the next changes. This also hides the unwanted group stop state automatically, we can remove another ptrace_reparented() check. Unfortunately, this adds the following behavioural changes: 1. Before this patch wait(WEXITED \| __WNOTHREAD) does not reap a natural child if it is traced by the caller's sub-thread. Hopefully nobody will ever notice this change, and I think that nobody should rely on this behaviour anyway. 2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if it is real parent. While this change comes as a side effect, I think it is good by itself. The group continued state can not be consumed by another process in this case, it doesn't depend on ptrace, it doesn't make sense to hide it from real parent. Perhaps we should add the thread_group_leader() check before wait_task_continued()? May be, but this shouldn't depend on ptrace_reparented(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space	Oleg Nesterov	2014-04-07	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	get_task_state() uses the most significant bit to report the state to user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition can be noticed via /proc as Z -> X -> Z change. Note that this was possible even before EXIT_TRACE was introduced. This is not really bad but imho it make sense to hide EXIT_TRACE from user-space completely. So the patch simply swaps EXIT_ZOMBIE and EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait: completely ignore the EXIT_DEAD tasks	Oleg Nesterov	2014-04-07	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that EXIT_DEAD is the terminal state it doesn't make sense to call eligible_child() or security_task_wait() if the task is really dead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait: use EXIT_TRACE only if thread_group_leader(zombie)	Oleg Nesterov	2014-04-07	1	-10/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if ptrace_reparented(). This is suboptimal and a bit confusing: we do not need do_notify_parent(p) if !thread_group_leader(p) and in this case we also do not need ptrace_unlink(), we can rely on ptrace_release_task(). Change wait_task_zombie() to check thread_group_leader() along with ptrace_reparented() and simplify the final p->exit_state transition. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition	Oleg Nesterov	2014-04-07	2	-29/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and drops tasklist_lock. If this task is not the natural child and it is traced, we change its state back to EXIT_ZOMBIE for ->real_parent. The last transition is racy, this is even documented in 50b8d257486a "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race". wait_consider_task() tries to detect this transition and clear ->notask_error but we can't rely on ptrace_reparented(), debugger can exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE. And there is another problem which were missed before: this transition can also race with reparent_leader() which doesn't reset >exit_signal if EXIT_DEAD, assuming that this task must be reaped by someone else. So the tracee can be re-parented with ->exit_signal != SIGCHLD, and if /sbin/init doesn't use __WALL it becomes unreapable. This was fixed by the previous commit, but it was the temporary hack. 1. Add the new exit_state, EXIT_TRACE. It means that the task is the traced zombie, debugger is going to detach and notify its natural parent. This new state is actually EXIT_ZOMBIE \| EXIT_DEAD. This way we can avoid the changes in proc/kgdb code, get_task_state() still reports "X (dead)" in this case. Note: with or without this change userspace can see Z -> X -> Z transition. Not really bad, but probably makes sense to fix. 2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD if we need to notify the ->real_parent. 3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD is always the final state we can safely ignore such a task. 4. Change wait_consider_task() to check EXIT_TRACE separately and kill the racy and no longer needed ptrace_reparented() case. If ptrace == T an EXIT_TRACE thread should be simply ignored, the owner of this state is going to ptrace_unlink() this task. We can pretend that it was already removed from ->ptraced list. Otherwise we should skip this thread too but clear ->notask_error, we must be the natural parent and debugger is going to untrace and notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace" even if the task was already untraced. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com> Reported-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race	Oleg Nesterov	2014-04-07	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and drops tasklist_lock. If this task is not the natural child and it is traced, we change its state back to EXIT_ZOMBIE for ->real_parent. The last transition is racy, this is even documented in 50b8d257486a "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race". wait_consider_task() tries to detect this transition and clear ->notask_error but we can't rely on ptrace_reparented(), debugger can exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE. And there is another problem which were missed before: this transition can also race with reparent_leader() which doesn't reset >exit_signal if EXIT_DEAD, assuming that this task must be reaped by someone else. So the tracee can be re-parented with ->exit_signal != SIGCHLD, and if /sbin/init doesn't use __WALL it becomes unreapable. Change reparent_leader() to update ->exit_signal even if EXIT_DEAD. Note: this is the simple temporary hack for -stable, it doesn't try to solve all problems, it will be reverted by the next changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com> Reported-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exec: kill bprm->tcomm[], simplify the "basename" logic	Oleg Nesterov	2014-04-07	4	-22/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting from commit c4ad8f98bef7 ("execve: use 'struct filename ' for executable name passing") bprm->filename can not go away after flush_old_exec(), so we do not need to save the binary name in bprm->tcomm[] added by 96e02d158678 ("exec: fix use-after-free bug in setup_new_exec()"). And there was never need for filename_to_taskname-like code, we can simply do set_task_comm(kbasename(filename). This patch has to change set_task_comm() and trace_task_rename() to accept "const char ", but I think this change is also good. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	procfs: make /proc/*/pagemap 0400	Djalal Harouni	2014-04-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The /proc//pagemap contain sensitive information and currently its mode is 0444. Change this to 0400, so the VFS will prevent unprivileged processes from getting file descriptors on arbitrary privileged /proc//pagemap files. This reduces the scope of address space leaking and bypasses by protecting already running processes. Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	procfs: make /proc/*/{stack,syscall,personality} 0400	Djalal Harouni	2014-04-07	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These procfs files contain sensitive information and currently their mode is 0444. Change this to 0400, so the VFS will be able to block unprivileged processes from getting file descriptors on arbitrary privileged /proc/*/{stack,syscall,personality} files. This reduces the scope of ASLR leaking and bypasses by protecting already running processes. Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/proc/inode.c: use RCU_INIT_POINTER(x, NULL)	Monam Agarwal	2014-04-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	proc: show mnt_id in /proc/pid/fdinfo	Andrey Vagin	2014-04-07	2	-7/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we don't have a way how to determing from which mount point file has been opened. This information is required for proper dumping and restoring file descriptos due to presence of mount namespaces. It's possible, that two file descriptors are opened using the same paths, but one fd references mount point from one namespace while the other fd -- from other namespace. $ ls -l /proc/1/fd/1 lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null $ cat /proc/1/fdinfo/1 pos: 0 flags: 0100002 mnt_id: 16 $ cat /proc/1/mountinfo \| grep ^16 16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755 Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Rob Landley <rob@landley.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs/proc/meminfo: meminfo_proc_show(): fix typo in comment	Luiz Capitulino	2014-04-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	It should read "reclaimable slab" and not "reclaimable swap". Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	kernel/exit.c: call proc_exit_connector() after exit_state is set	Guillaume Morin	2014-04-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The process events connector delivers a notification when a process exits. This is really convenient for a process that spawns and wants to monitor its children through an epoll-able() interface. Unfortunately, there is a small window between when the event is delivered and the child become wait()-able. This is creates a race if the parent wants to make sure that it knows about the exit, e.g pid_t pid = fork(); if (pid > 0) { register_interest_for_pid(pid); if (waitpid(pid, NULL, WNOHANG) > 0) { /* We might have raced with exit() / } return; } / Child */ execve(...) register_interest_for_pid() would be telling the the connector socket reader to pay attention to events related to pid. Though this is not a bug, I think it would make the connector a bit more usable if this race was closed by simply moving the call to proc_exit_connector() from just before exit_notify() to right after. Oleg said: : Even with this patch the code above is still "racy" if the child is : multi-threaded. Plus it should obviously filter-out subthreads. And : afaics there is no way to make it reliable, even if you change the code : above so that waitpid() is called only after the last thread exits WNOHANG : still can fail. Signed-off-by: Guillaume Morin <guillaume@morinfr.org> Cc: Matt Helsley <matt.helsley@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exit: move check_stack_usage() to the end of do_exit()	Oleg Nesterov	2014-04-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	It is not clear why check_stack_usage() is called so early and thus it never checks the stack usage in, say, exit_notify() or flush_ptrace_hw_breakpoint() or other functions which are only called by do_exit(). Move the callsite down to the last preempt_disable/schedule. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exit: call disassociate_ctty() before exit_task_namespaces()	Oleg Nesterov	2014-04-07	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 8aac62706ada ("move exit_task_namespaces() outside of exit_notify()") breaks pppd and the exiting service crashes the kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: ppp_register_channel+0x13/0x20 [ppp_generic] Call Trace: ppp_asynctty_open+0x12b/0x170 [ppp_async] tty_ldisc_open.isra.2+0x27/0x60 tty_ldisc_hangup+0x1e3/0x220 __tty_hangup+0x2c4/0x440 disassociate_ctty+0x61/0x270 do_exit+0x7f2/0xa50 ppp_register_channel() needs ->net_ns and current->nsproxy == NULL. Move disassociate_ctty() before exit_task_namespaces(), it doesn't make sense to delay it after perf_event_exit_task() or cgroup_exit(). This also allows to use task_work_add() inside the (nontrivial) code paths in disassociate_ctty(). Investigated by Peter Hurley. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Sree Harsha Totakura <sreeharsha@totakura.in> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Sree Harsha Totakura <sreeharsha@totakura.in> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrey Vagin <avagin@openvz.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> [v3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm/zswap.c: remove unnecessary parentheses	SeongJae Park	2014-04-07	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \|	Fix following trivial checkpatch error: ERROR: return is not a function, parentheses are not required Signed-off-by: SeongJae Park <sj38.park@gmail.com> Acked-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm/zswap: support multiple swap devices	Minchan Kim	2014-04-07	1	-31/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cai Liu reporeted that now zbud pool pages counting has a problem when multiple swap is used because it just counts only one swap intead of all of swap so zswap cannot control writeback properly. The result is unnecessary writeback or no writeback when we should really writeback. IOW, it made zswap crazy. Another problem in zswap is: For example, let's assume we use two swap A and B with different priority and A already has charged 19% long time ago and let's assume that A swap is full now so VM start to use B so that B has charged 1% recently. It menas zswap charged (19% + 1%) is full by default. Then, if VM want to swap out more pages into B, zbud_reclaim_page would be evict one of pages in B's pool and it would be repeated continuously. It's totally LRU reverse problem and swap thrashing in B would happen. This patch makes zswap consider mutliple swap by creating a zbud pool which will be shared by multiple swap so all of zswap pages in multiple swap keep order by LRU so it can prevent above two problems. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Cai Liu <cai.liu@samsung.com> Suggested-by: Weijie Yang <weijie.yang.kh@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm/zswap.c: update zsmalloc in comment to zbud	SeongJae Park	2014-04-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	zswap used zsmalloc before and now using zbud. But, some comments saying it use zsmalloc yet. Fix the trivial problems. Signed-off-by: SeongJae Park <sj38.park@gmail.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm/zswap.c: fix trivial typo and arrange indentation	SeongJae Park	2014-04-07	1	-2/+2
\| \| \| \| \| \| \| \|	Signed-off-by: SeongJae Park <sj38.park@gmail.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: support REQ_DISCARD	Joonsoo Kim	2014-04-07	1	-0/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	zram is ram based block device and can be used by backend of filesystem. When filesystem deletes a file, it normally doesn't do anything on data block of that file. It just marks on metadata of that file. This behavior has no problem on disk based block device, but has problems on ram based block device, since we can't free memory used for data block. To overcome this disadvantage, there is REQ_DISCARD functionality. If block device support REQ_DISCARD and filesystem is mounted with discard option, filesystem sends REQ_DISCARD to block device whenever some data blocks are discarded. All we have to do is to handle this request. This patch implements to flag up QUEUE_FLAG_DISCARD and handle this REQ_DISCARD request. With it, we can free memory used by zram if it isn't used. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: use scnprintf() in attrs show() methods	Sergey Senozhatsky	2014-04-07	2	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sysfs.txt documentation lists the following requirements: - The buffer will always be PAGE_SIZE bytes in length. On i386, this is 4096. - show() methods should return the number of bytes printed into the buffer. This is the return value of scnprintf(). - show() should always use scnprintf(). Use scnprintf() in show() functions. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: propagate error to user	Minchan Kim	2014-04-07	4	-15/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we initialized zcomp with single, we couldn't change max_comp_streams without zram reset but current interface doesn't show any error to user and even it changes max_comp_streams's value without any effect so it would make user very confusing. This patch prevents max_comp_streams's change when zcomp was initialized as single zcomp and emit the error to user(ex, echo). [akpm@linux-foundation.org: don't return with the lock held, per Sergey] [fengguang.wu@intel.com: fix coccinelle warnings] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: return error-valued pointer from zcomp_create()	Sergey Senozhatsky	2014-04-07	2	-15/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of returning just NULL, return ERR_PTR from zcomp_create() if compressing backend creation has failed. ERR_PTR(-EINVAL) for unsupported compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or compression stream) error. Perform IS_ERR() check of returned from zcomp_create() value in disksize_store() and set return code to PTR_ERR(). Change suggested by Jerome Marchand. [akpm@linux-foundation.org: clean up error recovery flow] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Jerome Marchand <jmarchan@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: move comp allocation out of init_lock	Sergey Senozhatsky	2014-04-07	1	-12/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While fixing lockdep spew of ->init_lock reported by Sasha Levin [1], Minchan Kim noted [2] that it's better to move compression backend allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as with zram_meta_alloc(), in order to prevent the same lockdep spew. [1] https://lkml.org/lkml/2014/2/27/337 [2] https://lkml.org/lkml/2014/3/3/32 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: add lz4 algorithm backend	Sergey Senozhatsky	2014-04-07	5	-0/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce LZ4 compression backend and make it available for selection. LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config option. The default compression backend is LZO. TEST (x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G, ext4 file system, 3 compression streams) iozone -t 3 -R -r 16K -s 60M -I +Z Test LZO LZ4 ---------------------------------------------- Initial write 1642744.62 1317005.09 Rewrite 2498980.88 1800645.16 Read 3957026.38 5877043.75 Re-read 3950997.38 5861847.00 Reverse Read 2937114.56 5047384.00 Stride read 2948163.19 4929587.38 Random read 3292692.69 4880793.62 Mixed workload 1545602.62 3502940.38 Random write 2448039.75 1758786.25 Pwrite 1670051.03 1338329.69 Pread 2530682.00 5097177.62 Fwrite 3232085.62 3275942.56 Fread 6306880.25 6645271.12 So on my system LZ4 is slower in write-only tests, while it performs better in read-only and mixed (reads + writes) tests. Official LZ4 benchmarks available here http://code.google.com/p/lz4/ (linux kernel uses revision r90). Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: make compression algorithm selection possible	Sergey Senozhatsky	2014-04-07	6	-11/+93
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add and document `comp_algorithm' device attribute. This attribute allows to show supported compression and currently selected compression algorithms: cat /sys/block/zram0/comp_algorithm [lzo] lz4 and change selected compression algorithm: echo lzo > /sys/block/zram0/comp_algorithm Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: add set_max_streams knob	Sergey Senozhatsky	2014-04-07	3	-3/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch allows to change max_comp_streams on initialised zcomp. Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams() and zcomp_strm_single_set_max_streams() callbacks to change streams limit for zcomp_strm_multi and zcomp_strm_single, accordingly. set_max_streams for single steam zcomp does nothing. If user has lowered the limit, then zcomp_strm_multi_set_max_streams() attempts to immediately free extra streams (as much as it can, depending on idle streams availability). Note, this patch does not allow to change stream 'policy' from single to multi stream (or vice versa) on already initialised compression backend. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: add multi stream functionality	Sergey Senozhatsky	2014-04-07	6	-11/+201
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Existing zram (zcomp) implementation has only one compression stream (buffer and algorithm private part), so in order to prevent data corruption only one write (compress operation) can use this compression stream, forcing all concurrent write operations to wait for stream lock to be released. This patch changes zcomp to keep a compression streams list of user-defined size (via sysfs device attr). Each write operation still exclusively holds compression stream, the difference is that we can have N write operations (depending on size of streams list) executing in parallel. See TEST section later in commit message for performance data. Introduce struct zcomp_strm_multi and a set of functions to manage zcomp_strm stream access. zcomp_strm_multi has a list of idle zcomp_strm structs, spinlock to protect idle list and wait queue, making it possible to perform parallel compressions. The following set of functions added: - zcomp_strm_multi_find()/zcomp_strm_multi_release() find and release a compression stream, implement required locking - zcomp_strm_multi_create()/zcomp_strm_multi_destroy() create and destroy zcomp_strm_multi zcomp ->strm_find() and ->strm_release() callbacks are set during initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release() correspondingly. Each time zcomp issues a zcomp_strm_multi_find() call, the following set of operations performed: - spin lock strm_lock - if idle list is not empty, remove zcomp_strm from idle list, spin unlock and return zcomp stream pointer to caller - if idle list is empty, current adds itself to wait queue. it will be awaken by zcomp_strm_multi_release() caller. zcomp_strm_multi_release(): - spin lock strm_lock - add zcomp stream to idle list - spin unlock, wake up sleeper Minchan Kim reported that spinlock-based locking scheme has demonstrated a severe perfomance regression for single compression stream case, comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16) base spinlock mutex ==Initial write ==Initial write ==Initial write records: 5 records: 5 records: 5 avg: 1642424.35 avg: 699610.40 avg: 1655583.71 std: 39890.95(2.43%) std: 232014.19(33.16%) std: 52293.96 max: 1690170.94 max: 1163473.45 max: 1697164.75 min: 1568669.52 min: 573429.88 min: 1553410.23 ==Rewrite ==Rewrite ==Rewrite records: 5 records: 5 records: 5 avg: 1611775.39 avg: 501406.64 avg: 1684419.11 std: 17144.58(1.06%) std: 15354.41(3.06%) std: 18367.42 max: 1641800.95 max: 531356.78 max: 1706445.84 min: 1593515.27 min: 488817.78 min: 1655335.73 When only one compression stream available, mutex with spin on owner tends to perform much better than frequent wait_event()/wake_up(). This is why single stream implemented as a special case with mutex locking. Introduce and document zram device attribute max_comp_streams. This attr shows and stores current zcomp's max number of zcomp streams (max_strm). Extend zcomp's zcomp_create() with `max_strm' parameter. `max_strm' limits the number of zcomp_strm structs in compression backend's idle list (max_comp_streams). max_comp_streams used during initialisation as follows: -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp using single compression stream zcomp_strm_single (mutex-based locking). -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp using multi compression stream zcomp_strm_multi (spinlock-based locking). default max_comp_streams value is 1, meaning that zram with single stream will be initialised. Later patch will introduce configuration knob to change max_comp_streams on already initialised and used zcomp. TEST iozone -t 3 -R -r 16K -s 60M -I +Z test base 1 strm (mutex) 3 strm (spinlock) ----------------------------------------------------------------------- Initial write 589286.78 583518.39 718011.05 Rewrite 604837.97 596776.38 1515125.72 Random write 584120.11 595714.58 1388850.25 Pwrite 535731.17 541117.38 739295.27 Fwrite 1418083.88 1478612.72 1484927.06 Usage example: set max_comp_streams to 4 echo 4 > /sys/block/zram0/max_comp_streams show current max_comp_streams (default value is 1). cat /sys/block/zram0/max_comp_streams Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: factor out single stream compression	Sergey Senozhatsky	2014-04-07	2	-10/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is preparation patch to add multi stream support to zcomp. Introduce struct zcomp_strm_single and a set of functions to manage zcomp_strm stream access. zcomp_strm_single implements single compession stream, same way as current zcomp implementation. This moves zcomp_strm stream control and locking from zcomp, so compressing backend zcomp is not aware of required locking. Single and multi streams require different locking schemes. Minchan Kim reported that spinlock-based locking scheme (which is used in multi stream implementation) has demonstrated a severe perfomance regression for single compression stream case, comparing to mutex-based. see https://lkml.org/lkml/2014/2/18/16 The following set of functions added: - zcomp_strm_single_find()/zcomp_strm_single_release() find and release a compression stream, implement required locking - zcomp_strm_single_create()/zcomp_strm_single_destroy() create and destroy zcomp_strm_single New ->strm_find() and ->strm_release() callbacks added to zcomp, which are set to zcomp_strm_single_find() and zcomp_strm_single_release() during initialisation. Instead of direct locking and zcomp_strm access from zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find() and ->strm_release() correspondingly. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	zram: use zcomp compressing backends	Sergey Senozhatsky	2014-04-07	3	-43/+36
\| \| \| \| \| \| \| \| \| \| \| \| \|	Do not perform direct LZO compress/decompress calls, initialise and use zcomp LZO backend (single compression stream) instead. [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>