summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* softirq: A single rcu_bh_qs() call per softirq set is enoughEric Dumazet2014-04-291-3/+1
| | | | | | | | | | | | | | Calling rcu_bh_qs() after every softirq action is not really needed. What RCU needs is at least one rcu_bh_qs() per softirq round to note a quiescent state was passed for rcu_bh. Note for Paul and myself : this could be inlined as a single instruction and avoid smp_processor_id() (sone this_cpu_write(rcu_bh_data.passed_quiesce, 1)) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Replace __this_cpu_ptr() uses with raw_cpu_ptr()Christoph Lameter2014-04-292-5/+5
| | | | | | | | | | | | __this_cpu_ptr is being phased out. One special case is increment_cpu_stall_ticks(). A per cpu variable is incremented so use raw_cpu_inc(). Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Remove duplicate resched_cpu() declarationPranith Kumar2014-04-291-6/+0
| | | | | | Signed-off-by: Pranith Kumar <pranith@gatech.edu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Make large and small sysidle systems use same state machinePaul E. McKenney2014-04-291-1/+2
| | | | | | | | | | | | | | | Currently, small systems move back into RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT and large systems do not. This works because moving aggressively to RCU_SYSIDLE_NOT affects only performance, not correctness, and on small systems, the performance impact should be negligible. That said, this difference does make RCU a bit more complex, and RCU does not seem to be suffering from any lack of complexity. This commit therefore adjusts small-system operation to match that of large systems, so that the state never moves back to RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Bind RCU grace-period kthreads if NO_HZ_FULLPaul E. McKenney2014-04-291-18/+16
| | | | | | | | | | | | | | Currently, RCU binds the grace-period kthreads to the timekeeping CPU only if CONFIG_NO_HZ_FULL_SYSIDLE=y. This means that these kthreads must be bound manually when CONFIG_NO_HZ_FULL_SYSIDLE=n and CONFIG_NO_HZ_FULL=y: Otherwise, these kthreads will induce OS jitter on random CPUs. Given that we are trying to reduce the amount of manual tweaking required to make CONFIG_NO_HZ_FULL=y work nicely, this commit makes this binding happen when CONFIG_NO_HZ_FULL=y, even in cases where CONFIG_NO_HZ_FULL_SYSIDLE=n. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Merge rcu_sched_force_quiescent_state() with rcu_force_quiescent_state()Andreea-Cristina Bernat2014-04-292-19/+9
| | | | | | | | | | | | | | | This patch merges the function rcu_force_quiescent_state() with rcu_sched_force_quiescent_state(), using the rcu_state pointer. Firstly, the rcu_sched_force_quiescent_state() function is deleted from the file kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was calling force_quiescent_state with the argument rcu_preempt_state pointer was deleted as well. The new function that combines the old ones uses the rcu_state pointer and is located after rcu_batches_completed_bh() in kernel/rcu/tree.c. Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Consolidate kfree_call_rcu() to use rcu_state pointerAndreea-Cristina Bernat2014-04-292-30/+14
| | | | | | | | | | | | kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU, it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state. This patch uses the rcu_state_pointer to combine the two definitions into one. The resulting function is placed after the closing of the preprocessor conditional CONFIG_TREE_PREEMPT_RCU. Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Replace NR_CPUS with nr_cpu_idsHimangi Saraogi2014-04-291-2/+2
| | | | | | | | | This patch replaces NR_CPUS with nr_cpu_ids as NR_CPUS should consider cpumask_var_t. Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Add event tracing to dyntick_save_progress_counter().Andreea-Cristina Bernat2014-04-291-1/+6
| | | | | | | | | | This patch adds event tracing to dyntick_save_progress_counter() in the case where it returns 1. I used the tracepoint string "dti" because this function returns 1 in case the CPU is in dynticks idle mode. Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Protect uses of ->jiffies_stall with ACCESS_ONCE()Himangi Saraogi2014-04-291-4/+4
| | | | | | | | | | | | | | | | | | | | Some of the accesses to the rcu_state structure's ->jiffies_stall field are unprotected. This patch protects them with ACCESS_ONCE(). The following coccinelle script was used to acheive this: /* coccinelle script to protect uses of ->jiffies_stall with ACCESS_ONCE() */ @@ identifier a; @@ ( ACCESS_ONCE(a->jiffies_stall) | - a->jiffies_stall + ACCESS_ONCE(a->jiffies_stall) ) Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Make callers awaken grace-period kthreadPaul E. McKenney2014-04-293-54/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rcu_start_gp_advanced() function currently uses irq_work_queue() to defer wakeups of the RCU grace-period kthread. This deferring is necessary to avoid RCU-scheduler deadlocks involving the rcu_node structure's lock, meaning that RCU cannot call any of the scheduler's wake-up functions while holding one of these locks. Unfortunately, the second and subsequent calls to irq_work_queue() are ignored, and the first call will be ignored (aside from queuing the work item) if the scheduler-clock tick is turned off. This is OK for many uses, especially those where irq_work_queue() is called from an interrupt or softirq handler, because in those cases the scheduler-clock-tick state will be re-evaluated, which will turn the scheduler-clock tick back on. On the next tick, any deferred work will then be processed. However, this strategy does not always work for RCU, which can be invoked at process level from idle CPUs. In this case, the tick might never be turned back on, indefinitely defering a grace-period start request. Note that the RCU CPU stall detector cannot see this condition, because there is no RCU grace period in progress. Therefore, we can (and do!) see long tens-of-seconds stalls in grace-period handling. In theory, we could see a full grace-period hang, but rcutorture testing to date has seen only the tens-of-seconds stalls. Event tracing demonstrates that irq_work_queue() is being called repeatedly to no effect during these stalls: The "newreq" event appears repeatedly from a task that is not one of the grace-period kthreads. In theory, irq_work_queue() might be fixed to avoid this sort of issue, but RCU's requirements are unusual and it is quite straightforward to pass wake-up responsibility up through RCU's call chain, so that the wakeup happens when the offending locks are released. This commit therefore makes this change. The rcu_start_gp_advanced(), rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(), __note_gp_changes(), and rcu_start_gp() functions now return a boolean which indicates when a wake-up is needed. A new rcu_gp_kthread_wake() does the wakeup when it is necessary and safe to do so: No self-wakes, no wake-ups if the ->gp_flags field indicates there is no need (as in someone else did the wake-up before we got around to it), and no wake-ups before the grace-period kthread has been created. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Protect uses of jiffies_stall field with ACCESS_ONCE()Iulia Manda2014-04-291-6/+6
| | | | | | | | | | Some of the uses of the rcu_state structure's ->jiffies_stall field do not use ACCESS_ONCE(), despite there being unprotected accesses. This commit therefore uses the ACCESS_ONCE() macro to protect this field. Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Remove unused rcu_data structure fieldIulia Manda2014-04-292-5/+2
| | | | | | | | | | | The ->preemptible field in rcu_data is only initialized in the function rcu_init_percpu_data(), and never used. This commit therefore removes this field. Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Update cpu_needs_another_gp() for futures from non-NOCB CPUsPaul E. McKenney2014-04-293-29/+29
| | | | | | | | | | | | | | In the old days, the only source of requests for future grace periods was NOCB CPUs. This has changed: CPUs routinely post requests for future grace periods in order to promote power efficiency and reduce OS jitter with minimal impact on grace-period latency. This commit therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp() instead of rcu_nocb_needs_gp(). The latter is no longer used, so is now removed. This commit also adds tracing for the irq_work_queue() wakeup case. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Print negatives for stall-warning counter wraparoundPaul E. McKenney2014-04-291-4/+5
| | | | | | | | | | | The print_other_cpu_stall() and print_cpu_stall() functions print grace-period numbers using an unsigned format, which means that the number one less than zero is a very large number. This commit therefore causes these numbers to be printed with a signed format in order to improve readability of the RCU CPU stall-warning output. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Fix incorrect notes for codeLiu Ping Fan2014-04-291-1/+1
| | | | | | Signed-off-by: Liu Ping Fan <kernelfans@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rcu: Protect ->gp_flags accesses with ACCESS_ONCE()Paul E. McKenney2014-04-291-6/+6
| | | | | | | | | | A number of ->gp_flags accesses don't have ACCESS_ONCE(), but all of the can race against other loads or stores. This commit therefore applies ACCESS_ONCE() to the unprotected ->gp_flags accesses. Reported-by: Alexey Roytman <alexey.roytman@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* futex: update documentation for ordering guaranteesDavidlohr Bueso2014-04-121-9/+23
| | | | | | | | | | | | | | Commits 11d4616bd07f ("futex: revert back to the explicit waiter counting code") and 69cd9eba3886 ("futex: avoid race between requeue and wake") changed some of the finer details of how we think about futexes. One was a late fix and the other a consequence of overlooking the whole requeuing logic. The first change caused our documentation to be incorrect, and the second made us aware that we need to explicitly add more details to it. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-04-122-9/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
| * missing bits of "splice: fix racy pipe->buffers uses"Al Viro2014-04-122-3/+3
| | | | | | | | | | | | | | that commit has fixed only the parts of that mess in fs/splice.c itself; there had been more in several other ->splice_read() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * pipe: kill ->map() and ->unmap()Al Viro2014-04-012-6/+0
| | | | | | | | | | | | all pipe_buffer_operations have the same instances of those... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'trace-3.15-v2' of ↵Linus Torvalds2014-04-128-335/+289
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing updates from Steven Rostedt: "This includes the final patch to clean up and fix the issue with the design of tracepoints and how a user could register a tracepoint and have that tracepoint not be activated but no error was shown. The design was for an out of tree module but broke in tree users. The clean up was to remove the saving of the hash table of tracepoint names such that they can be enabled before they exist (enabling a module tracepoint before that module is loaded). This added more complexity than needed. The clean up was to remove that code and just enable tracepoints that exist or fail if they do not. This removed a lot of code as well as the complexity that it brought. As a side effect, instead of registering a tracepoint by its name, the tracepoint needs to be registered with the tracepoint descriptor. This removes having to duplicate the tracepoint names that are enabled. The second patch was added that simplified the way modules were searched for. This cleanup required changes that were in the 3.15 queue as well as some changes that were added late in the 3.14-rc cycle. This final change waited till the two were merged in upstream and then the change was added and full tests were run. Unfortunately, the test found some errors, but after it was already submitted to the for-next branch and not to be rebased. Sparse errors were detected by Fengguang Wu's bot tests, and my internal tests discovered that the anonymous union initialization triggered a bug in older gcc compilers. Luckily, there was a bugzilla for the gcc bug which gave a work around to the problem. The third and fourth patch handled the sparse error and the gcc bug respectively. A final patch was tagged along to fix a missing documentation for the README file" * tag 'trace-3.15-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Add missing function triggers dump and cpudump to README tracing: Fix anonymous unions in struct ftrace_event_call tracepoint: Fix sparse warnings in tracepoint.c tracepoint: Simplify tracepoint module search tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints
| * | tracing: Add missing function triggers dump and cpudump to READMESteven Rostedt (Red Hat)2014-04-101-0/+2
| | | | | | | | | | | | | | | | | | | | | The debugfs tracing README file lists all the function triggers except for dump and cpudump. These should be added too. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracing: Fix anonymous unions in struct ftrace_event_callMathieu Desnoyers2014-04-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc <= 4.5.x has significant limitations with respect to initialization of anonymous unions within structures. They need to be surrounded by brackets, _and_ they need to be initialized in the same order in which they appear in the structure declaration. Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676 Link: http://lkml.kernel.org/r/1397077568-3156-1-git-send-email-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracepoint: Fix sparse warnings in tracepoint.cMathieu Desnoyers2014-04-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following sparse warnings: CHECK kernel/tracepoint.c kernel/tracepoint.c:184:18: warning: incorrect type in assignment (different address spaces) kernel/tracepoint.c:184:18: expected struct tracepoint_func *tp_funcs kernel/tracepoint.c:184:18: got struct tracepoint_func [noderef] <asn:4>*funcs kernel/tracepoint.c:216:18: warning: incorrect type in assignment (different address spaces) kernel/tracepoint.c:216:18: expected struct tracepoint_func *tp_funcs kernel/tracepoint.c:216:18: got struct tracepoint_func [noderef] <asn:4>*funcs kernel/tracepoint.c:392:24: error: return expression in void function CC kernel/tracepoint.o kernel/tracepoint.c: In function tracepoint_module_going: kernel/tracepoint.c:491:6: warning: symbol 'syscall_regfunc' was not declared. Should it be static? kernel/tracepoint.c:508:6: warning: symbol 'syscall_unregfunc' was not declared. Should it be static? Link: http://lkml.kernel.org/r/1397049883-28692-1-git-send-email-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracepoint: Simplify tracepoint module searchSteven Rostedt (Red Hat)2014-04-081-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of copying the num_tracepoints and tracepoints_ptrs from the module structure to the tp_mod structure, which only uses it to find the module associated to tracepoints of modules that are coming and going, simply copy the pointer to the module struct to the tracepoint tp_module structure. Also removed un-needed brackets around an if statement. Link: http://lkml.kernel.org/r/20140408201705.4dad2c4a@gandalf.local.home Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | tracepoint: Use struct pointer instead of name hash for reg/unreg tracepointsMathieu Desnoyers2014-04-086-331/+280
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Register/unregister tracepoint probes with struct tracepoint pointer rather than tracepoint name. This change, which vastly simplifies tracepoint.c, has been proposed by Steven Rostedt. It also removes 8.8kB (mostly of text) to the vmlinux size. From this point on, the tracers need to pass a struct tracepoint pointer to probe register/unregister. A probe can now only be connected to a tracepoint that exists. Moreover, tracers are responsible for unregistering the probe before the module containing its associated tracepoint is unloaded. text data bss dec hex filename 10443444 4282528 10391552 25117524 17f4354 vmlinux.orig 10434930 4282848 10391552 25109330 17f2352 vmlinux Link: http://lkml.kernel.org/r/1396992381-23785-2-git-send-email-mathieu.desnoyers@efficios.com CC: Ingo Molnar <mingo@kernel.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Frank Ch. Eigler <fche@redhat.com> CC: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> [ SDR - fixed return val in void func in tracepoint_module_going() ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | | Merge git://git.infradead.org/users/eparis/auditLinus Torvalds2014-04-125-54/+149
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull audit updates from Eric Paris. * git://git.infradead.org/users/eparis/audit: (28 commits) AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range audit: do not cast audit_rule_data pointers pointlesly AUDIT: Allow login in non-init namespaces audit: define audit_is_compat in kernel internal header kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c sched: declare pid_alive as inline audit: use uapi/linux/audit.h for AUDIT_ARCH declarations syscall_get_arch: remove useless function arguments audit: remove stray newline from audit_log_execve_info() audit_panic() call audit: remove stray newlines from audit_log_lost messages audit: include subject in login records audit: remove superfluous new- prefix in AUDIT_LOGIN messages audit: allow user processes to log from another PID namespace audit: anchor all pid references in the initial pid namespace audit: convert PPIDs to the inital PID namespace. pid: get pid_t ppid of task in init_pid_ns audit: rename the misleading audit_get_context() to audit_take_context() audit: Add generic compat syscall support audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL ...
| * | | audit: do not cast audit_rule_data pointers pointleslyEric Paris2014-04-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some sort of legacy support audit_rule is a subset of (and first entry in) audit_rule_data. We don't actually need or use audit_rule. We just do a cast from one to the other for no gain what so ever. Stop the crazy casting. Signed-off-by: Eric Paris <eparis@redhat.com>
| * | | AUDIT: Allow login in non-init namespacesEric Paris2014-03-311-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It its possible to configure your PAM stack to refuse login if audit messages (about the login) were unable to be sent. This is common in many distros and thus normal configuration of many containers. The PAM modules determine if audit is enabled/disabled in the kernel based on the return value from sending an audit message on the netlink socket. If userspace gets back ECONNREFUSED it believes audit is disabled in the kernel. If it gets any other error else it refuses to let the login proceed. Just about ever since the introduction of namespaces the kernel audit subsystem has returned EPERM if the task sending a message was not in the init user or pid namespace. So many forms of containers have never worked if audit was enabled in the kernel. BUT if the container was not in net_init then the kernel network code would send ECONNREFUSED (instead of the audit code sending EPERM). Thus by pure accident/dumb luck/bug if an admin configured the PAM stack to reject all logins that didn't talk to audit, but then ran the login untility in the non-init_net namespace, it would work!! Clearly this was a bug, but it is a bug some people expected. With the introduction of network namespace support in 3.14-rc1 the two bugs stopped cancelling each other out. Now, containers in the non-init_net namespace refused to let users log in (just like PAM was configfured!) Obviously some people were not happy that what used to let users log in, now didn't! This fix is kinda hacky. We return ECONNREFUSED for all non-init relevant namespaces. That means that not only will the old broken non-init_net setups continue to work, now the broken non-init_pid or non-init_user setups will 'work'. They don't really work, since audit isn't logging things. But it's what most users want. In 3.15 we should have patches to support not only the non-init_net (3.14) namespace but also the non-init_pid and non-init_user namespace. So all will be right in the world. This just opens the doors wide open on 3.14 and hopefully makes users happy, if not the audit system... Reported-by: Andre Tomt <andre@tomt.net> Reported-by: Adam Richter <adam_richter2004@yahoo.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: kernel/audit.c
| * | | kernel: Use RCU_INIT_POINTER(x, NULL) in audit.cMonam Agarwal2014-03-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | | syscall_get_arch: remove useless function argumentsEric Paris2014-03-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every caller of syscall_get_arch() uses current for the task and no implementors of the function need args. So just get rid of both of those things. Admittedly, since these are inline functions we aren't wasting stack space, but it just makes the prototypes better. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@linux-mips.org Cc: linux390@de.ibm.com Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-arch@vger.kernel.org
| * | | audit: remove stray newline from audit_log_execve_info() audit_panic() callJoe Perches2014-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | There's an unnecessary use of a \n in audit_panic. Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: remove stray newlines from audit_log_lost messagesJosh Boyer2014-03-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling audit_log_lost with a \n in the format string leads to extra newlines in dmesg. That function will eventually call audit_panic which uses pr_err with an explicit \n included. Just make these calls match the others that lack \n. Reported-by: Jonathan Kamens <jik@kamens.brookline.ma.us> Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: include subject in login recordsEric Paris2014-03-201-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The login uid change record does not include the selinux context of the task logging in. Add that information. (Updated from 2011-01: RHBZ:670328 -- RGB) Reported-by: Steve Grubb <sgrubb@redhat.com> Acked-by: James Morris <jmorris@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: remove superfluous new- prefix in AUDIT_LOGIN messagesRichard Guy Briggs2014-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The new- prefix on ses and auid are un-necessary and break ausearch. Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: allow user processes to log from another PID namespaceRichard Guy Briggs2014-03-201-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Still only permit the audit logging daemon and control to operate from the initial PID namespace, but allow processes to log from another PID namespace. Cc: "Eric W. Biederman" <ebiederm@xmission.com> (informed by ebiederman's c776b5d2) Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: anchor all pid references in the initial pid namespaceRichard Guy Briggs2014-03-203-10/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Store and log all PIDs with reference to the initial PID namespace and use the access functions task_pid_nr() and task_tgid_nr() for task->pid and task->tgid. Cc: "Eric W. Biederman" <ebiederm@xmission.com> (informed by ebiederman's c776b5d2) Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: convert PPIDs to the inital PID namespace.Richard Guy Briggs2014-03-202-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sys_getppid() returns the parent pid of the current process in its own pid namespace. Since audit filters are based in the init pid namespace, a process could avoid a filter or trigger an unintended one by being in an alternate pid namespace or log meaningless information. Switch to task_ppid_nr() for PPIDs to anchor all audit filters in the init_pid_ns. (informed by ebiederman's 6c621b7e) Cc: stable@vger.kernel.org Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: rename the misleading audit_get_context() to audit_take_context()Richard Guy Briggs2014-03-201-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "get" usually implies incrementing a refcount into a structure to indicate a reference being held by another part of code. Change this function name to indicate it is in fact being taken from it, returning the value while clearing it in the supplying structure. Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | audit: Send replies in the proper network namespace.Eric W. Biederman2014-03-202-13/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In perverse cases of file descriptor passing the current network namespace of a process and the network namespace of a socket used by that socket may differ. Therefore use the network namespace of the appropiate socket to ensure replies always go to the appropiate socket. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | | audit: Use struct net not pid_t to remember the network namespce to reply inEric W. Biederman2014-03-203-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While reading through 3.14-rc1 I found a pretty siginficant mishandling of network namespaces in the recent audit changes. In struct audit_netlink_list and audit_reply add a reference to the network namespace of the caller and remove the userspace pid of the caller. This cleanly remembers the callers network namespace, and removes a huge class of races and nasty failure modes that can occur when attempting to relook up the callers network namespace from a pid_t (including the caller's network namespace changing, pid wraparound, and the pid simply not being present). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | | audit: Audit proc/<pid>/cmdline aka proctitleWilliam Roberts2014-03-202-0/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During an audit event, cache and print the value of the process's proctitle value (proc/<pid>/cmdline). This is useful in situations where processes are started via fork'd virtual machines where the comm field is incorrect. Often times, setting the comm field still is insufficient as the comm width is not very wide and most virtual machine "package names" do not fit. Also, during execution, many threads have their comm field set as well. By tying it back to the global cmdline value for the process, audit records will be more complete in systems with these properties. An example of where this is useful and applicable is in the realm of Android. With Android, their is no fork/exec for VM instances. The bare, preloaded Dalvik VM listens for a fork and specialize request. When this request comes in, the VM forks, and the loads the specific application (specializing). This was done to take advantage of COW and to not require a load of basic packages by the VM on very app spawn. When this spawn occurs, the package name is set via setproctitle() and shows up in procfs. Many of these package names are longer then 16 bytes, the historical width of task->comm. Having the cmdline in the audit records will couple the application back to the record directly. Also, on my Debian development box, some audit records were more useful then what was printed under comm. The cached proctitle is tied to the life-cycle of the audit_context structure and is built on demand. Proctitle is controllable by userspace, and thus should not be trusted. It is meant as an aid to assist in debugging. The proctitle event is emitted during syscall audits, and can be filtered with auditctl. Example: type=AVC msg=audit(1391217013.924:386): avc: denied { getattr } for pid=1971 comm="mkdir" name="/" dev="selinuxfs" ino=1 scontext=system_u:system_r:consolekit_t:s0-s0:c0.c255 tcontext=system_u:object_r:security_t:s0 tclass=filesystem type=SYSCALL msg=audit(1391217013.924:386): arch=c000003e syscall=137 success=yes exit=0 a0=7f019dfc8bd7 a1=7fffa6aed2c0 a2=fffffffffff4bd25 a3=7fffa6aed050 items=0 ppid=1967 pid=1971 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mkdir" exe="/bin/mkdir" subj=system_u:system_r:consolekit_t:s0-s0:c0.c255 key=(null) type=UNKNOWN[1327] msg=audit(1391217013.924:386): proctitle=6D6B646972002D70002F7661722F72756E2F636F6E736F6C65 Acked-by: Steve Grubb <sgrubb@redhat.com> (wrt record formating) Signed-off-by: William Roberts <wroberts@tresys.com> Signed-off-by: Eric Paris <eparis@redhat.com>
| * | | audit: Use more current logging style againRichard Guy Briggs2014-03-072-21/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add pr_fmt to prefix "audit: " to output Convert printk(KERN_<LEVEL> to pr_<level> Coalesce formats Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
| * | | Merge tag 'v3.13' into for-3.15Eric Paris2014-03-07147-2588/+6384
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux 3.13 Conflicts: include/net/xfrm.h Simple merge where v3.13 removed 'extern' from definitions and the audit tree did s/u32/unsigned int/ to the same definitions.
* | | | | futex: avoid race between requeue and wakeLinus Torvalds2014-04-091-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jan Stancek reported: "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP) occasionally fails, because some threads fail to wake up. Testcase creates 5 threads, which are all waiting on same condition. Main thread then calls pthread_cond_broadcast() without holding mutex, which calls: futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..) This immediately wakes up single thread A, which unlocks mutex and tries to wake up another thread: futex(uaddr2, FUTEX_WAKE_PRIVATE, 1) If thread A manages to call futex_wake() before any waiters are requeued for uaddr2, no other thread is woken up" The ordering constraints for the hash bucket waiter counting are that the waiter counts have to be incremented _before_ getting the spinlock (because the spinlock acts as part of the memory barrier), but the "requeue" operation didn't honor those rules, and nobody had even thought about that case. This fairly simple patch just increments the waiter count for the target hash bucket (hb2) when requeing a futex before taking the locks. It then decrements them again after releasing the lock - the code that actually moves the futex(es) between hash buckets will do the additional required waiter count housekeeping. Reported-and-tested-by: Jan Stancek <jstancek@redhat.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # 3.14 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge branch 'akpm' (incoming from Andrew)Linus Torvalds2014-04-0721-117/+155
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge second patch-bomb from Andrew Morton: - the rest of MM - zram updates - zswap updates - exit - procfs - exec - wait - crash dump - lib/idr - rapidio - adfs, affs, bfs, ufs - cris - Kconfig things - initramfs - small amount of IPC material - percpu enhancements - early ioremap support - various other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits) MAINTAINERS: update Intel C600 SAS driver maintainers fs/ufs: remove unused ufs_super_block_third pointer fs/ufs: remove unused ufs_super_block_second pointer fs/ufs: remove unused ufs_super_block_first pointer fs/ufs/super.c: add __init to init_inodecache() doc/kernel-parameters.txt: add early_ioremap_debug arm64: add early_ioremap support arm64: initialize pgprot info earlier in boot x86: use generic early_ioremap mm: create generic early_ioremap() support x86/mm: sparse warning fix for early_memremap lglock: map to spinlock when !CONFIG_SMP percpu: add preemption checks to __this_cpu ops vmstat: use raw_cpu_ops to avoid false positives on preemption checks slub: use raw_cpu_inc for incrementing statistics net: replace __this_cpu_inc in route.c with raw_cpu_inc modules: use raw_cpu_write for initialization of per cpu refcount. mm: use raw_cpu ops for determining current NUMA node percpu: add raw_cpu_ops slub: fix leak of 'name' in sysfs_slab_add ...
| * | | | | lglock: map to spinlock when !CONFIG_SMPJosh Triplett2014-04-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the system has only one CPU, lglock is effectively a spinlock; map it directly to spinlock to eliminate the indirection and duplicate code. In addition to removing overhead, this drops 1.6k of code with a defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | modules: use raw_cpu_write for initialization of per cpu refcount.Christoph Lameter2014-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The initialization of a structure is not subject to synchronization. The use of __this_cpu would trigger a false positive with the additional preemption checks for __this_cpu ops. So simply disable the check through the use of raw_cpu ops. Trace: __this_cpu_write operation in preemptible [00000000] code: modprobe/286 caller is __this_cpu_preempt_check+0x38/0x60 CPU: 3 PID: 286 Comm: modprobe Tainted: GF 3.12.0-rc4+ #187 Call Trace: dump_stack+0x4e/0x82 check_preemption_disabled+0xec/0x110 __this_cpu_preempt_check+0x38/0x60 load_module+0xcfd/0x2650 SyS_init_module+0xa6/0xd0 tracesys+0xe1/0xe6 Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | kernel: use macros from compiler.h instead of __attribute__((...))Gideon Israel Dsouza2014-04-0713-21/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To increase compiler portability there is <linux/compiler.h> which provides convenience macros for various gcc constructs. Eg: __weak for __attribute__((weak)). I've replaced all instances of gcc attributes with the right macro in the kernel subsystem. Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud