summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* tracing: Comment the use of event_mutex with trace event flagsSteven Rostedt2010-05-141-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flags variable is protected by the event_mutex when modifying, but the event_mutex is not held when reading the variable. This is due to the fact that the reads occur in critical sections where taking a mutex (or even a spinlock) is not wanted. But the two flags that exist (enable and filter_active) have the code written as such to handle the reads to not need a lock. The enable flag is used just to know if the event is enabled or not and its use is always under the event_mutex. Whether or not the event is actually enabled is really determined by the tracepoint being registered. The flag is just a way to let the code know if the tracepoint is registered. The filter_active is different. It is read without the lock. If it is set, then the event probes jump to the filter code. There can be a slight mismatch between filters available and filter_active. If the flag is set but no filters are available, the code safely jumps to a filter nop. If the flag is not set and the filters are available, then the filters are skipped. This is acceptable since filters are usually set before tracing or they are set by humans, which would not notice the slight delay that this causes. v2: Fixed typo: "cacheing" -> "caching" Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Combine event filter_active and enable into single flags fieldSteven Rostedt2010-05-141-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The filter_active and enable both use an int (4 bytes each) to set a single flag. We can save 4 bytes per event by combining the two into a single integer. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4894944 1018052 861512 6774508 675eec vmlinux.id 4894871 1012292 861512 6768675 674823 vmlinux.flags This gives us another 5K in savings. The modification of both the enable and filter fields are done under the event_mutex, so it is still safe to combine the two. Note: Although Mathieu gave his Acked-by, he would like it documented that the reads of flags are not protected by the mutex. The way the code works, these reads will not break anything, but will have a residual effect. Since this behavior is the same even before this patch, describing this situation is left to another patch, as this patch does not change the behavior, but just brought it to Mathieu's attention. v2: Updated the event trace self test to for this change. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Remove duplicate id information in event structureSteven Rostedt2010-05-142-9/+8
| | | | | | | | | | | | | | | | | | | Now that the trace_event structure is embedded in the ftrace_event_call structure, there is no need for the ftrace_event_call id field. The id field is the same as the trace_event type field. Removing the id and re-arranging the structure brings down the tracepoint footprint by another 5K. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4895024 1023812 861512 6780348 6775bc vmlinux.print 4894944 1018052 861512 6774508 675eec vmlinux.id Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Move print functions into event classSteven Rostedt2010-05-143-41/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, every event has its own trace_event structure. This is fine since the structure is needed anyway. But the print function structure (trace_event_functions) is now separate. Since the output of the trace event is done by the class (with the exception of events defined by DEFINE_EVENT_PRINT), it makes sense to have the class define the print functions that all events in the class can use. This makes a bigger deal with the syscall events since all syscall events use the same class. The savings here is another 30K. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900382 1048964 861512 6810858 67ecea vmlinux.init 4900446 1049028 861512 6810986 67ed6a vmlinux.preprint 4895024 1023812 861512 6780348 6775bc vmlinux.print To accomplish this, and to let the class know what event is being printed, the event structure is embedded in the ftrace_event_call structure. This should not be an issues since the event structure was created for each event anyway. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Allow events to share their print functionsSteven Rostedt2010-05-144-14/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multiple events may use the same method to print their data. Instead of having all events have a pointer to their print funtions, the trace_event structure now points to a trace_event_functions structure that will hold the way to print ouf the event. The event itself is now passed to the print function to let the print function know what kind of event it should print. This opens the door to consolidating the way several events print their output. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900382 1048964 861512 6810858 67ecea vmlinux.init 4900446 1049028 861512 6810986 67ed6a vmlinux.preprint This change slightly increases the size but is needed for the next change. v3: Fix the branch tracer events to handle this change. v2: Fix the new function graph tracer event calls to handle this change. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Move raw_init from events to classSteven Rostedt2010-05-143-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The raw_init function pointer in the event is used to initialize various kinds of events. The type of initialization needed is usually classed to the kind of event it is. Two events with the same class will always have the same initialization function, so it makes sense to move this to the class structure. Perhaps even making a special system structure would work since the initialization is the same for all events within a system. But since there's no system structure (yet), this will just move it to the class. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900375 1053380 861512 6815267 67fe23 vmlinux.fields 4900382 1048964 861512 6810858 67ecea vmlinux.init The text grew very slightly, but this is a constant growth that happened with the changing of the C files that call the init code. The bigger savings is the data which will be saved the more events share a class. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Move fields from event to class structureSteven Rostedt2010-05-144-15/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the defined fields from the event to the class structure. Since the fields of the event are defined by the class they belong to, it makes sense to have the class hold the information instead of the individual events. The events of the same class would just hold duplicate information. After this change the size of the kernel dropped another 3K: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900252 1057412 861512 6819176 680d68 vmlinux.regs 4900375 1053380 861512 6815267 67fe23 vmlinux.fields Although the text increased, this was mainly due to the C files having to adapt to the change. This is a constant increase, where new tracepoints will not increase the Text. But the big drop is in the data size (as well as needed allocations to hold the fields). This will give even more savings as more tracepoints are created. Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS() with several DEFINE_EVENT()s, then the savings will be lost. But we are pushing developers to consolidate events with DEFINE_EVENT() so this should not be an issue. The kprobes define a unique class to every new event, but are dynamic so it should not be a issue. The syscalls however have a single class but the fields for the individual events are different. The syscalls use a metadata to define the fields. I moved the fields list from the event to the metadata and added a "get_fields()" function to the class. This function is used to find the fields. For normal events and kprobes, get_fields() just returns a pointer to the fields list_head in the class. For syscall events, it returns the fields list_head in the metadata for the event. v2: Fixed the syscall fields. The syscall metadata needs a list of fields for both enter and exit. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Remove per event trace registeringSteven Rostedt2010-05-143-124/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void *data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void * and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Let tracepoints have data passed to tracepoint callbacksSteven Rostedt2010-05-142-30/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void(*)(void *data, int value))probe, void *data); As the first argument, all callbacks (probes) must take a (void *data) parameter. So a callback for the above tracepoint will look like: void myprobe(void *data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void *data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file */ DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); /* In a file registering this tracepoint */ int my_callback(void *data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(*my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_* functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracepoints: Add check trace callback typeMathieu Desnoyers2010-05-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | This check is meant to be used by tracepoint users which do a direct cast of callbacks to (void *) for direct registration, thus bypassing the register_trace_##name and unregister_trace_##name checks. This permits to ensure that the callback type matches the function type at the call site, but without generating any code. Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> LKML-Reference: <20100430165959.GA25605@Krystal> CC: Ingo Molnar <mingo@elte.hu> CC: Andrew Morton <akpm@linux-foundation.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Peter Zijlstra <peterz@infradead.org> CC: Arnaldo Carvalho de Melo <acme@redhat.com> CC: Lai Jiangshan <laijs@cn.fujitsu.com> CC: Li Zefan <lizf@cn.fujitsu.com> CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* tracing: Create class struct for eventsSteven Rostedt2010-05-143-28/+28
| | | | | | | | | | | | | | | | | | | | | | | | | This patch creates a ftrace_event_class struct that event structs point to. This class struct will be made to hold information to modify the events. Currently the class struct only holds the events system name. This patch slightly increases the size, but this change lays the ground work of other changes to make the footprint of tracepoints smaller. With 82 standard tracepoints, and 618 system call tracepoints (two tracepoints per syscall: enter and exit): text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class This patch also cleans up some stale comments in ftrace.h. v2: Fixed missing semi-colon in macro. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* Merge branch 'sched/core' of ↵Steven Rostedt2010-05-1423-133/+211
|\ | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-4
| * sched, wait: Use wrapper functionsChangli Gao2010-05-111-20/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | epoll should not touch flags in wait_queue_t. This patch introduces a new function __add_wait_queue_exclusive(), for the users, who use wait queue as a LIFO queue. __add_wait_queue_tail_exclusive() is introduced too instead of add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as it is a duplicate of __remove_wait_queue(), disliked by users, and with less users. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <containers@lists.linux-foundation.org> LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Intoduce get_cpu_iowait_time_us()Arjan van de Ven2010-05-091-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the ondemand cpufreq governor, it is desired that the iowait time is microaccounted in a similar way as idle time is. This patch introduces the infrastructure to account and expose this information via the get_cpu_iowait_time_us() function. [akpm@linux-foundation.org: fix CONFIG_NO_HZ=n build] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082523.284feab6@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Eliminate the ts->idle_lastupdate fieldArjan van de Ven2010-05-091-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the only user of ts->idle_lastupdate is update_ts_time_stats(), the entire field can be eliminated. In update_ts_time_stats(), idle_lastupdate is first set to "now", and a few lines later, the only user is an if() statement that assigns a variable either to "now" or to ts->idle_lastupdate, which has the value of "now" at that point. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082439.2fab0b4f@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * Merge branch 'cpu_stop' of ↵Ingo Molnar2010-05-083-30/+95
| |\ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into sched/core
| | * cpu_stop: add dummy implementation for UPTejun Heo2010-05-081-6/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When !CONFIG_SMP, cpu_stop functions weren't defined at all which could lead to build failures if UP code uses cpu_stop facility. Add dummy cpu_stop implementation for UP. The waiting variants execute the work function directly with preempt disabled and stop_one_cpu_nowait() schedules a workqueue work. Makefile and ifdefs around stop_machine implementation are updated to accomodate CONFIG_SMP && !CONFIG_STOP_MACHINE case. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ingo Molnar <mingo@elte.hu>
| | * sched: replace migration_thread with cpu_stopTejun Heo2010-05-062-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently migration_thread is serving three purposes - migration pusher, context to execute active_load_balance() and forced context switcher for expedited RCU synchronize_sched. All three roles are hardcoded into migration_thread() and determining which job is scheduled is slightly messy. This patch kills migration_thread and replaces all three uses with cpu_stop. The three different roles of migration_thread() are splitted into three separate cpu_stop callbacks - migration_cpu_stop(), active_load_balance_cpu_stop() and synchronize_sched_expedited_cpu_stop() - and each use case now simply asks cpu_stop to execute the callback as necessary. synchronize_sched_expedited() was implemented with private preallocated resources and custom multi-cpu queueing and waiting logic, both of which are provided by cpu_stop. synchronize_sched_expedited_count is made atomic and all other shared resources along with the mutex are dropped. synchronize_sched_expedited() also implemented a check to detect cases where not all the callback got executed on their assigned cpus and fall back to synchronize_sched(). If called with cpu hotplug blocked, cpu_stop already guarantees that and the condition cannot happen; otherwise, stop_machine() would break. However, this patch preserves the paranoid check using a cpumask to record on which cpus the stopper ran so that it can serve as a bisection point if something actually goes wrong theree. Because the internal execution state is no longer visible, rcu_expedited_torture_stats() is removed. This patch also renames cpu_stop threads to from "stopper/%d" to "migration/%d". The names of these threads ultimately don't matter and there's no reason to make unnecessary userland visible changes. With this patch applied, stop_machine() and sched now share the same resources. stop_machine() is faster without wasting any resources and sched migration users are much cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>
| | * stop_machine: reimplement using cpu_stopTejun Heo2010-05-061-20/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reimplement stop_machine using cpu_stop. As cpu stoppers are guaranteed to be available for all online cpus, stop_machine_create/destroy() are no longer necessary and removed. With resource management and synchronization handled by cpu_stop, the new implementation is much simpler. Asking the cpu_stop to execute the stop_cpu() state machine on all online cpus with cpu hotplug disabled is enough. stop_machine itself doesn't need to manage any global resources anymore, so all per-instance information is rolled into struct stop_machine_data and the mutex and all static data variables are removed. The previous implementation created and destroyed RT workqueues as necessary which made stop_machine() calls highly expensive on very large machines. According to Dimitri Sivanich, preventing the dynamic creation/destruction makes booting faster more than twice on very large machines. cpu_stop resources are preallocated for all online cpus and should have the same effect. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>
| | * cpu_stop: implement stop_cpu[s]()Tejun Heo2010-05-061-4/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement a simplistic per-cpu maximum priority cpu monopolization mechanism. A non-sleeping callback can be scheduled to run on one or multiple cpus with maximum priority monopolozing those cpus. This is primarily to replace and unify RT workqueue usage in stop_machine and scheduler migration_thread which currently is serving multiple purposes. Four functions are provided - stop_one_cpu(), stop_one_cpu_nowait(), stop_cpus() and try_stop_cpus(). This is to allow clean sharing of resources among stop_cpu and all the migration thread users. One stopper thread per cpu is created which is currently named "stopper/CPU". This will eventually replace the migration thread and take on its name. * This facility was originally named cpuhog and lived in separate files but Peter Zijlstra nacked the name and thus got renamed to cpu_stop and moved into stop_machine.c. * Better reporting of preemption leak as per Peter's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>
| * | sched: Remove rq argument to the tracepointsPeter Zijlstra2010-05-071-22/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct rq isn't visible outside of sched.o so its near useless to expose the pointer, also there are no users of it, so remove it. Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1272997616.1642.207.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | Merge commit 'v2.6.34-rc6' into sched/coreIngo Molnar2010-05-0718-48/+184
| |\ \ | | |/ | |/|
| | * Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds2010-04-291-0/+1
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4 nfs: fix some issues in nfs41_proc_reclaim_complete() NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clear NFS: Fix an unstable write data integrity race nfs: testing for null instead of ERR_PTR() NFS: rsize and wsize settings ignored on v4 mounts NFSv4: Don't attempt an atomic open if the file is a mountpoint SUNRPC: Fix a bug in rpcauth_prune_expired
| | | * NFS: Fix an unstable write data integrity raceTrond Myklebust2010-04-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2c61be0a9478258f77b66208a0c4b1f5f8161c3c (NFS: Ensure that the WRITE and COMMIT RPC calls are always uninterruptible) exposed a race on file close. In order to ensure correct close-to-open behaviour, we want to wait for all outstanding background commit operations to complete. This patch adds an inode flag that indicates if a commit operation is under way, and provides a mechanism to allow ->write_inode() to wait for its completion if this is a data integrity flush. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| | * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2010-04-282-0/+2
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits) sfc: Change falcon_probe_board() to fail for unsupported boards sfc: Always close net device at the end of a disabling reset sfc: Wait at most 10ms for the MC to finish reading out MAC statistics sctp: Fix oops when sending queued ASCONF chunks sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set sctp: per_cpu variables should be in bh_disabled section sctp: fix potential reference of a freed pointer sctp: avoid irq lock inversion while call sk->sk_data_ready() Revert "tcp: bind() fix when many ports are bound" net/usb: add sierra_net.c driver cdc_ether: fix autosuspend for mbm devices bluetooth: handle l2cap_create_connless_pdu() errors gianfar: Wait for both RX and TX to stop ipheth: potential null dereferences on error path smc91c92_cs: spin_unlock_irqrestore before calling smc_interrupt() drivers/usb/net/kaweth.c: add device "Allied Telesyn AT-USB10 USB Ethernet Adapter" bnx2: Update version to 2.0.9. bnx2: Prevent "scheduling while atomic" warning with cnic, bonding and vlan. bnx2: Fix lost MSI-X problem on 5709 NICs. cxgb3: Wait longer for control packets on initialization ...
| | | * | sctp: Fix oops when sending queued ASCONF chunksVlad Yasevich2010-04-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we finish processing ASCONF_ACK chunk, we try to send the next queued ASCONF. This action runs the sctp state machine recursively and it's not prepared to do so. kernel BUG at kernel/timer.c:790! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/module/ipv6/initstate Modules linked in: sha256_generic sctp libcrc32c ipv6 dm_multipath uinput 8139too i2c_piix4 8139cp mii i2c_core pcspkr virtio_net joydev floppy virtio_blk virtio_pci [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.34-rc4 #15 /Bochs EIP: 0060:[<c044a2ef>] EFLAGS: 00010286 CPU: 0 EIP is at add_timer+0xd/0x1b EAX: cecbab14 EBX: 000000f0 ECX: c0957b1c EDX: 03595cf4 ESI: cecba800 EDI: cf276f00 EBP: c0957aa0 ESP: c0957aa0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process swapper (pid: 0, ti=c0956000 task=c0988ba0 task.ti=c0956000) Stack: c0957ae0 d1851214 c0ab62e4 c0ab5f26 0500ffff 00000004 00000005 00000004 <0> 00000000 d18694fd 00000004 1666b892 cecba800 cecba800 c0957b14 00000004 <0> c0957b94 d1851b11 ceda8b00 cecba800 cf276f00 00000001 c0957b14 000000d0 Call Trace: [<d1851214>] ? sctp_side_effects+0x607/0xdfc [sctp] [<d1851b11>] ? sctp_do_sm+0x108/0x159 [sctp] [<d1863386>] ? sctp_pname+0x0/0x1d [sctp] [<d1861a56>] ? sctp_primitive_ASCONF+0x36/0x3b [sctp] [<d185657c>] ? sctp_process_asconf_ack+0x2a4/0x2d3 [sctp] [<d184e35c>] ? sctp_sf_do_asconf_ack+0x1dd/0x2b4 [sctp] [<d1851ac1>] ? sctp_do_sm+0xb8/0x159 [sctp] [<d1863334>] ? sctp_cname+0x0/0x52 [sctp] [<d1854377>] ? sctp_assoc_bh_rcv+0xac/0xe1 [sctp] [<d1858f0f>] ? sctp_inq_push+0x2d/0x30 [sctp] [<d186329d>] ? sctp_rcv+0x797/0x82e [sctp] Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Yuansong Qiao <ysqiao@research.ait.ie> Signed-off-by: Shuaijun Zhang <szhang@research.ait.ie> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | sctp: avoid irq lock inversion while call sk->sk_data_ready()Wei Yongjun2010-04-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_data_ready() of sctp socket can be called from both BH and non-BH contexts, but the default sk->sk_data_ready(), sock_def_readable(), can not be used in this case. Therefore, we have to make a new function sctp_data_ready() to grab sk->sk_data_ready() with BH disabling. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 2.6.33-rc6 #129 --------------------------------------------------------- sctp_darn/1517 just changed the state of lock: (clock-AF_INET){++.?..}, at: [<c06aab60>] sock_def_readable+0x20/0x80 but this lock took another, SOFTIRQ-unsafe lock in the past: (slock-AF_INET){+.-...} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 1 lock held by sctp_darn/1517: #0: (sk_lock-AF_INET){+.+.+.}, at: [<cdfe363d>] sctp_sendmsg+0x23d/0xc00 [sctp] Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | Merge branch 'urgent' of ↵Linus Torvalds2010-04-282-9/+6
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6 * 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6: pcmcia: fix matching rules for pseudo-multi-function cards pcmcia: pcmcia_dev_present bugfix
| | | * | | pcmcia: pcmcia_dev_present bugfixDominik Brodowski2010-04-212-9/+6
| | | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pcmcia_dev_present is in and by itself buggy. Add a note specifying why it is broken, and replace the broken locking -- taking a mutex is a bad idea in IRQ context, from which this function is rarely called -- by an atomic_t. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
| | * | | Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-04-284-0/+10
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: coda: move backing-dev.h kernel include inside __KERNEL__ mtd: ensure that bdi entries are properly initialized and registered Move mtd_bdi_*mappable to mtdcore.c btrfs: convert to using bdi_setup_and_register() Catch filesystems lacking s_bdi drbd: Terminate a connection early if sending the protocol fails drbd: fix memory leak Fix JFFS2 sync silent failure smbfs: add bdi backing to mount session ncpfs: add bdi backing to mount session exofs: add bdi backing to mount session ecryptfs: add bdi backing to mount session coda: add bdi backing to mount session cifs: add bdi backing to mount session afs: add bdi backing to mount session. 9p: add bdi backing to mount session bdi: add helper function for doing init and register of a bdi for a file system block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer
| | | * | | coda: move backing-dev.h kernel include inside __KERNEL__Jens Axboe2010-04-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise we must export backing-dev.h as well, which doesn't make any sense. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| | | * | | Catch filesystems lacking s_bdiJörn Engel2010-04-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | noop_backing_dev_info is used only as a flag to mark filesystems that don't have any backing store, like tmpfs, procfs, spufs, etc. Signed-off-by: Joern Engel <joern@logfs.org> Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes to the noop_backing_dev_info is not legal and will not result in them being flushed, but we already catch this condition in __mark_inode_dirty() when checking for a registered bdi. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| | | * | | smbfs: add bdi backing to mount sessionJens Axboe2010-04-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that dirty data gets flushed properly. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| | | * | | ncpfs: add bdi backing to mount sessionJens Axboe2010-04-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that dirty data gets flushed properly. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| | | * | | coda: add bdi backing to mount sessionJens Axboe2010-04-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that dirty data gets flushed properly. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| | | * | | bdi: add helper function for doing init and register of a bdi for a file systemJens Axboe2010-04-221-0/+1
| | | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pretty trivial helper, just sets up the bdi and registers it. An atomic sequence count is used to ensure that the registered sysfs names are unique. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
| | * | | hugetlb: fix infinite loop in get_futex_key() when backed by huge pagesMel Gorman2010-04-241-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a futex key happens to be located within a huge page mapped MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a page->mapping that will never exist. See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details about the problem. This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to continue but because of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by futex. No other part of the VM should be dereferencing the page->mapping of a hugetlbfs page as its page cache is not on the LRU. This patch fixes the problem with the test case described in the bugzilla. [akpm@linux-foundation.org: mel cant spel] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <darren@dvhart.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | Cleanup generic block based fiemapJosef Bacik2010-04-231-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This cleans up a few of the complaints of __generic_block_fiemap. I've fixed all the typing stuff, used inline functions instead of macros, gotten rid of a couple of variables, and made sure the size and block requests are all block aligned. It also fixes a problem where sometimes FIEMAP_EXTENT_LAST wasn't being set properly. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | Merge branch 'for-linus' of ↵Linus Torvalds2010-04-222-2/+2
| | |\ \ \ | | | |_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: firewire: ohci: wait for local CSR lock access to finish firewire: ohci: prevent aliasing of locally handled register addresses firewire: core: fw_iso_resource_manage: return -EBUSY when out of resources firewire: core: fix retries calculation in iso manage_channel() firewire: cdev: fix cut+paste mistake in disclaimer
| | | * | firewire: cdev: fix cut+paste mistake in disclaimerStefan Richter2010-04-152-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was supposed to be generic "authors or copyright holders"; I mistakenly picked up text from a wrong file. Reported-by: Daniel K. <dk@uw.no> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
| | * | | Merge branch 'for-linus' of ↵Linus Torvalds2010-04-211-2/+6
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6: mc13783-regulator: fix a memory leak in mc13783_regulator_remove regulator: Let drivers know when they use the stub API
| | | * | | regulator: Let drivers know when they use the stub APIJean Delvare2010-04-191-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Have the stub variant of regulator_get() return NULL, so that drivers can (but still don't have to) handle this case specifically. Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Jerome Oufella <jerome.oufella@savoirfairelinux.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
| | * | | | KVM: Increase NR_IOBUS_DEVS limit to 200Sridhar Samudrala2010-04-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch increases the current hardcoded limit of NR_IOBUS_DEVS from 6 to 200. We are hitting this limit when creating a guest with more than 1 virtio-net device using vhost-net backend. Each virtio-net device requires 2 such devices to service notifications from rx/tx queues. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| | * | | | KVM: fix the handling of dirty bitmaps to avoid overflowsTakuya Yoshikawa2010-04-201-0/+5
| | | |_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Int is not long enough to store the size of a dirty bitmap. This patch fixes this problem with the introduction of a wrapper function to calculate the sizes of dirty bitmaps. Note: in mark_page_dirty(), we have to consider the fact that __set_bit() takes the offset as int, not long. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * | | | sched: Pre-compute cpumask_weight(sched_domain_span(sd))Peter Zijlstra2010-04-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave reported that his large SPARC machines spend lots of time in hweight64(), try and optimize some of those needless cpumask_weight() invocations (esp. with the large offstack cpumasks these are very expensive indeed). Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | Merge branch 'linus' into sched/coreIngo Molnar2010-04-1556-109/+317
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge reason: merge the latest fixes, update to -rc4. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | sched: Add enqueue/dequeue flagsPeter Zijlstra2010-04-021-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to reduce the dependency on TASK_WAKING rework the enqueue interface to support a proper flags field. Replace the int wakeup, bool head arguments with an int flags argument and create the following flags: ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task, ENQUEUE_WAKING - the enqueue has relative vruntime due to having sched_class::task_waking() called, ENQUEUE_HEAD - the waking task should be places on the head of the priority queue (where appropriate). For symmetry also convert sched_class::dequeue() to a flags scheme. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | sched: Fix TASK_WAKING vs fork deadlockPeter Zijlstra2010-04-021-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Oleg noticed a few races with the TASK_WAKING usage on fork. - since TASK_WAKING is basically a spinlock, it should be IRQ safe - since we set TASK_WAKING (*) without holding rq->lock it could be there still is a rq->lock holder, thereby not actually providing full serialization. (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING. Cure the second issue by not setting TASK_WAKING in sched_fork(), but only temporarily in wake_up_new_task() while calling select_task_rq(). Cure the first by holding rq->lock around the select_task_rq() call, this will disable IRQs, this however requires that we push down the rq->lock release into select_task_rq_fair()'s cgroup stuff. Because select_task_rq_fair() still needs to drop the rq->lock we cannot fully get rid of TASK_WAKING. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | sched: Make select_fallback_rq() cpuset friendlyOleg Nesterov2010-04-021-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems with select_fallback_rq(). It can be called from any context and can't use any cpuset locks including task_lock(). It is called when the task doesn't have online cpus in ->cpus_allowed but ttwu/etc must be able to find a suitable cpu. I am not proud of this patch. Everything which needs such a fat comment can't be good even if correct. But I'd prefer to not change the locking rules in the code I hardly understand, and in any case I believe this simple change make the code much more correct compared to deadlocks we currently have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091027.GA9155@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | sched: _cpu_down(): Don't play with current->cpus_allowedOleg Nesterov2010-04-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | _cpu_down() changes the current task's affinity and then recovers it at the end. The problems are well known: we can't restore old_allowed if it was bound to the now-dead-cpu, and we can race with the userspace which can change cpu-affinity during unplug. _cpu_down() should not play with current->cpus_allowed at all. Instead, take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable() removes the dying cpu from cpu_online_mask. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091023.GA9148@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
OpenPOWER on IntegriCloud