summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* sched/isolation: Offload residual 1Hz scheduler tickFrederic Weisbecker2018-02-218-1/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a CPU runs in full dynticks mode, a 1Hz tick remains in order to keep the scheduler stats alive. However this residual tick is a burden for bare metal tasks that can't stand any interruption at all, or want to minimize them. The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now outsource these scheduler ticks to the global workqueue so that a housekeeping CPU handles those remotely. The sched_class::task_tick() implementations have been audited and look safe to be called remotely as the target runqueue and its current task are passed in parameter and don't seem to be accessed locally. Note that in the case of using isolcpus, it's still up to the user to affine the global workqueues to the housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or domains isolation "isolcpus=nohz,domain". Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/isolation: Isolate workqueues when "nohz_full=" is setFrederic Weisbecker2018-02-212-2/+4
| | | | | | | | | | | | | | | | | | | | As we prepare for offloading the residual 1hz scheduler ticks to workqueue, let's affine those to housekeepers so that they don't interrupt the CPUs that don't want to be disturbed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* nohz: Allow to check if remote CPU tick is stoppedFrederic Weisbecker2018-02-211-0/+7
| | | | | | | | | | | | | | | | | | | This check is racy but provides a good heuristic to determine whether a CPU may need a remote tick or not. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-4-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* nohz: Convert tick_nohz_tick_stopped() to boolFrederic Weisbecker2018-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | It makes this function more self-explanatory about what it does and how to use it. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-3-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/core: Rename init_rq_hrtick() to hrtick_rq_init()Frederic Weisbecker2018-02-211-3/+3
| | | | | | | | | | | | | | | | | | Do that rename in order to normalize the hrtick namespace. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-2-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Delay retrying placement for automatic NUMA balance after ↵Mel Gorman2018-02-211-1/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wake_affine() If wake_affine() pulls a task to another node for any reason and the node is no longer preferred then temporarily stop automatic NUMA balancing pulling the task back. Otherwise, tasks with a strong waker/wakee relationship may constantly fight automatic NUMA balancing over where a task should be placed. Once again netperf is interesting here. The performance barely changes but automatic NUMA balancing is interesting: Hmean send-64 354.67 ( 0.00%) 352.15 ( -0.71%) Hmean send-128 702.91 ( 0.00%) 693.84 ( -1.29%) Hmean send-256 1350.07 ( 0.00%) 1344.19 ( -0.44%) Hmean send-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%) Hmean send-2048 9687.44 ( 0.00%) 9624.45 ( -0.65%) Hmean send-3312 14577.64 ( 0.00%) 14514.35 ( -0.43%) Hmean send-4096 16393.62 ( 0.00%) 16488.30 ( 0.58%) Hmean send-8192 26877.26 ( 0.00%) 26431.63 ( -1.66%) Hmean send-16384 38683.43 ( 0.00%) 38264.91 ( -1.08%) Hmean recv-64 354.67 ( 0.00%) 352.15 ( -0.71%) Hmean recv-128 702.91 ( 0.00%) 693.84 ( -1.29%) Hmean recv-256 1350.07 ( 0.00%) 1344.19 ( -0.44%) Hmean recv-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%) Hmean recv-2048 9687.43 ( 0.00%) 9624.45 ( -0.65%) Hmean recv-3312 14577.59 ( 0.00%) 14514.35 ( -0.43%) Hmean recv-4096 16393.55 ( 0.00%) 16488.20 ( 0.58%) Hmean recv-8192 26876.96 ( 0.00%) 26431.29 ( -1.66%) Hmean recv-16384 38682.41 ( 0.00%) 38263.94 ( -1.08%) NUMA alloc hit 1465986 1423090 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 1465897 1423003 NUMA base PTE updates 1473 1420 NUMA huge PMD updates 0 0 NUMA page range updates 1473 1420 NUMA hint faults 1383 1312 NUMA hint local faults 451 124 NUMA hint local percent 32 9 There is a slight degrading in performance but there are slightly fewer NUMA faults. There is a large drop in the percentage of local faults but the bulk of migrations for netperf are in small shared libraries so it's reflecting the fact that automatic NUMA balancing has backed off. This is a case where despite wake_affine() and automatic NUMA balancing fighting for placement that there is a marginal benefit to rescheduling to local data quickly. However, it should be noted that wake_affine() and automatic NUMA balancing fighting each other constantly is undesirable. However, the benefit in other cases is large. This is the result for NAS with the D class sizing on a 4-socket machine: nas-mpi 4.15.0 4.15.0 sdnuma-v1r23 delayretry-v1r23 Time cg.D 557.00 ( 0.00%) 431.82 ( 22.47%) Time ep.D 77.83 ( 0.00%) 79.01 ( -1.52%) Time is.D 26.46 ( 0.00%) 26.64 ( -0.68%) Time lu.D 727.14 ( 0.00%) 597.94 ( 17.77%) Time mg.D 191.35 ( 0.00%) 146.85 ( 23.26%) 4.15.0 4.15.0 sdnuma-v1r23delayretry-v1r23 User 75665.20 70413.30 System 20321.59 8861.67 Elapsed 766.13 634.92 Minor Faults 16528502 7127941 Major Faults 4553 5068 NUMA alloc local 6963197 6749135 NUMA base PTE updates 366409093 107491434 NUMA huge PMD updates 687556 198880 NUMA page range updates 718437765 209317994 NUMA hint faults 13643410 4601187 NUMA hint local faults 9212593 3063996 NUMA hint local percent 67 66 Note the massive reduction in system CPU usage even though the percentage of local faults is barely affected. There is a massive reduction in the number of PTE updates showing that automatic NUMA balancing has backed off. A critical observation is also that there is a massive reduction in minor faults which is due to far fewer NUMA hinting faults being trapped. There were questions on NAS OMP and how it behaved related to threads being bound to CPUs. First, there are more gains than losses with this patch applied and a reduction in system CPU usage: nas-omp 4.16.0-rc1 4.16.0-rc1 sdnuma-v2r1 delayretry-v2r1 Time bt.D 436.71 ( 0.00%) 430.05 ( 1.53%) Time cg.D 201.02 ( 0.00%) 180.87 ( 10.02%) Time ep.D 32.84 ( 0.00%) 32.68 ( 0.49%) Time is.D 9.63 ( 0.00%) 9.64 ( -0.10%) Time lu.D 331.20 ( 0.00%) 304.80 ( 7.97%) Time mg.D 54.87 ( 0.00%) 52.72 ( 3.92%) Time sp.D 1108.78 ( 0.00%) 917.10 ( 17.29%) Time ua.D 378.81 ( 0.00%) 398.83 ( -5.28%) 4.16.0-rc1 4.16.0-rc1 sdnuma-v2r1delayretry-v2r1 User 305633.08 296751.91 System 451.75 357.80 Elapsed 2595.73 2368.13 However, it does not close the gap between binding and being unbound. There is negligible difference between the performance of the baseline and a patched kernel when threads are bound so it is not presented here: 4.16.0-rc1 4.16.0-rc1 delayretry-bind delayretry-unbound Time bt.D 385.02 ( 0.00%) 430.05 ( -11.70%) Time cg.D 144.02 ( 0.00%) 180.87 ( -25.59%) Time ep.D 32.85 ( 0.00%) 32.68 ( 0.52%) Time is.D 10.52 ( 0.00%) 9.64 ( 8.37%) Time lu.D 285.31 ( 0.00%) 304.80 ( -6.83%) Time mg.D 43.21 ( 0.00%) 52.72 ( -22.01%) Time sp.D 820.24 ( 0.00%) 917.10 ( -11.81%) Time ua.D 337.09 ( 0.00%) 398.83 ( -18.32%) 4.16.0-rc1 4.16.0-rc1 delayretry-binddelayretry-unbound User 277731.25 296751.91 System 261.29 357.80 Elapsed 2100.55 2368.13 Unfortunately, while performance is improved by the patch, there is still quite a long way to go before it's equivalent to hard binding. Other workloads like hackbench, tbench, dbench and schbench are barely affected. dbench shows a mix of gains and losses depending on the machine although in general, the results are more stable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Consider SD_NUMA when selecting the most idle group to schedule onMel Gorman2018-02-211-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_idlest_group() compares a local group with each other group to select the one that is most idle. When comparing groups in different NUMA domains, a very slight imbalance is enough to select a remote NUMA node even if the runnable load on both groups is 0 or close to 0. This ignores the cost of remote accesses entirely and is a problem when selecting the CPU for a newly forked task to run on. This is problematic when a forking server is almost guaranteed to run on a remote node incurring numerous remote accesses and potentially causing automatic NUMA balancing to try migrate the task back or migrate the data to another node. Similar weirdness is observed if a basic shell command pipes output to another as each process in the pipeline is likely to start on different nodes and then get adjusted later by wake_affine(). This patch adds imbalance to remote domains when considering whether to select CPUs from remote domains. If the local domain is selected, imbalance will still be used to try select a CPU from a lower scheduler domain's group instead of stacking tasks on the same CPU. A variety of workloads and machines were tested and as expected, there is no difference on UMA. The difference on NUMA can be dramatic. This is a comparison of elapsed times running the git regression test suite. It's fork-intensive with short-lived processes: 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 Elapsed min 1706.06 ( 0.00%) 1435.94 ( 15.83%) Elapsed mean 1709.53 ( 0.00%) 1436.98 ( 15.94%) Elapsed stddev 2.16 ( 0.00%) 1.01 ( 53.38%) Elapsed coeffvar 0.13 ( 0.00%) 0.07 ( 44.54%) Elapsed max 1711.59 ( 0.00%) 1438.01 ( 15.98%) 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 User 5434.12 5188.41 System 4878.77 3467.09 Elapsed 10259.06 8624.21 That shows a considerable reduction in elapsed times. It's important to note that automatic NUMA balancing does not affect this load as processes are too short-lived. There is also a noticable impact on hackbench such as this example using processes and pipes: hackbench-process-pipes 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 Amean 1 1.0973 ( 0.00%) 0.9393 ( 14.40%) Amean 4 1.3427 ( 0.00%) 1.3730 ( -2.26%) Amean 7 1.4233 ( 0.00%) 1.6670 ( -17.12%) Amean 12 3.0250 ( 0.00%) 3.3013 ( -9.13%) Amean 21 9.0860 ( 0.00%) 9.5343 ( -4.93%) Amean 30 14.6547 ( 0.00%) 13.2433 ( 9.63%) Amean 48 22.5447 ( 0.00%) 20.4303 ( 9.38%) Amean 79 29.2010 ( 0.00%) 26.7853 ( 8.27%) Amean 110 36.7443 ( 0.00%) 35.8453 ( 2.45%) Amean 141 45.8533 ( 0.00%) 42.6223 ( 7.05%) Amean 172 55.1317 ( 0.00%) 50.6473 ( 8.13%) Amean 203 64.4420 ( 0.00%) 58.3957 ( 9.38%) Amean 234 73.2293 ( 0.00%) 67.1047 ( 8.36%) Amean 265 80.5220 ( 0.00%) 75.7330 ( 5.95%) Amean 296 88.7567 ( 0.00%) 82.1533 ( 7.44%) It's not a universal win as there are occasions when spreading wide and quickly is a benefit but it's more of a win than it is a loss. For other workloads, there is little difference but netperf is interesting. Without the patch, the server and client starts on different nodes but quickly get migrated due to wake_affine. Hence, the difference is overall performance is marginal but detectable: 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 Hmean send-64 349.09 ( 0.00%) 354.67 ( 1.60%) Hmean send-128 699.16 ( 0.00%) 702.91 ( 0.54%) Hmean send-256 1316.34 ( 0.00%) 1350.07 ( 2.56%) Hmean send-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%) Hmean send-2048 9705.19 ( 0.00%) 9687.44 ( -0.18%) Hmean send-3312 14359.48 ( 0.00%) 14577.64 ( 1.52%) Hmean send-4096 16324.20 ( 0.00%) 16393.62 ( 0.43%) Hmean send-8192 26112.61 ( 0.00%) 26877.26 ( 2.93%) Hmean send-16384 37208.44 ( 0.00%) 38683.43 ( 3.96%) Hmean recv-64 349.09 ( 0.00%) 354.67 ( 1.60%) Hmean recv-128 699.16 ( 0.00%) 702.91 ( 0.54%) Hmean recv-256 1316.34 ( 0.00%) 1350.07 ( 2.56%) Hmean recv-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%) Hmean recv-2048 9705.16 ( 0.00%) 9687.43 ( -0.18%) Hmean recv-3312 14359.42 ( 0.00%) 14577.59 ( 1.52%) Hmean recv-4096 16323.98 ( 0.00%) 16393.55 ( 0.43%) Hmean recv-8192 26111.85 ( 0.00%) 26876.96 ( 2.93%) Hmean recv-16384 37206.99 ( 0.00%) 38682.41 ( 3.97%) However, what is very interesting is how automatic NUMA balancing behaves. Each netperf instance runs long enough for balancing to activate: NUMA base PTE updates 4620 1473 NUMA huge PMD updates 0 0 NUMA page range updates 4620 1473 NUMA hint faults 4301 1383 NUMA hint local faults 1309 451 NUMA hint local percent 30 32 NUMA pages migrated 1335 491 AutoNUMA cost 21% 6% There is an unfortunate number of remote faults although tracing indicated that the vast majority are in shared libraries. However, the tendency to start tasks on the same node if there is capacity means that there were far fewer PTE updates and faults incurred overall. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Do not migrate due to a sync wakeup on exitPeter Zijlstra2018-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | When a task exits, it notifies the parent that it has exited. This is a sync wakeup and the exiting task may pull the parent towards the wakers CPU. For simple workloads like using a shell, it was observed that the shell is pulled across nodes by exiting processes. This is daft as the parent may be long-lived and properly placed. This patch special cases a sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range of workloads and machines showed very little differences in performance although there was a small 3% boost on some machines running a shellscript intensive workload (git regression test suite). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Do not migrate on wake_affine_weight() if weights are equalMel Gorman2018-02-211-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wake_affine_weight() will consider migrating a task to, or near, the current CPU if there is a load imbalance. If the CPUs share LLC then either CPU is valid as a search-for-idle-sibling target and equally appropriate for stacking two tasks on one CPU if an idle sibling is unavailable. If they do not share cache then a cross-node migration potentially impacts locality so while they are equal from a CPU capacity point of view, they are not equal in terms of memory locality. In either case, it's more appropriate to migrate only if there is a difference in their effective load. This patch modifies wake_affine_weight() to only consider migrating a task if there is a load imbalance for normal wakeups but will allow potential stacking if the loads are equal and it's a sync wakeup. For the most part, the different in performance is marginal. For example, on a 4-socket server running netperf UDP_STREAM on localhost the differences are as follows: 4.15.0 4.15.0 16rc0 noequal-v1r23 Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%) Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%) Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%) Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%) Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%) Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%) Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%) Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%) Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%) Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%) Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%) Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%) Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%) Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%) Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%) Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%) Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%) Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%) Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%) Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%) The difference in overall performance is marginal but note that most measurements are less variable. There were similar observations for other netperf comparisons. hackbench with sockets or threads with processes or threads showed minor difference with some reduction of migration. tbench showed only marginal differences that were within the noise. dbench, regardless of filesystem, showed minor differences all of which are within noise. Multiple machines, both UMA and NUMA were tested without any regressions showing up. The biggest risk with a patch like this is affecting wakeup latencies. However, the schbench load from Facebook which is very sensitive to wakeup latency showed a mixed result with mostly improvements in wakeup latency: 4.15.0 4.15.0 16rc0 noequal-v1r23 Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%) Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%) Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%) Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%) Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%) Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%) Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%) Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%) Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%) Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%) Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%) Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%) Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%) Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%) Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%) Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%) Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%) Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%) Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%) Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%) Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%) Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%) Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%) Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%) Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%) Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%) Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%) Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%) Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%) Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%) Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%) Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%) Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%) Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%) Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%) Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%) Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%) Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%) Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%) Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%) Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%) Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%) Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%) Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%) Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() ↵Mel Gorman2018-02-211-1/+1
| | | | | | | | | | | | | | | | | | until needed On sync wakeups, the previous CPU effective load may not be used so delay the calculation until it's needed. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affineMel Gorman2018-02-211-3/+2
| | | | | | | | | | | | | | | | The only caller of wake_affine() knows the CPU ID. Pass it in instead of rechecking it. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'v4.16-rc2' into sched/core, to pick up fixesIngo Molnar2018-02-2118-176/+301
|\ | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2018-02-181-50/+128
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Thomas Gleixner: "Perf tool updates and kprobe fixes: - perf_mmap overwrite mode fixes/overhaul, prep work to get 'perf top' using it, making it bearable to use it in large core count systems such as Knights Landing/Mill Intel systems (Kan Liang) - s/390 now uses syscall.tbl, just like x86-64 to generate the syscall table id -> string tables used by 'perf trace' (Hendrik Brueckner) - Use strtoull() instead of home grown function (Andy Shevchenko) - Synchronize kernel ABI headers, v4.16-rc1 (Ingo Molnar) - Document missing 'perf data --force' option (Sangwon Hong) - Add perf vendor JSON metrics for ARM Cortex-A53 Processor (William Cohen) - Improve error handling and error propagation of ftrace based kprobes so failures when installing kprobes are not silently ignored and create disfunctional tracepoints" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) kprobes: Propagate error from disarm_kprobe_ftrace() kprobes: Propagate error from arm_kprobe_ftrace() Revert "tools include s390: Grab a copy of arch/s390/include/uapi/asm/unistd.h" perf s390: Rework system call table creation by using syscall.tbl perf s390: Grab a copy of arch/s390/kernel/syscall/syscall.tbl tools/headers: Synchronize kernel ABI headers, v4.16-rc1 perf test: Fix test trace+probe_libc_inet_pton.sh for s390x perf data: Document missing --force option perf tools: Substitute yet another strtoull() perf top: Check the latency of perf_top__mmap_read() perf top: Switch default mode to overwrite mode perf top: Remove lost events checking perf hists browser: Add parameter to disable lost event warning perf top: Add overwrite fall back perf evsel: Expose the perf_missing_features struct perf top: Check per-event overwrite term perf mmap: Discard legacy interface for mmap read perf test: Update mmap read functions for backward-ring-buffer test perf mmap: Introduce perf_mmap__read_event() perf mmap: Introduce perf_mmap__read_done() ...
| | * kprobes: Propagate error from disarm_kprobe_ftrace()Jessica Yu2018-02-161-25/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve error handling when disarming ftrace-based kprobes. Like with arm_kprobe_ftrace(), propagate any errors from disarm_kprobe_ftrace() so that we do not disable/unregister kprobes that are still armed. In other words, unregister_kprobe() and disable_kprobe() should not report success if the kprobe could not be disarmed. disarm_all_kprobes() keeps its current behavior and attempts to disarm all kprobes. It returns the last encountered error and gives a warning if not all probes could be disarmed. This patch is based on Petr Mladek's original patchset (patches 2 and 3) back in 2015, which improved kprobes error handling, found here: https://lkml.org/lkml/2015/2/26/452 However, further work on this had been paused since then and the patches were not upstreamed. Based-on-patches-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/20180109235124.30886-3-jeyu@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * kprobes: Propagate error from arm_kprobe_ftrace()Jessica Yu2018-02-161-25/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve error handling when arming ftrace-based kprobes. Specifically, if we fail to arm a ftrace-based kprobe, register_kprobe()/enable_kprobe() should report an error instead of success. Previously, this has lead to confusing situations where register_kprobe() would return 0 indicating success, but the kprobe would not be functional if ftrace registration during the kprobe arming process had failed. We should therefore take any errors returned by ftrace into account and propagate this error so that we do not register/enable kprobes that cannot be armed. This can happen if, for example, register_ftrace_function() finds an IPMODIFY conflict (since kprobe_ftrace_ops has this flag set) and returns an error. Such a conflict is possible since livepatches also set the IPMODIFY flag for their ftrace_ops. arm_all_kprobes() keeps its current behavior and attempts to arm all kprobes. It returns the last encountered error and gives a warning if not all probes could be armed. This patch is based on Petr Mladek's original patchset (patches 2 and 3) back in 2015, which improved kprobes error handling, found here: https://lkml.org/lkml/2015/2/26/452 However, further work on this had been paused since then and the patches were not upstreamed. Based-on-patches-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/20180109235124.30886-2-jeyu@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2018-02-181-14/+4
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "A small set of updates mostly for irq chip drivers: - MIPS GIC fix for spurious, masked interrupts - fix for a subtle IPI bug in GICv3 - do not probe GICv3 ITSs that are marked as disabled - multi-MSI support for GICv2m - various small cleanups" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro irqchip/bcm: Remove hashed address printing irqchip/gic-v2m: Add PCI Multi-MSI support irqchip/gic-v3: Ignore disabled ITS nodes irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq() irqchip/gic-v3: Change pr_debug message to pr_devel irqchip/mips-gic: Avoid spuriously handling masked interrupts
| | * | irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macroAndy Shevchenko2018-02-161-14/+4
| | |/ | | | | | | | | | | | | | | | | | | | | | ...instead of open coding file operations followed by custom ->open() callbacks per each attribute. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2018-02-154-16/+22
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: - fix rq->lock lockdep annotation bug - fix/improve update_curr_rt() and update_curr_dl() accounting - update documentation - remove unused macro" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cpufreq: Remove unused SUGOV_KTHREAD_PRIORITY macro sched/core: Fix DEBUG_SPINLOCK annotation for rq->lock sched/rt: Make update_curr_rt() more accurate sched/deadline: Make update_curr_dl() more accurate membarrier-sync-core: Document architecture support
| | * | sched/cpufreq: Remove unused SUGOV_KTHREAD_PRIORITY macroLeo Yan2018-02-131-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since schedutil kernel thread directly set priority to 0, the macro SUGOV_KTHREAD_PRIORITY is not used. So remove it. Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/1518097702-9665-1-git-send-email-leo.yan@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | sched/core: Fix DEBUG_SPINLOCK annotation for rq->lockPeter Zijlstra2018-02-131-11/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark noticed that he had sporadic "spinlock recursion" warnings from the DEBUG_SPINLOCK code. Now rq->lock is special in that the owner changes in the middle of a context switch. It so happens that we fix up the lock.owner too late, @prev can run (remotely) the moment prev->on_cpu is cleared, this then allows @prev to again try and acquire this rq->lock and trigger this warning. So we have to switch lock.owner before clearing prev->on_cpu. Do this by moving the DEBUG_SPINLOCK annotation from after switch_to() to before switch_to() and collect all lockdep annotations there into prepare_lock_switch() to mirror the existing finish_lock_switch(). Debugged-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | sched/rt: Make update_curr_rt() more accurateWen Yang2018-02-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517882008-44552-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | sched/deadline: Make update_curr_dl() more accurateWen Yang2018-02-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_dl(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517882148-44599-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds2018-02-151-6/+15
| |\ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "This contains two qspinlock fixes and three documentation and comment fixes" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/semaphore: Update the file path in documentation locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit() locking/qspinlock: Ensure node->count is updated before initialising node locking/qspinlock: Ensure node is initialised before updating prev->next Documentation/locking/mutex-design: Update to reflect latest changes
| | * | locking/qspinlock: Ensure node->count is updated before initialising nodeWill Deacon2018-02-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When queuing on the qspinlock, the count field for the current CPU's head node is incremented. This needn't be atomic because locking in e.g. IRQ context is balanced and so an IRQ will return with node->count as it found it. However, the compiler could in theory reorder the initialisation of node[idx] before the increment of the head node->count, causing an IRQ to overwrite the initialised node and potentially corrupt the lock state. Avoid the potential for this harmful compiler reordering by placing a barrier() between the increment of the head node->count and the subsequent node initialisation. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | locking/qspinlock: Ensure node is initialised before updating prev->nextWill Deacon2018-02-131-6/+7
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a locker ends up queuing on the qspinlock locking slowpath, we initialise the relevant mcs node and publish it indirectly by updating the tail portion of the lock word using xchg_tail. If we find that there was a pre-existing locker in the queue, we subsequently update their ->next field to point at our node so that we are notified when it's our turn to take the lock. This can be roughly illustrated as follows: /* Initialise the fields in node and encode a pointer to node in tail */ tail = initialise_node(node); /* * Exchange tail into the lockword using an atomic read-modify-write * operation with release semantics */ old = xchg_tail(lock, tail); /* If there was a pre-existing waiter ... */ if (old & _Q_TAIL_MASK) { prev = decode_tail(old); smp_read_barrier_depends(); /* ... then update their ->next field to point to node. WRITE_ONCE(prev->next, node); } The conditional update of prev->next therefore relies on the address dependency from the result of xchg_tail ensuring order against the prior initialisation of node. However, since the release semantics of the xchg_tail operation apply only to the write portion of the RmW, then this ordering is not guaranteed and it is possible for the CPU to return old before the writes to node have been published, consequently allowing us to point prev->next to an uninitialised node. This patch fixes the problem by making the update of prev->next a RELEASE operation, which also removes the reliance on dependency ordering. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds2018-02-117-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-02-101-0/+1
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Radim Krčmář: "ARM: - icache invalidation optimizations, improving VM startup time - support for forwarded level-triggered interrupts, improving performance for timers and passthrough platform devices - a small fix for power-management notifiers, and some cosmetic changes PPC: - add MMIO emulation for vector loads and stores - allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without requiring the complex thread synchronization of older CPU versions - improve the handling of escalation interrupts with the XIVE interrupt controller - support decrement register migration - various cleanups and bugfixes. s390: - Cornelia Huck passed maintainership to Janosch Frank - exitless interrupts for emulated devices - cleanup of cpuflag handling - kvm_stat counter improvements - VSIE improvements - mm cleanup x86: - hypervisor part of SEV - UMIP, RDPID, and MSR_SMI_COUNT emulation - paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit - allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512 features - show vcpu id in its anonymous inode name - many fixes and cleanups - per-VCPU MSR bitmaps (already merged through x86/pti branch) - stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)" * tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits) KVM: PPC: Book3S: Add MMIO emulation for VMX instructions KVM: PPC: Book3S HV: Branch inside feature section KVM: PPC: Book3S HV: Make HPT resizing work on POWER9 KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code KVM: PPC: Book3S PR: Fix broken select due to misspelling KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs() KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled KVM: PPC: Book3S HV: Drop locks before reading guest memory kvm: x86: remove efer_reload entry in kvm_vcpu_stat KVM: x86: AMD Processor Topology Information x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested kvm: embed vcpu id to dentry of vcpu anon inode kvm: Map PFN-type memory regions as writable (if possible) x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n KVM: arm/arm64: Fixup userspace irqchip static key optimization KVM: arm/arm64: Fix userspace_irqchip_in_use counting KVM: arm/arm64: Fix incorrect timer_is_pending logic MAINTAINERS: update KVM/s390 maintainers MAINTAINERS: add Halil as additional vfio-ccw maintainer MAINTAINERS: add David as a reviewer for KVM/s390 ...
| | * \ Merge branch 'x86/hyperv' of ↵Radim Krčmář2018-02-0185-2142/+2337
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Topic branch for stable KVM clockource under Hyper-V. Thanks to Christoffer Dall for resolving the ARM conflict.
| | * | | kvm_config: add CONFIG_S390_GUESTChristian Borntraeger2018-01-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | make kvmconfig currently does not select CONFIG_S390_GUEST. Since the virtio-ccw transport depends on CONFIG_S390_GUEST, we want to add CONFIG_S390_GUEST to kvmconfig. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com>
| * | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2018-02-091-72/+115
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Make allocations less aggressive in x_tables, from Minchal Hocko. 2) Fix netfilter flowtable Kconfig deps, from Pablo Neira Ayuso. 3) Fix connection loss problems in rtlwifi, from Larry Finger. 4) Correct DRAM dump length for some chips in ath10k driver, from Yu Wang. 5) Fix ABORT handling in rxrpc, from David Howells. 6) Add SPDX tags to Sun networking drivers, from Shannon Nelson. 7) Some ipv6 onlink handling fixes, from David Ahern. 8) Netem packet scheduler interval calcualtion fix from Md. Islam. 9) Don't put crypto buffers on-stack in rxrpc, from David Howells. 10) Fix handling of error non-delivery status in netlink multicast delivery over multiple namespaces, from Nicolas Dichtel. 11) Missing xdp flush in tuntap driver, from Jason Wang. 12) Synchonize RDS protocol netns/module teardown with rds object management, from Sowini Varadhan. 13) Add nospec annotations to mpls, from Dan Williams. 14) Fix SKB truesize handling in TIPC, from Hoang Le. 15) Interrupt masking fixes in stammc from Niklas Cassel. 16) Don't allow ptr_ring objects to be sized outside of kmalloc's limits, from Jason Wang. 17) Don't allow SCTP chunks to be built which will have a length exceeding the chunk header's 16-bit length field, from Alexey Kodanev. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits) ibmvnic: Remove skb->protocol checks in ibmvnic_xmit bpf: fix rlimit in reuseport net selftest sctp: verify size of a new chunk in _sctp_make_chunk() s390/qeth: fix SETIP command handling s390/qeth: fix underestimated count of buffer elements ptr_ring: try vmalloc() when kmalloc() fails ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE net: stmmac: remove redundant enable of PMT irq net: stmmac: rename GMAC_INT_DEFAULT_MASK for dwmac4 net: stmmac: discard disabled flags in interrupt status register ibmvnic: Reset long term map ID counter tools/libbpf: handle issues with bpf ELF objects containing .eh_frames selftests/bpf: add selftest that use test_libbpf_open selftests/bpf: add test program for loading BPF ELF files tools/libbpf: improve the pr_debug statements to contain section numbers bpf: Sync kernel ABI header with tooling header for bpf_common.h net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT net: thunder: change q_len's type to handle max ring size tipc: fix skb truesize/datasize ratio control net/sched: cls_u32: fix cls_u32 on filter replace ...
| | * \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2018-02-091-72/+115
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2018-02-09 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Two fixes for BPF sockmap in order to break up circular map references from programs attached to sockmap, and detaching related sockets in case of socket close() event. For the latter we get rid of the smap_state_change() and plug into ULP infrastructure, which will later also be used for additional features anyway such as TX hooks. For the second issue, dependency chain is broken up via map release callback to free parse/verdict programs, all from John. 2) Fix a libbpf relocation issue that was found while implementing XDP support for Suricata project. Issue was that when clang was invoked with default target instead of bpf target, then various other e.g. debugging relevant sections are added to the ELF file that contained relocation entries pointing to non-BPF related sections which libbpf trips over instead of skipping them. Test cases for libbpf are added as well, from Jesper. 3) Various misc fixes for bpftool and one for libbpf: a small addition to libbpf to make sure it recognizes all standard section prefixes. Then, the Makefile in bpftool/Documentation is improved to explicitly check for rst2man being installed on the system as we otherwise risk installing empty man pages; the man page for bpftool-map is corrected and a set of missing bash completions added in order to avoid shipping bpftool where the completions are only partially working, from Quentin. 4) Fix applying the relocation to immediate load instructions in the nfp JIT which were missing a shift, from Jakub. 5) Two fixes for the BPF kernel selftests: handle CONFIG_BPF_JIT_ALWAYS_ON=y gracefully in test_bpf.ko module and mark them as FLAG_EXPECTED_FAIL in this case; and explicitly delete the veth devices in the two tests test_xdp_{meta,redirect}.sh before dismantling the netnses as when selftests are run in batch mode, then workqueue to handle destruction might not have finished yet and thus veth creation in next test under same dev name would fail, from Yonghong. 6) Fix test_kmod.sh to check the test_bpf.ko module path before performing an insmod, and fallback to modprobe. Especially the latter is useful when having a device under test that has the modules installed instead, from Naresh. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | | | bpf: sockmap, fix leaking maps with attached but not detached progsJohn Fastabend2018-02-061-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a program is attached to a map we increment the program refcnt to ensure that the program is not removed while it is potentially being referenced from sockmap side. However, if this same program also references the map (this is a reasonably common pattern in my programs) then the verifier will also increment the maps refcnt from the verifier. This is to ensure the map doesn't get garbage collected while the program has a reference to it. So we are left in a state where the map holds the refcnt on the program stopping it from being removed and releasing the map refcnt. And vice versa the program holds a refcnt on the map stopping it from releasing the refcnt on the prog. All this is fine as long as users detach the program while the map fd is still around. But, if the user omits this detach command we are left with a dangling map we can no longer release. To resolve this when the map fd is released decrement the program references and remove any reference from the map to the program. This fixes the issue with possibly dangling map and creates a user side API constraint. That is, the map fd must be held open for programs to be attached to a map. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | | * | | | bpf: sockmap, add sock close() hook to remove socksJohn Fastabend2018-02-061-67/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The selftests test_maps program was leaving dangling BPF sockmap programs around because not all psock elements were removed from the map. The elements in turn hold a reference on the BPF program they are attached to causing BPF programs to stay open even after test_maps has completed. The original intent was that sk_state_change() would be called when TCP socks went through TCP_CLOSE state. However, because socks may be in SOCK_DEAD state or the sock may be a listening socket the event is not always triggered. To resolve this use the ULP infrastructure and register our own proto close() handler. This fixes the above case. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | | | Merge tag 'trace-v4.16-rc1' of ↵Linus Torvalds2018-02-092-6/+4
| |\ \ \ \ \ \ | | |_|_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Al Viro discovered some breakage with the parsing of the set_ftrace_filter as well as the removing of function probes. This fixes the code with Al's suggestions. I also added a few selftests to test the broken cases such that they wont happen again" * tag 'trace-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: selftests/ftrace: Add more tests for removing of function probes selftests/ftrace: Add some missing glob checks selftests/ftrace: Have reset_ftrace_filter handle multiple instances selftests/ftrace: Have reset_ftrace_filter handle modules tracing: Fix parsing of globs with a wildcard at the beginning ftrace: Remove incorrect setting of glob search field
| | * | | | | tracing: Fix parsing of globs with a wildcard at the beginningSteven Rostedt (VMware)2018-02-081-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Al Viro reported: For substring - sure, but what about something like "*a*b" and "a*b"? AFAICS, filter_parse_regex() ends up with identical results in both cases - MATCH_GLOB and *search = "a*b". And no way for the caller to tell one from another. Testing this with the following: # cd /sys/kernel/tracing # echo '*raw*lock' > set_ftrace_filter bash: echo: write error: Invalid argument With this patch: # echo '*raw*lock' > set_ftrace_filter # cat set_ftrace_filter _raw_read_trylock _raw_write_trylock _raw_read_unlock _raw_spin_unlock _raw_write_unlock _raw_spin_trylock _raw_spin_lock _raw_write_lock _raw_read_lock Al recommended not setting the search buffer to skip the first '*' unless we know we are not using MATCH_GLOB. This implements his suggested logic. Link: http://lkml.kernel.org/r/20180127170748.GF13338@ZenIV.linux.org.uk Cc: stable@vger.kernel.org Fixes: 60f1d5e3bac44 ("ftrace: Support full glob matching") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Suggsted-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| | * | | | | ftrace: Remove incorrect setting of glob search fieldSteven Rostedt (VMware)2018-02-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __unregister_ftrace_function_probe() will incorrectly parse the glob filter because it resets the search variable that was setup by filter_parse_regex(). Al Viro reported this: After that call of filter_parse_regex() we could have func_g.search not equal to glob only if glob started with '!' or '*'. In the former case we would've buggered off with -EINVAL (not = 1). In the latter we would've set func_g.search equal to glob + 1, calculated the length of that thing in func_g.len and proceeded to reset func_g.search back to glob. Suppose the glob is e.g. *foo*. We end up with func_g.type = MATCH_MIDDLE_ONLY; func_g.len = 3; func_g.search = "*foo"; Feeding that to ftrace_match_record() will not do anything sane - we will be looking for names containing "*foo" (->len is ignored for that one). Link: http://lkml.kernel.org/r/20180127031706.GE13338@ZenIV.linux.org.uk Cc: stable@vger.kernel.org Fixes: 3ba009297149f ("ftrace: Introduce ftrace_glob structure") Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | | | | | | sched/fair: Remove stray space in #ifdefVincent Guittot2018-02-131-2/+2
|/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove a useless space in # ifdef and align it with others. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | | Merge tag 'modules-for-v4.16' of ↵Linus Torvalds2018-02-071-6/+1
|\ \ \ \ \ \ | |_|/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Minor code cleanups and MAINTAINERS update" * tag 'modules-for-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: modpost: Remove trailing semicolon ftrace/module: Move ftrace_release_mod() to ddebug_cleanup label MAINTAINERS: Remove from module & paravirt maintenance
| * | | | | ftrace/module: Move ftrace_release_mod() to ddebug_cleanup labelNamit Gupta2018-01-151-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ftrace_module_init happen after dynamic_debug_setup, it is desired that cleanup should be called after this label however in current implementation it is called in free module label,ie:even though ftrace in not initialized, from so many fail case ftrace_release_mod() will be called and unnecessary traverse the whole list. In below patch we moved ftrace_release_mod() from free_module label to ddebug_cleanup label. that is the best possible location, other solution is to make new label to ftrace_release_mod() but since ftrace_module_init() is not return with minimum changes it should be in ddebug_cleanup label. Signed-off-by: Namit Gupta <gupta.namit@samsung.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Jessica Yu <jeyu@kernel.org>
* | | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2018-02-0616-120/+83
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc updates from Andrew Morton: - kasan updates - procfs - lib/bitmap updates - other lib/ updates - checkpatch tweaks - rapidio - ubsan - pipe fixes and cleanups - lots of other misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) Documentation/sysctl/user.txt: fix typo MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns MAINTAINERS: update various PALM patterns MAINTAINERS: update "ARM/OXNAS platform support" patterns MAINTAINERS: update Cortina/Gemini patterns MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern MAINTAINERS: remove ANDROID ION pattern mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors mm: docs: fix parameter names mismatch mm: docs: fixup punctuation pipe: read buffer limits atomically pipe: simplify round_pipe_size() pipe: reject F_SETPIPE_SZ with size over UINT_MAX pipe: fix off-by-one error when checking buffer limits pipe: actually allow root to exceed the pipe buffer limits pipe, sysctl: remove pipe_proc_fn() pipe, sysctl: drop 'min' parameter from pipe-max-size converter kasan: rework Kconfig settings crash_dump: is_kdump_kernel can be boolean kernel/mutex: mutex_is_locked can be boolean ...
| * | | | | | pipe: reject F_SETPIPE_SZ with size over UINT_MAXEric Biggers2018-02-061-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A pipe's size is represented as an 'unsigned int'. As expected, writing a value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with EINVAL. However, the F_SETPIPE_SZ fcntl silently truncates such values to 32 bits, rather than failing with EINVAL as expected. (It *does* fail with EINVAL for values above (1 << 31) but <= UINT_MAX.) Fix this by moving the check against UINT_MAX into round_pipe_size() which is called in both cases. Link: http://lkml.kernel.org/r/20180111052902.14409-6-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | pipe, sysctl: remove pipe_proc_fn()Eric Biggers2018-02-061-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pipe_proc_fn() is no longer needed, as it only calls through to proc_dopipe_max_size(). Just put proc_dopipe_max_size() in the ctl_table entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS stub for it. (The reason the ENOSYS stub isn't needed is that the pipe-max-size ctl_table entry is located directly in 'kern_table' rather than being registered separately. Therefore, the entry is already only defined when the kernel is built with sysctl support.) Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | pipe, sysctl: drop 'min' parameter from pipe-max-size converterEric Biggers2018-02-061-14/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "pipe: buffer limits fixes and cleanups", v2. This series simplifies the sysctl handler for pipe-max-size and fixes another set of bugs related to the pipe buffer limits: - The root user wasn't allowed to exceed the limits when creating new pipes. - There was an off-by-one error when checking the limits, so a limit of N was actually treated as N - 1. - F_SETPIPE_SZ accepted values over UINT_MAX. - Reading the pipe buffer limits could be racy. This patch (of 7): Before validating the given value against pipe_min_size, do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the value up to pipe_min_size. Therefore, the second check against pipe_min_size is redundant. Remove it. Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/resource: iomem_is_exclusive can be booleanYaowei Bai2018-02-061-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make iomem_is_exclusive return bool due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/cpuset: current_cpuset_is_being_rebound can be booleanYaowei Bai2018-02-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make current_cpuset_is_being_rebound return bool due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1513266622-15860-4-git-send-email-baiyaowei@cmss.chinamobile.com Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | genirq: remove unneeded kallsyms includeSergey Senozhatsky2018-02-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The file was converted from print_symbol() to %pf some time ago in commit ef26f20cd117 ("genirq: Print threaded handler in spurious debug output"). kallsyms does not seem to be needed anymore. Link: http://lkml.kernel.org/r/20171208025616.16267-10-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | hrtimer: remove unneeded kallsyms includeSergey Senozhatsky2018-02-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hrtimer does not seem to use any of kallsyms functions/defines. Link: http://lkml.kernel.org/r/20171208025616.16267-9-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kcov: detect double association with a single taskDmitry Vyukov2018-02-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently KCOV_ENABLE does not check if the current task is already associated with another kcov descriptor. As the result it is possible to associate a single task with more than one kcov descriptor, which later leads to a memory leak of the old descriptor. This relation is really meant to be one-to-one (task has only one back link). Extend validation to detect such misuse. Link: http://lkml.kernel.org/r/20180122082520.15716-1-dvyukov@google.com Fixes: 5c9a8750a640 ("kernel: add kcov code coverage") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Shankara Pailoor <sp3485@columbia.edu> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: syzbot <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | kernel/relay.c: revert "kernel/relay.c: fix potential memory leak"Eric Biggers2018-02-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit ba62bafe942b ("kernel/relay.c: fix potential memory leak"). This commit introduced a double free bug, because 'chan' is already freed by the line: kref_put(&chan->kref, relay_destroy_channel); This bug was found by syzkaller, using the BLKTRACESETUP ioctl. Link: http://lkml.kernel.org/r/20180127004759.101823-1-ebiggers3@gmail.com Fixes: ba62bafe942b ("kernel/relay.c: fix potential memory leak") Signed-off-by: Eric Biggers <ebiggers@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Zhouyi Zhou <yizhouzhou@ict.ac.cn> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | pids: introduce find_get_task_by_vpid() helperMike Rapoport2018-02-064-45/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several functions that do find_task_by_vpid() followed by get_task_struct(). We can use a helper function instead. Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud