summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* mm: mmap: add new /proc tunable for mmap_base ASLRDaniel Cashman2016-01-141-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Address Space Layout Randomization (ASLR) provides a barrier to exploitation of user-space processes in the presence of security vulnerabilities by making it more difficult to find desired code/data which could help an attack. This is done by adding a random offset to the location of regions in the process address space, with a greater range of potential offset values corresponding to better protection/a larger search-space for brute force, but also to greater potential for fragmentation. The offset added to the mmap_base address, which provides the basis for the majority of the mappings for a process, is set once on process exec in arch_pick_mmap_layout() and is done via hard-coded per-arch values, which reflect, hopefully, the best compromise for all systems. The trade-off between increased entropy in the offset value generation and the corresponding increased variability in address space fragmentation is not absolute, however, and some platforms may tolerate higher amounts of entropy. This patch introduces both new Kconfig values and a sysctl interface which may be used to change the amount of entropy used for offset generation on a system. The direct motivation for this change was in response to the libstagefright vulnerabilities that affected Android, specifically to information provided by Google's project zero at: http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html The attack presented therein, by Google's project zero, specifically targeted the limited randomness used to generate the offset added to the mmap_base address in order to craft a brute-force-based attack. Concretely, the attack was against the mediaserver process, which was limited to respawning every 5 seconds, on an arm device. The hard-coded 8 bits used resulted in an average expected success rate of defeating the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a piece). With this patch, and an accompanying increase in the entropy value to 16 bits, the same attack would take an average expected time of over 45 hours (32768 tries), which makes it both less feasible and more likely to be noticed. The introduced Kconfig and sysctl options are limited by per-arch minimum and maximum values, the minimum of which was chosen to match the current hard-coded value and the maximum of which was chosen so as to give the greatest flexibility without generating an invalid mmap_base address, generally a 3-4 bits less than the number of bits in the user-space accessible virtual address space. When decided whether or not to change the default value, a system developer should consider that mmap_base address could be placed anywhere up to 2^(value) bits away from the non-randomized location, which would introduce variable-sized areas above and below the mmap_base address such that the maximum vm_area_struct size may be reduced, preventing very large allocations. This patch (of 4): ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Signed-off-by: Daniel Cashman <dcashman@google.com> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Borislav Petkov <bp@suse.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, shmem: add internal shmem resident memory accountingJerome Marchand2016-01-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported "shmem-rss" from "file-rss". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kmemcg: account certain kmem allocations to memcgVladimir Davydov2016-01-144-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'libnvdimm-for-4.5' of ↵Linus Torvalds2016-01-131-2/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has appeared in -next and independently received a build success notification from the kbuild robot. The 'for-4.5/block- dax' topic branch was rebased over the weekend to drop the "block device end-of-life" rework that Al would like to see re-implemented with a notifier, and to address bug reports against the badblocks integration. There is pending feedback against "libnvdimm: Add a poison list and export badblocks" received last week. Linda identified some localized fixups that we will handle incrementally. Summary: - Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. - Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. - Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. - Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes" * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits) block: kill disk_{check|set|clear|alloc}_badblocks libnvdimm, pmem: nvdimm_read_bytes() badblocks support pmem, dax: disable dax in the presence of bad blocks pmem: fail io-requests to known bad blocks libnvdimm: convert to statically allocated badblocks libnvdimm: don't fail init for full badblocks list block, badblocks: introduce devm_init_badblocks block: clarify badblocks lifetime badblocks: rename badblocks_free to badblocks_exit libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h libnvdimm: Add a poison list and export badblocks nfit_test: Enable DSMs for all test NFITs md: convert to use the generic badblocks code block: Add badblock management for gendisks badblocks: Add core badblock management code block: fix del_gendisk() vs blkdev_ioctl crash block: enable dax for raw block devices block: introduce bdev_file_inode() restrict /dev/mem to idle io memory ranges arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug ...
| * restrict /dev/mem to idle io memory rangesDan Williams2016-01-091-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE semantics by default. If userspace really believes it is safe to access the memory region it can also perform the extra step of disabling an active driver. This protects device address ranges with read side effects and otherwise directs userspace to use the driver. Persistent memory presents a large "mistake surface" to /dev/mem as now accidental writes can corrupt a filesystem. In general if a device driver is busily using a memory region it already informs other parts of the kernel to not touch it via request_mem_region(). /dev/mem should honor the same safety restriction by default. Debugging a device driver from userspace becomes more difficult with this enabled. Any application using /dev/mem or mmap of sysfs pci resources will now need to perform the extra step of either: 1/ Disabling the driver, for example: echo <device id> > /dev/bus/<parent bus>/drivers/<driver name>/unbind 2/ Rebooting with "iomem=relaxed" on the command line 3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n Traditional users of /dev/mem like dosemu are unaffected because the first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction. Legacy X configurations use /dev/mem to talk to graphics hardware, but that functionality has since moved to kernel graphics drivers. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge tag 'pm+acpi-4.5-rc1-1' of ↵Linus Torvalds2016-01-122-15/+11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull oower management and ACPI updates from Rafael Wysocki: "As far as the number of commits goes, ACPICA takes the lead this time, followed by cpufreq and the device properties framework changes. The most significant new feature is the debugfs-based interface to the ACPICA's AML debugger added in the previous cycle and a new user space tool for accessing it. On the cpufreq front, the core is updated to handle governors more efficiently, particularly on systems where a single cpufreq policy object is shared between multiple CPUs, and there are quite a few changes in drivers (intel_pstate, cpufreq-dt etc). The device properties framework is updated to handle built-in (ie included in the kernel itself) device properties better, among other things by adding a fallback mechanism that will allow drivers to provide default properties to be used in case the plaform firmware doesn't provide the properties expected by them. The Operating Performance Points (OPP) framework gets new DT bindings and debugfs support. A new cpufreq driver for ST platforms is added and the ACPI driver for AMD SoCs will now support the APM X-Gene ACPI I2C device. The rest is mostly fixes and cleanups all over. Specifics: - Add a debugfs-based interface for interacting with the ACPICA's AML debugger introduced in the previous cycle and a new user space tool for that, fix some bugs related to the AML debugger and clean up the code in question (Lv Zheng, Dan Carpenter, Colin Ian King, Markus Elfring). - Update ACPICA to upstream revision 20151218 including a number of fixes and cleanups in the ACPICA core (Bob Moore, Lv Zheng, Labbe Corentin, Prarit Bhargava, Colin Ian King, David E Box, Rafael Wysocki). In particular, the previously added erroneous support for the _SUB object is dropped, the concatenate operator will support all ACPI objects now, the Debug Object handling is improved, the SuperName handling of parameters being control methods is fixed, the ObjectType operator handling is updated to follow ACPI 5.0A and the handling of CondRefOf and RefOf is updated accordingly, module- level code will be executed after loading each ACPI table now (instead of being run once after all tables containing AML have been loaded), the Operation Region handlers management is updated to fix some reported problems and a the ACPICA code in the kernel is more in line with the upstream now. - Update the ACPI backlight driver to provide information on whether or not it will generate key-presses for brightness change hotkeys and update some platform drivers (dell-wmi, thinkpad_acpi) to use that information to avoid sending double key-events to users pace for these, add new ACPI backlight quirks (Hans de Goede, Aaron Lu, Adrien Schildknecht). - Improve the ACPI handling of interrupt GPIOs (Christophe Ricard). - Fix the handling of the list of device IDs of device objects found in the ACPI namespace and add a helper for checking if there is a device object for a given device ID (Lukas Wunner). - Change the logic in the ACPI namespace scanning code to create struct acpi_device objects for all ACPI device objects found in the namespace even if _STA fails for them which helps to avoid device enumeration problems on Microsoft Surface 3 (Aaron Lu). - Add support for the APM X-Gene ACPI I2C device to the ACPI driver for AMD SoCs (Loc Ho). - Fix the long-standing issue with the DMA controller on Intel SoCs where ACPI tables have no power management support for the DMA controller itself, but it can be powered off automatically when the last (other) device on the SoC is powered off via ACPI and clean up the ACPI driver for Intel SoCs (acpi-lpss) after previous attempts to fix that problem (Andy Shevchenko). - Assorted ACPI fixes and cleanups (Andy Lutomirski, Colin Ian King, Javier Martinez Canillas, Ken Xue, Mathias Krause, Rafael Wysocki, Sinan Kaya). - Update the device properties framework for better handling of built-in properties, add support for built-in properties to the platform bus type, update the MFD subsystem's handling of device properties and add support for passing default configuration data as device properties to the intel-lpss MFD drivers, convert the designware I2C driver to use the unified device properties API and add a fallback mechanism for using default built-in properties if the platform firmware fails to provide the properties as expected by drivers (Andy Shevchenko, Mika Westerberg, Heikki Krogerus, Andrew Morton). - Add new Device Tree bindings to the Operating Performance Points (OPP) framework and update the exynos4412 DT binding accordingly, introduce debugfs support for the OPP framework (Viresh Kumar, Bartlomiej Zolnierkiewicz). - Migrate the mt8173 cpufreq driver to the new OPP bindings (Pi-Cheng Chen). - Update the cpufreq core to make the handling of governors more efficient, especially on systems where policy objects are shared between multiple CPUs (Viresh Kumar, Rafael Wysocki). - Fix cpufreq governor handling on configurations with CONFIG_HZ_PERIODIC set (Chen Yu). - Clean up the cpufreq core code related to the boost sysfs knob support and update the ACPI cpufreq driver accordingly (Rafael Wysocki). - Add a new cpufreq driver for ST platforms and corresponding Device Tree bindings (Lee Jones). - Update the intel_pstate driver to allow the P-state selection algorithm used by it to depend on the CPU ID of the processor it is running on, make it use a special P-state selection algorithm (with an IO wait time compensation tweak) on Atom CPUs based on the Airmont and Silvermont cores so as to reduce their energy consumption and improve intel_pstate documentation (Philippe Longepe, Srinivas Pandruvada). - Update the cpufreq-dt driver to support registering cooling devices that use the (P * V^2 * f) dynamic power draw formula where V is the voltage, f is the frequency and P is a constant coefficient provided by Device Tree and update the arm_big_little cpufreq driver to use that support (Punit Agrawal). - Assorted cpufreq driver (cpufreq-dt, qoriq, pcc-cpufreq, blackfin-cpufreq) updates (Andrzej Hajda, Hongtao Jia, Jacob Tanenbaum, Markus Elfring). - cpuidle core tweaks related to polling and measured_us calculation (Rik van Riel). - Removal of modularity from a few cpuidle drivers (clps711x, ux500, exynos) that cannot be built as modules in practice (Paul Gortmaker). - PM core update to prevent devices from being probed during system suspend/resume which is generally problematic and may lead to inconsistent behavior (Grygorii Strashko). - Assorted updates of the PM core and related code (Julia Lawall, Manuel Pégourié-Gonnard, Maruthi Bayyavarapu, Rafael Wysocki, Ulf Hansson). - PNP bus type updates (Christophe Le Roy, Heiner Kallweit). - PCI PM code cleanups (Jarkko Nikula, Julia Lawall). - cpupower tool updates (Jacob Tanenbaum, Thomas Renninger)" * tag 'pm+acpi-4.5-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (177 commits) PM / clk: don't leave clocks enabled when driver not bound i2c: dw: Add APM X-Gene ACPI I2C device support ACPI / APD: Add APM X-Gene ACPI I2C device support ACPI / LPSS: change 'does not have' to 'has' in comment Revert "dmaengine: dw: platform: provide platform data for Intel" dmaengine: dw: return immediately from IRQ when DMA isn't in use dmaengine: dw: platform: power on device on shutdown ACPI / LPSS: override power state for LPSS DMA device PM / OPP: Use snprintf() instead of sprintf() Documentation: cpufreq: intel_pstate: enhance documentation ACPI, PCI, irq: remove redundant check for null string pointer ACPI / video: driver must be registered before checking for keypresses cpufreq-dt: fix handling regulator_get_voltage() result cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC PM / sleep: Add support for read-only sysfs attributes ACPI: Fix white space in a structure definition ACPI / SBS: fix inconsistent indenting inside if statement PNP: respect PNP_DRIVER_RES_DO_NOT_CHANGE when detaching ACPI / PNP: constify device IDs ACPI / PCI: Simplify acpi_penalize_isa_irq() ...
| | \
| | \
| *-. \ Merge branches 'pm-sleep' and 'pm-tools'Rafael J. Wysocki2016-01-122-15/+11
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pm-sleep: PM / sleep: Add support for read-only sysfs attributes * pm-tools: cpupower: fix how "cpupower frequency-info" interprets latency cpupower: rework the "cpupower frequency-info" command cpupower: Do not analyse offlined cpus cpupower: Provide STATIC variable in Makefile for debug builds cpupower: Fix precedence issue
| | * | | PM / sleep: Add support for read-only sysfs attributesRafael J. Wysocki2016-01-042-15/+11
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some sysfs attributes in /sys/power/ should really be read-only, so add support for that, convert those attributes to read-only and drop the stub .show() routines from them. Original-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | Merge tag 'trace-v4.5' of ↵Linus Torvalds2016-01-126-228/+300
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Not much new with tracing for this release. Mostly just clean ups and minor fixes. Here's what else is new: - A new TRACE_EVENT_FN_COND macro, combining both _FN and _COND for those that want both. - New selftest to test the instance create and delete - Better debug output when ftrace fails" * tag 'trace-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (24 commits) ftrace: Fix the race between ftrace and insmod ftrace: Add infrastructure for delayed enabling of module functions x86: ftrace: Fix the comments for ftrace_modify_code_direct() tracing: Fix comment to use tracing_on over tracing_enable metag: ftrace: Fix the comments for ftrace_modify_code sh: ftrace: Fix the comments for ftrace_modify_code() ia64: ftrace: Fix the comments for ftrace_modify_code() ftrace: Clean up ftrace_module_init() code ftrace: Join functions ftrace_module_init() and ftrace_init_module() tracing: Introduce TRACE_EVENT_FN_COND macro tracing: Use seq_buf_used() in seq_buf_to_user() instead of len bpf: Constify bpf_verifier_ops structure ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags too ftrace: Remove use of control list and ops ftrace: Fix output of enabled_functions for showing tramp ftrace: Fix a typo in comment ftrace: Show all tramps registered to a record on ftrace_bug() ftrace: Add variable ftrace_expected for archs to show expected code ftrace: Add new type to distinguish what kind of ftrace_bug() tracing: Update cond flag when enabling or disabling a trigger ...
| * | | | ftrace: Fix the race between ftrace and insmodQiu Peiyang2016-01-071-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We hit ftrace_bug report when booting Android on a 64bit ATOM SOC chip. Basically, there is a race between insmod and ftrace_run_update_code. After load_module=>ftrace_module_init, another thread jumps in to call ftrace_run_update_code=>ftrace_arch_code_modify_prepare =>set_all_modules_text_rw, to change all modules as RW. Since the new module is at MODULE_STATE_UNFORMED, the text attribute is not changed. Then, the 2nd thread goes ahead to change codes. However, load_module continues to call complete_formation=>set_section_ro_nx, then 2nd thread would fail when probing the module's TEXT. The patch fixes it by using notifier to delay the enabling of ftrace records to the time when module is at state MODULE_STATE_COMING. Link: http://lkml.kernel.org/r/567CE628.3000609@intel.com Signed-off-by: Qiu Peiyang <peiyangx.qiu@intel.com> Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Add infrastructure for delayed enabling of module functionsSteven Rostedt (Red Hat)2016-01-071-55/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Qiu Peiyang pointed out that there's a race when enabling function tracing and loading a module. In order to make the modifications of converting nops in the prologue of functions into callbacks, the text needs to be converted from read-only to read-write. When enabling function tracing, the text permission is updated, the functions are modified, and then they are put back. When loading a module, the updates to convert function calls to mcount is done before the module text is set to read-only. But after it is done, the module text is visible by the function tracer. Thus we have the following race: CPU 0 CPU 1 ----- ----- start function tracing set text to read-write load_module add functions to ftrace set module text read-only update all functions to callbacks modify module functions too < Can't it's read-only > When this happens, ftrace detects the issue and disables itself till the next reboot. To fix this, a new DISABLED flag is added for ftrace records, which all module functions get when they are added. Then later, after the module code is all set, the records will have the DISABLED flag cleared, and they will be enabled if any callback wants all functions to be traced. Note, this doesn't add the delay to later. It simply changes the ftrace_module_init() to do both the setting of DISABLED records, and then immediately calls the enable code. This helps with testing this new code as it has the same behavior as previously. Another change will come after this to have the ftrace_module_enable() called after the text is set to read-only. Cc: Qiu Peiyang <peiyangx.qiu@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | tracing: Fix comment to use tracing_on over tracing_enableChuyu Hu2015-12-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The file tracing_enable is obsolete and does not exist anymore. Replace the comment that references it with the proper tracing_on file. Link: http://lkml.kernel.org/r/1450787141-45544-1-git-send-email-chuhu@redhat.com Signed-off-by: Chuyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Clean up ftrace_module_init() codeSteven Rostedt (Red Hat)2015-12-231-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The start and end variables were only used when ftrace_module_init() was split up into multiple functions. No need to keep them around after the merger. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Join functions ftrace_module_init() and ftrace_init_module()Abel Vesa2015-12-231-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simple cleanup. No need for two functions here. The whole work can simply be done inside 'ftrace_module_init'. Link: http://lkml.kernel.org/r/1449067197-5718-1-git-send-email-abelvesa@linux.com Signed-off-by: Abel Vesa <abelvesa@linux.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | bpf: Constify bpf_verifier_ops structureJulia Lawall2015-12-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This bpf_verifier_ops structure is never modified, like the other bpf_verifier_ops structures, so declare it as const. Done with the help of Coccinelle. Link: http://lkml.kernel.org/r/1449855359-13724-1-git-send-email-Julia.Lawall@lip6.fr Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags tooSteven Rostedt (Red Hat)2015-12-231-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jiri Olsa noted that the change to replace the control_ops did not update the trampoline for when running perf on a single CPU and with CONFIG_PREEMPT disabled (where dynamic ops, like perf, can use trampolines directly). The result was that perf function could be called when RCU is not watching as well as not handle the ftrace_local_disable(). Modify the ftrace_ops_get_func() to also check the RCU and PER_CPU ops flags and use the recursive function if they are set. The recursive function is modified to check those flags and execute the appropriate checks if they are set. Link: http://lkml.kernel.org/r/20151201134213.GA14155@krava.brq.redhat.com Reported-by: Jiri Olsa <jolsa@redhat.com> Patch-fixed-up-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Remove use of control list and opsSteven Rostedt (Red Hat)2015-12-233-91/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently perf has its own list function within the ftrace infrastructure that seems to be used only to allow for it to have per-cpu disabling as well as a check to make sure that it's not called while RCU is not watching. It uses something called the "control_ops" which is used to iterate over ops under it with the control_list_func(). The problem is that this control_ops and control_list_func unnecessarily complicates the code. By replacing FTRACE_OPS_FL_CONTROL with two new flags (FTRACE_OPS_FL_RCU and FTRACE_OPS_FL_PER_CPU) we can remove all the code that is special with the control ops and add the needed checks within the generic ftrace_list_func(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Fix output of enabled_functions for showing trampSteven Rostedt (Red Hat)2015-12-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When showing all tramps registered to a ftrace record in the file enabled_functions, it exits the loop with ops == NULL. But then it is suppose to show the function on the ops->trampoline and add_trampoline_func() is called with the given ops. But because ops is now NULL (to exit the loop), it always shows the static trampoline instead of the one that is really registered to the record. The call to add_trampoline_func() that shows the trampoline for the given ops needs to be called at every iteration. Fixes: 39daa7b9e895 "ftrace: Show all tramps registered to a record on ftrace_bug()" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Fix a typo in commentLi Bin2015-12-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s/ARCH_SUPPORT_FTARCE_OPS/ARCH_SUPPORTS_FTRACE_OPS/ Link: http://lkml.kernel.org/r/1448879016-8659-1-git-send-email-huawei.libin@huawei.com Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Show all tramps registered to a record on ftrace_bug()Steven Rostedt (Red Hat)2015-11-251-9/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an anomaly is detected in the function call modification code, ftrace_bug() is called to disable function tracing as well as give any information that may help debug the problem. Currently, only the first found trampoline that is attached to the failed record is reported. Instead, show all trampolines that are hooked to it. Also, not only show the ops pointer but also report the function it calls. While at it, add this info to the enabled_functions debug file too. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Add variable ftrace_expected for archs to show expected codeSteven Rostedt (Red Hat)2015-11-251-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an anomaly is found while modifying function code, ftrace_bug() is called which disables the function tracing infrastructure and reports information about what failed. If the code that is to be replaced does not match what is expected, then actual code is shown. Currently there is no arch generic way to show what was expected. Add a new variable pointer calld ftrace_expected that the arch code can set to point to what it expected so that ftrace_bug() can report the actual text as well as the text that was expected to be there. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ftrace: Add new type to distinguish what kind of ftrace_bug()Steven Rostedt (Red Hat)2015-11-251-1/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ftrace function hook utility has several internal checks to make sure that whatever it modifies is exactly what it expects to be modifying. This is essential as modifying running code can be extremely dangerous to the system. When an anomaly is detected, ftrace_bug() is called which sends a splat to the console and disables function tracing. There's some extra information that is printed to help diagnose the issue. One thing that is missing though is output of what ftrace was doing at the time of the crash. Was it updating a call site or perhaps converting a call site to a nop? A new global enum variable is created to state what ftrace was doing at the time of the anomaly, and this is reported in ftrace_bug(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | tracing: Update cond flag when enabling or disabling a triggerTom Zanussi2015-11-251-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a trigger is enabled, the cond flag should be set beforehand, otherwise a trigger that's expecting to process a trace record (e.g. one with post_trigger set) could be invoked without one. Likewise a trigger's cond flag should be reset after it's disabled, not before. Link: http://lkml.kernel.org/r/a420b52a67b1c2d3cab017914362d153255acb99.1448303214.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ring-buffer: Process commits whenever moving to a new page.Steven Rostedt (Red Hat)2015-11-251-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When crossing over to a new page, commit the current work. This will allow readers to get data with less latency, and also simplifies the work to get timestamps working for interrupted events. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ring-buffer: Remove redundant update of page timestampSteven Rostedt (Red Hat)2015-11-241-24/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first commit of a buffer page updates the timestamp of that page. No need to have the update to the next page add the timestamp too. It will only be replaced by the first commit on that page anyway. Only update to a page if it contains an event. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | | | ring-buffer: Use READ_ONCE() for most tail_page accessSteven Rostedt (Red Hat)2015-11-241-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As cpu_buffer->tail_page may be modified by interrupts at almost any time, the flow of logic is very important. Do not let gcc get smart with re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its accesses. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | | | | Merge branch 'for-4.5' of ↵Linus Torvalds2016-01-126-62/+48
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup v2 interface is now official. It's no longer hidden behind a devel flag and can be mounted using the new cgroup2 fs type. Unfortunately, cpu v2 interface hasn't made it yet due to the discussion around in-process hierarchical resource distribution and only memory and io controllers can be used on the v2 interface at the moment. - The existing documentation which has always been a bit of mess is relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt is added as the authoritative documentation for the v2 interface. - Some features are added through for-4.5-ancestor-test branch to enable netfilter xt_cgroup match to use cgroup v2 paths. The actual netfilter changes will be merged through the net tree which pulled in the said branch. - Various cleanups * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rename cgroup documentations cgroup: fix a typo. cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX. cgroup: demote subsystem init messages to KERN_DEBUG cgroup: Fix uninitialized variable warning cgroup: put controller Kconfig options in meaningful order cgroup: clean up the kernel configuration menu nomenclature cgroup_pids: fix a typo. Subject: cgroup: Fix incomplete dd command in blkio documentation cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends cpuset: Replace all instances of time_t with time64_t cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/ cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
| * | | | | cgroup: fix a typo.Rami Rosen2016-01-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a typo in a comment in cgroup.c. Signed-off-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | cgroup: demote subsystem init messages to KERN_DEBUGTejun Heo2016-01-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These are noisy during boot and not all that interesting. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | cgroup_pids: fix a typo.Rami Rosen2015-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a typo in pids_charge() method. Signed-off-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | Merge branch 'for-4.5-ancestor-test' of ↵Tejun Heo2015-12-071-27/+44
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-4.5 Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | | cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friendsOleg Nesterov2015-12-035-31/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | | Merge branch 'for-4.4-fixes' into for-4.5Tejun Heo2015-12-037-110/+107
| |\ \ \ \ \ \
| * | | | | | | cpuset: Replace all instances of time_t with time64_tArnd Bergmann2015-11-251-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following patch replaces all instances of time_t with time64_t i.e. change the type used for representing time from 32-bit to 64-bit. All 32-bit kernels to date use a signed 32-bit time_t type, which can only represent time until January 2038. Since embedded systems running 32-bit Linux are going to survive beyond that date, we have to change all current uses, in a backwards compatible way. The patch also changes the function get_seconds() that returns a 32-bit integer to ktime_get_seconds() that returns seconds as 64-bit integer. The patch changes the type of ticks from time_t to u32. We keep ticks as 32-bits as the function uses 32-bit arithmetic which would prove less expensive than 64-bit arithmetic and the function is expected to be called atleast once every 32 seconds. Signed-off-by: Heena Sirwani <heenasirwani@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | | | cgroup: replace __DEVEL__sane_behavior with cgroup2 fs typeTejun Heo2015-11-161-24/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With major controllers - cpu, memory and io - shaping up for the unified hierarchy, cgroup2 is about ready to be, gradually, released into the wild. Replace __DEVEL__sane_behavior flag which was used to select the unified hierarchy with a separate filesystem type "cgroup2" so that unified hierarchy can be mounted as follows. mount -t cgroup2 none $MOUNT_POINT The cgroup2 fs has its own magic number - 0x63677270 ("cgrp"). v2: Assign a different magic number to cgroup2 fs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
* | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2016-01-126-59/+187
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from Davic Miller: 1) Support busy polling generically, for all NAPI drivers. From Eric Dumazet. 2) Add byte/packet counter support to nft_ct, from Floriani Westphal. 3) Add RSS/XPS support to mvneta driver, from Gregory Clement. 4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes Frederic Sowa. 5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai. 6) Add support for VLAN device bridging to mlxsw switch driver, from Ido Schimmel. 7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski. 8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko. 9) Reorganize wireless drivers into per-vendor directories just like we do for ethernet drivers. From Kalle Valo. 10) Provide a way for administrators "destroy" connected sockets via the SOCK_DESTROY socket netlink diag operation. From Lorenzo Colitti. 11) Add support to add/remove multicast routes via netlink, from Nikolay Aleksandrov. 12) Make TCP keepalive settings per-namespace, from Nikolay Borisov. 13) Add forwarding and packet duplication facilities to nf_tables, from Pablo Neira Ayuso. 14) Dead route support in MPLS, from Roopa Prabhu. 15) TSO support for thunderx chips, from Sunil Goutham. 16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon. 17) Rationalize, consolidate, and more completely document the checksum offloading facilities in the networking stack. From Tom Herbert. 18) Support aborting an ongoing scan in mac80211/cfg80211, from Vidyullatha Kanchanapally. 19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits) net: bnxt: always return values from _bnxt_get_max_rings net: bpf: reject invalid shifts phonet: properly unshare skbs in phonet_rcv() dwc_eth_qos: Fix dma address for multi-fragment skbs phy: remove an unneeded condition mdio: remove an unneed condition mdio_bus: NULL dereference on allocation error net: Fix typo in netdev_intersect_features net: freescale: mac-fec: Fix build error from phy_device API change net: freescale: ucc_geth: Fix build error from phy_device API change bonding: Prevent IPv6 link local address on enslaved devices IB/mlx5: Add flow steering support net/mlx5_core: Export flow steering API net/mlx5_core: Make ipv4/ipv6 location more clear net/mlx5_core: Enable flow steering support for the IB driver net/mlx5_core: Initialize namespaces only when supported by device net/mlx5_core: Set priority attributes net/mlx5_core: Connect flow tables net/mlx5_core: Introduce modify flow table command net/mlx5_core: Managing root flow table ...
| * | | | | | | | net: bpf: reject invalid shiftsRabin Vincent2016-01-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a constant shift that can't be encoded in the immediate field of the UBFM/SBFM instructions is passed to the JIT. Since these shifts amounts, which are negative or >= regsize, are invalid, reject them in the eBPF verifier and the classic BPF filter checker, for all architectures. Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-01-061-0/+1
| |\ \ \ \ \ \ \ \
| * | | | | | | | | bpf: hash: use per-bucket spinlocktom.leiming@gmail.com2015-12-291-18/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both htab_map_update_elem() and htab_map_delete_elem() can be called from eBPF program, and they may be in kernel hot path, so it isn't efficient to use a per-hashtable lock in this two helpers. The per-hashtable spinlock is used for protecting bucket's hlist, and per-bucket lock is just enough. This patch converts the per-hashtable lock into per-bucket spinlock, so that contention can be decreased a lot. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | bpf: hash: move select_bucket() out of htab's spinlocktom.leiming@gmail.com2015-12-291-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The spinlock is just used for protecting the per-bucket hlist, so it isn't needed for selecting bucket. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | bpf: hash: use atomic counttom.leiming@gmail.com2015-12-291-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparing for removing global per-hashtable lock, so the counter need to be defined as aotmic_t first. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | bpf: move clearing of A/X into classic to eBPF migration prologueDaniel Borkmann2015-12-181-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back in the days where eBPF (or back then "internal BPF" ;->) was not exposed to user space, and only the classic BPF programs internally translated into eBPF programs, we missed the fact that for classic BPF A and X needed to be cleared. It was fixed back then via 83d5b7ef99c9 ("net: filter: initialize A and X registers"), and thus classic BPF specifics were added to the eBPF interpreter core to work around it. This added some confusion for JIT developers later on that take the eBPF interpreter code as an example for deriving their JIT. F.e. in f75298f5c3fe ("s390/bpf: clear correct BPF accumulator register"), at least X could leak stack memory. Furthermore, since this is only needed for classic BPF translations and not for eBPF (verifier takes care that read access to regs cannot be done uninitialized), more complexity is added to JITs as they need to determine whether they deal with migrations or native eBPF where they can just omit clearing A/X in their prologue and thus reduce image size a bit, see f.e. cde66c2d88da ("s390/bpf: Only clear A and X for converted BPF programs"). In other cases (x86, arm64), A and X is being cleared in the prologue also for eBPF case, which is unnecessary. Lets move this into the BPF migration in bpf_convert_filter() where it actually belongs as long as the number of eBPF JITs are still few. It can thus be done generically; allowing us to remove the quirk from __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF, while reducing code duplication on this matter in current(/future) eBPF JITs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Zi Shen Lim <zlim.lnx@gmail.com> Cc: Yang Shi <yang.shi@linaro.org> Acked-by: Yang Shi <yang.shi@linaro.org> Acked-by: Zi Shen Lim <zlim.lnx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-12-1723-177/+270
| |\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | | net, cgroup: cgroup_sk_updat_lock was missing initializerTejun Heo2015-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") added global spinlock cgroup_sk_update_lock but erroneously skipped initializer leading to uninitialized spinlock warning. Fix it by using DEFINE_SPINLOCK(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dexuan Cui <decui@microsoft.com> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | | bpf, inode: allow for rename and link opsDaniel Borkmann2015-12-121-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for renaming and hard links to the fs. Most of this can be implemented by using simple library operations under the same constraints that we don't use a reserved name like elsewhere. Linking can be useful to share/manage things like maps across subsystem users. It works within the file system boundary, but is not allowed for directories. Symbolic links are explicitly not implemented here, as it can be better done already by doing bind mounts inside bpf fs to set up shared directories f.e. useful when using volumes in docker containers that map a private working directory into /sys/fs/bpf/ which contains itself a bind mounted path from the host's /sys/fs/bpf/ mount that is shared among multiple containers. For single maps instead of whole directory, hard links can be easily used to do the same. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | | sock, cgroup: add sock->sk_cgroupTejun Heo2015-12-081-1/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. In preparation, this patch updates sock->sk_cgrp_data handling so that it points to the v2 cgroup that sock was created in until either net_prio or net_cls is used. Once either of the two is used, sock->sk_cgrp_data reverts to its previous role of carrying prioidx and classid. This is to avoid adding yet another cgroup related field to struct sock. As the mode switching can happen at most once per boot, the switching mechanism is aimed at lowering hot path overhead. It may leak a finite, likely small, number of cgroup refs and report spurious prioidx or classid on switching; however, dynamic updates of prioidx and classid have always been racy and lossy - socks between creation and fd installation are never updated, config changes don't update existing sockets at all, and prioidx may index with dead and recycled cgroup IDs. Non-critical inaccuracies from small race windows won't make any noticeable difference. This patch doesn't make use of the pointer yet. The following patch will implement netfilter match for cgroup2 membership. v2: Use sock_cgroup_data to avoid inflating struct sock w/ another cgroup specific field. v3: Add comments explaining why sock_data_prioidx() and sock_data_classid() use different fallback values. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Daniel Wagner <daniel.wagner@bmw-carit.de> CC: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | | Merge branch 'for-4.5-ancestor-test' of ↵David S. Miller2015-12-081-27/+44
| |\ \ \ \ \ \ \ \ \ \ | | | |_|_|_|/ / / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Preparatory changes for some new socket cgroup infrastructure and netfilter targets. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | | | | cgroup: implement cgroup_get_from_path() and expose cgroup_put()Tejun Heo2015-11-201-5/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement cgroup_get_from_path() using kernfs_walk_and_get() which obtains a default hierarchy cgroup from its path. This will be used to allow cgroup path based matching from outside cgroup proper - e.g. networking and perf. v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path). Signed-off-by: Tejun Heo <tj@kernel.org>
| | * | | | | | | | | cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using itTejun Heo2015-11-201-22/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_is_descendant() currently walks up the hierarchy and compares each ancestor to the cgroup in question. While enough for cgroup core usages, this can't be used in hot paths to test cgroup membership. This patch adds cgroup->ancestor_ids[] which records the IDs of all ancestors including self and cgroup->level for the nesting level. This allows testing whether a given cgroup is a descendant of another in three finite steps - testing whether the two belong to the same hierarchy, whether the descendant candidate is at the same or a higher level than the ancestor and comparing the recorded ancestor_id at the matching level. cgroup_is_descendant() is accordingly reimplmented and made inline. Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-12-0311-46/+107
| |\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud