summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ARM: 7461/1: topology: Add arch_scale_freq_power functionVincent Guittot2012-07-121-1/+37
| | | | | | | | | Add infrastructure to be able to modify the cpu_power of each core Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7450/1: dcache: select DCACHE_WORD_ACCESS for little-endian ARMv6+ CPUsWill Deacon2012-07-092-0/+42
| | | | | | | | | | | | | DCACHE_WORD_ACCESS uses the word-at-a-time API for optimised string comparisons in the vfs layer. This patch implements support for load_unaligned_zeropad for ARM CPUs with native support for unaligned memory accesses (v6+) when running little-endian. Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7449/1: use generic strnlen_user and strncpy_from_user functionsWill Deacon2012-07-097-109/+63
| | | | | | | | | | | | | | | | This patch implements the word-at-a-time interface for ARM using the same algorithm as x86. We use the fls macro from ARMv5 onwards, where we have a clz instruction available which saves us a mov instruction when targetting Thumb-2. For older CPUs, we use the magic 0x0ff0001 constant. Big-endian configurations make use of the implementation from asm-generic. With this implemented, we can replace our byte-at-a-time strnlen_user and strncpy_from_user functions with the optimised generic versions. Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7448/1: perf: remove arm_perf_pmu_ids global enumerationWill Deacon2012-07-097-53/+36
| | | | | | | | | | | | | | | In order to provide PMU name strings compatible with the OProfile user ABI, an enumeration of all PMUs is currently used by perf to identify each PMU uniquely. Unfortunately, this does not scale well in the presence of multiple PMUs and creates a single, global namespace across all PMUs in the system. This patch removes the enumeration and instead uses the name string for the PMU to map onto the OProfile variant. perf_pmu_name is implemented for CPU PMUs, which is all that OProfile cares about anyway. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7447/1: rwlocks: remove unused branch labels from trylock routinesWill Deacon2012-07-091-2/+2
| | | | | | | | | | | | The ARM arch_{read,write}_trylock implementations include unused backwards branch labels, since we don't retry the locking operation if the exclusive store fails. This patch removes the labels. Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7446/1: spinlock: use ticket algorithm for ARMv6+ locking implementationWill Deacon2012-07-092-26/+63
| | | | | | | | | | | | | | | | | | Ticket spinlocks ensure locking fairness by introducing a FIFO-like nature to the granting of lock acquisitions and also reducing the thundering herd effect when spinning on a lock by allowing the cacheline to remain in a shared state amongst the waiting CPUs. This is especially important on systems where memory-access times are not necessarily uniform when accessing the lock structure (for example, on a multi-cluster platform where the lock is allocated into L1 when a CPU releases it). This patch implements the ticket spinlock algorithm for ARM, replacing the simpler implementation for ARMv6+ processors. Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7445/1: mm: update CONTEXTIDR register to contain PID of current processWill Deacon2012-07-094-0/+55
| | | | | | | | | | | | | | | | | This patch introduces a new Kconfig option which, when enabled, causes the kernel to write the PID of the current task into the PROCID field of the CONTEXTIDR on context switch. This is useful when analysing hardware trace, since writes to this register can be configured to emit an event into the trace stream. The thread notifier for writing the PID is deliberately kept separate from the ASID-writing code so that we can support newer processors using LPAE, where the ASID is stored in TTBR0. As such, the switch_mm code is updated to perform a read-modify-write sequence to ensure that we don't clobber the PID on CPUs using the classic 2-level page tables. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7444/1: kernel: add arch-timer C3STOP featureLorenzo Pieralisi2012-07-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | When a CPU is shutdown its architected timer comparators registers are lost. Within CPU idle, before processors enter shutdown they enter clock events broadcast mode through the clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, cpuid); function where the local timers are emulated by a global always-on timer. On CPU resume, the per-CPU tick device normal mode is restored by exiting broadcast mode through clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, cpuid); In order for this mechanism to function, architected timers should add to their feature C3STOP, which means that they are not able to function when the CPU is in off-mode. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7460/1: remove asm/locks.hPaul Bolle2012-07-091-274/+0
| | | | | | | | | | Commit 64ac24e738823161693bf791f87adc802cf529ff ("Generic semaphore implementation") removed the last include of this header. Apparently it was just an oversight to keep this header. It can safely be removed now. Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7439/1: head.S: simplify initial page table mappingNicolas Pitre2012-07-091-36/+23
| | | | | | | | | | Let's map the initial RAM up to the end of the kernel .bss instead of the strict kernel image area. This simplifies the code as the kernel image only needs to be handled specially in the XIP case. That covers the legacy ATAG location as well. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7437/1: zImage: Allow DTB command line concatenation with ATAG_CMDLINEGenoud Richard2012-07-092-2/+79
| | | | | | | | | | | | | | | | | | | | | This patch allows the ATAG_CMDLINE provided by the bootloader to be concatenated to the bootargs property of the device tree. This is useful to merge static values defined in the device tree with the boot loader's (possibly) more dynamic values, such as startup reasons and more. The bootloader should use the device tree to pass those values to the kernel, but that's not always simple (old bootloader or very small one). The behaviour is the same as the one introduced by Victor Boivie in 4394c1244249198c6b85093d46935b761b36ae05 by extending the CONFIG_CMDLINE. Signed-off-by: Richard Genoud <richard.genoud@gmail.com> Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7436/1: Do not map the vectors page as write-through on UP systemsCatalin Marinas2012-07-091-6/+0
| | | | | | | | | | | | | | The vectors page has been traditionally mapped as WT on UP systems but this creates a mismatched alias with the directly mapped RAM that is using WB attributes. On newer processors like Cortex-A15 this has implications on the data/instructions coherency at the point of unification (usually L2). This patch removes such restriction. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7424/1: update die handler from x86Rabin Vincent2012-07-091-23/+55
| | | | | | | | | | | | | | | | | | | Robustify ARM's die() handling with improvements from x86: - Fix for a deadlock (before panic in the case of panic_on_oops) if we oops under a spinlock which is also used from interrupt handler, since the old code was unconditionally enabling interrupts. - Usage of arch spinlock so lockdep etc doesn't get involved while we're trying to dump out oopses. - Deadlock prevention in the unlikely event that die() recurses. The changes all touch the same few lines of code, so they're done together in one patch. Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: MSM: use SGI0 to wake secondary CPUsRussell King2012-07-091-1/+1
| | | | Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: OMAP: use SGI0 to wake secondary CPUsRussell King2012-07-091-1/+1
| | | | Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: Realview: use SGI0 to wake secondary CPUsRussell King2012-07-091-1/+1
| | | | Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7459/1: irda/pxa: use readl_relaxed() to access OSCR registerArnd Bergmann2012-07-091-6/+6
| | | | | | | | | | | | | | | After c00184f9ab4 "ARM: sa11x0/pxa: convert OS timer registers to IOMEM", magician_defconfig and a few others fail to build because the OSCR register is accessed by the drivers/net/irda/pxaficp_ir.c but has turned into a pointer that needs to be read using readl. There are other registers in the same driver that eventually should be converted, and it's unclear whether we would want a better interface to access the OSCR from a device driver. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: sa11x0/pxa: convert OS timer registers to IOMEMRussell King2012-07-0919-82/+97
| | | | | | | | | | | | | | | | | | | | | | Make the OS timer registers have IOMEM like properities so they can be passed to readl_relaxed/writel_relaxed() et.al. rather than being straight volatile dereferences. Add linux/io.h includes where required. linux/io.h includes added to arch/arm/mach-sa1100/cpu-sa1100.c, arch/arm/mach-sa1100/jornada720_ssp.c, arch/arm/mach-sa1100/leds-lart.c drivers/input/touchscreen/jornada720_ts.c, drivers/pcmcia/sa1100_shannon.c from Arnd. This fixes these warnings: arch/arm/mach-sa1100/time.c: In function 'sa1100_timer_init': arch/arm/mach-sa1100/time.c:104: warning: passing argument 1 of 'clocksource_mmio_init' discards qualifiers from pointer target type arch/arm/mach-pxa/time.c: In function 'pxa_timer_init': arch/arm/mach-pxa/time.c:126: warning: passing argument 1 of 'clocksource_mmio_init' discards qualifiers from pointer target type Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* ARM: 7420/1: Improve build environment isolationVincent Sanders2012-06-031-0/+3
| | | | | | | | | | | | | | | Increasingly distributions are setting default build environments to have LDFLAGS with hardening options. There seems to be an assumption with those options that LDFLAGS are passed to the compiler frontend rather than used directly with ld (which the kernel build process assumes) To prevent build failures in such environments this patch changes the ARM architecture Makefile to override the LDFLAGS from the environment similar to the behaviour on other common architectures e.g. x86 Signed-off-by: Vincent Sanders <vince@collabora.co.uk> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* Linux 3.5-rc1v3.5-rc1Linus Torvalds2012-06-021-2/+2
|
* Merge tag 'dm-3.5-changes-1' of ↵Linus Torvalds2012-06-026-90/+322
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull device-mapper updates from Alasdair G Kergon: "Improve multipath's retrying mechanism in some defined circumstances and provide a simple reserve/release mechanism for userspace tools to access thin provisioning metadata while the pool is in use." * tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm thin: provide userspace access to pool metadata dm thin: use slab mempools dm mpath: allow ioctls to trigger pg init dm mpath: delay retry of bypassed pg dm mpath: reduce size of struct multipath
| * dm thin: provide userspace access to pool metadataJoe Thornber2012-06-035-11/+193
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements two new messages that can be sent to the thin pool target allowing it to take a snapshot of the _metadata_. This, read-only snapshot can be accessed by userland, concurrently with the live target. Only one metadata snapshot can be held at a time. The pool's status line will give the block location for the current msnap. Since version 0.1.5 of the userland thin provisioning tools, the thin_dump program displays the msnap as follows: thin_dump -m <msnap root> <metadata dev> Available here: https://github.com/jthornber/thin-provisioning-tools Now that userland can access the metadata we can do various things that have traditionally been kernel side tasks: i) Incremental backups. By using metadata snapshots we can work out what blocks have changed over time. Combined with data snapshots we can ensure the data doesn't change while we back it up. A short proof of concept script can be found here: https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb ii) Migration of thin devices from one pool to another. iii) Merging snapshots back into an external origin. iv) Asyncronous replication. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm thin: use slab mempoolsMike Snitzer2012-06-031-62/+99
| | | | | | | | | | | | | | | | | | Use dedicated caches prefixed with a "dm_" name rather than relying on kmalloc mempools backed by generic slab caches so the memory usage of thin provisioning (and any leaks) can be accounted for independently. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm mpath: allow ioctls to trigger pg initMikulas Patocka2012-06-031-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the failure of a group of paths, any alternative paths that need initialising do not become available until further I/O is sent to the device. Until this has happened, ioctls return -EAGAIN. With this patch, new paths are made available in response to an ioctl too. The processing of the ioctl gets delayed until this has happened. Instead of returning an error, we submit a work item to kmultipathd (that will potentially activate the new path) and retry in ten milliseconds. Note that the patch doesn't retry an ioctl if the ioctl itself fails due to a path failure. Such retries should be handled intelligently by the code that generated the ioctl in the first place, noting that some SCSI commands should not be retried because they are not idempotent (XOR write commands). For commands that could be retried, there is a danger that if the device rejected the SCSI command, the path could be errorneously marked as failed, and the request would be retried on another path which might fail too. It can be determined if the failure happens on the device or on the SCSI controller, but there is no guarantee that all SCSI drivers set these flags correctly. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm mpath: delay retry of bypassed pgMike Christie2012-06-031-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If I/O needs retrying and only bypassed priority groups are available, set the pg_init_delay_retry flag to wait before retrying. If, for example, the reason for the bypass is that the controller is getting reset or there is a firmware upgrade happening, retrying right away would cause a flood of log messages and retries for what could be a few seconds or even several minutes. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm mpath: reduce size of struct multipathMike Snitzer2012-06-031-6/+7
| | | | | | | | | | | | | | | | | | | | | | Move multipath structure's 'lock' and 'queue_size' members to eliminate two 4-byte holes. Also use a bit within a single unsigned int for each existing flag (saves 8-bytes). This allows future flags to be added without each consuming an unsigned int. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2012-06-0221-84/+201
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) Make syn floods consume significantly less resources by a) Not pre-COW'ing routing metrics for SYN/ACKs b) Mirroring the device queue mapping of the SYN for the SYN/ACK reply. Both from Eric Dumazet. 2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA. 3) Validate the length requested when building a paged SKB for a socket, so we don't overrun the page vector accidently. From Jason Wang. 4) When netlabel is disabled, we abort all IP option processing when we see a CIPSO option. This isn't the right thing to do, we should simply skip over it and continue processing the remaining options (if any). Fix from Paul Moore. 5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel Apfelbaum. 6) 8139cp enables the receiver before the ring address is properly programmed, which potentially lets the device crap over random memory. Fix from Jason Wang. 7) e1000/e1000e fixes for i217 RST handling, and an improper buffer address reference in jumbo RX frame processing from Bruce Allan and Sebastian Andrzej Siewior, respectively. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: fec_mpc52xx: fix timestamp filtering mcs7830: Implement link state detection e1000e: fix Rapid Start Technology support for i217 e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats() r8169: call netif_napi_del at errpaths and at driver unload tcp: reflect SYN queue_mapping into SYNACK packets tcp: do not create inetpeer on SYNACK message 8139cp/8139too: terminate the eeprom access with the right opmode 8139cp: set ring address before enabling receiver cipso: handle CIPSO options correctly when NetLabel is disabled net: sock: validate data_len before allocating skb in sock_alloc_send_pskb() bql: Avoid possible inconsistent calculation. bql: Avoid unneeded limit decrement. bql: Fix POSDIFF() to integer overflow aware. net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap net/mlx4_core: Fixes for VF / Guest startup flow net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event net/mlx4_core: Fix number of EQs used in ICM initialisation net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int
| * | fec_mpc52xx: fix timestamp filteringStephan Gatzka2012-06-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb_defer_rx_timestamp was called with a freshly allocated skb but must be called with rskb instead. Signed-off-by: Stephan Gatzka <stephan@gatzka.org> Cc: stable <stable@vger.kernel.org> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mcs7830: Implement link state detectionOndrej Zary2012-06-021-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add .status callback that detects link state changes. Tested with MCS7832CV-AA chip (9710:7830, identified as rev.C by the driver). Fixes https://bugzilla.kernel.org/show_bug.cgi?id=28532 Signed-off-by: Ondrej Zary <linux@rainbow-software.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | e1000e: fix Rapid Start Technology support for i217Bruce Allan2012-06-021-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The definition of I217_PROXY_CTRL must use the BM_PHY_REG() macro instead of the PHY_REG() macro for PHY page 800 register 70 since it is for a PHY register greater than the maximum allowed by the latter macro, and fix a typo setting the I217_MEMPWR register in e1000_suspend_workarounds_ich8lan. Also for clarity, rename a few defines as bit definitions instead of masks. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * | e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()Sebastian Andrzej Siewior2012-06-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is another fixup where the data is not transfered into buffer addressed by skb->data but into a page. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * | r8169: call netif_napi_del at errpaths and at driver unloadDevendra Naga2012-06-011-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | when register_netdev fails, the init'ed NAPIs by netif_napi_add must be deleted with netif_napi_del, and also when driver unloads, it should delete the NAPI before unregistering netdevice using unregister_netdev. Signed-off-by: Devendra Naga <devendra.aaru@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: reflect SYN queue_mapping into SYNACK packetsEric Dumazet2012-06-012-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While testing how linux behaves on SYNFLOOD attack on multiqueue device (ixgbe), I found that SYNACK messages were dropped at Qdisc level because we send them all on a single queue. Obvious choice is to reflect incoming SYN packet @queue_mapping to SYNACK packet. Under stress, my machine could only send 25.000 SYNACK per second (for 200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues. After patch, not a single SYNACK is dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: do not create inetpeer on SYNACK messageEric Dumazet2012-06-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting larger and larger, using lots of memory and cpu time. tcp_v4_send_synack() ->inet_csk_route_req() ->ip_route_output_flow() ->rt_set_nexthop() ->rt_init_metrics() ->inet_getpeer( create = true) This is a side effect of commit a4daad6b09230 (net: Pre-COW metrics for TCP) added in 2.6.39 Possible solution : Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS Before patch : # grep peer /proc/slabinfo inet_peer_cache 4175430 4175430 192 42 2 : tunables 0 0 0 : slabdata 99415 99415 0 Samples: 41K of event 'cycles', Event count (approx.): 30716565122 + 20,24% ksoftirqd/0 [kernel.kallsyms] [k] inet_getpeer + 8,19% ksoftirqd/0 [kernel.kallsyms] [k] peer_avl_rebalance.isra.1 + 4,81% ksoftirqd/0 [kernel.kallsyms] [k] sha_transform + 3,64% ksoftirqd/0 [kernel.kallsyms] [k] fib_table_lookup + 2,36% ksoftirqd/0 [ixgbe] [k] ixgbe_poll + 2,16% ksoftirqd/0 [kernel.kallsyms] [k] __ip_route_output_key + 2,11% ksoftirqd/0 [kernel.kallsyms] [k] kernel_map_pages + 2,11% ksoftirqd/0 [kernel.kallsyms] [k] ip_route_input_common + 2,01% ksoftirqd/0 [kernel.kallsyms] [k] __inet_lookup_established + 1,83% ksoftirqd/0 [kernel.kallsyms] [k] md5_transform + 1,75% ksoftirqd/0 [kernel.kallsyms] [k] check_leaf.isra.9 + 1,49% ksoftirqd/0 [kernel.kallsyms] [k] ipt_do_table + 1,46% ksoftirqd/0 [kernel.kallsyms] [k] hrtimer_interrupt + 1,45% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_alloc + 1,29% ksoftirqd/0 [kernel.kallsyms] [k] inet_csk_search_req + 1,29% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb + 1,16% ksoftirqd/0 [kernel.kallsyms] [k] copy_user_generic_string + 1,15% ksoftirqd/0 [kernel.kallsyms] [k] kmem_cache_free + 1,02% ksoftirqd/0 [kernel.kallsyms] [k] tcp_make_synack + 0,93% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock_bh + 0,87% ksoftirqd/0 [kernel.kallsyms] [k] __call_rcu + 0,84% ksoftirqd/0 [kernel.kallsyms] [k] rt_garbage_collect + 0,84% ksoftirqd/0 [kernel.kallsyms] [k] fib_rules_lookup Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | 8139cp/8139too: terminate the eeprom access with the right opmodeJason Wang2012-06-012-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we terminate the eeprom access through clearing the CS by: RTL_W8 (Cfg9346, ~EE_CS); or writeb (~EE_CS, ee_addr); This would left the eeprom into "Config. Register Write Enable:" state which is not expcted as the highest two bits were set to 0x11 ( expected is the "Normal" mode (0x00)). Solving this by write 0x0 instead of ~EE_CS when terminating the eeprom access. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | 8139cp: set ring address before enabling receiverJason Wang2012-06-011-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we enable the receiver before setting the ring address which could lead the card DMA into unexpected areas. Solving this by set the ring address before enabling the receiver. btw. I find and test this in qemu as I didn't have a 8139cp card in hand. please review it carefully. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | cipso: handle CIPSO options correctly when NetLabel is disabledPaul Moore2012-06-011-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When NetLabel is not enabled, e.g. CONFIG_NETLABEL=n, and the system receives a CIPSO tagged packet it is dropped (cipso_v4_validate() returns non-zero). In most cases this is the correct and desired behavior, however, in the case where we are simply forwarding the traffic, e.g. acting as a network bridge, this becomes a problem. This patch fixes the forwarding problem by providing the basic CIPSO validation code directly in ip_options_compile() without the need for the NetLabel or CIPSO code. The new validation code can not perform any of the CIPSO option label/value verification that cipso_v4_validate() does, but it can verify the basic CIPSO option format. The behavior when NetLabel is enabled is unchanged. Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()Jason Wang2012-05-311-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | We need to validate the number of pages consumed by data_len, otherwise frags array could be overflowed by userspace. So this patch validate data_len and return -EMSGSIZE when data_len may occupies more frags than MAX_SKB_FRAGS. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bql: Avoid possible inconsistent calculation.Hiroaki SHIMODA2012-05-311-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dql->num_queued could change while processing dql_completed(). To provide consistent calculation, added an on stack variable. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bql: Avoid unneeded limit decrement.Hiroaki SHIMODA2012-05-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When below pattern is observed, TIME dql_queued() dql_completed() | a) initial state | | b) X bytes queued V c) Y bytes queued d) X bytes completed e) Z bytes queued f) Y bytes completed a) dql->limit has already some value and there is no in-flight packet. b) X bytes queued. c) Y bytes queued and excess limit. d) X bytes completed and dql->prev_ovlimit is set and also dql->prev_num_queued is set Y. e) Z bytes queued. f) Y bytes completed. inprogress and prev_inprogress are true. At f), according to the comment, all_prev_completed becomes true and limit should be increased. But POSDIFF() ignores (completed == dql->prev_num_queued) case, so limit is decreased. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Denys Fedoryshchenko <denys@visp.net.lb> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bql: Fix POSDIFF() to integer overflow aware.Hiroaki SHIMODA2012-05-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | POSDIFF() fails to take into account integer overflow case. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Denys Fedoryshchenko <denys@visp.net.lb> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAPJack Morgenstein2012-05-311-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The "!mlx4_is_slave" is totally confusing. Fix with constant MLX4_CMD_NATIVE, which is the intended behavior. Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx4_core: Check port out-of-range before using in mlx4_slave_capJack Morgenstein2012-05-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The range check was performed after using the port number. Reverse this to prevent a potential array overflow. Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx4_core: Fixes for VF / Guest startup flowJack Morgenstein2012-05-314-14/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - pass the following parameters: - firmware version (added QUERY_FW paravirtualization for that) - disable Blueflame on slaves. KVM disables write combining on guests, and we get better performance without BF in this case. (This requires QUERY_DEV_CAP paravirtualization, also in this commit) - max qp rdma as destination - get rid of a chunk of "if (0)" dead code Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_eventJack Morgenstein2012-05-311-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Port is used as an array index before we know if that is proper. For example, in the catas event case, port is zero; however, the port index should lie in the range (1..2). Fix this by using 'port' only in the events where it is of interest. Test for port out of range in the default (unhandled event) case, and do not output a message if it is not an ethernet port. Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx4_core: Fix number of EQs used in ICM initialisationMarcel Apfelbaum2012-05-313-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In SRIOV mode, the number of EQs used when computing the total ICM size was incorrect. To fix this, we do the following: 1. We add a new structure to mlx4_dev, mlx4_phys_caps, to contain physical HCA capabilities. The PPF uses the phys capabilities when it computes things like ICM size. The dev_caps structure will then contain the paravirtualized values, making bookkeeping much easier in SRIOV mode. We add a structure rather than a single parameter because there will be other fields in the phys_caps. The first field we add to the mlx4_phys_caps structure is num_phys_eqs. 2. In INIT_HCA, when running in SRIOV mode, the "log_num_eqs" parameter passed to the FW is the number of EQs per VF/PF; each function (PF or VF) has this number of EQs available. However, the total number of EQs which must be allowed for in the ICM is (1 << log_num_eqs) * (#VFs + #PFs). Rather than compute this quantity, we allocate ICM space for 1024 EQs (which is the device maximum number of EQs, and which is the value we place in the mlx4_phys_caps structure). For INIT_HCA, however, we use the per-function number of EQs as described above. Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_intJack Morgenstein2012-05-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Ths fixes the comparison in the FLR (Function Level Reset) event case. Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2012-06-0211-39/+297
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull straggler x86 fixes from Peter Anvin: "Three groups of patches: - EFI boot stub documentation and the ability to print error messages; - Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which should never have been ported, and the port is broken and potentially dangerous.) - ftrace stack corruption fixes. I'm not super-happy about the technical implementation, but it is probably the least invasive in the short term. In the future I would like a single method for nesting the debug stack, however." * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32 x86, efi: Add EFI boot stub documentation x86, efi; Add EFI boot stub console support x86, efi: Only close open files in error path ftrace/x86: Do not change stacks in DEBUG when calling lockdep x86: Allow nesting of the debug stack IDT setting x86: Reset the debug_stack update counter ftrace: Use breakpoint method to update ftrace caller ftrace: Synchronize variable setting with breakpoints
| * \ \ Merge remote-tracking branch 'rostedt/tip/perf/urgent-2' into ↵H. Peter Anvin2012-06-016-16/+154
| |\ \ \ | | | | | | | | | | | | | | | x86-urgent-for-linus
| | * | | ftrace/x86: Do not change stacks in DEBUG when calling lockdepSteven Rostedt2012-05-311-3/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF will call into the lockdep code. The lockdep code can call lots of functions that may be traced by ftrace. When ftrace is updating its code and hits a breakpoint, the breakpoint handler will call into lockdep. If lockdep happens to call a function that also has a breakpoint attached, it will jump back into the breakpoint handler resetting the stack to the debug stack and corrupt the contents currently on that stack. The 'do_sym' call that calls do_int3() is protected by modifying the IST table to point to a different location if another breakpoint is hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if a breakpoint is hit from those, the stack will get corrupted, and the kernel will crash: [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0 [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP [ 1013.310600] CPU 2 [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan] [ 1013.401848] [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30 [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046 [ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000 [ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20 [ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000 [ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000 [ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40 [ 1013.586108] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000 [ 1013.609458] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0 [ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0) [ 1013.717028] Stack: [ 1013.724131] ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000 [ 1013.745918] cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627 [ 1013.767870] ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb [ 1013.790021] Call Trace: [ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff [ 1013.861443] RIP [<ffff88014630a000>] 0xffff880146309fff [ 1013.884466] RSP <ffff88014780f408> [ 1013.901507] CR2: 0000000000000002 The solution was to reuse the NMI functions that change the IDT table to make the debug stack keep its current stack (in kernel mode) when hitting a breakpoint: call debug_stack_set_zero TRACE_IRQS_ON call debug_stack_reset If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack and not crash the box. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
OpenPOWER on IntegriCloud