summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* fou: Fix typo in returning flags in netlinkTom Herbert2014-11-053-3/+3
| | | | | | | | When filling netlink info, dport is being returned as flags. Fix instances to return correct value. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* r8152: disable the tasklet by defaulthayeswang2014-11-051-2/+6
| | | | | | | | | Let the tasklet only be enabled after open(), and be disabled for the other situation. The tasklet is only necessary after open() for tx/rx, so it could be disabled by default. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUsDaniel Borkmann2014-11-052-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has been reported that generating an MLD listener report on devices with large MTUs (e.g. 9000) and a high number of IPv6 addresses can trigger a skb_over_panic(): skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20 head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0 dev:port1 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:100! invalid opcode: 0000 [#1] SMP Modules linked in: ixgbe(O) CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4 [...] Call Trace: <IRQ> [<ffffffff80578226>] ? skb_put+0x3a/0x3b [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4 [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45 [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68 [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182 [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3 [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46 [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70 mld_newpack() skb allocations are usually requested with dev->mtu in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations") we have changed the limit in order to be less likely to fail. However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb) macros, which determine if we may end up doing an skb_put() for adding another record. To avoid possible fragmentation, we check the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong assumption as the actual max allocation size can be much smaller. The IGMP case doesn't have this issue as commit 57e1ab6eaddc ("igmp: refine skb allocations") stores the allocation size in the cb[]. Set a reserved_tailroom to make it fit into the MTU and use skb_availroom() helper instead. This also allows to get rid of igmp_skb_size(). Reported-by: Wei Liu <lw1a2.jing@gmail.com> Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: David L Stevens <david.stevens@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Convert SEQ_START_TOKEN/seq_printf to seq_putsJoe Perches2014-11-053-12/+3
| | | | | | | | | | | | | | | | | | | Using a single fixed string is smaller code size than using a format and many string arguments. Reduces overall code size a little. $ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o* text data bss dec hex filename 34269 7012 14824 56105 db29 net/ipv4/igmp.o.new 34315 7012 14824 56151 db57 net/ipv4/igmp.o.old 30078 7869 13200 51147 c7cb net/ipv6/mcast.o.new 30105 7869 13200 51174 c7e6 net/ipv6/mcast.o.old 11434 3748 8580 23762 5cd2 net/ipv6/ip6_flowlabel.o.new 11491 3748 8580 23819 5d0b net/ipv6/ip6_flowlabel.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* fast_hash: avoid indirect function callsHannes Frederic Sowa2014-11-056-93/+98
| | | | | | | | | | | | | | | | | | By default the arch_fast_hash hashing function pointers are initialized to jhash(2). If during boot-up a CPU with SSE4.2 is detected they get updated to the CRC32 ones. This dispatching scheme incurs a function pointer lookup and indirect call for every hashing operation. rhashtable as a user of arch_fast_hash e.g. stores pointers to hashing functions in its structure, too, causing two indirect branches per hashing operation. Using alternative_call we can get away with one of those indirect branches. Acked-by: Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'amd-xgbe-next'David S. Miller2014-11-0511-277/+955
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tom Lendacky says: ==================== amd-xgbe: AMD XGBE driver updates 2014-11-04 The following series of patches includes functional updates to the driver as well as some trivial changes for function renaming and spelling fixes. - Move channel and ring structure allocation into the device open path - Rename the pre_xmit function to dev_xmit - Explicitly use the u32 data type for the device descriptors - Use page allocation for the receive buffers - Add support for split header/payload receive - Add support for per DMA channel interrupts - Add support for receive side scaling (RSS) - Add support for ethtool receive side scaling commands - Fix the spelling of descriptors - After a PCS reset, sync the PCS and PHY modes - Add dependency on HAS_IOMEM to both the amd-xgbe and amd-xgbe-phy drivers This patch series is based on net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe-phy: Let AMD_XGBE_PHY depend on HAS_IOMEMLendacky, Thomas2014-11-051-1/+1
| | | | | | | | | | | | | | | | The amd-xgbe-phy driver needs to perform ioremap calls, so add HAS_IOMEM to its build dependency. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Let AMD_XGBE depend on HAS_IOMEMLendacky, Thomas2014-11-051-1/+1
| | | | | | | | | | | | | | | | The amd-xgbe driver needs to perform ioremap calls, so add HAS_IOMEM to its build dependency. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe-phy: Sync PCS and PHY modes after resetLendacky, Thomas2014-11-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | This patch adds support to sync the states of the PCS and the PHY after a reset is performed. If the PCS and the PHY are not in the same state after reset an extra mode change would be performed. This extra mode change might not be needed if the PCS and the PHY are synced up after reset. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Fix a spelling errorLendacky, Thomas2014-11-051-2/+2
| | | | | | | | | | | | | | | | This patch fixes the spelling of the word "descriptor" in a couple of locations. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Add receive side scaling ethtool supportLendacky, Thomas2014-11-053-0/+96
| | | | | | | | | | | | | | | | This patch adds support for ethtool receive side scaling (RSS) commands. Support is added to get/set the RSS hash key and the RSS lookup table. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Provide support for receive side scalingLendacky, Thomas2014-11-055-2/+233
| | | | | | | | | | | | | | | | | | | | This patch provides support for receive side scaling (RSS). RSS allows for spreading incoming network packets across the Rx queues. When used in conjunction with the per DMA channel interrupt support, this allows the receive processing to be spread across multiple processors. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Add support for per DMA channel interruptsLendacky, Thomas2014-11-055-51/+219
| | | | | | | | | | | | | | | | | | This patch provides support for interrupts that are generated by the Tx/Rx DMA channel pairs of the device. This allows for Tx and Rx processing to run across multiple processsors. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Implement split header receive supportLendacky, Thomas2014-11-055-111/+201
| | | | | | | | | | | | | | | | | | | | Provide support for splitting IP packets so that the header and payload can be sent to different DMA addresses. This will allow the IP header to be put into the linear part of the skb while the payload can be added as frags. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Use page allocations for Rx buffersLendacky, Thomas2014-11-054-127/+196
| | | | | | | | | | | | | | | | Use page allocations for Rx buffers instead of pre-allocating skbs of a set size. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Use the u32 data type for descriptorsLendacky, Thomas2014-11-051-4/+4
| | | | | | | | | | | | | | | | The Tx and Rx descriptors are unsigned 32 bit values. Use the u32 type, rather than unsigned int, to map these descriptors. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Rename pre_xmit function to dev_xmitLendacky, Thomas2014-11-053-6/+6
| | | | | | | | | | | | | | | | | | | | | | The pre_xmit function name implies that it performs operations prior to transmitting the packet when in fact it is responsible for setting up the descriptors and initiating the transmit. Rename this to function from pre_xmit to dev_xmit, which is consistent with the name used during receive processing - dev_read. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * amd-xgbe: Move ring allocation to device openLendacky, Thomas2014-11-053-70/+93
|/ | | | | | | | | Move the channel and ring tracking structures allocation to device open. This will allow for future support to vary the number of Tx/Rx queues without unloading the module. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: move INET6_MATCH() to include/net/inet6_hashtables.hWANG Cong2014-11-052-10/+10
| | | | | | | | It is only used in net/ipv6/inet6_hashtables.c. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add and use skb_copy_datagram_msg() helper.David S. Miller2014-11-0543-57/+58
| | | | | | | | | | | | | | | This encapsulates all of the skb_copy_datagram_iovec() callers with call argument signature "skb, offset, msghdr->msg_iov, length". When we move to iov_iters in the networking, the iov_iter object will sit in the msghdr. Having a helper like this means there will be less places to touch during that transformation. Based upon descriptions and patch from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'gue-next'David S. Miller2014-11-0515-118/+565
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tom Herbert says: ==================== gue: Remote checksum offload This patch set implements remote checksum offload for GUE, which is a mechanism that provides checksum offload of encapsulated packets using rudimentary offload capabilities found in most Network Interface Card (NIC) devices. The outer header checksum for UDP is enabled in packets and, with some additional meta information in the GUE header, a receiver is able to deduce the checksum to be set for an inner encapsulated packet. Effectively this offloads the computation of the inner checksum. Enabling the outer checksum in encapsulation has the additional advantage that it covers more of the packet than the inner checksum including the encapsulation headers. Remote checksum offload is described in: http://tools.ietf.org/html/draft-herbert-remotecsumoffload-01 The GUE transmit and receive paths are modified to support the remote checksum offload option. The option contains a checksum offset and checksum start which are directly derived from values set in stack when doing CHECKSUM_PARTIAL. On receipt of the option, the operation is to calculate the packet checksum from "start" to end of the packet (normally derived for checksum complete), and then set the resultant value at checksum "offset" (the checksum field has already been primed with the pseudo header). This emulates a NIC that implements NETIF_F_HW_CSUM. The primary purpose of this feature is to eliminate cost of performing checksum calculation over a packet when encpasulating. In this patch set: - Move fou_build_header into fou.c and split it into a couple of functions - Enable offloading of outer UDP checksum in encapsulation - Change udp_offload to support remote checksum offload, includes new GSO type and ensuring encapsulated layers (TCP) doesn't try to set a checksum covered by RCO - TX support for RCO with GUE. This is configured through ip_tunnel and set the option on transmit when packet being encapsulated is CHECKSUM_PARTIAL - RX support for RCO with GUE for normal and GRO paths. Includes resolving the offloaded checksum v2: Address comments from davem: Move accounting for private option field in gue_encap_hlen to patch in which we add the remote checksum offload option. Testing: I ran performance numbers using netperf TCP_STREAM and TCP_RR with 200 streams, comparing GUE with and without remote checksum offload (doing checksum-unnecessary to complete conversion in both cases). These were run on mlnx4 and bnx2x. Some mlnx4 results are below. GRE/GUE TCP_STREAM IPv4, with remote checksum offload 9.71% TX CPU utilization 7.42% RX CPU utilization 36380 Mbps IPv4, without remote checksum offload 12.40% TX CPU utilization 7.36% RX CPU utilization 36591 Mbps TCP_RR IPv4, with remote checksum offload 77.79% CPU utilization 91/144/216 90/95/99% latencies 1.95127e+06 tps IPv4, without remote checksum offload 78.70% CPU utilization 89/152/297 90/95/99% latencies 1.95458e+06 tps IPIP/GUE TCP_STREAM With remote checksum offload 10.30% TX CPU utilization 7.43% RX CPU utilization 36486 Mbps Without remote checksum offload 12.47% TX CPU utilization 7.49% RX CPU utilization 36694 Mbps TCP_RR With remote checksum offload 77.80% CPU utilization 87/153/270 90/95/99% latencies 1.98735e+06 tps Without remote checksum offload 77.98% CPU utilization 87/150/287 90/95/99% latencies 1.98737e+06 tps SIT/GUE TCP_STREAM With remote checksum offload 9.68% TX CPU utilization 7.36% RX CPU utilization 35971 Mbps Without remote checksum offload 12.95% TX CPU utilization 8.04% RX CPU utilization 36177 Mbps TCP_RR With remote checksum offload 79.32% CPU utilization 94/158/295 90/95/99% latencies 1.88842e+06 tps Without remote checksum offload 80.23% CPU utilization 94/149/226 90/95/99% latencies 1.90338e+06 tps VXLAN TCP_STREAM 35.03% TX CPU utilization 20.85% RX CPU utilization 36230 Mbps TCP_RR 77.36% CPU utilization 84/146/270 90/95/99% latencies 2.08063e+06 tps We can also look at CPU time in csum_partial using perf (with bnx2x setup). For GRE with TCP_STREAM I see: With remote checksum offload 0.33% TX 1.81% RX Without remote checksum offload 6.00% TX 0.51% RX I suspect the fact that time in csum_partial noticably increases with remote checksum offload for RX is due to taking the cache miss on the encapsulated header in that function. By similar reasoning, if on the TX side the packet were not in cache (say we did a splice from a file whose data was never touched by the CPU) the CPU savings for TX would probably be more pronounced. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * gue: Receive side of remote checksum offloadTom Herbert2014-11-051-9/+161
| | | | | | | | | | | | | | | | | | Add processing of the remote checksum offload option in both the normal path as well as the GRO path. The implements patching the affected checksum to derive the offloaded checksum. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * gue: TX support for using remote checksum offload optionTom Herbert2014-11-053-4/+46
| | | | | | | | | | | | | | | | | | Add if_tunnel flag TUNNEL_ENCAP_FLAG_REMCSUM to configure remote checksum offload on an IP tunnel. Add logic in gue_build_header to insert remote checksum offload option. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * gue: Protocol constants for remote checksum offloadTom Herbert2014-11-051-1/+4
| | | | | | | | | | | | | | | | Define a private flag for remote checksun offload as well as a length for the option. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * udp: Changes to udp_offload to support remote checksum offloadTom Herbert2014-11-059-6/+29
| | | | | | | | | | | | | | | | | | | | | | | | Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote checksum offload being done (in this case inner checksum must not be offloaded to the NIC). Added logic in __skb_udp_tunnel_segment to handle remote checksum offload case. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * gue: Add infrastructure for flags and optionsTom Herbert2014-11-052-53/+189
| | | | | | | | | | | | | | | | | | | | | | | | Add functions and basic definitions for processing standard flags, private flags, and control messages. This includes definitions to compute length of optional fields corresponding to a set of flags. Flag validation is in validate_gue_flags function. This checks for unknown flags, and that length of optional fields is <= length in guehdr hlen. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * udp: Offload outer UDP tunnel csum if availableTom Herbert2014-11-051-16/+36
| | | | | | | | | | | | | | | | | | In __skb_udp_tunnel_segment if outer UDP checksums are enabled and ip_summed is not already CHECKSUM_PARTIAL, set up checksum offload if device features allow it. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Move fou_build_header into fou.c and refactorTom Herbert2014-11-054-49/+120
|/ | | | | | | | | | | | | | | | Move fou_build_header out of ip_tunnel.c and into fou.c splitting it up into fou_build_header, gue_build_header, and fou_build_udp. This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap to call fou_build_header or gue_build_header based on the tunnel encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen functions which are called by ip_encap_hlen. New net/fou.h has prototypes and defines for this. Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels can use FOU/GUE and fou module is also selected. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'stmmac-next'David S. Miller2014-11-056-97/+16
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Giuseppe Cavallaro says: ==================== stmmac: review driver Koptions Recently many Koption options have been added to have new glue logic on several platforms. The main goal behind this work is to guarantee that the driver built fine on all the branches where it is present independently of which glue logic is selected. IMHO, it is better to remove all the not necessary Koption(s) that can hide build problems when something changes in the driver and especially when the DT compatibility allows us to manage all the platform data. I compiled the driver w/o any issue on net-next Git for: x86, arm and sh4. In case of there are build problems on some repos now it will be easy to catch them and cherry-pick patches from mainstream. For sure, do not hesitate to contact me in case of issue. Also this set removes STMMAC_DEBUG_FS and BUS_MODE_DA. The latter is useless and the former can be replaced by DEBUG_FS (always to make safe the build). V2: patch-set re-based on top of the latest updates for net-next ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * stmmac: remove BUS_MODE_DAGiuseppe CAVALLARO2014-11-052-14/+0
| | | | | | | | | | | | | | | | | | | | | | This is a very old and often unused option to configure a bit in a register inside the DMA. This support should not stay under Koption and should be extended for new chips too. This will be do later maybe via device-tree parameters. Also no performance impact when remove this setting on STi platforms. Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * stmmac: remove STMMAC_DEBUG_FSGiuseppe CAVALLARO2014-11-052-15/+7
| | | | | | | | | | | | | | | | | | | | the STMMAC_DEBUG_FS Koption is now removed from the driver configuration and this support will be built by default when DEBUG_FS is present. This can also be useful on building driver verification. Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * stmmac: remove specific SoC Koption from platform.Giuseppe CAVALLARO2014-11-054-68/+9
|/ | | | | | | | | | | This patch removes all the Koptions added to build the glue-logic files for all different architectures: DWMAC_MESON, DWMAC_SUNXI, DWMAC_STI ... Nowadays the stmmac needs to be compiled on several platforms; in some case it very convenient to guarantee that its build is always completed with success on all the branches where the driver is present. Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: phy: spi_ks8995: remove sysfs bin file by registered attributeVladimir Zapolskiy2014-11-041-1/+3
| | | | | | | | | | | | | | When a sysfs binary file is asked to be removed, it is found by attribute name, so strictly speaking this change is not a fix, but just in case when attribute name is changed in the driver or sysfs internals are changed, it might be better to remove the previously created file using right the same binary attribute. Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: remove blank line between set and testFabian Frederick2014-11-041-1/+0
| | | | | | Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: trivial, add bracket for the if blockFlorent Fourcot2014-11-041-2/+2
| | | | | | | The "else" block is on several lines and use bracket. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* esp4: remove assignment in if conditionFabian Frederick2014-11-041-1/+3
| | | | | Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'ecn_via_routing_table'David S. Miller2014-11-045-44/+87
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Florian Westphal says: ==================== net: allow setting ecn via routing table Here is v4 of the patchset, its exactly the same as v3 except in patch3/3 where I added the missing 'const' qualifier to a function argument that Eric spotted during review. I preserved Erics Acks so that he doesn't have to resend them. v3 cover letter: When using syn cookies, then do not simply trust that the echoed timestamp was not modified to make sure that ecn is not turned on magically when it is disabled on the host. The first two patches, which were not part of earlier series, prepare the cookie code for the ecn route metrics change by allowing is to more easily use the existing dst object for ecn validation. The 3rd patch adds the ecn route metric feature support. It is almost the same as in v2, except that we'll now also test the dst_features when decoding a syn cookie timestamp that indicates ecn support. These three patches then allow turning on explicit congestion notification based on the destination network. For example, assuming the default tcp_ecn sysctl '2', the following will enable ecn (tcp_ecn=1 behaviour, i.e. request ecn to be enabled for a tcp connection) for all connections to hosts inside the 192.168.2/24 network: ip route change 192.168.2.0/24 dev eth0 features ecn Having a more fine-grained per-route setting can be beneficial for various reasons, for example 1) within data centers, or 2) local ISPs may deploy ECN support for their own video/streaming services [1], etc. Joint work with Daniel Borkmann, feature suggested by Hannes Frederic Sowa. The patch to enable this in iproute2 will be posted shortly, it is currently also available here: http://git.breakpoint.cc/cgit/fw/iproute2.git/commit/?h=iproute_features&id=8843d2d8973fb81c78a7efe6d42e3a17d739003e [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: allow setting ecn via routing tableFlorian Westphal2014-11-045-17/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows to set ECN on a per-route basis in case the sysctl tcp_ecn is not set to 1. In other words, when ECN is set for specific routes, it provides a tcp_ecn=1 behaviour for that route while the rest of the stack acts according to the global settings. One can use 'ip route change dev $dev $net features ecn' to toggle this. Having a more fine-grained per-route setting can be beneficial for various reasons, for example, 1) within data centers, or 2) local ISPs may deploy ECN support for their own video/streaming services [1], etc. There was a recent measurement study/paper [2] which scanned the Alexa's publicly available top million websites list from a vantage point in US, Europe and Asia: Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side only ECN") ;)); the break in connectivity on-path was found is about 1 in 10,000 cases. Timeouts rather than receiving back RSTs were much more common in the negotiation phase (and mostly seen in the Alexa middle band, ranks around 50k-150k): from 12-thousand hosts on which there _may_ be ECN-linked connection failures, only 79 failed with RST when _not_ failing with RST when ECN is not requested. It's unclear though, how much equipment in the wild actually marks CE when buffers start to fill up. We thought about a fallback to non-ECN for retransmitted SYNs as another global option (which could perhaps one day be made default), but as Eric points out, there's much more work needed to detect broken middleboxes. Two examples Eric mentioned are buggy firewalls that accept only a single SYN per flow, and middleboxes that successfully let an ECN flow establish, but later mark CE for all packets (so cwnd converges to 1). [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15 [2] http://ecn.ethz.ch/ Joint work with Daniel Borkmann. Reference: http://thread.gmane.org/gmane.linux.network/335797 Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * syncookies: split cookie_check_timestamp() into two functionsFlorian Westphal2014-11-043-18/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function cookie_check_timestamp(), both called from IPv4/6 context, is being used to decode the echoed timestamp from the SYN/ACK into TCP options used for follow-up communication with the peer. We can remove ECN handling from that function, split it into a separate one, and simply rename the original function into cookie_decode_options(). cookie_decode_options() just fills in tcp_option struct based on the echoed timestamp received from the peer. Anything that fails in this function will actually discard the request socket. While this is the natural place for decoding options such as ECN which commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue that in particular for ECN handling, it can be checked at a later point in time as the request sock would actually not need to be dropped from this, but just ECN support turned off. Therefore, we split this functionality into cookie_ecn_ok(), which tells us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled. This prepares for per-route ECN support: just looking at the tcp_ecn sysctl won't be enough anymore at that point; if the timestamp indicates ECN and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric. This would mean adding a route lookup to cookie_check_timestamp(), which we definitely want to avoid. As we already do a route lookup at a later point in cookie_{v4,v6}_check(), we can simply make use of that as well for the new cookie_ecn_ok() function w/o any additional cost. Joint work with Daniel Borkmann. Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * syncookies: avoid magic values and document which-bit-is-what-optionFlorian Westphal2014-11-041-15/+35
|/ | | | | | | | | | | | Was a bit more difficult to read than needed due to magic shifts; add defines and document the used encoding scheme. Joint work with Daniel Borkmann. Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* igmp: remove camel case definitionsFabian Frederick2014-11-041-14/+14
| | | | | | | use standard uppercase for definitions Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: remove else after returnFabian Frederick2014-11-041-6/+6
| | | | | | | else is unnecessary after return 0 in __udp4_lib_rcv() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: frags: remove inline on static in c fileFabian Frederick2014-11-041-8/+8
| | | | | | | | | remove __inline__ / inline and let compiler decide what to do with static functions Inspired-by: "David S. Miller" <davem@davemloft.net> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: remove 0/NULL assignment on staticFabian Frederick2014-11-041-8/+8
| | | | | | | static values are automatically initialized to 0 Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: use seq_puts instead of seq_printf where possibleFabian Frederick2014-11-041-3/+3
| | | | | Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: spelling s/plugable/pluggableFabian Frederick2014-11-041-1/+1
| | | | | Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* cipso: remove NULL assignment on staticFabian Frederick2014-11-041-1/+3
| | | | | | | Also add blank line after structure declarations Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: include linux/bug.h instead of asm/bug.hFabian Frederick2014-11-041-1/+1
| | | | | Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* cipso: kerneldoc warning fixFabian Frederick2014-11-041-1/+1
| | | | | Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add rbnode to struct sk_buffEric Dumazet2014-11-032-27/+20
| | | | | | | | | | | Yaogong replaces TCP out of order receive queue by an RB tree. As netem already does a private skb->{next/prev/tstamp} union with a 'struct rb_node', lets do this in a cleaner way. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud