summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* tcp: Do not apply TSO segment limit to non-TSO packetsHerbert Xu2015-01-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs. In fact the problem was completely unrelated to IPsec. The bug is also reproducible if you just disable TSO/GSO. The problem is that when the MSS goes down, existing queued packet on the TX queue that have not been transmitted yet all look like TSO packets and get treated as such. This then triggers a bug where tcp_mss_split_point tells us to generate a zero-sized packet on the TX queue. Once that happens we're screwed because the zero-sized packet can never be removed by ACKs. Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier") Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cheers, Signed-off-by: David S. Miller <davem@davemloft.net>
* openvswitch: Fix vport_send double freePravin B Shelar2014-12-231-1/+5
| | | | | | | | | | | | | | Today vport-send has complex error handling because it involves freeing skb and updating stats depending on return value from vport send implementation. This can be simplified by delegating responsibility of freeing skb to the vport implementation for all cases. So that vport-send needs just update stats. Fixes: 91b7514cdf ("openvswitch: Unify vport error stats handling") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* geneve: Fix races between socket add and release.Jesse Gross2014-12-181-6/+7
| | | | | | | | | | | | | | | | | | Currently, searching for a socket to add a reference to is not synchronized with deletion of sockets. This can result in use after free if there is another operation that is removing a socket at the same time. Solving this requires both holding the appropriate lock and checking the refcount to ensure that it has not already hit zero. Inspired by a related (but not exactly the same) issue in the VXLAN driver. Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver") CC: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* geneve: Remove socket and offload handlers at destruction.Jesse Gross2014-12-181-0/+17
| | | | | | | | | | | | Sockets aren't currently removed from the the global list when they are destroyed. In addition, offload handlers need to be cleaned up as well. Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver") CC: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip_tunnel: Add missing validation of encap type to ip_tunnel_encap_setup()Thomas Graf2014-12-161-0/+3
| | | | | | | | | The encap->type comes straight from Netlink. Validate it against max supported encap types just like ip_encap_hlen() already does. Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip_tunnel: Add sanity checks to ip_tunnel_encap_add_ops()Thomas Graf2014-12-161-0/+6
| | | | | | | | The symbols are exported and could be used by external modules. Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* gre: fix the inner mac header in nbma tunnel xmit pathTimo Teräs2014-12-151-4/+5
| | | | | | | | | | | | | | | | | | | | | | | The NBMA GRE tunnels temporarily push GRE header that contain the per-packet NBMA destination on the skb via header ops early in xmit path. It is the later pulled before the real GRE header is constructed. The inner mac was thus set differently in nbma case: the GRE header has been pushed by neighbor layer, and mac header points to beginning of the temporary gre header (set by dev_queue_xmit). Now that the offloads expect mac header to point to the gre payload, fix the xmit patch to: - pull first the temporary gre header away - and reset mac header to point to gre payload This fixes tso to work again with nbma tunnels. Fixes: 14051f0452a2 ("gre: Use inner mac length when computing tunnel length") Signed-off-by: Timo Teräs <timo.teras@iki.fi> Cc: Tom Herbert <therbert@google.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* fib_trie: Fix trie balancing issue if new node pushes down existing nodeAlexander Duyck2014-12-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch addresses an issue with the level compression of the fib_trie. Specifically in the case of adding a new leaf that triggers a new node to be added that takes the place of the old node. The result is a trie where the 1 child tnode is on one side and one leaf is on the other which gives you a very deep trie. Below is the script I used to generate a trie on dummy0 with a 10.X.X.X family of addresses. ip link add type dummy ipval=184549374 bit=2 for i in `seq 1 23` do ifconfig dummy0:$bit $ipval/8 ipval=`expr $ipval - $bit` bit=`expr $bit \* 2` done cat /proc/net/fib_triestat Running the script before the patch: Local: Aver depth: 10.82 Max depth: 23 Leaves: 29 Prefixes: 30 Internal nodes: 27 1: 26 2: 1 Pointers: 56 Null ptrs: 1 Total size: 5 kB After applying the patch and repeating: Local: Aver depth: 4.72 Max depth: 9 Leaves: 29 Prefixes: 30 Internal nodes: 12 1: 3 2: 2 3: 7 Pointers: 70 Null ptrs: 30 Total size: 4 kB What this fix does is start the rebalance at the newly created tnode instead of at the parent tnode. This way if there is a gap between the parent and the new node it doesn't prevent the new tnode from being coalesced with any pre-existing nodes that may have been pushed into one of the new nodes child branches. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2014-12-1140-561/+1175
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) New offloading infrastructure and example 'rocker' driver for offloading of switching and routing to hardware. This work was done by a large group of dedicated individuals, not limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend, Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu 2) Start making the networking operate on IOV iterators instead of modifying iov objects in-situ during transfers. Thanks to Al Viro and Herbert Xu. 3) A set of new netlink interfaces for the TIPC stack, from Richard Alpe. 4) Remove unnecessary looping during ipv6 routing lookups, from Martin KaFai Lau. 5) Add PAUSE frame generation support to gianfar driver, from Matei Pavaluca. 6) Allow for larger reordering levels in TCP, which are easily achievable in the real world right now, from Eric Dumazet. 7) Add a variable of napi_schedule that doesn't need to disable cpu interrupts, from Eric Dumazet. 8) Use a doubly linked list to optimize neigh_parms_release(), from Nicolas Dichtel. 9) Various enhancements to the kernel BPF verifier, and allow eBPF programs to actually be attached to sockets. From Alexei Starovoitov. 10) Support TSO/LSO in sunvnet driver, from David L Stevens. 11) Allow controlling ECN usage via routing metrics, from Florian Westphal. 12) Remote checksum offload, from Tom Herbert. 13) Add split-header receive, BQL, and xmit_more support to amd-xgbe driver, from Thomas Lendacky. 14) Add MPLS support to openvswitch, from Simon Horman. 15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen Klassert. 16) Do gro flushes on a per-device basis using a timer, from Eric Dumazet. This tries to resolve the conflicting goals between the desired handling of bulk vs. RPC-like traffic. 17) Allow userspace to ask for the CPU upon what a packet was received/steered, via SO_INCOMING_CPU. From Eric Dumazet. 18) Limit GSO packets to half the current congestion window, from Eric Dumazet. 19) Add a generic helper so that all drivers set their RSS keys in a consistent way, from Eric Dumazet. 20) Add xmit_more support to enic driver, from Govindarajulu Varadarajan. 21) Add VLAN packet scheduler action, from Jiri Pirko. 22) Support configurable RSS hash functions via ethtool, from Eyal Perry. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits) Fix race condition between vxlan_sock_add and vxlan_sock_release net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header net/mlx4: Add support for A0 steering net/mlx4: Refactor QUERY_PORT net/mlx4_core: Add explicit error message when rule doesn't meet configuration net/mlx4: Add A0 hybrid steering net/mlx4: Add mlx4_bitmap zone allocator net/mlx4: Add a check if there are too many reserved QPs net/mlx4: Change QP allocation scheme net/mlx4_core: Use tasklet for user-space CQ completion events net/mlx4_core: Mask out host side virtualization features for guests net/mlx4_en: Set csum level for encapsulated packets be2net: Export tunnel offloads only when a VxLAN tunnel is created gianfar: Fix dma check map error when DMA_API_DEBUG is enabled cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call net: fec: only enable mdio interrupt before phy device link up net: fec: clear all interrupt events to support i.MX6SX net: fec: reset fep link status in suspend function net: sock: fix access via invalid file descriptor net: introduce helper macro for_each_cmsghdr ...
| * net: introduce helper macro for_each_cmsghdrGu Zheng2014-12-101-1/+1
| | | | | | | | | | | | | | | | Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating cmsghdr from msghdr, just cleanup. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-12-102-2/+5
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/amd/xgbe/xgbe-desc.c drivers/net/ethernet/renesas/sh_eth.c Overlapping changes in both conflict cases. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * tcp: fix more NULL deref after prequeue changesEric Dumazet2014-12-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I cooked commit c3658e8d0f1 ("tcp: fix possible NULL dereference in tcp_vX_send_reset()") I missed other spots we could deref a NULL skb_dst(skb) Again, if a socket is provided, we do not need skb_dst() to get a pointer to network namespace : sock_net(sk) is good enough. Reported-by: Dann Frazier <dann.frazier@canonical.com> Bisected-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: ca777eff51f7 ("tcp: remove dst refcount false sharing for prequeue mode") Signed-off-by: David S. Miller <davem@davemloft.net>
| | * gre: Set inner mac header in gro completeTom Herbert2014-12-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set the inner mac header to point to the GRE payload when doing GRO. This is needed if we proceed to send the packet through GRE GSO which now uses the inner mac header instead of inner network header to determine the length of encapsulation headers. Fixes: 14051f0452a2 ("gre: Use inner mac length when computing tunnel length") Reported-by: Wolfgang Walter <linux@stwm.de> Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge branch 'for-davem-2' of ↵David S. Miller2014-12-107-20/+23
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More iov_iter work for the networking from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | put iov_iter into msghdrAl Viro2014-12-095-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note that the code _using_ ->msg_iter at that point will be very unhappy with anything other than unshifted iovec-backed iov_iter. We still need to convert users to proper primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | switch tcp_sock->ucopy from iovec (ucopy.iov) to msghdr (ucopy.msg)Al Viro2014-12-092-5/+4
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | ip_generic_getfrag, udplite_getfrag: switch to passing msghdrAl Viro2014-12-093-6/+6
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | raw.c: stick msghdr into raw_frag_vecAl Viro2014-12-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | we'll want access to ->msg_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | tcp: refine TSO autosizingEric Dumazet2014-12-092-57/+62
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to control TSO size, but did this at the wrong place (sendmsg() time) At sendmsg() time, we might have a pessimistic view of flow rate, and we end up building very small skbs (with 2 MSS per skb). This is bad because : - It sends small TSO packets even in Slow Start where rate quickly increases. - It tends to make socket write queue very big, increasing tcp_ack() processing time, but also increasing memory needs, not necessarily accounted for, as fast clones overhead is currently ignored. - Lower GRO efficiency and more ACK packets. Servers with a lot of small lived connections suffer from this. Lets instead fill skbs as much as possible (64KB of payload), but split them at xmit time, when we have a precise idea of the flow rate. skb split is actually quite efficient. Patch looks bigger than necessary, because TCP Small Queue decision now has to take place after the eventual split. As Neal suggested, introduce a new tcp_tso_autosize() helper, so that tcp_tso_should_defer() can be synchronized on same goal. Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable contains number of mss that we can put in GSO packet, and is not related to the autosizing goal anymore. Tested: 40 ms rtt link nstat >/dev/null netperf -H remote -l -2000000 -- -s 1000000 nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets" Before patch : Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/s 87380 2000000 2000000 0.36 44.22 IpInReceives 600 0.0 IpOutRequests 599 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2033249 0.0 After patch : Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 2000000 2000000 0.36 44.27 IpInReceives 221 0.0 IpOutRequests 232 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2013953 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp_cubic: refine Hystart delay thresholdEric Dumazet2014-12-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 2b4636a5f8ca ("tcp_cubic: make the delay threshold of HyStart less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms. The remaining problem is that using delay_min + (delay_min/16) as the threshold is too sensitive. 6.25 % of variation is too small for rtt above 60 ms, which are not uncommon. Lets use 12.5 % instead (delay_min + (delay_min/8)) Tested: 80 ms RTT between peers, FQ/pacing packet scheduler on sender. 10 bulk transfers of 10 seconds : nstat >/dev/null for i in `seq 1 10` do netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT done nstat | grep Hystart With the 6.25 % threshold : THROUGHPUT=20.66 THROUGHPUT=249.38 THROUGHPUT=254.10 THROUGHPUT=14.94 THROUGHPUT=251.92 THROUGHPUT=237.73 THROUGHPUT=19.18 THROUGHPUT=252.89 THROUGHPUT=21.32 THROUGHPUT=15.58 TcpExtTCPHystartTrainDetect 2 0.0 TcpExtTCPHystartTrainCwnd 4756 0.0 TcpExtTCPHystartDelayDetect 5 0.0 TcpExtTCPHystartDelayCwnd 180 0.0 With the 12.5 % threshold THROUGHPUT=251.09 THROUGHPUT=247.46 THROUGHPUT=250.92 THROUGHPUT=248.91 THROUGHPUT=250.88 THROUGHPUT=249.84 THROUGHPUT=250.51 THROUGHPUT=254.15 THROUGHPUT=250.62 THROUGHPUT=250.89 TcpExtTCPHystartTrainDetect 1 0.0 TcpExtTCPHystartTrainCwnd 3175 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp_cubic: add SNMP counters to track how effective is HystartEric Dumazet2014-12-092-9/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When deploying FQ pacing, one thing we noticed is that CUBIC Hystart triggers too soon. Having SNMP counters to have an idea of how often the various Hystart methods trigger is useful prior to any modifications. This patch adds SNMP counters tracking, how many time "ack train" or "Delay" based Hystart triggers, and cumulative sum of cwnd at the time Hystart decided to end SS (Slow Start) myhost:~# nstat -a | grep Hystart TcpExtTCPHystartTrainDetect 9 0.0 TcpExtTCPHystartTrainCwnd 20650 0.0 TcpExtTCPHystartDelayDetect 10 0.0 TcpExtTCPHystartDelayCwnd 360 0.0 -> Train detection was triggered 9 times, and average cwnd was 20650/9=2294, Delay detection was triggered 10 times and average cwnd was 36 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | udp: Neaten and reduce size of compute_score functionsJoe Perches2014-12-081-49/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The compute_score functions are a bit difficult to read. Neaten them a bit to reduce object sizes and make them a bit more intelligible. Return early to avoid indentation and avoid unnecessary initializations. (allyesconfig, but w/ -O2 and no profiling) $ size net/ipv[46]/udp.o.* text data bss dec hex filename 28680 1184 25 29889 74c1 net/ipv4/udp.o.new 28756 1184 25 29965 750d net/ipv4/udp.o.old 17600 1010 2 18612 48b4 net/ipv6/udp.o.new 17632 1010 2 18644 48d4 net/ipv6/udp.o.old Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net-timestamp: allow reading recv cmsg on errqueue with origin tstampWillem de Bruijn2014-12-081-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow reading of timestamps and cmsg at the same time on all relevant socket families. One use is to correlate timestamps with egress device, by asking for cmsg IP_PKTINFO. on AF_INET sockets, call the relevant function (ip_cmsg_recv). To avoid changing legacy expectations, only do so if the caller sets a new timestamping flag SOF_TIMESTAMPING_OPT_CMSG. on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already returned for all origins. only change is to set ifindex, which is not initialized for all error origins. In both cases, only generate the pktinfo message if an ifindex is known. This is not the case for ACK timestamps. The difference between the protocol families is probably a historical accident as a result of the different conditions for generating cmsg in the relevant ip(v6)_recv_error function: ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) { ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) { At one time, this was the same test bar for the ICMP/ICMP6 distinction. This is no longer true. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes v1 -> v2 large rewrite - integrate with existing pktinfo cmsg generation code - on ipv4: only send with new flag, to maintain legacy behavior - on ipv6: send at most a single pktinfo cmsg - on ipv6: initialize fields if not yet initialized The recv cmsg interfaces are also relevant to the discussion of whether looping packet headers is problematic. For v6, cmsgs that identify many headers are already returned. This patch expands that to v4. If it sounds reasonable, I will follow with patches 1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY (http://patchwork.ozlabs.org/patch/366967/) 2. sysctl to conditionally drop all timestamps that have payload or cmsg from users without CAP_NET_RAW. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: warn once on passing AF_INET6 socket to ip_recv_errorWillem de Bruijn2014-12-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | One line change, in response to catching an occurrence of this bug. See also fix f4713a3dfad0 ("net-timestamp: make tcp_recvmsg call ...") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2014-12-054-91/+2
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains netfilter updates for net-next. Basically, enhancements for xt_recent, skip zeroing of timer in conntrack, fix linking problem with recent redirect support for nf_tables, ipset updates and a couple of cleanups. More specifically, they are: 1) Rise maximum number per IP address to be remembered in xt_recent while retaining backward compatibility, from Florian Westphal. 2) Skip zeroing timer area in nf_conn objects, also from Florian. 3) Inspect IPv4 and IPv6 traffic from the bridge to allow filtering using using meta l4proto and transport layer header, from Alvaro Neira. 4) Fix linking problems in the new redirect support when CONFIG_IPV6=n and IP6_NF_IPTABLES=n. And ipset updates from Jozsef Kadlecsik: 5) Support updating element extensions when the set is full (fixes netfilter bugzilla id 880). 6) Fix set match with 32-bits userspace / 64-bits kernel. 7) Indicate explicitly when /0 networks are supported in ipset. 8) Simplify cidr handling for hash:*net* types. 9) Allocate the proper size of memory when /0 networks are supported. 10) Explicitly add padding elements to hash:net,net and hash:net,port, because the elements must be u32 sized for the used hash function. Jozsef is also cooking ipset RCU conversion which should land soon if they reach the merge window in time. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | netfilter: combine IPv4 and IPv6 nf_nat_redirect code in one modulePablo Neira Ayuso2014-11-274-91/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This resolves linking problems with CONFIG_IPV6=n: net/built-in.o: In function `redirect_tg6': xt_REDIRECT.c:(.text+0x6d021): undefined reference to `nf_nat_redirect_ipv6' Reported-by: Andreas Ruprecht <rupran@einserver.de> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-11-295-12/+21
| |\ \ \ | | | |/ | | |/|
| * | | gue: Call remcsum_adjustTom Herbert2014-11-261-67/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change remote checksum offload to call remcsum_adjust. This also eliminates the optimization to skip an IP header as part of the adjustment (really does not seem to be much of a win). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge branch 'for-davem' of ↵David S. Miller2014-11-254-6/+5
| |\ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More work from Al Viro to move away from modifying iovecs by using iov_iter instead. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | new helper: memcpy_to_msg()Al Viro2014-11-241-1/+1
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | new helper: memcpy_from_msg()Al Viro2014-11-242-2/+2
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | * | new helper: skb_copy_and_csum_datagram_msg()Al Viro2014-11-241-3/+2
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2014-11-248-12/+208
| |\ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== netfilter/ipvs updates for net-next The following patchset contains Netfilter updates for your net-next tree, this includes the NAT redirection support for nf_tables, the cgroup support for nft meta and conntrack zone support for the connlimit match. Coming after those, a bunch of sparse warning fixes, missing netns bits and cleanups. More specifically, they are: 1) Prepare IPv4 and IPv6 NAT redirect code to use it from nf_tables, patches from Arturo Borrero. 2) Introduce the nf_tables redir expression, from Arturo Borrero. 3) Remove an unnecessary assignment in ip_vs_xmit/__ip_vs_get_out_rt(). Patch from Alex Gartrell. 4) Add nft_log_dereference() macro to the nf_log infrastructure, patch from Marcelo Leitner. 5) Add some extra validation when registering logger families, also from Marcelo. 6) Some spelling cleanups from stephen hemminger. 7) Fix sparse warning in nf_logger_find_get(). 8) Add cgroup support to nf_tables meta, patch from Ana Rey. 9) A Kconfig fix for the new redir expression and fix sparse warnings in the new redir expression. 10) Fix several sparse warnings in the netfilter tree, from Florian Westphal. 11) Reduce verbosity when OOM in nfnetlink_log. User can basically do nothing when this situation occurs. 12) Add conntrack zone support to xt_connlimit, again from Florian. 13) Add netnamespace support to the h323 conntrack helper, contributed by Vasily Averin. 14) Remove unnecessary nul-pointer checks before free_percpu() and module_put(), from Markus Elfring. 15) Use pr_fmt in nfnetlink_log, again patch from Marcelo Leitner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | netfilter: fix various sparse warningsFlorian Westphal2014-11-133-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/bridge/br_netfilter.c:870:6: symbol 'br_netfilter_enable' was not declared. Should it be static? no; add include net/ipv4/netfilter/nft_reject_ipv4.c:22:6: symbol 'nft_reject_ipv4_eval' was not declared. Should it be static? yes net/ipv6/netfilter/nf_reject_ipv6.c:16:6: symbol 'nf_send_reset6' was not declared. Should it be static? no; add include net/ipv6/netfilter/nft_reject_ipv6.c:22:6: symbol 'nft_reject_ipv6_eval' was not declared. Should it be static? yes net/netfilter/core.c:33:32: symbol 'nf_ipv6_ops' was not declared. Should it be static? no; add include net/netfilter/xt_DSCP.c:40:57: cast truncates bits from constant value (ffffff03 becomes 3) net/netfilter/xt_DSCP.c:57:59: cast truncates bits from constant value (ffffff03 becomes 3) add __force, 3 is what we want. net/ipv4/netfilter/nf_log_arp.c:77:6: symbol 'nf_log_arp_packet' was not declared. Should it be static? yes net/ipv4/netfilter/nf_reject_ipv4.c:17:6: symbol 'nf_send_reset' was not declared. Should it be static? no; add include Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: log: protect nf_log_register against double registeringMarcelo Leitner2014-10-302-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, despite the comment right before the function, nf_log_register allows registering two loggers on with the same type and end up overwriting the previous register. Not a real issue today as current tree doesn't have two loggers for the same type but it's better to get this protected. Also make sure that all of its callers do error checking. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: nf_tables: add new expression nft_redirArturo Borrero2014-10-273-0/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This new expression provides NAT in the redirect flavour, which is to redirect packets to local machine. Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: refactor NAT redirect IPv4 to use it from nf_tablesArturo Borrero2014-10-273-0/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch refactors the IPv4 code so it can be usable both from xt and nf_tables. A similar patch follows-up to handle IPv6. Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-11-213-2/+7
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ieee802154/fakehard.c A bug fix went into 'net' for ieee802154/fakehard.c, which is removed in 'net-next'. Add build fix into the merge from Stephen Rothwell in openvswitch, the logging macros take a new initial 'log' argument, a new call was added in 'net' so when we merge that in here we have to explicitly add the new 'log' arg to it else the build fails. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | vlan: introduce *vlan_hwaccel_push_inside helpersJiri Pirko2014-11-211-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use them to push skb->vlan_tci into the payload and avoid code duplication. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | vlan: rename __vlan_put_tag to vlan_insert_tag_set_protoJiri Pirko2014-11-211-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Name fits better. Plus there's going to be introduced __vlan_insert_tag later on. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | tcp: make connect() mem charging friendlyEric Dumazet2014-11-191-40/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While working on sk_forward_alloc problems reported by Denys Fedoryshchenko, we found that tcp connect() (and fastopen) do not call sk_wmem_schedule() for SYN packet (and/or SYN/DATA packet), so sk_forward_alloc is negative while connect is in progress. We can fix this by calling regular sk_stream_alloc_skb() both for the SYN packet (in tcp_connect()) and the syn_data packet in tcp_send_syn_data() Then, tcp_send_syn_data() can avoid copying syn_data as we simply can manipulate syn_data->cb[] to remove SYN flag (and increment seq) Instead of open coding memcpy_fromiovecend(), simply use this helper. This leaves in socket write queue clean fast clone skbs. This was tested against our fastopen packetdrill tests. Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | icmp: Remove some spurious dropped packet profile hits from the ICMP pathRick Jones2014-11-182-18/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If icmp_rcv() has successfully processed the incoming ICMP datagram, we should use consume_skb() rather than kfree_skb() because a hit on the likes of perf -e skb:kfree_skb is not called-for. Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-11-142-1/+3
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/chelsio/cxgb4vf/sge.c drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c sge.c was overlapping two changes, one to use the new __dev_alloc_page() in net-next, and one to use s->fl_pg_order in net. ixgbe_phy.c was a set of overlapping whitespace changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | tcp: limit GSO packets to half cwndEric Dumazet2014-11-131-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In DC world, GSO packets initially cooked by tcp_sendmsg() are usually big, as sk_pacing_rate is high. When network is congested, cwnd can be smaller than the GSO packets found in socket write queue. tcp_write_xmit() splits GSO packets using the available cwnd, and we end up sending a single GSO packet, consuming all available cwnd. With GRO aggregation on the receiver, we might handle a single GRO packet, sending back a single ACK. 1) This single ACK might be lost TLP or RTO are forced to attempt a retransmit. 2) This ACK releases a full cwnd, sender sends another big GSO packet, in a ping pong mode. This behavior does not fill the pipes in the best way, because of scheduling artifacts. Make sure we always have at least two GSO packets in flight. This allows us to safely increase GRO efficiency without risking spurious retransmits. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | FOU: Fix no return statement warning for !CONFIG_NET_FOU_IP_TUNNELSThomas Graf2014-11-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/ipv4/fou.c: In function ‘ip_tunnel_encap_del_fou_ops’: net/ipv4/fou.c:861:1: warning: no return statement in function returning non-void [-Wreturn-type] Fixes: a8c5f90fb5 ("ip_tunnel: Ops registration for secondary encap (fou, gue)") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | ip_tunnel: Ops registration for secondary encap (fou, gue)Tom Herbert2014-11-122-25/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of calling fou and gue functions directly from ip_tunnel use ops for these that were previously registered. This patch adds the logic to add and remove encapsulation operations for ip_tunnel, and modified fou (and gue) to register with ip_tunnels. This patch also addresses a circular dependency between ip_tunnel and fou that was causing link errors when CONFIG_NET_IP_TUNNEL=y and CONFIG_NET_FOU=m. References to fou an gue have been removed from ip_tunnel.c Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | udp: Neaten function pointer calls and add bracesJoe Perches2014-11-121-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Standardize function pointer uses. Convert calling style from: (*foo)(args...); to: foo(args...); Other miscellanea: o Add braces around loops with single ifs on multiple lines o Realign arguments around these functions o Invert logic in if to return immediately. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | neigh: remove dynamic neigh table registration supportWANG Cong2014-11-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there are only three neigh tables in the whole kernel: arp table, ndisc table and decnet neigh table. What's more, we don't support registering multiple tables per family. Therefore we can just make these tables statically built-in. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: Convert LIMIT_NETDEBUG to net_dbg_ratelimitedJoe Perches2014-11-116-34/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the more common dynamic_debug capable net_dbg_ratelimited and remove the LIMIT_NETDEBUG macro. All messages are still ratelimited. Some KERN_<LEVEL> uses are changed to KERN_DEBUG. This may have some negative impact on messages that were emitted at KERN_INFO that are not not enabled at all unless DEBUG is defined or dynamic_debug is enabled. Even so, these messages are now _not_ emitted by default. This also eliminates the use of the net_msg_warn sysctl "/proc/sys/net/core/warnings". For backward compatibility, the sysctl is not removed, but it has no function. The extern declaration of net_msg_warn is removed from sock.h and made static in net/core/sysctl_net_core.c Miscellanea: o Update the sysctl documentation o Remove the embedded uses of pr_fmt o Coalesce format fragments o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: introduce SO_INCOMING_CPUEric Dumazet2014-11-112-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alternative to RPS/RFS is to use hardware support for multiple queues. Then split a set of million of sockets into worker threads, each one using epoll() to manage events on its own socket pool. Ideally, we want one thread per RX/TX queue/cpu, but we have no way to know after accept() or connect() on which queue/cpu a socket is managed. We normally use one cpu per RX queue (IRQ smp_affinity being properly set), so remembering on socket structure which cpu delivered last packet is enough to solve the problem. After accept(), connect(), or even file descriptor passing around processes, applications can use : int cpu; socklen_t len = sizeof(cpu); getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len); And use this information to put the socket into the right silo for optimal performance, as all networking stack should run on the appropriate cpu, without need to send IPI (RPS/RFS). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud