summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* genetlink: only pass array to genl_register_family_with_ops()Johannes Berg2013-11-191-2/+1
| | | | | | | | | | | | | | As suggested by David Miller, make genl_register_family_with_ops() a macro and pass only the array, evaluating ARRAY_SIZE() in the macro, this is a little safer. The openvswitch has some indirection, assing ops/n_ops directly in that code. This might ultimately just assign the pointers in the family initializations, saving the struct genl_family_and_ops and code (once mcast groups are handled differently.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: don't update snd_nxt, when a socket is switched from repair modeAndrey Vagin2013-11-191-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | snd_nxt must be updated synchronously with sk_send_head. Otherwise tp->packets_out may be updated incorrectly, what may bring a kernel panic. Here is a kernel panic from my host. [ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150 ... [ 146.301158] Call Trace: [ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0 Before this panic a tcp socket was restored. This socket had sent and unsent data in the write queue. Sent data was restored in repair mode, then the socket was switched from reapair mode and unsent data was restored. After that the socket was switched back into repair mode. In that moment we had a socket where write queue looks like this: snd_una snd_nxt write_seq |_________|________| | sk_send_head After a second switching from repair mode the state of socket was changed: snd_una snd_nxt, write_seq |_________ ________| | sk_send_head This state is inconsistent, because snd_nxt and sk_send_head are not synchronized. Bellow you can find a call trace, how packets_out can be incremented twice for one skb, if snd_nxt and sk_send_head are not synchronized. In this case packets_out will be always positive, even when sk_write_queue is empty. tcp_write_wakeup skb = tcp_send_head(sk); tcp_fragment if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq)) tcp_adjust_pcount(sk, skb, diff); tcp_event_new_data_sent tp->packets_out += tcp_skb_pcount(skb); I think update of snd_nxt isn't required, when a socket is switched from repair mode. Because it's initialized in tcp_connect_init. Then when a write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent, so it's always is in consistent state. I have checked, that the bug is not reproduced with this patch and all tests about restoring tcp connections work fine. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: Release dst if this dst is improper for vti tunnelfan.du2013-11-191-0/+1
| | | | | | | | | | | | After searching rt by the vti tunnel dst/src parameter, if this rt has neither attached to any transformation nor the transformation is not tunnel oriented, this rt should be released back to ip layer. otherwise causing dst memory leakage. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ping: prevent NULL pointer dereference on write to msg_nameHannes Frederic Sowa2013-11-181-15/+19
| | | | | | | | A plain read() on a socket does set msg->msg_name to NULL. So check for NULL pointer first. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: prevent leakage of uninitialized memory to user in recv syscallsHannes Frederic Sowa2013-11-183-21/+9
| | | | | | | | | | | | | | | | Only update *addr_len when we actually fill in sockaddr, otherwise we can return uninitialized memory from the stack to the caller in the recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL) checks because we only get called with a valid addr_len pointer either from sock_common_recvmsg or inet_recvmsg. If a blocking read waits on a socket which is concurrently shut down we now return zero and set msg_msgnamelen to 0. Reported-by: mpb <mpb.mail@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: fix possible seqlock deadlockEric Dumazet2013-11-141-1/+1
| | | | | | | | | | | | ip4_datagram_connect() being called from process context, it should use IP_INC_STATS() instead of IP_INC_STATS_BH() otherwise we can deadlock on 32bit arches, or get corruptions of SNMP counters. Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* genetlink: make all genl_ops users constJohannes Berg2013-11-141-1/+1
| | | | | | | | | | | | | | | | | | | Now that genl_ops are no longer modified in place when registering, they can be made const. This patch was done mostly with spatch: @@ identifier ops; @@ +const struct genl_ops ops[] = { ... }; (except the struct thing in net/openvswitch/datapath.c) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-tcp: fix panic in tcp_fastopen_cache_set()Eric Dumazet2013-11-141-1/+4
| | | | | | | | | | | | | | | We had some reports of crashes using TCP fastopen, and Dave Jones gave a nice stack trace pointing to the error. Issue is that tcp_get_metrics() should not be called with a NULL dst Fixes: 1fe4c481ba637 ("net-tcp: Fast Open client - cookie cache") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: tsq: restore minimal amount of queueingEric Dumazet2013-11-142-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several users reported throughput regressions, notably on mvneta and wifi adapters. 802.11 AMPDU requires a fair amount of queueing to be effective. This patch partially reverts the change done in tcp_write_xmit() so that the minimal amount is sysctl_tcp_limit_output_bytes. It also remove the use of this sysctl while building skb stored in write queue, as TSO autosizing does the right thing anyway. Users with well behaving NICS and correct qdisc (like sch_fq), can then lower the default sysctl_tcp_limit_output_bytes value from 128KB to 8KB. This new usage of sysctl_tcp_limit_output_bytes permits each driver authors to check how their driver performs when/if the value is set to a minimum of 4KB. Normally, line rate for a single TCP flow should be possible, but some drivers rely on timers to perform TX completion and too long TX completion delays prevent reaching full throughput. Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sujith Manoharan <sujith@msujith.org> Reported-by: Arnaud Ebalard <arno@natisbad.org> Tested-by: Sujith Manoharan <sujith@msujith.org> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* core/dev: do not ignore dmac in dev_forward_skb()Alexei Starovoitov2013-11-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | commit 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") and refactoring 64261f230a91 ("dev: move skb_scrub_packet() after eth_type_trans()") are forcing pkt_type to be PACKET_HOST when skb traverses veth. which means that ip forwarding will kick in inside netns even if skb->eth->h_dest != dev->dev_addr Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb() and in ip_tunnel_rcv() Fixes: 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") CC: Isaku Yamahata <yamahatanetdev@gmail.com> CC: Maciej Zenczykowski <zenczykowski@gmail.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: fix a UFO regressionEric Dumazet2013-11-081-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While testing virtio_net and skb_segment() changes, Hannes reported that UFO was sending wrong frames. It appears this was introduced by a recent commit : 8c3a897bfab1 ("inet: restore gso for vxlan") The old condition to perform IP frag was : tunnel = !!skb->encapsulation; ... if (!tunnel && proto == IPPROTO_UDP) { So the new one should be : udpfrag = !skb->encapsulation && proto == IPPROTO_UDP; ... if (udpfrag) { Initialization of udpfrag must be done before call to ops->callbacks.gso_segment(skb, features), as skb_udp_tunnel_segment() clears skb->encapsulation (We want udpfrag to be true for UFO, false for VXLAN) With help from Alexei Starovoitov Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACEHannes Frederic Sowa2013-11-054-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: properly handle stretch acks in slow startYuchung Cheng2013-11-0414-67/+56
| | | | | | | | | | | | | | | | | | | | | | | | | Slow start now increases cwnd by 1 if an ACK acknowledges some packets, regardless the number of packets. Consequently slow start performance is highly dependent on the degree of the stretch ACKs caused by receiver or network ACK compression mechanisms (e.g., delayed-ACK, GRO, etc). But slow start algorithm is to send twice the amount of packets of packets left so it should process a stretch ACK of degree N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A follow up patch will use the remainder of the N (if greater than 1) to adjust cwnd in the congestion avoidance phase. In addition this patch retires the experimental limited slow start (LSS) feature. LSS has multiple drawbacks but questionable benefit. The fractional cwnd increase in LSS requires a loop in slow start even though it's rarely used. Configuring such an increase step via a global sysctl on different BDPS seems hard. Finally and most importantly the slow start overshoot concern is now better covered by the Hybrid slow start (hystart) enabled by default. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: enable sockets to use MSG_FASTOPEN by defaultYuchung Cheng2013-11-041-1/+1
| | | | | | | | | | | | | Applications have started to use Fast Open (e.g., Chrome browser has such an optional flag) and the feature has gone through several generations of kernels since 3.7 with many real network tests. It's time to enable this flag by default for applications to test more conveniently and extensively. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2013-11-041-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables Pablo Neira Ayuso says: ==================== This batch contains fives nf_tables patches for your net-next tree, they are: * Fix possible use after free in the module removal path of the x_tables compatibility layer, from Dan Carpenter. * Add filter chain type for the bridge family, from myself. * Fix Kconfig dependencies of the nf_tables bridge family with the core, from myself. * Fix sparse warnings in nft_nat, from Tomasz Bursztyka. * Remove duplicated include in the IPv4 family support for nf_tables, from Wei Yongjun. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: nf_tables: remove duplicated include from nf_tables_ipv4.cWei Yongjun2013-11-031-1/+0
| | | | | | | | | | | | | | Remove duplicated include. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | Merge branch 'master' of ↵David S. Miller2013-11-041-35/+75
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== This is another batch containing Netfilter/IPVS updates for your net-next tree, they are: * Six patches to make the ipt_CLUSTERIP target support netnamespace, from Gao feng. * Two cleanups for the nf_conntrack_acct infrastructure, introducing a new structure to encapsulate conntrack counters, from Holger Eitzenberger. * Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann. * Skip checksum recalculation in SCTP support for IPVS, also from Daniel Borkmann. * Fix behavioural change in xt_socket after IP early demux, from Florian Westphal. * Fix bogus large memory allocation in the bitmap port set type in ipset, from Jozsef Kadlecsik. * Fix possible compilation issues in the hash netnet set type in ipset, also from Jozsef Kadlecsik. * Define constants to identify netlink callback data in ipset dumps, again from Jozsef Kadlecsik. * Use sock_gen_put() in xt_socket to replace xt_socket_put_sk, from Eric Dumazet. * Improvements for the SH scheduler in IPVS, from Alexander Frolkin. * Remove extra delay due to unneeded rcu barrier in IPVS net namespace cleanup path, from Julian Anastasov. * Save some cycles in ip6t_REJECT by skipping checksum validation in packets leaving from our stack, from Stanislav Fomichev. * Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from Julian Anastasov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: ipt_CLUSTERIP: use proper net namespace to operate CLUSTERIPGao feng2013-10-171-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | we can allow users in uninit net namespace to operate ipt_CLUSTERIP now. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipt_CLUSTERIP: create proc entry under proper ipt_CLUSTERIP directoryGao feng2013-10-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Create proc entries under the ipt_CLUSTERIP directory of proper net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipt_CLUSTERIP: add parameter net in clusterip_config_find_getGao feng2013-10-171-6/+6
| | | | | | | | | | | | | | | | | | | | | Inorder to find clusterip_config in net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipt_CLUSTERIP: make clusterip_lock per net namespaceGao feng2013-10-171-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | this lock is used for protecting clusterip_configs of per net namespace, it should be per net namespace too. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipt_CLUSTERIP: make clusterip_list per net namespaceGao feng2013-10-171-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clusterip_configs should be per net namespace, so operate cluster in one net namespace won't affect other net namespace. right now, only allow to operate the clusterip_configs of init net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipt_CLUSTERIP: make proc directory per net namespaceGao feng2013-10-171-19/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | Create /proc/net/ipt_CLUSTERIP directory for per net namespace. Right now,only allow to create entries under the ipt_CLUSTERIP in init net namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-11-046-25/+47
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/netconsole.c net/bridge/br_private.h Three mostly trivial conflicts. The net/bridge/br_private.h conflict was a function signature (argument addition) change overlapping with the extern removals from Joe Perches. In drivers/net/netconsole.c we had one change adjusting a printk message whilst another changed "printk(KERN_INFO" into "pr_info(". Lastly, the emulex change was a new inline function addition overlapping with Joe Perches's extern removals. Signed-off-by: David S. Miller <davem@davemloft.net>
| * \ \ Merge branch 'master' of ↵David S. Miller2013-11-021-2/+6
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Fix a possible race on ipcomp scratch buffers because of too early enabled siftirqs. From Michal Kubecek. 2) The current xfrm garbage collector threshold is too small for some workloads, resulting in bad performance on these workloads. Increase the threshold from 1024 to 32768. 3) Some codepaths might not have a dst_entry attached to the skb when calling xfrm_decode_session(). So add a check to prevent a null pointer dereference in this case. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | xfrm: Fix null pointer dereference when decoding sessionsSteffen Klassert2013-11-011-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some codepaths the skb does not have a dst entry when xfrm_decode_session() is called. So check for a valid skb_dst() before dereferencing the device interface index. We use 0 as the device index if there is no valid skb_dst(), or at reverse decoding we use skb_iif as device interface index. Bug was introduced with git commit bafd4bd4dc ("xfrm: Decode sessions with output interface."). Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | * | | xfrm: Increase the garbage collector thresholdSteffen Klassert2013-10-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the removal of the routing cache, we lost the option to tweak the garbage collector threshold along with the maximum routing cache size. So git commit 703fb94ec ("xfrm: Fix the gc threshold value for ipv4") moved back to a static threshold. It turned out that the current threshold before we start garbage collecting is much to small for some workloads, so increase it from 1024 to 32768. This means that we start the garbage collector if we have more than 32768 dst entries in the system and refuse new allocations if we are above 65536. Reported-by: Wolfgang Walter <linux@stwm.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | tcp: gso: fix truesize trackingEric Dumazet2013-10-291-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets") had an heuristic that can trigger a warning in skb_try_coalesce(), because skb->truesize of the gso segments were exactly set to mss. This breaks the requirement that skb->truesize >= skb->len + truesizeof(struct sk_buff); It can trivially be reproduced by : ifconfig lo mtu 1500 ethtool -K lo tso off netperf As the skbs are looped into the TCP networking stack, skb_try_coalesce() warns us of these skb under-estimating their truesize. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | tcp: do not rearm RTO when future data are sackedYuchung Cheng2013-10-271-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch ed08495c3 "tcp: use RTT from SACK for RTO" always re-arms RTO upon obtaining a RTT sample from newly sacked data. But technically RTO should only be re-armed when the data sent before the last (re)transmission of write queue head are (s)acked. Otherwise the RTO may continue to extend during loss recovery on data sent in the future. Note that RTTs from ACK or timestamps do not have this problem, as the RTT source must be from data sent before. The new RTO re-arm policy is 1) Always re-arm RTO if SND.UNA is advanced 2) Re-arm RTO if sack RTT is available, provided the sacked data was sent before the last time write_queue_head was sent. Signed-off-by: Larry Brakmo <brakmo@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | tcp: only take RTT from timestamps if new data is ackedYuchung Cheng2013-10-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch ed08495c3 "tcp: use RTT from SACK for RTO" has a bug that it does not check if the ACK acknowledge new data before taking the RTT sample from TCP timestamps. This patch adds the check back as required by the RFC. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | tcp: fix SYNACK RTT estimation in Fast OpenYuchung Cheng2013-10-271-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tp->lsndtime may not always be the SYNACK timestamp if a passive Fast Open socket sends data before handshake completes. And if the remote acknowledges both the data and the SYNACK, the RTT sample is already taken in tcp_ack(), so no need to call tcp_update_ack_rtt() in tcp_synack_rtt_meas() aagain. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2013-10-233-6/+11
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== The following patchset contains three netfilter fixes for your net tree, they are: * A couple of fixes to resolve info leak to userspace due to uninitialized memory area in ulogd, from Mathias Krause. * Fix instruction ordering issues that may lead to the access of uninitialized data in x_tables. The problem involves the table update (producer) and the main packet matching (consumer) routines. Detected in SMP ARMv7, from Will Deacon. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | netfilter: x_tables: fix ordering of jumpstack allocation and table updateWill Deacon2013-10-222-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During kernel stability testing on an SMP ARMv7 system, Yalin Wang reported the following panic from the netfilter code: 1fe0: 0000001c 5e2d3b10 4007e779 4009e110 60000010 00000032 ff565656 ff545454 [<c06c48dc>] (ipt_do_table+0x448/0x584) from [<c0655ef0>] (nf_iterate+0x48/0x7c) [<c0655ef0>] (nf_iterate+0x48/0x7c) from [<c0655f7c>] (nf_hook_slow+0x58/0x104) [<c0655f7c>] (nf_hook_slow+0x58/0x104) from [<c0683bbc>] (ip_local_deliver+0x88/0xa8) [<c0683bbc>] (ip_local_deliver+0x88/0xa8) from [<c0683718>] (ip_rcv_finish+0x418/0x43c) [<c0683718>] (ip_rcv_finish+0x418/0x43c) from [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598) [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598) from [<c062b314>] (process_backlog+0x84/0x158) [<c062b314>] (process_backlog+0x84/0x158) from [<c062de84>] (net_rx_action+0x70/0x1dc) [<c062de84>] (net_rx_action+0x70/0x1dc) from [<c0088230>] (__do_softirq+0x11c/0x27c) [<c0088230>] (__do_softirq+0x11c/0x27c) from [<c008857c>] (do_softirq+0x44/0x50) [<c008857c>] (do_softirq+0x44/0x50) from [<c0088614>] (local_bh_enable_ip+0x8c/0xd0) [<c0088614>] (local_bh_enable_ip+0x8c/0xd0) from [<c06b0330>] (inet_stream_connect+0x164/0x298) [<c06b0330>] (inet_stream_connect+0x164/0x298) from [<c061d68c>] (sys_connect+0x88/0xc8) [<c061d68c>] (sys_connect+0x88/0xc8) from [<c000e340>] (ret_fast_syscall+0x0/0x30) Code: 2a000021 e59d2028 e59de01c e59f011c (e7824103) ---[ end trace da227214a82491bd ]--- Kernel panic - not syncing: Fatal exception in interrupt This comes about because CPU1 is executing xt_replace_table in response to a setsockopt syscall, resulting in: ret = xt_jumpstack_alloc(newinfo); --> newinfo->jumpstack = kzalloc(size, GFP_KERNEL); [...] table->private = newinfo; newinfo->initial_entries = private->initial_entries; Meanwhile, CPU0 is handling the network receive path and ends up in ipt_do_table, resulting in: private = table->private; [...] jumpstack = (struct ipt_entry **)private->jumpstack[cpu]; On weakly ordered memory architectures, the writes to table->private and newinfo->jumpstack from CPU1 can be observed out of order by CPU0. Furthermore, on architectures which don't respect ordering of address dependencies (i.e. Alpha), the reads from CPU0 can also be re-ordered. This patch adds an smp_wmb() before the assignment to table->private (which is essentially publishing newinfo) to ensure that all writes to newinfo will be observed before plugging it into the table structure. A dependent-read barrier is also added on the consumer sides, to ensure the same ordering requirements are also respected there. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Wang, Yalin <Yalin.Wang@sonymobile.com> Tested-by: Wang, Yalin <Yalin.Wang@sonymobile.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | netfilter: ipt_ULOG: fix info leaksMathias Krause2013-10-021-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ulog messages leak heap bytes by the means of padding bytes and incompletely filled string arrays. Fix those by memset(0)'ing the whole struct before filling it. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | | | | Merge branch 'master' of ↵David S. Miller2013-11-021-34/+15
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Conflicts: net/xfrm/xfrm_policy.c Minor merge conflict in xfrm_policy.c, consisting of overlapping changes which were trivial to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | net: esp{4,6}: get rid of struct esp_dataMathias Krause2013-10-291-26/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct esp_data consists of a single pointer, vanishing the need for it to be a structure. Fold the pointer into 'data' direcly, removing one level of pointer indirection. Signed-off-by: Mathias Krause <mathias.krause@secunet.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | | net: esp{4,6}: remove padlen from struct esp_dataMathias Krause2013-10-291-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The padlen member of struct esp_data is always zero. Get rid of it. Signed-off-by: Mathias Krause <mathias.krause@secunet.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | | | | | tcp: temporarily disable Fast Open on SYN timeoutYuchung Cheng2013-10-292-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fast Open currently has a fall back feature to address SYN-data being dropped but it requires the middle-box to pass on regular SYN retry after SYN-data. This is implemented in commit aab487435 ("net-tcp: Fast Open client - detecting SYN-data drops") However some NAT boxes will drop all subsequent packets after first SYN-data and blackholes the entire connections. An example is in commit 356d7d8 "netfilter: nf_conntrack: fix tcp_in_window for Fast Open". The sender should note such incidents and fall back to use the regular TCP handshake on subsequent attempts temporarily as well: after the second SYN timeouts the original Fast Open SYN is most likely lost. When such an event recurs Fast Open is disabled based on the number of recurrences exponentially. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORKHannes Frederic Sowa2013-10-291-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and IP_PMTUDISC_PROBE well enough. UFO enabled packet delivery just appends all frags to the cork and hands it over to the network card. So we just deliver non-DF udp fragments (DF-flag may get overwritten by hardware or virtual UFO enabled interface). UDP_CORK does enqueue the data until the cork is disengaged. At this point it sets the correct IP_DF and local_df flags and hands it over to ip_fragment which in this case will generate an icmp error which gets appended to the error socket queue. This is not reflected in the syscall error (of course, if UFO is enabled this also won't happen). Improve this by checking the pmtudisc flags before appending data to the socket and if we still can fit all data in one packet when IP_PMTUDISC_DO or IP_PMTUDISC_PROBE is set, only then proceed. We use (mtu-fragheaderlen) to check for the maximum length because we ensure not to generate a fragment and non-fragmented data does not need to have its length aligned on 64 bit boundaries. Also the passed in ip_options are already aligned correctly. Maybe, we can relax some other checks around ip_fragment. This needs more research. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | inet: restore gso for vxlanEric Dumazet2013-10-281-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei reported a performance regression on vxlan, caused by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable" GSO vxlan packets were not properly segmented, adding IP fragments while they were not expected. Rename 'bool tunnel' to 'bool encap', and add a new boolean to express the fact that UDP should be fragmented. This fragmentation is triggered by skb->encapsulation being set. Remove a "skb->encapsulation = 1" added in above commit, as its not needed, as frags inherit skb->frag from original GSO skb. Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | inet: remove old fragmentation hash initializingHannes Frederic Sowa2013-10-231-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All fragmentation hash secrets now get initialized by their corresponding hash function with net_get_random_once. Thus we can eliminate the initial seeding. Also provide a comment that hash secret seeding happens at the first call to the corresponding hashing function. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | ipv4: initialize ip4_frags hash secret as late as possibleHannes Frederic Sowa2013-10-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Defer the generation of the first hash secret for the ipv4 fragmentation cache as late as possible. ip4_frags.rnd gets initial seeded by inet_frags_init and regulary reseeded by inet_frag_secret_rebuild. Either we call ipqhashfn directly from ip_fragment.c in which case we initialize the secret directly. If we first get called by inet_frag_secret_rebuild we install a new secret by a manual call to get_random_bytes. This secret will be overwritten as soon as the first call to ipqhashfn happens. This is safe because we won't race while publishing the new secrets with anyone else. Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-10-235-13/+31
|\ \ \ \ \ \ \ | | |/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | tcp: initialize passive-side sk_pacing_rate after 3WHSNeal Cardwell2013-10-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For passive TCP connections, upon receiving the ACK that completes the 3WHS, make sure we set our pacing rate after we get our first RTT sample. On passive TCP connections, when we receive the ACK completing the 3WHS we do not take an RTT sample in tcp_ack(), but rather in tcp_synack_rtt_meas(). So upon receiving the ACK that completes the 3WHS, tcp_ack() leaves sk_pacing_rate at its initial value. Originally the initial sk_pacing_rate value was 0, so passive-side connections defaulted to sysctl_tcp_min_tso_segs (2 segs) in skbuffs made in the first RTT. With a default initial cwnd of 10 packets, this happened to be correct for RTTs 5ms or bigger, so it was hard to see problems in WAN or emulated WAN testing. Since 7eec4174ff ("pkt_sched: fq: fix non TCP flows pacing"), the initial sk_pacing_rate is 0xffffffff. So after that change, passive TCP connections were keeping this value (and using large numbers of segments per skbuff) until receiving an ACK for data. Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | ip_output: do skb ufo init for peeked non ufo skb as wellJiri Pirko2013-10-191-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, if user application does: sendto len<mtu flag MSG_MORE sendto len>mtu flag 0 The skb is not treated as fragmented one because it is not initialized that way. So move the initialization to fix this. introduced by: commit e89e9cf539a28df7d0eb1d0a545368e9920b34ac "[IPv4/IPv6]: UFO Scatter-gather approach" Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()Eric Dumazet2013-10-171-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO packets in the first place. (As a performance hint) Once we have GSO packets in write queue, we can not decide they are no longer GSO only because flow now uses a route which doesn't handle TSO/GSO. Core networking stack handles the case very well for us, all we need is keeping track of packet counts in MSS terms, regardless of segmentation done later (in GSO or hardware) Right now, if tcp_fragment() splits a GSO packet in two parts, @left and @right, and route changed through a non GSO device, both @left and @right have pcount set to 1, which is wrong, and leads to incorrect packet_count tracking. This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with MTU discovery") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | tcp: must unclone packets before mangling themEric Dumazet2013-10-171-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP stack should make sure it owns skbs before mangling them. We had various crashes using bnx2x, and it turned out gso_size was cleared right before bnx2x driver was populating TC descriptor of the _previous_ packet send. TCP stack can sometime retransmit packets that are still in Qdisc. Of course we could make bnx2x driver more robust (using ACCESS_ONCE(shinfo->gso_size) for example), but the bug is TCP stack. We have identified two points where skb_unclone() was needed. This patch adds a WARN_ON_ONCE() to warn us if we missed another fix of this kind. Kudos to Neal for finding the root cause of this bug. Its visible using small MSS. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | tcp: fix incorrect ca_state in tail loss probeYuchung Cheng2013-10-171-1/+1
| | |_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On receiving an ACK that covers the loss probe sequence, TLP immediately sets the congestion state to Open, even though some packets are not recovered and retransmisssion are on the way. The later ACks may trigger a WARN_ON check in step D of tcp_fastretrans_alert(), e.g., https://bugzilla.redhat.com/show_bug.cgi?id=989251 The fix is to follow the similar procedure in recovery by calling tcp_try_keep_open(). The sender switches to Open state if no packets are retransmissted. Otherwise it goes to Disorder and let subsequent ACKs move the state to Recovery or Open. Reported-By: Michael Sterrett <michael@sterretts.net> Tested-By: Dormando <dormando@rydia.net> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | vti: get rid of nf mark rule in preroutingChristophe Gouault2013-10-111-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes and improves the use of vti interfaces (while lightly changing the way of configuring them). Currently: - it is necessary to identify and mark inbound IPsec packets destined to each vti interface, via netfilter rules in the mangle table at prerouting hook. - the vti module cannot retrieve the right tunnel in input since commit b9959fd3: vti tunnels all have an i_key, but the tunnel lookup is done with flag TUNNEL_NO_KEY, so there no chance to retrieve them. - the i_key is used by the outbound processing as a mark to lookup for the right SP and SA bundle. This patch uses the o_key to store the vti mark (instead of i_key) and enables: - to avoid the need for previously marking the inbound skbuffs via a netfilter rule. - to properly retrieve the right tunnel in input, only based on the IPsec packet outer addresses. - to properly perform an inbound policy check (using the tunnel o_key as a mark). - to properly perform an outbound SPD and SAD lookup (using the tunnel o_key as a mark). - to keep the current mark of the skbuff. The skbuff mark is neither used nor changed by the vti interface. Only the vti interface o_key is used. SAs have a wildcard mark. SPs have a mark equal to the vti interface o_key. The vti interface must be created as follows (i_key = 0, o_key = mark): ip link add vti1 mode vti local 1.1.1.1 remote 2.2.2.2 okey 1 The SPs attached to vti1 must be created as follows (mark = vti1 o_key): ip xfrm policy add dir out mark 1 tmpl src 1.1.1.1 dst 2.2.2.2 \ proto esp mode tunnel ip xfrm policy add dir in mark 1 tmpl src 2.2.2.2 dst 1.1.1.1 \ proto esp mode tunnel The SAs are created with the default wildcard mark. There is no distinction between global vs. vti SAs. Just their addresses will possibly link them to a vti interface: ip xfrm state add src 1.1.1.1 dst 2.2.2.2 proto esp spi 1000 mode tunnel \ enc "cbc(aes)" "azertyuiopqsdfgh" ip xfrm state add src 2.2.2.2 dst 1.1.1.1 proto esp spi 2000 mode tunnel \ enc "cbc(aes)" "sqbdhgqsdjqjsdfh" To avoid matching "global" (not vti) SPs in vti interfaces, global SPs should no use the default wildcard mark, but explicitly match mark 0. To avoid a double SPD lookup in input and output (in global and vti SPDs), the NOPOLICY and NOXFRM options should be set on the vti interfaces: echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_policy echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_xfrm The outgoing traffic is steered to vti1 by a route via the vti interface: ip route add 192.168.0.0/16 dev vti1 The incoming IPsec traffic is steered to vti1 because its outer addresses match the vti1 tunnel configuration. Signed-off-by: Christophe Gouault <christophe.gouault@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge branch 'master' of ↵David S. Miller2013-10-091-0/+1
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) We used the wrong netlink attribute to verify the lenght of the replay window on async events. Fix this by using the right netlink attribute. 2) Policy lookups can not match the output interface on forwarding. Add the needed informations to the flow informations. 3) We update the pmtu when we receive a ICMPV6_DEST_UNREACH message on IPsec with ipv6. This is wrong and leads to strange fragmented packets, only ICMPV6_PKT_TOOBIG messages should update the pmtu. Fix this by removing the ICMPV6_DEST_UNREACH check from the IPsec protocol error handlers. 4) The legacy IPsec anti replay mechanism supports anti replay windows up to 32 packets. If a user requests for a bigger anti replay window, we use 32 packets but pretend that we use the requested window size. Fix from Fan Du. 5) If asynchronous events are enabled and replay_maxdiff is set to zero, we generate an async event for every received packet instead of checking whether a timeout occurred. Fix from Thomas Egerer. 6) Policies need a refcount when the state resolution timer is armed. Otherwise the timer can fire after the policy is deleted. 7) We might dreference a NULL pointer if the hold_queue is empty, add a check to avoid this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud