summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* tcp: correct code comment stating 3 min timeout for FIN_WAIT2, we only do 1 minJesper Juhl2014-02-091-1/+1
| | | | | | | | | | | | | As far as I can tell we have used a default of 60 seconds for FIN_WAIT2 timeout for ages (since 2.x times??). In any case, the timeout these days is 60 seconds, so the 3 min comment is wrong (and cost me a few minutes of my life when I was debugging a FIN_WAIT2 related problem in a userspace application and checked the kernel source for details). Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2014-02-094-1/+85
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/nftables/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes, mostly nftables fixes, most relevantly they are: * Fix a crash in the h323 conntrack NAT helper due to expectation list corruption, from Alexey Dobriyan. * A couple of RCU race fixes for conntrack, one manifests by hitting BUG_ON in nf_nat_setup_info() and the destroy path, patches from Andrey Vagin and me. * Dump direction attribute in nft_ct only if it is set, from Arturo Borrero. * Fix IPVS bug in its own connection tracking system that may lead to copying only 4 bytes of the IPv6 address when initializing the ip_vs_conn object, from Michal Kubecek. * Fix -EBUSY errors in nftables when deleting the rules, chain and tables in a row due mixture of asynchronous and synchronous object releasing, from me. * Three fixes for the nf_tables set infrastructure when using intervals and mappings, from me. * Four patches to fixing the nf_tables log, reject and ct expressions from the new inet table, from Patrick McHardy. * Fix memory overrun in the map that is used to dynamically allocate names from anonymous sets, also from Patrick. * Fix a potential oops if you dump a set with NFPROTO_UNSPEC and a table name, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: nf_tables: add reject module for NFPROTO_INETPatrick McHardy2014-02-061-3/+4
| | | | | | | | | | | | | | | | Add a reject module for NFPROTO_INET. It does nothing but dispatch to the AF-specific modules based on the hook family. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc partsPatrick McHardy2014-02-063-0/+80
| | | | | | | | | | | | | | | | | | Currently the nft_reject module depends on symbols from ipv6. This is wrong since no generic module should force IPv6 support to be loaded. Split up the module into AF-specific and a generic part. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report()Alexey Dobriyan2014-02-051-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar bug fixed in SIP module in 3f509c6 ("netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation"). BUG: unable to handle kernel paging request at 00100104 IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack] ... Call Trace: [<c0244bd8>] ? del_timer+0x48/0x70 [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack] [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack] [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack] [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack] [<c024442d>] call_timer_fn+0x1d/0x80 [<c024461e>] run_timer_softirq+0x18e/0x1a0 [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack] [<c023e6f3>] __do_softirq+0xa3/0x170 [<c023e650>] ? __local_bh_enable+0x70/0x70 <IRQ> [<c023e587>] ? irq_exit+0x67/0xa0 [<c0202af6>] ? do_IRQ+0x46/0xb0 [<c027ad05>] ? clockevents_notify+0x35/0x110 [<c066ac6c>] ? common_interrupt+0x2c/0x40 [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0 [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100 [<c02085f8>] ? arch_cpu_idle+0x8/0x30 [<c027314b>] ? cpu_idle_loop+0x4b/0x140 [<c0273258>] ? cpu_startup_entry+0x18/0x20 [<c066056d>] ? rest_init+0x5d/0x70 [<c0813ac8>] ? start_kernel+0x2ec/0x2f2 [<c081364f>] ? repair_env_string+0x5b/0x5b [<c0813269>] ? i386_start_kernel+0x33/0x35 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | tcp: remove 1ms offset in srtt computationEric Dumazet2014-02-062-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP pacing depends on an accurate srtt estimation. Current srtt estimation is using jiffie resolution, and has an artificial offset of at least 1 ms, which can produce slowdowns when FQ/pacing is used, especially in DC world, where typical rtt is below 1 ms. We are planning a switch to usec resolution for linux-3.15, but in the meantime, this patch removes the 1 ms offset. All we need is to have tp->srtt minimal value of 1 to differentiate the case of srtt being initialized or not, not 8. The problematic behavior was observed on a 40Gbit testbed, where 32 concurrent netperf were reaching 12Gbps of aggregate speed, instead of line speed. This patch also has the effect of reporting more accurate srtt and send rates to iproute2 ss command as in : $ ss -i dst cca2 Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port tcp ESTAB 0 0 10.244.129.1:56984 10.244.129.2:12865 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send 463.4Mbps rcv_rtt:1 rcv_space:29200 tcp ESTAB 0 390960 10.244.129.1:60247 10.244.129.2:50204 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51 send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200 Reported-by: Vytautas Valancius <valas@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Fix runtime WARNING in rtmsg_ifa()Geert Uytterhoeven2014-02-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On m68k/ARAnyM: WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99() Modules linked in: CPU: 0 PID: 407 Comm: ifconfig Not tainted 3.13.0-atari-09263-g0c71d68014d1 #1378 Stack from 10c4fdf0: 10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f 00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080 c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001 00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002 008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90 eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020 Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c [<00024416>] warn_slowpath_null+0x14/0x1a [<002501b4>] rtmsg_ifa+0xdc/0xf0 [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2 [<0024f9b4>] inet_abc_len+0x0/0x42 [<00250b52>] inet_insert_ifa+0xc/0x12 [<00252390>] devinet_ioctl+0x2ae/0x5d6 Adding some debugging code reveals that net_fill_ifaddr() fails in put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp, preferred, valid)) nla_put complains: lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20 Apparently commit 5c766d642bcaffd0c2a5b354db2068515b3846cf ("ipv4: introduce address lifetime") forgot to take into account the addition of struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already done for ipv6. Suggested-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/ipv4: Use proper RCU APIs for writer-side in udp_offload.cShlomo Pongratz2014-02-041-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RCU writer side should use rcu_dereference_protected() and not rcu_dereference(), fix that. This also removes the "suspicious RCU usage" warning seen when running with CONFIG_PROVE_RCU. Also, don't use rcu_assign_pointer/rcu_dereference for pointers which are invisible beyond the udp offload code. Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols') Reported-by: Eric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip_tunnel: fix panic in ip_tunnel_xmit()Eric Dumazet2014-02-031-18/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting rt variable to NULL at the beginning of ip_tunnel_xmit() missed possible use of this variable as a scratch value. Also fixes a possible dst leak in tunnel_dst_check() : If we had to call tunnel_dst_reset(), we forgot to release the reference on dst. Merges tunnel_dst_get()/tunnel_dst_check() into a single tunnel_rtable_get() function for clarity. Many thanks to Tommi for his report and tests. Fixes: 7d442fab0a67 ("ipv4: Cache dst in tunnels") Reported-by: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/ipv4: Use non-atomic allocation of udp offloads structure instanceOr Gerlitz2014-01-301-1/+1
|/ | | | | | | | | | | | | | | | Since udp_add_offload() can be called from non-sleepable context e.g under this call tree from the vxlan driver use case: vxlan_socket_create() <-- holds the spinlock -> vxlan_notify_add_rx_port() -> udp_add_offload() <-- schedules we should allocate the udp_offloads structure in atomic manner. Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols') Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: gre: use icmp_hdr() to get inner ip headerDuan Jiong2014-01-271-1/+1
| | | | | | | | | | | | | | | | | | When dealing with icmp messages, the skb->data points the ip header that triggered the sending of the icmp message. In gre_cisco_err(), the parse_gre_header() is called, and the iptunnel_pull_header() is called to pull the skb at the end of the parse_gre_header(), so the skb->data doesn't point the inner ip header. Unfortunately, the ipgre_err still needs those ip addresses in inner ip header to look up tunnel by ip_tunnel_lookup(). So just use icmp_hdr() to get inner ip header instead of skb->data. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix memory leak if TPROXY used with TCP early demuxHolger Eitzenberger2014-01-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I see a memory leak when using a transparent HTTP proxy using TPROXY together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable): unreferenced object 0xffff88008cba4a40 (size 1696): comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s) hex dump (first 32 bytes): 0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2..... 02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9 [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5 [<ffffffff812702cf>] sk_clone_lock+0x14/0x283 [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b [<ffffffff8129a893>] netlink_broadcast+0x14/0x16 [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3 [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0 [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55 [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725 [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154 [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514 [<ffffffff8127aa77>] process_backlog+0xee/0x1c5 [<ffffffff8127c949>] net_rx_action+0xa7/0x200 [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157 But there are many more, resulting in the machine going OOM after some days. From looking at the TPROXY code, and with help from Florian, I see that the memory leak is introduced in tcp_v4_early_demux(): void tcp_v4_early_demux(struct sk_buff *skb) { /* ... */ iph = ip_hdr(skb); th = tcp_hdr(skb); if (th->doff < sizeof(struct tcphdr) / 4) return; sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo, iph->saddr, th->source, iph->daddr, ntohs(th->dest), skb->skb_iif); if (sk) { skb->sk = sk; where the socket is assigned unconditionally to skb->sk, also bumping the refcnt on it. This is problematic, because in our case the skb has already a socket assigned in the TPROXY target. This then results in the leak I see. The very same issue seems to be with IPv6, but haven't tested. Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: Use PTR_ERR_OR_ZEROSachin Kamat2014-01-271-1/+2
| | | | | | | | PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it also include missing err.h header. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2014-01-2556-933/+1120
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) BPF debugger and asm tool by Daniel Borkmann. 2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann. 3) Correct reciprocal_divide and update users, from Hannes Frederic Sowa and Daniel Borkmann. 4) Currently we only have a "set" operation for the hw timestamp socket ioctl, add a "get" operation to match. From Ben Hutchings. 5) Add better trace events for debugging driver datapath problems, also from Ben Hutchings. 6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data. 7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko. 8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel Borkmann. 9) Share IP header compression code between Bluetooth and IEEE802154 layers, from Jukka Rissanen. 10) Fix ipv6 router reachability probing, from Jiri Benc. 11) Allow packets to be captured on macvtap devices, from Vlad Yasevich. 12) Support tunneling in GRO layer, from Jerry Chu. 13) Allow bonding to be configured fully using netlink, from Scott Feldman. 14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can already get the TCI. From Atzm Watanabe. 15) New "Heavy Hitter" qdisc, from Terry Lam. 16) Significantly improve the IPSEC support in pktgen, from Fan Du. 17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom Herbert. 18) Add Proportional Integral Enhanced packet scheduler, from Vijay Subramanian. 19) Allow openvswitch to mmap'd netlink, from Thomas Graf. 20) Key TCP metrics blobs also by source address, not just destination address. From Christoph Paasch. 21) Support 10G in generic phylib. From Andy Fleming. 22) Try to short-circuit GRO flow compares using device provided RX hash, if provided. From Tom Herbert. The wireless and netfilter folks have been busy little bees too. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits) net/cxgb4: Fix referencing freed adapter ipv6: reallocate addrconf router for ipv6 address when lo device up fib_frontend: fix possible NULL pointer dereference rtnetlink: remove IFLA_BOND_SLAVE definition rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info qlcnic: update version to 5.3.55 qlcnic: Enhance logic to calculate msix vectors. qlcnic: Refactor interrupt coalescing code for all adapters. qlcnic: Update poll controller code path qlcnic: Interrupt code cleanup qlcnic: Enhance Tx timeout debugging. qlcnic: Use bool for rx_mac_learn. bonding: fix u64 division rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC sfc: Use the correct maximum TX DMA ring size for SFC9100 Add Shradha Shah as the sfc driver maintainer. net/vxlan: Share RX skb de-marking and checksum checks with ovs tulip: cleanup by using ARRAY_SIZE() ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called net/cxgb4: Don't retrieve stats during recovery ...
| * fib_frontend: fix possible NULL pointer dereferenceOliver Hartkopp2014-01-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | The two commits 0115e8e30d (net: remove delay at device dismantle) and 748e2d9396a (net: reinstate rtnl in call_netdevice_notifiers()) silently removed a NULL pointer check for in_dev since Linux 3.7. This patch re-introduces this check as it causes crashing the kernel when setting small mtu values on non-ip capable netdevices. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is calledDuan Jiong2014-01-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach") clear IPCB in ip_tunnel_xmit() , or else skb->cb[] may contain garbage from GSO segmentation layer. But commit 0e6fbc5b6c621("ip_tunnels: extend iptunnel_xmit()") refactor codes, and it clear IPCB behind the dst_link_failure(). So clear IPCB in ip_tunnel_xmit() just like commti a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach"). Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/udp_offload: Handle static checker complaintsShlomo Pongratz2014-01-231-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Fixed few issues around using __rcu prefix and rcu_assign_pointer, also fixed a warning print to use ntohs(port) and not htons(port). net/ipv4/udp_offload.c:112:9: error: incompatible types in comparison expression (different address spaces) net/ipv4/udp_offload.c:113:9: error: incompatible types in comparison expression (different address spaces) net/ipv4/udp_offload.c:176:19: error: incompatible types in comparison expression (different address spaces) Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: metrics: Handle v6/v4-mapped sockets in tcp-metricsChristoph Paasch2014-01-231-24/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A socket may be v6/v4-mapped. In that case sk->sk_family is AF_INET6, but the IP being used is actually an IPv4-address. Current's tcp-metrics will thus represent it as an IPv6-address: root@server:~# ip tcp_metrics ::ffff:10.1.1.2 age 22.920sec rtt 18750us rttvar 15000us cwnd 10 10.1.1.2 age 47.970sec rtt 16250us rttvar 10000us cwnd 10 This patch modifies the tcp-metrics so that they are able to handle the v6/v4-mapped sockets correctly. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv4: queue work on power efficient wqviresh kumar2014-01-221-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Workqueue used in ipv4 layer have no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that an idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. This doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is enabled. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: metrics: Fix rcu-race when deleting multiple entriesChristoph Paasch2014-01-221-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In bbf852b96ebdc6d1 I introduced the tmlist, which allows to delete multiple entries from the cache that match a specified destination if no source-IP is specified. However, as the cache is an RCU-list, we should not create this tmlist, as it will change the tcpm_next pointer of the element that will be deleted and so a thread iterating over the cache's entries while holding the RCU-lock might get "redirected" to this tmlist. This patch fixes this, by reverting back to the old behavior prior to bbf852b96ebdc6d1, which means that we simply change the tcpm_next pointer of the previous element (pp) to jump over the one we are deleting. The difference is that we call kfree_rcu() directly on the cache entry, which allows us to delete multiple entries from the list. Fixes: bbf852b96ebdc6d1 (tcp: metrics: Delete all entries matching a certain destination) Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Add GRO support for UDP encapsulating protocolsOr Gerlitz2014-01-211-0/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add GRO handlers for protocols that do UDP encapsulation, with the intent of being able to coalesce packets which encapsulate packets belonging to the same TCP session. For GRO purposes, the destination UDP port takes the role of the ether type field in the ethernet header or the next protocol in the IP header. The UDP GRO handler will only attempt to coalesce packets whose destination port is registered to have gro handler. Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive code twice on a packet. This solves the problem of udp encapsulated packets whose inner VM packet is udp and happen to carry a port which has registered offloads. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: protect protocols not handling ipv4 from v4 connection/bind attemptsHannes Frederic Sowa2014-01-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow connecting and binding to them. sendmsg logic does already check msg->name for this but must trust already connected sockets which could be set up for connection to ipv4 address family. Per-socket flag ipv6only is of no use here, as it is under users control by setsockopt. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: delete redundant calls of tcp_mtup_init()Peter Pan(潘卫平)2014-01-211-1/+0
| | | | | | | | | | | | | | | | | | As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen sock, we can delete the redundant calls of tcp_mtup_init() in tcp_{v4,v6}_syn_recv_sock(). Signed-off-by: Weiping Pan <panweiping3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: remove the useless argument from ip_tunnel_hash()Duan Jiong2014-01-211-5/+4
| | | | | | | | | | | | | | | | | | Since commit c544193214("GRE: Refactor GRE tunneling code") introduced function ip_tunnel_hash(), the argument itn is no longer in use, so remove it. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: make IPV6_RECVPKTINFO work for ipv4 datagramsHannes Frederic Sowa2014-01-192-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data for IPv4 datagrams on IPv6 sockets. This patch splits the ip6_datagram_recv_ctl into two functions, one which handles both protocol families, AF_INET and AF_INET6, while the ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data. ip6_datagram_recv_*_ctl never reported back any errors, so we can make them return void. Also provide a helper for protocols which don't offer dual personality to further use ip6_datagram_recv_ctl, which is exported to modules. I needed to shuffle the code for ping around a bit to make it easier to implement dual personality for ping ipv6 sockets in future. Reported-by: Gert Doering <gert@space.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: be friend with drop monitorEric Dumazet2014-01-184-6/+6
| | | | | | | | | | | | | | | | Replace some dev_kfree_skb() with kfree_skb() calls when we drop one skb, this might help bug tracking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: add build-time checks for msg->msg_name sizeSteffen Hurrle2014-01-184-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg handler msg_name and msg_namelen logic"). DECLARE_SOCKADDR validates that the structure we use for writing the name information to is not larger than the buffer which is reserved for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR consistently in sendmsg code paths. Signed-off-by: Steffen Hurrle <steffen@hurrle.net> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-01-182-21/+38
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c net/ipv4/tcp_metrics.c Overlapping changes between the "don't create two tcp metrics objects with the same key" race fix in net and the addition of the destination address in the lookup key in net-next. Minor overlapping changes in bnx2x driver. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: fix a dst leak in tunnelsEric Dumazet2014-01-171-20/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch : 1) Remove a dst leak if DST_NOCACHE was set on dst Fix this by holding a reference only if dst really cached. 2) Remove a lockdep warning in __tunnel_dst_set() This was reported by Cong Wang. 3) Remove usage of a spinlock where xchg() is enough 4) Remove some spurious inline keywords. Let compiler decide for us. Fixes: 7d442fab0a67 ("ipv4: Cache dst in tunnels") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <cwang@twopensource.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAITFlorent Fourcot2014-01-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is following the commit b903d324bee262 (ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAIT). For the same reason than tclass, we have to store the flow label in the inet_timewait_sock to provide consistency of flow label on the last ACK. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/ipv4: don't use module_init in non-modular gre_offloadPaul Gortmaker2014-01-161-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent commit 438e38fadca2f6e57eeecc08326c8a95758594d4 ("gre_offload: statically build GRE offloading support") added new module_init/module_exit calls to the gre_offload.c file. The file is obj-y and can't be anything other than built-in. Currently it can never be built modular, so using module_init as an alias for __initcall can be somewhat misleading. Fix this up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be a worse thing. We also make the inclusion explicit. Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of device_initcall directly in this change means that the runtime impact is zero -- it will remain at level 6 in initcall ordering. As for the module_exit, rather than replace it with __exitcall, we simply remove it, since it appears only UML does anything with those, and even for UML, there is no relevant cleanup to be done here. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: do not export tcp_gso_segment() and tcp_gro_receive()Eric Dumazet2014-01-141-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_gso_segment() and tcp_gro_receive() no longer need to be exported. IPv4 and IPv6 offloads are statically linked. Note that tcp_gro_complete() is still used by bnx2x, unfortunately. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: replace macros net_random and net_srandom with direct calls to prandomAruna-Hewapathirane2014-01-144-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the net_random and net_srandom macros and replaces them with direct calls to the prandom ones. As new commits only seem to use prandom_u32 there is no use to keep them around. This change makes it easier to grep for users of prandom_u32. Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: register igmp_notifier even when !CONFIG_PROC_FSWANG Cong2014-01-142-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We still need this notifier even when we don't config PROC_FS. It should be rare to have a kernel without PROC_FS, so just for completeness. Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-01-141-1/+4
| |\ \
| * | | gre_offload: simplify GRE header length calculation in gre_gso_segment()Neal Cardwell2014-01-131-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the GRE header length calculation in gre_gso_segment(). Switch to an approach that is simpler, faster, and more general. The new approach will continue to be correct even if we add support for the optional variable-length routing info that may be present in a GRE header. Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | gre_offload: fix sparse non static symbol warningWei Yongjun2014-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes the following sparse warning: net/ipv4/gre_offload.c:253:5: warning: symbol 'gre_gro_complete' was not declared. Should it be static? Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4: introduce hardened ip_no_pmtu_disc modeHannes Frederic Sowa2014-01-132-4/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors to be honored by protocols which do more stringent validation on the ICMP's packet payload. This knob is useful for people who e.g. want to run an unmodified DNS server in a namespace where they need to use pmtu for TCP connections (as they are used for zone transfers or fallback for requests) but don't want to use possibly spoofed UDP pmtu information. Currently the whitelisted protocols are TCP, SCTP and DCCP as they check if the returned packet is in the window or if the association is valid. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against ↵Hannes Frederic Sowa2014-01-134-8/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pmtu spoofing While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: metrics: Allow selective get/del of tcp-metrics based on src IPChristoph Paasch2014-01-101-10/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to be able to get/del tcp-metrics based on the src IP. This patch adds the necessary parsing of the netlink attribute and if the source address is set, it will match on this one too. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: metrics: Delete all entries matching a certain destinationChristoph Paasch2014-01-101-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we now can have multiple entries per destination-IP, the "ip tcp_metrics delete address ADDRESS" command deletes all of them. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: metrics: New netlink attribute for src IP and dumped in netlink replyChristoph Paasch2014-01-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new netlink attribute for the source-IP and appends it to the netlink reply. Now, iproute2 can have access to the source-IP. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: metrics: Add source-address to tcp-metricsChristoph Paasch2014-01-101-9/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We add the source-address to the tcp-metrics, so that different metrics will be used per source/destination-pair. We use the destination-hash to store the metric inside the hash-table. That way, deleting and dumping via "ip tcp_metrics" is easy. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: metrics: rename tcpm_addr to tcpm_daddrChristoph Paasch2014-01-101-36/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we will add also the source-address, we rename all accesses to the tcp-metrics address to use "daddr". Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge branch 'master' of ↵David S. Miller2014-01-094-60/+64
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables Pablo Neira Ayuso says: ==================== nf_tables updates for net-next The following patchset contains the following nf_tables updates, mostly updates from Patrick McHardy, they are: * Add the "inet" table and filter chain type for this new netfilter family: NFPROTO_INET. This special table/chain allows IPv4 and IPv6 rules, this should help to simplify the burden in the administration of dual stack firewalls. This also includes several patches to prepare the infrastructure for this new table and a new meta extension to match the layer 3 and 4 protocol numbers, from Patrick McHardy. * Load both IPv4 and IPv6 conntrack modules in nft_ct if the rule is used in NFPROTO_INET, as we don't certainly know which one would be used, also from Patrick McHardy. * Do not allow to delete a table that contains sets, otherwise these sets become orphan, from Patrick McHardy. * Hold a reference to the corresponding nf_tables family module when creating a table of that family type, to avoid the module deletion when in use, from Patrick McHardy. * Update chain counters before setting the chain policy to ensure that we don't leave the chain in inconsistent state in case of errors (aka. restore chain atomicity). This also fixes a possible leak if it fails to allocate the chain counters if no counters are passed to be restored, from Patrick McHardy. * Don't check for overflows in the table counter if we are just renaming a chain, from Patrick McHardy. * Replay the netlink request after dropping the nfnl lock to load the module that supports provides a chain type, from Patrick. * Fix chain type module references, from Patrick. * Several cleanups, function renames, constification and code refactorizations also from Patrick McHardy. * Add support to set the connmark, this can be used to set it based on the meta mark (similar feature to -j CONNMARK --restore), from Kristian Evensen. * A couple of fixes to the recently added meta/set support and nft_reject, and fix missing chain type unregistration if we fail to register our the family table/filter chain type, from myself. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | netfilter: nf_tables: fix error path in the init functionsPablo Neira Ayuso2014-01-091-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have to unregister chain type if this fails to register netns. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | netfilter: nf_tables: rename nft_do_chain_pktinfo() to nft_do_chain()Patrick McHardy2014-01-094-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't encode argument types into function names and since besides nft_do_chain() there are only AF-specific versions, there is no risk of confusion. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | netfilter: nf_tables: minor nf_chain_type cleanupsPatrick McHardy2014-01-094-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minor nf_chain_type cleanups: - reorder struct to plug a hoe - rename struct module member to "owner" for consistency - rename nf_hookfn array to "hooks" for consistency - reorder initializers for better readability Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | netfilter: nf_tables: constify chain type definitions and pointersPatrick McHardy2014-01-094-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | netfilter: nf_tables: add missing module references to chain typesPatrick McHardy2014-01-092-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some cases we neither take a reference to the AF info nor to the chain type, allowing the module to be unloaded while in use. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
OpenPOWER on IntegriCloud