summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* tcp: fix for zero packets_in_flight was too broadIlpo Järvinen2013-02-061-2/+6
| | | | | | | | | | | | | | | | There are transients during normal FRTO procedure during which the packets_in_flight can go to zero between write_queue state updates and firing the resulting segments out. As FRTO processing occurs during that window the check must be more precise to not match "spuriously" :-). More specificly, e.g., when packets_in_flight is zero but FLAG_DATA_ACKED is true the problematic branch that set cwnd into zero would not be taken and new segments might be sent out later. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Tested-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Update MIB counters for dropsVijay Subramanian2013-02-041-1/+2
| | | | | | | | | This patch updates LINUX_MIB_LISTENDROPS in tcp_v4_conn_request() and tcp_v4_err(). tcp_v4_conn_request() in particular can drop SYNs for various reasons which are not currently tracked. Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: frto should not set snd_cwnd to 0Eric Dumazet2013-02-031-1/+2
| | | | | | | | | | | | | | | | | | | Commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start()) uncovered a bug in FRTO code : tcp_process_frto() is setting snd_cwnd to 0 if the number of in flight packets is 0. As Neal pointed out, if no packet is in flight we lost our chance to disambiguate whether a loss timeout was spurious. We should assume it was a proper loss. Reported-by: Pasi Kärkkäinen <pasik@iki.fi> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix an infinite loop in tcp_slow_start()Eric Dumazet2013-02-031-4/+10
| | | | | | | | | | | | | | Since commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start()), a nul snd_cwnd triggers an infinite loop in tcp_slow_start() Avoid this infinite loop and log a one time error for further analysis. FRTO code is suspected to cause this bug. Reported-by: Pasi Kärkkäinen <pasik@iki.fi> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: detect SYN/data drop when F-RTO is disabledYuchung Cheng2013-01-311-2/+1
| | | | | | | | | | | | On receiving the SYN-ACK, Fast Open checks icsk_retransmit for SYN retransmission to detect SYN/data drops. But if F-RTO is disabled, icsk_retransmit is reset at step D of tcp_fastretrans_alert() ( under tcp_ack()) before tcp_rcv_fastopen_synack(). The fix is to use total_retrans instead which accounts for SYN retransmission regardless the use of F-RTO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Increment LISTENOVERFLOW and LISTENDROPS in tcp_v4_conn_request()Nivedita Singhvi2013-01-291-1/+4
| | | | | | | | | | | | | | We drop a connection request if the accept backlog is full and there are sufficient packets in the syn queue to warrant starting drops. Increment the appropriate counters so this isn't silent, for accurate stats and help in debugging. This patch assumes LINUX_MIB_LISTENDROPS is a superset of/includes the counter LINUX_MIB_LISTENOVERFLOWS. Signed-off-by: Nivedita Singhvi <niv@us.ibm.com> Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* IP_GRE: Fix kernel panic in IP_GRE with GRE csum.Pravin B Shelar2013-01-281-1/+5
| | | | | | | | | | | | Due to IP_GRE GSO support, GRE can recieve non linear skb which results in panic in case of GRE_CSUM. Following patch fixes it by using correct csum API. Bug introduced in commit 6b78f16e4bdde3936b (gre: add GSO support) Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Fix route refcount on pmtu discoverySteffen Klassert2013-01-221-2/+9
| | | | | | | | | | | | | | | git commit 9cb3a50c (ipv4: Invalidate the socket cached route on pmtu events if possible) introduced a refcount problem. We don't get a refcount on the route if we get it from__sk_dst_get(), but we need one if we want to reuse this route because __sk_dst_set() releases the refcount of the old route. This patch adds proper refcount handling for that case. We introduce a 'new' flag to indicate that we are going to use a new route and we release the old route only if we replace it by a new one. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2013-01-223-9/+28
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) The transport header did not point to the right place after esp/ah processing on tunnel mode in the receive path. As a result, the ECN field of the inner header was not set correctly, fixes from Li RongQing. 2) We did a null check too late in one of the xfrm_replay advance functions. This can lead to a division by zero, fix from Nickolai Zeldovich. 3) The size calculation of the hash table missed the muiltplication with the actual struct size when the hash table is freed. We might call the wrong free function, fix from Michal Kubecek. 4) On IPsec pmtu events we can't access the transport headers of the original packet, so force a relookup for all routes to notify about the pmtu event. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * xfrm4: Invalidate all ipv4 routes on IPsec pmtu eventsSteffen Klassert2013-01-213-6/+15
| | | | | | | | | | | | | | | | | | | | On IPsec pmtu events we can't access the transport headers of the original packet, so we can't find the socket that sent the packet. The only chance to notify the socket about the pmtu change is to force a relookup for all routes. This patch implenents this for the IPsec protocols. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * ah4/esp4: set transport header correctly for IPsec tunnel mode.Li RongQing2013-01-082-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | IPsec tunnel does not set ECN field to CE in inner header when the ECN field in the outer header is CE, and the ECN field in the inner header is ECT(0) or ECT(1). The cause is ipip_hdr() does not return the correct address of inner header since skb->transport-header is not the inner header after esp_input_done2(), or ah_input(). Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | ipv4: Add a socket release callback for datagram socketsSteffen Klassert2013-01-214-0/+28
| | | | | | | | | | | | | | | | | | | | This implements a socket release callback function to check if the socket cached route got invalid during the time we owned the socket. The function is used from udp, raw and ping sockets. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Invalidate the socket cached route on pmtu events if possibleSteffen Klassert2013-01-211-1/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The route lookup in ipv4_sk_update_pmtu() might return a route different from the route we cached at the socket. This is because standart routes are per cpu, so each cpu has it's own struct rtable. This means that we do not invalidate the socket cached route if the NET_RX_SOFTIRQ is not served by the same cpu that the sending socket uses. As a result, the cached route reused until we disconnect. With this patch we invalidate the socket cached route if possible. If the socket is owened by the user, we can't update the cached route directly. A followup patch will implement socket release callback functions for datagram sockets to handle this case. Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: fix incorrect LOCKDROPPEDICMPS counterEric Dumazet2013-01-201-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 563d34d057 (tcp: dont drop MTU reduction indications) added an error leading to incorrect accounting of LINUX_MIB_LOCKDROPPEDICMPS If socket is owned by the user, we want to increment this SNMP counter, unless the message is a (ICMP_DEST_UNREACH,ICMP_FRAG_NEEDED) one. Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Don't update the pmtu on mtu locked routesSteffen Klassert2013-01-171-0/+3
| | | | | | | | | | | | | | | | | | Routes with locked mtu should not use learned pmtu informations, so do not update the pmtu on these routes. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Remove output route check in ipv4_mtuSteffen Klassert2013-01-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The output route check was introduced with git commit 261663b0 (ipv4: Don't use the cached pmtu informations for input routes) during times when we cached the pmtu informations on the inetpeer. Now the pmtu informations are back in the routes, so this check is obsolete. It also had some unwanted side effects, as reported by Timo Teras and Lukas Tribus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: accept RST without ACK flagEric Dumazet2013-01-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c3ae62af8e755 (tcp: should drop incoming frames without ACK flag set) added a regression on the handling of RST messages. RST should be allowed to come even without ACK bit set. We validate the RST by checking the exact sequence, as requested by RFC 793 and 5961 3.2, in tcp_validate_incoming() Reported-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: fix splice() and tcp collapsing interactionEric Dumazet2013-01-101-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under unusual circumstances, TCP collapse can split a big GRO TCP packet while its being used in a splice(socket->pipe) operation. skb_splice_bits() releases the socket lock before calling splice_to_pipe(). [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc() [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]- [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13 To fix this problem, we must eat skbs in tcp_recv_skb(). Remove the inline keyword from tcp_recv_skb() definition since it has three call sites. Reported-by: Christian Becker <c.becker@traviangames.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: splice: fix an infinite loop in tcp_read_sock()Eric Dumazet2013-01-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers) added a regression. [ 83.843570] INFO: rcu_sched self-detected stall on CPU [ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132) [ 83.844582] Task dump for CPU 6: [ 83.844584] netperf R running task 0 8966 8952 0x0000000c [ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000 [ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10 [ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8 [ 83.844594] Call Trace: [ 83.844596] [<ffffffff81088679>] ? vprintk_emit+0x1c9/0x4c0 [ 83.844601] [<ffffffff815ad449>] ? schedule+0x29/0x70 [ 83.844606] [<ffffffff81537bd2>] ? tcp_splice_data_recv+0x42/0x50 [ 83.844610] [<ffffffff8153beaa>] ? tcp_read_sock+0xda/0x260 [ 83.844613] [<ffffffff81537b90>] ? tcp_prequeue_process+0xb0/0xb0 [ 83.844615] [<ffffffff8153c0f0>] ? tcp_splice_read+0xc0/0x250 [ 83.844618] [<ffffffff814dc0c2>] ? sock_splice_read+0x22/0x30 [ 83.844622] [<ffffffff811b820b>] ? do_splice_to+0x7b/0xa0 [ 83.844627] [<ffffffff811ba4bc>] ? sys_splice+0x59c/0x5d0 [ 83.844630] [<ffffffff8119745b>] ? putname+0x2b/0x40 [ 83.844633] [<ffffffff8118bcb4>] ? do_sys_open+0x174/0x1e0 [ 83.844636] [<ffffffff815b6202>] ? system_call_fastpath+0x16/0x1b if recv_actor() returns 0, we should stop immediately, because looping wont give a chance to drain the pipe. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: prevent setting ttl=0 via IP_TTLCong Wang2013-01-081-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | A regression is introduced by the following commit: commit 4d52cfbef6266092d535237ba5a4b981458ab171 Author: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue Jun 2 00:42:16 2009 -0700 net: ipv4/ip_sockglue.c cleanups Pure cleanups but it is not a pure cleanup... - if (val != -1 && (val < 1 || val>255)) + if (val != -1 && (val < 0 || val > 255)) Since there is no reason provided to allow ttl=0, change it back. Reported-by: nitin padalia <padalia.nitin@gmail.com> Cc: nitin padalia <padalia.nitin@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: fix NULL checking in devinet_ioctl()Xi Wang2013-01-061-1/+1
| | | | | | | | | | | The NULL pointer check `!ifa' should come before its first use. [ Bug origin : commit fd23c3b31107e2fc483301ee923d8a1db14e53f4 (ipv4: Add hash table of interface addresses) in linux-2.6.39 ] Signed-off-by: Xi Wang <xi.wang@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/ipv4/ipconfig: really display the BOOTP/DHCP server's address.Philippe De Muyter2013-01-041-2/+6
| | | | | | | | | Up to now, the debug and info messages from the ipconfig subsytem claim to display the IP address of the DHCP/BOOTP server but display instead the IP address of the bootserver. Fix that. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of git://1984.lsi.us.es/nfDavid S. Miller2012-12-282-5/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== The following batch contains Netfilter fixes for 3.8-rc1. They are a mixture of old bugs that have passed unnoticed (I'll pass these to stable) and more fresh ones from the previous merge window, they are: * Fix for MAC address in 6in4 tunnels via NFLOG that results in ulogd showing up wrong address, from Bob Hockney. * Fix a comment in nf_conntrack_ipv6, from Florent Fourcot. * Fix a leak an error path in ctnetlink while creating an expectation, from Jesper Juhl. * Fix missing ICMP time exceeded in the IPv6 defragmentation code, from Haibo Xi. * Fix inconsistent handling of routing changes in MASQUERADE for the new connections case, from Andrew Collins. * Fix a missing skb_reset_transport in ip[6]t_REJECT that leads to crashes in the ixgbe driver (since it seems to access the transport header with TSO enabled), from Mukund Jampala. * Recover obsoleted NOTRACK target by including it into the CT and spot a warning via printk about being obsoleted. Many people don't check the scheduled to be removal file under Documentation, so we follow some less agressive approach to kill this in a year or so. Spotted by Florian Westphal, patch from myself. * Fix race condition in xt_hashlimit that allows to create two or more entries, from myself. * Fix crash if the CT is used due to the recently added facilities to consult the dying and unconfirmed conntrack lists, from myself. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: nf_nat: Also handle non-ESTABLISHED routing changes in MASQUERADEAndrew Collins2012-12-161-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since (a0ecb85 netfilter: nf_nat: Handle routing changes in MASQUERADE target), the MASQUERADE target handles routing changes which affect the output interface of a connection, but only for ESTABLISHED connections. It is also possible for NEW connections which already have a conntrack entry to be affected by routing changes. This adds a check to drop entries in the NEW+conntrack state when the oif has changed. Signed-off-by: Andrew Collins <bsderandrew@gmail.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: ip[6]t_REJECT: fix wrong transport header pointer in TCP resetMukund Jampala2012-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem occurs when iptables constructs the tcp reset packet. It doesn't initialize the pointer to the tcp header within the skb. When the skb is passed to the ixgbe driver for transmit, the ixgbe driver attempts to access the tcp header and crashes. Currently, other drivers (such as our 1G e1000e or igb drivers) don't access the tcp header on transmit unless the TSO option is turned on. <1>BUG: unable to handle kernel NULL pointer dereference at 0000000d <1>IP: [<d081621c>] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe] <4>*pdpt = 0000000085e5d001 *pde = 0000000000000000 <0>Oops: 0000 [#1] SMP [...] <4>Pid: 0, comm: swapper Tainted: P 2.6.35.12 #1 Greencity/Thurley <4>EIP: 0060:[<d081621c>] EFLAGS: 00010246 CPU: 16 <4>EIP is at ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe] <4>EAX: c7628820 EBX: 00000007 ECX: 00000000 EDX: 00000000 <4>ESI: 00000008 EDI: c6882180 EBP: dfc6b000 ESP: ced95c48 <4> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 <0>Process swapper (pid: 0, ti=ced94000 task=ced73bd0 task.ti=ced94000) <0>Stack: <4> cbec7418 c779e0d8 c77cc888 c77cc8a8 0903010a 00000000 c77c0008 00000002 <4><0> cd4997c0 00000010 dfc6b000 00000000 d0d176c9 c77cc8d8 c6882180 cbec7318 <4><0> 00000004 00000004 cbec7230 cbec7110 00000000 cbec70c0 c779e000 00000002 <0>Call Trace: <4> [<d0d176c9>] ? 0xd0d176c9 <4> [<d0d18a4d>] ? 0xd0d18a4d <4> [<411e243e>] ? dev_hard_start_xmit+0x218/0x2d7 <4> [<411f03d7>] ? sch_direct_xmit+0x4b/0x114 <4> [<411f056a>] ? __qdisc_run+0xca/0xe0 <4> [<411e28b0>] ? dev_queue_xmit+0x2d1/0x3d0 <4> [<411e8120>] ? neigh_resolve_output+0x1c5/0x20f <4> [<411e94a1>] ? neigh_update+0x29c/0x330 <4> [<4121cf29>] ? arp_process+0x49c/0x4cd <4> [<411f80c9>] ? nf_hook_slow+0x3f/0xac <4> [<4121ca8d>] ? arp_process+0x0/0x4cd <4> [<4121ca8d>] ? arp_process+0x0/0x4cd <4> [<4121c6d5>] ? T.901+0x38/0x3b <4> [<4121c918>] ? arp_rcv+0xa3/0xb4 <4> [<4121ca8d>] ? arp_process+0x0/0x4cd <4> [<411e1173>] ? __netif_receive_skb+0x32b/0x346 <4> [<411e19e1>] ? netif_receive_skb+0x5a/0x5f <4> [<411e1ea9>] ? napi_skb_finish+0x1b/0x30 <4> [<d0816eb4>] ? ixgbe_xmit_frame_ring+0x1564/0x2260 [ixgbe] <4> [<41013468>] ? lapic_next_event+0x13/0x16 <4> [<410429b2>] ? clockevents_program_event+0xd2/0xe4 <4> [<411e1b03>] ? net_rx_action+0x55/0x127 <4> [<4102da1a>] ? __do_softirq+0x77/0xeb <4> [<4102dab1>] ? do_softirq+0x23/0x27 <4> [<41003a67>] ? do_IRQ+0x7d/0x8e <4> [<41002a69>] ? common_interrupt+0x29/0x30 <4> [<41007bcf>] ? mwait_idle+0x48/0x4d <4> [<4100193b>] ? cpu_idle+0x37/0x4c <0>Code: df 09 d7 0f 94 c2 0f b6 d2 e9 e7 fb ff ff 31 db 31 c0 e9 38 ff ff ff 80 78 06 06 0f 85 3e fb ff ff 8b 7c 24 38 8b 8f b8 00 00 00 <0f> b6 51 0d f6 c2 01 0f 85 27 fb ff ff 80 e2 02 75 0d 8b 6c 24 <0>EIP: [<d081621c>] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe] SS:ESP Signed-off-by: Mukund Jampala <jbmukund@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | ipv4/ip_gre: set transport header correctly to gre headerIsaku Yamahata2012-12-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipgre_tunnel_xmit() incorrectly sets transport header to inner payload instead of GRE header. It seems copy-and-pasted from ipip.c. So set transport header to gre header. (In ipip case the transport header is the inner ip header, so that's correct.) Found by inspection. In practice the incorrect transport header doesn't matter because the skb usually is sent to another net_device or socket, so the transport header isn't referenced. Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: should drop incoming frames without ACK flag setEric Dumazet2012-12-261-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 96e0bf4b5193d (tcp: Discard segments that ack data not yet sent) John Dykstra enforced a check against ack sequences. In commit 354e4aa391ed5 (tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation) I added more safety tests. But we missed fact that these tests are not performed if ACK bit is not set. RFC 793 3.9 mandates TCP should drop a frame without ACK flag set. " fifth check the ACK field, if the ACK bit is off drop the segment and return" Not doing so permits an attacker to only guess an acceptable sequence number, evading stronger checks. Many thanks to Zhiyun Qian for bringing this issue to our attention. See : http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf Reported-by: Zhiyun Qian <zhiyunq@umich.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | arp: fix a regression in arp_solicit()Cong Wang2012-12-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sedat reported the following commit caused a regression: commit 9650388b5c56578fdccc79c57a8c82fb92b8e7f1 Author: Eric Dumazet <edumazet@google.com> Date: Fri Dec 21 07:32:10 2012 +0000 ipv4: arp: fix a lockdep splat in arp_solicit This is due to the 6th parameter of arp_send() needs to be NULL for the broadcast case, the above commit changed it to an all-zero array by mistake. Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: arp: fix a lockdep splat in arp_solicit()Eric Dumazet2012-12-211-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yan Burman reported following lockdep warning : ============================================= [ INFO: possible recursive locking detected ] 3.7.0+ #24 Not tainted --------------------------------------------- swapper/1/0 is trying to acquire lock: (&n->lock){++--..}, at: [<ffffffff8139f56e>] __neigh_event_send +0x2e/0x2f0 but task is already holding lock: (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&n->lock); lock(&n->lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by swapper/1/0: #0: (((&n->timer))){+.-...}, at: [<ffffffff8104b350>] call_timer_fn+0x0/0x1c0 #1: (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit +0x1d4/0x280 #2: (rcu_read_lock_bh){.+....}, at: [<ffffffff81395400>] dev_queue_xmit+0x0/0x5d0 #3: (rcu_read_lock_bh){.+....}, at: [<ffffffff813cb41e>] ip_finish_output+0x13e/0x640 stack backtrace: Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24 Call Trace: <IRQ> [<ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0 [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30 [<ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0 [<ffffffff8108d570>] __lock_acquire+0x440/0xc30 [<ffffffff813c3570>] ? inet_getpeer+0x40/0x600 [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30 [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0 [<ffffffff8108ddf5>] lock_acquire+0x95/0x140 [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0 [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30 [<ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50 [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0 [<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0 [<ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270 [<ffffffff813cb62d>] ip_finish_output+0x34d/0x640 [<ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640 [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan] [<ffffffff813cb9a0>] ip_output+0x80/0xf0 [<ffffffff813ca368>] ip_local_out+0x28/0x80 [<ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan] [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan] [<ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0 [<ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80 [<ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270 [<ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0 [<ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0 [<ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0 [<ffffffff813f5788>] arp_xmit+0x58/0x60 [<ffffffff813f59db>] arp_send+0x3b/0x40 [<ffffffff813f6424>] arp_solicit+0x204/0x280 [<ffffffff813a1a70>] ? neigh_add+0x310/0x310 [<ffffffff8139f515>] neigh_probe+0x45/0x70 [<ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0 [<ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0 [<ffffffff8104b350>] ? detach_if_pending+0x120/0x120 [<ffffffff8104b748>] run_timer_softirq+0x238/0x2b0 [<ffffffff813a1a70>] ? neigh_add+0x310/0x310 [<ffffffff81043e51>] __do_softirq+0x101/0x280 [<ffffffff814518cc>] call_softirq+0x1c/0x30 [<ffffffff81003b65>] do_softirq+0x85/0xc0 [<ffffffff81043a7e>] irq_exit+0x9e/0xc0 [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0 [<ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80 <EOI> [<ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0 [<ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0 [<ffffffff8100a6a9>] cpu_idle+0x89/0xe0 [<ffffffff81441127>] start_secondary+0x1b2/0x1b6 Bug is from arp_solicit(), releasing the neigh lock after arp_send() In case of vxlan, we eventually need to write lock a neigh lock later. Its a false positive, but we can get rid of it without lockdep annotations. We can instead use neigh_ha_snapshot() helper. Reported-by: Yan Burman <yanb@mellanox.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip_gre: fix possible use after freeEric Dumazet2012-12-211-1/+5
| | | | | | | | | | | | | | | | | | | | | | Once skb_realloc_headroom() is called, tiph might point to freed memory. Cache tiph->ttl value before the reallocation, to avoid unexpected behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionallyIsaku Yamahata2012-12-211-1/+4
|/ | | | | | | | | | | ipgre_tunnel_xmit() parses network header as IP unconditionally. But transmitting packets are not always IP packet. For example such packet can be sent by packet socket with sockaddr_ll.sll_protocol set. So make the function check if skb->protocol is IP. Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sockChristoph Paasch2012-12-142-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If in either of the above functions inet_csk_route_child_sock() or __inet_inherit_port() fails, the newsk will not be freed: unreferenced object 0xffff88022e8a92c0 (size 1592): comm "softirq", pid 0, jiffies 4294946244 (age 726.160s) hex dump (first 32 bytes): 0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................ 02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5 [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481 [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416 [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701 [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4 [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233 [<ffffffff814cee68>] ip_rcv+0x217/0x267 [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553 [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82 This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus a single sock_put() is not enough to free the memory. Additionally, things like xfrm, memcg, cookie_values,... may have been initialized. We have to free them properly. This is fixed by forcing a call to tcp_done(), ending up in inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary, because it ends up doing all the cleanup on xfrm, memcg, cookie_values, xfrm,... Before calling tcp_done, we have to set the socket to SOCK_DEAD, to force it entering inet_csk_destroy_sock. To avoid the warning in inet_csk_destroy_sock, inet_num has to be set to 0. As inet_csk_destroy_sock does a dec on orphan_count, we first have to increase it. Calling tcp_done() allows us to remove the calls to tcp_clear_xmit_timer() and tcp_cleanup_congestion_control(). A similar approach is taken for dccp by calling dccp_done(). This is in the kernel since 093d282321 (tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()), thus since version >= 2.6.37. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-12-131-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial branch from Jiri Kosina: "Usual stuff -- comment/printk typo fixes, documentation updates, dead code elimination." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) HOWTO: fix double words typo x86 mtrr: fix comment typo in mtrr_bp_init propagate name change to comments in kernel source doc: Update the name of profiling based on sysfs treewide: Fix typos in various drivers treewide: Fix typos in various Kconfig wireless: mwifiex: Fix typo in wireless/mwifiex driver messages: i2o: Fix typo in messages/i2o scripts/kernel-doc: check that non-void fcts describe their return value Kernel-doc: Convention: Use a "Return" section to describe return values radeon: Fix typo and copy/paste error in comments doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c various: Fix spelling of "asynchronous" in comments. Fix misspellings of "whether" in comments. eisa: Fix spelling of "asynchronous". various: Fix spelling of "registered" in comments. doc: fix quite a few typos within Documentation target: iscsi: fix comment typos in target/iscsi drivers treewide: fix typo of "suport" in various comments and Kconfig treewide: fix typo of "suppport" in various comments ...
| * treewide: fix typo of "suport" in various comments and KconfigMasanari Iida2012-11-191-1/+1
| | | | | | | | | | Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2012-12-1232-257/+815
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking changes from David Miller: 1) Allow to dump, monitor, and change the bridge multicast database using netlink. From Cong Wang. 2) RFC 5961 TCP blind data injection attack mitigation, from Eric Dumazet. 3) Networking user namespace support from Eric W. Biederman. 4) tuntap/virtio-net multiqueue support by Jason Wang. 5) Support for checksum offload of encapsulated packets (basically, tunneled traffic can still be checksummed by HW). From Joseph Gasparakis. 6) Allow BPF filter access to VLAN tags, from Eric Dumazet and Daniel Borkmann. 7) Bridge port parameters over netlink and BPDU blocking support from Stephen Hemminger. 8) Improve data access patterns during inet socket demux by rearranging socket layout, from Eric Dumazet. 9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and Jon Maloy. 10) Update TCP socket hash sizing to be more in line with current day realities. The existing heurstics were choosen a decade ago. From Eric Dumazet. 11) Fix races, queue bloat, and excessive wakeups in ATM and associated drivers, from Krzysztof Mazur and David Woodhouse. 12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions in VXLAN driver, from David Stevens. 13) Add "oops_only" mode to netconsole, from Amerigo Wang. 14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also allow DCB netlink to work on namespaces other than the initial namespace. From John Fastabend. 15) Support PTP in the Tigon3 driver, from Matt Carlson. 16) tun/vhost zero copy fixes and improvements, plus turn it on by default, from Michael S. Tsirkin. 17) Support per-association statistics in SCTP, from Michele Baldessari. And many, many, driver updates, cleanups, and improvements. Too numerous to mention individually. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) net/mlx4_en: Add support for destination MAC in steering rules net/mlx4_en: Use generic etherdevice.h functions. net: ethtool: Add destination MAC address to flow steering API bridge: add support of adding and deleting mdb entries bridge: notify mdb changes via netlink ndisc: Unexport ndisc_{build,send}_skb(). uapi: add missing netconf.h to export list pkt_sched: avoid requeues if possible solos-pci: fix double-free of TX skb in DMA mode bnx2: Fix accidental reversions. bna: Driver Version Updated to 3.1.2.1 bna: Firmware update bna: Add RX State bna: Rx Page Based Allocation bna: TX Intr Coalescing Fix bna: Tx and Rx Optimizations bna: Code Cleanup and Enhancements ath9k: check pdata variable before dereferencing it ath5k: RX timestamp is reported at end of frame ath9k_htc: RX timestamp is reported at end of frame ...
| * | net: remove obsolete simple_strto<foo>Abhijit Pawar2012-12-101-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | This patch replace the obsolete simple_strto<foo> with kstrto<foo> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: Handle encapsulated offloads before fragmentation or handing to lower devAlexander Duyck2012-12-091-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change allows the VXLAN to enable Tx checksum offloading even on devices that do not support encapsulated checksum offloads. The advantage to this is that it allows for the lower device to change due to routing table changes without impacting features on the VXLAN itself. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4/route/rtnl: get mcast attributes when dst is multicastNicolas Dichtel2012-12-071-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f1ce3062c538 (ipv4: Remove 'rt_dst' from 'struct rtable') removes the call to ipmr_get_route(), which will get multicast parameters of the route. I revert the part of the patch that remove this call. I think the goal was only to get rid of rt_dst field. The patch is only compiled-tested. My first idea was to remove ipmr_get_route() because rt_fill_info() was the only user, but it seems the previous patch cleans the code a bit too much ;-) Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipmr: advertise new mfc entries via rtnlNicolas Dichtel2012-12-041-5/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows to monitor mfc activities via rtnetlink. To avoid parsing two times the mfc oifs, we use maxvif to allocate the rtnl msg, thus we may allocate some superfluous space. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipmr/ip6mr: allow to get unresolved cache via netlinkNicolas Dichtel2012-12-041-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/net/ip[6]_mr_cache allows to get all mfc entries, even if they are put in the unresolved list (mfc[6]_unres_queue). But only the table RT_TABLE_DEFAULT is displayed. This patch adds the parsing of the unresolved list when the dump is made via rtnetlink, hence each table can be checked. In IPv6, we set rtm_type in ip6mr_fill_mroute(), because in case of unresolved mfc __ip6mr_fill_mroute() will not set it. In IPv4, it is already done. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipmr/ip6mr: report origin of mfc entry into rtnl msgNicolas Dichtel2012-12-041-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | A mfc entry can be static or not (added via the mroute_sk socket). The patch reports MFC_STATIC flag into rtm_protocol by setting rtm_protocol to RTPROT_STATIC or RTPROT_MROUTED. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipmr/ip6mr: advertise mfc stats via rtnetlinkNicolas Dichtel2012-12-041-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These statistics can be checked only via /proc/net/ip_mr_cache or SIOCGETSGCNT[_IN6] and thus only for the table RT_TABLE_DEFAULT. Advertising them via rtnetlink allows to get statistics for all cache entries, whatever the table is. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netconf: advertise mc_forwarding statusNicolas Dichtel2012-12-042-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | This patch advertise the MC_FORWARDING status for IPv4 and IPv6. This field is readonly, only multicast engine in the kernel updates it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: nf_nat: Handle routing changes in MASQUERADE targetJozsef Kadlecsik2012-12-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the route changes (backup default route, VPNs) which affect a masqueraded target, the packets were sent out with the outdated source address. The patch addresses the issue by comparing the outgoing interface directly with the masqueraded interface in the nat table. Events are inefficient in this case, because it'd require adding route events to the network core and then scanning the whole conntrack table and re-checking the route for all entry. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | tcp: don't abort splice() after small transfersWilly Tarreau2012-12-021-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP coalescing added a regression in splice(socket->pipe) performance, for some workloads because of the way tcp_read_sock() is implemented. The reason for this is the break when (offset + 1 != skb->len). As we released the socket lock, this condition is possible if TCP stack added a fragment to the skb, which can happen with TCP coalescing. So let's go back to the beginning of the loop when this happens, to give a chance to splice more frags per system call. Doing so fixes the issue and makes GRO 10% faster than LRO on CPU-bound splice() workloads instead of the opposite. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: change default tcp hash sizeEric Dumazet2012-12-011-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As time passed, available memory increased faster than number of concurrent tcp sockets. As a result, a machine with 4GB of ram gets a hash table with 524288 slots, using 8388608 bytes of memory. Lets change that by a 16x factor (one slot for 128 KB of ram) Even if a small machine needs a _lot_ of sockets, tcp lookups are now very efficient, using one cache line per socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: move inet_dport/inet_num in sock_commonEric Dumazet2012-11-301-13/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 68835aba4d9b (net: optimize INET input path further) moved some fields used for tcp/udp sockets lookup in the first cache line of struct sock_common. This patch moves inet_dport/inet_num as well, filling a 32bit hole on 64 bit arches and reducing number of cache line misses in lookups. Also change INET_MATCH()/INET_TW_MATCH() to perform the ports match before addresses match, as this check is more discriminant. Remove the hash check from MATCH() macros because we dont need to re validate the hash value after taking a refcount on socket, and use likely/unlikely compiler hints, as the sk_hash/hash check makes the following conditional tests 100% predicted by cpu. Introduce skc_addrpair/skc_portpair pair values to better document the alignment requirements of the port/addr pairs used in the various MATCH() macros, and remove some casts. The namespace check can also be done at last. This slightly improves TCP/UDP lookup times. IP/TCP early demux needs inet->rx_dst_ifindex and TCP needs inet->min_ttl, lets group them together in same cache line. With help from Ben Hutchings & Joe Perches. Idea of this patch came after Ling Ma proposal to move skc_hash to the beginning of struct sock_common, and should allow him to submit a final version of his patch. My tests show an improvement doing so. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Joe Perches <joe@perches.com> Cc: Ling Ma <ling.ma.program@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-11-292-1/+6
| |\ \ | | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4/ipmr and ipv6/ip6mr: Convert int mroute_do_<foo> to boolJoe Perches2012-11-251-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Save a few bytes per table by convert mroute_do_assert and mroute_do_pim from int to bool. Remove !! as the compiler does that when assigning int to bool. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4: ipmr: various fixes and cleanupsEric Dumazet2012-11-251-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) ip_mroute_setsockopt() & ip_mroute_getsockopt() should not access/set raw_sk(sk)->ipmr_table before making sure the socket is a raw socket, and protocol is IGMP 2) MRT_INIT should return -EINVAL if optlen != sizeof(int), not -ENOPROTOOPT 3) MRT_ASSERT & MRT_PIM should validate optlen 4) " (v) ? 1 : 0 " can be written as " !!v " Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud