summaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp_input.c
Commit message (Collapse)AuthorAgeFilesLines
...
* | tcp: move tcp_openreq_init() to tcp_input.cEric Dumazet2015-03-171-0/+25
| | | | | | | | | | | | | | | | This big helper is called once from tcp_conn_request(), there is no point having it in an include. Compiler will inline it anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: fill request sock ir_iif for IPv4Eric Dumazet2015-03-141-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once request socks will be in ehash table, they will need to have a valid ir_iff field. This is currently true only for IPv6. This patch extends support for IPv4 as well. This means inet_diag_fill_req() can now properly use ir_iif, which is better for IPv6 link locals anyway, as request sockets and established sockets will propagate consistent netlink idiag_if. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: add missing ireq_net & ir_cookie initializationsEric Dumazet2015-03-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | I forgot to update dccp_v6_conn_request() & cookie_v6_check(). They both need to set ireq->ireq_net and ireq->ir_cookie Lets clear ireq->ir_cookie in inet_reqsk_alloc() Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 33cf7c90fe2f ("net: add real socket cookies") Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix CONFIG_NET_NS=n compilationEric Dumazet2015-03-111-1/+1
| | | | | | | | | | | | | | | | | | I forgot to use write_pnet() in three locations. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 33cf7c90fe2f9 ("net: add real socket cookies") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add real socket cookiesEric Dumazet2015-03-111-0/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | A long standing problem in netlink socket dumps is the use of kernel socket addresses as cookies. 1) It is a security concern. 2) Sockets can be reused quite quickly, so there is no guarantee a cookie is used once and identify a flow. 3) request sock, establish sock, and timewait socks for a given flow have different cookies. Part of our effort to bring better TCP statistics requires to switch to a different allocator. In this patch, I chose to use a per network namespace 64bit generator, and to use it only in the case a socket needs to be dumped to netlink. (This might be refined later if needed) Note that I tried to carry cookies from request sock, to establish sock, then timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eric Salo <salo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix tcp_should_expand_sndbuf() to use tcp_packets_in_flight()Neal Cardwell2015-02-221-1/+1
| | | | | | | | | | | | | | | | | tcp_should_expand_sndbuf() does not expand the send buffer if we have filled the congestion window. However, it should use tcp_packets_in_flight() instead of tp->packets_out to make this check. Testing has established that the difference matters a lot if there are many SACKed packets, causing a needless performance shortfall. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: mitigate ACK loops for connections as tcp_sockNeal Cardwell2015-02-081-7/+22
| | | | | | | | | | | | | | | | | | | | | | Ensure that in state ESTABLISHED, where the connection is represented by a tcp_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers or ACK numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. There is already a similar (although global) rate-limiting mechanism for "challenge ACKs". When deciding whether to send a challence ACK, we first consult the new per-connection rate limit, and then the global rate limit. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacksNeal Cardwell2015-02-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Helpers for mitigating ACK loops by rate-limiting dupacks sent in response to incoming out-of-window packets. This patch includes: - rate-limiting logic - sysctl to control how often we allow dupacks to out-of-window packets - SNMP counter for cases where we rate-limited our dupack sending The rate-limiting logic in this patch decides to not send dupacks in response to out-of-window segments if (a) they are SYNs or pure ACKs and (b) the remote endpoint is sending them faster than the configured rate limit. We rate-limit our responses rather than blocking them entirely or resetting the connection, because legitimate connections can rely on dupacks in response to some out-of-window segments. For example, zero window probes are typically sent with a sequence number that is below the current window, and ZWPs thus expect to thus elicit a dupack in response. We allow dupacks in response to TCP segments with data, because these may be spurious retransmissions for which the remote endpoint wants to receive DSACKs. This is safe because segments with data can't realistically be part of ACK loops, which by their nature consist of each side sending pure/data-less ACKs to each other. The dupack interval is controlled by a new sysctl knob, tcp_invalid_ratelimit, given in milliseconds, in case an administrator needs to dial this upward in the face of a high-rate DoS attack. The name and units are chosen to be analogous to the existing analogous knob for ICMP, icmp_ratelimit. The default value for tcp_invalid_ratelimit is 500ms, which allows at most one such dupack per 500ms. This is chosen to be 2x faster than the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule 2.4). We allow the extra 2x factor because network delay variations can cause packets sent at 1 second intervals to be compressed and arrive much closer. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dctcp: loosen requirement to assert ECT(0) during 3WHSFlorian Westphal2015-02-021-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One deployment requirement of DCTCP is to be able to run in a DC setting along with TCP traffic. As Glenn Judd's NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter" [1] (tba) explains, one way to solve this on switch side is to split DCTCP and TCP traffic in two queues per switch port based on the DSCP: one queue soley intended for DCTCP traffic and one for non-DCTCP traffic. For the DCTCP queue, there's the marking threshold K as explained in commit e3118e8359bb ("net: tcp: add DCTCP congestion control algorithm") for RED marking ECT(0) packets with CE. For the non-DCTCP queue, there's f.e. a classic tail drop queue. As already explained in e3118e8359bb, running DCTCP at scale when not marking SYN/SYN-ACK packets with ECT(0) has severe consequences as for non-ECT(0) packets, traversing the RED marking DCTCP queue will result in a severe reduction of connection probability. This is due to the DCTCP queue being dominated by ECT(0) traffic and switches handle non-ECT traffic in the RED marking queue after passing K as drops, where K is usually a low watermark in order to leave enough tailroom for bursts. Splitting DCTCP traffic among several queues (ECN and non-ECN queue) is being considered a terrible idea in the network community as it splits single flows across multiple network paths. Therefore, commit e3118e8359bb implements this on Linux as ECT(0) marked traffic, as we argue that marking all packets of a DCTCP flow is the only viable solution and also doesn't speak against the draft. However, recently, a DCTCP implementation for FreeBSD hit also their mainline kernel [2]. In order to let them play well together with Linux' DCTCP, we would need to loosen the requirement that ECT(0) has to be asserted during the 3WHS as not implemented in FreeBSD. This simplifies the ECN test and lets DCTCP work together with FreeBSD. Joint work with Daniel Borkmann. [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595f Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: use SACK RTTs for CCKenneth Klette Jonassen2015-01-311-2/+4
| | | | | | | | | | | | | | Current behavior only passes RTTs from sequentially acked data to CC. If sender gets a combined ACK for segment 1 and SACK for segment 3, then the computed RTT for CC is the time between sending segment 1 and receiving SACK for segment 3. Pass the minimum computed RTT from any acked data to CC, i.e. time between sending segment 3 and receiving SACK for segment 3. Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: avoid reducing cwnd when ACK+DSACK is receivedSébastien Barré2015-01-131-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With TLP, the peer may reply to a probe with an ACK+D-SACK, with ack value set to tlp_high_seq. In the current code, such ACK+DSACK will be missed and only at next, higher ack will the TLP episode be considered done. Since the DSACK is not present anymore, this will cost a cwnd reduction. This patch ensures that this scenario does not cause a cwnd reduction, since receiving an ACK+DSACK indicates that both the initial segment and the probe have been received by the peer. The following packetdrill test, from Neal Cardwell, validates this patch: // Establish a connection. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.020 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 1 packet. +0 write(4, ..., 1000) = 1000 +0 > P. 1:1001(1000) ack 1 // Loss probe retransmission. // packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms) // In this case, this means: 1.5*RTT + 200ms = 230ms +.230 > P. 1:1001(1000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% // Receiver ACKs at tlp_high_seq with a DSACK, // indicating they received the original packet and probe. +.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop> +0 %{ assert tcpi_snd_cwnd == 10 }% // Send another packet. +0 write(4, ..., 1000) = 1000 +0 > P. 1001:2001(1000) ack 1 // Receiver ACKs above tlp_high_seq, which should end the TLP episode // if we haven't already. We should not reduce cwnd. +.020 < . 1:1(0) ack 2001 win 257 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% Credits: -Gregory helped in finding that tcp_process_tlp_ack was where the cwnd got reduced in our MPTCP tests. -Neal wrote the packetdrill test above -Yuchung reworked the patch to make it more readable. Cc: Gregory Detal <gregory.detal@uclouvain.be> Cc: Nandita Dukkipati <nanditad@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* switch tcp_sock->ucopy from iovec (ucopy.iov) to msghdr (ucopy.msg)Al Viro2014-12-091-4/+3
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* new helper: memcpy_from_msg()Al Viro2014-11-241-1/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-11-211-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ieee802154/fakehard.c A bug fix went into 'net' for ieee802154/fakehard.c, which is removed in 'net-next'. Add build fix into the merge from Stephen Rothwell in openvswitch, the logging macros take a new initial 'log' argument, a new call was added in 'net' so when we merge that in here we have to explicitly add the new 'log' arg to it else the build fails. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: Restore RFC5961-compliant behavior for SYN packetsCalvin Owens2014-11-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c3ae62af8e755 ("tcp: should drop incoming frames without ACK flag set") was created to mitigate a security vulnerability in which a local attacker is able to inject data into locally-opened sockets by using TCP protocol statistics in procfs to quickly find the correct sequence number. This broke the RFC5961 requirement to send a challenge ACK in response to spurious RST packets, which was subsequently fixed by commit 7b514a886ba50 ("tcp: accept RST without ACK flag"). Unfortunately, the RFC5961 requirement that spurious SYN packets be handled in a similar manner remains broken. RFC5961 section 4 states that: ... the handling of the SYN in the synchronized state SHOULD be performed as follows: 1) If the SYN bit is set, irrespective of the sequence number, TCP MUST send an ACK (also referred to as challenge ACK) to the remote peer: <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> After sending the acknowledgment, TCP MUST drop the unacceptable segment and stop processing further. By sending an ACK, the remote peer is challenged to confirm the loss of the previous connection and the request to start a new connection. A legitimate peer, after restart, would not have a TCB in the synchronized state. Thus, when the ACK arrives, the peer should send a RST segment back with the sequence number derived from the ACK field that caused the RST. This RST will confirm that the remote peer has indeed closed the previous connection. Upon receipt of a valid RST, the local TCP endpoint MUST terminate its connection. The local TCP endpoint should then rely on SYN retransmission from the remote end to re-establish the connection. This patch lets SYN packets through the discard added in c3ae62af8e755, so that spurious SYN packets are properly dealt with as per the RFC. The challenge ACK is sent unconditionally and is rate-limited, so the original vulnerability is not reintroduced by this patch. Signed-off-by: Calvin Owens <calvinowens@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Convert LIMIT_NETDEBUG to net_dbg_ratelimitedJoe Perches2014-11-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the more common dynamic_debug capable net_dbg_ratelimited and remove the LIMIT_NETDEBUG macro. All messages are still ratelimited. Some KERN_<LEVEL> uses are changed to KERN_DEBUG. This may have some negative impact on messages that were emitted at KERN_INFO that are not not enabled at all unless DEBUG is defined or dynamic_debug is enabled. Even so, these messages are now _not_ emitted by default. This also eliminates the use of the net_msg_warn sysctl "/proc/sys/net/core/warnings". For backward compatibility, the sysctl is not removed, but it has no function. The extern declaration of net_msg_warn is removed from sock.h and made static in net/core/sysctl_net_core.c Miscellanea: o Update the sysctl documentation o Remove the embedded uses of pr_fmt o Coalesce format fragments o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-11-061-29/+31
|\ \ | |/
| * tcp: zero retrans_stamp if all retrans were ackedMarcelo Leitner2014-11-051-29/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ueki Kohei reported that when we are using NewReno with connections that have a very low traffic, we may timeout the connection too early if a second loss occurs after the first one was successfully acked but no data was transfered later. Below is his description of it: When SACK is disabled, and a socket suffers multiple separate TCP retransmissions, that socket's ETIMEDOUT value is calculated from the time of the *first* retransmission instead of the *latest* retransmission. This happens because the tcp_sock's retrans_stamp is set once then never cleared. Take the following connection: Linux remote-machine | | send#1---->(*1)|--------> data#1 --------->| | | | RTO : : | | | ---(*2)|----> data#1(retrans) ---->| | (*3)|<---------- ACK <----------| | | | | : : | : : | : : 16 minutes (or more) : | : : | : : | : : | | | send#2---->(*4)|--------> data#2 --------->| | | | RTO : : | | | ---(*5)|----> data#2(retrans) ---->| | | | | | | RTO*2 : : | | | | | | ETIMEDOUT<----(*6)| | (*1) One data packet sent. (*2) Because no ACK packet is received, the packet is retransmitted. (*3) The ACK packet is received. The transmitted packet is acknowledged. At this point the first "retransmission event" has passed and been recovered from. Any future retransmission is a completely new "event". (*4) After 16 minutes (to correspond with retries2=15), a new data packet is sent. Note: No data is transmitted between (*3) and (*4). The socket's timeout SHOULD be calculated from this point in time, but instead it's calculated from the prior "event" 16 minutes ago. (*5) Because no ACK packet is received, the packet is retransmitted. (*6) At the time of the 2nd retransmission, the socket returns ETIMEDOUT. Therefore, now we clear retrans_stamp as soon as all data during the loss window is fully acked. Reported-by: Ueki Kohei Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: allow setting ecn via routing tableFlorian Westphal2014-11-041-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows to set ECN on a per-route basis in case the sysctl tcp_ecn is not set to 1. In other words, when ECN is set for specific routes, it provides a tcp_ecn=1 behaviour for that route while the rest of the stack acts according to the global settings. One can use 'ip route change dev $dev $net features ecn' to toggle this. Having a more fine-grained per-route setting can be beneficial for various reasons, for example, 1) within data centers, or 2) local ISPs may deploy ECN support for their own video/streaming services [1], etc. There was a recent measurement study/paper [2] which scanned the Alexa's publicly available top million websites list from a vantage point in US, Europe and Asia: Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side only ECN") ;)); the break in connectivity on-path was found is about 1 in 10,000 cases. Timeouts rather than receiving back RSTs were much more common in the negotiation phase (and mostly seen in the Alexa middle band, ranks around 50k-150k): from 12-thousand hosts on which there _may_ be ECN-linked connection failures, only 79 failed with RST when _not_ failing with RST when ECN is not requested. It's unclear though, how much equipment in the wild actually marks CE when buffers start to fill up. We thought about a fallback to non-ECN for retransmitted SYNs as another global option (which could perhaps one day be made default), but as Eric points out, there's much more work needed to detect broken middleboxes. Two examples Eric mentioned are buggy firewalls that accept only a single SYN per flow, and middleboxes that successfully let an ECN flow establish, but later mark CE for all packets (so cwnd converges to 1). [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15 [2] http://ecn.ethz.ch/ Joint work with Daniel Borkmann. Reference: http://thread.gmane.org/gmane.linux.network/335797 Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Correction to RFC number in commentSowmini Varadhan2014-10-301-1/+1
| | | | | | | | | | | | | | | | Challenge ACK is described in RFC 5961, fix typo. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: minor spelling fixesstephen hemminger2014-10-301-1/+1
| | | | | | | | | | Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: allow for bigger reordering levelEric Dumazet2014-10-291-1/+2
|/ | | | | | | | | | | | | | | | | | | While testing upcoming Yaogong patch (converting out of order queue into an RB tree), I hit the max reordering level of linux TCP stack. Reordering level was limited to 127 for no good reason, and some network setups [1] can easily reach this limit and get limited throughput. Allow a new max limit of 300, and add a sysctl to allow admins to even allow bigger (or lower) values if needed. [1] Aggregation of links, per packet load balancing, fabrics not doing deep packet inspections, alternative TCP congestion modules... Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix tcp_ack() performance problemEric Dumazet2014-10-141-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We worked hard to improve tcp_ack() performance, by not accessing skb_shinfo() in fast path (cd7d8498c9a5 tcp: change tcp_skb_pcount() location) We still have one spurious access because of ACK timestamping, added in commit e1c8a607b281 ("net-timestamp: ACK timestamp for bytestreams") By checking if sk_tsflags has SOF_TIMESTAMPING_TX_ACK set, we can avoid two cache line misses for the common case. While we are at it, add two prefetchw() : One in tcp_ack() to bring skb at the head of write queue. One in tcp_clean_rtx_queue() loop to bring following skb, as we will delete skb from the write queue and dirty skb->next->prev. Add a couple of [un]likely() clauses. After this patch, tcp_ack() is no longer the most consuming function in tcp stack. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Van Jacobson <vanj@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2014-10-081-122/+163
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...
| * tcp: change TCP_ECN prefixes to lower caseFlorian Westphal2014-09-291-19/+22
| | | | | | | | | | | | | | | | | | | | | | | | Suggested by Stephen. Also drop inline keyword and let compiler decide. gcc 4.7.3 decides to no longer inline tcp_ecn_check_ce, so split it up. The actual evaluation is not inlined anymore while the ECN_OK test is. Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: move TCP_ECN_create_request out of headerFlorian Westphal2014-09-291-1/+35
| | | | | | | | | | | | | | | | | | | | | | | | After Octavian Purdilas tcp ipv4/ipv6 unification work this helper only has a single callsite. While at it, convert name to lowercase, suggested by Stephen. Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: tcp: more detailed ACK events and events for CE marked packetsFlorian Westphal2014-09-291-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DataCenter TCP (DCTCP) determines cwnd growth based on ECN information and ACK properties, e.g. ACK that updates window is treated differently than DUPACK. Also DCTCP needs information whether ACK was delayed ACK. Furthermore, DCTCP also implements a CE state machine that keeps track of CE markings of incoming packets. Therefore, extend the congestion control framework to provide these event types, so that DCTCP can be properly implemented as a normal congestion algorithm module outside of the core stack. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: tcp: split ack slow/fast events from cwnd_eventFlorian Westphal2014-09-291-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The congestion control ops "cwnd_event" currently supports CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others). Both FAST and SLOW_ACK are only used by Westwood congestion control algorithm. This removes both flags from cwnd_event and adds a new in_ack_event callback for this. The goal is to be able to provide more detailed information about ACKs, such as whether ECE flag was set, or whether the ACK resulted in a window update. It is required for DataCenter TCP (DCTCP) congestion control algorithm as it makes a different choice depending on ECE being set or not. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: tcp: add flag for ca to indicate that ECN is requiredDaniel Borkmann2014-09-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a flag to TCP congestion algorithms that allows for requesting to mark IPv4/IPv6 sockets with transport as ECN capable, that is, ECT(0), when required by a congestion algorithm. It is currently used and needed in DataCenter TCP (DCTCP), as it requires both peers to assert ECT on all IP packets sent - it uses ECN feedback (i.e. CE, Congestion Encountered information) from switches inside the data center to derive feedback to the end hosts. Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's algorithm/behaviour slightly diverges from RFC3168, therefore this is only (!) enabled iff the assigned congestion control ops module has requested this. By that, we can tightly couple this logic really only to the provided congestion control ops. Joint work with Florian Westphal and Glenn Judd. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: use tcp_flags in tcp_data_queue()Peter Pan(潘卫平)2014-09-281-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a cleanup which follows the idea in commit e11ecddf5128 (tcp: use TCP_SKB_CB(skb)->tcp_flags in input path), and it may reduce register pressure since skb->cb[] access is fast, bacause skb is probably in a register. v2: remove variable th v3: reword the changelog Signed-off-by: Weiping Pan <panweiping3@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: change tcp_skb_pcount() locationEric Dumazet2014-09-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our goal is to access no more than one cache line access per skb in a write or receive queue when doing the various walks. After recent TCP_SKB_CB() reorganizations, it is almost done. Last part is tcp_skb_pcount() which currently uses skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs 3 cache lines in current kernel (skb->head, skb->end, and shinfo->gso_segs are all in 3 different cache lines, far from skb->cb) This very simple patch reuses space currently taken by tcp_tw_isn only in input path, as tcp_skb_pcount is only needed for skb stored in write queue. This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags to get SKBTX_ACK_TSTAMP, which seems possible. This also speeds up all sack processing in general. This speeds up tcp_sendmsg() because it no longer has to access/dirty shinfo. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: add coalescing attempt in tcp_ofo_queue()Eric Dumazet2014-09-231-42/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | In order to make TCP more resilient in presence of reorders, we need to allow coalescing to happen when skbs from out of order queue are transferred into receive queue. LRO/GRO can be completely canceled in some pathological cases, like per packet load balancing on aggregated links. I had to move tcp_try_coalesce() up in the file above tcp_ofo_queue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: avoid possible arithmetic overflowsEric Dumazet2014-09-221-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default, or more if some sysctl (eg tcp_retries2) are changed. Better use 64bit to perform icsk_rto << icsk_backoff operations As Joe Perches suggested, add a helper for this. Yuchung spotted the tcp_v4_err() case. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: do not fake tcp headers in tcp_send_rcvq()Eric Dumazet2014-09-191-9/+4
| | | | | | | | | | | | | | | | | | Now we no longer rely on having tcp headers for skbs in receive queue, tcp repair do not need to build fake ones. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: do not copy headers in tcp_collapse()Eric Dumazet2014-09-151-15/+2
| | | | | | | | | | | | | | | | | | | | | | | | tcp_collapse() wants to shrink skb so that the overhead is minimal. Now we store tcp flags into TCP_SKB_CB(skb)->tcp_flags, we no longer need to keep around full headers. Whole available space is dedicated to the payload. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: allow segment with FIN in tcp_try_coalesce()Eric Dumazet2014-09-151-3/+1
| | | | | | | | | | | | | | | | | | | | We can allow a segment with FIN to be aggregated, if we take care to add tcp flags, and if skb_try_coalesce() takes care of zero sized skbs. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: use TCP_SKB_CB(skb)->tcp_flags in input pathEric Dumazet2014-09-151-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags, which is only used in output path. tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue, and its unfortunate because this bit is located in a cache line right before the payload. We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags. This patch does so, and avoids the cache line miss in tcp_recvmsg() Following patches will - allow a segment with FIN being coalesced in tcp_try_coalesce() - simplify tcp_collapse() by not copying the headers. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: remove TCP_SKB_CB(skb)->whenEric Dumazet2014-09-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | After commit 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution"), we no longer need to maintain timestamps in two different fields. TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: introduce TCP_SKB_CB(skb)->tcp_tw_isnEric Dumazet2014-09-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP_SKB_CB(skb)->when has different meaning in output and input paths. In output path, it contains a timestamp. In input path, it contains an ISN, chosen by tcp_timewait_state_process() Lets add a different name to ease code comprehension. Note that 'when' field will disappear in following patch, as skb_mstamp already contains timestamp, the anonymous union will promptly disappear as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: improve undo on timeoutYuchung Cheng2014-08-221-15/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upon timeout, undo (via both timestamps/Eifel and DSACKs) was disabled if any retransmits were still in flight. The concern was perhaps that spurious retransmission sent in a previous recovery episode may trigger DSACKs to falsely undo the current recovery. However, this inadvertently misses undo opportunities (using either TCP timestamps or DSACKs) when timeout occurs during a loss episode, i.e. recurring timeouts or timeout during fast recovery. In these cases some retransmissions will be in flight but we should allow undo. Furthermore, we should only reset undo_marker and undo_retrans upon timeout if we are starting a new recovery episode. Finally, when we do reset our undo state, we now do so in a manner similar to tcp_enter_recovery(), so that we require a DSACK for each of the outstsanding retransmissions. This will achieve the original goal by requiring that we receive the same number of DSACKs as retransmissions. This patch increases the undo events by 50% on Google servers. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'dmaengine-3.17' of ↵Linus Torvalds2014-10-071-75/+8
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine Pull dmaengine updates from Dan Williams: "Even though this has fixes marked for -stable, given the size and the needed conflict resolutions this is 3.18-rc1/merge-window material. These patches have been languishing in my tree for a long while. The fact that I do not have the time to do proper/prompt maintenance of this tree is a primary factor in the decision to step down as dmaengine maintainer. That and the fact that the bulk of drivers/dma/ activity is going through Vinod these days. The net_dma removal has not been in -next. It has developed simple conflicts against mainline and net-next (for-3.18). Continuing thanks to Vinod for staying on top of drivers/dma/. Summary: 1/ Step down as dmaengine maintainer see commit 08223d80df38 "dmaengine maintainer update" 2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit 77873803363c "net_dma: mark broken"), without reports of performance regression. 3/ Miscellaneous fixes" * tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine: net: make tcp_cleanup_rbuf private net_dma: revert 'copied_early' net_dma: simple removal dmaengine maintainer update dmatest: prevent memory leakage on error path in thread ioat: Use time_before_jiffies() dmaengine: fix xor sources continuation dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup() dma: mv_xor: Remove all callers of mv_xor_slot_cleanup() dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call ioat: Use pci_enable_msix_exact() instead of pci_enable_msix() drivers: dma: Include appropriate header file in dca.c drivers: dma: Mark functions as static in dma_v3.c dma: mv_xor: Add DMA API error checks ioat/dca: Use dev_is_pci() to check whether it is pci device
| * net_dma: revert 'copied_early'Dan Williams2014-09-281-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that tcp_dma_try_early_copy() is gone nothing ever sets copied_early. Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma" since it is no longer necessary. Cc: Ali Saidi <saidi@engin.umich.edu> Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Neal Cardwell <ncardwell@google.com> Reported-by: Dave Jones <davej@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * net_dma: simple removalDan Williams2014-09-281-61/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per commit "77873803363c net_dma: mark broken" net_dma is no longer used and there is no plan to fix it. This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards. Reverting the remainder of the net_dma induced changes is deferred to subsequent patches. Marked for stable due to Roman's report of a memory leak in dma_pin_iovec_pages(): https://lkml.org/lkml/2014/9/3/177 Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: David Whipple <whipple@securedatainnovations.ch> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: <stable@vger.kernel.org> Reported-by: Roman Gushchin <klamm@yandex-team.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | tcp: fix ssthresh and undo for consecutive short FRTO episodesNeal Cardwell2014-08-141-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix TCP FRTO logic so that it always notices when snd_una advances, indicating that any RTO after that point will be a new and distinct loss episode. Previously there was a very specific sequence that could cause FRTO to fail to notice a new loss episode had started: (1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue (2) receiver ACKs packet 1 (3) FRTO sends 2 more packets (4) RTO timer fires again (should start a new loss episode) The problem was in step (3) above, where tcp_process_loss() returned early (in the spot marked "Step 2.b"), so that it never got to the logic to clear icsk_retransmits. Thus icsk_retransmits stayed non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero icsk_retransmits, decide that this RTO is not a new episode, and decide not to cut ssthresh and remember the current cwnd and ssthresh for undo. There were two main consequences to the bug that we have observed. First, ssthresh was not decreased in step (4). Second, when there was a series of such FRTO (1-4) sequences that happened to be followed by an FRTO undo, we would restore the cwnd and ssthresh from before the entire series started (instead of the cwnd and ssthresh from before the most recent RTO). This could result in cwnd and ssthresh being restored to values much bigger than the proper values. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: e33099f96d99c ("tcp: implement RFC5682 F-RTO") Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logicHannes Frederic Sowa2014-08-141-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_tw_recycle heavily relies on tcp timestamps to build a per-host ordering of incoming connections and teardowns without the need to hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for the last measured RTO. To do so, we keep the last seen timestamp in a per-host indexed data structure and verify if the incoming timestamp in a connection request is strictly greater than the saved one during last connection teardown. Thus we can verify later on that no old data packets will be accepted by the new connection. During moving a socket to time-wait state we already verify if timestamps where seen on a connection. Only if that was the case we let the time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN will be used. But we don't verify this on incoming SYN packets. If a connection teardown was less than TCP_PAWS_MSL seconds in the past we cannot guarantee to not accept data packets from an old connection if no timestamps are present. We should drop this SYN packet. This patch closes this loophole. Please note, this patch does not make tcp_tw_recycle in any way more usable but only adds another safety check: Sporadic drops of SYN packets because of reordering in the network or in the socket backlog queues can happen. Users behing NAT trying to connect to a tcp_tw_recycle enabled server can get caught in blackholes and their connection requests may regullary get dropped because hosts behind an address translator don't have synchronized tcp timestamp clocks. tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled. In general, use of tcp_tw_recycle is disadvised. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net-timestamp: fix missing ACK timestampWillem de Bruijn2014-08-131-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ACK timestamps are generated in tcp_clean_rtx_queue. The TSO datapath can break out early, causing the timestamp code to be skipped. Move the code up before the break. Reported-by: David S. Miller <davem@davemloft.net> Also fix a boundary condition: tp->snd_una is the next unacknowledged byte and between tests inclusive (a <= b <= c), so generate a an ACK timestamp if (prior_snd_una <= tskey <= tp->snd_una - 1). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net-timestamp: ACK timestamp for bytestreamsWillem de Bruijn2014-08-051-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte in the send() call is acknowledged. It implements the feature for TCP. The timestamp is generated when the TCP socket cumulative ACK is moved beyond the tracked seqno for the first time. The feature ignores SACK and FACK, because those acknowledge the specific byte, but not necessarily the entire contents of the buffer up to that byte. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: reduce spurious retransmits due to transient SACK renegingNeal Cardwell2014-08-051-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit reduces spurious retransmits due to apparent SACK reneging by only reacting to SACK reneging that persists for a short delay. When a sequence space hole at snd_una is filled, some TCP receivers send a series of ACKs as they apparently scan their out-of-order queue and cumulatively ACK all the packets that have now been consecutiveyly received. This is essentially misbehavior B in "Misbehaviors in TCP SACK generation" ACM SIGCOMM Computer Communication Review, April 2011, so we suspect that this is from several common OSes (Windows 2000, Windows Server 2003, Windows XP). However, this issue has also been seen in other cases, e.g. the netdev thread "TCP being hoodwinked into spurious retransmissions by lack of timestamps?" from March 2014, where the receiver was thought to be a BSD box. Since snd_una would temporarily be adjacent to a previously SACKed range in these scenarios, this receiver behavior triggered the Linux SACK reneging code path in the sender. This led the sender to clear the SACK scoreboard, enter CA_Loss, and spuriously retransmit (potentially) every packet from the entire write queue at line rate just a few milliseconds before the ACK for each packet arrives at the sender. To avoid such situations, now when a sender sees apparent reneging it does not yet retransmit, but rather adjusts the RTO timer to give the receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs that will restore sanity to the SACK scoreboard. If the reneging persists until this RTO then, as before, we clear the SACK scoreboard and enter CA_Loss. A 10ms delay tolerates a receiver sending such a stream of ACKs at 56Kbit/sec. And to allow for receivers with slower or more congested paths, we wait for at least RTT/2. We validated the resulting max(RTT/2, 10ms) delay formula with a mix of North American and South American Google web server traffic, and found that for ACKs displaying transient reneging: (1) 90% of inter-ACK delays were less than 10ms (2) 99% of inter-ACK delays were less than RTT/2 In tests on Google web servers this commit reduced reneging events by 75%-90% (as measured by the TcpExtTCPSACKReneging counter), without any measurable impact on latency for user HTTP and SPDY requests. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-07-161-4/+4
|\ \ | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: fix false undo corner casesYuchung Cheng2014-07-071-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The undo code assumes that, upon entering loss recovery, TCP 1) always retransmit something 2) the retransmission never fails locally (e.g., qdisc drop) so undo_marker is set in tcp_enter_recovery() and undo_retrans is incremented only when tcp_retransmit_skb() is successful. When the assumption is broken because TCP's cwnd is too small to retransmit or the retransmit fails locally. The next (DUP)ACK would incorrectly revert the cwnd and the congestion state in tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs may enter the recovery state. The sender repeatedly enter and (incorrectly) exit recovery states if the retransmits continue to fail locally while receiving (DUP)ACKs. The fix is to initialize undo_retrans to -1 and start counting on the first retransmission. Always increment undo_retrans even if the retransmissions fail locally because they couldn't cause DSACKs to undo the cwnd reduction. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud