summaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp_output.c
Commit message (Collapse)AuthorAgeFilesLines
* tcp: Fix CWV being too strict on thin streamsBendik Rønning Opstad2015-09-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Application limited streams such as thin streams, that transmit small amounts of payload in relatively few packets per RTT, can be prevented from growing the CWND when in congestion avoidance. This leads to increased sojourn times for data segments in streams that often transmit time-dependent data. Currently, a connection is considered CWND limited only after having successfully transmitted at least one packet with new data, while at the same time failing to transmit some unsent data from the output queue because the CWND is full. Applications that produce small amounts of data may be left in a state where it is never considered to be CWND limited, because all unsent data is successfully transmitted each time an incoming ACK opens up for more data to be transmitted in the send window. Fix by always testing whether the CWND is fully used after successful packet transmissions, such that a connection is considered CWND limited whenever the CWND has been filled. This is the correct behavior as specified in RFC2861 (section 3.1). Cc: Andreas Petlund <apetlund@simula.no> Cc: Carsten Griwodz <griff@simula.no> Cc: Jonas Markussen <jonassm@ifi.uio.no> Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Cc: Mads Johannessen <madsjoh@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-09-261-0/+1
|\ | | | | | | | | | | | | | | | | | | Conflicts: net/ipv4/arp.c The net/ipv4/arp.c conflict was one commit adding a new local variable while another commit was deleting one. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: add proper TS val into RST packetsEric Dumazet2015-09-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RST packets sent on behalf of TCP connections with TS option (RFC 7323 TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr. A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100 ecr 0], length 0 B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss 1460,nop,nop,TS val 7264344 ecr 100], length 0 A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr 7264344], length 0 B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0 ecr 110], length 0 We need to call skb_mstamp_get() to get proper TS val, derived from skb->skb_mstamp Note that RFC 1323 was advocating to not send TS option in RST segment, but RFC 7323 recommends the opposite : Once TSopt has been successfully negotiated, that is both <SYN> and <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST> segment for the duration of the connection, and SHOULD be sent in an <RST> segment (see Section 5.2 for details) Note this RFC recommends to send TS val = 0, but we believe it is premature : We do not know if all TCP stacks are properly handling the receive side : When an <RST> segment is received, it MUST NOT be subjected to the PAWS check by verifying an acceptable value in SEG.TSval, and information from the Timestamps option MUST NOT be used to update connection state information. SEG.TSecr MAY be used to provide stricter <RST> acceptance checks. In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider to decide to send TS val = 0, if it buys something. Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp/dccp: constify rtx_synack() and friendsEric Dumazet2015-09-251-1/+1
| | | | | | | | | | | | | | | | This is done to make sure we do not change listener socket while sending SYNACK packets while socket lock is not held. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: constify tcp_make_synack() socket argumentEric Dumazet2015-09-251-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | listener socket is not locked when tcp_make_synack() is called. We better make sure no field is written. There is one exception : Since SYNACK packets are attached to the listener at this moment (or SYN_RECV child in case of Fast Open), sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using atomic operations so this is safe. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: remove tcp_ecn_make_synack() socket argumentEric Dumazet2015-09-251-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | SYNACK packets might be sent without holding socket lock. For DCTCP/ECN sake, we should call INET_ECN_xmit() while socket lock is owned, and only when we init/change congestion control. This also fixies a bug if congestion module is changed from dctcp to another one on a listener : we now clear ECN bits properly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: remove tcp_synack_options() socket argumentEric Dumazet2015-09-251-8/+7
| | | | | | | | | | | | | | We do not use the socket in this function. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: send loss probe after 1s if no RTT availableYuchung Cheng2015-09-211-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes TLP to use 1 sec timer by default when RTT is not available due to SYN/ACK retransmission or SYN cookies. Prior to this change, the lack of RTT prevents TLP so the first data packets sent can only be recovered by fast recovery or RTO. If the fast recovery fails to trigger the RTO is 3 second when SYN/ACK is retransmitted. With this patch we can trigger fast recovery in 1sec instead. Note that we need to check Fast Open more properly. A Fast Open connection could be (accepted then) closed before it receives the final ACK of 3WHS so the state is FIN_WAIT_1. Without the new check, TLP will retransmit FIN instead of SYN/ACK. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: provide skb->hash to synack packetsEric Dumazet2015-09-171-0/+2
|/ | | | | | | | | | | | | | | | | | | | | | In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit"), Tom provided a l4 hash to most outgoing TCP packets. We'd like to provide one as well for SYNACK packets, so that all packets of a given flow share same txhash, to later enable bonding driver to also use skb->hash to perform slave selection. Note that a SYNACK retransmit shuffles the tx hash, as Tom did in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing advice") for established sockets. This has nice effect making TCP flows resilient to some kind of black holes, even at connection establish phase. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Mahesh Bandewar <maheshb@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: generate CA_EVENT_TX_START on data framesNeal Cardwell2015-09-101-3/+3
| | | | | | | | | | | | | | | | | | | | Issuing a CC TX_START event on control frames like pure ACK is a waste of time, as a CC should not care. Following patch needs this change, as we want CUBIC to properly track idle time at a low cost, with a single TX_START being generated. Yuchung might slightly refine the condition triggering TX_START on a followup patch. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Jana Iyengar <jri@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix slow start after idle vs TSO/GSOEric Dumazet2015-08-251-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | slow start after idle might reduce cwnd, but we perform this after first packet was cooked and sent. With TSO/GSO, it means that we might send a full TSO packet even if cwnd should have been reduced to IW10. Moving the SSAI check in skb_entail() makes sense, because we slightly reduce number of times this check is done, especially for large send() and TCP Small queue callbacks from softirq context. As Neal pointed out, we also need to perform the check if/when receive window opens. Tested: Following packetdrill test demonstrates the problem // Test of slow start after idle `sysctl -q net.ipv4.tcp_slow_start_after_idle=1` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.100 < . 1:1(0) ack 1 win 511 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0 +0 write(4, ..., 26000) = 26000 +0 > . 1:5001(5000) ack 1 +0 > . 5001:10001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% +.100 < . 1:1(0) ack 10001 win 511 +0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }% +0 > . 10001:20001(10000) ack 1 +0 > P. 20001:26001(6000) ack 1 +.100 < . 1:1(0) ack 26001 win 511 +0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }% +4 write(4, ..., 20000) = 20000 // If slow start after idle works properly, we should send 5 MSS here (cwnd/2) +0 > . 26001:31001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% +0 > . 31001:36001(5000) ack 1 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: TLP retransmits last if failed to send new packetYuchung Cheng2015-08-131-16/+22
| | | | | | | | | | When TLP fails to send new packet because of receive window limit, it should fall back to retransmit the last packet instead. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: don't extend RTO on failed loss probe attemptsYuchung Cheng2015-08-131-7/+6
| | | | | | | | | | | | | | If TLP was unable to send a probe, it extended the RTO to now + icsk_rto. But extending the RTO makes little sense if no TLP probe went out. With this commit, instead of extending the RTO we re-arm it relative to the transmit time of the write queue head. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: tso: allow deferring under reordering stateEric Dumazet2015-07-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | While doing experiments with reordering resilience, we found linux senders were not able to send at full speed under reordering, because every incoming SACK was releasing one MSS. This patch removes the limitation, as we did for CWR state in commit a0ea700e409 ("tcp: tso: allow CA_CWR state in tcp_tso_should_defer()") Neal Cardwell had a concern about limited transmit so Yuchung conducted experiments on GFE and found nothing worth adding an extra check on fast path : if (icsk->icsk_ca_state == TCP_CA_Disorder && tcp_sk(sk)->reordering == sysctl_tcp_reordering) goto send_now; Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: v1 always send a quick ack when quickacks are enabledJon Maxwell2015-07-091-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | V1 of this patch contains Eric Dumazet's suggestion to move the per dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric. I ran some tests and after setting the "ip route change quickack 1" knob there were still many delayed ACKs sent. This occured because when icsk_ack.quick=0 the !icsk_ack.pingpong value is subsequently ignored as tcp_in_quickack_mode() checks both these values. The condition for a quick ack to trigger requires that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently only icsk_ack.pingpong is controlled by the knob. But the icsk_ack.quick value changes dynamically depending on heuristics. The crux of the matter is that delayed acks still cannot be entirely disabled even with the RTAX_QUICKACK per dst knob enabled. This patch ensures that a quick ack is always sent when the RTAX_QUICKACK per dst knob is turned on. The "ip route change quickack 1" knob was recently added to enable quickacks. It was modeled around the TCP_QUICKACK setsockopt() option. This issue is that even with "ip route change quickack 1" enabled we still see delayed ACKs under some conditions. It would be nice to be able to completely disable delayed ACKs. Here is an example: # netstat -s|grep dela 3 delayed acks sent For all routes enable the knob # ip route change quickack 1 Generate some traffic across a slow link and we still see the delayed acks. # netstat -s|grep dela 106 delayed acks sent 1 delayed acks further delayed because of locked socket The issue is that both the "ip route change quickack 1" knob and the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0. However at the business end in the __tcp_ack_snd_check() routine, tcp_in_quickack_mode() checks that both icsk_ack.quick != 0 and icsk_ack.pingpong=0 in order to trigger a quickack. As icsk_ack.quick is determined by heuristics it can be 0. When that occurs the icsk_ack.pingpong value is ignored and a delayed ACK is sent regardless. This patch moves the RTAX_QUICKACK per dst check into the tcp_in_quickack_mode() routine which ensures that a quickack is always sent when the quickack knob is enabled for that dst. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: remove obsolete check in tcp_set_skb_tso_segs()Eric Dumazet2015-06-111-3/+0
| | | | | | | | | | | | | | | | | | We had various issues in the past when TCP stack was modifying gso_size/gso_segs while clones were in flight. Commit c52e2421f73 ("tcp: must unclone packets before mangling them") fixed these bugs and added a WARN_ON_ONCE(skb_cloned(skb)); in tcp_set_skb_tso_segs() These bugs are now fixed, and because TCP stack now only sets shinfo->gso_size|segs on the clone itself, the check can be removed. As a result of this change, compiler inlines tcp_set_skb_tso_segs() in tcp_init_tso_segs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fill shinfo->gso_size at last momentEric Dumazet2015-06-111-8/+4
| | | | | | | | | | | | | | In commit cd7d8498c9a5 ("tcp: change tcp_skb_pcount() location") we stored gso_segs in a temporary cache hot location. This patch does the same for gso_size. This allows to save 2 cache line misses in tcp xmit path for the last packet that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: tcp_set_skb_tso_segs() no longer need struct sock parameterEric Dumazet2015-06-111-16/+14
| | | | | | | | tcp_set_skb_tso_segs() & tcp_init_tso_segs() no longer use the sock pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fill shinfo->gso_type at last momentEric Dumazet2015-06-111-3/+1
| | | | | | | | | | | | | | | Our goal is to touch skb_shinfo(skb) only when absolutely needed, to avoid two cache line misses in TCP output path for last skb that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) A packet is GSO only when skb_shinfo(skb)->gso_size is not zero. We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for non GSO packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: double default TSQ output bytes limitWei Liu2015-06-041-2/+2
| | | | | | | | | | | | | | | Xen virtual network driver has higher latency than a physical NIC. Having only 128K as limit for TSQ introduced 30% regression in guest throughput. This patch raises the limit to 256K. This reduces the regression to 8%. This buys us more time to work out a proper solution in the long run. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: tcp_tso_autosize() minimum is one packetEric Dumazet2015-05-261-2/+2
| | | | | | | | | | | | | | | | | By making sure sk->sk_gso_max_segs minimal value is one, and sysctl_tcp_min_tso_segs minimal value is one as well, tcp_tso_autosize() will return a non zero value. We can then revert 843925f33fcc293d80acf2c5c8a78adf3344d49b ("tcp: Do not apply TSO segment limit to non-TSO packets") and save few cpu cycles in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add tcpi_segs_in and tcpi_segs_out to tcp_infoMarcelo Ricardo Leitner2015-05-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | This patch tracks the total number of inbound and outbound segments on a TCP socket. One may use this number to have an idea on connection quality when compared against the retransmissions. RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut These are a 32bit field each and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->segs_out was placed near tp->snd_nxt for good data locality and minimal performance impact, while tp->segs_in was placed near tp->bytes_received for the same reason. Join work with Eric Dumazet. Note that received SYN are accounted on the listener, but sent SYNACK are not accounted. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add a force_schedule argument to sk_stream_alloc_skb()Eric Dumazet2015-05-211-5/+5
| | | | | | | | | | | | | | | | | | | In commit 8e4d980ac215 ("tcp: fix behavior for epoll edge trigger") we fixed a possible hang of TCP sockets under memory pressure, by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule() if no packet is in socket write queue. It turns out there are other cases where we want to force memory schedule : tcp_fragment() & tso_fragment() need to split a big TSO packet into two smaller ones. If we block here because of TCP memory pressure, we can effectively block TCP socket from sending new data. If no further ACK is coming, this hang would be definitive, and socket has no chance to effectively reduce its memory usage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add rfc3168, section 6.1.1.1. fallbackDaniel Borkmann2015-05-191-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing ECN connections. In other words, this work adds a retry with a non-ECN setup SYN packet, as suggested from the RFC on the first timeout: [...] A host that receives no reply to an ECN-setup SYN within the normal SYN retransmission timeout interval MAY resend the SYN and any subsequent SYN retransmissions with CWR and ECE cleared. [...] Schematic client-side view when assuming the server is in tcp_ecn=2 mode, that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend ECN sysctl to allow server-side only ECN"): 1) Normal ECN-capable path: SYN ECE CWR -----> <----- SYN ACK ECE ACK -----> 2) Path with broken middlebox, when client has fallback: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN -----> <----- SYN ACK ACK -----> In case we would not have the fallback implemented, the middlebox drop point would basically end up as: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) In any case, it's rather a smaller percentage of sites where there would occur such additional setup latency: it was found in end of 2014 that ~56% of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the fallback would mitigate with a slight latency trade-off. Recent related paper on this topic: Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth, Gorry Fairhurst, and Richard Scheffenegger: "Enabling Internet-Wide Deployment of Explicit Congestion Notification." Proc. PAM 2015, New York. http://ecn.ethz.ch/ecn-pam15.pdf Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168, section 6.1.1.1. fallback on timeout. For users explicitly not wanting this which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that allows for disabling the fallback. tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but rather we let tcp_ecn_rcv_synack() take that over on input path in case a SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent ECN being negotiated eventually in that case. Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch> Cc: Eric Dumazet <edumazet@google.com> Cc: Dave That <dave.taht@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: introduce tcp_under_memory_pressure()Eric Dumazet2015-05-171-2/+2
| | | | | | | | Introduce an optimized version of sk_under_memory_pressure() for TCP. Our intent is to use it in fast paths. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()Eric Dumazet2015-05-171-2/+4
| | | | | | | | We plan to use sk_forced_wmem_schedule() in input path as well, so make it non static and rename it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add TCPWinProbe and TCPKeepAlive SNMP countersEric Dumazet2015-05-091-6/+7
| | | | | | | | | | | | | | | | | | Diagnosing problems related to Window Probes has been hard because we lack a counter. TCPWinProbe counts the number of ACK packets a sender has to send at regular intervals to make sure a reverse ACK packet opening back a window had not been lost. TCPKeepAlive counts the number of ACK packets sent to keep TCP flows alive (SO_KEEPALIVE) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: adjust window probe timers to safer valuesEric Dumazet2015-05-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the advent of small rto timers in datacenter TCP, (ip route ... rto_min x), the following can happen : 1) Qdisc is full, transmit fails. TCP sets a timer based on icsk_rto to retry the transmit, without exponential backoff. With low icsk_rto, and lot of sockets, all cpus are servicing timer interrupts like crazy. Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN) and 500ms (TCP_RESOURCE_PROBE_INTERVAL) 2) Receivers can send zero windows if they don't drain their receive queue. TCP sends zero window probes, based on icsk_rto current value, with exponential backoff. With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in some cases), sender can abort in less than one or two minutes ! If receiver stops the sender, it obviously doesn't care of very tight rto. Probability of dropping the ACK reopening the window is not worth the risk. Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these events (but not normal RTO based retransmits) A followup patch adds a new SNMP counter, as it would have helped a lot diagnosing this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: avoid looping in tcp_send_fin()Eric Dumazet2015-04-241-21/+29
| | | | | | | | | | | | | | | | | | | | | | | | | Presence of an unbound loop in tcp_send_fin() had always been hard to explain when analyzing crash dumps involving gigantic dying processes with millions of sockets. Lets try a different strategy : In case of memory pressure, try to add the FIN flag to last packet in write queue, even if packet was already sent. TCP stack will be able to deliver this FIN after a timeout event. Note that this FIN being delivered by a retransmit, it also carries a Push flag given our current implementation. By checking sk_under_memory_pressure(), we anticipate that cooking many FIN packets might deplete tcp memory. In the case we could not allocate a packet, even with __GFP_WAIT allocation, then not sending a FIN seems quite reasonable if it allows to get rid of this socket, free memory, and not block the process from eventually doing other useful work. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix possible deadlock in tcp_send_fin()Eric Dumazet2015-04-221-1/+19
| | | | | | | | | | | | | | | | | | | Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in case a huge process is killed by OOM, and tcp_mem[2] is hit. To be able to free memory we need to make progress, so this patch allows FIN packets to not care about tcp_mem[2], if skb allocation succeeded. In a follow-up patch, we might abort tcp_send_fin() infinite loop in case TIF_MEMDIE is set on this thread, as memory allocator did its best getting extra memory already. This patch reverts d22e15371811 ("tcp: fix tcp fin memory accounting") Fixes: d22e15371811 ("tcp: fix tcp fin memory accounting") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-04-141-0/+2
|\ | | | | | | | | | | | | The dwmac-socfpga.c conflict was a case of a bug fix overlapping changes in net-next to handle an error pointer differently. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: tcp_make_synack() should clear skb->tstampEric Dumazet2015-04-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I noticed tcpdump was giving funky timestamps for locally generated SYNACK messages on loopback interface. 11:42:46.938990 IP 127.0.0.1.48245 > 127.0.0.2.23850: S 945476042:945476042(0) win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7> 20:28:58.502209 IP 127.0.0.2.23850 > 127.0.0.1.48245: S 3160535375:3160535375(0) ack 945476043 win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7> This is because we need to clear skb->tstamp before entering lower stack, otherwise net_timestamp_check() does not set skb->tstamp. Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: RFC7413 option support for Fast Open clientDaniel Lee2015-04-071-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fast Open has been using an experimental option with a magic number (RFC6994). This patch makes the client by default use the RFC7413 option (34) to get and send Fast Open cookies. This patch makes the client solicit cookies from a given server first with the RFC7413 option. If that fails to elicit a cookie, then it tries the RFC6994 experimental option. If that also fails, it uses the RFC7413 option on all subsequent connect attempts. If the server returns a Fast Open cookie then the client caches the form of the option that successfully elicited a cookie, and uses that form on later connects when it presents that cookie. The idea is to gradually obsolete the use of experimental options as the servers and clients upgrade, while keeping the interoperability meanwhile. Signed-off-by: Daniel Lee <Longinus00@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: RFC7413 option support for Fast Open serverDaniel Lee2015-04-071-11/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fast Open has been using the experimental option with a magic number (RFC6994) to request and grant Fast Open cookies. This patch enables the server to support the official IANA option 34 in RFC7413 in addition. The change has passed all existing Fast Open tests with both old and new options at Google. Signed-off-by: Daniel Lee <Longinus00@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: coding style: comparison for inequality with NULLIan Morris2015-04-031-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: coding style: comparison for equality with NULLIan Morris2015-04-031-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup()Eric Dumazet2015-03-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | With request socks convergence, we no longer need different lookup methods. A request socket can use generic lookup function. Add const qualifier to 2nd tcp_v[46]_md5_lookup() parameter. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: md5: remove request sock argument of calc_md5_hash()Eric Dumazet2015-03-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | Since request and established sockets now have same base, there is no need to pass two pointers to tcp_v4_md5_hash_skb() or tcp_v6_md5_hash_skb() Also add a const qualifier to their struct tcp_md5sig_key argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: md5: fix rcu lockdep splatEric Dumazet2015-03-241-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While timer handler effectively runs a rcu read locked section, there is no explicit rcu_read_lock()/rcu_read_unlock() annotations and lockdep can be confused here : net/ipv4/tcp_ipv4.c-906- /* caller either holds rcu_read_lock() or socket lock */ net/ipv4/tcp_ipv4.c:907: md5sig = rcu_dereference_check(tp->md5sig_info, net/ipv4/tcp_ipv4.c-908- sock_owned_by_user(sk) || net/ipv4/tcp_ipv4.c-909- lockdep_is_held(&sk->sk_lock.slock)); Let's explicitely acquire rcu_read_lock() in tcp_make_synack() Before commit fa76ce7328b ("inet: get rid of central tcp/dccp listener timer"), we were holding listener lock so lockdep was happy. Fixes: fa76ce7328b ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric DUmazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Revert "selinux: add a skb_owned_by() hook"Eric Dumazet2015-03-201-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit ca10b9e9a8ca7342ee07065289cbe74ac128c169. No longer needed after commit eb8895debe1baba41fcb62c78a16f0c63c21662a ("tcp: tcp_make_synack() should use sock_wmalloc") When under SYNFLOOD, we build lot of SYNACK and hit false sharing because of multiple modifications done on sk_listener->sk_wmem_alloc Since tcp_make_synack() uses sock_wmalloc(), there is no need to call skb_set_owner_w() again, as this adds two atomic operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-03-201-5/+1
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/emulex/benet/be_main.c net/core/sysctl_net_core.c net/ipv4/inet_diag.c The be_main.c conflict resolution was really tricky. The conflict hunks generated by GIT were very unhelpful, to say the least. It split functions in half and moved them around, when the real actual conflict only existed solely inside of one function, that being be_map_pci_bars(). So instead, to resolve this, I checked out be_main.c from the top of net-next, then I applied the be_main.c changes from 'net' since the last time I merged. And this worked beautifully. The inet_diag.c and sysctl_net_core.c conflicts were simple overlapping changes, and were easily to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix tcp fin memory accountingJosh Hunt2015-03-201-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | tcp_send_fin() does not account for the memory it allocates properly, so sk_forward_alloc can be negative in cases where we've sent a FIN: ss example output (ss -amn | grep -B1 f4294): tcp FIN-WAIT-1 0 1 192.168.0.1:45520 192.0.2.1:8080 skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0) Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Create probe timer for tcp PMTU as per RFC4821Fan Du2015-03-061-2/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As per RFC4821 7.3. Selecting Probe Size, a probe timer should be armed once probing has converged. Once this timer expired, probing again to take advantage of any path PMTU change. The recommended probing interval is 10 minutes per RFC1981. Probing interval could be sysctled by sysctl_tcp_probe_interval. Eric Dumazet suggested to implement pseudo timer based on 32bits jiffies tcp_time_stamp instead of using classic timer for such rare event. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: Use binary search to choose tcp PMTU probe_sizeFan Du2015-03-061-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current probe_size is chosen by doubling mss_cache, the probing process will end shortly with a sub-optimal mss size, and the link mtu will not be taken full advantage of, in return, this will make user to tweak tcp_base_mss with care. Use binary search to choose probe_size in a fine granularity manner, an optimal mss will be found to boost performance as its maxmium. In addition, introduce a sysctl_tcp_probe_threshold to control when probing will stop in respect to the width of search range. Test env: Docker instance with vxlan encapuslation(82599EB) iperf -c 10.0.0.24 -t 60 before this patch: 1.26 Gbits/sec After this patch: increase 26% 1.59 Gbits/sec Signed-off-by: Fan Du <fan.du@intel.com> Acked-by: John Heffner <johnwheffner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: tso: allow CA_CWR state in tcp_tso_should_defer()Eric Dumazet2015-02-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another TCP issue is triggered by ECN. Under pressure, receiver gets ECN marks, and send back ACK packets with ECE TCP flag. Senders enter CA_CWR state. In this state, tcp_tso_should_defer() is short cut : if (icsk->icsk_ca_state != TCP_CA_Open) goto send_now; This means that about all ACK packets we receive are triggering a partial send, and because cwnd is kept small, we can only send a small amount of data for each incoming ACK, which in return generate more ACK packets. Allowing CA_Open and CA_CWR states to enable TSO defer in tcp_tso_should_defer() brings performance back : TSO autodefer has more chance to defer under pressure. This patch increases TSO and LRO/GRO efficiency back to normal levels, and does not impact overall ECN behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: tso: restore IW10 after TSO autosizingEric Dumazet2015-02-281-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With sysctl_tcp_min_tso_segs being 4, it is very possible that tcp_tso_should_defer() decides not sending last 2 MSS of initial window of 10 packets. This also applies if autosizing decides to send X MSS per GSO packet, and cwnd is not a multiple of X. This patch implements an heuristic based on age of first skb in write queue : If it was sent very recently (less than half srtt), we can predict that no ACK packet will come in less than half rtt, so deferring might cause an under utilization of our window. This is visible on initial send (IW10) on web servers, but more generally on some RPC, as the last part of the message might need an extra RTT to get delivered. Tested: Ran following packetdrill test // A simple server-side test that sends exactly an initial window (IW10) // worth of packets. `sysctl -e -q net.ipv4.tcp_min_tso_segs=4` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +.1 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.1 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 14600) = 14600 +0 > . 1:5841(5840) ack 1 win 457 +0 > . 5841:11681(5840) ack 1 win 457 // Following packet should be sent right now. +0 > P. 11681:14601(2920) ack 1 win 457 +.1 < . 1:1(0) ack 14601 win 257 +0 close(4) = 0 +0 > F. 14601:14601(0) ack 1 +.1 < F. 1:1(0) ack 14602 win 257 +0 > . 14602:14602(0) ack 2 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: tso: remove tp->tso_deferredEric Dumazet2015-02-281-9/+5
|/ | | | | | | | | | | | | | | | | | | | | | | TSO relies on ability to defer sending a small amount of packets. Heuristic is to wait for future ACKS in hope to send more packets at once. Current algorithm uses a per socket tso_deferred field as a pseudo timer. This pseudo timer relies on future ACK, but there is no guarantee we receive them in time. Fix would be to use a real timer, but cost of such timer is probably too expensive for typical cases. This patch changes the logic to test the time of last transmit, because we should not add bursts of more than 1ms for any given flow. We've used this patch for about two years at Google, before FQ/pacing as it would reduce a fair amount of bursts. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Namespecify TCP PMTU mechanismFan Du2015-02-091-5/+3
| | | | | | | | | | | | | | | | Packetization Layer Path MTU Discovery works separately beside Path MTU Discovery at IP level, different net namespace has various requirements on which one to chose, e.g., a virutalized container instance would require TCP PMTU to probe an usable effective mtu for underlying tunnel, while the host would employ classical ICMP based PMTU to function. Hence making TCP PMTU mechanism per net namespace to decouple two functionality. Furthermore the probe base MSS should also be configured separately for each namespace. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-davem' of ↵David S. Miller2015-02-041-3/+8
|\ | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More iov_iter work from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip: convert tcp_sendmsg() to iov_iter primitivesAl Viro2015-02-041-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | patch is actually smaller than it seems to be - most of it is unindenting the inner loop body in tcp_sendmsg() itself... the bit in tcp_input.c is going to get reverted very soon - that's what memcpy_from_msg() will become, but not in this commit; let's keep it reasonably contained... There's one potentially subtle change here: in case of short copy from userland, mainline tcp_send_syn_data() discards the skb it has allocated and falls back to normal path, where we'll send as much as possible after rereading the same data again. This patch trims SYN+data skb instead - that way we don't need to copy from the same place twice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
OpenPOWER on IntegriCloud