summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_input.c
Commit message (Collapse)AuthorAgeFilesLines
* Revert the partial MFC of r313045 which broke dtracesmh2017-05-181-1/+2
| | | | | | | | | | | | This removes the mbuf to ipinfo_t translator and switches tcp_autorcvbuf to use the older mtod macro. This was originally merged to stable/10 as part of r317375. Reported by: markj Reviewed by: markj, hiren Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D10769
* Partial MFC r316676 and the required r313045smh2017-04-241-58/+63
| | | | | | | | | | | | | | | | | | MFC r316676: Use estimated RTT for receive buffer auto resizing instead of timestamps. This is a partial MFC as stable/10 doesn't include the TCP stack modularisation. MFC r313045: Add an mbuf to ipinfo_t translator to finish cleanup of mbuf passing to TCP probes. This is a partial MFC (missing debug__output & debug__drop changes) due to the massive amount of additional dtrace changes that would be required for a full MFC. Relnotes: Yes Sponsored by: Multiplay
* MFC r286227, r286443:jch2016-11-241-57/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r286227: Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc. r286443: Fix a kernel assertion issue introduced with r286227: Avoid too strict INP_INFO_RLOCK_ASSERT checks due to tcp_notify() being called from in6_pcbnotify(). Reported by: Larry Rosenman <ler@lerctr.org> Submitted by: markj, jch
* MFC r307551:jch2016-10-251-0/+18
| | | | | | | | | | | | | | | | | | Fix a double-free when an inp transitions to INP_TIMEWAIT state after having been dropped. This change enforces in_pcbdrop() logic in tcp_input(): "in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet delivery or event notification when a socket remains open but TCP has closed." PR: 203175 Reported by: Palle Girgensohn, Slawa Olhovchenkov Tested by: Slawa Olhovchenkov Reviewed by: Slawa Olhovchenkov Approved by: gnn, Slawa Olhovchenkov Differential Revision: https://reviews.freebsd.org/D8211 Sponsored by: Verisign, inc
* MFC r271119, r272081:jch2016-07-271-12/+16
| | | | | | | | | | | | | | | | r271119: In tcp_input(), don't acquire the pcbinfo global write lock for SYN packets targeting a listening socket. Permit to reduce TCP input processing starvation in context of high SYN load (e.g. short-lived TCP connections or SYN flood). Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: adrian, hiren, jhb, Mike Bentkofsky r272081: Catch up with r271119.
* MFC r300240truckman2016-06-201-1/+1
| | | | | | | | | | | | | | | | | | | | Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on control to a three way setting. 0 - Totally disable ECN. (no change) 1 - Enable ECN if incoming connections request it. Outgoing connections will request ECN. (no change from present != 0 setting) 2 - Enable ECN if incoming connections request it. Outgoing conections will not request ECN. Change the default value of net.inet.tcp.ecn.enable from 0 to 2. Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have similar capabilities. The actual values above match Linux, and the default matches the current Linux default. Reviewed by: eadler Relnotes: yes Differential Revision: https://reviews.freebsd.org/D6386
* MFC r298408:jtl2016-05-061-2/+11
| | | | | | Prevent underflows in tp->snd_wnd if the remote side ACKs more than tp->snd_wnd. This can happen, for example, when the remote side responds to a window probe by ACKing the one byte it contains.
* MFC: r292003hiren2016-01-111-8/+33
| | | | Improve tcp duplicate ack processing when SACK is present.
* MFC: r290122hiren2016-01-111-2/+25
| | | | | | | | | Calculate the correct amount of bytes that are in-flight for a connection as suggested by RFC 6675. MFC: r292046 r290122 added 4 bytes and removed 8 in struct sackhint. Add a pad entry of 4 bytes to restore the size.
* MFC r292706:pkelsey2015-12-281-6/+86
| | | | | | | | | | Implementation of server-side TCP Fast Open (TFO) [RFC7413]. TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Differential Revision: https://reviews.freebsd.org/D4350 Sponsored by: Verisign, Inc.
* MFC r288914hiren2015-10-141-0/+10
| | | | Add a comment specifying how we implement rfc3042.
* MFC r285567:pkelsey2015-07-211-0/+1
| | | | | | | | Check TCP timestamp option flag so that the automatic receive buffer scaling code does not use an uninitialized timestamp echo reply value from the stack when timestamps are not enabled. Approved by: re (gjb)
* MFC r266420 (by adrian)hiren2015-06-191-0/+1
| | | | | | | | Ensure that the flowid hashtype is assigned to the inp if the flowid is also assigned. Spotted by: gallatin Tested by: gallatin
* MFC r275358 r275483 r276982 - Removing M_FLOWID by hps@hiren2015-04-241-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r275358: Start process of removing the use of the deprecated "M_FLOWID" flag from the FreeBSD network code. The flag is still kept around in the "sys/mbuf.h" header file, but does no longer have any users. Instead the "m_pkthdr.rsstype" field in the mbuf structure is now used to decide the meaning of the "m_pkthdr.flowid" field. To modify the "m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX" macros as defined in the "sys/mbuf.h" header file. This patch introduces new behaviour in the transmit direction. Previously network drivers checked if "M_FLOWID" was set in "m_flags" before using the "m_pkthdr.flowid" field. This check has now now been replaced by checking if "M_HASHTYPE_GET(m)" is different from "M_HASHTYPE_NONE". In the future more hashtypes will be added, for example hashtypes for hardware dedicated flows. "M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is valid and has no particular type. This change removes the need for an "if" statement in TCP transmit code checking for the presence of a valid flowid value. The "if" statement mentioned above is now a direct variable assignment which is then later checked by the respective network drivers like before. r275483: Remove M_FLOWID from SCTP code. r276982: Remove no longer used "M_FLOWID" flag from mbuf.h and update the netisr manpage. Note: The FreeBSD version has been bumped. Reviewed by: hps, tuexen Sponsored by: Limelight Networks
* MFC r271946 and r272595:hselasky2014-11-031-0/+2
| | | | | | | | | Improve transmit sending offload, TSO, algorithm in general. This change allows all HCAs from Mellanox Technologies to function properly when TSO is enabled. See r271946 and r272595 for more details about this commit. Sponsored by: Mellanox Technologies
* Fix Denial of Service in TCP packet processing.delphij2014-09-161-5/+1
| | | | | Security: FreeBSD-SA-14:19.tcp Approved by: re (implicit, security advisory)
* MFC r266620:bz2014-08-161-4/+0
| | | | | | Remove the prototpye for the static inline function tcp_signature_verify_input(). The function is defined before first use already.
* MFC r266597:bz2014-08-161-2/+0
| | | | | | | | Remove the prototypes for things that are no longer file local but were moved to the header file. Was suppoed to be MFCed with: r266596 Pointy hat to: bz
* MFC r266596:bz2014-08-161-20/+0
| | | | | | | | Move the tcp_fields_to_host() and tcp_fields_to_net() (inline) functions to the tcp_var.h header file in order to avoid further duplication with upcoming commits. Reviewed by: np
* MFC r258622: dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2014-01-171-3/+3
|
* MFC r258605: Convert over the TCP probes to use mtod()avg2014-01-171-8/+8
| | | | MFC slacker: adrian
* Revert MFC of r258821 - it was already handled by MFC of r239672.peter2014-01-081-4/+2
| | | | Pointy hat to: peter
* MFC r258821 - fix tcp simultaneous closepeter2014-01-071-2/+4
| | | | PR: kern/99188
* MFC r259906: Draft-ietf-tcpm-initcwnd-05 became RFC6928.pluknet2014-01-021-2/+2
|
* MFC r256920:andre2013-10-291-4/+7
| | | | | | | | | | | | | | | | | | | | | | | The TCP delayed ACK logic isn't aware of LRO passing up large aggregated segments thinking it received only one segment. This causes it to enable the delay the ACK for 100ms to wait for another segment which may never come because all the data was received already. Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes us further away from acking every other packet; b) it introduces additional delay in responding to the sender. The latter is especially bad because it is in the nature of LRO to aggregated all segments of a burst with no more coming until an ACK is sent back. Change the delayed ACK logic to detect LRO segments by being larger than the MSS for this connection and issuing an immediate ACK for them to keep the ACK clock ticking without interruption. Reported by: julian, cperciva Tested by: cperciva Reviewed by: lstewart Approved by: re (glebius)
* When processing ACK in tcp_do_segment, use sbcut_locked() instead ofglebius2013-10-091-2/+5
| | | | | | | | | | | sbdrop_locked() to cut acked mbufs from the socket buffer. Free this chain a batch manner after the socket buffer lock is dropped. This measurably reduces contention on socket buffer. Sponsored by: Netflix Sponsored by: Nginx, Inc. Approved by: re (marius)
* Implement the ip, tcp, and udp DTrace providers. The probe definitions usemarkj2013-08-251-10/+30
| | | | | | | | | dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
* Remove the large part of struct ipsecstat. Only few fields of thisae2013-07-231-2/+2
| | | | | | | | | | | structure is used, but they already have equal fields in the struct newipsecstat, that was introduced with FAST_IPSEC and then was merged together with old ipsecstat structure. This fixes kernel stack overflow on some architectures after migration ipsecstat to PCPU counters. Reported by: Taku YAMAMOTO, Maciej Milewski
* Extend debug logging of TCP timestamp related specificationandre2013-07-101-5/+25
| | | | | | violations. Update related comments and style.
* Use new macros to implement ipstat and tcpstat using PCPU counters.ae2013-07-091-58/+7
| | | | Change interface of kread_counters() similar ot kread() in the netstat(1).
* Fix kmod_*stat_inc() after r249276. The incorrect code actuallyglebius2013-06-211-1/+1
| | | | | | | | increased the pointer, not the memory it points to. In collaboration with: kib Reported & tested by: Ian FREISLICH <ianf clue.co.za> Sponsored by: Nginx, Inc.
* Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statisticsae2013-06-201-2/+2
| | | | | | accounting. MFC after: 2 weeks
* Allow drivers to specify a maximum TSO length in bytes if they areandre2013-06-031-7/+10
| | | | | | | | | | | | | | | | | | | | | | | limited in the amount of data they can handle at once. Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to change the limit. The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything less wouldn't be very useful anymore. The upper limit is still at IP_MAXPACKET (65536 bytes). Raising it requires further auditing of the IPv4/v6 code path's as the length field in the IP header would overflow leading to confusion in firewalls and others packet handler on the real size of the packet. The placement into "struct ifnet" is a bit hackish but the best place that was found. When the stack/driver boundary is updated it should be handled in a better way. Submitted by: cperciva (earlier version) Reviewed by: cperciva Tested by: cperciva MFC after: 1 week (using spare struct members to preserve ABI)
* When doing RFC3042 limited transmit on the first on secondandre2013-04-231-1/+12
| | | | | | | | | duplicate ACK make sure we actually have new data to send. This prevents us from sending unneccessary pure ACKs. Reported by: Matt Miller <matt@matthewjmiller.net> Tested by: Matt Miller <matt@matthewjmiller.net> MFC after: 2 weeks
* Fix a race condition on tcp listen socket teardown with pendingandre2013-04-091-0/+9
| | | | | | | | | | | | | | connections in the accept queue and contiguous new incoming SYNs. Compared to the original submitters patch I've moved the test next to the SYN handling to have it together in a logical unit and reworded the comment explaining the issue. Submitted by: Matt Miller <matt@matthewjmiller.net> Submitted by: Juan Mojica <jmojica@gmail.com> Reviewed by: Matt Miller (changes) Tested by: pho MFC after: 1 week
* Fix VIMAGE build.glebius2013-04-091-2/+2
|
* Merge from projects/counters: TCP/IP stats.glebius2013-04-081-10/+64
| | | | | | | | | Convert 'struct ipstat' and 'struct tcpstat' to counter(9). This speeds up IP forwarding at extreme packet rates, and makes accounting more precise. Sponsored by: Nginx, Inc.
* Keep fwd_tag around for subsequent pcb lookupsemaste2013-03-291-17/+8
| | | | | | | | | | For TIMEWAIT handling tcp_input may have to jump back for an additional pass through pcblookup. Prior to this change the fwd_tag had been discarded after the first lookup, so a new connection attempt delivered locally via 'ipfw fwd' would fail to find a match. As of r248886 the tag will be detached and freed when passed to the socket buffer.
* Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"lstewart2013-01-221-1/+1
| | | | | | | | | | | | logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could therefore potentially corrupt the result (although under normal operation, neither variable should legitmately exceed 32 bits). [1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html Submitted by: jhb MFC after: 1 week
* Fix !INET6 build after r244365.glebius2012-12-181-2/+11
|
* Clear correct flag in INET6 case.glebius2012-12-181-1/+1
|
* Since we use different flags to detect tcp forwarding, and we share theae2012-12-171-1/+2
| | | | | | | same code for IPv4 and IPv6 in tcp_input, we should check both M_IP_NEXTHOP and M_IP6_NEXTHOP flags. MFC after: 3 days
* Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,glebius2012-12-121-0/+2
| | | | | | | | | but later after processing and freeing the tag, we need to jump back again to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to process and free the tag for second time. Reported & tested by: Pawel Tyll <ptyll nitronet.pl> MFC after: 3 days
* Back out r242262. The simplified window change/update logic wasn'tandre2012-11-051-47/+16
| | | | | | complete and ready for production use. PR: kern/173309
* Remove the recently added sysctl variable net.pfil.forward.ae2012-11-021-2/+3
| | | | | | | | | Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set. Suggested by: andre
* Increase the initial CWND to 10 segments as defined in IETF TCPMandre2012-10-281-0/+12
| | | | | | | | | | | | | | | | | | | | | | draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. MFC after: 2 weeks
* Simplify and enhance the window change/update acceptance logic,andre2012-10-281-16/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | especially in the presence of bi-directional data transfers. snd_wl1 tracks the right edge, including data in the reassembly queue, of valid incoming data. This makes it like rcv_nxt plus reassembly. It never goes backwards to prevent older, possibly reordered segments from updating the window. snd_wl2 tracks the left edge of sent data. This makes it a duplicate of snd_una. However joining them right now is difficult due to separate update dependencies in different places in the code flow. snd_wnd tracks the current advertized send window by the peer. In tcp_output() the effective window is calculated by subtracting the already in-flight data, snd_nxt less snd_una, from it. ACK's become the main clock of window updates and will always update the window when the left edge of what we sent is advanced. The ACK clock is the primary signaling mechanism in ongoing data transfers. This works reliably even in the presence of reordering, reassembly and retransmitted segments. The ACK clock is most important because it determines how much data we are allowed to inject into the network. Zero window updates get us out of persistence mode are crucial. Here a segment that neither moves ACK nor SEQ but enlarges WND is accepted. When the ACK clock is not active (that is we're not or no longer sending any data) any segment that moves the extended right SEQ edge, including out-of-order segments, updates the window. This gives us updates especially during ping-pong transfers where the peer isn't done consuming the already acknowledged data from the receive buffer while responding with data. The SSH protocol is a prime candidate to benefit from the improved bi-directional window update logic as it has its own windowing mechanism on top of TCP and is frequently sending back protocol ACK's. Tcpdump provided by: darrenr Tested by: darrenr MFC after: 2 weeks
* Allow arbitrary MSS sizes and don't mind about the cluster size anymore.andre2012-10-281-11/+2
| | | | | | | | We've got more cluster sizes for quite some time now and the orginally imposed limits and the previously codified thoughts on efficiency gains are no longer true. MFC after: 2 weeks
* When SYN or SYN/ACK had to be retransmitted RFC5681 requires us toandre2012-10-281-3/+9
| | | | | | | | | | reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after: 2 weeks
* Adjust the initial default CWND upon connection establishment to theandre2012-10-281-2/+9
| | | | | | | | new and increased values specified by RFC5681 Section 3.1. The even larger initial CWND per RFC3390, if enabled, is not affected. MFC after: 2 weeks
OpenPOWER on IntegriCloud