summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_input.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r258622: dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2014-01-171-3/+3
|
* MFC r258605: Convert over the TCP probes to use mtod()avg2014-01-171-8/+8
| | | | MFC slacker: adrian
* Revert MFC of r258821 - it was already handled by MFC of r239672.peter2014-01-081-4/+2
| | | | Pointy hat to: peter
* MFC r258821 - fix tcp simultaneous closepeter2014-01-071-2/+4
| | | | PR: kern/99188
* MFC r259906: Draft-ietf-tcpm-initcwnd-05 became RFC6928.pluknet2014-01-021-2/+2
|
* MFC r256920:andre2013-10-291-4/+7
| | | | | | | | | | | | | | | | | | | | | | | The TCP delayed ACK logic isn't aware of LRO passing up large aggregated segments thinking it received only one segment. This causes it to enable the delay the ACK for 100ms to wait for another segment which may never come because all the data was received already. Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes us further away from acking every other packet; b) it introduces additional delay in responding to the sender. The latter is especially bad because it is in the nature of LRO to aggregated all segments of a burst with no more coming until an ACK is sent back. Change the delayed ACK logic to detect LRO segments by being larger than the MSS for this connection and issuing an immediate ACK for them to keep the ACK clock ticking without interruption. Reported by: julian, cperciva Tested by: cperciva Reviewed by: lstewart Approved by: re (glebius)
* When processing ACK in tcp_do_segment, use sbcut_locked() instead ofglebius2013-10-091-2/+5
| | | | | | | | | | | sbdrop_locked() to cut acked mbufs from the socket buffer. Free this chain a batch manner after the socket buffer lock is dropped. This measurably reduces contention on socket buffer. Sponsored by: Netflix Sponsored by: Nginx, Inc. Approved by: re (marius)
* Implement the ip, tcp, and udp DTrace providers. The probe definitions usemarkj2013-08-251-10/+30
| | | | | | | | | dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
* Remove the large part of struct ipsecstat. Only few fields of thisae2013-07-231-2/+2
| | | | | | | | | | | structure is used, but they already have equal fields in the struct newipsecstat, that was introduced with FAST_IPSEC and then was merged together with old ipsecstat structure. This fixes kernel stack overflow on some architectures after migration ipsecstat to PCPU counters. Reported by: Taku YAMAMOTO, Maciej Milewski
* Extend debug logging of TCP timestamp related specificationandre2013-07-101-5/+25
| | | | | | violations. Update related comments and style.
* Use new macros to implement ipstat and tcpstat using PCPU counters.ae2013-07-091-58/+7
| | | | Change interface of kread_counters() similar ot kread() in the netstat(1).
* Fix kmod_*stat_inc() after r249276. The incorrect code actuallyglebius2013-06-211-1/+1
| | | | | | | | increased the pointer, not the memory it points to. In collaboration with: kib Reported & tested by: Ian FREISLICH <ianf clue.co.za> Sponsored by: Nginx, Inc.
* Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statisticsae2013-06-201-2/+2
| | | | | | accounting. MFC after: 2 weeks
* Allow drivers to specify a maximum TSO length in bytes if they areandre2013-06-031-7/+10
| | | | | | | | | | | | | | | | | | | | | | | limited in the amount of data they can handle at once. Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to change the limit. The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything less wouldn't be very useful anymore. The upper limit is still at IP_MAXPACKET (65536 bytes). Raising it requires further auditing of the IPv4/v6 code path's as the length field in the IP header would overflow leading to confusion in firewalls and others packet handler on the real size of the packet. The placement into "struct ifnet" is a bit hackish but the best place that was found. When the stack/driver boundary is updated it should be handled in a better way. Submitted by: cperciva (earlier version) Reviewed by: cperciva Tested by: cperciva MFC after: 1 week (using spare struct members to preserve ABI)
* When doing RFC3042 limited transmit on the first on secondandre2013-04-231-1/+12
| | | | | | | | | duplicate ACK make sure we actually have new data to send. This prevents us from sending unneccessary pure ACKs. Reported by: Matt Miller <matt@matthewjmiller.net> Tested by: Matt Miller <matt@matthewjmiller.net> MFC after: 2 weeks
* Fix a race condition on tcp listen socket teardown with pendingandre2013-04-091-0/+9
| | | | | | | | | | | | | | connections in the accept queue and contiguous new incoming SYNs. Compared to the original submitters patch I've moved the test next to the SYN handling to have it together in a logical unit and reworded the comment explaining the issue. Submitted by: Matt Miller <matt@matthewjmiller.net> Submitted by: Juan Mojica <jmojica@gmail.com> Reviewed by: Matt Miller (changes) Tested by: pho MFC after: 1 week
* Fix VIMAGE build.glebius2013-04-091-2/+2
|
* Merge from projects/counters: TCP/IP stats.glebius2013-04-081-10/+64
| | | | | | | | | Convert 'struct ipstat' and 'struct tcpstat' to counter(9). This speeds up IP forwarding at extreme packet rates, and makes accounting more precise. Sponsored by: Nginx, Inc.
* Keep fwd_tag around for subsequent pcb lookupsemaste2013-03-291-17/+8
| | | | | | | | | | For TIMEWAIT handling tcp_input may have to jump back for an additional pass through pcblookup. Prior to this change the fwd_tag had been discarded after the first lookup, so a new connection attempt delivered locally via 'ipfw fwd' would fail to find a match. As of r248886 the tag will be detached and freed when passed to the socket buffer.
* Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"lstewart2013-01-221-1/+1
| | | | | | | | | | | | logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could therefore potentially corrupt the result (although under normal operation, neither variable should legitmately exceed 32 bits). [1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html Submitted by: jhb MFC after: 1 week
* Fix !INET6 build after r244365.glebius2012-12-181-2/+11
|
* Clear correct flag in INET6 case.glebius2012-12-181-1/+1
|
* Since we use different flags to detect tcp forwarding, and we share theae2012-12-171-1/+2
| | | | | | | same code for IPv4 and IPv6 in tcp_input, we should check both M_IP_NEXTHOP and M_IP6_NEXTHOP flags. MFC after: 3 days
* Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,glebius2012-12-121-0/+2
| | | | | | | | | but later after processing and freeing the tag, we need to jump back again to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to process and free the tag for second time. Reported & tested by: Pawel Tyll <ptyll nitronet.pl> MFC after: 3 days
* Back out r242262. The simplified window change/update logic wasn'tandre2012-11-051-47/+16
| | | | | | complete and ready for production use. PR: kern/173309
* Remove the recently added sysctl variable net.pfil.forward.ae2012-11-021-2/+3
| | | | | | | | | Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set. Suggested by: andre
* Increase the initial CWND to 10 segments as defined in IETF TCPMandre2012-10-281-0/+12
| | | | | | | | | | | | | | | | | | | | | | draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. MFC after: 2 weeks
* Simplify and enhance the window change/update acceptance logic,andre2012-10-281-16/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | especially in the presence of bi-directional data transfers. snd_wl1 tracks the right edge, including data in the reassembly queue, of valid incoming data. This makes it like rcv_nxt plus reassembly. It never goes backwards to prevent older, possibly reordered segments from updating the window. snd_wl2 tracks the left edge of sent data. This makes it a duplicate of snd_una. However joining them right now is difficult due to separate update dependencies in different places in the code flow. snd_wnd tracks the current advertized send window by the peer. In tcp_output() the effective window is calculated by subtracting the already in-flight data, snd_nxt less snd_una, from it. ACK's become the main clock of window updates and will always update the window when the left edge of what we sent is advanced. The ACK clock is the primary signaling mechanism in ongoing data transfers. This works reliably even in the presence of reordering, reassembly and retransmitted segments. The ACK clock is most important because it determines how much data we are allowed to inject into the network. Zero window updates get us out of persistence mode are crucial. Here a segment that neither moves ACK nor SEQ but enlarges WND is accepted. When the ACK clock is not active (that is we're not or no longer sending any data) any segment that moves the extended right SEQ edge, including out-of-order segments, updates the window. This gives us updates especially during ping-pong transfers where the peer isn't done consuming the already acknowledged data from the receive buffer while responding with data. The SSH protocol is a prime candidate to benefit from the improved bi-directional window update logic as it has its own windowing mechanism on top of TCP and is frequently sending back protocol ACK's. Tcpdump provided by: darrenr Tested by: darrenr MFC after: 2 weeks
* Allow arbitrary MSS sizes and don't mind about the cluster size anymore.andre2012-10-281-11/+2
| | | | | | | | We've got more cluster sizes for quite some time now and the orginally imposed limits and the previously codified thoughts on efficiency gains are no longer true. MFC after: 2 weeks
* When SYN or SYN/ACK had to be retransmitted RFC5681 requires us toandre2012-10-281-3/+9
| | | | | | | | | | reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after: 2 weeks
* Adjust the initial default CWND upon connection establishment to theandre2012-10-281-2/+9
| | | | | | | | new and increased values specified by RFC5681 Section 3.1. The even larger initial CWND per RFC3390, if enabled, is not affected. MFC after: 2 weeks
* Remove the IPFIREWALL_FORWARD kernel option and make possible to turnae2012-10-251-12/+5
| | | | | | | | | on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default. Sponsored by: Yandex LLC Discussed with: net@ MFC after: 2 weeks
* Do not reduce ip_len by size of IP header in the ip_input()glebius2012-10-231-1/+1
| | | | | | | | | | | before passing a packet to protocol input routines. For several protocols this mean that now protocol needs to do subtraction itself, and for another half this means that we do not need to add header length back to the packet. Make ip_stripoptions() to adjust ip_len, since now we enter this function with a packet header whose ip_len does represent length of entire packet, not payload only.
* Switch the entire IPv4 stack to keep the IP packet headerglebius2012-10-221-17/+8
| | | | | | | | | | | | | | | | | | | | | | | in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
* In ip_stripoptions():glebius2012-10-121-1/+1
| | | | | - Remove unused argument and incorrect comment. - Fixup ip_len after stripping.
* This small change takes care of a race conditionrrs2012-08-251-0/+30
| | | | | | | | | | | | | that can occur when both sides close at the same time. If that occurs, without this fix the connection enters FIN1 on both sides and they will forever send FIN|ACK at each other until the connection times out. This is because we stopped processing the FIN|ACK and thus did not advance the sequence and so never ACK'd each others FIN. This fix adjusts it so we *do* process the FIN properly and the race goes away ;-) MFC after: 1 month
* Update some stale comments regarding tcbinfo locking in the TCP inputrwatson2012-07-221-10/+3
| | | | | | | path: read locks on tcbinfo are no longer used, so won't happen. No functional change. MFC after: 3 days
* - Updated TOE support in the kernel.np2012-06-191-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m | grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp | grep toe # sockstat -46c | grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
* Plug more refcount leaks and possible NULL deref for interfaceemax2012-06-041-1/+4
| | | | | | | address list. Submitted by: scottl@ MFC after: 3 days
* It turns out that too many drivers are not only parsing the L2/3/4bz2012-05-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | headers for TSO but also for generic checksum offloading. Ideally we would only have one common function shared amongst all drivers, and perhaps when updating them for IPv6 we should introduce that. Eventually we should provide the meta information along with mbufs to avoid (re-)parsing entirely. To not break IPv6 (checksums and offload) and to be able to MFC the changes without risking to hurt 3rd party drivers, duplicate the v4 framework, as other OSes have done as well. Introduce interface capability flags for TX/RX checksum offload with IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6 flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6 fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and add an alias for CSUM_DATA_VALID_IPV6. This pretty much brings IPv6 handling in line with IPv4. TSO is still handled in a different way and not via if_hwassist. Update ifconfig to allow (un)setting of the new capability flags. Update loopback to announce the new capabilities and if_hwassist flags. Individual driver updates will have to follow, as will SCTP. Reported by: gallatin, dim, .. Reviewed by: gallatin (glanced at?) MFC after: 3 days X-MFC with: r235961,235959,235958
* MFp4 bz_ipv6_fast:bz2012-05-251-2/+20
| | | | | | | | | | | | | | | | | | Add code to handle pre-checked TCP checksums as indicated by mbuf flags to save the entire computation for validation if not needed. In the IPv6 TCP output path only compute the pseudo-header checksum, set the checksum offset in the mbuf field along the appropriate flag as done in IPv4. In tcp_respond() just initialize the IPv6 payload length to 0 as ip6_output() will properly set it. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
* MFp4 bz_ipv6_fast:bz2012-05-251-2/+4
| | | | | | | | | | | | Factor out the tcp_hc_getmtu() call. As the comments say it applies to both v4 and v6, so only write it once making it easier to read the protocol family specifc code. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
* When we receive an ICMP unreach need fragmentation datagram, we takeglebius2012-04-161-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | proposed MTU value from it and update the TCP host cache. Then tcp_mss_update() is called on the corresponding tcpcb. It finds the just allocated entry in the TCP host cache and updates MSS on the tcpcb. And then we do a fast retransmit of what we have in the tcp send buffer. This sequence gets broken if the TCP host cache is exausted. In this case allocation fails, and later called tcp_mss_update() finds nothing in cache. The fast retransmit is done with not reduced MSS and is immidiately replied by remote host with new ICMP datagrams and the cycle repeats. This ping-pong can go up to wirespeed. To fix this: - tcp_mss_update() gets new parameter - mtuoffer, that is like offer, but needs to have min_protoh subtracted. - tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify(). - tcp_mtudisc() now accepts not a useless error argument, but proposed MTU value, that is passed to tcp_mss_update() as mtuoffer. Reported by: az Reported by: Andrey Zonov <andrey zonov.org> Reviewed by: andre (previous version of patch)
* Fix PAWS (Protect Against Wrapped Sequence numbers) in cases whenbz2012-02-151-14/+18
| | | | | | | | | | | | | | | | | | | | hz >> 1000 and thus getting outside the timestamp clock frequenceny of 1ms < x < 1s per tick as mandated by RFC1323, leading to connection resets on idle connections. Always use a granularity of 1ms using getmicrouptime() making all but relevant callouts independent of hz. Use getmicrouptime(), not getmicrotime() as the latter may make a jump possibly breaking TCP nfsroot mounts having our timestamps move forward for more than 24.8 days in a second without having been idle for that long. PR: kern/61404 Reviewed by: jhb, mav, rrs Discussed with: silby, lstewart Sponsored by: Sandvine Incorporated (originally in 2011) MFC after: 6 weeks
* Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL andglebius2012-02-051-8/+8
| | | | | | | TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart
* Remove the assertion from tcp_input() that rcv_nxt is always greaterjhb2012-01-051-3/+0
| | | | | | | | | | | | | | | | | than or equal to rcv_adv and fix tcp_twstart() to handle this case by assuming the last window was zero rather than a negative value. The code in tcp_input() already safely handled this case. It can happen due to delayed ACKs along with a remote sender that sends data beyond the window we previously advertised. If we have room in our socket buffer for the extra data beyond the advertised window, we will accept it. However, if the ACK for that segment is delayed, then we will not effectively fixup rcv_adv to account for that extra data until the next segment arrives and forces out an ACK. When that next segment arrives, rcv_nxt will be beyond rcv_adv. Tested by: pjd MFC after: 1 week
* Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.ed2011-11-071-1/+1
| | | | | | The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
* Restore sysctl names for tcp_sendspace/tcp_recvspace.pluknet2011-11-021-1/+1
| | | | | | | They seem to be changed unintentionally in r226437, and there were no any mentions of renaming in commit log message. Reported by: Anton Yuzhaninov <citrin citrin ru>
* Add syntactic sugar missed in r226437 and then not added either when movingbz2011-10-171-1/+1
| | | | | | | things around in r226448 but desperately needed to always make things compile successfully. MFC after: 1 week
* Move the tcp_sendspace and tcp_recvspace sysctl's fromandre2011-10-161-0/+5
| | | | | | | | the middle of tcp_usrreq.c to the top of tcp_output.c and tcp_input.c respectively next to the socket buffer autosizing controls. MFC after: 1 week
OpenPOWER on IntegriCloud