summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_input.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r266420 (by adrian)hiren2015-06-191-0/+1
| | | | | | | | Ensure that the flowid hashtype is assigned to the inp if the flowid is also assigned. Spotted by: gallatin Tested by: gallatin
* MFC r275358 r275483 r276982 - Removing M_FLOWID by hps@hiren2015-04-241-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r275358: Start process of removing the use of the deprecated "M_FLOWID" flag from the FreeBSD network code. The flag is still kept around in the "sys/mbuf.h" header file, but does no longer have any users. Instead the "m_pkthdr.rsstype" field in the mbuf structure is now used to decide the meaning of the "m_pkthdr.flowid" field. To modify the "m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX" macros as defined in the "sys/mbuf.h" header file. This patch introduces new behaviour in the transmit direction. Previously network drivers checked if "M_FLOWID" was set in "m_flags" before using the "m_pkthdr.flowid" field. This check has now now been replaced by checking if "M_HASHTYPE_GET(m)" is different from "M_HASHTYPE_NONE". In the future more hashtypes will be added, for example hashtypes for hardware dedicated flows. "M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is valid and has no particular type. This change removes the need for an "if" statement in TCP transmit code checking for the presence of a valid flowid value. The "if" statement mentioned above is now a direct variable assignment which is then later checked by the respective network drivers like before. r275483: Remove M_FLOWID from SCTP code. r276982: Remove no longer used "M_FLOWID" flag from mbuf.h and update the netisr manpage. Note: The FreeBSD version has been bumped. Reviewed by: hps, tuexen Sponsored by: Limelight Networks
* MFC r271946 and r272595:hselasky2014-11-031-0/+2
| | | | | | | | | Improve transmit sending offload, TSO, algorithm in general. This change allows all HCAs from Mellanox Technologies to function properly when TSO is enabled. See r271946 and r272595 for more details about this commit. Sponsored by: Mellanox Technologies
* Fix Denial of Service in TCP packet processing.delphij2014-09-161-5/+1
| | | | | Security: FreeBSD-SA-14:19.tcp Approved by: re (implicit, security advisory)
* MFC r266620:bz2014-08-161-4/+0
| | | | | | Remove the prototpye for the static inline function tcp_signature_verify_input(). The function is defined before first use already.
* MFC r266597:bz2014-08-161-2/+0
| | | | | | | | Remove the prototypes for things that are no longer file local but were moved to the header file. Was suppoed to be MFCed with: r266596 Pointy hat to: bz
* MFC r266596:bz2014-08-161-20/+0
| | | | | | | | Move the tcp_fields_to_host() and tcp_fields_to_net() (inline) functions to the tcp_var.h header file in order to avoid further duplication with upcoming commits. Reviewed by: np
* MFC r258622: dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2014-01-171-3/+3
|
* MFC r258605: Convert over the TCP probes to use mtod()avg2014-01-171-8/+8
| | | | MFC slacker: adrian
* Revert MFC of r258821 - it was already handled by MFC of r239672.peter2014-01-081-4/+2
| | | | Pointy hat to: peter
* MFC r258821 - fix tcp simultaneous closepeter2014-01-071-2/+4
| | | | PR: kern/99188
* MFC r259906: Draft-ietf-tcpm-initcwnd-05 became RFC6928.pluknet2014-01-021-2/+2
|
* MFC r256920:andre2013-10-291-4/+7
| | | | | | | | | | | | | | | | | | | | | | | The TCP delayed ACK logic isn't aware of LRO passing up large aggregated segments thinking it received only one segment. This causes it to enable the delay the ACK for 100ms to wait for another segment which may never come because all the data was received already. Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes us further away from acking every other packet; b) it introduces additional delay in responding to the sender. The latter is especially bad because it is in the nature of LRO to aggregated all segments of a burst with no more coming until an ACK is sent back. Change the delayed ACK logic to detect LRO segments by being larger than the MSS for this connection and issuing an immediate ACK for them to keep the ACK clock ticking without interruption. Reported by: julian, cperciva Tested by: cperciva Reviewed by: lstewart Approved by: re (glebius)
* When processing ACK in tcp_do_segment, use sbcut_locked() instead ofglebius2013-10-091-2/+5
| | | | | | | | | | | sbdrop_locked() to cut acked mbufs from the socket buffer. Free this chain a batch manner after the socket buffer lock is dropped. This measurably reduces contention on socket buffer. Sponsored by: Netflix Sponsored by: Nginx, Inc. Approved by: re (marius)
* Implement the ip, tcp, and udp DTrace providers. The probe definitions usemarkj2013-08-251-10/+30
| | | | | | | | | dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
* Remove the large part of struct ipsecstat. Only few fields of thisae2013-07-231-2/+2
| | | | | | | | | | | structure is used, but they already have equal fields in the struct newipsecstat, that was introduced with FAST_IPSEC and then was merged together with old ipsecstat structure. This fixes kernel stack overflow on some architectures after migration ipsecstat to PCPU counters. Reported by: Taku YAMAMOTO, Maciej Milewski
* Extend debug logging of TCP timestamp related specificationandre2013-07-101-5/+25
| | | | | | violations. Update related comments and style.
* Use new macros to implement ipstat and tcpstat using PCPU counters.ae2013-07-091-58/+7
| | | | Change interface of kread_counters() similar ot kread() in the netstat(1).
* Fix kmod_*stat_inc() after r249276. The incorrect code actuallyglebius2013-06-211-1/+1
| | | | | | | | increased the pointer, not the memory it points to. In collaboration with: kib Reported & tested by: Ian FREISLICH <ianf clue.co.za> Sponsored by: Nginx, Inc.
* Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statisticsae2013-06-201-2/+2
| | | | | | accounting. MFC after: 2 weeks
* Allow drivers to specify a maximum TSO length in bytes if they areandre2013-06-031-7/+10
| | | | | | | | | | | | | | | | | | | | | | | limited in the amount of data they can handle at once. Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to change the limit. The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything less wouldn't be very useful anymore. The upper limit is still at IP_MAXPACKET (65536 bytes). Raising it requires further auditing of the IPv4/v6 code path's as the length field in the IP header would overflow leading to confusion in firewalls and others packet handler on the real size of the packet. The placement into "struct ifnet" is a bit hackish but the best place that was found. When the stack/driver boundary is updated it should be handled in a better way. Submitted by: cperciva (earlier version) Reviewed by: cperciva Tested by: cperciva MFC after: 1 week (using spare struct members to preserve ABI)
* When doing RFC3042 limited transmit on the first on secondandre2013-04-231-1/+12
| | | | | | | | | duplicate ACK make sure we actually have new data to send. This prevents us from sending unneccessary pure ACKs. Reported by: Matt Miller <matt@matthewjmiller.net> Tested by: Matt Miller <matt@matthewjmiller.net> MFC after: 2 weeks
* Fix a race condition on tcp listen socket teardown with pendingandre2013-04-091-0/+9
| | | | | | | | | | | | | | connections in the accept queue and contiguous new incoming SYNs. Compared to the original submitters patch I've moved the test next to the SYN handling to have it together in a logical unit and reworded the comment explaining the issue. Submitted by: Matt Miller <matt@matthewjmiller.net> Submitted by: Juan Mojica <jmojica@gmail.com> Reviewed by: Matt Miller (changes) Tested by: pho MFC after: 1 week
* Fix VIMAGE build.glebius2013-04-091-2/+2
|
* Merge from projects/counters: TCP/IP stats.glebius2013-04-081-10/+64
| | | | | | | | | Convert 'struct ipstat' and 'struct tcpstat' to counter(9). This speeds up IP forwarding at extreme packet rates, and makes accounting more precise. Sponsored by: Nginx, Inc.
* Keep fwd_tag around for subsequent pcb lookupsemaste2013-03-291-17/+8
| | | | | | | | | | For TIMEWAIT handling tcp_input may have to jump back for an additional pass through pcblookup. Prior to this change the fwd_tag had been discarded after the first lookup, so a new connection attempt delivered locally via 'ipfw fwd' would fail to find a match. As of r248886 the tag will be detached and freed when passed to the socket buffer.
* Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"lstewart2013-01-221-1/+1
| | | | | | | | | | | | logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could therefore potentially corrupt the result (although under normal operation, neither variable should legitmately exceed 32 bits). [1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html Submitted by: jhb MFC after: 1 week
* Fix !INET6 build after r244365.glebius2012-12-181-2/+11
|
* Clear correct flag in INET6 case.glebius2012-12-181-1/+1
|
* Since we use different flags to detect tcp forwarding, and we share theae2012-12-171-1/+2
| | | | | | | same code for IPv4 and IPv6 in tcp_input, we should check both M_IP_NEXTHOP and M_IP6_NEXTHOP flags. MFC after: 3 days
* Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,glebius2012-12-121-0/+2
| | | | | | | | | but later after processing and freeing the tag, we need to jump back again to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to process and free the tag for second time. Reported & tested by: Pawel Tyll <ptyll nitronet.pl> MFC after: 3 days
* Back out r242262. The simplified window change/update logic wasn'tandre2012-11-051-47/+16
| | | | | | complete and ready for production use. PR: kern/173309
* Remove the recently added sysctl variable net.pfil.forward.ae2012-11-021-2/+3
| | | | | | | | | Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set. Suggested by: andre
* Increase the initial CWND to 10 segments as defined in IETF TCPMandre2012-10-281-0/+12
| | | | | | | | | | | | | | | | | | | | | | draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. MFC after: 2 weeks
* Simplify and enhance the window change/update acceptance logic,andre2012-10-281-16/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | especially in the presence of bi-directional data transfers. snd_wl1 tracks the right edge, including data in the reassembly queue, of valid incoming data. This makes it like rcv_nxt plus reassembly. It never goes backwards to prevent older, possibly reordered segments from updating the window. snd_wl2 tracks the left edge of sent data. This makes it a duplicate of snd_una. However joining them right now is difficult due to separate update dependencies in different places in the code flow. snd_wnd tracks the current advertized send window by the peer. In tcp_output() the effective window is calculated by subtracting the already in-flight data, snd_nxt less snd_una, from it. ACK's become the main clock of window updates and will always update the window when the left edge of what we sent is advanced. The ACK clock is the primary signaling mechanism in ongoing data transfers. This works reliably even in the presence of reordering, reassembly and retransmitted segments. The ACK clock is most important because it determines how much data we are allowed to inject into the network. Zero window updates get us out of persistence mode are crucial. Here a segment that neither moves ACK nor SEQ but enlarges WND is accepted. When the ACK clock is not active (that is we're not or no longer sending any data) any segment that moves the extended right SEQ edge, including out-of-order segments, updates the window. This gives us updates especially during ping-pong transfers where the peer isn't done consuming the already acknowledged data from the receive buffer while responding with data. The SSH protocol is a prime candidate to benefit from the improved bi-directional window update logic as it has its own windowing mechanism on top of TCP and is frequently sending back protocol ACK's. Tcpdump provided by: darrenr Tested by: darrenr MFC after: 2 weeks
* Allow arbitrary MSS sizes and don't mind about the cluster size anymore.andre2012-10-281-11/+2
| | | | | | | | We've got more cluster sizes for quite some time now and the orginally imposed limits and the previously codified thoughts on efficiency gains are no longer true. MFC after: 2 weeks
* When SYN or SYN/ACK had to be retransmitted RFC5681 requires us toandre2012-10-281-3/+9
| | | | | | | | | | reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after: 2 weeks
* Adjust the initial default CWND upon connection establishment to theandre2012-10-281-2/+9
| | | | | | | | new and increased values specified by RFC5681 Section 3.1. The even larger initial CWND per RFC3390, if enabled, is not affected. MFC after: 2 weeks
* Remove the IPFIREWALL_FORWARD kernel option and make possible to turnae2012-10-251-12/+5
| | | | | | | | | on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default. Sponsored by: Yandex LLC Discussed with: net@ MFC after: 2 weeks
* Do not reduce ip_len by size of IP header in the ip_input()glebius2012-10-231-1/+1
| | | | | | | | | | | before passing a packet to protocol input routines. For several protocols this mean that now protocol needs to do subtraction itself, and for another half this means that we do not need to add header length back to the packet. Make ip_stripoptions() to adjust ip_len, since now we enter this function with a packet header whose ip_len does represent length of entire packet, not payload only.
* Switch the entire IPv4 stack to keep the IP packet headerglebius2012-10-221-17/+8
| | | | | | | | | | | | | | | | | | | | | | | in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
* In ip_stripoptions():glebius2012-10-121-1/+1
| | | | | - Remove unused argument and incorrect comment. - Fixup ip_len after stripping.
* This small change takes care of a race conditionrrs2012-08-251-0/+30
| | | | | | | | | | | | | that can occur when both sides close at the same time. If that occurs, without this fix the connection enters FIN1 on both sides and they will forever send FIN|ACK at each other until the connection times out. This is because we stopped processing the FIN|ACK and thus did not advance the sequence and so never ACK'd each others FIN. This fix adjusts it so we *do* process the FIN properly and the race goes away ;-) MFC after: 1 month
* Update some stale comments regarding tcbinfo locking in the TCP inputrwatson2012-07-221-10/+3
| | | | | | | path: read locks on tcbinfo are no longer used, so won't happen. No functional change. MFC after: 3 days
* - Updated TOE support in the kernel.np2012-06-191-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m | grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp | grep toe # sockstat -46c | grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
* Plug more refcount leaks and possible NULL deref for interfaceemax2012-06-041-1/+4
| | | | | | | address list. Submitted by: scottl@ MFC after: 3 days
* It turns out that too many drivers are not only parsing the L2/3/4bz2012-05-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | headers for TSO but also for generic checksum offloading. Ideally we would only have one common function shared amongst all drivers, and perhaps when updating them for IPv6 we should introduce that. Eventually we should provide the meta information along with mbufs to avoid (re-)parsing entirely. To not break IPv6 (checksums and offload) and to be able to MFC the changes without risking to hurt 3rd party drivers, duplicate the v4 framework, as other OSes have done as well. Introduce interface capability flags for TX/RX checksum offload with IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6 flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6 fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and add an alias for CSUM_DATA_VALID_IPV6. This pretty much brings IPv6 handling in line with IPv4. TSO is still handled in a different way and not via if_hwassist. Update ifconfig to allow (un)setting of the new capability flags. Update loopback to announce the new capabilities and if_hwassist flags. Individual driver updates will have to follow, as will SCTP. Reported by: gallatin, dim, .. Reviewed by: gallatin (glanced at?) MFC after: 3 days X-MFC with: r235961,235959,235958
* MFp4 bz_ipv6_fast:bz2012-05-251-2/+20
| | | | | | | | | | | | | | | | | | Add code to handle pre-checked TCP checksums as indicated by mbuf flags to save the entire computation for validation if not needed. In the IPv6 TCP output path only compute the pseudo-header checksum, set the checksum offset in the mbuf field along the appropriate flag as done in IPv4. In tcp_respond() just initialize the IPv6 payload length to 0 as ip6_output() will properly set it. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
* MFp4 bz_ipv6_fast:bz2012-05-251-2/+4
| | | | | | | | | | | | Factor out the tcp_hc_getmtu() call. As the comments say it applies to both v4 and v6, so only write it once making it easier to read the protocol family specifc code. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
* When we receive an ICMP unreach need fragmentation datagram, we takeglebius2012-04-161-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | proposed MTU value from it and update the TCP host cache. Then tcp_mss_update() is called on the corresponding tcpcb. It finds the just allocated entry in the TCP host cache and updates MSS on the tcpcb. And then we do a fast retransmit of what we have in the tcp send buffer. This sequence gets broken if the TCP host cache is exausted. In this case allocation fails, and later called tcp_mss_update() finds nothing in cache. The fast retransmit is done with not reduced MSS and is immidiately replied by remote host with new ICMP datagrams and the cycle repeats. This ping-pong can go up to wirespeed. To fix this: - tcp_mss_update() gets new parameter - mtuoffer, that is like offer, but needs to have min_protoh subtracted. - tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify(). - tcp_mtudisc() now accepts not a useless error argument, but proposed MTU value, that is passed to tcp_mss_update() as mtuoffer. Reported by: az Reported by: Andrey Zonov <andrey zonov.org> Reviewed by: andre (previous version of patch)
OpenPOWER on IntegriCloud