summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_output.c
Commit message (Collapse)AuthorAgeFilesLines
...
* When doing TSO subtract hdrlen from TCP_MAXWIN to prevent ip->ip_lenandre2006-09-151-5/+7
| | | | | | | | from wrapping when we generate a maximally sized packet for later segmentation. Noticed by: gallatin Sponsored by: TCP/IP Optimization Fundraise 2005
* Rewrite of TCP syncookies to remove locking requirements and to enhanceandre2006-09-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | functionality: - Remove a rwlock aquisition/release per generated syncookie. Locking is now integrated with the bucket row locking of syncache itself and syncookies no longer add any additional lock overhead. - Syncookie secrets are different for and stored per syncache buck row. Secrets expire after 16 seconds and are reseeded on-demand. - The computational overhead for syncookie generation and verification is one MD5 hash computation as before. - Syncache can be turned off and run with syncookies only by setting the sysctl net.inet.tcp.syncookies_only=1. This implementation extends the orginal idea and first implementation of FreeBSD by using not only the initial sequence number field to store information but also the timestamp field if present. This way we can keep track of the entire state we need to know to recreate the session in its original form. Almost all TCP speakers implement RFC1323 timestamps these days. For those that do not we still have to live with the known shortcomings of the ISN only SYN cookies. The use of the timestamp field causes the timestamps to be randomized if syncookies are enabled. The idea of SYN cookies is to encode and include all necessary information about the connection setup state within the SYN-ACK we send back and thus to get along without keeping any local state until the ACK to the SYN-ACK arrives (if ever). Everything we need to know should be available from the information we encoded in the SYN-ACK. A detailed description of the inner working of the syncookies mechanism is included in the comments in tcp_syncache.c. Reviewed by: silby (slightly earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
* Second step of TSO (TCP segmentation offload) support in our network stack.andre2006-09-071-15/+73
| | | | | | | | | | | | | | | | | | | | | | | TSO is only used if we are in a pure bulk sending state. The presence of TCP-MD5, SACK retransmits, SACK advertizements, IPSEC and IP options prevent using TSO. With TSO the TCP header is the same (except for the sequence number) for all generated packets. This makes it impossible to transmit any options which vary per generated segment or packet. The length of TSO bursts is limited to TCP_MAXWIN. The sysctl net.inet.tcp.tso globally controls the use of TSO and is enabled. TSO enabled sends originating from tcp_output() have the CSUM_TCP and CSUM_TSO flags set, m_pkthdr.csum_data filled with the header pseudo-checksum and m_pkthdr.tso_segsz set to the segment size (net payload size, not counting IP+TCP headers or TCP options). IPv6 currently lacks a pseudo-header checksum function and thus doesn't support TSO yet. Tested by: Jack Vogel <jfvogel-at-gmail.com> Sponsored by: TCP/IP Optimization Fundraise 2005
* This patch fixes the problem where the current TCP code can not handleqingli2006-02-231-1/+2
| | | | | | | | | | simultaneous open. Both the bug and the patch were verified using the ANVL test suite. PR: kern/74935 Submitted by: qingli (before I became committer) Reviewed by: andre MFC after: 5 days
* Consolidate all IP Options handling functions into ip_options.[ch] andandre2005-11-181-0/+1
| | | | | | | | | | | | | | | | | | | | include ip_options.h into all files making use of IP Options functions. From ip_input.c rev 1.306: ip_dooptions(struct mbuf *m, int pass) save_rte(m, option, dst) ip_srcroute(m0) ip_stripoptions(m, mopt) From ip_output.c rev 1.249: ip_insertoptions(m, opt, phlen) ip_optcopy(ip, jp) ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m) No functional changes in this commit. Discussed with: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005
* Retire MT_HEADER mbuf type and change its users to use MT_DATA.andre2005-11-021-2/+2
| | | | | | | | | | | | Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005
* Replace t_force with a t_flag (TF_FORCEDATA).ps2005-05-211-6/+8
| | | | | Submitted by: Raja Mukerji. Reviewed by: Mohan, Silby, Andre Opperman.
* When looking for the next hole to retransmit from the scoreboard,ps2005-05-111-2/+7
| | | | | | | | | | | | | | | | | | or to compute the total retransmitted bytes in this sack recovery episode, the scoreboard is traversed. While in sack recovery, this traversal occurs on every call to tcp_output(), every dupack and every partial ack. The scoreboard could potentially get quite large, making this traversal expensive. This change optimizes this by storing hints (for the next hole to retransmit and the total retransmitted bytes in this sack recovery episode) reducing the complexity to find these values from O(n) to constant time. The debug code that sanity checks the hints against the computed value will be removed eventually. Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
* Fix for interaction problems between TCP SACK and TCP Signature.ps2005-04-211-45/+84
| | | | | | | | | | | If TCP Signatures are enabled, the maximum allowed sack blocks aren't going to fit. The fix is to compute how many sack blocks fit and tack these on last. Also on SYNs, defer padding until after the SACK PERMITTED option has been added. Found by: Mohan Srinivasan. Submitted by: Mohan Srinivasan, Noritoshi Demizu. Reviewed by: Raja Mukerji.
* Ignore ICMP Source Quench messages for TCP sessions. Source Quench isandre2005-04-211-1/+1
| | | | | | | | | | | ineffective, depreciated and can be abused to degrade the performance of active TCP sessions if spoofed. Replace a bogus call to tcp_quench() in tcp_output() with the direct equivalent tcpcb variable assignment. Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1 MFC after: 3 days
* Fix a TCP SACK related crash resulting from incorrect computationps2005-01-121-6/+16
| | | | | | | | of len in tcp_output(), in the case where the FIN has already been transmitted. The mis-computation of len is because of a gcc optimization issue, which this change works around. Submitted by: Mohan Srinivasan
* /* -> /*- for license, minor formatting changesimp2005-01-071-1/+1
|
* Fixes a bug in SACK causing us to send data beyond the receive window.ps2004-11-291-2/+4
| | | | | Found by: Pawel Worach and Daniel Hartmeier Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
* Remove RFC1644 T/TCP support from the TCP side of the network stack.andre2004-11-021-82/+2
| | | | | | | | | | | | | | | | A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch
* Correct a bug in TCP SACK that could result in wedging of the TCP stackrwatson2004-10-301-2/+2
| | | | | | | | | | | under high load: only set function state to loop and continuing sending if there is no data left to send. RELENG_5_3 candidate. Feet provided: Peter Losher <Peter underscore Losher at isc dot org> Diagnosed by: Aniel Hartmeier <daniel at benzedrine dot cx> Submitted by: mohan <mohans at yahoo-inc dot com>
* Acquire the send socket buffer lock around tcp_output() activitiesrwatson2004-10-091-2/+14
| | | | | | | | | | reaching into the socket buffer. This prevents a number of potential races, including dereferencing of sb_mb while unlocked leading to a NULL pointer deref (how I found it). Potentially this might also explain other "odd" TCP behavior on SMP boxes (although haven't seen it reported). RELENG_5 candidate.
* - Estimate the amount of data in flight in sack recovery and use itps2004-10-051-26/+40
| | | | | | | | | | to control the packets injected while in sack recovery (for both retransmissions and new data). - Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c. - Add a new sysctl (net.inet.tcp.sack.initburst) that controls the number of sack retransmissions done upon initiation of sack recovery. Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
* fix up socket/ip layer violation... don't assume/know thatjmg2004-09-051-3/+4
| | | | SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
* White space cleanup for netinet before branch:rwatson2004-08-161-71/+71
| | | | | | | | | | | - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
* Fix a bug in the sack code that was causing data to be retransmittedjayanth2004-07-281-4/+13
| | | | | | | | | | with the FIN bit set for all segments, if a FIN has already been sent before. The fix will allow the FIN bit to be set for only the last segment, in case it has to be retransmitted. Fix another bug that would have caused snd_nxt to be pulled by len if there was an error from ip_output. snd_nxt should not be touched during sack retransmissions.
* Fix for a SACK bug where the very last segment retransmittedjayanth2004-07-261-1/+1
| | | | | from the SACK scoreboard could result in the next (untransmitted) segment to be skipped.
* Let IN_FASTREOCOVERY macro decide if we are in recovery mode.jayanth2004-07-191-1/+1
| | | | | Nuke sackhole_limit for now. We need to add it back to limit the total number of sack blocks in the system.
* Fix a potential panic in the SACK code that was causingjayanth2004-07-191-8/+29
| | | | | | | | | | | | 1) data to be sent to the right of snd_recover. 2) send more data then whats in the send buffer. The fix is to postpone sack retransmit to a subsequent recovery episode if the current retransmit pointer is beyond snd_recover. Thanks to Mohan Srinivasan for helping fix the bug. Submitted by:Daniel Lang
* Add support for TCP Selective Acknowledgements. The work for thisps2004-06-231-3/+115
| | | | | | | | | | | | | | | originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
* Appease GCC.bms2004-06-181-1/+1
|
* If SO_DEBUG is enabled for a TCP socket, and a received segment isbms2004-06-181-2/+11
| | | | | | | | | | | | encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer when registering trace records. 'ipov' is specific to IPv4, and will therefore be uninitialized. [This fandango is only necessary in the first place because of our host-byte-order IP field pessimization.] PR: kern/60856 Submitted by: Galois Zheng
* Don't set FIN on a retransmitted segment after a FIN has been sent,bms2004-06-181-1/+1
| | | | | | | | | unless the segment really contains the last of the data for the stream. PR: kern/34619 Obtained from: OpenBSD (tcp_output.c rev 1.47) Noticed by: Joseph Ishac Reviewed by: George Neville-Neil
* Switch to using the inpcb MAC label instead of socket MAC label whenrwatson2004-05-041-1/+1
| | | | | | | | | | | | | | | | | | | | labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid the need for socket layer locking in the lower level network paths where inpcb locks are already frequently held where needed. In particular: - Use the inpcb for label instead of socket in raw_append(). - Use the inpcb for label instead of socket in tcp_output(). - Use the inpcb for label instead of socket in tcp_respond(). - Use the inpcb for label instead of socket in tcp_twrespond(). - Use the inpcb for label instead of socket in syncache_respond(). While here, modify tcp_respond() to avoid assigning NULL to a stack variable and centralize assertions about the inpcb when inp is assigned. Obtained from: TrustedBSD Project Sponsored by: DARPA, McAfee Research
* Remove advertising clause from University of California Regent'simp2004-04-071-4/+0
| | | | | | | license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
* Brucification.bms2004-02-131-3/+3
| | | | Submitted by: bde
* style(9) pass; whitespace and comments.bms2004-02-121-6/+4
| | | | Submitted by: njl
* Fix a typo; left out preprocessor conditional for sigoff variable, whichbms2004-02-111-0/+2
| | | | | | is only used by TCP_SIGNATURE code. Noticed by: Roop Nanuwa
* Initial import of RFC 2385 (TCP-MD5) digest support.bms2004-02-111-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
* pass pcb rather than so. it is expected that per socket policyume2004-02-031-7/+0
| | | | works again.
* Split the overloaded variable 'win' into two for their specific purposes:andre2004-01-221-21/+22
| | | | | | | | recwin and sendwin. This removes a big source of confusion and makes following the code much easier. Reviewed by: sam (mentor) Obtained from: DragonFlyBSD rev 1.6 (hsu)
* Introduce tcp_hostcache and remove the tcp specific metrics fromandre2003-11-201-26/+16
| | | | | | | | | | | | | | | | | | | | | | | the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
* replace mtx_assert by INP_LOCK_ASSERTsam2003-11-081-3/+1
| | | | Supported by: FreeBSD Foundation
* unbreak compilation of FAST_IPSECsam2003-11-081-1/+1
| | | | Supported by: FreeBSD Foundation
* - cleanup SP refcnt issue.ume2003-11-041-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - share policy-on-socket for listening socket. - don't copy policy-on-socket at all. secpolicy no longer contain spidx, which saves a lot of memory. - deep-copy pcb policy if it is an ipsec policy. assign ID field to all SPD entries. make it possible for racoon to grab SPD entry on pcb. - fixed the order of searching SA table for packets. - fixed to get a security association header. a mode is always needed to compare them. - fixed that the incorrect time was set to sadb_comb_{hard|soft}_usetime. - disallow port spec for tunnel mode policy (as we don't reassemble). - an user can define a policy-id. - clear enc/auth key before freeing. - fixed that the kernel crashed when key_spdacquire() was called because key_spdacquire() had been implemented imcopletely. - preparation for 64bit sequence number. - maintain ordered list of SA, based on SA id. - cleanup secasvar management; refcnt is key.c responsibility; alloc/free is keydb.c responsibility. - cleanup, avoid double-loop. - use hash for spi-based lookup. - mark persistent SP "persistent". XXX in theory refcnt should do the right thing, however, we have "spdflush" which would touch all SPs. another solution would be to de-register persistent SPs from sptree. - u_short -> u_int16_t - reduce kernel stack usage by auto variable secasindex. - clarify function name confusion. ipsec_*_policy -> ipsec_*_pcbpolicy. - avoid variable name confusion. (struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct secpolicy *) - count number of ipsec encapsulations on ipsec4_output, so that we can tell ip_output() how to handle the packet further. - When the value of the ul_proto is ICMP or ICMPV6, the port field in "src" of the spidx specifies ICMP type, and the port field in "dst" of the spidx specifies ICMP code. - avoid from applying IPsec transport mode to the packets when the kernel forwards the packets. Tested by: nork Obtained from: KAME
* The tcp_trace call needs the length of the header. Unfortunately theharti2003-08-131-1/+5
| | | | | | | | | code has rotten a bit so that the header length is not correct at the point when tcp_trace is called. Temporarily compute the correct value before the call and restore the old value after. This makes ports/benchmarks/dbs to almost work. This is a NOP unless you compile with TCPDEBUG.
* Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so thejlemon2003-02-191-9/+4
| | | | | | | | routine does not require a tcpcb to operate. Since we no longer keep template mbufs around, move pseudo checksum out of this routine, and merge it with the length update. Sponsored by: DARPA, NAI Labs
* Clean up delayed acks and T/TCP interactions:jlemon2003-02-191-3/+4
| | | | | | | | - delay acks for T/TCP regardless of delack setting - fix bug where a single pass through tcp_input might not delay acks - use callout_active() instead of callout_pending() Sponsored by: DARPA, NAI Labs
* Back out M_* changes, per decision of the TRB.imp2003-02-191-3/+3
| | | | Approved by: trb
* Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.alfred2003-01-211-3/+3
| | | | Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
* Optimize away call to bzero() in the common case by directly checkinghsu2003-01-181-6/+3
| | | | if a connection has any cached TAO information.
* Fix oops in my last commit, I was calculating a new length but then notdillon2002-10-161-1/+1
| | | | | | using it. (The code is already correct in -stable). Found by: silby
* Tie new "Fast IPsec" code into the build. This involves the usualsam2002-10-161-0/+5
| | | | | | | | | | | | configuration stuff as well as conditional code in the IPv4 and IPv6 areas. Everything is conditional on FAST_IPSEC which is mutually exclusive with IPSEC (KAME IPsec implmentation). As noted previously, don't use FAST_IPSEC with INET6 at the moment. Reviewed by: KAME, rwatson Approved by: silence Supported by: Vernier Networks
* Replace aux mbufs with packet tags:sam2002-10-161-12/+3
| | | | | | | | | | | | | | | | | | | o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month
* Update various comments mainly related to retransmit/FIN that Idillon2002-10-101-6/+36
| | | | | | | | documented while working on a previous bug. Fix a PERSIST bug. Properly account for a FIN sent during a PERSIST. MFC after: 7 days
* Tempary fix for inet6. The final fix is to change in6_pcbnotify to take ↵jennifer2002-09-171-0/+2
| | | | | | pcbinfo instead of pcbhead. It is on the way.
OpenPOWER on IntegriCloud