summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp.h
Commit message (Collapse)AuthorAgeFilesLines
* Provide new socket option TCP_CCALGOOPT, which stands for TCP congestionglebius2016-01-221-0/+1
| | | | | | | | | | | | | | | control algorithm options. The argument is variable length and is opaque to TCP, forwarded directly to the algorithm's ctl_output method. Provide new includes directory netinet/cc, where algorithm specific headers can be installed. The new API doesn't yet have any in tree consumers. The original code written by lstewart. Reviewed by: rrs, emax Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D711
* Implementation of server-side TCP Fast Open (TFO) [RFC7413].pkelsey2015-12-241-0/+5
| | | | | | | | | | TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350
* First cut of the modularization of our TCP stack. Stillrrs2015-12-161-1/+7
| | | | | | | | | to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055
* There are times when it would be really nice to have a record of the last fewhiren2015-10-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren
* Add placeholder constants to reserve a portion of the socket optionjhb2013-02-011-0/+3
| | | | | | name space for use by downstream vendors to add custom options. MFC after: 2 weeks
* Use decimal values for UDP and TCP socket options rather than hex to avoidjhb2013-01-221-12/+14
| | | | | | | implying that these constants should be treated as bit masks. Reviewed by: net MFC after: 1 week
* Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL andglebius2012-02-051-0/+4
| | | | | | | TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart
* Add missing #includes.ed2011-10-211-0/+1
| | | | | | | | | According to POSIX, these two header files should be able to be included by themselves, not depending on other headers. The <net/if.h> header uses struct sockaddr when __BSD_VISIBLE=1, while <netinet/tcp.h> uses integer datatypes (u_int32_t, u_short, etc). MFC after: 2 months
* Add new, per connection, statistics for TCP, including:gnn2010-11-171-1/+4
| | | | | | | | | | Retransmitted Packets Zero Window Advertisements Out of Order Receives These statistics are available via the -T argument to netstat(1). MFC after: 2 weeks
* Remove the TCP inflight bandwidth limiter as announced in r211315andre2010-09-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | to give way for the pluggable congestion control framework. It is the task of the congestion control algorithm to set the congestion window and amount of inflight data without external interference. In 'struct tcpcb' the variables previously used by the inflight limiter are renamed to spares to keep the ABI intact and to have some more space for future extensions. In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to preserve the ABI. It is always set to 0. In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed to preserve the ABI. It is always set to 0. These unused variable in the various structures may be reused in the future or garbage collected before the next release or at some other point when an ABI change happens anyway for other reasons. No MFC is planned. The inflight bandwidth limiter stays disabled by default in the other branches but remains available.
* Improve comment to TCP_MINMSS by taking the wording from lstewart (withandre2010-09-161-7/+7
| | | | | | | a small difference in the last paragraph though) as suggested by jhb. Clarify that the 'reviewed by' in r212653 by lstewart was for the functional change, not the comments in the committed version.
* Change the default MSS for IPv4 and IPv6 TCP connections from anandre2010-09-151-19/+27
| | | | | | | | | | | | | | | | | | artificial power-of-2 rounded number to their real values specified in RFC879 and RFC2460. From the history and existing comments it appears that the rounded numbers were intended to be advantageous for the kernel and mbuf system. However this hasn't been the case at for at least a long time. The mbuf clusters used in tcp_output() have enough space to hold the larger real value for the default MSS for both IPv4 and IPv6. Note that the default MSS is only used when path MTU discovery is disabled. Update and expand related comments. Reviewed by: lsteward (including some word-smithing) MFC after: 2 weeks
* use u_char instead of u_int for short bitfields.luigi2010-02-011-2/+2
| | | | | | | | | | | | For our compiler the two constructs are completely equivalent, but some compilers (including MSC and tcc) use the base type for alignment, which in the cases touched here result in aligning the bitfields to 32 bit instead of the 8 bit that is meant here. Note that almost all other headers where small bitfields are used have u_int8_t instead of u_int. MFC after: 3 days
* - Rename the __tcpi_(snd|rcv)_mss fields of the tcp_info structure to removejhb2009-12-221-4/+4
| | | | | | | | | the leading underscores since they are now implemented. - Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info structure. Reviewed by: rwatson MFC after: 2 weeks
* add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specifickmacy2008-05-051-2/+6
| | | | extension fields for tcp_info
* Use #defines for TCP options padding after EOL to be consistent.andre2008-04-071-0/+2
| | | | Reviewed by: bz
* Add socket option for setting and retrieving the congestion control algorithm.kmacy2007-12-161-0/+3
| | | | The name used is to allow compatibility with Linux.
* The printf %b list in PRINT_TH_FLAGS has to be in octal numbering.andre2007-05-251-1/+1
| | | | | | Thus convert \8 to \10 and the warnings go away. Pointed out by: sam, ru, thompsa
* Add CWR back into the PRINT_TH_FLAGS list as gcc42 doesn't complainandre2007-05-231-1/+1
| | | | about \8 in a string anymore.
* Add tcp_log_addrs() function to generate and standardized TCP log lineandre2007-05-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for use thoughout the tcp subsystem. It is IPv4 and IPv6 aware creates a line in the following format: "TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>" A "\n" is not included at the end. The caller is supposed to add further information after the standard tcp log header. The function returns a NUL terminated string which the caller has to free(s, M_TCPLOG) after use. All memory allocation is done with M_NOWAIT and the return value may be NULL in memory shortage situations. Either struct in_conninfo || (struct tcphdr && (struct ip || struct ip6_hdr) have to be supplied. Due to ip[6].h header inclusion limitations and ordering issues the struct ip and struct ip6_hdr parameters have to be casted and passed as void * pointers. tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr, void *ip6hdr) Usage example: struct ip *ip; char *tcplog; if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) { log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n", tcplog, __func__); free(s, M_TCPLOG); }
* o Remove unused and redundant TCP option definitionsandre2007-04-201-10/+1
| | | | | o Replace usage of MAX_TCPOPTLEN with the correctly constructed and derived MAX_TCPOPTLEN
* Remove tcp_minmssoverload DoS detection logic. The problem it tried toandre2007-03-211-8/+0
| | | | | | protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
* Consolidate insertion of TCP options into a segment from within tcp_output()andre2007-03-151-2/+5
| | | | | | | | | | | | | | and syncache_respond() into its own generic function tcp_addoptions(). tcp_addoptions() is alignment agnostic and does optimal packing in all cases. In struct tcpopt rename to_requested_s_scale to just to_wscale. Add a comment with quote from RFC1323: "The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled." Reviewed by: silby, mohans, julian Sponsored by: TCP/IP Optimization Fundraise 2005
* Expose smoothed RTT and RTT variance measurements to userland viabms2007-02-021-2/+2
| | | | | | | socket option TCP_INFO. Note that the units used in the original Linux API are in microseconds, so use a 64-bit mantissa to convert FreeBSD's internal measurements from struct tcpcb from ticks.
* Add missing TH_PUSH to the TH_FLAGS enumeration.andre2006-02-181-1/+1
| | | | | Submitted by: Andre Albsmeier <Andre.Albsmeier-at-siemens.com> PR: kern/85203
* Fix up the comment for MAX_SACK_BLKS.ps2005-08-241-1/+1
| | | | Submitted by: Noritoshi Demizu
* Rewrite of tcp_sack_option(). Kentaro Kurahone (NetBSD) pointed outps2005-05-231-1/+1
| | | | | | | | | | | | that if we sort the incoming SACK blocks, we can update the scoreboard in one pass of the scoreboard. The added overhead of sorting upto 4 sack blocks is much lower than traversing (potentially) large scoreboards multiple times. The code was updating the scoreboard with multiple passes over it (once for each sack option). The rewrite fixes that, reducing the complexity of the main loop from O(n^2) to O(n). Submitted by: Mohan Srinivasan, Noritoshi Demizu. Reviewed by: Raja Mukerji.
* /* -> /*- for license, minor formatting changesimp2005-01-071-1/+1
|
* Do export the advertised receive window via the tcpi_rcv_space field ofrwatson2004-11-271-1/+1
| | | | struct tcp_info.
* Implement parts of the TCP_INFO socket option as found in Linux 2.6.rwatson2004-11-261-0/+66
| | | | | | | | | | | | | | | This socket option allows processes query a TCP socket for some low level transmission details, such as the current send, bandwidth, and congestion windows. Linux provides a 'struct tcpinfo' structure containing various variables, rather than separate socket options; this makes the API somewhat fragile as it makes it dificult to add new entries of interest as requirements and implementation evolve. As such, I've included a large pad at the end of the structure. Right now, relatively few of the Linux API fields are filled in, and some contain no logical equivilent on FreeBSD. I've include __'d entries in the structure to make it easier to figure ou what is and isn't omitted. This API/ABI should be considered unstable for the time being.
* Remove RFC1644 T/TCP support from the TCP side of the network stack.andre2004-11-021-8/+0
| | | | | | | | | | | | | | | | A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch
* White space cleanup for netinet before branch:rwatson2004-08-161-1/+1
| | | | | | | | | | | - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
* Add support for TCP Selective Acknowledgements. The work for thisps2004-06-231-0/+12
| | | | | | | | | | | | | | | originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
* Remove advertising clause from University of California Regent'simp2004-04-071-4/+0
| | | | | | | license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
* Shorten the name of the socket option used to enable TCP-MD5 packetbms2004-02-161-1/+1
| | | | | | treatment. Submitted by: Vincent Jardin
* Initial import of RFC 2385 (TCP-MD5) digest support.bms2004-02-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
* Disable the minmssoverload connection drop by default until the detectionandre2004-01-121-1/+1
| | | | logic is refined.
* Reduce TCP_MINMSS default to 216. The AX.25 protocol (packet radio)andre2004-01-091-4/+3
| | | | | is frequently used with an MTU of 256 because of slow speeds and a high packet loss rate.
* Limiters and sanity checks for TCP MSS (maximum segement size)andre2004-01-081-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day
* Include <sys/cdefs.h> so the visibility conditionals are available.mike2002-10-021-0/+2
| | | | (This should have been included with the previous revision.)
* Use visibility conditionals. Only TCP_NODELAY ends up being definedmike2002-10-021-1/+6
| | | | in the standards case.
* o Minor style(9)ism to make consistent with -STABLErwatson2001-01-091-1/+1
|
* o IPFW incorrectly handled filtering in the presence of previouslyrwatson2001-01-091-1/+3
| | | | | | | | | | | | | | | | | | | | reserved and now allocated TCP flags in incoming packets. This patch stops overloading those bits in the IP firewall rules, and moves colliding flags to a seperate field, ipflg. The IPFW userland management tool, ipfw(8), is updated to reflect this change. New TCP flags related to ECN are now included in tcp.h for reference, although we don't currently implement TCP+ECN. o To use this fix without completely rebuilding, it is sufficient to copy ip_fw.h and tcp.h into your appropriate include directory, then rebuild the ipfw kernel module, and ipfw tool, and install both. Note that a mismatch between module and userland tool will result in incorrect installation of firewall rules that may have unexpected effects. This is an MFC candidate, following shakedown. This bug does not appear to affect ipfilter. Reviewed by: security-officer, billf Reported by: Aragon Gouveia <aragon@phat.za.net>
* Implement TCP NewReno, as documented in RFC 2582. This allowsjlemon2000-05-061-0/+2
| | | | | | | | better recovery for multiple packet losses in a single window. The algorithm can be toggled via the sysctl net.inet.tcp.newreno, which defaults to "on". Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
* tcp updates to support IPv6.shin2000-01-091-0/+3
| | | | | | | also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
* KAME related header files additions and merges.shin1999-11-051-0/+8
| | | | | | | (only those which don't affect c source files so much) Reviewed by: cvs-committers Obtained from: KAME project
* $Id$ -> $FreeBSD$peter1999-08-281-1/+1
|
* Declare tcp_seq and tcp_cc as fixed-size types. Half fixed typebde1998-07-131-3/+3
| | | | | | mismatches exposed by this (the prototype for tcp_respond() didn't match the function definition lexically, and still depends on a gcc feature to match if ints have more than 32 bits).
* Fixed pedantic semantics errors (bitfields not of type int, signed intbde1998-06-081-3/+3
| | | | | | | or unsigned int (this doesn't change the struct layout, size or alignment in any of the files changed in this commit, at least for gcc on i386's. Using bitfields of type u_char may affect size and alignment but not packing)).
* Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are notpeter1997-02-221-1/+1
| | | | ready for it yet.
OpenPOWER on IntegriCloud