summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_input.c
Commit message (Collapse)AuthorAgeFilesLines
* o Remove unncessary TOF_SIGLEN flag from struct tcpoptandre2007-04-201-1/+2
| | | | | o Correctly set to->to_signature in tcp_dooptions() o Update comments
* Add more KASSERT's.andre2007-04-201-0/+4
|
* Remove bogus check for accept queue length and associated failure handlingandre2007-04-201-16/+10
| | | | | | | | | | | | | | from the incoming SYN handling section of tcp_input(). Enforcement of the accept queue limits is done by sonewconn() after the 3WHS is completed. It is not necessary to have an earlier check before a connection request enters the SYN cache awaiting the full handshake. It rather limits the effectiveness of the syncache by preventing legit and illegit connections from entering it and having them shaken out before we hit the real limit which may have vanished by then. Change return value of syncache_add() to void. No status communication is required.
* Simplifly syncache_expand() and clarify its semantics. Zero is returnedandre2007-04-201-8/+8
| | | | | | | | | | | | | | | when the ACK is invalid and doesn't belong to any registered connection, either in syncache or through SYN cookies. True but a NULL struct socket is returned when the 3WHS completed but the socket could not be created due to insufficient resources or limits reached. For both cases an RST is sent back in tcp_input(). A logic error leading to a panic is fixed where syncache_expand() would free the mbuf on socket allocation failure but tcp_input() later supplies it to tcp_dropwithreset() to issue a RST to the peer. Reported by: kris (the panic)
* Remove unused variable tcbinfo_mtx.rwatson2007-04-151-1/+0
|
* Change the TCP timer system from using the callout system five timesandre2007-04-111-30/+22
| | | | | | | | | | | | | | | | directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)
* Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add someandre2007-04-041-0/+3
| | | | further INP_INFO_WLOCK_ASSERT() while there.
* Move last tcpcb initialization for the inbound connection case fromandre2007-04-041-10/+2
| | | | | | | | tcp_input() to syncache_socket() where it belongs and the majority of it already happens. The "tp->snd_up = tp->snd_una" is removed as it is done with the tcp_sendseqinit() macro a few lines earlier.
* Retire unused TCP_SACK_DEBUG.andre2007-04-041-1/+0
|
* In tcp_dooptions() skip over SACK options if it is a SYN segment.andre2007-04-041-0/+2
|
* When blackholing do a 'dropunlock' in the new world order to prevent theandre2007-03-281-1/+1
| | | | | | | INP_INFO_LOCK from leaking. Reported by: ache Found by: rwatson
* o Use a define for a buffer size.maxim2007-03-241-1/+10
| | | | | | | | Prodded by: db o Add missed vars for TCPDEBUG in tcp_do_segment(). Prodded by: tinderbox
* Split tcp_input() into its two functional parts:andre2007-03-231-132/+208
| | | | | | | | | | | | | | o tcp_input() now handles TCP segment sanity checks and preparations including the INPCB lookup and syncache. o tcp_do_segment() handles all data and ACK processing and is IPv4/v6 agnostic. Change all KASSERT() messages to ("%s: ", __func__). The changes in this commit are primarily of mechanical nature and no functional changes besides the function split are made. Discussed with: rwatson
* Tidy up some code to conform better to surroundings and style(9), 0 = NULLandre2007-03-231-17/+16
| | | | and space/tab.
* Bring SACK option handling in tcp_dooptions() in line with all otherandre2007-03-231-4/+7
| | | | options and ajust users accordingly.
* ANSIfy function declarations and remove register keywords for variables.andre2007-03-211-50/+24
| | | | Consistently apply style to all function declarations.
* Tidy up IPFIREWALL_FORWARD sections and comments.andre2007-03-211-4/+3
|
* Update and clarify comments in first section of tcp_input().andre2007-03-211-15/+13
|
* Tidy up the ACCEPTCONN section of tcp_input(), ajust comments and removeandre2007-03-211-57/+27
| | | | old dead T/TCP code.
* Tidy up tcp_log_in_vain and blackhole.andre2007-03-211-44/+31
|
* Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default itandre2007-03-211-5/+0
| | | | | | doesn't impede normal operation negatively and is only a few lines of code. It's close relatives blackhole and log_in_vain aren't options either.
* Remove tcp_minmssoverload DoS detection logic. The problem it tried toandre2007-03-211-59/+0
| | | | | | protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
* Match up SYSCTL declaration style.andre2007-03-191-14/+15
|
* Consolidate insertion of TCP options into a segment from within tcp_output()andre2007-03-151-2/+2
| | | | | | | | | | | | | | and syncache_respond() into its own generic function tcp_addoptions(). tcp_addoptions() is alignment agnostic and does optimal packing in all cases. In struct tcpopt rename to_requested_s_scale to just to_wscale. Add a comment with quote from RFC1323: "The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled." Reviewed by: silby, mohans, julian Sponsored by: TCP/IP Optimization Fundraise 2005
* This patch is provided to fix a couple of deployment issues observedqingli2007-03-071-5/+7
| | | | | | | | | | | | | | | | | | in the field. In one situation, one end of the TCP connection sends a back-to-back RST packet, with delayed ack, the last_ack_sent variable has not been update yet. When tcp_insecure_rst is turned off, the code treats the RST as invalid because last_ack_sent instead of rcv_nxt is compared against th_seq. Apparently there is some kind of firewall that sits in between the two ends and that RST packet is the only RST packet received. With short lived HTTP connections, the symptom is a large accumulation of connections over a short period of time . The +/-(1) factor is to take care of implementations out there that generate RST packets with these types of sequence numbers. This behavior has also been observed in live environments. Reviewed by: silby, Mike Karels MFC after: 1 week
* In the SYN_SENT case, Initialize the snd_wnd before the call to tcp_mss().mohans2007-02-281-3/+2
| | | | The TCP hostcache logic in tcp_mss() depends on the snd_wnd being initialized.
* Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigatemohans2007-02-261-1/+5
| | | | | | | | potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.
* Rename two identically named log_in_vain variables: tcp_input.c's staticrwatson2007-02-201-4/+4
| | | | | | | log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to udp_log_in_vain. MFC after: 1 week
* Auto sizing TCP socket buffers.andre2007-02-011-3/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month
* MFp4: 92972, 98913 + one more changebz2006-12-121-3/+7
| | | | | | | In ip6_sprintf no longer use and return one of eight static buffers for printing/logging ipv6 addresses. The caller now has to hand in a sufficiently large buffer as first argument.
* Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.hrwatson2006-10-221-1/+2
| | | | | | | | | | | | | begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
* fix calculating to_tsecr... This prevents the rtt calculations fromjmg2006-09-261-1/+1
| | | | going all wonky...
* Always set the IP version in the TCP input path, to preservebms2006-09-231-2/+0
| | | | | | | | | the header field for possible later IPSEC SPD lookup, even when the kernel is built without 'options INET6'. PR: kern/57760 MFC after: 1 week Submitted by: Joachim Schueth
* Rewrite of TCP syncookies to remove locking requirements and to enhanceandre2006-09-131-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | functionality: - Remove a rwlock aquisition/release per generated syncookie. Locking is now integrated with the bucket row locking of syncache itself and syncookies no longer add any additional lock overhead. - Syncookie secrets are different for and stored per syncache buck row. Secrets expire after 16 seconds and are reseeded on-demand. - The computational overhead for syncookie generation and verification is one MD5 hash computation as before. - Syncache can be turned off and run with syncookies only by setting the sysctl net.inet.tcp.syncookies_only=1. This implementation extends the orginal idea and first implementation of FreeBSD by using not only the initial sequence number field to store information but also the timestamp field if present. This way we can keep track of the entire state we need to know to recreate the session in its original form. Almost all TCP speakers implement RFC1323 timestamps these days. For those that do not we still have to live with the known shortcomings of the ISN only SYN cookies. The use of the timestamp field causes the timestamps to be randomized if syncookies are enabled. The idea of SYN cookies is to encode and include all necessary information about the connection setup state within the SYN-ACK we send back and thus to get along without keeping any local state until the ACK to the SYN-ACK arrives (if ever). Everything we need to know should be available from the information we encoded in the SYN-ACK. A detailed description of the inner working of the syncookies mechanism is included in the comments in tcp_syncache.c. Reviewed by: silby (slightly earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
* Back when we had T/TCP support, we used to apply differentru2006-09-071-2/+2
| | | | | | | | | | | timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).
* First step of TSO (TCP segmentation offload) support in our network stack.andre2006-09-061-4/+9
| | | | | | | | | | | | o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6 o add CSUM_TSO flag to mbuf pkthdr csum_flags field o add tso_segsz field to mbuf pkthdr o enhance ip_output() packet length check to allow for large TSO packets o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities o adjust all callers of tcp_maxmtu[46]() accordingly Discussed on: -current, -net Sponsored by: TCP/IP Optimization Fundraise 2005
* Fixes an edge case bug in timewait handling where ticks rolling over causingmohans2006-08-111-1/+1
| | | | | the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry). Reviewed by: silby
* Use INPLOOKUP_WILDCARD instead of just 1 more consistently.bz2006-06-291-3/+6
| | | | OKed by: rwatson (some weeks ago)
* Some cleanups and janitorial work to tcp_syncache:andre2006-06-261-0/+1
| | | | | | | | | | | | | | | | | o don't assign remote/local host/port information manually between provided struct in_conninfo and struct syncache, bcopy() it instead o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture the purpose of this field o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons o fix IPSEC error case printf's to report correct function name o in syncache_socket() only transpose enhanced tcp options parameters to struct tcpcb when the inpcb doesn't has TF_NOOPT set o in syncache_respond() reorder stack variables o in syncache_respond() remove bogus KASSERT() No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005
* Some cleanups and janitorial work to tcp_dooptions():andre2006-06-261-19/+25
| | | | | | | | | | | | | | | | o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its usage accordingly o update the comments to the tcp_dooptions() invocation in tcp_input():after_listen to reflect reality o move the logic checking the echoed timestamp out of tcp_dooptions() to the only place that uses it next to the invocation described in the previous item o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others o add comments in to struct tcpopt.to_flags #defines No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005
* When we receive an out-of-window SYN for an "ESTABLISHED" connection,dwmalone2006-06-191-0/+2
| | | | | | | | | ACK the SYN as required by RFC793, rather than ignoring it. NetBSD have had a similar change since 1999. PR: 93236 Submitted by: Grant Edwards <grante@visi.com> MFC after: 1 month
* Add locking to TCP syncache and drop the global tcpinfo lock as earlyandre2006-06-171-6/+9
| | | | | | | | | | | | | | | | | | as possible for the syncache_add() case. The syncache timer no longer aquires the tcpinfo lock and timeout/retransmit runs can happen in parallel with bucket granularity. On a P4 the additional locks cause a slight degression of 0.7% in tcp connections per second. When IP and TCP input are deserialized and can run in parallel this little overhead can be neglected. The syncookie handling still leaves room for improvement and its random salts may be moved to the syncache bucket head structures to remove the second lock operation currently required for it. However this would be a more involved change from the way syncookies work at the moment. Reviewed by: rwatson Tested by: rwatson, ps (earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005
* Allow for nmbclusters and maxsockets to be increased via sysctl.ps2006-04-211-0/+10
| | | | | An eventhandler is used to update all the various zones that depend on these values.
* Modify tcp_timewait() to accept an inpcb reference, not a tcptwrwatson2006-04-091-11/+12
| | | | | | | | | | reference. For now, we allow the possibility that the in_ppcb pointer in the inpcb may be NULL if a timewait socket has had its tcptw structure recycled. This allows tcp_timewait() to consistently unlock the inpcb. Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
* Don't unlock a timewait structure if the pointer is NULL inrwatson2006-04-051-1/+2
| | | | | | | | tcp_timewait(). This corrects a bug (or lack of fixing of a bug) in tcp_input.c:1.295. Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
* Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb beingrwatson2006-04-041-0/+9
| | | | | | | | | | NULL. We currently do allow this to happen, but may want to remove that possibility in the future. This case can occur when a socket is left open after TCP wraps up, and the timewait state is recycled. This will be cleaned up in the future. Found by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
* Change inp_ppcb from caddr_t to void *, fix/remove associated relatedrwatson2006-04-031-2/+1
| | | | | | | | | | | | | | | | | | casts. Consistently use intotw() to cast inp_ppcb pointers to struct tcptw * pointers. Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb * pointers. Don't assign tp to the results to intotcpcb() during variable declation at the top of functions, as that is before the asserts relating to locking have been performed. Do this later in the function after appropriate assertions have run to allow that operation to be conisdered safe. MFC after: 3 months
* Update TCP for infrastructural changes to the socket/pcb refcount model,rwatson2006-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months
* Explicitly assert socket pointer is non-NULL in tcp_input() so as torwatson2006-03-261-3/+4
| | | | | | | | | provide better debugging information. Prefer explicit comparison to NULL for tcpcb pointers rather than treating them as booleans. MFC after: 1 month
* Rework TCP window scaling (RFC1323) to properly scale the send windowandre2006-02-281-19/+20
| | | | | | | | | | | | | right from the beginning and partly clean up the differences in handling between SYN_SENT and SYN_RCVD (syncache). Further changes to this code to come. This is a first incremental step to a general overhaul and streamlining of the TCP code. PR: kern/15095 PR: kern/92690 (partly) Reviewed by: qingli (and tested with ANVL) Sponsored by: TCP/IP Optimization Fundraise 2005
OpenPOWER on IntegriCloud