summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_usrreq.c
Commit message (Collapse)AuthorAgeFilesLines
* Change the TCP timer system from using the callout system five timesandre2007-04-111-8/+10
| | | | | | | | | | | | | | | | directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)
* ANSIfy function declarations and remove register keywords for variables.andre2007-03-211-22/+9
| | | | Consistently apply style to all function declarations.
* Remove tcp_minmssoverload DoS detection logic. The problem it tried toandre2007-03-211-5/+0
| | | | | | protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
* Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigatemohans2007-02-261-2/+7
| | | | | | | | potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.
* Add "show inpcb", "show tcpcb" DDB commands, which should come in handyrwatson2007-02-171-1/+321
| | | | for debugging sblock and other network panics.
* Expose smoothed RTT and RTT variance measurements to userland viabms2007-02-021-0/+4
| | | | | | | socket option TCP_INFO. Note that the units used in the original Linux API are in microseconds, so use a 64-bit mantissa to convert FreeBSD's internal measurements from struct tcpcb from ticks.
* Auto sizing TCP socket buffers.andre2007-02-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month
* Change the way the advertized TCP window scaling is computed. Instead ofandre2007-02-011-2/+7
| | | | | | | | | | | | | | | upper-bounding it to the size of the initial socket buffer lower-bound it to the smallest MSS we accept. Ideally we'd use the actual MSS information here but it is not available yet. For socket buffer auto sizing to be effective we need room to grow the receive window. The window scale shift is determined at connection setup and can't be changed afterwards. The previous, original, method effectively just did a power of two roundup of the socket buffer size at connection setup severely limiting the headroom for larger socket buffers. Tested by: many (as part of the socket buffer auto sizing patch) MFC after: 1 month
* Change error codes returned by protocol operations when an inpcb issam2006-11-221-6/+6
| | | | | | | | | | | | | | | | | | | | marked INP_DROPPED or INP_TIMEWAIT: o return ECONNRESET instead of EINVAL for close, disconnect, shutdown, rcvd, rcvoob, and send operations o return ECONNABORTED instead of EINVAL for accept These changes should reduce confusion in applications since EINVAL is normally interpreted to mean an invalid file descriptor. This change does not conflict with POSIX or other standards I checked. The return of EINVAL has always been possible but rare; it's become more common with recent changes to the socket/inpcb handling and with finer-grained locking and preemption. Note: there are other instances of EINVAL for this state that were left unchanged; they should be reviewed. Reviewed by: rwatson, andre, ru MFC after: 1 month
* Make tcp_usr_send() free the passed mbufs on error in all cases as theandre2006-09-171-0/+4
| | | | | | comment to it claims. Sponsored by: TCP/IP Optimization Fundraise 2005
* Change semantics of socket close and detach. Add a new protocol switchrwatson2006-07-211-55/+92
| | | | | | | | | | | | | | | | | | | function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn
* Fix race conditions on enumerating pcb lists by moving the initializationups2006-07-181-2/+1
| | | | | | | | | | | | | | | ( and where appropriate the destruction) of the pcb mutex to the init/finit functions of the pcb zones. This allows locking of the pcb entries and race condition free comparison of the generation count. Rearrange locking a bit to avoid extra locking operation to update the generation count in in_pcballoc(). (in_pcballoc now returns the pcb locked) I am planning to convert pcb list handling from a type safe to a reference count model soon. ( As this allows really freeing the PCBs) Reviewed by: rwatson@, mohans@ MFC after: 1 week
* In tcp6_usr_attach(), return immediately if SS_ISDISCONNECTED, torwatson2006-06-261-4/+2
| | | | | | | avoid dereferencing an uninitialized inp variable. Submitted by: Michiel Boland <michiel at boland dot org> MFC after: 1 month
* Push acquisition of pcbinfo lock out of tcp_usr_attach() intorwatson2006-06-041-6/+8
| | | | | | | tcp_attach() after the call to soreserve(), as it doesn't require the global lock. Rearrange inpcb locking here also. MFC after: 1 month
* Instead of calling tcp_usr_detach() from tcp_usr_abort(), break outrwatson2006-04-241-52/+64
| | | | | | | | | | | | | | | common pcb tear-down logic into tcp_detach(), which is called from either. Invoke tcp_drop() from the tcp_usr_abort() path rather than tcp_disconnect(), as we want to drop it immediately not perform a FIN sequence. This is one reason why some people were experiencing panics in sodealloc(), as the netisr and aborting thread were simultaneously trying to tear down the socket. This bug could often be reproduced using repeated runs of the listenclose regression test. MFC after: 3 months PR: 96090 Reported by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris Tested by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
* Clarify comment on handling of non-timewait TCP states inrwatson2006-04-031-5/+7
| | | | | | tcp_usr_detach(). MFC after: 3 months
* After checking for SO_ISDISCONNECTED in tcp_usr_accept(), returnrwatson2006-04-031-5/+3
| | | | | | | | | | | | | | immediately rather than jumping to the normal output handling, which assumes we've pulled out the inpcb, which hasn't happened at this point (and isn't necessary). Return ECONNABORTED instead of EINVAL when the inpcb has entered INP_TIMEWAIT or INP_DROPPED, as this is the documented error value. This may correct the panic seen by Ganbold. MFC after: 1 month Reported by: Ganbold <ganbold at micom dot mng dot net>
* During reformulation of tcp_usr_detach(), the call to initiate TCPrwatson2006-04-021-7/+18
| | | | | | | | | | disconnect for fully connected sockets was dropped, meaning that if the socket was closed while the connection was alive, it would be leaked. Structure tcp_usr_detach() so that there are two clear parts: initiating disconnect, and reclaiming state, and reintroduce the tcp_disconnect() call in the first part. MFC after: 3 months
* Properly handle an edge case previously not handled correctly: arwatson2006-04-011-2/+5
| | | | | | | | | | | socket can have a tcp connection that has entered time wait attached to it, in the event that shutdown() is called on the socket and the FINs properly exchange before close(). In this case we don't detach or free the inpcb, just leave the tcptw detached and freed, but we must release the inpcb lock (which we didn't previously). MFC after: 3 months
* Update TCP for infrastructural changes to the socket/pcb refcount model,rwatson2006-04-011-187/+382
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months
* Chance protocol switch method pru_detach() so that it returns voidrwatson2006-04-011-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months
* Change protocol switch pru_abort() API so that it returns void ratherrwatson2006-04-011-5/+13
| | | | | | | | | | | | | | than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months
* Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() tomux2005-12-141-2/+2
| | | | | | | match the type of the variable they are exporting. Spotted by: Thomas Hurst <tom@hur.st> MFC after: 3 days
* Push the assignment of a new or updated so_qlimit from solisten()rwatson2005-10-301-4/+4
| | | | | | | | | | | | | | following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.
* Remove unnecessary IPSEC includes.andre2005-08-231-5/+0
| | | | | MFC after: 2 weeks Sponsored by: TCP/IP Optimization Fundraise 2005
* scope cleanup. with this changeume2005-07-251-0/+2
| | | | | | | | | | | | | | | | | | | - most of the kernel code will not care about the actual encoding of scope zone IDs and won't touch "s6_addr16[1]" directly. - similarly, most of the kernel code will not care about link-local scoped addresses as a special case. - scope boundary check will be stricter. For example, the current *BSD code allows a packet with src=::1 and dst=(some global IPv6 address) to be sent outside of the node, if the application do: s = socket(AF_INET6); bind(s, "::1"); sendto(s, some_global_IPv6_addr); This is clearly wrong, since ::1 is only meaningful within a single node, but the current implementation of the *BSD kernel cannot reject this attempt. Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp> Obtained from: KAME
* When aborting tcp_attach() due to a problem allocating or attaching therwatson2005-06-011-0/+2
| | | | | | | tcpcb, lock the inpcb before calling in_pcbdetach() or in6_pcbdetach(), as they expect the inpcb to be passed locked. MFC after: 7 days
* Assert tcbinfo lock, inpcb lock in tcp_disconnect().rwatson2005-06-011-1/+8
| | | | | | Assert tcbinfo lock, inpcb lock in in tcp_usrclosed(). MFC after: 7 days
* Assert tcbinfo lock in tcp_attach(), as it is required; the callerrwatson2005-06-011-0/+2
| | | | | | (tcp_usr_attach()) currently grabs it. MFC after: 7 days
* Replace t_force with a t_flag (TF_FORCEDATA).ps2005-05-211-2/+2
| | | | | Submitted by: Raja Mukerji. Reviewed by: Mohan, Silby, Andre Opperman.
* Remove now unused inirw variable from previous use of COMMON_END().rwatson2005-05-011-1/+0
| | | | Reported by: csjp
* Fix typo in last commit.grehan2005-05-011-1/+1
| | | | Approved by: rwatson
* Slide unlocking of the tcbinfo lock earlier in tcp_usr_send(), as it'srwatson2005-05-011-2/+13
| | | | | | | | | | | needed only for implicit connect cases. Under load, especially on SMP, this can greatly reduce contention on the tcbinfo lock. NB: Ambiguities about the state of so_pcb need to be resolved so that all use of the tcbinfo lock in non-implicit connection cases can be eliminated. Submited by: Kazuaki Oda <kaakun at highway dot ne dot jp>
* eliminate extraneous null ptr checkssam2005-03-291-1/+1
| | | | Noticed by: Coverity Prevent analysis tool
* In tcp_usr_send(), broaden coverage of the socket buffer lock in therwatson2005-03-141-1/+4
| | | | | | | non-OOB case so that the sbspace() check is performed under the same lock instance as the append to the send socket buffer. MFC after: 1 week
* In the current world order, solisten() implements the state transition ofrwatson2005-02-211-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a socket from a regular socket to a listening socket able to accept new connections. As part of this state transition, solisten() calls into the protocol to update protocol-layer state. There were several bugs in this implementation that could result in a race wherein a TCP SYN received in the interval between the protocol state transition and the shortly following socket layer transition would result in a panic in the TCP code, as the socket would be in the TCPS_LISTEN state, but the socket would not have the SO_ACCEPTCONN flag set. This change does the following: - Pushes the socket state transition from the socket layer solisten() to to socket "library" routines called from the protocol. This permits the socket routines to be called while holding the protocol mutexes, preventing a race exposing the incomplete socket state transition to TCP after the TCP state transition has completed. The check for a socket layer state transition is performed by solisten_proto_check(), and the actual transition is performed by solisten_proto(). - Holds the socket lock for the duration of the socket state test and set, and over the protocol layer state transition, which is now possible as the socket lock is acquired by the protocol layer, rather than vice versa. This prevents additional state related races in the socket layer. This permits the dual transition of socket layer and protocol layer state to occur while holding locks for both layers, making the two changes atomic with respect to one another. Similar changes are likely require elsewhere in the socket/protocol code. Reported by: Peter Holm <peter@holm.cc> Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net> Philosophical head nod: gnn
* o Add handling of an IPv4-mapped IPv6 address.maxim2005-02-141-87/+0
| | | | | | | | | | | | | o Use SYSCTL_IN() macro instead of direct call of copyin(9). Submitted by: ume o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where most of tcp sysctls live. o There are net.inet[6].tcp[6].getcred sysctls already, no needs in a separate struct tcp_ident_mapping. Suggested by: ume
* o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8)maxim2005-02-061-0/+86
| | | | | | | | | | | | utility: The tcpdrop command drops the TCP connection specified by the local address laddr, port lport and the foreign address faddr, port fport. Obtained from: OpenBSD Reviewed by: rwatson (locking), ru (man page), -current MFC after: 1 month
* /* -> /*- for license, minor formatting changesimp2005-01-071-1/+1
|
* Do export the advertised receive window via the tcpi_rcv_space field ofrwatson2004-11-271-0/+1
| | | | struct tcp_info.
* Implement parts of the TCP_INFO socket option as found in Linux 2.6.rwatson2004-11-261-2/+54
| | | | | | | | | | | | | | | This socket option allows processes query a TCP socket for some low level transmission details, such as the current send, bandwidth, and congestion windows. Linux provides a 'struct tcpinfo' structure containing various variables, rather than separate socket options; this makes the API somewhat fragile as it makes it dificult to add new entries of interest as requirements and implementation evolve. As such, I've included a large pad at the end of the structure. Right now, relatively few of the Linux API fields are filled in, and some contain no logical equivilent on FreeBSD. I've include __'d entries in the structure to make it easier to figure ou what is and isn't omitted. This API/ABI should be considered unstable for the time being.
* Initialize struct pr_userreqs in new/sparse style and fill in commonphk2004-11-081-11/+32
| | | | | | default elements in net_init_domain(). This makes it possible to grep these structures and see any bogosities.
* Remove RFC1644 T/TCP support from the TCP side of the network stack.andre2004-11-021-69/+4
| | | | | | | | | | | | | | | | A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch
* White space cleanup for netinet before branch:rwatson2004-08-161-9/+9
| | | | | | | | | | | - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
* Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSDdwmalone2004-08-141-6/+1
| | | | | | | | | | | | | | | | | | | | | have already done this, so I have styled the patch on their work: 1) introduce a ip_newid() static inline function that checks the sysctl and then decides if it should return a sequential or random IP ID. 2) named the sysctl net.inet.ip.random_id 3) IPv6 flow IDs and fragment IDs are now always random. Flow IDs and frag IDs are significantly less common in the IPv6 world (ie. rarely generated per-packet), so there should be smaller performance concerns. The sysctl defaults to 0 (sequential IP IDs). Reviewed by: andre, silby, mlaier, ume Based on: NetBSD MFC after: 2 months
* compare pointer against NULL, not 0jmg2004-07-261-2/+2
| | | | | | | | when inpcb is NULL, this is no longer invalid since jlemon added the tcp_twstart function... this prevents close "failing" w/ EINVAL when it really was successful... Reviewed by: jeremy (NetBSD)
* when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set onume2004-07-161-2/+10
| | | | | | | | outgoing tcp connections. Reported by: Orla McGann <orly@cnri.dit.ie> Reviewed by: Orla McGann <orly@cnri.dit.ie> Obtained from: KAME
* Remove spl's from TCP protocol entry points. While not all lockingrwatson2004-06-261-32/+1
| | | | | is merged here yet, this will ease the merge process by bringing the locked and unlocked versions into sync.
* In tcp_ctloutput(), don't hold the inpcb lock over a call torwatson2004-06-181-1/+1
| | | | | | | | ip_ctloutput(), as it may need to perform blocking memory allocations. This also improves consistency with locking relative to other points that call into ip_ctloutput(). Bumped into by: Grover Lines <grover@ceribus.net>
* The socket field so_state is used to hold a variety of socket relatedrwatson2004-06-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
OpenPOWER on IntegriCloud