summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_timewait.c
Commit message (Collapse)AuthorAgeFilesLines
* White space cleanup for netinet before branch:rwatson2004-08-161-68/+68
| | | | | | | | | | | - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>
* In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreachdwmalone2004-08-121-0/+2
| | | | | | so that the locks held are the same as the IPv4 case. Reviewed by: rwatson
* Backout removal of UMA_ZONE_NOFREE flag for all zones which are establishedandre2004-08-111-4/+4
| | | | | | | | | for structures with timers in them. It might be that a timer might fire even when the associated structure has already been free'd. Having type- stable storage in this case is beneficial for graceful failure handling and debugging. Discussed with: bosko, tegge, rwatson
* Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP andandre2004-08-111-4/+4
| | | | | TCP code. This flag would have prevented giving back excessive free slabs to the global pool after a transient peak usage.
* Pass pcbinfo structures to in6_pcbnotify() rather than pcbheadrwatson2004-08-061-2/+2
| | | | | | | | | | structures, allowing in6_pcbnotify() to lock the pcbinfo and each inpcb that it notifies of ICMPv6 events. This prevents inpcb assertions from firing when IPv6 generates and delievers event notifications for inpcbs. Reported by: kuriyama Tested by: kuriyama
* o Move the inflight sysctls to their own sub-tree under net.inet.tcp to beandre2004-08-031-5/+9
| | | | more consistent with the other sysctls around it.
* Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This iscperciva2004-07-261-2/+2
| | | | | | | | | | | somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb
* Let IN_FASTREOCOVERY macro decide if we are in recovery mode.jayanth2004-07-191-4/+0
| | | | | Nuke sackhole_limit for now. We need to add it back to limit the total number of sack blocks in the system.
* Move the sack sysctl's under net.inet.tcp.sackps2004-06-231-4/+4
| | | | | | | net.inet.tcp.do_sack -> net.inet.tcp.sack.enable net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit Requested by: wollman
* Add support for TCP Selective Acknowledgements. The work for thisps2004-06-231-0/+16
| | | | | | | | | | | | | | | originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
* If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE.rwatson2004-06-201-5/+12
|
* Extend coverage of SOCK_LOCK(so) to include so_count, the socketrwatson2004-06-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
* Switch to using the inpcb MAC label instead of socket MAC label whenrwatson2004-05-041-2/+7
| | | | | | | | | | | | | | | | | | | | labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid the need for socket layer locking in the lower level network paths where inpcb locks are already frequently held where needed. In particular: - Use the inpcb for label instead of socket in raw_append(). - Use the inpcb for label instead of socket in tcp_output(). - Use the inpcb for label instead of socket in tcp_respond(). - Use the inpcb for label instead of socket in tcp_twrespond(). - Use the inpcb for label instead of socket in syncache_respond(). While here, modify tcp_respond() to avoid assigning NULL to a stack variable and centralize assertions about the inpcb when inp is assigned. Obtained from: TrustedBSD Project Sponsored by: DARPA, McAfee Research
* Enhance our RFC1948 implementation to perform better in some pathlogicalsilby2004-04-201-2/+53
| | | | | | | | | | | | | | | | | | | | TIME_WAIT recycling cases I was able to generate with http testing tools. In short, as the old algorithm relied on ticks to create the time offset component of an ISN, two connections with the exact same host, port pair that were generated between timer ticks would have the exact same sequence number. As a result, the second connection would fail to pass the TIME_WAIT check on the server side, and the SYN would never be acknowledged. I've "fixed" this by adding random positive increments to the time component between clock ticks so that ISNs will *always* be increasing, no matter how quickly the port is recycled. Except in such contrived benchmarking situations, this problem should never come up in normal usage... until networks get faster. No MFC planned, 4.x is missing other optimizations that are needed to even create the situation in which such quick port recycling will occur.
* Remove advertising clause from University of California Regent'simp2004-04-071-4/+0
| | | | | | | license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
* Two missed in previous commit -- compare pointer with NULL rather thanrwatson2004-04-051-2/+2
| | | | using it as a boolean.
* Prefer NULL to 0 when checking pointer values as integers or booleans.rwatson2004-04-051-19/+20
|
* Remove now unneeded arguments to tcp_twrespond() -- so and msrc. Theserwatson2004-02-281-10/+2
| | | | | | were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam
* Split the mlock() kernel code into two parts, mlock(), which unpackstruckman2004-02-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms
* Convert the tcp segment reassembly queue to UMA and limit the maximumandre2004-02-241-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby
* Fixed ucred structure leak.pjd2004-02-191-0/+2
| | | | | | Approved by: scottl (mentor) PR: 54163 MFC after: 3 days
* Final brucification pass. Spell types consistently (u_int). Remove bogusbms2004-02-141-1/+1
| | | | | | casts. Remove unnecessary parenthesis. Submitted by: bde
* Brucification.bms2004-02-131-10/+14
| | | | Submitted by: bde
* supported IPV6_RECVPATHMTU socket option.ume2004-02-131-2/+2
| | | | Obtained from: KAME
* Update the prototype for tcpsignature_apply() to reflect the spelling ofbms2004-02-121-2/+2
| | | | | | the types used by m_apply()'s callback function, f, as documented in mbuf(9). Noticed by: njl
* style(9) pass; whitespace and comments.bms2004-02-121-17/+22
| | | | Submitted by: njl
* Initial import of RFC 2385 (TCP-MD5) digest support.bms2004-02-111-0/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
* Limiters and sanity checks for TCP MSS (maximum segement size)andre2004-01-081-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day
* If path mtu discovery is enabled set the DF bit in all cases weandre2004-01-081-0/+4
| | | | | | | | send packets on a tcp connection. PR: kern/60889 Tested by: Richard Wendland <richard@wendland.org.uk> Approved by: re (scottl)
* Enable the following TCP options by default to give it more exposure:andre2004-01-061-1/+1
| | | | | | | | | | | | rfc3042 Limited retransmit rfc3390 Increasing TCP's initial congestion Window inflight TCP inflight bandwidth limiting All my production server have it enabled and there have been no issues. I am confident about having them on by default and it gives us better overall TCP performance. Reviewed by: sam (mentor)
* Fix some becuase -> because typos.jhb2003-12-171-1/+1
| | | | Reported by: Marco Wertejuk <wertejuk@mwcis.com>
* Switch TCP over to using the inpcb label when responding in timedrwatson2003-12-171-4/+1
| | | | | | | | | | | | | | | | wait, rather than the socket label. This avoids reaching up to the socket layer during connection close, which requires locking changes. To do this, introduce MAC Framework entry point mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond() instead of calling mac_create_mbuf_from_socket() or mac_create_mbuf_netlayer(). Introduce MAC Policy entry point mpo_create_mbuf_from_inpcb(), and implementations for various policies, which generally just copy label data from the inpcb to the mbuf. Assert the inpcb lock in the entry point since we require consistency for the inpcb label reference. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Make sure all uses of stack allocated struct route's are properlyandre2003-11-261-2/+2
| | | | | | | | zeroed. Doing a bzero on the entire struct route is not more expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst. Reviewed by: sam (mentor) Approved by: re (scottl)
* Introduce tcp_hostcache and remove the tcp specific metrics fromandre2003-11-201-223/+125
| | | | | | | | | | | | | | | | | | | | | | | the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
* o correct locking problem: the inpcb must be held across tcp_respondsam2003-11-081-15/+20
| | | | | | | o add assertions in tcp_respond to validate inpcb locking assumptions o use local variable instead of chasing pointers in tcp_respond Supported by: FreeBSD Foundation
* Add an additional check to the tcp_twrecycleable function; I hadsilby2003-11-021-3/+16
| | | | | | | | | previously only considered the send sequence space. Unfortunately, some OSes (windows) still use a random positive increments scheme for their syn-ack ISNs, so I must consider receive sequence space as well. The value of 250000 bytes / second for Microsoft's ISN rate of increase was determined by testing with an XP machine.
* - Add a new function tcp_twrecycleable, which tells us if the ISN whichsilby2003-11-011-0/+19
| | | | | | | | | | | | | we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce.
* Reduce the number of tcp time_wait structs to maxsockets / 5; this ensuressilby2003-10-241-1/+1
| | | | | | | | | | | | that at most 20% of sockets can be in time_wait at one time, ensuring that time_wait sockets do not starve real connections from inpcb structures. No implementation change is needed, jlemon already implemented a nice LRU-ish algorithm for tcp_tw structure recycling. This should reduce the need for sysadmins to lower the default msl on busy servers.
* Change all SYSCTLS which are readonly and have a related TUNABLEsilby2003-10-211-1/+1
| | | | | from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages.
* Fix a bunch of off-by-one errors in the range checking code.ru2003-09-111-2/+2
|
* Introduce two new MAC Framework and MAC policy entry points:rwatson2003-08-211-3/+3
| | | | | | | | | | | | | | mac_reflect_mbuf_icmp() mac_reflect_mbuf_tcp() These entry points permit MAC policies to do "update in place" changes to the labels on ICMP and TCP mbuf headers when an ICMP or TCP response is generated to a packet outside of the context of an existing socket. For example, in respond to a ping or a RST packet to a SYN on a closed port. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Correct a bug introduced with reduced TCP state handling; makerwatson2003-05-071-3/+18
| | | | | | | | | | | | | | | | | | | sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Remove a potential panic condition introduced by reduced TCP waitrwatson2003-04-101-5/+15
| | | | | | | | | | | | | | | | | | state. Those changed attempted to work around the changed invariant that inp->in_socket was sometimes now NULL, but the logic wasn't quite right, meaning that inp->in_socket would be dereferenced by cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC were in use. Attempt to clarify and correct the logic. Note: the work-around originally introduced with the reduced TCP wait state handling to use cr_cansee() instead of cr_canseesocket() in this case isn't really right, although it "Does the right thing" for most of the cases in the base system. We'll need to address this at some point in the future. Pointed out by: dcs Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Remove a panic(); if the zone allocator can't provide more timewaitjlemon2003-03-081-20/+19
| | | | | | | structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs
* More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).des2003-03-021-1/+1
|
* When generating a TCP response to a connection, not only test if therwatson2003-02-251-1/+1
| | | | | | | | | | tcpcb is NULL, but also its connected inpcb, since we now allow elements of a TCP connection to hang around after other state, such as the socket, has been recycled. Tested by: dcs Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* - m = m_gethdr(M_NOWAIT, MT_HEADER);phk2003-02-211-1/+1
| | | | | | + m = m_gethdr(M_DONTWAIT, MT_HEADER); 'nuff said.
* Unbreak non-IPV6 compilation.jlemon2003-02-191-4/+10
| | | | | Caught by: phk Sponsored by: DARPA, NAI Labs
* Add a TCP TIMEWAIT state which uses less space than a fullblown TCPjlemon2003-02-191-48/+268
| | | | | | | | control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs
* Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so thejlemon2003-02-191-35/+32
| | | | | | | | routine does not require a tcpcb to operate. Since we no longer keep template mbufs around, move pseudo checksum out of this routine, and merge it with the length update. Sponsored by: DARPA, NAI Labs
OpenPOWER on IntegriCloud