summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_timewait.c
Commit message (Collapse)AuthorAgeFilesLines
* Enhance our RFC1948 implementation to perform better in some pathlogicalsilby2004-04-201-2/+53
| | | | | | | | | | | | | | | | | | | | TIME_WAIT recycling cases I was able to generate with http testing tools. In short, as the old algorithm relied on ticks to create the time offset component of an ISN, two connections with the exact same host, port pair that were generated between timer ticks would have the exact same sequence number. As a result, the second connection would fail to pass the TIME_WAIT check on the server side, and the SYN would never be acknowledged. I've "fixed" this by adding random positive increments to the time component between clock ticks so that ISNs will *always* be increasing, no matter how quickly the port is recycled. Except in such contrived benchmarking situations, this problem should never come up in normal usage... until networks get faster. No MFC planned, 4.x is missing other optimizations that are needed to even create the situation in which such quick port recycling will occur.
* Remove advertising clause from University of California Regent'simp2004-04-071-4/+0
| | | | | | | license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
* Two missed in previous commit -- compare pointer with NULL rather thanrwatson2004-04-051-2/+2
| | | | using it as a boolean.
* Prefer NULL to 0 when checking pointer values as integers or booleans.rwatson2004-04-051-19/+20
|
* Remove now unneeded arguments to tcp_twrespond() -- so and msrc. Theserwatson2004-02-281-10/+2
| | | | | | were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam
* Split the mlock() kernel code into two parts, mlock(), which unpackstruckman2004-02-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms
* Convert the tcp segment reassembly queue to UMA and limit the maximumandre2004-02-241-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby
* Fixed ucred structure leak.pjd2004-02-191-0/+2
| | | | | | Approved by: scottl (mentor) PR: 54163 MFC after: 3 days
* Final brucification pass. Spell types consistently (u_int). Remove bogusbms2004-02-141-1/+1
| | | | | | casts. Remove unnecessary parenthesis. Submitted by: bde
* Brucification.bms2004-02-131-10/+14
| | | | Submitted by: bde
* supported IPV6_RECVPATHMTU socket option.ume2004-02-131-2/+2
| | | | Obtained from: KAME
* Update the prototype for tcpsignature_apply() to reflect the spelling ofbms2004-02-121-2/+2
| | | | | | the types used by m_apply()'s callback function, f, as documented in mbuf(9). Noticed by: njl
* style(9) pass; whitespace and comments.bms2004-02-121-17/+22
| | | | Submitted by: njl
* Initial import of RFC 2385 (TCP-MD5) digest support.bms2004-02-111-0/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net
* Limiters and sanity checks for TCP MSS (maximum segement size)andre2004-01-081-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day
* If path mtu discovery is enabled set the DF bit in all cases weandre2004-01-081-0/+4
| | | | | | | | send packets on a tcp connection. PR: kern/60889 Tested by: Richard Wendland <richard@wendland.org.uk> Approved by: re (scottl)
* Enable the following TCP options by default to give it more exposure:andre2004-01-061-1/+1
| | | | | | | | | | | | rfc3042 Limited retransmit rfc3390 Increasing TCP's initial congestion Window inflight TCP inflight bandwidth limiting All my production server have it enabled and there have been no issues. I am confident about having them on by default and it gives us better overall TCP performance. Reviewed by: sam (mentor)
* Fix some becuase -> because typos.jhb2003-12-171-1/+1
| | | | Reported by: Marco Wertejuk <wertejuk@mwcis.com>
* Switch TCP over to using the inpcb label when responding in timedrwatson2003-12-171-4/+1
| | | | | | | | | | | | | | | | wait, rather than the socket label. This avoids reaching up to the socket layer during connection close, which requires locking changes. To do this, introduce MAC Framework entry point mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond() instead of calling mac_create_mbuf_from_socket() or mac_create_mbuf_netlayer(). Introduce MAC Policy entry point mpo_create_mbuf_from_inpcb(), and implementations for various policies, which generally just copy label data from the inpcb to the mbuf. Assert the inpcb lock in the entry point since we require consistency for the inpcb label reference. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Make sure all uses of stack allocated struct route's are properlyandre2003-11-261-2/+2
| | | | | | | | zeroed. Doing a bzero on the entire struct route is not more expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst. Reviewed by: sam (mentor) Approved by: re (scottl)
* Introduce tcp_hostcache and remove the tcp specific metrics fromandre2003-11-201-223/+125
| | | | | | | | | | | | | | | | | | | | | | | the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
* o correct locking problem: the inpcb must be held across tcp_respondsam2003-11-081-15/+20
| | | | | | | o add assertions in tcp_respond to validate inpcb locking assumptions o use local variable instead of chasing pointers in tcp_respond Supported by: FreeBSD Foundation
* Add an additional check to the tcp_twrecycleable function; I hadsilby2003-11-021-3/+16
| | | | | | | | | previously only considered the send sequence space. Unfortunately, some OSes (windows) still use a random positive increments scheme for their syn-ack ISNs, so I must consider receive sequence space as well. The value of 250000 bytes / second for Microsoft's ISN rate of increase was determined by testing with an XP machine.
* - Add a new function tcp_twrecycleable, which tells us if the ISN whichsilby2003-11-011-0/+19
| | | | | | | | | | | | | we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce.
* Reduce the number of tcp time_wait structs to maxsockets / 5; this ensuressilby2003-10-241-1/+1
| | | | | | | | | | | | that at most 20% of sockets can be in time_wait at one time, ensuring that time_wait sockets do not starve real connections from inpcb structures. No implementation change is needed, jlemon already implemented a nice LRU-ish algorithm for tcp_tw structure recycling. This should reduce the need for sysadmins to lower the default msl on busy servers.
* Change all SYSCTLS which are readonly and have a related TUNABLEsilby2003-10-211-1/+1
| | | | | from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages.
* Fix a bunch of off-by-one errors in the range checking code.ru2003-09-111-2/+2
|
* Introduce two new MAC Framework and MAC policy entry points:rwatson2003-08-211-3/+3
| | | | | | | | | | | | | | mac_reflect_mbuf_icmp() mac_reflect_mbuf_tcp() These entry points permit MAC policies to do "update in place" changes to the labels on ICMP and TCP mbuf headers when an ICMP or TCP response is generated to a packet outside of the context of an existing socket. For example, in respond to a ping or a RST packet to a SYN on a closed port. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Correct a bug introduced with reduced TCP state handling; makerwatson2003-05-071-3/+18
| | | | | | | | | | | | | | | | | | | sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Remove a potential panic condition introduced by reduced TCP waitrwatson2003-04-101-5/+15
| | | | | | | | | | | | | | | | | | state. Those changed attempted to work around the changed invariant that inp->in_socket was sometimes now NULL, but the logic wasn't quite right, meaning that inp->in_socket would be dereferenced by cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC were in use. Attempt to clarify and correct the logic. Note: the work-around originally introduced with the reduced TCP wait state handling to use cr_cansee() instead of cr_canseesocket() in this case isn't really right, although it "Does the right thing" for most of the cases in the base system. We'll need to address this at some point in the future. Pointed out by: dcs Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* Remove a panic(); if the zone allocator can't provide more timewaitjlemon2003-03-081-20/+19
| | | | | | | structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs
* More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).des2003-03-021-1/+1
|
* When generating a TCP response to a connection, not only test if therwatson2003-02-251-1/+1
| | | | | | | | | | tcpcb is NULL, but also its connected inpcb, since we now allow elements of a TCP connection to hang around after other state, such as the socket, has been recycled. Tested by: dcs Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
* - m = m_gethdr(M_NOWAIT, MT_HEADER);phk2003-02-211-1/+1
| | | | | | + m = m_gethdr(M_DONTWAIT, MT_HEADER); 'nuff said.
* Unbreak non-IPV6 compilation.jlemon2003-02-191-4/+10
| | | | | Caught by: phk Sponsored by: DARPA, NAI Labs
* Add a TCP TIMEWAIT state which uses less space than a fullblown TCPjlemon2003-02-191-48/+268
| | | | | | | | control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs
* Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so thejlemon2003-02-191-35/+32
| | | | | | | | routine does not require a tcpcb to operate. Since we no longer keep template mbufs around, move pseudo checksum out of this routine, and merge it with the length update. Sponsored by: DARPA, NAI Labs
* Back out M_* changes, per decision of the TRB.imp2003-02-191-4/+4
| | | | Approved by: trb
* Take advantage of pre-existing lock-free synchronization and type stable memoryhsu2003-02-151-4/+3
| | | | to avoid acquiring SMP locks during expensive copyout process.
* Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.alfred2003-01-211-4/+4
| | | | Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
* Validate inp to prevent an use after free.hsu2002-12-241-1/+2
|
* Change tcp.inflight_min from 1024 to a production default of 6144. Createdillon2002-12-141-4/+14
| | | | | | | a sysctl for the stabilization value for the bandwidth delay product (inflight) algorithm and document it. MFC after: 3 days
* Fix two instances of variant struct definitions in sys/netinet:phk2002-10-201-4/+4
| | | | | | | | | | | | | | Remove the never completed _IP_VHL version, it has not caught on anywhere and it would make us incompatible with other BSD netstacks to retain this version. Add a CTASSERT protecting sizeof(struct ip) == 20. Don't let the size of struct ipq depend on the IPDIVERT option. This is a functional no-op commit. Approved by: re
* Tie new "Fast IPsec" code into the build. This involves the usualsam2002-10-161-0/+8
| | | | | | | | | | | | configuration stuff as well as conditional code in the IPv4 and IPv6 areas. Everything is conditional on FAST_IPSEC which is mutually exclusive with IPSEC (KAME IPsec implmentation). As noted previously, don't use FAST_IPSEC with INET6 at the moment. Reviewed by: KAME, rwatson Approved by: silence Supported by: Vernier Networks
* Replace aux mbufs with packet tags:sam2002-10-161-8/+3
| | | | | | | | | | | | | | | | | | | o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month
* turn off debugging by default if bandwidth delay product limiting isdillon2002-10-101-1/+1
| | | | turned on (it is already off in -stable).
* Correct bug in t_bw_rtttime rollover, #undef USERTTdillon2002-08-241-1/+5
|
* Implement TCP bandwidth delay product window limiting, similar to (butdillon2002-08-171-0/+158
| | | | | | | | | | | | not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week
* Document the undocumented assumption that at least one of the PCBrwatson2002-08-011-0/+2
| | | | | | | | | pointer and incoming mbuf pointer will be non-NULL in tcp_respond(). This is relied on by the MAC code for correctness, as well as existing code. Obtained from: TrustedBSD PRoject Sponsored by: DARPA, NAI Labs
* Introduce support for Mandatory Access Control and extensiblerwatson2002-07-311-0/+17
| | | | | | | | | | | | | | | | | | kernel access control. Instrument the TCP socket code for packet generation and delivery: label outgoing mbufs with the label of the socket, and check socket and mbuf labels before permitting delivery to a socket. Assign labels to newly accepted connections when the syncache/cookie code has done its business. Also set peer labels as convenient. Currently, MAC policies cannot influence the PCB matching algorithm, so cannot implement polyinstantiation. Note that there is at least one case where a PCB is not available due to the TCP packet not being associated with any socket, so we don't label in that case, but need to handle it in a special manner. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
OpenPOWER on IntegriCloud