FreeBSD-src - Raptor Engineering's fork of pfsense FreeBSD src with pfSense changes

	Commit message (Collapse)	Author	Age	Files	Lines
*	MFC r312982:	cy	2017-02-06	1	-2/+2
\| \| \| \|	Correct comment grammar and make it easier to understand.
*	MFC r309858	hiren	2017-01-04	1	-19/+21
\| \| \| \| \| \| \| \| \| \| \|	We currently don't do TSO if ip options are present. In case of IPv6, we look at in6p_options to check that. That is incorrect as we carry ip options in in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't suffice as we combine ip options and ip header fields both in that one field. The commit fixes this by using ip6_optlen() which correctly calculates length of only ip options for IPv6. Sponsored by: Limelight Networks
*	tcp: Don't prematurely drop receiving-only connections	sephe	2016-05-30	1	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the connection was persistent and receiving-only, several (12) sporadic device insufficient buffers would cause the connection be dropped prematurely: Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is started. No one will stop this retransmission timer for receiving- only connection, so the retransmission timer promises to expire and t_rxtshift is promised to be increased. And t_rxtshift will not be reset to 0, since no RTT measurement will be done for receiving-only connection. If this receiving-only connection lived long enough (e.g. >350sec, given the RTO starts from 200ms), and it suffered 12 sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this receiving-only connection would be dropped prematurely by the retransmission timer. We now assert that for data segments, SYNs or FINs either rexmit or persist timer was wired upon ENOBUFS. And don't set rexmit timer for other cases, i.e. ENOBUFS upon ACKs. Discussed with: lstewart, hiren, jtl, Mike Karels MFC after: 3 weeks Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5872
*	Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on	truckman	2016-05-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	control to a three way setting. 0 - Totally disable ECN. (no change) 1 - Enable ECN if incoming connections request it. Outgoing connections will request ECN. (no change from present != 0 setting) 2 - Enable ECN if incoming connections request it. Outgoing conections will not request ECN. Change the default value of net.inet.tcp.ecn.enable from 0 to 2. Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have similar capabilities. The actual values above match Linux, and the default matches the current Linux default. Reviewed by: eadler MFC after: 1 month MFH: yes Sponsored by: https://reviews.freebsd.org/D6386
*	sys/net*: minor spelling fixes.	pfg	2016-05-03	1	-4/+4
\| \| \| \|	No functional change.
*	FreeBSD previously provided route caching for TCP (and UDP). Re-add	gnn	2016-03-24	1	-7/+3
\| \| \| \| \| \| \| \| \| \|	route caching for TCP, with some improvements. In particular, invalidate the route cache if a new route is added, which might be a better match. The cache is automatically invalidated if the old route is deleted. Submitted by: Mike Karels Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4306
*	to_flags is currently a 64-bit integer; however, we only use 7 bits.	jtl	2016-03-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Furthermore, there is no reason this needs to be a 64-bit integer for the forseeable future. Also, there is an inconsistency between to_flags and the mask in tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t but left the mask in tcp_addoptions() as a u_int, meaning that these variables will only be the same width on platforms with 64-bit integers. Convert both to_flags and the mask in tcp_addoptions() to be explicitly 32-bit variables. This may save a few cycles on 32-bit platforms, and avoids unnecessarily mixing types. Differential Revision: https://reviews.freebsd.org/D5584 Reviewed by: hiren MFC after: 2 weeks Sponsored by: Juniper Networks
*	Fix dtrace probes (introduced in 287759): debug__input was used	gnn	2016-03-03	1	-1/+1
\| \| \| \| \| \| \| \| \|	for output and drop; connect didn't always fire a user probe some probes were missing in fastpath Submitted by: Hannes Mehnert Sponsored by: REMS, EPSRC Differential Revision: https://reviews.freebsd.org/D5525
*	Rename netinet/tcp_cc.h to netinet/cc/cc.h.	glebius	2016-01-27	1	-1/+1
\| \| \| \|	Discussed with: lstewart
*	Persist timers TCPTV_PERSMIN and TCPTV_PERSMAX are hardcoded with 5 seconds and	hiren	2016-01-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	60 seconds, respectively. Turn them into sysctls that can be tuned live. The default values of 5 seconds and 60 seconds have been retained. Submitted by: Jason Wolfe (j at nitrology dot com) Reviewed by: gnn, rrs, hiren, bz MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D5024
*	- Rename cc.h to more meaningful tcp_cc.h.	glebius	2016-01-21	1	-1/+2
\| \| \| \| \|	- Declare it a kernel only include, which it already is. - Don't include tcp.h implicitly from tcp_cc.h
*	There is a bug in tcp_output()'s implementation of the TCP_SIGNATURE	glebius	2016-01-14	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \|	(RFC 2385/TCP-MD5) kernel option. If a tcpcb has TF_NOOPT flag, then tcp_addoptions() is not called, and to.to_signature is an uninitialized stack variable. The value is later used as write offset, which leads to writing to random address. Submitted by: rstone, jtl Security: SA-16:05.tcp
*	Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,	glebius	2016-01-07	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned up after T/TCP removal. After all permutations over the years the result is that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if timestamps are in action, or is equal to t_maxopd otherwise. That's a very rough estimate of MSS reduced by options length. Throughout the code it was used in places, where preciseness was not important, like cwnd or ssthresh calculations. With this change: - t_maxopd goes away. - t_maxseg now stores MSS not adjusted by options. - new function tcp_maxseg() is provided, that calculates MSS reduced by options length. The functions gives a better estimate, since it takes into account SACK state as well. Reviewed by: jtl Differential Revision: https://reviews.freebsd.org/D3593
*	Implementation of server-side TCP Fast Open (TFO) [RFC7413].	pkelsey	2015-12-24	1	-1/+70
\| \| \| \| \| \| \| \| \| \|	TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350
*	There are times when it would be really nice to have a record of the last few	hiren	2015-10-14	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren
*	Update TSO limits to include all headers.	hselasky	2015-09-14	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To make driver programming easier the TSO limits are changed to reflect the values used in the BUSDMA tag a network adapter driver is using. The TCP/IP network stack will subtract space for all linklevel and protocol level headers and ensure that the full mbuf chain passed to the network adapter fits within the given limits. Implementation notes: If a network adapter driver needs to fixup the first mbuf in order to support VLAN tag insertion, the size of the VLAN tag should be subtracted from the TSO limit. Else not. Network adapters which typically inline the complete header mbuf could technically transmit one more segment. This patch does not implement a mechanism to recover the last segment for data transmission. It is believed when sufficiently large mbuf clusters are used, the segment limit will not be reached and recovering the last segment will not have any effect. The current TSO algorithm tries to send MTU-sized packets, where the MTU typically is 1500 bytes, which gives 1448 bytes of TCP data payload per packet for IPv4. That means if the TSO length limitiation is set to 65536 bytes, there will be a data payload remainder of (65536 - 1500) mod 1448 bytes which is equal to 324 bytes. Trying to recover total TSO length due to inlining mbuf header data will not have any effect, because adding or removing the ETH/IP/TCP headers to or from 324 bytes will not cause more or less TCP payload to be TSO'ed. Existing network adapter limits will be updated separately. Differential Revision: https://reviews.freebsd.org/D3458 Reviewed by: rmacklem MFC after: 2 weeks
*	dd DTrace probe points, translators and a corresponding script	gnn	2015-09-13	1	-0/+1
\| \| \| \| \| \| \| \| \|	to provide the TCPDEBUG functionality with pure DTrace. Reviewed by: rwatson MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: D3530
*	Remove stale comment.	kp	2015-07-25	1	-1/+0
\| \| \| \| \| \|	The IPv6 pseudo header checksum was added by bz in r235961. Sponsored by: Essen FreeBSD Hackathon
*	Fix resource exhaustion due to sessions stuck in LAST_ACK state.	delphij	2015-07-21	1	-2/+9
\| \| \| \| \| \| \|	Submitted by: Jonathan Looney (Juniper SIRT) Reviewed by: lstewart Security: CVE-2015-5358 Security: SA-15:13.tcp
*	Avoid a situation where we do not set persist timer after a zero window	hiren	2015-06-29	1	-0/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	condition. If you send a 0-length packet, but there is data is the socket buffer, and neither the rexmt or persist timer is already set, then activate the persist timer. PR: 192599 Differential Revision: D2946 Submitted by: jlott at averesystems dot com Reviewed by: jhb, jch, gnn, hiren Tested by: jlott at averesystems dot com, jch MFC after: 2 weeks
*	To ease changes to underlying mbuf structure and the mbuf allocator, reduce	rwatson	2015-01-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the knowledge of mbuf layout, and in particular constants such as M_EXT, MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single M_ALIGN() macro, implemented by a now-inlined m_align() function: - Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline. - Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align(). - Update consumers around the tree to simply use M_ALIGN(). This change eliminates a number of cases where mbuf consumers must be aware of whether or not mbufs returned by the allocator use external storage, but also assumptions about the size of the returned mbuf. This will make it easier to introduce changes in how we use external storage, as well as features such as variable-size mbufs. Differential Revision: https://reviews.freebsd.org/D1436 Reviewed by: glebius, trasz, gnn, bz Sponsored by: EMC / Isilon Storage Division
*	In preparation of merging projects/sendfile, transform bare access to	glebius	2014-11-12	1	-13/+18
\| \| \| \| \| \| \| \| \| \| \| \|	sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.
*	Fix some minor TSO issues:	hselasky	2014-11-11	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	- Improve description of TSO limits. - Remove a not needed KASSERT() - Remove some not needed variable casts. Sponsored by: Mellanox Technologies Discussed with: lstewart @ MFC after: 1 week
*	Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.	glebius	2014-11-07	1	-6/+6
\| \| \| \|	Sponsored by: Nginx, Inc.
*	Catch ipv6 case when attempting to do PLPMTUD blackhole detection.	sbruno	2014-10-13	1	-0/+5
\| \| \| \| \| \|	Submitted by: Mikhail <mp@lenta.ru> MFC after: 2 weeks Relnotes: yes
*	Implement PLPMTUD blackhole detection (RFC 4821), inspired by code	sbruno	2014-10-07	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	from xnu sources. If we encounter a network where ICMP is blocked the Needs Frag indicator may not propagate back to us. Attempt to downshift the mss once to a preconfigured value. Default this feature to off for now while we do not have a full PLPMTUD implementation in our stack. Adds the following new sysctl's for control: net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature net.inet.tcp.pmtud_blackhole_mss -- mss to try for ipv4 net.inet.tcp.v6pmtud_blackhole_mss -- mss to try for ipv6 Adds the following new sysctl's for monitoring: -- Number of times the code was activated to attempt a mss downshift net.inet.tcp.pmtud_blackhole_activated -- Number of times the blackhole mss was used in an attempt to downshift net.inet.tcp.pmtud_blackhole_min_activated -- Number of times that we failed to connect after we downshifted the mss net.inet.tcp.pmtud_blackhole_failed Phabricator: https://reviews.freebsd.org/D506 Reviewed by: rpaulo bz MFC after: 2 weeks Relnotes: yes Sponsored by: Limelight Networks
*	Minor code styling.	hselasky	2014-10-06	1	-17/+16
\| \| \| \|	Suggested by: glebius @
*	Improve transmit sending offload, TSO, algorithm in general.	hselasky	2014-09-22	1	-11/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current TSO limitation feature only takes the total number of bytes in an mbuf chain into account and does not limit by the number of mbufs in a chain. Some kinds of hardware is limited by two factors. One is the fragment length and the second is the fragment count. Both of these limits need to be taken into account when doing TSO. Else some kinds of hardware might have to drop completely valid mbuf chains because they cannot loaded into the given hardware's DMA engine. The new way of doing TSO limitation has been made backwards compatible as input from other FreeBSD developers and will use defaults for values not set. Reviewed by: adrian, rmacklem Sponsored by: Mellanox Technologies MFC after: 1 week
*	Revert r271504. A new patch to solve this issue will be made.	hselasky	2014-09-13	1	-74/+3
\| \| \| \|	Suggested by: adrian @
*	Improve transmit sending offload, TSO, algorithm in general.	hselasky	2014-09-13	1	-3/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current TSO limitation feature only takes the total number of bytes in an mbuf chain into account and does not limit by the number of mbufs in a chain. Some kinds of hardware is limited by two factors. One is the fragment length and the second is the fragment count. Both of these limits need to be taken into account when doing TSO. Else some kinds of hardware might have to drop completely valid mbuf chains because they cannot loaded into the given hardware's DMA engine. The new way of doing TSO limitation has been made backwards compatible as input from other FreeBSD developers and will use defaults for values not set. MFC after: 1 week Sponsored by: Mellanox Technologies
*	Fix a typo.	hiren	2014-07-03	1	-1/+1
\|
*	- Remove rt_metrics_lite and simply put its members into rtentry.	glebius	2014-03-05	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode. The change is mostly targeted for stable/10 merge. For head, rt_pksent is expected to just disappear. Discussed with: melifaro Sponsored by: Netflix Sponsored by: Nginx, Inc.
*	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE	avg	2013-11-26	1	-2/+2
\| \| \| \| \| \| \| \|	In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
*	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging	attilio	2013-11-25	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
*	Implement the ip, tcp, and udp DTrace providers. The probe definitions use	markj	2013-08-25	1	-0/+20
\| \| \| \| \| \| \| \| \|	dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
*	Allow drivers to specify a maximum TSO length in bytes if they are	andre	2013-06-03	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	limited in the amount of data they can handle at once. Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to change the limit. The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything less wouldn't be very useful anymore. The upper limit is still at IP_MAXPACKET (65536 bytes). Raising it requires further auditing of the IPv4/v6 code path's as the length field in the IP header would overflow leading to confusion in firewalls and others packet handler on the real size of the packet. The placement into "struct ifnet" is a bit hackish but the best place that was found. When the stack/driver boundary is updated it should be handled in a better way. Submitted by: cperciva (earlier version) Reviewed by: cperciva Tested by: cperciva MFC after: 1 week (using spare struct members to preserve ABI)
*	Fix tcp_output() so that tcpcb is updated in the same manner when an	glebius	2013-04-11	1	-70/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mbuf allocation fails, as in a case when ip_output() returns error. To achieve that, move large block of code that updates tcpcb below the out: label. This fixes a panic, that requires the following sequence to happen: 1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss 2) The retransmit timeout happened for the SYN we had sent, tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output(). In tcp_output m_get() fails. 3) Later on the SYN\|ACK for the SYN sent in step 1) came, tcp_input sets tp->snd_una += 1, which leads to tp->snd_una > tp->snd_nxt inconsistency, that later panics in socket buffer code. For reference, this bug fixed in DragonflyBSD repo: http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419 Reviewed by: andre Tested by: pho Sponsored by: Nginx, Inc. PR: kern/177456 Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
*	- Replace compat macros with function calls.	glebius	2013-03-16	1	-1/+1
\|
*	- Use m_getcl() instead of hand allocating.	glebius	2013-03-15	1	-12/+8
\| \| \| \|	Sponsored by: Nginx, Inc.
*	Mechanically substitute flags from historic mbuf allocator with	glebius	2012-12-05	1	-3/+3
\| \| \| \| \| \| \| \| \|	malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
*	Fix typo; s/ouput/output	kevlo	2012-11-07	1	-1/+1
\|
*	Forced commit to provide the correct commit message to r242251:	andre	2012-10-29	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	Defer sending an independent window update if a delayed ACK is pending saving a packet. The window update then gets piggy-backed on the next already scheduled ACK. Added grammar fixes as well. MFC after: 2 weeks
*	Prevent a flurry of forced window updates when an application is	andre	2012-10-28	1	-11/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	doing small reads on a (partially) filled receive socket buffer. Normally one would a send a window update every time the available space in the socket buffer increases by two times MSS. This leads to a flurry of window updates that do not provide any meaningful new information to the sender. There still is available space in the window and the sender can continue sending data. All window updates then get carried by the regular ACKs. Only when the socket buffer was (almost) full and the window closed accordingly a window updates delivery new information and allows the sender to start sending more data again. Send window updates only every two MSS when the socket buffer has less than 1/8 space available, or the available space in the socket buffer increased by 1/4 its full capacity, or the socket buffer is very small. The next regular data ACK will carry and report the exact window size again. Reported by: sbruno Tested by: darrenr Tested by: Darren Baginski PR: kern/116335 MFC after: 2 weeks
*	When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to	andre	2012-10-28	1	-2/+6
\| \| \| \| \| \| \| \| \| \|	reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after: 2 weeks
*	Switch the entire IPv4 stack to keep the IP packet header	glebius	2012-10-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
*	If ip_output() returns EMSGSIZE to tcp_output(), then the latter calls	glebius	2012-07-16	1	-16/+27
\| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_mtudisc(), which in its turn may call tcp_output(). Under certain conditions (must admit they are very special) an infinite recursion can happen. To avoid recursion we can pass struct route to ip_output() and obtain correct mtu. This allows us not to use tcp_mtudisc() but call tcp_mss_update() directly. PR: kern/155585 Submitted by: Andrey Zonov <andrey zonov.org> (original version of patch)
*	- Updated TOE support in the kernel.	np	2012-06-19	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m \| grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp \| grep toe # sockstat -46c \| grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
*	It turns out that too many drivers are not only parsing the L2/3/4	bz	2012-05-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	headers for TSO but also for generic checksum offloading. Ideally we would only have one common function shared amongst all drivers, and perhaps when updating them for IPv6 we should introduce that. Eventually we should provide the meta information along with mbufs to avoid (re-)parsing entirely. To not break IPv6 (checksums and offload) and to be able to MFC the changes without risking to hurt 3rd party drivers, duplicate the v4 framework, as other OSes have done as well. Introduce interface capability flags for TX/RX checksum offload with IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6 flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6 fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and add an alias for CSUM_DATA_VALID_IPV6. This pretty much brings IPv6 handling in line with IPv4. TSO is still handled in a different way and not via if_hwassist. Update ifconfig to allow (un)setting of the new capability flags. Update loopback to announce the new capabilities and if_hwassist flags. Individual driver updates will have to follow, as will SCTP. Reported by: gallatin, dim, .. Reviewed by: gallatin (glanced at?) MFC after: 3 days X-MFC with: r235961,235959,235958
*	MFp4 bz_ipv6_fast:	bz	2012-05-25	1	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add code to handle pre-checked TCP checksums as indicated by mbuf flags to save the entire computation for validation if not needed. In the IPv6 TCP output path only compute the pseudo-header checksum, set the checksum offset in the mbuf field along the appropriate flag as done in IPv4. In tcp_respond() just initialize the IPv6 payload length to 0 as ip6_output() will properly set it. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
*	When we receive an ICMP unreach need fragmentation datagram, we take	glebius	2012-04-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	proposed MTU value from it and update the TCP host cache. Then tcp_mss_update() is called on the corresponding tcpcb. It finds the just allocated entry in the TCP host cache and updates MSS on the tcpcb. And then we do a fast retransmit of what we have in the tcp send buffer. This sequence gets broken if the TCP host cache is exausted. In this case allocation fails, and later called tcp_mss_update() finds nothing in cache. The fast retransmit is done with not reduced MSS and is immidiately replied by remote host with new ICMP datagrams and the cycle repeats. This ping-pong can go up to wirespeed. To fix this: - tcp_mss_update() gets new parameter - mtuoffer, that is like offer, but needs to have min_protoh subtracted. - tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify(). - tcp_mtudisc() now accepts not a useless error argument, but proposed MTU value, that is passed to tcp_mss_update() as mtuoffer. Reported by: az Reported by: Andrey Zonov <andrey zonov.org> Reviewed by: andre (previous version of patch)