FreeBSD-src - Raptor Engineering's fork of pfsense FreeBSD src with pfSense changes

	Commit message (Collapse)	Author	Age	Files	Lines
*	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator	rwatson	2009-07-14	1	-27/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
*	Make callers to in6_selectsrc() and in6_pcbladdr() pass in memory	bz	2009-06-23	1	-3/+3
\| \| \| \| \| \| \| \| \|	to save the selected source address rather than returning an unreferenced copy to a pointer that might long be gone by the time we use the pointer for anything meaningful. Asked for by: rwatson Reviewed by: rwatson
*	Add soreceive_stream(), an optimized version of soreceive() for	andre	2009-06-22	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stream (TCP) sockets. It is functionally identical to generic soreceive() but has a number stream specific optimizations: o does only one sockbuf unlock/lock per receive independent of the length of data to be moved into the uio compared to soreceive() which unlocks/locks per mbuf. o uses m_mbuftouio() instead of its own copy(out) variant. o much more compact code flow as a large number of special cases is removed. o much improved reability. It offers significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done. This function was written by "reverse engineering" and is not just a stripped down variant of soreceive(). It is not yet enabled by default on TCP sockets. Instead it is commented out in the protocol initialization in tcp_usrreq.c until more widespread testing has been done. Testers, especially with 10GigE gear, are welcome. MFP4: r164817 //depot/user/andre/soreceive_stream/
*	- Change members of tcpcb that cache values of ticks from int to u_int:	jhb	2009-06-16	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age, t_badrxtwin. - Change t_recent in struct timewait from u_long to u_int32_t to match the type of the field it shadows from tcpcb: ts_recent. - Change t_starttime in struct timewait from u_long to u_int to match the t_starttime field in tcpcb. Requested by: bde (1, 3)
*	Correct printf format type mismatches.	jhb	2009-06-11	1	-3/+3
\|
*	Change a few members of tcpcb that store cached copies of ticks to be ints	jhb	2009-06-10	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	instead of unsigned longs. This fixes a few overflow edge cases on 64-bit platforms. Specifically, if an idle connection receives a packet shortly before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep alive timer fires after 2^31 clock ticks, the keep alive timer will think that the connection has been idle for a very long time and will immediately drop the connection instead of sending a keep alive probe. Reviewed by: silby, gnn, lstewart MFC after: 1 week
*	Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and	rwatson	2009-04-11	1	-2/+2
\| \| \| \| \| \| \| \|	TCPSTAT_INC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days
*	With the right comparison we get a proper wscale value and thus	bz	2009-04-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	more adequate TCP performance with IPv6. Changes for IPv4, r166403 and r172795, both ignored the IPv6 counterpart and left it in the state of art of year 2000. The same logic in syncache already shares code between v4 and v6 so things do not need to be adapted there. Reported by: Steinar Haug (sthaug nethelp.no) Tested by: Steinar Haug (sthaug nethelp.no) MFC after: 3 days
*	Correct a number of evolved problems with inp_vflag and inp_flags:	rwatson	2009-03-15	1	-29/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	certain flags that should have been in inp_flags ended up in inp_vflag, meaning that they were inconsistently locked, and in one case, interpreted. Move the following flags from inp_vflag to gaps in the inp_flags space (and clean up the inp_flags constants to make gaps more obvious to future takers): INP_TIMEWAIT INP_SOCKREF INP_ONESBCAST INP_DROPPED Some aspects of this change have no effect on kernel ABI at all, as these are UDP/TCP/IP-internal uses; however, netstat and sockstat detect INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this into account. MFC after: 1 week (or after dependencies are MFC'd) Reviewed by: bz
*	In tcp_usr_shutdown() and tcp_usr_send(), I missed converting NULL	rwatson	2009-02-24	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \|	checks for the tcpcb, previously used to detect complete disconnection, with INP_DROPPED checks. Correct that, preventing shutdown() from improperly generating a TCP segment with destination IP and port of 0.0.0.0:0. PR: kern/132050 Reported by: david gueluy <david.gueluy at netasq.com> MFC after: 3 weeks
*	Standardize the various prison_foo_ip[46] functions and prison_if to	jamie	2009-02-05	1	-8/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)
*	Use inc_flags instead of the inc_isipv6 alias which so far	bz	2008-12-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	had been the only flag with random usage patterns. Switch inc_flags to be used as a real bit field by using INC_ISIPV6 with bitops to check for the 'isipv6' condition. While here fix a place or two where in case of v4 inc_flags were not properly initialized before.[1] Found by: rwatson during review [1] Discussed with: rwatson Reviewed by: rwatson MFC after: 4 weeks
*	Another step assimilating IPv[46] PCB code - directly use	bz	2008-12-15	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	the inpcb names rather than the following IPv6 compat macros: in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag, in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and sotoin6pcb(). Apart from removing duplicate code in netipsec, this is a pure whitespace, not a functional change. Discussed with: rwatson Reviewed by: rwatson (version before review requested changes) MFC after: 4 weeks (set the timer and see then)
*	Rather than using hidden includes (with cicular dependencies),	bz	2008-12-02	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
*	MFp4:	bz	2008-11-29	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
*	Replace most INP_CHECK_SOCKAF() uses checking if it is an	bz	2008-11-27	1	-5/+2
\| \| \| \| \| \| \| \| \|	IPv6 socket by comparing a constant inp vflag. This is expected to help to reduce extra locking. Suggested by: rwatson Reviewed by: rwatson MFC after: 6 weeks
*	Merge in6_pcbfree() into in_pcbfree() which after the previous	bz	2008-11-27	1	-24/+5
\| \| \| \| \| \| \| \| \| \|	IPsec change in r185366 only differed in two additonal IPv6 lines. Rather than splattering conditional code everywhere add the v6 check centrally at this single place. Reviewed by: rwatson (as part of a larger changset) MFC after: 6 weeks () () possibly need to leave a stub wrapper in 7 to keep the symbol.
*	Remove in6_pcbdetach() as it is exactly the same function	bz	2008-11-26	1	-32/+10
\| \| \| \| \| \| \| \|	as in_pcbdetach() and we don't need the code twice. Reviewed by: rwatson MFC after: 6 weeks () () possibly need to leave a stub wrapper in 7 to keep the symbol.
*	Step 1.5 of importing the network stack virtualization infrastructure	zec	2008-10-02	1	-0/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
*	Commit step 1 of the vimage project, (network stack)	bz	2008-08-17	1	-45/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
*	MFp4 (//depot/projects/tcpecn/):	rpaulo	2008-07-31	1	-0/+4
\| \| \| \| \| \| \| \|	TCP ECN support. Merge of my GSoC 2006 work for NetBSD. TCP ECN is defined in RFC 3168. Partly reviewed by: dwmalone, silby Obtained from: NetBSD
*	replace spaces added in last change with tabs	kmacy	2008-05-05	1	-5/+5
\|
*	add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specific	kmacy	2008-05-05	1	-0/+6
\| \| \| \|	extension fields for tcp_info
*	Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to	rwatson	2008-04-17	1	-68/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch)
*	tcp_usrreq.c:1.313 removed tcbinfo locking from tcp_usr_accept(), which	rwatson	2008-01-23	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	while in principle a good idea, opened us up to a race inherrent to the syncache's direct insertion of incoming TCP connections into the "completed connection" listen queue, as it transpires that the socket is inserted before the inpcb is fully filled in by syncache_expand(). The bug manifested with the occasional returning of 0.0.0.0:0 in the address returned by the accept() system call, which occurred if accept managed to execute tcp_usr_accept() before syncache_expand() had copied the endpoint addresses into inpcb connection state. Re-add tcbinfo locking around the address copyout, which has the effect of delaying the copy until syncache_expand() has finished running, as it is run while the tcbinfo lock is held. This is undesirable in that it increases contention on tcbinfo further, but a more significant change will be required to how the syncache inserts new sockets in order to fix this and keep more granular locking here. In particular, either more state needs to be passed into sonewconn() so that pru_attach() can fill in the fields before the socket is inserted, or the socket needs to be inserted in the incomplete connection queue until it is actually ready to be used. Reported by: glebius (and kris) Tested by: glebius
*	In tcp_ctloutput(), don't hold the inpcb lock over sooptcopyin(), rather,	rwatson	2008-01-18	1	-25/+55
\| \| \| \| \| \| \| \| \| \| \| \|	drop the lock and then re-acquire it, revalidating TCP connection state assumptions when we do so. This avoids a potential lock order reversal (and potential deadlock, although none have been reported) due to the inpcb lock being held over a page fault. MFC after: 1 week PR: 102752 Reviewed by: bz Reported by: VÃ¡clav Haisman <v dot haisman at sh dot cvut dot cz>
*	Incorporate TCP offload hooks in to core TCP code.	kmacy	2007-12-18	1	-9/+13
\| \| \| \| \| \| \| \| \| \|	- Rename output routines tcp_gen_* -> tcp_output_. - Rename notification routines that turn in to no-ops in the absence of TOE from tcp_gen_ -> tcp_offload_. - Fix some minor comment nits. - Add a / FALLTHROUGH */ Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack
*	Pick the smallest possible TCP window scaling factor that will still allow	silby	2007-10-19	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	us to scale up to sb_max, aka kern.ipc.maxsockbuf. We do this because there are broken firewalls that will corrupt the window scale option, leading to the other endpoint believing that our advertised window is unscaled. At scale factors larger than 5 the unscaled window will drop below 1500 bytes, leading to serious problems when traversing these broken firewalls. With the default maxsockbuf of 256K, a scale factor of 3 will be chosen by this algorithm. Those who choose a larger maxsockbuf should watch out for the compatiblity problems mentioned above. Reviewed by: andre
*	Add FBSDID to all files in netinet so that people can more	silby	2007-10-07	1	-1/+3
\| \| \| \| \| \|	easily include file version information in bug reports. Approved by: re (kensmith)
*	Two changes:	silby	2007-09-24	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Reintegrate the ANSI C function declaration change from tcp_timer.c rev 1.92 - Reorganize the tcpcb structure so that it has a single pointer to the "tcp_timer" structure which contains all of the tcp timer callouts. This change means that when the single tcp timer change is reintegrated, tcpcb will not change in size, and therefore the ABI between netstat and the kernel will not change. Neither of these changes should have any functional impact. Reviewed by: bmah, rrs Approved by: re (bmah)
*	Back out tcp_timer.c:1.93 and associated changes that reimplemented the many	rwatson	2007-09-07	1	-7/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP timers as a single timer, but retain the API changes necessary to reintroduce this change. This will back out the source of at least two reported problems: lock leaks in certain timer edge cases, and TCP timers continuing to fire after a connection has closed (a bug previously fixed and then reintroduced with the timer rewrite). In a follow-up commit, some minor restylings and comment changes performed after the TCP timer rewrite will be reapplied, and a further change to allow the TCP timer rewrite to be added back without disturbing the ABI. The new design is believed to be a good thing, but the outstanding issues are leading to significant stability/correctness problems that are holding up 7.0. This patch was generated by silby, but is being committed by proxy due to poor network connectivity for silby this week. Approved by: re (kensmith) Submitted by: silby Tested by: rwatson, kris Problems reported by: peter, kris, others
*	Make tcpstates[] static, and make sure TCPSTATES is defined before	des	2007-07-30	1	-4/+0
\| \| \| \| \| \| \| \| \|	<netinet/tcp_fsm.h> is included into any compilation unit that needs tcpstates[]. Also remove incorrect extern declarations and TCPDEBUG conditionals. This allows kernels both with and without TCPDEBUG to build, and unbreaks the tinderbox. Approved by: re (rwatson)
*	Fix compilation problems- tcpstates is only available if TCPDEBUG	mjacob	2007-07-29	1	-1/+3
\| \| \| \| \| \|	is set. Approved by: re (in spirit)
*	Garbage collect some debug code that not only no longer could	mjacob	2007-06-15	1	-6/+0
\| \| \| \| \|	work but in fact probably causes a random pointer dereferences. Garbage collect the tp variable too.
*	(1) In tcp_usrclosed(), tp can never become NULL, so don't test for NULL	rwatson	2007-05-31	1	-4/+3
\| \| \| \| \| \| \| \| \|	before handling the socket disconnection case. (2) Clean up surrounding comments and formatting. Found with: Coverity Prevent(tm) (1) CID: 2203
*	Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr	rwatson	2007-05-11	1	-3/+3
\| \| \| \| \| \| \| \|	protocol entry points using functions named proto_getsockaddr and proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr. While it's true that sockaddrs are allocated and set, the net effect is to retrieve (get) the socket address or peer address from a socket, not set it, so align names to that intent.
*	Remove unneeded wrappers for in_setsockaddr() and in_setpeeraddr(), which	rwatson	2007-05-11	1	-25/+2
\| \| \| \| \|	used to exist so pcbinfo locks could be acquired, but are no longer required as a result of socket/pcb reference model refinements.
*	Move universally to ANSI C function declarations, with relatively	rwatson	2007-05-10	1	-0/+2
\| \| \| \|	consistent style(9)-ish layout.
*	Remove unused requested_s_scale from struct tcpcb.	andre	2007-05-06	1	-2/+2
\|
*	Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of	andre	2007-05-06	1	-3/+3
\| \| \| \|	a decdicated sack_enable int for this bool. Change all users accordingly.
*	Remove unused pcbinfo arguments to in_setsockaddr() and	rwatson	2007-05-01	1	-2/+2
\| \| \| \|	in_setpeeraddr().
*	Change the TCP timer system from using the callout system five times	andre	2007-04-11	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)
*	ANSIfy function declarations and remove register keywords for variables.	andre	2007-03-21	1	-22/+9
\| \| \| \|	Consistently apply style to all function declarations.
*	Remove tcp_minmssoverload DoS detection logic. The problem it tried to	andre	2007-03-21	1	-5/+0
\| \| \| \| \| \|	protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.
*	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate	mohans	2007-02-26	1	-2/+7
\| \| \| \| \| \| \| \|	potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.
*	Add "show inpcb", "show tcpcb" DDB commands, which should come in handy	rwatson	2007-02-17	1	-1/+321
\| \| \| \|	for debugging sblock and other network panics.
*	Expose smoothed RTT and RTT variance measurements to userland via	bms	2007-02-02	1	-0/+4
\| \| \| \| \| \| \|	socket option TCP_INFO. Note that the units used in the original Linux API are in microseconds, so use a 64-bit mantissa to convert FreeBSD's internal measurements from struct tcpcb from ticks.
*	Auto sizing TCP socket buffers.	andre	2007-02-01	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month
*	Change the way the advertized TCP window scaling is computed. Instead of	andre	2007-02-01	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	upper-bounding it to the size of the initial socket buffer lower-bound it to the smallest MSS we accept. Ideally we'd use the actual MSS information here but it is not available yet. For socket buffer auto sizing to be effective we need room to grow the receive window. The window scale shift is determined at connection setup and can't be changed afterwards. The previous, original, method effectively just did a power of two roundup of the socket buffer size at connection setup severely limiting the headroom for larger socket buffers. Tested by: many (as part of the socket buffer auto sizing patch) MFC after: 1 month
*	Change error codes returned by protocol operations when an inpcb is	sam	2006-11-22	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	marked INP_DROPPED or INP_TIMEWAIT: o return ECONNRESET instead of EINVAL for close, disconnect, shutdown, rcvd, rcvoob, and send operations o return ECONNABORTED instead of EINVAL for accept These changes should reduce confusion in applications since EINVAL is normally interpreted to mean an invalid file descriptor. This change does not conflict with POSIX or other standards I checked. The return of EINVAL has always been possible but rare; it's become more common with recent changes to the socket/inpcb handling and with finer-grained locking and preemption. Note: there are other instances of EINVAL for this state that were left unchanged; they should be reviewed. Reviewed by: rwatson, andre, ru MFC after: 1 month