summaryrefslogtreecommitdiffstats
path: root/sys/kern/uipc_sockbuf.c
Commit message (Collapse)AuthorAgeFilesLines
* Mechanically substitute flags from historic mbuf allocator withglebius2012-12-051-4/+4
| | | | | | | | | malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
* Document a large number of currently undocumented sysctls. While hereeadler2011-12-131-1/+1
| | | | | | | | | | | | fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week
* Increase the defaults for the maximum socket buffer limit,bz2011-08-251-1/+1
| | | | | | | | | | | | | | | | | | and the maximum TCP send and receive buffer limits from 256kB to 2MB. For sb_max_adj we need to add the cast as already used in the sysctl handler to not overflow the type doing the maths. Note that this is just the defaults. They will allow more memory to be consumed per socket/connection if needed but not change the default "idle" memory consumption. All values are still tunable by sysctls. Suggested by: gnn Discussed on: arch (Mar and Aug 2011) MFC after: 3 weeks Approved by: re (kib)
* Revert r194662, since it breaks ng_ksocket(4) and may breakglebius2011-04-141-3/+0
| | | | | | | | other socket consumers with alternate sb_upcall. PR: kern/154676 Submitted by: Arnaud Lacombe <lacombar gmail.com> MFC after: 7 days
* In sbappendstream_locked() demote all incoming packet mbufs (andandre2009-06-221-0/+3
| | | | | | | | | | | | | | | | | | | | | chains) to pure data mbufs using m_demote(). This removes the packet header and all m_tag information as they are not meaningful anymore on a stream socket where mbufs are linked through m->m_next. Strictly speaking a packet header can be only ever valid on the first mbuf in an m_next chain. sbcompress() was doing this already when the mbuf chain layout lent itself to it (e.g. header splitting or merge-append), just not consistently. This frees resources at socket buffer append time instead of at sbdrop_internal() time after data has been read from the socket. For MAC the per packet information has done its duty and during socket buffer appending the policy of the socket itself takes over. With the append the packet boundaries disappear naturally and with it any context that was based on it. None of the residual information from mbuf headers in the socket buffer on stream sockets was looked at.
* Rework socket upcalls to close some races with setup/teardown of upcalls.jhb2009-06-011-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Each socket upcall is now invoked with the appropriate socket buffer locked. It is not permissible to call soisconnected() with this lock held; however, so socket upcalls now return an integer value. The two possible values are SU_OK and SU_ISCONNECTED. If an upcall returns SU_ISCONNECTED, then the soisconnected() will be invoked on the socket after the socket buffer lock is dropped. - A new API is provided for setting and clearing socket upcalls. The API consists of soupcall_set() and soupcall_clear(). - To simplify locking, each socket buffer now has a separate upcall. - When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from the receive socket buffer automatically. Note that a SO_SND upcall should never return SU_ISCONNECTED. - All this means that accept filters should now return SU_ISCONNECTED instead of calling soisconnected() directly. They also no longer need to explicitly clear the upcall on the new socket. - The HTTP accept filter still uses soupcall_set() to manage its internal state machine, but other accept filters no longer have any explicit knowlege of socket upcall internals aside from their return value. - The various RPC client upcalls currently drop the socket buffer lock while invoking soreceive() as a temporary band-aid. The plan for the future is to add a new flag to allow soreceive() to be called with the socket buffer locked. - The AIO callback for socket I/O is now also invoked with the socket buffer locked. Previously sowakeup() would drop the socket buffer lock only to call aio_swake() which immediately re-acquired the socket buffer lock for the duration of the function call. Discussed with: rwatson, rmacklem
* Fix sbappendrecord_locked().emax2009-04-211-8/+2
| | | | | | | | | | | | | | | | | | The main problem is that sbappendrecord_locked() relies on sbcompress() to set sb_mbtail. This will not happen if sbappendrecord_locked() is called with mbuf chain made of exactly one mbuf (i.e. m0->m_next == NULL). In this case sbcompress() will be called with m == NULL and will do nothing. I'm not entirely sure if m == NULL is a valid argument for sbcompress(), and, it rather pointless to call it like that, but keep calling it so it can do SBLASTMBUFCHK(). The problem is triggered by the SOCKBUF_DEBUG kernel option that enables SBLASTRECORDCHK() and SBLASTMBUFCHK() checks. PR: kern/126742 Investigated by: pluknet < pluknet -at- gmail -dot- com > No response from: freebsd-current@, freebsd-bluetooth@ MFC after: 3 days
* Rewrite sbreserve_locked()'s comment on NULL thread pointers, eliminatingrwatson2008-10-071-4/+5
| | | | | | an XXXRW about the comment being stale. MFC after: 3 days
* Catch a possible NULL pointer deref in case the offsets got mangledbz2008-09-071-1/+3
| | | | | | | | | | | | | | somehow. As a consequence we may now get an unexpected result(*). Catch that error cases with a well defined panic giving appropriate pointers to ease debugging. (*) While the concensus was that the case should never happen unless there was a bug, noone was definitively sure. Discussed with: kmacy (about 8 months back) Reviewed by: silby (as part of a larger patch in March) MFC after: 2 months
* Update the kernel to count the number of mbufs and clustersgnn2008-05-151-0/+2
| | | | | | | | | | | | (all types) used per socket buffer. Add support to netstat to print out all of the socket buffer statistics. Update the netstat manual page to describe the new -x flag which gives the extended output. Reviewed by: rwatson, julian
* Further clean up sorflush:rwatson2008-02-041-2/+1
| | | | | | | | | | | | | | - Expose sbrelease_internal(), a variant of sbrelease() with no expectations about the validity of locks in the socket buffer. - Use sbrelease_internel() in sorflush(), and as a result avoid intializing and destroying a socket buffer lock for the temporary stack copy of the actual buffer, asb. - Add a comment indicating why we do what we do, and remove an XXX since things have gotten less ugly in sorflush() lately. This makes socket close cleaner, and possibly also marginally faster. MFC after: 3 weeks
* Correct two problems relating to sorflush(), which is called to flushrwatson2008-01-311-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | read socket buffers in shutdown() and close(): - Call socantrcvmore() before sblock() to dislodge any threads that might be sleeping (potentially indefinitely) while holding sblock(), such as a thread blocked in recv(). - Flag the sblock() call as non-interruptible so that a signal delivered to the thread calling sorflush() doesn't cause sblock() to fail. The sblock() is required to ensure that all other socket consumer threads have, in fact, left, and do not enter, the socket buffer until we're done flushin it. To implement the latter, change the 'flags' argument to sblock() to accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK flag. When SBL_NOINTR is set, it forces a non-interruptible sx acquisition, regardless of the setting of the disposition of SB_NOINTR on the socket buffer; without this change it would be possible for another thread to clear SB_NOINTR between when the socket buffer mutex is released and sblock() is invoked. Reviewed by: bz, kmacy Reported by: Jos Backus <jos at catnook dot com>
* Add SB_NOCOALESCE flag to disable socket buffer update in placekmacy2007-12-171-0/+1
|
* Refactor select to reduce contention and hide internal implementationjeff2007-12-161-1/+2
| | | | | | | | | | | | | | | | | | | | | details from consumers. - Track individual selecters on a per-descriptor basis such that there are no longer collisions and after sleeping for events only those descriptors which triggered events must be rescaned. - Protect the selinfo (per descriptor) structure with a mtx pool mutex. mtx pool mutexes were chosen to preserve api compatibility with existing code which does nothing but bzero() to setup selinfo structures. - Use a per-thread wait channel rather than a global wait channel. - Hide select implementation details in a seltd structure which is opaque to the rest of the kernel. - Provide a 'selsocket' interface for those kernel consumers who wish to select on a socket when they have no fd so they no longer have to be aware of select implementation details. Tested by: kris Reviewed on: arch
* Set the NFS server sockbuf high watermarks to the system defaultsmohans2007-10-121-2/+2
| | | | | | | (up form 32KB). The low highwatermark setting caused UDP fullsock request drops, throttling thruput greatly. Reported by: Kris Kennaway Approved by: re@ (Ken Smith)
* Now that sx(9) locks support an interruptible lock acquire primitive,rwatson2007-05-311-2/+5
| | | | | | | | | | | properly observe the SB_NOINTR flag in sblock. This restores the required behavior that lock acquisition be interruptible on the socket buffer I/O serialization lock to allow threads waiting for I/O to be signaled even if they aren't the thread currently holding the I/O lock. With this change, the sblock regression test is again passed. Reported by: alfred sx(9) handiwork: attilio
* Generally migrate to ANSI function headers, and remove 'register' use.rwatson2007-05-161-7/+3
|
* sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flagsrwatson2007-05-031-19/+16
| | | | | | | | | | | | | | | | | | | | | | | on each socket buffer with the socket buffer's mutex. This sleep lock is used to serialize I/O on sockets in order to prevent I/O interlacing. This change replaces the custom sleep lock with an sx(9) lock, which results in marginally better performance, better handling of contention during simultaneous socket I/O across multiple threads, and a cleaner separation between the different layers of locking in socket buffers. Specifically, the socket buffer mutex is now solely responsible for serializing simultaneous operation on the socket buffer data structure, and not for I/O serialization. While here, fix two historic bugs: (1) a bug allowing I/O to be occasionally interlaced during long I/O operations (discovere by Isilon). (2) a bug in which failed non-blocking acquisition of the socket buffer I/O serialization lock might be ignored (discovered by sam). SCTP portion of this patch submitted by rrs.
* Following movement of functions from uipc_socket2.c to uipc_socket.c andrwatson2007-03-261-6/+7
| | | | uipc_sockbuf.c, clean up and update comments.
* Complete removal of uipc_socket2.c by moving the last few functions torwatson2007-03-261-0/+52
| | | | | | | | | | | | | other C files: - Move sbcreatecontrol() and sbtoxsockbuf() to uipc_sockbuf.c. While sbcreatecontrol() is really an mbuf allocation routine, it does its work with awareness of the layout of socket buffer memory. - Move pru_*() protocol switch stubs to uipc_socket.c where the non-stub versions of several of these functions live. Likewise, move socket state transition calls (soisconnecting(), etc) to uipc_socket.c. Moveo sodupsockaddr() and sotoxsocket().
* Maintain a pointer and offset pair into the socket buffer mbuf chain toandre2007-03-191-0/+41
| | | | | | | | | avoid traversal of the entire socket buffer for larger offsets on stream sockets. Adjust tcp_output() make use of it. Tested by: gallatin
* Use sysctl_handle_long() instead of duplicating it's logic forjhb2006-09-061-8/+4
| | | | | | | kern.ipc.maxsockbuf so that this sysctl works for 32-bit binaries running on amd64 via compat/freebsd32. MFC after: 3 days
* Remove 'register'.rwatson2006-08-021-196/+135
| | | | | Use ANSI C prototypes/function headers. More deterministically line wrap comments.
* Reimplement socket buffer tear-down in sofree(): as the socket is norwatson2006-08-011-13/+53
| | | | | | | | | | | | | longer referenced by other threads (hence our freeing it), we don't need to set the can't send and can't receive flags, wake up the consumers, perform two levels of locking, etc. Implement a fast-path teardown, sbdestroy(), which flushes and releases each socket buffer. A manual dom_dispose of the receive buffer is still required explicitly to GC any in-flight file descriptors, etc, before flushing the buffer. This results in a 9% UP performance improvement and 16% SMP performance improvement on a tight loop of socket();close(); in micro-benchmarking, but will likely also affect CPU-bound macro-benchmark performance.
* Remove non-socket buffer routines from uipc_sockbuf.c, and socket bufferrwatson2006-07-241-355/+7
| | | | | | | specific routines from uipc_socket2.c following repo-copy. We might rethink the location of one or two at some point, but the division was relatively clean. uipc_sockbuf.c is now the home of routines that manipulate socket buffers.
* Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel)rwatson2006-07-111-22/+0
| | | | | | return void, so don't implement no-op versions of these functions. Instead, consistently check if those switch pointers are NULL before invoking them.
* Remove now unneeded opt_mac.h and mac.h includes.rwatson2006-07-061-2/+0
|
* Remove sbinsertoob(), sbinsertoob_locked(). They violate (and haverwatson2006-06-171-64/+0
| | | | | | | | | | | | | basically always violated) invariannts of soreceive(), which assume that the first mbuf pointer in a receive socket buffer can't change while the SB_LOCK sleepable lock is held on the socket buffer, which is precisely what these functions do. No current protocols invoke these functions, and removing them will help discourage them from ever being used. I should have removed them years ago, but lost track of it. MFC after: 1 week Prodded almost by accident by: peter
* Move some functions and definitions from uipc_socket2.c to uipc_socket.c:rwatson2006-06-101-138/+0
| | | | | | | | | | | | | | | | | | | | | | | | - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month
* Allow for nmbclusters and maxsockets to be increased via sysctl.ps2006-04-211-2/+24
| | | | | An eventhandler is used to update all the various zones that depend on these values.
* Chance protocol switch method pru_detach() so that it returns voidrwatson2006-04-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months
* Change protocol switch pru_abort() API so that it returns void ratherrwatson2006-04-011-2/+2
| | | | | | | | | | | | | | than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months
* Add a sysctl, regression.sonewconn_earlytest, which when optionsrwatson2006-03-261-0/+10
| | | | | | | | | | REGRESSION is enabled, allows user space to dictate that sonewconn() should skip it's "skip the hard work" check to see if the listen queue is full, and instead proceed with allocation of a socket and trimming of the overflowed queue. This makes it easier to test the queue overflow logic. MFC after: 1 month
* Change soabort() from returning int to returning void, since allrwatson2006-03-161-1/+1
| | | | | | consumers ignore the return value, soabort() is required to succeed, and protocols produce errors here to report multiple freeing of the pcb, which we hope to eliminate.
* Fix a bug in the loop in sonewconn that makes room on the incompletejdp2005-11-221-1/+1
| | | | | | | | | connection queue for a new connection. It was removing connections from the wrong list. Submitted by: Paul Mikesell Sponsored by: Isilon Systems MFC after: 1 week
* Retire MT_HEADER mbuf type and change its users to use MT_DATA.andre2005-11-021-4/+2
| | | | | | | | | | | | Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005
* Push the assignment of a new or updated so_qlimit from solisten()rwatson2005-10-301-1/+1
| | | | | | | | | | | | | | following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.
* Re-comment sbcompress() to explain what it is it does; it took merwatson2005-09-181-7/+20
| | | | | | | | | | quite a bit of reading to figure it out, and I want to avoid figuring it out again. Convert an if (foo) else printf("this is almost a panic") into a KASSERT. MFC after: 3 days
* Fix the recent panics/LORs/hangs created by my kqueue commit by:ssouhlal2005-07-011-2/+4
| | | | | | | | | | | | | | | | | - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone
* In the current world order, each socket has two mutexes: a mutexrwatson2005-05-271-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | that protects socket and receive socket buffer state, and a second mutex to protect send socket buffer state. In some places, the mutex shared between the socket and receive socket buffer will be acquired twice, once by each layer, resulting in some inconsistency, but providing the abstraction benefit of being able to more easily separate the two mutexes in the future if desired. When transitioning a socket to the SS_ISDISCONNECTING or SS_ISDISCONNECTED states, grab the socket/receive socket buffer lock once rather than grabbing it as the socket lock, modifying socket state, then grabbing a second time as the receive lock in order to modify the socket buffer state to indicate no further data can be read. This change is believed to close a race between the change in socket state and the change in socket buffer state, which for a remotely initiated close on a UNIX domain socket, resulted in soreceive() returning ENOTCONN rather than an EOF condition. A similar race still exists in the case of send, however, and is harder to fix as the socket and send socket buffer mutexes are not the same, and we would like to avoid holding combinations of socket mutexes over sb_upcall until we've finished clarifying the locking protocol for upcalls. This change has the side affect of reducing the number of mutex operations to initiate disconnect or perform disconnect on a socket by two. PR: 78824 Rerported by: Marc Olzheim <marcolz@stack.nl> MFC after: 2 weeks
* Extend the coverage of the accept and socket mutexes in soisconnected()rwatson2005-03-121-3/+3
| | | | | | so that the socket lock is held over the test-and-set removal of the accept filter option during connect, and the two socket mutex regions (transition to connected, perform accept filter) are combined.
* When upcalling from a socket in soisconnected() for an accept filter,rwatson2005-03-071-1/+1
| | | | | | | call with flag M_DONTWAIT rather than M_TRYWAIT, as we don't want to do blocking memory allocation (etc) in the netisr. MFC after: 3 days
* Prefer NULL to returning 0 cast to a pointer type.rwatson2005-02-201-3/+3
| | | | MFC after: 3 days
* In sonewconn(), set the new socket's state to show the protocol-providedrwatson2005-02-171-1/+1
| | | | | | | | | | | connection status before inserting the new socket into the listen socket's accept queue, or there might be a race in which another thread wakes up when the accept lock is released, and sees the socket before its state is set correctly. The wakeup still occurs after the accept lock is released. There have been no diagnoses of this bug in real-world systems (as yet). MFC after: 3 days
* /* -> /*- for copyright notices, minor format tweaks as necessaryimp2005-01-061-1/+1
|
* In sonewconn(), the s/if/while/ change to wait for room at the tail ofrwatson2004-12-231-5/+5
| | | | | the accept queue is a feature, not a bug/issue, so remove the XXXRW from the comment.
* Fix a typo in a comparison appeared in rev. 1.125.maxim2004-10-271-1/+1
| | | | Submitted by: JINMEI Tatuya
* Support for dynamically loadable and unloadable protocols within existing ↵andre2004-10-191-1/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | protocol families. The protosw[] array of any particular protocol family ("domain") is of fixed size defined at compile time. This made it impossible to dynamically add or remove any protocols to or from it. We work around this by introducing so called SPACER's which are embedded into the protosw[] array at compile time. The SPACER's have a special protocol number (32767) to indicate the fact that they are SPACER's but are otherwise NULL. Only as many protocols can be dynamically loaded as SPACER's are provided in the protosw[] structure. The pr_usrreqs structure is treated more special and contains pointers to dummy functions only returning EOPNOTSUPP. This is needed because the use of those functions pointers is usually not checked within the kernel because until now it was assumed to be a valid function pointer. Instead of fixing all potential callers we just return a proper error code. Two new functions provide a clean API to register and unregister a protocol. The register function expects a pointer to a valid and complete struct protosw including a pointer to struct pru_usrreqs provided by the caller. Upon successful registration the pr_init() function will be called to finish initialization of the protocol. The unregister function restores the SPACER in place of the protocol again. It is the responseability of the caller to ensure proper closing of all sockets and freeing of memory allocation by the unloading protocol. sys/protosw.h o Define generic PROTO_SPACER to be 32767 o Prototypes for all pru_*_notsupp() functions o Prototypes for pf_proto_[un]register() functions kern/uipc_domain.c o Global struct pr_usrreqs nousrreqs containing valid pointers to the pru_*_notsupp() functions o New functions pf_proto_[un]register() kern/uipc_socket2.c o New functions bodies for all pru_*_notsupp() functions
* Add locking to the kqueue subsystem. This also makes the kqueue subsystemjmg2004-08-151-1/+3
| | | | | | | | | | | | | a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
* Reduce the number of unnecessary unlock-relocks on socket buffer mutexesrwatson2004-06-261-8/+5
| | | | | | | | | | | | | | | | | | | | associated with performing a wakeup on the socket buffer: - When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append(). - When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets. For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.
OpenPOWER on IntegriCloud