summaryrefslogtreecommitdiffstats
path: root/sys/kern/uipc_socket.c
Commit message (Collapse)AuthorAgeFilesLines
* Configure UMA warnings for the following zones:pjd2012-12-071-0/+1
| | | | | | | | | | | | | | | | | - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
* - Make socket_zone static - it is used only in this file.pjd2012-12-071-3/+3
| | | | | | - Update maxsockets on uma_zone_set_max(). Obtained from: WHEEL Systems
* Style cleanups.pjd2012-12-071-51/+54
|
* - according to POSIX, make socket(2) return EAFNOSUPPORT rather thankevlo2012-12-071-1/+10
| | | | | | | | EPROTONOSUPPORT if the address family is not supported. - introduce pffinddomain() to find a domain by family and use it as appropriate. Reviewed by: glebius
* Mechanically substitute flags from historic mbuf allocator withglebius2012-12-051-13/+13
| | | | | | | | | malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
* Fix r243627 by testing against the head socket instead of the socketandre2012-11-271-1/+1
| | | | | | | just created. MFC after: 1 week X-MFC-with: r243627
* Base the mbuf related limits on the available physical memory orandre2012-11-271-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month
* Fix a race on listen socket teardown where while draining theandre2012-11-271-4/+21
| | | | | | | | | | | | accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week
* In soreceive_stream() don't drop an already dequeued mbuf chain byandre2012-10-291-5/+11
| | | | | | | | | | | | | | | | | | | | overwriting the return mbuf pointer with newly received data after a loop. Instead append the new mbuf chain to the existing one. Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream() doesn't get confused. For the remainder copy case in the mbuf delivery part deduct the copied length len instead of the whole mbuf length. Additionally don't depend on 'n' being being available which isn't true in the case of MSG_PEEK. Fix the MSG_WAITALL case by comparing against sb_hiwat. Before it was looping for every receive as sb_lowat normally is zero. Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't properly handled. Submitted by: trociny (except for the change in last paragraph)
* Add logging for socket attach failures in sonewconn() during accept(2).andre2012-10-291-5/+21
| | | | | | | Include the pointer to the PCB so it can be attributed to a particular application by corresponding it to "netstat -A" output. MFC after: 2 weeks
* Replace the ill-named ZERO_COPY_SOCKET kernel option with twoandre2012-10-231-16/+19
| | | | | | | | | | | | | | | | | | | | | more appropriate named kernel options for the very distinct send and receive path. "options SOCKET_SEND_COW" enables VM page copy-on-write based sending of data on an outbound socket. NB: The COW based send mechanism is not safe and may result in kernel crashes. "options SOCKET_RECV_PFLIP" enables VM kernel/userspace page flipping for special disposable pages attached as external storage to mbufs. Only the naming of the kernel options is changed and their corresponding #ifdef sections are adjusted. No functionality is added or removed. Discussed with: alc (mechanism and limitations of send side COW)
* Grammar fixes to r241781.andre2012-10-201-1/+1
| | | | Submitted by: alc
* Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -aandre2012-10-201-1/+7
| | | | | | | | | | | output and replace it with a new visible sysctl kern.ipc.acceptqueue of the same functionality. It specifies the maximum length of the accept queue on a listen socket. The old kern.ipc.somaxconn remains available for reading and writing for compatibility reasons so that existing programs, scripts and configurations continue to work. There no plans to ever remove the orginal and now hidden kern.ipc.somaxconn.
* Tidy up somaxconn (accept queue limit) and related functionsandre2012-10-201-22/+26
| | | | and move it together into one place.
* Move socket UMA zone initialization functionality together intoandre2012-10-191-17/+16
| | | | one place.
* Move UMA socket zone initialization from uipc_domain.c to uipc_socket.candre2012-10-191-0/+23
| | | | into one place next to its other related functions to avoid confusion.
* Remove unnecessary includes from sosend_copyin() and fixandre2012-10-181-10/+4
| | | | a couple of style issues.
* Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS withinandre2012-10-181-17/+1
| | | | zero copy specialized sosend_copyin() helper function.
* Fix spelling of the function name in two assertion messages.wollman2012-10-021-2/+2
|
* In soreceive_generic() remove the optimization for the case whentrociny2012-09-021-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MSG_WAITALL is set, and it is possible to do the entire receive operation at once if we block (resid <= hiwat). Actually it might make the recv(2) with MSG_WAITALL flag get stuck when there is enough space in the receiver buffer to satisfy the request but not enough to open the window closed previously due to the buffer being full. The issue can be reproduced using the following scenario: On the sender side do 2 send(2) requests: 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10); 2) data of size equal to SOBUF_SIZE. On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set: 1) recv() data of SOBUF_SIZE / 10 size; 2) recv() data of SOBUF_SIZE size; We totally fill the receiver buffer with one SOBUF_SIZE/10 size request and partial SOBUF_SIZE request. When the first request is processed we get SOBUF_SIZE/10 free space. It is just enough to receive the rest of bytes for the second request, and soreceive_generic() blocks in the part that is a subject of this change waiting for the rest. But the window was closed when the buffer was filled and to avoid silly window syndrome it opens only when available space is larger than sb_hiwat/4 or maxseg. So it is stuck and pending data is only sent via TCP window probes. Discussed with: kib (long ago) MFC after: 2 weeks
* In soreceive_generic() when checking if the type of mbuf has changedtrociny2012-09-021-2/+2
| | | | | | | | | | | | check it for MT_CONTROL type too, otherwise the assertion "m->m_type == MT_DATA" below may be triggered by the following scenario: - the sender sends some data (MT_DATA) and then a file descriptor (MT_CONTROL); - the receiver calls recv(2) with a MSG_WAITALL asking for data larger than the receive buffer (uio_resid > hiwat). MFC after: 2 week
* Fix KASSERT message.trociny2012-07-031-1/+1
| | | | MFC after: 3 days
* - Remove redundant call to pr_ctloutput from code that handles SO_SETFIB.np2012-04-031-6/+5
| | | | | | | - Add a check for errors during copyin while here. Reviewed by: julian, bz MFC after: 2 weeks
* Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get thekib2012-02-261-0/+4
| | | | | | | | | | | | socket protocol number. This is useful since the socket type can be implemented by different protocols in the same protocol family, e.g. SOCK_STREAM may be provided by both TCP and SCTP. Submitted by: Jukka A. Ukkonen <jau iki fi> PR: kern/162352 Discussed with: bz Reviewed by: glebius MFC after: 2 weeks
* Remove apparently redundand checks for socket so_proto being non-NULLkib2012-02-261-9/+5
| | | | | | | | | from sosetopt() and sogetopt(). No exposed sockets may have so_proto invalid. Discussed with: bz, rwatson Reviewed by: glebius MFC after: 2 weeks
* Fix found places where uio_resid is truncated to int.kib2012-02-211-6/+11
| | | | | | | | | Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
* Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:bz2012-02-171-0/+2
| | | | | | | | | | | | Extend the so far IPv4-only support for multiple routing tables (FIBs) introduced in r178888 to IPv6 providing feature parity. This includes an extended rtalloc(9) KPI for IPv6, the necessary adjustments to the network stack, and user land support as in netstat. Sponsored by: Cisco Systems, Inc. Reviewed by: melifaro (basically) MFC after: 10 days
* Fix input validation in SO_SETFIB.hrs2012-02-041-1/+1
| | | | | Reviewed by: bz MFC after: 1 day
* Remove a few bits of FreeBSD 2.x compatibility code.rmh2011-11-141-3/+0
| | | | Approved by: kib (mentor)
* Fix a deficiency in the selinfo interface:attilio2011-08-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks
* In the experimental soreceive_stream():andre2011-07-081-13/+9
| | | | | | | | | | | o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF is correctly returned on a remote connection close. o In the non-blocking socket test compare SS_NBIO against the so->so_state field instead of the incorrect sb->sb_state field. o Simplify the ENOTCONN test by removing cases that can't occur. Submitted by: trociny (with some further tweaks by committer) Tested by: trociny
* Remove the TCP_SORECEIVE_STREAM compile time option. The use ofandre2011-07-071-2/+0
| | | | | | | soreceive_stream() for TCP still has to be enabled with the loader tuneable net.inet.tcp.soreceive_stream. Suggested by: trociny and others
* In soreceive_generic(), if MSG_WAITALL is set but the request istrociny2011-05-291-4/+10
| | | | | | | | | | | | | | | | | | larger than the receive buffer, we have to receive in sections. When notifying the protocol that some data has been drained the lock is released for a moment. Returning we block waiting for the rest of data. There is a race, when data could arrive while the lock was released and then the connection stalls in sbwait. Fix this by checking for data before blocking and skip blocking if there are some. PR: kern/154504 Reported by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Tested by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Reviewed by: rwatson Approved by: kib (co-mentor) MFC after: 2 weeks
* Mfp4 CH=177274,177280,177284-177285,177297,177324-177325bz2011-02-161-20/+78
| | | | | | | | | | | | | | | | | | | | | | VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147. While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix. The current expectations are documented at the beginning of uipc_socket.c along with the other information there. Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks
* Allow the SO_SETFIB socket option to select the default (0)deischen2011-02-131-4/+5
| | | | | | routing table. Reviewed by: julian
* Mfp4 CH=177255:bz2011-02-111-1/+2
| | | | | | | | | | | | | | | | | Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS. Change the syntax to match KASSERT() to allow more flexible panic messages rather than having a printf with hardcoded arguments before panic. Adjust the few assertions we have to the new format (and enhance the output). Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb MFC after: 2 weeks
* This commit implements the SO_USER_COOKIE socket option, which letsluigi2010-11-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | you tag a socket with an uint32_t value. The cookie can then be used by the kernel for various purposes, e.g. setting the skipto rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has been implemented; however there is nothing ipfw-specific in its implementation). The ipfw-related code that uses the optopn will be committed separately. This change adds a field to 'struct socket', but the struct is not part of any driver or userland-visible ABI so the change should be harmless. See the discussion at http://lists.freebsd.org/pipermail/freebsd-ipfw/2009-October/004001.html Idea and code from Paul Joe, small modifications and manpage changes by myself. Submitted by: Paul Joe MFC after: 1 week
* With reworking of the socket life cycle in 7.x, the need for a "sotryfree()"rwatson2010-09-181-5/+5
| | | | | | | | | was eliminated: all references to sockets are explicitly managed by sorele() and the protocols. As such, garbage collect sotryfree(), and update sofree() comments to make the new world order more clear. MFC after: 3 days Reported by: Anuranjan Shukla <anshukla at juniper dot net>
* Fix a bug where MSG_TRUNC was not returned in all necessary cases fortuexen2010-08-071-1/+6
| | | | | | | | | SOCK_DGRAM socket. MSG_TRUNC was only returned when some mbufs could not be copied to the application. If some data was left in the last mbuf, it was correctly discarded, but MSG_TRUNC was not set. Reviewed by: bz MFC after: 3 weeks
* When close() is called on a connected socket pair, SO_ISCONNECTED might berwatson2010-05-271-1/+4
| | | | | | | | | | | | set but be cleared before the call to sodisconnect(). In this case, ENOTCONN is returned: suppress this error rather than returning it to userspace so that close() doesn't report an error improperly. PR: kern/144061 Reported by: Matt Reimer <mreimer at vpop.net>, Nikolay Denev <ndenev at gmail.com>, Mikolaj Golub <to.my.trociny at gmail.com> MFC after: 3 days
* Provide groundwork for 32-bit binary compatibility on non-x86 platforms,nwhitehorn2010-03-111-3/+3
| | | | | | | | | for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32 option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts of the kernel and enhances the freebsd32 compatibility code to support big-endian platforms. Reviewed by: kib, jhb
* Set curvnet earlier so that it also covers calls to sodisconnect(), whichbz2010-02-201-2/+3
| | | | | | | before were possibly panicing the system in ULP code in the VIMAGE case. Submitted by: Igor (igor ispsystem.com) MFC after: 5 days
* Don't comment on stream socket handling in sosend_dgram, since that'srwatson2009-10-021-3/+0
| | | | | | not handled. MFC after: 3 weeks
* -Put the optimized soreceive_stream() under a compile time option calledandre2009-09-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP_SORECEIVE_STREAM for the time being. Requested by: brooks Once compiled in make it easily switchable for testers by using a tuneable net.inet.tcp.soreceive_stream and a corresponding read-only sysctl to report the current state. Suggested by: rwatson MFC after: 2 days -This line, and those below, will be ignored-- > Description of fields to fill in above: 76 columns --| > PR: If a GNATS PR is affected by the change. > Submitted by: If someone else sent in the change. > Reviewed by: If someone else reviewed your modification. > Approved by: If you needed approval for this commit. > Obtained from: If the change is from a third party. > MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email. > Security: Vulnerability reference (one per line) or description. > Empty fields above will be automatically removed. M sys/conf/options M sys/kern/uipc_socket.c M sys/netinet/tcp_subr.c M sys/netinet/tcp_usrreq.c
* Use C99 initialization for struct filterops.rwatson2009-09-121-6/+15
| | | | | | Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks
* Fix poll() on half-closed sockets, while retaining POLLHUP for fifos.jilles2009-08-251-5/+7
| | | | | | | | | | | | | This reverts part of r196460, so that sockets only return POLLHUP if both directions are closed/error. Fifos get POLLHUP by closing the unused direction immediately after creating the sockets. The tools/regression/poll/*poll.c tests now pass except for two other things: - if POLLHUP is returned, POLLIN is always returned as well instead of only when there is data left in the buffer to be read - fifo old/new reader distinction does not work the way POSIX specs it Reviewed by: kib, bde
* Fix the conformance of poll(2) for sockets after r195423 bykib2009-08-231-7/+5
| | | | | | | | | | | | | returning POLLHUP instead of POLLIN for several cases. Now, the tools/regression/poll results for FreeBSD are closer to that of the Solaris and Linux. Also, improve the POSIX conformance by explicitely clearing POLLOUT when POLLHUP is reported in pollscan(), making the fix global. Submitted by: bde Reviewed by: rwatson MFC after: 1 week
* Merge the remainder of kern_vimage.c and vimage.h into vnet.c andrwatson2009-08-011-1/+2
| | | | | | | | | | vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
* Somewhere along the line accept sockets stopped honoring thejulian2009-07-281-0/+1
| | | | | | | | FIB selected for them. Fix this. Reviewed by: ambrisko Approved by: re (kib) MFC after: 3 days
* Normalize field naming for struct vnet, fix two debugging printfs thatrwatson2009-07-191-2/+2
| | | | | | | print them. Reviewed by: bz Approved by: re (kensmith, kib)
OpenPOWER on IntegriCloud