summaryrefslogtreecommitdiffstats
path: root/sys/kern/uipc_socket.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r285522:markj2016-09-021-10/+19
| | | | | | | | | Fix cleanup race between unp_dispose and unp_gc. This change modifies the original commit to avoid changing the domain_dispose KPI. Tested by: Oliver Pinter
* MFC r298819:bdrewery2016-06-271-3/+3
| | | | sys/kern: spelling fixes in comments.
* MFC r279206:ae2015-03-021-5/+7
| | | | | | | | | | | | | | | | In some cases soreceive_dgram() can return no data, but has control message. This can happen when application is sending packets too big for the path MTU and recvmsg() will return zero (indicating no data) but there will be a cmsghdr with cmsg_type set to IPV6_PATHMTU. Remove KASSERT() which does NULL pointer dereference in such case. Also call m_freem() only when m isn't NULL. MFC r279209: soreceive_generic() still has similar KASSERT(), therefore instead of remove KASSERT(), change it to check mbuf isn't NULL. PR: 197882 Sponsored by: Yandex LLC
* MFC 275808:jhb2015-02-061-1/+1
| | | | | Check for SS_NBIO in so->so_state instead of sb->sb_flags in soreceive_stream().
* MFC r269502:davide2014-08-201-6/+6
| | | | | | | Fix an overflow in getsockopt(). optval isn't big enough to hold sbintime_t. Re-introduce r255030 behaviour capping socket timeouts to INT_32 if they're too large.
* MFC r257472hiren2014-02-271-3/+15
| | | | | Rate limit (to once per minute) "Listen queue overflow" message in sonewconn().
* Remove zero-copy sockets code. It only worked for anonymous memory,kib2013-09-161-167/+0
| | | | | | | | | | | | | and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)
* Fix socket buffer timeouts precision using the new sbintime_t KPI insteaddavide2013-09-011-4/+3
| | | | | | | | | of relying on the tvtohz() workaround. The latter has been introduced lately by jhb@ (r254699) in order to have a fix that can be backported to STABLE. Reported by: Vitja Makarov <vitja.makarov at gmail dot com> Reviewed by: jhb (earlier version)
* Don't return an error for socket timeouts that are too large. Justjhb2013-08-291-7/+2
| | | | | | | | cap them to INT_MAX ticks instead. PR: kern/181416 (r254699 really) Requested by: bde MFC after: 2 weeks
* Use tvtohz() to convert a socket buffer timeout to a tick value ratherjhb2013-08-231-7/+2
| | | | | | | | than using a home-rolled version. The home-rolled version could result in shorter-than-requested sleeps. Reported by: Vitja Makarov <vitja.makarov@gmail.com> MFC after: 2 weeks
* When the accept queue is full print the number of already pendingandre2013-05-081-1/+1
| | | | | | | | new connections instead of by how many we're over the limit, which is always 1. Noticed by: jmallet MFC after: 1 week
* Back out r249318, r249320 and r249327 due to a heisenbug mostandre2013-05-061-2/+2
| | | | | likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.
* socket: Make shutdown() wake up a blocked accept().jilles2013-04-301-0/+2
| | | | | | | | | | | | | | | | | | | A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5. MFC after: 2 weeks
* Fix the build.jimharris2013-04-101-1/+1
|
* Change certain heavily used network related mutexes and rwlocks toandre2013-04-091-2/+2
| | | | | | | | | | reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.
* When soreceive_generic() hands off an mbuf from buffer,glebius2013-03-291-0/+1
| | | | | | | | | | clear its pointer to next record, since next record belongs to the buffer, and shouldn't be leaked. The ng_ksocket(4) used to clear this pointer itself, but the correct place is here. Sponsored by: Nginx, Inc
* Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.jilles2013-03-191-2/+2
| | | | | | | | | | | | | | | | | | | This change allows creating file descriptors with close-on-exec set in some situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec. The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_*. The SOCK_* flags are not passed to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. On the other hand, audit is expected to cope with the new flags. For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags argument. Reviewed by: kib
* Return an error if sctp_peeloff() fails because a socket can't be allocated.tuexen2013-03-111-1/+6
| | | | MFC after: 3 days
* - Implement two new system calls:pjd2013-03-021-3/+43
| | | | | | | | | | | | | | | | | | | | | | | | | int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des
* Configure UMA warnings for the following zones:pjd2012-12-071-0/+1
| | | | | | | | | | | | | | | | | - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
* - Make socket_zone static - it is used only in this file.pjd2012-12-071-3/+3
| | | | | | - Update maxsockets on uma_zone_set_max(). Obtained from: WHEEL Systems
* Style cleanups.pjd2012-12-071-51/+54
|
* - according to POSIX, make socket(2) return EAFNOSUPPORT rather thankevlo2012-12-071-1/+10
| | | | | | | | EPROTONOSUPPORT if the address family is not supported. - introduce pffinddomain() to find a domain by family and use it as appropriate. Reviewed by: glebius
* Mechanically substitute flags from historic mbuf allocator withglebius2012-12-051-13/+13
| | | | | | | | | malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
* Fix r243627 by testing against the head socket instead of the socketandre2012-11-271-1/+1
| | | | | | | just created. MFC after: 1 week X-MFC-with: r243627
* Base the mbuf related limits on the available physical memory orandre2012-11-271-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month
* Fix a race on listen socket teardown where while draining theandre2012-11-271-4/+21
| | | | | | | | | | | | accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week
* In soreceive_stream() don't drop an already dequeued mbuf chain byandre2012-10-291-5/+11
| | | | | | | | | | | | | | | | | | | | overwriting the return mbuf pointer with newly received data after a loop. Instead append the new mbuf chain to the existing one. Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream() doesn't get confused. For the remainder copy case in the mbuf delivery part deduct the copied length len instead of the whole mbuf length. Additionally don't depend on 'n' being being available which isn't true in the case of MSG_PEEK. Fix the MSG_WAITALL case by comparing against sb_hiwat. Before it was looping for every receive as sb_lowat normally is zero. Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't properly handled. Submitted by: trociny (except for the change in last paragraph)
* Add logging for socket attach failures in sonewconn() during accept(2).andre2012-10-291-5/+21
| | | | | | | Include the pointer to the PCB so it can be attributed to a particular application by corresponding it to "netstat -A" output. MFC after: 2 weeks
* Replace the ill-named ZERO_COPY_SOCKET kernel option with twoandre2012-10-231-16/+19
| | | | | | | | | | | | | | | | | | | | | more appropriate named kernel options for the very distinct send and receive path. "options SOCKET_SEND_COW" enables VM page copy-on-write based sending of data on an outbound socket. NB: The COW based send mechanism is not safe and may result in kernel crashes. "options SOCKET_RECV_PFLIP" enables VM kernel/userspace page flipping for special disposable pages attached as external storage to mbufs. Only the naming of the kernel options is changed and their corresponding #ifdef sections are adjusted. No functionality is added or removed. Discussed with: alc (mechanism and limitations of send side COW)
* Grammar fixes to r241781.andre2012-10-201-1/+1
| | | | Submitted by: alc
* Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -aandre2012-10-201-1/+7
| | | | | | | | | | | output and replace it with a new visible sysctl kern.ipc.acceptqueue of the same functionality. It specifies the maximum length of the accept queue on a listen socket. The old kern.ipc.somaxconn remains available for reading and writing for compatibility reasons so that existing programs, scripts and configurations continue to work. There no plans to ever remove the orginal and now hidden kern.ipc.somaxconn.
* Tidy up somaxconn (accept queue limit) and related functionsandre2012-10-201-22/+26
| | | | and move it together into one place.
* Move socket UMA zone initialization functionality together intoandre2012-10-191-17/+16
| | | | one place.
* Move UMA socket zone initialization from uipc_domain.c to uipc_socket.candre2012-10-191-0/+23
| | | | into one place next to its other related functions to avoid confusion.
* Remove unnecessary includes from sosend_copyin() and fixandre2012-10-181-10/+4
| | | | a couple of style issues.
* Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS withinandre2012-10-181-17/+1
| | | | zero copy specialized sosend_copyin() helper function.
* Fix spelling of the function name in two assertion messages.wollman2012-10-021-2/+2
|
* In soreceive_generic() remove the optimization for the case whentrociny2012-09-021-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MSG_WAITALL is set, and it is possible to do the entire receive operation at once if we block (resid <= hiwat). Actually it might make the recv(2) with MSG_WAITALL flag get stuck when there is enough space in the receiver buffer to satisfy the request but not enough to open the window closed previously due to the buffer being full. The issue can be reproduced using the following scenario: On the sender side do 2 send(2) requests: 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10); 2) data of size equal to SOBUF_SIZE. On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set: 1) recv() data of SOBUF_SIZE / 10 size; 2) recv() data of SOBUF_SIZE size; We totally fill the receiver buffer with one SOBUF_SIZE/10 size request and partial SOBUF_SIZE request. When the first request is processed we get SOBUF_SIZE/10 free space. It is just enough to receive the rest of bytes for the second request, and soreceive_generic() blocks in the part that is a subject of this change waiting for the rest. But the window was closed when the buffer was filled and to avoid silly window syndrome it opens only when available space is larger than sb_hiwat/4 or maxseg. So it is stuck and pending data is only sent via TCP window probes. Discussed with: kib (long ago) MFC after: 2 weeks
* In soreceive_generic() when checking if the type of mbuf has changedtrociny2012-09-021-2/+2
| | | | | | | | | | | | check it for MT_CONTROL type too, otherwise the assertion "m->m_type == MT_DATA" below may be triggered by the following scenario: - the sender sends some data (MT_DATA) and then a file descriptor (MT_CONTROL); - the receiver calls recv(2) with a MSG_WAITALL asking for data larger than the receive buffer (uio_resid > hiwat). MFC after: 2 week
* Fix KASSERT message.trociny2012-07-031-1/+1
| | | | MFC after: 3 days
* - Remove redundant call to pr_ctloutput from code that handles SO_SETFIB.np2012-04-031-6/+5
| | | | | | | - Add a check for errors during copyin while here. Reviewed by: julian, bz MFC after: 2 weeks
* Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get thekib2012-02-261-0/+4
| | | | | | | | | | | | socket protocol number. This is useful since the socket type can be implemented by different protocols in the same protocol family, e.g. SOCK_STREAM may be provided by both TCP and SCTP. Submitted by: Jukka A. Ukkonen <jau iki fi> PR: kern/162352 Discussed with: bz Reviewed by: glebius MFC after: 2 weeks
* Remove apparently redundand checks for socket so_proto being non-NULLkib2012-02-261-9/+5
| | | | | | | | | from sosetopt() and sogetopt(). No exposed sockets may have so_proto invalid. Discussed with: bz, rwatson Reviewed by: glebius MFC after: 2 weeks
* Fix found places where uio_resid is truncated to int.kib2012-02-211-6/+11
| | | | | | | | | Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
* Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:bz2012-02-171-0/+2
| | | | | | | | | | | | Extend the so far IPv4-only support for multiple routing tables (FIBs) introduced in r178888 to IPv6 providing feature parity. This includes an extended rtalloc(9) KPI for IPv6, the necessary adjustments to the network stack, and user land support as in netstat. Sponsored by: Cisco Systems, Inc. Reviewed by: melifaro (basically) MFC after: 10 days
* Fix input validation in SO_SETFIB.hrs2012-02-041-1/+1
| | | | | Reviewed by: bz MFC after: 1 day
* Remove a few bits of FreeBSD 2.x compatibility code.rmh2011-11-141-3/+0
| | | | Approved by: kib (mentor)
* Fix a deficiency in the selinfo interface:attilio2011-08-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks
* In the experimental soreceive_stream():andre2011-07-081-13/+9
| | | | | | | | | | | o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF is correctly returned on a remote connection close. o In the non-blocking socket test compare SS_NBIO against the so->so_state field instead of the incorrect sb->sb_state field. o Simplify the ENOTCONN test by removing cases that can't occur. Submitted by: trociny (with some further tweaks by committer) Tested by: trociny
OpenPOWER on IntegriCloud