summaryrefslogtreecommitdiffstats
path: root/sys/kern/uipc_socket.c
Commit message (Collapse)AuthorAgeFilesLines
* Introduce a new protocol hook pru_aio_queue.jhb2016-04-291-0/+7
| | | | | | | This allows a protocol to claim individual AIO requests instead of using the default socket AIO handling. Sponsored by: Chelsio Communications
* o "avaliable" -> "available".maxim2016-03-211-1/+1
| | | | | PR: 208141 Submitted by: Tyler Littlefield
* Refactor the AIO subsystem to permit file-type-specific handling andjhb2016-03-011-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | improve cancellation robustness. Introduce a new file operation, fo_aio_queue, which is responsible for queueing and completing an asynchronous I/O request for a given file. The AIO subystem now exports library of routines to manipulate AIO requests as well as the ability to run a handler function in the "default" pool of AIO daemons to service a request. A default implementation for file types which do not include an fo_aio_queue method queues requests to the "default" pool invoking the fo_read or fo_write methods as before. The AIO subsystem permits file types to install a private "cancel" routine when a request is queued to permit safe dequeueing and cleanup of cancelled requests. Sockets now use their own pool of AIO daemons and service per-socket requests in FIFO order. Socket requests will not block indefinitely permitting timely cancellation of all requests. Due to the now-tight coupling of the AIO subsystem with file types, the AIO subsystem is now a standard part of all kernels. The VFS_AIO kernel option and aio.ko module are gone. Many file types may block indefinitely in their fo_read or fo_write callbacks resulting in a hung AIO daemon. This can result in hung user processes (when processes attempt to cancel all outstanding requests during exit) or a hung system. To protect against this, AIO requests are only permitted for known "safe" files by default. AIO requests for all file types can be enabled by setting the new vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have been updated to skip operations on unsafe file types if the sysctl is zero. Currently, AIO requests on sockets and raw disks are considered safe and are enabled by default. aio_mlock() is also enabled by default. Reviewed by: cem, jilles Discussed with: kib (earlier version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5289
* Increase max allowed backlog for listen socketsalfred2016-02-021-2/+8
| | | | | | | | from short to int. PR: 203922 Submitted by: White Knight <white_knight@2ch.net> MFC After: 4 weeks
* Make shutdown() return ENOTCONN as required by POSIX, part deux.ed2015-07-271-0/+3
| | | | | | | | | | | | | | | | | | | | | | Summary: Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right. Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries. I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same. Test Plan: This issue was uncovered while writing tests for shutdown() in CloudABI: https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26 Reviewers: glebius, rwatson, #manpages, gnn, #network Reviewed By: gnn, #network Subscribers: bms, mjg, imp Differential Revision: https://reviews.freebsd.org/D3039
* Fix a typo in comment.delphij2015-07-241-1/+1
| | | | | Submitted by: Yanhui Shen via twitter MFC after: 3 days
* Fix cleanup race between unp_dispose and unp_gccem2015-07-141-8/+9
| | | | | | | | | | | | | | | | | | | unp_dispose and unp_gc could race to teardown the same mbuf chains, which can lead to dereferencing freed filedesc pointers. This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS as invalid/freed. The flag is protected by UNP_LIST_LOCK. To serialize against unp_gc, unp_dispose needs the socket object. Change the dom_dispose() KPI to take a socket object instead of an mbuf chain directly. PR: 194264 Differential Revision: https://reviews.freebsd.org/D3044 Reviewed by: mjg (earlier version) Approved by: markj (mentor) Obtained from: mjg MFC after: 1 month Sponsored by: EMC / Isilon Storage Division
* soreceive_generic() still has similar KASSERT(), therefore instead ofae2015-02-231-0/+2
| | | | | | | remove KASSERT(), change it to check mbuf isn't NULL. Suggested by: kib MFC after: 1 week
* In some cases soreceive_dgram() can return no data, but has controlae2015-02-231-5/+5
| | | | | | | | | | | | message. This can happen when application is sending packets too big for the path MTU and recvmsg() will return zero (indicating no data) but there will be a cmsghdr with cmsg_type set to IPV6_PATHMTU. Remove KASSERT() which does NULL pointer dereference in such case. Also call m_freem() only when m isn't NULL. PR: 197882 MFC after: 1 week Sponsored by: Yandex LLC
* Don't access sockbuf fields directly, use accessor functions instead.davide2015-02-141-8/+4
| | | | | | | | | It is safe to move the call to socantsendmore_locked() after sbdrop_locked() as long as we hold the sockbuf lock across the two calls. CR: D1805 Reviewed by: adrian, kmacy, julian, rwatson
* Revert r274494, r274712, r275955 and provide extra comments explainingglebius2014-12-201-3/+6
| | | | | | | | why there could appear a zero-sized mbufs in socket buffers. A proper fix would be to divorce record socket buffers and stream socket buffers, and divorce pru_send that accepts normal data from pru_send that accepts control data.
* Check for SS_NBIO in so->so_state instead of sb->sb_flags injhb2014-12-151-1/+1
| | | | | | | | soreceive_stream(). Differential Revision: https://reviews.freebsd.org/D1299 Reviewed by: bz, gnn MFC after: 1 week
* Merge from projects/sendfile: extend protocols API to supportglebius2014-11-301-0/+7
| | | | | | | | | sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix
* Merge from projects/sendfile:glebius2014-11-301-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | o Introduce a notion of "not ready" mbufs in socket buffers. These mbufs are now being populated by some I/O in background and are referenced outside. This forces following implications: - An mbuf which is "not ready" can't be taken out of the buffer. - An mbuf that is behind a "not ready" in the queue neither. - If sockbet buffer is flushed, then "not ready" mbufs shouln't be freed. o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc. The sb_ccc stands for ""claimed character count", or "committed character count". And the sb_acc is "available character count". Consumers of socket buffer API shouldn't already access them directly, but use sbused() and sbavail() respectively. o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones with M_BLOCKED. o New field sb_fnrdy points to the first not ready mbuf, to avoid linear search. o New function sbready() is provided to activate certain amount of mbufs in a socket buffer. A special note on SCTP: SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet allow protocol specific sockbufs. Thus, SCTP does some hacks to make itself compatible with FreeBSD: it manages sockbufs on its own, but keeps sb_cc updated to inform the stack of amount of data in them. The new notion of "not ready" data isn't supported by SCTP. Instead, only a mechanical substitute is done: s/sb_cc/sb_ccc/. A proper solution would be to take away struct sockbuf from struct socket and allow protocols to implement their own socket buffers, like SCTP already does. This was discussed with rrs@. Sponsored by: Netflix Sponsored by: Nginx, Inc.
* Do not allocate zero-length mbuf in sosend_generic().glebius2014-11-191-1/+1
| | | | | Found by: pho Sponsored by: Nginx, Inc.
* Merge from projects/sendfile:glebius2014-11-141-3/+1
| | | | | | Use sbcut_locked() instead of manually editing a sockbuf. Sponsored by: Nginx, Inc.
* In preparation of merging projects/sendfile, transform bare access toglebius2014-11-121-21/+21
| | | | | | | | | | | | sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.
* - Make hhook_run_socket() vnet-aware instead of adding CURVNET_SET() aroundhrs2014-09-081-31/+20
| | | | | | | | the function calls. - Fix a memory leak and stats in the case that hhook_run_socket() fails in soalloc(). PR: 193265
* Fix for r271182.glebius2014-09-071-4/+6
| | | | | Submitted by: mjg Pointy hat to: me, submitter and everyone who urged me to commit
* Set vnet context before accessing V_socket_hhh[].glebius2014-09-051-0/+4
| | | | Submitted by: "Hiroo Ono (小野寛生)" <hiroo.ono+freebsd gmail.com>
* - Remove socket file operations declaration from sys/file.h.glebius2014-08-261-0/+1
| | | | | | | | - Make them static in sys_socket.c. - Provide generic invfo_truncate() instead of soo_truncate(). Sponsored by: Netflix Sponsored by: Nginx, Inc.
* Fix a panic which occurs in a VIMAGE-enabled kernel after r270158, andhrs2014-08-221-5/+35
| | | | | | | separate socket_hhook_register() part and put it into VNET_SYS{,UN}INIT() handler. Discussed with: marcel
* For vendors like Juniper, extensibility for sockets is important. Amarcel2014-08-181-6/+85
| | | | | | | | | | | | | | | | | good example is socket options that aren't necessarily generic. To this end, OSD is added to the socket structure and hooks are defined for key operations on sockets. These are: o soalloc() and sodealloc() o Get and set socket options o Socket related kevent filters. One aspect about hhook that appears to be not fully baked is the return semantics (the return value from the hook is ignored in hhook_run_hooks() at the time of commit). To support return values, the socket_hhook_data structure contains a 'status' field to hold return values. Submitted by: Anuranjan Shukla <anshukla@juniper.net> Obtained from: Juniper Networks, Inc.
* Fix an overflow in getsockopt(). optval isn't big enough to holddavide2014-08-041-6/+6
| | | | | | | | | | | sbintime_t. Re-introduce r255030 behaviour capping socket timeouts to INT_32 if they're too large. CR: https://phabric.freebsd.org/D433 Reported by: demon Reviewed by: bde [1], jhb [2] MFC after: 2 weeks
* The accept filter code is not specific to the FreeBSD IPv4 network stack,marcel2014-07-261-6/+2
| | | | | | | | | | | | | | | | | | so it really should not be under "optional inet". The fact that uipc_accf.c lives under kern/ lends some weight to making it a "standard" file. Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the need for #ifdef INET in kern/uipc_socket.c. Also, this meant the net.inet.accf.unloadable sysctl needed to move, as net.inet does not exist without networking compiled in (as it lives in netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable. In order to support existing accept filter sysctls, the net.inet.accf node has been added netinet/in_proto.c. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc.
* Simplify wait/nowait code, eventually killing last remnant ofglebius2014-01-161-20/+19
| | | | | | historical mbuf(9) allocator flag. Sponsored by: Nginx, Inc.
* Fix typo in a comment.hiren2013-11-081-1/+1
|
* Rate limit (to once per minute) "Listen queue overflow" message inemax2013-10-311-3/+15
| | | | | | | | sonewconn(). Reviewed by: scottl, lstewart Obtained from: Netflix, Inc MFC after: 2 weeks
* Remove zero-copy sockets code. It only worked for anonymous memory,kib2013-09-161-167/+0
| | | | | | | | | | | | | and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)
* Fix socket buffer timeouts precision using the new sbintime_t KPI insteaddavide2013-09-011-4/+3
| | | | | | | | | of relying on the tvtohz() workaround. The latter has been introduced lately by jhb@ (r254699) in order to have a fix that can be backported to STABLE. Reported by: Vitja Makarov <vitja.makarov at gmail dot com> Reviewed by: jhb (earlier version)
* Don't return an error for socket timeouts that are too large. Justjhb2013-08-291-7/+2
| | | | | | | | cap them to INT_MAX ticks instead. PR: kern/181416 (r254699 really) Requested by: bde MFC after: 2 weeks
* Use tvtohz() to convert a socket buffer timeout to a tick value ratherjhb2013-08-231-7/+2
| | | | | | | | than using a home-rolled version. The home-rolled version could result in shorter-than-requested sleeps. Reported by: Vitja Makarov <vitja.makarov@gmail.com> MFC after: 2 weeks
* When the accept queue is full print the number of already pendingandre2013-05-081-1/+1
| | | | | | | | new connections instead of by how many we're over the limit, which is always 1. Noticed by: jmallet MFC after: 1 week
* Back out r249318, r249320 and r249327 due to a heisenbug mostandre2013-05-061-2/+2
| | | | | likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.
* socket: Make shutdown() wake up a blocked accept().jilles2013-04-301-0/+2
| | | | | | | | | | | | | | | | | | | A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5. MFC after: 2 weeks
* Fix the build.jimharris2013-04-101-1/+1
|
* Change certain heavily used network related mutexes and rwlocks toandre2013-04-091-2/+2
| | | | | | | | | | reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.
* When soreceive_generic() hands off an mbuf from buffer,glebius2013-03-291-0/+1
| | | | | | | | | | clear its pointer to next record, since next record belongs to the buffer, and shouldn't be leaked. The ng_ksocket(4) used to clear this pointer itself, but the correct place is here. Sponsored by: Nginx, Inc
* Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.jilles2013-03-191-2/+2
| | | | | | | | | | | | | | | | | | | This change allows creating file descriptors with close-on-exec set in some situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec. The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_*. The SOCK_* flags are not passed to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. On the other hand, audit is expected to cope with the new flags. For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags argument. Reviewed by: kib
* Return an error if sctp_peeloff() fails because a socket can't be allocated.tuexen2013-03-111-1/+6
| | | | MFC after: 3 days
* - Implement two new system calls:pjd2013-03-021-3/+43
| | | | | | | | | | | | | | | | | | | | | | | | | int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des
* Configure UMA warnings for the following zones:pjd2012-12-071-0/+1
| | | | | | | | | | | | | | | | | - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
* - Make socket_zone static - it is used only in this file.pjd2012-12-071-3/+3
| | | | | | - Update maxsockets on uma_zone_set_max(). Obtained from: WHEEL Systems
* Style cleanups.pjd2012-12-071-51/+54
|
* - according to POSIX, make socket(2) return EAFNOSUPPORT rather thankevlo2012-12-071-1/+10
| | | | | | | | EPROTONOSUPPORT if the address family is not supported. - introduce pffinddomain() to find a domain by family and use it as appropriate. Reviewed by: glebius
* Mechanically substitute flags from historic mbuf allocator withglebius2012-12-051-13/+13
| | | | | | | | | malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
* Fix r243627 by testing against the head socket instead of the socketandre2012-11-271-1/+1
| | | | | | | just created. MFC after: 1 week X-MFC-with: r243627
* Base the mbuf related limits on the available physical memory orandre2012-11-271-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month
* Fix a race on listen socket teardown where while draining theandre2012-11-271-4/+21
| | | | | | | | | | | | accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week
* In soreceive_stream() don't drop an already dequeued mbuf chain byandre2012-10-291-5/+11
| | | | | | | | | | | | | | | | | | | | overwriting the return mbuf pointer with newly received data after a loop. Instead append the new mbuf chain to the existing one. Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream() doesn't get confused. For the remainder copy case in the mbuf delivery part deduct the copied length len instead of the whole mbuf length. Additionally don't depend on 'n' being being available which isn't true in the case of MSG_PEEK. Fix the MSG_WAITALL case by comparing against sb_hiwat. Before it was looping for every receive as sb_lowat normally is zero. Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't properly handled. Submitted by: trociny (except for the change in last paragraph)
OpenPOWER on IntegriCloud