summaryrefslogtreecommitdiffstats
path: root/sys/netinet
Commit message (Collapse)AuthorAgeFilesLines
* * Provide information in error causes in ASCII instead oftuexen2014-03-1612-665/+275
| | | | | | | | | | | proprietary binary format. * Add support for a diagnostic information error cause. The code is sysctlable and the default is 0, which means it is not sent. This is joint work with rrs@. MFC after: 1 week
* Several years after initial development, merge prototype support forrwatson2014-03-156-9/+794
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
* Remove AppleTalk support.glebius2014-03-141-13/+0
| | | | | | | | | | AppleTalk was a network transport protocol for Apple Macintosh devices in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was a legacy protocol and primary networking protocol is TCP/IP. The last Mac OS X release to support AppleTalk happened in 2009. The same year routing equipment vendors (namely Cisco) end their support. Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
* Remove IPX support.glebius2014-03-141-1/+0
| | | | | | | | | | | IPX was a network transport protocol in Novell's NetWare network operating system from late 80s and then 90s. The NetWare itself switched to TCP/IP as default transport in 1998. Later, in this century the Novell Open Enterprise Server became successor of Novell NetWare. The last release that claimed to still support IPX was OES 2 in 2007. Routing equipment vendors (e.g. Cisco) discontinued support for IPX in 2011. Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
* Put the offset of the CRC32C in csum_data instead of 0.tuexen2014-03-121-4/+4
| | | | | | | | | | | The virtio driver needs the offset to be stored in csum_data, like in the case for UDP and TCP. The virtio problem was reported by Niu Zhixiong <kaiaixi@gmail.com>, who helped in debugging and testing the patch. MFC after: 3 days
* SCTP uses CRC32C and not Adler anymore. While there change the referencetuexen2014-03-121-2/+2
| | | | | | | to RFC 4960. This does not change any code, just comments. MFC after: 3 days
* Since both netinet/ and netinet6/ call into netipsec/ and netpfil/,glebius2014-03-122-9/+3
| | | | | | | | | | | | | | | | | | the protocol specific mbuf flags are shared between them. - Move all M_FOO definitions into a single place: netinet/in6.h, to avoid future clashes. - Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted in a failure of operation of IPSEC and packet filters. Thanks to Nicolas and Georgios for all the hard work on bisecting, testing and finally finding the root of the problem. PR: kern/186755 PR: kern/185876 In collaboration with: Georgios Amanakis <gamanakis gmail.com> In collaboration with: Nicolas DEFFAYET <nicolas-ml deffayet.com> Sponsored by: Nginx, Inc.
* - Remove rt_metrics_lite and simply put its members into rtentry.glebius2014-03-058-33/+26
| | | | | | | | | | | | | | | | - Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode. The change is mostly targeted for stable/10 merge. For head, rt_pksent is expected to just disappear. Discussed with: melifaro Sponsored by: Netflix Sponsored by: Nginx, Inc.
* Remove ifa_ref()/ifa_free(), which are atomic(9), from ip_output().glebius2014-03-041-9/+1
| | | | | | | | The ifaddr is already referenced by the rtentry, and we are holding reference on the rtentry throughout the function execution. Sponsored by: Netflix Sponsored by: Nginx, Inc.
* Remove more constants related to static sysctl nodes. The MAXID constantsjhb2014-02-256-20/+6
| | | | | | | | were primarily used to size the sysctl name list macros that were removed in r254295. A few other constants either did not have an associated sysctl node, or the associated node used OID_AUTO instead. PR: ports/184525 (exp-run)
* Improve logging of send errors, reporting error code and interface.glebius2014-02-221-38/+33
| | | | | | Reduce code duplication between INET and INET6. Tested by: Lytochkin Boris <lytboris gmail.com>
* Remove redundant code and fix a style error.tuexen2014-02-202-6/+2
| | | | MFC after: 3 days
* o Remove at compile time the HASH_ALL code, that was neverglebius2014-02-171-13/+2
| | | | | | | | | | | | | | | | | | | | | | | tested and is unfinished. However, I've tested my version, it works okay. As before it is unfinished: timeout aren't driven by TCP session state. To enable the HASH_ALL mode, one needs in kernel config: options FLOWTABLE_HASH_ALL o Reduce the alignment on flentry to 64 bytes. Without the FLOWTABLE_HASH_ALL option, twice less memory would be consumed by flows. o API to ip_output()/ip6_output() got even more thin: 1 liner. o Remove unused unions. Simply use fle->f_key[]. o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same flowtable_lookup_ipv6(). Stop copying data to on stack sockaddr structures, simply use key[] on stack. o Move code from flowtable_lookup_common() that actually works on insertion into flowtable_insert(). Sponsored by: Netflix Sponsored by: Nginx, Inc.
* Fixup for r261590 (vnet sysctl handlers cleanup).trociny2014-02-092-11/+2
| | | | Reviewed by: glebius
* o Revamp API between flowtable and netinet, netinet6.glebius2014-02-072-30/+2
| | | | | | | | | | | | | | | | | | | | | | | | - ip_output() and ip_output6() simply call flowtable_lookup(), passing mbuf and address family. That's the only code under #ifdef FLOWTABLE in the protocols code now. o Revamp statistics gathering and export. - Remove hand made pcpu stats, and utilize counter(9). - Snapshot of statistics is available via 'netstat -rs'. - All sysctls are moved into net.flowtable namespace, since spreading them over net.inet isn't correct. o Properly separate at compile time INET and INET6 parts. o General cleanup. - Remove chain of multiple flowtables. We simply have one for IPv4 and one for IPv6. - Flowtables are allocated in flowtable.c, symbols are static. - With proper argument to SYSINIT() we no longer need flowtable_ready. - Hash salt doesn't need to be per-VNET. - Removed rudimentary debugging, which use quite useless in dtrace era. The runtime behavior of flowtable shouldn't be changed by this commit. Sponsored by: Netflix Sponsored by: Nginx, Inc.
* Utilize SYSCTL_UMA_CUR() to export usage of syncache andglebius2014-02-072-28/+5
| | | | | | tcp reassembly zones. Sponsored by: Nginx, Inc.
* Catch up on r261590.glebius2014-02-071-4/+0
|
* Adjust r239672 from rrs and r258821 from eadler.peter2014-01-281-32/+13
| | | | | | By definition, the very first FIN is not a duplicate. Process it normally and don't feed it to congestion control as though it were a dupe. Don't prevent CC from seeing later dupe acks while in a half close state.
* Decrease lock contention within the TCP accept case by removinggnn2014-01-282-10/+3
| | | | | | | | | | | the INP_INFO lock from tcp_usr_accept. As the PR/patch states this was following the advice already in the code. See the PR below for a full disucssion of this change and its measured effects. PR: 183659 Submitted by: Julian Charbon Reviewed by: jhb
* Fix fallout from r241923. Calculate length of payload inglebius2014-01-221-5/+3
| | | | | | | | | | pim_input() properly. While here, remove extra variable and incorrect condition before m_pullup(). Reported by: Olivier Cochard-Labbé <olivier cochard.me> Sponsored by: Nginx, Inc.
* Further rework netinet6 address handling code:melifaro2014-01-191-2/+2
| | | | | | | | | | | | | | | | | | | * Set ia address/mask values BEFORE attaching to address lists. Inet6 address assignment is not atomic, so the simplest way to do this atomically is to fill in ia before attach. * Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code). * Do some renamings: in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here) in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code) in6_ifaddloop -> nd6_add_ifa_lle in6_ifremloop -> nd6_rem_ifa_lle * Split working with LLE and route announce code for last two. Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour. * Call device SIOCSIFADDR handler IFF we're adding first address. In IPv4 we have to call it on every address change since ARP record is installed by arp_ifinit() which is called by given handler. IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so there is no reason to call SIOCSIFADDR often.
* If the flowid is available for the mbuf that finalised the creationadrian2014-01-181-0/+10
| | | | | | | | | | | | of a syncache connection, copy it into the inp_flowid field. Without this, an incoming TCP connection won't have an inp_flowid marked until some data comes in, and this means that things like the per-CPU TCP timer option will choose a different CPU for the timer work. (It also means that if one grabbed the flowid via an ioctl from userland, it won't be available until some data has been received.) Sponsored by: Netflix, Inc.
* Fix various places where we don't properly release a lockgnn2014-01-161-8/+18
| | | | | | PR: 185043 Submitted by: Michael Bentkofsky MFC after: 2 weeks
* Cleanup comments and whitespace. No functional changes.glebius2014-01-161-18/+14
|
* Fix refcount leak on netinet ifa.melifaro2014-01-161-4/+4
| | | | | | Reviewed by: glebius MFC after: 2 weeks Sponsored by: Yandex LLC
* Fix ipfw fwd for IPv4 traffic broken by r249894.melifaro2014-01-161-0/+8
| | | | | | | | | | | | | | | | Problem case: Original lookup returns route with GW set, so gw points to rte->rt_gateway. After that we're changing dst and performing lookup another time. Since fwd host is most probably directly reachable, resulting rte does not contain rt_gateway, so gw is not set. Finally, we end with packet transmitted to proper interface but wrong link-layer address. Found by: lstewart Discussed with: ae,lstewart MFC after: 2 weeks Sponsored by: Yandex LLC
* Simplify inet alias handling code: if we're adding/removing alias whichmelifaro2014-01-101-43/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | has the same prefix as some other alias on the same interface, use newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg(). This eliminates the following rtsock messages: Pinned RTM_ADD for prefix (for alias addition). Pinned RTM_DELETE for prefix (for alias withdrawal). Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24): before commit, addition: got message of size 116 on Fri Jan 10 14:13:15 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 got message of size 192 on Fri Jan 10 14:13:15 2014 RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff after commit, addition: got message of size 116 on Fri Jan 10 13:56:26 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255 before commit, wihdrawal: got message of size 192 on Fri Jan 10 13:58:59 2014 RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff got message of size 116 on Fri Jan 10 13:58:59 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 adter commit, withdrawal: got message of size 116 on Fri Jan 10 14:14:11 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong (and requires some hacks to keep prefix in route table on RTM_DELETE). I've tested this change with quagga (no change) and bird (*). bird alias handling is already broken in *BSD sysdep code, so nothing changes here, too. I'm going to MFC this change if there will be no complains about behavior change. While here, fix some style(9) bugs introduced by r260488 (pointed by glebius and bde). Sponsored by: Yandex LLC MFC after: 4 weeks
* Make failure of ifpromisc() a non-fatal error. This makes it possible toglebius2014-01-031-17/+11
| | | | | | run carp(4) on vtnet(4). Sponsored by: Nginx, Inc.
* Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called withae2014-01-031-0/+1
| | | | | | LLE_CREATE flag. MFC after: 1 week
* Fix regression from r249894. Now we pass "gw" as argument to if_outputglebius2014-01-021-0/+6
| | | | | | | method, thus for multicast case we need it to point at "dst". PR: 185395 Submitted by: ae
* lla_lookup() does modification only when LLE_CREATE is specified.ae2014-01-021-4/+4
| | | | | | | | | Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing lla_lookup() without LLE_CREATE flag. Reviewed by: glebius, adrian MFC after: 1 week Sponsored by: Yandex LLC
* Fix couple of bugs from r257692 related to scan of address list onglebius2013-12-291-4/+8
| | | | | | | | | an interface: - in in_control() skip over not AF_INET addresses. - in in_aifaddr_ioctl() and in_difaddr_ioctl() do correct check of address family, w/o accessing memory beyond struct ifaddr. Sponsored by: Nginx, Inc.
* Address some warnings which showed up on the userland version.tuexen2013-12-272-3/+3
| | | | MFC after: 1 week
* Draft-ietf-tcpm-initcwnd-05 became RFC6928.pluknet2013-12-261-2/+2
| | | | MFC after: 1 week
* Add more (IPv6) related Internet Protocols:bz2013-12-251-0/+4
| | | | | | | | | | | | | | | - Host Identity Protocol (RFC5201) - Shim6 Protocol (RFC5533) - 2x experimentation and testing (RFC3692, RFC4727) This does not indicate interest to implement/support these protocols, but they are part of the "IPv6 Extension Header Types" [1] based on RFC7045 and might thus be needed by filtering and next header parsing implementations. References: [1] http://www.iana.org/assignments/ipv6-parameters Obtained from: http://www.iana.org/assignments/protocol-numbers MFC after: 1 week
* It'll be okay to use LibAliasDetachHandlers() here, relyingglebius2013-12-251-1/+1
| | | | | on the fact that all handlers come from modules' bss and are followed by NODIR handler.
* Cleanup alias module handler register/unregister.glebius2013-12-254-186/+63
| | | | | | | | | | | | | | - Remove locking, since all module(9) events are running under &Giant. - Use TAILQ for protocol handlers and fix a bug which led to infinite cycle. Bug found in VirtualBox [1] - Simplify code everywhere. - Fix documentation. [1] https://www.virtualbox.org/pipermail/vbox-dev/2013-November/011936.html PR: 183792 [1] Submitted by: Valery Ushakov <uwe NetBSD.org> [1] Sponsored by: Nginx, Inc.
* Kill space at eols.glebius2013-12-257-125/+125
|
* Remove from kernel the "dll" code.glebius2013-12-252-17/+18
|
* Whitespace cleanup.glebius2013-12-252-79/+71
|
* In sys/netinet/in_mcast.c, inm_is_ifp_detached() is only used wheneverdim2013-12-241-0/+4
| | | | | | | KTR is defined, so put it between #ifdef KTR guards. This avoids a warning about a unused function if KTR is not enabled. MFC after: 3 days
* Disable the now unpredicably bogus check for whether we haveadrian2013-12-201-0/+22
| | | | | | | | | | | | | | | | | | | | eneough queue space before queuing a bunch of IP fragments. As the comment in the committed change says, in the post-if_transmit(), post-SMP, post-preemption world, there's just too much overlapping concurrent code paths and different approaches to driver transmit queue management to have this code even remotely be effective. The only specific place it could be useful is if ALTQ is enabled but again it doesn't at all promise that all the fragments will be transmitted anyway. The main reason for committing this change is to disable a parallel place where the drops counter is incremented. This is a side effect of an upcoming change to ixgbe/cxgbe to handle the queue drops counter slightly better. Sponsored by: Netflix, Inc.
* In a situation where:eadler2013-12-021-2/+4
| | | | | | | | | | | | | | | - The remote host sends a FIN - in an ACK for a sequence number for which an ACK has already been received - There is still unacked data on route to the remote host - The packet does not contain a window update The packet may be dropped without processing the FIN flag. PR: kern/99188 Submitted by: Staffan Ulfberg <staffan@ulfberg.se> Discussed with: andre MFC after: never
* Intuexen2013-11-302-8/+3
| | | | | | | | | | | | | | http://svnweb.freebsd.org/changeset/base/258221 I introduced a bug which initialized global locks whenever the SCTP stack initialized. This was fixed in http://svnweb.freebsd.org/changeset/base/258574 by rodrigc@. He just initialized the locks for the default vnet. This fix reverts to the old behaviour before r258221, which explicitly makes sure it is only called once, because this works also on other platforms. MFC after: 3 days X-MFC with: r258574.
* dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINEavg2013-11-266-37/+37
| | | | | | | | In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
* Convert over the TCP probes to use mtod() rather than directlyadrian2013-11-252-10/+11
| | | | | | dereferencing m->m_data. Sponsored by: Netflix, Inc.
* Only initialize some mutexes for the default VNET.rodrigc2013-11-251-2/+8
| | | | | | | | | | | | | | | | | | | In r208160, sctp_it_ctl was made a global variable, across all VNETs. However, sctp_init() is called for every VNET that is created. This results in the same global mutexes which are part of sctp_it_ctl being initialized. This can result in crashes if many jails are created. To reproduce the problem: (1) Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS, INVARIANTS. (2) Run this command in a loop: jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo (see http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html ) Witness will warn about the same mutex being initialized. Fix the problem by only initializing these mutexes in the default VNET.
* - For kernel compiled only with KDTRACE_HOOKS and not any lock debuggingattilio2013-11-2510-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
* In r257692 I intentionally deleted code that handled P2P interfacesglebius2013-11-171-1/+3
| | | | | | | with equal addresses on both sides. It appeared that OpenVPN uses such configutations. Submitted by: trociny
* Deregister helper hooks on vnet destroy.trociny2013-11-171-0/+14
|
OpenPOWER on IntegriCloud