summaryrefslogtreecommitdiffstats
path: root/sys/netinet/ip_output.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r265691:tuexen2014-06-221-10/+6
| | | | | | | | | | For some UDP packets (for example with 200 byte payload) and IP options, the IP header and the UDP header are not in the same mbuf. Add code to in_delayed_cksum() to deal with this case. MFC r265713: Use KASSERTs as suggested by glebius@
* Merge r262763, r262767, r262771, r262806 from head:glebius2014-03-211-4/+4
| | | | | | | | | | - Remove rt_metrics_lite and simply put its members into rtentry. - Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.
* Merge r262747: remove extraneous ifa_ref()/ifa_free().glebius2014-03-191-9/+1
|
* Merge r261582, r261601, r261610, r261613, r261627, r261640, r261641, r261823,glebius2014-03-041-13/+3
| | | | | | | | | | r261825, r261859, r261875, r261883, r261911, r262027, r262028, r262029, r262030, r262162 from head. Large flowtable revamp. See commit messages for merged revisions for details. Sponsored by: Netflix
* MFC r260702 (by melifaro):ae2014-02-061-0/+8
| | | | | | | | | | | | | Fix ipfw fwd for IPv4 traffic broken by r249894. Problem case: Original lookup returns route with GW set, so gw points to rte->rt_gateway. After that we're changing dst and performing lookup another time. Since fwd host is most probably directly reachable, resulting rte does not contain rt_gateway, so gw is not set. Finally, we end with packet transmitted to proper interface but wrong link-layer address.
* Merge r260188 from head:glebius2014-01-051-0/+6
| | | | | | | Fix regression from r249894. Now we pass "gw" as argument to if_output method, thus for multicast case we need it to point at "dst". PR: 185395
* Implement the ip, tcp, and udp DTrace providers. The probe definitions usemarkj2013-08-251-1/+6
| | | | | | | | | dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
* Add m_clrprotoflags() to clear protocol specific mbuf flags at up andandre2013-08-191-2/+2
| | | | | | | | downwards layer crossings. Consistently use it within IP, IPv6 and ethernet protocols. Discussed with: trociny, glebius
* Remove unused M_FRAG, M_FIRSTFRAG and M_LASTFRAG tagging from ip_fragment().andre2013-08-191-8/+3
| | | | | There wasn't any real driver (and hardware) support for it. Modern hardware does full fragmentation/segmentation offload instead.
* In r227207, to fix the issue with possible NULL inp_socket pointertrociny2013-07-041-7/+4
| | | | | | | | | | | | | | | | | | | | | dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR for multicast), INP_REUSEPORT flag was introduced to cache the socket option. It was decided then that one flag would be enough to cache both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR setsockopt(2), it was checked if it was called for a multicast address and INP_REUSEPORT was set accordingly. Unfortunately that approach does not work when setsockopt(2) is called before binding to a multicast address: the multicast check fails and INP_REUSEPORT is not set. Fix this by adding INP_REUSEADDR flag to unconditionally cache SO_REUSEADDR. PR: 179901 Submitted by: Michael Gmelin freebsd grem.de (initial version) Reviewed by: rwatson MFC after: 1 week
* Add const qualifier to the dst parameter of the ifnet if_output method.glebius2013-04-261-2/+3
|
* Introduce a pointer to const variable gw, which points either at theglebius2013-04-251-13/+6
| | | | | | | | | | | same place as dst, or to the sockaddr in the routing table. The const constraint of gw makes us safe from modifing routing table accidentially. And "onstantness" of dst allows us to remove several bandaids, when we switched it back at &ro->ro_dst, now it always points there. Reviewed by: rrs
* This fixes the issue with the "randomly changing" defaultrrs2013-04-241-1/+1
| | | | | | | | | | | | | | | | route. What it was is there are two places in ip_output.c where we do a goto again. One place was fine, it copies out the new address and then resets dst = ro->rt_dst; But the other place does *not* do that, which means earlier when we found the gateway, we have dst pointing there aka dst = ro->rt_gateway is done.. then we do a goto again.. bam now we clobber the default route. The fix is just to move the again so we are always doing dst = &ro->rt_dst; in the again loop. PR: 174749,157796 MFC after: 1 week
* Use m_get/m_gethdr instead of compat macros.glebius2013-03-151-2/+2
| | | | Sponsored by: Nginx, Inc.
* Mechanically substitute flags from historic mbuf allocator withglebius2012-12-051-5/+5
| | | | | | | | | malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
* Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.andre2012-11-271-4/+2
| | | | | | | | Checksumming the IP header of fragments is no different from doing normal IP headers. Discussed with: yongari MFC after: 1 week
* Remove the recently added sysctl variable net.pfil.forward.ae2012-11-021-5/+3
| | | | | | | | | Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set. Suggested by: andre
* o Remove last argument to ip_fragment(), and obtain all needed informationglebius2012-10-261-13/+16
| | | | | | | | | | | on checksums directly from mbuf flags. This simplifies code. o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in hardware. Some driver may not announce CSUM_IP in theur if_hwassist, although try to do checksums if CSUM_IP set on mbuf. Example is em(4). o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP. After this change CSUM_DELAY_IP vanishes from the stack. Submitted by: Sebastian Kuzminsky <seb lineratesystems.com>
* Remove the IPFIREWALL_FORWARD kernel option and make possible to turnae2012-10-251-8/+3
| | | | | | | | | on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default. Sponsored by: Yandex LLC Discussed with: net@ MFC after: 2 weeks
* Switch the entire IPv4 stack to keep the IP packet headerglebius2012-10-221-21/+5
| | | | | | | | | | | | | | | | | | | | | | | in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
* Fix a miss from r241344: in ip_mloopback() we need to go toglebius2012-10-141-3/+3
| | | | | | net byte order prior to calling in_delayed_cksum(). Reported by: Olivier Cochard-Labbe <olivier cochard.me>
* After r241245 it appeared that in_delayed_cksum(), which still expectsglebius2012-10-081-2/+3
| | | | | | | | | | | | | | host byte order, was sometimes called with net byte order. Since we are moving towards net byte order throughout the stack, the function was converted to expect net byte order, and its consumers fixed appropriately: - ip_output(), ipfilter(4) not changed, since already call in_delayed_cksum() with header in net byte order. - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order there and back. - mrouting code and IPv6 ipsec now need to switch byte order there and back, but I hope, this is temporary solution. - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum(). - pf_route() catches up on r241245 changes to ip_output().
* A step in resolving mess with byte ordering for AF_INET. After this change:glebius2012-10-061-15/+32
| | | | | | | | | | | | | | | | | | | - All packets in NETISR_IP queue are in net byte order. - ip_input() is entered in net byte order and converts packet to host byte order right _after_ processing pfil(9) hooks. - ip_output() is entered in host byte order and converts packet to net byte order right _before_ processing pfil(9) hooks. - ip_fragment() accepts and emits packet in net byte order. - ip_forward(), ip_mloopback() use host byte order (untouched actually). - ip_fastforward() no longer modifies packet at all (except ip_ttl). - Swapping of byte order there and back removed from the following modules: pf(4), ipfw(4), enc(4), if_bridge(4). - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version - __FreeBSD_version bumped. - pfil(9) manual page updated. Reviewed by: ray, luigi, eri, melifaro Tested by: glebius (LE), ray (BE)
* Plug a reference leak: before doing 'goto again' we need to unrefglebius2012-07-181-2/+8
| | | | | | ia->ia_ifa if there is any. Submitted by: Andrey Zonov <andrey zonov.org>
* When ip_output()/ip6_output() is supplied a struct route *ro argument,glebius2012-07-041-22/+22
| | | | | | | | | | | | | | | | | | | | | | | | | it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning here: it may be supplied to provide route, and it may be supplied to store and return to caller the route that ip_output()/ip6_output() finds. In the latter case skipping FLOWTABLE lookup is pessimisation. The difference between struct route filled by FLOWTABLE and filled by rtalloc() family is that the former doesn't hold a reference on its rtentry. Reference is hold by flow entry, and it is about to be released in future. Thus, route filled by FLOWTABLE shouldn't be passed to RTFREE() macro. - Introduce new flag for struct route/route_in6, that marks route not holding a reference on rtentry. - Introduce new macro RO_RTFREE() that cleans up a struct route depending on its kind. - All callers to ip_output()/ip6_output() that do supply non-NULL but empty route should use RO_RTFREE() to free results of lookup. - ip_output()/ip6_output() now do FLOWTABLE lookup always when ro->ro_rt == NULL. Tested by: tuexen (SCTP part)
* Add a IP_RECVTOS socket option to receive for received UDP/IPv4tuexen2012-06-121-0/+8
| | | | | | | | packets a cmsg of type IP_RECVTOS which contains the TOS byte. Much like IP_RECVTTL does for TTL. This allows to implement a protocol on top of UDP and implementing ECN. MFC after: 3 days
* Cache SO_REUSEPORT socket option in inpcb-layer in order to avoidtrociny2011-11-061-5/+36
| | | | | | | | | | | | | | inp_socket->so_options dereference when we may not acquire the lock on the inpcb. This fixes the crash due to NULL pointer dereference in in_pcbbind_setup() when inp_socket->so_options in a pcb returned by in_pcblookup_local() was checked. Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com> Suggested by: rwatson Glanced by: rwatson Tested by: dave jones <s.dave.jones@gmail.com>
* The mbuf_frag_size always was and is file local and not queried from basebz2011-04-141-1/+1
| | | | | | user space tools via kvm. Mark it static. MFC after: 3 days
* Try to catch a possible divide-by-zero as early as possible if "mtu" is 0bz2010-12-311-0/+3
| | | | | | | | | (also test for negative MTUs if checking it anyway). An MTU of 0 is arguably a bug elsewhere, but this at least gives us some more debugging hints. Sponsored by: ISPsystem (Early 2010) MFC after: 1 week
* IP_BINDANY is not correctly handled in getsockopt() case.attilio2010-09-241-0/+4
| | | | | | | | | Fix it by specifying the correct bits. Sponsored by: Sandvine Incorporated Reviewed by: bz, emaste, rstone Obtained from: Sandvine Incorporated MFC after: 10 days
* This patch fixes the problem where proxy ARP entries cannot be addedqingli2010-05-251-1/+1
| | | | | | over the if_ng interface. MFC after: 3 days
* The proper fix for the delayed SCTP checksum is torrs2010-03-121-2/+2
| | | | | | | | | | have the delayed function take an argument as to the offset to the SCTP header. This allows it to work for V4 and V6. This of course means changing all callers of the function to either pass the header len, if they have it, or create it (ip_hl << 2 or sizeof(ip6_hdr)). PR: 144529 MFC after: 2 weeks
* - restructure flowtable to support ipv6kmacy2010-03-121-8/+14
| | | | | | | | | | | | | | - add a name argument to flowtable_alloc for printing with ddb commands - extend ddb commands to print destination address or 4-tuples - don't parse ports in ulp header if FL_HASH_ALL is not passed - add kern_flowtable_insert to enable more generic use of flowtable (e.g. system calls for adding entries) - don't hash loopback addresses - cleanup whitespace - keep statistics per-cpu for per-cpu flowtables to avoid cache line contention - add sysctls to accumulate stats and report aggregate MFC after: 7 days
* One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is toqingli2010-03-091-1/+5
| | | | | | | | | | | | | | | | | | | | | allow for connection load balancing across interfaces. Currently the address alias handling method is colliding with the ECMP code. For example, when two interfaces are configured on the same prefix, only one prefix route is installed. So connection load balancing among the available interfaces is not possible. The other advantage of ECMP is for failover. The issue with the current code, is that the interface link-state is not reflected in the route entry. For example, if there are two interfaces on the same prefix, the cable on one interface is unplugged, new and existing connections should switch over to the other interface. This is not done today and packets go into a black hole. Also, there is a small bug in the kernel where deleting ECMP routes in the userland will always return an error even though the command is successfully executed. MFC after: 5 days
* Make the compiler happy after r201125:bz2009-12-281-1/+1
| | | | | - + remove two unnecessary initializations in ip_output; + + remove one unnecessary initializations in ip_output;
* introduce a local variable rte acting as a cache of ro->ro_rtluigi2009-12-281-18/+22
| | | | | | | within ip_output, achieving (in random order of importance): - a reduction of the number of 'r's in the source code; - improved legibility; - a reduction of 64 bytes in the .text
* + remove an unused #define print_ip;luigi2009-12-281-19/+13
| | | | | | | | | + remove two unnecessary initializations in ip_output; + localize 'len'; + introduce a temporary variable n to count the number of fragments, the compiler seems unable to identify a common subexpression (written 3 times, used twice); + document some assumptions on ip_len and ip_hl
* Remove ifdefed out part of code, which seems to have originated a decade agotrasz2009-11-091-1/+1
| | | | | | | | | in OpenBSD. As it is now, there is no way for this to be useful, since IPsec is free to forward packets via whatever interface it wants, so checking capabilities of the interface passed from ip_output (fetched from the routing table) serves no purpose. Discussed with: sam@
* Virtualize the pfil hooks so that different jails may chose differentjulian2009-10-111-2/+2
| | | | | | | | packet filters. ALso allows ipfw to be enabled on on ejail and disabled on another. In 8.0 it's a global setting. Sitting aroung in tree waiting to commit for: 2 months MFC after: 2 months
* Do not try to free the rt_lle entry of the cached route inqingli2009-08-281-3/+1
| | | | | | | | | ip_output() if the cached route was not initialized from the flow-table. The rt_lle entry is invalid unless it has been initialized through the flow-table. Reviewed by: kmacy, rwatson MFC after: immediately
* - change the interface to flowtable_lookup so that we don't rely onkmacy2009-08-181-1/+1
| | | | | | | | | | | | | | | | | the mbuf for obtaining the fib index - check that a cached flow corresponds to the same fib index as the packet for which we are doing the lookup - at interface detach time flush any flows referencing stale rtentrys associated with the interface that is going away (fixes reported panics) - reduce the time between cleans in case the cleaner is running at the time the eventhandler is called and the wakeup is missed less time will elapse before the eventhandler returns - separate per-vnet initialization from global initialization (pointed out by jeli@) Reviewed by: sam@ Approved by: re@
* In function ip_output(), the cached route is flushed when there is aqingli2009-08-141-1/+5
| | | | | | | | | | mismatch between the cached entry and the intended destination. The cached rtentry{} is flushed but the associated llentry{} is not. This causes the wrong destination MAC address being used in the output packets. The fix is to flush the llentry{} when rtentry{} is cleared. Reviewed by: kmacy, rwatson Approved by: re
* Merge the remainder of kern_vimage.c and vimage.h into vnet.c andrwatson2009-08-011-1/+0
| | | | | | | | | | vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
* Build on Jeff Roberson's linker-set based dynamic per-CPU allocatorrwatson2009-07-141-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
* Modify most routines returning 'struct ifaddr *' to return referencesrwatson2009-06-231-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
* V_irtualize flowtable state.zec2009-06-221-1/+1
| | | | | | | | | | | | This change should make options VIMAGE kernel builds usable again, to some extent at least. Note that the size of struct vnet_inet has changed, though in accordance with one-bump-per-day policy we didn't update the __FreeBSD_version number, given that it has already been touched by r194640 a few hours ago. Reviewed by: bz Approved by: julian (mentor)
* Move the kernel option FLOWTABLE chacking from the header file to thebz2009-06-121-0/+2
| | | | | | | | | | | actual implementation. Remove the accessor functions for the compiled out case, just returning "unavail" values. Remove the kernel conditional from the header file as it is no longer needed, only leaving the externs. Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size under the kernel option as well. Reviewed by: rwatson
* Only four out of nine arguments for ip_ipsec_output() are actually used.pjd2009-06-051-1/+1
| | | | | Kill unused arguments except for 'ifp' as it might be used in the future for detecting IPsec-capable interfaces.
* Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERICrwatson2009-06-051-1/+0
| | | | | | | | and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
* - Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistentpjd2009-06-011-17/+8
| | | | | | | | | | | | | with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option as there is no more space for more SOL_SOCKET options, but this option also fits better as an IP socket option, it seems. - Implement this functionality also for IPv6 and RAW IP sockets. - Always compile it in (don't use additional kernel options). - Remove sysctl to turn this functionality on and off. - Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this functionality (currently only unjail root can use it). Discussed with: julian, adrian, jhb, rwatson, kmacy
OpenPOWER on IntegriCloud