summaryrefslogtreecommitdiffstats
path: root/sys/net/if_llatbl.c
Commit message (Collapse)AuthorAgeFilesLines
* sys/net*: minor spelling fixes.pfg2016-05-031-1/+1
| | | | No functional change.
* Implement interface link header precomputation API.melifaro2015-12-311-8/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add if_requestencap() interface method which is capable of calculating various link headers for given interface. Right now there is support for INET/INET6/ARP llheader calculation (IFENCAP_LL type request). Other types are planned to support more complex calculation (L2 multipath lagg nexthops, tunnel encap nexthops, etc..). Reshape 'struct route' to be able to pass additional data (with is length) to prepend to mbuf. These two changes permits routing code to pass pre-calculated nexthop data (like L2 header for route w/gateway) down to the stack eliminating the need for other lookups. It also brings us closer to more complex scenarios like transparently handling MPLS nexthops and tunnel interfaces. Last, but not least, it removes layering violation introduced by flowtable code (ro_lle) and simplifies handling of existing if_output consumers. ARP/ND changes: Make arp/ndp stack pre-calculate link header upon installing/updating lle record. Interface link address change are handled by re-calculating headers for all lles based on if_lladdr event. After these changes, arpresolve()/nd6_resolve() returns full pre-calculated header for supported interfaces thus simplifying if_output(). Move these lookups to separate ether_resolve_addr() function which ether returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr() compat versions to return link addresses instead of pre-calculated data. BPF changes: Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT. Despite the naming, both of there have ther header "complete". The only difference is that interface source mac has to be filled by OS for AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside BPF and not pollute if_output() routines. Convert BPF to pass prepend data via new 'struct route' mechanism. Note that it does not change non-optimized if_output(): ro_prepend handling is purely optional. Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI. It is not needed for ethernet anymore. The only remaining FDDI user is dev/pdq mostly untouched since 2007. FDDI support was eliminated from OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65). Flowtable changes: Flowtable violates layering by saving (and not correctly managing) rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated header data from that lle. Differential Revision: https://reviews.freebsd.org/D4102
* Simplify bringup order by removing a SYSINIT making it a static listbz2015-12-221-12/+2
| | | | | | | | | | | | | | | initialization. Mfp4 @180384,180385: There is no need for a dedicated SYSINIT here. The list can be initialized statically. Sponsored by: CK Software GmbH Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4528
* Remove LLE read lock from IPv6 fast path.melifaro2015-12-131-0/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LLE structure is mostly unchanged during its lifecycle: there are only 2 things relevant for fast path lookup code: 1) link-level address change. Since r286722, these updates are performed under AFDATA WLOCK. 2) Some sort of feedback indicating that this particular entry is used so we send NS to perform reachability verification instead of expiring entry. The only signal that is needed from fast path is something like binary yes/no. The latter is solved by the following changes: Special r_skip_req (introduced in D3688) value is used for fast path feedback. It is read lockless by fast path, but updated under req_mutex mutex. If this field is non-zero, then fast path will acquire lock and set it back to 0. After transitioning to STALE state, callout timer is armed to run each V_nd6_delay seconds to make sure that if packet was transmitted at the start of given interval, we would be able to switch to PROBE state in V_nd6_delay seconds as user expects. (in STALE state) timer is rescheduled until original V_nd6_gctimer expires keeping lle in STALE state (remaining timer value stored in lle_remtime). (in STALE state) timer is rescheduled if packet was transmitted less that V_nd6_delay seconds ago to make sure we transition to PROBE state exactly after V_n6_delay seconds. As a result, all packets towards lle in REACHABLE/STALE/PROBE states are handled by fast path without acquiring lle read lock. Differential Revision: https://reviews.freebsd.org/D3780
* Remove LLE read lock from IPv4 fast path.melifaro2015-12-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LLE structure is mostly unchanged during its lifecycle. To be more specific, there are 2 things relevant for fast path lookup code: 1) link-level address change. Since r286722, these updates are performed under AFDATA WLOCK. 2) Some sort of feedback indicating that this particular entry is used so we re-send arp request to perform reachability verification instead of expiring entry. The only signal that is needed from fast path is something like binary yes/no. The latter is solved by the following changes: 1) introduce special r_skip_req field which is read lockless by fast path, but updated under (new) req_mutex mutex. If this field is non-zero, then fast path will acquire lock and set it back to 0. 2) introduce simple state machine: incomplete->reachable<->verify->deleted. Before that we implicitely had incomplete->reachable->deleted state machine, with V_arpt_keep between "reachable" and "deleted". Verification was performed in runtime 5 seconds before V_arpt_keep expire. This is changed to "change state to verify 5 seconds before V_arpt_keep, set r_skip_req to non-zero value and check it every second". If the value is zero - then send arp verification probe. These changes do not introduce any signifficant control plane overhead: typically lle callout timer would fire 1 time more each V_arpt_keep (1200s) for used lles and up to arp_maxtries (5) for dead lles. As a result, all packets towards "reachable" lle are handled by fast path without acquiring lle read lock. Additional "req_mutex" is needed because callout / arpresolve_slow() or eventhandler might keep LLE lock for signifficant amount of time, which might not be feasible for fast path locking (e.g. having rmlock as ether AFDATA or lltable own lock). Differential Revision: https://reviews.freebsd.org/D3688
* This fixes several places where callout_stops return is examined. Therrs2015-11-131-1/+1
| | | | | | | | | | new return codes of -1 were mistakenly being considered "true". Callout_stop now returns -1 to indicate the callout had either already completed or was not running and 0 to indicate it could not be stopped. Also update the manual page to make it more consistent no non-zero in the callout_stop or callout_reset descriptions. MFC after: 1 Month with associated callout change.
* Unify setting lladdr for AF_INET[6].melifaro2015-11-071-0/+9
|
* Eliminate nd6_nud_hint() and its TCP bindings.melifaro2015-09-271-1/+0
| | | | | | | | | | | | | | Initially function was introduced in r53541 (KAME initial commit) to "provide hints from upper layer protocols that indicate a connection is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability Confirmation). However, it was converted to do nothing (e.g. just return) in r122922 (tcp_hostcache implementation) back in 2003. Some defines were moved to tcp_var.h in r169541. Then, it was broken (for non-corner cases) by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So, right now this code is broken and has no "real" base users. Differential Revision: https://reviews.freebsd.org/D3699
* * Require explicitl lle unlink prior to calling llentry_delete().melifaro2015-09-151-7/+1
| | | | | | This one slightly decreases time of holding afdata wlock. * While here, make nd6_free() return void. No one has used its return value since r186119.
* * Do more fine-grained locking: call eventhandlers/free_entrymelifaro2015-09-141-10/+41
| | | | | | | | | | without holding afdata wlock * convert per-af delete_address callback to global lltable_delete_entry() and more low-level "delete this lle" per-af callback * fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D3573
* Simplify lla_rt_output()/nd6_add_ifa_lle() by setting lle state inmelifaro2015-08-311-16/+5
| | | | alloc handler, based on flags.
* * Split allocation and table linking for lle's.melifaro2015-08-201-20/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Before that, the logic besides lle_create() was the following: return existing if found, create if not. This behaviour was error-prone since we had to deal with 'sudden' static<>dynamic lle changes. This commit fixes bunch of different issues like: - refcount leak when lle is converted to static. Simple check case: console 1: while true; do for i in `arp -an|awk '$4~/incomp/{print$2}'|tr -d '()'`; do arp -s $i 00:22:44:66:88:00 ; arp -d $i; done; done console 2: ping -f any-dead-host-in-L2 console 3: # watch for memory consumption: vmstat -m | awk '$1~/lltable/{print$2}' - possible problems in arptimer() / nd6_timer() when dropping/reacquiring lock. New logic explicitly handles use-or-create cases in every lla_create user. Basically, most of the changes are purely mechanical. However, we explicitly avoid using existing lle's for interface/static LLE records. * While here, call lle_event handlers on all real table lle change. * Create lltable_free_entry() calling existing per-lltable lle_free_t callback for entry deletion
* Use single 'lle_timer' callout in lltable instead ofmelifaro2015-08-111-2/+2
| | | | two different names of the same timer.
* MFP r276712.melifaro2015-08-111-18/+42
| | | | | | | | * Split lltable_init() into lltable_allocate_htbl() (alloc hash table with default callbacks) and lltable_link() ( links any lltable to the list). * Switch from LLTBL_HASHTBL_SIZE to per-lltable hash size field. * Move lltable setup to separate functions in in[6]_domifattach.
* Partially merge r274887,r275334,r275577,r275578,r275586 to minimizemelifaro2015-08-101-21/+260
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | differences between projects/routing and HEAD. This commit tries to keep code logic the same while changing underlying code to use unified callbacks. * Add llt_foreach_entry method to traverse all entries in given llt * Add llt_dump_entry method to export particular lle entry in sysctl/rtsock format (code is not indented properly to minimize diff). Will be fixed in the next commits. * Add llt_link_entry/llt_unlink_entry methods to link/unlink particular lle. * Add llt_fill_sa_entry method to export address in the lle to sockaddr format. * Add llt_hash method to use in generic hash table support code. * Add llt_free_entry method which is used in llt_prefix_free code. * Prepare for fine-grained locking by separating lle unlink and deletion in lltable_free() and lltable_prefix_free(). * Provide lltable_get<ifp|af>() functions to reduce direct 'struct lltable' access by external callers. * Remove @llt agrument from lle_free() lle callback since it was unused. * Temporarily add L3_CADDR() macro for 'const' sockaddr typecasting. * Switch to per-af hashing code. * Rename LLE_FREE_LOCKED() callback from in[6]_lltable_free() to in_[6]lltable_destroy() to avoid clashing with llt_free_entry() method. Update description from these functions. * Use unified lltable_free_entry() function instead of per-af one. Reviewed by: ae
* MFP r274553:melifaro2015-08-081-70/+55
| | | | | | | | | | * Move lle creation/deletion from lla_lookup to separate functions: lla_lookup(LLE_CREATE) -> lla_create lla_lookup(LLE_DELETE) -> lla_delete lla_create now returns with LLE_EXCLUSIVE lock for lle. * Provide typedefs for new/existing lltable callbacks. Reviewed by: ae
* Back out r249318, r249320 and r249327 due to a heisenbug mostandre2013-05-061-1/+1
| | | | | likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.
* Change certain heavily used network related mutexes and rwlocks toandre2013-04-091-1/+1
| | | | | | | | | | reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.
* Retire struct sockaddr_inarp.glebius2013-01-311-28/+5
| | | | | | | | | | | | | | | Since ARP and routing are separated, "proxy only" entries don't have any meaning, thus we don't need additional field in sockaddr to pass SIN_PROXY flag. New kernel is binary compatible with old tools, since sizes of sockaddr_inarp and sockaddr_in match, and sa_family are filled with same value. The structure declaration is left for compatibility with third party software, but in tree code no longer use it. Reviewed by: ru, andre, net@
* route_output() always supplies info with RTAX_GATEWAY member thatglebius2013-01-291-4/+3
| | | | | points to a sockaddr of AF_LINK family. Assert this instead of checking.
* Fix problem in r238990. The LLE_LINKED flag should be tested prior toglebius2012-12-131-6/+0
| | | | | | | | | entering llentry_free(), and in case if we lose the race, we should simply perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread performing arptimer(), it will remove two references from the lle instead of one. Reported by: Ian FREISLICH <ianf clue.co.za>
* Fix races between in_lltable_prefix_free(), lla_lookup(),glebius2012-08-021-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | llentry_free() and arptimer(): o Use callout_init_rw() for lle timeout, this allows us safely disestablish them. - This allows us to simplify the arptimer() and make it race safe. o Consistently use ifp->if_afdata_lock to lock access to linked lists in the lle hashes. o Introduce new lle flag LLE_LINKED, which marks an entry that is attached to the hash. - Use LLE_LINKED to avoid double unlinking via consequent calls to llentry_free(). - Mark lle with LLE_DELETED via |= operation istead of =, so that other flags won't be lost. o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more consistent and provide more informative KASSERTs. The patch is a collaborative work of all submitters and myself. PR: kern/165863 Submitted by: Andrey Zonov <andrey zonov.org> Submitted by: Ryan Stone <rysto32 gmail.com> Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>
* The llentry_update() is used only by flowtable and the latterglebius2012-08-021-20/+11
| | | | | always passes NULL pointer to it. Thus, code can be simplified and function renamed to llentry_alloc() to match rtalloc().
* Some more whitespace cleanup.glebius2012-08-011-5/+5
|
* Some style(9) and whitespace changes.glebius2012-07-311-7/+7
| | | | Together with: Andrey Zonov <andrey zonov.org>
* A flowtable entry can continue referencing an llentry indefinitely if the ↵kmacy2012-01-261-0/+1
| | | | | | | | | | | entry is repeatedly referenced within its timeout window. This change clears the LLE_VALID flag when an llentry is removed from an interface's hash table and adds an extra check to the flowtable code for the LLE_VALID flag in llentry to avoid retaining and using a stale reference. Reviewed by: qingli@ MFC after: 2 weeks
* Move arprequest() declaration to if_ether.h.glebius2012-01-081-3/+0
|
* The statically configured (permanent) ARP entries are removed when anqingli2011-05-201-2/+3
| | | | | | | | | | interface is brought down, even though the interface address is still valid. This patch maintains the permanent ARP entries as long as the interface address (having the same prefix as that of the ARP entries) is valid. Reviewed by: delphij MFC after: 5 days
* After some off-list discussion, revert a number of changes to thedim2010-11-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
* Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughoutdim2010-11-141-1/+1
| | | | the tree.
* Use 'z' modifier for size_t printing.kib2010-11-131-1/+1
|
* Add a queue to hold packets while we await an ARP reply.gnn2010-11-121-3/+20
| | | | | | | | | | | | | | | | | | | | | | When a fast machine first brings up some non TCP networking program it is quite possible that we will drop packets due to the fact that only one packet can be held per ARP entry. This leads to packets being missed when a program starts or restarts if the ARP data is not currently in the ARP cache. This code adds a new sysctl, net.link.ether.inet.maxhold, which defines a system wide maximum number of packets to be held in each ARP entry. Up to maxhold packets are queued until an ARP reply is received or the ARP times out. The default setting is the old value of 1 which has been part of the BSD networking code since time immemorial. Expose the time we hold an incomplete ARP entry by adding the sysctl net.link.ether.inet.wait, which defaults to 20 seconds, the value used when the new ARP code was added.. Reviewed by: bz, rpaulo MFC after: 3 weeks
* lltable_drain() has never been used so far, thus #if 0 it for now.bz2010-10-161-0/+4
| | | | | | | While touching it add the missing locking to the now disabled code for the time when we'll resurrect it. MFC after: 3 days
* Don't check malloc(M_WAITOK) result.glebius2010-07-271-2/+0
|
* When installing a new ARP entry via 'arp -S', lla_lookup() willglebius2010-07-271-0/+1
| | | | | | | | | | | | | | | either find an existing entry, or allocate a new one. In the latter case an entry would have flags, that were supplied as argument to lla_lookup(). In case of an existing entry, flags aren't modified. This lead to losing LLE_PUB and/or LLE_PROXY flags. We should apply these flags either in lla_rt_output() or in the in.c:in_lltable_lookup(). It seems to me that lla_rt_output() is a more correct choice. PR: kern/148784, kern/146539 Silence from: qingli, 5 days
* Fix an obvious typo from r1.1. We were acquiring an exclusive writer lockjkim2010-07-221-1/+1
| | | | | | regardless of the given flags. MFC after: 3 days
* Plug reference leaks in the link-layer code ("new-arp") that previouslybz2010-04-111-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | prevented the link-layer entry from being freed. In both in.c and in6.c (though that code path seems to be basically dead) plug a reference leak in case of a pending callout being drained. In if_ether.c consistently add a reference before resetting the callout and in case we canceled a pending one remove the reference for that. In the final case in arptimer, before freeing the expired entry, remove the reference again and explicitly call callout_stop() to clear the active flag. In nd6.c:nd6_free() we are only ever called from the callout function and thus need to remove the reference there as well before calling into llentry_free(). In if_llatbl.c when freeing entire tables make sure that in case we cancel a pending callout to remove the reference as well. Reviewed by: qingli (earlier version) MFC after: 10 days Problem observed, patch tested by: simon on ipv6gw.f.o, Christian Kratzer (ck cksoft.de), Evgenii Davidov (dado korolev-net.ru) PR: kern/144564 Configurations still affected: with options FLOWTABLE
* Add ddb support to the "new" link layer code ("new-arp"):bz2010-03-181-0/+136
| | | | | | | | - show all lltables [1] (optional flag to also show the llentries as well) - show lltable <struct lltable *> - show llentry <struct llentry *> MFC after: 6 days
* - restructure flowtable to support ipv6kmacy2010-03-121-2/+2
| | | | | | | | | | | | | | - add a name argument to flowtable_alloc for printing with ddb commands - extend ddb commands to print destination address or 4-tuples - don't parse ports in ulp header if FL_HASH_ALL is not passed - add kern_flowtable_insert to enable more generic use of flowtable (e.g. system calls for adding entries) - don't hash loopback addresses - cleanup whitespace - keep statistics per-cpu for per-cpu flowtables to avoid cache line contention - add sysctls to accumulate stats and report aggregate MFC after: 7 days
* The proxy arp entries could not be added into the system over theqingli2009-12-301-1/+3
| | | | | | | | | | | | | | | | | | IFF_POINTOPOINT link types. The reason was due to the routing entry returned from the kernel covering the remote end is of an interface type that does not support ARP. This patch fixes this problem by providing a hint to the kernel routing code, which indicates the prefix route instead of the PPP host route should be returned to the caller. Since a host route to the local end point is also added into the routing table, and there could be multiple such instantiations due to multiple PPP links can be created with the same local end IP address, this patch also fixes the loopback route installation failure problem observed prior to this patch. The reference count of loopback route to local end would be either incremented or decremented. The first instantiation would create the entry and the last removal would delete the route entry. MFC after: 5 days
* Style fix - break too long a line in two.zec2009-09-181-1/+2
| | | | | Spotted by: bz MFC after: 3 days
* V_irtualize the lltables list, making ARP and ND reasonablyzec2009-09-171-7/+19
| | | | | | | | usable again with options VIMAGE kernels. Submitted by: bz (the original version, probably identical to this one) Reviewed by: many @ DevSummit Cambridge MFC after: 3 days
* The addresses that are assigned to the loopback interfaceqingli2009-09-051-9/+0
| | | | | | | should be part of the kernel routing table. Reviewed by: bz MFC after: immediately
* This patch fixes the following issues:qingli2009-09-051-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | - Interface link-local address is not reachable within the node that owns the interface, this is due to the mismatch in address scope as the result of the installed interface address loopback route. Therefore for each interface address loopback route, the rt_gateway field (of AF_LINK type) will be used to track which interface a given address belongs to. This will aid the address source to use the proper interface for address scope/zone validation. - The loopback address is not reachable. The root cause is the same as the above. - Empty nd6 entries are created for the IPv6 loopback addresses only for validation reason. Doing so will eliminate as much of the special case (loopback addresses) handling code as possible, however, these empty nd6 entries should not be returned to the userland applications such as the "ndp" command. Since both of the above issues contain common files, these files are committed together. Reviewed by: bz MFC after: immediately
* Use locks specific to the lltable code, rather than borrow the ifnetrwatson2009-08-251-14/+15
| | | | | | | | | list/index locks, to protect link layer address tables. This avoids lock order issues during interface teardown, but maintains the bug that sysctl copy routines may be called while a non-sleepable lock is held. Reviewed by: bz, kmacy MFC after: 3 days
* Rework global locks for interface list and index management, correctingrwatson2009-08-231-4/+4
| | | | | | | | | | | | | | several critical bugs, including race conditions and lock order issues: Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an sxlock. Either can be held to stablize the lists and indexes, but both are required to write. This allows the list to be held stable in both network interrupt contexts and sleepable user threads across sleeping memory allocations or device driver interactions. As before, writes to the interface list must occur from sleepable contexts. Reviewed by: bz, julian MFC after: 3 days
* Merge the remainder of kern_vimage.c and vimage.h into vnet.c andrwatson2009-08-011-1/+1
| | | | | | | | | | vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
* When an interface address is removed and the last prefixqingli2009-05-201-0/+17
| | | | | | | | route is also being deleted, the link-layer address table (arp or nd6) will flush those L2 llinfo entries that match the removed prefix. Reviewed by: kmacy
* clarify state of llentry that is passed backkmacy2009-04-171-0/+2
|
* add comment to llentry_updatekmacy2009-04-161-0/+4
| | | | Requested by: sam
OpenPOWER on IntegriCloud