summaryrefslogtreecommitdiffstats
path: root/net/ipv4/route.c
Commit message (Collapse)AuthorAgeFilesLines
* ipv4: route: fix inet_rtm_getroute induced crashFlorian Westphal2017-08-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "ip route get $daddr iif eth0 from $saddr" causes: BUG: KASAN: use-after-free in ip_route_input_rcu+0x1535/0x1b50 Call Trace: ip_route_input_rcu+0x1535/0x1b50 ip_route_input_noref+0xf9/0x190 tcp_v4_early_demux+0x1a4/0x2b0 ip_rcv+0xbcb/0xc05 __netif_receive_skb+0x9c/0xd0 netif_receive_skb_internal+0x5a8/0x890 Problem is that inet_rtm_getroute calls either ip_route_input_rcu (if an iif was provided) or ip_route_output_key_hash_rcu. But ip_route_input_rcu, unlike ip_route_output_key_hash_rcu, already associates the dst_entry with the skb. This clears the SKB_DST_NOREF bit (i.e. skb_dst_drop will release/free the entry while it should not). Thus only set the dst if we called ip_route_output_key_hash_rcu(). I tested this patch by running: while true;do ip r get 10.0.1.2;done > /dev/null & while true;do ip r get 10.0.1.2 iif eth0 from 10.0.1.1;done > /dev/null & ... and saw no crash or memory leak. Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Cc: David Ahern <dsahern@gmail.com> Fixes: ba52d61e0ff ("ipv4: route: restore skb_dst_set in inet_rtm_getroute") Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'random_for_linus' of ↵Linus Torvalds2017-07-151-2/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull random updates from Ted Ts'o: "Add wait_for_random_bytes() and get_random_*_wait() functions so that callers can more safely get random bytes if they can block until the CRNG is initialized. Also print a warning if get_random_*() is called before the CRNG is initialized. By default, only one single-line warning will be printed per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a warning will be printed for each function which tries to get random bytes before the CRNG is initialized. This can get spammy for certain architecture types, so it is not enabled by default" * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: reorder READ_ONCE() in get_random_uXX random: suppress spammy warnings about unseeded randomness random: warn when kernel uses unseeded randomness net/route: use get_random_int for random counter net/neighbor: use get_random_u32 for 32-bit hash random rhashtable: use get_random_u32 for hash_rnd ceph: ensure RNG is seeded before using iscsi: ensure RNG is seeded before use cifs: use get_random_u32 for 32-bit lock random random: add get_random_{bytes,u32,u64,int,long,once}_wait family random: add wait_for_random_bytes() API
| * net/route: use get_random_int for random counterJason A. Donenfeld2017-06-191-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Using get_random_int here is faster, more fitting of the use case, and just as cryptographically secure. It also has the benefit of providing better randomness at early boot, which is when many of these structures are assigned. Also, semantically, it's not really proper to have been assigning an atomic_t in this way before, even if in practice it works fine. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | net: remove DST_NOCACHE flagWei Wang2017-06-171-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | DST_NOCACHE flag check has been removed from dst_release() and dst_hold_safe() in a previous patch because all the dst are now ref counted properly and can be released based on refcnt only. Looking at the rest of the DST_NOCACHE use, all of them can now be removed or replaced with other checks. So this patch gets rid of all the DST_NOCACHE usage and remove this flag completely. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove DST_NOGC flagWei Wang2017-06-171-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all the components have been changed to release dst based on refcnt only and not depend on dst gc anymore, we can remove the temporary flag DST_NOGC. Note that we also need to remove the DST_NOCACHE check in dst_release() and dst_hold_safe() because now all the dst are released based on refcnt and behaves as DST_NOCACHE. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: mark DST_NOGC and remove the operation of dst_free()Wei Wang2017-06-171-12/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | With the previous preparation patches, we are ready to get rid of the dst gc operation in ipv4 code and release dst based on refcnt only. So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls to dst_free(). At this point, all dst created in ipv4 code do not use the dst gc anymore and will be destroyed at the point when refcnt drops to 0. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: call dst_hold_safe() properlyWei Wang2017-06-171-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch checks all the calls to dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if dst_hold_safe() is needed to avoid double free issue if dst gc is removed and dst_release() directly destroys dst when dst->__refcnt drops to 0. In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock(). UDP and other similar protocols always hold refcount for skb->_skb_refdst. So both paths seem to be safe. In rx path, as it is lockless and skb_dst_set_noref() is likely to be used, dst_hold_safe() should always be used when trying to hold dst. In the routing code, if dst is held during an rcu protected session, it is necessary to call dst_hold_safe() as the current dst might be in its rcu grace period. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: call dst_dev_put() properlyWei Wang2017-06-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the intend of this patch series is to completely remove dst gc, we need to call dst_dev_put() to release the reference to dst->dev when removing routes from fib because we won't keep the gc list anymore and will lose the dst pointer right after removing the routes. Without the gc list, there is no way to find all the dst's that have dst->dev pointing to the going-down dev. Hence, we are doing dst_dev_put() immediately before we lose the last reference of the dst from the routing code. The next dst_check() will trigger a route re-lookup to find another route (if there is any). Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: take dst->__refcnt when caching dst in fibWei Wang2017-06-171-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers to struct rtable but they never increment dst->__refcnt. This leads to the need of the dst garbage collector because when user is done with this dst and calls dst_release(), it can only decrement dst->__refcnt and can not free the dst even it sees dst->__refcnt drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing code might still hold reference to it. And when the routing code tries to delete a route, it has to put the dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread running periodically to check on dst->__refcnt and finally to free dst when refcnt becomes 0. This patch increments dst->__refcnt when fib_nh/fib_nh_exception holds reference to this dst and properly release the dst when fib_nh/fib_nh_exception has been updated with a new dst. This patch is a preparation in order to fully get rid of dst gc later. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: use loopback dev when generating blackhole routeWei Wang2017-06-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Existing ipv4/6_blackhole_route() code generates a blackhole route with dst->dev pointing to the passed in dst->dev. It is not necessary to hold reference to the passed in dst->dev because the packets going through this route are dropped anyway. A loopback interface is good enough so that we don't need to worry about releasing this dst->dev when this dev is going down. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: route: restore skb_dst_set in inet_rtm_getrouteRoopa Prabhu2017-06-011-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | recent updates to inet_rtm_getroute dropped skb_dst_set in inet_rtm_getroute. This patch restores it because it is needed to release the dst correctly. Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Reported-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-05-261-1/+9
|\ \ | | | | | | | | | | | | | | | | | | Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net' restricting a HW workaround alongside cleanups in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: add reference counting to metricsEric Dumazet2017-05-261-1/+9
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: RTM_GETROUTE: return matched fib result when requestedRoopa Prabhu2017-05-261-2/+11
| | | | | | | | | | | | | | | | | | This patch adds support to return matched fib result when RTM_F_FIB_MATCH flag is specified in RTM_GETROUTE request. This is useful for user-space applications/controllers wanting to query a matching route. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookupDavid Ahern2017-05-261-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert inet_rtm_getroute to use ip_route_input_rcu and ip_route_output_key_hash_rcu passing the fib_result arg to both. The rcu lock is held through the creation of the response, so the rtable/dst does not need to be attached to the skb and is passed to rt_fill_info directly. In converting from ip_route_output_key to ip_route_output_key_hash_rcu the xfrm_lookup_route in ip_route_output_flow is dropped since flowi4_proto is not set for a route get request. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: Remove event arg to rt_fill_infoDavid Ahern2017-05-261-4/+3
| | | | | | | | | | | | | | | | | | rt_fill_info has 1 caller with the event set to RTM_NEWROUTE. Given that remove the arg and use RTM_NEWROUTE directly in rt_fill_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: refactor ip_route_input_norefDavid Ahern2017-05-261-29/+37
| | | | | | | | | | | | | | | | | | | | | | A later patch wants access to the fib result on an input route lookup with the rcu lock held. Refactor ip_route_input_noref pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper that takes the fib_result as an input arg. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: refactor __ip_route_output_key_hashDavid Ahern2017-05-261-22/+31
|/ | | | | | | | | | | | | | | A later patch wants access to the fib result on an output route lookup with the rcu lock held. Refactor __ip_route_output_key_hash, pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper with the fib_result as an input arg. To keep the name length under control remove the leading underscores from the name and add _rcu to the name of the new helper indicating it is called with the rcu read lock held. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-04-261-1/+2
|\ | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: Avoid caching l3mdev dst on mismatched local routeRobert Shearman2017-04-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David reported that doing the following: ip li add red type vrf table 10 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red when either policy routing IP rules are present or the local table lookup ip rule is before the l3mdev lookup results in a hang with these messages: unregister_netdevice: waiting for red to become free. Usage count = 1 The problem is caused by caching the dst used for sending the packet out of the specified interface on a local route with a different nexthop interface. Thus the dst could stay around until the route in the table the lookup was done is deleted which may be never. Address the problem by not forcing output device to be the l3mdev in the flow's output interface if the lookup didn't use the l3mdev. This then results in the dst using the right device according to the route. Changes in v2: - make the dev_out passed in by __ip_route_output_key_hash correct instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as suggested by David. Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback") Reported-by: David Ahern <dsa@cumulusnetworks.com> Suggested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ipv4: fix multipath RTM_GETROUTE behavior when iif is givenFlorian Larysch2017-04-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to ip_route_input when iif is given. If a multipath route is present for the designated destination, ip_multipath_icmp_hash ends up being called, which uses the source/destination addresses within the skb to calculate a hash. However, those are not set in the synthetic skb, causing it to return an arbitrary and incorrect result. Instead, use UDP, which gets no such special treatment. Signed-off-by: Florian Larysch <fl@n621.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: rtnetlink: plumb extended ack to doit functionDavid Ahern2017-04-171-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse for doit functions that call it directly. This is the first step to using extended error reporting in rtnetlink. >From here individual subsystems can be updated to set netlink_ext_ack as needed. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netlink: pass extended ACK struct to parsing functionsJohannes Berg2017-04-131-1/+2
| | | | | | | | | | | | | | | | | | Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: Refine the ipv4_default_advmssGao Feng2017-04-131-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Don't get the metric RTAX_ADVMSS of dst. There are two reasons. 1) Its caller dst_metric_advmss has already invoke dst_metric_advmss before invoke default_advmss. 2) The ipv4_default_advmss is used to get the default mss, it should not try to get the metric like ip6_default_advmss. 2. Use sizeof(tcphdr)+sizeof(iphdr) instead of literal 40. 3. Define one new macro IPV4_MAX_PMTU instead of 65535 according to RFC 2675, section 5.1. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: fix multipath RTM_GETROUTE behavior when iif is givenFlorian Larysch2017-04-071-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to ip_route_input when iif is given. If a multipath route is present for the designated destination, fib_multipath_hash ends up being called with that skb. However, as that skb contains no information beyond the protocol type, the calculated hash does not match the one we would see for a real packet. There is currently no way to fix this for layer 4 hashing, as RTM_GETROUTE doesn't have the necessary information to create layer 4 headers. To fix this for layer 3 hashing, set appropriate saddr/daddrs in the skb and also change the protocol to UDP to avoid special treatment for ICMP. Signed-off-by: Florian Larysch <fl@n621.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: add support for ECMP hash policy choiceNikolay Aleksandrov2017-03-211-22/+70
|/ | | | | | | | | | | | | | | | | | This patch adds support for ECMP hash policy choice via a new sysctl called fib_multipath_hash_policy and also adds support for L4 hashes. The current values for fib_multipath_hash_policy are: 0 - layer 3 (default) 1 - layer 4 If there's an skb hash already set and it matches the chosen policy then it will be used instead of being calculated (currently only for L4). In L3 mode we always calculate the hash due to the ICMP error special case, the flow dissector's field consistentification should handle the address order thus we can remove the address reversals. If the skb is provided we always use it for the hash calculation, otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: mask tos for input routeJulian Anastasov2017-02-261-0/+1
| | | | | | | | | | | | | Restore the lost masking of TOS in input route code to allow ip rules to match it properly. Problem [1] noticed by Shmulik Ladkani <shmulik.ladkani@gmail.com> [1] http://marc.info/?t=137331755300040&r=1&w=2 Fixes: 89aef8921bfb ("ipv4: Delete routing cache.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: add missing initialization for flowi4_uidJulian Anastasov2017-02-261-0/+1
| | | | | | | | | | | Avoid matching of random stack value for uid when rules are looked up on input route or when RP filter is used. Problem should affect only setups that use ip rules with uid range. Fixes: 622ec2c9d524 ("net: core: add UID to flows, rules, and routes") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add confirm_neigh method to dst_opsJulian Anastasov2017-02-071-0/+19
| | | | | | | | | | | | | | Add confirm_neigh method to dst_ops and use it from IPv4 and IPv6 to lookup and confirm the neighbour. Its usage via the new helper dst_confirm_neigh() should be restricted to MSG_PROBE users for performance reasons. For XFRM prefer the last tunnel address, if present. With help from Steffen Klassert. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-01-171-1/+1
|\
| * net: ipv4: fix table id in getroute responseDavid Ahern2017-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | rtm_table is an 8-bit field while table ids are allowed up to u32. Commit 709772e6e065 ("net: Fix routing tables with id > 255 for legacy software") added the preference to set rtm_table in dumps to RT_TABLE_COMPAT if the table id is > 255. The table id returned on get route requests should do the same. Fixes: c36ba6603a11 ("net: Allow user to get table id from route lookup") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: remove disable of bottom half in inet_rtm_getrouteDavid Ahern2017-01-091-2/+0
| | | | | | | | | | | | | | | | | | | | Nothing about the route lookup requires bottom half to be disabled. Remove the local_bh_disable ... local_bh_enable around ip_route_input. This appears to be a vestige of days gone by as it has been there since the beginning of git time. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: Remove flow arg from ip_mkroute_inputDavid Ahern2017-01-081-2/+1
| | | | | | | | | | | | | | fl4 arg is not used; remove it. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipmr: Remove nowait arg to ipmr_get_routeDavid Ahern2017-01-081-1/+1
| | | | | | | | | | | | | | | | ipmr_get_route has 1 caller and the nowait arg is 0. Remove the arg and simplify ipmr_get_route accordingly. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv4: Simplify rt_fill_infoDavid Ahern2017-01-061-13/+7
|/ | | | | | | | | rt_fill_info has only 1 caller and both of the last 2 args -- nowait and flags -- are hardcoded to 0. Given that remove them as input arguments and simplify rt_fill_info accordingly. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: dst for local input routes should use l3mdev if relevantDavid Ahern2016-12-291-1/+2
| | | | | | | | | | | | | | | IPv4 output routes already use l3mdev device instead of loopback for dst's if it is applicable. Change local input routes to do the same. This fixes icmp responses for unreachable UDP ports which are directed to the wrong table after commit 9d1a6c4ea43e4 because local_input routes use the loopback device. Moving from ingress device to loopback loses the L3 domain causing responses based on the dst to get to lost. Fixes: 9d1a6c4ea43e4 ("net: icmp_route_lookup should use rt dev to determine L3 domain") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2016-12-241-1/+1
| | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: ipv4: Don't crash if passing a null sk to ip_do_redirect.Lorenzo Colitti2016-12-221-1/+2
| | | | | | | | | | | | | Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") made ip_do_redirect call sock_net(sk) to determine the network namespace of the passed-in socket. This crashes if sk is NULL. Fix this by getting the network namespace from the skb instead. Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* route: Set lwtstate for local traffic and cached input dstsThomas Graf2016-12-021-13/+26
| | | | | | | | | | | | | | A route on the output path hitting a RTN_LOCAL route will keep the dst associated on its way through the loopback device. On the receive path, the dst_input() call will thus invoke the input handler of the route created in the output path. Thus, lwt redirection for input must be done for dsts allocated in the otuput path as well. Also, if a route is cached in the input path, the allocated dst should respect lwtunnel configuration on the nexthop as well. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* route: Set orig_output when redirecting to lwt on locally generated trafficThomas Graf2016-12-021-1/+3
| | | | | | | | | | orig_output for IPv4 was only set for dsts which hit an input route. Set it consistently for locally generated traffic as well to allow lwt to continue the dst_output() path as configured by the nexthop. Fixes: 2536862311d ("lwt: Add support to redirect dst.input") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.Lorenzo Colitti2016-11-301-1/+2
| | | | | | | | | | | | | | Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") made __build_flow_key call sock_net(sk) to determine the network namespace of the passed-in socket. This crashes if sk is NULL. Fix this by getting the network namespace from the skb instead. Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il> Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-11-151-1/+3
|\ | | | | | | | | | | | | Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: use new_gw for redirect neigh lookupStephen Suryaputra Lin2016-11-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0 and then the state of the neigh for the new_gw is checked. If the state isn't valid then the redirected route is deleted. This behavior is maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway is assigned to peer->redirect_learned.a4 before calling ipv4_neigh_lookup(). After commit 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again."), ipv4_neigh_lookup() is performed without the rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw) isn't zero, the function uses it as the key. The neigh is most likely valid since the old_gw is the one that sends the ICMP redirect message. Then the new_gw is assigned to fib_nh_exception. The problem is: the new_gw ARP may never gets resolved and the traffic is blackholed. So, use the new_gw for neigh lookup. Changes from v1: - use __ipv4_neigh_lookup instead (per Eric Dumazet). Fixes: 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again.") Signed-off-by: Stephen Suryaputra Lin <ssurya@ieee.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: inet: Support UID-based routing in IP protocols.Lorenzo Colitti2016-11-041-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: core: add UID to flows, rules, and routesLorenzo Colitti2016-11-041-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Enable support for VRF with ipv4 multicastDavid Ahern2016-11-011-15/+26
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | Enable support for IPv4 multicast: - similar to unicast the flow struct is updated to L3 master device if relevant prior to calling fib_rules_lookup. The table id is saved to the lookup arg so the rule action for ipmr can return the table associated with the device. - ip_mr_forward needs to check for master device mismatch as well since the skb->dev is set to it - allow multicast address on VRF device for Rx by checking for the daddr in the VRF device as well as the original ingress device - on Tx need to drop to __mkroute_output when FIB lookup fails for multicast destination address. - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates IPMR FIB rules on first device create similar to FIB rules. In addition the VRF driver does not divert IPv4 multicast packets: it breaks on Tx since the fib lookup fails on the mcast address. With this patch, ipmr forwarding and local rx/tx work. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: Do not drop to make_route if oif is l3mdevDavid Ahern2016-10-131-1/+2
| | | | | | | | | | | | | Commit e0d56fdd7342 was a bit aggressive removing l3mdev calls in the IPv4 stack. If the fib_lookup fails we do not want to drop to make_route if the oif is an l3mdev device. Also reverts 19664c6a0009 ("net: l3mdev: Remove netif_index_is_l3_master") which removed netif_index_is_l3_master. Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-10-021-1/+2
|\ | | | | | | | | | | Three sets of overlapping changes. Nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_routeNikolay Aleksandrov2016-09-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid instead of the previous dst_pid which was copied from in_skb's portid. Since the skb is new the portid is 0 at that point so the packets are sent to the kernel and we get scheduling while atomic or a deadlock (depending on where it happens) by trying to acquire rtnl two times. Also since this is RTM_GETROUTE, it can be triggered by a normal user. Here's the sleeping while atomic trace: [ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0 [ 7858.212881] 2 locks held by swapper/0/0: [ 7858.213013] #0: (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350 [ 7858.213422] #1: (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130 [ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179 [ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 7858.214108] 0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000 [ 7858.214412] ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e [ 7858.214716] 000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f [ 7858.215251] Call Trace: [ 7858.215412] <IRQ> [<ffffffff813a7804>] dump_stack+0x85/0xc1 [ 7858.215662] [<ffffffff810a4a72>] ___might_sleep+0x192/0x250 [ 7858.215868] [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100 [ 7858.216072] [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0 [ 7858.216279] [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460 [ 7858.216487] [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40 [ 7858.216687] [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260 [ 7858.216900] [<ffffffff81573c70>] rtnl_unicast+0x20/0x30 [ 7858.217128] [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0 [ 7858.217351] [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130 [ 7858.217581] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217785] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217990] [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350 [ 7858.218192] [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350 [ 7858.218415] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.218656] [<ffffffff810fde10>] run_timer_softirq+0x260/0x640 [ 7858.218865] [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f [ 7858.219068] [<ffffffff816637c8>] __do_softirq+0xe8/0x54f [ 7858.219269] [<ffffffff8107a948>] irq_exit+0xb8/0xc0 [ 7858.219463] [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50 [ 7858.219678] [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0 [ 7858.219897] <EOI> [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10 [ 7858.220165] [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10 [ 7858.220373] [<ffffffff810298e3>] default_idle+0x23/0x190 [ 7858.220574] [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20 [ 7858.220790] [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60 [ 7858.221016] [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0 [ 7858.221257] [<ffffffff8164f995>] rest_init+0x135/0x140 [ 7858.221469] [<ffffffff81f83014>] start_kernel+0x50e/0x51b [ 7858.221670] [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120 [ 7858.221894] [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c [ 7858.222113] [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a Fixes: 2942e9005056 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-09-231-2/+8
|\ \ | |/
OpenPOWER on IntegriCloud