summaryrefslogtreecommitdiffstats
path: root/net/ipv6/addrconf.c
Commit message (Collapse)AuthorAgeFilesLines
* netconf: add macro to represent all attributesZhang Shengju2016-03-131-15/+21
| | | | | | | | This patch adds macro NETCONFA_ALL to represent all type of netconf attributes for IPv4 and IPv6. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv6: Fix refcnt on host routesDavid Ahern2016-03-031-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrew and Ying Huang's test robot both reported usage count problems that trace back to the 'keep address on ifdown' patch. >From Andrew: We execute CRIU test on linux-next. On the current linux-next kernel they hangs on creating a network namespace. The kernel log contains many massages like this: [ 1036.122108] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 1046.165156] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 1056.210287] unregister_netdevice: waiting for lo to become free. Usage count = 2 I tried to revert this patch and the bug disappeared. Here is a set of commands to reproduce this bug: [root@linux-next-test linux-next]# uname -a Linux linux-next-test 4.5.0-rc6-next-20160301+ #3 SMP Wed Mar 2 17:32:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux [root@linux-next-test ~]# unshare -n [root@linux-next-test ~]# ip link set up dev lo [root@linux-next-test ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever [root@linux-next-test ~]# logout [root@linux-next-test ~]# unshare -n ----- The problem is a change made to RTM_DELADDR case in __ipv6_ifa_notify that was added in an early version of the offending patch and is no longer needed. Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Cc: Andrey Wagin <avagin@gmail.com> Cc: Ying Huang <ying.huang@linux.intel.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: Jeremiah Mahler <jmmahler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv6/l3mdev: Move host route on saved address if necessaryDavid Ahern2016-03-011-0/+26
| | | | | | | | | | | | | Commit f1705ec197e70 allows IPv6 addresses to be retained on a link down. The address can have a cached host route which can point to the wrong FIB table if the L3 enslavement is changed (e.g., route can point to local table instead of VRF table if device is added to an L3 domain). On link up check the table of the cached host route against the FIB table associated with the device and correct if needed. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv6: Make address flushing on ifdown optionalDavid Ahern2016-02-251-15/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, all ipv6 addresses are flushed when the interface is configured down, including global, static addresses: $ ip -6 addr show dev eth1 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000 inet6 2100:1::2/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe79:34bd/64 scope link valid_lft forever preferred_lft forever $ ip link set dev eth1 down $ ip -6 addr show dev eth1 << nothing; all addresses have been flushed>> Add a new sysctl to make this behavior optional. The new setting defaults to flush all addresses to maintain backwards compatibility. When the set global addresses with no expire times are not flushed on an admin down. The sysctl is per-interface or system-wide for all interfaces $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1 or $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 Will keep addresses on eth1 on an admin down. $ ip -6 addr show dev eth1 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000 inet6 2100:1::2/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe79:34bd/64 scope link valid_lft forever preferred_lft forever $ ip link set dev eth1 down $ ip -6 addr show dev eth1 3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000 inet6 2100:1::2/120 scope global tentative valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative valid_lft forever preferred_lft forever Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-02-231-2/+5
|\ | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/phy/bcm7xxx.c drivers/net/phy/marvell.c drivers/net/vxlan.c All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * rtnl: RTM_GETNETCONF: fix wrong return valueAnton Protopopov2016-02-191-1/+1
| | | | | | | | | | | | | | | | | | An error response from a RTM_GETNETCONF request can return the positive error value EINVAL in the struct nlmsgerr that can mislead userspace. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: addrconf: Fix recursive spin lock callsubashab@codeaurora.org2016-02-061-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A rcu stall with the following backtrace was seen on a system with forwarding, optimistic_dad and use_optimistic set. To reproduce, set these flags and allow ipv6 autoconf. This occurs because the device write_lock is acquired while already holding the read_lock. Back trace below - INFO: rcu_preempt self-detected stall on CPU { 1} (t=2100 jiffies g=3992 c=3991 q=4471) <6> Task dump for CPU 1: <2> kworker/1:0 R running task 12168 15 2 0x00000002 <2> Workqueue: ipv6_addrconf addrconf_dad_work <6> Call trace: <2> [<ffffffc000084da8>] el1_irq+0x68/0xdc <2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30 <2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4 <2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4 <2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c <2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70 <2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334 <2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc <2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418 <2> [<ffffffc0000bb40c>] kthread+0xe0/0xec v2: do addrconf_dad_kick inside read lock and then acquire write lock for ipv6_ifa_notify as suggested by Eric Fixes: 7fd2561e4ebdd ("net: ipv6: Add a sysctl to make optimistic addresses useful candidates") Cc: Eric Dumazet <edumazet@google.com> Cc: Erik Kline <ek@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: add option to drop unsolicited neighbor advertisementsJohannes Berg2016-02-111-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | In certain 802.11 wireless deployments, there will be NA proxies that use knowledge of the network to correctly answer requests. To prevent unsolicitd advertisements on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_unsolicited_na". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: add option to drop unicast encapsulated in L2 multicastJohannes Berg2016-02-111-0/+8
|/ | | | | | | | | | | | | | In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv6 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: always add flag an address that failed DAD with DADFAILEDLubomir Rintel2016-01-101-2/+3
| | | | | | | | | | | | | The userspace needs to know why is the address being removed so that it can perhaps obtain a new address. Without the DADFAILED flag it's impossible to distinguish removal of a temporary and tentative address due to DAD failure from other reasons (device removed, manual address removal). Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-12-311-7/+4
|\
| * addrconf: always initialize sysctl table dataWANG Cong2015-12-221-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | When sysctl performs restrict writes, it allows to write from a middle position of a sysctl file, which requires us to initialize the table data before calling proc_dostring() for the write case. Fixes: 3d1bec99320d ("ipv6: introduce secret_stable to ipv6_devconf") Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: addrconf: use stable address generator for ARPHRD_NONEBjørn Mork2015-12-181-6/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new address generator mode, using the stable address generator with an automatically generated secret. This is intended as a default address generator mode for device types with no EUI64 implementation. The new generator is used for ARPHRD_NONE interfaces initially, adding default IPv6 autoconf support to e.g. tun interfaces. If the addrgenmode is set to 'random', either by default or manually, and no stable secret is available, then a random secret is used as input for the stable-privacy address generator. The secret can be read and modified like manually configured secrets, using the proc interface. Modifying the secret will change the addrgen mode to 'stable-privacy' to indicate that it operates on a known secret. Existing behaviour of the 'stable-privacy' mode is kept unchanged. If a known secret is available when the device is created, then the mode will default to 'stable-privacy' as before. The mode can be manually set to 'random' but it will behave exactly like 'stable-privacy' in this case. The secret will not change. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-12-171-1/+7
|\ \ | |/ | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: automatically enable stable privacy mode if stable_secret setHannes Frederic Sowa2015-12-151-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | Bjørn reported that while we switch all interfaces to privacy stable mode when setting the secret, we don't set this mode for new interfaces. This does not make sense, so change this behaviour. Fixes: 622c81d57b392cc ("ipv6: generation of stable privacy addresses for link-local and autoconf") Reported-by: Bjørn Mork <bjorn@mork.no> Cc: Bjørn Mork <bjorn@mork.no> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: keep existing flags when setting IFA_F_OPTIMISTICBjørn Mork2015-12-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag") failed to update the setting of the IFA_F_OPTIMISTIC flag, causing the IFA_F_STABLE_PRIVACY flag to be lost if IFA_F_OPTIMISTIC is set. Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Fixes: 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag") Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: addrconf: drop ieee802154 specific thingsAlexander Aring2015-12-141-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes ARPHRD_IEEE802154 from addrconf handling. In the earlier days of 802.15.4 6LoWPAN, the interface type was ARPHRD_IEEE802154 which introduced several issues, because 802.15.4 interfaces used the same type. Since commit 965e613d299c ("ieee802154: 6lowpan: fix ARPHRD to ARPHRD_6LOWPAN") we use ARPHRD_6LOWPAN for 6LoWPAN interfaces. This patch will remove ARPHRD_IEEE802154 which is currently deadcode, because ARPHRD_IEEE802154 doesn't reach the minimum 1280 MTU of IPv6. Also we use 6LoWPAN EUI64 specific defines instead using link-layer constanst from 802.15.4 link-layer header. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Only act upon NETDEV_*_TYPE_CHANGE if we have ipv6 addressesAndrew Lunn2015-12-051-1/+2
| | | | | | | | | | | | | | | | An interface changing type may not have IPv6 addresses. Don't call the address configuration type change in this case. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-12-031-1/+1
|\ \ | |/ | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Revert "ipv6: ndisc: inherit metadata dst when creating ndisc requests"Nicolas Dichtel2015-12-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit ab450605b35caa768ca33e86db9403229bf42be4. In IPv6, we cannot inherit the dst of the original dst. ndisc packets are IPv6 packets and may take another route than the original packet. This patch breaks the following scenario: a packet comes from eth0 and is forwarded through vxlan1. The encapsulated packet triggers an NS which cannot be sent because of the wrong route. CC: Jiri Benc <jbenc@redhat.com> CC: Thomas Graf <tgraf@suug.ch> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv6: restrict hop_limit sysctl setting to range [1; 255]Phil Sutter2015-12-031-1/+15
|/ | | | | | | | | | | | | | | | | | | Setting a value bigger than 255 resulted in using only the lower eight bits of that value as it is assigned to the u8 header field. To avoid this unexpected result, reject such values. Setting a value of zero is technically possible, but hosts receiving such a packet have to treat it like hop_limit was set to one, according to RFC2460. Therefore I don't see a use-case for that. Setting a route's hop_limit to zero in iproute2 means to use the sysctl default, which is not the case here: Setting e.g. net.conf.eth0.hop_limit=0 will not make the kernel use net.conf.all.hop_limit for outgoing packets on eth0. To avoid these kinds of confusion, reject zero. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: clean up dev_snmp6 proc entry when we fail to initialize inet6_devSabrina Dubroca2015-11-041-0/+1
| | | | | | | | | | | In ipv6_add_dev, when addrconf_sysctl_register fails, we do not clean up the dev_snmp6 entry that we have already registered for this device. Call snmp6_unregister_dev in this case. Fixes: a317a2f19da7d ("ipv6: fail early when creating netdev named all or default") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTUAlexander Duyck2015-10-301-19/+27
| | | | | | | | | This change makes it so that we reinitialize the interface if the MTU is increased back above IPV6_MIN_MTU and the interface is up. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: Rightsize IFLA_AF_SPEC size calculationArad, Ronen2015-10-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | if_nlmsg_size() overestimates the minimum allocation size of netlink dump request (when called from rtnl_calcit()) or the size of the message (when called from rtnl_getlink()). This is because ext_filter_mask is not supported by rtnl_link_get_af_size() and rtnl_link_get_size(). The over-estimation is significant when at least one netdev has many VLANs configured (8 bytes for each configured VLAN). This patch-set "rightsizes" the protocol specific attribute size calculation by propagating ext_filter_mask to rtnl_link_get_af_size() and adding this a argument to get_link_af_size op in rtnl_af_ops. Bridge module already used filtering aware sizing for notifications. br_get_link_af_size_filtered() is consistent with the modified get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops. br_get_link_af_size() becomes unused and thus removed. Signed-off-by: Ronen Arad <ronen.arad@intel.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-10-201-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/asix_common.c net/ipv4/inet_connection_sock.c net/switchdev/switchdev.c In the inet_connection_sock.c case the request socket hashing scheme is completely different in net-next. The other two conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: gre: setup default multicast routes over PtP linksHannes Frederic Sowa2015-10-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | GRE point-to-point interfaces should also support ipv6 multicast. Setting up default multicast routes on interface creation was forgotten. Add it. Bugzilla: <https://bugzilla.kernel.org/show_bug.cgi?id=103231> Cc: Julien Muchembled <jm@jmuchemb.eu> Cc: Eric Dumazet <edumazet@google.com> Cc: Nicolas Dumazet <ndumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Add VRF support to IPv6 stackDavid Ahern2015-10-131-3/+9
| | | | | | | | | | | | | | | | | | | | As with IPv4 support for VRFs added to IPv6 stack by replacing hardcoded table ids with possibly device specific ones and manipulating the oif in the flowi6. The flow flags are used to skip oif compare in nexthop lookups if the device is enslaved to a VRF via the L3 master device. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-09-261-4/+3
|\ \ | |/ | | | | | | | | | | | | | | | | Conflicts: net/ipv4/arp.c The net/ipv4/arp.c conflict was one commit adding a new local variable while another commit was deleting one. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Avoid double dst_freeMartin KaFai Lau2015-09-151-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is a prep work to get dst freeing from fib tree undergo a rcu grace period. The following is a common paradigm: if (ip6_del_rt(rt)) dst_free(rt) which means, if rt cannot be deleted from the fib tree, dst_free(rt) now. 1. We don't know the ip6_del_rt(rt) failure is because it was not managed by fib tree (e.g. DST_NOCACHE) or it had already been removed from the fib tree. 2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means dst_free(rt) has been called already. A second dst_free(rt) is not always obviously safe. The rt may have been destroyed already. 3. If rt is a DST_NOCACHE, dst_free(rt) should not be called. 4. It is a stopper to make dst freeing from fib tree undergo a rcu grace period. This patch is to use a DST_NOCACHE flag to indicate a rt is not managed by the fib tree. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: remove unused neigh parameter from ndisc functionsJiri Benc2015-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | Since commit 12fd84f4383b1 ("ipv6: Remove unused neigh argument for icmp6_dst_alloc() and its callers."), the neigh parameter of ndisc_send_na and ndisc_send_ns is unused. CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 statsSowmini Varadhan2015-09-151-4/+9
|/ | | | | | | | | | | | | | | | | | Many commonly used functions like getifaddrs() invoke RTM_GETLINK to dump the interface information, and do not need the the AF_INET6 statististics that are always returned by default from rtnl_fill_ifinfo(). Computing the statistics can be an expensive operation that impacts scaling, so it is desirable to avoid this if the information is not needed. This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that can be passed with netlink_request() to avoid statistics computation for the ifinfo path. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'for-linus' of ↵Linus Torvalds2015-09-091-31/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull inifiniband/rdma updates from Doug Ledford: "This is a fairly sizeable set of changes. I've put them through a decent amount of testing prior to sending the pull request due to that. There are still a few fixups that I know are coming, but I wanted to go ahead and get the big, sizable chunk into your hands sooner rather than waiting for those last few fixups. Of note is the fact that this creates what is intended to be a temporary area in the drivers/staging tree specifically for some cleanups and additions that are coming for the RDMA stack. We deprecated two drivers (ipath and amso1100) and are waiting to hear back if we can deprecate another one (ehca). We also put Intel's new hfi1 driver into this area because it needs to be refactored and a transfer library created out of the factored out code, and then it and the qib driver and the soft-roce driver should all be modified to use that library. I expect drivers/staging/rdma to be around for three or four kernel releases and then to go away as all of the work is completed and final deletions of deprecated drivers are done. Summary of changes for 4.3: - Create drivers/staging/rdma - Move amso1100 driver to staging/rdma and schedule for deletion - Move ipath driver to staging/rdma and schedule for deletion - Add hfi1 driver to staging/rdma and set TODO for move to regular tree - Initial support for namespaces to be used on RDMA devices - Add RoCE GID table handling to the RDMA core caching code - Infrastructure to support handling of devices with differing read and write scatter gather capabilities - Various iSER updates - Kill off unsafe usage of global mr registrations - Update SRP driver - Misc mlx4 driver updates - Support for the mr_alloc verb - Support for a netlink interface between kernel and user space cache daemon to speed path record queries and route resolution - Ininitial support for safe hot removal of verbs devices" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits) IB/ipoib: Suppress warning for send only join failures IB/ipoib: Clean up send-only multicast joins IB/srp: Fix possible protection fault IB/core: Move SM class defines from ib_mad.h to ib_smi.h IB/core: Remove unnecessary defines from ib_mad.h IB/hfi1: Add PSM2 user space header to header_install IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY mlx5: Fix incorrect wc pkey_index assignment for GSI messages IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow IB/uverbs: reject invalid or unknown opcodes IB/cxgb4: Fix if statement in pick_local_ip6adddrs IB/sa: Fix rdma netlink message flags IB/ucma: HW Device hot-removal support IB/mlx4_ib: Disassociate support IB/uverbs: Enable device removal when there are active user space applications IB/uverbs: Explicitly pass ib_dev to uverbs commands IB/uverbs: Fix race between ib_uverbs_open and remove_one IB/uverbs: Fix reference counting usage of event files IB/core: Make ib_dealloc_pd return void IB/srp: Create an insecure all physical rkey only if needed ...
| * net/ipv6: Export addrconf_ifid_eui48Matan Barak2015-08-301-31/+0
| | | | | | | | | | | | | | | | | | | | | | | | For loopback purposes, RoCE devices should have a default GID in the port GID table, even when the interface is down. In order to do so, we use the IPv6 link local address which would have been genenrated for the related Ethernet netdevice when it goes up as a default GID. addrconf_ifid_eui48 is used to gernerate this address, export it. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | net: Optimize snmp stat aggregation by walking all the percpu data at onceRaghavendra K T2015-08-301-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 36 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.25s cache miss 2.7% 1.41% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.57% docker docker [.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one changes/ideas suggested: Using buffer in stack (Eric), Usage of memset (David), Using memcpy in place of unaligned_put (Joe). Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: ndisc: inherit metadata dst when creating ndisc requestsJiri Benc2015-08-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | If output device wants to see the dst, inherit the dst of the original skb in the ndisc request. This is an IPv6 counterpart of commit 0accfc268f4d ("arp: Inherit metadata dst when creating ARP requests"). Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: addr IFLA_OPERSTATE to netlink message for ipv6 ifinfoAndy Gospodarek2015-08-131-1/+4
| | | | | | | | | | | | | | | | | | This is useful information to include in ipv6 netlink messages that report interface information. IFLA_OPERSTATE is already included in ipv4 messages, but missing for ipv6. This closes that gap. Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv6 sysctl option to ignore routes when nexthop link is downAndy Gospodarek2015-08-131-1/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Like the ipv4 patch with a similar title, this adds a sysctl to allow the user to change routing behavior based on whether or not the interface associated with the nexthop was an up or down link. The default setting preserves the current behavior, but anyone that enables it will notice that nexthops on down interfaces will no longer be selected: net.ipv6.conf.all.ignore_routes_with_linkdown = 0 net.ipv6.conf.default.ignore_routes_with_linkdown = 0 net.ipv6.conf.lo.ignore_routes_with_linkdown = 0 ... When the above sysctls are set, not only will link status be reported to userspace, but an indication that a nexthop is dead and will not be used is also reported. 1000::/8 via 7000::2 dev p7p1 metric 1024 dead linkdown pref medium 1000::/8 via 8000::2 dev p8p1 metric 1024 pref medium 7000::/8 dev p7p1 proto kernel metric 256 dead linkdown pref medium 8000::/8 dev p8p1 proto kernel metric 256 pref medium 9000::/8 via 8000::2 dev p8p1 metric 2048 pref medium 9000::/8 via 7000::2 dev p7p1 metric 1024 dead linkdown pref medium fe80::/64 dev p7p1 proto kernel metric 256 dead linkdown pref medium fe80::/64 dev p8p1 proto kernel metric 256 pref medium This also adds devconf support and notification when sysctl values change. v2: drop use of rt6i_nhflags since it is not needed right now Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/ipv6: add sysctl option accept_ra_min_hop_limitHangbin Liu2015-07-301-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6fd99094de2b ("ipv6: Don't reduce hop limit for an interface") disabled accept hop limit from RA if it is smaller than the current hop limit for security stuff. But this behavior kind of break the RFC definition. RFC 4861, 6.3.4. Processing Received Router Advertisements A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time, and Retrans Timer) may contain a value denoting that it is unspecified. In such cases, the parameter should be ignored and the host should continue using whatever value it is already using. If the received Cur Hop Limit value is non-zero, the host SHOULD set its CurHopLimit variable to the received value. So add sysctl option accept_ra_min_hop_limit to let user choose the minimum hop limit value they can accept from RA. And set default to 1 to meet RFC standards. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: sysctl to restrict candidate source addressesErik Kline2015-07-221-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | Per RFC 6724, section 4, "Candidate Source Addresses": It is RECOMMENDED that the candidate source addresses be the set of unicast addresses assigned to the interface that will be used to send to the destination (the "outgoing" interface). Add a sysctl to enable this behaviour. Signed-off-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Remove unused arguments for __ipv6_dev_get_saddr().YOSHIFUJI Hideaki2015-07-161-4/+2
| | | | | | | | | | Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Fix finding best source address in ipv6_dev_get_saddr().YOSHIFUJI Hideaki/吉藤英明2015-07-151-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9131f3de2 ("ipv6: Do not iterate over all interfaces when finding source address on specific interface.") did not properly update best source address available. Plus, it introduced possible NULL pointer dereference. Bug was reported by Erik Kline <ek@google.com>. Based on patch proposed by Hajime Tazaki <thehajime@gmail.com>. Fixes: 9131f3de24db4dc12199aede7d931e6703e97f3b ("ipv6: Do not iterate over all interfaces when finding source address on specific interface.") Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Acked-by: Hajime Tazaki <thehajime@gmail.com> Acked-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Do not iterate over all interfaces when finding source address on ↵YOSHIFUJI Hideaki/吉藤英明2015-07-101-90/+107
|/ | | | | | | | | | | | specific interface. If outgoing interface is specified and the candidate address is restricted to the outgoing interface, it is enough to iterate over that given interface only. Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Acked-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Consider RTF_CACHE when searching the fib6 treeMartin KaFai Lau2015-05-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is a prep work for the later bug-fix patch which will stop /128 route from disappearing after pmtu update. The later bug-fix patch will allow a /128 route and its RTF_CACHE clone both exist at the same fib6_node. To do this, we need to prepare the existing fib6 tree search to expect RTF_CACHE for /128 route. Note that the fn->leaf is sorted by rt6i_metric. Hence, RTF_CACHE (if there is any) is always at the front. This property leads to the following: 1. When doing ip6_route_del(), it should honor the RTF_CACHE flag which the caller is used to ask for deleting clone or non-clone. The rtm_to_fib6_config() should also check the RTM_F_CLONED and then set RTF_CACHE accordingly so that: - 'ip -6 r del...' will make ip6_route_del() to delete a route and all its clones. Note that its clones is flushed by fib6_del() - 'ip -6 r flush table cache' will make ip6_route_del() to only delete clone(s). 2. Exclude RTF_CACHE from addrconf_get_prefix_route() which should not configure on a cloned route. 3. No change is need for rt6_device_match() since it currently could return a RTF_CACHE clone route, so the later bug-fix patch will not affect it. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* dev: introduce dev_get_iflink()Nicolas Dichtel2015-04-021-2/+2
| | | | | | | | | | | The goal of this patch is to prepare the removal of the iflink field. It introduces a new ndo function, which will be implemented by virtual interfaces. There is no functional change into this patch. All readers of iflink field now call dev_get_iflink(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: implement nla_put_in_addr and nla_put_in6_addrJiri Benc2015-03-311-5/+5
| | | | | | | | | | | | IP addresses are often stored in netlink attributes. Add generic functions to do that. For nla_put_in_addr, it would be nicer to pass struct in_addr but this is not used universally throughout the kernel, in way too many places __be32 is used to store IPv4 address. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: coding style: comparison for equality with NULLIan Morris2015-03-311-44/+44
| | | | | | | | | | | | The ipv6 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: fix sparse warnings in privacy stable addresses generationHannes Frederic Sowa2015-03-241-5/+5
| | | | | | | | | | | Those warnings reported by sparse endianness check (via kbuild test robot) are harmless, nevertheless fix them up and make the code a little bit easier to read. Reported-by: kbuild test robot <fengguang.wu@intel.com> Fixes: 622c81d57b392cc ("ipv6: generation of stable privacy addresses for link-local and autoconf") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: introduce idgen_delay and idgen_retries knobsHannes Frederic Sowa2015-03-231-7/+4
| | | | | | | | | | | This is specified by RFC 7217. Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: do retries on stable privacy addressesHannes Frederic Sowa2015-03-231-3/+54
| | | | | | | | | | | | | | | If a DAD conflict is detected, we want to retry privacy stable address generation up to idgen_retries (= 3) times with a delay of idgen_delay (= 1 second). Add the logic to addrconf_dad_failure. By design, we don't clean up dad failed permanent addresses. Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: collapse state_lock and lockHannes Frederic Sowa2015-03-231-16/+15
| | | | | | | | | Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud