summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* Merge remote-tracking branch 'tip/timers/core' into afs-nextDavid Howells2017-11-139-26/+27
|\ | | | | | | | | | | These AFS patches need the timer_reduce() patch from timers/core. Signed-off-by: David Howells <dhowells@redhat.com>
| * netfilter: ipvs: Convert timers to use timer_setup()Kees Cook2017-11-085-22/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Wensong Zhang <wensong@linux-vs.org> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Cc: lvs-devel@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au>
| * timer: Remove expires and data arguments from DEFINE_TIMERKees Cook2017-10-054-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the arguments from the macro and adjust all callers with the following script: perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \ $(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h) Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # for m68k parts Acked-by: Guenter Roeck <linux@roeck-us.net> # for watchdog parts Acked-by: David S. Miller <davem@davemloft.net> # for networking parts Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kalle Valo <kvalo@codeaurora.org> # for wireless parts Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: linux-mips@linux-mips.org Cc: Petr Mladek <pmladek@suse.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Sebastian Reichel <sre@kernel.org> Cc: Kalle Valo <kvalo@qca.qualcomm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux1394-devel@lists.sourceforge.net Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: linux-s390@vger.kernel.org Cc: linux-wireless@vger.kernel.org Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ursula Braun <ubraun@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Harish Patil <harish.patil@cavium.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Michael Reed <mdr@sgi.com> Cc: Manish Chopra <manish.chopra@cavium.com> Cc: Len Brown <len.brown@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-pm@vger.kernel.org Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Gross <mark.gross@intel.com> Cc: linux-watchdog@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Guenter Roeck <linux@roeck-us.net> Cc: netdev@vger.kernel.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Link: https://lkml.kernel.org/r/1507159627-127660-11-git-send-email-keescook@chromium.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | ip6_tunnel: clean up ip4ip6 and ip6ip6's err_handlersXin Long2017-11-131-28/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch is to remove some useless codes of redirect and fix some indents on ip4ip6 and ip6ip6's err_handlers. Note that redirect icmp packet is already processed in ip6_tnl_err, the old redirect codes in ip4ip6_err actually never worked even before this patch. Besides, there's no need to send redirect to user's sk, it's for lower dst, so just remove it in this patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip6_tunnel: process toobig in a better wayXin Long2017-11-131-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The same improvement in "ip6_gre: process toobig in a better way" is needed by ip4ip6 and ip6ip6 as well. Note that ip4ip6 and ip6ip6 will also update sk dst pmtu in their err_handlers. Like I said before, gre6 could not do this as it's inner proto is not certain. But for all of them, sk dst pmtu will be updated in tx path if in need. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip6_tunnel: add the process for redirect in ip6_tnl_errXin Long2017-11-131-5/+10
| | | | | | | | | | | | | | | | The same process for redirect in "ip6_gre: add the process for redirect in ip6gre_err" is needed by ip4ip6 and ip6ip6 as well. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip6_gre: process toobig in a better wayXin Long2017-11-131-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now ip6gre processes toobig icmp packet by setting gre dev's mtu in ip6gre_err, which would cause few things not good: - It couldn't set mtu with dev_set_mtu due to it's not in user context, which causes route cache and idev->cnf.mtu6 not to be updated. - It has to update sk dst pmtu in tx path according to gredev->mtu for ip6gre, while it updates pmtu again according to lower dst pmtu in ip6_tnl_xmit. - To change dev->mtu by toobig icmp packet is not a good idea, it should only work on pmtu. This patch is to process toobig by updating the lower dst's pmtu, as later sk dst pmtu will be updated in ip6_tnl_xmit, the same way as in ip4gre. Note that gre dev's mtu will not be updated any more, it doesn't make any sense to change dev's mtu after receiving a toobig packet. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip6_gre: add the process for redirect in ip6gre_errXin Long2017-11-131-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | This patch is to add redirect icmp packet process for ip6gre by calling ip6_redirect() in ip6gre_err(), as in vti6_err. Prior to this patch, there's even no route cache generated after receiving redirect. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'nfc-next-4.15-1' of ↵David S. Miller2017-11-136-32/+52
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz says: ==================== NFC 4.15 pull request This is the NFC pull request for 4.15. We have: - A new netlink command for explicitly deactivating NFC targets - i2c constification for all NFC drivers - One NFC device allocation error path fix ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | NFC: Add NFC_CMD_DEACTIVATE_TARGET supportMark Greer2017-11-101-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once an NFC target (i.e., a tag) is found, it remains active until there is a failure reading or writing it (often caused by the target moving out of range). While the target is active, the NFC adapter and antenna must remain powered. This wastes power when the target remains in range but the client application no longer cares whether it is there or not. To mitigate this, add a new netlink command that allows userspace to deactivate an active target. When issued, this command will cause the NFC subsystem to act as though the target was moved out of range. Once the command has been executed, the client application can power off the NFC adapter to reduce power consumption. Signed-off-by: Mark Greer <mgreer@animalcreek.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * | NFC: digital: Abort cmd when deactivating targetMark Greer2017-11-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | When deactivating an active target, the outstanding command should be aborted. Signed-off-by: Mark Greer <mgreer@animalcreek.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * | NFC: Convert timers to use timer_setup()Allen Pais2017-11-064-31/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | Switch to using the new timer_setup() and from_timer() for net/nfc/* Signed-off-by: Allen Pais <allen.pais@oracle.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
| * | NFC: fix device-allocation error returnJohan Hovold2017-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent change fixing NFC device allocation itself introduced an error-handling bug by returning an error pointer in case device-id allocation failed. This is clearly broken as the callers still expected NULL to be returned on errors as detected by Dan's static checker. Fix this up by returning NULL in the event that we've run out of memory when allocating a new device id. Note that the offending commit is marked for stable (3.8) so this fix needs to be backported along with it. Fixes: 20777bc57c34 ("NFC: fix broken device allocation") Cc: stable <stable@vger.kernel.org> # 3.8 Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
* | | openvswitch: Add meter action supportAndy Zhou2017-11-133-0/+13
| | | | | | | | | | | | | | | | | | | | | Implements OVS kernel meter action support. Signed-off-by: Andy Zhou <azhou@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | openvswitch: Add meter infrastructureAndy Zhou2017-11-135-2/+674
| | | | | | | | | | | | | | | | | | | | | | | | | | | OVS kernel datapath so far does not support Openflow meter action. This is the first stab at adding kernel datapath meter support. This implementation supports only drop band type. Signed-off-by: Andy Zhou <azhou@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | openvswitch: export get_dp() API.Andy Zhou2017-11-132-29/+31
| | | | | | | | | | | | | | | | | | | | | Later patches will invoke get_dp() outside of datapath.c. Export it. Signed-off-by: Andy Zhou <azhou@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: dsa: Support prepended Broadcom tagFlorian Fainelli2017-11-135-7/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new type: DSA_TAG_PROTO_PREPEND which allows us to support for the 4-bytes Broadcom tag that we already support, but in a format where it is pre-pended to the packet instead of located between the MAC SA and the Ethertyper (DSA_TAG_PROTO_BRCM). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: dsa: tag_brcm: Prepare for supporting prepended tagFlorian Fainelli2017-11-131-11/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for supporting the same Broadcom tag format, but instead of inserted between the MAC SA and EtherType, prepended to the Ethernet frame, restructure the code a little bit to make that possible and take an offset parameter. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: dsa: Pass a port to get_tag_protocol()Florian Fainelli2017-11-132-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A number of drivers want to check whether the configured CPU port is a possible configuration for enabling tagging, pass down the CPU port number so they verify that. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net/sched/sch_red.c: work around gcc-4.4.4 anon union initializer issueAndrew Morton2017-11-131-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc-4.4.4 (at lest) has issues with initializers and anonymous unions: net/sched/sch_red.c: In function 'red_dump_offload': net/sched/sch_red.c:282: error: unknown field 'stats' specified in initializer net/sched/sch_red.c:282: warning: initialization makes integer from pointer without a cast net/sched/sch_red.c:283: error: unknown field 'stats' specified in initializer net/sched/sch_red.c:283: warning: initialization makes integer from pointer without a cast net/sched/sch_red.c: In function 'red_dump_stats': net/sched/sch_red.c:352: error: unknown field 'xstats' specified in initializer net/sched/sch_red.c:352: warning: initialization makes integer from pointer without a cast Work around this. Fixes: 602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc") Cc: Nogah Frankel <nogahf@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Simon Horman <simon.horman@netronome.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | af_netlink: ensure that NLMSG_DONE never fails in dumpsJason A. Donenfeld2017-11-132-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The way people generally use netlink_dump is that they fill in the skb as much as possible, breaking when nla_put returns an error. Then, they get called again and start filling out the next skb, and again, and so forth. The mechanism at work here is the ability for the iterative dumping function to detect when the skb is filled up and not fill it past the brim, waiting for a fresh skb for the rest of the data. However, if the attributes are small and nicely packed, it is possible that a dump callback function successfully fills in attributes until the skb is of size 4080 (libmnl's default page-sized receive buffer size). The dump function completes, satisfied, and then, if it happens to be that this is actually the last skb, and no further ones are to be sent, then netlink_dump will add on the NLMSG_DONE part: nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); It is very important that netlink_dump does this, of course. However, in this example, that call to nlmsg_put_answer will fail, because the previous filling by the dump function did not leave it enough room. And how could it possibly have done so? All of the nla_put variety of functions simply check to see if the skb has enough tailroom, independent of the context it is in. In order to keep the important assumptions of all netlink dump users, it is therefore important to give them an skb that has this end part of the tail already reserved, so that the call to nlmsg_put_answer does not fail. Otherwise, library authors are forced to find some bizarre sized receive buffer that has a large modulo relative to the common sizes of messages received, which is ugly and buggy. This patch thus saves the NLMSG_DONE for an additional message, for the case that things are dangerously close to the brim. This requires keeping track of the errno from ->dump() across calls. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | netem: support delivering packets in delayed time slotsDave Taht2017-11-131-3/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Slotting is a crude approximation of the behaviors of shared media such as cable, wifi, and LTE, which gather up a bunch of packets within a varying delay window and deliver them, relative to that, nearly all at once. It works within the existing loss, duplication, jitter and delay parameters of netem. Some amount of inherent latency must be specified, regardless. The new "slot" parameter specifies a minimum and maximum delay between transmission attempts. The "bytes" and "packets" parameters can be used to limit the amount of information transferred per slot. Examples of use: tc qdisc add dev eth0 root netem delay 200us \ slot 800us 10ms bytes 64k packets 42 A more correct example, using stacked netem instances and a packet limit to emulate a tail drop wifi queue with slots and variable packet delivery, with a 200Mbit isochronous underlying rate, and 20ms path delay: tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \ limit 10000 tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \ slot 800us 10ms bytes 64k packets 42 limit 512 Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | netem: add uapi to express delay and jitter in nanosecondsDave Taht2017-11-131-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | netem userspace has long relied on a horrible /proc/net/psched hack to translate the current notion of "ticks" to nanoseconds. Expressing latency and jitter instead, in well defined nanoseconds, increases the dynamic range of emulated delays and jitter in netem. It will also ease a transition where reducing a tick to nsec equivalence would constrain the max delay in prior versions of netem to only 4.3 seconds. Signed-off-by: Dave Taht <dave.taht@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | netem: convert to qdisc_watchdog_schedule_nsDave Taht2017-11-131-28/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upgrade the internal netem scheduler to use nanoseconds rather than ticks throughout. Convert to and from the std "ticks" userspace api automatically, while allowing for finer grained scheduling to take place. Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: try not to take rtnl_lock in ip6mr_sk_doneFrancesco Ruggeri2017-11-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid traversing the list of mr6_tables (which requires the rtnl_lock) in ip6mr_sk_done(), when we know in advance that a match will not be found. This can happen when rawv6_close()/ip6mr_sk_done() is invoked on non-mroute6 sockets. This patch helps reduce rtnl_lock contention when destroying a large number of net namespaces, each having a non-mroute6 raw socket. v2: same patch, only fixed subject line and expanded comment. Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-11-125-12/+23
|\ \ \
| * | | vlan: fix a use-after-free in vlan_device_event()Cong Wang2017-11-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After refcnt reaches zero, vlan_vid_del() could free dev->vlan_info via RCU: RCU_INIT_POINTER(dev->vlan_info, NULL); call_rcu(&vlan_info->rcu, vlan_info_rcu_free); However, the pointer 'grp' still points to that memory since it is set before vlan_vid_del(): vlan_info = rtnl_dereference(dev->vlan_info); if (!vlan_info) goto out; grp = &vlan_info->grp; Depends on when that RCU callback is scheduled, we could trigger a use-after-free in vlan_group_for_each_dev() right following this vlan_vid_del(). Fix it by moving vlan_vid_del() before setting grp. This is also symmetric to the vlan_vid_add() we call in vlan_device_event(). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct") Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Girish Moodalbail <girish.moodalbail@oracle.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Girish Moodalbail <girish.moodalbail@oracle.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: dsa: return after vlan prepare phaseVivien Didelot2017-11-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code does not return after successfully preparing the VLAN addition on every ports member of a it. Fix this. Fixes: 1ca4aa9cd4cc ("net: dsa: check VLAN capability of every switch") Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: dsa: return after mdb prepare phaseVivien Didelot2017-11-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code does not return after successfully preparing the MDB addition on every ports member of a multicast group. Fix this. Fixes: a1a6b7ea7f2d ("net: dsa: add cross-chip multicast support") Reported-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: fix tcp_fastretrans_alert warningYuchung Cheng2017-11-101-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the cause of an WARNING indicatng TCP has pending retransmission in Open state in tcp_fastretrans_alert(). The root cause is a bad interaction between path mtu probing, if enabled, and the RACK loss detection. Upong receiving a SACK above the sequence of the MTU probing packet, RACK could mark the probe packet lost in tcp_fastretrans_alert(), prior to calling tcp_simple_retransmit(). tcp_simple_retransmit() only enters Loss state if it newly marks the probe packet lost. If the probe packet is already identified as lost by RACK, the sender remains in Open state with some packets marked lost and retransmitted. Then the next SACK would trigger the warning. The likely scenario is that the probe packet was lost due to its size or network congestion. The actual impact of this warning is small by potentially entering fast recovery an ACK later. The simple fix is always entering recovery (Loss) state if some packet is marked lost during path MTU probing. Fixes: a0370b3f3f2c ("tcp: enable RACK loss detection to trigger recovery") Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Roman Gushchin <guro@fb.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: gso: avoid refcount_t warning from tcp_gso_segment()Eric Dumazet2017-11-101-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a GSO skb of truesize O is segmented into 2 new skbs of truesize N1 and N2, we want to transfer socket ownership to the new fresh skbs. In order to avoid expensive atomic operations on a cache line subject to cache bouncing, we replace the sequence : refcount_add(N1, &sk->sk_wmem_alloc); refcount_add(N2, &sk->sk_wmem_alloc); // repeated by number of segments refcount_sub(O, &sk->sk_wmem_alloc); by a single refcount_add(sum_of(N) - O, &sk->sk_wmem_alloc); Problem is : In some pathological cases, sum(N) - O might be a negative number, and syzkaller bot was apparently able to trigger this trace [1] atomic_t was ok with this construct, but we need to take care of the negative delta with refcount_t [1] refcount_t: saturated; leaking memory. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 8404 at lib/refcount.c:77 refcount_add_not_zero+0x198/0x200 lib/refcount.c:77 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 8404 Comm: syz-executor2 Not tainted 4.14.0-rc5-mm1+ #20 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1c4/0x1e0 kernel/panic.c:546 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177 do_trap_no_signal arch/x86/kernel/traps.c:211 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:260 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:refcount_add_not_zero+0x198/0x200 lib/refcount.c:77 RSP: 0018:ffff8801c606e3a0 EFLAGS: 00010282 RAX: 0000000000000026 RBX: 0000000000001401 RCX: 0000000000000000 RDX: 0000000000000026 RSI: ffffc900036fc000 RDI: ffffed0038c0dc68 RBP: ffff8801c606e430 R08: 0000000000000001 R09: 0000000000000000 R10: ffff8801d97f5eba R11: 0000000000000000 R12: ffff8801d5acf73c R13: 1ffff10038c0dc75 R14: 00000000ffffffff R15: 00000000fffff72f refcount_add+0x1b/0x60 lib/refcount.c:101 tcp_gso_segment+0x10d0/0x16b0 net/ipv4/tcp_offload.c:155 tcp4_gso_segment+0xd4/0x310 net/ipv4/tcp_offload.c:51 inet_gso_segment+0x60c/0x11c0 net/ipv4/af_inet.c:1271 skb_mac_gso_segment+0x33f/0x660 net/core/dev.c:2749 __skb_gso_segment+0x35f/0x7f0 net/core/dev.c:2821 skb_gso_segment include/linux/netdevice.h:3971 [inline] validate_xmit_skb+0x4ba/0xb20 net/core/dev.c:3074 __dev_queue_xmit+0xe49/0x2070 net/core/dev.c:3497 dev_queue_xmit+0x17/0x20 net/core/dev.c:3538 neigh_hh_output include/net/neighbour.h:471 [inline] neigh_output include/net/neighbour.h:479 [inline] ip_finish_output2+0xece/0x1460 net/ipv4/ip_output.c:229 ip_finish_output+0x85e/0xd10 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:238 [inline] ip_output+0x1cc/0x860 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:459 [inline] ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124 ip_queue_xmit+0x8c6/0x18e0 net/ipv4/ip_output.c:504 tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1137 tcp_write_xmit+0x663/0x4de0 net/ipv4/tcp_output.c:2341 __tcp_push_pending_frames+0xa0/0x250 net/ipv4/tcp_output.c:2513 tcp_push_pending_frames include/net/tcp.h:1722 [inline] tcp_data_snd_check net/ipv4/tcp_input.c:5050 [inline] tcp_rcv_established+0x8c7/0x18a0 net/ipv4/tcp_input.c:5497 tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1460 sk_backlog_rcv include/net/sock.h:909 [inline] __release_sock+0x124/0x360 net/core/sock.c:2264 release_sock+0xa4/0x2a0 net/core/sock.c:2776 tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763 sock_sendmsg_nosec net/socket.c:632 [inline] sock_sendmsg+0xca/0x110 net/socket.c:642 ___sys_sendmsg+0x31c/0x890 net/socket.c:2048 __sys_sendmmsg+0x1e6/0x5f0 net/socket.c:2138 Fixes: 14afee4b6092 ("net: convert sock.sk_wmem_alloc from atomic_t to refcount_t") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | rds: ib: Fix NULL pointer dereference in debug codeHåkon Bugge2017-11-101-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rds_ib_recv_refill() is a function that refills an IB receive queue. It can be called from both the CQE handler (tasklet) and a worker thread. Just after the call to ib_post_recv(), a debug message is printed with rdsdebug(): ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv, recv->r_ibinc, sg_page(&recv->r_frag->f_sg), (long) ib_sg_dma_address( ic->i_cm_id->device, &recv->r_frag->f_sg), ret); Now consider an invocation of rds_ib_recv_refill() from the worker thread, which is preemptible. Further, assume that the worker thread is preempted between the ib_post_recv() and rdsdebug() statements. Then, if the preemption is due to a receive CQE event, the rds_ib_recv_cqe_handler() will be invoked. This function processes receive completions, including freeing up data structures, such as the recv->r_frag. In this scenario, rds_ib_recv_cqe_handler() will process the receive WR posted above. That implies, that the recv->r_frag has been freed before the above rdsdebug() statement has been executed. When it is later executed, we will have a NULL pointer dereference: [ 4088.068008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 4088.076754] IP: rds_ib_recv_refill+0x87/0x620 [rds_rdma] [ 4088.082686] PGD 0 P4D 0 [ 4088.085515] Oops: 0000 [#1] SMP [ 4088.089015] Modules linked in: rds_rdma(OE) rds(OE) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) mlx4_ib(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) ib_cm(E) iw_cm(E) ib_core(E) binfmt_misc(E) sb_edac(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) aesni_intel(E) crypto_simd(E) iTCO_wdt(E) glue_helper(E) iTCO_vendor_support(E) sg(E) cryptd(E) pcspkr(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) shpchp(E) ioatdma(E) i2c_i801(E) wmi(E) lpc_ich(E) mei_me(E) mei(E) mfd_core(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) [ 4088.168486] fb_sys_fops(E) ahci(E) ixgbe(E) libahci(E) ttm(E) mdio(E) ptp(E) pps_core(E) drm(E) sd_mod(E) libata(E) crc32c_intel(E) mlx4_core(E) i2c_core(E) dca(E) megaraid_sas(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: rds] [ 4088.193442] CPU: 20 PID: 1244 Comm: kworker/20:2 Tainted: G OE 4.14.0-rc7.master.20171105.ol7.x86_64 #1 [ 4088.205097] Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017 [ 4088.216074] Workqueue: ib_cm cm_work_handler [ib_cm] [ 4088.221614] task: ffff885fa11d0000 task.stack: ffffc9000e598000 [ 4088.228224] RIP: 0010:rds_ib_recv_refill+0x87/0x620 [rds_rdma] [ 4088.234736] RSP: 0018:ffffc9000e59bb68 EFLAGS: 00010286 [ 4088.240568] RAX: 0000000000000000 RBX: ffffc9002115d050 RCX: ffffc9002115d050 [ 4088.248535] RDX: ffffffffa0521380 RSI: ffffffffa0522158 RDI: ffffffffa0525580 [ 4088.256498] RBP: ffffc9000e59bbf8 R08: 0000000000000005 R09: 0000000000000000 [ 4088.264465] R10: 0000000000000339 R11: 0000000000000001 R12: 0000000000000000 [ 4088.272433] R13: ffff885f8c9d8000 R14: ffffffff81a0a060 R15: ffff884676268000 [ 4088.280397] FS: 0000000000000000(0000) GS:ffff885fbec80000(0000) knlGS:0000000000000000 [ 4088.289434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4088.295846] CR2: 0000000000000020 CR3: 0000000001e09005 CR4: 00000000001606e0 [ 4088.303816] Call Trace: [ 4088.306557] rds_ib_cm_connect_complete+0xe0/0x220 [rds_rdma] [ 4088.312982] ? __dynamic_pr_debug+0x8c/0xb0 [ 4088.317664] ? __queue_work+0x142/0x3c0 [ 4088.321944] rds_rdma_cm_event_handler+0x19e/0x250 [rds_rdma] [ 4088.328370] cma_ib_handler+0xcd/0x280 [rdma_cm] [ 4088.333522] cm_process_work+0x25/0x120 [ib_cm] [ 4088.338580] cm_work_handler+0xd6b/0x17aa [ib_cm] [ 4088.343832] process_one_work+0x149/0x360 [ 4088.348307] worker_thread+0x4d/0x3e0 [ 4088.352397] kthread+0x109/0x140 [ 4088.355996] ? rescuer_thread+0x380/0x380 [ 4088.360467] ? kthread_park+0x60/0x60 [ 4088.364563] ret_from_fork+0x25/0x30 [ 4088.368548] Code: 48 89 45 90 48 89 45 98 eb 4d 0f 1f 44 00 00 48 8b 43 08 48 89 d9 48 c7 c2 80 13 52 a0 48 c7 c6 58 21 52 a0 48 c7 c7 80 55 52 a0 <4c> 8b 48 20 44 89 64 24 08 48 8b 40 30 49 83 e1 fc 48 89 04 24 [ 4088.389612] RIP: rds_ib_recv_refill+0x87/0x620 [rds_rdma] RSP: ffffc9000e59bb68 [ 4088.397772] CR2: 0000000000000020 [ 4088.401505] ---[ end trace fe922e6ccf004431 ]--- This bug was provoked by compiling rds out-of-tree with EXTRA_CFLAGS="-DRDS_DEBUG -DDEBUG" and inserting an artificial delay between the rdsdebug() and ib_ib_port_recv() statements: /* XXX when can this fail? */ ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); + if (can_wait) + usleep_range(1000, 5000); rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv, recv->r_ibinc, sg_page(&recv->r_frag->f_sg), (long) ib_sg_dma_address( The fix is simply to move the rdsdebug() statement up before the ib_post_recv() and remove the printing of ret, which is taken care of anyway by the non-debug code. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com> Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: Remove unused skb_shared_info memberMat Martineau2017-11-111-31/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ip6_frag_id was only used by UFO, which has been removed. ipv6_proxy_select_ident() only existed to set ip6_frag_id and has no in-tree callers. Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | l2tp: remove the .tunnel_sock field from struct pppol2tp_sessionGuillaume Nault2017-11-111-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The last user of .tunnel_sock is pppol2tp_connect() which defensively uses it to verify internal data consistency. This check isn't necessary: l2tp_session_get() guarantees that the returned session belongs to the tunnel passed as parameter. And .tunnel_sock is never updated, so checking that it still points to the parent tunnel socket is useless; that test can never fail. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | l2tp: avoid using ->tunnel_sock for getting session's parent tunnelGuillaume Nault2017-11-111-54/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sessions don't need to use l2tp_sock_to_tunnel(xxx->tunnel_sock) for accessing their parent tunnel. They have the .tunnel field in the l2tp_session structure for that. Furthermore, in all these cases, the session is registered, so we're guaranteed that .tunnel isn't NULL and that the session properly holds a reference on the tunnel. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | l2tp: remove .tunnel_sock from struct l2tp_ethGuillaume Nault2017-11-111-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This field has never been used. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: dsa: lan9303: Clear offload_fwd_mark for IGMPEgil Hjelmeland2017-11-111-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that IGMP packets no longer is flooded in HW, we want the SW bridge to forward packets based on bridge configuration. To make that happen, IGMP packets must have skb->offload_fwd_mark = 0. Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: dsa: {e}dsa: set offload_fwd_mark on received packetsAndrew Lunn2017-11-112-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The software bridge needs to know if a packet has already been bridged by hardware offload to ports in the same hardware offload, in order that it does not re-flood them, causing duplicates. This is particularly true for broadcast and multicast traffic which the host has requested. By setting offload_fwd_mark in the skb the bridge will only flood to ports in other offloads and other netifs. Set this flag in the DSA and EDSA tag driver. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_IDAndrew Lunn2017-11-111-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SWITCHDEV_ATTR_ID_PORT_PARENT_ID is used by the software bridge when determining which ports to flood a packet out. If the packet originated from a switch, it assumes the switch has already flooded the packet out the switches ports, so the bridge should not flood the packet itself out switch ports. Ports on the same switch are expected to return the same parent ID when SWITCHDEV_ATTR_ID_PORT_PARENT_ID is called. DSA gets this wrong with clusters of switches. As far as the software bridge is concerned, the cluster is all one switch. A packet from any switch in the cluster can be assumed to have been flooded as needed out of all ports of the cluster, not just the switch it originated from. Hence all ports of a cluster should return the same parent. The old implementation did not, each switch in the cluster had its own ID. Also wrong was that the ID was not unique if multiple DSA instances are in operation. Use the tree ID as the parent ID, which is the same for all switches in a cluster and unique across switch clusters. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | sock: Remove the global prot_inuse counter.Tonghao Zhang2017-11-111-22/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The per-cpu counter for init_net is prepared in core_initcall. The patch 7d720c3e ("percpu: add __percpu sparse annotations to net") and d6d9ca0fe ("net: this_cpu_xxx conversions") optimize the routines. Then remove the old counter. Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: decnet: dn_table: mark expected switch fall-throughGustavo A. R. Silva2017-11-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 115106 Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | l2tp: don't close sessions in l2tp_tunnel_destruct()Guillaume Nault2017-11-111-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sessions are already removed by the proto ->destroy() handlers, and since commit f3c66d4e144a ("l2tp: prevent creation of sessions on terminated tunnels"), we're guaranteed that no new session can be created afterwards. Furthermore, l2tp_tunnel_closeall() can sleep when there are sessions left to close. So we really shouldn't call it in a ->sk_destruct() handler, as it can be used from atomic context. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: use sequence distance to detect reorderingYuchung Cheng2017-11-114-106/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the reordering distance measurement in packet unit with sequence based approach. Previously it trackes the number of "packets" toward the forward ACK (i.e. highest sacked sequence)in a state variable "fackets_out". Precisely measuring reordering degree on packet distance has not much benefit, as the degree constantly changes by factors like path, load, and congestion window. It is also complicated and prone to arcane bugs. This patch replaces with sequence-based approach that's much simpler. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: retire FACK loss detectionYuchung Cheng2017-11-116-60/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FACK loss detection has been disabled by default and the successor RACK subsumed FACK and can handle reordering better. This patch removes FACK to simplify TCP loss recovery. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tipc: improve link resiliency when rps is activatedJon Maloy2017-11-113-26/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the TIPC RPS dissector is based only on the incoming packets' source node address, hence steering all traffic from a node to the same core. We have seen that this makes the links vulnerable to starvation and unnecessary resets when we turn down the link tolerance to very low values. To reduce the risk of this happening, we exempt probe and probe replies packets from the convergence to one core per source node. Instead, we do the opposite, - we try to diverge those packets across as many cores as possible, by randomizing the flow selector key. To make such packets identifiable to the dissector, we add a new 'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is set both for PROBE and PROBE_REPLY messages, and only for those. It should be noted that these packets are not part of any flow anyway, and only constitute a minuscule fraction of all packets sent across a link. Hence, there is no risk that this will affect overall performance. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: sysctl to specify IPv6 ND traffic classMaciej Żenczykowski2017-11-112-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a per-device sysctl to specify the default traffic class to use for kernel originated IPv6 Neighbour Discovery packets. Currently this includes: - Router Solicitation (ICMPv6 type 133) ndisc_send_rs() -> ndisc_send_skb() -> ip6_nd_hdr() - Neighbour Solicitation (ICMPv6 type 135) ndisc_send_ns() -> ndisc_send_skb() -> ip6_nd_hdr() - Neighbour Advertisement (ICMPv6 type 136) ndisc_send_na() -> ndisc_send_skb() -> ip6_nd_hdr() - Redirect (ICMPv6 type 137) ndisc_send_redirect() -> ndisc_send_skb() -> ip6_nd_hdr() and if the kernel ever gets around to generating RA's, it would presumably also include: - Router Advertisement (ICMPv6 type 134) (radvd daemon could pick up on the kernel setting and use it) Interface drivers may examine the Traffic Class value and translate the DiffServ Code Point into a link-layer appropriate traffic prioritization scheme. An example of mapping IETF DSCP values to IEEE 802.11 User Priority values can be found here: https://tools.ietf.org/html/draft-ietf-tsvwg-ieee-802-11 The expected primary use case is to properly prioritize ND over wifi. Testing: jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass 0 jzem22:~# echo -1 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass -bash: echo: write error: Invalid argument jzem22:~# echo 256 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass -bash: echo: write error: Invalid argument jzem22:~# echo 0 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass jzem22:~# echo 255 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass 255 jzem22:~# echo 34 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass 34 jzem22:~# echo $[0xDC] > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass jzem22:~# tcpdump -v -i eth0 icmp6 and src host jzem22.pgc and dst host fe80::1 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes IP6 (class 0xdc, hlim 255, next-header ICMPv6 (58) payload length: 24) jzem22.pgc > fe80::1: [icmp6 sum ok] ICMP6, neighbor advertisement, length 24, tgt is jzem22.pgc, Flags [solicited] (based on original change written by Erik Kline, with minor changes) v2: fix 'suspicious rcu_dereference_check() usage' by explicitly grabbing the rcu_read_lock. Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Erik Kline <ek@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net/ncsi: Don't return error on normal responseSamuel Mendoza-Jonas2017-11-111-17/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several response handlers return EBUSY if the data corresponding to the command/response pair is already set. There is no reason to return an error here; the channel is advertising something as enabled because we told it to enable it, and it's possible that the feature has been enabled previously. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net/ncsi: Improve general state loggingSamuel Mendoza-Jonas2017-11-113-21/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NCSI driver is mostly silent which becomes a headache when trying to determine what has occurred on the NCSI connection. This adds additional logging in a few key areas such as state transitions and calling out certain errors more visibly. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | act_vlan: VLAN action rewrite to use RCU lock/unlock and updateManish Kurup2017-11-101-24/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using a spinlock in the VLAN action causes performance issues when the VLAN action is used on multiple cores. Rewrote the VLAN action to use RCU read locking for reads and updates instead. All functions now use an RCU dereferenced pointer to access the VLAN action context. Modified helper functions used by other modules, to use the RCU as opposed to directly accessing the structure. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Manish Kurup <manish.kurup@verizon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | act_vlan: Change stats update to use per-core statsManish Kurup2017-11-101-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VLAN action maintains one set of stats across all cores, and uses a spinlock to synchronize updates to it from the same. Changed this to use a per-CPU stats context instead. This change will result in better performance. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Manish Kurup <manish.kurup@verizon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud