summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* bpf: Hooks for sys_sendmsgAndrey Ignatov2018-05-283-2/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr *` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr *` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-05-2617-51/+124
|\ | | | | | | | | | | | | Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2018-05-2514-48/+121
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Let's begin the holiday weekend with some networking fixes: 1) Whoops need to restrict cfg80211 wiphy names even more to 64 bytes. From Eric Biggers. 2) Fix flags being ignored when using kernel_connect() with SCTP, from Xin Long. 3) Use after free in DCCP, from Alexey Kodanev. 4) Need to check rhltable_init() return value in ipmr code, from Eric Dumazet. 5) XDP handling fixes in virtio_net from Jason Wang. 6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu. 7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack Morgenstein. 8) Prevent out-of-bounds speculation using indexes in BPF, from Daniel Borkmann. 9) Fix regression added by AF_PACKET link layer cure, from Willem de Bruijn. 10) Correct ENIC dma mask, from Govindarajulu Varadarajan. 11) Missing config options for PMTU tests, from Stefano Brivio" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits) ibmvnic: Fix partial success login retries selftests/net: Add missing config options for PMTU tests mlx4_core: allocate ICM memory in page size chunks enic: set DMA mask to 47 bit ppp: remove the PPPIOCDETACH ioctl ipv4: remove warning in ip_recv_error net : sched: cls_api: deal with egdev path only if needed vhost: synchronize IOTLB message with dev cleanup packet: fix reserve calculation net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands net/mlx5e: When RXFCS is set, add FCS data into checksum calculation bpf: properly enforce index mask to prevent out-of-bounds speculation net/mlx4: Fix irq-unsafe spinlock usage net: phy: broadcom: Fix bcm_write_exp() net: phy: broadcom: Fix auxiliary control register reads net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message ibmvnic: Only do H_EOI for mobility events tuntap: correctly set SOCKWQ_ASYNC_NOSPACE virtio-net: fix leaking page for gso packet during mergeable XDP ...
| | * Merge tag 'batadv-net-for-davem-20180524' of git://git.open-mesh.org/linux-mergeDavid S. Miller2018-05-252-18/+68
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - prevent hardif_put call with NULL parameter, by Colin Ian King - Avoid race in Translation Table allocator, by Sven Eckelmann - Fix Translation Table sync flags for intermediate Responses, by Linus Luessing - prevent sending inconsistent Translation Table TVLVs, by Marek Lindner ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * batman-adv: prevent TT request storms by not sending inconsistent TT TLVLsMarek Lindner2018-05-121-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A translation table TVLV changset sent with an OGM consists of a number of headers (one per VLAN) plus the changeset itself (addition and/or deletion of entries). The per-VLAN headers are used by OGM recipients for consistency checks. Said consistency check might determine that a full translation table request is needed to restore consistency. If the TT sender adds per-VLAN headers of empty VLANs into the OGM, recipients are led to believe to have reached an inconsistent state and thus request a full table update. The full table does not contain empty VLANs (due to missing entries) the cycle restarts when the next OGM is issued. Consequently, when the translation table TVLV headers are composed, empty VLANs are to be excluded. Fixes: 21a57f6e7a3b ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
| | | * batman-adv: Fix TT sync flags for intermediate TT responsesLinus Lüssing2018-05-121-10/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous TT sync fix so far only fixed TT responses issued by the target node directly. So far, TT responses issued by intermediate nodes still lead to the wrong flags being added, leading to CRC mismatches. This behaviour was observed at Freifunk Hannover in a 800 nodes setup where a considerable amount of nodes were still infected with 'WI' TT flags even with (most) nodes having the previous TT sync fix applied. I was able to reproduce the issue with intermediate TT responses in a four node test setup and this patch fixes this issue by ensuring to use the per originator instead of the summarized, OR'd ones. Fixes: e9c00136a475 ("batman-adv: fix tt_global_entries flags update") Reported-by: Leonardo Mörlein <me@irrelefant.net> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
| | | * batman-adv: Avoid race in TT TVLV allocator helperSven Eckelmann2018-05-101-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions batadv_tt_prepare_tvlv_local_data and batadv_tt_prepare_tvlv_global_data are responsible for preparing a buffer which can be used to store the TVLV container for TT and add the VLAN information to it. This will be done in three phases: 1. count the number of VLANs and their entries 2. allocate the buffer using the counters from the previous step and limits from the caller (parameter tt_len) 3. insert the VLAN information to the buffer The step 1 and 3 operate on a list which contains the VLANs. The access to these lists must be protected with an appropriate lock or otherwise they might operate on on different entries. This could for example happen when another context is adding VLAN entries to this list. This could lead to a buffer overflow in these functions when enough entries were added between step 1 and 3 to the VLAN lists that the buffer room for the entries (*tt_change) is smaller then the now required extra buffer for new VLAN entries. Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
| | | * batman-adv: don't pass a NULL hard_iface to batadv_hardif_putColin Ian King2018-04-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case where hard_iface is NULL, the error path may pass a null pointer to batadv_hardif_put causing a null pointer dereference error. Avoid this by only calling the function if hard_iface not null. Detected by CoverityScan, CID#1466456 ("Explicit null dereferenced") Fixes: 53dd9a68ba68 ("batman-adv: add multicast flags netlink support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
| | * | ipv4: remove warning in ip_recv_errorWillem de Bruijn2018-05-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A precondition check in ip_recv_error triggered on an otherwise benign race. Remove the warning. The warning triggers when passing an ipv6 socket to this ipv4 error handling function. RaceFuzzer was able to trigger it due to a race in setsockopt IPV6_ADDRFORM. --- CPU0 do_ipv6_setsockopt sk->sk_socket->ops = &inet_dgram_ops; --- CPU1 sk->sk_prot->recvmsg udp_recvmsg ip_recv_error WARN_ON_ONCE(sk->sk_family == AF_INET6); --- CPU0 do_ipv6_setsockopt sk->sk_family = PF_INET; This socket option converts a v6 socket that is connected to a v4 peer to an v4 socket. It updates the socket on the fly, changing fields in sk as well as other structs. This is inherently non-atomic. It races with the lockless udp_recvmsg path. No other code makes an assumption that these fields are updated atomically. It is benign here, too, as ip_recv_error cares only about the protocol of the skbs enqueued on the error queue, for which sk_family is not a precise predictor (thanks to another isue with IPV6_ADDRFORM). Link: http://lkml.kernel.org/r/20180518120826.GA19515@dragonet.kaist.ac.kr Fixes: 7ce875e5ecb8 ("ipv4: warn once on passing AF_INET6 socket to ip_recv_error") Reported-by: DaeRyong Jeong <threeearcat@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net : sched: cls_api: deal with egdev path only if neededOr Gerlitz2018-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dealing with ingress rule on a netdev, if we did fine through the conventional path, there's no need to continue into the egdev route, and we can stop right there. Not doing so may cause a 2nd rule to be added by the cls api layer with the ingress being the egdev. For example, under sriov switchdev scheme, a user rule of VFR A --> VFR B will end up with two HW rules (1) VF A --> VF B and (2) uplink --> VF B Fixes: 208c0f4b5237 ('net: sched: use tc_setup_cb_call to call per-block callbacks') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | packet: fix reserve calculationWillem de Bruijn2018-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation") ensures that packet_snd always starts writing the link layer header in reserved headroom allocated for this purpose. This is needed because packets may be shorter than hard_header_len, in which case the space up to hard_header_len may be zeroed. But that necessary padding is not accounted for in skb->len. The fix, however, is buggy. It calls skb_push, which grows skb->len when moving skb->data back. But in this case packet length should not change. Instead, call skb_reserve, which moves both skb->data and skb->tail back, without changing length. Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation") Reported-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: ipv4: add missing RTA_TABLE to rtm_ipv4_policyRoopa Prabhu2018-05-231-0/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | Merge tag 'mac80211-for-davem-2018-05-23' of ↵David S. Miller2018-05-233-5/+9
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A handful of fixes: * hwsim radio dump wasn't working for the first radio * mesh was updating statistics incorrectly * a netlink message allocation was possibly too short * wiphy name limit was still too long * in certain cases regdb query could find a NULL pointer ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | cfg80211: fix NULL pointer derference when querying regdbHaim Dreyfuss2018-05-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some drivers may call this function when regdb is not initialized yet, so we need to make sure regdb is valid before trying to access it. Make sure regdb is initialized before trying to access it in reg_query_regdb_wmm() and query_regdb(). Reported-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | | * | mac80211: mesh: fix premature update of rc statsBob Copeland2018-05-181-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mesh_neighbour_update() function, queued via beacon rx, can race with userspace creating the same station. If the station already exists by the time mesh_neighbour_update() is called, the function wrongly assumes rate control has been initialized and calls rate_control_rate_update(), which in turn calls into the driver. Updating the rate control before it has been initialized can cause a crash in some drivers, for example this firmware crash in ath10k due to sta->rx_nss being 0: [ 3078.088247] mesh0: Inserted STA 5c:e2:8c:f1:ab:ba [ 3078.258407] ath10k_pci 0000:0d:00.0: firmware crashed! (uuid d6ed5961-93cc-4d61-803f-5eda55bb8643) [ 3078.258421] ath10k_pci 0000:0d:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000 [ 3078.258426] ath10k_pci 0000:0d:00.0: kconfig debug 1 debugfs 1 tracing 1 dfs 0 testmode 0 [ 3078.258608] ath10k_pci 0000:0d:00.0: firmware ver 10.2.4.70.59-2 api 5 features no-p2p,raw-mode,mfp crc32 4159f498 [ 3078.258613] ath10k_pci 0000:0d:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08 [ 3078.258617] ath10k_pci 0000:0d:00.0: htt-ver 2.1 wmi-op 5 htt-op 2 cal otp max-sta 128 raw 0 hwcrypto 1 [ 3078.260627] ath10k_pci 0000:0d:00.0: firmware register dump: [ 3078.260640] ath10k_pci 0000:0d:00.0: [00]: 0x4100016C 0x000015B3 0x009A31BB 0x00955B31 [ 3078.260647] ath10k_pci 0000:0d:00.0: [04]: 0x009A31BB 0x00060130 0x00000008 0x00000007 [ 3078.260652] ath10k_pci 0000:0d:00.0: [08]: 0x00000000 0x00955B31 0x00000000 0x0040F89E [ 3078.260656] ath10k_pci 0000:0d:00.0: [12]: 0x00000009 0xFFFFFFFF 0x009580F5 0x00958117 [ 3078.260660] ath10k_pci 0000:0d:00.0: [16]: 0x00958080 0x0094085D 0x00000000 0x00000000 [ 3078.260664] ath10k_pci 0000:0d:00.0: [20]: 0x409A31BB 0x0040AA84 0x00000002 0x00000001 [ 3078.260669] ath10k_pci 0000:0d:00.0: [24]: 0x809A2B8D 0x0040AAE4 0x00000088 0xC09A31BB [ 3078.260673] ath10k_pci 0000:0d:00.0: [28]: 0x809898C8 0x0040AB04 0x0043F91C 0x009C6458 [ 3078.260677] ath10k_pci 0000:0d:00.0: [32]: 0x809B66AC 0x0040AB34 0x009C6458 0x0043F91C [ 3078.260686] ath10k_pci 0000:0d:00.0: [36]: 0x809B2824 0x0040ADA4 0x00400000 0x00416EB4 [ 3078.260692] ath10k_pci 0000:0d:00.0: [40]: 0x809C07D9 0x0040ADE4 0x0040AE08 0x00412028 [ 3078.260696] ath10k_pci 0000:0d:00.0: [44]: 0x809486FA 0x0040AE04 0x00000001 0x00000000 [ 3078.260700] ath10k_pci 0000:0d:00.0: [48]: 0x80948E2C 0x0040AEA4 0x0041F4F0 0x00412634 [ 3078.260704] ath10k_pci 0000:0d:00.0: [52]: 0x809BFC39 0x0040AEC4 0x0041F4F0 0x00000001 [ 3078.260709] ath10k_pci 0000:0d:00.0: [56]: 0x80940F18 0x0040AF14 0x00000010 0x00403AC0 [ 3078.284130] ath10k_pci 0000:0d:00.0: failed to to request monitor vdev 1 stop: -108 Fix this by checking whether the sta has already initialized rate control using the flag for that purpose. We can also drop the unnecessary insert parameter here. Signed-off-by: Bob Copeland <bobcopeland@fb.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | | * | nl80211: fix nlmsg allocation in cfg80211_ft_eventDedy Lansky2018-05-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocation size of nlmsg in cfg80211_ft_event is based on ric_ies_len and doesn't take into account ies_len. This leads to NL80211_CMD_FT_EVENT message construction failure in case ft_event contains large enough ies buffer. Add ies_len to the nlmsg allocation size. Signed-off-by: Dedy Lansky <dlansky@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | | ipmr: properly check rhltable_init() return valueEric Dumazet2018-05-221-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8fb472c09b9d ("ipmr: improve hash scalability") added a call to rhltable_init() without checking its return value. This problem was then later copied to IPv6 and factorized in commit 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table") kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 31552 Comm: syz-executor7 Not tainted 4.17.0-rc5+ #60 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:rht_key_hashfn include/linux/rhashtable.h:277 [inline] RIP: 0010:__rhashtable_lookup include/linux/rhashtable.h:630 [inline] RIP: 0010:rhltable_lookup include/linux/rhashtable.h:716 [inline] RIP: 0010:mr_mfc_find_parent+0x2ad/0xbb0 net/ipv4/ipmr_base.c:63 RSP: 0018:ffff8801826aef70 EFLAGS: 00010203 RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffc90001ea0000 RDX: 0000000000000079 RSI: ffffffff8661e859 RDI: 000000000000000c RBP: ffff8801826af1c0 R08: ffff8801b2212000 R09: ffffed003b5e46c2 R10: ffffed003b5e46c2 R11: ffff8801daf23613 R12: dffffc0000000000 R13: ffff8801826af198 R14: ffff8801cf8225c0 R15: ffff8801826af658 FS: 00007ff7fa732700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000003ffffff9c CR3: 00000001b0210000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip6mr_cache_find_parent net/ipv6/ip6mr.c:981 [inline] ip6mr_mfc_delete+0x1fe/0x6b0 net/ipv6/ip6mr.c:1221 ip6_mroute_setsockopt+0x15c6/0x1d70 net/ipv6/ip6mr.c:1698 do_ipv6_setsockopt.isra.9+0x422/0x4660 net/ipv6/ipv6_sockglue.c:163 ipv6_setsockopt+0xbd/0x170 net/ipv6/ipv6_sockglue.c:922 rawv6_setsockopt+0x59/0x140 net/ipv6/raw.c:1060 sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3039 __sys_setsockopt+0x1bd/0x390 net/socket.c:1903 __do_sys_setsockopt net/socket.c:1914 [inline] __se_sys_setsockopt net/socket.c:1911 [inline] __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 8fb472c09b9d ("ipmr: improve hash scalability") Fixes: 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Cc: Yuval Mintz <yuvalm@mellanox.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | dccp: don't free ccid2_hc_tx_sock struct in dccp_disconnect()Alexey Kodanev2018-05-221-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reported the use-after-free in timer_is_static_object() [1]. This can happen because the structure for the rto timer (ccid2_hc_tx_sock) is removed in dccp_disconnect(), and ccid2_hc_tx_rto_expire() can be called after that. The report [1] is similar to the one in commit 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at dismantle time"). And the fix is the same, delay freeing ccid2_hc_tx_sock structure, so that it is freed in dccp_sk_destruct(). [1] ================================================================== BUG: KASAN: use-after-free in timer_is_static_object+0x80/0x90 kernel/time/timer.c:607 Read of size 8 at addr ffff8801bebb5118 by task syz-executor2/25299 CPU: 1 PID: 25299 Comm: syz-executor2 Not tainted 4.17.0-rc5+ #54 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 timer_is_static_object+0x80/0x90 kernel/time/timer.c:607 debug_object_activate+0x2d9/0x670 lib/debugobjects.c:508 debug_timer_activate kernel/time/timer.c:709 [inline] debug_activate kernel/time/timer.c:764 [inline] __mod_timer kernel/time/timer.c:1041 [inline] mod_timer+0x4d3/0x13b0 kernel/time/timer.c:1102 sk_reset_timer+0x22/0x60 net/core/sock.c:2742 ccid2_hc_tx_rto_expire+0x587/0x680 net/dccp/ccids/ccid2.c:147 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326 expire_timers kernel/time/timer.c:1363 [inline] __run_timers+0x79e/0xc50 kernel/time/timer.c:1666 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d1/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:525 [inline] smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863 </IRQ> ... Allocated by task 25374: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554 ccid_new+0x25b/0x3e0 net/dccp/ccid.c:151 dccp_hdlr_ccid+0x27/0x150 net/dccp/feat.c:44 __dccp_feat_activate+0x184/0x270 net/dccp/feat.c:344 dccp_feat_activate_values+0x3a7/0x819 net/dccp/feat.c:1538 dccp_create_openreq_child+0x472/0x610 net/dccp/minisocks.c:128 dccp_v4_request_recv_sock+0x12c/0xca0 net/dccp/ipv4.c:408 dccp_v6_request_recv_sock+0x125d/0x1f10 net/dccp/ipv6.c:415 dccp_check_req+0x455/0x6a0 net/dccp/minisocks.c:197 dccp_v4_rcv+0x7b8/0x1f3f net/dccp/ipv4.c:841 ip_local_deliver_finish+0x2e3/0xd80 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:288 [inline] ip_local_deliver+0x1e1/0x720 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x81b/0x2200 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:288 [inline] ip_rcv+0xb70/0x143d net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x26f5/0x3630 net/core/dev.c:4592 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4657 process_backlog+0x219/0x760 net/core/dev.c:5337 napi_poll net/core/dev.c:5735 [inline] net_rx_action+0x7b7/0x1930 net/core/dev.c:5801 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285 Freed by task 25374: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kmem_cache_free+0x86/0x2d0 mm/slab.c:3756 ccid_hc_tx_delete+0xc3/0x100 net/dccp/ccid.c:190 dccp_disconnect+0x130/0xc66 net/dccp/proto.c:286 dccp_close+0x3bc/0xe60 net/dccp/proto.c:1045 inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:460 sock_release+0x96/0x1b0 net/socket.c:594 sock_close+0x16/0x20 net/socket.c:1149 __fput+0x34d/0x890 fs/file_table.c:209 ____fput+0x15/0x20 fs/file_table.c:243 task_work_run+0x1e4/0x290 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:191 [inline] exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] syscall_return_slowpath arch/x86/entry/common.c:265 [inline] do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8801bebb4cc0 which belongs to the cache ccid2_hc_tx_sock of size 1240 The buggy address is located 1112 bytes inside of 1240-byte region [ffff8801bebb4cc0, ffff8801bebb5198) The buggy address belongs to the page: page:ffffea0006faed00 count:1 mapcount:0 mapping:ffff8801bebb41c0 index:0xffff8801bebb5240 compound_mapcount: 0 flags: 0x2fffc0000008100(slab|head) raw: 02fffc0000008100 ffff8801bebb41c0 ffff8801bebb5240 0000000100000003 raw: ffff8801cdba3138 ffffea0007634120 ffff8801cdbaab40 0000000000000000 page dumped because: kasan: bad access detected ... ================================================================== Reported-by: syzbot+5d47e9ec91a6f15dbd6f@syzkaller.appspotmail.com Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | sctp: fix the issue that flags are ignored when using kernel_connectXin Long2018-05-223-18/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now sctp uses inet_dgram_connect as its proto_ops .connect, and the flags param can't be passed into its proto .connect where this flags is really needed. sctp works around it by getting flags from socket file in __sctp_connect. It works for connecting from userspace, as inherently the user sock has socket file and it passes f_flags as the flags param into the proto_ops .connect. However, the sock created by sock_create_kern doesn't have a socket file, and it passes the flags (like O_NONBLOCK) by using the flags param in kernel_connect, which calls proto_ops .connect later. So to fix it, this patch defines a new proto_ops .connect for sctp, sctp_inet_connect, which calls __sctp_connect() directly with this flags param. After this, the sctp's proto .connect can be removed. Note that sctp_inet_connect doesn't need to do some checks that are not needed for sctp, which makes thing better than with inet_dgram_connect. Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2018-05-243-3/+3
| |\ \ \ \ | | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull rdma fixes from Jason Gunthorpe: "This is pretty much just the usual array of smallish driver bugs. - remove bouncing addresses from the MAINTAINERS file - kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and hns drivers - various small LOC behavioral/operational bugs in mlx5, hns, qedr and i40iw drivers - two fixes for patches already sent during the merge window - a long-standing bug related to not decreasing the pinned pages count in the right MM was found and fixed" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits) RDMA/hns: Move the location for initializing tmp_len RDMA/hns: Bugfix for cq record db for kernel IB/uverbs: Fix uverbs_attr_get_obj RDMA/qedr: Fix doorbell bar mapping for dpi > 1 IB/umem: Use the correct mm during ib_umem_release iw_cxgb4: Fix an error handling path in 'c4iw_get_dma_mr()' RDMA/i40iw: Avoid panic when reading back the IRQ affinity hint RDMA/i40iw: Avoid reference leaks when processing the AEQ RDMA/i40iw: Avoid panic when objects are being created and destroyed RDMA/hns: Fix the bug with NULL pointer RDMA/hns: Set NULL for __internal_mr RDMA/hns: Enable inner_pa_vld filed of mpt RDMA/hns: Set desc_dma_addr for zero when free cmq desc RDMA/hns: Fix the bug with rq sge RDMA/hns: Not support qp transition from reset to reset for hip06 RDMA/hns: Add return operation when configured global param fail RDMA/hns: Update convert function of endian format RDMA/hns: Load the RoCE dirver automatically RDMA/hns: Bugfix for rq record db for kernel RDMA/hns: Add rq inline flags judgement ...
| | * | | IB: remove redundant INFINIBAND kconfig dependenciesGreg Thelen2018-05-093-3/+3
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | INFINIBAND_ADDR_TRANS depends on INFINIBAND. So there's no need for options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND. Remove the unnecessary INFINIBAND depends. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | | | openvswitch: Support conntrack zone limitYi-Hung Wei2018-05-255-6/+567
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, nf_conntrack_max is used to limit the maximum number of conntrack entries in the conntrack table for every network namespace. For the VMs and containers that reside in the same namespace, they share the same conntrack table, and the total # of conntrack entries for all the VMs and containers are limited by nf_conntrack_max. In this case, if one of the VM/container abuses the usage the conntrack entries, it blocks the others from committing valid conntrack entries into the conntrack table. Even if we can possibly put the VM in different network namespace, the current nf_conntrack_max configuration is kind of rigid that we cannot limit different VM/container to have different # conntrack entries. To address the aforementioned issue, this patch proposes to have a fine-grained mechanism that could further limit the # of conntrack entries per-zone. For example, we can designate different zone to different VM, and set conntrack limit to each zone. By providing this isolation, a mis-behaved VM only consumes the conntrack entries in its own zone, and it will not influence other well-behaved VMs. Moreover, the users can set various conntrack limit to different zone based on their preference. The proposed implementation utilizes Netfilter's nf_conncount backend to count the number of connections in a particular zone. If the number of connection is above a configured limitation, ovs will return ENOMEM to the userspace. If userspace does not configure the zone limit, the limit defaults to zero that is no limitation, which is backward compatible to the behavior without this patch. The following high leve APIs are provided to the userspace: - OVS_CT_LIMIT_CMD_SET: * set default connection limit for all zones * set the connection limit for a particular zone - OVS_CT_LIMIT_CMD_DEL: * remove the connection limit for a particular zone - OVS_CT_LIMIT_CMD_GET: * get the default connection limit for all zones * get the connection limit for a particular zone Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge tag 'mlx5e-updates-2018-05-19' of ↵David S. Miller2018-05-251-0/+20
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5e-updates-2018-05-19 This series contains updates for mlx5e netdevice driver with one subject, DSCP to priority mapping, in the first patch Huy adds the needed API in dcbnl, the second patch adds the needed mlx5 core capability bits for the feature, and all other patches are mlx5e (netdev) only changes to add support for the feature. From: Huy Nguyen Dscp to priority mapping for Ethernet packet: These patches enable differentiated services code point (dscp) to priority mapping for Ethernet packet. Once this feature is enabled, the packet is routed to the corresponding priority based on its dscp. User can combine this feature with priority flow control (pfc) feature to have priority flow control based on the dscp. Firmware interface: Mellanox firmware provides two control knobs for this feature: QPTS register allow changing the trust state between dscp and pcp mode. The default is pcp mode. Once in dscp mode, firmware will route the packet based on its dscp value if the dscp field exists. QPDPM register allow mapping a specific dscp (0 to 63) to a specific priority (0 to 7). By default, all the dscps are mapped to priority zero. Software interface: This feature is controlled via application priority TLV. IEEE specification P802.1Qcd/D2.1 defines priority selector id 5 for application priority TLV. This APP TLV selector defines DSCP to priority map. This APP TLV can be sent by the switch or can be set locally using software such as lldptool. In mlx5 drivers, we add the support for net dcb's getapp and setapp call back. Mlx5 driver only handles the selector id 5 application entry (dscp application priority application entry). If user sends multiple dscp to priority APP TLV entries on the same dscp, the last sent one will take effect. All the previous sent will be deleted. This attribute combined with pfc attribute allows advanced user to fine tune the qos setting for specific priority queue. For example, user can give dedicated buffer for one or more priorities or user can give large buffer to certain priorities. The dcb buffer configuration will be controlled by lldptool. >> lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6 maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6 >> lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0 sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to choose dcbnl over devlink interface since this feature is intended to set port attributes which are governed by the netdev instance of that port, where devlink API is more suitable for global ASIC configurations. The firmware trust state (in QPTS register) is changed based on the number of dscp to priority application entries. When the first dscp to priority application entry is added by the user, the trust state is changed to dscp. When the last dscp to priority application entry is deleted by the user, the trust state is changed to pcp. When the port is in DSCP trust state, the transmit queue is selected based on the dscp of the skb. When the port is in DSCP trust state and vport inline mode is not NONE, firmware requires mlx5 driver to copy the IP header to the wqe ethernet segment inline header if the skb has it. This is done by changing the transmit queue sq's min inline mode to L3. Note that the min inline mode of sqs that belong to other features such as xdpsq, icosq are not modified. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net/dcb: Add dcbnl buffer attributeHuy Nguyen2018-05-241-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In this patch, we add dcbnl buffer attribute to allow user change the NIC's buffer configuration such as priority to buffer mapping and buffer size of individual buffer. This attribute combined with pfc attribute allows advanced user to fine tune the qos setting for specific priority queue. For example, user can give dedicated buffer for one or more priorities or user can give large buffer to certain priorities. The dcb buffer configuration will be controlled by lldptool. lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6 maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6 lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0 sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to choose dcbnl over devlink interface since this feature is intended to set port attributes which are governed by the netdev instance of that port, where devlink API is more suitable for global ASIC configurations. We present an use case scenario where dcbnl buffer attribute configured by advance user helps reduce the latency of messages of different sizes. Scenarios description: On ConnectX-5, we run latency sensitive traffic with small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive traffic with large messages sizes 512KB and 1MB. We group small, medium, and large message sizes to their own pfc enables priorities as follow. Priorities 1 & 2 (64B, 256B and 1KB) Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB) Priorities 5 & 6 (512KB and 1MB) By default, ConnectX-5 maps all pfc enabled priorities to a single lossless fixed buffer size of 50% of total available buffer space. The other 50% is assigned to lossy buffer. Using dcbnl buffer attribute, we create three equal size lossless buffers. Each buffer has 25% of total available buffer space. Thus, the lossy buffer size reduces to 25%. Priority to lossless buffer mappings are set as follow. Priorities 1 & 2 on lossless buffer #1 Priorities 3 & 4 on lossless buffer #2 Priorities 5 & 6 on lossless buffer #3 We observe improvements in latency for small and medium message sizes as follows. Please note that the large message sizes bandwidth performance is reduced but the total bandwidth remains the same. 256B message size (42 % latency reduction) 4K message size (21% latency reduction) 64K message size (16% latency reduction) CC: Ido Schimmel <idosch@idosch.org> CC: Jakub Kicinski <jakub.kicinski@netronome.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Or Gerlitz <gerlitz.or@gmail.com> CC: Parav Pandit <parav@mellanox.com> CC: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
* | | | | net: bridge: add support for port isolationNikolay Aleksandrov2018-05-255-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for a new port flag - BR_ISOLATED. If it is set then isolated ports cannot communicate between each other, but they can still communicate with non-isolated ports. The same can be achieved via ACLs but they can't scale with large number of ports and also the complexity of the rules grows. This feature can be used to achieve isolated vlan functionality (similar to pvlan) as well, though currently it will be port-wide (for all vlans on the port). The new test in should_deliver uses data that is already cache hot and the new boolean is used to avoid an additional source port test in should_deliver. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net/ipv4: Remove tracepoint in fib_validate_sourceDavid Ahern2018-05-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tracepoint does not add value and the call to fib_lookup follows it which shows the same information and the fib lookup result. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net/ipv6: Udate fib6_table_lookup tracepointDavid Ahern2018-05-242-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bb0ad1987e96 ("ipv6: fib6_rules: support for match on sport, dport and ip proto") added support for protocol and ports to FIB rules. Update the FIB lookup tracepoint to dump the parameters. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net/ipv4: Udate fib_table_lookup tracepointDavid Ahern2018-05-241-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4a2d73a4fb36 ("ipv4: fib_rules: support match on sport, dport and ip proto") added support for protocol and ports to FIB rules. Update the FIB lookup tracepoint to dump the parameters. In addition, make the IPv4 tracepoint similar to the IPv6 one where the lookup parameters and result are dumped in 1 event. It is much easier to use and understand the outcome of the lookup. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net_sched: switch to rcu_workCong Wang2018-05-2412-220/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 05f0fe6b74db ("RCU, workqueue: Implement rcu_work") introduces new API's for dispatching work in a RCU callback. Now we can just switch to the new API's for tc filters. This could get rid of a lot of code. Cc: Tejun Heo <tj@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2018-05-2415-266/+933
|\ \ \ \ \ | |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-05-24 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc). 2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers. 3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit. 4) Jiong Wang adds support for indirect and arithmetic shifts to NFP 5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible. 6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions. 7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT. 8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events. 9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | xdp: change ndo_xdp_xmit API to support bulkingJesper Dangaard Brouer2018-05-241-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | | xdp: introduce xdp_return_frame_rx_napiJesper Dangaard Brouer2018-05-241-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sending an xdp_frame through xdp_do_redirect call, then error cases can happen where the xdp_frame needs to be dropped, and returning an -errno code isn't sufficient/possible any-longer (e.g. for cpumap case). This is already fully supported, by simply calling xdp_return_frame. This patch is an optimization, which provides xdp_return_frame_rx_napi, which is a faster variant for these error cases. It take advantage of the protection provided by XDP RX running under NAPI protection. This change is mostly relevant for drivers using the page_pool allocator as it can take advantage of this. (Tested with mlx5). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | | xdp: add tracepoint for devmap like cpumap haveJesper Dangaard Brouer2018-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Notice how this allow us get XDP statistic without affecting the XDP performance, as tracepoint is no-longer activated on a per packet basis. V5: Spotted by John Fastabend. Fix 'sent' also counted 'drops' in this patch, a later patch corrected this, but it was a mistake in this intermediate step. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | | bpf: devmap introduce dev_map_enqueueJesper Dangaard Brouer2018-05-241-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Functionality is the same, but the ndo_xdp_xmit call is now simply invoked from inside the devmap.c code. V2: Fix compile issue reported by kbuild test robot <lkp@intel.com> V5: Cleanups requested by Daniel - Newlines before func definition - Use BUILD_BUG_ON checks - Remove unnecessary use return value store in dev_map_enqueue Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | | ipv6: sr: Add seg6local action End.BPFMathieu Xhonneux2018-05-242-2/+191
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the End.BPF action to the LWT seg6local infrastructure. This action works like any other seg6local End action, meaning that an IPv6 header with SRH is needed, whose DA has to be equal to the SID of the action. It will also advance the SRH to the next segment, the BPF program does not have to take care of this. Since the BPF program may not be a source of instability in the kernel, it is important to ensure that the integrity of the packet is maintained before yielding it back to the IPv6 layer. The hook hence keeps track if the SRH has been altered through the helpers, and re-validates its content if needed with seg6_validate_srh. The state kept for validation is stored in a per-CPU buffer. The BPF program is not allowed to directly write into the packet, and only some fields of the SRH can be altered through the helper bpf_lwt_seg6_store_bytes. Performances profiling has shown that the SRH re-validation does not induce a significant overhead. If the altered SRH is deemed as invalid, the packet is dropped. This validation is also done before executing any action through bpf_lwt_seg6_action, and will not be performed again if the SRH is not modified after calling the action. The BPF program may return 3 types of return codes: - BPF_OK: the End.BPF action will look up the next destination through seg6_lookup_nexthop. - BPF_REDIRECT: if an action has been executed through the bpf_lwt_seg6_action helper, the BPF program should return this value, as the skb's destination is already set and the default lookup should not be performed. - BPF_DROP : the packet will be dropped. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | bpf: Split lwt inout verifier structuresMathieu Xhonneux2018-05-241-31/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new bpf_lwt_push_encap helper should only be accessible within the LWT BPF IN hook, and not the OUT one, as this may lead to a skb under panic. At the moment, both LWT BPF IN and OUT share the same list of helpers, whose calls are authorized by the verifier. This patch separates the verifier ops for the IN and OUT hooks, and allows the IN hook to call the bpf_lwt_push_encap helper. This patch is also the occasion to put all lwt_*_func_proto functions together for clarity. At the moment, socks_op_func_proto is in the middle of lwt_inout_func_proto and lwt_xmit_func_proto. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | bpf: Add IPv6 Segment Routing helpersMathieu Xhonneux2018-05-243-23/+269
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The BPF seg6local hook should be powerful enough to enable users to implement most of the use-cases one could think of. After some thinking, we figured out that the following actions should be possible on a SRv6 packet, requiring 3 specific helpers : - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH (to add/delete TLVs) - bpf_lwt_seg6_action: Apply some SRv6 network programming actions (specifically End.X, End.T, End.B6 and End.B6.Encap) The specifications of these helpers are provided in the patch (see include/uapi/linux/bpf.h). The non-sensitive fields of the SRH are the following : flags, tag and TLVs. The other fields can not be modified, to maintain the SRH integrity. Flags, tag and TLVs can easily be modified as their validity can be checked afterwards via seg6_validate_srh. It is not allowed to modify the segments directly. If one wants to add segments on the path, he should stack a new SRH using the End.B6 action via bpf_lwt_seg6_action. Growing, shrinking or editing TLVs via the helpers will flag the SRH as invalid, and it will have to be re-validated before re-entering the IPv6 layer. This flag is stored in a per-CPU buffer, along with the current header length in bytes. Storing the SRH len in bytes in the control block is mandatory when using bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes boundary). When adding/deleting TLVs within the BPF program, the SRH may temporary be in an invalid state where its length cannot be rounded to 8 bytes without remainder, hence the need to store the length in bytes separately. The caller of the BPF program can then ensure that the SRH's final length is valid using this value. Again, a final SRH modified by a BPF program which doesn’t respect the 8-bytes boundary will be discarded as it will be considered as invalid. Finally, a fourth helper is provided, bpf_lwt_push_encap, which is available from the LWT BPF IN hook, but not from the seg6local BPF one. This helper allows to encapsulate a Segment Routing Header (either with a new outer IPv6 header, or by inlining it directly in the existing IPv6 header) into a non-SRv6 packet. This helper is required if we want to offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet, as the BPF seg6local hook only works on traffic already containing a SRH. This is the BPF equivalent of the seg6 LWT infrastructure, which achieves the same purpose but with a static SRH per route. These helpers require CONFIG_IPV6=y (and not =m). Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | ipv6: sr: export function lookup_nexthopMathieu Xhonneux2018-05-241-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function lookup_nexthop is essential to implement most of the seg6local actions. As we want to provide a BPF helper allowing to apply some of these actions on the packet being processed, the helper should be able to call this function, hence the need to make it public. Moreover, if one argument is incorrect or if the next hop can not be found, an error should be returned by the BPF helper so the BPF program can adapt its processing of the packet (return an error, properly force the drop, ...). This patch hence makes this function return dst->error to indicate a possible error. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | bpf: Add mtu checking to FIB forwarding helperDavid Ahern2018-05-221-7/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add check that egress MTU can handle packet to be forwarded. If the MTU is less than the packet length, return 0 meaning the packet is expected to continue up the stack for help - eg., fragmenting the packet or sending an ICMP. The XDP path needs to leverage the FIB entry for an MTU on the route spec or an exception entry for a given destination. The skb path lets is_skb_forwardable decide if the packet can be sent. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | net/ipv6: Add helper to return path MTU based on fib resultDavid Ahern2018-05-223-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Determine path MTU from a FIB lookup result. Logic is based on ip6_dst_mtu_forward plus lookup of nexthop exception. Add ip6_dst_mtu_forward to ipv6_stubs to handle access by core bpf code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | net/ipv4: Add helper to return path MTU based on fib resultDavid Ahern2018-05-221-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Determine path MTU from a FIB lookup result. Logic is a distillation of ip_dst_mtu_maybe_forward. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: convert atomic_t to refcount_tBjörn Töpel2018-05-222-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce refcount_t, in favor of atomic_t. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: simplified umem setupBjörn Töpel2018-05-223-55/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As suggested by Daniel Borkmann, the umem setup code was a too defensive and complex. Here, we reduce the number of checks. Also, the memory pinning is now folded into the umem creation, and we do correct locking. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: add missing write- and data-dependency barrierBjörn Töpel2018-05-221-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here, we add a missing write-barrier, and use READ_ONCE for the data-dependency barrier. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: remove explicit ring structure from uapiBjörn Töpel2018-05-222-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In this commit we remove the explicit ring structure from the the uapi. It is tricky for an uapi to depend on a certain L1 cache line size, since it can differ for variants of the same architecture. Now, we let the user application determine the offsets of the producer, consumer and descriptors by asking the socket via getsockopt. A typical flow would be (Rx ring): struct xdp_mmap_offsets off; struct xdp_desc *ring; u32 *prod, *cons; void *map; ... getsockopt(fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen); map = mmap(NULL, off.rx.desc + NUM_DESCS * sizeof(struct xdp_desc), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, sfd, XDP_PGOFF_RX_RING); prod = map + off.rx.producer; cons = map + off.rx.consumer; ring = map + off.rx.desc; Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: proper queue id check at bindMagnus Karlsson2018-05-221-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Validate the queue id against both Rx and Tx on the netdev. Also, make sure that the queue exists at xmit time. Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: remove rebind supportBjörn Töpel2018-05-221-21/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Supporting rebind, i.e. after a successful bind the process can call bind again without closing the socket, makes the AF_XDP setup state machine more complex. Constrain the state space, by not supporting rebind. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | bpf: allow sk_msg programs to read sock fieldsJohn Fastabend2018-05-181-3/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently sk_msg programs only have access to the raw data. However, it is often useful when building policies to have the policies specific to the socket endpoint. This allows using the socket tuple as input into filters, etc. This patch adds ctx access to the sock fields. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: proper '=' alignmentBjörn Töpel2018-05-181-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Properly align xsk_proto_ops initialization. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | xsk: fixed some cases of unnecessary parenthesesBjörn Töpel2018-05-183-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Removed some cases of unnecessary parentheses. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
OpenPOWER on IntegriCloud