summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* netfilter: Use nf_hook_state.netEric W. Biederman2015-09-1721-58/+41
| | | | | | | | | Instead of saying "net = dev_net(state->in?state->in:state->out)" just say "state->net". As that information is now availabe, much less confusing and much less error prone. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netfilter: Pass struct net into the netfilter hooksEric W. Biederman2015-09-1729-102/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass a network namespace parameter into the netfilter hooks. At the call site of the netfilter hooks the path a packet is taking through the network stack is well known which allows the network namespace to be easily and reliabily. This allows the replacement of magic code like "dev_net(state->in?:state->out)" that appears at the start of most netfilter hooks with "state->net". In almost all cases the network namespace passed in is derived from the first network device passed in, guaranteeing those paths will not see any changes in practice. The exceptions are: xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm) ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp) ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp) ipv4/raw.c:raw_send_hdrinc() sock_net(sk) ipv6/ip6_output.c:ip6_xmit() sock_net(sk) ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev) ipv6/raw.c:raw6_send_hdrinc() sock_net(sk) br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev In all cases these exceptions seem to be a better expression for the network namespace the packet is being processed in then the historic "dev_net(in?in:out)". I am documenting them in case something odd pops up and someone starts trying to track down what happened. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bridge: Add br_netif_receive_skb remove netif_receive_skb_skEric W. Biederman2015-09-172-3/+8
| | | | | | | | netif_receive_skb_sk is only called once in the bridge code, replace it with a bridge specific function that calls netif_receive_skb. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bridge: Cache net in br_nf_pre_routing_finishEric W. Biederman2015-09-171-1/+2
| | | | | | | This is prep work for passing net to the netfilter hooks. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bridge: Pass net into br_nf_push_frag_xmitEric W. Biederman2015-09-171-3/+8
| | | | | | | | | | | When struct net starts being passed through the ipv4 and ipv6 fragment routines br_nf_push_frag_xmit will need to take a net parameter. Prepare br_nf_push_frag_xmit before that is needed and introduce br_nf_push_frag_xmit_sk for the call sites that still need the old calling conventions. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bridge: Pass net into br_nf_ip_fragmentEric W. Biederman2015-09-171-6/+6
| | | | | | | | | This is a prep work for passing struct net through ip_do_fragment and later the netfilter okfn. Doing this independently makes the later code changes clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Compute net once in raw6_send_hdrincEric W. Biederman2015-09-171-2/+3
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Cache net in ip6_outputEric W. Biederman2015-09-171-2/+2
| | | | | | | | Keep net in a local variable so I can use it in NF_HOOK_COND when I pass struct net to all of the netfilter hooks. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Only compute net once in ip6_finish_output2Eric W. Biederman2015-09-171-6/+5
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Don't recompute net in ip6_rcvEric W. Biederman2015-09-171-1/+1
| | | | | | | Avoid silly redundant code Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Remove dev_queue_xmit_skEric W. Biederman2015-09-171-2/+2
| | | | | | | | | | | | | | | | | A function with weird arguments that it will never use to accomdate a netfilter callback prototype is absolutely in the core of the networking stack. Frankly it does not make sense and it causes a lot of confusion as to why arguments that are never used are being passed to the function. As I am preparing to make a second change to arguments to the okfn even the names stops making sense. As I have removed the two callers of this function remove this confusion from the networking stack. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bridge: Introduce br_send_bpdu_finishEric W. Biederman2015-09-171-1/+6
| | | | | | | | | The function dev_queue_xmit_skb_sk is unncessary and very confusing. Introduce br_send_bpdu_finish to remove the need for dev_queue_xmit_skb_sk, and have br_send_bpdu_finish call dev_queue_xmit. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* arp: Introduce arp_xmit_finishEric W. Biederman2015-09-171-1/+6
| | | | | | | | | The function dev_queue_xmit_skb_sk is unncessary and very confusing. Introduce arp_xmit_finish to remove the need for dev_queue_xmit_skb_sk, and have arp_xmit_finish call dev_queue_xmit. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Only compute net once in ip6mr_forward2_finishEric W. Biederman2015-09-171-2/+3
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Only compute net once in ipmr_forward_finishEric W. Biederman2015-09-171-2/+3
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Only compute net once in ip_rcv_finishEric W. Biederman2015-09-171-6/+4
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Only compute net once in ip_finish_output2Eric W. Biederman2015-09-171-2/+3
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Explicitly compute net in ip_fragmentEric W. Biederman2015-09-171-3/+2
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Only compute net once in ip_do_fragmentEric W. Biederman2015-09-171-6/+8
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Don't recompute net in ipmr_queue_xmitEric W. Biederman2015-09-171-1/+1
| | | | | | | Calling dev_net(dev) for is just silly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Remember the net in ip_output and ip_mc_outputEric W. Biederman2015-09-171-2/+4
| | | | | | | | This is a prepatory patch to passing net int the netfilter hooks, where net will be used again. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Compute net once in ip_rcvEric W. Biederman2015-09-171-7/+9
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Compute net once in ip_forward_finishEric W. Biederman2015-09-171-2/+3
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Compute net once in ip_forwardEric W. Biederman2015-09-171-2/+4
| | | | | | | | Compute struct net from the input device in ip_forward before it is used. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Merge dst_output and dst_output_skEric W. Biederman2015-09-1718-25/+25
| | | | | | | | | Add a sock paramter to dst_output making dst_output_sk superfluous. Add a skb->sk parameter to all of the callers of dst_output Have the callers of dst_output_sk call dst_output. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: Remove unused afinfo method init_dstEric W. Biederman2015-09-171-2/+0
| | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-sysfs: get_netdev_queue_index() cleanupThadeu Lima de Souza Cascardo2015-09-171-6/+3
| | | | | | | | | | | | | | | | | | | Redo commit ed1acc8cd8c22efa919da8d300bab646e01c2dce. Commit 822b3b2ebfff8e9b3d006086c527738a7ca00cd0 ("net: Add max rate tx queue attribute") moved get_netdev_queue_index around, but kept the old version. Probably because of a reuse of the original patch from before Eric's change to that function. Remove one inline keyword, and no need for a loop to find an index into a table. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Fixes: 822b3b2ebfff ("net: Add max rate tx queue attribute") Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 statsSowmini Varadhan2015-09-153-6/+12
| | | | | | | | | | | | | | | | | | Many commonly used functions like getifaddrs() invoke RTM_GETLINK to dump the interface information, and do not need the the AF_INET6 statististics that are always returned by default from rtnl_fill_ifinfo(). Computing the statistics can be an expensive operation that impacts scaling, so it is desirable to avoid this if the information is not needed. This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that can be passed with netlink_request() to avoid statistics computation for the ifinfo path. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Allow user to get table id from route lookupDavid Ahern2015-09-151-4/+8
| | | | | | | | | | | | | | rt_fill_info which is called for 'route get' requests hardcodes the table id as RT_TABLE_MAIN which is not correct when multiple tables are used. Use the newly added table id in the rtable to send back the correct table similar to what is done for IPv6. To maintain current ABI a new request flag, RTM_F_LOOKUP_TABLE, is added to indicate the actual table is wanted versus the hardcoded response. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add FIB table id to rtableDavid Ahern2015-09-152-0/+9
| | | | | | | | Add the FIB table id to rtable to make the information available for IPv4 as it is for IPv6. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Refactor rtable initializationDavid Ahern2015-09-151-52/+33
| | | | | | | | All callers to rt_dst_alloc have nearly the same initialization following a successful allocation. Consolidate it into rt_dst_alloc. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2015-09-1034-108/+461
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix out-of-bounds array access in netfilter ipset, from Jozsef Kadlecsik. 2) Use correct free operation on netfilter conntrack templates, from Daniel Borkmann. 3) Fix route leak in SCTP, from Marcelo Ricardo Leitner. 4) Fix sizeof(pointer) in mac80211, from Thierry Reding. 5) Fix cache pointer comparison in ip6mr leading to missed unlock of mrt_lock. From Richard Laing. 6) rds_conn_lookup() needs to consider network namespace in key comparison, from Sowmini Varadhan. 7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov Dmitriy. 8) Fix fd leaks in bpf syscall, from Daniel Borkmann. 9) Fix error recovery when installing ipv6 multipath routes, we would delete the old route before we would know if we could fully commit to the new set of nexthops. Fix from Roopa Prabhu. 10) Fix run-time suspend problems in r8152, from Hayes Wang. 11) In fec, don't program the MAC address into the chip when the clocks are gated off. From Fugang Duan. 12) Fix poll behavior for netlink sockets when using rx ring mmap, from Daniel Borkmann. 13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169 driver, from Corinna Vinschen. 14) In TCP Cubic congestion control, handle idle periods better where we are application limited, in order to keep cwnd from growing out of control. From Eric Dumzet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits) tcp_cubic: better follow cubic curve after idle period tcp: generate CA_EVENT_TX_START on data frames xen-netfront: respect user provided max_queues xen-netback: respect user provided max_queues r8169: Fix sleeping function called during get_stats64, v2 ether: add IEEE 1722 ethertype - TSN netlink, mmap: fix edge-case leakages in nf queue zero-copy netlink, mmap: don't walk rx ring on poll if receive queue non-empty cxgb4: changes for new firmware 1.14.4.0 net: fec: add netif status check before set mac address r8152: fix the runtime suspend issues r8152: split DRIVER_VERSION ipv6: fix ifnullfree.cocci warnings add microchip LAN88xx phy driver stmmac: fix check for phydev being open net: qlcnic: delete redundant memsets net: mv643xx_eth: use kzalloc net: jme: use kzalloc() instead of kmalloc+memset net: cavium: liquidio: use kzalloc in setup_glist() net: ipv6: use common fib_default_rule_pref ...
| * tcp_cubic: better follow cubic curve after idle periodEric Dumazet2015-09-101-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jana Iyengar found an interesting issue on CUBIC : The epoch is only updated/reset initially and when experiencing losses. The delta "t" of now - epoch_start can be arbitrary large after app idle as well as the bic_target. Consequentially the slope (inverse of ca->cnt) would be really large, and eventually ca->cnt would be lower-bounded in the end to 2 to have delayed-ACK slow-start behavior. This particularly shows up when slow_start_after_idle is disabled as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle time. Jana initial fix was to reset epoch_start if app limited, but Neal pointed out it would ask the CUBIC algorithm to recalculate the curve so that we again start growing steeply upward from where cwnd is now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth curve to be the same shape, just shifted later in time by the amount of the idle period. Reported-by: Jana Iyengar <jri@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: generate CA_EVENT_TX_START on data framesNeal Cardwell2015-09-101-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Issuing a CC TX_START event on control frames like pure ACK is a waste of time, as a CC should not care. Following patch needs this change, as we want CUBIC to properly track idle time at a low cost, with a single TX_START being generated. Yuchung might slightly refine the condition triggering TX_START on a followup patch. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Jana Iyengar <jri@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netlink, mmap: fix edge-case leakages in nf queue zero-copyDaniel Borkmann2015-09-092-8/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When netlink mmap on receive side is the consumer of nf queue data, it can happen that in some edge cases, we write skb shared info into the user space mmap buffer: Assume a possible rx ring frame size of only 4096, and the network skb, which is being zero-copied into the netlink skb, contains page frags with an overall skb->len larger than the linear part of the netlink skb. skb_zerocopy(), which is generic and thus not aware of the fact that shared info cannot be accessed for such skbs then tries to write and fill frags, thus leaking kernel data/pointers and in some corner cases possibly writing out of bounds of the mmap area (when filling the last slot in the ring buffer this way). I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has an advertised length larger than 4096, where the linear part is visible at the slot beginning, and the leaked sizeof(struct skb_shared_info) has been written to the beginning of the next slot (also corrupting the struct nl_mmap_hdr slot header incl. status etc), since skb->end points to skb->data + ring->frame_size - NL_MMAP_HDRLEN. The fix adds and lets __netlink_alloc_skb() take the actual needed linear room for the network skb + meta data into account. It's completely irrelevant for non-mmaped netlink sockets, but in case mmap sockets are used, it can be decided whether the available skb_tailroom() is really large enough for the buffer, or whether it needs to internally fallback to a normal alloc_skb(). >From nf queue side, the information whether the destination port is an mmap RX ring is not really available without extra port-to-socket lookup, thus it can only be determined in lower layers i.e. when __netlink_alloc_skb() is called that checks internally for this. I chose to add the extra ldiff parameter as mmap will then still work: We have data_len and hlen in nfqnl_build_packet_message(), data_len is the full length (capped at queue->copy_range) for skb_zerocopy() and hlen some possible part of data_len that needs to be copied; the rem_len variable indicates the needed remaining linear mmap space. The only other workaround in nf queue internally would be after allocation time by f.e. cap'ing the data_len to the skb_tailroom() iff we deal with an mmap skb, but that would 1) expose the fact that we use a mmap skb to upper layers, and 2) trim the skb where we otherwise could just have moved the full skb into the normal receive queue. After the patch, in my test case the ring slot doesn't fit and therefore shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and thus needs to be picked up via recv(). Fixes: 3ab1f683bf8b ("nfnetlink: add support for memory mapped netlink") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netlink, mmap: don't walk rx ring on poll if receive queue non-emptyDaniel Borkmann2015-09-091-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of netlink mmap, there can be situations where received frames have to be placed into the normal receive queue. The ring buffer indicates this through NL_MMAP_STATUS_COPY, so the user is asked to pick them up via recvmsg(2) syscall, and to put the slot back to NL_MMAP_STATUS_UNUSED. Commit 0ef707700f1c ("netlink: rx mmap: fix POLLIN condition") changed polling, so that we walk in the worst case the whole ring through the new netlink_has_valid_frame(), for example, when the ring would have no NL_MMAP_STATUS_VALID, but at least one NL_MMAP_STATUS_COPY frame. Since we do a datagram_poll() already earlier to pick up a mask that could possibly contain POLLIN | POLLRDNORM already (due to NL_MMAP_STATUS_COPY), we can skip checking the rx ring entirely. In case the kernel is compiled with !CONFIG_NETLINK_MMAP, then all this is irrelevant anyway as netlink_poll() is just defined as datagram_poll(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: fix ifnullfree.cocci warningsWu Fengguang2015-09-091-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net/ipv6/route.c:2946:3-8: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ipv6: use common fib_default_rule_prefPhil Sutter2015-09-096-17/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This switches IPv6 policy routing to use the shared fib_default_rule_pref() function of IPv4 and DECnet. It is also used in multicast routing for IPv4 as well as IPv6. The motivation for this patch is a complaint about iproute2 behaving inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the assigned priority value was decreased with each rule added. Since then all users of the default_pref field have been converted to assign the generic function fib_default_rule_pref(), fib_nl_newrule() may just use it directly instead. Therefore get rid of the function pointer altogether and make fib_default_rule_pref() static, as it's not used outside fib_rules.c anymore. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: fix multipath route replace error recoveryRoopa Prabhu2015-09-091-26/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: The ecmp route replace support for ipv6 in the kernel, deletes the existing ecmp route too early, ie when it installs the first nexthop. If there is an error in installing the subsequent nexthops, its too late to recover the already deleted existing route leaving the fib in an inconsistent state. This patch reduces the possibility of this by doing the following: a) Changes the existing multipath route add code to a two stage process: build rt6_infos + insert them ip6_route_add rt6_info creation code is moved into ip6_route_info_create. b) This ensures that most errors are caught during building rt6_infos and we fail early c) Separates multipath add and del code. Because add needs the special two stage mode in a) and delete essentially does not care. d) In any event if the code fails during inserting a route again, a warning is printed (This should be unlikely) Before the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from * kernel */ After the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 Fixes: 27596472473a ("ipv6: fix ECMP route replacement") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * RDS: verify the underlying transport exists before creating a connectionSasha Levin2015-09-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was no verification that an underlying transport exists when creating a connection, this would cause dereferencing a NULL ptr. It might happen on sockets that weren't properly bound before attempting to send a message, which will cause a NULL ptr deref: [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [135546.051270] Modules linked in: [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527 [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000 [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194) [135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202 [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000 [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038 [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000 [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000 [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000 [135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000 [135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0 [135546.064723] Stack: [135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008 [135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342 [135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00 [135546.068629] Call Trace: [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134) [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298) [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278) [135546.071981] rds_sendmsg (net/rds/send.c:1058) [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38) [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298) [135546.074577] ? rds_send_drop_to (net/rds/send.c:976) [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795) [135546.076349] ? __might_fault (mm/memory.c:3795) [135546.077179] ? rds_send_drop_to (net/rds/send.c:976) [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620) [135546.078856] SYSC_sendto (net/socket.c:1657) [135546.079596] ? SYSC_connect (net/socket.c:1628) [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926) [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674) [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16) [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16) [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1 Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: tipc: fix stall during bclink wakeup procedureKolmakov Dmitriy2015-09-081-1/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an attempt to wake up users of broadcast link is made when there is no enough place in send queue than it may hang up inside the tipc_sk_rcv() function since the loop breaks only after the wake up queue becomes empty. This can lead to complete CPU stall with the following message generated by RCU: INFO: rcu_sched self-detected stall on CPU { 0} (t=2101 jiffies g=54225 c=54224 q=11465) Task dump for CPU 0: tpch R running task 0 39949 39948 0x0000000a ffffffff818536c0 ffff88181fa037a0 ffffffff8106a4be 0000000000000000 ffffffff818536c0 ffff88181fa037c0 ffffffff8106d8a8 ffff88181fa03800 0000000000000001 ffff88181fa037f0 ffffffff81094a50 ffff88181fa15680 Call Trace: <IRQ> [<ffffffff8106a4be>] sched_show_task+0xae/0x120 [<ffffffff8106d8a8>] dump_cpu_task+0x38/0x40 [<ffffffff81094a50>] rcu_dump_cpu_stacks+0x90/0xd0 [<ffffffff81097c3b>] rcu_check_callbacks+0x3eb/0x6e0 [<ffffffff8106e53f>] ? account_system_time+0x7f/0x170 [<ffffffff81099e64>] update_process_times+0x34/0x60 [<ffffffff810a84d1>] tick_sched_handle.isra.18+0x31/0x40 [<ffffffff810a851c>] tick_sched_timer+0x3c/0x70 [<ffffffff8109a43d>] __run_hrtimer.isra.34+0x3d/0xc0 [<ffffffff8109aa95>] hrtimer_interrupt+0xc5/0x1e0 [<ffffffff81030d52>] ? native_smp_send_reschedule+0x42/0x60 [<ffffffff81032f04>] local_apic_timer_interrupt+0x34/0x60 [<ffffffff810335bc>] smp_apic_timer_interrupt+0x3c/0x60 [<ffffffff8165a3fb>] apic_timer_interrupt+0x6b/0x70 [<ffffffff81659129>] ? _raw_spin_unlock_irqrestore+0x9/0x10 [<ffffffff8107eb9f>] __wake_up_sync_key+0x4f/0x60 [<ffffffffa313ddd1>] tipc_write_space+0x31/0x40 [tipc] [<ffffffffa313dadf>] filter_rcv+0x31f/0x520 [tipc] [<ffffffffa313d699>] ? tipc_sk_lookup+0xc9/0x110 [tipc] [<ffffffff81659259>] ? _raw_spin_lock_bh+0x19/0x30 [<ffffffffa314122c>] tipc_sk_rcv+0x2dc/0x3e0 [tipc] [<ffffffffa312e7ff>] tipc_bclink_wakeup_users+0x2f/0x40 [tipc] [<ffffffffa313ce26>] tipc_node_unlock+0x186/0x190 [tipc] [<ffffffff81597c1c>] ? kfree_skb+0x2c/0x40 [<ffffffffa313475c>] tipc_rcv+0x2ac/0x8c0 [tipc] [<ffffffffa312ff58>] tipc_l2_rcv_msg+0x38/0x50 [tipc] [<ffffffff815a76d3>] __netif_receive_skb_core+0x5a3/0x950 [<ffffffff815a98d3>] __netif_receive_skb+0x13/0x60 [<ffffffff815a993e>] netif_receive_skb_internal+0x1e/0x90 [<ffffffff815aa138>] napi_gro_receive+0x78/0xa0 [<ffffffffa07f93f4>] tg3_poll_work+0xc54/0xf40 [tg3] [<ffffffff81597c8c>] ? consume_skb+0x2c/0x40 [<ffffffffa07f9721>] tg3_poll_msix+0x41/0x160 [tg3] [<ffffffff815ab0f2>] net_rx_action+0xe2/0x290 [<ffffffff8104b92a>] __do_softirq+0xda/0x1f0 [<ffffffff8104bc26>] irq_exit+0x76/0xa0 [<ffffffff81004355>] do_IRQ+0x55/0xf0 [<ffffffff8165a12b>] common_interrupt+0x6b/0x6b <EOI> The issue occurs only when tipc_sk_rcv() is used to wake up postponed senders: tipc_bclink_wakeup_users() // wakeupq - is a queue which consists of special // messages with SOCK_WAKEUP type. tipc_sk_rcv(wakeupq) ... while (skb_queue_len(inputq)) { filter_rcv(skb) // Here the type of message is checked // and if it is SOCK_WAKEUP then // it tries to wake up a sender. tipc_write_space(sk) wake_up_interruptible_sync_poll() } After the sender thread is woke up it can gather control and perform an attempt to send a message. But if there is no enough place in send queue it will call link_schedule_user() function which puts a message of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep. Thus the size of the queue actually is not changed and the while() loop never exits. The approach I proposed is to wake up only senders for which there is enough place in send queue so the described issue can't occur. Moreover the same approach is already used to wake up senders on unicast links. I have got into the issue on our product code but to reproduce the issue I changed a benchmark test application (from tipcutils/demos/benchmark) to perform the following scenario: 1. Run 64 instances of test application (nodes). It can be done on the one physical machine. 2. Each application connects to all other using TIPC sockets in RDM mode. 3. When setup is done all nodes start simultaneously send broadcast messages. 4. Everything hangs up. The issue is reproducible only when a congestion on broadcast link occurs. For example, when there are only 8 nodes it works fine since congestion doesn't occur. Send queue limit is 40 in my case (I use a critical importance level) and when 64 nodes send a message at the same moment a congestion occurs every time. Signed-off-by: Dmitry S Kolmakov <kolmakov.dmitriy@huawei.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: remove unnecessary switchdev includeVivien Didelot2015-09-081-1/+0
| | | | | | | | | | | | | | | | Remove the unnecessary switchdev.h include from br_netlink.c. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: check __vlan_vid_del for errorVivien Didelot2015-09-081-4/+13
| | | | | | | | | | | | | | | | | | Since __vlan_del can return an error code, change its inner function __vlan_vid_del to return an eventual error from switchdev_port_obj_del. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: Remove conntrack Kconfig option.Joe Stringer2015-09-063-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | There's no particular desire to have conntrack action support in Open vSwitch as an independently configurable bit, rather just to ensure there is not a hard dependency. This exposed option doesn't accurately reflect the conntrack dependency when enabled, so simplify this by removing the option. Compile the support if NF_CONNTRACK is enabled. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge tag 'mac80211-for-davem-2015-09-04' of ↵David S. Miller2015-09-067-6/+113
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== For the first round of fixes, we have this: * fix for the sizeof() pointer type issue * a fix for regulatory getting into a restore loop * a fix for rfkill global 'all' state, it needs to be stored everywhere to apply correctly to new rfkill instances * properly refuse CQM RSSI when it cannot actually be used * protect HT TDLS traffic properly in non-HT networks * don't incorrectly advertise 80 MHz support when not allowed ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * mac80211: reject software RSSI CQM with beacon filteringJohannes Berg2015-09-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | When beacon filtering is enabled the mac80211 software implementation for RSSI CQM cannot work as beacons will not be available. Rather than accepting such a configuration without proper effect, reject it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * mac80211: avoid VHT usage with no 80MHz chans allowedArik Nemtsov2015-09-042-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently if 80MHz channels are not allowed for use, the VHT IE is not included in the probe request for an AP. This is not good enough if the AP is configured with the wrong regulatory and supports VHT even where prohibited or in TDLS scenarios. Mark the ifmgd with the DISABLE_VHT flag for the misbehaving-AP case, and unset VHT support from the peer-station entry for the TDLS case. Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * cfg80211: regulatory: restore proper user alpha2Maciej S. Szmigiero2015-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | restore_regulatory_settings() should restore alpha2 as computed in restore_alpha2(), not raw user_alpha2 to behave as described in the comment just above that code. This fixes endless loop of calling CRDA for "00" and "97" countries after resume from suspend on my laptop. Looks like others had the same problem, too: http://ath9k-devel.ath9k.narkive.com/knY5W6St/ath9k-and-crda-messages-in-logs https://bugs.launchpad.net/ubuntu/+source/linux/+bug/899335 https://forum.porteus.org/viewtopic.php?t=4975&p=36436 https://forums.opensuse.org/showthread.php/483356-Authentication-Regulatory-Domain-issues-ath5k-12-2 Signed-off-by: Maciej Szmigiero <mail@maciej.szmigiero.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * rfkill: Copy "all" global state to other typesJoão Paulo Rechi Vita2015-09-041-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When switching the state of all RFKill switches of type all we need to replicate the RFKILL_TYPE_ALL global state to all the other types global state, so it is used to initialize persistent RFKill switches on register. Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com> Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * mac80211: protect non-HT BSS when HT TDLS traffic existsAvri Altman2015-09-041-3/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HT TDLS traffic should be protected in a non-HT BSS to avoid collisions. Therefore, when TDLS peers join/leave, check if protection is (now) needed and set the ht_operation_mode of the virtual interface according to the HT capabilities of the TDLS peer(s). This works because a non-HT BSS connection never sets (or otherwise uses) the ht_operation_mode; it just means that drivers must be aware that this field applies to all HT traffic for this virtual interface, not just the traffic within the BSS. Document that. Signed-off-by: Avri Altman <avri.altman@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
OpenPOWER on IntegriCloud