summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2015-07-3143-228/+402
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Must teardown SR-IOV before unregistering netdev in igb driver, from Alex Williamson. 2) Fix ipv6 route unreachable crash in IPVS, from Alex Gartrell. 3) Default route selection in ipv4 should take the prefix length, table ID, and TOS into account, from Julian Anastasov. 4) sch_plug must have a reset method in order to purge all buffered packets when the qdisc is reset, likewise for sch_choke, from WANG Cong. 5) Fix deadlock and races in slave_changelink/br_setport in bridging. From Nikolay Aleksandrov. 6) mlx4 bug fixes (wrong index in port even propagation to VFs, overzealous BUG_ON assertion, etc.) from Ido Shamay, Jack Morgenstein, and Or Gerlitz. 7) Turn off klog message about SCTP userspace interface compat that makes no sense at all, from Daniel Borkmann. 8) Fix unbounded restarts of inet frag eviction process, causing NMI watchdog soft lockup messages, from Florian Westphal. 9) Suspend/resume fixes for r8152 from Hayes Wang. 10) Fix busy loop when MSG_WAITALL|MSG_PEEK is used in TCP recv, from Sabrina Dubroca. 11) Fix performance regression when removing a lot of routes from the ipv4 routing tables, from Alexander Duyck. 12) Fix device leak in AF_PACKET, from Lars Westerhoff. 13) AF_PACKET also has a header length comparison bug due to signedness, from Alexander Drozdov. 14) Fix bug in EBPF tail call generation on x86, from Daniel Borkmann. 15) Memory leaks, TSO stats, watchdog timeout and other fixes to thunderx driver from Sunil Goutham and Thanneeru Srinivasulu. 16) act_bpf can leak memory when replacing programs, from Daniel Borkmann. 17) WOL packet fixes in gianfar driver, from Claudiu Manoil. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits) stmmac: fix missing MODULE_LICENSE in stmmac_platform gianfar: Enable device wakeup when appropriate gianfar: Fix suspend/resume for wol magic packet gianfar: Fix warning when CONFIG_PM off act_pedit: check binding before calling tcf_hash_release() net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket net: sched: fix refcount imbalance in actions r8152: reset device when tx timeout r8152: add pre_reset and post_reset qlcnic: Fix corruption while copying act_bpf: fix memory leaks when replacing bpf programs net: thunderx: Fix for crash while BGX teardown net: thunderx: Add PCI driver shutdown routine net: thunderx: Fix crash when changing rss with mutliple traffic flows net: thunderx: Set watchdog timeout value net: thunderx: Wakeup TXQ only if CQE_TX are processed net: thunderx: Suppress alloc_pages() failure warnings net: thunderx: Fix TSO packet statistic net: thunderx: Fix memory leak when changing queue count net: thunderx: Fix RQ_DROP miscalculation ...
| * act_pedit: check binding before calling tcf_hash_release()WANG Cong2015-07-311-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | When we share an action within a filter, the bind refcnt should increase, therefore we should not call tcf_hash_release(). Fixes: 1a29321ed045 ("net_sched: act: Dont increment refcnt on replace") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sk_clone_lock() should only do get_net() if the parent is not a kernel ↵Sowmini Varadhan2015-07-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | socket The newsk returned by sk_clone_lock should hold a get_net() reference if, and only if, the parent is not a kernel socket (making this similar to sk_alloc()). E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock sets up the syn_recv newsk from sk_clone_lock. When the parent (listen) socket is a kernel socket (defined in sk_alloc() as having sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt and should not hold a get_net() reference. Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Acked-by: Eric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: fix refcount imbalance in actionsDaniel Borkmann2015-07-301-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 55334a5db5cd ("net_sched: act: refuse to remove bound action outside"), we end up with a wrong reference count for a tc action. Test case 1: FOO="1,6 0 0 4294967295," BAR="1,6 0 0 4294967294," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \ action bpf bytecode "$FOO" tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 1 bind 1 tc actions replace action bpf bytecode "$BAR" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe index 1 ref 2 bind 1 tc actions replace action bpf bytecode "$FOO" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 3 bind 1 Test case 2: FOO="1,6 0 0 4294967295," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 2 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 3 bind 1 What happens is that in tcf_hash_check(), we check tcf_common for a given index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've found an existing action. Now there are the following cases: 1) We do a late binding of an action. In that case, we leave the tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init() handler. This is correctly handeled. 2) We replace the given action, or we try to add one without replacing and find out that the action at a specific index already exists (thus, we go out with error in that case). In case of 2), we have to undo the reference count increase from tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to do so because of the 'tcfc_bindcnt > 0' check which bails out early with an -EPERM error. Now, while commit 55334a5db5cd prevents 'tc actions del action ...' on an already classifier-bound action to drop the reference count (which could then become negative, wrap around etc), this restriction only accounts for invocations outside a specific action's ->init() handler. One possible solution would be to add a flag thus we possibly trigger the -EPERM ony in situations where it is indeed relevant. After the patch, above test cases have correct reference count again. Fixes: 55334a5db5cd ("net_sched: act: refuse to remove bound action outside") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * act_bpf: fix memory leaks when replacing bpf programsDaniel Borkmann2015-07-291-18/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently trigger multiple memory leaks when replacing bpf actions, besides others: comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s) hex dump (first 32 bytes): 01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................ 18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00 ...m............ backtrace: [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0 [<ffffffff8120a37a>] __vmalloc+0x4a/0x50 [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0 [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0 [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf] [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0 [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0 [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0 [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf] [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf] [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910 [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240 [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0 [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40 [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0 Issue is that the old content from tcf_bpf is allocated and needs to be released when we replace it. We seem to do that since the beginning of act_bpf on the filter and insns, later on the name as well. Example test case, after patch: # FOO="1,6 0 0 4294967295," # BAR="1,6 0 0 4294967294," # tc actions add action bpf bytecode "$FOO" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 2 ref 1 bind 0 # tc actions replace action bpf bytecode "$BAR" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe index 2 ref 1 bind 0 # tc actions replace action bpf bytecode "$FOO" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 2 ref 1 bind 0 # tc actions del action bpf index 2 [...] # echo "scan" > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l 0 Fixes: d23b8ad8ab23 ("tc: add BPF based action") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: flush nd cache on IFF_NOARP changeEric Dumazet2015-07-291-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is the IPv6 equivalent of commit 6c8b4e3ff81b ("arp: flush arp cache on IFF_NOARP change") Without it, we keep buggy neighbours in the cache, with destination MAC address equal to our own MAC address. Tested: tcpdump -i eth0 -s 0 ip6 -n -e & ip link set dev eth0 arp off ping6 remote // sends buggy frames ip link set dev eth0 arp on ping6 remote // should work once kernel is patched Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mario Fanelli <mariofanelli@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: mdb: fix delmdb state in the notificationNikolay Aleksandrov2015-07-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since mdb states were introduced when deleting an entry the state was left as it was set in the delete request from the user which leads to the following output when doing a monitor (for example): $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 temp ^^^ Note the "temp" state in the delete notification which is wrong since the entry was permanent, the state in a delete is always reported as "temp" regardless of the real state of the entry. After this patch: $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent There's one important note to make here that the state is actually not matched when doing a delete, so one can delete a permanent entry by stating "temp" in the end of the command, I've chosen this fix in order not to break user-space tools which rely on this (incorrect) behaviour. So to give an example after this patch and using the wrong state: $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp (monitor) dev br0 port eth3 grp 239.0.0.1 permanent Note the state of the entry that got deleted is correct in the notification. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: ccb1c31a7a87 ("bridge: add flags to distinguish permanent mdb entires") Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: mcast: give fast leave precedence over multicast router and querierSatish Ashok2015-07-291-24/+26
| | | | | | | | | | | | | | | | | | | | | | When fast leave is configured on a bridge port and an IGMP leave is received for a group, the group is not deleted immediately if there is a router detected or if multicast querier is configured. Ideally the group should be deleted immediately when fast leave is configured. Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: Fix network header pointer for vlan tagged packetsToshiaki Makita2015-07-291-7/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several devices that can receive vlan tagged packets with CHECKSUM_PARTIAL like tap, possibly veth and xennet. When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded by bridge to a device with the IP_CSUM feature, they end up with checksum error because before entering bridge, the network header is set to ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(), get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later. Since the network header is exepected to be ETH_HLEN in flow-dissection and hash-calculation in RPS in rx path, and since the header pointer fix is needed only in tx path, set the appropriate network header on forwarding packets. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: tpacket_snd(): fix signed/unsigned comparisonAlexander Drozdov2015-07-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tpacket_fill_skb() can return a negative value (-errno) which is stored in tp_len variable. In that case the following condition will be (but shouldn't be) true: tp_len > dev->mtu + dev->hard_header_len as dev->mtu and dev->hard_header_len are both unsigned. That may lead to just returning an incorrect EMSGSIZE errno to the user. Fixes: 52f1454f629fa ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case") Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * arp: filter NOARP neighbours for SIOCGARPEric Dumazet2015-07-281-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When arp is off on a device, and ioctl(SIOCGARP) is queried, a buggy answer is given with MAC address of the device, instead of the mac address of the destination/gateway. We filter out NUD_NOARP neighbours for /proc/net/arp, we must do the same for SIOCGARP ioctl. Tested: lpaa23:~# ./arp 10.246.7.190 MAC=00:01:e8:22:cb:1d // correct answer lpaa23:~# ip link set dev eth0 arp off lpaa23:~# cat /proc/net/arp # check arp table is now 'empty' IP address HW type Flags HW address Mask Device lpaa23:~# ./arp 10.246.7.190 MAC=00:1a:11:c3:0d:7f // buggy answer before patch (this is eth0 mac) After patch : lpaa23:~# ip link set dev eth0 arp off lpaa23:~# ./arp 10.246.7.190 ioctl(SIOCGARP) failed: No such device or address Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Vytautas Valancius <valas@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/ipv4: suppress NETDEV_UP notification on address lifetime updateDavid Ward2015-07-281-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This notification causes the FIB to be updated, which is not needed because the address already exists, and more importantly it may undo intentional changes that were made to the FIB after the address was originally added. (As a point of comparison, when an address becomes deprecated because its preferred lifetime expired, a notification on this chain is not generated.) The motivation for this commit is fixing an incompatibility between DHCP clients which set and update the address lifetime according to the lease, and a commercial VPN client which replaces kernel routes in a way that outbound traffic is sent only through the tunnel (and disconnects if any further route changes are detected via netlink). Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: stp: when using userspace stp stop kernel hello and hold timersNikolay Aleksandrov2015-07-283-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These should be handled only by the respective STP which is in control. They become problematic for devices with limited resources with many ports because the hold_timer is per port and fires each second and the hello timer fires each 2 seconds even though it's global. While in user-space STP mode these timers are completely unnecessary so it's better to keep them off. Also ensure that when the bridge is up these timers are started only when running with kernel STP. Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * packet: missing dev_put() in packet_do_bind()Lars Westerhoff2015-07-271-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When binding a PF_PACKET socket, the use count of the bound interface is always increased with dev_hold in dev_get_by_{index,name}. However, when rebound with the same protocol and device as in the previous bind the use count of the interface was not decreased. Ultimately, this caused the deletion of the interface to fail with the following message: unregister_netdevice: waiting for dummy0 to become free. Usage count = 1 This patch moves the dev_put out of the conditional part that was only executed when either the protocol or device changed on a bind. Fixes: 902fefb82ef7 ('packet: improve socket create/bind latency in some cases') Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fib_trie: Drop unnecessary calls to leaf_pull_suffixAlexander Duyck2015-07-271-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was reported that update_suffix was taking a long time on systems where a large number of leaves were attached to a single node. As it turns out fib_table_flush was calling update_suffix for each leaf that didn't have all of the aliases stripped from it. As a result, on this large node removing one leaf would result in us calling update_suffix for every other leaf on the node. The fix is to just remove the calls to leaf_pull_suffix since they are redundant as we already have a call in resize that will go through and update the suffix length for the node before we exit out of fib_table_flush or fib_table_flush_external. Reported-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix recv with flags MSG_WAITALL | MSG_PEEKSabrina Dubroca2015-07-274-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL | MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258 Reported-by: Enrico Scholz <rh-bugzilla@ensc.de> Reported-by: Dan Searle <dan@censornet.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'for-upstream' of ↵David S. Miller2015-07-261-0/+4
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2015-07-23 Here's another one-patch pull request for 4.2 which targets a potential NULL pointer dereference in the LE Security Manager code that can be triggered by using older user space tools. The issue has been there since 4.0 so there's the appropriate "Cc: stable" in place. Let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Bluetooth: Fix NULL pointer dereference in smp_conn_securityJohan Hedberg2015-07-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The l2cap_conn->smp pointer may be NULL for various valid reasons where SMP has failed to initialize properly. One such scenario is when crypto support is missing, another when the adapter has been powered on through a legacy method. The smp_conn_security() function should have the appropriate check for this situation to avoid NULL pointer dereferences. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org # 4.0+
| * | inet: frags: remove INET_FRAG_EVICTED and use list_evictor for the testNikolay Aleksandrov2015-07-263-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags race conditions with the evictor and use a participation test for the evictor list, when we're at that point (after inet_frag_kill) in the timer there're 2 possible cases: 1. The evictor added the entry to its evictor list while the timer was waiting for the chainlock or 2. The timer unchained the entry and the evictor won't see it In both cases we should be able to see list_evictor correctly due to the sync on the chainlock. Joint work with Florian Westphal. Tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | inet: frag: don't wait for timer deletion when evictingFlorian Westphal2015-07-261-18/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Frank reports 'NMI watchdog: BUG: soft lockup' errors when load is high. Instead of (potentially) unbounded restarts of the eviction process, just skip to the next entry. One caveat is that, when a netns is exiting, a timer may still be running by the time inet_evict_bucket returns. We use the frag memory accounting to wait for outstanding timers, so that when we free the percpu counter we can be sure no running timer will trip over it. Reported-and-tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | inet: frag: change *_frag_mem_limit functions to take netns_frags as argumentFlorian Westphal2015-07-265-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | Followup patch will call it after inet_frag_queue was freed, so q->net doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would). Tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | inet: frag: don't re-use chainlist for evictorFlorian Westphal2015-07-261-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 65ba1f1ec0eff ("inet: frags: fix a race between inet_evict_bucket and inet_frag_kill") describes the bug, but the fix doesn't work reliably. Problem is that ->flags member can be set on other cpu without chainlock being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED bit after we put the element on the evictor private list. We can crash when walking the 'private' evictor list since an element can be deleted from list underneath the evictor. Join work with Nikolay Alexandrov. Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue") Reported-by: Johan Schuijt <johan@transip.nl> Tested-by: Frank Schreuder <fschreuder@transip.nl> Signed-off-by: Nikolay Alexandrov <nikolay@cumulusnetworks.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: sctp: stop spamming klog with rfc6458, 5.3.2. deprecation warningsDaniel Borkmann2015-07-261-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back then when we added support for SCTP_SNDINFO/SCTP_RCVINFO from RFC6458 5.3.4/5.3.5, we decided to add a deprecation warning for the (as per RFC deprecated) SCTP_SNDRCV via commit bbbea41d5e53 ("net: sctp: deprecate rfc6458, 5.3.2. SCTP_SNDRCV support"), see [1]. Imho, it was not a good idea, and we should just revert that message for a couple of reasons: 1) It's uapi and therefore set in stone forever. 2) To be able to run on older and newer kernels, an SCTP application would need to probe for both, SCTP_SNDRCV, but also SCTP_SNDINFO/ SCTP_RCVINFO support, so that on older kernels, it can make use of SCTP_SNDRCV, and on newer kernels SCTP_SNDINFO/SCTP_RCVINFO. In my (limited) experience, a lot of SCTP appliances are migrating to newer kernels only ve(ee)ry slowly. 3) Some people don't have the chance to change their applications, f.e. due to proprietary legacy stuff. So, they'll hit this warning in fast path and are stuck with older kernels. But i.e. due to point 1) I really fail to see the benefit of a warning. So just revert that for now, the issue was reported up Jamal. [1] http://thread.gmane.org/gmane.linux.network/321960/ Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Michael Tuexen <tuexen@fh-muenster.de> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bridge: netlink: fix slave_changelink/br_setport race conditionsNikolay Aleksandrov2015-07-261-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since slave_changelink support was added there have been a few race conditions when using br_setport() since some of the port functions it uses require the bridge lock. It is very easy to trigger a lockup due to some internal spin_lock() usage without bh disabled, also it's possible to get the bridge into an inconsistent state. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: 3ac636b8591c ("bridge: implement rtnl_link_ops->slave_changelink") Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2015-07-2511-77/+163
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains ten Netfilter/IPVS fixes, they are: 1) Address refcount leak when creating an expectation from the ctnetlink interface. 2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry Torokhov. 3) Resolve panic for unreachable route in IPVS with locally generated traffic in the output path, from Alex Gartrell. 4) Fix wrong source address in rare cases for tunneled traffic in IPVS, from Julian Anastasov. 5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian. 6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from Alex Gartrell. 7) Fix crash with IPVS sync protocol v0 and FTP, from Julian. 8) Reset sender cpu for forwarded traffic in IPVS, also from Julian. 9) Allocate template conntracks through kmalloc() to resolve netns dependency problems with the conntrack kmem_cache. 10) Fix zones with expectations that clash using the same tuple, from Joe Stringer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | netfilter: nf_conntrack: Support expectations in different zonesJoe Stringer2015-07-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When zones were originally introduced, the expectation functions were all extended to perform lookup using the zone. However, insertion was not modified to check the zone. This means that two expectations which are intended to apply for different connections that have the same tuple but exist in different zones cannot both be tracked. Fixes: 5d0aa2ccd4 (netfilter: nf_conntrack: add support for "conntrack zones") Signed-off-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | Merge tag 'ipvs-fixes-for-v4.2' of ↵Pablo Neira Ayuso2015-07-205-39/+110
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== IPVS Fixes for v4.2 please consider this fix for v4.2. For reasons that are not clear to me it is a bumper crop. It seems to me that they are all relevant to stable. Please let me know if you need my help to get the fixes into stable. * ipvs: fix ipv6 route unreach panic This problem appears to be present since IPv6 support was added to IPVS in v2.6.28. * ipvs: skb_orphan in case of forwarding This appears to resolve a problem resulting from a side effect of 41063e9dd119 ("ipv4: Early TCP socket demux.") which was included in v3.6. * ipvs: do not use random local source address for tunnels This appears to resolve a problem introduced by 026ace060dfe ("ipvs: optimize dst usage for real server") in v3.10. * ipvs: fix crash if scheduler is changed This appears to resolve a problem introduced by ceec4c381681 ("ipvs: convert services to rcu") in v3.10. Julian has provided backports of the fix: * [PATCHv2 3.10.81] ipvs: fix crash if scheduler is changed http://www.spinics.net/lists/lvs-devel/msg04008.html * [PATCHv2 3.12.44,3.14.45,3.18.16,4.0.6] ipvs: fix crash if scheduler is changed http://www.spinics.net/lists/lvs-devel/msg04007.html Please let me know how you would like to handle guiding these backports into stable. * ipvs: fix crash with sync protocol v0 and FTP This appears to resolve a problem introduced by 749c42b620a9 ("ipvs: reduce sync rate with time thresholds") in v3.5 ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | | * | ipvs: call skb_sender_cpu_clearJulian Anastasov2015-07-141-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reset XPS's sender_cpu on forwarding. Signed-off-by: Julian Anastasov <ja@ssi.bg> Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices") Signed-off-by: Simon Horman <horms@verge.net.au>
| | | * | ipvs: fix crash with sync protocol v0 and FTPJulian Anastasov2015-07-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix crash in 3.5+ if FTP is used after switching sync_version to 0. Fixes: 749c42b620a9 ("ipvs: reduce sync rate with time thresholds") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | | * | ipvs: skb_orphan in case of forwardingAlex Gartrell2015-07-141-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible that we bind against a local socket in early_demux when we are actually going to want to forward it. In this case, the socket serves no purpose and only serves to confuse things (particularly functions which implicitly expect sk_fullsock to be true, like ip_local_out). Additionally, skb_set_owner_w is totally broken for non full-socks. Signed-off-by: Alex Gartrell <agartrell@fb.com> Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.") Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | | * | ipvs: fix crash if scheduler is changedJulian Anastasov2015-07-143-37/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I overlooked the svc->sched_data usage from schedulers when the services were converted to RCU in 3.10. Now the rare ipvsadm -E command can change the scheduler but due to the reverse order of ip_vs_bind_scheduler and ip_vs_unbind_scheduler we provide new sched_data to the old scheduler resulting in a crash. To fix it without changing the scheduler methods we have to use synchronize_rcu() only for the editing case. It means all svc->scheduler readers should expect a NULL value. To avoid breakage for the service listing and ipvsadm -R we can use the "none" name to indicate that scheduler is not assigned, a state when we drop new connections. Reported-by: Alexander Vasiliev <a.vasylev@404-group.com> Fixes: ceec4c381681 ("ipvs: convert services to rcu") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | | * | ipvs: do not use random local source address for tunnelsJulian Anastasov2015-07-141-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michael Vallaly reports about wrong source address used in rare cases for tunneled traffic. Looks like __ip_vs_get_out_rt in 3.10+ is providing uninitialized dest_dst->dst_saddr.ip because ip_vs_dest_dst_alloc uses kmalloc. While we retry after seeing EINVAL from routing for data that does not look like valid local address, it still succeeded when this memory was previously used from other dests and with different local addresses. As result, we can use valid local address that is not suitable for our real server. Fix it by providing 0.0.0.0 every time our cache is refreshed. By this way we will get preferred source address from routing. Reported-by: Michael Vallaly <lvs@nolatency.com> Fixes: 026ace060dfe ("ipvs: optimize dst usage for real server") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | | * | ipvs: fix ipv6 route unreach panicAlex Gartrell2015-07-141-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously there was a trivial panic unshare -n /bin/bash <<EOF ip addr add dev lo face::1/128 ipvsadm -A -t [face::1]:15213 ipvsadm -a -t [face::1]:15213 -r b00c::1 echo boom | nc face::1 15213 EOF This patch allows us to replicate the net logic above and simply capture the skb_dst(skb)->dev and use that for the purpose of the invocation. Signed-off-by: Alex Gartrell <agartrell@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | * | | netfilter: fix netns dependencies with conntrack templatesPablo Neira Ayuso2015-07-203-32/+50
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quoting Daniel Borkmann: "When adding connection tracking template rules to a netns, f.e. to configure netfilter zones, the kernel will endlessly busy-loop as soon as we try to delete the given netns in case there's at least one template present, which is problematic i.e. if there is such bravery that the priviledged user inside the netns is assumed untrusted. Minimal example: ip netns add foo ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1 ip netns del foo What happens is that when nf_ct_iterate_cleanup() is being called from nf_conntrack_cleanup_net_list() for a provided netns, we always end up with a net->ct.count > 0 and thus jump back to i_see_dead_people. We don't get a soft-lockup as we still have a schedule() point, but the serving CPU spins on 100% from that point onwards. Since templates are normally allocated with nf_conntrack_alloc(), we also bump net->ct.count. The issue why they are not yet nf_ct_put() is because the per netns .exit() handler from x_tables (which would eventually invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is called in the dependency chain at a *later* point in time than the per netns .exit() handler for the connection tracker. This is clearly a chicken'n'egg problem: after the connection tracker .exit() handler, we've teared down all the connection tracking infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be invoked at a later point in time during the netns cleanup, as that would lead to a use-after-free. At the same time, we cannot make x_tables depend on the connection tracker module, so that the xt_ct_tg_destroy() would be invoked earlier in the cleanup chain." Daniel confirms this has to do with the order in which modules are loaded or having compiled nf_conntrack as modules while x_tables built-in. So we have no guarantees regarding the order in which netns callbacks are executed. Fix this by allocating the templates through kmalloc() from the respective SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache. Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch is marked as unlikely since conntrack templates are rarely allocated and only from the configuration plane path. Note that templates are not kept in any list to avoid further dependencies with nf_conntrack anymore, thus, the tmpl larval list is removed. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | netfilter: IDLETIMER: fix lockdep warningDmitry Torokhov2015-07-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dynamically allocated sysfs attributes should be initialized with sysfs_attr_init() otherwise lockdep will be angry with us: [ 45.468653] BUG: key ffffffc030fad4e0 not in .data! [ 45.468655] ------------[ cut here ]------------ [ 45.468666] WARNING: CPU: 0 PID: 1176 at /mnt/host/source/src/third_party/kernel/v3.18/kernel/locking/lockdep.c:2991 lockdep_init_map+0x12c/0x490() [ 45.468672] DEBUG_LOCKS_WARN_ON(1) [ 45.468672] CPU: 0 PID: 1176 Comm: iptables Tainted: G U W 3.18.0 #43 [ 45.468674] Hardware name: XXX [ 45.468675] Call trace: [ 45.468680] [<ffffffc0002072b4>] dump_backtrace+0x0/0x10c [ 45.468683] [<ffffffc0002073d0>] show_stack+0x10/0x1c [ 45.468688] [<ffffffc000a86cd4>] dump_stack+0x74/0x94 [ 45.468692] [<ffffffc000217ae0>] warn_slowpath_common+0x84/0xb0 [ 45.468694] [<ffffffc000217b84>] warn_slowpath_fmt+0x4c/0x58 [ 45.468697] [<ffffffc0002530a4>] lockdep_init_map+0x128/0x490 [ 45.468701] [<ffffffc000367ef0>] __kernfs_create_file+0x80/0xe4 [ 45.468704] [<ffffffc00036862c>] sysfs_add_file_mode_ns+0x104/0x170 [ 45.468706] [<ffffffc00036870c>] sysfs_create_file_ns+0x58/0x64 [ 45.468711] [<ffffffc000930430>] idletimer_tg_checkentry+0x14c/0x324 [ 45.468714] [<ffffffc00092a728>] xt_check_target+0x170/0x198 [ 45.468717] [<ffffffc000993efc>] check_target+0x58/0x6c [ 45.468720] [<ffffffc000994c64>] translate_table+0x30c/0x424 [ 45.468723] [<ffffffc00099529c>] do_ipt_set_ctl+0x144/0x1d0 [ 45.468728] [<ffffffc0009079f0>] nf_setsockopt+0x50/0x60 [ 45.468732] [<ffffffc000946870>] ip_setsockopt+0x8c/0xb4 [ 45.468735] [<ffffffc0009661c0>] raw_setsockopt+0x10/0x50 [ 45.468739] [<ffffffc0008c1550>] sock_common_setsockopt+0x14/0x20 [ 45.468742] [<ffffffc0008bd190>] SyS_setsockopt+0x88/0xb8 [ 45.468744] ---[ end trace 41d156354d18c039 ]--- Signed-off-by: Dmitry Torokhov <dtor@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: ctnetlink: put back references to master ct and expect objectsPablo Neira Ayuso2015-07-101-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have to put back the references to the master conntrack and the expectation that we just created, otherwise we'll leak them. Fixes: 0ef71ee1a5b9 ("netfilter: ctnetlink: refactor ctnetlink_create_expect") Reported-by: Tim Wiess <Tim.Wiess@watchguard.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | cgroup: net_cls: fix false-positive "suspicious RCU usage"Konstantin Khlebnikov2015-07-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dev_queue_xmit() net_cls protected with rcu-bh. [ 270.730026] =============================== [ 270.730029] [ INFO: suspicious RCU usage. ] [ 270.730033] 4.2.0-rc3+ #2 Not tainted [ 270.730036] ------------------------------- [ 270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage! [ 270.730041] other info that might help us debug this: [ 270.730043] rcu_scheduler_active = 1, debug_locks = 1 [ 270.730045] 2 locks held by dhclient/748: [ 270.730046] #0: (rcu_read_lock_bh){......}, at: [<ffffffff81682b70>] __dev_queue_xmit+0x50/0x960 [ 270.730085] #1: (&qdisc_tx_lock){+.....}, at: [<ffffffff81682d60>] __dev_queue_xmit+0x240/0x960 [ 270.730090] stack backtrace: [ 270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2 [ 270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011 [ 270.730100] 0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007 [ 270.730103] ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00 [ 270.730105] ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8 [ 270.730108] Call Trace: [ 270.730121] [<ffffffff817ad487>] dump_stack+0x4c/0x65 [ 270.730148] [<ffffffff810ca4f2>] lockdep_rcu_suspicious+0xe2/0x120 [ 270.730153] [<ffffffff816a62d2>] task_cls_state+0x92/0xa0 [ 270.730158] [<ffffffffa00b534f>] cls_cgroup_classify+0x4f/0x120 [cls_cgroup] [ 270.730164] [<ffffffff816aac74>] tc_classify_compat+0x74/0xc0 [ 270.730166] [<ffffffff816ab573>] tc_classify+0x33/0x90 [ 270.730170] [<ffffffffa00bcb0a>] htb_enqueue+0xaa/0x4a0 [sch_htb] [ 270.730172] [<ffffffff81682e26>] __dev_queue_xmit+0x306/0x960 [ 270.730174] [<ffffffff81682b70>] ? __dev_queue_xmit+0x50/0x960 [ 270.730176] [<ffffffff816834a3>] dev_queue_xmit_sk+0x13/0x20 [ 270.730185] [<ffffffff81787770>] dev_queue_xmit+0x10/0x20 [ 270.730187] [<ffffffff8178b91c>] packet_snd.isra.62+0x54c/0x760 [ 270.730190] [<ffffffff8178be25>] packet_sendmsg+0x2f5/0x3f0 [ 270.730203] [<ffffffff81665245>] ? sock_def_readable+0x5/0x190 [ 270.730210] [<ffffffff817b64bb>] ? _raw_spin_unlock+0x2b/0x40 [ 270.730216] [<ffffffff8173bcbc>] ? unix_dgram_sendmsg+0x5cc/0x640 [ 270.730219] [<ffffffff8165f367>] sock_sendmsg+0x47/0x50 [ 270.730221] [<ffffffff8165f42f>] sock_write_iter+0x7f/0xd0 [ 270.730232] [<ffffffff811fd4c7>] __vfs_write+0xa7/0xf0 [ 270.730234] [<ffffffff811fe5b8>] vfs_write+0xb8/0x190 [ 270.730236] [<ffffffff811fe8c2>] SyS_write+0x52/0xb0 [ 270.730239] [<ffffffff817b6bae>] entry_SYSCALL_64_fastpath+0x12/0x76 Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | sch_choke: drop all packets in queue during resetWANG Cong2015-07-241-0/+13
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | sch_plug: purge buffered packets during resetWANG Cong2015-07-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise the skbuff related structures are not correctly refcount'ed. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4: consider TOS in fib_select_defaultJulian Anastasov2015-07-244-13/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fib_select_default considers alternative routes only when res->fi is for the first alias in res->fa_head. In the common case this can happen only when the initial lookup matches the first alias with highest TOS value. This prevents the alternative routes to require specific TOS. This patch solves the problem as follows: - routes that require specific TOS should be returned by fib_select_default only when TOS matches, as already done in fib_table_lookup. This rule implies that depending on the TOS we can have many different lists of alternative gateways and we have to keep the last used gateway (fa_default) in first alias for the TOS instead of using single tb_default value. - as the aliases are ordered by many keys (TOS desc, fib_priority asc), we restrict the possible results to routes with matching TOS and lowest metric (fib_priority) and routes that match any TOS, again with lowest metric. For example, packet with TOS 8 can not use gw3 (not lowest metric), gw4 (different TOS) and gw6 (not lowest metric), all other gateways can be used: tos 8 via gw1 metric 2 <--- res->fa_head and res->fi tos 8 via gw2 metric 2 tos 8 via gw3 metric 3 tos 4 via gw4 tos 0 via gw5 tos 0 via gw6 metric 1 Reported-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4: fib_select_default should match the prefixJulian Anastasov2015-07-241-0/+5
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | fib_trie starting from 4.1 can link fib aliases from different prefixes in same list. Make sure the alternative gateways are in same table and for same prefix (0) by checking tb_id and fa_slen. Fixes: 79e5ad2ceb00 ("fib_trie: Remove leaf_info") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2015-07-283-13/+23
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - Fix a situation where the client uses the wrong (zero) stateid. - Fix a memory leak in nfs_do_recoalesce Bugfixes: - Plug a memory leak when ->prepare_layoutcommit fails - Fix an Oops in the NFSv4 open code - Fix a backchannel deadlock - Fix a livelock in sunrpc when sendmsg fails due to low memory availability - Don't revalidate the mapping if both size and change attr are up to date - Ensure we don't miss a file extension when doing pNFS - Several fixes to handle NFSv4.1 sequence operation status bits correctly - Several pNFS layout return bugfixes" * tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits) nfs: Fix an oops caused by using other thread's stack space in ASYNC mode nfs: plug memory leak when ->prepare_layoutcommit fails SUNRPC: Report TCP errors to the caller sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable. NFSv4.2: handle NFS-specific llseek errors NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce() NFS: Fix a memory leak in nfs_do_recoalesce NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised NFS: Don't revalidate the mapping if both size and change attr are up to date NFSv4/pnfs: Ensure we don't miss a file extension NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked SUNRPC: xprt_complete_bc_request must also decrement the free slot count SUNRPC: Fix a backchannel deadlock pNFS: Don't throw out valid layout segments pNFS: pnfs_roc_drain() fix a race with open pNFS: Fix races between return-on-close and layoutreturn. pNFS: pnfs_roc_drain should return 'true' when sleeping pNFS: Layoutreturn must invalidate all existing layout segments. ...
| * | | SUNRPC: Report TCP errors to the callerTrond Myklebust2015-07-271-7/+6
| | | | | | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.NeilBrown2015-07-271-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The networking layer does not reliably report the distinction between a non-block write failing because: 1/ the queue is too full already and 2/ a memory allocation attempt failed. The distinction is important because in the first case it is appropriate to retry as soon as the socket reports that it is writable, and in the second case a small delay is required as the socket will most likely report as writable but kmalloc could still fail. sk_stream_wait_memory() exhibits this distinction nicely, setting 'vm_wait' if a small wait is needed. However in the non-blocking case it always returns -EAGAIN no matter the cause of the failure. This -EAGAIN call get all the way to sunrpc. The sunrpc layer expects EAGAIN to indicate the first cause, and ENOBUFS to indicate the second. Various documentation suggests that this is not unreasonable, but does not guarantee the desired error codes. The result of getting -EAGAIN when -ENOBUFS is expected is that the send is tried again in a tight loop and soft lockups are reported. so: add tests after calls to xs_sendpages() to translate -EAGAIN into -ENOBUFS if the socket is writable. This cannot happen inside xs_sendpages() as the test for "is socket writable" is different between TCP and UDP. With this change, the tight loop retrying xs_sendpages() becomes a loop which only retries every 250ms, and so will not trigger a soft-lockup warning. It is possible that the write did fail because the queue was too full and by the time xs_sendpages() completed, the queue was writable again. In this case an extra 250ms delay is inserted that isn't really needed. This circumstance suggests a degree of congestion so a delay is not necessarily a bad thing, and it can only cause a single 250ms delay, not a series of them. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | SUNRPC: xprt_complete_bc_request must also decrement the free slot countTrond Myklebust2015-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling xprt_complete_bc_request() effectively causes the slot to be allocated, so it needs to decrement the backchannel free slot count as well. Fixes: 0d2a970d0ae5 ("SUNRPC: Fix a backchannel race") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | SUNRPC: Fix a backchannel deadlockTrond Myklebust2015-07-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xprt_alloc_bc_request() cannot call xprt_free_bc_request() without deadlocking, since it already holds the xprt->bc_pa_lock. Reported-by: Chuck Lever <chuck.lever@oracle.com> Fixes: 0d2a970d0ae55 ("SUNRPC: Fix a backchannel race") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | SUNRPC: Don't confuse ENOBUFS with a write_space issueTrond Myklebust2015-07-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ENOBUFS means that memory allocations are failing due to an actual low memory situation. It should not be confused with being out of socket buffer space. Handle the problem by just punting to the delay in call_status. Reported-by: Neil Brown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | SUNRPC: Don't reencode message if transmission failed with ENOBUFSTrond Myklebust2015-07-031-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're running out of buffer memory when transmitting data, then we want to just delay for a moment, and then continue transmitting the remainder of the message. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | | | Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2015-07-231-0/+1
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull virtio/vhost fixes from Michael Tsirkin: "Bugfixes and documentation fixes. Igor's patch that allows users to tweak memory table size is borderline, but it does fix known crashes, so I merged it" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost: add max_mem_regions module parameter vhost: extend memory regions allocation to vmalloc 9p/trans_virtio: reset virtio device on remove virtio/s390: rename drivers/s390/kvm -> drivers/s390/virtio MAINTAINERS: separate section for s390 virtio drivers virtio: define virtio_pci_cfg_cap in header. virtio: Fix typecast of pointer in vring_init() virtio scsi: fix unused variable warning vhost: use binary search instead of linear in find_region() virtio_net: document VIRTIO_NET_CTRL_GUEST_OFFLOADS
| * | | 9p/trans_virtio: reset virtio device on removePierre Morel2015-07-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On device shutdown/removal, virtio drivers need to trigger a reset on the device; if this is neglected, the virtio core will complain about non-zero device status. This patch resets the status when the 9p virtio driver is removed from the system by calling vdev->config->reset on the virtio_device to send a reset to the host virtio device. Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
OpenPOWER on IntegriCloud