summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* tipc: Remove obsolete broadcast tag capabilityAllan Stephens2012-02-063-9/+1
| | | | | | | | Eliminates support for the broadcast tag field, which is no longer used by broadcast link NACK messages. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Major redesign of broadcast link ACK/NACK algorithmsAllan Stephens2012-02-065-165/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Completely redesigns broadcast link ACK and NACK mechanisms to prevent spurious retransmit requests in dual LAN networks, and to prevent the broadcast link from stalling due to the failure of a receiving node to acknowledge receiving a broadcast message or request its retransmission. Note: These changes only impact the timing of when ACK and NACK messages are sent, and not the basic broadcast link protocol itself, so inter- operability with nodes using the "classic" algorithms is maintained. The revised algorithms are as follows: 1) An explicit ACK message is still sent after receiving 16 in-sequence messages, and implicit ACK information continues to be carried in other unicast link message headers (including link state messages). However, the timing of explicit ACKs is now based on the receiving node's absolute network address rather than its relative network address to ensure that the failure of another node does not delay the ACK beyond its 16 message target. 2) A NACK message is now typically sent only when a message gap persists for two consecutive incoming link state messages; this ensures that a suspected gap is not confirmed until both LANs in a dual LAN network have had an opportunity to deliver the message, thereby preventing spurious NACKs. A NACK message can also be generated by the arrival of a single link state message, if the deferred queue is so big that the current message gap cannot be the result of "normal" mis-ordering due to the use of dual LANs (or one LAN using a bonded interface). Since link state messages typically arrive at different nodes at different times the problem of multiple nodes issuing identical NACKs simultaneously is inherently avoided. 3) Nodes continue to "peek" at NACK messages sent by other nodes. If another node requests retransmission of a message gap suspected (but not yet confirmed) by the peeking node, the peeking node forgets about the gap and does not generate a duplicate retransmit request. (If the peeking node subsequently fails to receive the lost message, later link state messages will cause it to rediscover and confirm the gap and send another NACK.) 4) Message gap "equality" is now determined by the start of the gap only. This is sufficient to deal with the most common cases of message loss, and eliminates the need for complex end of gap computations. 5) A peeking node no longer tries to determine whether it should send a complementary NACK, since the most common cases of message loss don't require it to be sent. Consequently, the node no longer examines the "broadcast tag" field of a NACK message when peeking. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Add missing locks in broadcast link statistics accumulationAllan Stephens2012-02-061-0/+11
| | | | | | | | | | Ensures that all attempts to update broadcast link statistics are done only while holding the lock that protects the link's main data structures, to prevent interference by simultaneous updates caused by messages arriving on other interfaces. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Fix bug in broadcast link duplicate message statisticsAllan Stephens2012-02-061-0/+2
| | | | | | | | | | Modifies broadcast link so that it increments the "received duplicate message" count if an incoming message cannot be added to the deferred message queue because it is already present in the queue. (The aligns broadcast link behavior with that of TIPC's unicast links.) Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Fix node lock reclamation issues in broadcast link receptionAllan Stephens2012-02-061-18/+40
| | | | | | | | | | | | | | | | Fixes a pair of problems in broadcast link message reception code relating to the reclamation of the node lock after consuming an in-sequence message. 1) Now retests to see if the sending node is still up after reclaiming the node lock, and bails out if it is non-operational. 2) Now manipulates the node's deferred message queue only after reclaiming the node lock, rather than using queue head pointer information that was cached previously. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Add missing broadcast link lock when sending NACKAllan Stephens2012-02-061-0/+2
| | | | | | | | | | Ensures that any attempt to send a NACK message over TIPC's broadcast link has exclusive access to the link's main data structures, to prevent interference with a simultaneous attempt to send other broadcast link traffic (such as application-generated multicast messages). Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Fix problem with broadcast link synchronization between nodesAllan Stephens2012-02-061-2/+5
| | | | | | | | | | | | | | Corrects a problem in which a link endpoint that activates as the result of receiving a RESET/STATE sequence of link protocol messages fails to properly record the broadcast link status information about the node to which it is now communicating with. (The problem does not occur with the more common RESET/ACTIVATE sequence of messages.) The fix ensures that the broadcast link status info is updated after the RESET message resets the link endpoint, rather than before, thereby preventing new information from being overwritten by the reset operation. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Ensure broadcast link re-acquires node after link failureAllan Stephens2012-02-063-4/+8
| | | | | | | | | | | | | | | | | | | | | | | Fix a bug that can prevent TIPC from sending broadcast messages to a node if contact with the node is lost and then regained. The problem occurs if the broadcast link first clears the flag indicating the node is part of the link's distribution set (when it loses contact with the node), and later fails to restore the flag (when contact is regained); restoration fails if contact with the node is regained by implicit unicast link activation triggered by the arrival of a data message, rather than explicitly by the arrival of a link activation message. The broadcast link now uses separate fields to track whether a node is theoretically capable of receiving broadcast messages versus whether it is actually part of the link's distribution set. The former member is updated by the receipt of link protocol messages, which can occur at any time; the latter member is updated only when contact with the node is gained or lost. This change also permits the simplification of several conditional expressions since the broadcast link's "supported" field can now only be set if there are working links to the associated node. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Prevent broadcast link stalling in dual LAN environmentsAllan Stephens2012-02-061-4/+5
| | | | | | | | | | | | | | | | | | Ensure that sequence number information about incoming broadcast link messages is initialized only by the activation of the first link to a given cluster node. Previously, a race condition allowed reset and/or activation messages for a second link to re-initialize this sequence number information with obsolete values. This could trigger TIPC to request the retransmission of previously acknowledged broadcast link messages from that node, resulting in broadcast link processing becoming stalled if the node had already released one or more of those messages and was unable to perform the required retransmission. Thanks to Laser <gotolaser@gmail.com> for identifying this problem and assisting in the development of this fix. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: Prevent transmission of outdated link protocol messagesAllan Stephens2012-02-061-26/+27
| | | | | | | | | | | | | | | | Ensures that a link endpoint discards any previously deferred link protocol message whenever it attempts to send a new one. Previously, it was possible for a link protocol message that was unsent due to congestion to be transmitted after newer protocol messages had been sent. The stale link protocol message might then cause the receiving link endpoint to malfunction because of its outdated conent. Thanks to Osamu Kaminuma [okaminum@avaya.com] for diagnosing the problem and contributing a prototype patch. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* tipc: improve the link deferred queue insertion algorithmAllan Stephens2012-02-061-25/+23
| | | | | | | | | Re-code the algorithm for inserting an out-of-sequence message into a unicast or broadcast link's deferred message queue. It remains functionally equivalent but should be easier to understand/maintain. Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* caif: caifdev is never used in net/caif/caif_dev.c::transmit() - remove it.Jesper Juhl2012-02-051-2/+0
| | | | | Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* decnet: remove unused variable from dn_output()Jesper Juhl2012-02-051-2/+1
| | | | | | | | The variable 'neigh' is assigned to, but otherwise completely unused. So let's remove it. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-02-0420-96/+99
|\
| * netprio_cgroup: Fix obo in get_prioidxNeil Horman2012-02-041-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | It was recently pointed out to me that the get_prioidx function sets a bit in the prioidx map prior to checking to see if the index being set is out of bounds. This patch corrects that, avoiding the possiblity of us writing beyond the end of the array Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Stanislaw Gruszka <sgruszka@redhat.com> CC: Stanislaw Gruszka <sgruszka@redhat.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'master' of ↵John W. Linville2012-02-031-1/+1
| |\ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
| | * mac80211: timeout a single frame in the rx reorder bufferEliad Peller2012-02-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code checks for stored_mpdu_num > 1, causing the reorder_timer to be triggered indefinitely, but the frame is never timed-out (until the next packet is received) Signed-off-by: Eliad Peller <eliad@wizery.com> Cc: <stable@vger.kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
| * | caif: Bugfix double kfree_skb upon xmit failureDmitry Tarnyagin2012-02-021-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | SKB is freed twice upon send error. The Network stack consumes SKB even when it returns error code. Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | caif: Bugfix list_del_rcu race in cfmuxl_ctrlcmd.sjur.brandeland@stericsson.com2012-02-021-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Always use cfmuxl_remove_uplayer when removing a up-layer. cfmuxl_ctrlcmd() can be called independently and in parallel with cfmuxl_remove_uplayer(). The race between them could cause list_del_rcu to be called on a node which has been already taken out from the list. That lead to a (rare) crash on accessing poisoned node->prev inside list_del_rcu. This fix ensures that deletion are done holding the same lock. Reported-by: Dmitry Tarnyagin <dmitry.tarnyagin@stericsson.com> Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: properly initialize tcp memory limitsJason Wang2012-02-022-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4acb4190 tries to fix the using uninitialized value introduced by commit 3dc43e3, but it would make the per-socket memory limits too small. This patch fixes this and also remove the redundant codes introduced in 4acb4190. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ethtool: Null-terminate filename passed to ethtool_ops::flash_deviceBen Hutchings2012-02-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The parameters for ETHTOOL_FLASHDEV include a filename, which ought to be null-terminated. Currently the only driver that implements ethtool_ops::flash_device attempts to add a null terminator if necessary, but does it wrongly. Do it in the ethtool core instead. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: Disambiguate kernel messageArun Sharma2012-02-012-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some of our machines were reporting: TCP: too many of orphaned sockets even when the number of orphaned sockets was well below the limit. We print a different message depending on whether we're out of TCP memory or there are too many orphaned sockets. Also move the check out of line and cleanup the messages that were printed. Signed-off-by: Arun Sharma <asharma@fb.com> Suggested-by: Mohan Srinivasan <mohan@fb.com> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: David Miller <davem@davemloft.net> Cc: Glauber Costa <glommer@parallels.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2012-01-3013-65/+52
| |\ \ | | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) Setting link attributes can modify the size of the attributes that would be reported on a subsequent getlink netlink operation, therefore min_ifinfo_dump_size needs to be adjusted. From Stefan Gula. 2) Resegmentation of TSO frames while trimming can violate invariants expected by callers, namely that the number of segments can only stay the same or decrease, never increase. If MSS changes, however, we can trim data but then end up with more segments. Fix this by only segmenting to the MSS already recorded in the SKB. That's the simplest fix for now and if we want to get more fancy in the future that's a more involved change. This probably explains some retransmit counter inaccuracies. From Neal Cardwell. 3) Fix too-many-wakeups in POLL with AF_UNIX sockets, from Eric Dumazet. 4) Fix CAIF crashes wrt. namespace handling. From Eric Dumazet and Eric W. Biederman. 5) TCP port selection fixes from Flavio Leitner. 6) More socket memory cgroup build fixes in certain randonfig situations. From Glauber Costa. 7) Fix TCP memory sysctl regression reported by Ingo Molnar, also from Glauber Costa. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: af_unix: fix EPOLLET regression for stream sockets tcp: fix tcp_trim_head() to adjust segment count with skb MSS net/tcp: Fix tcp memory limits initialization when !CONFIG_SYSCTL net caif: Register properly as a pernet subsystem. netns: Fail conspicously if someone uses net_generic at an inappropriate time. net: explicitly add jump_label.h header to sock.h net: RTNETLINK adjusting values of min_ifinfo_dump_size ipv6: Fix ip_gre lockless xmits. xen-netfront: correct MAX_TX_TARGET calculation. netns: fix net_alloc_generic() tcp: bind() optimize port allocation tcp: bind() fix autoselection to share ports l2tp: l2tp_ip - fix possible oops on packet receive iwlwifi: fix PCI-E transport "inta" race mac80211: set bss_conf.idle when vif is connected mac80211: update oper_channel on ibss join
| | * af_unix: fix EPOLLET regression for stream socketsEric Dumazet2012-01-301-15/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from a stream socket) added a regression for epoll() in Edge Triggered mode (EPOLLET) Appropriate fix is to use skb_peek()/skb_unlink() instead of skb_dequeue(), and only call skb_unlink() when skb is fully consumed. This remove the need to requeue a partial skb into sk_receive_queue head and the extra sk->sk_data_ready() calls that added the regression. This is safe because once skb is given to sk_receive_queue, it is not modified by a writer, and readers are serialized by u->readlock mutex. This also reduce number of spinlock acquisition for small reads or MSG_PEEK users so should improve overall performance. Reported-by: Nick Mathewson <nickm@freehaven.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Alexey Moiseytsev <himeraster@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * tcp: fix tcp_trim_head() to adjust segment count with skb MSSNeal Cardwell2012-01-301-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes tcp_trim_head() to recalculate the number of segments in the skb with the skb's existing MSS, so trimming the head causes the skb segment count to be monotonically non-increasing - it should stay the same or go down, but not increase. Previously tcp_trim_head() used the current MSS of the connection. But if there was a decrease in MSS between original transmission and ACK (e.g. due to PMTUD), this could cause tcp_trim_head() to counter-intuitively increase the segment count when trimming bytes off the head of an skb. This violated assumptions in tcp_tso_acked() that tcp_trim_head() only decreases the packet count, so that packets_acked in tcp_tso_acked() could underflow, leading tcp_clean_rtx_queue() to pass u32 pkts_acked values as large as 0xffffffff to ca_ops->pkts_acked(). As an aside, if tcp_trim_head() had really wanted the skb to reflect the current MSS, it should have called tcp_set_skb_tso_segs() unconditionally, since a decrease in MSS would mean that a single-packet skb should now be sliced into multiple segments. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net/tcp: Fix tcp memory limits initialization when !CONFIG_SYSCTLGlauber Costa2012-01-302-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sysctl_tcp_mem() initialization was moved to sysctl_tcp_ipv4.c in commit 3dc43e3e4d0b52197d3205214fe8f162f9e0c334, since it became a per-ns value. That code, however, will never run when CONFIG_SYSCTL is disabled, leading to bogus values on those fields - causing hung TCP sockets. This patch fixes it by keeping an initialization code in tcp_init(). It will be overwritten by the first net namespace init if CONFIG_SYSCTL is compiled in, and do the right thing if it is compiled out. It is also named properly as tcp_init_mem(), to properly signal its non-sysctl side effect on TCP limits. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: David S. Miller <davem@davemloft.net> Link: http://lkml.kernel.org/r/4F22D05A.8030604@parallels.com [ renamed the function, tidied up the changelog a bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net caif: Register properly as a pernet subsystem.Eric W. Biederman2012-01-272-21/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | caif is a subsystem and as such it needs to register with register_pernet_subsys instead of register_pernet_device. Among other problems using register_pernet_device was resulting in net_generic being called before the caif_net structure was allocated. Which has been causing net_generic to fail with either BUG_ON's or by return NULL pointers. A more ugly problem that could be caused is packets in flight why the subsystem is shutting down. To remove confusion also remove the cruft cause by inappropriately trying to fix this bug. With the aid of the previous patch I have tested this patch and confirmed that using register_pernet_subsys makes the failure go away as it should. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Tested-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Merge branch 'master' of ↵David S. Miller2012-01-272-0/+2
| | |\ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
| | | * mac80211: set bss_conf.idle when vif is connectedEliad Peller2012-01-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __ieee80211_recalc_idle() iterates through the vifs, sets bss_conf.idle = true if they are disconnected, and increases "count" if they are not (which later gets evaluated in order to determine whether the device is idle). However, the loop doesn't set bss_conf.idle = false (along with increasing "count"), causing the device idle state and the vif idle state to get out of sync in some cases. Signed-off-by: Eliad Peller <eliad@wizery.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
| | | * mac80211: update oper_channel on ibss joinEliad Peller2012-01-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 13c40c5 ("mac80211: Add HT operation modes for IBSS") broke ibss operation by mistakenly removing the local->oper_channel update (causing ibss to start on the wrong channel). fix it. Signed-off-by: Eliad Peller <eliad@wizery.com> Acked-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: John W. Linville <linville@tuxdriver.com>
| | * | net: RTNETLINK adjusting values of min_ifinfo_dump_sizeStefan Gula2012-01-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting link parameters on a netdevice changes the value of if_nlmsg_size(), therefore it is necessary to recalculate min_ifinfo_dump_size. Signed-off-by: Stefan Gula <steweg@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | ipv6: Fix ip_gre lockless xmits.Willem de Bruijn2012-01-261-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tunnel devices set NETIF_F_LLTX to bypass HARD_TX_LOCK. Sit and ipip set this unconditionally in ops->setup, but gre enables it conditionally after parameter passing in ops->newlink. This is not called during tunnel setup as below, however, so GRE tunnels are still taking the lock. modprobe ip_gre ip tunnel add test0 mode gre remote 10.5.1.1 dev lo ip link set test0 up ip addr add 10.6.0.1 dev test0 # cat /sys/class/net/test0/features # $DIR/test_tunnel_xmit 10 10.5.2.1 ip route add 10.5.2.0/24 dev test0 ip tunnel del test0 The newlink callback is only called in rtnl_netlink, and only if the device is new, as it calls register_netdevice internally. Gre tunnels are created at 'ip tunnel add' with ioctl SIOCADDTUNNEL, which calls ipgre_tunnel_locate, which calls register_netdev. rtnl_newlink is called at 'ip link set', but skips ops->newlink and the device is up with locking still enabled. The equivalent ipip tunnel works fine, btw (just substitute 'method gre' for 'method ipip'). On kernels before /sys/class/net/*/features was removed [1], the first commented out line returns 0x6000 with method gre, which indicates that NETIF_F_LLTX (0x1000) is not set. With ipip, it reports 0x7000. This test cannot be used on recent kernels where the sysfs file is removed (and ETHTOOL_GFEATURES does not currently work for tunnel devices, because they lack dev->ethtool_ops). The second commented out line calls a simple transmission test [2] that sends on 24 cores at maximum rate. Results of a single run: ipip: 19,372,306 gre before patch: 4,839,753 gre after patch: 19,133,873 This patch replicates the condition check in ipgre_newlink to ipgre_tunnel_locate. It works for me, both with oseq on and off. This is the first time I looked at rtnetlink and iproute2 code, though, so someone more knowledgeable should probably check the patch. Thanks. The tail of both functions is now identical, by the way. To avoid code duplication, I'll be happy to rework this and merge the two. [1] http://patchwork.ozlabs.org/patch/104610/ [2] http://kernel.googlecode.com/files/xmit_udp_parallel.c Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | netns: fix net_alloc_generic()Eric Dumazet2012-01-261-15/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a new net namespace is created, we should attach to it a "struct net_generic" with enough slots (even empty), or we can hit the following BUG_ON() : [ 200.752016] kernel BUG at include/net/netns/generic.h:40! ... [ 200.752016] [<ffffffff825c3cea>] ? get_cfcnfg+0x3a/0x180 [ 200.752016] [<ffffffff821cf0b0>] ? lockdep_rtnl_is_held+0x10/0x20 [ 200.752016] [<ffffffff825c41be>] caif_device_notify+0x2e/0x530 [ 200.752016] [<ffffffff810d61b7>] notifier_call_chain+0x67/0x110 [ 200.752016] [<ffffffff810d67c1>] raw_notifier_call_chain+0x11/0x20 [ 200.752016] [<ffffffff821bae82>] call_netdevice_notifiers+0x32/0x60 [ 200.752016] [<ffffffff821c2b26>] register_netdevice+0x196/0x300 [ 200.752016] [<ffffffff821c2ca9>] register_netdev+0x19/0x30 [ 200.752016] [<ffffffff81c1c67a>] loopback_net_init+0x4a/0xa0 [ 200.752016] [<ffffffff821b5e62>] ops_init+0x42/0x180 [ 200.752016] [<ffffffff821b600b>] setup_net+0x6b/0x100 [ 200.752016] [<ffffffff821b6466>] copy_net_ns+0x86/0x110 [ 200.752016] [<ffffffff810d5789>] create_new_namespaces+0xd9/0x190 net_alloc_generic() should take into account the maximum index into the ptr array, as a subsystem might use net_generic() anytime. This also reduces number of reallocations in net_assign_generic() Reported-by: Sasha Levin <levinsasha928@gmail.com> Tested-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sjur Brændeland <sjur.brandeland@stericsson.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: bind() optimize port allocationFlavio Leitner2012-01-251-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Port autoselection finds a port and then drop the lock, then right after that, gets the hash bucket again and lock it. Fix it to go direct. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: bind() fix autoselection to share portsFlavio Leitner2012-01-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code checks for conflicts when the application requests a specific port. If there is no conflict, then the request is granted. On the other hand, the port autoselection done by the kernel fails when all ports are bound even when there is a port with no conflict available. The fix changes port autoselection to check if there is a conflict and use it if not. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | l2tp: l2tp_ip - fix possible oops on packet receiveJames Chapman2012-01-251-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a packet is received on an L2TP IP socket (L2TPv3 IP link encapsulation), the l2tpip socket's backlog_rcv function calls xfrm4_policy_check(). This is not necessary, since it was called before the skb was added to the backlog. With CONFIG_NET_NS enabled, xfrm4_policy_check() will oops if skb->dev is null, so this trivial patch removes the call. This bug has always been present, but only when CONFIG_NET_NS is enabled does it cause problems. Most users are probably using UDP encapsulation for L2TP, hence the problem has only recently surfaced. EIP: 0060:[<c12bb62b>] EFLAGS: 00210246 CPU: 0 EIP is at l2tp_ip_recvmsg+0xd4/0x2a7 EAX: 00000001 EBX: d77b5180 ECX: 00000000 EDX: 00200246 ESI: 00000000 EDI: d63cbd30 EBP: d63cbd18 ESP: d63cbcf4 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Call Trace: [<c1218568>] sock_common_recvmsg+0x31/0x46 [<c1215c92>] __sock_recvmsg_nosec+0x45/0x4d [<c12163a1>] __sock_recvmsg+0x31/0x3b [<c1216828>] sock_recvmsg+0x96/0xab [<c10b2693>] ? might_fault+0x47/0x81 [<c10b2693>] ? might_fault+0x47/0x81 [<c1167fd0>] ? _copy_from_user+0x31/0x115 [<c121e8c8>] ? copy_from_user+0x8/0xa [<c121ebd6>] ? verify_iovec+0x3e/0x78 [<c1216604>] __sys_recvmsg+0x10a/0x1aa [<c1216792>] ? sock_recvmsg+0x0/0xab [<c105a99b>] ? __lock_acquire+0xbdf/0xbee [<c12d5a99>] ? do_page_fault+0x193/0x375 [<c10d1200>] ? fcheck_files+0x9b/0xca [<c10d1259>] ? fget_light+0x2a/0x9c [<c1216bbb>] sys_recvmsg+0x2b/0x43 [<c1218145>] sys_socketcall+0x16d/0x1a5 [<c11679f0>] ? trace_hardirqs_on_thunk+0xc/0x10 [<c100305f>] sysenter_do_call+0x12/0x38 Code: c6 05 8c ea a8 c1 01 e8 0c d4 d9 ff 85 f6 74 07 3e ff 86 80 00 00 00 b9 17 b6 2b c1 ba 01 00 00 00 b8 78 ed 48 c1 e8 23 f6 d9 ff <ff> 76 0c 68 28 e3 30 c1 68 2d 44 41 c1 e8 89 57 01 00 83 c4 0c Signed-off-by: James Chapman <jchapman@katalix.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge tag 'nfs-for-3.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2012-01-301-1/+16
| |\ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | NFS client bugfixes for Linux 3.3 (pull 3) * tag 'nfs-for-3.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Fix machine creds in generic_create_cred and generic_match
| | * | SUNRPC: Fix machine creds in generic_create_cred and generic_matchTrond Myklebust2012-01-231-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - generic_create_cred needs to copy the '.principal' field. - generic_match needs to ignore the groups and match on the '.principal' field. This fixes an Oops that was introduced by commit 68c9715 (SUNRPC: Clean up the RPCSEC_GSS service ticket requests) Reported-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: J. Bruce Fields <bfields@redhat.com>
* | | | caif: Add drop count for caif_net device.sjur.brandeland@stericsson.com2012-02-041-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Count dropped packets in CAIF Netdevice. Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | caif: Kill debugfs vars for caif socketsjur.brandeland@stericsson.com2012-02-041-112/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Kill off the debug-fs exposed varaibles from caif_socket. Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | atm: clip: Convert over to dst_neigh_lookup().David S. Miller2012-02-011-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CLIP only support ipv4, and this is evidenced by the fact that it is a device specific extension of arp_tbl, so this conversion is pretty straightforward. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | decnet: Add missing neigh->ha locking to dn_neigh_output_packet()David S. Miller2012-02-011-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | Basically, mirror the logic in neigh_connected_output(). Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv6: Remove never used function inet6_ac_check().David S. Miller2012-02-011-29/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It went from unused, to commented out, and never changing after that. Just get rid of it, if someone wants it they can unearth it from the history. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | PATCH V2 net-next] net: dev: Convert printks to pr_<level>Joe Perches2012-02-011-55/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the current logging style. Coalesce formats where appropriate. Update grammar where appropriate. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | netpoll: Neaten MAX_SKB_SIZE macroJoe Perches2012-02-011-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the types in the packet layout order. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | netpoll: Convert printks to np_<level> and add pr_fmtJoe Perches2012-02-011-35/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a more current message logging style. Add pr_fmt to prefix dmesg output with "netpoll: " Add macros to print np->name. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: md5: RST: getting md5 key from listenerShawn Lu2012-02-012-5/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP RST mechanism is broken in TCP md5(RFC2385). When connection is gone, md5 key is lost, sending RST without md5 hash is deem to ignored by peer. This can be a problem since RST help protocal like bgp to fast recove from peer crash. In most case, users of tcp md5, such as bgp and ldp, have listener on both sides to accept connection from peer. md5 keys for peers are saved in listening socket. There are two cases in finding md5 key when connection is lost: 1.Passive receive RST: The message is send to well known port, tcp will associate it with listner. md5 key is gotten from listener. 2.Active receive RST (no sock): The message is send to ative side, there is no socket associated with the message. In this case, finding listener from source port, then find md5 key from listener. we are not loosing sercuriy here: packet is checked with md5 hash. No RST is generated if md5 hash doesn't match or no md5 key can be found. Signed-off-by: Shawn Lu <shawn.lu@ericsson.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | xfrm6: remove unneeded NULL check in __xfrm6_output()Dan Carpenter2012-02-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't check for NULL consistently in __xfrm6_output(). If "x" were NULL here it would lead to an OOPs later. I asked Steffen Klassert about this and he suggested that we remove the NULL check. On 10/29/11, Steffen Klassert <steffen.klassert@secunet.com> wrote: >> net/ipv6/xfrm6_output.c >> 148 >> 149 if ((x && x->props.mode == XFRM_MODE_TUNNEL) && >> ^ > > x can't be null here. It would be a bug if __xfrm6_output() is called > without a xfrm_state attached to the skb. I think we can just remove > this null check. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: md5: protects md5sig_info with RCUEric Dumazet2012-02-012-14/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes sure we use appropriate memory barriers before publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from tcp_v4_send_reset() without holding socket lock (upcoming patch from Shawn Lu) Note we also need to respect rcu grace period before its freeing, since we can free socket without this grace period thanks to SLAB_DESTROY_BY_RCU Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Shawn Lu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: md5: use sock_kmalloc() to limit md5 keysEric Dumazet2012-01-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no limit on number of MD5 keys an application can attach to a tcp socket. This patch adds a per tcp socket limit based on /proc/sys/net/core/optmem_max With current default optmem_max values, this allows about 150 keys on 64bit arches, and 88 keys on 32bit arches. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud