op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	netfilter: Fix build errors with xt_socket.c	David S. Miller	2013-09-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As reported by Randy Dunlap: ==================== when CONFIG_IPV6=m and CONFIG_NETFILTER_XT_MATCH_SOCKET=y: net/built-in.o: In function `socket_mt6_v1_v2': xt_socket.c:(.text+0x51b55): undefined reference to `udp6_lib_lookup' net/built-in.o: In function `socket_mt_init': xt_socket.c:(.init.text+0x1ef8): undefined reference to `nf_defrag_ipv6_enable' ==================== Like several other modules under net/netfilter/ we have to have a dependency "IPV6 disabled or set compatibly with this module" clause. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	icplus: Use netif_running to determine device state	Jon Mason	2013-09-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Remove the __LINK_STATE_START check to verify the device is running, in favor of netif_running(). netif_running() performs the same check of __LINK_STATE_START, so the code should behave the same. Signed-off-by: Jon Mason <jdmason@kudzu.us> Cc: Francois Romieu <romieu@fr.zoreil.com> Cc: Sorbica Shieh <sorbica@icplus.com.tw> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ethernet/arc/arc_emac: Fix huge delays in large file copies	Vineet Gupta	2013-09-05	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	copying large files to a NFS mounted host was taking absurdly large time. Turns out that TX BD reclaim had a sublte bug. Loop starts off from @txbd_dirty cursor and stops when it hits a BD still in use by controller. However when it stops it needs to keep the cursor at that very BD to resume scanning in next iteration. However it was erroneously incrementing the cursor, causing the next scan(s) to fail too, unless the BD chain was completely drained out. [ARCLinux]$ ls -l -sh /disk/log.txt 17976 -rw-r--r-- 1 root root 17.5M Sep /disk/log.txt ========== Before ===================== [ARCLinux]$ time cp /disk/log.txt /mnt/. real 31m 7.95s user 0m 0.00s sys 0m 0.10s ========== After ===================== [ARCLinux]$ time cp /disk/log.txt /mnt/. real 0m 24.33s user 0m 0.00s sys 0m 0.19s Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Cc: Alexey Brodkin <abrodkin@synopsys.com> (commit_signer:3/4=75%) Cc: "David S. Miller" <davem@davemloft.net> (commit_signer:3/4=75%) Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: arc-linux-dev@synopsys.com Signed-off-by: David S. Miller <davem@davemloft.net>
*	tuntap: orphan frags before trying to set tx timestamp	Jason Wang	2013-09-05	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sock_tx_timestamp() will clear all zerocopy flags of skb which may lead the frags never to be orphaned. This will break guest to guest traffic when zerocopy is enabled. Fix this by orphaning the frags before trying to set tx time stamp. The issue were introduced by commit eda297729171fe16bf34fe5b0419dfb69060f623 (tun: Support software transmit time stamping). Cc: Richard Cochran <richardcochran@gmail.com> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tuntap: purge socket error queue on detach	Jason Wang	2013-09-05	1	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \|	Commit eda297729171fe16bf34fe5b0419dfb69060f623 (tun: Support software transmit time stamping) will queue skbs into error queue when tx stamping is enabled. But it forgets to purge the error queue during detach. This patch fixes this. Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	qlcnic: use standard NAPI weights	Michal Schmidt	2013-09-05	2	-19/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 82dc3c63 ("net: introduce NAPI_POLL_WEIGHT") netif_napi_add() produces an error message if a NAPI poll weight greater than 64 is requested. qlcnic requests the weight as large as 256 for some of its rings, and smaller values for other rings. For instance in qlcnic_82xx_napi_add() I think the intention was to give the tx+rx ring a bigger weight than to rx-only rings, but it's actually doing the opposite. So I'm assuming the weights do not really matter much. Just use the standard NAPI weights for all rings. Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Acked-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ipv6:introduce function to find route for redirect	Duan Jiong	2013-09-05	6	-11/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RFC 4861 says that the IP source address of the Redirect is the same as the current first-hop router for the specified ICMP Destination Address, so the gateway should be taken into consideration when we find the route for redirect. There was once a check in commit a6279458c534d01ccc39498aba61c93083ee0372 ("NDISC: Search over all possible rules on receipt of redirect.") and the check went away in commit b94f1c0904da9b8bf031667afc48080ba7c3e8c9 ("ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect()"). The bug is only "exploitable" on layer-2 because the source address of the redirect is checked to be a valid link-local address but it makes spoofing a lot easier in the same L2 domain nonetheless. Thanks very much for Hannes's help. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	bnx2x: VF RSS support - VF side	Ariel Elior	2013-09-05	6	-60/+145
\| \| \| \| \| \| \| \| \| \|	In this patch capabilities are added to the Vf driver to request multiple queues over the VF PF channel, and the logic for requesting rss configuration for said queues. Signed-off-by: Ariel Elior <ariele@broadcom.com> Signed-off-by: Eilong Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	bnx2x: VF RSS support - PF side	Ariel Elior	2013-09-05	10	-144/+513
\| \| \| \| \| \| \| \| \| \| \| \|	This patch adds support for Receive Side Scaling for queues of Virtual Functions on the PF side. This includes support for the requests for multiple queues from VF drivers, configuration of the HW for multiple queues per VF, and support for rss configuration of said queues. Signed-off-by: Ariel Elior <ariele@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	vxlan: Notify drivers for listening UDP port changes	Joseph Gasparakis	2013-09-05	3	-1/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds two more ndo ops: ndo_add_rx_vxlan_port() and ndo_del_rx_vxlan_port(). Drivers can get notifications through the above functions about changes of the UDP listening port of VXLAN. Also, when physical ports come up, now they can call vxlan_get_rx_port() in order to obtain the port number(s) of the existing VXLAN interface in case they already up before them. This information about the listening UDP port would be used for VXLAN related offloads. A big thank you to John Fastabend (john.r.fastabend@intel.com) for his input and his suggestions on this patch set. CC: John Fastabend <john.r.fastabend@intel.com> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: usbnet: update addr_assign_type if appropriate	Bjørn Mork	2013-09-05	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \|	This module generates a common default address on init, using eth_random_addr. Set addr_assign_type to let userspace know the address is random unless it was overridden by the minidriver. Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Oliver Neukum <oliver@neukum.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'enic'	David S. Miller	2013-09-05	5	-19/+54
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Govindarajulu Varadarajan says: ==================== The following patch adds multi tx support for enic. Signed-off-by: Nishank Trivedi <nistrive@cisco.com> Signed-off-by: Christian Benvenuti <benve@cisco.com> Signed-off-by: Govindarajulu Varadarajan <govindarajulu90@gmail.com> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	driver/net: enic: update enic maintainers and driver	govindarajulu.v	2013-09-05	2	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Govindarajulu Varadarajan <govindarajulu90@gmail.com> Signed-off-by: Sujith Sankar <ssujith@cisco.com> Signed-off-by: Nishank Trivedi <nistrive@cisco.com> Signed-off-by: Christian Benvenuti <benve@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	driver/net: enic: Exposing symbols for Cisco's low latency driver	govindarajulu.v	2013-09-05	2	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch exposes symbols for usnic low latency driver that can be used to register and unregister vNics as well to traverse the resources on vNics. Signed-off-by: Upinder Malhi <umalhi@cisco.com> Signed-off-by: Nishank Trivedi <nistrive@cisco.com> Signed-off-by: Christian Benvenuti <benve@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	driver/net: enic: Try DMA 64 first, then failover to DMA	govindarajulu.v	2013-09-05	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In servers with more than 1.1 TB of RAM, the existing 40/32 bit DMA could cause failure as the DMA-able address could go outside the range addressable using 40/32 bits. The following patch first tried 64 bit DMA if possible, failover to 32 bit. Signed-off-by: Sujith Sankar <ssujith@cisco.com> Signed-off-by: Christian Benvenuti <benve@cisco.com> Signed-off-by: Govindarajulu Varadarajan <govindarajulu90@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	driver/net: enic: record q_number and rss_hash for skb	govindarajulu.v	2013-09-05	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following patch sets the skb->rxhash and skb->q_number. This is used by RPS and RFS. Kernel can make use of hw provided hash instead of calculating the hash. Signed-off-by: Govindarajulu Varadarajan <govindarajulu90@gmail.com> Signed-off-by: Nishank Trivedi <nistrive@cisco.com> Signed-off-by: Christian Benvenuti <benve@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	driver/net: enic: Add multi tx support for enic	govindarajulu.v	2013-09-05	2	-13/+25
\|/ \| \| \| \| \| \| \| \|	The following patch adds multi tx support for enic. Signed-off-by: Nishank Trivedi <nistrive@cisco.com> Signed-off-by: Christian Benvenuti <benve@cisco.com> Signed-off-by: Govindarajulu Varadarajan <govindarajulu90@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	bridge: apply multicast snooping to IPv6 link-local, too	Linus Lüssing	2013-09-05	3	-14/+6
\| \| \| \| \| \| \| \| \|	The multicast snooping code should have matured enough to be safely applicable to IPv6 link-local multicast addresses (excluding the link-local all nodes address, ff02::1), too. Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
*	bridge: prevent flooding IPv6 packets that do not have a listener	Linus Lüssing	2013-09-05	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if there is no listener for a certain group then IPv6 packets for that group are flooded on all ports, even though there might be no host and router interested in it on a port. With this commit they are only forwarded to ports with a multicast router. Just like commit bd4265fe36 ("bridge: Only flood unregistered groups to routers") did for IPv4, let's do the same for IPv6 with the same reasoning. Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: document force_mld_version in ip-sysctl.txt	Daniel Borkmann	2013-09-04	1	-0/+5
\| \| \| \| \| \| \| \| \|	Document force_mld_version parameter in ip-sysctl.txt. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: introduce mld_{gq, ifc, dad}_stop_timer functions	Daniel Borkmann	2013-09-04	1	-16/+25
\| \| \| \| \| \| \| \| \| \| \|	We already have mld_{gq,ifc,dad}_start_timer() functions, so introduce mld_{gq,ifc,dad}_stop_timer() functions to reduce code size and make it more readable. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: refactor query processing into v1/v2 functions	Daniel Borkmann	2013-09-04	1	-33/+56
\| \| \| \| \| \| \| \| \| \|	Make igmp6_event_query() a bit easier to read by refactoring code parts into mld_process_v1() and mld_process_v2(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: similarly to MLDv2 have min max_delay of 1	Daniel Borkmann	2013-09-04	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Similarly as we do in MLDv2 queries, set a forged MLDv1 query with 0 ms mld_maxdelay to minimum timer shot time of 1 jiffies. This is eventually done in igmp6_group_queried() anyway, so we can simplify a check there. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: implement RFC3810 MLDv2 mode only	Daniel Borkmann	2013-09-04	1	-4/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RFC3810, 10. Security Considerations says under subsection 10.1. Query Message: A forged Version 1 Query message will put MLDv2 listeners on that link in MLDv1 Host Compatibility Mode. This scenario can be avoided by providing MLDv2 hosts with a configuration option to ignore Version 1 messages completely. Hence, implement a MLDv2-only mode that will ignore MLDv1 traffic: echo 2 > /proc/sys/net/ipv6/conf/ethX/force_mld_version or echo 2 > /proc/sys/net/ipv6/conf/all/force_mld_version Note that <all> device has a higher precedence as it was previously also the case in the macro MLD_V1_SEEN() that would "short-circuit" if condition on <all> case. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: get rid of MLDV2_MRC and simplify calculation	Daniel Borkmann	2013-09-04	3	-26/+23
\| \| \| \| \| \| \| \| \| \| \|	Get rid of MLDV2_MRC and use our new macros for mantisse and exponent to calculate Maximum Response Delay out of the Maximum Response Code. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: clean up MLD_V1_SEEN macro	Daniel Borkmann	2013-09-04	1	-13/+21
\| \| \| \| \| \| \| \| \| \| \|	Replace the macro with a function to make it more readable. GCC will eventually decide whether to inline this or not (also, that's not fast-path anyway). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ipv6: mld: fix v1/v2 switchback timeout to rfc3810, 9.12.	Daniel Borkmann	2013-09-04	3	-9/+143
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	i) RFC3810, 9.2. Query Interval [QI] says: The Query Interval variable denotes the interval between General Queries sent by the Querier. Default value: 125 seconds. [...] ii) RFC3810, 9.3. Query Response Interval [QRI] says: The Maximum Response Delay used to calculate the Maximum Response Code inserted into the periodic General Queries. Default value: 10000 (10 seconds) [...] The number of seconds represented by the [Query Response Interval] must be less than the [Query Interval]. iii) RFC3810, 9.12. Older Version Querier Present Timeout [OVQPT] says: The Older Version Querier Present Timeout is the time-out for transitioning a host back to MLDv2 Host Compatibility Mode. When an MLDv1 query is received, MLDv2 hosts set their Older Version Querier Present Timer to [Older Version Querier Present Timeout]. This value MUST be ([Robustness Variable] times (the [Query Interval] in the last Query received)) plus ([Query Response Interval]). Hence, on default the timeout results in: [RV] = 2, [QI] = 125sec, [QRI] = 10sec [OVQPT] = [RV] * [QI] + [QRI] = 260sec Having that said, we currently calculate [OVQPT] (here given as 'switchback' variable) as ... switchback = (idev->mc_qrv + 1) * max_delay RFC3810, 9.12. says "the [Query Interval] in the last Query received". In section "9.14. Configuring timers", it is said: This section is meant to provide advice to network administrators on how to tune these settings to their network. Ambitious router implementations might tune these settings dynamically based upon changing characteristics of the network. [...] iv) RFC38010, 9.14.2. Query Interval: The overall level of periodic MLD traffic is inversely proportional to the Query Interval. A longer Query Interval results in a lower overall level of MLD traffic. The value of the Query Interval MUST be equal to or greater than the Maximum Response Delay used to calculate the Maximum Response Code inserted in General Query messages. I assume that was why switchback is calculated as is (3 * max_delay), although this setting seems to be meant for routers only to configure their [QI] interval for non-default intervals. So usage here like this is clearly wrong. Concluding, the current behaviour in IPv6's multicast code is not conform to the RFC as switch back is calculated wrongly. That is, it has a too small value, so MLDv2 hosts switch back again to MLDv2 way too early, i.e. ~30secs instead of ~260secs on default. Hence, introduce necessary helper functions and fix this up properly as it should be. Introduced in 06da92283 ("[IPV6]: Add MLDv2 support."). Credits to Hannes Frederic Sowa who also had a hand in this as well. Also thanks to Hangbin Liu who did initial testing. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: David Stevens <dlstevens@us.ibm.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp: better comments for RTO initiallization	Yuchung Cheng	2013-09-04	1	-6/+20
\| \| \| \| \| \| \| \| \| \| \|	Commit 1b7fdd2ab585("tcp: do not use cached RTT for RTT estimation") removes important comments on how RTO is initialized and updated. Hopefully this patch puts those information back. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	vxlan: Optimize vxlan rcv	Pravin B Shelar	2013-09-04	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \|	vxlan-udp-recv function lookup vxlan_sock struct on every packet recv by using udp-port number. we can use sk->sk_user_data to store vxlan_sock and avoid lookup. I have open coded rcu-api to store and read vxlan_sock from sk_user_data to avoid sparse warning as sk_user_data is not __rcu pointer. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	atm: he: print MAC via %pM	Andy Shevchenko	2013-09-04	1	-9/+2
\| \| \| \| \|	Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	atm: nicstar: re-use native mac_pton() helper	Andy Shevchenko	2013-09-04	1	-25/+1
\| \| \| \| \| \| \| \|	There is a nice helper to parse MAC. Let's use it and remove custom implementation. Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	driver:stmmac: Adjust time stamp increase for 0.465 ns accurate only when ↵	Sonic Zhang	2013-09-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Time stamp binary rollover is set. The synopsys spec says When TSCRLSSR is cleard, the rollover value of sub-second register is 0x7FFFFFFF(0.465 ns per clock). Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: sctp: Fix data chunk fragmentation for MTU values which are not ↵	Alexander Sverdlin	2013-09-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	multiple of 4 net: sctp: Fix data chunk fragmentation for MTU values which are not multiple of 4 Initially the problem was observed with ipsec, but later it became clear that SCTP data chunk fragmentation algorithm has problems with MTU values which are not multiple of 4. Test program was used which just transmits 2000 bytes long packets to other host. tcpdump was used to observe re-fragmentation in IP layer after SCTP already fragmented data chunks. With MTU 1500: 12:54:34.082904 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1500) 10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 2366088589] [SID: 0] [SSEQ 1] [PPID 0x0] 12:54:34.082933 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 596) 10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 2366088590] [SID: 0] [SSEQ 1] [PPID 0x0] 12:54:34.090576 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48) 10.151.24.91.54321 > 10.151.38.153.39303: sctp (1) [SACK] [cum ack 2366088590] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0] With MTU 1499: 13:02:49.955220 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 0, flags [+], proto SCTP (132), length 1492) 10.151.38.153.39084 > 10.151.24.91.54321: sctp[\|sctp] 13:02:49.955249 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 1472, flags [none], proto SCTP (132), length 28) 10.151.38.153 > 10.151.24.91: ip-proto-132 13:02:49.955262 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600) 10.151.38.153.39084 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 404355346] [SID: 0] [SSEQ 1] [PPID 0x0] 13:02:49.956770 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48) 10.151.24.91.54321 > 10.151.38.153.39084: sctp (1) [SACK] [cum ack 404355346] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0] Here problem in data portion limit calculation leads to re-fragmentation in IP, which is sub-optimal. The problem is max_data initial value, which doesn't take into account the fact, that data chunk must be padded to 4-bytes boundary. It's enough to correct max_data, because all later adjustments are correctly aligned to 4-bytes boundary. After the fix is applied, everything is fragmented correctly for uneven MTUs: 15:16:27.083881 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1496) 10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 3077098183] [SID: 0] [SSEQ 1] [PPID 0x0] 15:16:27.083907 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600) 10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 3077098184] [SID: 0] [SSEQ 1] [PPID 0x0] 15:16:27.085640 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48) 10.151.24.91.54321 > 10.151.38.153.53417: sctp (1) [SACK] [cum ack 3077098184] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0] The bug was there for years already, but - is a performance issue, the packets are still transmitted - doesn't show up with default MTU 1500, but possibly with ipsec (MTU 1438) Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nsn.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	drivers:net: delete premature free_irq	Julia Lawall	2013-09-04	1	-9/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Free_irq is not needed if there has been no request_irq. Free_irq is removed from both the probe and remove functions. The correct request_irq and free_irq are found in the open and close functions. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression e; @@ e = platform_get_irq(...); ... when != request_irq(e,...) free_irq(e,...) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: sync some IP headers with glibc	Carlos O'Donell	2013-09-04	4	-20/+169
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Solution: ========= - Synchronize linux's `include/uapi/linux/in6.h' with glibc's `inet/netinet/in.h'. - Synchronize glibc's `inet/netinet/in.h with linux's `include/uapi/linux/in6.h'. - Allow including the headers in either other. - First header included defines the structures and macros. Details: ======== The kernel promises not to break the UAPI ABI so I don't see why we can't just have the two userspace headers coordinate? If you include the kernel headers first you get those, and if you include the glibc headers first you get those, and the following patch arranges a coordination and synchronization between the two. Let's handle `include/uapi/linux/in6.h' from linux, and `inet/netinet/in.h' from glibc and ensure they compile in any order and preserve the required ABI. These two patches pass the following compile tests: cat >> test1.c <<EOF int main (void) { return 0; } EOF gcc -c test1.c cat >> test2.c <<EOF int main (void) { return 0; } EOF gcc -c test2.c One wrinkle is that the kernel has a different name for one of the members in ipv6_mreq. In the kernel patch we create a macro to cover the uses of the old name, and while that's not entirely clean it's one of the best solutions (aside from an anonymous union which has other issues). I've reviewed the code and it looks to me like the ABI is assured and everything matches on both sides. Notes: - You want netinet/in.h to include bits/in.h as early as possible, but it needs in_addr so define in_addr early. - You want bits/in.h included as early as possible so you can use the linux specific code to define __USE_KERNEL_DEFS based on the _UAPI_* macro definition and use those to cull in.h. - glibc was missing IPPROTO_MH, added here. Compile tested and inspected. Reported-by: Thomas Backlund <tmb@mageia.org> Cc: Thomas Backlund <tmb@mageia.org> Cc: libc-alpha@sourceware.org Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: David S. Miller <davem@davemloft.net> Tested-by: Cong Wang <amwang@redhat.com> Signed-off-by: Carlos O'Donell <carlos@redhat.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	sfc: check for allocation failure	Dan Carpenter	2013-09-04	1	-0/+2
\| \| \| \| \| \| \|	It upsets static analyzers when we don't check for allocation failure. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'master' of ↵	David S. Miller	2013-09-04	7	-60/+143
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next Jeff Kirsher says: ==================== This series contains updates to igb only. Todd provides a fix for igb to not look for a PBA in the iNVM on devices that are flashless. Akeem provides igb patches to add a new PHY id for i354, as well as a couple of patches to implement the new PHY id. He also provides several patches to correctly report the appropriate media type as well as correctly report advertised/supported link for i354 devices. Lastly Akeem implements a 1 second delay mechanism for i210 devices to avoid erroneous link issue with the link partner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	igb: Update version number	Akeem G Abodunrin	2013-09-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch updates igb driver version to 5.0.5 Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: Implementation to report advertised/supported link on i354 devices	Akeem G Abodunrin	2013-09-04	1	-11/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch changes the way we report supported/advertised link for i354 devices, especially for 2.5 GB. Instead of reporting 2.5 GB for all i354 devices erroneously, check first, if it is 2.5 GB capable. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: Get speed and duplex for 1G non_copper devices	Akeem G Abodunrin	2013-09-04	1	-1/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch changes how we get speed/duplex for non_copper devices; it now uses pcs register to get current speed and duplex instead of using generic status register that we use to detect speed/duplex for copper devices. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: Support to get 2_5G link status for appropriate media type	Akeem G Abodunrin	2013-09-04	2	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since i354 2.5Gb devices are not Copper media type but SerDes, so this patch changes the way we detect speed/duplex link info for this device. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: No PHPM support in i354 devices	Akeem G Abodunrin	2013-09-04	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PHY Power Management does not exist for i354 device. So, there is no need to read and write this register or clear go link Disconnect bit, which could cause a lot of issues. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: M88E1543 PHY downshift implementation	Akeem G Abodunrin	2013-09-04	1	-9/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements downshift mechanism for M88E1543 PHY, so that downshift is disabled first during link setup process, and later enabled if we are master and downshift link is negotiated. Also cleaned up return code implementation. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: New PHY_ID for i354 device	Akeem G Abodunrin	2013-09-04	3	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch changes PHY_ID for i354 device, now using M88E1543 instead of M88E1545. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: Implementation of 1-sec delay for i210 devices	Akeem G Abodunrin	2013-09-04	2	-3/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds 1 sec delay mechanism to i210 device family, in order to avoid erroneous link issue with the link partner. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: Don't look for a PBA in the iNVM when flashless	Todd Fujinaka	2013-09-04	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a part is flashless, do not look for a PBA in the iNVM. Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* \|	Merge branch 'master' of ↵	David S. Miller	2013-09-04	4	-9/+17
\|\ \ \| \|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== The following batch contains: * Three fixes for the new synproxy target available in your net-next tree, from Jesper D. Brouer and Patrick McHardy. * One fix for TCPMSS to correctly handling the fragmentation case, from Phil Oester. I'll pass this one to -stable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	netfilter: xt_TCPMSS: correct return value in tcpmss_mangle_packet	Phil Oester	2013-09-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In commit b396966c4 (netfilter: xt_TCPMSS: Fix missing fragmentation handling), I attempted to add safe fragment handling to xt_TCPMSS. However, Andy Padavan of Project N56U correctly points out that returning XT_CONTINUE in this function does not work. The callers (tcpmss_tg[46]) expect to receive a value of 0 in order to return XT_CONTINUE. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: SYNPROXY: let unrelated packets continue	Jesper Dangaard Brouer	2013-09-04	2	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Packets reaching SYNPROXY were default dropped, as they were most likely invalid (given the recommended state matching). This patch, changes SYNPROXY target to let packets, not consumed, continue being processed by the stack. This will be more in line other target modules. As it will allow more flexible configurations of handling, logging or matching on packets in INVALID states. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: synproxy_core: fix warning in __nf_ct_ext_add_length()	Patrick McHardy	2013-09-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With CONFIG_NETFILTER_DEBUG we get the following warning during SYNPROXY init: [ 80.558906] WARNING: CPU: 1 PID: 4833 at net/netfilter/nf_conntrack_extend.c:80 __nf_ct_ext_add_length+0x217/0x220 [nf_conntrack]() The reason is that the conntrack template is set to confirmed before adding the extension and it is invalid to add extensions to already confirmed conntracks. Fix by adding the extensions before setting the conntrack to confirmed. Reported-by: Jesper Dangaard Brouer <jesper.brouer@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>