summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-09-0534-200/+439
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c net/bridge/br_multicast.c net/ipv6/sit.c The conflicts were minor: 1) sit.c changes overlap with change to ip_tunnel_xmit() signature. 2) br_multicast.c had an overlap between computing max_delay using msecs_to_jiffies and turning MLDV2_MRC() into an inline function with a name using lowercase instead of uppercase letters. 3) stmmac had two overlapping changes, one which conditionally allocated and hooked up a dma_cfg based upon the presence of the pbl OF property, and another one handling store-and-forward DMA made. The latter of which should not go into the new of_find_property() basic block. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: Add missing braces to do_tcp_setsockoptDave Jones2013-09-051-1/+2
| | | | | | | | | | | | Signed-off-by: Dave Jones <davej@fedoraproject.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * caif: Add missing braces to multiline if in cfctrl_linkup_requestDave Jones2013-09-051-1/+2
| | | | | | | | | | | | | | | | | | | | The indentation here implies this was meant to be a multi-line if. Introduced several years back in commit c85c2951d4da1236e32f1858db418221e624aba5 ("caif: Handle dev_queue_xmit errors.") Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ipv6: tcp: fix potential use after free in tcp_v6_do_rcvDaniel Borkmann2013-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In tcp_v6_do_rcv() code, when processing pkt options, we soley work on our skb clone opt_skb that we've created earlier before entering tcp_rcv_established() on our way. However, only in condition ... if (np->rxopt.bits.rxtclass) np->rcv_tclass = ipv6_get_dsfield(ipv6_hdr(skb)); ... we work on skb itself. As we extract every other information out of opt_skb in ipv6_pktoptions path, this seems wrong, since skb can already be released by tcp_rcv_established() earlier on. When we try to access it in ipv6_hdr(), we will dereference freed skb. [ Bug added by commit 4c507d2897bd9b ("net: implement IP_RECVTOS for IP_PKTOPTIONS") ] Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Don't depend on per socket memory for neighbour discovery messagesThomas Graf2013-09-041-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocating skbs when sending out neighbour discovery messages currently uses sock_alloc_send_skb() based on a per net namespace socket and thus share a socket wmem buffer space. If a netdevice is temporarily unable to transmit due to carrier loss or for other reasons, the queued up ndisc messages will cosnume all of the wmem space and will thus prevent from any more skbs to be allocated even for netdevices that are able to transmit packets. The number of neighbour discovery messages sent is very limited, use of alloc_skb() bypasses the socket wmem buffer size enforcement while the manual call to skb_set_owner_w() maintains the socket reference needed for the IPv6 output path. This patch has orginally been posted by Eric Dumazet in a modified form. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Fabio Estevam <festevam@gmail.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Tested-by: Stephen Warren <swarren@nvidia.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: fix null pointer dereference in __ip6addrlbl_addHannes Frederic Sowa2013-09-041-25/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b67bfe0d42cac56c512dd5da4b1b347a23f4b70a ("hlist: drop the node parameter from iterators") changed the behavior of hlist_for_each_entry_safe to leave the p argument NULL. Fix this up by tracking the last argument. Reported-by: Michele Baldessari <michele@acksyn.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Tested-by: Michele Baldessari <michele@acksyn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: ipv6_create_tempaddr cleanupPetr Holasek2013-09-031-2/+0
| | | | | | | | | | | | | | | | | | | | This two-liner removes max_addresses variable which is now unecessary related to patch [ipv6: remove max_addresses check from ipv6_create_tempaddr]. Signed-off-by: Petr Holasek <pholasek@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ICMPv6: treat dest unreachable codes 5 and 6 as EACCES, not EPROTOJiri Bohac2013-09-031-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RFC 4443 has defined two additional codes for ICMPv6 type 1 (destination unreachable) messages: 5 - Source address failed ingress/egress policy 6 - Reject route to destination Now they are treated as protocol error and icmpv6_err_convert() converts them to EPROTO. RFC 4443 says: "Codes 5 and 6 are more informative subsets of code 1." Treat codes 5 and 6 as code 1 (EACCES) Btw, connect() returning -EPROTO confuses firefox, so that fallback to other/IPv4 addresses does not work: https://bugzilla.mozilla.org/show_bug.cgi?id=910773 Signed-off-by: Jiri Bohac <jbohac@suse.cz> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2013-08-3028-165/+423
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) There was a simplification in the ipv6 ndisc packet sending attempted here, which avoided using memory accounting on the per-netns ndisc socket for sending NDISC packets. It did fix some important issues, but it causes regressions so it gets reverted here too. Specifically, the problem with this change is that the IPV6 output path really depends upon there being a valid skb->sk attached. The reason we want to do this change in some form when we figure out how to do it right, is that if a device goes down the ndisc_sk socket send queue will fill up and block NDISC packets that we want to send to other devices too. That's really bad behavior. Hopefully Thomas can come up with a better version of this change. 2) Fix a severe TCP performance regression by reverting a change made to dev_pick_tx() quite some time ago. From Eric Dumazet. 3) TIPC returns wrongly signed error codes, fix from Erik Hugne. 4) Fix OOPS when doing IPSEC over ipv4 tunnels due to orphaning the skb->sk too early. Fix from Li Hongjun. 5) RAW ipv4 sockets can use the wrong routing key during lookup, from Chris Clark. 6) Similar to #1 revert an older change that tried to use plain alloc_skb() for SYN/ACK TCP packets, this broke the netfilter owner mark which needs to see the skb->sk for such frames. From Phil Oester. 7) BNX2x driver bug fixes from Ariel Elior and Yuval Mintz, specifically in the handling of virtual functions. 8) IPSEC path error propagations to sockets is not done properly when we have v4 in v6, and v6 in v4 type rules. Fix from Hannes Frederic Sowa. 9) Fix missing channel context release in mac80211, from Johannes Berg. 10) Fix network namespace handing wrt. SCM_RIGHTS, from Andy Lutomirski. 11) Fix usage of bogus NAPI weight in jme, netxen, and ps3_gelic drivers. From Michal Schmidt. 12) Hopefully a complete and correct fix for the genetlink dump locking and module reference counting. From Pravin B Shelar. 13) sk_busy_loop() must do a cpu_relax(), from Eliezer Tamir. 14) Fix handling of timestamp offset when restoring a snapshotted TCP socket. From Andrew Vagin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits) net: fec: fix time stamping logic after napi conversion net: bridge: convert MLDv2 Query MRC into msecs_to_jiffies for max_delay mISDN: return -EINVAL on error in dsp_control_req() net: revert 8728c544a9c ("net: dev_pick_tx() fix") Revert "ipv6: Don't depend on per socket memory for neighbour discovery messages" ipv4 tunnels: fix an oops when using ipip/sit with IPsec tipc: set sk_err correctly when connection fails tcp: tcp_make_synack() should use sock_wmalloc bridge: separate querier and query timer into IGMP/IPv4 and MLD/IPv6 ones ipv6: Don't depend on per socket memory for neighbour discovery messages ipv4: sendto/hdrincl: don't use destination address found in header tcp: don't apply tsoffset if rcv_tsecr is zero tcp: initialize rcv_tstamp for restored sockets net: xilinx: fix memleak net: usb: Add HP hs2434 device to ZLP exception table net: add cpu_relax to busy poll loop net: stmmac: fixed the pbl setting with DT genl: Hold reference on correct module while netlink-dump. genl: Fix genl dumpit() locking. xfrm: Fix potential null pointer dereference in xdst_queue_output ...
| | * net: bridge: convert MLDv2 Query MRC into msecs_to_jiffies for max_delayDaniel Borkmann2013-08-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While looking into MLDv1/v2 code, I noticed that bridging code does not convert it's max delay into jiffies for MLDv2 messages as we do in core IPv6' multicast code. RFC3810, 5.1.3. Maximum Response Code says: The Maximum Response Code field specifies the maximum time allowed before sending a responding Report. The actual time allowed, called the Maximum Response Delay, is represented in units of milliseconds, and is derived from the Maximum Response Code as follows: [...] As we update timers that work with jiffies, we need to convert it. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Linus Lüssing <linus.luessing@web.de> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net: revert 8728c544a9c ("net: dev_pick_tx() fix")Eric Dumazet2013-08-301-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8728c544a9cbdc ("net: dev_pick_tx() fix") and commit b6fe83e9525a ("bonding: refine IFF_XMIT_DST_RELEASE capability") are quite incompatible : Queue selection is disabled because skb dst was dropped before entering bonding device. This causes major performance regression, mainly because TCP packets for a given flow can be sent to multiple queues. This is particularly visible when using the new FQ packet scheduler with MQ + FQ setup on the slaves. We can safely revert the first commit now that 416186fbf8c5b ("net: Split core bits of netdev_pick_tx into __netdev_pick_tx") properly caps the queue_index. Reported-by: Xi Wang <xii@google.com> Diagnosed-by: Xi Wang <xii@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Denys Fedorysychenko <nuclearcat@nuclearcat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Revert "ipv6: Don't depend on per socket memory for neighbour discovery ↵David S. Miller2013-08-301-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | messages" This reverts commit 1f324e38870cc09659cf23bc626f1b8869e201f2. It seems to cause regressions, and in particular the output path really depends upon there being a socket attached to skb->sk for checks such as sk_mc_loop(skb->sk) for example. See ip6_output_finish2(). Reported-by: Stephen Warren <swarren@wwwdotorg.org> Reported-by: Fabio Estevam <festevam@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * ipv4 tunnels: fix an oops when using ipip/sit with IPsecLi Hongjun2013-08-302-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 3d7b46cd20e3 (ip_tunnel: push generic protocol handling to ip_tunnel module.), an Oops is triggered when an xfrm policy is configured on an IPv4 over IPv4 tunnel. xfrm4_policy_check() calls __xfrm_policy_check2(), which uses skb_dst(skb). But this field is NULL because iptunnel_pull_header() calls skb_dst_drop(skb). Signed-off-by: Li Hongjun <hongjun.li@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * tipc: set sk_err correctly when connection failsErik Hugne2013-08-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Should a connect fail, if the publication/server is unavailable or due to some other error, a positive value will be returned and errno is never set. If the application code checks for an explicit zero return from connect (success) or a negative return (failure), it will not catch the error and subsequent send() calls will fail as shown from the strace snippet below. socket(0x1e /* PF_??? */, SOCK_SEQPACKET, 0) = 3 connect(3, {sa_family=0x1e /* AF_??? */, sa_data="\2\1\322\4\0\0\322\4\0\0\0\0\0\0"}, 16) = 111 sendto(3, "test", 4, 0, NULL, 0) = -1 EPIPE (Broken pipe) The reason for this behaviour is that TIPC wrongly inverts error codes set in sk_err. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * tcp: tcp_make_synack() should use sock_wmallocPhil Oester2013-08-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed the call to sock_wmalloc in tcp_make_synack to alloc_skb. In doing so, the netfilter owner match lost its ability to block the SYNACK packet on outbound listening sockets. Revert the change, restoring the owner match functionality. This closes netfilter bugzilla #847. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * bridge: separate querier and query timer into IGMP/IPv4 and MLD/IPv6 onesLinus Lüssing2013-08-305-93/+240
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we would still potentially suffer multicast packet loss if there is just either an IGMP or an MLD querier: For the former case, we would possibly drop IPv6 multicast packets, for the latter IPv4 ones. This is because we are currently assuming that if either an IGMP or MLD querier is present that the other one is present, too. This patch makes the behaviour and fix added in "bridge: disable snooping if there is no querier" (b00589af3b04) to also work if there is either just an IGMP or an MLD querier on the link: It refines the deactivation of the snooping to be protocol specific by using separate timers for the snooped IGMP and MLD queries as well as separate timers for our internal IGMP and MLD queriers. Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Merge branch 'master' of ↵David S. Miller2013-08-2913-32/+72
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== This pull request fixes some issues that arise when 6in4 or 4in6 tunnels are used in combination with IPsec, all from Hannes Frederic Sowa and a null pointer dereference when queueing packets to the policy hold queue. 1) We might access the local error handler of the wrong address family if 6in4 or 4in6 tunnel is protected by ipsec. Fix this by addind a pointer to the correct local_error to xfrm_state_afinet. 2) Add a helper function to always refer to the correct interpretation of skb->sk. 3) Call skb_reset_inner_headers to record the position of the inner headers when adding a new one in various ipv6 tunnels. This is needed to identify the addresses where to send back errors in the xfrm layer. 4) Dereference inner ipv6 header if encapsulated to always call the right error handler. 5) Choose protocol family by skb protocol to not call the wrong xfrm{4,6}_local_error handler in case an ipv6 sockets is used in ipv4 mode. 6) Partly revert "xfrm: introduce helper for safe determination of mtu" because this introduced pmtu discovery problems. 7) Set skb->protocol on tcp, raw and ip6_append_data genereated skbs. We need this to get the correct mtu informations in xfrm. 8) Fix null pointer dereference in xdst_queue_output. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * xfrm: Fix potential null pointer dereference in xdst_queue_outputSteffen Klassert2013-08-281-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The net_device might be not set on the skb when we try refcounting. This leads to a null pointer dereference in xdst_queue_output(). It turned out that the refcount to the net_device is not needed after all. The dst_entry has a refcount to the net_device before we queue the skb, so it can't go away. Therefore we can remove the refcount on queueing to fix the null pointer dereference. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * ipv6: set skb->protocol on tcp, raw and ip6_append_data genereated skbsHannes Frederic Sowa2013-08-262-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we don't initialize skb->protocol when transmitting data via tcp, raw(with and without inclhdr) or udp+ufo or appending data directly to the socket transmit queue (via ip6_append_data). This needs to be done so that we can get the correct mtu in the xfrm layer. Setting of skb->protocol happens only in functions where we also have a transmitting socket and a new skb, so we don't overwrite old values. Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * xfrm: revert ipv4 mtu determination to dst_mtuHannes Frederic Sowa2013-08-262-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 0ea9d5e3e0e03a63b11392f5613378977dae7eca ("xfrm: introduce helper for safe determination of mtu") I switched the determination of ipv4 mtus from dst_mtu to ip_skb_dst_mtu. This was an error because in case of IP_PMTUDISC_PROBE we fall back to the interface mtu, which is never correct for ipv4 ipsec. This patch partly reverts 0ea9d5e3e0e03a63b11392f5613378977dae7eca ("xfrm: introduce helper for safe determination of mtu"). Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * xfrm: choose protocol family by skb protocolHannes Frederic Sowa2013-08-191-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to choose the protocol family by skb->protocol. Otherwise we call the wrong xfrm{4,6}_local_error handler in case an ipv6 sockets is used in ipv4 mode, in which case we should call down to xfrm4_local_error (ip6 sockets are a superset of ip4 ones). We are called before before ip_output functions, so skb->protocol is not reset. Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * ipv6: xfrm: dereference inner ipv6 header if encapsulatedHannes Frederic Sowa2013-08-191-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfrm6_local_error use inner_header if the packet was encapsulated. Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * ipv6: wire up skb->encapsulationHannes Frederic Sowa2013-08-193-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When pushing a new header before current one call skb_reset_inner_headers to record the position of the inner headers in the various ipv6 tunnel protocols. We later need this to correctly identify the addresses needed to send back an error in the xfrm layer. This change is safe, because skb->protocol is always checked before dereferencing data from the inner protocol. Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * xfrm: introduce helper for safe determination of mtuHannes Frederic Sowa2013-08-143-12/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we always have to make sure we a referring to the correct interpretation of skb->sk. We only depend on header defines to query the mtu, so we don't introduce a new dependency to ipv6 by this change. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | | * xfrm: make local error reporting more robustHannes Frederic Sowa2013-08-146-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfrm4 and xfrm6 we need to take care about sockets of the other address family. This could happen because a 6in4 or 4in6 tunnel could get protected by ipsec. Because we don't want to have a run-time dependency on ipv6 when only using ipv4 xfrm we have to embed a pointer to the correct local_error function in xfrm_state_afinet and look it up when returning an error depending on the socket address family. Thanks to vi0ss for the great bug report: <https://bugzilla.kernel.org/show_bug.cgi?id=58691> v2: a) fix two more unsafe interpretations of skb->sk as ipv6 socket (xfrm6_local_dontfrag and __xfrm6_output) v3: a) add an EXPORT_SYMBOL_GPL(xfrm_local_error) to fix a link error when building ipv6 as a module (thanks to Steffen Klassert) Reported-by: <vi0oss@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| | * | ipv6: Don't depend on per socket memory for neighbour discovery messagesThomas Graf2013-08-291-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocating skbs when sending out neighbour discovery messages currently uses sock_alloc_send_skb() based on a per net namespace socket and thus share a socket wmem buffer space. If a netdevice is temporarily unable to transmit due to carrier loss or for other reasons, the queued up ndisc messages will cosnume all of the wmem space and will thus prevent from any more skbs to be allocated even for netdevices that are able to transmit packets. The number of neighbour discovery messages sent is very limited, simply use alloc_skb() and don't depend on any socket wmem space any longer. This patch has orginally been posted by Eric Dumazet in a modified form. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | ipv4: sendto/hdrincl: don't use destination address found in headerChris Clark2013-08-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipv4: raw_sendmsg: don't use header's destination address A sendto() regression was bisected and found to start with commit f8126f1d5136be1 (ipv4: Adjust semantics of rt->rt_gateway.) The problem is that it tries to ARP-lookup the constructed packet's destination address rather than the explicitly provided address. Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used. cf. commit 2ad5b9e4bd314fc685086b99e90e5de3bc59e26b Reported-by: Chris Clark <chris.clark@alcatel-lucent.com> Bisected-by: Chris Clark <chris.clark@alcatel-lucent.com> Tested-by: Chris Clark <chris.clark@alcatel-lucent.com> Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Chris Clark <chris.clark@alcatel-lucent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: don't apply tsoffset if rcv_tsecr is zeroAndrew Vagin2013-08-291-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zero value means that tsecr is not valid, so it's a special case. tsoffset is used to customize tcp_time_stamp for one socket. tsoffset is usually zero, it's used when a socket was moved from one host to another host. Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to TCP_RTO_MAX. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tcp: initialize rcv_tstamp for restored socketsAndrew Vagin2013-08-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | u32 rcv_tstamp; /* timestamp of last received ACK */ Its value used in tcp_retransmit_timer, which closes socket if the last ack was received more then TCP_RTO_MAX ago. Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer is called before receiving a first ack, the connection is closed. This patch initializes rcv_tstamp to a timestamp, when a socket was restored. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | genl: Hold reference on correct module while netlink-dump.Pravin B Shelar2013-08-281-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | netlink dump operations take module as parameter to hold reference for entire netlink dump duration. Currently it holds ref only on genl module which is not correct when we use ops registered to genl from another module. Following patch adds module pointer to genl_ops so that netlink can hold ref count on it. CC: Jesse Gross <jesse@nicira.com> CC: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | genl: Fix genl dumpit() locking.Pravin B Shelar2013-08-281-5/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of genl-family with parallel ops off, dumpif() callback is expected to run under genl_lock, But commit def3117493eafd9df (genl: Allow concurrent genl callbacks.) changed this behaviour where only first dumpit() op was called under genl-lock. For subsequent dump, only nlk->cb_lock was taken. Following patch fixes it by defining locked dumpit() and done() callback which takes care of genl-locking. CC: Jesse Gross <jesse@nicira.com> CC: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | Merge branch 'for-davem' of ↵David S. Miller2013-08-272-4/+33
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless John W. Linville says: ==================== This is one more set of fixes intended for the 3.11 stream... For the mac80211 bits, Johannes says: "I have three more patches for the 3.11 stream: Felix's fix for the fairly visible brcmsmac crash, a fix from Simon for an IBSS join bug I found and a fix for a channel context bug in IBSS I'd introduced." Along with those... Sujith Manoharan makes a minor change to not use a PLL hang workaroun for AR9550. This one-liner fixes a couple of bugs reported in the Red Hat bugzilla. Helmut Schaa addresses an ath9k_htc bug that mangles frame headers during Tx. This fix is small, tested by the bug reported and isolated to ath9k_htc. Stanislaw Gruszka reverts a recent iwl4965 change that broke rfkill notification to user space. Please let me know if there are problems! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * \ Merge branch 'master' of ↵John W. Linville2013-08-232-4/+33
| | | |\ \ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
| | * | | | net: Check the correct namespace when spoofing pid over SCM_RIGHTSAndy Lutomirski2013-08-271-1/+1
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a security bug. The follow-up will fix nsproxy to discourage this type of issue from happening again. Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | SUNRPC: Fix memory corruption issue on 32-bit highmem systemsTrond Myklebust2013-08-281-3/+6
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some architectures, such as ARM-32 do not return the same base address when you call kmap_atomic() twice on the same page. This causes problems for the memmove() call in the XDR helper routine "_shift_data_right_pages()", since it defeats the detection of overlapping memory ranges, and has been seen to corrupt memory. The fix is to distinguish between the case where we're doing an inter-page copy or not. In the former case of we know that the memory ranges cannot possibly overlap, so we can additionally micro-optimise by replacing memmove() with memcpy(). Reported-by: Mark Young <MYoung@nvidia.com> Reported-by: Matt Craighead <mcraighead@nvidia.com> Cc: Bruce Fields <bfields@fieldses.org> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Matt Craighead <mcraighead@nvidia.com>
* | | | netfilter: Fix build errors with xt_socket.cDavid S. Miller2013-09-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As reported by Randy Dunlap: ==================== when CONFIG_IPV6=m and CONFIG_NETFILTER_XT_MATCH_SOCKET=y: net/built-in.o: In function `socket_mt6_v1_v2': xt_socket.c:(.text+0x51b55): undefined reference to `udp6_lib_lookup' net/built-in.o: In function `socket_mt_init': xt_socket.c:(.init.text+0x1ef8): undefined reference to `nf_defrag_ipv6_enable' ==================== Like several other modules under net/netfilter/ we have to have a dependency "IPV6 disabled or set compatibly with this module" clause. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv6:introduce function to find route for redirectDuan Jiong2013-09-056-11/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RFC 4861 says that the IP source address of the Redirect is the same as the current first-hop router for the specified ICMP Destination Address, so the gateway should be taken into consideration when we find the route for redirect. There was once a check in commit a6279458c534d01ccc39498aba61c93083ee0372 ("NDISC: Search over all possible rules on receipt of redirect.") and the check went away in commit b94f1c0904da9b8bf031667afc48080ba7c3e8c9 ("ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect()"). The bug is only "exploitable" on layer-2 because the source address of the redirect is checked to be a valid link-local address but it makes spoofing a lot easier in the same L2 domain nonetheless. Thanks very much for Hannes's help. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | bridge: apply multicast snooping to IPv6 link-local, tooLinus Lüssing2013-09-053-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The multicast snooping code should have matured enough to be safely applicable to IPv6 link-local multicast addresses (excluding the link-local all nodes address, ff02::1), too. Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | bridge: prevent flooding IPv6 packets that do not have a listenerLinus Lüssing2013-09-051-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently if there is no listener for a certain group then IPv6 packets for that group are flooded on all ports, even though there might be no host and router interested in it on a port. With this commit they are only forwarded to ports with a multicast router. Just like commit bd4265fe36 ("bridge: Only flood unregistered groups to routers") did for IPv4, let's do the same for IPv6 with the same reasoning. Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: mld: introduce mld_{gq, ifc, dad}_stop_timer functionsDaniel Borkmann2013-09-041-16/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already have mld_{gq,ifc,dad}_start_timer() functions, so introduce mld_{gq,ifc,dad}_stop_timer() functions to reduce code size and make it more readable. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: mld: refactor query processing into v1/v2 functionsDaniel Borkmann2013-09-041-33/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make igmp6_event_query() a bit easier to read by refactoring code parts into mld_process_v1() and mld_process_v2(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: mld: similarly to MLDv2 have min max_delay of 1Daniel Borkmann2013-09-041-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly as we do in MLDv2 queries, set a forged MLDv1 query with 0 ms mld_maxdelay to minimum timer shot time of 1 jiffies. This is eventually done in igmp6_group_queried() anyway, so we can simplify a check there. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: mld: implement RFC3810 MLDv2 mode onlyDaniel Borkmann2013-09-041-4/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RFC3810, 10. Security Considerations says under subsection 10.1. Query Message: A forged Version 1 Query message will put MLDv2 listeners on that link in MLDv1 Host Compatibility Mode. This scenario can be avoided by providing MLDv2 hosts with a configuration option to ignore Version 1 messages completely. Hence, implement a MLDv2-only mode that will ignore MLDv1 traffic: echo 2 > /proc/sys/net/ipv6/conf/ethX/force_mld_version or echo 2 > /proc/sys/net/ipv6/conf/all/force_mld_version Note that <all> device has a higher precedence as it was previously also the case in the macro MLD_V1_SEEN() that would "short-circuit" if condition on <all> case. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: mld: get rid of MLDV2_MRC and simplify calculationDaniel Borkmann2013-09-042-17/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Get rid of MLDV2_MRC and use our new macros for mantisse and exponent to calculate Maximum Response Delay out of the Maximum Response Code. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: mld: clean up MLD_V1_SEEN macroDaniel Borkmann2013-09-041-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the macro with a function to make it more readable. GCC will eventually decide whether to inline this or not (also, that's not fast-path anyway). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: ipv6: mld: fix v1/v2 switchback timeout to rfc3810, 9.12.Daniel Borkmann2013-09-041-6/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i) RFC3810, 9.2. Query Interval [QI] says: The Query Interval variable denotes the interval between General Queries sent by the Querier. Default value: 125 seconds. [...] ii) RFC3810, 9.3. Query Response Interval [QRI] says: The Maximum Response Delay used to calculate the Maximum Response Code inserted into the periodic General Queries. Default value: 10000 (10 seconds) [...] The number of seconds represented by the [Query Response Interval] must be less than the [Query Interval]. iii) RFC3810, 9.12. Older Version Querier Present Timeout [OVQPT] says: The Older Version Querier Present Timeout is the time-out for transitioning a host back to MLDv2 Host Compatibility Mode. When an MLDv1 query is received, MLDv2 hosts set their Older Version Querier Present Timer to [Older Version Querier Present Timeout]. This value MUST be ([Robustness Variable] times (the [Query Interval] in the last Query received)) plus ([Query Response Interval]). Hence, on *default* the timeout results in: [RV] = 2, [QI] = 125sec, [QRI] = 10sec [OVQPT] = [RV] * [QI] + [QRI] = 260sec Having that said, we currently calculate [OVQPT] (here given as 'switchback' variable) as ... switchback = (idev->mc_qrv + 1) * max_delay RFC3810, 9.12. says "the [Query Interval] in the last Query received". In section "9.14. Configuring timers", it is said: This section is meant to provide advice to network administrators on how to tune these settings to their network. Ambitious router implementations might tune these settings dynamically based upon changing characteristics of the network. [...] iv) RFC38010, 9.14.2. Query Interval: The overall level of periodic MLD traffic is inversely proportional to the Query Interval. A longer Query Interval results in a lower overall level of MLD traffic. The value of the Query Interval MUST be equal to or greater than the Maximum Response Delay used to calculate the Maximum Response Code inserted in General Query messages. I assume that was why switchback is calculated as is (3 * max_delay), although this setting seems to be meant for routers only to configure their [QI] interval for non-default intervals. So usage here like this is clearly wrong. Concluding, the current behaviour in IPv6's multicast code is not conform to the RFC as switch back is calculated wrongly. That is, it has a too small value, so MLDv2 hosts switch back again to MLDv2 way too early, i.e. ~30secs instead of ~260secs on default. Hence, introduce necessary helper functions and fix this up properly as it should be. Introduced in 06da92283 ("[IPV6]: Add MLDv2 support."). Credits to Hannes Frederic Sowa who also had a hand in this as well. Also thanks to Hangbin Liu who did initial testing. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: David Stevens <dlstevens@us.ibm.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | tcp: better comments for RTO initiallizationYuchung Cheng2013-09-041-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 1b7fdd2ab585("tcp: do not use cached RTT for RTT estimation") removes important comments on how RTO is initialized and updated. Hopefully this patch puts those information back. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: sctp: Fix data chunk fragmentation for MTU values which are not ↵Alexander Sverdlin2013-09-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | multiple of 4 net: sctp: Fix data chunk fragmentation for MTU values which are not multiple of 4 Initially the problem was observed with ipsec, but later it became clear that SCTP data chunk fragmentation algorithm has problems with MTU values which are not multiple of 4. Test program was used which just transmits 2000 bytes long packets to other host. tcpdump was used to observe re-fragmentation in IP layer after SCTP already fragmented data chunks. With MTU 1500: 12:54:34.082904 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1500) 10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 2366088589] [SID: 0] [SSEQ 1] [PPID 0x0] 12:54:34.082933 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 596) 10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 2366088590] [SID: 0] [SSEQ 1] [PPID 0x0] 12:54:34.090576 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48) 10.151.24.91.54321 > 10.151.38.153.39303: sctp (1) [SACK] [cum ack 2366088590] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0] With MTU 1499: 13:02:49.955220 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 0, flags [+], proto SCTP (132), length 1492) 10.151.38.153.39084 > 10.151.24.91.54321: sctp[|sctp] 13:02:49.955249 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 1472, flags [none], proto SCTP (132), length 28) 10.151.38.153 > 10.151.24.91: ip-proto-132 13:02:49.955262 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600) 10.151.38.153.39084 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 404355346] [SID: 0] [SSEQ 1] [PPID 0x0] 13:02:49.956770 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48) 10.151.24.91.54321 > 10.151.38.153.39084: sctp (1) [SACK] [cum ack 404355346] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0] Here problem in data portion limit calculation leads to re-fragmentation in IP, which is sub-optimal. The problem is max_data initial value, which doesn't take into account the fact, that data chunk must be padded to 4-bytes boundary. It's enough to correct max_data, because all later adjustments are correctly aligned to 4-bytes boundary. After the fix is applied, everything is fragmented correctly for uneven MTUs: 15:16:27.083881 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1496) 10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 3077098183] [SID: 0] [SSEQ 1] [PPID 0x0] 15:16:27.083907 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600) 10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 3077098184] [SID: 0] [SSEQ 1] [PPID 0x0] 15:16:27.085640 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48) 10.151.24.91.54321 > 10.151.38.153.53417: sctp (1) [SACK] [cum ack 3077098184] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0] The bug was there for years already, but - is a performance issue, the packets are still transmitted - doesn't show up with default MTU 1500, but possibly with ipsec (MTU 1438) Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nsn.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'master' of ↵David S. Miller2013-09-044-9/+17
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== The following batch contains: * Three fixes for the new synproxy target available in your net-next tree, from Jesper D. Brouer and Patrick McHardy. * One fix for TCPMSS to correctly handling the fragmentation case, from Phil Oester. I'll pass this one to -stable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | netfilter: xt_TCPMSS: correct return value in tcpmss_mangle_packetPhil Oester2013-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit b396966c4 (netfilter: xt_TCPMSS: Fix missing fragmentation handling), I attempted to add safe fragment handling to xt_TCPMSS. However, Andy Padavan of Project N56U correctly points out that returning XT_CONTINUE in this function does not work. The callers (tcpmss_tg[46]) expect to receive a value of 0 in order to return XT_CONTINUE. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
OpenPOWER on IntegriCloud