summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* net: lwtunnel: Add extack to encap attr validationDavid Ahern2017-05-301-2/+4
| | | | | | | | | Pass extack down to lwtunnel_valid_encap_type and lwtunnel_valid_encap_type_attr. Add messages for unknown or unsupported encap types. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: Add extack message for invalid prefix or lengthDavid Ahern2017-05-302-9/+15
| | | | | | | | | | Add extack error message for invalid prefix length and invalid prefix. Example of the latter is a route spec containing 172.16.100.1/24, where the /24 mask means the lower 8-bits should be 0. Amazing how easy that one is to overlook when an EINVAL is returned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: refactor key and length checksDavid Ahern2017-05-301-10/+15
| | | | | | | | | fib_table_insert and fib_table_delete have the same checks on the prefix and length. Refactor into a helper. Avoids duplicate extack messages in the next patch. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-05-265-12/+29
|\ | | | | | | | | | | | | Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net' restricting a HW workaround alongside cleanups in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: add reference counting to metricsEric Dumazet2017-05-262-8/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * arp: fixed -Wuninitialized compiler warningIhar Hrachyshka2017-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7d472a59c0e5ec117220a05de6b370447fb6cb66 ("arp: always override existing neigh entries with gratuitous ARP") introduced a compiler warning: net/ipv4/arp.c:880:35: warning: 'addr_type' may be used uninitialized in this function [-Wmaybe-uninitialized] While the code logic seems to be correct and doesn't allow the variable to be used uninitialized, and the warning is not consistently reproducible, it's still worth fixing it for other people not to waste time looking at the warning in case it pops up in the build environment. Yes, compiler is probably at fault, but we will need to accommodate. Fixes: 7d472a59c0e5 ("arp: always override existing neigh entries with gratuitous ARP") Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: avoid fastopen API to be used on AF_UNSPECWei Wang2017-05-251-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fastopen API should be used to perform fastopen operations on the TCP socket. It does not make sense to use fastopen API to perform disconnect by calling it with AF_UNSPEC. The fastopen data path is also prone to race conditions and bugs when using with AF_UNSPEC. One issue reported and analyzed by Vegard Nossum is as follows: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Thread A: Thread B: ------------------------------------------------------------------------ sendto() - tcp_sendmsg() - sk_stream_memory_free() = 0 - goto wait_for_sndbuf - sk_stream_wait_memory() - sk_wait_event() // sleep | sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC) | - tcp_sendmsg() | - tcp_sendmsg_fastopen() | - __inet_stream_connect() | - tcp_disconnect() //because of AF_UNSPEC | - tcp_transmit_skb()// send RST | - return 0; // no reconnect! | - sk_stream_wait_connect() | - sock_error() | - xchg(&sk->sk_err, 0) | - return -ECONNRESET - ... // wake up, see sk->sk_err == 0 - skb_entail() on TCP_CLOSE socket If the connection is reopened then we will send a brand new SYN packet after thread A has already queued a buffer. At this point I think the socket internal state (sequence numbers etc.) becomes messed up. When the new connection is closed, the FIN-ACK is rejected because the sequence number is outside the window. The other side tries to retransmit, but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which corrupts the skb data length and hits a BUG() in copy_and_csum_bits(). +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Hence, this patch adds a check for AF_UNSPEC in the fastopen data path and return EOPNOTSUPP to user if such case happens. Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'master' of ↵David S. Miller2017-05-231-1/+4
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-05-23 1) Fix wrong header offset for esp4 udpencap packets. 2) Fix a stack access out of bounds when creating a bundle with sub policies. From Sabrina Dubroca. 3) Fix slab-out-of-bounds in pfkey due to an incorrect sadb_x_sec_len calculation. 4) We checked the wrong feature flags when taking down an interface with IPsec offload enabled. Fix from Ilan Tayari. 5) Copy the anti replay sequence numbers when doing a state migration, otherwise we get out of sync with the sequence numbers. Fix from Antony Antony. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * esp4: Fix udpencap for local TCP packets.Steffen Klassert2017-05-041-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Locally generated TCP packets are usually cloned, so we do skb_cow_data() on this packets. After that we need to reload the pointer to the esp header. On udpencap this header has an offset to skb_transport_header, so take this offset into account. Fixes: 67d349ed603 ("net/esp4: Fix invalid esph pointer crash") Fixes: fca11ebde3f0 ("esp4: Reorganize esp_output") Reported-by: Don Bowman <db@donbowman.ca> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | net: ipv4: RTM_GETROUTE: return matched fib result when requestedRoopa Prabhu2017-05-261-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support to return matched fib result when RTM_F_FIB_MATCH flag is specified in RTM_GETROUTE request. This is useful for user-space applications/controllers wanting to query a matching route. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: Save trie prefix to fib lookup resultDavid Ahern2017-05-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Prefix is needed for returning matching route spec on get route request. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookupDavid Ahern2017-05-261-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert inet_rtm_getroute to use ip_route_input_rcu and ip_route_output_key_hash_rcu passing the fib_result arg to both. The rcu lock is held through the creation of the response, so the rtable/dst does not need to be attached to the skb and is passed to rt_fill_info directly. In converting from ip_route_output_key to ip_route_output_key_hash_rcu the xfrm_lookup_route in ip_route_output_flow is dropped since flowi4_proto is not set for a route get request. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: Remove event arg to rt_fill_infoDavid Ahern2017-05-261-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | rt_fill_info has 1 caller with the event set to RTM_NEWROUTE. Given that remove the arg and use RTM_NEWROUTE directly in rt_fill_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: refactor ip_route_input_norefDavid Ahern2017-05-261-29/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A later patch wants access to the fib result on an input route lookup with the rcu lock held. Refactor ip_route_input_noref pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper that takes the fib_result as an input arg. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: refactor __ip_route_output_key_hashDavid Ahern2017-05-262-23/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A later patch wants access to the fib result on an output route lookup with the rcu lock held. Refactor __ip_route_output_key_hash, pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper with the fib_result as an input arg. To keep the name length under control remove the leading underscores from the name and add _rcu to the name of the new helper indicating it is called with the rcu read lock held. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: better validation of received ack sequencesEric Dumazet2017-05-251-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Paul Fiterau Brostean reported : <quote> Linux TCP stack we analyze exhibits behavior that seems odd to me. The scenario is as follows (all packets have empty payloads, no window scaling, rcv/snd window size should not be a factor): TEST HARNESS (CLIENT) LINUX SERVER 1. - LISTEN (server listen, then accepts) 2. - --> <SEQ=100><CTL=SYN> --> SYN-RECEIVED 3. - <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED 4. - --> <SEQ=101><ACK=301><CTL=ACK> --> ESTABLISHED 5. - <-- <SEQ=301><ACK=101><CTL=FIN,ACK> <-- FIN WAIT-1 (server opts to close the data connection calling "close" on the connection socket) 6. - --> <SEQ=101><ACK=99999><CTL=FIN,ACK> --> CLOSING (client sends FIN,ACK with not yet sent acknowledgement number) 7. - <-- <SEQ=302><ACK=102><CTL=ACK> <-- CLOSING (ACK is 102 instead of 101, why?) ... (silence from CLIENT) 8. - <-- <SEQ=301><ACK=102><CTL=FIN,ACK> <-- CLOSING (retransmission, again ACK is 102) Now, note that packet 6 while having the expected sequence number, acknowledges something that wasn't sent by the server. So I would expect the packet to maybe prompt an ACK response from the server, and then be ignored. Yet it is not ignored and actually leads to an increase of the acknowledgement number in the server's retransmission of the FIN,ACK packet. The explanation I found is that the FIN in packet 6 was processed, despite the acknowledgement number being unacceptable. Further experiments indeed show that the server processes this FIN, transitioning to CLOSING, then on receiving an ACK for the FIN it had send in packet 5, the server (or better said connection) transitions from CLOSING to TIME_WAIT (as signaled by netstat). </quote> Indeed, tcp_rcv_state_process() calls tcp_ack() but does not exploit the @acceptable status but for TCP_SYN_RECV state. What we want here is to send a challenge ACK, if not in TCP_SYN_RECV state. TCP_FIN_WAIT1 state is not the only state we should fix. Add a FLAG_NO_CHALLENGE_ACK so that tcp_rcv_state_process() can choose to send a challenge ACK and discard the packet instead of wrongly change socket state. With help from Neal Cardwell. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paul Fiterau Brostean <p.fiterau-brostean@science.ru.nl> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: fix TCP_SYNCNT flakesEric Dumazet2017-05-241-15/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the mentioned commit, some of our packetdrill tests became flaky. TCP_SYNCNT socket option can limit the number of SYN retransmits. retransmits_timed_out() has to compare times computations based on local_clock() while timers are based on jiffies. With NTP adjustments and roundings we can observe 999 ms delay for 1000 ms timers. We end up sending one extra SYN packet. Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value") makes no real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at all. Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set parameter from retransmits_timed_out() Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-05-222-17/+43
|\ \ \ | |/ /
| * | arp: always override existing neigh entries with gratuitous ARPIhar Hrachyshka2017-05-211-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when arp_accept is 1, we always override existing neigh entries with incoming gratuitous ARP replies. Otherwise, we override them only if new replies satisfy _locktime_ conditional (packets arrive not earlier than _locktime_ seconds since the last update to the neigh entry). The idea behind locktime is to pick the very first (=> close) reply received in a unicast burst when ARP proxies are used. This helps to avoid ARP thrashing where Linux would switch back and forth from one proxy to another. This logic has nothing to do with gratuitous ARP replies that are generally not aligned in time when multiple IP address carriers send them into network. This patch enforces overriding of existing neigh entries by all incoming gratuitous ARP packets, irrespective of their time of arrival. This will make the kernel honour all incoming gratuitous ARP packets. Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | arp: postpone addr_type calculation to as late as possibleIhar Hrachyshka2017-05-211-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The addr_type retrieval can be costly, so it's worth trying to avoid its calculation as much as possible. This patch makes it calculated only for gratuitous ARP packets. This is especially important since later we may want to move is_garp calculation outside of arp_accept block, at which point the costly operation will be executed for all setups. The patch is the result of a discussion in net-dev: http://marc.info/?l=linux-netdev&m=149506354216994 Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | arp: decompose is_garp logic into a separate functionIhar Hrachyshka2017-05-211-12/+23
| | | | | | | | | | | | | | | | | | | | | | | | The code is quite involving already to earn a separate function for itself. If anything, it helps arp_process readability. Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | arp: fixed error in a commentIhar Hrachyshka2017-05-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the is_garp code deals just with gratuitous ARP packets, not every unsolicited packet. This patch is a result of a discussion in netdev: http://marc.info/?l=linux-netdev&m=149506354216994 Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0Wei Wang2017-05-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When tcp_disconnect() is called, inet_csk_delack_init() sets icsk->icsk_ack.rcv_mss to 0. This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() => __tcp_select_window() call path to have division by 0 issue. So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0. Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: tcp: fixed comment coding style issueRohit Chavan2017-05-221-2/+3
| | | | | | | | | | | | | | | | | | | | | Fixed a coding style issue Signed-off-by: Rohit Chavan <roheetchavan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: Add extack messages for route add failuresDavid Ahern2017-05-222-22/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add messages for non-obvious errors (e.g, no need to add text for malloc failures or ENODEV failures). This mostly covers the annoying EINVAL errors Some message strings violate the 80-columns but searchable strings need to trump that rule. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv4: Plumb extack through route add functionsDavid Ahern2017-05-224-19/+26
| | | | | | | | | | | | | | | | | | | | | Plumb extack argument down to route add functions. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: fix tcp_probe_timer() for TCP_USER_TIMEOUTEric Dumazet2017-05-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP_USER_TIMEOUT is still converted to jiffies value in icsk_user_timeout So we need to make a conversion for the cases HZ != 1000 Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | fou: make local function staticstephen hemminger2017-05-211-41/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The build header functions are not used by any other code. net/ipv6/fou6.c:36:5: warning: no previous prototype for ‘fou6_build_header’ [-Wmissing-prototypes] net/ipv6/fou6.c:54:5: warning: no previous prototype for ‘gue6_build_header’ [-Wmissing-prototypes] Need to do some code rearranging to satisfy different Kconfig possiblities. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcpnv: do not export local functionstephen hemminger2017-05-211-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The TCP New Vegas congestion control was exporting an internal function tcpnv_get_info which is not used by any other in tree kernel code. Make it static. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | inet: fix warning about missing prototypestephen hemminger2017-05-211-0/+1
| | | | | | | | | | | | | | | | | | | | | The prototype for inet_rcv_saddr_equal was not being included. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sk_buff: remove support for csum_bad in sk_buffDavide Caratti2017-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This bit was introduced with commit 5a21232983aa ("net: Support for csum_bad in skbuff") to reduce the stack workload when processing RX packets carrying a wrong Internet Checksum. Up to now, only one driver and GRO core are setting it. Suggested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: warn on negative reordering valuesSoheil Hassas Yeganeh2017-05-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bafbb9c73241 ("tcp: eliminate negative reordering in tcp_clean_rtx_queue") fixes an issue for negative reordering metrics. To be resilient to such errors, warn and return when a negative metric is passed to tcp_update_reordering(). Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-05-187-24/+58
|\ \ \ | |/ /
| * | udp: make *udp*_queue_rcv_skb() functions staticPaolo Abeni2017-05-182-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the udp memory accounting refactor, we don't need any more to export the *udp*_queue_rcv_skb(). Make them static and fix a couple of sparse warnings: net/ipv4/udp.c:1615:5: warning: symbol 'udp_queue_rcv_skb' was not declared. Should it be static? net/ipv6/udp.c:572:5: warning: symbol 'udpv6_queue_rcv_skb' was not declared. Should it be static? Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema") Fixes: c915fe13cbaa ("udplite: fix NULL pointer dereference") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | arp: honour gratuitous ARP _replies_Ihar Hrachyshka2017-05-171-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When arp_accept is 1, gratuitous ARPs are supposed to override matching entries irrespective of whether they arrive during locktime. This was implemented in commit 56022a8fdd87 ("ipv4: arp: update neighbour address when a gratuitous arp is received and arp_accept is set") There is a glitch in the patch though. RFC 2002, section 4.6, "ARP, Proxy ARP, and Gratuitous ARP", defines gratuitous ARPs so that they can be either of Request or Reply type. Those Reply gratuitous ARPs can be triggered with standard tooling, for example, arping -A option does just that. This patch fixes the glitch, making both Request and Reply flavours of gratuitous ARPs to behave identically. As per RFC, if gratuitous ARPs are of Reply type, their Target Hardware Address field should also be set to the link-layer address to which this cache entry should be updated. The field is present in ARP over Ethernet but not in IEEE 1394. In this patch, I don't consider any broadcasted ARP replies as gratuitous if the field is not present, to conform the standard. It's not clear whether there is such a thing for IEEE 1394 as a gratuitous ARP reply; until it's cleared up, we will ignore such broadcasts. Note that they will still update existing ARP cache entries, assuming they arrive out of locktime time interval. Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: Improve handling of failures on link and route dumpsDavid Ahern2017-05-162-16/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In general, rtnetlink dumps do not anticipate failure to dump a single object (e.g., link or route) on a single pass. As both route and link objects have grown via more attributes, that is no longer a given. netlink dumps can handle a failure if the dump function returns an error; specifically, netlink_dump adds the return code to the response if it is <= 0 so userspace is notified of the failure. The missing piece is the rtnetlink dump functions returning the error. Fix route and link dump functions to return the errors if no object is added to an skb (detected by skb->len != 0). IPv6 route dumps (rt6_dump_route) already return the error; this patch updates IPv4 and link dumps. Other dump functions may need to be ajusted as well. Reported-by: Jan Moskyto Matejka <mq@ucw.cz> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipmr: vrf: Find VIFs using the actual deviceThomas Winter2017-05-161-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The skb->dev that is passed into ip_mr_input is the loX device for VRFs. When we lookup a vif for this dev, none is found as we do not create vifs for loopbacks. Instead lookup a vif for the actual device that the packet was received on, eg the vlan. Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz> cc: David Ahern <dsa@cumulusnetworks.com> cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> cc: roopa <roopa@cumulusnetworks.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: eliminate negative reordering in tcp_clean_rtx_queueSoheil Hassas Yeganeh2017-05-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_ack() can call tcp_fragment() which may dededuct the value tp->fackets_out when MSS changes. When prior_fackets is larger than tp->fackets_out, tcp_clean_rtx_queue() can invoke tcp_update_reordering() with negative values. This results in absurd tp->reodering values higher than sysctl_tcp_max_reordering. Note that tcp_update_reordering indeeds sets tp->reordering to min(sysctl_tcp_max_reordering, metric), but because the comparison is signed, a negative metric always wins. Fixes: c7caf8d3ed7a ("[TCP]: Fix reord detection due to snd_una covered holes") Reported-by: Rebecca Isaacs <risaacs@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: fix tcp_rearm_rto()Eric Dumazet2017-05-181-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skbs in (re)transmit queue no longer have a copy of jiffies at the time of the transmit : skb->skb_mstamp is now in usec unit, with no correlation to tcp_jiffies32. We have to convert rto from jiffies to usec, compute a time difference in usec, then convert the delta to HZ units. Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: fix __skb_try_recv_from_queue to return the old behaviorAndrey Vagin2017-05-181-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function has to return NULL on a error case, because there is a separate error variable. The offset has to be changed only if skb is returned v2: fix udp code to not use an extra variable Cc: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Fixes: 65101aeca522 ("net/sock: factor out dequeue/peek with offset cod") Signed-off-by: Andrei Vagin <avagin@openvz.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | udp: make function udp_skb_dtor_locked staticColin Ian King2017-05-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Function udp_skb_dtor_locked does not need to be in global scope so make it static to fix sparse warning: net/ipv4/udp.c: warning: symbol 'udp_skb_dtor_locked' was not declared. Should it be static? Fixes: 6dfb4367cd911d ("udp: keep the sk_receive_queue held when splicing") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: switch TCP TS option (RFC 7323) to 1ms clockEric Dumazet2017-05-1711-114/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP Timestamps option is defined in RFC 7323 Traditionally on linux, it has been tied to the internal 'jiffies' variable, because it had been a cheap and good enough generator. For TCP flows on the Internet, 1 ms resolution would be much better than 4ms or 10ms (HZ=250 or HZ=100 respectively) For TCP flows in the DC, Google has used usec resolution for more than two years with great success [1] Receive size autotuning (DRS) is indeed more precise and converges faster to optimal window size. This patch converts tp->tcp_mstamp to a plain u64 value storing a 1 usec TCP clock. This choice will allow us to upstream the 1 usec TS option as discussed in IETF 97. [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: replace misc tcp_time_stamp to tcp_jiffies32Eric Dumazet2017-05-175-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | After this patch, all uses of tcp_time_stamp will require a change when we introduce 1 ms and/or 1 us TCP TS option. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp_lp: cache tcp_time_stampEric Dumazet2017-05-171-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_time_stamp will become slightly more expensive soon, cache its value. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp_westwood: use tcp_jiffies32 instead of tcp_time_stampEric Dumazet2017-05-171-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | This CC does not need 1 ms tcp_time_stamp and can use the jiffy based 'timestamp'. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: use tcp_jiffies32 in __tcp_oow_rate_limited()Eric Dumazet2017-05-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | This place wants to use tcp_jiffies32, this is good enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: uses jiffies_32 to feed tp->chrono_startEric Dumazet2017-05-172-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | tcp_time_stamp will no longer be tied to jiffies. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: use tcp_jiffies32 to feed probe_timestampEric Dumazet2017-05-172-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use tcp_jiffies32 instead of tcp_time_stamp, since tcp_time_stamp will soon be only used for TCP TS option. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: use tcp_jiffies32 for rcv_tstamp and lrcvtimeEric Dumazet2017-05-174-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use tcp_jiffies32 instead of tcp_time_stamp, since tcp_time_stamp will soon be only used for TCP TS option. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stampEric Dumazet2017-05-172-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use tcp_jiffies32 instead of tcp_time_stamp, since tcp_time_stamp will soon be only used for TCP TS option. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud