| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
Don't bother testing if we need to switch to alternate stack
unless TEE target is used.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In most cases there is no reentrancy into ip/ip6tables.
For skbs sent by REJECT or SYNPROXY targets, there is one level
of reentrancy, but its not relevant as those targets issue an absolute
verdict, i.e. the jumpstack can be clobbered since its not used
after the target issues absolute verdict (ACCEPT, DROP, STOLEN, etc).
So the only special case where it is relevant is the TEE target, which
returns XT_CONTINUE.
This patch changes ip(6)_do_table to always use the jump stack starting
from 0.
When we detect we're operating on an skb sent via TEE (percpu
nf_skb_duplicated is 1) we switch to an alternate stack to leave
the original one alone.
Since there is no TEE support for arptables, it doesn't need to
test if tee is active.
The jump stack overflow tests are no longer needed as well --
since ->stacksize is the largest call depth we cannot exceed it.
A much better alternative to the external jumpstack would be to just
declare a jumps[32] stack on the local stack frame, but that would mean
we'd have to reject iptables rulesets that used to work before.
Another alternative would be to start rejecting rulesets with a larger
call depth, e.g. 1000 -- in this case it would be feasible to allocate the
entire stack in the percpu area which would avoid one dereference.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The {arp,ip,ip6tables} jump stack is currently sized based
on the number of user chains.
However, its rather unlikely that every user defined chain jumps to the
next, so lets use the existing loop detection logic to also track the
chain depths.
The stacksize is then set to the largest chain depth seen.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Conflicts:
net/bridge/br_mdb.c
Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
sockopt", I am not happy with the limitations it causes for socket
analysing code in userspace. Exporting the value only if it is set makes
it hard for userspace to decide whether the option is not set or the
kernel does not support exporting the option at all.
>From an auditor's perspective, the interesting question for listening
AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
the unexpected case. This patch allows to answer this question reliably.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree. This batch
mostly comes with patches to address fallout from the previous merge window
cycle, they are:
1) Use entry->state.hook_list from nf_queue() instead of the global nf_hooks
which is not valid when used from NFPROTO_NETDEV, this should cause no
problems though since we have no userspace queueing for that family, but
let's fix this now for the sake of correctness. Patch from Eric W. Biederman.
2) Fix compilation breakage in bridge netfilter if CONFIG_NF_DEFRAG_IPV4 is not
set, from Bernhard Thaler.
3) Use percpu jumpstack in arptables too, now that there's a single copy of the
rule blob we can't store the return address there anymore. Patch from
Florian Westphal.
4) Fix a skb leak in the xmit path of bridge netfilter, problem there since
2.6.37 although it should be not possible to hit invalid traffic there, also
from Florian.
5) Eric Leblond reports that when loading a large ruleset with many missing
modules after a fresh boot, nf_tables can take long time commit it. Fix this
by processing the full batch until the end, even on missing modules, then
abort only once and restart processing.
6) Add bridge netfilter files to the MAINTAINER files.
7) Fix a net_device refcount leak in the new IPV6 bridge netfilter code, from
Julien Grall.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 482cfc318559 ("netfilter: xtables: avoid percpu ruleset duplication")
Unlike ip and ip6tables, arp tables were never converted to use the percpu
jump stack.
It still uses the rule blob to store return address, which isn't safe
anymore since we now share this blob among all processors.
Because there is no TEE support for arptables, we don't need to cope
with reentrancy, so we can use loocal variable to hold stack offset.
Fixes: 482cfc318559 ("netfilter: xtables: avoid percpu ruleset duplication")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This kernel patch exports the value of the new
ignore_routes_with_linkdown via netconf.
v2: changes to notify userspace via netlink when sysctl values change
and proposed for 'net' since this could be considered a bugfix
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Frag needed should be sent only if the inner header asked
to not fragment. Currently fragmentation is broken if the
tunnel has df set, but df was not asked in the original
packet. The tunnel's df needs to be still checked to update
internally the pmtu cache.
Commit 23a3647bc4f93bac broke it, and this commit fixes
the ipv4 df check back to the way it was.
Fixes: 23a3647bc4f93bac ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This reverts commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45.
There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).
However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.
In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded. But discussion during NFWS resulted in
'lets just remove this for now'.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
After commit 900f65d361d3 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer
need to export tcp_init_xmit_timers()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
flags local variable in __mkroute_input is not used as a variable.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add support to allow non-local binds similar to how this was done for IPv4.
Non-local binds are very useful in emulating the Internet in a box, etc.
This add the ip_nonlocal_bind sysctl under ipv6.
Testing:
Set up nonlocal binding and receive routing on a host, e.g.:
ip -6 rule add from ::/0 iif eth0 lookup 200
ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
sysctl -w net.ipv6.ip_nonlocal_bind=1
Set up routing to 2001:0:0:1::/64 on peer to go to first host
ping6 -I 2001:0:0:1::1 peer-address -- to verify
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
inet_twsk_deschedule() calls are followed by inet_twsk_put().
Only particular case is in inet_twsk_purge() but there is no point
to defer the inet_twsk_put() after re-enabling BH.
Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put()
and move the inet_twsk_put() inside.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
timewait sockets have a complex refcounting logic.
Once we realize it should be similar to established and
syn_recv sockets, we can use sk_nulls_del_node_init_rcu()
and remove inet_twsk_unhash()
In particular, deferred inet_twsk_put() added in commit
13475a30b66cd ("tcp: connect() race with timewait reuse")
looks unecessary : When removing a timewait socket from
ehash or bhash, caller must own a reference on the socket
anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The congestion state and cwnd can be updated in the wrong order.
For example, upon receiving a dubious ACK, we incorrectly raise
the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because
the state is still Open, then enter recovery state to reduce cwnd.
For another example, if the ACK indicates spurious timeout or
retransmits, we first revert the cwnd reduction and congestion
state back to Open state. But we don't raise the cwnd even though
the ACK does not indicate any congestion.
To fix this problem we should first call tcp_fastretrans_alert() to
process the dubious ACK and update the congestion state, then call
tcp_may_raise_cwnd() that raises cwnd based on the current state.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
V1 of this patch contains Eric Dumazet's suggestion to move the per
dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.
I ran some tests and after setting the "ip route change quickack 1"
knob there were still many delayed ACKs sent. This occured
because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
subsequently ignored as tcp_in_quickack_mode() checks both these
values. The condition for a quick ack to trigger requires
that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
only icsk_ack.pingpong is controlled by the knob. But the
icsk_ack.quick value changes dynamically depending on heuristics.
The crux of the matter is that delayed acks still cannot be entirely
disabled even with the RTAX_QUICKACK per dst knob enabled. This
patch ensures that a quick ack is always sent when the RTAX_QUICKACK
per dst knob is turned on.
The "ip route change quickack 1" knob was recently added to enable
quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
This issue is that even with "ip route change quickack 1" enabled
we still see delayed ACKs under some conditions. It would be nice
to be able to completely disable delayed ACKs.
Here is an example:
# netstat -s|grep dela
3 delayed acks sent
For all routes enable the knob
# ip route change quickack 1
Generate some traffic across a slow link and we still see the delayed
acks.
# netstat -s|grep dela
106 delayed acks sent
1 delayed acks further delayed because of locked socket
The issue is that both the "ip route change quickack 1" knob and
the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
However at the business end in the __tcp_ack_snd_check() routine,
tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
and icsk_ack.pingpong=0 in order to trigger a quickack. As
icsk_ack.quick is determined by heuristics it can be 0. When
that occurs the icsk_ack.pingpong value is ignored and a delayed
ACK is sent regardless.
This patch moves the RTAX_QUICKACK per dst check into the
tcp_in_quickack_mode() routine which ensures that a quickack is
always sent when the quickack knob is enabled for that dst.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
PRR slow start is often too aggressive especially when drops are
caused by traffic policers. The policers mainly use token bucket
to enforce the rate so sending (twice) faster than the delivery
rate causes excessive drops.
This patch changes PRR to the conservative reduction bound
(CRB) mode in RFC 6937 by default. CRB follows the packet
conservation rule to send at most the delivery rate by default.
But if many packets are lost and the pipe is empty, CRB may take N
round trips to repair N losses. We conditionally turn on slow start
mode if all these conditions are made to speed up the recovery:
1) on the second round or later in recovery
2) retransmission sent in the previous round is delivered on this ACK
3) no retransmission is marked lost on this ACK
By using packet conservation by default, this change reduces the loss
retransmits signicantly on networks that deploy traffic policers,
up to 20% reduction of overall loss rate.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the retransmission in CA_Loss is lost again, we should not
continue to slow start or raise cwnd in congestion avoidance mode.
Instead we should enter fast recovery and use PRR to reduce cwnd,
following the principle in RFC5681:
"... or the loss of a retransmission, should be taken as two
indications of congestion and, therefore, cwnd (and ssthresh) MUST
be lowered twice in this case."
This is especially important to reduce loss when the CA_Loss
state was caused by a traffic policer dropping the entire inflight.
The CA_Loss state has a problem where a loss of L packets causes the
sender to send a burst of L packets. So a policer that's dropping
most packets in a given RTT can cause a huge retransmit storm. By
contrast, PRR includes logic to bound the number of outbound packets
that result from a given ACK. So switching to CA_Recovery on lost
retransmits in CA_Loss avoids this retransmit storm problem when
in CA_Loss.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull module_init replacement part two from Paul Gortmaker:
"Replace module_init with appropriate alternate initcall in non
modules.
This series converts non-modular code that is using the module_init()
call to hook itself into the system to instead use one of our
alternate priority initcalls.
Unlike the previous series that used device_initcall and hence was a
runtime no-op, these commits change to one of the alternate initcalls,
because (a) we have them and (b) it seems like the right thing to do.
For example, it would seem logical to use arch_initcall for arch
specific setup code and fs_initcall for filesystem setup code.
This does mean however, that changes in the init ordering will be
taking place, and so there is a small risk that some kind of implicit
init ordering issue may lie uncovered. But I think it is still better
to give these ones sensible priorities than to just assign them all to
device_initcall in order to exactly preserve the old ordering.
Thad said, we have already made similar changes in core kernel code in
commit c96d6660dc65 ("kernel: audit/fix non-modular users of
module_init in core code") without any regressions reported, so this
type of change isn't without precedent. It has also got the same
local testing and linux-next coverage as all the other pull requests
that I'm sending for this merge window have got.
Once again, there is an unused module_exit function removal that shows
up as an outlier upon casual inspection of the diffstat"
* tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
mm/page_owner.c: use late_initcall to hook in enabling
lib/list_sort: use late_initcall to hook in self tests
arm: use subsys_initcall in non-modular pl320 IPC code
powerpc: don't use module_init for non-modular core hugetlb code
powerpc: use subsys_initcall for Freescale Local Bus
x86: don't use module_init for non-modular core bootflag code
netfilter: don't use module_init/exit in core IPV4 code
fs/notify: don't use module_init for non-modular inotify_user code
mm: replace module_init usages with subsys_initcall in nommu.c
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The file net/ipv4/netfilter.o is created based on whether
CONFIG_NETFILTER is set. However that is defined as a bool, and
hence this file with the core netfilter hooks will never be
modular. So using module_init as an alias for __initcall can be
somewhat misleading.
Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing. Also add an inclusion of init.h, as
that was previously implicit here in the netfilter.c file.
Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of subsys_initcall (which
seems to make sense for netfilter code) will thus change this
registration from level 6-device to level 4-subsys (i.e. slightly
earlier). However no observable impact of that small difference
has been observed during testing, or is expected. (i.e. the
location of the netfilter messages in dmesg remains unchanged
with respect to all the other surrounding messages.)
As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netfilter-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The following lockdep splat was seen due to the wrong context for
grabbing in_dev.
===============================
[ INFO: suspicious RCU usage. ]
4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244 Not tainted
-------------------------------
include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by ip/403:
#0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81453305>] rtnl_lock+0x17/0x19
#1: ((inetaddr_chain).rwsem){.+.+.+}, at: [<ffffffff8105c327>] __blocking_notifier_call_chain+0x35/0x6a
stack backtrace:
CPU: 2 PID: 403 Comm: ip Not tainted 4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244
0000000000000001 ffff8800b189b728 ffffffff8150a542 ffffffff8107a8b3
ffff880037bbea40 ffff8800b189b758 ffffffff8107cb74 ffff8800379dbd00
ffff8800bec85800 ffff8800bf9e13c0 00000000000000ff ffff8800b189b7d8
Call Trace:
[<ffffffff8150a542>] dump_stack+0x4c/0x6e
[<ffffffff8107a8b3>] ? up+0x39/0x3e
[<ffffffff8107cb74>] lockdep_rcu_suspicious+0xf7/0x100
[<ffffffff814b63c3>] fib_dump_info+0x227/0x3e2
[<ffffffff814b6624>] rtmsg_fib+0xa6/0x116
[<ffffffff814b978f>] fib_table_insert+0x316/0x355
[<ffffffff814b362e>] fib_magic+0xb7/0xc7
[<ffffffff814b4803>] fib_add_ifaddr+0xb1/0x13b
[<ffffffff814b4d09>] fib_inetaddr_event+0x36/0x90
[<ffffffff8105c086>] notifier_call_chain+0x4c/0x71
[<ffffffff8105c340>] __blocking_notifier_call_chain+0x4e/0x6a
[<ffffffff8105c370>] blocking_notifier_call_chain+0x14/0x16
[<ffffffff814a7f50>] __inet_insert_ifa+0x1a5/0x1b3
[<ffffffff814a894d>] inet_rtm_newaddr+0x350/0x35f
[<ffffffff81457866>] rtnetlink_rcv_msg+0x17b/0x18a
[<ffffffff8107e7c3>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff8146965f>] ? netlink_deliver_tap+0x1cb/0x1f7
[<ffffffff814576eb>] ? rtnl_newlink+0x72a/0x72a
...
This patch resolves that splat.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.c
Both conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
ICMP messages can trigger ICMP and local errors. In this case
serr->port is 0 and starting from Linux 4.0 we do not return
the original target address to the error queue readers.
Add function to define which errors provide addr_offset.
With this fix my ping command is not silent anymore.
Fixes: c247f0534cc5 ("ip: fix error queue empty skb handling")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |/
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.
Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
[ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[ 36.012494] Call Trace:
[ 36.012953] <IRQ> [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
[ 36.014085] [<ffffffff810967d3>] ___might_sleep+0x103/0x170
[ 36.015117] [<ffffffff81096892>] __might_sleep+0x52/0x90
[ 36.016117] [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
[ 36.017266] [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
[ 36.018485] [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
[ 36.019679] [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
[ 36.020884] [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
[ 36.022058] [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
[ 36.023118] [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
[ 36.024185] [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
[ 36.025327] [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
[ 36.026410] [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
[ 36.027556] [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
[ 36.028784] [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
[ 36.029832] [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
[ 36.030936] [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
[ 36.031875] [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
[ 36.032953] [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
[ 36.034065] [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
[ 36.035069] [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
[ 36.035963] [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
[ 36.036950] [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
[ 36.037847] [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
[ 36.038994] [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
[ 36.040033] [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
[ 36.041025] [<ffffffff81611482>] net_rx_action+0x122/0x310
[ 36.042007] [<ffffffff81076743>] __do_softirq+0x103/0x2f0
[ 36.042978] [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30
This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
exported to userspace. It indicates whether a socket bound to in6addr_any
listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
listed by e.g. 'ss -l -4'.
This patch is accompanied by an appropriate one for iproute2 to enable
the additional information in 'ss -e'.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This feature is only enabled with the new per-interface or ipv4 global
sysctls called 'ignore_routes_with_linkdown'.
net.ipv4.conf.all.ignore_routes_with_linkdown = 0
net.ipv4.conf.default.ignore_routes_with_linkdown = 0
net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
...
When the above sysctls are set, will report to userspace that a route is
dead and will no longer resolve to this nexthop when performing a fib
lookup. This will signal to userspace that the route will not be
selected. The signalling of a RTNH_F_DEAD is only passed to userspace
if the sysctl is enabled and link is down. This was done as without it
the netlink listeners would have no idea whether or not a nexthop would
be selected. The kernel only sets RTNH_F_DEAD internally if the
interface has IFF_UP cleared.
With the new sysctl set, the following behavior can be observed
(interface p8p1 is link-down):
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1
cache
local 80.0.0.1 dev lo src 80.0.0.1
cache <local>
80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15
cache
While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down. Now interface p8p1 is linked-up:
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2
90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1
cache
local 80.0.0.1 dev lo src 80.0.0.1
cache <local>
80.0.0.2 dev p8p1 src 80.0.0.1
cache
and the output changes to what one would expect.
If the sysctl is not set, the following output would be expected when
p8p1 is down:
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
Since the dead flag does not appear, there should be no expectation that
the kernel would skip using this route due to link being down.
v2: Split kernel changes into 2 patches, this actually makes a
behavioral change if the sysctl is set. Also took suggestion from Alex
to simplify code by only checking sysctl during fib lookup and
suggestion from Scott to add a per-interface sysctl.
v3: Code clean-ups to make it more readable and efficient as well as a
reverse path check fix.
v4: Drop binary sysctl
v5: Whitespace fixups from Dave
v6: Style changes from Dave and checkpatch suggestions
v7: One more checkpatch fixup
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
reachable via an interface where carrier is off. No action is taken,
but additional flags are passed to userspace to indicate carrier status.
This also includes a cleanup to fib_disable_ip to more clearly indicate
what event made the function call to replace the more cryptic force
option previously used.
v2: Split out kernel functionality into 2 patches, this patch simply
sets and clears new nexthop flag RTNH_F_LINKDOWN.
v3: Cleanups suggested by Alex as well as a bug noticed in
fib_sync_down_dev and fib_sync_up when multipath was not enabled.
v5: Whitespace and variable declaration fixups suggested by Dave.
v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
all.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
inet_diag_dump_reqs() is called from inet_diag_dump_icsk() with BH
disabled. So no need to disable BH in inet_diag_dump_reqs().
Signed-off-by: Hiroaki Shimoda <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch adds NLM_F_APPEND flag to struct nlmsg_hdr->nlmsg_flags
in newroute notifications if the route add was an append.
(This is similar to how NLM_F_REPLACE is already part of new
route replace notifications today)
This helps userspace determine if the route add operation was
an append.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When an inet_sock is destroyed, its source port (sk_num) is set to
zero as part of the unhash procedure. In order to supply a source
port as part of the NETLINK_SOCK_DIAG socket destruction broadcasts,
the source port number must be read from inet_sport instead.
Tested: ss -E
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This get_info handler will simply dispatch to the appropriate
existing inet protocol handler.
This patch also includes a new netlink attribute
(INET_DIAG_PROTOCOL). This attribute is currently only used
for multicast messages. Without this attribute, there is no
way of knowing the IP protocol used by the socket information
being broadcast. This attribute is not necessary in the 'dump'
variant of this protocol (though it could easily be added)
because dump requests are issued for specific family/protocol
pairs.
Tested: ss -E (note, the -E option has not yet been merged into
the upstream version of ss).
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Previously, there was no clear distinction between the inet protocols
that used struct tcp_info to report information and those that didn't.
This change adds a specific size attribute to the inet_diag_handler
struct which defines these interfaces. This will make dispatching
sock_diag get_info requests identical for all inet protocols in a
following patch.
Tested: ss -au
Tested: ss -at
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This a bit large (and late) patchset that contains Netfilter updates for
net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of
x_tables percpu ruleset copy and rework of the nf_tables netdev support. More
specifically, they are:
1) Warn the user when there is a better protocol conntracker available, from
Marcelo Ricardo Leitner.
2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard
Thaler. This comes with several patches to prepare the change in first place.
3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This
is not needed anymore since now we use the largest fragment size to
refragment, from Florian Westphal.
4) Restore vlan tag when refragmenting in br_netfilter, also from Florian.
5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another
follow up patch to refine it from Eric Dumazet.
6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik.
7) Get rid of parens in Netfilter Kconfig files.
8) Attach the net_device to the basechain as opposed to the initial per table
approach in the nf_tables netdev family.
9) Subscribe to netdev events to detect the removal and registration of a
device that is referenced by a basechain.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore :
Only one copy of table is kept, instead of one copy per cpu.
We also can avoid a dereference if we put table data right after
xt_table_info. It reduces register pressure and helps compiler.
Then, we attempt a kmalloc() if total size is under order-3 allocation,
to reduce TLB pressure, as in many cases, rules fit in 32 KB.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
According to the reporter, they are not needed.
Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We store the rule blob per (possible) cpu. Unfortunately this means we can
waste lot of memory on big smp machines. ipt_entry structure ('rule head')
is 112 byte, so e.g. with maxcpu=64 one single rule eats
close to 8k RAM.
Since previous patch made counters percpu it appears there is nothing
left in the rule blob that needs to be percpu.
On my test system (144 possible cpus, 400k dummy rules) this
change saves close to 9 Gigabyte of RAM.
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The binary arp/ip/ip6tables ruleset is stored per cpu.
The only reason left as to why we need percpu duplication are the rule
counters embedded into ipt_entry et al -- since each cpu has its own copy
of the rules, all counters can be lockless.
The downside is that the more cpus are supported, the more memory is
required. Rules are not just duplicated per online cpu but for each
possible cpu, i.e. if maxcpu is 144, then rule is duplicated 144 times,
not for the e.g. 64 cores present.
To save some memory and also improve utilization of shared caches it
would be preferable to only store the rule blob once.
So we first need to separate counters and the rule blob.
Instead of using entry->counters, allocate this percpu and store the
percpu address in entry->counters.pcnt on CONFIG_SMP.
This change makes no sense as-is; it is merely an intermediate step to
remove the percpu duplication of the rule set in a followup patch.
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
since commit d6b915e29f4adea9
("ip_fragment: don't forward defragmented DF packet") the largest
fragment size is available in the IPCB.
Therefore we no longer need to care about 'encapsulation'
overhead of stripped PPPOE/VLAN headers since ip_do_fragment
doesn't use device mtu in such cases.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Fixes cross-compile to mips.
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
text data bss dec hex filename
old: 16527 44 0 16571 40bb net/ipv4/ip_output.o
new: 14935 44 0 14979 3a83 net/ipv4/ip_output.o
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We had various issues in the past when TCP stack was modifying
gso_size/gso_segs while clones were in flight.
Commit c52e2421f73 ("tcp: must unclone packets before mangling them")
fixed these bugs and added a WARN_ON_ONCE(skb_cloned(skb)); in
tcp_set_skb_tso_segs()
These bugs are now fixed, and because TCP stack now only sets
shinfo->gso_size|segs on the clone itself, the check can be removed.
As a result of this change, compiler inlines tcp_set_skb_tso_segs() in
tcp_init_tso_segs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In commit cd7d8498c9a5 ("tcp: change tcp_skb_pcount() location") we stored
gso_segs in a temporary cache hot location.
This patch does the same for gso_size.
This allows to save 2 cache line misses in tcp xmit path for
the last packet that is considered but not sent because of
various conditions (cwnd, tso defer, receiver window, TSQ...)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
tcp_set_skb_tso_segs() & tcp_init_tso_segs() no longer
use the sock pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Our goal is to touch skb_shinfo(skb) only when absolutely needed,
to avoid two cache line misses in TCP output path for last skb
that is considered but not sent because of various conditions
(cwnd, tso defer, receiver window, TSQ...)
A packet is GSO only when skb_shinfo(skb)->gso_size is not zero.
We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for
non GSO packets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
tcp_gso_segment() and tcp_gro_receive() are not strictly
part of TCP stack. They should not assume tcp_skb_mss(skb)
is in fact skb_shinfo(skb)->gso_size.
This will allow us to change tcp_skb_mss() in following patches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
the TCP sender in order to [1]:
o Use the delay gradient as a congestion signal.
o Back off with an average probability that is independent of the RTT.
o Coexist with flows that use loss-based congestion control, i.e.,
flows that are unresponsive to the delay signal.
o Tolerate packet loss unrelated to congestion. (Disabled by default.)
Its FreeBSD implementation was presented for the ICCRG in July 2012;
slides are available at http://www.ietf.org/proceedings/84/iccrg.html
Running the experiment scenarios in [1] suggests that our implementation
achieves more goodput compared with FreeBSD 10.0 senders, although it also
causes more queueing delay for a given backoff factor.
The loss tolerance heuristic is disabled by default due to safety concerns
for its use in the Internet [2, p. 45-46].
We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce
the probability of slow start overshoot.
[1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
delay gradients." In Networking 2011, pages 328-341. Springer, 2011.
[2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux."
MSc thesis. Department of Informatics, University of Oslo, 2015.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: David Hayes <davihay@ifi.uio.no>
Cc: Andreas Petlund <apetlund@simula.no>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|