op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	netfilter: remove unused hooknum arg from packet functions	Florian Westphal	2017-09-04	9	-12/+4
\| \| \| \| \| \|	tested with allmodconfig build. Signed-off-by: Florian Westphal <fw@strlen.de>
*	netfilter: nft_limit: add stateful object type	Pablo M. Bermudo Garay	2017-09-04	1	-1/+121
\| \| \| \| \| \| \| \|	Register a new limit stateful object type into the stateful object infrastructure. Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	netfilter: nft_limit: replace pkt_bytes with bytes	Pablo M. Bermudo Garay	2017-09-04	1	-13/+13
\| \| \| \| \| \| \|	Just a small refactor patch in order to improve the code readability. Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	netfilter: nf_tables: add select_ops for stateful objects	Pablo M. Bermudo Garay	2017-09-04	5	-35/+66
\| \| \| \| \| \| \| \| \| \| \| \|	This patch adds support for overloading stateful objects operations through the select_ops() callback, just as it is implemented for expressions. This change is needed for upcoming additions to the stateful objects infrastructure. Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	netfilter: xt_hashlimit: add rate match mode	Vishwanath Pai	2017-09-04	1	-24/+253
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds a new feature to hashlimit that allows matching on the current packet/byte rate without rate limiting. This can be enabled with a new flag --hashlimit-rate-match. The match returns true if the current rate of packets is above/below the user specified value. The main difference between the existing algorithm and the new one is that the existing algorithm rate-limits the flow whereas the new algorithm does not. Instead it classifies the flow based on whether it is above or below a certain rate. I will demonstrate this with an example below. Let us assume this rule: iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain If the packet rate is 15/s, the existing algorithm would ACCEPT 10 packets every second and send 5 packets to "new_chain". But with the new algorithm, as long as the rate of 15/s is sustained, all packets will continue to match and every packet is sent to new_chain. This new functionality will let us classify different flows based on their current rate, so that further decisions can be made on them based on what the current rate is. This is how the new algorithm works: We divide time into intervals of 1 (sec/min/hour) as specified by the user. We keep track of the number of packets/bytes processed in the current interval. After each interval we reset the counter to 0. When we receive a packet for match, we look at the packet rate during the current interval and the previous interval to make a decision: if [ prev_rate < user and cur_rate < user ] return Below else return Above Where cur_rate is the number of packets/bytes seen in the current interval, prev is the number of packets/bytes seen in the previous interval and 'user' is the rate specified by the user. We also provide flexibility to the user for choosing the time interval using the option --hashilmit-interval. For example the user can keep a low rate like x/hour but still keep the interval as small as 1 second. To preserve backwards compatibility we have to add this feature in a new revision, so I've created revision 3 for hashlimit. The two new options we add are: --hashlimit-rate-match --hashlimit-rate-interval I have updated the help text to add these new options. Also added a few tests for the new options. Suggested-by: Igor Lubashev <ilubashe@akamai.com> Reviewed-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next	David S. Miller	2017-09-03	77	-787/+1333
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Basically, updates to the conntrack core, enhancements for nf_tables, conversion of netfilter hooks from linked list to array to improve memory locality and asorted improvements for the Netfilter codebase. More specifically, they are: 1) Add expection to hashes after timer initialization to prevent access from another CPU that walks on the hashes and calls del_timer(), from Florian Westphal. 2) Don't update nf_tables chain counters from hot path, this is only used by the x_tables compatibility layer. 3) Get rid of nested rcu_read_lock() calls from netfilter hook path. Hooks are always guaranteed to run from rcu read side, so remove nested rcu_read_lock() where possible. Patch from Taehee Yoo. 4) nf_tables new ruleset generation notifications include PID and name of the process that has updated the ruleset, from Phil Sutter. 5) Use skb_header_pointer() from nft_fib, so we can reuse this code from the nf_family netdev family. Patch from Pablo M. Bermudo. 6) Add support for nft_fib in nf_tables netdev family, also from Pablo. 7) Use deferrable workqueue for conntrack garbage collection, to reduce power consumption, from Patch from Subash Abhinov Kasiviswanathan. 8) Add nf_ct_expect_iterate_net() helper and use it. From Florian Westphal. 9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian. 10) Drop references on conntrack removal path when skbuffs has escaped via nfqueue, from Florian. 11) Don't queue packets to nfqueue with dying conntrack, from Florian. 12) Constify nf_hook_ops structure, from Florian. 13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter. 14) Add nla_strdup(), from Phil Sutter. 15) Rise nf_tables objects name size up to 255 chars, people want to use DNS names, so increase this according to what RFC 1035 specifies. Patch series from Phil Sutter. 16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook registration on demand, suggested by Eric Dumazet, patch from Florian. 17) Remove unused variables in compat_copy_entry_from_user both in ip_tables and arp_tables code. Patch from Taehee Yoo. 18) Constify struct nf_conntrack_l4proto, from Julia Lawall. 19) Constify nf_loginfo structure, also from Julia. 20) Use a single rb root in connlimit, from Taehee Yoo. 21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo. 22) Use audit_log() instead of open-coding it, from Geliang Tang. 23) Allow to mangle tcp options via nft_exthdr, from Florian. 24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes a fix for a miscalculation of the minimal length. 25) Simplify branch logic in h323 helper, from Nick Desaulniers. 26) Calculate netlink attribute size for conntrack tuple at compile time, from Florian. 27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure. From Florian. 28) Remove holes in nf_conntrack_l4proto structure, so it becomes smaller. From Florian. 29) Get rid of print_tuple() indirection for /proc conntrack listing. Place all the code in net/netfilter/nf_conntrack_standalone.c. Patch from Florian. 30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is off. From Florian. 31) Constify most nf_conntrack_{l3,l4}proto helper functions, from Florian. 32) Fix broken indentation in ebtables extensions, from Colin Ian King. 33) Fix several harmless sparse warning, from Florian. 34) Convert netfilter hook infrastructure to use array for better memory locality, joint work done by Florian and Aaron Conole. Moreover, add some instrumentation to debug this. 35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once per batch, from Florian. 36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian. 37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao. 38) Remove unused code in the generic protocol tracker, from Davide Caratti. I think I will have material for a second Netfilter batch in my queue if time allow to make it fit in this merge window. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	netfilter: rt: account for tcp header size too	Florian Westphal	2017-08-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This needs to accout for the ipv4/ipv6 header size and the tcp header without options. Fixes: 6b5dc98e8fac0 ("netfilter: rt: add support to fetch path mss") Reported-by: Matteo Croce <technoboy85@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: remove unused code in nf_conntrack_proto_generic.c	Davide Caratti	2017-08-28	1	-12/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	L4 protocol helpers for DCCP, SCTP and UDPlite can't be built as kernel modules anymore, so we can remove code enclosed in #ifdef CONFIG_NF_CT_PROTO_{DCCP,SCTP,UDPLITE}_MODULE Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: Remove NFDEBUG()	Varsha Rao	2017-08-28	2	-7/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove NFDEBUG and use pr_debug() instead of it. Signed-off-by: Varsha Rao <rvarsha016@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: don't log "invalid" icmpv6 connections	Florian Westphal	2017-08-28	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When enabling logging for invalid connections we currently also log most icmpv6 types, which we don't track intentionally (e.g. neigh discovery). "invalid" should really mean "invalid", i.e. short header or bad checksum. We don't do any logging for icmp(v4) either, its just useless noise. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: core: batch nf_unregister_net_hooks synchronize_net calls	Florian Westphal	2017-08-28	1	-3/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	re-add batching in nf_unregister_net_hooks(). Similar as before, just store an array with to-be-free'd rule arrays on stack, then call synchronize_net once per batch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: debug: check for sorted array	Florian Westphal	2017-08-28	1	-0/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make sure our grow/shrink routine places them in the correct order. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: convert hook list to an array	Aaron Conole	2017-08-28	4	-107/+279
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This converts the storage and layout of netfilter hook entries from a linked list to an array. After this commit, hook entries will be stored adjacent in memory. The next pointer is no longer required. The ops pointers are stored at the end of the array as they are only used in the register/unregister path and in the legacy br_netfilter code. nf_unregister_net_hooks() is slower than needed as it just calls nf_unregister_net_hook in a loop (i.e. at least n synchronize_net() calls), this will be addressed in followup patch. Test setup: - ixgbe 10gbit - netperf UDP_STREAM, 64 byte packets - 5 hooks: (raw + mangle prerouting, mangle+filter input, inet filter): empty mangle and raw prerouting, mangle and filter input hooks: 353.9 this patch: 364.2 Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: fix a few (harmless) sparse warnings	Florian Westphal	2017-08-28	3	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	net/netfilter/nft_payload.c:187:18: warning: incorrect type in return expression (expected bool got restricted __sum16 [usertype] check) net/netfilter/nft_exthdr.c:222:14: warning: cast to restricted __be32 net/netfilter/nft_rt.c:49:23: warning: incorrect type in assignment (different base types expected unsigned int got restricted __be32) net/netfilter/nft_rt.c:70:25: warning: symbol 'nft_rt_policy' was not declared. Should it be static? Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: ebtables: fix indent on if statements	Colin Ian King	2017-08-24	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The returns on some if statements are not indented correctly, add in the missing tab. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: make protocol tracker pointers const	Florian Westphal	2017-08-24	6	-37/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Doesn't change generated code, but will make it easier to eventually make the actual trackers themselvers const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: print_conntrack only needed if CONFIG_NF_CONNTRACK_PROCFS	Florian Westphal	2017-08-24	4	-0/+22
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: place print_tuple in procfs part	Florian Westphal	2017-08-24	12	-108/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CONFIG_NF_CONNTRACK_PROCFS is deprecated, no need to use a function pointer in the trackers for this. Place the printf formatting in the one place that uses it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: remove protocol name from l4proto struct	Florian Westphal	2017-08-24	10	-19/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	no need to waste storage for something that is only needed in one place and can be deduced from protocol number. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: remove protocol name from l3proto struct	Florian Westphal	2017-08-24	4	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	no need to waste storage for something that is only needed in one place and can be deduced from protocol number. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: compute l3proto nla size at compile time	Florian Westphal	2017-08-24	4	-19/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	avoids a pointer and allows struct to be const later on. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_nat_h323: fix logical-not-parentheses warning	Nick Desaulniers	2017-08-24	1	-27/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Clang produces the following warning: net/ipv4/netfilter/nf_nat_h323.c:553:6: error: logical not is only applied to the left hand side of this comparison [-Werror,-Wlogical-not-parentheses] if (!set_h225_addr(skb, protoff, data, dataoff, taddr, ^ add parentheses after the '!' to evaluate the comparison first add parentheses around left hand side expression to silence this warning There's not necessarily a bug here, but it's cleaner to return early, ex: if (x) return ... rather than: if (x == 0) ... else return Also added a return code check that seemed to be missing in one instance. Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: rt: add support to fetch path mss	Florian Westphal	2017-08-19	1	-0/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to be used in combination with tcp option set support to mimic iptables TCPMSS --clamp-mss-to-pmtu. v2: Eric Dumazet points out dst must be initialized. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: exthdr: tcp option set support	Florian Westphal	2017-08-19	1	-2/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows setting 2 and 4 byte quantities in the tcp option space. Main purpose is to allow native replacement for xt_TCPMSS to work around pmtu blackholes. Writes to kind and len are now allowed at the moment, it does not seem useful to do this as it causes corruption of the tcp option space. We can always lift this restriction later if a use-case appears. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: exthdr: split netlink dump function	Florian Westphal	2017-08-19	1	-5/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	so eval and uncoming eval_set versions can reuse a common helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: exthdr: factor out tcp option access	Florian Westphal	2017-08-19	1	-12/+21
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: use audit_log()	Geliang Tang	2017-08-19	2	-19/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use audit_log() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: remove prototype of netfilter_queue_init	Taehee Yoo	2017-08-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The netfilter_queue_init() has been removed. so we can remove the prototype of that. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: connlimit: merge root4 and root6.	Taehee Yoo	2017-08-19	1	-15/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The root4 variable is used only when connlimit extension module has been stored by the iptables command. and the roo6 variable is used only when connlimit extension module has been stored by the ip6tables command. So the root4 and roo6 variable does not be used at the same time. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: constify nf_loginfo structures	Julia Lawall	2017-08-02	7	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The nf_loginfo structures are only passed as the seventh argument to nf_log_trace, which is declared as const or stored in a local const variable. Thus the nf_loginfo structures themselves can be const. Done with the help of Coccinelle. // <smpl> @r disable optional_qualifier@ identifier i; position p; @@ static struct nf_loginfo i@p = { ... }; @ok1@ identifier r.i; expression list[6] es; position p; @@ nf_log_trace(es,&i@p,...) @ok2@ identifier r.i; const struct nf_loginfo *e; position p; @@ e = &i@p @bad@ position p != {r.p,ok1.p,ok2.p}; identifier r.i; struct nf_loginfo e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct nf_loginfo i = { ... }; // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: constify nf_conntrack_l3/4proto parameters	Julia Lawall	2017-08-02	4	-21/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a nf_conntrack_l3/4proto parameter is not on the left hand side of an assignment, its address is not taken, and it is not passed to a function that may modify its fields, then it can be declared as const. This change is useful from a documentation point of view, and can possibly facilitate making some nf_conntrack_l3/4proto structures const subsequently. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: xtables: Remove unused variable in compat_copy_entry_from_user()	Taehee Yoo	2017-08-02	2	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The target variable is not used in the compat_copy_entry_from_user(). So It can be removed. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: do not enable connection tracking unless needed	Florian Westphal	2017-07-31	4	-68/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Discussion during NFWS 2017 in Faro has shown that the current conntrack behaviour is unreasonable. Even if conntrack module is loaded on behalf of a single net namespace, its turned on for all namespaces, which is expensive. Commit 481fa373476 ("netfilter: conntrack: add nf_conntrack_default_on sysctl") attempted to provide an alternative to the 'default on' behaviour by adding a sysctl to change it. However, as Eric points out, the sysctl only becomes available once the module is loaded, and then its too late. So we either have to move the sysctl to the core, or, alternatively, change conntrack to become active only once the rule set requires this. This does the latter, conntrack is only enabled when a rule needs it. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nft_set_rbtree: use seqcount to avoid lock in most cases	Florian Westphal	2017-07-31	1	-12/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	switch to lockless lockup. write side now also increments sequence counter. On lookup, sample counter value and only take the lock if we did not find a match and the counter has changed. This avoids need to write to private area in normal (lookup) cases. In case we detect a writer (seqretry is true) we fall back to taking the readlock. The readlock is also used during dumps to ensure we get a consistent tree walk. Similar technique (rbtree+seqlock) was used by David Howells in rxrpc. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: Allow object names of up to 255 chars	Phil Sutter	2017-07-31	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Same conversion as for table names, use NFT_NAME_MAXLEN as upper boundary as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: Allow set names of up to 255 chars	Phil Sutter	2017-07-31	1	-4/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Same conversion as for table names, use NFT_NAME_MAXLEN as upper boundary as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: Allow chain name of up to 255 chars	Phil Sutter	2017-07-31	2	-10/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Same conversion as for table names, use NFT_NAME_MAXLEN as upper boundary as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: Allow table names of up to 255 chars	Phil Sutter	2017-07-31	2	-14/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allocate all table names dynamically to allow for arbitrary lengths but introduce NFT_NAME_MAXLEN as an upper sanity boundary. It's value was chosen to allow using a domain name as per RFC 1035. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: No need to check chain existence when tracing	Phil Sutter	2017-07-31	1	-8/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nft_trace_notify() is called only from __nft_trace_packet(), which assigns its parameter 'chain' to info->chain. __nft_trace_packet() in turn later dereferences 'chain' unconditionally, which indicates that it's never NULL. Same does nft_do_chain(), the only user of the tracing infrastructure. Hence it is safe to assume the check removed here is not needed. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_hook_ops structs can be const	Florian Westphal	2017-07-31	15	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We no longer place these on a list so they can be const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nfnetlink_queue: don't queue dying conntracks to userspace	Florian Westphal	2017-07-31	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When skb is queued to userspace it leaves softirq/rcu protection. skb->nfct (via conntrack extensions such as helper) could then reference modules that no longer exist if the conntrack was not yet confirmed. nf_ct_iterate_destroy() will set the DYING bit for unconfirmed conntracks, we therefore solve this race as follows: 1. take the queue spinlock. 2. check if the conntrack is unconfirmed and has dying bit set. In this case, we must discard skb while we're still inside rcu read-side section. 3. If nf_ct_iterate_destroy() is called right after the packet is queued to userspace, it will be removed from the queue via nf_ct_iterate_destroy -> nf_queue_nf_hook_drop. When userspace sends the verdict (nfnetlink takes rcu read lock), there are two cases to consider: 1. nf_ct_iterate_destroy() was called while packet was out. In this case, skb will have been removed from the queue already and no reinject takes place as we won't find a matching entry for the packet id. 2. nf_ct_iterate_destroy() gets called right after verdict callback found and removed the skb from queue list. In this case, skb->nfct is marked as dying but it is still valid. The skb will be dropped either in nf_conntrack_confirm (we don't insert DYING conntracks into hash table) or when we try to queue the skb again, but either events don't occur before the rcu read lock is dropped. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: destroy functions need to free queued packets	Florian Westphal	2017-07-31	2	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	queued skbs might be using conntrack extensions that are being removed, such as timeout. This happens for skbs that have a skb->nfct in unconfirmed state (i.e., not in hash table yet). This is destructive, but there are only two use cases: - module removal (rare) - netns cleanup (most likely no conntracks exist, and if they do, they are removed anyway later on). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: add and use nf_ct_unconfirmed_destroy	Florian Westphal	2017-07-31	2	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This also removes __nf_ct_unconfirmed_destroy() call from nf_ct_iterate_cleanup_net, so that function can be used only when missing conntracks from unconfirmed list isn't a problem. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: expect: add and use nf_ct_expect_iterate helpers	Florian Westphal	2017-07-31	3	-61/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several spots that open-code a expect walk, add a helper that is similar to nf_ct_iterate_destroy/nf_ct_iterate_cleanup. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: conntrack: Change to deferable work queue	subashab@codeaurora.org	2017-07-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Delayed workqueue causes wakeups to idle CPUs. This was causing a power impact for devices. Use deferable work queue instead so that gc_worker runs when CPU is active only. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: add fib expression to the netdev family	Pablo M. Bermudo Garay	2017-07-31	3	-0/+97
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add fib expression support for netdev family. Like inet family, netdev delegates the actual decision to the corresponding backend, either ipv4 or ipv6. This allows to perform very early reverse path filtering, among other things. You can find more information about fib expression in the f6d0cbcf09c5 ("<netfilter: nf_tables: add fib expression>") commit message. Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: fib: use skb_header_pointer	Pablo M. Bermudo Garay	2017-07-31	2	-10/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a preparatory patch for adding fib support to the netdev family. The netdev family receives the packets from ingress hook. At this point we have no guarantee that the ip header is linear. So this patch replaces ip_hdr with skb_header_pointer in order to address that possible situation. Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: Attach process info to NFT_MSG_NEWGEN notifications	Phil Sutter	2017-07-24	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is helpful for 'nft monitor' to track which process caused a given change to the ruleset. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: Remove duplicated rcu_read_lock.	Taehee Yoo	2017-07-24	20	-128/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes duplicate rcu_read_lock(). 1. IPVS part: According to Julian Anastasov's mention, contexts of ipvs are described at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary: - packet RX/TX: does not need locks because packets come from hooks. - sync msg RX: backup server uses RCU locks while registering new connections. - ip_vs_ctl.c: configuration get/set, RCU locks needed. - xt_ipvs.c: It is a netfilter match, running from hook context. As result, rcu_read_lock and rcu_read_unlock can be removed from: - ip_vs_core.c: all - ip_vs_ctl.c: - only from ip_vs_has_real_service - ip_vs_ftp.c: all - ip_vs_proto_sctp.c: all - ip_vs_proto_tcp.c: all - ip_vs_proto_udp.c: all - ip_vs_xmit.c: all (contains only packet processing) 2. Netfilter part: There are three types of functions that are guaranteed the rcu_read_lock(). First, as result, functions are only called by nf_hook(): - nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp(). - tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6(). - match_lookup_rt6(), check_hlist(), hashlimit_mt_common(). - xt_osf_match_packet(). Second, functions that caller already held the rcu_read_lock(). - destroy_conntrack(), ctnetlink_conntrack_event(). - ctnl_timeout_find_get(), nfqnl_nf_hook_drop(). Third, functions that are mixed with type1 and type2. These functions are called by nf_hook() also these are called by ordinary functions that already held the rcu_read_lock(): - __ctnetlink_glue_build(), ctnetlink_expect_event(). - ctnetlink_proto_size(). Applied files are below: - nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c. - nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c. - nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c. - xt_connlimit.c, xt_hashlimit.c, xt_osf.c Detailed calltrace can be found at: http://marc.info/?l=netfilter-devel&m=149667610710350&w=2 Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
\| *	netfilter: nf_tables: keep chain counters away from hot path	Pablo Neira Ayuso	2017-07-24	2	-16/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These chain counters are only used by the iptables-compat tool, that allow users to use the x_tables extensions from the existing nf_tables framework. This patch makes nf_tables by ~5% for the general usecase, ie. native nft users, where no chain counters are used at all. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>