summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* bridge: use either ndo VLAN ops or switchdev VLAN ops to install MASTER vlansScott Feldman2015-06-151-2/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v2: Move struct switchdev_obj automatics to inner scope where there used. v1: To maintain backward compatibility with the existing iproute2 "bridge vlan" command, let bridge's setlink/dellink handler call into either the port driver's 8021q ndo ops or the port driver's bridge_setlink/dellink ops. This allows port driver to choose 8021q ops or the newer bridge_setlink/dellink ops when implementing VLAN add/del filtering on the device. The iproute "bridge vlan" command does not need to be modified. To summarize using the "bridge vlan" command examples, we have: 1) bridge vlan add|del vid VID dev DEV Here iproute2 sets MASTER flag. Bridge's bridge_setlink/dellink is called. Vlan is set on bridge for port. If port driver implements ndo 8021q ops, call those to port driver can install vlan filter on device. Otherwise, if port driver implements bridge_setlink/dellink ops, call those to install vlan filter to device. This option only works if port is bridged. 2) bridge vlan add|del vid VID dev DEV master Same as 1) 3) bridge vlan add|del vid VID dev DEV self Bridge's bridge_setlink/dellink isn't called. Port driver's bridge_setlink/dellink is called, if implemented. This option works if port is bridged or not. If port is not bridged, a VLAN can still be added/deleted to device filter using this variant. 4) bridge vlan add|del vid VID dev DEV master self This is a combination of 1) and 3), but will only work if port is bridged. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bpf: allow networking programs to use bpf_trace_printk() for debuggingAlexei Starovoitov2015-06-151-0/+2
| | | | | | | | | | | bpf_trace_printk() is a helper function used to debug eBPF programs. Let socket and TC programs use it as well. Note, it's DEBUG ONLY helper. If it's used in the program, the kernel will print warning banner to make sure users don't use it in production. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bpf: introduce current->pid, tgid, uid, gid, comm accessorsAlexei Starovoitov2015-06-151-0/+6
| | | | | | | | | | | | | | | | | | | | | | | eBPF programs attached to kprobes need to filter based on current->pid, uid and other fields, so introduce helper functions: u64 bpf_get_current_pid_tgid(void) Return: current->tgid << 32 | current->pid u64 bpf_get_current_uid_gid(void) Return: current_gid << 32 | current_uid bpf_get_current_comm(char *buf, int size_of_buf) stores current->comm into buf They can be used from the programs attached to TC as well to classify packets based on current task fields. Update tracex2 example to print histogram of write syscalls for each process instead of aggregated for all. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2015-06-1535-1659/+1843
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This a bit large (and late) patchset that contains Netfilter updates for net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of x_tables percpu ruleset copy and rework of the nf_tables netdev support. More specifically, they are: 1) Warn the user when there is a better protocol conntracker available, from Marcelo Ricardo Leitner. 2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard Thaler. This comes with several patches to prepare the change in first place. 3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This is not needed anymore since now we use the largest fragment size to refragment, from Florian Westphal. 4) Restore vlan tag when refragmenting in br_netfilter, also from Florian. 5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another follow up patch to refine it from Eric Dumazet. 6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik. 7) Get rid of parens in Netfilter Kconfig files. 8) Attach the net_device to the basechain as opposed to the initial per table approach in the nf_tables netdev family. 9) Subscribe to netdev events to detect the removal and registration of a device that is referenced by a basechain. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: nf_tables_netdev: unregister hooks on net_device removalPablo Neira Ayuso2015-06-152-4/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | In case the net_device is gone, we have to unregister the hooks and put back the reference on the net_device object. Once it comes back, register them again. This also covers the device rename case. This patch also adds a new flag to indicate that the basechain is disabled, so their hooks are not registered. This flag is used by the netdev family to handle the case where the net_device object is gone. Currently this flag is not exposed to userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nf_tables: add nft_register_basechain() and ↵Pablo Neira Ayuso2015-06-151-15/+37
| | | | | | | | | | | | | | | | nft_unregister_basechain() This wrapper functions take care of hook registration for basechains. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nf_tables: attach net_device to basechainPablo Neira Ayuso2015-06-151-37/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device is part of the hook configuration, so instead of a global configuration per table, set it to each of the basechain that we create. This patch reworks ebddf1a8d78a ("netfilter: nf_tables: allow to bind table to net_device"). Note that this adds a dev_name field in the nft_base_chain structure which is required the netdev notification subscription that follows up in a patch to handle gone net_devices. Suggested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: x_tables: remove XT_TABLE_INFO_SZ and a dereference.Eric Dumazet2015-06-154-26/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore : Only one copy of table is kept, instead of one copy per cpu. We also can avoid a dereference if we put table data right after xt_table_info. It reduces register pressure and helps compiler. Then, we attempt a kmalloc() if total size is under order-3 allocation, to reduce TLB pressure, as in many cases, rules fit in 32 KB. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * Merge branch 'master' of git://blackhole.kfki.hu/nf-nextPablo Neira Ayuso2015-06-1521-1270/+1261
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jozsef Kadlecsik says: ==================== ipset patches for nf-next Please consider to apply the next bunch of patches for ipset. First comes the small changes, then the bugfixes and at the end the RCU related patches. * Use MSEC_PER_SEC consistently instead of the number. * Use SET_WITH_*() helpers to test set extensions from Sergey Popovich. * Check extensions attributes before getting extensions from Sergey Popovich. * Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich. * Make sure we always return line number on batch in the case of error from Sergey Popovich. * Check CIDR value only when attribute is given from Sergey Popovich. * Fix cidr handling for hash:*net* types, reported by Jonathan Johnson. * Fix parallel resizing and listing of the same set so that the original set is kept for the whole dumping. * Make sure listing doesn't grab a set which is just being destroyed. * Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU. * Replace rwlock_t with spinlock_t in "struct ip_set", change the locking in the core and simplifications in the timeout routines. * Introduce RCU locking in bitmap:* types with a slight modification in the logic on how an element is added. * Introduce RCU locking in hash:* types. This is the most complex part of the changes. * Introduce RCU locking in list type where standard rculist is used. * Fix coding styles reported by checkpatch.pl. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: ipset: Fix coding styles reported by checkpatch.plJozsef Kadlecsik2015-06-1421-289/+322
| | | | | | | | | | | | Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Introduce RCU locking in list typeJozsef Kadlecsik2015-06-141-209/+189
| | | | | | | | | | | | | | | | | | Standard rculist is used. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Introduce RCU locking in hash:* typesJozsef Kadlecsik2015-06-1412-236/+364
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Three types of data need to be protected in the case of the hash types: a. The hash buckets: standard rcu pointer operations are used. b. The element blobs in the hash buckets are stored in an array and a bitmap is used for book-keeping to tell which elements in the array are used or free. c. Networks per cidr values and the cidr values themselves are stored in fix sized arrays and need no protection. The values are modified in such an order that in the worst case an element testing is repeated once with the same cidr value. The ipset hash approach uses arrays instead of lists and therefore is incompatible with rhashtable. Performance is tested by Jesper Dangaard Brouer: Simple drop in FORWARD ~~~~~~~~~~~~~~~~~~~~~~ Dropping via simple iptables net-mask match:: iptables -t raw -N simple || iptables -t raw -F simple iptables -t raw -I simple -s 198.18.0.0/15 -j DROP iptables -t raw -D PREROUTING -j simple iptables -t raw -I PREROUTING -j simple Drop performance in "raw": 11.3Mpps Generator: sending 12.2Mpps (tx:12264083 pps) Drop via original ipset in RAW table ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create a set with lots of elements:: sudo ./ipset destroy test echo "create test hash:ip hashsize 65536" > test.set for x in `seq 0 255`; do for y in `seq 0 255`; do echo "add test 198.18.$x.$y" >> test.set done done sudo ./ipset restore < test.set Dropping via ipset:: iptables -t raw -F iptables -t raw -N net198 || iptables -t raw -F net198 iptables -t raw -I net198 -m set --match-set test src -j DROP iptables -t raw -I PREROUTING -j net198 Drop performance in "raw" with ipset: 8Mpps Perf report numbers ipset drop in "raw":: + 24.65% ksoftirqd/1 [ip_set] [k] ip_set_test - 21.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_lock_bh - _raw_read_lock_bh + 99.88% ip_set_test - 19.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_unlock_bh - _raw_read_unlock_bh + 99.72% ip_set_test + 4.31% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_kadt + 2.27% ksoftirqd/1 [ixgbe] [k] ixgbe_fetch_rx_buffer + 2.18% ksoftirqd/1 [ip_tables] [k] ipt_do_table + 1.81% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_test + 1.61% ksoftirqd/1 [kernel.kallsyms] [k] __netif_receive_skb_core + 1.44% ksoftirqd/1 [kernel.kallsyms] [k] build_skb + 1.42% ksoftirqd/1 [kernel.kallsyms] [k] ip_rcv + 1.36% ksoftirqd/1 [kernel.kallsyms] [k] __local_bh_enable_ip + 1.16% ksoftirqd/1 [kernel.kallsyms] [k] dev_gro_receive + 1.09% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_unlock + 0.96% ksoftirqd/1 [ixgbe] [k] ixgbe_clean_rx_irq + 0.95% ksoftirqd/1 [kernel.kallsyms] [k] __netdev_alloc_frag + 0.88% ksoftirqd/1 [kernel.kallsyms] [k] kmem_cache_alloc + 0.87% ksoftirqd/1 [xt_set] [k] set_match_v3 + 0.85% ksoftirqd/1 [kernel.kallsyms] [k] inet_gro_receive + 0.83% ksoftirqd/1 [kernel.kallsyms] [k] nf_iterate + 0.76% ksoftirqd/1 [kernel.kallsyms] [k] put_compound_page + 0.75% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_lock Drop via ipset in RAW table with RCU-locking ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With RCU locking, the RW-lock is gone. Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Introduce RCU locking in bitmap:* typesJozsef Kadlecsik2015-06-144-14/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's nothing much required because the bitmap types use atomic bit operations. However the logic of adding elements slightly changed: first the MAC address updated (which is not atomic), then the element activated (added). The extensions may call kfree_rcu() therefore we call rcu_barrier() at module removal. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Prepare the ipset core to use RCU at set levelJozsef Kadlecsik2015-06-141-22/+22
| | | | | | | | | | | | | | | | | | | | | | | | Replace rwlock_t with spinlock_t in "struct ip_set" and change the locking accordingly. Convert the comment extension into an rcu-avare object. Also, simplify the timeout routines. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter:ipset Remove rbtree from hash:net,ifaceJozsef Kadlecsik2015-06-141-143/+20
| | | | | | | | | | | | | | | | | | Remove rbtree in order to introduce RCU instead of rwlock in ipset Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Make sure listing doesn't grab a set which is just being ↵Jozsef Kadlecsik2015-06-141-15/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | destroyed. There was a small window when all sets are destroyed and a concurrent listing of all sets could grab a set which is just being destroyed. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Fix parallel resizing and listing of the same setJozsef Kadlecsik2015-06-142-15/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When elements added to a hash:* type of set and resizing triggered, parallel listing could start to list the original set (before resizing) and "continue" with listing the new set. Fix it by references and using the original hash table for listing. Therefore the destroying of the original hash table may happen from the resizing or listing functions. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Fix cidr handling for hash:*net* typesJozsef Kadlecsik2015-06-147-34/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit "Simplify cidr handling for hash:*net* types" broke the cidr handling for the hash:*net* types when the sets were used by the SET target: entries with invalid cidr values were added to the sets. Reported by Jonathan Johnson. Testsuite entry is added to verify the fix. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Check CIDR value only when attribute is givenSergey Popovich2015-06-144-49/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no reason to check CIDR value regardless attribute specifying CIDR is given. Initialize cidr array in element structure on element structure declaration to let more freedom to the compiler to optimize initialization right before element structure is used. Remove local variables cidr and cidr2 for netnet and netportnet hashes as we do not use packed cidr value for such set types and can store value directly in e.cidr[]. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Make sure we always return line number on batchSergey Popovich2015-06-1415-75/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if we return with generic IPSET_ERR_PROTOCOL it is good idea to return line number if we called in batch mode. Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED or IPSET_ERR_INVALID_CIDR. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Permit CIDR equal to the host address CIDR in IPv6Sergey Popovich2015-06-145-15/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Permit userspace to supply CIDR length equal to the host address CIDR length in netlink message. Prohibit any other CIDR length for IPv6 variant of the set. Also return -IPSET_ERR_HASH_RANGE_UNSUPPORTED instead of generic -IPSET_ERR_PROTOCOL in IPv6 variant of hash:ip,port,net when IPSET_ATTR_IP_TO attribute is given. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Check extensions attributes before getting extensions.Sergey Popovich2015-06-1416-170/+29
| | | | | | | | | | | | | | | | | | | | | | | | Make all extensions attributes checks within ip_set_get_extensions() and reduce number of duplicated code. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| | * netfilter: ipset: Use SET_WITH_*() helpers to test set extensionsSergey Popovich2015-06-142-7/+7
| | | | | | | | | | | | | | | Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * | netfilter: Kconfig: get rid of parens around depends onPablo Neira Ayuso2015-06-153-11/+13
| |/ | | | | | | | | | | | | According to the reporter, they are not needed. Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: xtables: avoid percpu ruleset duplicationFlorian Westphal2015-06-124-147/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We store the rule blob per (possible) cpu. Unfortunately this means we can waste lot of memory on big smp machines. ipt_entry structure ('rule head') is 112 byte, so e.g. with maxcpu=64 one single rule eats close to 8k RAM. Since previous patch made counters percpu it appears there is nothing left in the rule blob that needs to be percpu. On my test system (144 possible cpus, 400k dummy rules) this change saves close to 9 Gigabyte of RAM. Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: xtables: use percpu rule countersFlorian Westphal2015-06-123-14/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The binary arp/ip/ip6tables ruleset is stored per cpu. The only reason left as to why we need percpu duplication are the rule counters embedded into ipt_entry et al -- since each cpu has its own copy of the rules, all counters can be lockless. The downside is that the more cpus are supported, the more memory is required. Rules are not just duplicated per online cpu but for each possible cpu, i.e. if maxcpu is 144, then rule is duplicated 144 times, not for the e.g. 64 cores present. To save some memory and also improve utilization of shared caches it would be preferable to only store the rule blob once. So we first need to separate counters and the rule blob. Instead of using entry->counters, allocate this percpu and store the percpu address in entry->counters.pcnt on CONFIG_SMP. This change makes no sense as-is; it is merely an intermediate step to remove the percpu duplication of the rule set in a followup patch. Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: restore vlan tag when refragmentingFlorian Westphal2015-06-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If bridge netfilter is used with both bridge-nf-call-iptables and bridge-nf-filter-vlan-tagged enabled then ip fragments in VLAN frames are sent without the vlan header. This has never worked reliably. Turns out this relied on pre-3.5 behaviour where skb frag_list was used to store ip fragments; ip_fragment() then re-used these skbs. But since commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45 ("ipv4: use skb coalescing in defragmentation") this is no longer the case. ip_do_fragment now needs to allocate new skbs, but these don't contain the vlan tag information anymore. Fix it by storing vlan information of the ressembled skb in the br netfilter percpu frag area, and restore them for each of the fragments. Fixes: 3cc4949269e01f3 ("ipv4: use skb coalescing in defragmentation") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * net: ip_fragment: remove BRIDGE_NETFILTER mtu special handlingFlorian Westphal2015-06-122-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | since commit d6b915e29f4adea9 ("ip_fragment: don't forward defragmented DF packet") the largest fragment size is available in the IPCB. Therefore we no longer need to care about 'encapsulation' overhead of stripped PPPOE/VLAN headers since ip_do_fragment doesn't use device mtu in such cases. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: forward IPv6 fragmented packetsBernhard Thaler2015-06-123-42/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IPv6 fragmented packets are not forwarded on an ethernet bridge with netfilter ip6_tables loaded. e.g. steps to reproduce 1) create a simple bridge like this modprobe br_netfilter brctl addbr br0 brctl addif br0 eth0 brctl addif br0 eth2 ifconfig eth0 up ifconfig eth2 up ifconfig br0 up 2) place a host with an IPv6 address on each side of the bridge set IPv6 address on host A: ip -6 addr add fd01:2345:6789:1::1/64 dev eth0 set IPv6 address on host B: ip -6 addr add fd01:2345:6789:1::2/64 dev eth0 3) run a simple ping command on host A with packets > MTU ping6 -s 4000 fd01:2345:6789:1::2 4) wait some time and run e.g. "ip6tables -t nat -nvL" on the bridge IPv6 fragmented packets traverse the bridge cleanly until somebody runs. "ip6tables -t nat -nvL". As soon as it is run (and netfilter modules are loaded) IPv6 fragmented packets do not traverse the bridge any more (you see no more responses in ping's output). After applying this patch IPv6 fragmented packets traverse the bridge cleanly in above scenario. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> [pablo@netfilter.org: small changes to br_nf_dev_queue_xmit] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: re-order check_hbh_len()Bernhard Thaler2015-06-121-55/+56
| | | | | | | | | | | | | | | | Prepare check_hbh_len() to be called from newly introduced br_validate_ipv6() in next commit. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: rename br_parse_ip_optionsBernhard Thaler2015-06-121-5/+4
| | | | | | | | | | | | | | | | | | | | | | br_parse_ip_options() does not parse any IP options, it validates IP packets as a whole and the function name is misleading. Rename br_parse_ip_options() to br_validate_ipv4() and remove unneeded commments. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: refactor frag_max_sizeBernhard Thaler2015-06-122-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | Currently frag_max_size is member of br_input_skb_cb and copied back and forth using IPCB(skb) and BR_INPUT_SKB_CB(skb) each time it is changed or used. Attach frag_max_size to nf_bridge_info and set value in pre_routing and forward functions. Use its value in forward and xmit functions. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: detect NAT66 correctly and change MAC addressBernhard Thaler2015-06-122-9/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IPv4 iptables allows to REDIRECT/DNAT/SNAT any traffic over a bridge. e.g. REDIRECT $ sysctl -w net.bridge.bridge-nf-call-iptables=1 $ iptables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \ -j REDIRECT --to-ports 81 This does not work with ip6tables on a bridge in NAT66 scenario because the REDIRECT/DNAT/SNAT is not correctly detected. The bridge pre-routing (finish) netfilter hook has to check for a possible redirect and then fix the destination mac address. This allows to use the ip6tables rules for local REDIRECT/DNAT/SNAT REDIRECT similar to the IPv4 iptables version. e.g. REDIRECT $ sysctl -w net.bridge.bridge-nf-call-ip6tables=1 $ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \ -j REDIRECT --to-ports 81 This patch makes it possible to use IPv6 NAT66 on a bridge. It was tested on a bridge with two interfaces using SNAT/DNAT NAT66 rules. Reported-by: Artie Hamilton <artiemhamilton@yahoo.com> Signed-off-by: Sven Eckelmann <sven@open-mesh.com> [bernhard.thaler@wvnet.at: rebased, add indirect call to ip6_route_input()] [bernhard.thaler@wvnet.at: rebased, split into separate patches] Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: re-order br_nf_pre_routing_finish_ipv6()Bernhard Thaler2015-06-121-31/+32
| | | | | | | | | | | | | | | | | | Put br_nf_pre_routing_finish_ipv6() after daddr_was_changed() and br_nf_pre_routing_finish_bridge() to prepare calling these functions from there. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: bridge: refactor clearing BRNF_NF_BRIDGE_PREROUTINGBernhard Thaler2015-06-121-2/+2
| | | | | | | | | | | | | | | | use binary AND on complement of BRNF_NF_BRIDGE_PREROUTING to unset bit in nf_bridge->mask. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: conntrack: warn the user if there is a better helper to useMarcelo Ricardo Leitner2015-06-121-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After db29a9508a92 ("netfilter: conntrack: disable generic tracking for known protocols"), if the specific helper is built but not loaded (a standard for most distributions) systems with a restrictive firewall but weak configuration regarding netfilter modules to load, will silently stop working. This patch then puts a warning message so the sysadmin knows where to start looking into. It's a pr_warn_once regardless of protocol itself but it should be enough to give a hint on where to look. Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | tcp: cdg: use div_u64()Kenneth Klette Jonassen2015-06-141-1/+1
| | | | | | | | | | | | | | Fixes cross-compile to mips. Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-06-138-25/+47
|\ \
| * | sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHOMarcelo Ricardo Leitner2015-06-121-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we can ask to authenticate DATA chunks and we can send DATA chunks on the same packet as COOKIE_ECHO, but if you try to combine both, the DATA chunk will be sent unauthenticated and peer won't accept it, leading to a communication failure. This happens because even though the data was queued after it was requested to authenticate DATA chunks, it was also queued before we could know that remote peer can handle authenticating, so sctp_auth_send_cid() returns false. The fix is whenever we set up an active key, re-check send queue for chunks that now should be authenticated. As a result, such packet will now contain COOKIE_ECHO + AUTH + DATA chunks, in that order. Reported-by: Liu Wei <weliu@redhat.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: don't wait for order-3 page allocationShaohua Li2015-06-112-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We saw excessive direct memory compaction triggered by skb_page_frag_refill. This causes performance issues and add latency. Commit 5640f7685831e0 introduces the order-3 allocation. According to the changelog, the order-3 allocation isn't a must-have but to improve performance. But direct memory compaction has high overhead. The benefit of order-3 allocation can't compensate the overhead of direct memory compaction. This patch makes the order-3 page allocation atomic. If there is no memory pressure and memory isn't fragmented, the alloction will still success, so we don't sacrifice the order-3 benefit here. If the atomic allocation fails, direct memory compaction will not be triggered, skb_page_frag_refill will fallback to order-0 immediately, hence the direct memory compaction overhead is avoided. In the allocation failure case, kswapd is waken up and doing compaction, so chances are allocation could success next time. alloc_skb_with_frags is the same. The mellanox driver does similar thing, if this is accepted, we must fix the driver too. V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric V2: make the changelog clearer Cc: Eric Dumazet <edumazet@google.com> Cc: Chris Mason <clm@fb.com> Cc: Debabrata Banerjee <dbavatar@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mpls: handle device renames for per-device sysctlsRobert Shearman2015-06-111-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a device is renamed and the original name is subsequently reused for a new device, the following warning is generated: sysctl duplicate entry: /net/mpls/conf/veth0//input CPU: 3 PID: 1379 Comm: ip Not tainted 4.1.0-rc4+ #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 0000000000000000 0000000000000000 ffffffff81566aaf 0000000000000000 ffffffff81236279 ffff88002f7d7f00 0000000000000000 ffff88000db336d8 ffff88000db33698 0000000000000005 ffff88002e046000 ffff8800168c9280 Call Trace: [<ffffffff81566aaf>] ? dump_stack+0x40/0x50 [<ffffffff81236279>] ? __register_sysctl_table+0x289/0x5a0 [<ffffffffa051a24f>] ? mpls_dev_notify+0x1ff/0x300 [mpls_router] [<ffffffff8108db7f>] ? notifier_call_chain+0x4f/0x70 [<ffffffff81470e72>] ? register_netdevice+0x2b2/0x480 [<ffffffffa0524748>] ? veth_newlink+0x178/0x2d3 [veth] [<ffffffff8147f84c>] ? rtnl_newlink+0x73c/0x8e0 [<ffffffff8147f27a>] ? rtnl_newlink+0x16a/0x8e0 [<ffffffff81459ff2>] ? __kmalloc_reserve.isra.30+0x32/0x90 [<ffffffff8147ccfd>] ? rtnetlink_rcv_msg+0x8d/0x250 [<ffffffff8145b027>] ? __alloc_skb+0x47/0x1f0 [<ffffffff8149badb>] ? __netlink_lookup+0xab/0xe0 [<ffffffff8147cc70>] ? rtnetlink_rcv+0x30/0x30 [<ffffffff8149e7a0>] ? netlink_rcv_skb+0xb0/0xd0 [<ffffffff8147cc64>] ? rtnetlink_rcv+0x24/0x30 [<ffffffff8149df17>] ? netlink_unicast+0x107/0x1a0 [<ffffffff8149e4be>] ? netlink_sendmsg+0x50e/0x630 [<ffffffff8145209c>] ? sock_sendmsg+0x3c/0x50 [<ffffffff81452beb>] ? ___sys_sendmsg+0x27b/0x290 [<ffffffff811bd258>] ? mem_cgroup_try_charge+0x88/0x110 [<ffffffff811bd5b6>] ? mem_cgroup_commit_charge+0x56/0xa0 [<ffffffff811d7700>] ? do_filp_open+0x30/0xa0 [<ffffffff8145336e>] ? __sys_sendmsg+0x3e/0x80 [<ffffffff8156c3f2>] ? system_call_fastpath+0x16/0x75 Fix this by unregistering the previous sysctl table (registered for the path containing the original device name) and re-registering the table for the path containing the new device name. Fixes: 37bde79979c3 ("mpls: Per-device enabling of packet input") Reported-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net, swap: Remove a warning and clarify why sk_mem_reclaim is required when ↵Mel Gorman2015-06-101-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | deactivating swap Jeff Layton reported the following; [ 74.232485] ------------[ cut here ]------------ [ 74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80() [ 74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio [ 74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5 [ 74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 74.245546] 0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d [ 74.246786] 0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa [ 74.248175] 0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8 [ 74.249483] Call Trace: [ 74.249872] [<ffffffff8179263d>] dump_stack+0x45/0x57 [ 74.250703] [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0 [ 74.251655] [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20 [ 74.252585] [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80 [ 74.253519] [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc] [ 74.254537] [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc] [ 74.255610] [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs] [ 74.256582] [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80 [ 74.257496] [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0 [ 74.258318] [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140 [ 74.259253] [<ffffffff81798dae>] system_call_fastpath+0x12/0x71 [ 74.260158] ---[ end trace 2530722966429f10 ]--- The warning in question was unnecessary but with Jeff's series the rules are also clearer. This patch removes the warning and updates the comment to explain why sk_mem_reclaim() may still be called. [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon points out that it's not needed.] Cc: Leon Romanovsky <leon@leon.nu> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | bridge: fix multicast router rlist endless loopNikolay Aleksandrov2015-06-101-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the addition of sysfs multicast router support if one set multicast_router to "2" more than once, then the port would be added to the hlist every time and could end up linking to itself and thus causing an endless loop for rlist walkers. So to reproduce just do: echo 2 > multicast_router; echo 2 > multicast_router; in a bridge port and let some igmp traffic flow, for me it hangs up in br_multicast_flood(). Fix this by adding a check in br_multicast_add_router() if the port is already linked. The reason this didn't happen before the addition of multicast_router sysfs entries is because there's a !hlist_unhashed check that prevents it. Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Fixes: 0909e11758bd ("bridge: Add multicast_router sysfs entries") Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tipc: disconnect socket directly after probe failureErik Hugne2015-06-101-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the TIPC connection timer expires in a probing state, a self abort message is supposed to be generated and delivered to the local socket. This is currently broken, and the abort message is actually sent out to the peer node with invalid addressing information. This will cause the link to enter a constant retransmission state and eventually reset. We fix this by removing the self-abort message creation and tear down connection immediately instead. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Revert "ipv6: Fix protocol resubmission"David S. Miller2015-06-101-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 0243508edd317ff1fa63b495643a7c192fbfcd92. It introduces new regressions. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | cfg80211: wext: clear sinfo struct before calling driverJohannes Berg2015-06-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until recently, mac80211 overwrote all the statistics it could provide when getting called, but it now relies on the struct having been zeroed by the caller. This was always the case in nl80211, but wext used a static struct which could even cause values from one device leak to another. Using a static struct is OK (as even documented in a comment) since the whole usage of this function and its return value is always locked under RTNL. Not clearing the struct for calling the driver has always been wrong though, since drivers were free to only fill values they could report, so calling this for one device and then for another would always have leaked values from one to the other. Fix this by initializing the structure in question before the driver method call. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=99691 Cc: stable@vger.kernel.org Reported-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Reported-by: Alexander Kaltsas <alexkaltsas@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: tcp_v6_connect() cleanupEric Dumazet2015-06-121-2/+0
| | | | | | | | | | | | | | | | | | | | | Remove dead code from tcp_v6_connect() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrsEric Dumazet2015-06-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __skb_header_pointer() returns a pointer that must be checked. Fixes infinite loop reported by Alexei, and add __must_check to catch these errors earlier. Fixes: 6a74fcf426f5 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | flow_dissector: add support for dst, hop-by-hop and routing ext hdrsTom Herbert2015-06-121-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | If dst, hop-by-hop or routing extension headers are present determine length of the options and skip over them in flow dissection. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | flow_dissector: Fix MPLS entropy label handling in flow dissectorTom Herbert2015-06-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Need to shift after masking to get label value for comparison. Fixes: b3baa0fbd02a1a9d493d8 ("mpls: Add MPLS entropy label in flow_keys") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud