summaryrefslogtreecommitdiffstats
path: root/net/ipv6
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2011-07-281-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits) tg3: Remove 5719 jumbo frames and TSO blocks tg3: Break larger frags into 4k chunks for 5719 tg3: Add tx BD budgeting code tg3: Consolidate code that calls tg3_tx_set_bd() tg3: Add partial fragment unmapping code tg3: Generalize tg3_skb_error_unmap() tg3: Remove short DMA check for 1st fragment tg3: Simplify tx bd assignments tg3: Reintroduce tg3_tx_ring_info ASIX: Use only 11 bits of header for data size ASIX: Simplify condition in rx_fixup() Fix cdc-phonet build bonding: reduce noise during init bonding: fix string comparison errors net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared net: add IFF_SKB_TX_SHARED flag to priv_flags net: sock_sendmsg_nosec() is static forcedeth: fix vlans gianfar: fix bug caused by 87c288c6e9aa31720b72e2bc2d665e24e1653c3e gro: Only reset frag0 when skb can be pulled ...
| * ipv6: Do not leave router anycast address for /127 prefixes.YOSHIFUJI Hideaki2011-07-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Original commit 2bda8a0c8af... "Disable router anycast address for /127 prefixes" says: | No need for matching code in addrconf_leave_anycast() as it | will silently ignore any attempt to leave an unknown anycast | address. After analysis, because 1) we may add two or more prefixes on the same interface, or 2)user may have manually joined that anycast, we may hit chances to have anycast address which as if we had generated one by /127 prefix and we should not leave from subnet- router anycast address unconditionally. CC: Bjørn Mork <bjorn@mork.no> CC: Brian Haley <brian.haley@hp.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | atomic: use <linux/atomic.h>Arun Sharma2011-07-261-1/+1
|/ | | | | | | | | | | | | | This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ipv6: make fragment identifications less predictableEric Dumazet2011-07-212-6/+32
| | | | | | | | | | | | | | | | | | IPv6 fragment identification generation is way beyond what we use for IPv4 : It uses a single generator. Its not scalable and allows DOS attacks. Now inetpeer is IPv6 aware, we can use it to provide a more secure and scalable frag ident generator (per destination, instead of system wide) This patch : 1) defines a new secure_ipv6_id() helper 2) extends inet_getid() to provide 32bit results 3) extends ipv6_select_ident() with a new dest parameter Reported-by: Fernando Gont <fernando@gont.com.ar> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: unshare inetpeersEric Dumazet2011-07-211-13/+20
| | | | | | | | | | | | | | | | | We currently cow metrics a bit too soon in IPv6 case : All routes are tied to a single inetpeer entry. Change ip6_rt_copy() to get destination address as second argument, so that we fill rt6i_dst before the dst_copy_metrics() call. icmp6_dst_alloc() must set rt6i_dst before calling dst_metric_set(), or else the cow is done while rt6i_dst is still NULL. If orig route points to readonly metrics, we can share the pointer instead of performing the memory allocation and copy. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add ->neigh_lookup() operation to dst_opsDavid S. Miller2011-07-181-0/+7
| | | | | | | | In the future dst entries will be neigh-less. In that environment we need to have an easy transition point for current users of dst->neighbour outside of the packet output fast path. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Abstract dst->neighbour accesses behind helpers.David S. Miller2011-07-176-31/+36
| | | | | | dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Get rid of rt6i_nexthop macro.David S. Miller2011-07-174-19/+19
| | | | | | | It just makes it harder to see 1) what the code is doing and 2) grep for all users of dst{->,.}neighbour Signed-off-by: David S. Miller <davem@davemloft.net>
* neigh: Pass neighbour entry to output ops.David S. Miller2011-07-171-3/+3
| | | | | | | | | | This will get us closer to being able to do "neigh stuff" completely independent of the underlying dst_entry for protocols (ipv4/ipv6) that wish to do so. We will also be able to make dst entries neigh-less. Signed-off-by: David S. Miller <davem@davemloft.net>
* neigh: Kill ndisc_ops->queue_xmitDavid S. Miller2011-07-161-4/+1
| | | | | | It is always dev_queue_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>
* neigh: Kill neigh_ops->hh_outputDavid S. Miller2011-07-161-3/+0
| | | | | | It's always dev_queue_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Create and use new helper, neigh_output().David S. Miller2011-07-161-7/+3
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Use calculated 'neigh' instead of re-evaluating dst->neighbourDavid S. Miller2011-07-161-1/+1
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Embed hh_cache inside of struct neighbour.David S. Miller2011-07-141-5/+9
| | | | | | | | | | | | | | | Now that there is a one-to-one correspondance between neighbour and hh_cache entries, we no longer need: 1) dynamic allocation 2) attachment to dst->hh 3) refcounting Initialization of the hh_cache entry is indicated by hh_len being non-zero, and such initialization is always done with the neighbour's lock held as a writer. Signed-off-by: David S. Miller <davem@davemloft.net>
* Disable router anycast address for /127 prefixesBjørn Mork2011-07-071-0/+2
| | | | | | | | | | | | RFC 6164 requires that routers MUST disable Subnet-Router anycast for the prefix when /127 prefixes are used. No need for matching code in addrconf_leave_anycast() as it will silently ignore any attempt to leave an unknown anycast address. Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2011-07-053-18/+14
|\ | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
| * net: bind() fix error return on wrong address familyMarcus Meissner2011-07-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi, Reinhard Max also pointed out that the error should EAFNOSUPPORT according to POSIX. The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN. Other protocols error values in their af bind() methods in current mainline git as far as a brief look shows: EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25, No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip Ciao, Marcus Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Don't put artificial limit on routing table size.David S. Miller2011-07-011-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IPV6, unlike IPV4, doesn't have a routing cache. Routing table entries, as well as clones made in response to route lookup requests, all live in the same table. And all of these things are together collected in the destination cache table for ipv6. This means that routing table entries count against the garbage collection limits, even though such entries cannot ever be reclaimed and are added explicitly by the administrator (rather than being created in response to lookups). Therefore it makes no sense to count ipv6 routing table entries against the GC limits. Add a DST_NOCOUNT destination cache entry flag, and skip the counting if it is set. Use this flag bit in ipv6 when adding routing table entries. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Don't change dst->flags using assignments.David S. Miller2011-07-011-10/+2
| | | | | | | | | | | | This blows away any flags already set in the entry. Signed-off-by: David S. Miller <davem@davemloft.net>
| * udp/recvmsg: Clear MSG_TRUNC flag when starting over for a new packetXufeng Zhang2011-06-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider this scenario: When the size of the first received udp packet is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags. However, if checksum error happens and this is a blocking socket, it will goto try_again loop to receive the next packet. But if the size of the next udp packet is smaller than receive buffer, MSG_TRUNC flag should not be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before receive the next packet, MSG_TRUNC is still set, which is wrong. Fix this problem by clearing MSG_TRUNC flag when starting over for a new packet. Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6/udp: Use the correct variable to determine non-blocking conditionXufeng Zhang2011-06-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | udpv6_recvmsg() function is not using the correct variable to determine whether or not the socket is in non-blocking operation, this will lead to unexpected behavior when a UDP checksum error occurs. Consider a non-blocking udp receive scenario: when udpv6_recvmsg() is called by sock_common_recvmsg(), MSG_DONTWAIT bit of flags variable in udpv6_recvmsg() is cleared by "flags & ~MSG_DONTWAIT" in this call: err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); i.e. with udpv6_recvmsg() getting these values: int noblock = flags & MSG_DONTWAIT int flags = flags & ~MSG_DONTWAIT So, when udp checksum error occurs, the execution will go to csum_copy_err, and then the problem happens: csum_copy_err: ............... if (flags & MSG_DONTWAIT) return -EAGAIN; goto try_again; ............... But it will always go to try_again as MSG_DONTWAIT has been cleared from flags at call time -- only noblock contains the original value of MSG_DONTWAIT, so the test should be: if (noblock) return -EAGAIN; This is also consistent with what the ipv4/udp code does. Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Reduce switch/case indentJoe Perches2011-07-011-76/+69
| | | | | | | | | | | | | | | | | | | | | | Make the case labels the same indent as the switch. git diff -w shows 80 column reflowing, removal of a useless break after return, and moving open brace after case instead of separate line. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2011-06-205-4/+11
|\ \ | |/ | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rxon.c drivers/net/wireless/rtlwifi/pci.c net/netfilter/ipvs/ip_vs_core.c
| * net: rfs: enable RFS before first data packet is receivedEric Dumazet2011-06-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit : > From: Ben Hutchings <bhutchings@solarflare.com> > Date: Fri, 17 Jun 2011 00:50:46 +0100 > > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote: > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) > >> goto discard; > >> > >> if (nsk != sk) { > >> + sock_rps_save_rxhash(nsk, skb->rxhash); > >> if (tcp_child_process(sk, nsk, skb)) { > >> rsk = nsk; > >> goto reset; > >> > > > > I haven't tried this, but it looks reasonable to me. > > > > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar. > > Indeed ipv6 side needs the same fix. > > Eric please add that part and resubmit. And in fact I might stick > this into net-2.6 instead of net-next-2.6 > OK, here is the net-2.6 based one then, thanks ! [PATCH v2] net: rfs: enable RFS before first data packet is received First packet received on a passive tcp flow is not correctly RFS steered. One sock_rps_record_flow() call is missing in inet_accept() But before that, we also must record rxhash when child socket is setup. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@conan.davemloft.net>
| * netfilter: fix looped (broad|multi)cast's MAC handlingNicolas Cavallari2011-06-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default, when broadcast or multicast packet are sent from a local application, they are sent to the interface then looped by the kernel to other local applications, going throught netfilter hooks in the process. These looped packet have their MAC header removed from the skb by the kernel looping code. This confuse various netfilter's netlink queue, netlink log and the legacy ip_queue, because they try to extract a hardware address from these packets, but extracts a part of the IP header instead. This patch prevent NFQUEUE, NFLOG and ip_QUEUE to include a MAC header if there is none in the packet. Signed-off-by: Nicolas Cavallari <cavallar@lri.fr> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * net/ipv6: check for mistakenly passed in non-AF_INET6 sockaddrsMarcus Meissner2011-06-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | Same check as for IPv4, also do for IPv6. (If you passed in a IPv4 sockaddr_in here, the sizeof check in the line before would have triggered already though.) Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: use unsigned variables for packet lengths in ip[6]_queue.Dave Jones2011-06-061-1/+2
| | | | | | | | | | | | | | Netlink message lengths can't be negative, so use unsigned variables. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nf_conntrack: fix ct refcount leak in l4proto->error()Pablo Neira Ayuso2011-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a refcount leak of ct objects that may occur if l4proto->error() assigns one conntrack object to one skbuff. In that case, we have to skip further processing in nf_conntrack_in(). With this patch, we can also fix wrong return values (-NF_ACCEPT) for special cases in ICMP[v6] that should not bump the invalid/error statistic counters. Reported-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: add more values to enum ip_conntrack_infoEric Dumazet2011-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following error is raised (and other similar ones) : net/ipv4/netfilter/nf_nat_standalone.c: In function ‘nf_nat_fn’: net/ipv4/netfilter/nf_nat_standalone.c:119:2: warning: case value ‘4’ not in enumerated type ‘enum ip_conntrack_info’ gcc barfs on adding two enum values and getting a not enumerated result : case IP_CT_RELATED+IP_CT_IS_REPLY: Add missing enum values Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: David Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | rtnetlink: Compute and store minimum ifinfo dump sizeGreg Rose2011-06-095-14/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The message size allocated for rtnl ifinfo dumps was limited to a single page. This is not enough for additional interface info available with devices that support SR-IOV and caused a bug in which VF info would not be displayed if more than approximately 40 VFs were created per interface. Implement a new function pointer for the rtnl_register service that will calculate the amount of data required for the ifinfo dump and allocate enough data to satisfy the request. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* | tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open sideJerry Chu2011-06-082-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch lowers the default initRTO from 3secs to 1sec per RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet has been retransmitted, AND the TCP timestamp option is not on. It also adds support to take RTT sample during 3WHS on the passive open side, just like its active open counterpart, and uses it, if valid, to seed the initRTO for the data transmission phase. The patch also resets ssthresh to its initial default at the beginning of the data transmission phase, and reduces cwnd to 1 if there has been MORE THAN ONE retransmission during 3WHS per RFC5681. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: generate link local address for GRE tunnelstephen hemminger2011-06-081-0/+35
|/ | | | | | | | Use same logic as SIT tunnel to handle link local address for GRE tunnel. OSPFv3 requires link-local address to function. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert %p usage to %pKDan Rosenberg2011-05-243-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The %pK format specifier is designed to hide exposed kernel pointers, specifically via /proc interfaces. Exposing these pointers provides an easy target for kernel write vulnerabilities, since they reveal the locations of writable structures containing easily triggerable function pointers. The behavior of %pK depends on the kptr_restrict sysctl. If kptr_restrict is set to 0, no deviation from the standard %p behavior occurs. If kptr_restrict is set to 1, the default, if the current user (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG (currently in the LSM tree), kernel pointers using %pK are printed as 0's. If kptr_restrict is set to 2, kernel pointers using %pK are printed as 0's regardless of privileges. Replacing with 0's was chosen over the default "(null)", which cannot be parsed by userland %p, which expects "(nil)". The supporting code for kptr_restrict and %pK are currently in the -mm tree. This patch converts users of %p in net/ to %pK. Cases of printing pointers to the syslog are not covered, since this would eliminate useful information for postmortem debugging and the reading of the syslog is already optionally protected by the dmesg_restrict sysctl. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: James Morris <jmorris@namei.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Thomas Graf <tgraf@infradead.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Kees Cook <kees.cook@canonical.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Fix return of xfrm6_tunnel_rcv()David S. Miller2011-05-241-1/+1
| | | | | | Like ipv4, just return xfrm6_rcv_spi()'s return value directly. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: copy prefsrc setting when copying route entryFlorian Westphal2011-05-211-0/+1
| | | | | | | | | | | commit c3968a857a6b6c3d2ef4ead35776b055fb664d74 ('ipv6: RTA_PREFSRC support for ipv6 route source address selection') added support for ipv6 prefsrc as an alternative to ipv6 addrlabels, but it did not work because the prefsrc entry was not copied. Cc: Daniel Walter <sahne@0x90.at> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds2011-05-2030-295/+404
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits) macvlan: fix panic if lowerdev in a bond tg3: Add braces around 5906 workaround. tg3: Fix NETIF_F_LOOPBACK error macvlan: remove one synchronize_rcu() call networking: NET_CLS_ROUTE4 depends on INET irda: Fix error propagation in ircomm_lmp_connect_response() irda: Kill set but unused variable 'bytes' in irlan_check_command_param() irda: Kill set but unused variable 'clen' in ircomm_connect_indication() rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport() be2net: Kill set but unused variable 'req' in lancer_fw_download() irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication() atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined. rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer(). rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler() rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection() rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window() pkt_sched: Kill set but unused variable 'protocol' in tc_classify() isdn: capi: Use pr_debug() instead of ifdefs. tg3: Update version to 3.119 tg3: Apply rx_discards fix to 5719/5720 ... Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c as per Davem.
| * ipv6: reduce per device ICMP mib sizesEric Dumazet2011-05-192-26/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipv6 has per device ICMP SNMP counters, taking too much space because they use percpu storage. needed size per device is : (512+4)*sizeof(long)*number_of_possible_cpus*2 On a 32bit kernel, 16 possible cpus, this wastes more than 64kbytes of memory per ipv6 enabled network device, taken in vmalloc pool. Since ICMP messages are rare, just use shared counters (atomic_long_t) Per network space ICMP counters are still using percpu memory, we might also convert them to shared counters in a future patch. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'master' of ↵David S. Miller2011-05-113-4/+7
| |\ | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6 Conflicts: drivers/net/benet/be_main.c
| * | inet: Pass flowi to ->queue_xmit().David S. Miller2011-05-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | inet: Decrease overhead of on-stack inet_cork.David S. Miller2011-05-062-18/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we fast path datagram sends to avoid locking by putting the inet_cork on the stack we use up lots of space that isn't necessary. This is because inet_cork contains a "struct flowi" which isn't used in these code paths. Split inet_cork to two parts, "inet_cork" and "inet_cork_full". Only the latter of which has the "struct flowi" and is what is stored in inet_sock. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
| * | Merge branch 'master' of ↵David S. Miller2011-05-052-2/+2
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/tg3.c
| * | | net: call dev_alloc_name from register_netdeviceJiri Pirko2011-05-052-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Force dev_alloc_name() to be called from register_netdevice() by dev_get_valid_name(). That allows to remove multiple explicit dev_alloc_name() calls. The possibility to call dev_alloc_name in advance remains. This also fixes veth creation regresion caused by 84c49d8c3e4abefb0a41a77b25aa37ebe8d6b743 Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv6: Use flowi4->{daddr,saddr} in ipip6_tunnel_xmit().David S. Miller2011-05-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Instead of rt->rt_{dst,src} Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4: Make caller provide on-stack flow key to ip_route_output_ports().David S. Miller2011-05-032-4/+7
| | | | | | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: dont hold rtnl mutex during netlink dump callbacksEric Dumazet2011-05-021-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Four years ago, Patrick made a change to hold rtnl mutex during netlink dump callbacks. I believe it was a wrong move. This slows down concurrent dumps, making good old /proc/net/ files faster than rtnetlink in some situations. This occurred to me because one "ip link show dev ..." was _very_ slow on a workload adding/removing network devices in background. All dump callbacks are able to use RCU locking now, so this patch does roughly a revert of commits : 1c2d670f366 : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks 6313c1e0992 : [RTNETLINK]: Remove unnecessary locking in dump callbacks This let writers fight for rtnl mutex and readers going full speed. It also takes care of phonet : phonet_route_get() is now called from rcu read section. I renamed it to phonet_route_get_rcu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Remi Denis-Courmont <remi.denis-courmont@nokia.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv4, ipv6, bonding: Restore control over number of peer notificationsBen Hutchings2011-04-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For backward compatibility, we should retain the module parameters and sysfs attributes to control the number of peer notifications (gratuitous ARPs and unsolicited NAs) sent after bonding failover. Also, it is possible for failover to take place even though the new active slave does not have link up, and in that case the peer notification should be deferred until it does. Change ipv4 and ipv6 so they do not automatically send peer notifications on bonding failover. Change the bonding driver to send separate NETDEV_NOTIFY_PEERS notifications when the link is up, as many times as requested. Since it does not directly control which protocols send notifications, make num_grat_arp and num_unsol_na aliases for a single parameter. Bump the bonding version number and update its documentation. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Acked-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: Use non-zero allocations in dst_alloc().David S. Miller2011-04-281-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make dst_alloc() and it's users explicitly initialize the entire entry. The zero'ing done by kmem_cache_zalloc() was almost entirely redundant. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: Make dst_alloc() take more explicit initializations.David S. Miller2011-04-281-18/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Now the dst->dev, dev->obsolete, and dst->flags values can be specified as well. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net:use help function of skb_checksum_start_offset to calculate offsetShan Wei2011-04-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Although these are equivalent, but the skb_checksum_start_offset() is more readable. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | inet: add RCU protection to inet->optEric Dumazet2011-04-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We lack proper synchronization to manipulate inet->opt ip_options Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt. Another thread can change inet->opt pointer and free old one under us. Use RCU to protect inet->opt (changed to inet->inet_opt). Instead of handling atomic refcounts, just copy ip_options when necessary, to avoid cache line dirtying. We cant insert an rcu_head in struct ip_options since its included in skb->cb[], so this patch is large because I had to introduce a new ip_options_rcu structure. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud