op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	netns xfrm: per-netns sysctls	Alexey Dobriyan	2008-11-25	3	-5/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make net.core.xfrm_aevent_etime net.core.xfrm_acq_expires net.core.xfrm_aevent_rseqth net.core.xfrm_larval_drop sysctls per-netns. For that make net_core_path[] global, register it to prevent two /proc/net/core antries and change initcall position -- xfrm_init() is called from fs_initcall, so this one should be fs_initcall at least. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: /proc/net/xfrm_stat in netns	Alexey Dobriyan	2008-11-25	1	-1/+2
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns MIBs	Alexey Dobriyan	2008-11-25	2	-7/+9
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: ->get_saddr in netns	Alexey Dobriyan	2008-11-25	1	-1/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: ->dst_lookup in netns	Alexey Dobriyan	2008-11-25	1	-1/+2
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: KM reporting in netns	Alexey Dobriyan	2008-11-25	1	-2/+2
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: pass netns with KM notifications	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \| \| \| \|	SA and SPD flush are executed with NULL SA and SPD respectively, for these cases pass netns explicitly from userspace socket. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns NETLINK_XFRM socket	Alexey Dobriyan	2008-11-25	2	-3/+6
\| \| \| \| \| \| \|	Stub senders to init_net's one temporarily. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: dst garbage-collecting in netns	Alexey Dobriyan	2008-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Pass netns pointer to struct xfrm_policy_afinfo::garbage_collect() [This needs more thoughts on what to do with dst_ops] [Currently stub to init_net] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: xfrm_route_forward() in netns	Alexey Dobriyan	2008-11-25	1	-1/+3
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: xfrm_policy_check in netns	Alexey Dobriyan	2008-11-25	1	-1/+2
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: lookup in netns	Alexey Dobriyan	2008-11-25	2	-12/+13
\| \| \| \| \| \| \| \| \| \|	Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns to flow_cache_lookup() and resolver callback. Take it from socket or netdevice. Stub DECnet to init_net. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: policy walking in netns	Alexey Dobriyan	2008-11-25	1	-1/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: finding policy in netns	Alexey Dobriyan	2008-11-25	1	-2/+2
\| \| \| \| \| \| \|	Add netns parameter to xfrm_policy_bysel_ctx(), xfrm_policy_byidx(). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: policy flushing in netns	Alexey Dobriyan	2008-11-25	1	-1/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: state walking in netns	Alexey Dobriyan	2008-11-25	1	-1/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: finding states in netns	Alexey Dobriyan	2008-11-25	1	-3/+4
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: state lookup in netns	Alexey Dobriyan	2008-11-25	1	-2/+2
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: state flush in netns	Alexey Dobriyan	2008-11-25	1	-1/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns policy hash resizing work	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns policy counts	Alexey Dobriyan	2008-11-25	2	-4/+3
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_policy_bydst hash	Alexey Dobriyan	2008-11-25	1	-0/+6
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns inexact policies	Alexey Dobriyan	2008-11-25	1	-0/+2
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_policy_byidx hashmask	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \| \| \|	Per-netns hashes are independently resizeable. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_policy_byidx hash	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns policy list	Alexey Dobriyan	2008-11-25	1	-0/+2
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: add struct xfrm_policy::xp_net	Alexey Dobriyan	2008-11-25	1	-1/+9
\| \| \| \| \| \| \| \|	Again, to avoid complications with passing netns when not necessary. Again, ->xp_net is set-once field, once set it never changes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns km_waitq	Alexey Dobriyan	2008-11-25	2	-1/+3
\| \| \| \| \| \| \|	Disallow spurious wakeups in __xfrm_lookup(). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns state GC work	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \| \| \|	State GC is per-netns, and this is part of it. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns state GC list	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	km_waitq is going to be made per-netns to disallow spurious wakeups in __xfrm_lookup(). To not wakeup after every garbage-collected xfrm_state (which potentially can be from different netns) make state GC list per-netns. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_hash_work	Alexey Dobriyan	2008-11-25	1	-0/+2
\| \| \| \| \| \| \|	All of this is implicit passing which netns's hashes should be resized. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_state counts	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_state_hmask	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \| \| \|	Since hashtables are per-netns, they can be independently resized. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_state_byspi hash	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_state_bysrc hash	Alexey Dobriyan	2008-11-25	1	-0/+1
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_state_bydst hash	Alexey Dobriyan	2008-11-25	1	-0/+9
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: per-netns xfrm_state_all list	Alexey Dobriyan	2008-11-25	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \|	This is done to get a) simple "something leaked" check b) cover possible DoSes when other netns puts many, many xfrm_states onto a list. c) not miss "alien xfrm_state" check in some of list iterators in future. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: add struct xfrm_state::xs_net	Alexey Dobriyan	2008-11-25	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \|	To avoid unnecessary complications with passing netns around. * set once, very early after allocating * once set, never changes For a while create every xfrm_state in init_net. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: add netns boilerplate	Alexey Dobriyan	2008-11-25	3	-1/+13
\| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp: tcp_limit_reno_sacked can become static	Ilpo Järvinen	2008-11-25	1	-2/+0
\| \| \| \| \|	Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	DCB: fix kconfig option	Jeff Kirsher	2008-11-25	1	-2/+2
\| \| \| \| \| \| \| \| \|	Since the netlink option for DCB is necessary to actually be useful, simplified the Kconfig option. In addition, added useful help text for the Kconfig option. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netdev: add HAVE_NET_DEVICE_OPS	Stephen Hemminger	2008-11-25	1	-0/+1
\| \| \| \| \| \| \| \| \|	As a concession to vendors who have to deal with one source for different kernel versions, add a HAVE_NET_DEVICE_OPS so they don't end up hard coding ifdef against kernel version. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp: add some mibs to track collapsing	Ilpo Järvinen	2008-11-24	1	-0/+3
\| \| \| \| \|	Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp: Try to restore large SKBs while SACK processing	Ilpo Järvinen	2008-11-24	2	-0/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During SACK processing, most of the benefits of TSO are eaten by the SACK blocks that one-by-one fragment SKBs to MSS sized chunks. Then we're in problems when cleanup work for them has to be done when a large cumulative ACK comes. Try to return back to pre-split state already while more and more SACK info gets discovered by combining newly discovered SACK areas with the previous skb if that's SACKed as well. This approach has a number of benefits: 1) The processing overhead is spread more equally over the RTT 2) Write queue has less skbs to process (affect everything which has to walk in the queue past the sacked areas) 3) Write queue is consistent whole the time, so no other parts of TCP has to be aware of this (this was not the case with some other approach that was, well, quite intrusive all around). 4) Clean_rtx_queue can release most of the pages using single put_page instead of previous PAGE_SIZE/mss+1 calls In case a hole is fully filled by the new SACK block, we attempt to combine the next skb too which allows construction of skbs that are even larger than what tso split them to and it handles hole per on every nth patterns that often occur during slow start overshoot pretty nicely. Though this to be really useful also a retransmission would have to get lost since cumulative ACKs advance one hole at a time in the most typical case. TODO: handle upwards only merging. That should be rather easy when segment is fully sacked but I'm leaving that as future work item (it won't make very large difference anyway since this current approach already covers quite a lot of normal cases). I was earlier thinking of some sophisticated way of tracking timestamps of the first and the last segment but later on realized that it won't be that necessary at all to store the timestamp of the last segment. The cases that can occur are basically either: 1) ambiguous => no sensible measurement can be taken anyway 2) non-ambiguous is due to reordering => having the timestamp of the last segment there is just skewing things more off than does some good since the ack got triggered by one of the holes (besides some substle issues that would make determining right hole/skb even harder problem). Anyway, it has nothing to do with this change then. I choose to route some abnormal looking cases with goto noop, some could be handled differently (eg., by stopping the walking at that skb but again). In general, they either shouldn't happen at all or are rare enough to make no difference in practice. In theory this change (as whole) could cause some macroscale regression (global) because of cache misses that are taken over the round-trip time but it gets very likely better because of much less (local) cache misses per other write queue walkers and the big recovery clearing cumulative ack. Worth to note that these benefits would be very easy to get also without TSO/GSO being on as long as the data is in pages so that we can merge them. Currently I won't let that happen because DSACK splitting at fragment that would mess up pcounts due to sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets avoided, we have some conditions that can be made less strict. TODO: I will probably have to convert the excessive pointer passing to struct sacktag_state... :-) My testing revealed that considerable amount of skbs couldn't be shifted because they were cloned (most likely still awaiting tx reclaim)... [The rest is considering future work instead since I got repeatably EFAULT to tcpdump's recvfrom when I added pskb_expand_head to deal with clones, so I separated that into another, later patch] ...To counter that, I gave up on the fifth advantage: 5) When growing previous SACK block, less allocs for new skbs are done, basically a new alloc is needed only when new hole is detected and when the previous skb runs out of frags space ...which now only happens of if reclaim is fast enough to dispose the clone before the SACK block comes in (the window is RTT long), otherwise we'll have to alloc some. With clones being handled I got these numbers (will be somewhat worse without that), taken with fine-grained mibs: TCPSackShifted 398 TCPSackMerged 877 TCPSackShiftFallback 320 TCPSACKCOLLAPSEFALLBACKGSO 0 TCPSACKCOLLAPSEFALLBACKSKBBITS 0 TCPSACKCOLLAPSEFALLBACKSKBDATA 0 TCPSACKCOLLAPSEFALLBACKBELOW 0 TCPSACKCOLLAPSEFALLBACKFIRST 1 TCPSACKCOLLAPSEFALLBACKPREVBITS 318 TCPSACKCOLLAPSEFALLBACKMSS 1 TCPSACKCOLLAPSEFALLBACKNOHEAD 0 TCPSACKCOLLAPSEFALLBACKSHIFT 0 TCPSACKCOLLAPSENOOPSEQ 0 TCPSACKCOLLAPSENOOPSMALLPCOUNT 0 TCPSACKCOLLAPSENOOPSMALLLEN 0 TCPSACKCOLLAPSEHOLE 12 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp: move tcp_simple_retransmit to tcp_input	Ilpo Järvinen	2008-11-24	1	-2/+0
\| \| \| \| \|	Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: avoid a pair of dst_hold()/dst_release() in ip_append_data()	Eric Dumazet	2008-11-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can reduce pressure on dst entry refcount that slowdown UDP transmit path on SMP machines. This pressure is visible on RTP servers when delivering content to mediagateways, especially big ones, handling thousand of streams. Several cpus send UDP frames to the same destination, hence use the same dst entry. This patch makes ip_append_data() eventually steal the refcount its callers had to take on the dst entry. This doesnt avoid all refcounting, but still gives speedups on SMP, on UDP/RAW transmit path Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	eth: Declare an optimized compare_ether_addr_64bits() function	Eric Dumazet	2008-11-23	1	-0/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linus mentioned we could try to perform long word operations, even on potentially unaligned addresses, on x86 at least. David mentioned the HAVE_EFFICIENT_UNALIGNED_ACCESS test to handle this on all arches that have efficient unailgned accesses. I tried this idea and got nice assembly on 32 bits: 158: 33 82 38 01 00 00 xor 0x138(%edx),%eax 15e: 33 8a 34 01 00 00 xor 0x134(%edx),%ecx 164: c1 e0 10 shl $0x10,%eax 167: 09 c1 or %eax,%ecx 169: 74 0b je 176 <eth_type_trans+0x87> And very nice assembly on 64 bits of course (one xor, one shl) Nice oprofile improvement in eth_type_trans(), 0.17 % instead of 0.41 %, expected since we remove 8 instructions on a fast path. This patch implements a compare_ether_addr_64bits() function, that uses the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS ifdef to efficiently perform the 6 bytes comparison on all capable arches. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: Convert TCP/DCCP listening hash tables to use RCU	Eric Dumazet	2008-11-23	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the last step to be able to perform full RCU lookups in __inet_lookup() : After established/timewait tables, we add RCU lookups to listening hash table. The only trick here is that a socket of a given type (TCP ipv4, TCP ipv6, ...) can now flight between two different tables (established and listening) during a RCU grace period, so we must use different 'nulls' end-of-chain values for two tables. We define a large value : #define LISTENING_NULLS_BASE (1U << 29) So that slots in listening table are guaranteed to have different end-of-chain values than slots in established table. A reader can still detect it finished its lookup in the right chain. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Set per-connection CCIDs via socket options	Gerrit Renker	2008-11-23	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	With this patch, TX/RX CCIDs can now be changed on a per-connection basis, which overrides the defaults set by the global sysctl variables for TX/RX CCIDs. To make full use of this facility, the remaining patches of this patch set are needed, which track dependencies and activate negotiated feature values. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	WAN: syncppp.c is no longer used by any kernel code. Remove it.	Krzysztof Hałasa	2008-11-22	1	-102/+0
\| \| \| \|	Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>