op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	IPoIB: improve IPv4/IPv6 to IB mcast mapping functions	Rolf Manderscheid	2008-01-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An IPoIB subnet on an IB fabric that spans multiple IB subnets can't use link-local scope in multicast GIDs. The existing routines that map IP/IPv6 multicast addresses into IB link-level addresses hard-code the scope to link-local, and they also leave the partition key field uninitialised. This patch adds a parameter (the link-level broadcast address) to the mapping routines, allowing them to initialise both the scope and the P_Key appropriately, and fixes up the call sites. The next step will be to add a way to configure the scope for an IPoIB interface. Signed-off-by: Rolf Manderscheid <rvm@obsidianresearch.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
*	[INET]: Fix truesize setting in ip_append_data	Herbert Xu	2008-01-23	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As it is ip_append_data only counts page fragments to the skb that allocated it. As such it means that the first skb gets hit with a 4K charge even though it might have only used a fraction of it while all subsequent skb's that use the same page gets away with no charge at all. This bug was exposed by the UDP accounting patch. [ The wmem_alloc bumping needs to be moved with the truesize, noticed by Takahiro Yasui. -DaveM ] Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Add missing skb->truesize increment in ip_append_page().	David S. Miller	2008-01-23	1	-0/+2
\| \| \| \| \| \| \|	And as noted by Takahiro Yasui, we thus need to bump the sk->sk_wmem_alloc at this spot as well. Signed-off-by: David S. Miller <davem@davemloft.net>
*	[ICMP]: ICMP_MIB_OUTMSGS increment duplicated	Wang Chen	2008-01-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit "96793b482540f3a26e2188eaf75cb56b7829d3e3" (Add ICMPMsgStats MIB (RFC 4293)) made a mistake. In that patch, David L added a icmp_out_count() in ip_push_pending_frames(), remove icmp_out_count() from icmp_reply(). But he forgot to remove icmp_out_count() from icmp_send() too. Since icmp_send and icmp_reply will call icmp_push_reply, which will call ip_push_pending_frames, a duplicated increment happened in icmp_send. This patch remove the icmp_out_count from icmp_send too. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] FIB_HASH : Avoid unecessary loop in fn_hash_dump_zone()	Eric Dumazet	2008-01-20	1	-11/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I noticed "ip route list" was slower than "cat /proc/net/route" on a machine with a full Internet routing table (214392 entries : Special thanks to Robert ;) ) This is similar to problem reported in commit d8c9283089287341c85a0a69de32c2287a990e71 ("[IPV4] ROUTE: ip_rt_dump() is unecessary slow") Fix is to avoid scanning the begining of fz_hash table, but directly seek to the right offset. Before patch : time ip route >/tmp/ROUTE real 0m1.285s user 0m0.712s sys 0m0.436s After patch # time ip route >/tmp/ROUTE real 0m0.835s user 0m0.692s sys 0m0.124s Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] fib_trie: fix duplicated route issue	Joonwoo Park	2008-01-20	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	http://bugzilla.kernel.org/show_bug.cgi?id=9493 The fib allows making identical routes with 'ip route replace'. This patch makes the fib return -EEXIST if replacement would cause duplication. Signed-off-by: Joonwoo Park <joonwpark81@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] fib_hash: fix duplicated route issue	Joonwoo Park	2008-01-20	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	http://bugzilla.kernel.org/show_bug.cgi?id=9493 The fib allows making identical routes with 'ip route replace'. This patch makes the fib return -EEXIST if replacement would cause duplication. Signed-off-by: Joonwoo Park <joonwpark81@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache	Eric Dumazet	2008-01-10	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In rt_cache_get_next(), no need to guard seq->private by a rcu_dereference() since seq is private to the thread running this function. Reading seq.private once (as guaranted bu rcu_dereference()) or several time if compiler really is dumb enough wont change the result. But we miss real spots where rcu_dereference() are needed, both in rt_cache_get_first() and rt_cache_get_next() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[LRO] Fix lro_mgr->features checks	Brice Goglin	2008-01-08	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	lro_mgr->features contains a bitmask of LRO_F_* values which are defined as power of two, not as bit indexes. They must be checked with x&LRO_F_FOO, not with test_bit(LRO_F_FOO,&x). Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Andrew Gallatin <gallatin@myri.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] ROUTE: ip_rt_dump() is unecessary slow	Eric Dumazet	2008-01-08	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I noticed "ip route list cache x.y.z.t" can be very slow. While strace-ing -T it I also noticed that first part of route cache is fetched quite fast : recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.000047> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.000042> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3740 <0.000055> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.000043> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3732 <0.000053> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3708 <0.000052> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3680 <0.000041> while the part at the end of the table is more expensive: recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3656 <0.003857> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.003891> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.003765> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3700 <0.003879> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3676 <0.003797> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3724 <0.003856> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.003848> The following patch corrects this performance/latency problem, removing quadratic behavior. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] ipconfig: Fix regression in ip command line processing	Amos Waterland	2008-01-08	1	-4/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The recent changes for ip command line processing fixed some problems but unfortunately broke some common usage scenarios. In current 2.6.24-rc6 the following command line results in no IP address assignment, which is surely a regression: ip=10.0.2.15::10.0.2.2:255.255.255.0::eth0:off Please find below a patch that works for all cases I can find. Signed-off-by: Amos Waterland <apw@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] raw: Strengthen check on validity of iph->ihl	Herbert Xu	2008-01-08	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	We currently check that iph->ihl is bounded by the real length and that the real length is greater than the minimum IP header length. However, we did not check the caes where iph->ihl is less than the minimum IP header length. This breaks because some ip_fast_csum implementations assume that which is quite reasonable. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[INET]: Fix netdev renaming and inet address labels	Mark McLoughlin	2008-01-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When re-naming an interface, the previous secondary address labels get lost e.g. $> brctl addbr foo $> ip addr add 192.168.0.1 dev foo $> ip addr add 192.168.0.2 dev foo label foo:00 $> ip addr show dev foo \| grep inet inet 192.168.0.1/32 scope global foo inet 192.168.0.2/32 scope global foo:00 $> ip link set foo name bar $> ip addr show dev bar \| grep inet inet 192.168.0.1/32 scope global bar inet 192.168.0.2/32 scope global bar:2 Turns out to be a simple thinko in inetdev_changename() - clearly we want to look at the address label, rather than the device name, for a suffix to retain. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: use non-delayed ACK for congestion control RTT	Gavin McCullagh	2007-12-29	1	-8/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a delayed ACK representing two packets arrives, there are two RTT samples available, one for each packet. The first (in order of seq number) will be artificially long due to the delay waiting for the second packet, the second will trigger the ACK and so will not itself be delayed. According to rfc1323, the SRTT used for RTO calculation should use the first rtt, so receivers echo the timestamp from the first packet in the delayed ack. For congestion control however, it seems measuring delayed ack delay is not desirable as it varies independently of congestion. The patch below causes seq_rtt and last_ackt to be updated with any available later packet rtts which should have less (and hopefully zero) delack delay. The rtt value then gets passed to ca_ops->pkts_acked(). Where TCP_CONG_RTT_STAMP was set, effort was made to supress RTTs from within a TSO chunk (!fully_acked), using only the final ACK (which includes any TSO delay) to generate RTTs. This patch removes these checks so RTTs are passed for each ACK to ca_ops->pkts_acked(). For non-delay based congestion control (cubic, h-tcp), rtt is sometimes used for rtt-scaling. In shortening the RTT, this may make them a little less aggressive. Delay-based schemes (eg vegas, veno, illinois) should get a cleaner, more accurate congestion signal, particularly for small cwnds. The congestion control module can potentially also filter out bad RTTs due to the delayed ack alarm by looking at the associated cnt which (where delayed acking is in use) should probably be 1 if the alarm went off or greater if the ACK was triggered by a packet. Signed-off-by: Gavin McCullagh <gavin.mccullagh@nuim.ie> Acked-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] Fix ip=dhcp regression	Simon Horman	2007-12-28	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	David Brownell pointed out a regression in my recent "Fix ip command line processing" patch. It turns out to be a fairly blatant oversight on my part whereby ic_enable is never set, and thus autoconfiguration is never enabled. Clearly my testing was broken :-( The solution that I have is to set ic_enable to 1 if we hit ip_auto_config_setup(), which basically means that autoconfiguration is activated unless told otherwise. I then flip ic_enable to 0 if ip=off, ip=none, ip=::::::off or ip=::::::none using ic_proto_name(); The incremental patch is below, let me know if a non-incremental version is prepared, as I did as for the original patch to be reverted pending a fix. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Fix ip command line processing.	Simon Horman	2007-12-26	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recently the documentation in Documentation/nfsroot.txt was update to note that in fact ip=off and ip=::::::off as the latter is ignored and the default (on) is used. This was certainly a step in the direction of reducing confusion. But it seems to me that the code ought to be fixed up so that ip=::::::off actually turns off ip autoconfiguration. This patch also notes more specifically that ip=on (aka ip=::::::on) is the default. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: nf_conntrack_ipv4: fix module parameter compatibility	Patrick McHardy	2007-12-26	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some users do "modprobe ip_conntrack hashsize=...". Since we have the module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the hashsize parameter is unknown for nf_conntrack_ipv4 however and makes it fail. Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4. Note: the nf_conntrack message in the ringbuffer will display an incorrect hashsize since nf_conntrack is first pulled in as a dependency and calculates the size itself, then it gets changed through a call to nf_conntrack_set_hashsize(). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: OOPS with NETLINK_FIB_LOOKUP netlink socket	Denis V. Lunev	2007-12-21	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[ Regression added by changeset: cd40b7d3983c708aabe3d3008ec64ffce56d33b0 [NET]: make netlink user -> kernel interface synchronious -DaveM ] nl_fib_input re-reuses incoming skb to send the reply. This means that this packet will be freed twice, namely in: - netlink_unicast_kernel - on receive path Use clone to send as a cure, the caller is responsible for kfree_skb on error. Thanks to Alexey Dobryan, who originally found the problem. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] ipv4: Spelling fixes	Joe Perches	2007-12-20	1	-1/+1
\| \| \| \| \|	Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] ip_gre: set mac_header correctly in receive path	Timo Teras	2007-12-20	1	-1/+1
\| \| \| \| \| \| \| \|	mac_header update in ipgre_recv() was incorrectly changed to skb_reset_mac_header() when it was introduced. Signed-off-by: Timo Teras <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] ARP: Remove not used code	Mark Ryden	2007-12-19	1	-2/+1
\| \| \| \| \| \| \| \|	In arp_process() (net/ipv4/arp.c), there is unused code: definition and assignment of tha (target hw address ). Signed-off-by: Mark Ryden <markryde@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Make tcp_input_metrics() get minimum RTO via tcp_rto_min()	Satoru SATOH	2007-12-16	1	-1/+1
\| \| \| \| \| \| \| \|	tcp_input_metrics() refers to the built-time constant TCP_RTO_MIN regardless of configured minimum RTO with iproute2. Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Updates to nfsroot documentation	Amos Waterland	2007-12-14	1	-19/+1
\| \| \| \| \| \| \| \| \| \| \|	The difference between ip=off and ip=::::::off has been a cause of much confusion. Document how each behaves, and do not contradict ourselves by saying that "off" is the default when in fact "any" is the default and is descibed as being so lower in the file. Signed-off-by: Amos Waterland <apw@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: ip_tables: fix compat copy race	Patrick McHardy	2007-12-14	1	-45/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When copying entries to user, the kernel makes two passes through the data, first copying all the entries, then fixing up names and counters. On the second pass it copies the kernel and match data from userspace to the kernel again to find the corresponding structures, expecting that kernel pointers contained in the data are still valid. This is obviously broken, fix by avoiding the second pass completely and fixing names and counters while dumping the ruleset, using the kernel-internal data structures. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPv4] ESP: Discard dummy packets introduced in rfc4303	Thomas Graf	2007-12-11	1	-0/+5
\| \| \| \| \| \| \| \| \| \|	RFC4303 introduces dummy packets with a nexthdr value of 59 to implement traffic confidentiality. Such packets need to be dropped silently and the payload may not be attempted to be parsed as it consists of random chunk. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Swap the ifa allocation with the"ipv4_devconf_setall" call	Pavel Emelyanov	2007-12-11	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	According to Herbert, the ipv4_devconf_setall should be called only when the ifa is added to the device. However, failed ifa allocation may bring things into inconsistent state. Move the call to ipv4_devconf_setall after the ifa allocation. Fits both net-2.6 (with offsets) and net-2.6.25 (cleanly). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Remove prototype of ip_rt_advice	Denis V. Lunev	2007-12-07	1	-1/+1
\| \| \| \| \| \| \|	ip_rt_advice has been gone, so no need to keep prototype and debug message. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPv4]: Reply net unreachable ICMP message	Mitsuru Chinen	2007-12-07	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \|	IPv4 stack doesn't reply any ICMP destination unreachable message with net unreachable code when IP detagrams are being discarded because of no route could be found in the forwarding path. Incidentally, IPv6 stack replies such ICMPv6 message in the similar situation. Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[LRO]: fix lro_gen_skb() alignment	Andrew Gallatin	2007-12-05	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add a field to the lro_mgr struct so that drivers can specify how much padding is required to align layer 3 headers when a packet is copied into a freshly allocated skb by inet_lro.c:lro_gen_skb(). Without padding, skbs generated by LRO will cause alignment warnings on architectures which require strict alignment (seen on sparc64). Myri10GE is updated to use this field. Signed-off-by: Andrew Gallatin <gallatin@myri.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: NAGLE_PUSH seems to be a wrong way around	Ilpo J�rvinen	2007-12-05	1	-2/+1
\| \| \| \| \| \| \| \|	The comment in tcp_nagle_test suggests that. This bug is very very old, even 2.4.0 seems to have it. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Move prior_in_flight collect to more robust place	Ilpo J�rvinen	2007-12-05	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	The previous location is after sacktag processing, which affects counters tcp_packets_in_flight depends on. This may manifest as wrong behavior if new SACK blocks are present and all is clear for call to tcp_cong_avoid, which in the case of tcp_reno_cong_avoid bails out early because it thinks that TCP is not limited by cwnd. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] FRTO: Use of existing funcs make code more obvious & robust	Ilpo J�rvinen	2007-12-05	1	-9/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Though there's little need for everything that tcp_may_send_now does (actually, even the state had to be adjusted to pass some checks FRTO does not want to occur), it's more robust to let it make the decision if sending is allowed. State adjustments needed: - Make sure snd_cwnd limit is not hit in there - Disable nagle (if necessary) through the frto_counter == 2 The result of check for frto_counter in argument to call for tcp_enter_frto_loss can just be open coded, therefore there isn't need to store the previous frto_counter past tcp_may_send_now. In addition, returns can then be combined. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPVS]: Fix sched registration race when checking for name collision.	Pavel Emelyanov	2007-12-05	1	-13/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The register_ip_vs_scheduler() checks for the scheduler with the same name under the read-locked __ip_vs_sched_lock, then drops, takes it for writing and puts the scheduler in list. This is racy, since we can have a race window between the lock being re-locked for writing. The fix is to search the scheduler with the given name right under the write-locked __ip_vs_sched_lock. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPVS]: Don't leak sysctl tables if the scheduler registration fails.	Pavel Emelyanov	2007-12-05	2	-2/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	In case we load lblc or lblcr module we can leak some sysctl tables if the call to register_ip_vs_scheduler() fails. I've looked at the register_ip_vs_scheduler() code and saw, that the only reason to fail is the name collision, so I think that with some 3rd party schedulers this becomes a relevant issue. No? Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[INET]: Fix inet_diag dead-lock regression	Herbert Xu	2007-12-03	1	-21/+46
\| \| \| \| \| \| \| \| \| \| \|	The inet_diag register fix broke inet_diag module loading because the loaded module had to take the same mutex that's already held by the loader in order to register the new handler. This patch fixes it by introducing a separate mutex to protect the handling of handlers. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
*	[TCP] illinois: Incorrect beta usage	Stephen Hemminger	2007-11-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lachlan Andrew observed that my TCP-Illinois implementation uses the beta value incorrectly: The parameter beta in the paper specifies the amount to decrease by: that is, on loss, W <- W - betaW but in tcp_illinois_ssthresh() uses beta as the amount to decrease to: W <- betaW This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network, hurting performance. Note: since the base beta value is .5, it has no impact on a congested network. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
*	[INET]: Fix inet_diag register vs rcv race	Pavel Emelyanov	2007-11-30	1	-9/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following race is possible when one cpu unregisters the handler while other one is trying to receive a message and call this one: CPU1: CPU2: inet_diag_rcv() inet_diag_unregister() mutex_lock(&inet_diag_mutex); netlink_rcv_skb(skb, &inet_diag_rcv_msg); if (inet_diag_table[nlh->nlmsg_type] == NULL) /* false handler is still registered / ... netlink_dump_start(idiagnl, skb, nlh, inet_diag_dump, NULL); cb = kzalloc(sizeof(cb), GFP_KERNEL); /* sleep here freeing memory * or preempt * or sleep later on nlk->cb_mutex / spin_lock(&inet_diag_register_lock); inet_diag_table[type] = NULL; ... spin_unlock(&inet_diag_register_lock); synchronize_rcu(); / CPU1 is sleeping - RCU quiescent * state is passed / return; / inet_diag_dump is finally called: / inet_diag_dump() handler = inet_diag_table[cb->nlh->nlmsg_type]; BUG_ON(handler == NULL); / OOPS! While we slept the unregister has set * handler to NULL :( */ Grep showed, that the register/unregister functions are called from init/fini module callbacks for tcp_/dccp_diag, so it's OK to use the inet_diag_mutex to synchronize manipulations with the inet_diag_table and the access to it. Besides, as Herbert pointed out, asynchronous dumps should hold this mutex as well, and thus, we provide the mutex as cb_mutex one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
*	[IPV4]: Remove bogus ifdef mess in arp_process	Adrian Bunk	2007-11-26	1	-19/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The #ifdef's in arp_process() were not only a mess, they were also wrong in the CONFIG_NET_ETHERNET=n and (CONFIG_NETDEV_1000=y or CONFIG_NETDEV_10000=y) cases. Since they are not required this patch removes them. Also removed are some #ifdef's around #include's that caused compile errors after this change. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
*	[TCP] MTUprobe: Cleanup send queue check (no need to loop)	Ilpo Järvinen	2007-11-23	1	-6/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original code has striking complexity to perform a query which can be reduced to a very simple compare. FIN seqno may be included to write_seq but it should not make any significant difference here compared to skb->len which was used previously. One won't end up there with SYN still queued. Use of write_seq check guarantees that there's a valid skb in send_head so I removed the extra check. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: John Heffner <jheffner@psc.edu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
*	[TCP]: MTUprobe: receiver window & data available checks fixed	Ilpo Järvinen	2007-11-23	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It seems that the checked range for receiver window check should begin from the first rather than from the last skb that is going to be included to the probe. And that can be achieved without reference to skbs at all, snd_nxt and write_seq provides the correct seqno already. Plus, it SHOULD account packets that are necessary to trigger fast retransmit [RFC4821]. Location of snd_wnd < probe_size/size_needed check is bogus because it will cause the other if() match as well (due to snd_nxt >= snd_una invariant). Removed dead obvious comment. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
*	[IPVS]: Fix compiler warning about unused register_ip_vs_protocol	Pavel Emelyanov	2007-11-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is silly, but I have turned the CONFIG_IP_VS to m, to check the compilation of one (recently sent) fix and set all the CONFIG_IP_VS_PROTO_XXX options to n to speed up the compilation. In this configuration the compiler warns me about CC [M] net/ipv4/ipvs/ip_vs_proto.o net/ipv4/ipvs/ip_vs_proto.c:49: warning: 'register_ip_vs_protocol' defined but not used Indeed. With no protocols selected there are no calls to this function - all are compiled out with ifdefs. Maybe the best fix would be to surround this call with ifdef-s or tune the Kconfig dependences, but I think that marking this register function as __used is enough. No? Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[ARP]: Fix arp reply when sender ip 0	Jonas Danielsson	2007-11-20	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Fix arp reply when received arp probe with sender ip 0. Send arp reply with target ip address 0.0.0.0 and target hardware address set to hardware address of requester. Previously sent reply with target ip address and target hardware address set to same as source fields. Signed-off-by: Jonas Danielsson <the.sator@gmail.com> Acked-by: Alexey Kuznetov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps.	YOSHIFUJI Hideaki	2007-11-20	1	-4/+4
\| \| \| \| \|	Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4] TCPMD5: Omit redundant NULL check for kfree() argument.	YOSHIFUJI Hideaki	2007-11-20	1	-2/+1
\| \| \| \| \|	Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: Fix kernel panic with REDIRECT target.	Evgeniy Polyakov	2007-11-20	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When connection tracking entry (nf_conn) is about to copy itself it can have some of its extension users (like nat) as being already freed and thus not required to be copied. Actually looking at this function I suspect it was copied from nf_nat_setup_info() and thus bug was introduced. Report and testing from David <david@unsolicited.net>. [ Patrick McHardy states: I now understand whats happening: - new connection is allocated without helper - connection is REDIRECTed to localhost - nf_nat_setup_info adds NAT extension, but doesn't initialize it yet - nf_conntrack_alter_reply performs a helper lookup based on the new tuple, finds the SIP helper and allocates a helper extension, causing reallocation because of too little space - nf_nat_move_storage is called with the uninitialized nat extension So your fix is entirely correct, thanks a lot :) ] Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Add missing "space"	Joe Perches	2007-11-19	2	-2/+2
\| \| \| \| \|	Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Problem bug with sysctl_tcp_congestion_control function	Sam Jansen	2007-11-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	From: "Sam Jansen" <sjansen@google.com> sysctl_tcp_congestion_control seems to have a bug that prevents it from actually calling the tcp_set_default_congestion_control function. This is not so apparent because it does not return an error and generally the /proc interface is used to configure the default TCP congestion control algorithm. This is present in 2.6.18 onwards and probably earlier, though I have not inspected 2.6.15--2.6.17. sysctl_tcp_congestion_control calls sysctl_string and expects a successful return code of 0. In such a case it actually sets the congestion control algorithm with tcp_set_default_congestion_control. Otherwise, it returns the value returned by sysctl_string. This was correct in 2.6.14, as sysctl_string returned 0 on success. However, sysctl_string was updated to return 1 on success around about 2.6.15 and sysctl_tcp_congestion_control was not updated. Even though sysctl_tcp_congestion_control returns 1, do_sysctl_strategy converts this return code to '0', so the caller never notices the error. Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP] MTUprobe: fix potential sk_send_head corruption	Ilpo J�rvinen	2007-11-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	When the abstraction functions got added, conversion here was made incorrectly. As a result, the skb may end up pointing to skb which got included to the probe skb and then was freed. For it to trigger, however, skb_transmit must fail sending as well. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPVS]: Move remaining sysctl handlers over to CTL_UNNUMBERED	Simon Horman	2007-11-19	3	-22/+0
\| \| \| \| \| \| \| \| \|	Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED, I stronly doubt that anyone is using the sys_sysctl interface to these variables. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPVS]: Fix sysctl warnings about missing strategy in schedulers	Simon Horman	2007-11-19	2	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy Switch these entried over to use CTL_UNNUMBERED as clearly the sys_syscal portion wasn't working. This is along the same lines as Christian Borntraeger's patch that fixes up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>