op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SOCK]: Adds a rcu_dereference() in sk_filter	Eric Dumazet	2008-01-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	It seems commit fda9ef5d679b07c9d9097aaf6ef7f069d794a8f9 introduced a RCU protection for sk_filter(), without a rcu_dereference() Either we need a rcu_dereference(), either a comment should explain why we dont need it. I vote for the former. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[XFRM]: xfrm_algo_clone() allocates too much memory	Eric Dumazet	2008-01-08	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	alg_key_len is the length in bits of the key, not in bytes. Best way to fix this is to move alg_len() function from net/xfrm/xfrm_user.c to include/net/xfrm.h, and to use it in xfrm_algo_clone() alg_len() is renamed to xfrm_alg_len() because of its global exposition. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Clone the sk_buff 'iif' field in __skb_clone()	Paul Moore	2008-01-08	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Both NetLabel and SELinux (other LSMs may grow to use it as well) rely on the 'iif' field to determine the receiving network interface of inbound packets. Unfortunately, at present this field is not preserved across a skb clone operation which can lead to garbage values if the cloned skb is sent back through the network stack. This patch corrects this problem by properly copying the 'iif' field in __skb_clone() and removing the 'iif' field assignment from skb_act_clone() since it is no longer needed. Also, while we are here, put the assignments in the same order as the offsets to reduce cacheline bounces. Signed-off-by: Paul Moore <paul.moore@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[SCTP]: Fix the name of the authentication event.	Vlad Yasevich	2008-01-08	1	-1/+1
\| \| \| \| \| \| \|	The even should be called SCTP_AUTHENTICATION_INDICATION. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[VETH]: move veth.h to include/linux	Stephen Hemminger	2007-12-26	1	-12/+0
\| \| \| \| \| \| \| \| \| \|	Move veth.h from net/ to linux/ since it is a user api, and add it to user header processing Kbuild. [ Use header-y as suggested by Sam Ravnborg. -DaveM ] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: nf_conntrack_ipv4: fix module parameter compatibility	Patrick McHardy	2007-12-26	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some users do "modprobe ip_conntrack hashsize=...". Since we have the module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the hashsize parameter is unknown for nf_conntrack_ipv4 however and makes it fail. Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4. Note: the nf_conntrack message in the ringbuffer will display an incorrect hashsize since nf_conntrack is first pulled in as a dependency and calculates the size itself, then it gets changed through a call to nf_conntrack_set_hashsize(). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET] include/net/: Spelling fixes	Joe Perches	2007-12-20	4	-6/+6
\| \| \| \| \|	Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[SCTP]: Fix the bind_addr info during migration.	Vlad Yasevich	2007-12-07	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	During accept/migrate the code attempts to copy the addresses from the parent endpoint to the new endpoint. However, if the parent was bound to a wildcard address, then we end up pointlessly copying all of the current addresses on the system. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Remove prototype of ip_rt_advice	Denis V. Lunev	2007-12-07	1	-1/+0
\| \| \| \| \| \| \|	ip_rt_advice has been gone, so no need to keep prototype and debug message. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	SCTP: Fix build issues with SCTP AUTH.	Vlad Yasevich	2007-11-29	1	-3/+6
\| \| \| \| \| \| \| \|	SCTP-AUTH requires selection of CRYPTO, HMAC and SHA1 since SHA1 is a MUST requirement for AUTH. We also support SHA256, but that's optional, so fix the code to treat it as such. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
*	[IPV4]: Fix memory leak in inet_hashtables.h when NUMA is on	Pavel Emelyanov	2007-11-26	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The inet_ehash_locks_alloc() looks like this: #ifdef CONFIG_NUMA if (size > PAGE_SIZE) x = vmalloc(...); else #endif x = kmalloc(...); Unlike it, the inet_ehash_locks_alloc() looks like this: #ifdef CONFIG_NUMA if (size > PAGE_SIZE) vfree(x); else #else kfree(x); #endif The error is obvious - if the NUMA is on and the size is less than the PAGE_SIZE we leak the pointer (kfree is inside the #else branch). Compiler doesn't warn us because after the kfree(x) there's a "x = NULL" assignment, so here's another (minor?) bug: we don't set x to NULL under certain circumstances. Boring explanation, I know... Patch explains it better. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
*	Merge branch 'fixes-davem' of ↵	David S. Miller	2007-11-20	1	-0/+8
\|\ \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6
\| *	ieee80211: Stop net_ratelimit/IEEE80211_DEBUG_DROP log pollution	Guillaume Chazarain	2007-11-20	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	if (net_ratelimit()) IEEE80211_DEBUG_DROP(...) can pollute the logs with messages like: printk: 1 messages suppressed. printk: 2 messages suppressed. printk: 7 messages suppressed. if debugging information is disabled. These messages are printed by net_ratelimit(). Add a wrapper to net_ratelimit() that takes into account the log level, so that net_ratelimit() is called only when we really want to print something. Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by: John W. Linville <linville@tuxdriver.com>
* \|	[TCP] MTUprobe: fix potential sk_send_head corruption	Ilpo J�rvinen	2007-11-19	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the abstraction functions got added, conversion here was made incorrectly. As a result, the skb may end up pointing to skb which got included to the probe skb and then was freed. For it to trigger, however, skb_transmit must fail sending as well. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	[IPVS]: Move remaining sysctl handlers over to CTL_UNNUMBERED	Simon Horman	2007-11-19	1	-28/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED, I stronly doubt that anyone is using the sys_sysctl interface to these variables. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	[IPVS]: Fix sysctl warnings about missing strategy in schedulers	Simon Horman	2007-11-19	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy Switch these entried over to use CTL_UNNUMBERED as clearly the sys_syscal portion wasn't working. This is along the same lines as Christian Borntraeger's patch that fixes up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	[IPVS]: Fix sysctl warnings about missing strategy	Christian Borntraeger	2007-11-19	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Running the latest git code I get the following messages during boot: sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy I removed the binary sysctl handler for those messages and also removed the definitions in ip_vs.h. The alternative would be to implement a proper strategy handler, but syscall sysctl is deprecated. There are other sysctl definitions that are commented out or work with the default sysctl_data strategy. I did not touch these. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	[TCP]: Fix TCP header misalignment	Herbert Xu	2007-11-18	1	-0/+3
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Indeed my previous change to alloc_pskb has made it possible for the TCP header to be misaligned iff the MTU is not a multiple of 4 (and less than a page). So I suspect the optimised IPsec MTU calculation is giving you just such an MTU :) This patch fixes it by changing alloc_pskb to make sure that the size is at least 32-bit aligned. This does not cause the problem fixed by the previous patch because max_header is always 32-bit aligned which means that in the SG/NOTSO case this will be a no-op. I thought about putting this in the callers but all the current callers are from TCP. If and when we get a non-TCP caller we can always create a TCP wrapper for this function and move the alignment over there. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[INET]: Fix potential kfree on vmalloc-ed area of request_sock_queue	Pavel Emelyanov	2007-11-15	1	-17/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The request_sock_queue's listen_opt is either vmalloc-ed or kmalloc-ed depending on the number of table entries. Thus it is expected to be handled properly on free, which is done in the reqsk_queue_destroy(). However the error path in inet_csk_listen_start() calls the lite version of reqsk_queue_destroy, called __reqsk_queue_destroy, which calls the kfree unconditionally. Fix this and move the __reqsk_queue_destroy into a .c file as it looks too big to be inline. As David also noticed, this is an error recovery path only, so no locking is required and the lopt is known to be not NULL. reqsk_queue_yank_listen_sk is also now only used in net/core/request_sock.c so we should move it there too. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Fix size calculation in sk_stream_alloc_pskb	Herbert Xu	2007-11-14	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We round up the header size in sk_stream_alloc_pskb so that TSO packets get zero tail room. Unfortunately this rounding up is not coordinated with the select_size() function used by TCP to calculate the second parameter of sk_stream_alloc_pskb. As a result, we may allocate more than a page of data in the non-TSO case when exactly one page is desired. In fact, rounding up the head room is detrimental in the non-TSO case because it makes memory that would otherwise be available to the payload head room. TSO doesn't need this either, all it wants is the guarantee that there is no tail room. So this patch fixes this by adjusting the skb_reserve call so that exactly the requested amount (which all callers have calculated in a precise way) is made available as tail room. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Move unneeded data to initdata section.	Denis V. Lunev	2007-11-13	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	This patch reverts Eric's commit 2b008b0a8e96b726c603c5e1a5a7a509b5f61e35 It diets .text & .data section of the kernel if CONFIG_NET_NS is not set. This is safe after list operations cleanup. Signed-of-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[INET]: Use list_head-s in inetpeer.c	Pavel Emelyanov	2007-11-12	1	-1/+1
\| \| \| \| \| \| \| \|	The inetpeer.c tracks the LRU list of inet_perr-s, but makes it by hands. Use the list_head-s for this. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[INET]: Remove leftover prototypes from include/net/inet_common.h	Arnaldo Carvalho de Melo	2007-11-12	1	-4/+0
\| \| \| \| \|	Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'pending' of ↵	David S. Miller	2007-11-12	4	-13/+18
\|\ \| \| \| \| \| \|	master.kernel.org:/pub/scm/linux/kernel/git/vxy/lksctp-dev
\| *	SCTP: Clean-up some defines for regressions tests.	Vlad Yasevich	2007-11-09	1	-1/+0
\| \| \| \| \| \| \| \|	Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
\| *	SCTP: Make sctp_verify_param return multiple indications.	Vlad Yasevich	2007-11-09	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SCTP-AUTH and future ADD-IP updates have a requirement to do additional verification of parameters and an ability to ABORT the association if verification fails. So, introduce additional return code so that we can clear signal a required action. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
\| *	SCTP: Convert custom hash lists to use hlist.	Vlad Yasevich	2007-11-09	2	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Convert the custom hash list traversals to use hlist functions. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
\| *	SCTP: Allow ADD_IP to work with AUTH for backward compatibility.	Vlad Yasevich	2007-11-07	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds a tunable that will allow ADD_IP to work without AUTH for backward compatibility. The default value is off since the default value for ADD_IP is off as well. People who need to use ADD-IP with older implementations take risks of connection hijacking and should consider upgrading or turning this tunable on. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
\| *	SCTP: Correctly disable ADD-IP when AUTH is not supported.	Vlad Yasevich	2007-11-07	1	-1/+0
\| \| \| \| \| \| \| \|	Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
\| *	SCTP: Update RCU handling during the ADD-IP case	Vlad Yasevich	2007-11-07	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After learning more about rcu, it looks like the ADD-IP hadling doesn't need to call call_rcu_bh. All the rcu critical sections use rcu_read_lock, so using call_rcu_bh is wrong here. Now, restore the local_bh_disable() code blocks and use normal call_rcu() calls. Also restore the missing return statement. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
\| *	SCTP: Fix difference cases of retransmit.	Vlad Yasevich	2007-11-07	4	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit d0ce92910bc04e107b2f3f2048f07e94f570035d broke several retransmit cases including fast retransmit. The reason is that we should only delay by rto while doing retranmists as a result of a timeout. Retransmit as a result of path mtu discover, fast retransmit, or other evernts that should trigger immidiate retransmissions got broken. Also, since rto is doubled prior to marking of packets elegable for retransmission, we never marked correct chunks anyway. The fix is provide a reason for a given retransmission so that we can mark chunks appropriately and to save the old rto value to do comparisons against. All regressions tests passed with this code. Spotted by Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
* \|	[INET]: Small possible memory leak in FIB rules	Denis V. Lunev	2007-11-10	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a small memory leak. Default fib rules can be deleted by the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by ip rule flush Such a rule will not be freed as the ref-counter has 2 on start and becomes clearly unreachable after removal. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	[AF_UNIX]: Make unix_tot_inflight counter non-atomic	Pavel Emelyanov	2007-11-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This counter is _always_ modified under the unix_gc_lock spinlock, so its atomicity can be provided w/o additional efforts. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	mac80211: remove unused driver ops	Johannes Berg	2007-11-10	1	-21/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The driver operations set_ieee8021x(), set_port_auth() and set_privacy_invoked() are not used by any drivers, except set_privacy_invoked() they aren't even used by mac80211. Remove them at least until we need to support drivers with mac80211 that require getting this information. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Michael Wu <flamingice@sourmilk.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
* \|	mac80211: allow driver to ask for a rate control algorithm	Johannes Berg	2007-11-10	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows a driver to ask for a specific rate control algorithm. The rate control algorithm asked for must be registered and be available as a module or built-in. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
* \|	[NET]: Make helper to get dst entry and "use" it	Pavel Emelyanov	2007-11-10	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are many places that get the dst entry, increase the __use counter and set the "lastuse" time stamp. Make a helper for this. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	[INET]: Add a missing include <linux/vmalloc.h> to inet_hashtables.h	Eric Dumazet	2007-11-10	1	-0/+1
\|/ \| \| \| \|	Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[INET]: Remove per bucket rwlock in tcp/dccp ehash table.	Eric Dumazet	2007-11-07	1	-6/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As done two years ago on IP route cache table (commit 22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one lock per hash bucket for the huge TCP/DCCP hash tables. On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for litle performance differences. (we hit a different cache line for the rwlock, but then the bucket cache line have a better sharing factor among cpus, since we dirty it less often). For netstat or ss commands that want a full scan of hash table, we perform fewer memory accesses. Using a 'small' table of hashed rwlocks should be more than enough to provide correct SMP concurrency between different buckets, without using too much memory. Sizing of this table depends on num_possible_cpus() and various CONFIG settings. This patch provides some locking abstraction that may ease a future work using a different model for TCP/DCCP table. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPVS]: Synchronize closing of Connections	Rumen G. Bogdanovski	2007-11-07	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch makes the master daemon to sync the connection when it is about to close. This makes the connections on the backup to close or timeout according their state. Before the sync was performed only if the connection is in ESTABLISHED state which always made the connections to timeout in the hard coded 3 minutes. However the Andy Gospodarek's patch ([IPVS]: use proper timeout instead of fixed value) effectively did nothing more than increasing this to 15 minutes (Established state timeout). So this patch makes use of proper timeout since it syncs the connections on status changes to FIN_WAIT (2min timeout) and CLOSE (10sec timeout). However if the backup misses CLOSE hopefully it did not miss FIN_WAIT. Otherwise we will just have to wait for the ESTABLISHED state timeout. As it is without this patch. This way the number of the hanging connections on the backup is kept to minimum. And very few of them will be left to timeout with a long timeout. This is important if we want to make use of the fix for the real server overcommit on master/backup fail-over. Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPVS]: Bind connections on stanby if the destination exists	Rumen G. Bogdanovski	2007-11-07	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes the problem with node overload on director fail-over. Given the scenario: 2 nodes each accepting 3 connections at a time and 2 directors, director failover occurs when the nodes are fully loaded (6 connections to the cluster) in this case the new director will assign another 6 connections to the cluster, If the same real servers exist there. The problem turned to be in not binding the inherited connections to the real servers (destinations) on the backup director. Therefore: "ipvsadm -l" reports 0 connections: root@test2:~# ipvsadm -l IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP test2.local:5999 wlc -> node473.local:5999 Route 1000 0 0 -> node484.local:5999 Route 1000 0 0 while "ipvs -lnc" is right root@test2:~# ipvsadm -lnc IPVS connection entries pro expire state source virtual destination TCP 14:56 ESTABLISHED 192.168.0.10:39164 192.168.0.222:5999 192.168.0.51:5999 TCP 14:59 ESTABLISHED 192.168.0.10:39165 192.168.0.222:5999 192.168.0.52:5999 So the patch I am sending fixes the problem by binding the received connections to the appropriate service on the backup director, if it exists, else the connection will be handled the old way. So if the master and the backup directors are synchronized in terms of real services there will be no problem with server over-committing since new connections will not be created on the nonexistent real services on the backup. However if the service is created later on the backup, the binding will be performed when the next connection update is received. With this patch the inherited connections will show as inactive on the backup: root@test2:~# ipvsadm -l IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP test2.local:5999 wlc -> node473.local:5999 Route 1000 0 1 -> node484.local:5999 Route 1000 0 1 rumen@test2:~$ cat /proc/net/ip_vs IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP C0A800DE:176F wlc -> C0A80033:176F Route 1000 0 1 -> C0A80032:176F Route 1000 0 1 Regards, Rumen Bogdanovski Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com> Signed-off-by: Simon Horman <horms@verge.net.au>
*	[IPV4]: Compact some ifdefs in the fib code.	Pavel Emelyanov	2007-11-07	1	-9/+6
\| \| \| \| \| \| \| \| \| \| \|	There are places that check for CONFIG_IP_MULTIPLE_TABLES twice in the same file, but the internals of these #ifdefs can be merged. As a side effect - remove one ifdef from inside a function. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA ↵	Eric Dumazet	2007-11-07	1	-6/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	way. "struct proto" currently uses an array stats[NR_CPUS] to track change on 'inuse' sockets per protocol. If NR_CPUS is big, this means we use a big memory area for this. Moreover, all this memory area is located on a single node on NUMA machines, increasing memory pressure on the boot node. In this patch, I tried to : - Keep a fast !CONFIG_SMP implementation - Keep a fast CONFIG_SMP implementation for often used protocols (tcp,udp,raw,...) - Introduce a NUMA efficient implementation Some helper macros are defined in include/net/sock.h These macros take into account CONFIG_SMP If a "struct proto" is declared without using DEFINE_PROTO_INUSE / REF_PROTO_INUSE macros, it will automatically use a default implementation, using a dynamically allocated percpu zone. This default implementation will be NUMA efficient, but might use 32/64 bytes per possible cpu because of current alloc_percpu() implementation. However it still should be better than previous implementation based on stats[NR_CPUS] field. When a "struct proto" is changed to use the new macros, we use a single static "int" percpu variable, lowering the memory and cpu costs, still preserving NUMA efficiency. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	cleanup asm/scatterlist.h includes	Adrian Bunk	2007-11-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Not architecture specific code should not #include <asm/scatterlist.h>. This patch therefore either replaces them with #include <linux/scatterlist.h> or simply removes them if they were unused. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	[NET]: Relax the reference counting of init_net_ns	Pavel Emelyanov	2007-11-01	1	-8/+25
\| \| \| \| \| \| \| \| \|	When the CONFIG_NET_NS is n there's no need in refcounting the initial net namespace. So relax this code by making a stupid stubs for the "n" case. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Forget the zero_it argument of sk_alloc()	Pavel Emelyanov	2007-11-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Finally, the zero_it argument can be completely removed from the callers and from the function prototype. Besides, fix the checkpatch.pl warnings about using the assignments inside if-s. This patch is rather big, and it is a part of the previous one. I splitted it wishing to make the patches more readable. Hope this particular split helped. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Move the sock_copy() from the header	Pavel Emelyanov	2007-11-01	1	-14/+0
\| \| \| \| \| \| \| \|	The sock_copy() call is not used outside the sock.c file, so just move it into a sock.c Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	SCTP endianness annotations regression	Al Viro	2007-10-29	1	-1/+1
\| \| \| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	[NET]: Marking struct pernet_operations __net_initdata was inappropriate	Eric W. Biederman	2007-10-26	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is not safe to to place struct pernet_operations in a special section. We need struct pernet_operations to last until we call unregister_pernet_subsys. Which doesn't happen until module unload. So marking struct pernet_operations is a disaster for modules in two ways. - We discard it before we call the exit method it points to. - Because I keep struct pernet_operations on a linked list discarding it for compiled in code removes elements in the middle of a linked list and does horrible things for linked insert. So this looks safe assuming __exit_refok is not discarded for modules. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[SCTP] net/sctp/auth.c: make 3 functions static	Adrian Bunk	2007-10-26	1	-1/+0
\| \| \| \| \| \| \| \|	This patch makes three needlessly global functions static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[SCTP]: #if 0 sctp_update_copy_cksum()	Adrian Bunk	2007-10-26	1	-1/+0
\| \| \| \| \| \| \| \|	sctp_update_copy_cksum() is no longer used. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>