op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	net: mark read-only arrays as const	Jan Engelhardt	2009-08-05	3	-7/+8
\| \| \| \| \| \| \| \|	String literals are constant, and usually, we can also tag the array of pointers const too, moving it to the .rodata section. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: adding memory barrier to the poll and receive callbacks	Jiri Olsa	2009-07-09	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adding memory barrier after the poll_wait function, paired with receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper to wrap the memory barrier. Without the memory barrier, following race can happen. The race fires, when following code paths meet, and the tp->rcv_nxt and __add_wait_queue updates stay in CPU caches. CPU1 CPU2 sys_select receive packet ... ... __add_wait_queue update tp->rcv_nxt ... ... tp->rcv_nxt check sock_def_readable ... { schedule ... if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible(sk->sk_sleep) ... } If there was no cache the code would work ok, since the wait_queue and rcv_nxt are opposit to each other. Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already passed the tp->rcv_nxt check and sleeps, or will get the new value for tp->rcv_nxt and will return with new data mask. In both cases the process (CPU1) is being added to the wait queue, so the waitqueue_active (CPU2) call cannot miss and will wake up CPU1. The bad case is when the __add_wait_queue changes done by CPU1 stay in its cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then endup calling schedule and sleep forever if there are no more data on the socket. Calls to poll_wait in following modules were ommited: net/bluetooth/af_bluetooth.c net/irda/af_irda.c net/irda/irnet/irnet_ppp.c net/mac80211/rc80211_pid_debugfs.c net/phonet/socket.c net/rds/af_rds.c net/rfkill/core.c net/sunrpc/cache.c net/sunrpc/rpc_pipe.c net/tipc/socket.c Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ipv6: Use correct data types for ICMPv6 type and code	Brian Haley	2009-06-23	1	-1/+1
\| \| \| \| \| \| \| \| \|	Change all the code that deals directly with ICMPv6 type and code values to use u8 instead of a signed int as that's the actual data type. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: skb->dst accessors	Eric Dumazet	2009-06-03	3	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Define three accessors to get/set dst attached to a skb struct dst_entry skb_dst(const struct sk_buff skb) void skb_dst_set(struct sk_buff skb, struct dst_entry dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: skb->rtable accessor	Eric Dumazet	2009-06-03	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \|	Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb Delete skb->rtable field Setting rtable is not allowed, just set dst instead as rtable is an alias. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Do not let initial option overhead shrink the MPS	Gerrit Renker	2009-03-02	2	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes a problem caused by the overlap of the connection-setup and established-state phases of DCCP connections. During connection setup, the client retransmits Confirm Feature-Negotiation options until a response from the server signals that it can move from the half-established PARTOPEN into the OPEN state, whereupon the connection is fully established on both ends (RFC 4340, 8.1.5). However, since the client may already send data while it is in the PARTOPEN state, consequences arise for the Maximum Packet Size: the problem is that the initial option overhead is much higher than for the subsequent established phase, as it involves potentially many variable-length list-type options (server-priority options, RFC 4340, 6.4). Applying the standard MPS is insufficient here: especially with larger payloads this can lead to annoying, counter-intuitive EMSGSIZE errors. On the other hand, reducing the MPS available for the established phase by the added initial overhead is highly wasteful and inefficient. The solution chosen therefore is a two-phase strategy: If the payload length of the DataAck in PARTOPEN is too large, an Ack is sent to carry the options, and the feature-negotiation list is then flushed. This means that the server gets two Acks for one Response. If both Acks get lost, it is probably better to restart the connection anyway and devising yet another special-case does not seem worth the extra complexity. The result is a higher utilisation of the available packet space for the data transmission phase (established state) of a connection. The patch (over-)estimates the initial overhead to be 32*4 bytes -- commonly seen values were around 90 bytes for initial feature-negotiation options. It uses sizeof(u32) to mean "aligned units of 4 bytes". For consistency, another use of 4-byte alignment is adapted. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Minimise header option overhead in setting the MPS	Gerrit Renker	2009-03-02	2	-8/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch resolves a long-standing FIXME to dynamically update the Maximum Packet Size depending on actual options usage. It uses the flags set by the feature-negotiation infrastructure to compute the required header option size. Most options are fixed-size, a notable exception are Ack Vectors (required currently only by CCID-2). These can have any length between 3 and 1020 bytes. As a result of testing, 16 bytes (2 bytes for type/length plus 14 Ack Vector cells) have been found to be sufficient for loss-free situations. There are currently no CCID-specific header options which may appear on data packets, thus it is not necessary to define a corresponding CCID field as suggested in the old comment. Further changes: ---------------- Adjusted the type of 'cur_mps' to match the unsigned return type of the function. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Debugging functions for feature negotiation	Gerrit Renker	2009-01-21	4	-59/+109
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since all feature-negotiation processing now takes place in feat.c, functions for producing verbose debugging output are concentrated there. New functions to print out values, entry records, and options are provided, and also a macro is defined to not always have the function name in the output line. Thanks a lot to Wei Yongjun and Giuseppe Galeota for help and discussion with an earlier revision of this patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Initialisation and type-checking of feature sysctls	Gerrit Renker	2009-01-21	5	-23/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch takes care of initialising and type-checking sysctls related to feature negotiation. Type checking is important since some of the sysctls now directly impact the feature-negotiation process. The sysctls are initialised with the known default values for each feature. For the type-checking the value constraints from RFC 4340 are used: * Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes), tested and confirmed that it works up to 4294967295 - for Gbps speed; * Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer); * CCIDs are between 0 .. 255; * request_retries, retries1, retries2 also between 0..255 for good measure; * tx_qlen is checked to be non-negative; * sync_ratelimit remains as before. Notes: ------ 1. Die s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c. 2. As pointed out by Arnaldo, the pattern of type-checking repeats itself in other places, sometimes with exactly the same kind of definitions (e.g. "static int zero;"). It may be a good idea (kernel janitors?) to consolidate type checking. For the sake of keeping the changeset small and in order not to affect other subsystems, I have not strived to generalise here. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Implement both feature-local and feature-remote Sequence Window feature	Gerrit Renker	2009-01-21	4	-24/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds full support for local/remote Sequence Window feature, from which the * sequence-number-validity (W) and * acknowledgment-number-validity (W') windows derive as specified in RFC 4340, 7.5.3. Specifically, the following is contained in this patch: * integrated new socket fields into dccp_sk; * updated the update_gsr/gss routines with regard to these fields; * updated handler code: the Sequence Window feature is located at the TX side, so the local feature is meant if the handler-rx flag is false; * the initialisation of `rcv_wnd' in reqsk is removed, since - rcv_wnd is not used by the code anywhere; - sequence number checks are not done in the LISTEN state (cf. 7.5.3); - dccp_check_req checks the Ack number validity more rigorously; * the `struct dccp_minisock' became empty and is now removed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Initialisation framework for feature negotiation	Gerrit Renker	2009-01-21	2	-10/+57
\| \| \| \| \| \| \| \| \| \| \| \|	This initialises feature negotiation from two tables, which are in turn are initialised from sysctls. As a novel feature, specifics of the implementation (e.g. that short seqnos and ECN are not yet available) are advertised for robustness. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp ccid-3: Fix RFC reference	Gerrit Renker	2009-01-11	1	-1/+1
\| \| \| \| \| \| \| \|	Thanks to Wei and Arnaldo for pointing out the correct new reference for CCID-3. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: fix section mismatch warnings in dccp/ccids/lib/tfrc.c	Leonardo Potenza	2009-01-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Removed the __exit annotation of tfrc_lib_exit(), in order to suppress the following section mismatch messages: WARNING: net/dccp/dccp.o(.text+0xd9): Section mismatch in reference from the function ccid_cleanup_builtins() to the function .exit.text:tfrc_lib_exit() The function ccid_cleanup_builtins() references a function in an exit section. Often the function tfrc_lib_exit() has valid usage outside the exit section and the fix is to remove the __exit annotation of tfrc_lib_exit. WARNING: net/dccp/dccp.o(.init.text+0x48): Section mismatch in reference from the function ccid_initialize_builtins() to the function .exit.text:tfrc_lib_exit() The function __init ccid_initialize_builtins() references a function __exit tfrc_lib_exit(). This is often seen when error handling in the init function uses functionality in the exit path. The fix is often to remove the __exit annotation of tfrc_lib_exit() so it may be used outside an exit section. Signed-off-by: Leonardo Potenza <lpotenza@inwind.it> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Integrate the TFRC library with DCCP	Gerrit Renker	2009-01-04	12	-52/+27
\| \| \| \| \| \| \| \|	This patch integrates the TFRC library, which is a dependency of CCID-3 (and CCID-4), with the new use of CCIDs in the DCCP module. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Clean up ccid.c after integration of CCID plugins	Gerrit Renker	2009-01-04	4	-176/+14
\| \| \| \| \| \| \| \| \| \|	This patch cleans up after integrating the CCID modules and, in addition, * moves the if/else cases from ccid_delete() into ccid_hc_{tx,rx}_delete(); * removes the 'gfp' argument to ccid_new() - since it is always gfp_any(). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Lockless integration of CCID congestion-control plugins	Gerrit Renker	2009-01-04	11	-168/+148
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on Arnaldo's earlier patch, this patch integrates the standardised CCID congestion control plugins (CCID-2 and CCID-3) of DCCP with dccp.ko: * enables a faster connection path by eliminating the need to always go through the CCID registration lock; * updates the implementation to use only a single array whose size equals the number of configured CCIDs instead of the maximum (256); * since the CCIDs are now fixed array elements, synchronization is no longer needed, simplifying use and implementation. CCID-2 is suggested as minimum for a basic DCCP implementation (RFC 4340, 10); CCID-3 is a standards-track CCID supported by RFC 4342 and RFC 5348. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: Fix percpu counters deadlock	Herbert Xu	2008-12-29	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we converted the protocol atomic counters such as the orphan count and the total socket count deadlocks were introduced due to the mismatch in BH status of the spots that used the percpu counter operations. Based on the diagnosis and patch by Peter Zijlstra, this patch fixes these issues by disabling BH where we may be in process context. Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp_diag: LISTEN sockets don't have CCIDs	Arnaldo Carvalho de Melo	2008-12-17	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	And thus when we try to use 'ss -danemi' on these sockets that have no ccid blocks (data collected using systemtap after I fixed the problem): dccp_diag_get_info sk=0xffff8801220a3100, dp->dccps_hc_rx_ccid=0x0000000000000000, dp->dccps_hc_tx_ccid=0x0000000000000000 We get an OOPS: mica.ghostprotocols.net login: BUG: unable to handle kernel NULL pointer dereferenc0 IP: [<ffffffffa0136082>] dccp_diag_get_info+0x82/0xc0 [dccp_diag] PGD 12106f067 PUD 122488067 PMD 0 Oops: 0000 [#1] PREEMPT Fix is trivial, and 'ss -d' is working again: [root@mica ~]# ss -danemi State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 0 :5001 :* ino:7288 sk:220a3100ffff8801 mem:(r0,w0,f0,t0) cwnd:0 ssthresh:0 [root@mica ~]# Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp ccid-2: Phase out the use of boolean Ack Vector sysctl	Gerrit Renker	2008-12-08	7	-27/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This removes the use of the sysctl and the minisock variable for the Send Ack Vector feature, as it now is handled fully dynamically via feature negotiation (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per RFC 4341, 4.). Using a sysctl in parallel to this implementation would open the door to crashes, since much of the code relies on tests of the boolean minisock / sysctl variable. Thus, this patch replaces all tests of type if (dccp_msk(sk)->dccpms_send_ack_vector) /* ... / with if (dp->dccps_hc_rx_ackvec != NULL) / ... / The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature negotiation concluded that Ack Vectors are to be used on the half-connection. Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child), so that the test is a valid one. The activation handler for Ack Vectors is called as soon as the feature negotiation has concluded at the server when the Ack marking the transition RESPOND => OPEN arrives; * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN. Adding the sequence number of the Response packet to the Ack Vector has been removed, since (a) connection establishment implies that the Response has been received; (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e. this entry will always be ignored; (c) it can not be used for anything useful - to detect loss for instance, only packets received after the loss can serve as pseudo-dupacks. There was a FIXME to change the error code when dccp_ackvec_add() fails. I removed this after finding out that: * the check whether ackno < ISN is already made earlier, * this Response is likely the 1st packet with an Ackno that the client gets, * so when dccp_ackvec_add() fails, the reason is likely not a packet error. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Remove manual influence on NDP Count feature	Gerrit Renker	2008-12-08	5	-13/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Updating the NDP count feature is handled automatically now: * for CCID-2 it is disabled, since the code does not use NDP counts; * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths. Allowing the user to change NDP values leads to unpredictable and failing behaviour, since it is then possible to disable NDP counts even when they are needed (e.g. in CCID-3). This means that only those user settings are sensible that agree with the values for Send NDP Count implied by the choice of CCID. But those settings are already activated by the feature negotiation (CCID dependency tracking), hence this form of support is redundant. At startup the initialisation of the NDP count feature uses the default value of 0, which is done implicitly by the zeroing-out of the socket when it is allocated. If the choice of CCID or feature negotiation enables NDP count, this will then be updated via the NDP activation handler. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Remove obsolete parts of the old CCID interface	Gerrit Renker	2008-12-08	4	-34/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector case, their value equals initially that of the sysctl, but at the end of feature negotiation may be something different. The old interface removed by this patch thus has been replaced by the newer interface to dynamically query the currently loaded CCIDs. Also removed are the constructors for the TX CCID and the RX CCID, since the switch "rx <-> non-rx" is done by the handler in minisocks.c (and the handler is the only place in the code where CCIDs are loaded). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Clean up old feature-negotiation infrastructure	Gerrit Renker	2008-12-08	2	-505/+11
\| \| \| \| \| \| \| \| \|	The code removed by this patch is no longer referenced or used, the added lines update documentation and copyrights. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Integration of dynamic feature activation - part 3 (client side)	Gerrit Renker	2008-12-08	1	-4/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This integrates feature-activation in the client: 1. When dccp_parse_options() fails, the reset code is already set; request_sent\ _state_process() currently overrides this with `Packet Error', which is not intended - changed to use the reset code supplied by dccp_parse_options(). 2. When feature negotiation fails, the socket should be marked as not usable, so that the application is notified that an error occurred. This is achieved by a new label 'unable_to_proceed': generating an error code of `Aborted', setting the socket state to CLOSED, returning with ECOMM in sk_err. 3. Avoids parsing the Ack twice in Respond state by not doing option processing again in dccp_rcv_respond_partopen_state_process (as option processing has already been done on the request_sock in dccp_check_req). Since this addresses congestion-control initialisation, a corresponding FIXME has been removed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Integration of dynamic feature activation - part 2 (server side)	Gerrit Renker	2008-12-08	1	-30/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch integrates the activation of features at the end of negotiation into the server-side code. Note regarding the removal of 'const': -------------------------------------- The 'const' attribute has been removed from 'dreq' since dccp_activate_values() needs to operate on dreq's feature list. Part of the activation is to remove those options from the list that have already been confirmed, hence it is not purely read-only. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Integration of dynamic feature activation - part 1 (socket setup)	Gerrit Renker	2008-12-08	1	-40/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This first patch out of three replaces the hardcoded default settings with initialisation code for the dynamic feature negotiation. The patch also ensures that the client feature-negotiation queue is flushed only when entering the OPEN state. Since confirmed Change options are removed as soon as they are confirmed (in the DCCP-Response), this ensures that Confirm options are retransmitted. Note on retransmitting Confirm options: --------------------------------------- Implementation experience showed that it is necessary to retransmit Confirm options. Thanks to Leandro Melo de Sales who reported a bug in an earlier revision of the patch set, resulting from not retransmitting these options. As long as the client is in PARTOPEN, it needs to retransmit the Confirm options for the Change options received on the DCCP-Response from the server. Otherwise, if the packet containing the Confirm options gets dropped in the network, the connection aborts due to undefined feature negotiation state. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: use roundup instead of opencoding	Ilpo Järvinen	2008-12-05	1	-1/+1
\| \| \| \| \|	Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Feature activation handlers	Gerrit Renker	2008-12-01	2	-10/+204
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch provides the post-processing of feature negotiation state, after the negotiation has completed. To this purpose, handlers are used and added to the dccp_feat_table. Each handler is passed a boolean flag whether the RX or TX side of the feature is meant. Several handlers are provided already, new handlers can easily be added. The initialisation is now fully dynamic, i.e. CCIDs are activated only after the feature negotiation. The integration of this dynamic activation is done in the subsequent patches. Thanks to Wei Yongjun for pointing out the necessity of skipping over empty Confirm options while copying the negotiated feature values. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Processing Confirm options	Gerrit Renker	2008-12-01	3	-17/+101
\| \| \| \| \| \| \| \| \| \| \| \| \|	Analogous to the previous patch, this adds code to interpret incoming Confirm feature-negotiation options. Both functions operate on the feature-negotiation list of either the request_sock (server) or the dccp_sock (client). Thanks to Wei Yongjun for pointing out that it is overly restrictive to check the entire list of confirmed SP values. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Process incoming Change feature-negotiation options	Gerrit Renker	2008-12-01	3	-18/+189
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds/replaces code for processing incoming ChangeL/R options. The main difference is that: * mandatory FN options are now interpreted inside the function (there are too many individual cases to do this externally); * the function returns an appropriate Reset code or 0, which is then used to fill in the data for the Reset packet. Old code, which is no longer used or referenced, has been removed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Preference list reconciliation	Gerrit Renker	2008-12-01	1	-2/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This provides two functions to * reconcile preference lists (with appropriate return codes) and * reorder the preference list if successful reconciliation changed the preferred value. The patch also removes the old code for processing SP/NN Change options, since new code to process these is mostly there already; related references have been commented out. The code for processing Change options follows in the next patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Integrate feature-negotiation insertion code	Gerrit Renker	2008-12-01	1	-12/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The patch implements insertion of feature negotiation at the server (listening and request socket) and the client (connecting socket). In dccp_insert_options(), several statements have been grouped together now to achieve (it is hoped) better efficiency by reducing the number of tests each packet has to go through: - Ack Vectors are sent if the packet is neither a Data or a Request packet; - a previous issue is corrected - feature negotiation options are allowed on DataAck packets (5.8). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Insert feature-negotiation options into skb	Gerrit Renker	2008-12-01	2	-0/+67
\| \| \| \| \| \| \| \| \| \|	This patch replaces the earlier insertion routine from options.c, so that code specific to feature negotiation can remain in feat.c. This is possible by calling a function already existing in options.c. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: Use a percpu_counter for orphan_count	Eric Dumazet	2008-11-25	2	-7/+11
\| \| \| \| \| \| \| \| \|	Instead of using one atomic_t per protocol, use a percpu_counter for "orphan_count", to reduce cache line contention on heavy duty network servers. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netns xfrm: lookup in netns	Alexey Dobriyan	2008-11-25	1	-5/+5
\| \| \| \| \| \| \| \| \| \|	Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns to flow_cache_lookup() and resolver callback. Take it from socket or netdevice. Stub DECnet to init_net. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: fix warning in net/dccp/options.c	Ingo Molnar	2008-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	this warning: net/dccp/options.c: In function ‘dccp_parse_options’: net/dccp/options.c:67: warning: ‘value’ may be used uninitialized in this function is a bogus GCC warning. The compiler does not recognize the relation between "value" and "mandatory" variables: the code flow can ever reach the "out_invalid_option:" label if 'mandatory' is set to 1, and when 'mandatory' is non-zero, we'll always have 'value' initialized. Help out the compiler by annotating the variable. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Header option insertion routine for feature-negotiation	Gerrit Renker	2008-11-23	2	-60/+33
\| \| \| \| \| \| \| \| \| \| \| \|	The patch extends existing code: * Confirm options divide into the confirmed value plus an optional preference list for SP values. Previously only the preference list was echoed for SP values, now the confirmed value is added as per RFC 4340, 6.1; * length and sanity checks are added to avoid illegal memory (or NULL) access. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Support for Mandatory options	Gerrit Renker	2008-11-23	2	-0/+17
\| \| \| \| \| \| \| \| \| \|	Support for Mandatory options is provided by this patch, which will be used by subsequent feature-negotiation patches. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Increase the scope of variable-length htonl/ntohl functions	Gerrit Renker	2008-11-23	2	-7/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This extends the scope of two available functions, encode\|decode_value_var, to work up to 6 (8) bytes, to match maximum requirements in the RFC. These functions are going to be used both by general option processing and feature negotiation code, hence declarations have been put into feat.h. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: API to query the current TX/RX CCID	Gerrit Renker	2008-11-23	3	-5/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This provides function to query the current TX/RX CCID dynamically, without reliance on the minisock value, using dynamic information available in the currently loaded CCID module. This query function is then used to (a) provide the getsockopt part for getting/setting CCIDs via sockopts; (b) replace the current test for "which CCID is in use" in probe.c. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Set per-connection CCIDs via socket options	Gerrit Renker	2008-11-23	4	-8/+42
\| \| \| \| \| \| \| \| \| \| \| \| \|	With this patch, TX/RX CCIDs can now be changed on a per-connection basis, which overrides the defaults set by the global sysctl variables for TX/RX CCIDs. To make full use of this facility, the remaining patches of this patch set are needed, which track dependencies and activate negotiated feature values. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Fix bracing in dccp_feat_list_lookup.	Gerrit Renker	2008-11-20	1	-1/+2
\| \| \| \| \| \|	From: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: listening_hash get a spinlock per bucket	Eric Dumazet	2008-11-20	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	This patch prepares RCU migration of listening_hash table for TCP/DCCP protocols. listening_hash table being small (32 slots per protocol), we add a spinlock for each slot, instead of a single rwlock for whole table. This should reduce hold time of readers, and writers concurrency. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: inet_diag_handler structs can be const	Eric Dumazet	2008-11-19	1	-1/+1
\| \| \| \| \|	Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Tidy up setsockopt calls	Gerrit Renker	2008-11-16	1	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This splits the setsockopt calls into two groups, depending on whether an integer argument (val) is required and whether routines being called do their own locking. Some options (such as setting the CCID) use u8 rather than int, so that for these the test with regard to integer-sizeof can not be used. The second switch-case statement now only has those statements which need locking and which make use of `val'. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: Eugene Teo <eugeneteo@kernel.sg> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Deprecate Ack Ratio sysctl	Gerrit Renker	2008-11-16	4	-10/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch deprecates the Ack Ratio sysctl, since * Ack Ratio is entirely ignored by CCID-3 and CCID-4, * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1); * even if it would work in CCID-2, there is no point for a user to change it: - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2), - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts (since waiting for Acks which will never arrive in this window), - cwnd is not a user-configurable value. The only reasonable place for Ack Ratio is to print it for debugging. It is planned to do this later on, as part of e.g. dccp_probe. With this patch Ack Ratio is now under full control of feature negotiation: * Ack Ratio is resolved as a dependency of the selected CCID; * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to the default of 2, following RFC 4340, 11.3 - "New connections start with Ack Ratio 2 for both endpoints"; * what happens then is part of another patch set, since it concerns the dynamic update of Ack Ratio while the connection is in full flight. Thanks to Tomasz Grobelny for discussion leading up to this patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Feature negotiation for minimum-checksum-coverage	Gerrit Renker	2008-11-16	1	-13/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This provides feature negotiation for server minimum checksum coverage which so far has been missing. Since sender/receiver coverage values range only from 0...15, their type has also been reduced in size from u16 to u4. Feature-negotiation options are now generated for both sender and receiver coverage, i.e. when the peer has `forgotten' to enable partial coverage then feature negotiation will automatically enable (negotiate) the partial coverage value for this connection. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Deprecate old setsockopt framework	Gerrit Renker	2008-11-16	3	-98/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous setsockopt interface, which passed socket options via struct dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to ugly code since the old approach did not distinguish between NN and SP values. This patch removes the old setsockopt interface and replaces it with two new functions to register NN/SP values for feature negotiation. These are essentially wrappers around the internal __feat_register functions, with checking added to avoid * wrong usage (type); * changing values while the connection is in progress. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Mechanism to resolve CCID dependencies	Gerrit Renker	2008-11-16	3	-4/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a hook to resolve features whose value depends on the choice of CCID. It is done at the server since it can only be done after the CCID values have been negotiated; i.e. the client will add its CCID preference list on the Change options sent in the Request, which will be reconciled with the local preference list of the server. The concept is documented on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\ implementation_notes.html#ccid_dependencies Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls	Eric Dumazet	2008-11-16	3	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	dccp: Resolve dependencies of features on choice of CCID	Gerrit Renker	2008-11-12	4	-0/+168
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This provides a missing link in the code chain, as several features implicitly depend and/or rely on the choice of CCID. Most notably, this is the Send Ack Vector feature, but also Ack Ratio and Send Loss Event Rate (also taken care of). For Send Ack Vector, the situation is as follows: * since CCID2 mandates the use of Ack Vectors, there is no point in allowing endpoints which use CCID2 to disable Ack Vector features such a connection; * a peer with a TX CCID of CCID2 will always expect Ack Vectors, and a peer with a RX CCID of CCID2 must always send Ack Vectors (RFC 4341, sec. 4); * for all other CCIDs, the use of (Send) Ack Vector is optional and thus negotiable. However, this implies that the code negotiating the use of Ack Vectors also supports it (i.e. is able to supply and to either parse or ignore received Ack Vectors). Since this is not the case (CCID-3 has no Ack Vector support), the use of Ack Vectors is here disabled, with a comment in the source code. An analogous consideration arises for the Send Loss Event Rate feature, since the CCID-3 implementation does not support the loss interval options of RFC 4342. To make such use explicit, corresponding feature-negotiation options are inserted which signal the use of the loss event rate option, as it is used by the CCID3 code. Lastly, the values of the Ack Ratio feature are matched to the choice of CCID. The patch implements this as a function which is called after the user has made all other registrations for changing default values of features. The table is variable-length, the reserved (and hence for feature-negotiation invalid, confirmed by considering section 19.4 of RFC 4340) feature number `0' is used to mark the end of the table. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>