summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* tipc: Don't use iocb argument in socket layerYing Xue2015-03-021-38/+44
| | | | | | | | | | | | | | | | | Currently the iocb argument is used to idenfiy whether or not socket lock is hold before tipc_sendmsg()/tipc_send_stream() is called. But this usage prevents iocb argument from being dropped through sendmsg() at socket common layer. Therefore, in the commit we introduce two new functions called __tipc_sendmsg() and __tipc_send_stream(). When they are invoked, it assumes that their callers have taken socket lock, thereby avoiding the weird usage of iocb argument. Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'dropcount'David S. Miller2015-03-0218-48/+79
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eyal Birger says: ==================== net: move skb->dropcount to skb->cb[] Commit 977750076d98 ("af_packet: add interframe drop cmsg (v6)") unionized skb->mark and skb->dropcount in order to allow recording of the socket drop count while maintaining struct sk_buff size. skb->dropcount was introduced since there was no available room in skb->cb[] in packet sockets. However, its introduction led to the inability to export skb->mark to userspace. It was considered to alias skb->priority instead of skb->mark. However, that would lead to the inabilty to export skb->priority to userspace if desired. Such change may also lead to hard-to-find issues as skb->priority is assumed to be alias free, and, as noted by Shmulik Ladkani, is not 'naturally orthogonal' with other skb fields. This patch series follows the suggestions made by Eric Dumazet moving the dropcount metric to skb->cb[], eliminating this problem at the expense of 4 bytes less in skb->cb[] for protocol families using it. The patch series include compactization of bluetooth and packet use of skb->cb[] as well as the infrastructure for placing dropcount in skb->cb[]. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: move skb->dropcount to skb->cb[]Eyal Birger2015-03-023-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 977750076d98 ("af_packet: add interframe drop cmsg (v6)") unionized skb->mark and skb->dropcount in order to allow recording of the socket drop count while maintaining struct sk_buff size. skb->dropcount was introduced since there was no available room in skb->cb[] in packet sockets. However, its introduction led to the inability to export skb->mark, or any other aliased field to userspace if so desired. Moving the dropcount metric to skb->cb[] eliminates this problem at the expense of 4 bytes less in skb->cb[] for protocol families using it. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: add common accessor for setting dropcount on packetsEyal Birger2015-03-023-2/+8
| | | | | | | | | | | | | | | | As part of an effort to move skb->dropcount to skb->cb[], use a common function in order to set dropcount in struct sk_buff. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: use common macro for assering skb->cb[] available size in protocol familiesEyal Birger2015-03-029-14/+13
| | | | | | | | | | | | | | | | | | As part of an effort to move skb->dropcount to skb->cb[] use a common macro in protocol families using skb->cb[] for ancillary data to validate available room in skb->cb[]. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: packet: use sockaddr_ll fields as storage for skb original length in ↵Eyal Birger2015-03-021-6/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | recvmsg path As part of an effort to move skb->dropcount to skb->cb[], 4 bytes of additional room are needed in skb->cb[] in packet sockets. Store the skb original length in the first two fields of sockaddr_ll (sll_family and sll_protocol) as they can be derived from the skb when needed. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: rxrpc: change call to sock_recv_ts_and_drops() on rxrpc recvmsg to ↵Eyal Birger2015-03-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_recv_timestamp() Commit 3b885787ea4112 ("net: Generalize socket rx gap / receive queue overflow cmsg") allowed receiving packet dropcount information as a socket level option. RXRPC sockets recvmsg function was changed to support this by calling sock_recv_ts_and_drops() instead of sock_recv_timestamp(). However, protocol families wishing to receive dropcount should call sock_queue_rcv_skb() or set the dropcount specifically (as done in packet_rcv()). This was not done for rxrpc and thus this feature never worked on these sockets. Formalizing this by not calling sock_recv_ts_and_drops() in rxrpc as part of an effort to move skb->dropcount into skb->cb[] Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bluetooth: compact struct bt_skb_cb by converting boolean fields to bit ↵Eyal Birger2015-03-024-6/+6
| | | | | | | | | | | | | | | | | | | | | | fields Convert boolean fields incoming and req_start to bit fields and move force_active in order save space in bt_skb_cb in an effort to use a portion of skb->cb[] for storing skb->dropcount. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bluetooth: compact struct bt_skb_cb by inlining struct hci_req_ctrlEyal Birger2015-03-025-19/+15
|/ | | | | | | | | | struct hci_req_ctrl is never used outside of struct bt_skb_cb; Inlining it frees 8 bytes on a 64 bit system in skb->cb[] allowing the addition of more ancillary data. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pppoe: Use workqueue to die properly when a PADT is receivedSimon Farnsworth2015-03-022-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When a PADT frame is received, the socket may not be in a good state to close down the PPP interface. The current implementation handles this by simply blocking all further PPP traffic, and hoping that the lack of traffic will trigger the user to investigate. Use schedule_work to get to a process context from which we clear down the PPP interface, in a fashion analogous to hangup on a TTY-based PPP interface. This causes pppd to disconnect immediately, and allows tools to take immediate corrective action. Note that pppd's rp_pppoe.so plugin has code in it to disable the session when it disconnects; however, as a consequence of this patch, the session is already disabled before rp_pppoe.so is asked to disable the session. The result is a harmless error message: Failed to disconnect PPPoE socket: 114 Operation already in progress This message is safe to ignore, as long as the error is 114 Operation already in progress; in that specific case, it means that the PPPoE session has already been disabled before pppd tried to disable it. Signed-off-by: Simon Farnsworth <simon@farnz.org.uk> Tested-by: Dan Williams <dcbw@redhat.com> Tested-by: Christoph Schulz <develop@kristov.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* bnx2: disable toggling of rxvlan if necessaryIvan Vecera2015-03-011-12/+3
| | | | | | | | | | | | The bnx2 driver uses .ndo_fix_features to force enable of Rx VLAN tag stripping when the card cannot disable it. The driver should remove NETIF_F_HW_VLAN_CTAG_RX flag from hw_features instead so it is fixed for the ethtool. Cc: Sony Chacko <sony.chacko@qlogic.com> Cc: Dept-HSGLinuxNICDev@qlogic.com Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: macb: Properly add DMACFG bit definitionsArun Chandran2015-03-011-1/+2
| | | | | | | | | Add *_SIZE macros for the bits ENDIA_DESC and ENDIA_PKT Signed-off-by: Arun Chandran <achandran@mvista.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: macb: Add on the fly CPU endianness detectionArun Chandran2015-03-011-4/+18
| | | | | | | | | | Program management descriptor's access mode according to the dynamically detected CPU endianness. Signed-off-by: Arun Chandran <achandran@mvista.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Tested-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Driver: Vmxnet3: Copy TCP header to mapped frame for IPv6 packetsShrikrishna Khare2015-03-012-11/+23
| | | | | | | | | Allows for packet parsing to be done by the fast path. This performance optimization already exists for IPv4. Add similar logic for IPv6. Signed-off-by: Amitabha Banerjee <banerjeea@vmware.com> Signed-off-by: Shrikrishna Khare <skhare@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'ebpf_support_for_cls_bpf'David S. Miller2015-03-0115-193/+243
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== eBPF support for cls_bpf This is the non-RFC version of my patchset posted before netdev01 [1] conference. It contains a couple of eBPF cleanups and preparation patches to get eBPF support into cls_bpf. The last patch adds the actual support. I'll post the iproute2 parts after the kernel bits are merged, an initial preview link to the code is mentioned in the last patch. Patch 4 and 5 were originally one patch, but I've split them into two parts upon request as patch 4 only is also needed for Alexei's tracing patches that go via tip tree. Tested with tc and all in-kernel available BPF test suites. I have configured and built LLVM with --enable-experimental-targets=BPF but as Alexei put it, the plan is to get rid of the experimental status in future [2]. Thanks a lot! v1 -> v2: - Removed arch patches from this series - x86 is already queued in tip tree, under x86/mm - arm64 just reposted directly to arm folks - Rest is unchanged [1] http://thread.gmane.org/gmane.linux.network/350191 [2] http://article.gmane.org/gmane.linux.kernel/1874969 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * cls_bpf: add initial eBPF support for programmable classifiersDaniel Borkmann2015-03-013-52/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work extends the "classic" BPF programmable tc classifier by extending its scope also to native eBPF code! This allows for user space to implement own custom, 'safe' C like classifiers (or whatever other frontend language LLVM et al may provide in future), that can then be compiled with the LLVM eBPF backend to an eBPF elf file. The result of this can be loaded into the kernel via iproute2's tc. In the kernel, they can be JITed on major archs and thus run in native performance. Simple, minimal toy example to demonstrate the workflow: #include <linux/ip.h> #include <linux/if_ether.h> #include <linux/bpf.h> #include "tc_bpf_api.h" __section("classify") int cls_main(struct sk_buff *skb) { return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos)); } char __license[] __section("license") = "GPL"; The classifier can then be compiled into eBPF opcodes and loaded via tc, for example: clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o tc filter add dev em1 parent 1: bpf cls.o [...] As it has been demonstrated, the scope can even reach up to a fully fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c). For tc, maps are allowed to be used, but from kernel context only, in other words, eBPF code can keep state across filter invocations. In future, we perhaps may reattach from a different application to those maps e.g., to read out collected statistics/state. Similarly as in socket filters, we may extend functionality for eBPF classifiers over time depending on the use cases. For that purpose, cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so we can allow additional functions/accessors (e.g. an ABI compatible offset translation to skb fields/metadata). For an initial cls_bpf support, we allow the same set of helper functions as eBPF socket filters, but we could diverge at some point in time w/o problem. I was wondering whether cls_bpf and act_bpf could share C programs, I can imagine that at some point, we introduce i) further common handlers for both (or even beyond their scope), and/or if truly needed ii) some restricted function space for each of them. Both can be abstracted easily through struct bpf_verifier_ops in future. The context of cls_bpf versus act_bpf is slightly different though: a cls_bpf program will return a specific classid whereas act_bpf a drop/non-drop return code, latter may also in future mangle skbs. That said, we can surely have a "classify" and "action" section in a single object file, or considered mentioned constraint add a possibility of a shared section. The workflow for getting native eBPF running from tc [1] is as follows: for f_bpf, I've added a slightly modified ELF parser code from Alexei's kernel sample, which reads out the LLVM compiled object, sets up maps (and dynamically fixes up map fds) if any, and loads the eBPF instructions all centrally through the bpf syscall. The resulting fd from the loaded program itself is being passed down to cls_bpf, which looks up struct bpf_prog from the fd store, and holds reference, so that it stays available also after tc program lifetime. On tc filter destruction, it will then drop its reference. Moreover, I've also added the optional possibility to annotate an eBPF filter with a name (e.g. path to object file, or something else if preferred) so that when tc dumps currently installed filters, some more context can be given to an admin for a given instance (as opposed to just the file descriptor number). Last but not least, bpf_prog_get() and bpf_prog_put() needed to be exported, so that eBPF can be used from cls_bpf built as a module. Thanks to 60a3b2253c41 ("net: bpf: make eBPF interpreter images read-only") I think this is of no concern since anything wanting to alter eBPF opcode after verification stage would crash the kernel. [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: move read-only fields to bpf_prog and shrink bpf_prog_auxDaniel Borkmann2015-03-015-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | is_gpl_compatible and prog_type should be moved directly into bpf_prog as they stay immutable during bpf_prog's lifetime, are core attributes and they can be locked as read-only later on via bpf_prog_select_runtime(). With a bit of rearranging, this also allows us to shrink bpf_prog_aux to exactly 1 cacheline. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: add sched_cls_type and map it to sk_filter's verifier opsDaniel Borkmann2015-03-013-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed recently and at netconf/netdev01, we want to prevent making bpf_verifier_ops registration available for modules, but have them at a controlled place inside the kernel instead. The reason for this is, that out-of-tree modules can go crazy and define and register any verfifier ops they want, doing all sorts of crap, even bypassing available GPLed eBPF helper functions. We don't want to offer such a shiny playground, of course, but keep strict control to ourselves inside the core kernel. This also encourages us to design eBPF user helpers carefully and generically, so they can be shared among various subsystems using eBPF. For the eBPF traffic classifier (cls_bpf), it's a good start to share the same helper facilities as we currently do in eBPF for socket filters. That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus one day if there's a good reason to diverge the set of helper functions from the set available to socket filters, we keep ABI compatibility. In future, we could place all bpf_prog_type_list at a central place, perhaps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter codeDaniel Borkmann2015-03-011-22/+14
| | | | | | | | | | | | | | | | | | | | | | | | This gets rid of CONFIG_BPF_SYSCALL ifdefs in the socket filter code, now that the BPF internal header can deal with it. While going over it, I also changed eBPF related functions to a sk_filter prefix to be more consistent with the rest of the file. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefsDaniel Borkmann2015-03-011-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Socket filter code and other subsystems with upcoming eBPF support should not need to deal with the fact that we have CONFIG_BPF_SYSCALL defined or not. Having the bpf syscall as a config option is a nice thing and I'd expect it to stay that way for expert users (I presume one day the default setting of it might change, though), but code making use of it should not care if it's actually enabled or not. Instead, hide this via header files and let the rest deal with it. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: export BPF_PSEUDO_MAP_FD to uapiDaniel Borkmann2015-03-013-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the ELF BPF loader where instructions are being loaded that need map fixups. An initial stage loads all maps into the kernel, and later on replaces related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source register and the actual fd as immediate value. The kernel verifier recognizes this keyword and replaces the map fd with a real pointer internally. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: constify various function pointer structsDaniel Borkmann2015-03-015-19/+19
| | | | | | | | | | | | | | | | | | We can move bpf_map_ops and bpf_verifier_ops and other structs into ro section, bpf_map_type_list and bpf_prog_type_list into read mostly. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: remove kernel test stubsDaniel Borkmann2015-03-013-83/+3
|/ | | | | | | | | | | | | Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can remove the test stubs which were added to get the verifier suite up. We can just let the test cases probe under socket filter type instead. In the fill/spill test case, we cannot (yet) access fields from the context (skb), but we may adapt that test case in future. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 's390-next'David S. Miller2015-02-287-3742/+4
|\ | | | | | | | | | | | | | | | | | | | | | | Ursula Braun says: ==================== s390: network patches for net-next here are some s390 related patches for net-next ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * MAINTAINERS: update S390 NETWORK DRIVERS maintainerUrsula Braun2015-02-281-1/+0
| | | | | | | | | | | | | | | | remove Frank Blaschka as S390 NETWORK DRIVERS maintainer Acked-by: Frank Blaschka <blaschka@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * qeth: Fix command sizesStefan Raspl2015-02-281-2/+3
| | | | | | | | | | | | | | | | | | | | This patch adjusts two instances where we were using the (too big) struct qeth_ipacmd_setadpparms size instead of the commands' actual size. This didn't do any harm, but wasted a few bytes. Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390: remove claw driverUrsula Braun2015-02-285-3739/+1
|/ | | | | | | | claw devices are outdated and no longer supported. This patch removes the claw driver. Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: cleanup static functionsEric Dumazet2015-02-283-5/+4
| | | | | | | | | tcp_fastopen_create_child() is static and should not be exported. tcp4_gso_segment() and tcp6_gso_segment() should be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* hyperv: Implement netvsc_get_channels() ethool opAndrew Schwartzmeyer2015-02-283-1/+20
| | | | | | | | | | | | | | This adds support for reporting the actual and maximum combined channels count of the hv_netvsc driver via 'ethtool --show-channels'. This required adding 'max_chn' to 'struct netvsc_device', and assigning it 'rsscap.num_recv_que' in 'rndis_filter_device_add'. Now we can access the combined maximum channel count via 'struct netvsc_device' in the ethtool callback. Signed-off-by: Andrew Schwartzmeyer <andrew@schwartzmeyer.com> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'tcp-tso'David S. Miller2015-02-282-13/+17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric Dumazet says: ==================== tcp: tso improvements This patch serie reworks tcp_tso_should_defer() a bit to get less bursts, and better ECN behavior. We also removed tso_deferred field in tcp socket. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: tso: allow CA_CWR state in tcp_tso_should_defer()Eric Dumazet2015-02-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another TCP issue is triggered by ECN. Under pressure, receiver gets ECN marks, and send back ACK packets with ECE TCP flag. Senders enter CA_CWR state. In this state, tcp_tso_should_defer() is short cut : if (icsk->icsk_ca_state != TCP_CA_Open) goto send_now; This means that about all ACK packets we receive are triggering a partial send, and because cwnd is kept small, we can only send a small amount of data for each incoming ACK, which in return generate more ACK packets. Allowing CA_Open and CA_CWR states to enable TSO defer in tcp_tso_should_defer() brings performance back : TSO autodefer has more chance to defer under pressure. This patch increases TSO and LRO/GRO efficiency back to normal levels, and does not impact overall ECN behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: tso: restore IW10 after TSO autosizingEric Dumazet2015-02-281-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With sysctl_tcp_min_tso_segs being 4, it is very possible that tcp_tso_should_defer() decides not sending last 2 MSS of initial window of 10 packets. This also applies if autosizing decides to send X MSS per GSO packet, and cwnd is not a multiple of X. This patch implements an heuristic based on age of first skb in write queue : If it was sent very recently (less than half srtt), we can predict that no ACK packet will come in less than half rtt, so deferring might cause an under utilization of our window. This is visible on initial send (IW10) on web servers, but more generally on some RPC, as the last part of the message might need an extra RTT to get delivered. Tested: Ran following packetdrill test // A simple server-side test that sends exactly an initial window (IW10) // worth of packets. `sysctl -e -q net.ipv4.tcp_min_tso_segs=4` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +.1 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.1 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 14600) = 14600 +0 > . 1:5841(5840) ack 1 win 457 +0 > . 5841:11681(5840) ack 1 win 457 // Following packet should be sent right now. +0 > P. 11681:14601(2920) ack 1 win 457 +.1 < . 1:1(0) ack 14601 win 257 +0 close(4) = 0 +0 > F. 14601:14601(0) ack 1 +.1 < F. 1:1(0) ack 14602 win 257 +0 > . 14602:14602(0) ack 2 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: tso: remove tp->tso_deferredEric Dumazet2015-02-282-10/+5
|/ | | | | | | | | | | | | | | | | | | | | | | TSO relies on ability to defer sending a small amount of packets. Heuristic is to wait for future ACKS in hope to send more packets at once. Current algorithm uses a per socket tso_deferred field as a pseudo timer. This pseudo timer relies on future ACK, but there is no guarantee we receive them in time. Fix would be to use a real timer, but cost of such timer is probably too expensive for typical cases. This patch changes the logic to test the time of last transmit, because we should not add bursts of more than 1ms for any given flow. We've used this patch for about two years at Google, before FQ/pacing as it would reduce a fair amount of bursts. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* usbnet: Fix tx_packets stat for FLAG_MULTI_FRAME driversBen Hutchings2015-02-285-3/+20
| | | | | | | | | | | | | | | | | Currently the usbnet core does not update the tx_packets statistic for drivers with FLAG_MULTI_PACKET and there is no hook in the TX completion path where they could do this. cdc_ncm and dependent drivers are bumping tx_packets stat on the transmit path while asix and sr9800 aren't updating it at all. Add a packet count in struct skb_data so these drivers can fill it in, initialise it to 1 for other drivers, and add the packet count to the tx_packets statistic on completion. Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Tested-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'tipc-next'David S. Miller2015-02-277-31/+15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Erik Hugne says: ==================== tipc: bug fix and some improvements Most important is a fix for a nullptr exception that would occur when name table subscriptions fail. The remaining patches are performance improvements and cosmetic changes. v2: remove unnecessary whitespace in patch #2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: make media address offset a common defineErik Hugne2015-02-272-4/+3
| | | | | | | | | | | | | | | | | | | | With the exception of infiniband media which does not use media offsets, the media address is always located at offset 4 in the media info field as defined by the protocol, so we move the definition to the generic bearer.h Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: rename media/msg related definitionsErik Hugne2015-02-274-6/+6
| | | | | | | | | | | | | | | | | | | | The TIPC_MEDIA_ADDR_SIZE and TIPC_MEDIA_ADDR_OFFSET names are misleading, as they actually define the size and offset of the whole media info field and not the address part. This patch does not have any functional changes. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: purge links when bearer is disabledErik Hugne2015-02-271-1/+1
| | | | | | | | | | | | | | | | | | | | If a bearer is disabled by manual intervention, all links over that bearer should be purged, indicated with the 'shutting_down' flag. Otherwise tipc will get confused if a new bearer is enabled using a different media type. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix nullpointer bug when subscribing to eventsErik Hugne2015-02-271-19/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a subscription request is sent to a topology server connection, and any error occurs (malformed request, oom or limit reached) while processing this request, TIPC should terminate the subscriber connection. While doing so, it tries to access fields in an already freed (or never allocated) subscription element leading to a nullpointer exception. We fix this by removing the subscr_terminate function and terminate the connection immediately upon any subscription failure. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: only create header copy for name distr messagesErik Hugne2015-02-271-1/+1
|/ | | | | | | | | | | | The TIPC name distributor pushes topology updates to the cluster neighbors. Currently this is done in a unicast manner, and the skb holding the update is cloned for each cluster member. This is unnecessary, as we only modify the destnode field in the header so we change it to do pskb_copy instead. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* team: allow TSO being set on masterJiri Pirko2015-02-271-0/+3
| | | | | | | | | | | | | | This patch allows TSO being set/unset on the master, so that GSO segmentation is done after team layer. Similar patch is present for bonding: b0ce3508b25e ("bonding: allow TSO being set on bonding master") and bridge: f902e8812ef6 ("bridge: Add ability to enable TSO") Suggested-by: Jiri Prochazka <jprochaz@redhat.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'fib_trie_remove_leaf_info'David S. Miller2015-02-274-322/+176
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexander Duyck says: ==================== fib_trie: Remove leaf_info structure This patch set removes the leaf_info structure from the IPv4 fib_trie. The general idea is that the leaf_info structure itself only held about 6 actual bits of data, beyond that it was mostly just waste. As such we can drop the structure, move the 1 byte representing the prefix/suffix length into the fib_alias and just link it all into one list. My testing shows that this saves somewhere between 4 to 10ns depending on the type of test performed. I'm suspecting that this represents 1 to 2 L1 cache misses saved per look-up. One side effect of this change is that semantic_match_miss will now only increment once per leaf instead of once per leaf_info miss. However the stat is already skewed now that we perform a preliminary check on the leaf as a part of the look-up. I also have gone through and addressed a number of ordering issues in the first patch since I had misread the behavior of list_add_tail. I have since run some additional testing and verified the resulting lists are in the same order when combining multiple prefix length and tos values in a single leaf. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * fib_trie: Remove leaf_infoAlexander Duyck2015-02-271-307/+156
| | | | | | | | | | | | | | | | | | | | | | At this point the leaf_info hash is redundant. By adding the suffix length to the fib_alias hash list we no longer have need of leaf_info as we can determine the prefix length from fa_slen. So we can compress things by dropping the leaf_info structure from fib_trie and instead directly connect the leaves to the fib_alias hash list. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fib_trie: Add slen to fib aliasAlexander Duyck2015-02-272-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | Make use of an empty spot in the alias to store the suffix length so that we don't need to pull that information from the leaf_info structure. This patch also makes a slight change to the user statistics. Instead of incrementing semantic_match_miss once per leaf_info miss we now just increment it once per leaf if a match was not found. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fib_trie: Replace plen with slen in leaf_infoAlexander Duyck2015-02-271-33/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the prefix length variable in the leaf_info structure with a suffix length value, or host identifier length in bits. By doing this it makes it easier to sort out since the tnodes and leaf are carrying this value as well since it is compatible with the ->pos field in tnodes. I also cleaned up one spot that had some list manipulation that could be simplified. I basically updated it so that we just use hlist_add_head_rcu instead of calling hlist_add_before_rcu on the first node in the list. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fib_trie: Convert fib_alias to hlist from listAlexander Duyck2015-02-274-40/+48
|/ | | | | | | | | There isn't any advantage to having it as a list and by making it an hlist we make the fib_alias more compatible with the list_info in terms of the type of list used. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'ip_level_multicast_join_leave'David S. Miller2015-02-278-16/+137
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Madhu Challa says: ==================== Multicast group join/leave at ip level This series enables configuring multicast group join/leave at ip level by extending the "ip address" command. It adds a new control socket mc_autojoin_sock and ifa_flag IFA_F_MCAUTOJOIN to invoke the corresponding igmp group join/leave api. Since the igmp group join/leave api takes the rtnl_lock the code had to be refactored by adding a shim layer prefixed by __ that can be invoked by code that already has the rtnl_lock. This way we avoid proliferation of work queues. The first patch in this series does the refactoring for igmp v6. Its based on igmp v4 changes that were added by Eric Dumazet. The second patch in this series does the group join/leave based on the setting of the IFA_F_MCAUTOJOIN flag. v5: - addressed comments from Daniel Borkmann. - removed blank line in patch 1/2 - removed unused variable, const arg in patch 2/2 v4: - addressed comments from Yoshifuji Hideaki. - Remove WARN_ON not needed because we return a value from v2. - addressed comments from Daniel Borkmann. - rename sock to mc_autojoin_sk - ip_mc_config() pass ifa so it needs one less argument. - igmp_net_{init|destroy}() use inet_ctl_sock_{create|destroy} - inet_rtm_newaddr() change scope of ret. - igmp_net_init() no need to initialize sock to NULL. v3: - addressed comments from David Miller. - fixed indentation and local variable order. v2: - addressed comments from Eric Dumazet. - removed workqueue and call __ip_mc_{join|leave}_group or __ipv6_sock_mc_{join|drop} ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * multicast: Extend ip address command to enable multicast group join/leave onMadhu Challa2015-02-277-7/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Joining multicast group on ethernet level via "ip maddr" command would not work if we have an Ethernet switch that does igmp snooping since the switch would not replicate multicast packets on ports that did not have IGMP reports for the multicast addresses. Linux vxlan interfaces created via "ip link add vxlan" have the group option that enables then to do the required join. By extending ip address command with option "autojoin" we can get similar functionality for openvswitch vxlan interfaces as well as other tunneling mechanisms that need to receive multicast traffic. The kernel code is structured similar to how the vxlan driver does a group join / leave. example: ip address add 224.1.1.10/24 dev eth5 autojoin ip address del 224.1.1.10/24 dev eth5 Signed-off-by: Madhu Challa <challa@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * igmp v6: add __ipv6_sock_mc_join and __ipv6_sock_mc_dropMadhu Challa2015-02-272-9/+39
|/ | | | | | | | | | | | | | | | Based on the igmp v4 changes from Eric Dumazet. 959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()") These changes are needed to perform igmp v6 join/leave while RTNL is held. Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around __ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid proliferation of work queues. Signed-off-by: Madhu Challa <challa@noironetworks.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: In udp_flow_src_port use random hash value if skb_get_hash failsTom Herbert2015-02-272-6/+25
| | | | | | | | | | In the unlikely event that skb_get_hash is unable to deduce a hash in udp_flow_src_port we use a consistent random value instead. This is specified in GRE/UDP draft section 3.2.1: https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud