summaryrefslogtreecommitdiffstats
path: root/net/core/skbuff.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-12-291-3/+3
|\ | | | | | | | | | | | | | | | | | | | | net/ipv6/ip6_gre.c is a case of parallel adds. include/trace/events/tcp.h is a little bit more tricky. The removal of in-trace-macro ifdefs in 'net' paralleled with moving show_tcp_state_name and friends over to include/trace/events/sock.h in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * skbuff: in skb_copy_ubufs unclone before releasing zerocopyWillem de Bruijn2017-12-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb_copy_ubufs must unclone before it is safe to modify its skb_shared_info with skb_zcopy_clear. Commit b90ddd568792 ("skbuff: skb_copy_ubufs must release uarg even without user frags") ensures that all skbs release their zerocopy state, even those without frags. But I forgot an edge case where such an skb arrives that is cloned. The stack does not build such packets. Vhost/tun skbs have their frags orphaned before cloning. TCP skbs only attach zerocopy state when a frag is added. But if TCP packets can be trimmed or linearized, this might occur. Tracing the code I found no instance so far (e.g., skb_linearize ends up calling skb_zcopy_clear if !skb->data_len). Still, it is non-obvious that no path exists. And it is fragile to rely on this. Fixes: b90ddd568792 ("skbuff: skb_copy_ubufs must release uarg even without user frags") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | skbuff: in skb_segment, call zerocopy functions once per nskbWillem de Bruijn2017-12-271-5/+9
|/ | | | | | | | | | | | | | | | | | | | | This is a net-next follow-up to commit 268b79067942 ("skbuff: orphan frags before zerocopy clone"), which fixed a bug in net, but added a call to skb_zerocopy_clone at each frag to do so. When segmenting skbs with user frags, either the user frags must be replaced with private copies and uarg released, or the uarg must have its refcount increased for each new skb. skb_orphan_frags does the first, except for cases that can handle reference counting. skb_zerocopy_clone then does the second. Call these once per nskb, instead of once per frag. That is, in the common case. With a frag list, also refresh when the origin skb (frag_skb) changes. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: skb_copy_ubufs must release uarg even without user fragsWillem de Bruijn2017-12-211-1/+2
| | | | | | | | | | | | | skb_copy_ubufs creates a private copy of frags[] to release its hold on user frags, then calls uarg->callback to notify the owner. Call uarg->callback even when no frags exist. This edge case can happen when zerocopy_sg_from_iter finds enough room in skb_headlen to copy all the data. Fixes: 3ece782693c4 ("sock: skb_copy_ubufs support for compound pages") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: orphan frags before zerocopy cloneWillem de Bruijn2017-12-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate calls to skb_uarg(skb)->callback for the same data. skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb with each segment. This is only safe for uargs that do refcounting, which is those that pass skb_orphan_frags without dropping their shared frags. For others, skb_orphan_frags drops the user frags and sets the uarg to NULL, after which sock_zerocopy_clone has no effect. Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback calls for the same data causing the vhost_net_ubuf_ref_>refcount to drop below zero. Link: http://lkml.kernel.org/r/<CAF=yD-LWyCD4Y0aJ9O0e_CHLR+3JOeKicRRTEVCPxgw4XOcqGQ@mail.gmail.com> Fixes: 1f8b977ab32d ("sock: enable MSG_ZEROCOPY") Reported-by: Andreas Hartmann <andihartmann@01019freenet.de> Reported-by: David Hill <dhill@redhat.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock: free skb in skb_complete_tx_timestamp on errorWillem de Bruijn2017-12-151-1/+5
| | | | | | | | | | | | skb_complete_tx_timestamp must ingest the skb it is passed. Call kfree_skb if the skb cannot be enqueued. Fixes: b245be1f4db1 ("net-timestamp: no-payload only sysctl") Fixes: 9ac25fc06375 ("net: fix socket refcounting in skb_complete_tx_timestamp()") Reported-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2017-11-151-7/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: - a few misc bits - ocfs2 updates - almost all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits) memory hotplug: fix comments when adding section mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP mm: simplify nodemask printing mm,oom_reaper: remove pointless kthread_run() error check mm/page_ext.c: check if page_ext is not prepared writeback: remove unused function parameter mm: do not rely on preempt_count in print_vma_addr mm, sparse: do not swamp log with huge vmemmap allocation failures mm/hmm: remove redundant variable align_end mm/list_lru.c: mark expected switch fall-through mm/shmem.c: mark expected switch fall-through mm/page_alloc.c: broken deferred calculation mm: don't warn about allocations which stall for too long fs: fuse: account fuse_inode slab memory as reclaimable mm, page_alloc: fix potential false positive in __zone_watermark_ok mm: mlock: remove lru_add_drain_all() mm, sysctl: make NUMA stats configurable shmem: convert shmem_init_inodecache() to void Unify migrate_pages and move_pages access checks mm, pagevec: rename pagevec drained field ...
| * mm: remove __GFP_COLDMel Gorman2017-11-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the page free path makes no distinction between cache hot and cold pages, there is no real useful ordering of pages in the free list that allocation requests can take advantage of. Juding from the users of __GFP_COLD, it is likely that a number of them are the result of copying other sites instead of actually measuring the impact. Remove the __GFP_COLD parameter which simplifies a number of paths in the page allocator. This is potentially controversial but bear in mind that the size of the per-cpu pagelists versus modern cache sizes means that the whole per-cpu list can often fit in the L3 cache. Hence, there is only a potential benefit for microbenchmarks that alloc/free pages in a tight loop. It's even worse when THP is taken into account which has little or no chance of getting a cache-hot page as the per-cpu list is bypassed and the zeroing of multiple pages will thrash the cache anyway. The truncate microbenchmarks are not shown as this patch affects the allocation path and not the free path. A page fault microbenchmark was tested but it showed no sigificant difference which is not surprising given that the __GFP_COLD branches are a miniscule percentage of the fault path. Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * kmemcheck: remove annotationsLevin, Alexander (Sasha Levin)2017-11-151-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-11-101-0/+1
|\ \ | |/ | | | | | | | | | | | | | | | | Simple cases of overlapping changes in the packet scheduler. Must easier to resolve this time. Which probably means that I screwed it up somehow. Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter/ipvs: clear ipvs_property flag when SKB net namespace changedYe Yin2017-11-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When run ipvs in two different network namespace at the same host, and one ipvs transport network traffic to the other network namespace ipvs. 'ipvs_property' flag will make the second ipvs take no effect. So we should clear 'ipvs_property' when SKB network namespace changed. Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()") Signed-off-by: Ye Yin <hustcat@gmail.com> Signed-off-by: Wei Zhou <chouryzhou@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-10-221-1/+5
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There were quite a few overlapping sets of changes here. Daniel's bug fix for off-by-ones in the new BPF branch instructions, along with the added allowances for "data_end > ptr + x" forms collided with the metadata additions. Along with those three changes came veritifer test cases, which in their final form I tried to group together properly. If I had just trimmed GIT's conflict tags as-is, this would have split up the meta tests unnecessarily. In the socketmap code, a set of preemption disabling changes overlapped with the rename of bpf_compute_data_end() to bpf_compute_data_pointers(). Changes were made to the mv88e6060.c driver set addr method which got removed in net-next. The hyperv transport socket layer had a locking change in 'net' which overlapped with a change of socket state macro usage in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * sock: correct sk_wmem_queued accounting on efault in tcp zerocopyWillem de Bruijn2017-10-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller hits WARN_ON(sk->sk_wmem_queued) in sk_stream_kill_queues after triggering an EFAULT in __zerocopy_sg_from_iter. On this error, skb_zerocopy_stream_iter resets the skb to its state before the operation with __pskb_trim. It cannot kfree_skb like datagram callers, as the skb may have data from a previous send call. __pskb_trim calls skb_condense for unowned skbs, which adjusts their truesize. These tcp skbuffs are owned and their truesize must add up to sk_wmem_queued. But they match because their skb->sk is NULL until tcp_transmit_skb. Temporarily set skb->sk when calling __pskb_trim to signal that the skbuffs are owned and avoid the skb_condense path. Fixes: 52267790ef52 ("sock: add MSG_ZEROCOPY") Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix typo in skbuff.cWenhua Shi2017-10-141-1/+1
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/core: Fix BUG to BUG_ON conditionals.Tim Hansen2017-10-101-9/+6
| | | | | | | | | | | | | | | | | | | | Fix BUG() calls to use BUG_ON(conditional) macros. This was found using make coccicheck M=net/core on linux next tag next-2017092 Signed-off-by: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: cache skb_shinfo() in skb_try_coalesce()Eric Dumazet2017-10-041-13/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compiler does not really know that skb_shinfo(to|from) are constants in skb_try_coalesce(), lets cache their values to shrink code. We might even take care of skb_zcopy() calls later. $ size net/core/skbuff.o.before net/core/skbuff.o text data bss dec hex filename 40727 1298 0 42025 a429 net/core/skbuff.o.before 40631 1298 0 41929 a3c9 net/core/skbuff.o Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: add meta pointer for direct accessDaniel Borkmann2017-09-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work enables generic transfer of metadata from XDP into skb. The basic idea is that we can make use of the fact that the resulting skb must be linear and already comes with a larger headroom for supporting bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work on a similar principle and introduce a small helper bpf_xdp_adjust_meta() for adjusting a new pointer called xdp->data_meta. Thus, the packet has a flexible and programmable room for meta data, followed by the actual packet data. struct xdp_buff is therefore laid out that we first point to data_hard_start, then data_meta directly prepended to data followed by data_end marking the end of packet. bpf_xdp_adjust_head() takes into account whether we have meta data already prepended and if so, memmove()s this along with the given offset provided there's enough room. xdp->data_meta is optional and programs are not required to use it. The rationale is that when we process the packet in XDP (e.g. as DoS filter), we can push further meta data along with it for the XDP_PASS case, and give the guarantee that a clsact ingress BPF program on the same device can pick this up for further post-processing. Since we work with skb there, we can also set skb->mark, skb->priority or other skb meta data out of BPF, thus having this scratch space generic and programmable allows for more flexibility than defining a direct 1:1 transfer of potentially new XDP members into skb (it's also more efficient as we don't need to initialize/handle each of such new members). The facility also works together with GRO aggregation. The scratch space at the head of the packet can be multiple of 4 byte up to 32 byte large. Drivers not yet supporting xdp->data_meta can simply be set up with xdp->data_meta as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, such that the subsequent match against xdp->data for later access is guaranteed to fail. The verifier treats xdp->data_meta/xdp->data the same way as we treat xdp->data/xdp->data_end pointer comparisons. The requirement for doing the compare against xdp->data is that it hasn't been modified from it's original address we got from ctx access. It may have a range marking already from prior successful xdp->data/xdp->data_end pointer comparisons though. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: speed up skb_rbtree_purge()Eric Dumazet2017-09-251-4/+7
|/ | | | | | | | | | | | | | | | | As measured in my prior patch ("sch_netem: faster rb tree removal"), rbtree_postorder_for_each_entry_safe() is nice looking but much slower than using rb_next() directly, except when tree is small enough to fit in CPU caches (then the cost is the same) Also note that there is not even an increase of text size : $ size net/core/skbuff.o.before net/core/skbuff.o text data bss dec hex filename 40711 1298 0 42009 a419 net/core/skbuff.o.before 40711 1298 0 42009 a419 net/core/skbuff.o From: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: drop head states only when all skb references are gonePaolo Abeni2017-09-071-6/+3
| | | | | | | | | | | | | | | | | | After commit 0ddf3fb2c43d ("udp: preserve skb->dst if required for IP options processing") we clear the skb head state as soon as the skb carrying them is first processed. Since the same skb can be processed several times when MSG_PEEK is used, we can end up lacking the required head states, and eventually oopsing. Fix this clearing the skb head state only when processing the last skb reference. Reported-by: Eric Dumazet <edumazet@google.com> Fixes: 0ddf3fb2c43d ("udp: preserve skb->dst if required for IP options processing") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert (struct ubuf_info)->refcnt to refcount_tEric Dumazet2017-09-011-3/+3
| | | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. v2: added the change in drivers/vhost/net.c as spotted by Willem. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: prepare (struct ubuf_info)->refcnt conversionEric Dumazet2017-09-011-8/+2
| | | | | | | | | | | | | | | In order to convert this atomic_t refcnt to refcount_t, we need to init the refcount to one to not trigger a 0 -> 1 transition. This also removes one atomic operation in fast path. v2: removed dead code in sock_zerocopy_put_abort() as suggested by Willem. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-09-011-5/+8
|\ | | | | | | | | | | Three cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: core: Specify skb_pad()/skb_put_padto() SKB freeingFlorian Fainelli2017-08-231-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | Rename skb_pad() into __skb_pad() and make it take a third argument: free_on_error which controls whether kfree_skb() should be called or not, skb_pad() directly makes use of it and passes true to preserve its existing behavior. Do exactly the same thing with __skb_put_padto() and skb_put_padto(). Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Woojung Huh <Woojung.Huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fixes for skb_send_sockJohn Fastabend2017-08-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | A couple fixes to new skb_send_sock infrastructure. However, no users currently exist for this code (adding user in next handful of patches) so it should not be possible to trigger a panic with existing in-kernel code. Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: fix zerocopy panic in mem accountingWillem de Bruijn2017-08-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only call mm_unaccount_pinned_pages when releasing a struct ubuf_info that has initialized its field uarg->mmp. Before this patch, a vhost-net with experimental_zcopytx can crash in mm_unaccount_pinned_pages sock_zerocopy_put skb_zcopy_clear skb_release_data Only sock_zerocopy_alloc initializes this field. Move the unaccount call from generic sock_zerocopy_put to its specific callback sock_zerocopy_callback. Fixes: a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages") Reported-by: David Ahern <dsahern@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: ulimit on MSG_ZEROCOPY pagesWillem de Bruijn2017-08-031-0/+48
| | | | | | | | | | | | | | | | | | | | Bound the number of pages that a user may pin. Follow the lead of perf tools to maintain a per-user bound on memory locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: MSG_ZEROCOPY notification coalescingWillem de Bruijn2017-08-031-7/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the simple case, each sendmsg() call generates data and eventually a zerocopy ready notification N, where N indicates the Nth successful invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket. TCP and corked sockets can cause send() calls to append new data to an existing sk_buff and, thus, ubuf_info. In that case the notification must hold a range. odify ubuf_info to store a inclusive range [N..N+m] and add skb_zerocopy_realloc() to optionally extend an existing range. Also coalesce notifications in this common case: if a notification [1, 1] is about to be queued while [0, 0] is the queue tail, just modify the head of the queue to read [0, 1]. Coalescing is limited to a few TSO frames worth of data to bound notification latency. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: enable MSG_ZEROCOPYWillem de Bruijn2017-08-031-29/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with skb_zerocopy_clone() wherever needed due to skb split, merge, resize or clone. Split skb_orphan_frags into two variants. The split, merge, .. paths support reference counted zerocopy buffers, so do not do a deep copy. Add skb_orphan_frags_rx for paths that may loop packets to receive sockets. That is not allowed, as it may cause unbounded latency. Deep copy all zerocopy copy buffers, ref-counted or not, in this path. The exact locations to modify were chosen by exhaustively searching through all code that might modify skb_frag references and/or the the SKBTX_DEV_ZEROCOPY tx_flags bit. The changes err on the safe side, in two ways. (1) legacy ubuf_info paths virtio and tap are not modified. They keep a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags still call skb_copy_ubufs and thus copy frags in this case. (2) not all copies deep in the stack are addressed yet. skb_shift, skb_split and skb_try_coalesce can be refined to avoid copying. These are not in the hot path and this patch is hairy enough as is, so that is left for future refinement. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: add SOCK_ZEROCOPY sockoptWillem de Bruijn2017-08-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The send call ignores unknown flags. Legacy applications may already unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a socket opts in to zerocopy. Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing. Processes can also query this socket option to detect kernel support for the feature. Older kernels will return ENOPROTOOPT. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: add MSG_ZEROCOPYWillem de Bruijn2017-08-031-0/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel supports zerocopy sendmsg in virtio and tap. Expand the infrastructure to support other socket types. Introduce a completion notification channel over the socket error queue. Notifications are returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid blocking the send/recv path on receiving notifications. Add reference counting, to support the skb split, merge, resize and clone operations possible with SOCK_STREAM and other socket types. The patch does not yet modify any datapaths. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: skb_copy_ubufs support for compound pagesWillem de Bruijn2017-08-031-15/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refine skb_copy_ubufs to support compound pages. With upcoming TCP zerocopy sendmsg, such fragments may appear. The existing code replaces each page one for one. Splitting each compound page into an independent number of regular pages can result in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned. Instead, fill all destination pages but the last to PAGE_SIZE. Split the existing alloc + copy loop into separate stages: 1. compute bytelength and minimum number of pages to store this. 2. allocate 3. copy, filling each page except the last to PAGE_SIZE bytes 4. update skb frag array Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add skb_frag_foreach_page and use with kmap_atomicWillem de Bruijn2017-08-011-29/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Skb frags may contain compound pages. Various operations map frags temporarily using kmap_atomic, but this function works on single pages, not whole compound pages. The distinction is only relevant for high mem pages that require temporary mappings. Introduce a looping mechanism that for compound highmem pages maps one page at a time, does not change behavior on other pages. Use the loop in the kmap_atomic callers in net/core/skbuff.c. Verified by triggering skb_copy_bits with tcpdump -n -c 100 -i ${DEV} -w /dev/null & netperf -t TCP_STREAM -H ${HOST} and by triggering __skb_checksum with ethtool -K ${DEV} tx off repeated the tests with looping on a non-highmem platform (x86_64) by making skb_frag_must_loop always return true. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | skbuff: Function to send an skbuf on a socketTom Herbert2017-08-011-0/+101
| | | | | | | | | | | | | | | | | | Add skb_send_sock to send an skbuff on a socket within the kernel. Arguments include an offset so that an skbuf might be sent in mulitple calls (e.g. send buffer limit is hit). Signed-off-by: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | skbuff: re-add check for NULL skb->head in kfree_skb pathFlorian Westphal2017-07-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A null check is needed after all. netlink skbs can have skb->head be backed by vmalloc. The netlink destructor vfree()s head, then sets it to NULL. We then panic in skb_release_data with a NULL dereference. Re-add such a test. Alternative would be to switch to kvfree to free skb->head memory and remove the special handling in netlink destructor. Reported-by: kernel test robot <fengguang.wu@intel.com> Fixes: 06dc75ab06943 ("net: Revert "net: add function to allocate sk_buff head without data area") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-07-211-1/+1
|\ \ | |/
| * mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko2017-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: Revert "net: add function to allocate sk_buff head without data area"Florian Westphal2017-07-171-29/+2
| | | | | | | | | | | | | | | | | | It was added for netlink mmap tx, there are no callers in the tree. The commit also added a check for skb->head != NULL in kfree_skb path, remove that too -- all skbs ought to have skb->head set. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | skbuff: optimize the pull_pages code in __pskb_pull_tail()linzhang2017-07-171-0/+3
|/ | | | | | | | | In the pull_pages code block, if the first frag size > eat, we can end the loop in advance to avoid extra copy. Signed-off-by: Lin Zhang <xiaolou4617@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert sock.sk_refcnt from atomic_t to refcount_tReshetova, Elena2017-07-011-3/+3
| | | | | | | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. This patch uses refcount_inc_not_zero() instead of atomic_inc_not_zero_hint() due to absense of a _hint() version of refcount API. If the hint() version must be used, we might need to revisit API. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert sock.sk_wmem_alloc from atomic_t to refcount_tReshetova, Elena2017-07-011-1/+1
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_tReshetova, Elena2017-07-011-5/+5
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert sk_buff.users from atomic_t to refcount_tReshetova, Elena2017-07-011-4/+4
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* networking: make skb_push & __skb_push return void pointersJohannes Berg2017-06-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) @@ expression SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - fn(SKB, LEN)[0] + *(u8 *)fn(SKB, LEN) Note that the last part there converts from push(...)[0] to the more idiomatic *(u8 *)push(...). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* networking: make skb_pull & friends return void pointersJohannes Berg2017-06-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_pull, __skb_pull, skb_pull_inline, __pskb_pull_tail, __pskb_pull, pskb_pull }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_pull, __skb_pull, skb_pull_inline, __pskb_pull_tail, __pskb_pull, pskb_pull }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* networking: make skb_put & friends return void pointersJohannes Berg2017-06-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions (skb_put, __skb_put and pskb_put) return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_put, __skb_put }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_put, __skb_put }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) which actually doesn't cover pskb_put since there are only three users overall. A handful of stragglers were converted manually, notably a macro in drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many instances in net/bluetooth/hci_sock.c. In the former file, I also had to fix one whitespace problem spatch introduced. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use skb_unref() in napi_consume_skb()Paolo Abeni2017-06-141-3/+2
| | | | | | | | | | | | | | | The commit 83ada39bb79d ("net: factor out a helper to decrement the skb refcount") provided and used a helper for decrementing skb usage, but I missed at least a spot for it. This change remove some more duplicated code reusing skb_unref() in napi_consume_skb(), too. The helper uses an additional, unneeded unlikely(!skb) test - napi_consume_skb() already check it a few lines above - but the compiler is smart enough to optimize the duplicated test out. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: avoid a cache miss on dequeuePaolo Abeni2017-06-121-4/+20
| | | | | | | | | | | | | | | | | | | | Since UDP no more uses sk->destructor, we can clear completely the skb head state before enqueuing. Amend and use skb_release_head_state() for that. All head states share a single cacheline, which is not normally used/accesses on dequeue. We can avoid entirely accessing such cacheline implementing and using in the UDP code a specialized skb free helper which ignores the skb head state. This saves a cacheline miss at skb deallocation time. v1 -> v2: replaced secpath_reset() with skb_release_head_state() Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: factor out a helper to decrement the skb refcountPaolo Abeni2017-06-121-10/+4
| | | | | | | | | The same code is replicated in 3 different places; move it to a common helper. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: only inherit relevant tx_flagsWillem de Bruijn2017-06-081-4/+6
| | | | | | | | | | | | | | | | | | When inheriting tx_flags from one skbuff to another, always apply a mask to avoid overwriting unrelated other bits in the field. The two SKBTX_SHARED_FRAG cases clears all other bits. In practice, tx_flags are zero at this point now. But this is fragile. Timestamp flags are set, for instance, if in tcp_gso_segment, after this clear in skb_segment. The SKBTX_ANY_TSTAMP mask in __skb_tstamp_tx ensures that new skbs do not accidentally inherit flags such as SKBTX_SHARED_FRAG. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-06-061-1/+4
|\ | | | | | | | | | | | | Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud