summaryrefslogtreecommitdiffstats
path: root/net/core/sock.c
Commit message (Collapse)AuthorAgeFilesLines
* net: core: drop null test before destroy functionsJulia Lawall2015-09-151-8/+4
| | | | | | | | | | | | | | | | | | Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) { \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); x = NULL; -} // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock: fix kernel doc errorJean Sacren2015-08-271-1/+1
| | | | | | | | | | The symbol '__sk_reclaim' is not present in the current tree. Apparently '__sk_reclaim' was meant to be '__sk_mem_reclaim', so fix it with the right symbol name for the kernel doc. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Cc: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sk_clone_lock() should only do get_net() if the parent is not a kernel ↵Sowmini Varadhan2015-07-301-1/+2
| | | | | | | | | | | | | | | | | | | | | socket The newsk returned by sk_clone_lock should hold a get_net() reference if, and only if, the parent is not a kernel socket (making this similar to sk_alloc()). E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock sets up the syn_recv newsk from sk_clone_lock. When the parent (listen) socket is a kernel socket (defined in sk_alloc() as having sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt and should not hold a get_net() reference. Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Acked-by: Eric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix recv with flags MSG_WAITALL | MSG_PEEKSabrina Dubroca2015-07-271-2/+3
| | | | | | | | | | | | | | | | | | | | | Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL | MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258 Reported-by: Enrico Scholz <rh-bugzilla@ensc.de> Reported-by: Dan Searle <dan@censornet.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock_diag: don't broadcast kernel socketsCraig Gallek2015-06-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel sockets do not hold a reference for the network namespace to which they point. Socket destruction broadcasting relies on the network namespace and will cause the splat below when a kernel socket is destroyed. This fix simply ignores kernel sockets when they are destroyed. Reported as: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1 Workqueue: sock_diag_events sock_diag_broadcast_destroy_work Stack: ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90 ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90 ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8 Call Trace: [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10 [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag] [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20 [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10 [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160 [<ffffffff8408ea97>] process_one_work+0x147/0x420 [<ffffffff8408f0f9>] worker_thread+0x69/0x470 [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0 [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320 [<ffffffff84093cd7>] kthread+0x107/0x120 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0 [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0 Tested: Using a debug kernel while 'ss -E' is running: ip netns add test-ns ip netns delete test-ns Fixes: eb4cb008529c sock_diag: define destruction multicast groups Fixes: 26abe14379f8 net: Modify sk_alloc to not reference count the netns of kernel sockets. Reported-by: Dave Jones <davej@codemonkey.org.uk> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Kill sock->sk_protinfoDavid Miller2015-06-281-1/+0
| | | | | | No more users, so it can now be removed. Signed-off-by: David S. Miller <davem@davemloft.net>
* sock_diag: define destruction multicast groupsCraig Gallek2015-06-151-1/+10
| | | | | | | | | | | | | | | These groups will contain socket-destruction events for AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP. Near the end of socket destruction, a check for listeners is performed. In the presence of a listener, rather than completely cleanup the socket, a unit of work will be added to a private work queue which will first broadcast information about the socket and then finish the cleanup operation. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-06-131-9/+6
|\
| * net: don't wait for order-3 page allocationShaohua Li2015-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We saw excessive direct memory compaction triggered by skb_page_frag_refill. This causes performance issues and add latency. Commit 5640f7685831e0 introduces the order-3 allocation. According to the changelog, the order-3 allocation isn't a must-have but to improve performance. But direct memory compaction has high overhead. The benefit of order-3 allocation can't compensate the overhead of direct memory compaction. This patch makes the order-3 page allocation atomic. If there is no memory pressure and memory isn't fragmented, the alloction will still success, so we don't sacrifice the order-3 benefit here. If the atomic allocation fails, direct memory compaction will not be triggered, skb_page_frag_refill will fallback to order-0 immediately, hence the direct memory compaction overhead is avoided. In the allocation failure case, kswapd is waken up and doing compaction, so chances are allocation could success next time. alloc_skb_with_frags is the same. The mellanox driver does similar thing, if this is accepted, we must fix the driver too. V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric V2: make the changelog clearer Cc: Eric Dumazet <edumazet@google.com> Cc: Chris Mason <clm@fb.com> Cc: Debabrata Banerjee <dbavatar@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net, swap: Remove a warning and clarify why sk_mem_reclaim is required when ↵Mel Gorman2015-06-101-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | deactivating swap Jeff Layton reported the following; [ 74.232485] ------------[ cut here ]------------ [ 74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80() [ 74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio [ 74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5 [ 74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 74.245546] 0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d [ 74.246786] 0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa [ 74.248175] 0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8 [ 74.249483] Call Trace: [ 74.249872] [<ffffffff8179263d>] dump_stack+0x45/0x57 [ 74.250703] [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0 [ 74.251655] [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20 [ 74.252585] [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80 [ 74.253519] [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc] [ 74.254537] [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc] [ 74.255610] [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs] [ 74.256582] [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80 [ 74.257496] [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0 [ 74.258318] [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140 [ 74.259253] [<ffffffff81798dae>] system_call_fastpath+0x12/0x71 [ 74.260158] ---[ end trace 2530722966429f10 ]--- The warning in question was unnecessary but with Jeff's series the rules are also clearer. This patch removes the warning and updates the comment to explain why sk_mem_reclaim() may still be called. [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon points out that it's not needed.] Cc: Leon Romanovsky <leon@leon.nu> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Revert "net: kernel socket should be released in init_net namespace"Herbert Xu2015-05-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit c243d7e20996254f89c28d4838b5feca735c030d. That patch is solving a non-existant problem while creating a real problem. Just because a socket is allocated in the init name space doesn't mean that it gets hashed in the init name space. When we unhash it the name space must be the same as the one we had when we hashed it. So this patch is completely bogus and causes socket leaks. Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: tcp_tso_autosize() minimum is one packetEric Dumazet2015-05-261-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By making sure sk->sk_gso_max_segs minimal value is one, and sysctl_tcp_min_tso_segs minimal value is one as well, tcp_tso_autosize() will return a non zero value. We can then revert 843925f33fcc293d80acf2c5c8a78adf3344d49b ("tcp: Do not apply TSO segment limit to non-TSO packets") and save few cpu cycles in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix sk_mem_reclaim_partial()Eric Dumazet2015-05-171-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | sk_mem_reclaim_partial() goal is to ensure each socket has one SK_MEM_QUANTUM forward allocation. This is needed both for performance and better handling of memory pressure situations in follow up patches. SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes as some arches have 64KB pages. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: kill sk_change_net and sk_release_kernelEric W. Biederman2015-05-111-19/+0
| | | | | | | | | | | | | | These functions are no longer needed and no longer used kill them. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Modify sk_alloc to not reference count the netns of kernel sockets.Eric W. Biederman2015-05-111-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that sk_alloc knows when a kernel socket is being allocated modify it to not reference count the network namespace of kernel sockets. Keep track of if a socket needs reference counting by adding a flag to struct sock called sk_net_refcnt. Update all of the callers of sock_create_kern to stop using sk_change_net and sk_release_kernel as those hacks are no longer needed, to avoid reference counting a kernel socket. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Pass kern from net_proto_family.create to sk_allocEric W. Biederman2015-05-111-1/+2
|/ | | | | | | | | In preparation for changing how struct net is refcounted on kernel sockets pass the knowledge that we are creating a kernel socket from sock_create_kern through to sk_alloc. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: do not cache align timewait socketsEric Dumazet2015-04-121-2/+1
| | | | | | | | | | | | | | With recent adoption of skc_cookie in struct sock_common, struct tcp_timewait_sock size increased from 192 to 200 bytes on 64bit arches. SLAB rounds then to 256 bytes. It is time to drop SLAB_HWCACHE_ALIGN constraint for twsk_slab. This saves about 12 MB of memory on typical configuration reaching 262144 timewait sockets, and has no noticeable impact on performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-04-061-0/+19
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/mellanox/mlx4/cmd.c net/core/fib_rules.c net/ipv4/fib_frontend.c The fib_rules.c and fib_frontend.c conflicts were locking adjustments in 'net' overlapping addition and removal of code in 'net-next'. The mlx4 conflict was a bug fix in 'net' happening in the same place a constant was being replaced with a more suitable macro. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: protect skb->sk accesses from recursive dereference inside the stackhannes@stressinduktion.org2015-04-061-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should not consult skb->sk for output decisions in xmit recursion levels > 0 in the stack. Otherwise local socket settings could influence the result of e.g. tunnel encapsulation process. ipv6 does not conform with this in three places: 1) ip6_fragment: we do consult ipv6_npinfo for frag_size 2) sk_mc_loop in ipv6 uses skb->sk and checks if we should loop the packet back to the local socket 3) ip6_skb_dst_mtu could query the settings from the user socket and force a wrong MTU Furthermore: In sk_mc_loop we could potentially land in WARN_ON(1) if we use a PF_PACKET socket ontop of an IPv6-backed vxlan device. Reuse xmit_recursion as we are currently only interested in protecting tunnel devices. Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Move the comment about unsettable socket-level options to default ↵YOSHIFUJI Hideaki/吉藤英明2015-03-231-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | clause and update its reference. We implement the SO_SNDLOWAT etc not to be settable and return ENOPROTOOPT per 1003.1g 7. Move the comment to appropriate position and update the reference. Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-03-201-0/+4
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/emulex/benet/be_main.c net/core/sysctl_net_core.c net/ipv4/inet_diag.c The be_main.c conflict resolution was really tricky. The conflict hunks generated by GIT were very unhelpful, to say the least. It split functions in half and moved them around, when the real actual conflict only existed solely inside of one function, that being be_map_pci_bars(). So instead, to resolve this, I checked out be_main.c from the top of net-next, then I applied the be_main.c changes from 'net' since the last time I merged. And this worked beautifully. The inet_diag.c and sysctl_net_core.c conflicts were simple overlapping changes, and were easily to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: add comment for sock_efree() usageOliver Hartkopp2015-03-101-0/+4
| | | | | | | | | | | | Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: get rid of central tcp/dccp listener timerEric Dumazet2015-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the major issue for TCP is the SYNACK rtx handling, done by inet_csk_reqsk_queue_prune(), fired by the keepalive timer of a TCP_LISTEN socket. This function runs for awful long times, with socket lock held, meaning that other cpus needing this lock have to spin for hundred of ms. SYNACK are sent in huge bursts, likely to cause severe drops anyway. This model was OK 15 years ago when memory was very tight. We now can afford to have a timer per request sock. Timer invocations no longer need to lock the listener, and can be run from all cpus in parallel. With following patch increasing somaxconn width to 32 bits, I tested a listener with more than 4 million active request sockets, and a steady SYNFLOOD of ~200,000 SYN per second. Host was sending ~830,000 SYNACK per second. This is ~100 times more what we could achieve before this patch. Later, we will get rid of the listener hash and use ehash instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: kernel socket should be released in init_net namespaceYing Xue2015-03-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | Creating a kernel socket with sock_create_kern() happens in "init_net" namespace, however, releasing it with sk_release_kernel() occurs in the current namespace which may be different with "init_net" namespace. Therefore, we should guarantee that the namespace in which a kernel socket is created is same as the socket is created. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: factorize sock_edemux()/sock_gen_put() codeEric Dumazet2015-03-161-15/+0
| | | | | | | | | | | | | | | | sock_edemux() is not used in fast path, and should really call sock_gen_put() to save some code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV stateEric Dumazet2015-03-121-0/+2
| | | | | | | | | | | | | | sock_edemux() & sock_gen_put() should be ready to cope with request socks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add req_prot_cleanup() & req_prot_init() helpersEric Dumazet2015-03-121-28/+41
| | | | | | | | | | | | | | Make proto_register() & proto_unregister() a bit nicer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Kill hold_net release_netEric W. Biederman2015-03-121-1/+0
| | | | | | | | | | | | | | | | | | hold_net and release_net were an idea that turned out to be useless. The code has been disabled since 2008. Kill the code it is long past due. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add real socket cookiesEric Dumazet2015-03-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A long standing problem in netlink socket dumps is the use of kernel socket addresses as cookies. 1) It is a security concern. 2) Sockets can be reused quite quickly, so there is no guarantee a cookie is used once and identify a flow. 3) request sock, establish sock, and timewait socks for a given flow have different cookies. Part of our effort to bring better TCP statistics requires to switch to a different allocator. In this patch, I chose to use a per network namespace 64bit generator, and to use it only in the case a socket needs to be dumped to netlink. (This might be refined later if needed) Note that I tried to carry cookies from request sock, to establish sock, then timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eric Salo <salo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Remove iocb argument from sendmsg and recvmsgYing Xue2015-03-021-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add common accessor for setting dropcount on packetsEyal Birger2015-03-021-1/+1
|/ | | | | | | | As part of an effort to move skb->dropcount to skb->cb[], use a common function in order to set dropcount in struct sk_buff. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-timestamp: no-payload only sysctlWillem de Bruijn2015-02-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. Add a sysctl that optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW. The policy is checked when timestamps are generated in the stack. It is possible for timestamps with data to be reported after the sysctl is set, if these were queued internally earlier. No vulnerability is immediately known that exploits knowledge gleaned from packet headers, but it may still be preferable to allow administrators to lock down this path at the cost of possible breakage of legacy applications. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (v1 -> v2) - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW) (rfc -> v1) - document the sysctl in Documentation/sysctl/net.txt - fix access control race: read .._OPT_TSONLY only once, use same value for permission check and skb generation. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds2014-12-131-4/+20
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull crypto update from Herbert Xu: - The crypto API is now documented :) - Disallow arbitrary module loading through crypto API. - Allow get request with empty driver name through crypto_user. - Allow speed testing of arbitrary hash functions. - Add caam support for ctr(aes), gcm(aes) and their derivatives. - nx now supports concurrent hashing properly. - Add sahara support for SHA1/256. - Add ARM64 version of CRC32. - Misc fixes. * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits) crypto: tcrypt - Allow speed testing of arbitrary hash functions crypto: af_alg - add user space interface for AEAD crypto: qat - fix problem with coalescing enable logic crypto: sahara - add support for SHA1/256 crypto: sahara - replace tasklets with kthread crypto: sahara - add support for i.MX53 crypto: sahara - fix spinlock initialization crypto: arm - replace memset by memzero_explicit crypto: powerpc - replace memset by memzero_explicit crypto: sha - replace memset by memzero_explicit crypto: sparc - replace memset by memzero_explicit crypto: algif_skcipher - initialize upon init request crypto: algif_skcipher - removed unneeded code crypto: algif_skcipher - Fixed blocking recvmsg crypto: drbg - use memzero_explicit() for clearing sensitive data crypto: drbg - use MODULE_ALIAS_CRYPTO crypto: include crypto- module prefix in template crypto: user - add MODULE_ALIAS crypto: sha-mb - remove a bogus NULL check crytpo: qat - Fix 64 bytes requests ...
| * crypto: algif - add and use sock_kzfree_s() instead of memzero_explicit()Daniel Borkmann2014-11-251-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e1bd95bf7c25 ("crypto: algif - zeroize IV buffer") and 2a6af25befd0 ("crypto: algif - zeroize message digest buffer") added memzero_explicit() calls on buffers that are later on passed back to sock_kfree_s(). This is a discussed follow-up that, instead, extends the sock API and adds sock_kzfree_s(), which internally uses kzfree() instead of kfree() for passing the buffers back to slab. Having sock_kzfree_s() allows to keep the changes more minimal by just having a drop-in replacement instead of adding memzero_explicit() calls everywhere before sock_kfree_s(). In kzfree(), the compiler is not allowed to optimize the memset() away and thus there's no need for memzero_explicit(). Both, sock_kfree_s() and sock_kzfree_s() are wrappers for __sock_kfree_s() and call into kfree() resp. kzfree(); here, __sock_kfree_s() needs to be explicitly inlined as we want the compiler to optimize the call and condition away and thus it produces e.g. on x86_64 the _same_ assembler output for sock_kfree_s() before and after, and thus also allows for avoiding code duplication. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | net: sock: allow eBPF programs to be attached to socketsAlexei Starovoitov2014-12-051-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | introduce new setsockopt() command: setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...) and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER setsockopt() calls bpf_prog_get() which increments refcnt of the program, so it doesn't get unloaded while socket is using the program. The same eBPF program can be attached to multiple sockets. User task exit automatically closes socket which calls sk_filter_uncharge() which decrements refcnt of eBPF program Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: introduce SO_INCOMING_CPUEric Dumazet2014-11-111-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alternative to RPS/RFS is to use hardware support for multiple queues. Then split a set of million of sockets into worker threads, each one using epoll() to manage events on its own socket pool. Ideally, we want one thread per RX/TX queue/cpu, but we have no way to know after accept() or connect() on which queue/cpu a socket is managed. We normally use one cpu per RX queue (IRQ smp_affinity being properly set), so remembering on socket structure which cpu delivered last packet is enough to solve the problem. After accept(), connect(), or even file descriptor passing around processes, applications can use : int cpu; socklen_t len = sizeof(cpu); getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len); And use this information to put the socket into the right silo for optimal performance, as all networking stack should run on the appropriate cpu, without need to send IPI (RPS/RFS). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Add and use skb_copy_datagram_msg() helper.David S. Miller2014-11-051-1/+1
|/ | | | | | | | | | | | | | | This encapsulates all of the skb_copy_datagram_iovec() callers with call argument signature "skb, offset, msghdr->msg_iov, length". When we move to iov_iters in the networking, the iov_iter object will sit in the msghdr. Having a helper like this means there will be less places to touch during that transformation. Based upon descriptions and patch from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Trap attempts to call sock_kfree_s() with a NULL pointer.David S. Miller2014-10-141-0/+2
| | | | | | | | Unlike normal kfree() it is never right to call sock_kfree_s() with a NULL pointer, because sock_kfree_s() also has the side effect of discharging the memory from the sockets quota. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2014-10-081-85/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-09-231-1/+1
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/mips/net/bpf_jit.c drivers/net/can/flexcan.c Both the flexcan and MIPS bpf_jit conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: add alloc_skb_with_frags() helperEric Dumazet2014-09-191-63/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extract from sock_alloc_send_pskb() code building skb with frags, so that we can reuse this in other contexts. Intent is to use it from tcp_send_rcvq(), tcp_collapse(), ... We also want to replace some skb_linearize() calls to a more reliable strategy in pathological cases where we need to reduce number of frags. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-09-071-16/+18
| |\ \
| * | | net: merge cases where sock_efree and sock_edemux are the same functionAlexander Duyck2014-09-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since sock_efree and sock_demux are essentially the same code for non-TCP sockets and the case where CONFIG_INET is not defined we can combine the code or replace the call to sock_edemux in several spots. As a result we can avoid a bit of unnecessary code or code duplication. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net-timestamp: Make the clone operation stand-alone from phy timestampingAlexander Duyck2014-09-051-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The phy timestamping takes a different path than the regular timestamping does in that it will create a clone first so that the packets needing to be timestamped can be placed in a queue, or the context block could be used. In order to support these use cases I am pulling the core of the code out so it can be used in other drivers beyond just phy devices. In addition I have added a destructor named sock_efree which is meant to provide a simple way for dropping the reference to skb exceptions that aren't part of either the receive or send windows for the socket, and I have removed some duplication in spots where this destructor could be used in place of sock_edemux. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | sock: deduplicate errqueue dequeueWillem de Bruijn2014-09-011-12/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_error_queue is dequeued in four locations. All share the exact same logic. Deduplicate. Also collapse the two critical sections for dequeue (at the top of the recv handler) and signal (at the bottom). This moves signal generation for the next packet forward, which should be harmless. It also changes the behavior if the recv handler exits early with an error. Previously, a signal for follow-up packets on the errqueue would then not be scheduled. The new behavior, to always signal, is arguably a bug fix. For rxrpc, the change causes the same function to be called repeatedly for each queued packet (because the recv handler == sk_error_report). It is likely that all packets will fail for the same reason (e.g., memory exhaustion). This code runs without sk_lock held, so it is not safe to trust that sk->sk_err is immutable inbetween releasing q->lock and the subsequent test. Introduce int err just to avoid this potential race. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: remove dead code after sk_data_ready changeEric Dumazet2014-08-221-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a followup to commit 676d23690fb ("net: Fix use after free by removing length arg from sk_data_ready callbacks"), we can remove some useless code in sock_queue_rcv_skb() and rxrpc_queue_rcv_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge tag 'dmaengine-3.17' of ↵Linus Torvalds2014-10-071-6/+0
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine Pull dmaengine updates from Dan Williams: "Even though this has fixes marked for -stable, given the size and the needed conflict resolutions this is 3.18-rc1/merge-window material. These patches have been languishing in my tree for a long while. The fact that I do not have the time to do proper/prompt maintenance of this tree is a primary factor in the decision to step down as dmaengine maintainer. That and the fact that the bulk of drivers/dma/ activity is going through Vinod these days. The net_dma removal has not been in -next. It has developed simple conflicts against mainline and net-next (for-3.18). Continuing thanks to Vinod for staying on top of drivers/dma/. Summary: 1/ Step down as dmaengine maintainer see commit 08223d80df38 "dmaengine maintainer update" 2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit 77873803363c "net_dma: mark broken"), without reports of performance regression. 3/ Miscellaneous fixes" * tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine: net: make tcp_cleanup_rbuf private net_dma: revert 'copied_early' net_dma: simple removal dmaengine maintainer update dmatest: prevent memory leakage on error path in thread ioat: Use time_before_jiffies() dmaengine: fix xor sources continuation dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup() dma: mv_xor: Remove all callers of mv_xor_slot_cleanup() dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call ioat: Use pci_enable_msix_exact() instead of pci_enable_msix() drivers: dma: Include appropriate header file in dca.c drivers: dma: Mark functions as static in dma_v3.c dma: mv_xor: Add DMA API error checks ioat/dca: Use dev_is_pci() to check whether it is pci device
| * | | net_dma: simple removalDan Williams2014-09-281-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per commit "77873803363c net_dma: mark broken" net_dma is no longer used and there is no plan to fix it. This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards. Reverting the remainder of the net_dma induced changes is deferred to subsequent patches. Marked for stable due to Roman's report of a memory leak in dma_pin_iovec_pages(): https://lkml.org/lkml/2014/9/3/177 Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: David Whipple <whipple@securedatainnovations.ch> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: <stable@vger.kernel.org> Reported-by: Roman Gushchin <klamm@yandex-team.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | | | net: fix skb_page_frag_refill() kerneldocEric Dumazet2014-09-081-1/+1
| |_|/ |/| | | | | | | | | | | | | | | | | | | | In commit d9b2938aabf7 ("net: attempt a single high order allocation) I forgot to update kerneldoc, as @prio parameter was renamed to @gfp Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: treewide: Fix typo found in DocBook/networking.xmlMasanari Iida2014-09-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fix spelling typo found in DocBook/networking.xml. It is because the neworking.xml is generated from comments in the source, I have to fix typo in comments within the source. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud