summaryrefslogtreecommitdiffstats
path: root/include/net/sock.h
Commit message (Collapse)AuthorAgeFilesLines
* sock: struct proto hash function may errorCraig Gallek2016-02-111-3/+3
| | | | | | | | | | | In order to support fast reuseport lookups in TCP, the hash function defined in struct proto must be capable of returning an error code. This patch changes the function signature of all related hash functions to return an integer and handles or propagates this return value at all call sites. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sock: remove dead cgroup methods from struct protoJohannes Weiner2016-01-211-12/+0
| | | | | | | | | | | The cgroup methods are no longer used after baac50bbc3cd ("net: tcp_memcontrol: simplify linkage between socket and page counter"). The hunk to delete them was included in the original patch but must have gotten lost during conflict resolution on the way upstream. Fixes: baac50bbc3cd ("net: tcp_memcontrol: simplify linkage between socket and page counter") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* mm: memcontrol: generalize the socket accounting jump labelJohannes Weiner2016-01-141-7/+0
| | | | | | | | | | | | | The unified hierarchy memory controller is going to use this jump label as well to control the networking callbacks. Move it to the memory controller code and give it a more generic name. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: tcp_memcontrol: simplify linkage between socket and page counterJohannes Weiner2016-01-141-21/+4
| | | | | | | | | | | | There won't be any separate counters for socket memory consumed by protocols other than TCP in the future. Remove the indirection and link sockets directly to their owning memory cgroup. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: tcp_memcontrol: sanitize tcp memory accounting callbacksJohannes Weiner2016-01-141-56/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace this with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. Note that this is not purely aesthetic: as a result of shoehorning the per-memcg code into the same memory accounting functions that handle the global level, the old code would compare the per-memcg consumption against the smaller of the per-memcg limit and the global limit. This allowed the total consumption of multiple sockets to exceed the global limit, as long as the individual sockets stayed within bounds. After this change, the code will always compare the per-memcg consumption to the per-memcg limit, and the global consumption to the global limit, and thus close this loophole. Without a soft limit, the per-memcg memory pressure state in sockets is generally questionable. However, we did it until now, so we continue to enter it when the hard limit is hit, and packets are dropped, to let other sockets in the cgroup know that they shouldn't grow their transmit windows, either. However, keep it simple in the new callback model and leave memory pressure lazily when the next packet is accepted (as opposed to doing it synchroneously when packets are processed). When packets are dropped, network performance will already be in the toilet, so that should be a reasonable trade-off. As described above, consumption is now checked on the per-memcg level and the global level separately. Likewise, memory pressure states are maintained on both the per-memcg level and the global level, and a socket is considered under pressure when either level asserts as much. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: tcp_memcontrol: simplify the per-memcg limit accessJohannes Weiner2016-01-141-3/+5
| | | | | | | | | | | | tcp_memcontrol replicates the global sysctl_mem limit array per cgroup, but it only ever sets these entries to the value of the memory_allocated page_counter limit. Use the latter directly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: tcp_memcontrol: remove dead per-memcg count of allocated socketsJohannes Weiner2016-01-141-36/+3
| | | | | | | | | | | | | | | | | The number of allocated sockets is used for calculations in the soft limit phase, where packets are accepted but the socket is under memory pressure. Since there is no soft limit phase in tcp_memcontrol, and memory pressure is only entered when packets are already dropped, this is actually dead code. Remove it. As this is the last user of parent_cg_proto(), remove that too. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: tcp_memcontrol: remove bogus hierarchy pressure propagationJohannes Weiner2016-01-141-15/+4
| | | | | | | | | | | | | | | | | | | | When a cgroup currently breaches its socket memory limit, it enters memory pressure mode for itself and its *ancestors*. This throttles transmission in unrelated sibling and cousin subtrees that have nothing to do with the breached limit. On the contrary, breaching a limit should make that group and its *children* enter memory pressure mode. But this happens already, albeit lazily: if an ancestor limit is breached, siblings will enter memory pressure on their own once the next packet arrives for them. So no additional hierarchy code is needed. Remove the bogus stuff. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: tcp_memcontrol: properly detect ancestor socket pressureJohannes Weiner2016-01-141-4/+6
| | | | | | | | | | | | | | | | | | | When charging socket memory, the code currently checks only the local page counter for excess to determine whether the memcg is under socket pressure. But even if the local counter is fine, one of the ancestors could have breached its limit, which should also force this child to enter socket pressure. This currently doesn't happen. Fix this by using page_counter_try_charge() first. If that fails, it means that either the local counter or one of the ancestors are in excess of their limit, and the child should enter socket pressure. Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* soreuseport: define reuseport groupsCraig Gallek2016-01-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | struct sock_reuseport is an optional shared structure referenced by each socket belonging to a reuseport group. When a socket is bound to an address/port not yet in use and the reuseport flag has been set, the structure will be allocated and attached to the newly bound socket. When subsequent calls to bind are made for the same address/port, the shared structure will be updated to include the new socket and the newly bound socket will reference the group structure. Usually, when an incoming packet was destined for a reuseport group, all sockets in the same group needed to be considered before a dispatching decision was made. With this structure, an appropriate socket can be found after looking up just one socket in the group. This shared structure will also allow for more complicated decisions to be made when selecting a socket (eg a BPF filter). This work is based off a similar implementation written by Ying Cai <ycai@google.com> for implementing policy-based reuseport selection. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-12-171-2/+5
|\ | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix warnings in 'make htmldocs' by moving macro definition out of field ↵Hannes Frederic Sowa2015-12-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | declaration Docbook does not like the definition of macros inside a field declaration and adds a warning. Move the definition out. Fixes: 79462ad02e86180 ("net: add validation for the socket syscall protocol argument") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix IP early demux racesEric Dumazet2015-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Wilder reported crashes caused by dst reuse. <quote David> I am seeing a crash on a distro V4.2.3 kernel caused by a double release of a dst_entry. In ipv4_dst_destroy() the call to list_empty() finds a poisoned next pointer, indicating the dst_entry has already been removed from the list and freed. The crash occurs 18 to 24 hours into a run of a network stress exerciser. </quote> Thanks to his detailed report and analysis, we were able to understand the core issue. IP early demux can associate a dst to skb, after a lookup in TCP/UDP sockets. When socket cache is not properly set, we want to store into sk->sk_dst_cache the dst for future IP early demux lookups, by acquiring a stable refcount on the dst. Problem is this acquisition is simply using an atomic_inc(), which works well, unless the dst was queued for destruction from dst_release() noticing dst refcount went to zero, if DST_NOCACHE was set on dst. We need to make sure current refcount is not zero before incrementing it, or risk double free as David reported. This patch, being a stable candidate, adds two new helpers, and use them only from IP early demux problematic paths. It might be possible to merge in net-next skb_dst_force() and skb_dst_force_safe(), but I prefer having the smallest patch for stable kernels : Maybe some skb_dst_force() callers do not expect skb->dst can suddenly be cleared. Can probably be backported back to linux-3.6 kernels Reported-by: David J. Wilder <dwilder@us.ibm.com> Tested-by: David J. Wilder <dwilder@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: add validation for the socket syscall protocol argumentHannes Frederic Sowa2015-12-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 郭永刚 reported that one could simply crash the kernel as root by using a simple program: int socket_fd; struct sockaddr_in addr; addr.sin_port = 0; addr.sin_addr.s_addr = INADDR_ANY; addr.sin_family = 10; socket_fd = socket(10,3,0x40000000); connect(socket_fd , &addr,16); AF_INET, AF_INET6 sockets actually only support 8-bit protocol identifiers. inet_sock's skc_protocol field thus is sized accordingly, thus larger protocol identifiers simply cut off the higher bits and store a zero in the protocol fields. This could lead to e.g. NULL function pointer because as a result of the cut off inet_num is zero and we call down to inet_autobind, which is NULL for raw sockets. kernel: Call Trace: kernel: [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70 kernel: [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80 kernel: [<ffffffff81645069>] SYSC_connect+0xd9/0x110 kernel: [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80 kernel: [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200 kernel: [<ffffffff81645e0e>] SyS_connect+0xe/0x10 kernel: [<ffffffff81779515>] tracesys_phase2+0x84/0x89 I found no particular commit which introduced this problem. CVE: CVE-2015-8543 Cc: Cong Wang <cwang@twopensource.com> Reported-by: 郭永刚 <guoyonggang@360.cn> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * xfrm: add rcu protection to sk->sk_policy[]Eric Dumazet2015-12-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | XFRM can deal with SYNACK messages, sent while listener socket is not locked. We add proper rcu protection to __xfrm_sk_clone_policy() and xfrm_sk_policy_lookup() This might serve as the first step to remove xfrm.xfrm_policy_lock use in fast path. Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: update the netstamp_needed counter when copying socketsMarcelo Ricardo Leitner2015-12-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dmitry Vyukov reported that SCTP was triggering a WARN on socket destroy related to disabling sock timestamp. When SCTP accepts an association or peel one off, it copies sock flags but forgot to call net_enable_timestamp() if a packet timestamping flag was copied, leading to extra calls to net_disable_timestamp() whenever such clones were closed. The fix is to call net_enable_timestamp() whenever we copy a sock with that flag on, like tcp does. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: diag: Add the ability to destroy a socket.Lorenzo Colitti2015-12-151-0/+1
| | | | | | | | | | | | | | | | | | | | This patch adds a SOCK_DESTROY operation, a destroy function pointer to sock_diag_handler, and a diag_destroy function pointer. It does not include any implementation code. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: Fix conditions to determine checksum offloadTom Herbert2015-12-151-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to determine if checksum offload can be performed. This check currently does not take the IP protocol into account for devices that advertise only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a function to check capabilities for checksum offload with a socket called sk_check_csum_caps. This function checks for specific IPv4 or IPv6 offload support based on the family of the socket. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a structTejun Heo2015-12-081-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data. ->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct and its accessors are defined in cgroup-defs.h. This is to prepare for overloading the fields with a cgroup pointer. This patch mostly performs equivalent conversions but the followings are noteworthy. * Equality test before updating classid is removed from sock_update_classid(). This shouldn't make any noticeable difference and a similar test will be implemented on the helper side later. * sock_update_netprioidx() now takes struct sock_cgroup_data and can be moved to netprio_cgroup.h without causing include dependency loop. Moved. * The dummy version of sock_update_netprioidx() converted to a static inline function while at it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netprio_cgroup: limit the maximum css->id to USHRT_MAXTejun Heo2015-12-081-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | netprio builds per-netdev contiguous priomap array which is indexed by css->id. The array is allocated using kzalloc() effectively limiting the maximum ID supported to some thousand range. This patch caps the maximum supported css->id to USHRT_MAX which should be way above what is actually useable. This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed up part will be used to overload the cgroup related fields. sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the two cgroup related fields are adjacent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Daniel Borkmann <daniel@iogearbox.net> CC: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-12-031-7/+25
|\ \ | |/ | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: kill sk_dst_lockEric Dumazet2015-12-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While testing the np->opt RCU conversion, I found that UDP/IPv6 was using a mixture of xchg() and sk_dst_lock to protect concurrent changes to sk->sk_dst_cache, leading to possible corruptions and crashes. ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest way to fix the mess is to remove sk_dst_lock completely, as we did for IPv4. __ip6_dst_store() and ip6_dst_store() share same implementation. sk_setup_caps() being called with socket lock being held or not, we have to use sk_dst_set() instead of __sk_dst_set() Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);" in ip6_dst_store() before the sk_setup_caps(sk, dst) call. This is because ip6_dst_store() can be called from process context, without any lock held. As soon as the dst is installed in sk->sk_dst_cache, dst can be freed from another cpu doing a concurrent ip6_dst_store() Doing the dst dereference before doing the install is needed to make sure no use after free would trigger. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix sock_wake_async() rcu protectionEric Dumazet2015-12-011-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dmitry provided a syzkaller (http://github.com/google/syzkaller) triggering a fault in sock_wake_async() when async IO is requested. Said program stressed af_unix sockets, but the issue is generic and should be addressed in core networking stack. The problem is that by the time sock_wake_async() is called, we should not access the @flags field of 'struct socket', as the inode containing this socket might be freed without further notice, and without RCU grace period. We already maintain an RCU protected structure, "struct socket_wq" so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it is the safe route. It also reduces number of cache lines needing dirtying, so might provide a performance improvement anyway. In followup patches, we might move remaining flags (SOCK_NOSPACE, SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket' being mostly read and let it being shared between cpus. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATAEric Dumazet2015-12-011-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a cleanup to make following patch easier to review. Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA from (struct socket)->flags to a (struct socket_wq)->flags to benefit from RCU protection in sock_wake_async() To ease backports, we rename both constants. Two new helpers, sk_set_bit(int nr, struct sock *sk) and sk_clear_bit(int net, struct sock *sk) are added so that following patch can change their implementation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: suppress too verbose messages in tcp_send_ack()Eric Dumazet2015-12-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If tcp_send_ack() can not allocate skb, we properly handle this and setup a timer to try later. Use __GFP_NOWARN to avoid polluting syslog in the case host is under memory pressure, so that pertinent messages are not lost under a flood of useless information. sk_gfp_atomic() can use its gfp_mask argument (all callers currently were using GFP_ATOMIC before this patch) We rename sk_gfp_atomic() to sk_gfp_mask() to clearly express this function now takes into account its second argument (gfp_mask) Note that when tcp_transmit_skb() is called with clone_it set to false, we do not attempt memory allocations, so can pass a 0 gfp_mask, which most compilers can emit faster than a non zero or constant value. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Generalise wq_has_sleeper helperHerbert Xu2015-11-301-10/+5
|/ | | | | | | | | | The memory barrier in the helper wq_has_sleeper is needed by just about every user of waitqueue_active. This patch generalises it by making it take a wait_queue_head_t directly. The existing helper is renamed to skwq_has_sleeper. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: ensure proper barriers in lockless contextsEric Dumazet2015-11-151-0/+25
| | | | | | | | | | | | | | Some functions access TCP sockets without holding a lock and might output non consistent data, depending on compiler and or architecture. tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ... Introduce sk_state_load() and sk_state_store() to fix the issues, and more clearly document where this lack of locking is happening. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman2015-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: make skb_set_owner_w() more robustEric Dumazet2015-11-021-15/+2
| | | | | | | | | | | | | | | | | | skb_set_owner_w() is called from various places that assume skb->sk always point to a full blown socket (as it changes sk->sk_wmem_alloc) We'd like to attach skb to request sockets, and in the future to timewait sockets as well. For these kind of pseudo sockets, we need to take a traditional refcount and use sock_edemux() as the destructor. It is now time to un-inline skb_set_owner_w(), being too big. Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Bisected-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-10-201-0/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/asix_common.c net/ipv4/inet_connection_sock.c net/switchdev/switchdev.c In the inet_connection_sock.c case the request socket hashing scheme is completely different in net-next. The other two conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: add pfmemalloc check in sk_add_backlog()Eric Dumazet2015-10-171-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Greg reported crashes hitting the following check in __sk_backlog_rcv() BUG_ON(!sock_flag(sk, SOCK_MEMALLOC)); The pfmemalloc bit is currently checked in sk_filter(). This works correctly for TCP, because sk_filter() is ran in tcp_v[46]_rcv() before hitting the prequeue or backlog checks. For UDP or other protocols, this does not work, because the sk_filter() is ran from sock_queue_rcv_skb(), which might be called _after_ backlog queuing if socket is owned by user by the time packet is processed by softirq handler. Fixes: b4b9e35585089 ("netvm: set PF_MEMALLOC as appropriate during SKB processing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Greg Thelen <gthelen@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: shrink tcp_timewait_sock by 8 bytesEric Dumazet2015-10-121-0/+2
| | | | | | | | | | | | | | | | Reducing tcp_timewait_sock from 280 bytes to 272 bytes allows SLAB to pack 15 objects per page instead of 14 (on x86) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: shrink struct sock and request_sock by 8 bytesEric Dumazet2015-10-121-5/+9
| | | | | | | | | | | | | | | | One 32bit hole is following skc_refcnt, use it. skc_incoming_cpu can also be an union for request_sock rcv_wnd. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: align sk_refcnt on 128 bytes boundaryEric Dumazet2015-10-121-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_refcnt is dirtied for every TCP/UDP incoming packet. This is a performance issue if multiple cpus hit a common socket, or multiple sockets are chained due to SO_REUSEPORT. By moving sk_refcnt 8 bytes further, first 128 bytes of sockets are mostly read. As they contain the lookup keys, this has a considerable performance impact, as cpus can cache them. These 8 bytes are not wasted, we use them as a place holder for various fields, depending on the socket type. Tested: SYN flood hitting a 16 RX queues NIC. TCP listener using 16 sockets and SO_REUSEPORT and SO_INCOMING_CPU for proper siloing. Could process 6.0 Mpps SYN instead of 4.2 Mpps Kernel profile looked like : 11.68% [kernel] [k] sha_transform 6.51% [kernel] [k] __inet_lookup_listener 5.07% [kernel] [k] __inet_lookup_established 4.15% [kernel] [k] memcpy_erms 3.46% [kernel] [k] ipt_do_table 2.74% [kernel] [k] fib_table_lookup 2.54% [kernel] [k] tcp_make_synack 2.34% [kernel] [k] tcp_conn_request 2.05% [kernel] [k] __netif_receive_skb_core 2.03% [kernel] [k] kmem_cache_alloc Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: SO_INCOMING_CPU setsockopt() supportEric Dumazet2015-10-121-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: support per-packet fwmarkEdward Jee2015-10-121-0/+7
| | | | | | | | | | | | | | | | | | | | | | It's useful to allow users to set fwmark for an individual packet, without changing the socket state. The function this patch adds in sock layer can be used by the protocols that need such a feature. Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: synack packets can be attached to request socketsEric Dumazet2015-10-111-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | selinux needs few changes to accommodate fact that SYNACK messages can be attached to a request socket, lacking sk_security pointer (Only syncookies are still attached to a TCP_LISTEN socket) Adds a new sk_listener() helper, and use it in selinux and sch_fq Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported by: kernel test robot <ying.huang@linux.intel.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Eric Paris <eparis@parisplace.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: constify sk_gfp_atomic() sock argumentEric Dumazet2015-09-291-1/+1
| | | | | | | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: provide skb->hash to synack packetsEric Dumazet2015-09-171-4/+8
|/ | | | | | | | | | | | | | | | | | | | | | In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit"), Tom provided a l4 hash to most outgoing TCP packets. We'd like to provide one as well for SYNACK packets, so that all packets of a given flow share same txhash, to later enable bonding driver to also use skb->hash to perform slave selection. Note that a SYNACK retransmit shuffles the tx hash, as Tom did in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing advice") for established sockets. This has nice effect making TCP flows resilient to some kind of black holes, even at connection establish phase. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Mahesh Bandewar <maheshb@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* memcg: move memcg_proto_active from sock.hMichal Hocko2015-09-081-5/+0
| | | | | | | | | | | | | The only user is sock_update_memcg which is living in memcontrol.c so it doesn't make much sense to pollute sock.h by this inline helper. Move it to memcontrol.c and open code it into its only caller. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: export struct mem_cgroupMichal Hocko2015-09-081-28/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [vdavykov@parallels.com: inline memcg_kmem_is_active] [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG] [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx] [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-07-311-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/s390/net/bpf_jit_comp.c drivers/net/ethernet/ti/netcp_ethss.c net/bridge/br_multicast.c net/ipv4/ip_fragment.c All four conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix recv with flags MSG_WAITALL | MSG_PEEKSabrina Dubroca2015-07-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL | MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258 Reported-by: Enrico Scholz <rh-bugzilla@ensc.de> Reported-by: Dan Searle <dan@censornet.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Recompute sk_txhash on negative routing adviceTom Herbert2015-07-291-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | When a connection is failing a transport protocol calls dst_negative_advice to try to get a better route. This patch includes changing the sk_txhash in that function. This provides a rudimentary method to try to find a different path in the network since sk_txhash affects ECMP on the local host and through the network (via flow labels or UDP source port in encapsulation). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Set sk_txhash from a random numberTom Herbert2015-07-291-0/+8
| | | | | | | | | | | | | | | | | | | | | | This patch creates sk_set_txhash and eliminates protocol specific inet_set_txhash and ip6_set_txhash. sk_set_txhash simply sets a random number instead of performing flow dissection. sk_set_txash is also allowed to be called multiple times for the same socket, we'll need this when redoing the hash for negative routing advice. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: #ifdefify sk_classid member of struct sockMathias Krause2015-07-211-0/+2
|/ | | | | | | | | The sk_classid member is only required when CONFIG_CGROUP_NET_CLASSID is enabled. #ifdefify it to reduce the size of struct sock on 32 bit systems, at least. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Kill sock->sk_protinfoDavid Miller2015-06-281-2/+0
| | | | | | No more users, so it can now be removed. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Update out-of-date commentZhaowei Yuan2015-06-211-1/+0
| | | | | | | | Struct inet_proto no longer exists, so update the comment which is out of date. Signed-off-by: Zhaowei Yuan <zhaowei.yuan@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock_diag: define destruction multicast groupsCraig Gallek2015-06-151-0/+1
| | | | | | | | | | | | | | | These groups will contain socket-destruction events for AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP. Near the end of socket destruction, a check for listeners is performed. In the presence of a listener, rather than completely cleanup the socket, a unit of work will be added to a private work queue which will first broadcast information about the socket and then finish the cleanup operation. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add a force_schedule argument to sk_stream_alloc_skb()Eric Dumazet2015-05-211-1/+2
| | | | | | | | | | | | | | | | | | | In commit 8e4d980ac215 ("tcp: fix behavior for epoll edge trigger") we fixed a possible hang of TCP sockets under memory pressure, by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule() if no packet is in socket write queue. It turns out there are other cases where we want to force memory schedule : tcp_fragment() & tso_fragment() need to split a big TSO packet into two smaller ones. If we block here because of TCP memory pressure, we can effectively block TCP socket from sending new data. If no further ACK is coming, this hang would be definitive, and socket has no chance to effectively reduce its memory usage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud