summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | | sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqstChuck Lever2017-07-121-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svcrdma needs 259 pages allocated to receive 1MB NFSv4.0 WRITE requests: - 1 page for the transport header and head iovec - 256 pages for the data payload - 1 page for the trailing GETATTR request (since NFSD XDR decoding does not look for a tail iovec, the GETATTR is stuck at the end of the rqstp->rq_arg.pages list) - 1 page for building the reply xdr_buf But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that svc_alloc_arg never allocates that many pages. To address this: 1. The final element of rq_pages always points to NULL. To accommodate up to 259 pages in rq_pages, add an extra element to rq_pages for the array termination sentinel. 2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES is calculated, so it can go up to 259. Bruce noted that the calculation assumes sv_max_mesg is a multiple of PAGE_SIZE, which might not always be true. I didn't change this assumption. 3. Change the loop boundaries to allow 259 pages to be allocated. Additional clean-up: WARN_ON_ONCE adds an extra conditional branch, which is basically never taken. And there's no need to dump the stack here because svc_alloc_arg has only one caller. Keeping that NULL "array termination sentinel"; there doesn't appear to be any code that depends on it, only code in nfsd_splice_actor() which needs the 259th element to be initialized to *something*. So it's possible we could just keep the array at 259 elements and drop that final NULL, but we're being conservative for now. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | svcrdma: Don't account for Receive queue "starvation"Chuck Lever2017-06-281-15/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | >From what I can tell, calling ->recvfrom when there is no work to do is a normal part of operation. This is the only way svc_recv can tell when there is no more data ready to receive on the transport. Neither the TCP nor the UDP transport implementations have a "starve" metric. The cost of receive starvation accounting is bumping an atomic, which results in extra (IMO unnecessary) bus traffic between CPU sockets, while holding a spin lock. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | svcrdma: Improve Reply chunk sanity checkingChuck Lever2017-06-281-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Identify malformed transport headers and unsupported chunk combinations as early as possible. - Ensure that segment lengths are not crazy. - Ensure that the Reply chunk's segment count is not crazy. With a 1KB inline threshold, the largest number of Write segments that can be conveyed is about 60 (for a RDMA_NOMSG Reply message). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | svcrdma: Improve Write chunk sanity checkingChuck Lever2017-06-281-5/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Identify malformed transport headers and unsupported chunk combinations as early as possible. - Reject RPC-over-RDMA messages that contain more than one Write chunk, since this implementation does not support more than one per message. - Ensure that segment lengths are not crazy. - Ensure that the chunk's segment count is not crazy. With a 1KB inline threshold, the largest number of Write segments that can be conveyed is about 60 (for a RDMA_NOMSG Reply message). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | svcrdma: Improve Read chunk sanity checkingChuck Lever2017-06-281-18/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Identify malformed transport headers and unsupported chunk combinations as early as possible. - Reject RPC-over-RDMA messages that contain more than one Read chunk, since this implementation currently does not support more than one per RPC transaction. - Ensure that segment lengths are not crazy. - Remove the segment count check. With a 1KB inline threshold, the largest number of Read segments that can be conveyed is about 40 (for a RDMA_NOMSG Call message). This is nowhere near RPCSVC_MAXPAGES. As far as I can tell, that was just a sanity check and does not enforce an implementation limit. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | svcrdma: Remove svc_rdma_marshal.cChuck Lever2017-06-283-170/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svc_rdma_marshal.c has one remaining exported function -- svc_rdma_xdr_decode_req -- and it has a single call site. Take the same approach as the sendto path, and move this function into the source file where it is called. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | svcrdma: Avoid Send Queue overflowChuck Lever2017-06-282-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sanity case: Catch the case where more Work Requests are being posted to the Send Queue than there are Send Queue Entries. This might happen if a client sends a chunk with more segments than there are SQEs for the transport. The server can't send that reply, so the transport will deadlock unless the client drops the RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | svcrdma: Squelch disconnection messagesChuck Lever2017-06-281-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The server displays "svcrdma: failed to post Send WR (-107)" in the kernel log when the client disconnects. This could flood the server's log, so remove the message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | sunrpc: Disable splice for krb5iChuck Lever2017-06-282-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running a multi-threaded 8KB fio test (70/30 mix), three or four out of twelve of the jobs fail when using krb5i. The failure is an EIO on a read. Troubleshooting confirmed the EIO results when the client fails to verify the MIC of an NFS READ reply. Bruce suggested the problem could be due to the data payload changing between the time the reply's MIC was computed on the server and the time the reply was actually sent. krb5p gets around this problem by disabling RQ_SPLICE_OK. Use the same mechanism for krb5i RPCs. "iozone -i0 -i1 -s128m -y1k -az -I", export is tmpfs, mount is sec=krb5i,vers=3,proto=rdma. The important numbers are the read / reread column. Here's without the RQ_SPLICE_OK patch: kB reclen write rewrite read reread 131072 1 7546 7929 8396 8267 131072 2 14375 14600 15843 15639 131072 4 19280 19248 21303 21410 131072 8 32350 31772 35199 34883 131072 16 36748 37477 49365 51706 131072 32 55669 56059 57475 57389 131072 64 74599 75190 74903 75550 131072 128 99810 101446 102828 102724 131072 256 122042 122612 124806 125026 131072 512 137614 138004 141412 141267 131072 1024 146601 148774 151356 151409 131072 2048 180684 181727 293140 292840 131072 4096 206907 207658 552964 549029 131072 8192 223982 224360 454493 473469 131072 16384 228927 228390 654734 632607 And here's with it: kB reclen write rewrite read reread 131072 1 7700 7365 7958 8011 131072 2 13211 13303 14937 14414 131072 4 19001 19265 20544 20657 131072 8 30883 31097 34255 33566 131072 16 36868 34908 51499 49944 131072 32 56428 55535 58710 56952 131072 64 73507 74676 75619 74378 131072 128 100324 101442 103276 102736 131072 256 122517 122995 124639 124150 131072 512 137317 139007 140530 140830 131072 1024 146807 148923 151246 151072 131072 2048 179656 180732 292631 292034 131072 4096 206216 208583 543355 541951 131072 8192 223738 224273 494201 489372 131072 16384 229313 229840 691719 668427 I would say that there is not much difference in this test. For good measure, here's the same test with sec=krb5p: kB reclen write rewrite read reread 131072 1 5982 5881 6137 6218 131072 2 10216 10252 10850 10932 131072 4 12236 12575 15375 15526 131072 8 15461 15462 23821 22351 131072 16 25677 25811 27529 27640 131072 32 31903 32354 34063 33857 131072 64 42989 43188 45635 45561 131072 128 52848 53210 56144 56141 131072 256 59123 59214 62691 62933 131072 512 63140 63277 66887 67025 131072 1024 65255 65299 69213 69140 131072 2048 76454 76555 133767 133862 131072 4096 84726 84883 251925 250702 131072 8192 89491 89482 270821 276085 131072 16384 91572 91597 361768 336868 BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=307 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | | | Merge tag 'v4.12-rc5' into nfsd treeJ. Bruce Fields2017-06-28106-638/+938
| |\ \ \ \ \ | | |/ / / / | | | | | | | | | | | | | | | | | | Update to get f0c3192ceee3 "virtio_net: lower limit on buffer size". That bug was interfering with my nfsd testing.
| * | | | | sunrpc: mark all struct svc_version instances as constChristoph Hellwig2017-05-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | sunrpc: mark all struct svc_procinfo instances as constChristoph Hellwig2017-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct svc_procinfo contains function pointers, and marking it as constant avoids it being able to be used as an attach vector for code injections. Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | sunrpc: move pc_count out of struct svc_procinfoChristoph Hellwig2017-05-152-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pc_count is the only writeable memeber of struct svc_procinfo, which is a good candidate to be const-ified as it contains function pointers. This patch moves it into out out struct svc_procinfo, and into a separate writable array that is pointed to by struct svc_version. Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | sunrpc: properly type pc_encode callbacksChristoph Hellwig2017-05-151-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the resp argument as it can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | sunrpc: properly type pc_decode callbacksChristoph Hellwig2017-05-151-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the argp argument as it can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | sunrpc: properly type pc_release callbacksChristoph Hellwig2017-05-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the p and resp arguments as they are always NULL or can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to kxdrproc_t. Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | sunrpc: properly type pc_func callbacksChristoph Hellwig2017-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the argp and resp arguments as they can trivially be derived from the rqstp argument. With that all functions now have the same prototype, and we can remove the unsafe casting to svc_procfunc as well as the svc_procfunc typedef itself. Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | sunrpc: mark all struct rpc_procinfo instances as constChristoph Hellwig2017-05-154-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct rpc_procinfo contains function pointers, and marking it as constant avoids it being able to be used as an attach vector for code injections. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | sunrpc: move p_count out of struct rpc_procinfoChristoph Hellwig2017-05-154-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | p_count is the only writeable memeber of struct rpc_procinfo, which is a good candidate to be const-ified as it contains function pointers. This patch moves it into out out struct rpc_procinfo, and into a separate writable array that is pointed to by struct rpc_version and indexed by p_statidx. Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | sunrpc/auth_gss: fix decoder callback prototypesChristoph Hellwig2017-05-153-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_decode callbacks with the proper prototype instead of casting to kxdrdproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | sunrpc: fix decoder callback prototypesChristoph Hellwig2017-05-151-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_decode callbacks with the proper prototype instead of casting to kxdrdproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com>
| * | | | | sunrpc: properly type argument to kxdrdproc_tChristoph Hellwig2017-05-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass struct rpc_request as the first argument instead of an untyped blob. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | sunrpc/auth_gss: nfsd: fix encoder callback prototypesChristoph Hellwig2017-05-153-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_encode callbacks with the proper prototype instead of casting to kxdreproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | sunrpc: fix encoder callback prototypesChristoph Hellwig2017-05-151-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the p_encode callbacks with the proper prototype instead of casting to kxdreproc_t and losing all type safety. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | sunrpc: properly type argument to kxdreproc_tChristoph Hellwig2017-05-151-1/+2
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass struct rpc_request as the first argument instead of an untyped blob, and mark the data object as const. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com>
* | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2017-07-134-13/+9
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge yet more updates from Andrew Morton: - various misc things - kexec updates - sysctl core updates - scripts/gdb udpates - checkpoint-restart updates - ipc updates - kernel/watchdog updates - Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature" - "stackprotector: ascii armor the stack canary" - more MM bits - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits) writeback: rework wb_[dec|inc]_stat family of functions ARM: samsung: usb-ohci: move inline before return type video: fbdev: omap: move inline before return type video: fbdev: intelfb: move inline before return type USB: serial: safe_serial: move __inline__ before return type drivers: tty: serial: move inline before return type drivers: s390: move static and inline before return type x86/efi: move asmlinkage before return type sh: move inline before return type MIPS: SMP: move asmlinkage before return type m68k: coldfire: move inline before return type ia64: sn: pci: move inline before type ia64: move inline before return type FRV: tlbflush: move asmlinkage before return type CRIS: gpio: move inline before return type ARM: HP Jornada 7XX: move inline before return type ARM: KVM: move asmlinkage before type checkpatch: improve the STORAGE_CLASS test mm, migration: do not trigger OOM killer when migrating memory drm/i915: use __GFP_RETRY_MAYFAIL ...
| * | | | | mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko2017-07-123-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()Michal Hocko2017-07-121-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xt_alloc_table_info() basically opencodes kvmalloc() so use the library function instead. Link: http://lkml.kernel.org/r/20170531155145.17111-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-07-123-6/+7
|\ \ \ \ \ \ | |/ / / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix 64-bit division in mlx5 IPSEC offload support, from Ilan Tayari and Arnd Bergmann. 2) Fix race in statistics gathering in bnxt_en driver, from Michael Chan. 3) Can't use a mutex in RCU reader protected section on tap driver, from Cong WANG. 4) Fix mdb leak in bridging code, from Eduardo Valentin. 5) Fix free of wrong pointer variable in nfp driver, from Dan Carpenter. 6) Buffer overflow in brcmfmac driver, from Arend van SPriel. 7) ioremap_nocache() return value needs to be checked in smsc911x driver, from Alexey Khoroshilov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits) net: stmmac: revert "support future possible different internal phy mode" sfc: don't read beyond unicast address list datagram: fix kernel-doc comments socket: add documentation for missing elements smsc911x: Add check for ioremap_nocache() return code brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() net: hns: Bugfix for Tx timeout handling in hns driver net: ipmr: ipmr_get_table() returns NULL nfp: freeing the wrong variable mlxsw: spectrum_switchdev: Check status of memory allocation mlxsw: spectrum_switchdev: Remove unused variable mlxsw: spectrum_router: Fix use-after-free in route replace mlxsw: spectrum_router: Add missing rollback samples/bpf: fix a build issue bridge: mdb: fix leak on complete_info ptr on fail path tap: convert a mutex to a spinlock cxgb4: fix BUG() on interrupt deallocating path of ULD qed: Fix printk option passed when printing ipv6 addresses net: Fix minor code bug in timestamping.txt net: stmmac: Make 'alloc_dma_[rt]x_desc_resources()' look even closer ...
| * | | | | datagram: fix kernel-doc commentsstephen hemminger2017-07-121-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An underscore in the kernel-doc comment section has special meaning and mis-use generates an errors. ./net/core/datagram.c:207: ERROR: Unknown target name: "msg". ./net/core/datagram.c:379: ERROR: Unknown target name: "msg". ./net/core/datagram.c:816: ERROR: Unknown target name: "t". Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: ipmr: ipmr_get_table() returns NULLDan Carpenter2017-07-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ipmr_get_table() function doesn't return error pointers it returns NULL on error. Fixes: 4f75ba6982bc ("net: ipmr: Add ipmr_rtm_getroute") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | bridge: mdb: fix leak on complete_info ptr on fail pathEduardo Valentin2017-07-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently get the following kmemleak report: unreferenced object 0xffff8800039d9820 (size 32): comm "softirq", pid 0, jiffies 4295212383 (age 792.416s) hex dump (first 32 bytes): 00 0c e0 03 00 88 ff ff ff 02 00 00 00 00 00 00 ................ 00 00 00 01 ff 11 00 02 86 dd 00 00 ff ff ff ff ................ backtrace: [<ffffffff8152b4aa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff811d8ec8>] kmem_cache_alloc_trace+0xb8/0x1c0 [<ffffffffa0389683>] __br_mdb_notify+0x2a3/0x300 [bridge] [<ffffffffa038a0ce>] br_mdb_notify+0x6e/0x70 [bridge] [<ffffffffa0386479>] br_multicast_add_group+0x109/0x150 [bridge] [<ffffffffa0386518>] br_ip6_multicast_add_group+0x58/0x60 [bridge] [<ffffffffa0387fb5>] br_multicast_rcv+0x1d5/0xdb0 [bridge] [<ffffffffa037d7cf>] br_handle_frame_finish+0xcf/0x510 [bridge] [<ffffffffa03a236b>] br_nf_hook_thresh.part.27+0xb/0x10 [br_netfilter] [<ffffffffa03a3738>] br_nf_hook_thresh+0x48/0xb0 [br_netfilter] [<ffffffffa03a3fb9>] br_nf_pre_routing_finish_ipv6+0x109/0x1d0 [br_netfilter] [<ffffffffa03a4400>] br_nf_pre_routing_ipv6+0xd0/0x14c [br_netfilter] [<ffffffffa03a3c27>] br_nf_pre_routing+0x197/0x3d0 [br_netfilter] [<ffffffff814a2952>] nf_iterate+0x52/0x60 [<ffffffff814a29bc>] nf_hook_slow+0x5c/0xb0 [<ffffffffa037ddf4>] br_handle_frame+0x1a4/0x2c0 [bridge] This happens when switchdev_port_obj_add() fails. This patch frees complete_info object in the fail path. Reviewed-by: Vallish Vaidyeshwara <vallish@amazon.com> Signed-off-by: Eduardo Valentin <eduval@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | Merge tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2017-07-118-303/+1657
|\ \ \ \ \ \ | |/ / / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "The main item here is support for v12.y.z ("Luminous") clusters: RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS feature bits, and various other changes in the RADOS client protocol. On top of that we have a new fsc mount option to allow supplying fscache uniquifier (similar to NFS) and the usual pile of filesystem fixes from Zheng" * tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits) libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS libceph: osd_state is 32 bits wide in luminous crush: remove an obsolete comment crush: crush_init_workspace starts with struct crush_work libceph, crush: per-pool crush_choose_arg_map for crush_do_rule() crush: implement weight and id overrides for straw2 libceph: apply_upmap() libceph: compute actual pgid in ceph_pg_to_up_acting_osds() libceph: pg_upmap[_items] infrastructure libceph: ceph_decode_skip_* helpers libceph: kill __{insert,lookup,remove}_pg_mapping() libceph: introduce and switch to decode_pg_mapping() libceph: don't pass pgid by value libceph: respect RADOS_BACKOFF backoffs libceph: make DEFINE_RB_* helpers more general libceph: avoid unnecessary pi lookups in calc_target() libceph: use target pi for calc_target() calculations libceph: always populate t->target_{oid,oloc} in calc_target() libceph: make sure need_resend targets reflect latest map libceph: delete from need_resend_linger before check_linger_pool_dne() ...
| * | | | | libceph: osd_state is 32 bits wide in luminousIlya Dryomov2017-07-072-10/+21
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | crush: remove an obsolete commentIlya Dryomov2017-07-071-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reflects ceph.git commit dca1ae1e0a6b02029c3a7f9dec4114972be26d50. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | crush: crush_init_workspace starts with struct crush_workIlya Dryomov2017-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is not just a pointer to crush_work, it is the whole structure. That is not a problem since it only contains a pointer. But it will be a problem if new data members are added to crush_work. Reflects ceph.git commit ee957dd431bfbeb6dadaf77764db8e0757417328. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()Ilya Dryomov2017-07-072-3/+200
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is no crush_choose_arg_map for a given pool, a NULL pointer is passed to preserve existing crush_do_rule() behavior. Reflects ceph.git commits 55fb91d64071552ea1bc65ab4ea84d3c8b73ab4b, dbe36e08be00c6519a8c89718dd47b0219c20516. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | crush: implement weight and id overrides for straw2Ilya Dryomov2017-07-072-19/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bucket_straw2_choose needs to use weights that may be different from weight_items. For instance to compensate for an uneven distribution caused by a low number of values. Or to fix the probability biais introduced by conditional probabilities (see http://tracker.ceph.com/issues/15653 for more information). We introduce a weight_set for each straw2 bucket to set the desired weight for a given item at a given position. The weight of a given item when picking the first replica (first position) may be different from the weight the second replica (second position). For instance the weight matrix for a given bucket containing items 3, 7 and 13 could be as follows: position 0 position 1 item 3 0x10000 0x100000 item 7 0x40000 0x10000 item 13 0x40000 0x10000 When crush_do_rule picks the first of two replicas (position 0), item 7, 3 are four times more likely to be choosen by bucket_straw2_choose than item 13. When choosing the second replica (position 1), item 3 is ten times more likely to be choosen than item 7, 13. By default the weight_set of each bucket exactly matches the content of item_weights for each position to ensure backward compatibility. bucket_straw2_choose compares items by using their id. The same ids are also used to index buckets and they must be unique. For each item in a bucket an array of ids can be provided for placement purposes and they are used instead of the ids. If no replacement ids are provided, the legacy behavior is preserved. Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: apply_upmap()Ilya Dryomov2017-07-071-2/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, pg_to_raw_osds() didn't filter for existent OSDs because raw_to_up_osds() would filter for "up" ("up" is predicated on "exists") and raw_to_up_osds() was called directly after pg_to_raw_osds(). Now, with apply_upmap() call in there, nonexistent OSDs in pg_to_raw_osds() output can affect apply_upmap(). Introduce remove_nonexistent_osds() to deal with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: compute actual pgid in ceph_pg_to_up_acting_osds()Ilya Dryomov2017-07-071-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move raw_pg_to_pg() call out of get_temp_osds() and into ceph_pg_to_up_acting_osds(), for upcoming apply_upmap(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: pg_upmap[_items] infrastructureIlya Dryomov2017-07-072-3/+155
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pg_temp and pg_upmap encodings are the same (PG -> array of osds), except for the incremental remove: it's an empty mapping in new_pg_temp for pg_temp and a separate old_pg_upmap set for pg_upmap. (This isn't to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't looked at as an example for pg_upmap encoding.) Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap. __decode_pg_temp() stores into pg_temp union member, but since pg_upmap union member is identical, reading through pg_upmap later is OK. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: ceph_decode_skip_* helpersIlya Dryomov2017-07-071-22/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some of these won't be as efficient as they could be (e.g. ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32) once instead of advancing by sizeof(u32) len times), but that's fine and not worth a bunch of extra macro code. Replace skip_name_map() with ceph_decode_skip_map as an example. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: kill __{insert,lookup,remove}_pg_mapping()Ilya Dryomov2017-07-071-72/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch to DEFINE_RB_FUNCS2-generated {insert,lookup,erase}_pg_mapping(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: introduce and switch to decode_pg_mapping()Ilya Dryomov2017-07-071-67/+83
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: don't pass pgid by valueIlya Dryomov2017-07-071-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make __{lookup,remove}_pg_mapping() look like their ceph_spg_mapping counterparts: take const struct ceph_pg *. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: respect RADOS_BACKOFF backoffsIlya Dryomov2017-07-074-0/+684
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: avoid unnecessary pi lookups in calc_target()Ilya Dryomov2017-07-072-28/+34
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: use target pi for calc_target() calculationsIlya Dryomov2017-07-071-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For luminous and beyond we are encoding the actual spgid, which requires operating with the correct pg_num, i.e. that of the target pool. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: always populate t->target_{oid,oloc} in calc_target()Ilya Dryomov2017-07-071-11/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | need_check_tiering logic doesn't make a whole lot of sense. Drop it and apply tiering unconditionally on every calc_target() call instead. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | libceph: make sure need_resend targets reflect latest mapIlya Dryomov2017-07-072-9/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise we may miss events like PG splits, pool deletions, etc when we get multiple incremental maps at once. Because check_pool_dne() can now be fed an unlinked request, finish_request() needed to be taught to handle unlinked requests. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
OpenPOWER on IntegriCloud