| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* limit size of buffers to RPC_MAXDATASIZE
* don't leak memory
* be more picky about bad parameters
From:
https://raw.githubusercontent.com/guidovranken/rpcbomb/master/libtirpc_patch.txt
https://github.com/guidovranken/rpcbomb/blob/master/rpcbind_patch.txt
via NetBSD.
Approved by: re (kib)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix the client side krpc from doing TCP reconnects for ERESTART from sosend().
When sosend() replies ERESTART in the client side krpc, it indicates that
the RPC message hasn't yet been sent and that the send queue is full or
locked while a signal is posted for the process.
Without this patch, this would result in a RPC_CANTSEND reply from
clnt_vc_call(), which would cause clnt_reconnect_call() to create a new
TCP transport connection. For most NFS servers, this wasn't a serious problem,
although it did imply retries of outstanding RPCs, which could possibly
have missed the DRC.
For an NFSv4.1 mount to AmazonEFS, this caused a serious problem, since
AmazonEFS often didn't retain the NFSv4.1 session and would reply with
NFS4ERR_BAD_SESSION. This implies to the client a crash/reboot which
requires open/lock state recovery.
Three options were considered to fix this:
- Return the ERESTART all the way up to the system call boundary and then
have the system call redone. This is fraught with risk, due to convoluted
code paths, asynchronous I/O RPCs etc. cperciva@ worked on this, but it
is still a work in prgress and may not be feasible.
- Set SB_NOINTR for the socket buffer. This fixes the problem, but makes
the sosend() completely non interruptible, which kib@ considered
inappropriate. It also would break forced dismount when a thread
was blocked in sosend().
- Modify the retry loop in clnt_vc_call(), so that it loops for this case
for up to 15sec. Testing showed that the sosend() usually succeeded by
the 2nd retry. The extreme case observed was 111 loop iterations, or
about 100msec of delay.
This third alternative is what is implemented in this patch, since the
change is:
- localized
- straightforward
- forced dismount is not broken by it.
This patch has been tested by cperciva@ extensively against AmazonEFS.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix a crash during unmount of an NFSv4.1 mount.
Larry Rosenman reported a crash on freebsd-current@ which was caused by
a premature release of the krpc backchannel socket structure.
I believe this was caused by a race between the SVC_RELEASE() in clnt_vc.c
and the xprt_unregister() in the higher layer (clnt_rc.c), which tried
to lock the mutex in the xprt structure and crashed.
This patch fixes this by removing the xprt_unregister() in the clnt_vc
layer and allowing this to always be done by the clnt_rc (higher reconnect
layer).
|
|
|
|
|
|
|
| |
PR: 204340
Reported by: Panzura
Approved by: rmacklem
Obtained from: rmacklem
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Provide the getboottime(9) and getboottimebin(9) KPI.
MFC r303387:
Prevent parallel tc_windup() calls. Keep boottime in timehands,
and adjust it from tc_windup().
MFC notes:
The boottime and boottimebin globals are still exported from
the kernel dyn symbol table in stable/11, but their declarations are
removed from sys/time.h. This preserves KBI but not KPI, while all
in-tree consumers are converted to getboottime().
The variables are updated after tc_setclock_mtx is dropped, which gives
approximately same unlocked bugs as before.
The boottime and boottimebin locals in several sys/kern_tc.c functions
were renamed by adding the '_x' suffix to avoid name conficts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Approved by: re (gjb)
r302550:
Deobfuscate cleanup path in clnt_dg_create(..)
Similar to r300836 and r301800, cl and cu will always be non-NULL as they're
allocated using the mem_alloc routines, which always use
`malloc(..., M_WAITOK)`.
Deobfuscating the cleanup path fixes a leak where if cl was NULL and
cu was not, cu would not be free'd, and also removes a duplicate test for
cl not being NULL.
CID: 1007033, 1007344
r302551:
Deobfuscate cleanup path in clnt_vc_create(..)
Similar to r300836, r301800, and r302550, cl and ct will always
be non-NULL as they're allocated using the mem_alloc routines,
which always use `malloc(..., M_WAITOK)`.
CID: 1007342
r302552:
Convert `svc_xprt_alloc(..)` and `svc_xprt_free(..)`'s prototypes to
ANSI C style prototypes
r302553:
Don't test for xpt not being NULL before calling svc_xprt_free(..)
svc_xprt_alloc(..) will always return initialized memory as it uses
mem_alloc(..) under the covers, which uses malloc(.., M_WAITOK, ..).
CID: 1007341
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Similar to r300836, cl and ct will always be non-NULL as they're allocated
using the mem_alloc routines, which always use `malloc(..., M_WAITOK)`.
Deobfuscating the cleanup path fixes a leak where if cl was NULL and
ct was not, ct would not be free'd, and also removes a duplicate test for
cl not being NULL.
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D6801
MFC after: 1 week
Reported by: Coverity
CID: 1229999
Reviewed by: cem
Sponsored by: EMC / Isilon Storage Division
|
|
|
|
| |
Submitted by: Sebastian Huber <sebastian dot huber at embedded-brains dot de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Both cd and xprt will be non-NULL after their respective malloc(9) wrappers are
called (mem_alloc and svc_xprt_alloc, which calls mem_alloc) as mem_alloc
always gets called with M_WAITOK|M_ZERO today. Thus, testing for them being
non-NULL is incorrect -- it misleads Coverity and it misleads the reader.
Remove some unnecessary NULL initializations as a follow up to help solidify
the fact that these pointers will be initialized properly in sys/rpc/.. with
the interfaces the way they are currently.
Differential Revision: https://reviews.freebsd.org/D6572
MFC after: 2 weeks
Reported by: Coverity
CID: 1007338, 1007339, 1007340
Reviewed by: markj, truckman
Sponsored by: EMC / Isilon Storage Division
|
|
|
|
|
|
|
|
|
| |
The mem_alloc macro calls calloc (userspace) / malloc(.., M_WAITOK|M_ZERO)
under the covers, so zeroing out memory is already handled by the underlying
calls
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
|
|
|
|
| |
No functional change.
|
|
|
|
| |
No functional change.
|
|
|
|
|
|
|
|
|
| |
'buf.value' was previously treated as a nul-terminated string, but only
allocated with strlen() space. Rectify this.
Reported by: Coverity
CID: 1007639
Sponsored by: EMC / Isilon Storage Division
|
|
|
|
|
|
| |
These are mostly cosmetical, no functional change.
Found with devel/coccinelle.
|
|
|
|
| |
Found with devel/coccinelle.
|
|
|
|
|
| |
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
|
|
|
|
|
| |
Submitted by: pfg
MFC after: 1 week
|
|
|
|
| |
via sys/mbuf.h
|
|
|
|
| |
MFC after: 1 week
|
|
|
|
|
|
|
|
| |
PR: 202659
Submitted by: matthew.l.dailey@dartmouth.edu
Reviewed by: rmacklem dfr
MFC after: 1 week
Sponsored by: iXsystems
|
|
|
|
|
|
| |
Reviewed by: melifaro
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3398
|
|
|
|
|
|
|
|
|
|
| |
sosend(). The only release on the xp_snt_cnt is done after sosend(),
with an intent to synchronize with load_acq in svc_vc_ack().
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
|
|
|
|
|
|
|
| |
Limits of 5 connections set long ago creates problems for SPEC benchmark.
Make the NFS follow system-wide maximum.
MFC after: 1 week
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
implementation.
The kernel RPC code, which is responsible for the low-level scheduling
of incoming NFS requests, contains a throttling mechanism that
prevents too much kernel memory from being tied up by NFS requests
that are being serviced. When the throttle is engaged, the RPC layer
stops servicing incoming NFS sockets, resulting ultimately in
backpressure on the clients (if they're using TCP). However, this is
a very heavy-handed mechanism as it prevents all clients from making
any requests, regardless of how heavy or light they are. (Thus, when
engaged, the throttle often prevents clients from even mounting the
filesystem.) The throttle mechanism applies specifically to requests
that have been received by the RPC layer (from a TCP or UDP socket)
and are queued waiting to be serviced by one of the nfsd threads; it
does not limit the amount of backlog in the socket buffers.
The original implementation limited the total bytes of queued requests
to the minimum of a quarter of (nmbclusters * MCLBYTES) and 45 MiB.
The former limit seems reasonable, since requests queued in the socket
buffers and replies being constructed to the requests in progress will
all require some amount of network memory, but the 45 MiB limit is
plainly ridiculous for modern memory sizes: when running 256 service
threads on a busy server, 45 MiB would result in just a single
maximum-sized NFS3PROC_WRITE queued per thread before throttling.
Removing this limit exposed integer-overflow bugs in the original
computation, and related bugs in the routines that actually account
for the amount of traffic enqueued for service threads. The old
implementation also attempted to reduce accounting overhead by
batching updates until each queue is fully drained, but this is prone
to livelock, resulting in repeated accumulate-throttle-drain cycles on
a busy server. Various data types are changed to long or unsigned
long; explicit 64-bit types are not used due to the unavailability of
64-bit atomics on many 32-bit platforms, but those platforms also
cannot support nmbclusters large enough to cause overflow.
This code (in a 10.1 kernel) is presently running on production NFS
servers at CSAIL.
Summary of this revision:
* Removes 45 MiB limit on requests queued for nfsd service threads
* Fixes integer-overflow and signedness bugs
* Avoids unnecessary throttling by not deferring accounting for
completed requests
Differential Revision: https://reviews.freebsd.org/D2165
Reviewed by: rmacklem, mav
MFC after: 30 days
Relnotes: yes
Sponsored by: MIT Computer Science & Artificial Intelligence Laboratory
|
|
|
|
|
|
|
|
| |
Initialize *xprt to avoid exposing a random value
in cleanup_svc_vc_create.
This is the kernel counterpart of r278041.
CID: 1007340
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
feature is to quisce the system before suspend.
Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.
Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.
In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
|
|
|
|
|
|
|
|
|
| |
This is not correct at least for the stop requests. Check for stop
conditions and suspend threads if requested.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
|
|
|
|
|
|
|
|
|
|
|
|
| |
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
|
|
|
|
|
|
|
|
|
| |
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.
MFC after: 1 month
|
|
|
|
| |
MFC after: 2 weeks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Old design with unified thread pool was good from the point of thread
utilization. But single pool-wide mutex became huge congestion point
for systems with many CPUs. To reduce the congestion create several
thread groups within a pool (one group for every 6 CPUs and 12 threads),
each group with own mutex. Each connection during its registration is
assigned to one of the groups in round-robin fashion. File affinify
code may still move requests between the groups, but otherwise groups
are self-contained.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
|
|
|
|
| |
MFC after: 2 weeks
|
|
|
|
|
|
|
| |
This allows to slightly simplify svc_run_internal() code: if we processed
all the requests in a queue, then we know that new one will not appear.
MFC after: 2 weeks
|
|
|
|
|
|
| |
CID: 1007032
Found with: Coverity Prevent(tm)
MFC after: 2 weeks
|
|
|
|
| |
MFC after: 3 days
|
|
|
|
| |
additions from r260229 and the SVCPOOL type doesn't exist in userland.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
New algorithm does not create additional lock congestion, while some races
it includes should not be a problem. Those races may keep requests in DRC
cache for some more time by returning ACK position smaller then actual,
but it still should be able to drop thems when proper ACK finally read.
Races of the original algorithm based on TCP seq number were worse because
they happened when reply sequence number were recorded. After that even
correctly read ACKs could not clean DRC sometimes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.
Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.
As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.
Sponsored by: iXsystems, Inc.
|
|
|
|
|
|
| |
global RPC thread pool lock and protect it with own set of locks.
On synthetic benchmarks this improves peak NFS request rate by 40%.
|
|
|
|
|
| |
is assigned to thread. For example, withing receive handlers. In that
case the function reduces to single assignment and can avoid locking.
|
|
|
|
| |
data than we need. This reduces lock pressure from xprt_active() side.
|
|
|
|
|
|
|
| |
(Note the #if 0 part has been inactive since the initial commit,
r177633, so maybe it should be removed altogether).
MFC after: 3 days
|
|
|
|
|
|
| |
been used since the initial commit (r177633).
MFC after: 3 days
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Do not insert active ports into pool->sp_active list if they are success-
fully assigned to some thread. This makes that list include only ports that
really require attention, and so traversal can be reduced to simple taking
the first one.
Remove idle thread from pool->sp_idlethreads list when assigning some
work (port of requests) to it. That again makes possible to replace list
traversals with simple taking the first element.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When processing receive buffer, write the amount of data, expected
in present request record, into socket's so_rcv.sb_lowat to make stack
aware about our needs. When processing following upcalls, ignore them
until socket collect enough data to be read and processed in one turn.
This change reduces number of context switches and other operations
in RPC stack during large NFS writes (especially via non-Jumbo networks)
by order of magnitude.
After precessing current packet, take another look into the pending
buffer to find out whether the next packet had been already received.
If not, deactivate this port right there without making RPC code to
push this port to another thread just to find that there is nothing.
If the next packet is received partially, also deactivate the port, but
also update socket's so_rcv.sb_lowat to not be woken up prematurely.
This change additionally reduces number of context switches per NFS
request about in half.
|
|
|
|
|
|
| |
3-clause BSD license as specified by Oracle America, Inc. in 2010.
This license change was approved by Wim Coekaerts, Senior Vice
President, Linux and Virtualization at Oracle Corporation.
|
|
|
|
| |
with the explicit permission of Sun Microsystems in 2009.
|