summaryrefslogtreecommitdiffstats
path: root/lib
diff options
context:
space:
mode:
authorglebius <glebius@FreeBSD.org>2016-01-08 20:34:57 +0000
committerglebius <glebius@FreeBSD.org>2016-01-08 20:34:57 +0000
commitaaa09777e1d9fde5591814af536025be01a0182f (patch)
treeffa24924b5fa8fa8a6a9433a56b73c595e5d33b6 /lib
parentad8ebc35bb92f3a458fcde1049b19656ef5e614d (diff)
downloadFreeBSD-src-aaa09777e1d9fde5591814af536025be01a0182f.zip
FreeBSD-src-aaa09777e1d9fde5591814af536025be01a0182f.tar.gz
New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and
up to now. The new sendfile is the code that Netflix uses to send their multiple tens of gigabits of data per second. The new implementation features asynchronous I/O, when I/O operations are launched, but not awaited to be complete. An explanation of why such behavior is beneficial compared to old one is going to be too long for a commit message, so we will skip it here. Additional features of new syscall are extra flags, which provide an application more control over data sent. The SF_NOCACHE flag tells kernel that data shouldn't be cached after it was sent. The SF_READAHEAD() macro allows to specify readahead size in pages. The new syscalls is a drop in replacement. No modifications are required to applications. One can take nginx binary for stable/10 and run it successfully on head. Although SF_NODISKIO lost its original sense, as now sendfile doesn't block, and now means something completely different (tm), using the new sendfile the old way is absolutely safe. Celebrates: Netflix global launch! Sponsored by: Nginx, Inc. Sponsored by: Netflix Relnotes: yes
Diffstat (limited to 'lib')
-rw-r--r--lib/libc/sys/sendfile.2123
1 files changed, 91 insertions, 32 deletions
diff --git a/lib/libc/sys/sendfile.2 b/lib/libc/sys/sendfile.2
index b363382..2b52dd9 100644
--- a/lib/libc/sys/sendfile.2
+++ b/lib/libc/sys/sendfile.2
@@ -25,7 +25,7 @@
.\"
.\" $FreeBSD$
.\"
-.Dd January 7, 2010
+.Dd January 7, 2016
.Dt SENDFILE 2
.Os
.Sh NAME
@@ -46,7 +46,7 @@
The
.Fn sendfile
system call
-sends a regular file specified by descriptor
+sends a regular file or shared memory object specified by descriptor
.Fa fd
out a stream socket specified by descriptor
.Fa s .
@@ -101,32 +101,55 @@ the system will write the total number of bytes sent on the socket to the
variable pointed to by
.Fa sbytes .
.Pp
-The
+The least significant 16 bits of
.Fa flags
argument is a bitmap of these values:
-.Bl -item -offset indent
-.It
-.Dv SF_NODISKIO .
-This flag causes any
-.Fn sendfile
-call which would block on disk I/O to instead
-return
-.Er EBUSY .
-Busy servers may benefit by transferring requests that would
-block to a separate I/O worker thread.
-.It
-.Dv SF_MNOWAIT .
-Do not wait for some kernel resource to become available,
-in particular,
-.Vt mbuf
-and
-.Vt sf_buf .
-The flag does not make the
-.Fn sendfile
-syscall truly non-blocking, since other resources are still allocated
-in a blocking fashion.
-.It
-.Dv SF_SYNC .
+.Bl -tag -offset indent
+.It Dv SF_NODISKIO
+This flag causes
+.Nm
+to return
+.Er EBUSY
+instead of blocking when a busy page is encountered.
+This rare situation can happen if some other process is now working
+with the same region of the file.
+It is advised to retry the operation after a short period.
+.Pp
+Note that in older
+.Fx
+versions the
+.Dv SF_NODISKIO
+had slightly different notion.
+The flag prevented
+.Nm
+to run I/O operations in case if an invalid (not cached) page is encountered,
+thus avoiding blocking on I/O.
+Starting with
+.Fx 11
+.Nm
+sending files off the
+.Xr ffs 7
+filesystem doesn't block on I/O
+(see
+.Sx IMPLEMENTATION NOTES
+), so the condition no longer applies.
+However, it is safe if an application utilizes
+.Dv SF_NODISKIO
+and on
+.Er EBUSY
+performs the same action as it did in
+older
+.Fx
+versions, e.g.
+.Xr aio_read 2,
+.Xr read 2
+or
+.Nm
+in a different context.
+.It Dv SF_NOCACHE
+The data sent to socket will not be cached by the virtual memory system,
+and will be freed directly to the pool of free pages.
+.It Dv SF_SYNC
.Nm
sleeps until the network stack no longer references the VM pages
of the file, making subsequent modifications to it safe.
@@ -134,6 +157,22 @@ Please note that this is not a guarantee that the data has actually
been sent.
.El
.Pp
+The most significant 16 bits of
+.Fa flags
+specify amount of pages that
+.Nm
+may read ahead when reading the file.
+A macro
+.Fn SF_FLAGS
+is provided to combine readahead amount and flags.
+Example shows specifing readahead of 16 pages and
+.Dv SF_NOCACHE
+flag:
+.Pp
+.Bd -literal -offset indent -compact
+ SF_FLAGS(16, SF_NOCACHE)
+.Ed
+.Pp
When using a socket marked for non-blocking I/O,
.Fn sendfile
may send fewer bytes than requested.
@@ -149,6 +188,18 @@ The
.Fx
implementation of
.Fn sendfile
+doesn't block on disk I/O when it sends a file off the
+.Xr ffs 7
+filesystem.
+The syscall returns success before the actual I/O completes, and data
+is put into the socket later unattended.
+However, the order of data in the socket is preserved, so it is safe
+to do further writes to the socket.
+.Pp
+The
+.Fx
+implementation of
+.Fn sendfile
is "zero-copy", meaning that it has been optimized so that copying of the file data is avoided.
.Sh TUNING
On some architectures, this system call internally uses a special
@@ -232,12 +283,10 @@ The
argument
is not a valid socket descriptor.
.It Bq Er EBUSY
-Completing the entire transfer would have required disk I/O, so
-it was aborted.
-Partial data may have been sent.
-(This error can only occur when
+A busy page was encountered and
.Dv SF_NODISKIO
-is specified.)
+had been specified.
+Partial data may have been sent.
.It Bq Er EFAULT
An invalid address was specified for an argument.
.It Bq Er EINTR
@@ -310,9 +359,19 @@ first appeared in
.Fx 3.0 .
This manual page first appeared in
.Fx 3.1 .
+In
+.Fx 10
+support for sending shared memory descriptors had been introduced.
+In
+.Fx 11
+a non-blocking implementation had been introduced.
.Sh AUTHORS
-The
+The initial implementation of
.Fn sendfile
system call
and this manual page were written by
.An David G. Lawrence Aq Mt dg@dglawrence.com .
+The
+.Fx 11
+implementation was written by
+.An Gleb Smirnoff Aq Mt glebius@FreeBSD.org .
OpenPOWER on IntegriCloud