diff options
Diffstat (limited to 'share/man/man9/zero_copy.9')
-rw-r--r-- | share/man/man9/zero_copy.9 | 171 |
1 files changed, 171 insertions, 0 deletions
diff --git a/share/man/man9/zero_copy.9 b/share/man/man9/zero_copy.9 new file mode 100644 index 0000000..aa11e8c --- /dev/null +++ b/share/man/man9/zero_copy.9 @@ -0,0 +1,171 @@ +.\" +.\" Copyright (c) 2002 Kenneth D. Merry. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions, and the following disclaimer, +.\" without modification, immediately at the beginning of the file. +.\" 2. The name of the author may not be used to endorse or promote products +.\" derived from this software without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR +.\" ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd October 23, 2012 +.Dt ZERO_COPY 9 +.Os +.Sh NAME +.Nm zero_copy , +.Nm zero_copy_sockets +.Nd "zero copy sockets code" +.Sh SYNOPSIS +.Cd "options SOCKET_SEND_COW" +.Cd "options SOCKET_RECV_PFLIP" +.Sh DESCRIPTION +The +.Fx +kernel includes a facility for eliminating data copies on +socket reads and writes. +.Pp +This code is collectively known as the zero copy sockets code, because during +normal network I/O, data will not be copied by the CPU at all. +Rather it +will be DMAed from the user's buffer to the NIC (for sends), or DMAed from +the NIC to a buffer that will then be given to the user (receives). +.Pp +The zero copy sockets code uses the standard socket read and write +semantics, and therefore has some limitations and restrictions that +programmers should be aware of when trying to take advantage of this +functionality. +.Pp +For sending data, there are no special requirements or capabilities that +the sending NIC must have. +The data written to the socket, though, must be +at least a page in size and page aligned in order to be mapped into the +kernel. +If it does not meet the page size and alignment constraints, it +will be copied into the kernel, as is normally the case with socket I/O. +.Pp +The user should be careful not to overwrite buffers that have been written +to the socket before the data has been freed by the kernel, and the +copy-on-write mapping cleared. +If a buffer is overwritten before it has +been given up by the kernel, the data will be copied, and no savings in CPU +utilization and memory bandwidth utilization will be realized. +.Pp +The +.Xr socket 2 +API does not really give the user any indication of when his data has +actually been sent over the wire, or when the data has been freed from +kernel buffers. +For protocols like TCP, the data will be kept around in +the kernel until it has been acknowledged by the other side; it must be +kept until the acknowledgement is received in case retransmission is required. +.Pp +From an application standpoint, the best way to guarantee that the data has +been sent out over the wire and freed by the kernel (for TCP-based sockets) +is to set a socket buffer size (see the +.Dv SO_SNDBUF +socket option in the +.Xr setsockopt 2 +manual page) appropriate for the application and network environment and then +make sure you have sent out twice as much data as the socket buffer size +before reusing a buffer. +For TCP, the send and receive socket buffer sizes +generally directly correspond to the TCP window size. +.Pp +For receiving data, in order to take advantage of the zero copy receive +code, the user must have a NIC that is configured for an MTU greater than +the architecture page size. +(E.g., for i386 it would be 4KB.) +Additionally, in order for zero copy receive to work, +packet payloads must be at least a page in size and page aligned. +.Pp +Achieving page aligned payloads requires a NIC that can split an incoming +packet into multiple buffers. +It also generally requires some sort of +intelligence on the NIC to make sure that the payload starts in its own +buffer. +This is called +.Dq "header splitting" . +Currently the only NICs with +support for header splitting are Alteon Tigon 2 based boards running +slightly modified firmware. +The +.Fx +.Xr ti 4 +driver includes modified firmware for Tigon 2 boards only. +Header +splitting code can be written, however, for any NIC that allows putting +received packets into multiple buffers and that has enough programmability +to determine that the header should go into one buffer and the payload into +another. +.Pp +You can also do a form of header splitting that does not require any NIC +modifications if your NIC is at least capable of splitting packets into +multiple buffers. +This requires that you optimize the NIC driver for your +most common packet header size. +If that size (ethernet + IP + TCP headers) +is generally 66 bytes, for instance, you would set the first buffer in a +set for a particular packet to be 66 bytes long, and then subsequent +buffers would be a page in size. +For packets that have headers that are +exactly 66 bytes long, your payload will be page aligned. +.Pp +The other requirement for zero copy receive to work is that the buffer that +is the destination for the data read from a socket must be at least a page +in size and page aligned. +.Pp +Obviously the requirements for receive side zero copy are impossible to +meet without NIC hardware that is programmable enough to do header +splitting of some sort. +Since most NICs are not that programmable, or their +manufacturers will not share the source code to their firmware, this approach +to zero copy receive is not widely useful. +.Pp +There are other approaches, such as RDMA and TCP Offload, that may +potentially help alleviate the CPU overhead associated with copying data +out of the kernel. +Most known techniques require some sort of support at +the NIC level to work, and describing such techniques is beyond the scope +of this manual page. +.Pp +The zero copy send and zero copy receive code can be individually turned +off via the +.Va kern.ipc.zero_copy.send +and +.Va kern.ipc.zero_copy.receive +.Nm sysctl +variables respectively. +.Sh SEE ALSO +.Xr sendfile 2 , +.Xr socket 2 , +.Xr ti 4 +.Sh HISTORY +The zero copy sockets code first appeared in +.Fx 5.0 , +although it has +been in existence in patch form since at least mid-1999. +.Sh AUTHORS +.An -nosplit +The zero copy sockets code was originally written by +.An Andrew Gallatin Aq Mt gallatin@FreeBSD.org +and substantially modified and updated by +.An Kenneth Merry Aq Mt ken@FreeBSD.org . +.Sh BUGS +The COW based send mechanism is not safe and may result in kernel crashes. |