summaryrefslogtreecommitdiffstats
path: root/share
diff options
context:
space:
mode:
authorluigi <luigi@FreeBSD.org>2014-02-15 08:23:31 +0000
committerluigi <luigi@FreeBSD.org>2014-02-15 08:23:31 +0000
commit0cf286382f1b96d8b03564ebe5e49a6e13dbb060 (patch)
tree0b67d58ec487f7b187ad6c2a6c552c549c5fcaeb /share
parentc11e4e28da4667555d0133cbb5d237f1b1711bb2 (diff)
downloadFreeBSD-src-0cf286382f1b96d8b03564ebe5e49a6e13dbb060.zip
FreeBSD-src-0cf286382f1b96d8b03564ebe5e49a6e13dbb060.tar.gz
complete svn 261909 - new netmap version.
since i updated the manpage i might as well commit it. MFC after: 3 days
Diffstat (limited to 'share')
-rw-r--r--share/man/man4/netmap.4283
1 files changed, 193 insertions, 90 deletions
diff --git a/share/man/man4/netmap.4 b/share/man/man4/netmap.4
index 523d8dd..1b2dc7a 100644
--- a/share/man/man4/netmap.4
+++ b/share/man/man4/netmap.4
@@ -27,7 +27,7 @@
.\"
.\" $FreeBSD$
.\"
-.Dd January 4, 2014
+.Dd February 13, 2014
.Dt NETMAP 4
.Os
.Sh NAME
@@ -36,6 +36,9 @@
.br
.Nm VALE
.Nd a fast VirtuAl Local Ethernet using the netmap API
+.br
+.Nm netmap pipes
+.Nd a shared memory packet transport channel
.Sh SYNOPSIS
.Cd device netmap
.Sh DESCRIPTION
@@ -45,38 +48,55 @@ for both userspace and kernel clients.
It runs on FreeBSD and Linux,
and includes
.Nm VALE ,
-a very fast and modular in-kernel software switch/dataplane.
+a very fast and modular in-kernel software switch/dataplane,
+and
+.Nm netmap pipes ,
+a shared memory packet transport channel.
+All these are accessed interchangeably with the same API.
.Pp
-.Nm
+.Nm , VALE
and
-.Nm VALE
-are one order of magnitude faster than sockets, bpf or
-native switches based on
-.Xr tun/tap 4 ,
-reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC,
-and 20 Mpps per core for VALE ports.
+.Nm netmap pipes
+are at least one order of magnitude faster than
+standard OS mechanisms
+(sockets, bpf, tun/tap interfaces, native switches, pipes),
+reaching 14.88 million packets per second (Mpps)
+with much less than one core on a 10 Gbit NIC,
+about 20 Mpps per core for VALE ports,
+and over 100 Mpps for netmap pipes.
.Pp
Userspace clients can dynamically switch NICs into
.Nm
mode and send and receive raw packets through
memory mapped buffers.
-A selectable file descriptor supports
-synchronization and blocking I/O.
-.Pp
Similarly,
.Nm VALE
-can dynamically create switch instances and ports,
+switch instances and ports, and
+.Nm netmap pipes
+can be created dynamically,
providing high speed packet I/O between processes,
virtual machines, NICs and the host stack.
.Pp
-For best performance,
.Nm
-requires explicit support in device drivers;
-however, the
+suports both non-blocking I/O through
+.Xr ioctls() ,
+synchronization and blocking I/O through a file descriptor
+and standard OS mechanisms such as
+.Xr select 2 ,
+.Xr poll 2 ,
+.Xr epoll 2 ,
+.Xr kqueue 2 .
+.Nm VALE
+and
+.Nm netmap pipes
+are implemented by a single kernel module, which also emulates the
+.Nm
+API over standard drivers for devices without native
+.Nm
+support.
+For best performance,
.Nm
-API can be emulated on top of unmodified device drivers,
-at the price of reduced performance
-(but still better than sockets or BPF/pcap).
+requires explicit support in device drivers.
.Pp
In the rest of this (long) manual page we document
various aspects of the
@@ -114,10 +134,26 @@ mode use the same memory region,
accessible to all processes who own
.Nm /dev/netmap
file descriptors bound to NICs.
+Independent
.Nm VALE
-ports instead use separate memory regions.
+and
+.Nm netmap pipe
+ports
+by default use separate memory regions,
+but can be independently configured to share memory.
.Pp
.Sh ENTERING AND EXITING NETMAP MODE
+The following section describes the system calls to create
+and control
+.Nm netmap
+ports (including
+.Nm VALE
+and
+.Nm netmap pipe
+ports).
+Simpler, higher level functions are described in section
+.Xr LIBRARIES .
+.Pp
Ports and rings are created and controlled through a file descriptor,
created by opening a special device
.Dl fd = open("/dev/netmap");
@@ -186,12 +222,11 @@ API. The main structures and fields are indicated below:
.Bd -literal
struct netmap_if {
...
- const uint32_t ni_flags; /* properties */
+ const uint32_t ni_flags; /* properties */
...
- const uint32_t ni_tx_rings; /* NIC tx rings */
- const uint32_t ni_rx_rings; /* NIC rx rings */
- const uint32_t ni_extra_tx_rings; /* extra tx rings */
- const uint32_t ni_extra_rx_rings; /* extra rx rings */
+ const uint32_t ni_tx_rings; /* NIC tx rings */
+ const uint32_t ni_rx_rings; /* NIC rx rings */
+ uint32_t ni_bufs_head; /* head of extra bufs list */
...
};
.Ed
@@ -204,11 +239,14 @@ The number of tx and rx rings
normally depends on the hardware.
NICs also have an extra tx/rx ring pair connected to the host stack.
.Em NIOCREGIF
-can request additional tx/rx rings,
-to be used between multiple processes/threads
-accessing the same
-.Nm
-port.
+can also request additional unbound buffers in the same memory space,
+to be used as temporary storage for packets.
+.Pa ni_bufs_head
+contains the index of the first of these free rings,
+which are connected in a list (the first uint32_t of each
+buffer being the index of the next buffer in the list).
+A 0 indicates the end of the list.
+.Pp
.It Dv struct netmap_ring (one per ring)
.Bd -literal
struct netmap_ring {
@@ -221,9 +259,9 @@ struct netmap_ring {
const uint32_t tail; /* (k) first buf owned by kernel */
...
uint32_t flags;
- struct timeval ts; /* (k) time of last rxsync() */
+ struct timeval ts; /* (k) time of last rxsync() */
...
- struct netmap_slot slot[0]; /* array of slots */
+ struct netmap_slot slot[0]; /* array of slots */
}
.Ed
.Pp
@@ -482,14 +520,16 @@ struct nmreq {
uint32_t nr_version; /* (i) API version */
uint32_t nr_offset; /* (o) nifp offset in mmap region */
uint32_t nr_memsize; /* (o) size of the mmap region */
- uint32_t nr_tx_slots; /* (o) slots in tx rings */
- uint32_t nr_rx_slots; /* (o) slots in rx rings */
- uint16_t nr_tx_rings; /* (o) number of tx rings */
- uint16_t nr_rx_rings; /* (o) number of tx rings */
- uint16_t nr_ringid; /* (i) ring(s) we care about */
+ uint32_t nr_tx_slots; /* (i/o) slots in tx rings */
+ uint32_t nr_rx_slots; /* (i/o) slots in rx rings */
+ uint16_t nr_tx_rings; /* (i/o) number of tx rings */
+ uint16_t nr_rx_rings; /* (i/o) number of tx rings */
+ uint16_t nr_ringid; /* (i/o) ring(s) we care about */
uint16_t nr_cmd; /* (i) special command */
- uint16_t nr_arg1; /* (i) extra arguments */
- uint16_t nr_arg2; /* (i) extra arguments */
+ uint16_t nr_arg1; /* (i/o) extra arguments */
+ uint16_t nr_arg2; /* (i/o) extra arguments */
+ uint32_t nr_arg3; /* (i/o) extra arguments */
+ uint32_t nr_flags /* (i/o) open mode */
...
};
.Ed
@@ -537,20 +577,59 @@ it from the host stack.
Multiple file descriptors can be bound to the same port,
with proper synchronization left to the user.
.Pp
-On return, it gives the same info as NIOCGINFO, and nr_ringid
-indicates the identity of the rings controlled through the file
+.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
+.Em netmap pipe ,
+consisting of two netmap ports with a crossover connection.
+A netmap pipe share the same memory space of the parent port,
+and is meant to enable configuration where a master process acts
+as a dispatcher towards slave processes.
+.Pp
+To enable this function, the
+.Pa nr_arg1
+field of the structure can be used as a hint to the kernel to
+indicate how many pipes we expect to use, and reserve extra space
+in the memory region.
+.Pp
+On return, it gives the same info as NIOCGINFO,
+with
+.Pa nr_ringid
+and
+.Pa nr_flags
+indicating the identity of the rings controlled through the file
descriptor.
.Pp
+.Va nr_flags
.Va nr_ringid
selects which rings are controlled through this file descriptor.
-Possible values are:
+Possible values of
+.Pa nr_flags
+are indicated below, together with the naming schemes
+that application libraries (such as the
+.Nm nm_open
+indicated below) can use to indicate the specific set of rings.
+In the example below, "netmap:foo" is any valid netmap port name.
+.Pp
.Bl -tag -width XXXXX
-.It 0
-(default) all hardware rings
-.It NETMAP_SW_RING
+.It NR_REG_ALL_NIC "netmap:foo"
+(default) all hardware ring pairs
+.It NR_REG_SW_NIC "netmap:foo^"
the ``host rings'', connecting to the host stack.
-.It NETMAP_HW_RING | i
-the i-th hardware ring .
+.It NR_RING_NIC_SW "netmap:foo+
+all hardware rings and the host rings
+.It NR_REG_ONE_NIC "netmap:foo-i"
+only the i-th hardware ring pair, where the number is in
+.Pa nr_ringid ;
+.It NR_REG_PIPE_MASTER "netmap:foo{i"
+the master side of the netmap pipe whose identifier (i) is in
+.Pa nr_ringid ;
+.It NR_REG_PIPE_SLAVE "netmap:foo}i"
+the slave side of the netmap pipe whose identifier (i) is in
+.Pa nr_ringid .
+.Pp
+The identifier of a pipe must be thought as part of the pipe name,
+and does not need to be sequential. On return the pipe
+will only have a single ring pair with index 0,
+irrespective of the value of i.
.El
.Pp
By default, a
@@ -579,7 +658,7 @@ number of slots available for transmission.
tells the hardware of consumed packets, and asks for newly available
packets.
.El
-.Sh SELECT AND POLL
+.Sh SELECT, POLL, EPOLL, KQUEUE.
.Xr select 2
and
.Xr poll 2
@@ -588,16 +667,26 @@ on a
file descriptor process rings as indicated in
.Sx TRANSMIT RINGS
and
-.Sx RECEIVE RINGS
-when write (POLLOUT) and read (POLLIN) events are requested.
-.Pp
-Both block if no slots are available in the ring (
-.Va ring->cur == ring->tail )
+.Sx RECEIVE RINGS ,
+respectively when write (POLLOUT) and read (POLLIN) events are requested.
+Both block if no slots are available in the ring
+.Va ( ring->cur == ring->tail ) .
+Depending on the platform,
+.Xr epoll 2
+and
+.Xr kqueue 2
+are supported too.
.Pp
-Packets in transmit rings are normally pushed out even without
+Packets in transmit rings are normally pushed out
+(and buffers reclaimed) even without
requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
.Em NIOCREGIF
disables this feature.
+By default, receive rings are processed only if read
+events are requested. Passing the NETMAP_DO_RX_SYNC flag to
+.Em NIOCREGIF updates receive rings even without read events.
+Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC
+only have an effect when some event is posted for the file descriptor.
.Sh LIBRARIES
The
.Nm
@@ -620,7 +709,7 @@ before
.Pp
The following functions are available:
.Bl -tag -width XXXXX
-.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags)
+.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
similar to
.Xr pcap_open ,
binds a file descriptor to a port.
@@ -629,26 +718,36 @@ binds a file descriptor to a port.
is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
.Nm VALE
port.
+.It Va req
+provides the initial values for the argument to the NIOCREGIF ioctl.
+The nm_flags and nm_ringid values are overwritten by parsing
+ifname and flags, and other fields can be overridden through
+the other two arguments.
+.It Va arg
+points to a struct nm_desc containing arguments (e.g. from a previously
+open file descriptor) that should override the defaults.
+The fields are used as described below
.It Va flags
-can be set to
-.Va NETMAP_SW_RING
-to bind to the host ring pair,
-or to NETMAP_HW_RING to bind to a specific ring.
-.Va ring_name
-with NETMAP_HW_RING,
-is interpreted as a string or an integer indicating the ring to use.
-.It Va ring_flags
-is copied directly into the ring flags, to specify additional parameters
-such as NR_TIMESTAMP or NR_FORWARD.
+can be set to a combination of the following flags:
+.Va NETMAP_NO_TX_POLL ,
+.Va NETMAP_DO_RX_POLL
+(copied into nr_ringid);
+.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
+avoids the mmap and uses the values from it);
+.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
+.Va NM_OPEN_ARG1 ,
+.Va NM_OPEN_ARG2 ,
+.Va NM_OPEN_ARG3 (uses the fields from arg);
+.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
.El
-.It Va int nm_close(struct nm_desc_t *d)
+.It Va int nm_close(struct nm_desc *d)
closes the file descriptor, unmaps memory, frees resources.
-.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size)
+.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
similar to pcap_inject(), pushes a packet to a ring, returns the size
of the packet is successful, or 0 on error;
-.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg)
+.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
similar to pcap_dispatch(), applies a callback to incoming packets
-.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr)
+.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
similar to pcap_next(), fetches the next packet
.Pp
.El
@@ -740,9 +839,11 @@ performance.
.Sh SYSTEM CALLS
.Nm
uses
-.Xr select 2
+.Xr select 2 ,
+.Xr poll 2 ,
+.Xr epoll
and
-.Xr poll 2
+.Xr kqueue
to wake up processes when significant events occur, and
.Xr mmap 2
to map memory.
@@ -872,10 +973,10 @@ A simple receiver can be implemented using the helper functions
...
void receiver(void)
{
- struct nm_desc_t *d;
+ struct nm_desc *d;
struct pollfd fds;
u_char *buf;
- struct nm_hdr_t h;
+ struct nm_pkthdr h;
...
d = nm_open("netmap:ix0", NULL, 0, 0);
fds.fd = NETMAP_FD(d);
@@ -910,6 +1011,13 @@ to replenish the receive ring:
...
.Ed
.Ss ACCESSING THE HOST STACK
+The host stack is for all practical purposes just a regular ring pair,
+which you can access with the netmap API (e.g. with
+.Dl nm_open("netmap:eth0^", ... ) ;
+All packets that the host would send to an interface in
+.Nm
+mode end up into the RX ring, whereas all packets queued to the
+TX ring are send up to the host stack.
.Ss VALE SWITCH
A simple way to test the performance of a
.Nm VALE
@@ -917,6 +1025,10 @@ switch is to attach a sender and a receiver to it,
e.g. running the following in two different terminals:
.Dl pkt-gen -i vale1:a -f rx # receiver
.Dl pkt-gen -i vale1:b -f tx # sender
+The same example can be used to test netmap pipes, by simply
+changing port names, e.g.
+.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
+.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
.Pp
The following command attaches an interface and the host stack
to a switch:
@@ -935,6 +1047,14 @@ Communications of the ACM, 55 (3), pp.45-51, March 2012
.Pp
Luigi Rizzo, netmap: a novel framework for fast packet I/O,
Usenix ATC'12, June 2012, Boston
+.Pp
+Luigi Rizzo, Giuseppe Lettieri,
+VALE, a switched ethernet for virtual machines,
+ACM CoNEXT'12, December 2012, Nice
+.Pp
+Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
+Speeding up packet I/O in virtual machines,
+ACM/IEEE ANCS'13, October 2013, San Jose
.Sh AUTHORS
.An -nosplit
The
@@ -953,20 +1073,3 @@ and
.Nm VALE
have been funded by the European Commission within FP7 Projects
CHANGE (257422) and OPENLAB (287581).
-.Pp
-.Ss SPECIAL MODES
-When the device name has the form
-.Dl valeXXX:ifname (ifname is an existing interface)
-the physical interface
-(and optionally the corrisponding host stack endpoint)
-are connected or disconnected from the
-.Nm VALE
-switch named XXX.
-In this case the
-.Pa ioctl()
-is only used only for configuration, typically through the
-.Xr vale-ctl
-command.
-The file descriptor cannot be used for I/O, and should be
-closed after issuing the
-.Pa ioctl() .
OpenPOWER on IntegriCloud