1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
|
.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\" notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\" notice, this list of conditions and the following disclaimer in the
.\" documentation and/or other materials provided with the distribution.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
.\" SUCH DAMAGE.
.\"
.\" This document is derived in part from the enet man page (enet.4)
.\" distributed with 4.3BSD Unix.
.\"
.\" $FreeBSD$
.\"
.Dd February 13, 2014
.Dt NETMAP 4
.Os
.Sh NAME
.Nm netmap
.Nd a framework for fast packet I/O
.br
.Nm VALE
.Nd a fast VirtuAl Local Ethernet using the netmap API
.br
.Nm netmap pipes
.Nd a shared memory packet transport channel
.Sh SYNOPSIS
.Cd device netmap
.Sh DESCRIPTION
.Nm
is a framework for extremely fast and efficient packet I/O
for both userspace and kernel clients.
It runs on FreeBSD and Linux,
and includes
.Nm VALE ,
a very fast and modular in-kernel software switch/dataplane,
and
.Nm netmap pipes ,
a shared memory packet transport channel.
All these are accessed interchangeably with the same API.
.Pp
.Nm , VALE
and
.Nm netmap pipes
are at least one order of magnitude faster than
standard OS mechanisms
(sockets, bpf, tun/tap interfaces, native switches, pipes),
reaching 14.88 million packets per second (Mpps)
with much less than one core on a 10 Gbit NIC,
about 20 Mpps per core for VALE ports,
and over 100 Mpps for netmap pipes.
.Pp
Userspace clients can dynamically switch NICs into
.Nm
mode and send and receive raw packets through
memory mapped buffers.
Similarly,
.Nm VALE
switch instances and ports, and
.Nm netmap pipes
can be created dynamically,
providing high speed packet I/O between processes,
virtual machines, NICs and the host stack.
.Pp
.Nm
suports both non-blocking I/O through
.Xr ioctls() ,
synchronization and blocking I/O through a file descriptor
and standard OS mechanisms such as
.Xr select 2 ,
.Xr poll 2 ,
.Xr epoll 2 ,
.Xr kqueue 2 .
.Nm VALE
and
.Nm netmap pipes
are implemented by a single kernel module, which also emulates the
.Nm
API over standard drivers for devices without native
.Nm
support.
For best performance,
.Nm
requires explicit support in device drivers.
.Pp
In the rest of this (long) manual page we document
various aspects of the
.Nm
and
.Nm VALE
architecture, features and usage.
.Pp
.Sh ARCHITECTURE
.Nm
supports raw packet I/O through a
.Em port ,
which can be connected to a physical interface
.Em ( NIC ) ,
to the host stack,
or to a
.Nm VALE
switch).
Ports use preallocated circular queues of buffers
.Em ( rings )
residing in an mmapped region.
There is one ring for each transmit/receive queue of a
NIC or virtual port.
An additional ring pair connects to the host stack.
.Pp
After binding a file descriptor to a port, a
.Nm
client can send or receive packets in batches through
the rings, and possibly implement zero-copy forwarding
between ports.
.Pp
All NICs operating in
.Nm
mode use the same memory region,
accessible to all processes who own
.Nm /dev/netmap
file descriptors bound to NICs.
Independent
.Nm VALE
and
.Nm netmap pipe
ports
by default use separate memory regions,
but can be independently configured to share memory.
.Pp
.Sh ENTERING AND EXITING NETMAP MODE
The following section describes the system calls to create
and control
.Nm netmap
ports (including
.Nm VALE
and
.Nm netmap pipe
ports).
Simpler, higher level functions are described in section
.Xr LIBRARIES .
.Pp
Ports and rings are created and controlled through a file descriptor,
created by opening a special device
.Dl fd = open("/dev/netmap");
and then bound to a specific port with an
.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
.Pp
.Nm
has multiple modes of operation controlled by the
.Vt struct nmreq
argument.
.Va arg.nr_name
specifies the port name, as follows:
.Bl -tag -width XXXX
.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
the data path of the NIC is disconnected from the host stack,
and the file descriptor is bound to the NIC (one or all queues),
or to the host stack;
.It Dv valeXXX:YYY (arbitrary XXX and YYY)
the file descriptor is bound to port YYY of a VALE switch called XXX,
both dynamically created if necessary.
The string cannot exceed IFNAMSIZ characters, and YYY cannot
be the name of any existing OS network interface.
.El
.Pp
On return,
.Va arg
indicates the size of the shared memory region,
and the number, size and location of all the
.Nm
data structures, which can be accessed by mmapping the memory
.Dl char *mem = mmap(0, arg.nr_memsize, fd);
.Pp
Non blocking I/O is done with special
.Xr ioctl 2
.Xr select 2
and
.Xr poll 2
on the file descriptor permit blocking I/O.
.Xr epoll 2
and
.Xr kqueue 2
are not supported on
.Nm
file descriptors.
.Pp
While a NIC is in
.Nm
mode, the OS will still believe the interface is up and running.
OS-generated packets for that NIC end up into a
.Nm
ring, and another ring is used to send packets into the OS network stack.
A
.Xr close 2
on the file descriptor removes the binding,
and returns the NIC to normal mode (reconnecting the data path
to the host stack), or destroys the virtual port.
.Pp
.Sh DATA STRUCTURES
The data structures in the mmapped memory region are detailed in
.Xr sys/net/netmap.h ,
which is the ultimate reference for the
.Nm
API. The main structures and fields are indicated below:
.Bl -tag -width XXX
.It Dv struct netmap_if (one per interface)
.Bd -literal
struct netmap_if {
...
const uint32_t ni_flags; /* properties */
...
const uint32_t ni_tx_rings; /* NIC tx rings */
const uint32_t ni_rx_rings; /* NIC rx rings */
uint32_t ni_bufs_head; /* head of extra bufs list */
...
};
.Ed
.Pp
Indicates the number of available rings
.Pa ( struct netmap_rings )
and their position in the mmapped region.
The number of tx and rx rings
.Pa ( ni_tx_rings , ni_rx_rings )
normally depends on the hardware.
NICs also have an extra tx/rx ring pair connected to the host stack.
.Em NIOCREGIF
can also request additional unbound buffers in the same memory space,
to be used as temporary storage for packets.
.Pa ni_bufs_head
contains the index of the first of these free rings,
which are connected in a list (the first uint32_t of each
buffer being the index of the next buffer in the list).
A 0 indicates the end of the list.
.Pp
.It Dv struct netmap_ring (one per ring)
.Bd -literal
struct netmap_ring {
...
const uint32_t num_slots; /* slots in each ring */
const uint32_t nr_buf_size; /* size of each buffer */
...
uint32_t head; /* (u) first buf owned by user */
uint32_t cur; /* (u) wakeup position */
const uint32_t tail; /* (k) first buf owned by kernel */
...
uint32_t flags;
struct timeval ts; /* (k) time of last rxsync() */
...
struct netmap_slot slot[0]; /* array of slots */
}
.Ed
.Pp
Implements transmit and receive rings, with read/write
pointers, metadata and and an array of
.Pa slots
describing the buffers.
.Pp
.It Dv struct netmap_slot (one per buffer)
.Bd -literal
struct netmap_slot {
uint32_t buf_idx; /* buffer index */
uint16_t len; /* packet length */
uint16_t flags; /* buf changed, etc. */
uint64_t ptr; /* address for indirect buffers */
};
.Ed
.Pp
Describes a packet buffer, which normally is identified by
an index and resides in the mmapped region.
.It Dv packet buffers
Fixed size (normally 2 KB) packet buffers allocated by the kernel.
.El
.Pp
The offset of the
.Pa struct netmap_if
in the mmapped region is indicated by the
.Pa nr_offset
field in the structure returned by
.Pa NIOCREGIF .
From there, all other objects are reachable through
relative references (offsets or indexes).
Macros and functions in <net/netmap_user.h>
help converting them into actual pointers:
.Pp
.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset);
.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
.Pp
.Dl char *buf = NETMAP_BUF(ring, buffer_index);
.Sh RINGS, BUFFERS AND DATA I/O
.Va Rings
are circular queues of packets with three indexes/pointers
.Va ( head , cur , tail ) ;
one slot is always kept empty.
The ring size
.Va ( num_slots )
should not be assumed to be a power of two.
.br
(NOTE: older versions of netmap used head/count format to indicate
the content of a ring).
.Pp
.Va head
is the first slot available to userspace;
.br
.Va cur
is the wakeup point:
select/poll will unblock when
.Va tail
passes
.Va cur ;
.br
.Va tail
is the first slot reserved to the kernel.
.Pp
Slot indexes MUST only move forward;
for convenience, the function
.Dl nm_ring_next(ring, index)
returns the next index modulo the ring size.
.Pp
.Va head
and
.Va cur
are only modified by the user program;
.Va tail
is only modified by the kernel.
The kernel only reads/writes the
.Vt struct netmap_ring
slots and buffers
during the execution of a netmap-related system call.
The only exception are slots (and buffers) in the range
.Va tail\ . . . head-1 ,
that are explicitly assigned to the kernel.
.Pp
.Ss TRANSMIT RINGS
On transmit rings, after a
.Nm
system call, slots in the range
.Va head\ . . . tail-1
are available for transmission.
User code should fill the slots sequentially
and advance
.Va head
and
.Va cur
past slots ready to transmit.
.Va cur
may be moved further ahead if the user code needs
more slots before further transmissions (see
.Sx SCATTER GATHER I/O ) .
.Pp
At the next NIOCTXSYNC/select()/poll(),
slots up to
.Va head-1
are pushed to the port, and
.Va tail
may advance if further slots have become available.
Below is an example of the evolution of a TX ring:
.Pp
.Bd -literal
after the syscall, slots between cur and tail are (a)vailable
head=cur tail
| |
v v
TX [.....aaaaaaaaaaa.............]
user creates new packets to (T)ransmit
head=cur tail
| |
v v
TX [.....TTTTTaaaaaa.............]
NIOCTXSYNC/poll()/select() sends packets and reports new slots
head=cur tail
| |
v v
TX [..........aaaaaaaaaaa........]
.Ed
.Pp
select() and poll() wlll block if there is no space in the ring, i.e.
.Dl ring->cur == ring->tail
and return when new slots have become available.
.Pp
High speed applications may want to amortize the cost of system calls
by preparing as many packets as possible before issuing them.
.Pp
A transmit ring with pending transmissions has
.Dl ring->head != ring->tail + 1 (modulo the ring size).
The function
.Va int nm_tx_pending(ring)
implements this test.
.Pp
.Ss RECEIVE RINGS
On receive rings, after a
.Nm
system call, the slots in the range
.Va head\& . . . tail-1
contain received packets.
User code should process them and advance
.Va head
and
.Va cur
past slots it wants to return to the kernel.
.Va cur
may be moved further ahead if the user code wants to
wait for more packets
without returning all the previous slots to the kernel.
.Pp
At the next NIOCRXSYNC/select()/poll(),
slots up to
.Va head-1
are returned to the kernel for further receives, and
.Va tail
may advance to report new incoming packets.
.br
Below is an example of the evolution of an RX ring:
.Bd -literal
after the syscall, there are some (h)eld and some (R)eceived slots
head cur tail
| | |
v v v
RX [..hhhhhhRRRRRRRR..........]
user advances head and cur, releasing some slots and holding others
head cur tail
| | |
v v v
RX [..*****hhhRRRRRR...........]
NICRXSYNC/poll()/select() recovers slots and reports new packets
head cur tail
| | |
v v v
RX [.......hhhRRRRRRRRRRRR....]
.Ed
.Pp
.Sh SLOTS AND PACKET BUFFERS
Normally, packets should be stored in the netmap-allocated buffers
assigned to slots when ports are bound to a file descriptor.
One packet is fully contained in a single buffer.
.Pp
The following flags affect slot and buffer processing:
.Bl -tag -width XXX
.It NS_BUF_CHANGED
it MUST be used when the buf_idx in the slot is changed.
This can be used to implement
zero-copy forwarding, see
.Sx ZERO-COPY FORWARDING .
.Pp
.It NS_REPORT
reports when this buffer has been transmitted.
Normally,
.Nm
notifies transmit completions in batches, hence signals
can be delayed indefinitely. This flag helps detecting
when packets have been send and a file descriptor can be closed.
.It NS_FORWARD
When a ring is in 'transparent' mode (see
.Sx TRANSPARENT MODE ) ,
packets marked with this flags are forwarded to the other endpoint
at the next system call, thus restoring (in a selective way)
the connection between a NIC and the host stack.
.It NS_NO_LEARN
tells the forwarding code that the SRC MAC address for this
packet must not be used in the learning bridge code.
.It NS_INDIRECT
indicates that the packet's payload is in a user-supplied buffer,
whose user virtual address is in the 'ptr' field of the slot.
The size can reach 65535 bytes.
.br
This is only supported on the transmit ring of
.Nm VALE
ports, and it helps reducing data copies in the interconnection
of virtual machines.
.It NS_MOREFRAG
indicates that the packet continues with subsequent buffers;
the last buffer in a packet must have the flag clear.
.El
.Sh SCATTER GATHER I/O
Packets can span multiple slots if the
.Va NS_MOREFRAG
flag is set in all but the last slot.
The maximum length of a chain is 64 buffers.
This is normally used with
.Nm VALE
ports when connecting virtual machines, as they generate large
TSO segments that are not split unless they reach a physical device.
.Pp
NOTE: The length field always refers to the individual
fragment; there is no place with the total length of a packet.
.Pp
On receive rings the macro
.Va NS_RFRAGS(slot)
indicates the remaining number of slots for this packet,
including the current one.
Slots with a value greater than 1 also have NS_MOREFRAG set.
.Sh IOCTLS
.Nm
uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
for non-blocking I/O. They take no argument.
Two more ioctls (NIOCGINFO, NIOCREGIF) are used
to query and configure ports, with the following argument:
.Bd -literal
struct nmreq {
char nr_name[IFNAMSIZ]; /* (i) port name */
uint32_t nr_version; /* (i) API version */
uint32_t nr_offset; /* (o) nifp offset in mmap region */
uint32_t nr_memsize; /* (o) size of the mmap region */
uint32_t nr_tx_slots; /* (i/o) slots in tx rings */
uint32_t nr_rx_slots; /* (i/o) slots in rx rings */
uint16_t nr_tx_rings; /* (i/o) number of tx rings */
uint16_t nr_rx_rings; /* (i/o) number of tx rings */
uint16_t nr_ringid; /* (i/o) ring(s) we care about */
uint16_t nr_cmd; /* (i) special command */
uint16_t nr_arg1; /* (i/o) extra arguments */
uint16_t nr_arg2; /* (i/o) extra arguments */
uint32_t nr_arg3; /* (i/o) extra arguments */
uint32_t nr_flags /* (i/o) open mode */
...
};
.Ed
.Pp
A file descriptor obtained through
.Pa /dev/netmap
also supports the ioctl supported by network devices, see
.Xr netintro 4 .
.Pp
.Bl -tag -width XXXX
.It Dv NIOCGINFO
returns EINVAL if the named port does not support netmap.
Otherwise, it returns 0 and (advisory) information
about the port.
Note that all the information below can change before the
interface is actually put in netmap mode.
.Pp
.Bl -tag -width XX
.It Pa nr_memsize
indicates the size of the
.Nm
memory region. NICs in
.Nm
mode all share the same memory region,
whereas
.Nm VALE
ports have independent regions for each port.
.It Pa nr_tx_slots , nr_rx_slots
indicate the size of transmit and receive rings.
.It Pa nr_tx_rings , nr_rx_rings
indicate the number of transmit
and receive rings.
Both ring number and sizes may be configured at runtime
using interface-specific functions (e.g.
.Xr ethtool
).
.El
.It Dv NIOCREGIF
binds the port named in
.Va nr_name
to the file descriptor. For a physical device this also switches it into
.Nm
mode, disconnecting
it from the host stack.
Multiple file descriptors can be bound to the same port,
with proper synchronization left to the user.
.Pp
.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
.Em netmap pipe ,
consisting of two netmap ports with a crossover connection.
A netmap pipe share the same memory space of the parent port,
and is meant to enable configuration where a master process acts
as a dispatcher towards slave processes.
.Pp
To enable this function, the
.Pa nr_arg1
field of the structure can be used as a hint to the kernel to
indicate how many pipes we expect to use, and reserve extra space
in the memory region.
.Pp
On return, it gives the same info as NIOCGINFO,
with
.Pa nr_ringid
and
.Pa nr_flags
indicating the identity of the rings controlled through the file
descriptor.
.Pp
.Va nr_flags
.Va nr_ringid
selects which rings are controlled through this file descriptor.
Possible values of
.Pa nr_flags
are indicated below, together with the naming schemes
that application libraries (such as the
.Nm nm_open
indicated below) can use to indicate the specific set of rings.
In the example below, "netmap:foo" is any valid netmap port name.
.Pp
.Bl -tag -width XXXXX
.It NR_REG_ALL_NIC "netmap:foo"
(default) all hardware ring pairs
.It NR_REG_SW_NIC "netmap:foo^"
the ``host rings'', connecting to the host stack.
.It NR_RING_NIC_SW "netmap:foo+"
all hardware rings and the host rings
.It NR_REG_ONE_NIC "netmap:foo-i"
only the i-th hardware ring pair, where the number is in
.Pa nr_ringid ;
.It NR_REG_PIPE_MASTER "netmap:foo{i"
the master side of the netmap pipe whose identifier (i) is in
.Pa nr_ringid ;
.It NR_REG_PIPE_SLAVE "netmap:foo}i"
the slave side of the netmap pipe whose identifier (i) is in
.Pa nr_ringid .
.Pp
The identifier of a pipe must be thought as part of the pipe name,
and does not need to be sequential. On return the pipe
will only have a single ring pair with index 0,
irrespective of the value of i.
.El
.Pp
By default, a
.Xr poll 2
or
.Xr select 2
call pushes out any pending packets on the transmit ring, even if
no write events are specified.
The feature can be disabled by or-ing
.Va NETMAP_NO_TX_SYNC
to the value written to
.Va nr_ringid.
When this feature is used,
packets are transmitted only on
.Va ioctl(NIOCTXSYNC)
or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
.Pp
When registering a virtual interface that is dynamically created to a
.Xr vale 4
switch, we can specify the desired number of rings (1 by default,
and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
.It Dv NIOCTXSYNC
tells the hardware of new packets to transmit, and updates the
number of slots available for transmission.
.It Dv NIOCRXSYNC
tells the hardware of consumed packets, and asks for newly available
packets.
.El
.Sh SELECT, POLL, EPOLL, KQUEUE.
.Xr select 2
and
.Xr poll 2
on a
.Nm
file descriptor process rings as indicated in
.Sx TRANSMIT RINGS
and
.Sx RECEIVE RINGS ,
respectively when write (POLLOUT) and read (POLLIN) events are requested.
Both block if no slots are available in the ring
.Va ( ring->cur == ring->tail ) .
Depending on the platform,
.Xr epoll 2
and
.Xr kqueue 2
are supported too.
.Pp
Packets in transmit rings are normally pushed out
(and buffers reclaimed) even without
requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
.Em NIOCREGIF
disables this feature.
By default, receive rings are processed only if read
events are requested. Passing the NETMAP_DO_RX_SYNC flag to
.Em NIOCREGIF updates receive rings even without read events.
Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC
only have an effect when some event is posted for the file descriptor.
.Sh LIBRARIES
The
.Nm
API is supposed to be used directly, both because of its simplicity and
for efficient integration with applications.
.Pp
For conveniency, the
.Va <net/netmap_user.h>
header provides a few macros and functions to ease creating
a file descriptor and doing I/O with a
.Nm
port. These are loosely modeled after the
.Xr pcap 3
API, to ease porting of libpcap-based applications to
.Nm .
To use these extra functions, programs should
.Dl #define NETMAP_WITH_LIBS
before
.Dl #include <net/netmap_user.h>
.Pp
The following functions are available:
.Bl -tag -width XXXXX
.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
similar to
.Xr pcap_open ,
binds a file descriptor to a port.
.Bl -tag -width XX
.It Va ifname
is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
.Nm VALE
port.
.It Va req
provides the initial values for the argument to the NIOCREGIF ioctl.
The nm_flags and nm_ringid values are overwritten by parsing
ifname and flags, and other fields can be overridden through
the other two arguments.
.It Va arg
points to a struct nm_desc containing arguments (e.g. from a previously
open file descriptor) that should override the defaults.
The fields are used as described below
.It Va flags
can be set to a combination of the following flags:
.Va NETMAP_NO_TX_POLL ,
.Va NETMAP_DO_RX_POLL
(copied into nr_ringid);
.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
avoids the mmap and uses the values from it);
.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
.Va NM_OPEN_ARG1 ,
.Va NM_OPEN_ARG2 ,
.Va NM_OPEN_ARG3 (uses the fields from arg);
.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
.El
.It Va int nm_close(struct nm_desc *d)
closes the file descriptor, unmaps memory, frees resources.
.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
similar to pcap_inject(), pushes a packet to a ring, returns the size
of the packet is successful, or 0 on error;
.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
similar to pcap_dispatch(), applies a callback to incoming packets
.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
similar to pcap_next(), fetches the next packet
.Pp
.El
.Sh SUPPORTED DEVICES
.Nm
natively supports the following devices:
.Pp
On FreeBSD:
.Xr em 4 ,
.Xr igb 4 ,
.Xr ixgbe 4 ,
.Xr lem 4 ,
.Xr re 4 .
.Pp
On Linux
.Xr e1000 4 ,
.Xr e1000e 4 ,
.Xr igb 4 ,
.Xr ixgbe 4 ,
.Xr mlx4 4 ,
.Xr forcedeth 4 ,
.Xr r8169 4 .
.Pp
NICs without native support can still be used in
.Nm
mode through emulation. Performance is inferior to native netmap
mode but still significantly higher than sockets, and approaching
that of in-kernel solutions such as Linux's
.Xr pktgen .
.Pp
Emulation is also available for devices with native netmap support,
which can be used for testing or performance comparison.
The sysctl variable
.Va dev.netmap.admode
globally controls how netmap mode is implemented.
.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
Some aspect of the operation of
.Nm
are controlled through sysctl variables on FreeBSD
.Em ( dev.netmap.* )
and module parameters on Linux
.Em ( /sys/module/netmap_lin/parameters/* ) :
.Pp
.Bl -tag -width indent
.It Va dev.netmap.admode: 0
Controls the use of native or emulated adapter mode.
0 uses the best available option, 1 forces native and
fails if not available, 2 forces emulated hence never fails.
.It Va dev.netmap.generic_ringsize: 1024
Ring size used for emulated netmap mode
.It Va dev.netmap.generic_mit: 100000
Controls interrupt moderation for emulated mode
.It Va dev.netmap.mmap_unreg: 0
.It Va dev.netmap.fwd: 0
Forces NS_FORWARD mode
.It Va dev.netmap.flags: 0
.It Va dev.netmap.txsync_retry: 2
.It Va dev.netmap.no_pendintr: 1
Forces recovery of transmit buffers on system calls
.It Va dev.netmap.mitigate: 1
Propagates interrupt mitigation to user processes
.It Va dev.netmap.no_timestamp: 0
Disables the update of the timestamp in the netmap ring
.It Va dev.netmap.verbose: 0
Verbose kernel messages
.It Va dev.netmap.buf_num: 163840
.It Va dev.netmap.buf_size: 2048
.It Va dev.netmap.ring_num: 200
.It Va dev.netmap.ring_size: 36864
.It Va dev.netmap.if_num: 100
.It Va dev.netmap.if_size: 1024
Sizes and number of objects (netmap_if, netmap_ring, buffers)
for the global memory region. The only parameter worth modifying is
.Va dev.netmap.buf_num
as it impacts the total amount of memory used by netmap.
.It Va dev.netmap.buf_curr_num: 0
.It Va dev.netmap.buf_curr_size: 0
.It Va dev.netmap.ring_curr_num: 0
.It Va dev.netmap.ring_curr_size: 0
.It Va dev.netmap.if_curr_num: 0
.It Va dev.netmap.if_curr_size: 0
Actual values in use.
.It Va dev.netmap.bridge_batch: 1024
Batch size used when moving packets across a
.Nm VALE
switch. Values above 64 generally guarantee good
performance.
.El
.Sh SYSTEM CALLS
.Nm
uses
.Xr select 2 ,
.Xr poll 2 ,
.Xr epoll
and
.Xr kqueue
to wake up processes when significant events occur, and
.Xr mmap 2
to map memory.
.Xr ioctl 2
is used to configure ports and
.Nm VALE switches .
.Pp
Applications may need to create threads and bind them to
specific cores to improve performance, using standard
OS primitives, see
.Xr pthread 3 .
In particular,
.Xr pthread_setaffinity_np 3
may be of use.
.Sh CAVEATS
No matter how fast the CPU and OS are,
achieving line rate on 10G and faster interfaces
requires hardware with sufficient performance.
Several NICs are unable to sustain line rate with
small packet sizes. Insufficient PCIe or memory bandwidth
can also cause reduced performance.
.Pp
Another frequent reason for low performance is the use
of flow control on the link: a slow receiver can limit
the transmit speed.
Be sure to disable flow control when running high
speed experiments.
.Pp
.Ss SPECIAL NIC FEATURES
.Nm
is orthogonal to some NIC features such as
multiqueue, schedulers, packet filters.
.Pp
Multiple transmit and receive rings are supported natively
and can be configured with ordinary OS tools,
such as
.Xr ethtool
or
device-specific sysctl variables.
The same goes for Receive Packet Steering (RPS)
and filtering of incoming traffic.
.Pp
.Nm
.Em does not use
features such as
.Em checksum offloading , TCP segmentation offloading ,
.Em encryption , VLAN encapsulation/decapsulation ,
etc. .
When using netmap to exchange packets with the host stack,
make sure to disable these features.
.Sh EXAMPLES
.Ss TEST PROGRAMS
.Nm
comes with a few programs that can be used for testing or
simple applications.
See the
.Va examples/
directory in
.Nm
distributions, or
.Va tools/tools/netmap/
directory in FreeBSD distributions.
.Pp
.Xr pkt-gen
is a general purpose traffic source/sink.
.Pp
As an example
.Dl pkt-gen -i ix0 -f tx -l 60
can generate an infinite stream of minimum size packets, and
.Dl pkt-gen -i ix0 -f rx
is a traffic sink.
Both print traffic statistics, to help monitor
how the system performs.
.Pp
.Xr pkt-gen
has many options can be uses to set packet sizes, addresses,
rates, and use multiple send/receive threads and cores.
.Pp
.Xr bridge
is another test program which interconnects two
.Nm
ports. It can be used for transparent forwarding between
interfaces, as in
.Dl bridge -i ix0 -i ix1
or even connect the NIC to the host stack using netmap
.Dl bridge -i ix0 -i ix0
.Ss USING THE NATIVE API
The following code implements a traffic generator
.Pp
.Bd -literal -compact
#include <net/netmap_user.h>
...
void sender(void)
{
struct netmap_if *nifp;
struct netmap_ring *ring;
struct nmreq nmr;
struct pollfd fds;
fd = open("/dev/netmap", O_RDWR);
bzero(&nmr, sizeof(nmr));
strcpy(nmr.nr_name, "ix0");
nmr.nm_version = NETMAP_API;
ioctl(fd, NIOCREGIF, &nmr);
p = mmap(0, nmr.nr_memsize, fd);
nifp = NETMAP_IF(p, nmr.nr_offset);
ring = NETMAP_TXRING(nifp, 0);
fds.fd = fd;
fds.events = POLLOUT;
for (;;) {
poll(&fds, 1, -1);
while (!nm_ring_empty(ring)) {
i = ring->cur;
buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
... prepare packet in buf ...
ring->slot[i].len = ... packet length ...
ring->head = ring->cur = nm_ring_next(ring, i);
}
}
}
.Ed
.Ss HELPER FUNCTIONS
A simple receiver can be implemented using the helper functions
.Bd -literal -compact
#define NETMAP_WITH_LIBS
#include <net/netmap_user.h>
...
void receiver(void)
{
struct nm_desc *d;
struct pollfd fds;
u_char *buf;
struct nm_pkthdr h;
...
d = nm_open("netmap:ix0", NULL, 0, 0);
fds.fd = NETMAP_FD(d);
fds.events = POLLIN;
for (;;) {
poll(&fds, 1, -1);
while ( (buf = nm_nextpkt(d, &h)) )
consume_pkt(buf, h->len);
}
nm_close(d);
}
.Ed
.Ss ZERO-COPY FORWARDING
Since physical interfaces share the same memory region,
it is possible to do packet forwarding between ports
swapping buffers. The buffer from the transmit ring is used
to replenish the receive ring:
.Bd -literal -compact
uint32_t tmp;
struct netmap_slot *src, *dst;
...
src = &src_ring->slot[rxr->cur];
dst = &dst_ring->slot[txr->cur];
tmp = dst->buf_idx;
dst->buf_idx = src->buf_idx;
dst->len = src->len;
dst->flags = NS_BUF_CHANGED;
src->buf_idx = tmp;
src->flags = NS_BUF_CHANGED;
rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
txr->head = txr->cur = nm_ring_next(txr, txr->cur);
...
.Ed
.Ss ACCESSING THE HOST STACK
The host stack is for all practical purposes just a regular ring pair,
which you can access with the netmap API (e.g. with
.Dl nm_open("netmap:eth0^", ... ) ;
All packets that the host would send to an interface in
.Nm
mode end up into the RX ring, whereas all packets queued to the
TX ring are send up to the host stack.
.Ss VALE SWITCH
A simple way to test the performance of a
.Nm VALE
switch is to attach a sender and a receiver to it,
e.g. running the following in two different terminals:
.Dl pkt-gen -i vale1:a -f rx # receiver
.Dl pkt-gen -i vale1:b -f tx # sender
The same example can be used to test netmap pipes, by simply
changing port names, e.g.
.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
.Pp
The following command attaches an interface and the host stack
to a switch:
.Dl vale-ctl -h vale2:em0
Other
.Nm
clients attached to the same switch can now communicate
with the network card or the host.
.Pp
.Sh SEE ALSO
.Pp
http://info.iet.unipi.it/~luigi/netmap/
.Pp
Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
Communications of the ACM, 55 (3), pp.45-51, March 2012
.Pp
Luigi Rizzo, netmap: a novel framework for fast packet I/O,
Usenix ATC'12, June 2012, Boston
.Pp
Luigi Rizzo, Giuseppe Lettieri,
VALE, a switched ethernet for virtual machines,
ACM CoNEXT'12, December 2012, Nice
.Pp
Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
Speeding up packet I/O in virtual machines,
ACM/IEEE ANCS'13, October 2013, San Jose
.Sh AUTHORS
.An -nosplit
The
.Nm
framework has been originally designed and implemented at the
Universita` di Pisa in 2011 by
.An Luigi Rizzo ,
and further extended with help from
.An Matteo Landi ,
.An Gaetano Catalli ,
.An Giuseppe Lettieri ,
.An Vincenzo Maffione .
.Pp
.Nm
and
.Nm VALE
have been funded by the European Commission within FP7 Projects
CHANGE (257422) and OPENLAB (287581).
|