summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/ieee802154.txt5
-rw-r--r--Documentation/networking/ip-sysctl.txt53
-rw-r--r--Documentation/networking/ipvs-sysctl.txt7
-rw-r--r--Documentation/networking/packet_mmap.txt327
-rw-r--r--Documentation/networking/stmmac.txt45
5 files changed, 389 insertions, 48 deletions
diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt
index 703cf43..67a9cb2 100644
--- a/Documentation/networking/ieee802154.txt
+++ b/Documentation/networking/ieee802154.txt
@@ -71,8 +71,9 @@ submits skb to qdisc), so if you need something from that cb later, you should
store info in the skb->data on your own.
To hook the MLME interface you have to populate the ml_priv field of your
-net_device with a pointer to struct ieee802154_mlme_ops instance. All fields are
-required.
+net_device with a pointer to struct ieee802154_mlme_ops instance. The fields
+assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional.
+All other fields are required.
We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index dc2dc87..f98ca63 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -29,7 +29,7 @@ route/max_size - INTEGER
neigh/default/gc_thresh1 - INTEGER
Minimum number of entries to keep. Garbage collector will not
purge entries if there are fewer than this number.
- Default: 256
+ Default: 128
neigh/default/gc_thresh3 - INTEGER
Maximum number of neighbor entries allowed. Increase this
@@ -175,14 +175,6 @@ tcp_congestion_control - STRING
is inherited.
[see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ]
-tcp_cookie_size - INTEGER
- Default size of TCP Cookie Transactions (TCPCT) option, that may be
- overridden on a per socket basis by the TCPCT socket option.
- Values greater than the maximum (16) are interpreted as the maximum.
- Values greater than zero and less than the minimum (8) are interpreted
- as the minimum. Odd values are interpreted as the next even value.
- Default: 0 (off).
-
tcp_dsack - BOOLEAN
Allows TCP to send "duplicate" SACKs.
@@ -190,7 +182,9 @@ tcp_early_retrans - INTEGER
Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
for triggering fast retransmit when the amount of outstanding data is
small and when no previously unsent data can be transmitted (such
- that limited transmit could be used).
+ that limited transmit could be used). Also controls the use of
+ Tail loss probe (TLP) that converts RTOs occuring due to tail
+ losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
Possible values:
0 disables ER
1 enables ER
@@ -198,7 +192,9 @@ tcp_early_retrans - INTEGER
by a fourth of RTT. This mitigates connection falsely
recovers when network has a small degree of reordering
(less than 3 packets).
- Default: 2
+ 3 enables delayed ER and TLP.
+ 4 enables TLP only.
+ Default: 3
tcp_ecn - INTEGER
Control use of Explicit Congestion Notification (ECN) by TCP.
@@ -229,36 +225,13 @@ tcp_fin_timeout - INTEGER
Default: 60 seconds
tcp_frto - INTEGER
- Enables Forward RTO-Recovery (F-RTO) defined in RFC4138.
+ Enables Forward RTO-Recovery (F-RTO) defined in RFC5682.
F-RTO is an enhanced recovery algorithm for TCP retransmission
- timeouts. It is particularly beneficial in wireless environments
- where packet loss is typically due to random radio interference
- rather than intermediate router congestion. F-RTO is sender-side
- only modification. Therefore it does not require any support from
- the peer.
-
- If set to 1, basic version is enabled. 2 enables SACK enhanced
- F-RTO if flow uses SACK. The basic version can be used also when
- SACK is in use though scenario(s) with it exists where F-RTO
- interacts badly with the packet counting of the SACK enabled TCP
- flow.
-
-tcp_frto_response - INTEGER
- When F-RTO has detected that a TCP retransmission timeout was
- spurious (i.e, the timeout would have been avoided had TCP set a
- longer retransmission timeout), TCP has several options what to do
- next. Possible values are:
- 0 Rate halving based; a smooth and conservative response,
- results in halved cwnd and ssthresh after one RTT
- 1 Very conservative response; not recommended because even
- though being valid, it interacts poorly with the rest of
- Linux TCP, halves cwnd and ssthresh immediately
- 2 Aggressive response; undoes congestion control measures
- that are now known to be unnecessary (ignoring the
- possibility of a lost retransmission that would require
- TCP to be more cautious), cwnd and ssthresh are restored
- to the values prior timeout
- Default: 0 (rate halving based)
+ timeouts. It is particularly beneficial in networks where the
+ RTT fluctuates (e.g., wireless). F-RTO is sender-side only
+ modification. It does not require any support from the peer.
+
+ By default it's enabled with a non-zero value. 0 disables F-RTO.
tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled.
diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt
index f2a2488..9573d0c 100644
--- a/Documentation/networking/ipvs-sysctl.txt
+++ b/Documentation/networking/ipvs-sysctl.txt
@@ -15,6 +15,13 @@ amemthresh - INTEGER
enabled and the variable is automatically set to 2, otherwise
the strategy is disabled and the variable is set to 1.
+backup_only - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ If set, disable the director function while the server is
+ in backup mode to avoid packet loops for DR/TUN methods.
+
conntrack - BOOLEAN
0 - disabled (default)
not 0 - enabled
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 94444b1..65efb85 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -685,6 +685,333 @@ int main(int argc, char **argp)
}
-------------------------------------------------------------------------------
++ AF_PACKET TPACKET_V3 example
+-------------------------------------------------------------------------------
+
+AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
+sizes by doing it's own memory management. It is based on blocks where polling
+works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
+
+It is said that TPACKET_V3 brings the following benefits:
+ *) ~15 - 20% reduction in CPU-usage
+ *) ~20% increase in packet capture rate
+ *) ~2x increase in packet density
+ *) Port aggregation analysis
+ *) Non static frame size to capture entire packet payload
+
+So it seems to be a good candidate to be used with packet fanout.
+
+Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
+it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <assert.h>
+#include <net/if.h>
+#include <arpa/inet.h>
+#include <netdb.h>
+#include <poll.h>
+#include <unistd.h>
+#include <signal.h>
+#include <inttypes.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <linux/if_packet.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+
+#define BLOCK_SIZE (1 << 22)
+#define FRAME_SIZE 2048
+
+#define NUM_BLOCKS 64
+#define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE)
+
+#define BLOCK_RETIRE_TOV_IN_MS 64
+#define BLOCK_PRIV_AREA_SZ 13
+
+#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1))
+
+#define BLOCK_STATUS(x) ((x)->h1.block_status)
+#define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts)
+#define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt)
+#define BLOCK_LEN(x) ((x)->h1.blk_len)
+#define BLOCK_SNUM(x) ((x)->h1.seq_num)
+#define BLOCK_O2PRIV(x) ((x)->offset_to_priv)
+#define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x)))
+#define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc)))
+#define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri)))
+
+#ifndef likely
+# define likely(x) __builtin_expect(!!(x), 1)
+#endif
+#ifndef unlikely
+# define unlikely(x) __builtin_expect(!!(x), 0)
+#endif
+
+struct block_desc {
+ uint32_t version;
+ uint32_t offset_to_priv;
+ struct tpacket_hdr_v1 h1;
+};
+
+struct ring {
+ struct iovec *rd;
+ uint8_t *map;
+ struct tpacket_req3 req;
+};
+
+static unsigned long packets_total = 0, bytes_total = 0;
+static sig_atomic_t sigint = 0;
+
+void sighandler(int num)
+{
+ sigint = 1;
+}
+
+static int setup_socket(struct ring *ring, char *netdev)
+{
+ int err, i, fd, v = TPACKET_V3;
+ struct sockaddr_ll ll;
+
+ fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+ if (fd < 0) {
+ perror("socket");
+ exit(1);
+ }
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ memset(&ring->req, 0, sizeof(ring->req));
+ ring->req.tp_block_size = BLOCK_SIZE;
+ ring->req.tp_frame_size = FRAME_SIZE;
+ ring->req.tp_block_nr = NUM_BLOCKS;
+ ring->req.tp_frame_nr = NUM_FRAMES;
+ ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS;
+ ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ;
+ ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH;
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
+ sizeof(ring->req));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
+ fd, 0);
+ if (ring->map == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
+ assert(ring->rd);
+ for (i = 0; i < ring->req.tp_block_nr; ++i) {
+ ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
+ ring->rd[i].iov_len = ring->req.tp_block_size;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = PF_PACKET;
+ ll.sll_protocol = htons(ETH_P_ALL);
+ ll.sll_ifindex = if_nametoindex(netdev);
+ ll.sll_hatype = 0;
+ ll.sll_pkttype = 0;
+ ll.sll_halen = 0;
+
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ exit(1);
+ }
+
+ return fd;
+}
+
+#ifdef __checked
+static uint64_t prev_block_seq_num = 0;
+
+void assert_block_seq_num(struct block_desc *pbd)
+{
+ if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) {
+ printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != "
+ "actual seq:%"PRIu64"\n", prev_block_seq_num,
+ prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd));
+ exit(1);
+ }
+
+ prev_block_seq_num = BLOCK_SNUM(pbd);
+}
+
+static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
+{
+ if (BLOCK_NUM_PKTS(pbd)) {
+ if (unlikely(bytes != BLOCK_LEN(pbd))) {
+ printf("block:%u with %upackets, expected len:%u != actual len:%u\n",
+ block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd));
+ exit(1);
+ }
+ } else {
+ if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) {
+ printf("block:%u, expected len:%lu != actual len:%u\n",
+ block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd));
+ exit(1);
+ }
+ }
+}
+
+static void assert_block_header(struct block_desc *pbd, const int block_num)
+{
+ uint32_t block_status = BLOCK_STATUS(pbd);
+
+ if (unlikely((block_status & TP_STATUS_USER) == 0)) {
+ printf("block:%u, not in TP_STATUS_USER\n", block_num);
+ exit(1);
+ }
+
+ assert_block_seq_num(pbd);
+}
+#else
+static inline void assert_block_header(struct block_desc *pbd, const int block_num)
+{
+}
+static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
+{
+}
+#endif
+
+static void display(struct tpacket3_hdr *ppd)
+{
+ struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
+ struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
+
+ if (eth->h_proto == htons(ETH_P_IP)) {
+ struct sockaddr_in ss, sd;
+ char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
+
+ memset(&ss, 0, sizeof(ss));
+ ss.sin_family = PF_INET;
+ ss.sin_addr.s_addr = ip->saddr;
+ getnameinfo((struct sockaddr *) &ss, sizeof(ss),
+ sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
+
+ memset(&sd, 0, sizeof(sd));
+ sd.sin_family = PF_INET;
+ sd.sin_addr.s_addr = ip->daddr;
+ getnameinfo((struct sockaddr *) &sd, sizeof(sd),
+ dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
+
+ printf("%s -> %s, ", sbuff, dbuff);
+ }
+
+ printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
+}
+
+static void walk_block(struct block_desc *pbd, const int block_num)
+{
+ int num_pkts = BLOCK_NUM_PKTS(pbd), i;
+ unsigned long bytes = 0;
+ unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ);
+ struct tpacket3_hdr *ppd;
+
+ assert_block_header(pbd, block_num);
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd));
+ for (i = 0; i < num_pkts; ++i) {
+ bytes += ppd->tp_snaplen;
+ if (ppd->tp_next_offset)
+ bytes_with_padding += ppd->tp_next_offset;
+ else
+ bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac);
+
+ display(ppd);
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset);
+ __sync_synchronize();
+ }
+
+ assert_block_len(pbd, bytes_with_padding, block_num);
+
+ packets_total += num_pkts;
+ bytes_total += bytes;
+}
+
+void flush_block(struct block_desc *pbd)
+{
+ BLOCK_STATUS(pbd) = TP_STATUS_KERNEL;
+ __sync_synchronize();
+}
+
+static void teardown_socket(struct ring *ring, int fd)
+{
+ munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
+ free(ring->rd);
+ close(fd);
+}
+
+int main(int argc, char **argp)
+{
+ int fd, err;
+ socklen_t len;
+ struct ring ring;
+ struct pollfd pfd;
+ unsigned int block_num = 0;
+ struct block_desc *pbd;
+ struct tpacket_stats_v3 stats;
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ signal(SIGINT, sighandler);
+
+ memset(&ring, 0, sizeof(ring));
+ fd = setup_socket(&ring, argp[argc - 1]);
+ assert(fd > 0);
+
+ memset(&pfd, 0, sizeof(pfd));
+ pfd.fd = fd;
+ pfd.events = POLLIN | POLLERR;
+ pfd.revents = 0;
+
+ while (likely(!sigint)) {
+ pbd = (struct block_desc *) ring.rd[block_num].iov_base;
+retry_block:
+ if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) {
+ poll(&pfd, 1, -1);
+ goto retry_block;
+ }
+
+ walk_block(pbd, block_num);
+ flush_block(pbd);
+ block_num = (block_num + 1) % NUM_BLOCKS;
+ }
+
+ len = sizeof(stats);
+ err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
+ if (err < 0) {
+ perror("getsockopt");
+ exit(1);
+ }
+
+ fflush(stdout);
+ printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
+ stats.tp_packets, bytes_total, stats.tp_drops,
+ stats.tp_freeze_q_cnt);
+
+ teardown_socket(&ring, fd);
+ return 0;
+}
+
+-------------------------------------------------------------------------------
+ PACKET_TIMESTAMP
-------------------------------------------------------------------------------
diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index f9fa6db..654d2e5 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -1,6 +1,6 @@
STMicroelectronics 10/100/1000 Synopsys Ethernet driver
-Copyright (C) 2007-2010 STMicroelectronics Ltd
+Copyright (C) 2007-2013 STMicroelectronics Ltd
Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
@@ -10,7 +10,7 @@ Currently this network device driver is for all STM embedded MAC/GMAC
(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000
FF1152AMT0221 D1215994A VIRTEX FPGA board.
-DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether
+DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether
MAC 10/100 Universal version 4.0 have been used for developing this driver.
This driver supports both the platform bus and PCI.
@@ -32,6 +32,8 @@ The kernel configuration option is STMMAC_ETH:
watchdog: transmit timeout (in milliseconds);
flow_ctrl: Flow control ability [on/off];
pause: Flow Control Pause Time;
+ eee_timer: tx EEE timer;
+ chain_mode: select chain mode instead of ring.
3) Command line options
Driver parameters can be also passed in command line by using:
@@ -164,12 +166,12 @@ Where:
o bus_setup: perform HW setup of the bus. For example, on some ST platforms
this field is used to configure the AMBA bridge to generate more
efficient STBus traffic.
- o init/exit: callbacks used for calling a custom initialisation;
+ o init/exit: callbacks used for calling a custom initialization;
this is sometime necessary on some platforms (e.g. ST boxes)
where the HW needs to have set some PIO lines or system cfg
registers.
o custom_cfg/custom_data: this is a custom configuration that can be passed
- while initialising the resources.
+ while initializing the resources.
o bsp_priv: another private poiter.
For MDIO bus The we have:
@@ -273,6 +275,8 @@ reset procedure etc).
o norm_desc.c: functions for handling normal descriptors;
o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes;
o mmc_core.c/mmc.h: Management MAC Counters;
+ o stmmac_hwtstamp.c: HW timestamp support for PTP
+ o stmmac_ptp.c: PTP 1588 clock
5) Debug Information
@@ -326,6 +330,35 @@ To enter in Tx LPI mode the driver needs to have a software timer
that enable and disable the LPI mode when there is nothing to be
transmitted.
-7) TODO:
+7) Extended descriptors
+The extended descriptors give us information about the receive Ethernet payload
+when it is carrying PTP packets or TCP/UDP/ICMP over IP.
+These are not available on GMAC Synopsys chips older than the 3.50.
+At probe time the driver will decide if these can be actually used.
+This support also is mandatory for PTPv2 because the extra descriptors 6 and 7
+are used for saving the hardware timestamps.
+
+8) Precision Time Protocol (PTP)
+The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP),
+which enables precise synchronization of clocks in measurement and
+control systems implemented with technologies such as network
+communication.
+
+In addition to the basic timestamp features mentioned in IEEE 1588-2002
+Timestamps, new GMAC cores support the advanced timestamp features.
+IEEE 1588-2008 that can be enabled when configure the Kernel.
+
+9) SGMII/RGMII supports
+New GMAC devices provide own way to manage RGMII/SGMII.
+This information is available at run-time by looking at the
+HW capability register. This means that the stmmac can manage
+auto-negotiation and link status w/o using the PHYLIB stuff
+In fact, the HW provides a subset of extended registers to
+restart the ANE, verify Full/Half duplex mode and Speed.
+Also thanks to these registers it is possible to look at the
+Auto-negotiated Link Parter Ability.
+
+10) TODO:
o XGMAC is not supported.
- o Add the PTP - precision time protocol
+ o Complete the TBI & RTBI support.
+ o extened VLAN support for 3.70a SYNP GMAC.
OpenPOWER on IntegriCloud