summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2014-08-06 09:38:14 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2014-08-06 09:38:14 -0700
commitae045e2455429c418a418a3376301a9e5753a0a8 (patch)
treeb445bdeecd3f38aa0d0a29c9585cee49e4ccb0f1 /Documentation/networking
parentf4f142ed4ef835709c7e6d12eaca10d190bcebed (diff)
parentd247b6ab3ce6dd43665780865ec5fa145d9ab6bd (diff)
downloadop-kernel-dev-ae045e2455429c418a418a3376301a9e5753a0a8.zip
op-kernel-dev-ae045e2455429c418a418a3376301a9e5753a0a8.tar.gz
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "Highlights: 1) Steady transitioning of the BPF instructure to a generic spot so all kernel subsystems can make use of it, from Alexei Starovoitov. 2) SFC driver supports busy polling, from Alexandre Rames. 3) Take advantage of hash table in UDP multicast delivery, from David Held. 4) Lighten locking, in particular by getting rid of the LRU lists, in inet frag handling. From Florian Westphal. 5) Add support for various RFC6458 control messages in SCTP, from Geir Ola Vaagland. 6) Allow to filter bridge forwarding database dumps by device, from Jamal Hadi Salim. 7) virtio-net also now supports busy polling, from Jason Wang. 8) Some low level optimization tweaks in pktgen from Jesper Dangaard Brouer. 9) Add support for ipv6 address generation modes, so that userland can have some input into the process. From Jiri Pirko. 10) Consolidate common TCP connection request code in ipv4 and ipv6, from Octavian Purdila. 11) New ARP packet logger in netfilter, from Pablo Neira Ayuso. 12) Generic resizable RCU hash table, with intial users in netlink and nftables. From Thomas Graf. 13) Maintain a name assignment type so that userspace can see where a network device name came from (enumerated by kernel, assigned explicitly by userspace, etc.) From Tom Gundersen. 14) Automatic flow label generation on transmit in ipv6, from Tom Herbert. 15) New packet timestamping facilities from Willem de Bruijn, meant to assist in measuring latencies going into/out-of the packet scheduler, latency from TCP data transmission to ACK, etc" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits) cxgb4 : Disable recursive mailbox commands when enabling vi net: reduce USB network driver config options. tg3: Modify tg3_tso_bug() to handle multiple TX rings amd-xgbe: Perform phy connect/disconnect at dev open/stop amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask net: sun4i-emac: fix memory leak on bad packet sctp: fix possible seqlock seadlock in sctp_packet_transmit() Revert "net: phy: Set the driver when registering an MDIO bus device" cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine team: Simplify return path of team_newlink bridge: Update outdated comment on promiscuous mode net-timestamp: ACK timestamp for bytestreams net-timestamp: TCP timestamping net-timestamp: SCHED timestamp on entering packet scheduler net-timestamp: add key to disambiguate concurrent datagrams net-timestamp: move timestamp flags out of sk_flags net-timestamp: extend SCM_TIMESTAMPING ancillary data struct cxgb4i : Move stray CPL definitions to cxgb4 driver tcp: reduce spurious retransmits due to transient SACK reneging qlcnic: Initialize dcbnl_ops before register_netdev ...
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/bonding.txt31
-rw-r--r--Documentation/networking/filter.txt12
-rw-r--r--Documentation/networking/i40e.txt7
-rw-r--r--Documentation/networking/ip-sysctl.txt38
-rw-r--r--Documentation/networking/packet_mmap.txt18
-rw-r--r--Documentation/networking/phy.txt18
-rw-r--r--Documentation/networking/pktgen.txt28
-rw-r--r--Documentation/networking/timestamping.txt16
-rw-r--r--Documentation/networking/timestamping/timestamping.c7
9 files changed, 112 insertions, 63 deletions
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 9c723ec..eeb5b2e 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -542,10 +542,10 @@ mode
XOR policy: Transmit based on the selected transmit
hash policy. The default policy is a simple [(source
- MAC address XOR'd with destination MAC address) modulo
- slave count]. Alternate transmit policies may be
- selected via the xmit_hash_policy option, described
- below.
+ MAC address XOR'd with destination MAC address XOR
+ packet type ID) modulo slave count]. Alternate transmit
+ policies may be selected via the xmit_hash_policy option,
+ described below.
This mode provides load balancing and fault tolerance.
@@ -801,10 +801,11 @@ xmit_hash_policy
layer2
- Uses XOR of hardware MAC addresses to generate the
- hash. The formula is
+ Uses XOR of hardware MAC addresses and packet type ID
+ field to generate the hash. The formula is
- (source MAC XOR destination MAC) modulo slave count
+ hash = source MAC XOR destination MAC XOR packet type ID
+ slave number = hash modulo slave count
This algorithm will place all traffic to a particular
network peer on the same slave.
@@ -819,7 +820,7 @@ xmit_hash_policy
Uses XOR of hardware MAC addresses and IP addresses to
generate the hash. The formula is
- hash = source MAC XOR destination MAC
+ hash = source MAC XOR destination MAC XOR packet type ID
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
@@ -2301,13 +2302,13 @@ broadcast: Like active-backup, there is not much advantage to this
bandwidth.
Additionally, the linux bonding 802.3ad implementation
- distributes traffic by peer (using an XOR of MAC addresses),
- so in a "gatewayed" configuration, all outgoing traffic will
- generally use the same device. Incoming traffic may also end
- up on a single device, but that is dependent upon the
- balancing policy of the peer's 8023.ad implementation. In a
- "local" configuration, traffic will be distributed across the
- devices in the bond.
+ distributes traffic by peer (using an XOR of MAC addresses
+ and packet type ID), so in a "gatewayed" configuration, all
+ outgoing traffic will generally use the same device. Incoming
+ traffic may also end up on a single device, but that is
+ dependent upon the balancing policy of the peer's 8023.ad
+ implementation. In a "local" configuration, traffic will be
+ distributed across the devices in the bond.
Finally, the 802.3ad mode mandates the use of the MII monitor,
therefore, the ARP monitor is not available in this mode.
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index ee78eba..c48a970 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -586,12 +586,12 @@ team driver's classifier for its load-balancing mode, netfilter's xt_bpf
extension, PTP dissector/classifier, and much more. They are all internally
converted by the kernel into the new instruction set representation and run
in the eBPF interpreter. For in-kernel handlers, this all works transparently
-by using sk_unattached_filter_create() for setting up the filter, resp.
-sk_unattached_filter_destroy() for destroying it. The macro
-SK_RUN_FILTER(filter, ctx) transparently invokes eBPF interpreter or JITed
-code to run the filter. 'filter' is a pointer to struct sk_filter that we
-got from sk_unattached_filter_create(), and 'ctx' the given context (e.g.
-skb pointer). All constraints and restrictions from sk_chk_filter() apply
+by using bpf_prog_create() for setting up the filter, resp.
+bpf_prog_destroy() for destroying it. The macro
+BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed
+code to run the filter. 'filter' is a pointer to struct bpf_prog that we
+got from bpf_prog_create(), and 'ctx' the given context (e.g.
+skb pointer). All constraints and restrictions from bpf_check_classic() apply
before a conversion to the new layout is being done behind the scenes!
Currently, the classic BPF format is being used for JITing on most of the
diff --git a/Documentation/networking/i40e.txt b/Documentation/networking/i40e.txt
index f737273..a251bf4 100644
--- a/Documentation/networking/i40e.txt
+++ b/Documentation/networking/i40e.txt
@@ -69,8 +69,11 @@ Additional Configurations
FCoE
----
- Fiber Channel over Ethernet (FCoE) hardware offload is not currently
- supported.
+ The driver supports Fiber Channel over Ethernet (FCoE) and Data Center
+ Bridging (DCB) functionality. Configuring DCB and FCoE is outside the scope
+ of this driver doc. Refer to http://www.open-fcoe.org/ for FCoE project
+ information and http://www.open-lldp.org/ or email list
+ e1000-eedc@lists.sourceforge.net for DCB information.
MAC and VLAN anti-spoofing feature
----------------------------------
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ab42c95..29a9351 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -101,19 +101,17 @@ ipfrag_high_thresh - INTEGER
Maximum memory used to reassemble IP fragments. When
ipfrag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ipfrag_low_thresh
- is reached.
+ is reached. This also serves as a maximum limit to namespaces
+ different from the initial one.
ipfrag_low_thresh - INTEGER
- See ipfrag_high_thresh
+ Maximum memory used to reassemble IP fragments before the kernel
+ begins to remove incomplete fragment queues to free up resources.
+ The kernel still accepts new fragments for defragmentation.
ipfrag_time - INTEGER
Time in seconds to keep an IP fragment in memory.
-ipfrag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
- for the hash secret) for IP fragments.
- Default: 600
-
ipfrag_max_dist - INTEGER
ipfrag_max_dist is a non-negative integer value which defines the
maximum "disorder" which is allowed among fragments which share a
@@ -1132,6 +1130,15 @@ flowlabel_consistency - BOOLEAN
FALSE: disabled
Default: TRUE
+auto_flowlabels - BOOLEAN
+ Automatically generate flow labels based based on a flow hash
+ of the packet. This allows intermediate devices, such as routers,
+ to idenfify packet flows for mechanisms like Equal Cost Multipath
+ Routing (see RFC 6438).
+ TRUE: enabled
+ FALSE: disabled
+ Default: false
+
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
@@ -1153,11 +1160,6 @@ ip6frag_low_thresh - INTEGER
ip6frag_time - INTEGER
Time in seconds to keep an IPv6 fragment in memory.
-ip6frag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
- for the hash secret) for IPv6 fragments.
- Default: 600
-
conf/default/*:
Change the interface-specific default settings.
@@ -1210,6 +1212,18 @@ accept_ra_defrtr - BOOLEAN
Functional default: enabled if accept_ra is enabled.
disabled if accept_ra is disabled.
+accept_ra_from_local - BOOLEAN
+ Accept RA with source-address that is found on local machine
+ if the RA is otherwise proper and able to be accepted.
+ Default is to NOT accept these as it may be an un-intended
+ network loop.
+
+ Functional default:
+ enabled if accept_ra_from_local is enabled
+ on a specific interface.
+ disabled if accept_ra_from_local is disabled
+ on a specific interface.
+
accept_ra_pinfo - BOOLEAN
Learn Prefix Information in Router Advertisement.
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 38112d5..a6d7cb9 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -1008,14 +1008,9 @@ hardware timestamps to be used. Note: you may need to enable the generation
of hardware timestamps with SIOCSHWTSTAMP (see related information from
Documentation/networking/timestamping.txt).
-PACKET_TIMESTAMP accepts the same integer bit field as
-SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE
-and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by
-PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over
-SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.
-
- int req = 0;
- req |= SOF_TIMESTAMPING_SYS_HARDWARE;
+PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
+
+ int req = SOF_TIMESTAMPING_RAW_HARDWARE;
setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
For the mmap(2)ed ring buffers, such timestamps are stored in the
@@ -1023,14 +1018,13 @@ tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
what kind of timestamp has been reported, the tp_status field is binary |'ed
with the following possible bits ...
- TP_STATUS_TS_SYS_HARDWARE
TP_STATUS_TS_RAW_HARDWARE
TP_STATUS_TS_SOFTWARE
... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
-RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set),
-then this means that a software fallback was invoked *within* PF_PACKET's
-processing code (less precise).
+RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
+software fallback was invoked *within* PF_PACKET's processing code (less
+precise).
Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index 3544c98..e839e7e 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -272,6 +272,8 @@ Writing a PHY driver
txtsamp: Requests a transmit timestamp at the PHY level for a 'skb'
set_wol: Enable Wake-on-LAN at the PHY level
get_wol: Get the Wake-on-LAN status at the PHY level
+ read_mmd_indirect: Read PHY MMD indirect register
+ write_mmd_indirect: Write PHY MMD indirect register
Of these, only config_aneg and read_status are required to be
assigned by the driver code. The rest are optional. Also, it is
@@ -284,7 +286,21 @@ Writing a PHY driver
Feel free to look at the Marvell, Cicada, and Davicom drivers in
drivers/net/phy/ for examples (the lxt and qsemi drivers have
- not been tested as of this writing)
+ not been tested as of this writing).
+
+ The PHY's MMD register accesses are handled by the PAL framework
+ by default, but can be overridden by a specific PHY driver if
+ required. This could be the case if a PHY was released for
+ manufacturing before the MMD PHY register definitions were
+ standardized by the IEEE. Most modern PHYs will be able to use
+ the generic PAL framework for accessing the PHY's MMD registers.
+ An example of such usage is for Energy Efficient Ethernet support,
+ implemented in the PAL. This support uses the PAL to access MMD
+ registers for EEE query and configuration if the PHY supports
+ the IEEE standard access mechanisms, or can use the PHY's specific
+ access interfaces if overridden by the specific PHY driver. See
+ the Micrel driver in drivers/net/phy/ for an example of how this
+ can be implemented.
Board Fixups
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index 0e30c78..0dffc6e 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -24,6 +24,34 @@ For monitoring and control pktgen creates:
/proc/net/pktgen/ethX
+Tuning NIC for max performance
+==============================
+
+The default NIC setting are (likely) not tuned for pktgen's artificial
+overload type of benchmarking, as this could hurt the normal use-case.
+
+Specifically increasing the TX ring buffer in the NIC:
+ # ethtool -G ethX tx 1024
+
+A larger TX ring can improve pktgen's performance, while it can hurt
+in the general case, 1) because the TX ring buffer might get larger
+than the CPUs L1/L2 cache, 2) because it allow more queueing in the
+NIC HW layer (which is bad for bufferbloat).
+
+One should be careful to conclude, that packets/descriptors in the HW
+TX ring cause delay. Drivers usually delay cleaning up the
+ring-buffers (for various performance reasons), thus packets stalling
+the TX ring, might just be waiting for cleanup.
+
+This cleanup issues is specifically the case, for the driver ixgbe
+(Intel 82599 chip). This driver (ixgbe) combine TX+RX ring cleanups,
+and the cleanup interval is affected by the ethtool --coalesce setting
+of parameter "rx-usecs".
+
+For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6):
+ # ethtool -C ethX rx-usecs 30
+
+
Viewing threads
===============
/proc/net/pktgen/kpktgend_0
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index bc35541..897f942 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -40,7 +40,7 @@ the set bits correspond to data that is available, then the control
message will not be generated:
SOF_TIMESTAMPING_SOFTWARE: report systime if available
-SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available
+SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available (deprecated)
SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available
It is worth noting that timestamps may be collected for reasons other
@@ -88,13 +88,12 @@ hwtimeraw is the original hardware time stamp. Filled in if
SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
relation to system time should be made.
-hwtimetrans is the hardware time stamp transformed so that it
-corresponds as good as possible to system time. This correlation is
-not perfect; as a consequence, sorting packets received via different
-NICs by their hwtimetrans may differ from the order in which they were
-received. hwtimetrans may be non-monotonic even for the same NIC.
-Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
-by the network device and will be empty without that support.
+hwtimetrans is always zero. This field is deprecated. It used to hold
+hw timestamps converted to system time. Instead, expose the hardware
+clock device on the NIC directly as a HW PTP clock source, to allow
+time conversion in userspace and optionally synchronize system time
+with a userspace PTP stack such as linuxptp. For the PTP clock API,
+see Documentation/ptp/ptp.txt.
SIOCSHWTSTAMP, SIOCGHWTSTAMP:
@@ -185,7 +184,6 @@ struct skb_shared_hwtstamps {
* since arbitrary point in time
*/
ktime_t hwtstamp;
- ktime_t syststamp; /* hwtstamp transformed to system time base */
};
Time stamps for outgoing packets are to be generated as follows:
diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
index 8ba82bf..5cdfd74 100644
--- a/Documentation/networking/timestamping/timestamping.c
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -76,7 +76,6 @@ static void usage(const char *error)
" SOF_TIMESTAMPING_RX_HARDWARE - hardware time stamping of incoming packets\n"
" SOF_TIMESTAMPING_RX_SOFTWARE - software fallback for incoming packets\n"
" SOF_TIMESTAMPING_SOFTWARE - request reporting of software time stamps\n"
- " SOF_TIMESTAMPING_SYS_HARDWARE - request reporting of transformed HW time stamps\n"
" SOF_TIMESTAMPING_RAW_HARDWARE - request reporting of raw HW time stamps\n"
" SIOCGSTAMP - check last socket time stamp\n"
" SIOCGSTAMPNS - more accurate socket time stamp\n");
@@ -202,9 +201,7 @@ static void printpacket(struct msghdr *msg, int res,
(long)stamp->tv_sec,
(long)stamp->tv_nsec);
stamp++;
- printf("HW transformed %ld.%09ld ",
- (long)stamp->tv_sec,
- (long)stamp->tv_nsec);
+ /* skip deprecated HW transformed */
stamp++;
printf("HW raw %ld.%09ld",
(long)stamp->tv_sec,
@@ -361,8 +358,6 @@ int main(int argc, char **argv)
so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE;
else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE"))
so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE;
- else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SYS_HARDWARE"))
- so_timestamping_flags |= SOF_TIMESTAMPING_SYS_HARDWARE;
else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE"))
so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE;
else
OpenPOWER on IntegriCloud