op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	pkt_sched: fq: Fair Queue packet scheduler	Eric Dumazet	2013-08-29	4	-0/+848
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel) - Uses the new_flow/old_flow separation from FQ_codel - New flows get an initial credit allowing IW10 without added delay. - Special FIFO queue for high prio packets (no need for PRIO + FQ) - Uses a hash table of RB trees to locate the flows at enqueue() time - Smart on demand gc (at enqueue() time, RB tree lookup evicts old unused flows) - Dynamic memory allocations. - Designed to allow millions of concurrent flows per Qdisc. - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow. - Single high resolution timer for throttled flows (if any). - One RB tree to link throttled flows. - Ability to have a max rate per flow. We might add a socket option to add per socket limitation. Attempts have been made to add TCP pacing in TCP stack, but this seems to add complex code to an already complex stack. TCP pacing is welcomed for flows having idle times, as the cwnd permits TCP stack to queue a possibly large number of packets. This removes the 'slow start after idle' choice, hitting badly large BDP flows, and applications delivering chunks of data as video streams. Nicely spaced packets : Here interface is 10Gbit, but flow bottleneck is ~20Mbit cwin is big, yet FQ avoids the typical bursts generated by TCP (as in netperf TCP_RR -- -r 100000,100000) 15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115> 15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115> 15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115> 15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115> 15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115> 15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115> 15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115> 15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> TCP timestamps show that most packets from B were queued in the same ms timeframe (TSval 1159799{3,4}), but FQ managed to send them right in time to avoid a big burst. In slow start or steady state, very few packets are throttled [1] FQ gets a bunch of tunables as : limit : max number of packets on whole Qdisc (default 10000) flow_limit : max number of packets per flow (default 100) quantum : the credit per RR round (default is 2 MTU) initial_quantum : initial credit for new flows (default is 10 MTU) maxrate : max per flow rate (default : unlimited) buckets : number of RB trees (default : 1024) in hash table. (consumes 8 bytes per bucket) [no]pacing : disable/enable pacing (default is enable) All of them can be changed on a live qdisc. $ tc qd add dev eth0 root fq help Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ] [ quantum BYTES ] [ initial_quantum BYTES ] [ maxrate RATE ] [ buckets NUMBER ] [ [no]pacing ] $ tc -s -d qd qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14) backlog 0b 0p requeues 14 511 flows, 511 inactive, 0 throttled 110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit [1] Except if initial srtt is overestimated, as if using cached srtt in tcp metrics. We'll provide a fix for this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: packet: document available fanout policies	Daniel Borkmann	2013-08-29	1	-0/+8
\| \| \| \| \| \| \|	Update documentation to add fanout policies that are available. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: packet: use reciprocal_divide in fanout_demux_hash	Daniel Borkmann	2013-08-29	1	-1/+1
\| \| \| \| \| \| \| \|	Instead of hard-coding reciprocal_divide function, use the inline function from reciprocal_div.h. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: packet: add randomized fanout scheduler	Daniel Borkmann	2013-08-29	2	-1/+13
\| \| \| \| \| \| \| \| \| \|	We currently allow for different fanout scheduling policies in pf_packet such as scheduling by skb's rxhash, round-robin, by cpu, and rollover. Also allow for a random, equidistributed selection of the socket from the fanout process group. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	sh_eth: no need to call ether_setup()	Sergei Shtylyov	2013-08-29	1	-3/+0
\| \| \| \| \| \| \| \| \|	There's no need to call ether_setup() in the driver since prior alloc_etherdev() call already arranges for it. Suggested-by: Denis Kirjanov <kda@linux-powerpc.org> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'bond_vlan'	David S. Miller	2013-08-29	6	-310/+416
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Veaceslav Falico says: ==================== bonding: remove vlan special handling v1: Per Jiri's advice, remove the exported netdev_upper struct to keep it inside dev.c only, and instead implement a macro to iterate over the list and return only net_device . v2: Jiri noted that we need to see every upper device, but not only the first level. Modify the netdev_upper logic to include a list of lower devices and for both upper/lower lists every device would see both its first-level devices and every other devices that is lower/upper of it. Also, convert some annoying spamming warnings to pr_debug in bond_arp_send_all. v3: move renaming part completely to patch 1 (did I forget to git add before commiting?) and address Jiri's input about comments/style of patch 2. v4: as Vlad found spotted, bond_arp_send_all() won't work in a config where we have a device with ip on top of our upper vlan. It fails to send packets because we don't tag the packet, while the device on top of vlan will emit tagged packets through this vlan. Fix this by first searching for all upper vlans, and for each vlan - for the devs on top of it. If we find the dev - then tag the packet with the underling's vlan_id, otherwise just search the old way - for all devices on top of bonding. Also, move the version changes under "---" so they won't get into the commit message, if/when applied. The aim of this patchset is to remove bondings' own vlan handling as much as possible and replace it with the netdev upper device functionality. The upper device functionality is extended to include also lower devices and to have, for each device, a full view of every lower and upper device, but not only the first-level ones. This might permit in the future to avoid, for any grouping/teaming/upper/lower devices, to maintain its own lists of slaves/vlans/etc. This is achieved by adding a helper function to upper dev list handling - netdev_upper_get_next_dev(dev, iter), which returns the next device after the list_head iter, and sets iter to the next list_head *. This patchset also adds netdev_for_each_upper_dev(dev, upper, iter), which iterates through the whole dev->upper_dev_list, setting upper to the net_device. The only special treatment of vlans remains in rlb code. This patchset solves several issues with bonding, simplifies it overall, RCUify further and exports upper list functions for any other users which might also want to get rid of its own vlan_lists or slaves. I'm testing it continuously currently, no issues found, will update on anything. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: pr_debug instead of pr_warn in bond_arp_send_all	Veaceslav Falico	2013-08-29	1	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	They're simply annoying and will spam dmesg constantly if we hit them, so convert to pr_debug so that we still can access them in case of debugging. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: remove vlan_list/current_alb_vlan	Veaceslav Falico	2013-08-29	4	-143/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently there are no real users of vlan_list/current_alb_vlan, only the helpers which maintain them, so remove them. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: make alb_send_learning_packets() use upper dev list	Veaceslav Falico	2013-08-29	2	-19/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, if there are vlans on top of bond, alb_send_learning_packets() will never send LPs from the bond itself (i.e. untagged), which might leave untagged clients unupdated. Also, the 'circular vlan' logic (i.e. update only MAX_LP_BURST vlans at a time, and save the last vlan for the next update) is really suboptimal - in case of lots of vlans it will take a lot of time to update every vlan. It is also never called in any hot path and sends only a few small packets - thus the optimization by itself is useless. So remove the whole current_alb_vlan/MAX_LP_BURST logic from alb_send_learning_packets(). Instead, we'll first send a packet untagged and then traverse the upper dev list, sending a tagged packet for each vlan found. Also, remove the MAX_LP_BURST define - we already don't need it. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: split alb_send_learning_packets()	Veaceslav Falico	2013-08-29	1	-24/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Create alb_send_lp_vid(), which will handle the skb/lp creation, vlan tagging and sending, and use it in alb_send_learning_packets(). This way all the logic remains in alb_send_learning_packets(), which becomes a lot more cleaner and easier to understand. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: use vlan_uses_dev() in __bond_release_one()	Veaceslav Falico	2013-08-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We always hold the rtnl_lock() in __bond_release_one(), so use vlan_uses_dev() instead of bond_vlan_used(). CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: convert bond_has_this_ip() to use upper devices	Veaceslav Falico	2013-08-29	1	-12/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, bond_has_this_ip() is aware only of vlan upper devices, and thus will return false if the address is associated with the upper bridge or any other device, and thus will break the arp logic. Fix this by using the upper device list. For every upper device we verify if the address associated with it is our address, and if yes - return true. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: make bond_arp_send_all use upper device list	Veaceslav Falico	2013-08-29	1	-51/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, bond_arp_send_all() is aware only of vlans, which breaks configurations like bond <- bridge (or any other 'upper' device) with IP (which is quite a common scenario for virt setups). To fix this we convert the bond_arp_send_all() to first verify if the rt device is the bond itself, and if not - to go through its list of upper vlans and their respectiv upper devices (if the vlan's upper device matches - tag the packet), if still not found - go through all of our upper list devices to see if any of them match the route device for the target. If the match is a vlan device - we also save its vlan_id and tag it in bond_arp_send(). Also, clean the function a bit to be more readable. CC: Vlad Yasevich <vyasevic@redhat.com> CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bonding: use netdev_upper list in bond_vlan_used	Veaceslav Falico	2013-08-29	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert bond_vlan_used() to traverse the upper device list to see if we have any vlans above us. It's protected by rcu, and in case we are holding rtnl_lock we should call vlan_uses_dev() instead - it's faster. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: add netdev_for_each_upper_dev_rcu()	Veaceslav Falico	2013-08-29	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new macro netdev_for_each_upper_dev_rcu(dev, upper, iter) iterates through the dev->upper_dev_list starting from the first element, using the netdev_upper_get_next_dev_rcu(dev, &iter). Must be called under RCU read lock. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: add netdev_upper_get_next_dev_rcu(dev, iter)	Veaceslav Falico	2013-08-29	1	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This function returns the next dev in the dev->upper_dev_list after the struct list_head *iter position, and updates iter accordingly. Returns NULL if there are no devices left. Caller must hold RCU read lock. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: remove search_list from netdev_adjacent	Veaceslav Falico	2013-08-29	1	-36/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We already don't need it cause we see every upper/lower device in the list already. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: add lower_dev_list to net_device and make a full mesh	Veaceslav Falico	2013-08-29	2	-27/+259
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds lower_dev_list list_head to net_device, which is the same as upper_dev_list, only for lower devices, and begins to use it in the same way as the upper list. It also changes the way the whole adjacent device lists work - now they contain all of upper/lower devices, not only the first level. The first level devices are distinguished by the bool neighbour field in netdev_adjacent, also added by this patch. There are cases when a device can be added several times to the adjacent list, the simplest would be: /---- eth0.10 ---\ eth0- --- bond0 \---- eth0.20 ---/ where both bond0 and eth0 'see' each other in the adjacent lists two times. To avoid duplication of netdev_adjacent structures ref_nr is being kept as the number of times the device was added to the list. The 'full view' is achieved by adding, on link creation, all of the upper_dev's upper_dev_list devices as upper devices to all of the lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice versa. On unlink they are removed using the same logic. I've tested it with thousands vlans/bonds/bridges, everything works ok and no observable lags even on a huge number of interfaces. Memory footprint for 128 devices interconnected with each other via both upper and lower (which is impossible, but for the comparison) lists would be: 1281282*sizeof(netdev_adjacent) = 1.5MB but in the real world we usualy have at most several devices with slaves and a lot of vlans, so the footprint will be much lower. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: rename netdev_upper to netdev_adjacent	Veaceslav Falico	2013-08-29	1	-12/+12
\|/ \| \| \| \| \| \| \| \| \| \| \|	Rename the structure to reflect the upcoming addition of lower_dev_list. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'master' of ↵	David S. Miller	2013-08-29	9	-90/+239
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next Jeff Kirsher says: ==================== This series contains updates to ixgbe. Jacob provides a fix for 82599 devices where it can potentially keep link lights up when the adapter has gone down. Mark provides a fix to resolve the possible use of uninitialized memory by checking the return value on EEPROM reads. Don provides 2 patches, one to fix a issue where we were traversing the Tx ring with the value of IXGBE_NUM_RX_QUEUES which currently happens to have the correct value but this is misleading. A change later, could easily make this no longer correct so when traversing the Tx ring, use netdev->num_tx_queues. His second patch does some minor clean ups of log messages. Emil provides the remaining ixgbe patches. First he fixes the link test where forcing the laser before the link check can lead to inconsistent results because it does not guarantee that the link will be negotiated correctly. Then he initializes the message buffer array to 0 in order to avoid using random numbers from the memory as a MAC address for the VF. Emil also fixes the read loop for the I2C data to account for the offset for SFP+ modules. Lastly, Emil provides several patches to add support for QSFP modules where 1Gbps support is added as well as support for older QSFP active direct attach cables which pre-date SFF-8436 v3.6. v2: Fixed patch 4 description and added blank line based on feedback from Sergei Shtylyov ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	ixgbe: add support for older QSFP active DA cables	Emil Tantilov	2013-08-29	2	-10/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds support for QSFP active direct attach (DA) cables which pre-date SFF-8436 v3.6. Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: include QSFP PHY types in ixgbe_is_sfp()	Emil Tantilov	2013-08-29	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch makes sure that QSFP+ modules use the SFP+ code path for setting up link. Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: add 1Gbps support for QSFP+	Emil Tantilov	2013-08-29	3	-11/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds GB speed support for QSFP+ modules. Autonegotiation is not supported with QSFP+. The user will have to set the desired speed on both link partners using ethtool advertise setting. Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: fix SFF data dumps of SFP+ modules from an offset	Emil Tantilov	2013-08-29	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes the read loop for the I2C data to account for the offset. Also includes a whitespace cleanup and removes ret_val as it is not needed. CC: Ben Hutchings <bhutchings@solarflare.com> Reported-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: cleanup some log messages	Don Skidmore	2013-08-29	2	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some minor log messages cleanup, changing the level one message is logged, adding a bit of detail to another and put all the text on one line. Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: zero out mailbox buffer on init	Emil Tantilov	2013-08-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch initializes the msgbuf array to 0 in order to avoid using random numbers from the memory as MAC address for the VF. Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: fix link test when connected to 1Gbps link partner	Emil Tantilov	2013-08-29	1	-17/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch is a partial reverse of: commit dfcc4615f09c33454bc553567f7c7506cae60cb9 Author: Jacob Keller <jacob.e.keller@intel.com> Date: Thu Nov 8 07:07:08 2012 +0000 ixgbe: ethtool ixgbe_diag_test cleanup Specifically forcing the laser before the link check can lead to inconsistent results because it does not guarantee that the link will be negotiated correctly. Such is the case when dual speed SFP+ module is connected to a gigabit link partner. Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: fix incorrect limit value in ring transverse	Don Skidmore	2013-08-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were transversing the tx_ring with IXGBE_NUM_RX_QUEUES. Now this define happens to have the correct value but this is misleading and a change later could easily make this no longer true. I updated it to netdev->num_tx_queues like we use in ixgbe_get_strings(). Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: Check return value on eeprom reads	Mark Rustad	2013-08-29	4	-39/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes the possible use of uninitialized memory by checking the return value on eeprom reads. These issues were identified by static analysis. In many cases error messages will be produced so that corrupted eeprom issues will be more visible. Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	ixgbe: disable link when adapter goes down	Jacob Keller	2013-08-29	3	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes an issue with the 82599 adapter where it can potentially keep link lights up when the adapter has gone down. The patch adds a function which ensures link is disabled, and calls this function when the adapter transitions to a down state. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* \|	fec: Use NAPI_POLL_WEIGHT	Fabio Estevam	2013-08-29	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of using a custom 'FEC_NAPI_WEIGHT', just use the generic 'NAPI_POLL_WEIGHT' definition instead. Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	net: sctp: sctp_verify_init: clean up mandatory checks and add comment	Daniel Borkmann	2013-08-29	1	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a comment related to RFC4960 explaning why we do not check for initial TSN, and while at it, remove yoda notation checks and clean up code from checks of mandatory conditions. That's probably just really minor, but makes reviewing easier. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	tcp: TSO packets automatic sizing	Eric Dumazet	2013-08-29	7	-7/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After hearing many people over past years complaining against TSO being bursty or even buggy, we are proud to present automatic sizing of TSO packets. One part of the problem is that tcp_tso_should_defer() uses an heuristic relying on upcoming ACKS instead of a timer, but more generally, having big TSO packets makes little sense for low rates, as it tends to create micro bursts on the network, and general consensus is to reduce the buffering amount. This patch introduces a per socket sk_pacing_rate, that approximates the current sending rate, and allows us to size the TSO packets so that we try to send one packet every ms. This field could be set by other transports. Patch has no impact for high speed flows, where having large TSO packets makes sense to reach line rate. For other flows, this helps better packet scheduling and ACK clocking. This patch increases performance of TCP flows in lossy environments. A new sysctl (tcp_min_tso_segs) is added, to specify the minimal size of a TSO packet (default being 2). A follow-up patch will provide a new packet scheduler (FQ), using sk_pacing_rate as an input to perform optional per flow pacing. This explains why we chose to set sk_pacing_rate to twice the current rate, allowing 'slow start' ramp up. sk_pacing_rate = 2 * cwnd * mss / srtt v2: Neal Cardwell reported a suspect deferring of last two segments on initial write of 10 MSS, I had to change tcp_tso_should_defer() to take into account tp->xmit_size_goal_segs Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Van Jacobson <vanj@google.com> Cc: Tom Herbert <therbert@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	ipv6: drop fragmented ndisc packets by default (RFC 6980)	Hannes Frederic Sowa	2013-08-29	5	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements RFC6980: Drop fragmented ndisc packets by default. If a fragmented ndisc packet is received the user is informed that it is possible to disable the check. Cc: Fernando Gont <fernando@gont.com.ar> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	ARM: at91/dt: fix phy address in sama5xmb to match the reg property	Boris BREZILLON	2013-08-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix phy0 address to match the reg property defined in phy0 node. Signed-off-by: Boris BREZILLON <b.brezillon@overkiz.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	net/cadence/macb: fix invalid 0 return if no phy is discovered on mii init	Boris BREZILLON	2013-08-29	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace misleading -1 (-EPERM) by a more appropriate return code (-ENXIO) in macb_mii_probe function. Save macb_mii_probe return before branching to err_out_unregister to avoid erronous 0 return. Signed-off-by: Boris BREZILLON <b.brezillon@overkiz.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	bridge: inherit slave devices needed_headroom	Florian Fainelli	2013-08-29	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some slave devices may have set a dev->needed_headroom value which is different than the default one, most likely in order to prepend a hardware descriptor in front of the Ethernet frame to send. Whenever a new slave is added to a bridge, ensure that we update the needed_headroom value accordingly to account for the slave needed_headroom value. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	net: sctp: reorder sctp_globals to reduce cacheline usage	Daniel Borkmann	2013-08-29	1	-11/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reduce cacheline usage from 2 to 1 cacheline for sctp_globals structure. By reordering elements, we can close gaps and simply achieve the following: Current situation: /* size: 80, cachelines: 2, members: 10 / / sum members: 57, holes: 4, sum holes: 16 / / padding: 7 / / last cacheline: 16 bytes / Afterwards: / size: 64, cachelines: 1, members: 10 / / padding: 7 */ Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	net: mdio-sun4i: Convert to devm_* api	Jisheng Zhang	2013-08-29	1	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use devm_ioremap_resource instead of of_iomap() and devm_kzalloc() instead of kmalloc() to make cleanup paths simpler. This patch also fixes the resource leak caused by missing corresponding iounamp() of the of_iomap(). Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	Merge branch 'for-davem' of ↵	David S. Miller	2013-08-29	20	-1326/+1536
\|\ \ \| \|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next Ben Hutchings says: ==================== 1. Further cleanup and refactoring in preparation for EF10. 2. Remove ethtool stats that are always zero on Falcon boards. 3. Add an ethtool stat for merged TX completions. 4. Prepare to support merged RX completions. 5. Prepare to support more hwmon sensors. 6. Add support for new events that are generated by EF10 firmware. 7. Update MC reboot detection for EF10. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	sfc: Use extended MC_CMD_SENSOR_INFO and MC_CMD_READ_SENSORS	Ben Hutchings	2013-08-27	1	-44/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to use extended requests to read and get metadata for sensors numbered > 31. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Return an error code when a sensor is busy.	Alexandre Rames	2013-08-27	1	-1/+6
\| \| \| \| \| \| \| \| \| \|	[bwh: Also name this new state, though we don't expect to see it in an event] Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Add support for reading packet length from prefix	Ben Hutchings	2013-08-27	2	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Define a flag for struct efx_rx_buffer and efx_rx_packet() that indicates packet length must be read from the prefix. If this is set, read the length in __efx_rx_packet() (when the prefix should have arrived in cache). Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Add TX merged completion counter	Ben Hutchings	2013-08-27	3	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a counter for TX merged completion events. This is implemented in the common TX path, because the NIC event handlers only know how many descriptors were completed, not how many packets. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Generalise packet hash lookup to support EF10 RX prefix	Jon Cooper	2013-08-27	6	-13/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EF10 uses an entirely different RX prefix format from Falcon-arch. Extend struct efx_nic_type to describe this. [bwh: Also replace the magic numbers used for the Falcon-arch RX prefix] Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Rename EFX_PAGE_BLOCK_SIZE to EFX_VI_PAGE_SIZE and adjust comments	Ben Hutchings	2013-08-27	1	-4/+4
\| \| \| \| \| \| \| \|	Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Remove early call to efx_nic_type::reconfigure_mac in efx_reset_up()	Ben Hutchings	2013-08-27	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	efx_reset_up() calls efx_nic_type::reconfigure_mac once directly, then again through efx_start_all() -> efx_start_port() -> efx->type->reconfigure_mac(). This first call is also made too early to work properly on EF10. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: use MCDI epoch flag to improve MC reboot detection in the driver	Daniel Pieczko	2013-08-27	2	-6/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Huntington MC will reject all MCDI requests after an MC reboot until it sees one with the NOT_EPOCH flag clear. This flag is set by default for all requests, and then cleared on the first request after we detect that an MC reboot has occurred. The old MCDI_STATUS_DELAY_COUNT gave a timeout of 10ms, which was not long enough for the driver to detect that a reboot had occurred based on the warm boot count while calling efx_mcdi_poll_reboot() from the loop in efx_mcdi_ev_death(). Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Add EF10 support for TX/RX DMA error events handling.	Alexandre Rames	2013-08-27	5	-11/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also, since we handle all DMA errors in the same way, merge RESET_TYPE_(RX\|TX)_DESC_FETCH into RESET_TYPE_DMA_ERROR. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
\| *	sfc: Add a function pointer to abstract write of host time into NIC shared ↵	Laurence Evans	2013-08-27	3	-2/+16
\| \| \| \| \| \| \| \| \| \| \| \|	memory Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>