op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	bnx2x: fix DMA API usage	Michal Schmidt	2015-06-28	3	-24/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With CONFIG_DMA_API_DEBUG=y bnx2x triggers the error "DMA-API: device driver frees DMA memory with wrong function". On archs where PAGE_SIZE > SGE_PAGE_SIZE it also triggers "DMA-API: device driver frees DMA memory with different size". Fix this by making the mapping and unmapping symmetric: - Do not map the whole pool page at once. Instead map the SGE_PAGE_SIZE-sized pieces individually, so they can be unmapped in the same manner. - What's mapped using dma_map_page() must be unmapped using dma_unmap_page(). Tested on ppc64. Fixes: 4cace675d687 ("bnx2x: Alloc 4k fragment for each rx ring buffer element") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: via: VIA_RHINE and VIA_VELOCITY should depend on HAS_DMA	Geert Uytterhoeven	2015-06-28	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If NO_DMA=y: ERROR: "dma_sync_single_for_cpu" [drivers/net/ethernet/via/via-rhine.ko] undefined! ERROR: "dma_set_mask" [drivers/net/ethernet/via/via-rhine.ko] undefined! ERROR: "dma_mapping_error" [drivers/net/ethernet/via/via-rhine.ko] undefined! ERROR: "dma_map_single" [drivers/net/ethernet/via/via-rhine.ko] undefined! ERROR: "dma_alloc_coherent" [drivers/net/ethernet/via/via-rhine.ko] undefined! ERROR: "dma_free_coherent" [drivers/net/ethernet/via/via-rhine.ko] undefined! ERROR: "dma_unmap_single" [drivers/net/ethernet/via/via-rhine.ko] undefined! ERROR: "dma_map_page" [drivers/net/ethernet/via/via-velocity.ko] undefined! ERROR: "dma_sync_single_for_cpu" [drivers/net/ethernet/via/via-velocity.ko] undefined! ERROR: "dma_free_coherent" [drivers/net/ethernet/via/via-velocity.ko] undefined! ERROR: "dma_unmap_single" [drivers/net/ethernet/via/via-velocity.ko] undefined! ERROR: "dma_map_single" [drivers/net/ethernet/via/via-velocity.ko] undefined! ERROR: "dma_alloc_coherent" [drivers/net/ethernet/via/via-velocity.ko] undefined! Before, the symbols depended implicitly on HAS_DMA through PCI or USE_OF. Add explicit dependencies on HAS_DMA to fix this. Fixes: b7d3282a245f4428 ("net: via/Kconfig: replace USE_OF with OF_???") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'master' of ↵	David S. Miller	2015-06-28	9	-104/+198
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2015-06-26 This series contains fixes for igb, e1000e and i40evf. Todd disables IPv6 extension header processing due to a hardware errata and bumps the driver version. Yanir provides six fixes for e1000e. First is a fix for a locking issue where we were not always taking the pci_bus_sem semaphore all the time when calling pci_disable_link_state_locked(), so fix the code to only call pci_disable_link_state_locked() when the semaphore has been acquired, otherwise call pci_disable_link_state(). A previous fix for i219 where the hardware prevented ULP entry caused EEE in Sx not the be enabled, so modify the code flow that allows both ULP and EEE in Sx. Fix an issue when running 10/100 full duplex on i219 where CRC errors were occurring by increasing the IPG from 8 to 0xC as per the hardware developers. Fix a data corruption issue found on some platforms by increasing the minimum gap between the PHY FIFO read and write pointers. Fix i219, which does not require the K1 workaround for LPT devices. Mitch provides a i40evf fix for a panic when changing MTU. Down was requesting queue disables, but then exited immediately without waiting for the queues to actually be disabled. This could allow any function called after i40evf_down() to run immediately, including i40evf_up(), and causes a memory leak. Fixed the issue by removing the whole reinit_locked function which allows for the driver to handle the state changes by requesting reset from the periodic timer. The second fix resolves an issue where RSS was being configured as though it is using the maximum number of queue. This can cause the device to drop a lot of receive traffic, as the packets get assigned to non-functional queues. This is resolved by only configuring RSS with the number of active queues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	i40evf: don't configure unused RSS queues	Mitch Williams	2015-06-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The driver will only configure as many queues as there are available CPUs, up the maximum number of queues. However, it always configures RSS as though it is using the maximum number of queues. This can cause the device to drop a lot of RX traffic, as the packets get assigned to nonfunctional queues. Fix this by only configuring RSS with the number of active queues. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	i40evf: fix panic during MTU change	Mitch Williams	2015-06-26	4	-65/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Down was requesting queue disables, but then exited immediately without waiting for the queues to actually disable. This could allow any function called after i40evf_down to run immediately, including i40evf_up, and causes a memory leak. Removing the whole reinit_locked function is the best way to go about this, and allows for the driver to handle the state changes by requesting reset from the periodic timer. Also, add a couple WARN_ONs in slow path to help us recognize if we re-introduce this issue or missed any cases. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	e1000e: i219 - k1 workaround for LPT is not required for SPT	Yanir Lubetkin	2015-06-26	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In SPT hardware does not require this driver workaround. Removed the conditional that caused K1 workaround execution on SPT. Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	e1000e: i219 - Increase minimum FIFO read/write min gap	Yanir Lubetkin	2015-06-26	1	-0/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Due to clocking changes in the Skylake platform, there was i219 data corruption. To work around this, HW team reported the need to increase the minimum gap between the PHY FIFO read and write pointers. Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	e1000e: i219 - increase IPG for speed 10/100 full duplex	Yanir Lubetkin	2015-06-26	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In SPT/i219, there were CRC errors in speed 10/100 full duplex. The solution given by the HW team is to increase the IPG from 8 to 0xC Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	e1000e: i219 - fix to enable both ULP and EEE in Sx state	Yanir Lubetkin	2015-06-26	1	-13/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In i219, there is a hardware bug that prevented ULP entry. A side effect of the original software fix for this was that EEE in Sx couldn't be enabled. This patch implements a modified flow that allows both ULP and EEE in Sx. Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	e1000e: synchronization of MAC-PHY interface only on non- ME systems	Yanir Lubetkin	2015-06-26	1	-10/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On power up, the MAC - PHY interface needs to be set to PCIe, even if cable is disconnected. In ME systems, the ME handles this on exit from Sx state. In non-ME, the driver handles it. Added a check for non-ME system to the driver code that handles that. Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	e1000e: fix locking issue with e1000e_disable_aspm	Yanir Lubetkin	2015-06-26	1	-4/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	e1000e_disable_aspm called pci_disable_link_state_locked which requires pci_bus_sem to be held, but is also called from places where this semaphore was not previously acquired. This patch implements two flavors of disable_aspm, one that acquires the lock, and the other (_locked) which should be called when the semaphore is already acquired. Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: bump version of igb to 5.2.18	Todd Fujinaka	2015-06-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bump version of igb to igb-5.2.18 Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
\| *	igb: disable IPv6 extension header processing	Todd Fujinaka	2015-06-26	2	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Disable IPv6 extension header processing as per hardware errata. Also fix copyright date. Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* \|	net/phy: tune get_phy_c45_ids to support more c45 phy	Shengzhou Liu	2015-06-28	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As some C45 10G PHYs(e.g. Cortina CS4315/CS4340 PHY) have zero Devices In package, current driver can't get correct devices_in_package value by non-zero Devices In package. so let's probe more with zero Devices In package to support more C45 PHYs. Signed-off-by: Shengzhou Liu <Shengzhou.Liu@freescale.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	bnx2x: fix lockdep splat	Eric Dumazet	2015-06-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Michel reported following lockdep splat [ 44.718117] INFO: trying to register non-static key. [ 44.723081] the code is fine but needs lockdep annotation. [ 44.728559] turning off the locking correctness validator. [ 44.734036] CPU: 8 PID: 5483 Comm: ethtool Not tainted 4.1.0 [ 44.770289] Call Trace: [ 44.772741] [<ffffffff816eb1cd>] dump_stack+0x4c/0x65 [ 44.777879] [<ffffffff8111d921>] ? console_unlock+0x1f1/0x510 [ 44.783708] [<ffffffff811121f5>] __lock_acquire+0x1d05/0x1f10 [ 44.789538] [<ffffffff8111370a>] ? mark_held_locks+0x6a/0x90 [ 44.795276] [<ffffffff81113835>] ? trace_hardirqs_on_caller+0x105/0x1d0 [ 44.801967] [<ffffffff8111390d>] ? trace_hardirqs_on+0xd/0x10 [ 44.807793] [<ffffffff811330fa>] ? hrtimer_try_to_cancel+0x4a/0x250 [ 44.814142] [<ffffffff81112ba6>] lock_acquire+0xb6/0x290 [ 44.819537] [<ffffffff810d6675>] ? flush_work+0x5/0x280 [ 44.824844] [<ffffffff810d66ad>] flush_work+0x3d/0x280 [ 44.830061] [<ffffffff810d6675>] ? flush_work+0x5/0x280 [ 44.835366] [<ffffffff816f3c43>] ? schedule_hrtimeout_range+0x13/0x20 [ 44.841889] [<ffffffff8112ec9b>] ? usleep_range+0x4b/0x50 [ 44.847365] [<ffffffff8111370a>] ? mark_held_locks+0x6a/0x90 [ 44.853102] [<ffffffff810d8585>] ? __cancel_work_timer+0x105/0x1c0 [ 44.859359] [<ffffffff81113835>] ? trace_hardirqs_on_caller+0x105/0x1d0 [ 44.866045] [<ffffffff810d851f>] __cancel_work_timer+0x9f/0x1c0 [ 44.872048] [<ffffffffa0010982>] ? bnx2x_func_stop+0x42/0x90 [bnx2x] [ 44.878481] [<ffffffff810d8670>] cancel_work_sync+0x10/0x20 [ 44.884134] [<ffffffffa00259e5>] bnx2x_chip_cleanup+0x245/0x730 [bnx2x] [ 44.890829] [<ffffffff8110ce02>] ? up+0x32/0x50 [ 44.895439] [<ffffffff811306b5>] ? del_timer_sync+0x5/0xd0 [ 44.901005] [<ffffffffa005596d>] bnx2x_nic_unload+0x20d/0x8e0 [bnx2x] [ 44.907527] [<ffffffff811f1aef>] ? might_fault+0x5f/0xb0 [ 44.912921] [<ffffffffa005851c>] bnx2x_reload_if_running+0x2c/0x50 [bnx2x] [ 44.919879] [<ffffffffa005a3c5>] bnx2x_set_ringparam+0x2b5/0x460 [bnx2x] [ 44.926664] [<ffffffff815d498b>] dev_ethtool+0x55b/0x1c40 [ 44.932148] [<ffffffff815dfdc7>] ? rtnl_lock+0x17/0x20 [ 44.937364] [<ffffffff815e7f8b>] dev_ioctl+0x17b/0x630 [ 44.942582] [<ffffffff815abf8d>] sock_do_ioctl+0x5d/0x70 [ 44.947972] [<ffffffff815ac013>] sock_ioctl+0x73/0x280 [ 44.953192] [<ffffffff8124c1c8>] do_vfs_ioctl+0x88/0x5b0 [ 44.958587] [<ffffffff8110d0b3>] ? up_read+0x23/0x40 [ 44.963631] [<ffffffff812584cc>] ? __fget_light+0x6c/0xa0 [ 44.969105] [<ffffffff8124c781>] SyS_ioctl+0x91/0xb0 [ 44.974149] [<ffffffff816f4dd7>] system_call_fastpath+0x12/0x6f As bnx2x_init_ptp() is only called if bp->flags contains PTP_SUPPORTED, we also need to guard bnx2x_stop_ptp() with same condition, otherwise ptp_task workqueue is not initialized and kernel barfs on cancel_work_sync() Fixes: eeed018cbfa30 ("bnx2x: Add timestamping and PTP hardware clock support") Reported-by: Michel Lespinasse <walken@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Michal Kalderon <Michal.Kalderon@qlogic.com> Cc: Ariel Elior <Ariel.Elior@qlogic.com> Cc: Yuval Mintz <Yuval.Mintz@qlogic.com> Cc: David Decotigny <decot@google.com> Acked-by: Sony Chacko <sony.chacko@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	net: fec: don't access RACC register when not available	Greg Ungerer	2015-06-28	2	-13/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Not all silicon implementations of the Freescale FEC hardware module have the RACC (Receive Accelerator Function) register, so we should not be trying to access it on those that don't. Currently none of the ColdFire based parts with a FEC have it. Support for RACC was introduced by commit 4c09eed9 ("net: fec: Enable imx6 enet checksum acceleration"). A fix was introduced in commit d1391930 ("net: fec: Fix build for MCF5272") that disables its use on the ColdFire M5272 part, but it doesn't fix the general case of other ColdFire parts. To fix we create a quirk flag, FEC_QUIRK_HAS_RACC, and check it before working with the RACC register. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	net: phy: fix phy link up when limiting speed via device tree	Mugunthan V N	2015-06-28	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When limiting phy link speed using "max-speed" to 100mbps or less on a giga bit phy, phy never completes auto negotiation and phy state machine is held in PHY_AN. Fixing this issue by comparing the giga bit advertise though phydev->supported doesn't have it but phy has BMSR_ESTATEN set. So that auto negotiation is restarted as old and new advertise are different and link comes up fine. Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	dsa: fix promiscuity leak on slave dev open error	Gilad Ben-Yossef	2015-06-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DSA master netdev promiscuity counter was not being properly decremented on slave device open error path. Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> CC: Gilad Ben-Yossef <giladb@ezchip.com> CC: David S. Miller <davem@davemloft.net> CC: Florian Fainelli <f.fainelli@gmail.com> CC: Guenter Roeck <linux@roeck-us.net> CC: Andrew Lunn <andrew@lunn.ch> CC: Scott Feldman <sfeldma@gmail.com> Acked-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	Merge branch 'kill_sk_protinfo'	David S. Miller	2015-06-28	6	-26/+31
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	David Miller says: ==================== Get rid of sock->sk_protinfo. These two patches get rid of the last remaining user of sk_protinfo (ax25) and then really gets rid of the struct member. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \|	net: Kill sock->sk_protinfo	David Miller	2015-06-28	3	-9/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	No more users, so it can now be removed. Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \|	ax25: Stop using sock->sk_protinfo.	David Miller	2015-06-28	3	-17/+31
\|/ / \| \| \| \| \| \| \| \| \| \|	Just make a ax25_sock structure that provides the ax25_cb pointer. Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	drivers: net: xgene: Pre-initialize ret in xgene_enet_get_resources()	Geert Uytterhoeven	2015-06-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If CONFIG_ACPI=n: drivers/net/ethernet/apm/xgene/xgene_enet_main.c: In function ‘xgene_enet_get_resources’: drivers/net/ethernet/apm/xgene/xgene_enet_main.c:951: warning: ‘ret’ may be used uninitialized in this function If the driver is bound to a legacy platform device, ret will contain arbitrary data. If it is non-zero, it will be returned to the caller as an error code. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	flow_dissector: Pre-initialize ip_proto in __skb_flow_dissect()	Geert Uytterhoeven	2015-06-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	net/core/flow_dissector.c: In function ‘__skb_flow_dissect’: net/core/flow_dissector.c:132: warning: ‘ip_proto’ may be used uninitialized in this function Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	xen-netfront: Remove the meaningless code	Li, Liang Z	2015-06-28	1	-7/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The function netif_set_real_num_tx_queues() will return -EINVAL if the second parameter < 1, so call this function with the second parameter set to 0 is meaningless. Signed-off-by: Liang Li <liang.z.li@intel.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	ipv4: fix RCU lockdep warning from linkdown changes	Andy Gospodarek	2015-06-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following lockdep splat was seen due to the wrong context for grabbing in_dev. =============================== [ INFO: suspicious RCU usage. ] 4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244 Not tainted ------------------------------- include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ip/403: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81453305>] rtnl_lock+0x17/0x19 #1: ((inetaddr_chain).rwsem){.+.+.+}, at: [<ffffffff8105c327>] __blocking_notifier_call_chain+0x35/0x6a stack backtrace: CPU: 2 PID: 403 Comm: ip Not tainted 4.1.0-next-20150626-dbg-00020-g54a6d91-dirty #244 0000000000000001 ffff8800b189b728 ffffffff8150a542 ffffffff8107a8b3 ffff880037bbea40 ffff8800b189b758 ffffffff8107cb74 ffff8800379dbd00 ffff8800bec85800 ffff8800bf9e13c0 00000000000000ff ffff8800b189b7d8 Call Trace: [<ffffffff8150a542>] dump_stack+0x4c/0x6e [<ffffffff8107a8b3>] ? up+0x39/0x3e [<ffffffff8107cb74>] lockdep_rcu_suspicious+0xf7/0x100 [<ffffffff814b63c3>] fib_dump_info+0x227/0x3e2 [<ffffffff814b6624>] rtmsg_fib+0xa6/0x116 [<ffffffff814b978f>] fib_table_insert+0x316/0x355 [<ffffffff814b362e>] fib_magic+0xb7/0xc7 [<ffffffff814b4803>] fib_add_ifaddr+0xb1/0x13b [<ffffffff814b4d09>] fib_inetaddr_event+0x36/0x90 [<ffffffff8105c086>] notifier_call_chain+0x4c/0x71 [<ffffffff8105c340>] __blocking_notifier_call_chain+0x4e/0x6a [<ffffffff8105c370>] blocking_notifier_call_chain+0x14/0x16 [<ffffffff814a7f50>] __inet_insert_ifa+0x1a5/0x1b3 [<ffffffff814a894d>] inet_rtm_newaddr+0x350/0x35f [<ffffffff81457866>] rtnetlink_rcv_msg+0x17b/0x18a [<ffffffff8107e7c3>] ? trace_hardirqs_on+0xd/0xf [<ffffffff8146965f>] ? netlink_deliver_tap+0x1cb/0x1f7 [<ffffffff814576eb>] ? rtnl_newlink+0x72a/0x72a ... This patch resolves that splat. Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	tipc: purge backlog queue counters when broadcast link is reset	Jon Paul Maloy	2015-06-28	3	-1/+7
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In commit 1f66d161ab3d8b518903fa6c3f9c1f48d6919e74 ("tipc: introduce starvation free send algorithm") we introduced a counter per priority level for buffers in the link backlog queue. We also introduced a new function tipc_link_purge_backlog(), to reset these counters to zero when the link is reset. Unfortunately, we missed to call this function when the broadcast link is reset, with the result that the values of these counters might be permanently skewed when new nodes are attached. This may in the worst case lead to permananent, but spurious, broadcast link congestion, where no broadcast packets can be sent at all. We fix this bug with this commit. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'bnx2x'	David S. Miller	2015-06-25	6	-51/+129
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Yuval Mintz says: ==================== bnx2x: various fixes This patch series contains several small fixes [with the possible exception of the first 2 link fixes] for various driver flows. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Fix linearization for encapsulated packets	Yuval Mintz	2015-06-25	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Due to FW constraints, driver must make sure that transmitted SKBs will not be too fragmented, or in the case that they are - that each 'window' of fragments passed to the FW would contain at least an mss worth of data. For encapsultaed packets the calculation is wrong, since it ignores the inner headers in the calculation of the headers' length. This could lead to a FW assertion in case of a too-fragmented encapsulated packet. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Release nvram lock on error flow	Yuval Mintz	2015-06-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During an error flow when trying to access the nvram the driver doesn't release the hw lock it acquired. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Fix statistics gathering on link change	Ariel Elior	2015-06-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since driver statistics flow access MACs and those might reset during link re-configurations, when we're about to change link properties we have to make sure that statistics are not operational. Statisics would be re-enabled [i.e., gathering of statistics would re-commence] once physical link is achieved again. Since driver employs a link-flap avoidance scheme, there are scenarios where driver will receive no indication that the new link is up, and as a result the statistics would not be re-enabled. Preventing LFA from working in such cases would guarantee that we'll always receive such indications and thus will fix statistics gathering. Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Fix self-test for 20g devices	Yuval Mintz	2015-06-25	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	20g-capable devices are not configured properly for self-test, using 10g as their speed which cause the link indication to remain down and fail the internal loopback test. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Fix VF MAC removal	Shahed Shaikh	2015-06-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There's a bug in today's driver where VF requests to add/remove MAC filters always reach the Hypervisor as add requests. This prevents the VF from changing its MAC address, as it cannot remove the previously configured MAC and runs out of MAC credits. Signed-off-by: Shahed Shaikh <Shahed.Shaikh@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Don't notify about scratchpad parities	Manish Chopra	2015-06-25	2	-10/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The scratchpad is a shared block between all functions of a given device. Due to HW limitations, we can't properly close its parity notifications to all functions on legal flows. E.g., it's possible that while taking a register dump from one function a parity error would be triggered on other functions. Today driver doesn't consider this parity as a 'real' parity unless its being accompanied by additional indications [which would happen in a real parity scenario]; But it does print notifications for such events in the system logs. This eliminates such prints - in case of real parities driver would have additional indications; But if this is the only signal user will not even see a parity being logged in the system. Signed-off-by: Manish Chopra <Manish.Chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Prevent false warning when accessing MACs	Yuval Mintz	2015-06-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each time a flow finishes reads from the classification shadow configuration in the driver, that flow would check for pending commands and pass them to FW if possible. In case there's already a completion pending command, I.e., a ramrod that has been sent to the FW and is yet to be completed while said flow tries to configure the pending command we would get a false error message in logs [and panic if SOE was used for driver compilation] since the command could not have been completed. This prevents said print [and panic]; The pending command will be sent by the time the completion of the current sent command would arrive. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Correct speed from baseT into KR.	Yuval Mintz	2015-06-25	3	-19/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ethtool shows KR supported/advertised speeds incorrectly as baseT in cases the board is in fact KR-base. Signed-off-by: Yaniv Rosner <Yaniv.Rosner@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bnx2x: Correct asymmetric flow-control	Yuval Mintz	2015-06-25	2	-14/+29
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes several issues relating to asymmetric configuration: 1. When user requests to disable TX, the local-device needs to advertise both PAUSE and ASM_DIR, but to avoid transmitting pause frames. In the 578xx, it would ignore the TX disable. 2. When user advertises RX-only, ASM_DIR was advertised instead of PAUSE/ASM_DIR. 3. When changing mode, the advertised PAUSE/ASM_DIR was not cleared before setting new one, so disabling RX or TX had no impact on the 'advertised' as appeared in the 'ethtool -a' output. Signed-off-by: Yaniv Rosner <Yaniv.Rosner@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: sched: flower fix typo	Jamal Hadi Salim	2015-06-25	1	-2/+2
\| \| \| \| \| \| \|	Fix typo in the validation rules for flower's attributes Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	enic: use atomic_t instead of spin_lock in busy poll	Govindarajulu Varadarajan	2015-06-25	2	-66/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We use spinlock to access a single flag. We can avoid spin_locks by using atomic variable and atomic_cmpxchg(). Use atomic_cmpxchg to set the flag for idle to poll. And a simple atomic_set to unlock (set idle from poll). In napi poll, if gro is enabled, we call napi_gro_receive() to deliver the packets. Before we call napi_complete(), i.e while re-polling, if low latency busy poll is called, we use netif_receive_skb() to deliver the packets. At this point if there are some skb's held in GRO, busy poll could deliver the packets out of order. So we call napi_gro_flush() to flush skbs before we move the napi poll to idle. Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net/phy: Add Vitesse 8641 phy ID	Shaohui Xie	2015-06-25	1	-0/+14
\| \| \| \| \| \| \|	Vitesse VSC8641 is compatible with Vitesse 82xx Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net/fsl: remove dependency FSL_SOC for Gianfar	Alison Wang	2015-06-25	1	-2/+2
\| \| \| \| \| \| \| \|	CONFIG_GIANFAR is not depended on FSL_SOC, it can be built on non-PPC platforms. Signed-off-by: Alison Wang <alison.wang@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	cavium/liquidio: fix some error handling in lio_set_phys_id()	Dan Carpenter	2015-06-25	1	-2/+3
\| \| \| \| \| \| \| \| \|	There was a missing assignment so the "if (ret)" on the next line is never true. Fixes: f21fb3ed364b ('Add support of Cavium Liquidio ethernet adapters') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	renesas: missing unlock on error path	Dan Carpenter	2015-06-25	1	-1/+3
\| \| \| \| \| \| \| \| \|	We need to unlock before returning here. Fixes: a0d2f20650e8 ('Renesas Ethernet AVB PTP clock driver') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'mlx4'	David S. Miller	2015-06-25	5	-25/+26
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Or Gerlitz says: ==================== mlx4 driver fixes, June 24, 2015 Some fixes that we made recently, all need to go into stable. patch #1 "net/mlx4_en: Release TX QP when destroying TX ring" and patch #3 "Fix wrong csum complete report when rxvlan offload is disabled" to >= 3.19 patch #2 "Wake TX queues only when there's enough room" addressing a bug which is there from day one... should go to whatever kernels it's still applicable patch #4 "mlx4: Disable HA for SRIOV PF RoCE devices" to >= 4.0 The patches are marked with net but are made against net-next, as the net tree still doesn't contain all the net-next bits. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	mlx4: Disable HA for SRIOV PF RoCE devices	Or Gerlitz	2015-06-25	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When in HA mode, the driver exposes an IB (RoCE) device instance with only one port. Under SRIOV, the existing implementation doesn't go well with the PF RoCE driver's role of Special QPs Para-Virtualization, etc. As such, disable HA for the mlx4 PF RoCE device in SRIOV mode. Fixes: a57500903093 ('IB/mlx4: Add port aggregation support') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/mlx4_en: Fix wrong csum complete report when rxvlan offload is disabled	Ido Shamay	2015-06-25	1	-11/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The check_csum() function relied on hwtstamp_rx_filter to know if rxvlan offload is disabled. This is wrong since rxvlan offload can be switched on/off regardless of hwtstamp_rx_filter. Also moved check_csum to query CQE information to identify VLAN packets and removed the check of IP packets, since it has been validated before. Fixes: f8c6455bb04b ('net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE') Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/mlx4_en: Wake TX queues only when there's enough room	Ido Shamay	2015-06-25	2	-8/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Indication of a single completed packet, marked by txbbs_skipped being bigger then zero, in not enough in order to wake up a stopped TX queue. The completed packet may contain a single TXBB, while next packet to be sent (after the wake up) may have multiple TXBBs (LSO/TSO packets for example), causing overflow in queue followed by WQE corruption and TX queue timeout. Instead, wake the stopped queue only when there's enough room for the worst case (maximum sized WQE) packet that we should need to handle after the queue is opened again. Also created an helper routine - mlx4_en_is_tx_ring_full, which checks if the current TX ring is full or not. It provides better code readability and removes code duplication. Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/mlx4_en: Release TX QP when destroying TX ring	Eran Ben Elisha	2015-06-25	3	-5/+1
\|/ \| \| \| \| \| \| \| \| \| \|	TX ring QP wasn't released at mlx4_en_destroy_tx_ring. Instead, the code used the deprecated base_tx_qpn field. Move TX QP release to mlx4_en_destroy_tx_ring and remove the base_tx_qpn field. Fixes: ddae0349fdb7 ('net/mlx4: Change QP allocation scheme') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	2015-06-24	151	-1321/+2277
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge first patchbomb from Andrew Morton: - a few misc things - ocfs2 udpates - kernel/watchdog.c feature work (took ages to get right) - most of MM. A few tricky bits are held up and probably won't make 4.2. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (91 commits) mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() mm, thp: respect MPOL_PREFERRED policy with non-local node tmpfs: truncate prealloc blocks past i_size mm/memory hotplug: print the last vmemmap region at the end of hot add memory mm/mmap.c: optimization of do_mmap_pgoff function mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan mm: kmemleak: avoid deadlock on the kmemleak object insertion error path mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup() mm: kmemleak: fix delete_object_*() race when called on the same memory block mm: kmemleak: allow safe memory scanning during kmemleak disabling memcg: convert mem_cgroup->under_oom from atomic_t to int memcg: remove unused mem_cgroup->oom_wakeups frontswap: allow multiple backends x86, mirror: x86 enabling - find mirrored memory ranges mm/memblock: allocate boot time data structures from mirrored memory mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute mm: do not ignore mapping_gfp_mask in page cache allocation paths mm/cma.c: fix typos in comments mm/oom_kill.c: print points as unsigned int mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages ...
\| *	mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()	Larry Finger	2015-06-24	3	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the following INFO splat is logged: =============================== [ INFO: suspicious RCU usage. ] 4.1.0-rc7-next-20150612 #1 Not tainted ------------------------------- kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 3 locks held by systemd/1: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40 #1: (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540 #2: (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540 stack backtrace: CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1 Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014 Call Trace: dump_stack+0x4c/0x6e lockdep_rcu_suspicious+0xe7/0x120 ___might_sleep+0x1d5/0x1f0 __might_sleep+0x4d/0x90 kmem_cache_alloc+0x47/0x250 create_object+0x39/0x2e0 kmemleak_alloc_percpu+0x61/0xe0 pcpu_alloc+0x370/0x630 Additional backtrace lines are truncated. In addition, the above splat is followed by several "BUG: sleeping function called from invalid context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau, these are the clue to the fix. Routine kmemleak_alloc_percpu() always uses GFP_KERNEL for its allocations, whereas it should follow the gfp from its callers. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
\| *	mm, thp: respect MPOL_PREFERRED policy with non-local node	Vlastimil Babka	2015-06-24	1	-16/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node"), we handle THP allocations on page fault in a special way - for non-interleave memory policies, the allocation is only attempted on the node local to the current CPU, if the policy's nodemask allows the node. This is motivated by the assumption that THP benefits cannot offset the cost of remote accesses, so it's better to fallback to base pages on the local node (which might still be available, while huge pages are not due to fragmentation) than to allocate huge pages on a remote node. The nodemask check prevents us from violating e.g. MPOL_BIND policies where the local node is not among the allowed nodes. However, the current implementation can still give surprising results for the MPOL_PREFERRED policy when the preferred node is different than the current CPU's local node. In such case we should honor the preferred node and not use the local node, which is what this patch does. If hugepage allocation on the preferred node fails, we fall back to base pages and don't try other nodes, with the same motivation as is done for the local node hugepage allocations. The patch also moves the MPOL_INTERLEAVE check around to simplify the hugepage specific test. The difference can be demonstrated using in-tree transhuge-stress test on the following 2-node machine where half memory on one node was occupied to show the difference. > numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 node 0 size: 7878 MB node 0 free: 3623 MB node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 8045 MB node 1 free: 7818 MB node distances: node 0 1 0: 10 21 1: 21 10 Before the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages Number of successful THP allocations corresponds to free memory on node 0 in the first case and node 1 in the second case, i.e. -p parameter is ignored and cpu binding "wins". After the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -p1 -C0 ./transhuge-stress transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages The -p parameter is respected regardless of cpu binding. > numactl -C0 ./transhuge-stress transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -C12 ./transhuge-stress transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages Without -p parameter, hugepage restriction to CPU-local node works as before. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>