summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* powerpc/powernv Adapt opal-elog and opal-dump to new sysfs_remove_file_selfStewart Smith2014-04-092-14/+4
| | | | | | | | We are currently using sysfs_schedule_callback() which is deprecated and about to be removed. Switch to the new interface instead. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Revert "powerpc/powernv: hwmon driver for power values, fan rpm and temperature"Benjamin Herrenschmidt2014-04-093-538/+0
| | | | | | | | | | | This reverts commit 0de7f8a917b5202014430e0055c0e1db0348bd62. This driver wasn't merged via the proper maintainers (my fault ... ooops !) and has serious issues so let's take it out for now and have a new better one be merged the right way Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> ---
* power, sched: stop updating inside arch_update_cpu_topology() when nothing ↵Michael Wang2014-04-091-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | to be update Since v1: Edited the comment according to Srivatsa's suggestion. During the testing, we encounter below WARN followed by Oops: WARNING: at kernel/sched/core.c:6218 ... NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200 LR [c000000000101358] .build_sched_domains+0xec8/0x1200 PACATMSCRATCH [800000000000f032] Call Trace: [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0 LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200 PACATMSCRATCH [8000000000029032] Call Trace: [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0 [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... This was caused by that 'sd->groups == NULL' after building groups, which was caused by the empty 'sd->span'. The cpu's domain contained nothing because the cpu was assigned to a wrong node, due to the following unfortunate sequence of events: 1. The hypervisor sent a topology update to the guest OS, to notify changes to the cpu-node mapping. However, the update was actually redundant - i.e., the "new" mapping was exactly the same as the old one. 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting the 'for-loop' in arch_update_cpu_topology(). 3. So we ended up calling stop-machine() with an empty cpumask list, which made stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as the cpu to run the payload (the update_cpu_topology() function). 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates' is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology() finds update->cpu as well as update->new_nid to be 0. In other words, we end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly. Along with the following wrong updating, it causes the sched-domain rebuild code to break and crash the system. Fix this by skipping the topology update in cases where we find that the topology has not actually changed in reality (ie., spurious updates). CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: Nathan Fontenot <nfont@linux.vnet.ibm.com> CC: Stephen Rothwell <sfr@canb.auug.org.au> CC: Andrew Morton <akpm@linux-foundation.org> CC: Robert Jennings <rcj@linux.vnet.ibm.com> CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com> CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> CC: Alistair Popple <alistair@popple.id.au> Suggested-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/le: Avoid creatng R_PPC64_TOCSAVE relocations for modules.Tony Breeds2014-04-091-0/+1
| | | | | | | | | | | | | | | When building modules with a native le toolchain the linker will generate R_PPC64_TOCSAVE relocations when it's safe to omit saving r2 on a plt call. This isn't helpful in the conext of a kernel module and the kernel will fail to load those modules with an error like: nf_conntrack: Unknown ADD relocation: 109 This patch tells the linker to avoid createing R_PPC64_TOCSAVE relocations allowing modules to load. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* arch/powerpc: Use RCU_INIT_POINTER(x, NULL) in platforms/cell/spu_syscalls.cMonam Agarwal2014-04-091-1/+1
| | | | | | | | | | | Here rcu_assign_pointer() is ensuring that the initialization of a structure is carried out before storing a pointer to that structure. So, rcu_assign_pointer(p, NULL) can always safely be converted to RCU_INIT_POINTER(p, NULL). Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/opal: Add missing includeMichael Neuling2014-04-091-0/+2
| | | | | | | | | | | | | next-20140324 currently fails compiling celleb_defconfig with: arch/powerpc/include/asm/opal.h:894:42: error: 'struct notifier_block' declared inside parameter list [-Werror] arch/powerpc/include/asm/opal.h:894:42: error: its scope is only this definition or declaration, which is probably not what you want [-Werror] arch/powerpc/include/asm/opal.h:896:14: error: 'struct notifier_block' declared inside parameter list [-Werror] This is due to a missing include which is added here. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Convert last uses of __FUNCTION__ to __func__Joe Perches2014-04-091-6/+5
| | | | | | | | Just about all of these have been converted to __func__, so convert the last uses. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Add lq/stq emulationAnton Blanchard2014-04-093-8/+46
| | | | | | | | Recent CPUs support quad word load and store instructions. Add support to the alignment handler for them. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Add invalid OPAL callJoel Stanley2014-04-093-0/+6
| | | | | | | | | This call will not be understood by OPAL, and cause it to add an error to it's log. Among other things, this is useful for testing the behaviour of the log as it fills up. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Add OPAL message log interfaceJoel Stanley2014-04-094-1/+128
| | | | | | | | | | | OPAL provides an in-memory circular buffer containing a message log populated with various runtime messages produced by the firmware. Provide a sysfs interface /sys/firmware/opal/msglog for userspace to view the messages. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/book3s: Fix mc_recoverable_range buffer overrun issue.Mahesh Salgaonkar2014-04-091-8/+20
| | | | | | | | | | | | | | | Currently we wrongly allocate mc_recoverable_range buffer (to hold recoverable ranges) based on size of the property "mcheck-recoverable-ranges". This results in allocating less memory to hold available recoverable range entries from /proc/device-tree/ibm,opal/mcheck-recoverable-ranges. This patch fixes this issue by allocating mc_recoverable_range buffer based on number of entries of recoverable ranges instead of device property size. Without this change we end up allocating less memory and run into memory corruption issue. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Remove dead code in sycall entryMichael Neuling2014-04-091-8/+0
| | | | | | | | | | | | | | | In: commit 742415d6b66bf09e3e73280178ef7ec85c90b7ee Author: Michael Neuling <mikey@neuling.org> powerpc: Turn syscall handler into macros We converted the syscall entry code onto macros, but in doing this we introduced some cruft that's never run and should never have been added. This removes that code. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Use of_node_init() for the fakenode in msi_bitmap.cLi Zhong2014-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch uses of_node_init() to initialize the kobject in the fake node used in test_of_node(), to avoid following kobject warning. [ 0.897654] kobject: '(null)' (c0000007ca183a08): is not initialized, yet kobject_put() is being called. [ 0.897682] ------------[ cut here ]------------ [ 0.897688] WARNING: at lib/kobject.c:670 [ 0.897692] Modules linked in: [ 0.897701] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 3.14.0+ #1 [ 0.897708] task: c0000007ca100000 ti: c0000007ca180000 task.ti: c0000007ca180000 [ 0.897715] NIP: c00000000046a1f0 LR: c00000000046a1ec CTR: 0000000001704660 [ 0.897721] REGS: c0000007ca1835c0 TRAP: 0700 Not tainted (3.14.0+) [ 0.897727] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28000024 XER: 0000000d [ 0.897749] CFAR: c0000000008ef4ec SOFTE: 1 GPR00: c00000000046a1ec c0000007ca183840 c0000000014c59b8 000000000000005c GPR04: 0000000000000001 c000000000129770 0000000000000000 0000000000000001 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000003fef GPR12: 0000000000000000 c00000000f221200 c00000000000c350 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: 0000000000000000 c00000000144e808 c000000000c56f20 00000000000000d8 GPR28: c000000000cd5058 0000000000000000 c000000001454ca8 c0000007ca183a08 [ 0.897856] NIP [c00000000046a1f0] .kobject_put+0xa0/0xb0 [ 0.897863] LR [c00000000046a1ec] .kobject_put+0x9c/0xb0 [ 0.897868] Call Trace: [ 0.897874] [c0000007ca183840] [c00000000046a1ec] .kobject_put+0x9c/0xb0 (unreliable) [ 0.897885] [c0000007ca1838c0] [c000000000743f9c] .of_node_put+0x2c/0x50 [ 0.897894] [c0000007ca183940] [c000000000c83954] .test_of_node+0x1dc/0x208 [ 0.897902] [c0000007ca183b80] [c000000000c839a4] .msi_bitmap_selftest+0x24/0x38 [ 0.897913] [c0000007ca183bf0] [c00000000000bb34] .do_one_initcall+0x144/0x200 [ 0.897922] [c0000007ca183ce0] [c000000000c748e4] .kernel_init_freeable+0x2b4/0x394 [ 0.897931] [c0000007ca183db0] [c00000000000c374] .kernel_init+0x24/0x130 [ 0.897940] [c0000007ca183e30] [c00000000000a2f4] .ret_from_kernel_thread+0x5c/0x68 [ 0.897947] Instruction dump: [ 0.897952] 7fe3fb78 38210080 e8010010 ebe1fff8 7c0803a6 4800014c e89f0000 3c62ff6e [ 0.897971] 7fe5fb78 3863a950 48485279 60000000 <0fe00000> 39000000 393f0038 4bffff80 [ 0.897992] ---[ end trace 1eeffdb9f825a556 ]--- Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/mm: NUMA pte should be handled via slow path in get_user_pages_fast()Aneesh Kumar K.V2014-04-091-0/+13
| | | | | | | We need to handle numa pte via the slow path Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Fix endian issues with sensor codeAnton Blanchard2014-04-092-3/+4
| | | | | | | One OPAL call and one device tree property needed byte swapping. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Fix endian issues with OPAL async codeAnton Blanchard2014-04-075-13/+16
| | | | | | | | OPAL defines opal_msg as a big endian struct so we have to byte swap it on little endian builds. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* tty/hvc_opal: Kick the HVC thread on OPAL console eventsBenjamin Herrenschmidt2014-04-071-1/+21
| | | | | | | The firmware can notify us when new input data is available, so let's make sure we wakeup the HVC thread in that case. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Add opal_notifier_unregister() and export to modulesBenjamin Herrenschmidt2014-04-072-0/+16
| | | | | | | opal_notifier_register() is missing a pending "unregister" variant and should be exposed to modules. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/ppc64: Do not turn AIL (reloc-on interrupts) too earlyBenjamin Herrenschmidt2014-04-072-5/+15
| | | | | | | | Turn them on at the same time as we allow MSR_IR/DR in the paca kernel MSR, ie, after the MMU has been setup enough to be able to handle relocated access to the linear mapping. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/ppc64: Gracefully handle early interruptsBenjamin Herrenschmidt2014-04-072-1/+17
| | | | | | | | | | | | | | | | | If we take an interrupt such as a trap caused by a BUG_ON before the MMU has been setup, the interrupt handlers try to enable virutal mode and cause a recursive crash, making the original problem very hard to debug. This fixes it by adjusting the "kernel_msr" value in the PACA so that it only has MSR_IR and MSR_DR (translation for instruction and data) set after the MMU has been initialized for the processor. We may still not have a console yet but at least we don't get into a recursive fault (and early debug console or memory dump via JTAG of the kernel buffer *will* give us the proper error). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/prom: early_init_dt_scan_cpus() updates cpu features only onceBenjamin Herrenschmidt2014-04-071-26/+26
| | | | | | | | All our cpu feature updates were done for every CPU in the device-tree, thus overwriting the cputable bits over and over again. Instead do them only for the boot CPU. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Make boot_cpuid common between 32 and 64-bitBenjamin Herrenschmidt2014-04-074-3/+7
| | | | | | | | | | Move the definition to setup-common.c and set the init value to -1 on both 32 and 64-bit (it was 0 on 64-bit). Additionally add a check to prom.c to garantee that the init value has been udpated after the DT scan. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Adjust CPU_FTR_SMT on all platformsBenjamin Herrenschmidt2014-04-071-1/+1
| | | | | | | For historical reasons that code was under #ifdef CONFIG_PPC_PSERIES but it applies equally to all 64-bit platforms. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/tm: Disable IRQ in tm_recheckpointMichael Neuling2014-04-074-7/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't take an IRQ when we're about to do a trechkpt as our GPR state is set to user GPR values. We've hit this when running some IBM Java stress tests in the lab resulting in the following dump: cpu 0x3f: Vector: 700 (Program Check) at [c000000007eb3d40] pc: c000000000050074: restore_gprs+0xc0/0x148 lr: 00000000b52a8184 sp: ac57d360 msr: 8000000100201030 current = 0xc00000002c500000 paca = 0xc000000007dbfc00 softe: 0 irq_happened: 0x00 pid = 34535, comm = Pooled Thread # R00 = 00000000b52a8184 R16 = 00000000b3e48fda R01 = 00000000ac57d360 R17 = 00000000ade79bd8 R02 = 00000000ac586930 R18 = 000000000fac9bcc R03 = 00000000ade60000 R19 = 00000000ac57f930 R04 = 00000000f6624918 R20 = 00000000ade79be8 R05 = 00000000f663f238 R21 = 00000000ac218a54 R06 = 0000000000000002 R22 = 000000000f956280 R07 = 0000000000000008 R23 = 000000000000007e R08 = 000000000000000a R24 = 000000000000000c R09 = 00000000b6e69160 R25 = 00000000b424cf00 R10 = 0000000000000181 R26 = 00000000f66256d4 R11 = 000000000f365ec0 R27 = 00000000b6fdcdd0 R12 = 00000000f66400f0 R28 = 0000000000000001 R13 = 00000000ada71900 R29 = 00000000ade5a300 R14 = 00000000ac2185a8 R30 = 00000000f663f238 R15 = 0000000000000004 R31 = 00000000f6624918 pc = c000000000050074 restore_gprs+0xc0/0x148 cfar= c00000000004fe28 dont_restore_vec+0x1c/0x1a4 lr = 00000000b52a8184 msr = 8000000100201030 cr = 24804888 ctr = 0000000000000000 xer = 0000000000000000 trap = 700 This moves tm_recheckpoint to a C function and moves the tm_restore_sprs into that function. It then adds IRQ disabling over the trechkpt critical section. It also sets the TEXASR FS in the signals code to ensure this is never set now that we explictly write the TM sprs in tm_recheckpoint. Signed-off-by: Michael Neuling <mikey@neuling.org> cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/le: Enable RTAS events supportGreg Kurz2014-04-075-61/+128
| | | | | | | | | | | | | | | | The current kernel code assumes big endian and parses RTAS events all wrong. The most visible effect is that we cannot honor EPOW events, meaning, for example, we cannot shut down a guest properly from the hypervisor. This new patch is largely inspired by Nathan's work: we get rid of all the bit fields in the RTAS event structures (even the unused ones, for consistency). We also introduce endian safe accessors for the fields used by the kernel (trivial rtas_error_type() accessor added for consistency). Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Correct emulated mtfsf instructionStephen Chivers2014-04-071-35/+23
| | | | | | | | | | | | | | | | | | | | | | | | | The emulated (CONFIG_MATH_EMULATION_FULL) PowerPC Floating Point instruction mtfsf does not correctly copy bits from its source register to the Floating Point Status and Register (FPSCR). The error is in the preparation of the mask used to select the bits to be copied from the source to the FPSCR. Execution of the mtfsf instruction does not produce the same results on a MPC8548 platform (emulated floating point) as on MPC7410 or 440EP platforms (hardware floating point). This error has been detected using a Freescale MPC8548 based platform and the patch below tested using that platform. The patch is based on the patch(es) provided by Gabriel Paubert and analysis by Gabriel, James Yang and David Laight. Signed-off-by: Stephen Chivers <schivers@csc.com> Signed-off-by: Gabriel Paubert <paubert@iram.es> Tested-by: Stephen Chivers <schivers@csc.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tileLinus Torvalds2014-04-0615-20/+1295
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull arch/tile updates from Chris Metcalf: "These fix a few stray build issues seen in linux-next, and also add the minimal required support for perf to tilegx" * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: arch/tile: remove unused variable 'devcap' tile: Fix vDSO compilation issue with allyesconfig perf tools: Allow building for tile tile/perf: Support perf_events on tilegx and tilepro tile: Enable NMIs on return from handle_nmi() without errors tile: Add support for handling PMC hardware tile: don't use __get_cpu_var() with structure-typed arguments tile: avoid overflow in ns2cycles
| * arch/tile: remove unused variable 'devcap'Chris Metcalf2014-04-041-2/+0
| | | | | | | | | | | | | | Commit 503275bf37 removed the use of the variable but not the variable itself. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
| * tile: Fix vDSO compilation issue with allyesconfigKerry Sheh2014-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | make allyesconfig give the following build error on tile: tilegx-linux-gcc: error: arch/tile/kernel/vdso/vgettimeofday32.o: No such file or directory tilegx-linux-objcopy: 'arch/tile/kernel/vdso/vdso32.so.dbg': No such file or directory In case with CONFIG_MODVERSIONS, cmd_cc_o_c generate .tmp_<file>.o from <file>.c only. Fix it by execute rule_cc_o_c instead. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Kerry Sheh <ksheh@tilera.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
| * perf tools: Allow building for tileZhigang Lu2014-03-072-1/+10
| | | | | | | | | | | | | | | | | | Tested by building perf: - Cross-compiled for tile on x86_64 - Built natively on tile Signed-off-by: Zhigang Lu <zlu@tilera.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
| * tile/perf: Support perf_events on tilegx and tileproZhigang Lu2014-03-075-0/+1048
| | | | | | | | | | | | | | Add perf support for tile architecture. Signed-off-by: Zhigang Lu <zlu@tilera.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
| * tile: Enable NMIs on return from handle_nmi() without errorsZhigang Lu2014-03-072-1/+21
| | | | | | | | | | | | | | | | | | | | NMI interrupts mask ALL interrupts before calling the handler, so we need to unmask NMIs according to the value handle_nmi() returns. If it returns zero, the NMIs should be re-enabled; if it returns a non-zero error, the NMIs should be disabled. Signed-off-by: Zhigang Lu <zlu@tilera.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
| * tile: Add support for handling PMC hardwareZhigang Lu2014-03-076-12/+204
| | | | | | | | | | | | | | The PMC module is used by perf_events, oprofile and watchdogs. Signed-off-by: Zhigang Lu <zlu@tilera.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
| * tile: don't use __get_cpu_var() with structure-typed argumentsChris Metcalf2014-03-061-2/+2
| | | | | | | | | | | | This no longer works with the new per-cpu infrastructure. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
| * tile: avoid overflow in ns2cyclesHenrik Austad2014-03-061-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 4cecf6d401a ("sched, x86: Avoid unnecessary overflow in sched_clock") and in recent patch "clocksource: avoid unnecessary overflow in cyclecounter_cyc2ns()" https://lkml.org/lkml/2014/3/4/17, the mult-shift approach is replaced by 2 steps to avoid storing a large, intermediate value that could overflow. arch/tile/kernel/time.c has a similar pattern in cycles2ns, and this copies the same pattern in this function CC: John Stultz <johnstul@us.ibm.com> CC: Mike Galbraith <bitbucket@online.de> CC: Salman Qazi <sqazi@google.com> Signed-off-by: Henrik Austad <henrik@austad.us> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
* | Merge tag 'dm-3.15-changes' of ↵Linus Torvalds2014-04-0521-478/+2346
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - Fix dm-cache corruption caused by discard_block_size > cache_block_size - Fix a lock-inversion detected by LOCKDEP in dm-cache - Fix a dangling bio bug in the dm-thinp target's process_deferred_bios error path - Fix corruption due to non-atomic transaction commit which allowed a metadata superblock to be written before all other metadata was successfully written -- this is common to all targets that use the persistent-data library's transaction manager (dm-thinp, dm-cache and dm-era). - Various small cleanups in the DM core - Add the dm-era target which is useful for keeping track of which blocks were written within a user defined period of time called an 'era'. Use cases include tracking changed blocks for backup software, and partially invalidating the contents of a cache to restore cache coherency after rolling back a vendor snapshot. - Improve the on-disk layout of multithreaded writes to the dm-thin-pool by splitting the pool's deferred bio list to be a per-thin device list and then sorting that list using an rb_tree. The subsequent read throughput of the data written via multiple threads improved by ~70%. - Simplify the multipath target's handling of queuing IO by pushing requests back to the request queue rather than queueing the IO internally. * tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm cache: fix a lock-inversion dm thin: sort the per thin deferred bios using an rb_tree dm thin: use per thin device deferred bio lists dm thin: simplify pool_is_congested dm thin: fix dangling bio in process_deferred_bios error path dm mpath: print more useful warnings in multipath_message() dm-mpath: do not activate failed paths dm mpath: remove extra nesting in map function dm mpath: remove map_io() dm mpath: reduce memory pressure when requeuing dm mpath: remove process_queued_ios() dm mpath: push back requests instead of queueing dm table: add dm_table_run_md_queue_async dm mpath: do not call pg_init when it is already running dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind dm: stop using bi_private dm: remove dm_get_mapinfo dm: make dm_table_alloc_md_mempools static dm: take care to copy the space map roots before locking the superblock dm transaction manager: fix corruption due to non-atomic transaction commit ...
| * | dm cache: fix a lock-inversionJoe Thornber2014-04-043-52/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When suspending a cache the policy is walked and the individual policy hints written to the metadata via sync_metadata(). This led to this lock order: policy->lock cache_metadata->root_lock When loading the cache target the policy is populated while the metadata lock is held: cache_metadata->root_lock policy->lock Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by ensuring the cache_metadata root_lock is held whilst all the hints are written, rather than being repeatedly locked while policy->lock is held (as was the case with each callout that policy_walk_mappings() made to the old save_hint() method). Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove locking correctness") build option. However, it is not clear how the LOCKDEP reported paths can lead to a deadlock since the two paths, suspending a target and loading a target, never occur at the same time. But that doesn't mean the same lock-inversion couldn't have occurred elsewhere. Reported-by: Marian Csontos <mcsontos@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * | dm thin: sort the per thin deferred bios using an rb_treeMike Snitzer2014-04-041-2/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A thin-pool will allocate blocks using FIFO order for all thin devices which share the thin-pool. Because of this simplistic allocation the thin-pool's space can become fragmented quite easily; especially when multiple threads are requesting blocks in parallel. Sort each thin device's deferred_bio_list based on logical sector to help reduce fragmentation of the thin-pool's ondisk layout. The following tables illustrate the realized gains/potential offered by sorting each thin device's deferred_bio_list. An "io size"-sized random read of the device would result in "seeks/io" fragments being read, with an average "distance/seek" between each fragment. Data was written to a single thin device using multiple threads via iozone (8 threads, 64K for both the block_size and io_size). unsorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.013 11m 64k 0.065 11m 256k 0.274 10m 1m 1.109 10m 4m 4.411 10m 16m 17.097 11m 64m 60.055 13m 256m 148.798 25m 1g 809.929 21m sorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.000 1g 64k 0.001 1g 256k 0.003 1g 1m 0.011 1g 4m 0.045 1g 16m 0.181 1g 64m 0.747 1011m 256m 3.299 1g 1g 14.373 1g Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * | dm thin: use per thin device deferred bio listsMike Snitzer2014-03-311-61/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The thin-pool previously only had a single deferred_bios list that would collect bios for all thin devices in the pool. Split this per-pool deferred_bios list out to per-thin deferred_bios_list -- doing so enables increased parallelism when processing deferred bios. And now that each thin device has it's own deferred_bios_list we can sort all bios in the list using logical sector. The requeue code in error handling path is also cleaner as a side-effect. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * | dm thin: simplify pool_is_congestedMike Snitzer2014-03-311-11/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | The pool is congested if the pool is in PM_OUT_OF_DATA_SPACE mode. This is more explicit/clear/efficient than inferring whether or not the pool is congested by checking if retry_on_resume_list is empty. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * | dm thin: fix dangling bio in process_deferred_bios error pathMike Snitzer2014-03-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If unable to ensure_next_mapping() we must add the current bio, which was removed from the @bios list via bio_list_pop, back to the deferred_bios list before all the remaining @bios. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
| * | dm mpath: print more useful warnings in multipath_message()Jose Castillo2014-03-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The warning message "Unrecognised multipath message received" is displayed in two different situations in multipath_message(): when the number of arguments passed is invalid and when the string passed in argv[0] is not recognized. Make it easier to identify where the problem is by making these warnings more specific with additional context for each case. Signed-off-by: Jose Castillo <jcastillo@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm-mpath: do not activate failed pathsHannes Reinecke2014-03-271-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | activate_path() is run without a lock, so the path might be set to failed before activate_path() had a chance to run. This patch add a check for ->active in activate_path() to avoid unnecessary overhead by calling functions which are known to be failing. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * | dm mpath: remove extra nesting in map functionMike Snitzer2014-03-271-22/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Return early for case when no path exists, and when the pathgroup isn't ready. This eliminates the need for extra nesting for the the common case. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de>
| * | dm mpath: remove map_io()Hannes Reinecke2014-03-271-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | multipath_map() is now just a wrapper around map_io(), so we can rename map_io() to multipath_map(). Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
| * | dm mpath: reduce memory pressure when requeuingHannes Reinecke2014-03-271-23/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When multipath needs to requeue I/O in the block layer the per-request context shouldn't be allocated, as it will be freed immediately afterwards anyway. Avoiding this memory allocation will reduce memory pressure during requeuing. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
| * | dm mpath: remove process_queued_ios()Hannes Reinecke2014-03-271-42/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | process_queued_ios() has served 3 functions: 1) select pg and pgpath if none is selected 2) start pg_init if requested 3) dispatch queued IOs when pg is ready Basically, a call to queue_work(process_queued_ios) can be replaced by dm_table_run_md_queue_async(), which runs request queue and ends up calling map_io(), which does 1), 2) and 3). Exception is when !pg_ready() (which means either pg_init is running or requested), then multipath_busy() prevents map_io() being called from request_fn. If pg_init is running, it should be ok as long as pg_init_done() does the right thing when pg_init is completed, I.e.: restart pg_init if !pg_ready() or call dm_table_run_md_queue_async() to kick map_io(). If pg_init is requested, we have to make sure the request is detected and pg_init will be started. pg_init is requested in 3 places: a) __choose_pgpath() in map_io() b) __choose_pgpath() in multipath_ioctl() c) pg_init retry in pg_init_done() a) is ok because map_io() calls __pg_init_all_paths(), which does 2). b) needs a call to __pg_init_all_paths(), which does 2). c) needs a call to __pg_init_all_paths(), which does 2). So this patch removes process_queued_ios() and ensures that __pg_init_all_paths() is called at the appropriate locations. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
| * | dm mpath: push back requests instead of queueingHannes Reinecke2014-03-271-78/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no reason why multipath needs to queue requests internally for queue_if_no_path or pg_init; we should rather push them back onto the request queue. And while we're at it we can simplify the conditional statement in map_io() to make it easier to read. Since mpath no longer does internal queuing of I/O the table info no longer emits the internal queue_size. Instead it displays 1 if queuing is being used or 0 if it is not. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
| * | dm table: add dm_table_run_md_queue_asyncMike Snitzer2014-03-274-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce dm_table_run_md_queue_async() to run the request_queue of the mapped_device associated with a request-based DM table. Also add dm_md_get_queue() wrapper to extract the request_queue from a mapped_device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
| * | dm mpath: do not call pg_init when it is already runningHannes Reinecke2014-03-271-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves condition checks as a preparation of following patches and has no effect on behaviour. process_queued_ios() is the only caller of __pg_init_all_paths() and 2 condition checks are moved from outside to inside without side effects. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
OpenPOWER on IntegriCloud