summaryrefslogtreecommitdiffstats
path: root/arch/powerpc
Commit message (Collapse)AuthorAgeFilesLines
* powerpc/powernv: Fix indirect XSCOM unmanglingBenjamin Herrenschmidt2014-02-281-9/+12
| | | | | | | | | We need to unmangle the full address, not just the register number, and we also need to support the real indirect bit being set for in-kernel uses. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@vger.kernel.org> [v3.13]
* powerpc/powernv: Fix opal_xscom_{read,write} prototypeBenjamin Herrenschmidt2014-02-281-2/+2
| | | | | | | | | The OPAL firmware functions opal_xscom_read and opal_xscom_write take a 64-bit argument for the XSCOM (PCB) address in order to support the indirect mode on P8. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@vger.kernel.org> [v3.13]
* powerpc/powernv: Refactor PHB diag-data dumpGavin Shan2014-02-281-95/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | As Ben suggested, the patch prints PHB diag-data with multiple fields in one line and omits the line if the fields of that line are all zero. With the patch applied, the PHB3 diag-data dump looks like: PHB3 PHB#3 Diag-data (Version: 1) brdgCtl: 00000002 RootSts: 0000000f 00400000 b0830008 00100147 00002000 nFir: 0000000000000000 0030006e00000000 0000000000000000 PhbSts: 0000001c00000000 0000000000000000 Lem: 0000000000100000 42498e327f502eae 0000000000000000 InAErr: 8000000000000000 8000000000000000 0402030000000000 0000000000000000 PE[ 8] A/B: 8480002b00000000 8000000000000000 [ The current diag data is so big that it overflows the printk buffer pretty quickly in cases when we get a handful of errors at once which can happen. --BenH ] Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Dump PHB diag-data immediatelyGavin Shan2014-02-281-53/+43
| | | | | | | | | | | | | | | The PHB diag-data is important to help locating the root cause for EEH errors such as frozen PE or fenced PHB. However, the EEH core enables IO path by clearing part of HW registers before collecting this data causing it to be corrupted. This patch fixes this by dumping the PHB diag-data immediately when frozen/fenced state on PE or PHB is detected for the first time in eeh_ops::get_state() or next_error() backend. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Increase stack redzone for 64-bit userspace to 512 bytesPaul Mackerras2014-02-283-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | The new ELFv2 little-endian ABI increases the stack redzone -- the area below the stack pointer that can be used for storing data -- from 288 bytes to 512 bytes. This means that we need to allow more space on the user stack when delivering a signal to a 64-bit process. To make the code a bit clearer, we define new USER_REDZONE_SIZE and KERNEL_REDZONE_SIZE symbols in ptrace.h. For now, we leave the kernel redzone size at 288 bytes, since increasing it to 512 bytes would increase the size of interrupt stack frames correspondingly. Gcc currently only makes use of 288 bytes of redzone even when compiling for the new little-endian ABI, and the kernel cannot currently be compiled with the new ABI anyway. In the future, hopefully gcc will provide an option to control the amount of redzone used, and then we could reduce it even more. This also changes the code in arch_compat_alloc_user_space() to preserve the expanded redzone. It is not clear why this function would ever be used on a 64-bit process, though. Signed-off-by: Paul Mackerras <paulus@samba.org> CC: <stable@vger.kernel.org> [v3.13] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/ftrace: bugfix for test_24bit_addrLiu Ping Fan2014-02-281-0/+1
| | | | | | | | The branch target should be the func addr, not the addr of func_descr_t. So using ppc_function_entry() to generate the right target addr. Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/crashdump : Fix page frame number check in copy_oldmem_pageLaurent Dufour2014-02-281-3/+5
| | | | | | | | | | | | | | | | | | | | In copy_oldmem_page, the current check using max_pfn and min_low_pfn to decide if the page is backed or not, is not valid when the memory layout is not continuous. This happens when running as a QEMU/KVM guest, where RTAS is mapped higher in the memory. In that case max_pfn points to the end of RTAS, and a hole between the end of the kdump kernel and RTAS is not backed by PTEs. As a consequence, the kdump kernel is crashing in copy_oldmem_page when accessing in a direct way the pages in that hole. This fix relies on the memblock's service memblock_is_region_memory to check if the read page is part or not of the directly accessible memory. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/le: Ensure that the 'stop-self' RTAS token is handled correctlyTony Breeds2014-02-281-11/+11
| | | | | | | | | | | | | | | | | Currently we're storing a host endian RTAS token in rtas_stop_self_args.token. We then pass that directly to rtas. This is fine on big endian however on little endian the token is not what we expect. This will typically result in hitting: panic("Alas, I survived.\n"); To fix this we always use the stop-self token in host order and always convert it to be32 before passing this to rtas. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/eeh: Disable EEH on rebootGavin Shan2014-02-172-1/+22
| | | | | | | | | | | | | | We possiblly detect EEH errors during reboot, particularly in kexec path, but it's impossible for device drivers and EEH core to handle or recover them properly. The patch registers one reboot notifier for EEH and disable EEH subsystem during reboot. That means the EEH errors is going to be cleared by hardware reset or second kernel during early stage of PCI probe. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/eeh: Cleanup on eeh_subsystem_enabledGavin Shan2014-02-174-10/+27
| | | | | | | | | The patch cleans up variable eeh_subsystem_enabled so that we needn't refer the variable directly from external. Instead, we will use function eeh_enabled() and eeh_set_enable() to operate the variable. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Rework EEH resetGavin Shan2014-02-171-25/+4
| | | | | | | | | | | | | | | | | When doing reset in order to recover the affected PE, we issue hot reset on PE primary bus if it's not root bus. Otherwise, we issue hot or fundamental reset on root port or PHB accordingly. For the later case, we didn't cover the situation where PE only includes root port and it potentially causes kernel crash upon EEH error to the PE. The patch reworks the logic of EEH reset to improve the code readability and also avoid the kernel crash. Cc: stable@vger.kernel.org Reported-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Use unstripped VDSO image for more accurate profiling dataAnton Blanchard2014-02-172-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are seeing a lot of hits in the VDSO that are not resolved by perf. A while(1) gettimeofday() loop shows the issue: 27.64% [vdso] [.] 0x000000000000060c 22.57% [vdso] [.] 0x0000000000000628 16.88% [vdso] [.] 0x0000000000000610 12.39% [vdso] [.] __kernel_gettimeofday 6.09% [vdso] [.] 0x00000000000005f8 3.58% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 2.94% [vdso] [.] __kernel_datapage_offset 2.90% test [.] main We are using a stripped VDSO image which means only symbols with relocation info can be resolved. There isn't a lot of point to stripping the VDSO, the debug info is only about 1kB: 4680 arch/powerpc/kernel/vdso64/vdso64.so 5815 arch/powerpc/kernel/vdso64/vdso64.so.dbg By using the unstripped image, we can resolve all the symbols in the VDSO and the perf profile data looks much better: 76.53% [vdso] [.] __do_get_tspec 12.20% [vdso] [.] __kernel_gettimeofday 5.05% [vdso] [.] __get_datapage 3.20% test [.] main 2.92% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Link VDSOs at 0x0Anton Blanchard2014-02-171-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf is failing to resolve symbols in the VDSO. A while (1) gettimeofday() loop shows: 93.99% [vdso] [.] 0x00000000000005e0 3.12% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 2.81% test [.] main The reason for this is that we are linking our VDSO shared libraries at 1MB, which is a little weird. Even though this is uncommon, Alan points out that it is valid and we should probably fix perf userspace. Regardless, I can't see a reason why we are doing this. The code is all position independent and we never rely on the VDSO ending up at 1M (and we never place it there on 64bit tasks). Changing our link address to 0x0 fixes perf VDSO symbol resolution: 73.18% [vdso] [.] 0x000000000000060c 12.39% [vdso] [.] __kernel_gettimeofday 3.58% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18 2.94% [vdso] [.] __kernel_datapage_offset 2.90% test [.] main We still have some local symbol resolution issues that will be fixed in a subsequent patch. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bitAneesh Kumar K.V2014-02-171-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using a hash table MMU for various reasons (the flush is handled as part of the PTE modification when necessary). ppc64 thus doesn't implement flush_tlb_range for hash based MMUs. Additionally ppc64 require the tlb flushing to be batched within ptl locks. The reason to do that is to ensure that the hash page table is in sync with linux page table. We track the hpte index in linux pte and if we clear them without flushing hash and drop the ptl lock, we can have another cpu update the pte and can end up with duplicate entry in the hash table, which is fatal. We also want to keep set_pte_at simpler by not requiring them to do hash flush for performance reason. We do that by assuming that set_pte_at() is never *ever* called on a PTE that is already valid. This was the case until the NUMA code went in which broke that assumption. Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a way similar to ptep/pmdp_set_wrprotect(), with a generic implementation using set_pte_at() and a powerpc specific one using the appropriate mechanism needed to keep the hash table in sync. Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/mm: Add new "set" flag argument to pte/pmd update functionAneesh Kumar K.V2014-02-174-18/+24
| | | | | | | | | | | | | | | | | | | | | | | pte_update() is a powerpc-ism used to change the bits of a PTE when the access permission is being restricted (a flush is potentially needed). It uses atomic operations on when needed and handles the hash synchronization on hash based processors. It is currently only used to clear PTE bits and so the current implementation doesn't provide a way to also set PTE bits. The new _PAGE_NUMA bit, when set, is actually restricting access so it must use that function too, so this change adds the ability for pte_update() to also set bits. We will use this later to set the _PAGE_NUMA bit. Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Add Gen3 definitions for PCIE link speedKleber Sacilotto de Souza2014-02-171-0/+6
| | | | | | | | | | | | | | | | | | | Rev3 of the PCI Express Base Specification defines a Supported Link Speeds Vector where the bit definitions within this field are: Bit 0 - 2.5 GT/s Bit 1 - 5.0 GT/s Bit 2 - 8.0 GT/s This vector definition is used by the platform firmware to export the maximum and current link speeds of the PCI bus via the "ibm,pcie-link-speed-stats" device-tree property. This patch updates pseries_root_bridge_prepare() to detect Gen3 speed buses (defined by 0x04). Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Fix regression on PCI link speedKleber Sacilotto de Souza2014-02-171-7/+9
| | | | | | | | | | | | | | | | | Commit 5091f0c (powerpc/pseries: Fix PCIE link speed endian issue) introduced a regression on the PCI link speed detection using the device-tree property. The ibm,pcie-link-speed-stats property is composed of two 32-bit integers, the first one being the maxinum link speed and the second the current link speed. The changes introduced by the aforementioned commit are considering just the first integer. Fix this issue by changing how the property is accessed, using the helper functions to properly access the array of values. The explicit byte swapping is not needed anymore here, since it's done by the helper functions. Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Set the correct ksp_limit on ppc32 when switching to irq stackKevin Hao2014-02-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Guenter Roeck has got the following call trace on a p2020 board: Kernel stack overflow in process eb3e5a00, r1=eb79df90 CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4 task: eb3e5a00 ti: c0616000 task.ti: ef440000 NIP: c003a420 LR: c003a410 CTR: c0017518 REGS: eb79dee0 TRAP: 0901 Not tainted (3.13.0-rc8-juniper-00146-g19eca00) MSR: 00029000 <CE,EE,ME> CR: 24008444 XER: 00000000 GPR00: c003a410 eb79df90 eb3e5a00 00000000 eb05d900 00000001 65d87646 00000000 GPR08: 00000000 020b8000 00000000 00000000 44008442 NIP [c003a420] __do_softirq+0x94/0x1ec LR [c003a410] __do_softirq+0x84/0x1ec Call Trace: [eb79df90] [c003a410] __do_softirq+0x84/0x1ec (unreliable) [eb79dfe0] [c003a970] irq_exit+0xbc/0xc8 [eb79dff0] [c000cc1c] call_do_irq+0x24/0x3c [ef441f20] [c00046a8] do_IRQ+0x8c/0xf8 [ef441f40] [c000e7f4] ret_from_except+0x0/0x18 --- Exception: 501 at 0xfcda524 LR = 0x10024900 Instruction dump: 7c781b78 3b40000a 3a73b040 543c0024 3a800000 3b3913a0 7ef5bb78 48201bf9 5463103a 7d3b182e 7e89b92e 7c008146 <3ba00000> 7e7e9b78 48000014 57fff87f Kernel panic - not syncing: kernel stack overflow CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4 Call Trace: The reason is that we have used the wrong register to calculate the ksp_limit in commit cbc9565ee826 (powerpc: Remove ksp_limit on ppc64). Just fix it. As suggested by Benjamin Herrenschmidt, also add the C prototype of the function in the comment in order to avoid such kind of errors in the future. Cc: stable@vger.kernel.org # 3.12 Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/powernv: Add iommu DMA bypass support for IODA2Benjamin Herrenschmidt2014-02-119-4/+137
| | | | | | | | | | | | | | | This patch adds the support for to create a direct iommu "bypass" window on IODA2 bridges (such as Power8) allowing to bypass iommu page translation completely for 64-bit DMA capable devices, thus significantly improving DMA performances. Additionally, this adds a hook to the struct iommu_table so that the IOMMU API / VFIO can disable the bypass when external ownership is requested, since in that case, the device will be used by an environment such as userspace or a KVM guest which must not be allowed to bypass translations. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Fix endian issues in kexec and crash dump codeAnton Blanchard2014-02-112-6/+14
| | | | | | | | | We expose a number of OF properties in the kexec and crash dump code and these need to be big endian. Cc: stable@vger.kernel.org # v3.13 Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/ppc32: Fix the bug in the init of non-base exception stack for UPKevin Hao2014-02-112-0/+10
| | | | | | | | | | | | | We would allocate one specific exception stack for each kind of non-base exceptions for every CPU. For ppc32 the CPU hard ID is used as the subscript to get the specific exception stack for one CPU. But for an UP kernel, there is only one element in the each kind of exception stack array. We would get stuck if the CPU hard ID is not equal to '0'. So in this case we should use the subscript '0' no matter what the CPU hard ID is. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/xmon: Don't signal we've entered until we're finished printingMichael Ellerman2014-02-111-1/+2
| | | | | | | | | | | | | | | Currently we set our cpu's bit in cpus_in_xmon, and then we take the output lock and print the exception information. This can race with the master cpu entering the command loop and printing the backtrace. The result is that the backtrace gets garbled with another cpu's exception print out. Fix it by delaying the set of cpus_in_xmon until we are finished printing. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/xmon: Fix timeout loop in get_output_lock()Michael Ellerman2014-02-111-2/+9
| | | | | | | | | | | | | | | | | | | | As far as I can tell, our 70s era timeout loop in get_output_lock() is generating no code. This leads to the hostile takeover happening more or less simultaneously on all cpus. The result is "interesting", some example output that is more readable than most: cpu 0x1: Vector: 100 (Scypsut e0mx bR:e setV)e catto xc0p:u[ c 00 c0:0 000t0o0V0erc0td:o5 rfc28050000]0c00 0 0 0 6t(pSrycsV1ppuot uxe 1m 2 0Rx21e3:0s0ce000c00000t00)00 60602oV2SerucSayt0y 0p 1sxs Fix it by using udelay() in the timeout loop. The wait time and check frequency are arbitrary, but seem to work OK. We already rely on udelay() working so this is not a new dependency. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/xmon: Don't loop forever in get_output_lock()Michael Ellerman2014-02-111-5/+5
| | | | | | | | | | | | | | | | | | | | | If we enter with xmon_speaker != 0 we skip the first cmpxchg(), we also skip the while loop because xmon_speaker != last_speaker (0) - meaning we skip the second cmpxchg() also. Following that code path the compiler sees no memory barriers and so is within its rights to never reload xmon_speaker. The end result is we loop forever. This manifests as all cpus being in xmon ('c' command), but they refuse to take control when you switch to them ('c x' for cpu # x). I have seen this deadlock in practice and also checked the generated code to confirm this is what's happening. The simplest fix is just to always try the cmpxchg(). Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/perf: Configure BHRB filter before enabling PMU interruptsAnshuman Khandual2014-02-111-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now the config_bhrb() PMU specific call happens after write_mmcr0(), which actually enables the PMU for event counting and interrupts. So there is a small window of time where the PMU and BHRB runs without the required HW branch filter (if any) enabled in BHRB. This can cause some of the branch samples to be collected through BHRB without any filter applied and hence affects the correctness of the results. This patch moves the BHRB config function call before enabling interrupts. Here are some data points captured via trace prints which depicts how we could get PMU interrupts with BHRB filter NOT enabled with a standard perf record command line (asking for branch record information as well). $ perf record -j any_call ls Before the patch:- ls-1962 [003] d... 2065.299590: .perf_event_interrupt: MMCRA: 40000000000 ls-1962 [003] d... 2065.299603: .perf_event_interrupt: MMCRA: 40000000000 ... All the PMU interrupts before this point did not have the requested HW branch filter enabled in the MMCRA. ls-1962 [003] d... 2065.299647: .perf_event_interrupt: MMCRA: 40040000000 ls-1962 [003] d... 2065.299662: .perf_event_interrupt: MMCRA: 40040000000 After the patch:- ls-1850 [008] d... 190.311828: .perf_event_interrupt: MMCRA: 40040000000 ls-1850 [008] d... 190.311848: .perf_event_interrupt: MMCRA: 40040000000 All the PMU interrupts have the requested HW BHRB branch filter enabled in MMCRA. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> [mpe: Fixed up whitespace and cleaned up changelog] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Select ARCH_RANDOM on pseriesMichael Ellerman2014-02-111-0/+1
| | | | | | | | | | | | | We have a driver for the ARCH_RANDOM hook in rng.c, so we should select ARCH_RANDOM on pseries. Without this the build breaks if you turn ARCH_RANDOM off. This hasn't broken the build because pseries_defconfig doesn't specify a value for PPC_POWERNV, which is default y, and selects ARCH_RANDOM. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/perf: Add Power8 cache & TLB eventsMichael Ellerman2014-02-111-0/+144
| | | | | Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/relocate fix relocate processing in LE modeLaurent Dufour2014-02-111-2/+2
| | | | | | | | | | | | | | | | Relocation's code is not working in little endian mode because the r_info field, which is a 64 bits value, should be read from the right offset. The current code is optimized to read the r_info field as a 32 bits value starting at the middle of the double word (offset 12). When running in LE mode, the read value is not correct since only the MSB is read. This patch removes this optimization which consist to deal with a 32 bits value instead of a 64 bits one. This way it works in big and little endian mode. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Fix kdump hang issue on p8 with relocation on exception enabled.Mahesh Salgaonkar2014-02-112-0/+26
| | | | | | | | | | | | | | | | | | | | | On p8 systems, with relocation on exception feature enabled we are seeing kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this feature enabled, exception are raised with MMU (IR=DR=1) ON with the default offset of 0xc*4000. Since exception is raised in virtual mode it requires the vector region to be executable without which it fails to fetch and execute instruction at 0xc*4xxx. For default kernel since kernel is loaded at real 0, the htab mappings sets the entire kernel text region executable. But for relocatable kernel (e.g. kdump case) we only copy interrupt vectors down to real 0 and never marked that region as executable because in p7 and below we always get exception in real mode. This patch fixes this issue by marking htab mapping range as executable that overlaps with the interrupt vector region for relocatable kernel. Thanks to Ben who helped me to debug this issue and find the root cause. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/pseries: Disable relocation on exception while going down during crash.Mahesh Salgaonkar2014-02-111-2/+1
| | | | | | | | | Disable relocation on exception while going down even in kdump case. This is because we are about clear htab mappings while kexec-ing into kdump kernel and we may run into issues if we still have AIL ON. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc/eeh: Drop taken reference to driver on eeh_rmv_deviceThadeu Lima de Souza Cascardo2014-02-111-2/+6
| | | | | | | | | | | | | | | Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use partial hotplug for EEH unaware drivers") introduces eeh_rmv_device, which may grab a reference to a driver, but not release it. That prevents a driver from being removed after it has gone through EEH recovery. This patch drops the reference if it was taken. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* powerpc: Fix build failure in sysdev/mpic.c for MPIC_WEIRD=yPaul Gortmaker2014-02-111-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 446f6d06fab0b49c61887ecbe8286d6aaa796637 ("powerpc/mpic: Properly set default triggers") breaks the mpc7447_hpc_defconfig as follows: CC arch/powerpc/sysdev/mpic.o arch/powerpc/sysdev/mpic.c: In function 'mpic_set_irq_type': arch/powerpc/sysdev/mpic.c:886:9: error: case label does not reduce to an integer constant arch/powerpc/sysdev/mpic.c:890:9: error: case label does not reduce to an integer constant arch/powerpc/sysdev/mpic.c:894:9: error: case label does not reduce to an integer constant arch/powerpc/sysdev/mpic.c:898:9: error: case label does not reduce to an integer constant Looking at the cpp output (gcc 4.7.3), I see: case mpic->hw_set[MPIC_IDX_VECPRI_SENSE_EDGE] | mpic->hw_set[MPIC_IDX_VECPRI_POLARITY_POSITIVE]: The pointer into an array appears because CONFIG_MPIC_WEIRD=y is set for this platform, thus enabling the following: ------------------- #ifdef CONFIG_MPIC_WEIRD static u32 mpic_infos[][MPIC_IDX_END] = { [0] = { /* Original OpenPIC compatible MPIC */ [...] #define MPIC_INFO(name) mpic->hw_set[MPIC_IDX_##name] #else /* CONFIG_MPIC_WEIRD */ #define MPIC_INFO(name) MPIC_##name #endif /* CONFIG_MPIC_WEIRD */ ------------------- Here we convert the case section to if/else if, and also add the equivalent of a default case to warn about unknown types. Boot tested on sbc8548, build tested on all defconfigs. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2014-01-3141-1066/+1563
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more KVM updates from Paolo Bonzini: "Second batch of KVM updates. Some minor x86 fixes, two s390 guest features that need some handling in the host, and all the PPC changes. The PPC changes include support for little-endian guests and enablement for new POWER8 features" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (45 commits) x86, kvm: correctly access the KVM_CPUID_FEATURES leaf at 0x40000101 x86, kvm: cache the base of the KVM cpuid leaves kvm: x86: move KVM_CAP_HYPERV_TIME outside #ifdef KVM: PPC: Book3S PR: Cope with doorbell interrupts KVM: PPC: Book3S HV: Add software abort codes for transactional memory KVM: PPC: Book3S HV: Add new state for transactional memory powerpc/Kconfig: Make TM select VSX and VMX KVM: PPC: Book3S HV: Basic little-endian guest support KVM: PPC: Book3S HV: Add support for DABRX register on POWER7 KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8 KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8 KVM: PPC: Book3S HV: Add handler for HV facility unavailable KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8 KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers KVM: PPC: Book3S HV: Don't set DABR on POWER8 kvm/ppc: IRQ disabling cleanup ...
| * Merge branch 'kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm-queuePaolo Bonzini2014-01-2941-1066/+1563
| |\ | | | | | | | | | | | | | | | Conflicts: arch/powerpc/kvm/book3s_hv_rmhandlers.S arch/powerpc/kvm/booke.c
| | * KVM: PPC: Book3S PR: Cope with doorbell interruptsPaul Mackerras2014-01-273-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the PR host is running on a POWER8 machine in POWER8 mode, it will use doorbell interrupts for IPIs. If one of them arrives while we are in the guest, we pop out of the guest with trap number 0xA00, which isn't handled by kvmppc_handle_exit_pr, leading to the following BUG_ON: [ 331.436215] exit_nr=0xa00 | pc=0x1d2c | msr=0x800000000000d032 [ 331.437522] ------------[ cut here ]------------ [ 331.438296] kernel BUG at arch/powerpc/kvm/book3s_pr.c:982! [ 331.439063] Oops: Exception in kernel mode, sig: 5 [#2] [ 331.439819] SMP NR_CPUS=1024 NUMA pSeries [ 331.440552] Modules linked in: tun nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw virtio_net kvm binfmt_misc ibmvscsi scsi_transport_srp scsi_tgt virtio_blk [ 331.447614] CPU: 11 PID: 1296 Comm: qemu-system-ppc Tainted: G D 3.11.7-200.2.fc19.ppc64p7 #1 [ 331.448920] task: c0000003bdc8c000 ti: c0000003bd32c000 task.ti: c0000003bd32c000 [ 331.450088] NIP: d0000000025d6b9c LR: d0000000025d6b98 CTR: c0000000004cfdd0 [ 331.451042] REGS: c0000003bd32f420 TRAP: 0700 Tainted: G D (3.11.7-200.2.fc19.ppc64p7) [ 331.452331] MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI> CR: 28004824 XER: 20000000 [ 331.454616] SOFTE: 1 [ 331.455106] CFAR: c000000000848bb8 [ 331.455726] GPR00: d0000000025d6b98 c0000003bd32f6a0 d0000000026017b8 0000000000000032 GPR04: c0000000018627f8 c000000001873208 320d0a3030303030 3030303030643033 GPR08: c000000000c490a8 0000000000000000 0000000000000000 0000000000000002 GPR12: 0000000028004822 c00000000fdc6300 0000000000000000 00000100076ec310 GPR16: 000000002ae343b8 00003ffffd397398 0000000000000000 0000000000000000 GPR20: 00000100076f16f4 00000100076ebe60 0000000000000008 ffffffffffffffff GPR24: 0000000000000000 0000008001041e60 0000000000000000 0000008001040ce8 GPR28: c0000003a2d80000 0000000000000a00 0000000000000001 c0000003a2681810 [ 331.466504] NIP [d0000000025d6b9c] .kvmppc_handle_exit_pr+0x75c/0xa80 [kvm] [ 331.466999] LR [d0000000025d6b98] .kvmppc_handle_exit_pr+0x758/0xa80 [kvm] [ 331.467517] Call Trace: [ 331.467909] [c0000003bd32f6a0] [d0000000025d6b98] .kvmppc_handle_exit_pr+0x758/0xa80 [kvm] (unreliable) [ 331.468553] [c0000003bd32f750] [d0000000025d98f0] kvm_start_lightweight+0xb4/0xc4 [kvm] [ 331.469189] [c0000003bd32f920] [d0000000025d7648] .kvmppc_vcpu_run_pr+0xd8/0x270 [kvm] [ 331.469838] [c0000003bd32f9c0] [d0000000025cf748] .kvmppc_vcpu_run+0xc8/0xf0 [kvm] [ 331.470790] [c0000003bd32fa50] [d0000000025cc19c] .kvm_arch_vcpu_ioctl_run+0x5c/0x1b0 [kvm] [ 331.471401] [c0000003bd32fae0] [d0000000025c4888] .kvm_vcpu_ioctl+0x478/0x730 [kvm] [ 331.472026] [c0000003bd32fc90] [c00000000026192c] .do_vfs_ioctl+0x4dc/0x7a0 [ 331.472561] [c0000003bd32fd80] [c000000000261cc4] .SyS_ioctl+0xd4/0xf0 [ 331.473095] [c0000003bd32fe30] [c000000000009ed8] syscall_exit+0x0/0x98 [ 331.473633] Instruction dump: [ 331.473766] 4bfff9b4 2b9d0800 419efc18 60000000 60420000 3d220000 e8bf11a0 e8df12a8 [ 331.474733] 7fa4eb78 e8698660 48015165 e8410028 <0fe00000> 813f00e4 3ba00000 39290001 [ 331.475386] ---[ end trace 49fc47d994c1f8f2 ]--- [ 331.479817] This fixes the problem by making kvmppc_handle_exit_pr() recognize the interrupt. We also need to jump to the doorbell interrupt handler in book3s_segment.S to handle the interrupt on the way out of the guest. Having done that, there's nothing further to be done in kvmppc_handle_exit_pr(). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Add software abort codes for transactional memoryMichael Neuling2014-01-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the software abort code defines for transactional memory (TM). These values are from PAPR. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Add new state for transactional memoryMichael Neuling2014-01-274-8/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new state for transactional memory (TM) to kvm_vcpu_arch. Also add asm-offset bits that are going to be required. This also moves the existing TFHAR, TFIAR and TEXASR SPRs into a CONFIG_PPC_TRANSACTIONAL_MEM section. This requires some code changes to ensure we still compile with CONFIG_PPC_TRANSACTIONAL_MEM=N. Much of the added the added #ifdefs are removed in a later patch when the bulk of the TM code is added. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix merge conflict] Signed-off-by: Alexander Graf <agraf@suse.de>
| | * powerpc/Kconfig: Make TM select VSX and VMXMichael Neuling2014-01-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are no processors in existence that have TM but no VMX or VSX. So let's makes CONFIG_PPC_TRANSACTIONAL_MEM select both CONFIG_VSX and CONFIG_ALTIVEC. This makes the code a lot simpler by removing the need for a bunch of #ifdefs. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Basic little-endian guest supportAnton Blanchard2014-01-275-11/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We create a guest MSR from scratch when delivering exceptions in a few places. Instead of extracting LPCR[ILE] and inserting it into MSR_LE each time, we simply create a new variable intr_msr which contains the entire MSR to use. For a little-endian guest, userspace needs to set the ILE (interrupt little-endian) bit in the LPCR for each vcpu (or at least one vcpu in each virtual core). [paulus@samba.org - removed H_SET_MODE implementation from original version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Add support for DABRX register on POWER7Paul Mackerras2014-01-276-9/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The DABRX (DABR extension) register on POWER7 processors provides finer control over which accesses cause a data breakpoint interrupt. It contains 3 bits which indicate whether to enable accesses in user, kernel and hypervisor modes respectively to cause data breakpoint interrupts, plus one bit that enables both real mode and virtual mode accesses to cause interrupts. Currently, KVM sets DABRX to allow both kernel and user accesses to cause interrupts while in the guest. This adds support for the guest to specify other values for DABRX. PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR and DABRX with one call. This adds a real-mode implementation of H_SET_XDABR, which shares most of its code with the existing H_SET_DABR implementation. To support this, we add a per-vcpu field to store the DABRX value plus code to get and set it via the ONE_REG interface. For Linux guests to use this new hcall, userspace needs to add "hcall-xdabr" to the set of strings in the /chosen/hypertas-functions property in the device tree. If userspace does this and then migrates the guest to a host where the kernel doesn't include this patch, then userspace will need to implement H_SET_XDABR by writing the specified DABR value to the DABR using the ONE_REG interface. In that case, the old kernel will set DABRX to DABRX_USER | DABRX_KERNEL. That should still work correctly, at least for Linux guests, since Linux guests cope with getting data breakpoint interrupts in modes that weren't requested by just ignoring the interrupt, and Linux guests never set DABRX_BTI. The other thing this does is to make H_SET_DABR and H_SET_XDABR work on POWER8, which has the DAWR and DAWRX instead of DABR/X. Guests that know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but guests running in POWER7 compatibility mode will still use H_SET_[X]DABR. For them, this adds the logic to convert DABR/X values into DAWR/X values on POWER8. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbellsPaul Mackerras2014-01-273-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | POWER8 has support for hypervisor doorbell interrupts. Though the kernel doesn't use them for IPIs on the powernv platform yet, it probably will in future, so this makes KVM cope gracefully if a hypervisor doorbell interrupt arrives while in a guest. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8Paul Mackerras2014-01-272-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | POWER8 has a bit in the LPCR to enable or disable the PURR and SPURR registers to count when in the guest. Set this bit. POWER8 has a field in the LPCR called AIL (Alternate Interrupt Location) which is used to enable relocation-on interrupts. Allow userspace to set this field. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Handle guest using doorbells for IPIsPaul Mackerras2014-01-272-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * SRR1 wake reason field for system reset interrupt on wakeup from nap is now a 4-bit field on P8, compared to 3 bits on P7. * Set PECEDP in LPCR when napping because of H_CEDE so guest doorbells will wake us up. * Waking up from nap because of a guest doorbell interrupt is not a reason to exit the guest. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from napPaul Mackerras2014-01-271-115/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently in book3s_hv_rmhandlers.S we have three places where we have woken up from nap mode and we check the reason field in SRR1 to see what event woke us up. This consolidates them into a new function, kvmppc_check_wake_reason. It looks at the wake reason field in SRR1, and if it indicates that an external interrupt caused the wakeup, calls kvmppc_read_intr to check what sort of interrupt it was. This also consolidates the two places where we synthesize an external interrupt (0x500 vector) for the guest. Now, if the guest exit code finds that there was an external interrupt which has been handled (i.e. it was an IPI indicating that there is now an interrupt pending for the guest), it jumps to deliver_guest_interrupt, which is in the last part of the guest entry code, where we synthesize guest external and decrementer interrupts. That code has been streamlined a little and now clears LPCR[MER] when appropriate as well as setting it. The extra clearing of any pending IPI on a secondary, offline CPU thread before going back to nap mode has been removed. It is no longer necessary now that we have code to read and acknowledge IPIs in the guest exit path. This fixes a minor bug in the H_CEDE real-mode handling - previously, if we found that other threads were already exiting the guest when we were about to go to nap mode, we would branch to the cede wakeup path and end up looking in SRR1 for a wakeup reason. Now we branch to a point after we have checked the wakeup reason. This also fixes a minor bug in kvmppc_read_intr - previously it could return 0xff rather than 1, in the case where we find that a host IPI is pending after we have cleared the IPI. Now it returns 1. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8Paul Mackerras2014-01-272-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows us to select architecture 2.05 (POWER6) or 2.06 (POWER7) compatibility modes on a POWER8 processor. (Note that transactional memory is disabled for usermode if either or both of the PCR_TM_DIS and PCR_ARCH_206 bits are set.) Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Add handler for HV facility unavailableMichael Ellerman2014-01-272-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present this should never happen, since the host kernel sets HFSCR to allow access to all facilities. It's better to be prepared to handle it cleanly if it does ever happen, though. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8Paul Mackerras2014-01-271-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | POWER8 has 512 sets in the TLB, compared to 128 for POWER7, so we need to do more tlbiel instructions when flushing the TLB on POWER8. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Context-switch new POWER8 SPRsMichael Neuling2014-01-276-3/+361
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds fields to the struct kvm_vcpu_arch to store the new guest-accessible SPRs on POWER8, adds code to the get/set_one_reg functions to allow userspace to access this state, and adds code to the guest entry and exit to context-switch these SPRs between host and guest. Note that DPDES (Directed Privileged Doorbell Exception State) is shared between threads on a core; hence we store it in struct kvmppc_vcore and have the master thread save and restore it. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbersPaul Mackerras2014-01-276-337/+397
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a threaded processor such as POWER7, we group VCPUs into virtual cores and arrange that the VCPUs in a virtual core run on the same physical core. Currently we don't enforce any correspondence between virtual thread numbers within a virtual core and physical thread numbers. Physical threads are allocated starting at 0 on a first-come first-served basis to runnable virtual threads (VCPUs). POWER8 implements a new "msgsndp" instruction which guest kernels can use to interrupt other threads in the same core or sub-core. Since the instruction takes the destination physical thread ID as a parameter, it becomes necessary to align the physical thread IDs with the virtual thread IDs, that is, to make sure virtual thread N within a virtual core always runs on physical thread N. This means that it's possible that thread 0, which is where we call __kvmppc_vcore_entry, may end up running some other vcpu than the one whose task called kvmppc_run_core(), or it may end up running no vcpu at all, if for example thread 0 of the virtual core is currently executing in userspace. However, we do need thread 0 to be responsible for switching the MMU -- a previous version of this patch that had other threads switching the MMU was found to be responsible for occasional memory corruption and machine check interrupts in the guest on POWER7 machines. To accommodate this, we no longer pass the vcpu pointer to __kvmppc_vcore_entry, but instead let the assembly code load it from the PACA. Since the assembly code will need to know the kvm pointer and the thread ID for threads which don't have a vcpu, we move the thread ID into the PACA and we add a kvm pointer to the virtual core structure. In the case where thread 0 has no vcpu to run, it still calls into kvmppc_hv_entry in order to do the MMU switch, and then naps until either its vcpu is ready to run in the guest, or some other thread needs to exit the guest. In the latter case, thread 0 jumps to the code that switches the MMU back to the host. This control flow means that now we switch the MMU before loading any guest vcpu state. Similarly, on guest exit we now save all the guest vcpu state before switching the MMU back to the host. This has required substantial code movement, making the diff rather large. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * KVM: PPC: Book3S HV: Don't set DABR on POWER8Michael Neuling2014-01-272-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | POWER8 doesn't have the DABR and DABRX registers; instead it has new DAWR/DAWRX registers, which will be handled in a later patch. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
OpenPOWER on IntegriCloud