summaryrefslogtreecommitdiffstats
path: root/sys/amd64/vmm/intel/vmx.c
Commit message (Collapse)AuthorAgeFilesLines
* Revert "Revert "MFC ↵Luiz Souza2018-02-231-1/+2
| | | | | | r328083,328096,328116,328119,328120,328128,328135,328153,328157,"" This reverts commit d3d59b01294138e59995b31d2bcbbbdf45e26a3c.
* Revert "MFC r328083,328096,328116,328119,328120,328128,328135,328153,328157,"Luiz Souza2018-02-211-2/+1
| | | | This reverts commit 430a2bea3907149b30cc75fc722b6cf1f81da82a.
* MFC r328083,328096,328116,328119,328120,328128,328135,328153,328157,kib2018-02-191-1/+2
| | | | | | | | | | | | | | 328166,328177,328199,328202,328205,328468,328470,328624,328625,328627, 328628,329214,329297,329365: Meltdown mitigation by PTI, PCID optimization of PTI, and kernel use of IBRS for some mitigations of Spectre. Tested by: emaste, Arshan Khanifar <arshankhanifar@gmail.com> Discussed with: jkim Sponsored by: The FreeBSD Foundation (cherry picked from commit 6dd025b40ee6870bea6ba670f30dcf684edc3f6c)
* Restructure memory allocation in bhyve to support "devmem".neel2015-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer where doing a trap-and-emulate for every access is impractical. devmem is a hybrid of system memory (sysmem) and emulated device models. devmem is mapped in the guest address space via nested page tables similar to sysmem. However the address range where devmem is mapped may be changed by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is usually mapped RO or RW as compared to RWX mappings for sysmem. Each devmem segment is named (e.g. "bootrom") and this name is used to create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom). The device node supports mmap(2) and this decouples the host mapping of devmem from its mapping in the guest address space (which can change). Reviewed by: tychon Discussed with: grehan Differential Revision: https://reviews.freebsd.org/D2762 MFC after: 4 weeks
* Support guest writes to the TSC by enabling the "use TSC offsetting"tychon2015-06-091-4/+21
| | | | | | | | execution control and writing the difference between the host TSC and the guest TSC into the TSC offset in the VMCS upon encountering a write. Reviewed by: neel
* Fix non-deterministic delays when accessing a vcpu that was in "running" orneel2015-05-281-3/+9
| | | | | | | "sleeping" state. This is done by forcing the vcpu to transition to "idle" by returning to userspace with an exit code of VM_EXITCODE_REQIDLE. MFC after: 2 weeks
* Don't rely on the 'VM-exit instruction length' field in the VMCS to alwaysneel2015-05-221-0/+1
| | | | | | | | | | have an accurate length on an EPT violation. This is not needed by the instruction decoding code because it also has to work with AMD/SVM that does not provide a valid instruction length on a Nested Page Fault. In collaboration with: Leon Dang (ldang@nahannisys.com) Discussed with: grehan MFC after: 1 week
* When fetching an instruction in non-64bit mode, consider the value of thetychon2015-03-241-0/+6
| | | | | | | | | code segment base address. Also if an instruction doesn't support a mod R/M (modRM) byte, don't be concerned if the CPU is in real mode. Reviewed by: neel
* Use lapic_ipi_alloc() to dynamically allocate IPI slots needed by bhyve whenneel2015-03-141-6/+5
| | | | | | | | vmm.ko is loaded. Also relocate the 'justreturn' IPI handler to be alongside all other handlers. Requested by: kib
* Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restoreneel2015-02-241-9/+2
| | | | | | | | | | | | | capability of VT-x. This lets bhyve run nested in older VMware versions that don't support the PAT save/restore capability. Note that the actual value programmed by the guest in MSR_PAT is irrelevant because bhyve sets the 'Ignore PAT' bit in the nested PTE. Reported by: marcel Tested by: Leon Dang (ldang@nahannisys.com) Sponsored by: Nahanni Systems MFC after: 2 weeks
* Simplify instruction restart logic in bhyve.neel2015-01-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | Keep track of the next instruction to be executed by the vcpu as 'nextrip'. As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should start execution. Also, instruction restart happens implicitly via 'vm_inject_exception()' or explicitly via 'vm_restart_instruction()'. The APIs behave identically in both kernel and userspace contexts. The main beneficiary is the instruction emulation code that executes in both contexts. bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length' as readonly: - Restarting an instruction is now done by calling 'vm_restart_instruction()' as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout()) - Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch()) Differential Revision: https://reviews.freebsd.org/D1526 Reviewed by: grehan MFC after: 2 weeks
* 'struct vm_exception' was intended to be used only as the collateral for theneel2015-01-131-9/+9
| | | | | | | | | | | | | | | | VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping track pending exceptions for a vcpu. This in turn causes confusion because some fields in 'struct vm_exception' like 'vcpuid' make sense only in the ioctl context. It also makes it harder to add or remove structure fields. Fix this by using 'struct vm_exception' only to communicate information from userspace to vmm.ko when injecting an exception. Also, add a field 'restart_instruction' to 'struct vm_exception'. This field is set to '1' for exceptions where the faulting instruction is restarted after the exception is handled. MFC after: 1 week
* Clear blocking due to STI or MOV SS in the hypervisor when an instruction isneel2015-01-061-10/+28
| | | | | | | | | | | emulated or when the vcpu incurs an exception. This matches the CPU behavior. Remove special case code in HLT processing that was clearing the interrupt shadow. This is now redundant because the interrupt shadow is always cleared when the vcpu is resumed after an instruction is emulated. Reported by: David Reed (david.reed@tidalscale.com) MFC after: 2 weeks
* Allow ktr(4) tracing of all guest exceptions via the tunableneel2014-12-231-6/+70
| | | | | | | | | | | | | | | | "hw.vmm.trace_guest_exceptions". To enable this feature set the tunable to "1" before loading vmm.ko. Tracing the guest exceptions can be useful when debugging guest triple faults. Note that there is a performance impact when exception tracing is enabled since every exception will now trigger a VM-exit. Also, handle machine check exceptions that happen during guest execution by vectoring to the host's machine check handler via "int $18". Discussed with: grehan MFC after: 2 weeks
* IFC @r272887neel2014-10-101-0/+8
|\
| * Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'.neel2014-10-061-0/+8
| | | | | | | | | | | | | | | | The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to be present anyways. Discussed with: grehan
* | IFC @r272481neel2014-10-051-55/+17
|\ \ | |/
| * Get rid of code that dealt with the hardware not being able to save/restoreneel2014-10-021-55/+17
| | | | | | | | | | | | | | | | | | | | the PAT MSR on guest exit/entry. This workaround was done for a beta release of VMware Fusion 5 but is no longer needed in later versions. All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT in the VM exit and entry controls. Discussed with: grehan
* | IFC @r272185neel2014-09-271-7/+0
|\ \ | |/
| * MSR_KGSBASE is no longer saved and restored from the guest MSR save area. Thisneel2014-09-201-7/+0
| | | | | | | | | | | | | | | | behavior was changed in r271888 so update the comment block to reflect this. MSR_KGSBASE is accessible from the guest without triggering a VM-exit. The permission bitmap for MSR_KGSBASE is modified by vmx_msr_guest_init() so get rid of redundant code in vmx_vminit().
| * Restructure the MSR handling so it is entirely handled by processor-specificneel2014-09-201-36/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | code. There are only a handful of MSRs common between the two so there isn't too much duplicate functionality. The VT-x code has the following types of MSRs: - MSRs that are unconditionally saved/restored on every guest/host context switch (e.g., MSR_GSBASE). - MSRs that are restored to guest values on entry to vmx_run() and saved before returning. This is an optimization for MSRs that are not used in host kernel context (e.g., MSR_KGSBASE). - MSRs that are emulated and every access by the guest causes a trap into the hypervisor (e.g., MSR_IA32_MISC_ENABLE). Reviewed by: grehan
* | IFC r271888.neel2014-09-201-36/+55
| | | | | | | | Restructure MSR emulation so it is all done in processor-specific code.
* | IFC @r271694neel2014-09-171-0/+46
|\ \ | |/
| * Optimize the common case of injecting an interrupt into a vcpu after a HLTneel2014-09-121-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | by explicitly moving it out of the interrupt shadow. The hypervisor is done "executing" the HLT and by definition this moves the vcpu out of the 1-instruction interrupt shadow. Prior to this change the interrupt would be held pending because the VMCS guest-interruptibility-state would indicate that "blocking by STI" was in effect. This resulted in an unnecessary round trip into the guest before the pending interrupt could be injected. Reviewed by: grehan
* | AMD processors that have the SVM decode assist capability will store theneel2014-09-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | instruction bytes in the VMCB on a nested page fault. This is useful because it saves having to walk the guest page tables to fetch the instruction. vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len' that map directly to 'vie->inst[]' and 'vie->num_valid'. The instruction emulation handler skips calling 'vmm_fetch_instruction()' if 'vie->num_valid' is non-zero. The use of this capability can be turned off by setting the sysctl/tunable 'hw.vmm.svm.disable_npf_assist' to '1'. Reviewed by: Anish Gupta (akgupt3@gmail.com) Discussed with: grehan
* | IFC @r269962neel2014-09-021-173/+479
|\ \ | |/ | | | | Submitted by: Anish Gupta (akgupt3@gmail.com)
| * - Output a summary of optional VT-x features in dmesg similar to CPUjhb2014-07-301-7/+26
| | | | | | | | | | | | | | | | | | | | | | | | features. If bootverbose is enabled, a detailed list is provided; otherwise, a single-line summary is displayed. - Add read-only sysctls for optional VT-x capabilities used by bhyve under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that indicate the presence of optional capabilities under this node. CR: https://phabric.freebsd.org/D498 Reviewed by: grehan, neel MFC after: 1 week
| * If a vcpu has issued a HLT instruction with interrupts disabled then it sleepsneel2014-07-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | forever in vm_handle_hlt(). This is usually not an issue as long as one of the other vcpus properly resets or powers off the virtual machine. However, if the bhyve(8) process is killed with a signal the halted vcpu cannot be woken up because it's sleep cannot be interrupted. Fix this by waking up periodically and returning from vm_handle_hlt() if TDF_ASTPENDING is set. Reported by: Leon Dang Sponsored by: Nahanni Systems
| * Fix build without INVARIANTS defined by getting rid of unused variable 'exc'.neel2014-07-201-2/+1
| | | | | | | | Reported by: adrian, stefanf
| * Handle nested exceptions in bhyve.neel2014-07-191-47/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A nested exception condition arises when a second exception is triggered while delivering the first exception. Most nested exceptions can be handled serially but some are converted into a double fault. If an exception is generated during delivery of a double fault then the virtual machine shuts down as a result of a triple fault. vm_exit_intinfo() is used to record that a VM-exit happened while an event was being delivered through the IDT. If an exception is triggered while handling the VM-exit it will be treated like a nested exception. vm_entry_intinfo() is used by processor-specific code to get the event to be injected into the guest on the next VM-entry. This function is responsible for deciding the disposition of nested exceptions.
| * Add emulation for legacy x86 task switching mechanism.neel2014-07-161-5/+77
| | | | | | | | | | | | | | FreeBSD/i386 uses task switching to handle double fault exceptions and this change enables that to work. Reported by: glebius
| * Add support for operand size and address size override prefixes in bhyve'sneel2014-07-151-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | instruction emulation [1]. Fix bug in emulation of opcode 0x8A where the destination is a legacy high byte register and the guest vcpu is in 32-bit mode. Prior to this change instead of modifying %ah, %bh, %ch or %dh the emulation would end up modifying %spl, %bpl, %sil or %dil instead. Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value during instruction decoding. Fix bug in verify_gla() where the linear address computed after decoding the instruction was not being truncated to the effective address size [2]. Tested by: Leon Dang [1] Reported by: Peter Grehan [2] Sponsored by: Nahanni Systems
| * Accurately identify the vcpu's operating mode as 64-bit, compatibility,neel2014-07-081-4/+12
| | | | | | | | protected or real.
| * Invalidate guest TLB mappings as a side-effect of its CR3 being updated.neel2014-07-081-27/+68
| | | | | | | | | | This is a pre-requisite for task switch emulation since the CR3 is loaded from the new TSS.
| * Bring an overly enthusiastic KASSERT inline with the Intel SDM.tychon2014-06-161-2/+18
| | | | | | | | Reviewed by: neel
| * Add helper functions to populate VM exit information for rendezvous andneel2014-06-101-32/+8
| | | | | | | | | | astpending exits. This is to reduce code duplication between VT-x and SVM implementations.
| * Turn on interrupt window exiting unconditionally when an ExtINT is beingneel2014-06-101-2/+6
| | | | | | | | | | | | | | | | | | | | injected into the guest. This allows the hypervisor to inject another ExtINT or APIC vector as soon as the guest is able to process interrupts. This change is not to address any correctness issue but to guarantee that any pending APIC vector that was preempted by the ExtINT will be injected as soon as possible. Prior to this change such pending interrupts could be delayed until the next VM exit.
| * Add reserved bit checking when doing %CR8 emulation and inject #GP if required.neel2014-06-091-6/+9
| | | | | | | | | | Pointed out by: grehan Reviewed by: tychon
| * Support guest accesses to %cr8.tychon2014-06-061-50/+144
| | | | | | | | Reviewed by: neel
| * If VMX isn't enabled so long as the lock bit isn't set yet in MSRtychon2014-05-301-1/+10
| | | | | | | | | | | | IA32_FEATURE_CONTROL it still can be. Approved by: grehan (co-mentor)
| * - Rework the XSAVE/XRSTOR emulation to only expose XCR0 features to thejhb2014-05-271-2/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | guest for which the rules regarding xsetbv emulation are known. In particular future extensions like AVX-512 have interdependencies among feature bits that could allow a guest to trigger a GP# in the host with the current approach of allowing anything the host supports. - Add proper checking of Intel MPX and AVX-512 XSAVE features in the xsetbv emulation and allow these features to be exposed to the guest if they are enabled in the host. - Expose a subset of known-safe features from leaf 0 of the structured extended features to guests if they are supported on the host including RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM. Aside from AVX-512, these features are all new instructions available for use in ring 3 with no additional hypervisor changes needed. Reviewed by: neel
* | ins/outs support for SVM. Modelled on the Intel VT-x code.grehan2014-06-061-2/+0
|/ | | | | | | | | Remove CR2 save/restore - the guest restore/save is done in hardware, and there is no need to save/restore the host version (same as VT-x). Submitted by: neel (SVM segment descriptor 'P' bit code) Reviewed by: neel
* Do the linear address calculation for the ins/outs emulation using a newneel2014-05-251-1/+0
| | | | | | | API function 'vie_calculate_gla()'. While the current implementation is simplistic it forms the basis of doing segmentation checks if the guest is in 32-bit protected mode.
* Consolidate all the information needed by the guest page table walker intoneel2014-05-241-10/+14
| | | | | | | | | | 'struct vm_guest_paging'. Check for canonical addressing in vmm_gla2gpa() and inject a protection fault into the guest if a violation is detected. If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to point to the root of the page tables.
* When injecting a page fault into the guest also update the guest's %cr2 toneel2014-05-241-0/+2
| | | | | | | | | | indicate the faulting linear address. If the guest PML4 entry has the PG_PS bit set then inject a page fault into the guest with the PGEX_RSV bit set in the error_code. Get rid of redundant checks for the PG_RW violations when walking the page tables.
* Add emulation of the "outsb" instruction. NetBSD guests use this to write toneel2014-05-231-8/+99
| | | | | | | | | | | | the UART FIFO. The emulation is constrained in a number of ways: 64-bit only, doesn't check for all exception conditions, limited to i/o ports emulated in userspace. Some of these constraints will be relaxed in followup commits. Requested by: grehan Reviewed by: tychon (partially and a much earlier version)
* Allow vmx_getdesc() and vmx_setdesc() to be called for a vcpu that is in theneel2014-05-221-2/+12
| | | | | VCPU_RUNNING state. This will let the VMX exit handler inspect the vcpu's segment descriptors without having to exit the critical section.
* Add PG_U (user/supervisor) checks when translating a guest linear addressneel2014-05-191-12/+27
| | | | | | | | | to a guest physical address. PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now checked only in non-terminal paging entries. Ignore the upper 32-bits of the CR3 for PAE paging.
* Ignore writes to microcode update MSR. This MSR is accessed by RHEL7 guest.neel2014-04-301-0/+3
| | | | Add KTR tracepoints to annotate wrmsr and rdmsr VM exits.
* Allow a virtual machine to be forcibly reset or powered off. This is doneneel2014-04-281-11/+2
| | | | | | | | | | | | | by adding an argument to the VM_SUSPEND ioctl that specifies how the virtual machine should be suspended, viz. VM_SUSPEND_RESET or VM_SUSPEND_POWEROFF. The disposition of VM_SUSPEND is also made available to the exit handler via the 'u.suspended' member of 'struct vm_exit'. This capability is exposed via the '--force-reset' and '--force-poweroff' arguments to /usr/sbin/bhyvectl. Discussed with: grehan@
OpenPOWER on IntegriCloud