summaryrefslogtreecommitdiffstats
path: root/sys/amd64/amd64/trap.c
Commit message (Collapse)AuthorAgeFilesLines
* MFC r293613:dchagin2016-01-161-0/+7
| | | | | Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall instead of vdso. An upcoming linux_base-c6 needs it.
* MFC r283870:dim2015-06-081-8/+2
| | | | | | | | | | | | | | Remove unneeded NULL checks in amd64's trap_fatal(). Since td_name is an array member of struct thread, it can never be NULL, so the check can be removed. In addition, curproc can never be NULL, so remove the if statement, and splice the two printfs() together. While here, remove the u_long cast, and use the correct printf format specifier curproc->p_pid. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D2695
* MFC r280780:kib2015-03-311-2/+0
| | | | | | The #ss fault handler erronously does not check for the fault originated from the return to usermode. #ss must be handled same as #np.
* MFC r277055:kib2015-01-191-6/+0
| | | | Revert r263475: TDP_DEVMEMIO no longer needed.
* MFC r271747:kib2014-10-041-2/+2
| | | | | | | | | | | - Use NULL instead of 0 for fpcurthread. - Note the quirk with the interrupt enabled state of the dna handler. - Use just panic() instead of printf() and panic(). Print tid instead of pid, the fpu state is per-thread. MFC r271924: Update and clarify comments. Remove the useless counter for impossible, but seen in wild situation (on buggy hypervisors).
* MFC r266826, r266827markj2014-08-091-22/+0
| | | | | Move some duplicated hook definitions from machine-dependent files to kern_dtrace.c.
* MFC r263329:markj2014-07-291-19/+21
| | | | | | | Only invoke fasttrap hooks for traps from user mode, and ensure that they're called with interrupts enabled. Calling fasttrap_pid_probe() with interrupts disabled can lead to deadlock if fasttrap writes to the process' address space.
* MFC r268383:kib2014-07-151-0/+4
| | | | Correct si_code for the SIGBUS signal generated by the alignment trap.
* MFC r263475:kib2014-03-281-0/+6
| | | | | | | | | | | | | | | | Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, use a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. MFC r263498: Add change forgotten in r263475. Make dmaplimit accessible outside amd64/pmap.c.
* MFC r257417: Remove references to an unused fasttrap probe hookavg2014-02-171-11/+4
|
* Merge projects/bhyve_npt_pmap into head.neel2013-10-051-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the amd64/pmap code aware of nested page table mappings used by bhyve guests. This allows bhyve to associate each guest with its own vmspace and deal with nested page faults in the context of that vmspace. This also enables features like accessed/dirty bit tracking, swapping to disk and transparent superpage promotions of guest memory. Guest vmspace: Each bhyve guest has a unique vmspace to represent the physical memory allocated to the guest. Each memory segment allocated by the guest is mapped into the guest's address space via the 'vmspace->vm_map' and is backed by an object of type OBJT_DEFAULT. pmap types: The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT. The PT_X86 pmap type is used by the vmspace associated with the host kernel as well as user processes executing on the host. The PT_EPT pmap is used by the vmspace associated with a bhyve guest. Page Table Entries: The EPT page table entries as mostly similar in functionality to regular page table entries although there are some differences in terms of what bits are used to express that functionality. For e.g. the dirty bit is represented by bit 9 in the nested PTE as opposed to bit 6 in the regular x86 PTE. Therefore the bitmask representing the dirty bit is now computed at runtime based on the type of the pmap. Thus PG_M that was previously a macro now becomes a local variable that is initialized at runtime using 'pmap_modified_bit(pmap)'. An additional wrinkle associated with EPT mappings is that older Intel processors don't have hardware support for tracking accessed/dirty bits in the PTE. This means that the amd64/pmap code needs to emulate these bits to provide proper accounting to the VM subsystem. This is achieved by using the following mapping for EPT entries that need emulation of A/D bits: Bit Position Interpreted By PG_V 52 software (accessed bit emulation handler) PG_RW 53 software (dirty bit emulation handler) PG_A 0 hardware (aka EPT_PG_RD) PG_M 1 hardware (aka EPT_PG_WR) The idea to use the mapping listed above for A/D bit emulation came from Alan Cox (alc@). The final difference with respect to x86 PTEs is that some EPT implementations do not support superpage mappings. This is recorded in the 'pm_flags' field of the pmap. TLB invalidation: The amd64/pmap code has a number of ways to do invalidation of mappings that may be cached in the TLB: single page, multiple pages in a range or the entire TLB. All of these funnel into a single EPT invalidation routine called 'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and sends an IPI to the host cpus that are executing the guest's vcpus. On a subsequent entry into the guest it will detect that the EPT has changed and invalidate the mappings from the TLB. Guest memory access: Since the guest memory is no longer wired we need to hold the host physical page that backs the guest physical page before we can access it. The helper functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose. PCI passthru: Guest's with PCI passthru devices will wire the entire guest physical address space. The MMIO BAR associated with the passthru device is backed by a vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that have one or more PCI passthru devices attached to them. Limitations: There isn't a way to map a guest physical page without execute permissions. This is because the amd64/pmap code interprets the guest physical mappings as user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U shares the same bit position as EPT_PG_EXECUTE all guest mappings become automatically executable. Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews as well as their support and encouragement. Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing object for pci passthru mmio regions. Special thanks to Peter Holm for testing the patch on short notice. Approved by: re Discussed with: grehan Reviewed by: alc, kib Tested by: pho
* Assert that interrupts are enabled in the trap handlers on x86 beforekib2013-06-031-0/+1
| | | | | | | | calling generic code to deliver signals. Discussed with: bde Tested by: pho MFC after: 1 week
* When reporting the fault details, also print %rsp.kib2013-05-271-2/+2
| | | | | Sponsored by: The FreeBSD Foundation MFC after: 1 week
* Print the %rip value for uprintf_signal.kib2012-10-141-1/+3
| | | | MFC after: 1 week
* userret() already checks for td_locks when INVARIANTS is enabled, soattilio2012-09-081-1/+0
| | | | | | | there is no need to check if Giant is acquired after it. Reviewed by: kib MFC after: 1 week
* Add a hackish debugging facility to provide a bit of information aboutkib2012-08-141-2/+20
| | | | | | | | | | | | | | reason for generated trap. The dump of basic signal information and 8 bytes of the faulting instruction are printed on the controlling terminal of the process, if the machdep.uprintf_signal syscal is enabled. The print is the only practical way to debug traps from a.out processes I am aware of. Because I have to reimplement it each time I debug an issue with a.out support on amd64, commit the hack to main tree. MFC after: 1 week
* Introduce curpcb magic variable, similar to curthread, which is MDkib2012-07-191-6/+5
| | | | | | | | | | | | | | | | | | | | | amd64. It is implemented as __pure2 inline with non-volatile asm read from pcpu, which allows a compiler to cache its results. Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb. Note that __curthread() uses magic value 0 as an offsetof(struct pcpu, pc_curthread). It seems to be done this way due to machine/pcpu.h needs to be processed before sys/pcpu.h, because machine/pcpu.h contributes machine-depended fields to the struct pcpu definition. As result, machine/pcpu.h cannot use struct pcpu yet. The __curpcb() also uses a magic constant instead of offsetof(struct pcpu, pc_curpcb) for the same reason. The constants are now defined as symbols and CTASSERTs are added to ensure that future KBI changes do not break the code. Requested and reviewed by: bde MFC after: 3 weeks
* On AMD64, provide siginfo.si_code for floating point errors when errorkib2012-07-181-2/+4
| | | | | | | | | | | | | | occurs using the SSE math processor. Update comments describing the handling of the exception status bits in coprocessors control words. Remove GET_FPU_CW and GET_FPU_SW macros which were used only once. Prefer to use curpcb to access pcb_save over the longer path of referencing pcb through the thread structure. Based on the submission by: Ed Alley <wea llnl gov> PR: amd64/169927 Reviewed by: bde MFC after: 3 weeks
* Adjust the fix in r236953, by not generating the signal manually, butkib2012-06-181-11/+5
| | | | | | | | | performing the return to usermode using full return path. This consolidates the handling of exceptional situations in less number of places, and is less code as well. Reviewed by: jhb MFC after: 1 week
* Fix a problem where zero-length RDATA fields can cause named(8) to crash.bz2012-06-121-0/+17
| | | | | | | | | | | | | | | | [12:03] Correct a privilege escalation when returning from kernel if running FreeBSD/amd64 on non-AMD processors. [12:04] Fix reference count errors in IPv6 code. [EN-12:02] Security: CVE-2012-1667 Security: FreeBSD-SA-12:03.bind Security: CVE-2012-0217 Security: FreeBSD-SA-12:04.sysret Security: FreeBSD-EN-12:02.ipv6refcount Approved by: so (simon, bz)
* Make machine check exception logging more readable. On newer Intel systems,jhb2012-04-021-2/+1
| | | | | | | | | | | | an uncorrected ECC error tends to fire on all CPUs in a package simultaneously and the current printf hacks are not sufficient to make the messages legible. Instead, use the existing mca_lock spinlock to serialize calls to mca_log() and change the machine check code to panic directly when an unrecoverable error is encoutered rather than falling back to a trap_fatal() call in trap() (which adds nearly a screen-full of logging messages that aren't useful for machine checks). MFC after: 2 weeks
* Add software PMC support.fabient2012-03-281-1/+16
| | | | | | | | | | | | | New kernel events can be added at various location for sampling or counting. This will for example allow easy system profiling whatever the processor is with known tools like pmcstat(8). Simultaneous usage of software PMC and hardware PMC is possible, for example looking at the lock acquire failure, page fault while sampling on instructions. Sponsored by: NETASQ MFC after: 1 month
* Handle spurious page faults that may occur in no-fault sections of thealc2012-03-221-20/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | kernel. When access restrictions are added to a page table entry, we flush the corresponding virtual address mapping from the TLB. In contrast, when access restrictions are removed from a page table entry, we do not flush the virtual address mapping from the TLB. This is exactly as recommended in AMD's documentation. In effect, when access restrictions are removed from a page table entry, AMD's MMUs will transparently refresh a stale TLB entry. In short, this saves us from having to perform potentially costly TLB flushes. In contrast, Intel's MMUs are allowed to generate a spurious page fault based upon the stale TLB entry. Usually, such spurious page faults are handled by vm_fault() without incident. However, when we are executing no-fault sections of the kernel, we are not allowed to execute vm_fault(). This change introduces special-case handling for spurious page faults that occur in no-fault sections of the kernel. In collaboration with: kib Tested by: gibbs (an earlier version) I would also like to acknowledge Hiroki Sato's assistance in diagnosing this problem. MFC after: 1 week
* Simplify the error checking in one branch of trap_pfault() and updatealc2012-03-121-10/+5
| | | | | | | | the nearby comment. Add missing whitespace to a return statement in trap_pfault(). Submitted by: kib [2]
* Add support for the extended FPU states on amd64, both for nativekib2012-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 64bit and 32bit ABIs. As a side-effect, it enables AVX on capable CPUs. In particular: - Query the CPU support for XSAVE, list of the supported extensions and the required size of FPU save area. The hw.use_xsave tunable is provided for disabling XSAVE, and hw.xsave_mask may be used to select the enabled extensions. - Remove the FPU save area from PCB and dynamically allocate the (run-time sized) user save area on the top of the kernel stack, right above the PCB. Reorganize the thread0 PCB initialization to postpone it after BSP is queried for save area size. - The dumppcb, stoppcbs and susppcbs now do not carry the FPU state as well. FPU state is only useful for suspend, where it is saved in dynamically allocated suspfpusave area. - Use XSAVE and XRSTOR to save/restore FPU state, if supported and enabled. - Define new mcontext_t flag _MC_HASFPXSTATE, indicating that mcontext_t has a valid pointer to out-of-struct extended FPU state. Signal handlers are supplied with stack-allocated fpu state. The sigreturn(2) and setcontext(2) syscall honour the flag, allowing the signal handlers to inspect and manipilate extended state in the interrupted context. - The getcontext(2) never returns extended state, since there is no place in the fixed-sized mcontext_t to place variable-sized save area. And, since mcontext_t is embedded into ucontext_t, makes it impossible to fix in a reasonable way. Instead of extending getcontext(2) syscall, provide a sysarch(2) facility to query extended FPU state. - Add ptrace(2) support for getting and setting extended state; while there, implement missed PT_I386_{GET,SET}XMMREGS for 32bit binaries. - Change fpu_kern KPI to not expose struct fpu_kern_ctx layout to consumers, making it opaque. Internally, struct fpu_kern_ctx now contains a space for the extended state. Convert in-kernel consumers of fpu_kern KPI both on i386 and amd64. First version of the support for AVX was submitted by Tim Bird <tim.bird am sony com> on behalf of Sony. This version was written from scratch. Tested by: pho (previous version), Yamagi Burmeister <lists yamagi org> MFC after: 1 month
* Attempt to improve formatting and content of several comments forkib2011-11-091-3/+3
| | | | | | | amd64 and i386 MD code. Based on suggestions by: bde MFC after: 1 week
* Fix the DTrace pid return trap interrupt vector. Previously we were usingrstone2011-11-071-10/+11
| | | | | | | | | | 31, but that vector is reserved. Without this fix, running dtrace -p <pid> would either cause the target process to crash or the kernel to page fault. Obtained from: rpaulo MFC after: 3days
* Revert rev. 226893: subr_syscall.c is being included from C files andmarcel2011-10-301-0/+7
| | | | | on amd64 with FREEBSD32 enabled, this means that systrace_probe_func gets defined twice.
* Define systrace_probe_func in subr_syscall.c where it's used, insteadmarcel2011-10-291-7/+0
| | | | of defining it in MD code. This eliminates porting to other architectures.
* Do not allow the kernel to access usermode pages without installedkib2011-10-031-0/+13
| | | | | | | fault handler. Panic immediately in such situation, on i386 and amd64. Reviewed by: avg, jhb MFC after: 1 week
* Put amd64_syscall() prototype in md_var.h.kib2011-09-151-1/+0
| | | | | | | Requested by: jhb Reviewed by: alc, jhb Approved by: re (bz) MFC after: 2 weeks
* Perform amd64-specific microoptimizations for native syscall entrykib2011-09-111-11/+6
| | | | | | | | | | | | sequence. The effect is ~1% on the microbenchmark. In particular, do not restore registers which are preserved by the C calling sequence. Align the jump target. Avoid unneeded memory accesses by calculating some data in syscall entry trampoline. Reviewed by: jhb Approved by: re (bz) MFC after: 2 weeks
* Inline the syscallenter() and syscallret(). This reduces the time measuredkib2011-09-111-0/+2
| | | | | | | | by the syscall entry speed microbenchmarks by ~10% on amd64. Submitted by: jhb Approved by: re (bz) MFC after: 2 weeks
* Add tunables that mirror the functionality of sysctls machdep.panic_on_nmirstone2011-04-081-0/+2
| | | | | | | and machdep.kdb_on_nmi. Approved by: emaste (mentor) MFC after: 1 week
* Fix typos - remove duplicate "the".brucec2011-02-211-1/+1
| | | | | | PR: bin/154928 Submitted by: Eitan Adler <lists at eitanadler.com> MFC after: 3 days
* To avoid excessive code duplication create wrapper for fill regsdchagin2011-02-161-32/+2
| | | | | from stack frame. Change the trap() code to use newly created function instead of explicit regs assignment.
* Do not use __FreeBSD_version prefix for the special osrel version.kib2010-11-141-2/+1
| | | | | | | | The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot grok several constants with the prefix. Reported and tested by: swell.k gmail com MFC after: 1 week
* Use symbolic names instead of hardcoding values for magic p_osrel constants.kib2010-11-141-3/+3
| | | | MFC after: 1 week
* Move the <machine/mca.h> header to <x86/mca.h>.jhb2010-11-011-1/+1
|
* Call the necessary DTrace function pointers when we have different kindsrpaulo2010-08-251-0/+56
| | | | | | of traps. Sponsored by: The FreeBSD Foundation
* Remove unused KTRACE includes.jhb2010-08-191-4/+0
|
* Remove a dead test. We already exclude NMI traps from this code in anjhb2010-07-121-2/+2
| | | | | | earlier condition. MFC after: 1 week
* Introduce the x86 kernel interfaces to allow kernel code to usekib2010-06-051-8/+22
| | | | | | | | | | | | | | | | FPU/SSE hardware. Caller should provide a save area that is chained into the stack of the areas; pcb save_area for usermode FPU state is on top. The pcb now contains a pointer to the current FPU saved area, used during FPUDNA handling and context switches. There is also a facility to allow the kernel thread to use pcb save_area. Change the dreaded warnings "npxdna in kernel mode!" into the panics when FPU usage is not registered. KPI discussed with: fabient Tested by: pho, fabient Hardware provided by: Sentex Communications MFC after: 1 month
* Reorganize syscall entry and leave handling.kib2010-05-231-138/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend struct sysvec with three new elements: sv_fetch_syscall_args - the method to fetch syscall arguments from usermode into struct syscall_args. The structure is machine-depended (this might be reconsidered after all architectures are converted). sv_set_syscall_retval - the method to set a return value for usermode from the syscall. It is a generalization of cpu_set_syscall_retval(9) to allow ABIs to override the way to set a return value. sv_syscallnames - the table of syscall names. Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding the call to cpu_set_syscall_retval(). The new functions syscallenter(9) and syscallret(9) are provided that use sv_*syscall* pointers and contain the common repeated code from the syscall() implementations for the architecture-specific syscall trap handlers. Syscallenter() fetches arguments, calls syscall implementation from ABI sysent table, and set up return frame. The end of syscall bookkeeping is done by syscallret(). Take advantage of single place for MI syscall handling code and implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the thread is stopped at syscall entry or return point respectively. The EXEC flag augments SCX and notifies debugger that the process address space was changed by one of exec(2)-family syscalls. The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are changed to use syscallenter()/syscallret(). MIPS and arm are not converted and use the mostly unchanged syscall() implementation. Reviewed by: jhb, marcel, marius, nwhitehorn, stas Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc), stas (mips) MFC after: 1 month
* Remove unneeded overrides of the segment registers in the inner trapkib2010-05-121-4/+0
| | | | | | | | | | | frame upon segment register load fault. The doreti procedure does not load segment registers when returning to the kernel frame, and current values in the segment descriptor cache already allow the kernel mode to run, not modified by faulted loaded. Suggested by: bde Tested by: pho MFC after: 1 week
* Remove debugging code that was not used once since commit.kib2010-05-011-85/+1
| | | | | Suggested by: bde MFC after: 1 week
* Change printf() calls to uprintf() for sigreturn() and trap() complaintskib2010-04-131-1/+1
| | | | | | | | | | about inacessible or wrong mcontext, and for dreaded "kernel trap with interrupts disabled" situation. The later is changed when trap is generated from user mode (shall never be ?). Normalize the messages to include both pid and thread name. MFC after: 1 week
* Handle a case when non-canonical address is loaded into the fsbase orkib2010-04-101-0/+8
| | | | | | gsbase MSR. MFC after: 3 days
* For PT_TO_SCE stop that stops the ptraced process upon syscall entry,kib2010-01-231-69/+107
| | | | | | | | | | | | | | | | | | | | | | syscall arguments are collected before ptracestop() is called. As a consequence, debugger cannot modify syscall or its arguments. For i386, amd64 and ia32 on amd64 MD syscall(), reread syscall number and arguments after ptracestop(), if debugger modified anything in the process environment. Since procfs stopeven requires number of syscall arguments in p_xstat, this cannot be solved by moving stop/trace point before argument fetching. Move the code to read arguments into separate function fetch_syscall_args() to avoid code duplication. Note that ktrace point for modified syscall is intentionally recorded twice, once with original arguments, and second time with the arguments set by debugger. PT_TO_SCX stop is executed after cpu_syscall_set_retval() already. Reported by: Ali Polatel <alip exherbo org> Briefly discussed with: jhb MFC after: 3 weeks
* Simplify the invocation of vm_fault(). Specifically, eliminate the flagalc2009-11-271-3/+1
| | | | | | | VM_FAULT_DIRTY. The information provided by this flag can be trivially inferred by vm_fault(). Discussed with: kib
OpenPOWER on IntegriCloud