From c878da3b27ceeed953c9f9a1eb002d59e9dcb4c6 Mon Sep 17 00:00:00 2001 From: malc Date: Mon, 5 Nov 2012 21:47:04 +0400 Subject: tcg/ppc32: Use trampolines to trim the code size for mmu slow path accessors mmu access looks something like: if miss goto slow_path done: ... ; end of the TB slow_path:
 mr r3, r27         ; move areg0 to r3
                    ; (r3 holds the first argument for all the PPC32 ABIs)
 
 b $+8
 .long done
 
 b done

On ppc32  is:

(SysV and Darwin)

mmu_helper is most likely not within direct branching distance from
the call site, necessitating

a. moving 32 bit offset of mmu_helper into a GPR ; 8 bytes
b. moving GPR to CTR/LR                          ; 4 bytes
c. (finally) branching to CTR/LR                 ; 4 bytes

r3 setting              - 4 bytes
call                    - 16 bytes
dummy jump over retaddr - 4 bytes
embedded retaddr        - 4 bytes
         Total overhead - 28 bytes

(PowerOpen (AIX))
a. moving 32 bit offset of mmu_helper's TOC into a GPR1 ; 8 bytes
b. loading 32 bit function pointer into GPR2            ; 4 bytes
c. moving GPR2 to CTR/LR                                ; 4 bytes
d. loading 32 bit small area pointer into R2            ; 4 bytes
e. (finally) branching to CTR/LR                        ; 4 bytes

r3 setting              - 4 bytes
call                    - 24 bytes
dummy jump over retaddr - 4 bytes
embedded retaddr        - 4 bytes
         Total overhead - 36 bytes

Following is done to trim the code size of slow path sections:

In tcg_target_qemu_prologue trampolines are emitted that look like this:

trampoline:
mfspr r3, LR
addi  r3, 4
mtspr LR, r3      ; fixup LR to point over embedded retaddr
mr    r3, r27
 ; tail call of sorts

And slow path becomes:

slow_path:
 
 
 .long done
 
 b done

call                    - 4 bytes (trampoline is within code gen buffer
                                   and most likely accessible via
                                   direct branch)
embedded retaddr        - 4 bytes
         Total overhead - 8 bytes

In the end the icache pressure is decreased by 20/28 bytes at the cost
of an extra jump to trampoline and adjusting LR (to skip over embedded
retaddr) once inside.

Signed-off-by: malc 
---
 exec-all.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'exec-all.h')

diff --git a/exec-all.h b/exec-all.h
index 94ed613..6b3272a 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -337,7 +337,7 @@ extern uintptr_t tci_tb_ptr;
                                     *(int32_t *)((void *)GETRA() + 3) - 1))
 # elif defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
 #  define GETRA() ((uintptr_t)__builtin_return_address(0))
-#  define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() + 4)) - 1))
+#  define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() - 4)) - 1))
 # else
 #  error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
 # endif
-- 
cgit v1.1