Latest version of this document

This is a snapshot of the zpu_arch.html document in CVS. Please check out the latest version from CVS to get the latest version.

$id$

Index

The worlds smallest 32 bit CPU with GCC toolchain

This CPU is finding a new home at www.opencores.org, please contact me if you are willing and able to help in shaping up the www.opencores.org pages.

The HDL, GCC toolchain and eCos HAL are actually done. Mainly I could need a hand with writing up docs/web pages/examples/bug reports.

The ZPU has a BSD license for the HDL and GPL for the rest(source files are sadly out of date here, patches gladly accepted!). This allows deployments to implement any version of the ZPU they want without running into commercial problems, but if improvements are done to the architecture as such, then they need to be contributed back.

One strength of the ZPU is that it is tiny and therefore easy to implement from scratch to suit specialized needs and optimizations.

Currently there exists some pages at http://www.zylin.com/zpu.htm that explains about the ZPU. According to OpenCores policy this information should be moved to www.opencores.org. Patches gratefully accepted to do so!

Per Jan 1. 2008, Zylin has the Copyright for the ZPU, i.e. Zylin is free to decide that the ZPU shall have a BSD license for HDL + GPL for the rest.

Sincerley,

Øyvind Harboe
Zylin AS

Features

Survey

Please take the time to fill in this short survey so we can gather information about where the ZPU can be the most useful:

http://www.zylin.com/zpusurvey.html

Status

Simulator

The ZPU simulator is integrated into the Zylin Embedded CDT plugin to ease debugging of ZPU applications:

http://www.zylin.com/embeddedcdt.html

The ZPU simulator has many features besides debugging an application:

The plugin is still pretty rough around the edges, and needs to get GUI support for enabling the ModelSim trace input feature.


Compiling ZPU application


Setting up the simulator


Choosing ZPU executable


Debug session


Getting started - FPGA

The simplest version of the ZPU uses BRAM. When getting accustomed to the ZPU, a BRAM ZPU with a UART is a good place to start.

You'll find a working simulation script in hdl/example/simzpu_small.do and hdl/example_medium/simzpu_medium.do, which show simulation of the small(zpu_core_small.vhd) and medium sized ZPU(zpu_core.vhd). hdl/example/simzpu_interrupt.do shows use of interrupts.

When implementing the ZPU, copy the following files and modify them to your needs:

  1. hdl/example/zpu_config.vhd - set up RAM size here
  2. hdl/example/helloworld.vhd - dual port BRAM implementation.
Obviously you must also connect the ZPU to the rest of your IO subsystem. IO is memory mapped(read/write) in the ZPU.

Generating VHDL BRAM initialization

../install/bin/zpu-elf-objcopy -O binary hello.elf hello.bin
java -classpath ../simulator/zpusim.jar com.zylin.zpu.simulator.tools.MakeRam hello.bin >hello.bram

Build another test application for example simulation

Here is how to build a rom image for an application using the zpu/example simulation files.

cd zpu/roadshow/roadshow/dhrystone
sh build.sh
cd zpu/hdl/example
gcc zpuromgen.c
$ ./a
Usage: ./a binary_file
./a ../../roadshow/roadshow/dhrystone/dhrystone.bin >app.txt

Copy and paste app.txt into helloworld.vhd.

Running example simulation

The hdl/example directory has a simulation written for Xilinx WebPack ModelSim. From the ModelSim command prompt:
  1. cd c:/<installfolder>/hdl/example
  2. do zpusim_small.do

After running the hello world simulation (see zpusim.do), two files are written to the hdl/example directory:

  1. log.txt - contains the "Hello world!" text written to the debug channel/simplified UART.
  2. trace.txt - a trace file for the CPU. The instruction set simulator has the capability of taking this file as input in order to verify that the HDL implementation matches the instruction set simulator. When a mismatch is found, the GDB debugger will break. Very handy for debugging custom ZPU implementations.

HDL Directories & files

The HDL files need a bit of spit and polish!

Getting started - software

The ZPU comes with a standard GCC toolchain and an instruction set simulator. This allows compiling, running & debugging simple test programs. The Simulator has some very basic peripherals defined: counter, timer interrupt and a debug output port.

Installing

  1. Install Cygwin. http://www.cygwin.com
  2. Install Java
  3. Start Cygwin bash
  4. cd zpu/sw
  5. sh setup.sh
  6. /tmp/zpu/install/bin now has the .exe files for the GCC toolchain & GDB
  7. Optionally you may set up PATH variables to point to /tmp/zpu/install/bin
    source env.sh

Hello world example

The ZPU toolchain comes with newlib & libstdc++ support which means that many C/C++ programs can be compiled without modification.

cd zpu/sw/helloworld
../install/bin/zpu-elf-gcc -phi hello.c -o hello.elf

Running the hello world example in GDB

  1. cd zpu/sw/helloworld
  2. Launch the simulator from a seperate bash shell:

    java -classpath ../simulator/zpusim.jar -Xmx512m com.zylin.zpu.simulator.Phi 4444

  3. Launch GDB:

    ../install/bin/zpu-elf-gdb hello.elf

  4. Connect to target, load and run application:

    (gdb) target remote localhost:4444
    (gdb) load
    (gdb) continue

Architecture introduction

The ZPU is a zero operand, or stack based CPU. The opcodes have a fixed width of 8 bits.

Example:

IM 5 ; push 5 onto the stack LOADSP 20 ; push value at memory location SP+20 ADD ; pop 2 values on the stack and push the result
As can be seen, a lot of information is packed into the 8 bits, e.g. the IM instruction pushes a 7 bit signed integer onto the stack.

The choice of opcodes is intimately tied to the GCC toolchain capabilities.

/* simple program showing some interesting qualities of the ZPU toolchain */ void bar(int); int j; void foo(int a, int b, int c) { a++; b+=a; j=c; bar(b); } foo: loadsp 4 ; a is at memory location SP+4 im 1 add loadsp 12 ; b is now at memory location SP+12 add loadsp 16 ; c is now at memory location SP+16 im 24 ; «j» is at absolute memory location 24. ; Notice how the ZPU toolchain is using link-time relaxation ; to squeeze the address into a single no-op store im 22 ; the fn bar is at address 22 call im 12 return ; 12 bytes of arguments + return from fn

Instruction set

Only the base instructions are implemented in the architecture. More advanced instructions, like ASHIFTLEFT are emulated in the illegal instruction vector. All operations are 32 bit wide.
NameOpcodeDescriptionDefinition
BREAKPOINT 00000000 The debugger sets a memory location to this value to set a breakpoint. Once a JTAG-like debugger interface is added, it will be convenient to be able to distinguish between a breakpoint and an illegal(possibly emulated) instruction. No effect on registers
IM 1xxx xxxx Pushes 7 bit sign extended integer and sets the a «instruction decode interrupt mask» flag(IDIM).

If the IDIM flag is already set, this instruction shifts the value on the stack left by 7 bits and stores the 7 bit immediate value into the lower 7 bits.

Unless an instruction is listed as treating the IDIM flag specially, it should be assumed to clear the IDIM flag.

To push a 14 bit integer onto the stack, use two consequtive IM instructions.

If multiple immediate integers are to be pushed onto the stack, they must be interleaved with another instruction, typically NOP.

pc <= pc + 1
idim <= 1
if (idim=0) then
sp <= sp - 1;
for i in wordSize-1 downto 7 loop
mem(sp)(i) <= opcode(6)
end loop
mem(sp)(6 downto 0) <= opcode(6 downto 0)
else
mem(sp)(wordSize-1 downto 7) <= mem(sp)(wordSize-8 downto 0)
mem(sp)(6 downto 0) <= opcode(6 downto 0)
end if
STORESP 010x xxxx Pop value off stack and store it in the SP+xxxxx*4 memory location, where xxxxx is a positive integer.
LOADSP 011x xxxx Push value of memory location SP+xxxxx*4, where xxxxx is a positive integer, onto stack.
ADDSP 0001 xxxx Add value of memory location SP+xxxx*4 to value on top of stack.
EMULATE 001x xxxx Push PC to stack and set PC to 0x0+xxxxx*32. This is used to emulate opcodes. See zpupgk.vhd for list of emulate opcode values used. zpu_core.vhd contains reference implementations of these instructions rather than letting the ZPU execute the EMULATE instruction

One way to improve performance of the ZPU is to implement some of the EMULATE instructions.

PUSHPC emulated Pushes program counter onto the stack.
POPPC 0000 0100 Pops address off stack and sets PC
LOAD 0000 1000 Pops address stored on stack and loads the value of that address onto stack.

Bit 0 and 1 of address are always treated as 0(i.e. ignored) by the HDL implementations and C code is guaranteed by the programming model never to use 32 bit LOAD on non-32 bit aligned addresses(i.e. if a program does this, then it has a bug).

STORE 0000 1100 Pops address, then value from stack and stores the value into the memory location of the address.

Bit 0 and 1 of address are always treated as 0

PUSHSP 0000 0010 Pushes stack pointer.
POPSP 0000 1101 Pops value off top of stack and sets SP to that value. Used to allocate/deallocate space on stack for variables or when changing threads.
ADD 0000 0101 Pops two values on stack adds them and pushes the result
AND 0000 0110 Pops two values off the stack and does a bitwise-and & pushes the result onto the stack
OR 0000 0111 Pops two integers, does a bitwise or and pushes result
NOT 0000 1001 Bitwise inverse of value on stack
FLIP 0000 1010 Reverses the bit order of the value on the stack, i.e. abc->cba, 100->001, 110->011, etc.

The raison d'etre for this instruction is mainly to emulate other instructions.

NOP 0000 1011 No operation, clears IDIM flag as side effect, i.e. used between two consequtive IM instructions to push two values onto the stack.
PUSHSPADD 61 a=sp;
b=popIntStack()*4;
pushIntStack(a+b);
POPPCREL 57 setPc(popIntStack()+getPc());
SUB 49 int a=popIntStack();
int b=popIntStack();
pushIntStack(b-a);
XOR 50 pushIntStack(popIntStack() ^ popIntStack());
LOADB 51 8 bit load instruction. Really only here for compatibility with C programming model. Also it has a big impact on DMIPS test.

pushIntStack(cpuReadByte(popIntStack())&0xff);

STOREB 52 8 bit store instruction. Really only here for compatibility with C programming model. Also it has a big impact on DMIPS test.

addr = popIntStack();
val = popIntStack();
cpuWriteByte(addr, val);

LOADH 34 16 bit load instruction. Really only here for compatibility with C programming model.

pushIntStack(cpuReadWord(popIntStack()));

STOREH 35 16 bit store instruction. Really only here for compatibility with C programming model.

addr = popIntStack();
val = popIntStack();
cpuWriteWord(addr, val);

LESSTHAN 36 Signed comparison
a = popIntStack();
b = popIntStack();
pushIntStack((a < b) ? 1 : 0);
LESSTHANOREQUAL 37 Signed comparison
a = popIntStack();
b = popIntStack();
pushIntStack((a <= b) ? 1 : 0);
ULESSTHAN 37 Unsigned comparison
long a;//long is here 64 bit signed integer
long b;
a = ((long) popIntStack()) & INTMASK; // INTMASK is unsigned 0x00000000ffffffff
b = ((long) popIntStack()) & INTMASK;
pushIntStack((a < b) ? 1 : 0);
ULESSTHANOREQUAL 39 Unsigned comparison
long a;//long is here 64 bit signed integer
long b;
a = ((long) popIntStack()) & INTMASK; // INTMASK is unsigned 0x00000000ffffffff
b = ((long) popIntStack()) & INTMASK;
pushIntStack((a <= b) ? 1 : 0);
EQBRANCH 55 int compare;
int target;
target = popIntStack() + pc;
compare = popIntStack();
if (compare == 0)
{
setPc(target);
} else
{
setPc(pc + 1);
}
NEQBRANCH 56 int compare;
int target;
target = popIntStack() + pc;
compare = popIntStack();
if (compare != 0)
{
setPc(target);
} else
{
setPc(pc + 1);
}
MULT 41 Signed 32 bit multiply
pushIntStack(popIntStack() * popIntStack());
DIV 53 Signed 32 bit integer divide.
a = popIntStack();
b = popIntStack();
if (b == 0)
{
// undefined
} pushIntStack(a / b);
MOD 54 Signed 32 bit integer modulo.
a = popIntStack();
b = popIntStack();
if (b == 0)
{
// undefined
}
pushIntStack(a % b);
LSHIFTRIGHT 42 unsigned shift right.
long shift;
long valX;
int t;
shift = ((long) popIntStack()) & INTMASK;
valX = ((long) popIntStack()) & INTMASK;
t = (int) (valX >> (shift & 0x3f));
pushIntStack(t);
ASHIFTLEFT 43 arithmetic(signed) shift left.
long shift;
long valX;
shift = ((long) popIntStack()) & INTMASK;
valX = ((long) popIntStack()) & INTMASK;
int t = (int) (valX << (shift & 0x3f));
pushIntStack(t);
ASHIFTRIGHT 43 arithmetic(signed) shift left.
long shift;
int valX;
shift = ((long) popIntStack()) & INTMASK;
valX = popIntStack();
int t = valX >> (shift & 0x3f);
pushIntStack(t);
CALL 45 call procedure.

int address = pop();
push(pc + 1);
setPc(address);
CALLPCREL 63 call procedure pc relative

int address = pop();
push(pc + 1);
setPc(address+pc);
EQ 46 pushIntStack((popIntStack() == popIntStack()) ? 1 : 0);
NEQ 48 pushIntStack((popIntStack() != popIntStack()) ? 1 : 0);
NEG 47 pushIntStack(-popIntStack());

Custom startup code (aka crt0.s)

To minimize the size of an application, one important trick is to strip down the startup code. The startup code contains emulation of instructions that may never be used by a particular application.

The startup code is found in the GCC source code under gcc/libgloss/zpu, but to make the startup code more available, it has been duplicated into zpu/sw/startup

To minimize startup size, see codesize demo. This is pretty standard GCC stuff and simple enough once you've been over it a couple of times.

Implementing your own ZPU

One of the neat things about the ZPU is that the instruction set and architecture is very small and it is easy to implement a ZPU from scratch or modify the existing ZPU implementations.

Implementing a ZPU can be done without understanding the toolchain in detail, i.e. using exclusively HDL skills and only a rudimentary understanding of standard GCC/GDB usage is sufficient.

A few tips:

Vectors

AddressNameDescription
0x000 Reset 1.When the ZPU boots, this is the first instruction to be executed.

2.The stack pointer is initialised to maximum RAM address

0x020 Interrupt This is the entry point for interrupts.
0x040- Emulated instructions Emulated opcode 34. Note that opcode 32 and opcode 33 are not normally used to emulate instructions as these memory addresses are already used by boot vector, GCC registers and the interrupt vector.

Phi memory map

The ZPU architecture does not define a memory map as such, but the GCC + libgloss + ecos hal library uses the memory map below. "Phi" is just a three letter word for the particular memory layout below that came about while developing the ZPU.

Address

Type

Name

Description

0x080A0000

Write

ZPU enable

Bit [31:1] Not used

Bit [0] Enable ZPU operations

0 ZPU is held in Idle mode

1 ZPU running

0x080A000C

Read/

Write

ZPU Debug channel / UART to ARM7 TX

NOTE! ZPU side

Bit [31:9] Not used

Bit [8] TX buffer ready (valid on ready)

0 TX buffer not ready (full)

1 TX buffer ready

Bit [7:0] TX byte (valid on write)

0x080A0010

Read

ZPU Debug channel / UART to ARM7 RX

NOTE! ZPU side

Bit [31:9] Not used

Bit [8] RX buffer data valid

0 RX buffer not valid

1 RX buffer valid

Bit [7:0] RX byte (when valid)

0x080A0014

Read/

Write

Counter(1)

Bit [0] Reset counter (valid for write)

0 N/A

1 Reset counter

Bit [1] Sample counter (valid for write)

0 N/A

1 Sample counter

Bit [31:0] Counter bit 31:0

0x080A0018

Read

Counter(2)

Bit [31:0] Counter bit 63:32

0x080A0020

Read / Write

Global_Interrupt_mask

Bit [31:1] Not used

Bit [0] Global intr. Mask

0 Interrupts enabled

1 Interrupts disabled

0x080A0024

Write

UART_INTERRUPT_ENABLE

Bit [31:1] Not used

Bit [0] Debug channel / UART RX interrupt enable

0 Interrupt disable

1 Interrupt enable

0x080A0028

Read

Write

UART_interrupt

Bit [31:1] Not used

Bit [0] Debug channel / UART RX interrupt pending (Read)

0 No interrupt pending

1 Interrupt pending

Bit [0] Clear UART interrupt (Write)

0 N/A

1 Interrupt cleared

0x080A002C

Write

Timer_Interrupt_enable

Bit [31:1] Not used

Bit [0] Timer interrupt enable

0 Interrupt disable

1 Interrupt enable

0x080A0030

Read /

Write

Timer_interrupt

Bit [31:2] Not used

Bit [0] Timer interrupt pending (Read)

0 No interrupt pending

1 Interrupt pending

Bit [1] Reset Timer counter (Write)

0 N/A

1 Timer counter reset

Bit [0] Clear Timer interrupt (Write)

0 N/A

1 Interrupt cleared

0x080A0034

Write

Timer_Period

Bit [31:0] Interrupt period (write)

Number of clock cycles

between timer interrupts

NOTE! The timer will start at Timer_Periode value and count down to zero, and generate an interrupt

.0x080A0038

Read

Timer_Counter

Bit [31:0] Timer counter (read)


















Wishbone

In hdl/wishbone there is an implementation of a wishbone bridge.

However this wishbone bridge was used together with the hdl/zy2000 implementation of the ZPU, which differs slightly from hdl/zpu4/core.

The ZY2000 is a complete implementation of the ZPU including: DRAM, soft-MAC, wishbone bridges, GPIO subsystem, etc. This also included an eCos HAL w/TCP/IP support.

JTAG/hardware debugger for GDB

The Zylin ZY1000 JTAG debugger supports the ZPU. Contact Zylin for pricing and details.

There are two debug modes in which the ZY1000 can operate:

Interrupts

The ZPU supports interrupts.

To trigger an interrupt, the interrupt signal must be asserted. The ZPU does not define any interrupt disabling mechanism, this must be implemented by the interrupt controller and controlled via memory mapped IO.

Interrupts are masked when the IDIM flag is set, i.e. with consequtive IM instructions.

The ZPU has an edge triggered interrupt. As the ZPU notices that the interrupt is asserted, it will execute the interrupt instruction. The interrupt signal must stay asserted until the ZPU acknowledges it.

When the interrupt instruction is executed, the PC will be pushed onto the stack and the PC will be set to the interrupt vector address (0x20).

Note that the GCC compiler requires three registers r0,r1,r2,r3 for some rather uncommon operations. These 32 registers are mapped to memory locations 0x0, 0x4, 0x8, 0xc. The default interrupt vector at address 0x20 will load the value of these memory locations onto the stack, call _zpu_interrupt and restore them.

See zpu/hdl/zpu4/test/interrupt/ for C code and zpu/hdl/example/simzpu_interrupt.do for simulation example.

About zpu_core_small.vhd

The small ZPU implements the minimum instruction set. It is optimized for size and simplicity serving as a reference in both regards.

It uses a BRAM (dual port RAM w/read/write to both ports) as data & code storage and is implemented as a simple state machine.

Essentially it has three states:

  1. Fetch - starts fetch of next instruction
  2. FetchNext - sets up operands for execute cycle
  3. Decode - decodes instruction
  4. Execute - well.. executes instruction
The tricky bit is that there is a tiny bit of interleaving of states since the BRAM takes a cycle to perform a fetch/store. The above is the normal states the ZPU cycles through unless memory fetch, jumps, etc. take place.

Speeding up the ZPU

There are two aspects of speeding up the ZPU: making it perform better for a particular application and toying around with the ZPU architecture.

Performance tips

  1. Profile. Create a small sample and run in a simulator that is as close to the real deployment as possible. zpu4/core/histogram.perl is a script that will tell you which instructions take the most time.
  2. Using the profile output, decide on which emulated instructions that it makes sense to implement in HDL for your particular application. Modifying zpu_core_small.vhd is not particularly hard. Most instructions can be transliterated into zpu_core_small.vhd from zpu_core.vhd without too much problem.
  3. The memory subsystem may well turn out to be where you should concentrate your efforts.

Toying around with the architecture

Again: profile 90% of the time and spend the remaining 10% tinkering with the architecture. If you need to get ca. 20-50 DMIPS out of the ZPU you will have to write a heavily pipelined architecture with caches(if you are running against DRAM). This is *tricky*, but some proof of concept work was done to show 20 DMIPS w/the ZPU(the actual result was discarded since it was not complete and contained fatal flaws).

Achieving above 50-100 DMIPS with the current ZPU architecture is probably a non-starter and a more conventional RISC design makes more sense here.

The unique advantages of the ZPU is size in terms of HDL & code size.

Debug channel / UART

All self respecting embedded projects should have a debug channel to print stuff to. Typically this is a standard RS232 or UART, but it can also be something more exotic like a DCC JTAG channel.

The point is that characters(bytes) are sent to/from the ZPU via some terminal.

The ZPU defines in the memory map a UART / debug channel. This should be implemented by some suitable debug channel for the device in which the ZPU is implemented.

www.opencores.org has several UART implementations. This is one of the simpler ones: http://www.opencores.org/projects.cgi/web/uart/overview

Implementing your own UART / debug channel

The first thing you need to do is to choose a debug channel for your hardware. This could be a UART, but it doesn't have to be.

Secondly you should write a small HDL module that interface between the ZPU memory map of debug channel to the UART. This should be relatively simple as all you need to do is to let the ZPU query the FIFO in/out for busy flag and allow the ZPU to read/write data to the UART via the memory map.

About zpu_core.vhd

The zpu_core.vhd has a single port memory interface. All data, code and IO is accessed through this memory interface.

It performs better(despite having less memory bandwidth than zpu_core_small.vhd) since it implements many more instructions.

Compiling hello world program with the ZPU GCC toolchain

The ZPU comes with a standard GCC toolchain and an instruction set simulator. This allows compiling, running & debugging simple test programs. The Simulator has some very basic peripherals defined: counter, timer interrupt and a debug output port.

Installation

  1. Install Cygwin. http://www.cygwin.com
  2. Start Cygwin bash
  3. unzip zputoolchain.zip
  4. Add install/bin from zputoolchain.zip to PATH.
    export PATH=$PATH:/install/bin

Hello world example

The ZPU toolchain comes with newlib & libstdc++ support which means that many C/C++ programs can be compiled without modification.

zpu-elf-gcc -Os -zeta hello.c -o hello.elf -Wl,--relax -Wl,--gc-sections
zpu-elf-size hello.elf

SPI flash controller (read-only)

This is a simple read-only SPI flash controller, with the following characteristics:
  • Fast-READ only implementation.
  • 32-bit only access
  • Fast sequential read access - Uses low-clock approach
  • Version

    The current version is 1.2. This is also the first public version available.

    Timing overview

    Simple timing overview, with one nonsequential access to address 0x0, followed by a sequential access to address 0x4. This simulation was done with Xilinx tools, after post-routing, and using a ZPU to access the SPI

    Image 1: Timing overview

    On Image 2, you can see the clock almost perfectly centered on data, when we write to the SPI flash.

    Image 2: Issuing commands to the SPI

    As you can see from Image 3, I assume the worst-case read delay from SPI (which is 15ns, as you can see from the marker).

    Image 3: Reading from the SPI

    Usage

    Simple description of SPI controller interface:
    Symbol Direction Bit width Purpose
    adrInput24Address where to read from SPI
    dat_oOutput32Data read from SPI
    clkInput1Input clock. Used for both interface and SPI
    ceInput1Chip Enable
    rstInput1Asynchronous reset
    ackOutput1Data valid ACK
    SPI_CLKOutput1SPI output clock
    SPI_MOSIOutput1SPI output data from controller to chip
    SPI_MISOInput1SPI input data from chip to controller
    SPI_SELNOutput1SPI nSEL (deselect, active low) signal

    License

    The Verilog implementation is released under BSD license. See the file itself for more licensing details.

    Dowload

    Download the Verilog code here: spi_controller.v

    Troubleshooting

    The current implementation is timed and optimized for myself. Your parameters might not be the same as those I defaulted, so read the code carefully. If you have any issue let me know.

    Zealot: Implementing in FPGAs

    The Zealot version of ZPU is a ZPU medium variant ready to be used with FPGAs. It was tested using Xilinx Spartan 3 1500 FPGAs and was contributed by Salvador E. Tropea. The key features are:

    Simulation and implementation files are provided. You need 16 kB of BRAMs for the "hello world" example and 32 kB for the DMIPS benchmark. The medium version takes around 1030 slices and 3 multipliers and the small version around 430 slices.

    The generics for the Zealot Medium ZPU are:

    For more information read the 0README.txt file located inside the zealot directory.

    Optimizing for code size

    The ZPU toolchain produces highly compact code.
    1. Since the ZPU GCC toolchain supports standard ANSI C, it is easy to stumble across functionality that takes up a lot of space. E.g. the standard printf() function is a beast. Some compilers drop e.g. floating point support from the printf() function and thus boast a "smaller" printf() when in fact they have a non-standard printf(). newlib has a standard printf() function and an alternative iprintf() function that works only on integers.
    2. The ZPU ships with default startup code that works across various configurations of the ZPU, so be warned that there is some overhead that will not occurr in the final application(anywhere between 1-4kBytes).
    3. Compilation and linker options matter. The ZPU benefits greatly from the "-Wl,--relax -Wl,--gc-sections" options which is not used by all architectures(e.g. GCC ARM does not implement/need -Wl,--relax).

    Small code example

    zpu-elf-gcc -Os -abel smallstd.c -o smallstd.elf -Wl,--relax -Wl,--gc-sections
    zpu-elf-size small.elf

    $ zpu-elf-size small.elf
    text data bss dec hex filename
    2845 952 36 3833 ef9 small.elf

    Even smaller code example

    If the ZPU implements the optional instructions, the RAM overhead can be reduced significantly.

    zpu-elf-gcc -Os -abel crt0_phi.S small.c -o small.elf -Wl,--relax -Wl,--gc-sections -nostdlib
    zpu-elf-size small.elf

    $ zpu-elf-size small.elf
    text data bss dec hex filename
    56 8 0 64 40 small.elf

    Installing eCos build tools

    tar -xjvf ecossnapshot.tar.bz2
    tar -xjvf repository.tar.bz2
    tar -xjvf ecostools.tar.bz2
    # run this every time you open the shell
    export PATH=$PATH:`pwd`/ecos-install
    export ECOS_REPOSITORY=`pwd`/ecos/packages:`pwd`/repository

    Compiling eCos tests

    ecosconfig new zeta default
    ecosconfig tree
    make
    cd kernel/current
    make tests

    Code size ZPU

    $ zpu-elf-size *
    text data bss dec hex filename
    15761 1504 12060 29325 728d bin_sem0
    16907 1512 14436 32855 8057 bin_sem1
    17105 1524 30032 48661 be15 bin_sem2
    17186 1512 14436 33134 816e bin_sem3
    18986 1500 12036 32522 7f0a clock0
    15812 1504 13236 30552 7758 clock1
    25095 1972 13224 40291 9d63 clockcnv
    16437 1500 13224 31161 79b9 clocktruth
    15762 1504 12060 29326 728e cnt_sem0
    17124 1512 14436 33072 8130 cnt_sem1
    35947 1564 22512 60023 ea77 dhrystone
    16428 1500 13228 31156 79b4 except1
    15751 1504 12052 29307 727b flag0
    19145 1512 15624 36281 8db9 flag1
    20053 1516 102908 124477 1e63d fptest
    15998 1496 12092 29586 7392 intr0
    16080 1496 12200 29776 7450 kalarm0
    15327 1496 12036 28859 70bb kcache1
    15549 1496 13224 30269 763d kcache2
    18291 1500 12260 32051 7d33 kclock0
    16231 1500 13232 30963 78f3 kclock1
    16572 1496 13228 31296 7a40 kexcept1
    15618 1496 12060 29174 71f6 kflag0
    19287 1500 15624 36411 8e3b kflag1
    16887 1516 15628 34031 84ef kill
    16186 1496 12128 29810 7472 kintr0
    19724 1504 14516 35744 8ba0 klock
    18283 1500 14592 34375 8647 kmbox1
    15539 1496 12064 29099 71ab kmutex0
    16524 1504 15664 33692 839c kmutex1
    18272 1712 20348 40332 9d8c kmutex3
    18682 1608 20352 40642 9ec2 kmutex4
    15619 1496 14412 31527 7b27 ksched1
    15567 1496 12060 29123 71c3 ksem0
    17063 1500 14436 32999 80e7 ksem1
    15504 1496 13228 30228 7614 kthread0
    16167 1496 14412 32075 7d4b kthread1
    18281 1512 14580 34373 8645 mbox1
    20611 1508 14940 37059 90c3 mqueue1
    15672 1504 12064 29240 7238 mutex0
    16678 1516 15664 33858 8442 mutex1
    17694 1508 16868 36070 8ce6 mutex2
    18203 1720 20344 40267 9d4b mutex3
    16352 1508 14428 32288 7e20 release
    15890 1500 14412 31802 7c3a sched1
    44196 1612 286332 332140 5116c stress_threads
    17891 1524 16864 36279 8db7 sync2
    16943 1512 15644 34099 8533 sync3
    15467 1496 13064 30027 754b thread0
    16134 1496 14420 32050 7d32 thread1
    17560 1512 15636 34708 8794 thread2
    16279 1500 24028 41807 a34f thread_gdb
    17051 1504 20376 38931 9813 timeslice
    17146 1504 21564 40214 9d16 timeslice2
    37313 1512 422380 461205 70995 tm_basic

    Code size ARM (non-thumb)

    Thumb does not compile out of the box w/AT91 EB40a for which this test was made.

    $ arm-elf-size *
    text data bss dec hex filename
    25204 692 16976 42872 a778 bin_sem0
    26644 700 22096 49440 c120 bin_sem1
    26996 712 55584 83292 1455c bin_sem2
    27008 700 22100 49808 c290 bin_sem3
    28992 688 16944 46624 b620 clock0
    25456 692 19532 45680 b270 clock1
    34572 1160 19520 55252 d7d4 clockcnv
    26224 688 19508 46420 b554 clocktruth
    25204 692 16976 42872 a778 cnt_sem0
    26888 700 22108 49696 c220 cnt_sem1
    44180 752 27416 72348 11a9c dhrystone
    26088 688 19520 46296 b4d8 except1
    25236 692 16968 42896 a790 flag0
    29532 700 24668 54900 d674 flag1
    29508 704 109652 139864 22258 fptest
    25932 684 17016 43632 aa70 intr0
    25824 684 17112 43620 aa64 kalarm0
    24728 684 16956 42368 a580 kcache1
    25168 684 19512 45364 b134 kcache2
    28112 688 17168 45968 b390 kclock0
    25976 688 19524 46188 b46c kclock1
    26372 684 19512 46568 b5e8 kexcept1
    25140 684 16968 42792 a728 kflag0
    29824 688 24660 55172 d784 kflag1
    26896 704 24656 52256 cc20 kill
    26088 684 17028 43800 ab18 kintr0
    30812 692 22176 53680 d1b0 klock
    28504 688 22260 51452 c8fc kmbox1
    24984 684 16984 42652 a69c kmutex0
    26504 692 24704 51900 cabc kmutex1
    28792 900 34892 64584 fc48 kmutex3
    29264 796 34896 64956 fdbc kmutex4
    25240 684 22084 48008 bb88 ksched1
    25044 684 16968 42696 a6c8 ksem0
    26988 688 22100 49776 c270 ksem1
    25028 684 19512 45224 b0a8 kthread0
    25996 684 22080 48760 be78 kthread1
    28552 700 22252 51504 c930 mbox1
    31324 696 22612 54632 d568 mqueue1
    25108 692 16980 42780 a71c mutex0
    26464 704 24700 51868 ca9c mutex1
    27624 696 27280 55600 d930 mutex2
    28596 908 34884 64388 fb84 mutex3
    26156 696 22100 48952 bf38 release
    25460 688 22084 48232 bc68 sched1
    56356 828 45892 103076 192a4 stress_threads
    27900 712 27288 55900 da5c sync2
    26760 700 24692 52152 cbb8 sync3
    24924 684 19356 44964 afa4 thread0
    25868 684 22084 48636 bdfc thread1
    27452 700 24680 52832 ce60 thread2
    26136 688 42704 69528 10f98 thread_gdb
    27212 692 34916 62820 f564 timeslice
    52728 700 123332 176760 2b278 tm_basic

    Next generation ZPU

    Based on feedback here is a list of a tenuous "consensus" for the next generation of the ZPU with some tentative ideas on implementation.

    The plan is to update zpu_core.vhd and zpu_core_small.vhd as examples/reference, and to open up for innovation in the HDL implementation.

    1. Reduce minimum code size footprint
      1. Add single entry for unknown instructions. PC and unsupported instruction is pushed onto stack before jumping to unkonwn instruction vector. This makes it possible to write denser microcode for missing instructions. For emulated opcodes that are not in use, the microcode can more easily be disabled. Determining that e.g. MULT is not used, can be a bit tricky, but disabling it is easy.

        The address of this entry will be 0x10. The reason 0x00 is not used is that GCC needs 0x00-0x0b inclusive to store R0-R2(memory mapped GCC registers). The reset vector remains 0x0 so the 0x00-0x0f addresses contains the first few instructions executed by the ZPU. Some very early work has been done in nextgen_crt0.S.

      2. Single entry for *all* unknown instructions does not limit emulation to the EMULATE instructions today, but instructions such as OR, LOADSP, STORESP, ADDSP, etc. can also be emulated. This opens up for further reduction in logic usage.
      3. The single entry for all unknown instructions will make it easier to write a compact custom crt0.s to fit an instruction subset.
      4. The interrupt is basically an unknown instruction that is injected into the execution stream.
      5. Possibly modify the java simulator to support the single entry for unknown instructions.
    2. Add floating point add and mult. FADD & FMULT. Option to generate the instructions from the compiler.
    3. Add GCC support for seperate code/data bus. This may be as "simple" as writing a custom linker script for the current GCC compiler.
    4. Add some scheme to support custom instructions. Can this be combined with single entry point for unknown instructions?
    5. Add support to Zylin Embedded CDT for downloading fully functional ZPU toolchain. The goal is to allow new users to write and simulate simple ZPU programs in in less than an hour.
    6. Strip away unused instructions from GCC and add options to GCC for not emitting more advanced instructions. This will e.g. convert MULT/DIV into function calls to libgcc and thus make it easier to determine that microcode is not needed.

    Next generation ZPU HDL work

    1. Incorporate feedback on FPGA tricks to reduce memory usage: do not use asynchronous reset?, use BRAMs in synchronous mode to reduce complexity of state machine?, seperate code/data bus? Reduce instruction set further. Goal: <300 LUT's for 32 bit ZPU
    2. Will someone be willing to contribute a heavily pipelined ZPU? For this to make sense, the performance must hit 20 DMIPS w/DRAM & cache. This ZPU could run a TCP/IP stack with relevant performance to compete with stripped down ARM7 type systems.

    Download source code

    The simplest way to get the ZPU HDL source and tools is to check it out from CVS:

    cvs -d :pserver:anonymous@cvs.opencores.org:/cvsroot/anonymous co zpu/zpu

    Start by reading zpu/zpu/hdl/index.html

    Creating a patch


    Please submit changes to the
    zylin-zpu mailing list as a patch.

    1. Merge your changes with CVS HEAD.
    2. Update the FreeBSD or GPL copyright with your name in the case of non-trivial changes. If in doubt, add the copyright.
    3. Add an entry to zpu/ChangeLog with date, your name, email, the files you changed and a comment.
    4. cd zpu
      cvs diff -upN . > mypatch.txt
    5. Email it to zylin-zpu mailing list. Attach it as an uncompressed .txt file

    Getting help - mailing list

    The place to get help is the
    zylin-zpu mailing list