How to add a new architecture to QEMU - Part 4

Why we need branch instructions
How to implement branch instructions
Complexity
Testing the implementation
Expending the test
The next steps

In the last article, I showed you how AVR32 CPU instructions are emulated in QEMU. As an example, we implemented the ADD instruction. I also encouraged you to implement the MOV instruction, as we need to move immediate values into the CPU registers now. If you did not implement that instruction, you can copy it from my GitHub repository. We also need to implement an instruction that performs branches, because we later need this functionality to test our implementation.

Why we need branch instructions

In higher programming languages, we usually have if-then-else constructs. But not in assembled programs. A CPU executes a program instruction by instruction, like we would read a book. It starts at the first address of the program code and then continues until the program ends.

But what if the program needs to skip a certain part? For example, when an if statement is evaluated with a false value? This is where branch instructions come into play.

Let’s consider the following program:

int val = 5;
if(val == 1){
    printf("Val is 1");
else{
    printf("Val is not 1");
}

When this code is compiled into an AVR32 binary, it probably looks like this:

0x0000  mov r1, 0x5     #move value 5 into register 1
0x0002  cp.w r1, 0x1    #compare content of register 1 to the integer '1' and set/unset 'equal' flag
0x0004  brne 0x14       #branch to address at offset 0x14 if the 'equal' flag is not set
0x0006 #call printf('Val is 1');   
0x0014 #call printf('Val is not 1');

The if-statement is replaced by a branch operation. The CPU will jump to an address behind the first printf call if the value is not 1. This is done for every conditional operation in the program. Therefore, branch instructions are really useful. We will also use branch instructions in QEMU to emulate AVR32 instructions that perform conditional operations but do not jump to another address on their own.

Remember that we use the Tiny Code Generator to create an intermediate representation (IR) of the AVR32 instructions that is later translated into x86 instructions. If an instruction internally checks if the equal status register flag is set, we will use a branch to skip the part of the IR that represents the operations that should not be performed.

How to implement branch instructions

Let’s start with an actual AVR32 branch instruction. The AVR32 instruction set has the BR{cond} – Branch if Condition Satisfied instruction. The AVR32 Architecture Document defines the operation of the instructions as follows (format 1):

if (cond3)
    PC = PC + (SE(disp8) << 1);
else
    PC = PC + 2;

The pseudocode shows us that we first need to check which condition needs to be evaluated. The condition is given as a 3-bit-long number. The table on page 96 of the architecture document explains how the conditions are encoded. If the condition is true, we calculate the address where the execution should continue. If the condition is false, the execution continues at the next instruction.

First, we specify the pattern of the instruction in insn.decode:

BR_f1            1100 ........ 0 ...                            @op_rd_disp8

Next, we add the new instruction to the disas.c file:

INSN(BR_f1, BR, "cond3: %d, disp: [0x%04x]", a->rd, (a->disp))

Now we can implement the translation function:

static bool trans_BR_f1(DisasContext *ctx, arg_BR_rd *a){
    //The displacement from the disp-field.
    int disp = (a->disp);

    //check if bit 8 is set => sign extend disp
    if((disp >> 7) == 1){
        disp |= 0xFFFFFF00;
    }
    disp = disp << 1;
    //Creating a TCG Label
    TCGLabel *no_branch = gen_new_label();
    
    //Checking the condition
    switch(a->cond){
        case 0x0: //Condition 'Equal' (eq)
            tcg_gen_brcondi_i32(TCG_COND_NE, cpu_sflags[sflagZ], 1, no_branch);
            break;
        case 0x1: //Condition 'NOT Equal' (ne)
            tcg_gen_brcondi_i32(TCG_COND_NE, cpu_sflags[sflagZ], 0, no_branch);
            break;
        //...
    }
    //Performing the branch
    gen_goto_tb(ctx, 0, ctx->base.pc_next+disp);

    //Setting the label
    gen_set_label(no_branch);

    ctx->base.pc_next += 2;
    ctx->base.is_jmp = DISAS_CHAIN;
    return true;
}

At the start of the translation function, we first sign-extend the displacement and shift it one position to the left. This is defined in the architecture document. Next, we create a TCGLabel. TCGLabels are used as jump marks in the IR.

QEMU generates the IR code iteratively, in the same order as the tcg_gen functions are called in the translation function. If we want to skip certain IR parts, we can make QEMU jump past them, just like a regular CPU would do. You can see that we place the label further down, near the end of the function. This way, we can skip the actual branch operation and continue at the next instruction.

Before we use the label, we check which condition needs to be evaluated. This is done in a switch statement. We need to implement the evaluation for every condition from the condition table. These evaluations are needed by a lot of instructions. To keep the code readable and prevent redundant code segments, you should move this functionality to a helper file.

I use the tcg_gen_brcondi_i32 function to evaluate if the condition is false. In that case, no branch should happen. tcg_gen_brcondi_i32 first needs a TCG condition (here we use not equal), then a register and an immediate value that is compared with the register content. We provide the function with the status-register-bit for the zero flag. If two values are equal, this flag must contain the value 1. You can look at the definition of the Compare Word instruction for more details.

If the register content and the immediate value are not equal, QEMU will jump to the no_branch label in the IR (not in the actual emulated address space). The IR parts in between are skipped, and the execution continues to the next instruction.

However, if the register content and the immediate value are equal, QEMU will execute the gen_goto_tb call, and a branch to the specified address will be done.

One last important aspect is the use of ctx->base.is_jmp = DISAS_CHAIN; at the end. We need to tell QEMU that the current translation block ends with this instruction. But the next translation block needs to be chained behind the one we just ended because we do not always perform an actual branch. There is also the DISAS_JUMP target that we would use if an instruction always performs a jump (for example, the RJMP instruction).

Complexity

The implementation of the AVR32 branch instructions, and the implementation of the instructions that evaluate conditions was one of the more challenging aspects of this project. There are instructions that perform multiple if-operations and there are more complex evaluations. The use of the TCG functions adds a layer of abstraction, and it is easy to lose track when working with multiple labels.

On multiple occasions, I made smaller implementation errors when working on these aspects. The mistakes on their own were minor, like mixing up the second and third arguments of a gen_andc call. The QEMU emulation even worked (seemingly) correctly until it could not reach a point that definitely should be reached or some other illogical error occurred. Debugging these issues manually is a time-consuming task.

If you need to implement a more complex branch instruction, be sure to always keep track of the labels and their purpose. Also, think about how many labels you need beforehand. Try to keep the operations simple, and maybe write down a truth table that helps you find an easier approach for a complex condition. All of this will not prevent any errors, but it will reduce the chance of making an unnoticed error.

However, the only reliable way to detect similar implementation errors is through testing.

Testing the implementation

Now that we have multiple instructions implemented, we should test our new implementation. Well-defined tests are the only way to prevent implementation errors that only come into effect deep inside an emulation.

If you have an AVR32 assembler at hand, you can compile the following program (test.asm):

.section .text
.global main
main:
    mov r1, 0x1
    mov r2, 0x2
    add r1, r2

Assemble the program with ./avr32-as -o test.elf test.asm. Depending on your operating system, you may encounter the following error message:

./avr32-as -o test.elf test.asm
avr32-as: loadlocale.c:130: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.

To solve this issue, you simply need to set an environment variable:

export LC_ALL=C

Now, the assembler should work as intended, and it should produce an ELF-file. As we did not implement an elf loader for our emulator, we cannot execute the file directly. We need to extract the text section. In the next article, I will show you how this process can be done automatically. For now, we need to find the text section by hand with the avr32-objdump tool:

./avr32-objdump -h test.elf 

test.elf:     file format elf32-avr32

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00000006  00000000  00000000  00000034  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  0000003a  2**0
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  0000003a  2**0
                  ALLOC

We can see that the .text section starts at offset 0x34 (decimal 52) and is 6 bytes long. We can use standard Unix system tools to extract the section:

dd bs=1 if=test.elf of=test.bin count=6 skip=52
6+0 records in
6+0 records out
6 bytes copied, 9.7795e-05 s, 61.4 kB/s

Now we can execute QEMU and use the AVR32 binary file as input. If we set the -d cpu parameter, QEMU prints the CPU status at the start of every translation block:

build/avr32-softmmu/qemu-system-avr32 -machine avr32example-board -bios test.bin -d cpu
Setting up board...
Realizing...
Board setup complete
Loading firmware 'test.bin'...
[AVR32-BOOT]: Loading firmware images as raw binary
[AVR32-BOOT]: Loaded boot image successfully
PC:    d0000000
SP:    00000000
LR:    00000000
r0:    00000000
r1:    00000000
r2:    00000000
#...
sregH:    00000000
sreg30:    00000000
sregSS:    00000000

And now we did it. We ran our own CPU emulation for the first time. But why are all the registers empty? Shouldn’t there be the values 1, 2, and 3 somewhere?

The -d cpu parameter makes QEMU print the CPU status at the start of every translation block. Because our AVR32 binary only consists of exactly one block, QEMU shows us the initial register values. And, as to be expected, they are mostly zero.

Expending the test

To get the result of our AVR32 test program, we need to add a second translation block. This is why we needed to implement the branch instruction. A new translation block starts after every branch. Let’s expand our test with just two more lines:

.section .text
.global main
main:
    mov r1, 0x1
    mov r2, 0x2
    add r1, r2
    bral end
end:

If we repeat the process from before, we should get a second translation block output:

./avr32-as -o test.elf test.asm
dd bs=1 if=test.elf of=test.bin count=10 skip=52
build/avr32-softmmu/qemu-system-avr32 -machine avr32example-board -bios test.bin -d cpu
Setting up board...
Realizing...
Board setup complete
Loading firmware 'test.bin'...
[AVR32-BOOT]: Loading firmware images as raw binary
[AVR32-BOOT]: Loaded boot image successfully
# ...
PC:    d000000a
SP:    00000000
LR:    00000000
r0:    00000000
r1:    00000003
r2:    00000002
#...

Remember to change the size argument for dd, as the new text section is longer. Now you should see a second status dump at the end. It shows the CPU status at the start of the second translation block that starts at the end mark. This way, we can see the register contents after the end of the first block.

As we expect, register 1 holds the value 3, and register 2 still holds the value 1.

The next steps

Now that we have a working emulation and the ability to test the implementation, we should start to automatically test the translation functions. There are complex instructions that are not as easy to implement as the add instruction. Because of that, there is a high risk for implementation errors that do not result in a non-executable emulation but in semantically incorrect translations.

In the next article, I will show you how we can do automated tests on the emulated instructions to detect faulty translation functions.