The hell of CPU debugging

This article aims to give an overview of my project named bracoujl, a script I made when writing and debugging my GameBoy color emulator project. Both projects’ code are available on my bitbucket: the emulator (nebula) and the debugging tool (bracoujl).

Tip: If a graph in this blog post is big, browser will not make it easy to move arround to inspect it. For this, eog is cool. It has zoom and drag-and-drop movement features.

Is writing an emulator for the GameBoy easy? Well I did not think it would take me that kind of debug to do so, but I think this is an interesting story.

While starting it, I decided I wanted it to be clean. If you read this nice chart containing all the opcodes of the CPU, you can see patterns. Cool! This means we have an opportunity to factorize the code! The most important thing I learnt about emulation: the CPU is the easiest thing to write and understand, but the easiest to screw too, and even if it’s not directly the CPU that fails, it will probably be impacted. (memory, anyone?)

With bracoujl, the goal was to get a better idea of what the CPU was doing, to understand bugs easier. This blog post will present what bracoujl can understand, and then show examples of debugging done with it.

Beginning with CPU debugging

When starting to debug my emulator, so basically watching what was being executed, I first started with the awesome ROMs written by blargg. It was cool and helped a lot, but one thing really hit me fast: if you don’t really know where the bug is, stepping through the code becomes boring really, really, really quickly.

Most of the ROMs I worked with start with a lot of memory initialization and the simple process of extracting some code or data from somewhere in memory to somewhere else (ROM to RAM for example). Check the function bellow for example. It comes from the bomberman ROM.

Bomberman sub_0237: memset(%hl, 0, %bc)

click image to see full size

This function is small and called 13 times, but the loop is executed 16691 times which represents over 100k instructions linearly. And this is only a small function, when debugging a bug caused by the joypad, I used to generate logs containing millions of instructions executed.

[DEBUG] PC: 0100 | OPCODE: 00 | MEM: C350 | DIS: nop
[DEBUG] PC: 0101 | OPCODE: C3 | MEM: 5001 | DIS: jmp $0x0150
[DEBUG] PC: 0150 | OPCODE: F0 | MEM: 44FE | DIS: ldh %a, ($0xFF44)
[DEBUG] PC: 0152 | OPCODE: FE | MEM: 9138 | DIS: cp %a, $0x91
[DEBUG] PC: 0154 | OPCODE: 38 | MEM: FAAF | DIS: jr cy, $0xFA ; ($-6)
[DEBUG] PC: 0150 | OPCODE: F0 | MEM: 44FE | DIS: ldh %a, ($0xFF44)
[DEBUG] PC: 0152 | OPCODE: FE | MEM: 9138 | DIS: cp %a, $0x91
[DEBUG] PC: 0154 | OPCODE: 38 | MEM: FAAF | DIS: jr cy, $0xFA ; ($-6)
[DEBUG] PC: 0150 | OPCODE: F0 | MEM: 44FE | DIS: ldh %a, ($0xFF44)
[DEBUG] PC: 0152 | OPCODE: FE | MEM: 9138 | DIS: cp %a, $0x91
[DEBUG] PC: 0154 | OPCODE: 38 | MEM: FAAF | DIS: jr cy, $0xFA ; ($-6)
[DEBUG] PC: 0150 | OPCODE: F0 | MEM: 44FE | DIS: ldh %a, ($0xFF44)
[DEBUG] PC: 0152 | OPCODE: FE | MEM: 9138 | DIS: cp %a, $0x91
[DEBUG] PC: 0154 | OPCODE: 38 | MEM: FAAF | DIS: jr cy, $0xFA ; ($-6)
[DEBUG] PC: 0150 | OPCODE: F0 | MEM: 44FE | DIS: ldh %a, ($0xFF44)
[DEBUG] PC: 0152 | OPCODE: FE | MEM: 9138 | DIS: cp %a, $0x91
[DEBUG] PC: 0154 | OPCODE: 38 | MEM: FAAF | DIS: jr cy, $0xFA ; ($-6)
[DEBUG] PC: 0150 | OPCODE: F0 | MEM: 44FE | DIS: ldh %a, ($0xFF44)
[DEBUG] PC: 0152 | OPCODE: FE | MEM: 9138 | DIS: cp %a, $0x91
[DEBUG] PC: 0154 | OPCODE: 38 | MEM: FAAF | DIS: jr cy, $0xFA ; ($-6)

This is the input of bracoujl. This input format is defined by the CPU configuration used. DIS was added by a miscellaneous script (bundled with bracoujl) for your convenience, but it is not in the original logs.

This is only the first problem, because while executing this, you must save all the calling backtrace, and follow in which loop you are in (without forgetting nested loops). Additionally, once you have interruptions implemented, they will start randomly, so you must remember what you were looking at and context switch in your head too, for it.

I must be honnest, I did not really check what other people did on the subject. I thought of this as an interesting problem and wanted to solve it myself. Here are the solutions I weighted:

I was in C++, so simply debug the CPU through my C++ code in gdb.
Write a simple debugger for the z80 CPU, with memory/register dumps and breakpoint management.
Write a tool to transform logs of the linear execution of a ROM into something useful.

The first approach is OK for me if you know what you want to debug. Say find out what contains memory for a certain instruction. But it starts to be as boring as steping through the instructions when you don’t know what you want to find.

I did not take the second approach for a simple reason: the support of this is too complicated. With a gameboy emulator, having another working emulator at hand with sources is easy, and it helps a lot to be able to compare behaviors. It helps a lot to understand the inner-workings of this type of hardware, but if you implement a debugger in yours, you must do it for your reference too, which is not practical.

Writing a debugger brings easy bugs with it too, or it could be fooled by the current bugs of the CPU itself. Instead, only having to print the current program counter, and the instruction opcode on stdout is easy, doesn’t require to understand the whole CPU and memory architecture for a given emulator and avoids making mistakes.

It is in my plans to write the debugger for the z80 for nebula. My point is that it is not a good solution when the CPU is totally broken. Finally, the last solution is re-usable. If you currently write an emulator, you should be able to try it with a really small patch in your code.

What `bracoujl` can understand (to an extent)

So the solution was to automate the process of transforming of the linear logs to graphs, generated with graphviz via a dot file. The visual form of functions is a lot better than instructions on a linear way.

This part will explain what bracoujl understands and why it matters regarding our problem. Titles should give you a nice idea of all the problems I encountered while solving them.

I think you must first understand how bracoujl proceeds. The base of the algorithm is as follow:

For each instruction that was executed. We have pc (program counter), opcode (a byte being the opcode) and, in the case of the z80, two bytes of memory following the opcode for the disassembler, that could be skipped if there was no disassembler.
- Check the pc in our current graph. We store the opcode at that pc as an instruction and link it to the last instruction executed.
- Only the link step is necessary if we already knew the instruction.
- If we already knew the link too, nothing has to be done to construct a graph.
When we read all the instructions, we can now remove all the links between two instructions where the starting instruction doesn’t go anywhere else and the ending instruction isn’t reached by any other instruction. This creates blocks.
Once all the blocks are created, we just need to output all those as a dot file and then use them to generate the image files necessary.

But this will generate big, un-readable graphs that cannot be used easily. Here are a series of fixes that bracoujl does and that enhance the readability of the graphs.

Graph readability and size

(Conditional) Jumps

Lets start with a first, little fix that enhances readability quite a lot. The first fix I decided to do is to cut any block if it ends on a jump. This is done easily through the CPU configuration used by bracoujl.

All that was needed is a list of jumps, relative jumps and the cutting was easy. But something makes the graph a lot more readable: colorize the links on conditional jumps, in green when it is taken, in red when it is not triggered.

This is simple if we know the size of a given jump. When retrieving the next instruction, we can know if it is the instruction behind the jump in memory (which means we fell-through) or another address, in which case we decide we took the jump.

The result can be seen in above graph of the function sub_0237 with the loop that uses a relative jump of -7 bytes if %b | %c is not zero (%bc is not zero). When the relative jump is taken, we go back at the beginning of the loop (green link), else we continue to the end of the function (red link).

Functions

First thing done to enhance readability, was to understand when a call was made, and when it returns, and to continue building from the call opcode on ret. For instance, take this simple x86_64 assembly code:

func:
    push rbx
    xor rbx, rbx
    mov [rax], rbx
    pop rbx
    ret

main:
    mov rax, 0xcafebabedeadbeef
    call func
    xor rax, rax
    inc rax
    ret

Seen linearly, it looks like this log:

mov rax, 0xcafebabedeadbeef
call func
push rbx
xor rbx, rbx
mov [rax], rbx
pop rbx
ret
xor rax, rax
inc rax
ret

But when drawing the graph, you don’t really want to have a ret linked to the xor rax, rax, so the goal was to understand the ret instruction and to go back to the call instruction. This is done easily with a back-trace list of addresses, the list of call and rst (interrupt call) instructions.

But z80 has both conditional call and ret instructions. So the process of understanding if one of these instructions triggered is done the same way than what is done for jump instructions, explained above.

When starting the project, call links would also be colorized, but in blue. The problem is, graphs became bigger and bigger. And if a function was called in a lot of places, it became a hot-spot of the graph. Below is an example of an old version, not too huge, that shows the readability problem of having all functions in one graph.

Uncut functions

click image to see full size

Now only bigger functions look like this, and all calls are replaced with little boxes only saying Call to sub_XXXX. The next graph shows a function with some calls, that looks a lot cleaner now that the functions are split.

Sample function with multiple sub calls

click image to see full size

Interrupts

Another problem are the interruptions done by the CPU. They will start “randomly” between the instructions. The problem is, it breaks a lot of blocks in pieces, and this is annoying. The important thing to understand is that, even if the first execution of the function is broken only in one place, the more it is called, the more it will be broken, and often in a lot of different places.

This example (I crafted by hand) shows the kind of graphs you would get with only the function sub_0237 above and the interrupts not removed out of the way.

Interrupts: sub_0237 brooooken!

click image to see full size

As you can see, the little function above is now a complicated cluster of nodes with links crossed everywhere. The end result looks a lot like function cutting, but for detection, this time I used the fact that all interrupts have know addresses (in practice a jump table located from address 0000h to address 0060h every 8 bytes).

There is another way to trigger an interrupt, and this is with the rst instruction. Now this is not a problem, in fact this is managed the exact same way a call is managed. Both methods use the same algorithm than calls for the back-trace, and the way conditional returns are managed.

Other behaviors

Now we have ok looking graphs of the program being executed. A lot more readable than only the linear logs of execution. But there are still other things done by a ROM that are disturbing, and must be understood.

Memory changes

This problem can come if there is self-modifying code, virtual memory or, like in the GameBoy, memory banking.

Memory banking by example

The first 4000h addresses of the address space on the GameBoy are the first 4000h bytes of the ROM, we will call ROM Bank 0. This only gives 16 KB of space for the ROM in the address space, for all game code and data. This is small, so the next 4000h addresses (from 4000h to 7FFFh) are a view of another 16 KB of the ROM. It is a switchable view of ROM banks. So the GameBoy can address 32 KB of ROM memory at the same time.

This memory is not writable, so when writing to this address space, it switches the current memory viewed by the switchable memory bank. There are several memory controllers that can be used, documentation can be found there.

Lets take an example with MBC1 memory bank controller. Say we currently are on bank 2, which is 8000h:BFFFh of the ROM viewed through 4000h:7FFFh of the address space. If the code writes value 05h on any address of 2000h:3FFFh, this will change the lower 5 bits of the current ROM selector with the lower 5 bits of the value, which will make us view fifth bank in this case. We now use 14000h:17FFFh of the ROM when addressing 4000h:7FFFh of the address space. This system makes it possible to address up to 2 MB of ROM memory, instead of 16 KB.

Memory changes management

All this speech to say what? That whenever any code is executed in memory range 4000h:7FFFh, it can be up to 125 different functions if MBC1 is used, and it happens. This would be the same problem if the code was in memory and changed itself.

To manage this, bracoujl compares the instruction executed with the instructions known for this address. If they are different, it creates different paths from the preceding instruction. Sometimes they merge back with the old flow, sometimes they just totally differ. It often happens for a function that is called and is switched, or the target of a jump.

The blocks are then merged and split as usual, and same-memory blocks are named with a memory label, that is the order in which they were gone through.

This is kind of a proof-of-concept management algorithm, it sadly doesn’t give any insight in the timeline of the memory changes. Even though I didn’t see it become chaotic, I think it could, which is not cool. Anyway, with this feature, bracoujl is not disturbed if the code changes for a same memory address.

push + ret = jump

Another behavior used by some games is to do a push of an address, then a ret and so effectively doing a jump. This used to break completely the back-trace of calls and interrupts.

To fix this, the concept of “ret miss” was introduced. bracoujl checks for the address of the next instruction after a ret against the expected address (computed from the size of call opcodes). If they don’t correspond, this means there was a weird manipulation of the stack done, and is understood as a “ret miss”.

Sometimes this is just because there is a bug in memory or CPU, but in cases like the below function, it is stack manipulation.

"push + ret = jump" in sub_0427

click image to see full size

If you look at the three local blocks named loc_4052, loc_405C and loc_408D. They all go to block loc_0433 and are considered ret misses (brown link). This can be explained by the first block of the function, sub_0427, where we can read ld %bc, $0x0433; push %bc. Since the jump instruction uses %hl register and jumps to switchable ROM banks from ROM bank 0, it was probably the easiest way to go back to the right code in bank 0 without hard-coding any address in rom banks.

As you can see with this example, this method keeps the graph readable and simple, and reading it explains the behavior easily, which is cool.

Other features

Graph comparison

This feature aims to compare two graphs produced by two different flows (from two different emulators, or two different versions of the same emulator) against each other. Once the graph are produced, they should look a lot alike for a given code.

Number of loops (on synchronization mechanisms) and interrupts can vary a lot between the two emulators. Even more if both are not perfect emulators. This is why linear comparison with diff or meld is impossible. Plus files are usually huge, for not that much code.

The output currently is fairly simple, but it at least lists the function where it found different blocks and list all the different blocks. Differences currently include:

A block that reaches more places in one of the graphs.
A block that is reached from more places in one of the graphs.
A block that is present in only one of the graphs.
A block is different in graphs. Which can be caused by:
- A block that was not split in two, because one of its inner instructions was reached once, instead of from multiple other blocks.
- The CPU gets the size of an instruction wrong and started to execute some wrong code.

Disassenbly

Finally, one last choice was to put the optional disassembler directly in bracoujl instead of expecting the emulator to do it. The biggest argument being that then you can compare logs from multiple emulators containing disassembly at no cost.

Case studies

Here are some examples of graphs that really show how much it can help identify bugs. In all these examples, the left image will be the expected behavior, and the right one will be the buggy behavior.

For the simple case studies, the graphs are given with a patch to add the bug to nebula, from commit number b3b72f3 and both logs are from nebula. I indeed added those bugs intentionally to show the graph, but if you go through the commit log, you should find those same bugs when I fixed them. The big case studies are made directly with the fixed commits, referenced in the explanations.

Simple case studies

Multi-byte instruction with wrong size

click image to see full size

diff --git a/src/z80/z80opcodes.cc b/src/z80/z80opcodes.cc
index 8c9ed8e..55b1a49 100644
--- a/src/z80/z80opcodes.cc
+++ b/src/z80/z80opcodes.cc
@@ -61,7 +61,7 @@ uint16_t ld_1B_reg_d8(MMU& mmu, Z80Registers& regs, uint8_t& reg)
 uint16_t ld_2B_reg_d16(MMU& mmu, Z80Registers& regs, WordRegProxy& reg)
 {
     reg.set(mmu.read<uint16_t>(regs.PC + 1));
-    return P(3, 12);
+    return P(2, 12);
 }

 uint16_t ld_2B_reg_d16(MMU& mmu, Z80Registers& regs, uint16_t& reg)

This first example was a little modified so that second image is not too big (this should be fixed quite soon in bracoujl, since I consider this an undesired behavior). Anyway, as you should clearly see, there are additional instructions in second and third blocks. Then later instead of a jump, the CPU goes into irrelevant code that ends up being mostly nops, and this loops endlessly.

This bug was also observed with the CPU starting to execute a part of the memory containing data or not aligned properly on code and then stopping on an undefined instruction.

Relative jump mis-behaving

click image to see full size

diff --git a/src/z80/z80opcodes.cc b/src/z80/z80opcodes.cc
index 8c9ed8e..964394e 100644
--- a/src/z80/z80opcodes.cc
+++ b/src/z80/z80opcodes.cc
@@ -408,7 +408,7 @@ uint16_t jr_if(MMU& mmu, Z80Registers& regs, uint8_t val)
     if (val)
     {
         regs.PC += (int8_t) mmu.read<uint8_t>(regs.PC + 1);
-        return P(2, 12);
+        return P(0, 12);
     }
     return P(2, 8);
 }

This mistake I made on relative jump is computing the target address without counting jump instruction size. The graph makes it really clear that some instructions are not executed. Again this example is an easy one that could have been avoided.

Flags not set correctly

If you don’t have test ROMs as I did for the GameBoy, you can also check for flag consistency with the number of loops you execute. Say one instruction sets the flag triggering a relative jump instruction for a loop one value too soon, then you have one iteration of a loop that is not executed, that can be seen in a graph too. But lets go into more interesting bugs.

Case study: Is joypad OK?

Bomberman: start of the ROM bug (gif).

click image to see animated

This bug was in nebula up until commit 9859ef3 (included). In addition to looking nice, this the kind of bug that bracoujl is made for. When starting to debug, the two usual questions must be answered:

Where is the bug?
What is causing the bug?

My understanding at the time was that to function properly, the minimum necessary to a game was the CPU, the memory and the GPU. Sound and joypad were optional. We could guess that this bug must come from the GPU, but once the first frame is displayed correctly, the game continues correctly, weird.

So lets try to use bracoujl on this. I had another emulator at the time, and I made the stats with this when I was really debugging it, but the following stats are from nebula (9859ef3 for the buggy stats, b3b72f3 for the good ones).

When trying to stop on the first valid frame as soon as possible on both, the correctly working emulator was able to display the first frame with a few letters in 320,138 instructions (13 MB of logs), while for the buggy version, 2,938,046 instructions were necessary to get to the first frame (118 MB of logs), before the first letter could even be displayed. This is interesting, the CPU really is impacted heavily here. We are not going to read this by hand. So lets put bracoujl in use.

When looking at the graphs (or using the comparison function), the first difference is in the function sub_0819.

click image to see full size

Hmmm this is weird: FF00h is an I/O Port to … the JOYPAD status. I honestly could not believe this could be a problem. We also noticed that interrupts were triggered totally differently in the two emulators we had, and I thought this was probably more relevant to the problem.

Since we were two to work on nebula during that time and I knew I had done a shitty job at implementing the joypad, we decided to split jobs: my friend colona would fix the joypad while I would continue to analyze the logs for interesting stuff.

Turns out that once he came back to me with a patch for the joypad, he also fixed the bug with it. Bomberman apparently has a weird behavior if the JOYPAD is not in the state it expects it to be in. Thanks again to Ivan Delalande (colona) for his awesome help on this bug.

I think the lesson learnt with this bug was that trying to explain the bug with a complicated explanation when an obvious issue was there was wrong. Instead, starting by fixing simple issues before reaching for more complicated explanations is a good idea in a program like that.

Case study: Memory banks rock!

Often, bracoujl is an helper with another method. This bug is an example of that. It was present in nebula in revisions up to 142bbd8 included. To explain this bug, I must explain a little bit how nebula was written. One of the first goals was to keep it clean, which means avoid redundant code.

If you look at the opcode table in the CB table, there are 3 instructions that do things with a very similar API: res, set and bit. They all take a register, a bit, do something on it and we are good. There is also the version with %hl dereference. This gives a great opportunity to write a proxy with the basic operations of reading and writing the registers/memory, and the specific function to do just what it’s supposed to do: set, reset or test a bit in the value. But the bit function doesn’t need to modify the value, and I said: whatever, we will just write back the unmodified value to the registers/memory and this will be good, no need for a separate function.

Few weeks/months later, the emulator is a going well and now implements a lot more features, including the graphical emulation and some memory banks controllers, but when launching some games, for example Kirby, it would crash with the error Unknown opcodes E3.... Did we end up in some shady memory part, say data? Lets check what the graphs look like.

click image to see full size

This time, the working emulator was not really necessary, anyway seeing its graph gives a good insight. Lets start where the emulator crashed, in function sub_40E4.

The interesting thing is both graph differ when they start executing memory in switchable memory bank. This is an information we did not have, and which is really interesting. So I started to log when bank switches where done, but without the informations available in the graphs, there would be too much. Using the linear log, I went to the crash point, and searched backward for bank changes. One happens in the function sub_19F9 linked bellow, and called just before the CPU starts executing wrong memory. The guilty instruction is bit $3, (%hl) in local block loc_1A6E.

click image to see full size

This behavior was not expected when I first wrote the CPU. Indeed, memory banking was an unknown concept to me and I did not expect that writing the same value I just read to a memory position could have side-effects like that at the time. Here bracoujl was used with another type of log, to first understand and target what was going on, and then understand the second type of log, which made the debugging a lot easier and faster.

Conclusion

This project was really interesting. The material to test was really diverse, and doing that kind of work raises a lot of interesting corner cases. It also taught me that trusting your debugger is “dangerous” and also really important. Indeed when trying to find a bug, assumptions will be made based on the information taken form the debugger. This project taught me that when writing a debugger, a lot of care must be put into it, so that then when you use it, you can make your assumptions safely. Obviously, there is no warranty bracoujl is perfect, but it works correctly :P.

You should have no hard-time trying it

And if you do, bug me! Again, everything is available on my github. If you want to try it on another architecture, just take the z80 configuration (disassembler is optional), adapt it to your architecture and try to run it. You can even send me a mail and I will be happy to help you with that.

What next?

Missing features

Even though my nebula project is in pause for me, which means bracoujl too, but this does not mean that it will never see new features. Some of the things that could be enhanced are:

Adding serialization of graphs. This cannot be done with pickle trivially since there is a circular dependency right now between the blocks and the links. But it must be possible easily anyway, by either removing the circular dependency (not sure if possible) or simplifying one of the references. It would be awesome speed-wise. Graphs are currently generated from logs each time svgs, dots or comparison is done.
Add a search feature, that gives information, like the local block in which a particular instruction is, or it’s position in the image (see next point). If you really need searching feature now, grep is still cool.
Maybe something other than graphviz would be cool to generate the graphs. Sometimes the results are a little weird, nothing too bad though.
Being able to have information on the time-line, and put annotations to the graph, could be cool :)
Bugfix: Compress nop-sleds (as seen in the graph the “wrong-size” part, where I had to fix them by hand so that they are not too huge.

New project idea

To conclude, this project gave some of my friends and me a fun idea: why not try to infer an instruction set from its behavior? Say you start to reverse engineer a virtual machine for an unknown architecture, it could be fun to work on some tool that would give insight by observing the behavior of it executing code. I don’t know how far this idea could go, but I may work on that some time soon.

Beginning with CPU debugging

What bracoujl can understand (to an extent)

Graph readability and size

(Conditional) Jumps

Functions

Interrupts

Other behaviors

Memory changes

push + ret = jump

Other features

Graph comparison

Disassenbly

Case studies

Simple case studies

Multi-byte instruction with wrong size

Relative jump mis-behaving

Flags not set correctly

Case study: Is joypad OK?

Case study: Memory banks rock!

Conclusion

You should have no hard-time trying it

What next?

Missing features

New project idea

Enjoy Reading This Article?

What `bracoujl` can understand (to an extent)