jak-project/decompiler
2021-06-11 19:04:47 -04:00
..
analysis [decomp] start on actor-link-h (#579) 2021-06-11 19:04:47 -04:00
config [decomp] start on actor-link-h (#579) 2021-06-11 19:04:47 -04:00
data Force-enable assert on release mode Windows build (#440) 2021-05-10 23:03:56 -04:00
Disasm [decompiler] Support accessing 128-bit bitfields (#514) 2021-05-21 20:36:04 -04:00
Function [decompiler] Add pass to generate a symbol definition map file (#546) 2021-05-31 10:43:25 -04:00
gui Make decompiler naming consistent (#94) 2020-10-24 14:27:50 -04:00
IR clean up config (#456) 2021-05-11 19:19:23 -04:00
IR2 [decompiler] fix missing casts issue (#573) 2021-06-09 21:35:13 -04:00
ObjectFile [decompile] bsp-h (#562) 2021-06-07 18:30:16 -04:00
scripts add decompiler 2020-08-22 23:30:17 -04:00
util Change syntax for boxed array to support different length and allocated-length (#568) 2021-06-07 20:22:06 -04:00
VuDisasm [decompiler] support break (#516) 2021-05-23 20:16:34 -04:00
CMakeLists.txt [decompiler] Add pass to generate a symbol definition map file (#546) 2021-05-31 10:43:25 -04:00
config.cpp [compiler/decompiler] Take the address of a variable (#554) 2021-06-04 13:43:19 -04:00
config.h [compiler/decompiler] Take the address of a variable (#554) 2021-06-04 13:43:19 -04:00
main.cpp Organize all types (#523) 2021-06-01 22:41:37 -04:00
README.md add decompiler 2020-08-22 23:30:17 -04:00

How to use

Compile (Linux):

mkdir build
cd build
cmake ..
make -j
cd ..

After compiling: First create a folder for the output and create a folder for the input. Add all of the CGO/DGO files into the input folder.

build/jak_disassembler config/jak1_ntsc_black_label.jsonc in_folder/ out_folder/

Notes

The config folder has settings for the disassembly. Currently Jak 2 and Jak 3 are not as well supported as Jak 1.

Procedure

ObjectFileDB

The ObjectFileDB tracks unique object files. The games have a lot of duplicated objected files, and object files with the same names but different contents, so ObjectFileDB is used to create a unique name for each unique object file. It generates a file named dgo.txt which maps its names to the original name and which DGO files it appears in. The ObjectFileDB extracts all object files from a DGO file, decompressing the DGO first if needed. (note: Jak 2 demo DGOs do not decompress properly). Each object file has a number of segments, which the game can load to separate places. Sometimes there is just a single "data" segment, and other times there are three segments:

  • top-level is executed at the end of the linking process, then discarded and goes in a special temporary heap
  • main is loaded and linked onto the specified heap
  • debug is loaded and linked onto the debug heap

This function interprets breaks the object file's data into segments, and processes the link data. The data is stored as a sequence of LinkedObjectWords, which contain extra data from the link. The LinkedObjectWords are stored by segment in a LinkedObjectFile, which also contains a list of Labels that allow LinkedObjectWords to refer to other LinkedObjectWords. Note that a Label can have a byte-offset into a word, which GOAL uses to load non-4-byte-aligned bytes and halfwords, and also to represent a pair object.

ObjectFileDB::find_code

This function looks through the LinkedObjectFiles and splits each segment into data and code zones.

The only files with code zones are from object files with three segments, and the code always comes first. The end of the code zone is found by looking for the last GOAL function object, then finding the end of this object by looking one word past the last jr ra instruction. This assumes that the last function in each segment doesn't have an extra inline assembly jr ra somewhere in the middle, but functions with multiple jr ra's are extremely rare (and not generated by the GOAL compiler without the use of inline assembly), so this seems like a safe assumption for now.

The code zones are scanned for GOAL function types, which are in front every GOAL function, and used to create Functions. Each Function is disassembled into EE Instructions, which also adds Labels for branch instructions, and can also contain linking data when appropriate. The final step is to look for instructions which use the fp register to reference static data, and insert the apprioriate Labels. GOAL uses the following fp relative addressing modes:

  • lw, lwc1, ld relative to the fp register to load static data.
  • daddiu to create a pointer to fp-relative data within +/- 2^15 bytes
  • Sequence of ori, daddu to generate a pointer that reaches within +2^16 bytes
  • Sequence of lui, ori, daddu to generate any 32-bit offset from fp.

The last two are only found in very large object files, and GOALDIS doesn't handle these.

The fp register is set with this sequence. The function prologue only sets fp if it is needed in the function.

;; goal function call, t9 contains the function address
jalr ra, t9
sll v0, ra, 0 

;; example goal function prologue:
daddiu sp, sp, -16 
sd ra, 0(sp)
sd fp, 8(sp)
or fp, t9, r0

Note: there are a few hacks to avoid generating labels when fp is used as a temporary register in inline assembly. Like ignoring stores/loads of fp from the stack (kernel does this to suspend resume a thread), or ignoring fp when used with the PEXTLW function, or totally skipping this step for a single object file in Jak 2 (effect-control).

ObjectFileDB::process_labels

This step simply renames labels with L1, L2, .... It should happen before any custom label naming as it will overwrite all label names.

ObjectFileDB::find_and_write_scripts

Looks for static linked lists and attempts to print them. Doesn't support printing everything, but can print nested lists, strings, numbers, and symbols.

ObjectFileDB::write_object_file_words

Dumps words in each segment like hexdump. There's an option to only run this on v3 object files, which contain data, as opposed to v2 which are typically large data.

ObjectFileDB::write_disassembly

Like write_object_file_words, but code is replaced with disassembly. There's a config option to avoid running this on object files with no functions, as these are usually large data files which are uninteresting to view as a binary dump and slow to dump.

Basic Block Finding

Look at branch intstructions and their destinations to find all basic blocks. Implemented in find_blocks_in_function as part of analyze_functions. This works for Jak 1, 2 and 3.

Analyze Functions Prologues and Epilogues

This will help us find stack variables and make sure that the prologue/epilogue are ignored by the statement generation.

A "full" prologue looks like this:

    daddiu sp, sp, -208
    sd ra, 0(sp)       
    sd fp, 8(sp)       
    or fp, t9, r0        ;; set fp to the address of this function
    sq s3, 128(sp)     
    sq s4, 144(sp)     
    sq s5, 160(sp)     
    sq gp, 176(sp)     
    swc1 f26, 192(sp)  
    swc1 f28, 196(sp)  
    swc1 f30, 200(sp)  

GOAL will leave out instructions that aren't needed. This prologue is "decoded" into:

Total stack usage: 0xd0 bytes
$fp set? : yes
$ra set? : yes
Stack variables : yes, 112 bytes at sp + 16
Saved gprs: gp s5 s4 s3
Saved fprs: f30 f28 f26

A similar process is done for the epilogue, and it is checked against the prologue.

The prologue is removed from the first basic block and the epilogue + alignment padding is removed from the last one.

Documentation of Planned Steps that are not implemented

Currently the focus is to get these working for Jak 1. But it shouldn't be much extra work to support Jak 2/3.

Guess Function Names (to be implemented)

When possible, we should guess function names. It's not always possible because GOAL supports anonymous lambda functions, like for example:

(lambda ((x int) (y int)) (+ x y))

which will generate a GOAL function object without a name.

But these are pretty uncommon, and the majority of GOAL functions are

  • Normal functions, which are stored into a symbol with the same name as the function
  • Methods, which are stored into the method table of their type with the method-set! function. Sadly we can't get the name of methods, but we can get their ID (to figure out the inheritance hierarchy) and what type they are defined for.
  • State handlers / behaviors (not yet fully understood)
  • Virtual state handlers / behaviors (not yet fully understood)

Currently the state/behavior stuff isn't well understood, or used in the early initialization of the game, so name guessing won't worry about this for now.

Guess Types (to be implemented)

The majority of GOAL types have a compiler-generated inpsect method which prints their fields. We should detect these methods in the previous function name guessing step, and then read through them to determine the data layout of the type.

Control Flow Analysis

The basic blocks should be built into a graph and annotated with control flow patterns, like if, cond, and, and various loops. To do this, register liveliness will be determined for each instruction.

Conversion to Statements

Instructions (or sequences of instructions that should not be separated) should be converted into Statements, which represent something like (add! r1 r2 r3). The registers should be mapped to variables, using as many variables as possible, as we don't know at this point if a register will be holding the same GOAL variable at different instructions.

Type propagation

Variables should get types determined by arguments of the function, which should then be propagated to other Statements in the function, and can then refine the argument types of other functions. This process should be repeated until things stop changing.

Variable declaration

Variables which are actually the same variable will be merged. The point at which variables are first defined/declared will be determined based on liveliness and then expanded to come up with a scope nesting that doesn't cross control flow boundaries.

Statement -> S-Expression map tree

Due to the the simple single pass GOAL compiler design, we build a tree which represents how Statements can be combined to eliminate variables. As an extremely simple example:

(set! r1 thing1)
(set! r2 thing2)
(add-int! r4 r2 r3)
(mult-int! r1 r4)

can be collapsed to

(* thing1 (+ thing2 r3))

But

(set! r2 thing2)
(add-int! r4 r2 r3)
(set! r1 thing1)
(mult-int! r1 r4)

can be collapsed to

(let ((temp0 (+ thing2 r3)))
  (+ thing1 temp0)
)
  

and this difference will actually reflect the difference in how the code was originally written! This is a huge advantage over existing decompilers, which will be unable to tell the subtle difference between the two.

Macro pattern matching

Lots of GOAL language features are implemented with macros, so once the s-expression nesting is recovered, we can pattern match to undo macros very precisely.