jak-project/decompiler
Luminar Light 0ae0938965
Only remove -vis from name if it is part of the name. (#3257)
During level extraction, the last 4 characters of the level name are
always removed, because it is assumed that '-vis' is there. But it isn't
always there. This is especially true in post-TPL games.

This causes multiple problems:
- There can be levels in the extraction that will miss their last 4
characters from their name, which is sad, and may make it harder to
identify them.
- If there are '-vis'-less levels whose names are identical apart from
the last 4 characters, the extractor will only get the last one (it
probably extracts all but overwrites everything but the last one). For
example 'ctyasha' and 'ctykora'.

This issue affects the glb extraction and the entities json extraction.

I personally think that just keeping the -vis in the name would be the
best solution, but I guess there was a reason why it was decided that it
should be removed from the name. So to adapt to this, my implementation
will still remove '-vis' from the name, but only if it is actually in
the name - otherwise it won't remove anything.

I hope my changes didn't break anything. Extraction seemed to run fine
after my changes, and I was able to see both ctyasha and ctykora json
files. And didn't see any '-vis', so it is still properly removed.

---------

Co-authored-by: Tyler Wilding <xtvaser@gmail.com>
2024-02-24 14:13:48 -05:00
..
analysis decomp3: game-info, game-task, game-save, level-info, process-drawable and more (#3374) 2024-02-15 11:16:07 +00:00
config Only remove -vis from name if it is part of the name. (#3257) 2024-02-24 14:13:48 -05:00
data Switch to std::span (#3376) 2024-02-18 13:23:19 -05:00
Disasm i18n: Add jak 2 custom text to Crowdin (#3141) 2023-11-04 13:14:14 -04:00
extractor Update to C++20 (#3193) 2024-02-17 14:14:23 -05:00
Function Update to C++20 (#3193) 2024-02-17 14:14:23 -05:00
gui Make decompiler naming consistent (#94) 2020-10-24 14:27:50 -04:00
IR2 Update to C++20 (#3193) 2024-02-17 14:14:23 -05:00
level_extractor Only remove -vis from name if it is part of the name. (#3257) 2024-02-24 14:13:48 -05:00
ObjectFile decompiler: defskelgroup macro detection for jak 3, fix art group dumping for jak 3 and some more decomp work (#3370) 2024-02-11 09:32:06 -05:00
scripts add decompiler 2020-08-22 23:30:17 -04:00
types2 Decompile joint, collide-func, clean up joint decompression code for all games (#3369) 2024-02-11 09:50:07 -05:00
util Update to C++20 (#3193) 2024-02-17 14:14:23 -05:00
VuDisasm Update to C++20 (#3193) 2024-02-17 14:14:23 -05:00
CMakeLists.txt [glb export] Export bones. (#3087) 2023-10-14 16:49:23 -04:00
config.cpp Rip collision based on config flag (#3348) 2024-01-29 22:15:42 +01:00
config.h Rip collision based on config flag (#3348) 2024-01-29 22:15:42 +01:00
main.cpp Rip collision based on config flag (#3348) 2024-01-29 22:15:42 +01:00
README.md add decompiler 2020-08-22 23:30:17 -04:00

How to use

Compile (Linux):

mkdir build
cd build
cmake ..
make -j
cd ..

After compiling: First create a folder for the output and create a folder for the input. Add all of the CGO/DGO files into the input folder.

build/jak_disassembler config/jak1_ntsc_black_label.jsonc in_folder/ out_folder/

Notes

The config folder has settings for the disassembly. Currently Jak 2 and Jak 3 are not as well supported as Jak 1.

Procedure

ObjectFileDB

The ObjectFileDB tracks unique object files. The games have a lot of duplicated objected files, and object files with the same names but different contents, so ObjectFileDB is used to create a unique name for each unique object file. It generates a file named dgo.txt which maps its names to the original name and which DGO files it appears in. The ObjectFileDB extracts all object files from a DGO file, decompressing the DGO first if needed. (note: Jak 2 demo DGOs do not decompress properly). Each object file has a number of segments, which the game can load to separate places. Sometimes there is just a single "data" segment, and other times there are three segments:

  • top-level is executed at the end of the linking process, then discarded and goes in a special temporary heap
  • main is loaded and linked onto the specified heap
  • debug is loaded and linked onto the debug heap

This function interprets breaks the object file's data into segments, and processes the link data. The data is stored as a sequence of LinkedObjectWords, which contain extra data from the link. The LinkedObjectWords are stored by segment in a LinkedObjectFile, which also contains a list of Labels that allow LinkedObjectWords to refer to other LinkedObjectWords. Note that a Label can have a byte-offset into a word, which GOAL uses to load non-4-byte-aligned bytes and halfwords, and also to represent a pair object.

ObjectFileDB::find_code

This function looks through the LinkedObjectFiles and splits each segment into data and code zones.

The only files with code zones are from object files with three segments, and the code always comes first. The end of the code zone is found by looking for the last GOAL function object, then finding the end of this object by looking one word past the last jr ra instruction. This assumes that the last function in each segment doesn't have an extra inline assembly jr ra somewhere in the middle, but functions with multiple jr ra's are extremely rare (and not generated by the GOAL compiler without the use of inline assembly), so this seems like a safe assumption for now.

The code zones are scanned for GOAL function types, which are in front every GOAL function, and used to create Functions. Each Function is disassembled into EE Instructions, which also adds Labels for branch instructions, and can also contain linking data when appropriate. The final step is to look for instructions which use the fp register to reference static data, and insert the apprioriate Labels. GOAL uses the following fp relative addressing modes:

  • lw, lwc1, ld relative to the fp register to load static data.
  • daddiu to create a pointer to fp-relative data within +/- 2^15 bytes
  • Sequence of ori, daddu to generate a pointer that reaches within +2^16 bytes
  • Sequence of lui, ori, daddu to generate any 32-bit offset from fp.

The last two are only found in very large object files, and GOALDIS doesn't handle these.

The fp register is set with this sequence. The function prologue only sets fp if it is needed in the function.

;; goal function call, t9 contains the function address
jalr ra, t9
sll v0, ra, 0 

;; example goal function prologue:
daddiu sp, sp, -16 
sd ra, 0(sp)
sd fp, 8(sp)
or fp, t9, r0

Note: there are a few hacks to avoid generating labels when fp is used as a temporary register in inline assembly. Like ignoring stores/loads of fp from the stack (kernel does this to suspend resume a thread), or ignoring fp when used with the PEXTLW function, or totally skipping this step for a single object file in Jak 2 (effect-control).

ObjectFileDB::process_labels

This step simply renames labels with L1, L2, .... It should happen before any custom label naming as it will overwrite all label names.

ObjectFileDB::find_and_write_scripts

Looks for static linked lists and attempts to print them. Doesn't support printing everything, but can print nested lists, strings, numbers, and symbols.

ObjectFileDB::write_object_file_words

Dumps words in each segment like hexdump. There's an option to only run this on v3 object files, which contain data, as opposed to v2 which are typically large data.

ObjectFileDB::write_disassembly

Like write_object_file_words, but code is replaced with disassembly. There's a config option to avoid running this on object files with no functions, as these are usually large data files which are uninteresting to view as a binary dump and slow to dump.

Basic Block Finding

Look at branch intstructions and their destinations to find all basic blocks. Implemented in find_blocks_in_function as part of analyze_functions. This works for Jak 1, 2 and 3.

Analyze Functions Prologues and Epilogues

This will help us find stack variables and make sure that the prologue/epilogue are ignored by the statement generation.

A "full" prologue looks like this:

    daddiu sp, sp, -208
    sd ra, 0(sp)       
    sd fp, 8(sp)       
    or fp, t9, r0        ;; set fp to the address of this function
    sq s3, 128(sp)     
    sq s4, 144(sp)     
    sq s5, 160(sp)     
    sq gp, 176(sp)     
    swc1 f26, 192(sp)  
    swc1 f28, 196(sp)  
    swc1 f30, 200(sp)  

GOAL will leave out instructions that aren't needed. This prologue is "decoded" into:

Total stack usage: 0xd0 bytes
$fp set? : yes
$ra set? : yes
Stack variables : yes, 112 bytes at sp + 16
Saved gprs: gp s5 s4 s3
Saved fprs: f30 f28 f26

A similar process is done for the epilogue, and it is checked against the prologue.

The prologue is removed from the first basic block and the epilogue + alignment padding is removed from the last one.

Documentation of Planned Steps that are not implemented

Currently the focus is to get these working for Jak 1. But it shouldn't be much extra work to support Jak 2/3.

Guess Function Names (to be implemented)

When possible, we should guess function names. It's not always possible because GOAL supports anonymous lambda functions, like for example:

(lambda ((x int) (y int)) (+ x y))

which will generate a GOAL function object without a name.

But these are pretty uncommon, and the majority of GOAL functions are

  • Normal functions, which are stored into a symbol with the same name as the function
  • Methods, which are stored into the method table of their type with the method-set! function. Sadly we can't get the name of methods, but we can get their ID (to figure out the inheritance hierarchy) and what type they are defined for.
  • State handlers / behaviors (not yet fully understood)
  • Virtual state handlers / behaviors (not yet fully understood)

Currently the state/behavior stuff isn't well understood, or used in the early initialization of the game, so name guessing won't worry about this for now.

Guess Types (to be implemented)

The majority of GOAL types have a compiler-generated inpsect method which prints their fields. We should detect these methods in the previous function name guessing step, and then read through them to determine the data layout of the type.

Control Flow Analysis

The basic blocks should be built into a graph and annotated with control flow patterns, like if, cond, and, and various loops. To do this, register liveliness will be determined for each instruction.

Conversion to Statements

Instructions (or sequences of instructions that should not be separated) should be converted into Statements, which represent something like (add! r1 r2 r3). The registers should be mapped to variables, using as many variables as possible, as we don't know at this point if a register will be holding the same GOAL variable at different instructions.

Type propagation

Variables should get types determined by arguments of the function, which should then be propagated to other Statements in the function, and can then refine the argument types of other functions. This process should be repeated until things stop changing.

Variable declaration

Variables which are actually the same variable will be merged. The point at which variables are first defined/declared will be determined based on liveliness and then expanded to come up with a scope nesting that doesn't cross control flow boundaries.

Statement -> S-Expression map tree

Due to the the simple single pass GOAL compiler design, we build a tree which represents how Statements can be combined to eliminate variables. As an extremely simple example:

(set! r1 thing1)
(set! r2 thing2)
(add-int! r4 r2 r3)
(mult-int! r1 r4)

can be collapsed to

(* thing1 (+ thing2 r3))

But

(set! r2 thing2)
(add-int! r4 r2 r3)
(set! r1 thing1)
(mult-int! r1 r4)

can be collapsed to

(let ((temp0 (+ thing2 r3)))
  (+ thing1 temp0)
)
  

and this difference will actually reflect the difference in how the code was originally written! This is a huge advantage over existing decompilers, which will be unable to tell the subtle difference between the two.

Macro pattern matching

Lots of GOAL language features are implemented with macros, so once the s-expression nesting is recovered, we can pattern match to undo macros very precisely.