Restructure auto-sync docs to have them more contained (#2355)

* Restructure auto-sync docs to have them more contained in suite/auto-sync

* Enhance Differ documentation

* Fix link and emphasize importance of ARCHITECTURE.md

* Add auto-syc intro.md document, based on @moste00 work

* Be consistent with Auto-Sync naming and use python3
This commit is contained in:
Rot127 2024-06-10 01:55:47 +00:00 committed by GitHub
parent 60d5b7ec2f
commit 03c41e1be4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 219 additions and 122 deletions

View File

@ -1,12 +1,9 @@
# Auto-Sync <!--
Copyright © 2022 Rot127 <unisono@quyllur.org>
SPDX-License-Identifier: BSD-3
-->
`auto-sync` is the architecture update tool for Capstone. # Architecture of the Auto-Sync framework
Because the architecture modules of Capstone use mostly code from LLVM,
we need to update this part with every LLVM release. `auto-sync` helps
with this synchronization between LLVM and Capstone's modules by
automating most of it.
You can find it in `suite/auto-sync`.
This document is split into four parts. This document is split into four parts.
@ -15,8 +12,8 @@ This document is split into four parts.
3. Instructions how to refactor an architecture to use `auto-sync`. 3. Instructions how to refactor an architecture to use `auto-sync`.
4. Notes about how to add a new architecture to Capstone with `auto-sync`. 4. Notes about how to add a new architecture to Capstone with `auto-sync`.
Please read the section about architecture module design in Please read the section about capstone module design in
[ARCHITECTURE.md](ARCHITECTURE.md) before proceeding. [ARCHITECTURE.md](https://github.com/capstone-engine/capstone/blob/next/docs/ARCHITECTURE.md) before proceeding.
The architectural understanding is important for the following. The architectural understanding is important for the following.
## Update procedure ## Update procedure
@ -98,102 +95,30 @@ _Note_: For details about this checkout `suite/auto-sync/CppTranslator/README.md
Because the result of the `CppTranslator` is not perfect, Because the result of the `CppTranslator` is not perfect,
we still have many syntax problems left. we still have many syntax problems left.
Those need to be fixed by hand. Those need to be fixed partially by hand.
**Differ**
In order to ease this process we run the `Differ` after the `CppTranslator`. In order to ease this process we run the `Differ` after the `CppTranslator`.
The `Differ` parses each _translated_ file and the corresponding source file _currently_ used in Capstone. The `Differ` compares our two versions of C files we have now.
It then compares specific nodes from the just translated file to the equivalent nodes in the old file. One of them are the C files currently used by the architecture module.
On the other hand we have the translated C files. Those are still faulty and need to be fixed.
Most fixes are syntactical problems. Those were almost always resolved before, during the last update.
The `Differ` helps you to compare the files and let you select which version to accept.
Sometimes (not very often though), the newly translated C files contain important changes.
Most often though, the old files are already correct.
The `Differ` parses both files into an abstract syntax tree and compares certain nodes with the same name
(mostly functions).
The user can choose if she accepts the version from the translated file or the old file. The user can choose if she accepts the version from the translated file or the old file.
This decision is saved for every node. This decision is saved for every node.
If there exists a saved decision for a node, the previous decision automatically applied again. If there exists a saved decision for two nodes, and the nodes did not change since the last time,
it applies the previous decision automatically again.
Every other syntax error must be solved manually. The `Differ` is far from perfect. It only helps to automatically apply "known to be good" fixes
and gives the user a better interface to solve the other problems.
## Update an architecture But there will still be syntax errors left afterward. These must be fixed by hand.
To update an architecture do the following:
Rebase `llvm-capstone` onto the new LLVM release (if not already done).
```
# 1. Clone Capstone's LLVM
git clone https://github.com/capstone-engine/llvm-capstone
cd llvm-capstone
git checkout auto-sync
# 2. Rebase onto the new LLVM release and resolve the conflicts.
# 3. Build tblgen
mkdir build
cd build
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=<ARCH> -DCMAKE_BUILD_TYPE=Debug ../llvm
cmake --build . --target llvm-tblgen --config Debug
# 4. Run the updater
cd ../../suite/auto-sync/
./Updater/ASUpdater.py -a <ARCH>
```
The update script will execute the steps described above and copy the new files to their directories.
Afterward try to build Capstone and fix any build errors left.
If new instructions or operands were added, add test cases for those
(recession tests for instructions are located in `suite/MC/`).
TODO: Operand and detail tests
<!--
TODO: Wait until `cstest` is rewritten and add description about operand testing.
Issue: https://github.com/capstone-engine/capstone/issues/1984
-->
## Refactor an architecture for `auto-sync`
To refactor an architecture to use `auto-sync`, you need to add it to the configuration.
1. Add the architecture to the supported architectures list in `ASUpdater.py`.
2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`)
Now, manually run the update commands within `ASUpdater.py` but *skip* the `Differ` step:
```
./Updater/ASUpdater.py -a <ARCH> -s IncGen Translate
```
The task after this is to:
- Replace leftover C++ syntax with its C equivalent.
- Implement the `add_cs_detail()` handler in `<ARCH>Mapping` for each operand type.
- Add any missing logic to the translated files.
- Make it build and write tests.
- Run the Differ again and always select the old nodes.
**Notes:**
- If you find yourself fixing the same syntax error multiple times,
please consider adding a `Patch` to the `CppTranslator` for this case.
- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own.
- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them.
- Sometimes the LLVM code uses a single function from a larger source file.
It is not worth it to translate the whole file just for this function.
Bundle those lonely functions in `<ARCH>DisassemblerExtension.c`.
- Some generated enums must be included in the `include/capstone/<ARCH>.h` header.
At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets):
```
// generate content <FILENAME.inc> begin
// generate content <FILENAME.inc> end
```
The update script will insert the content of the `.inc` file at this place.
## Adding a new architecture
Adding a new architecture follows the same steps as above. With the exception that you need
to implement all the Capstone files from scratch.
Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help.

View File

@ -1,15 +1,19 @@
<!-- <!--
Copyright © 2022 Rot127 <unisono@quyllur.org> Copyright © 2022 Rot127 <unisono@quyllur.org>
Copyright © 2024 2022 Rot127 <unisono@quyllur.org>
SPDX-License-Identifier: BSD-3 SPDX-License-Identifier: BSD-3
--> -->
# Architecture updater # Architecture updater - Auto-Sync
This is Capstones updater for some architectures. `auto-sync` is the architecture update tool for Capstone.
Unfortunately not all architectures are supported yet. Because the architecture modules of Capstone use mostly code from LLVM,
we need to update this part with every LLVM release. `auto-sync` helps
with this synchronization between LLVM and Capstone's modules by
automating most of it.
## Install dependencies Please refer to [intro.md](intro.md) for an introduction about this tool.
## Install
Setup Python environment and Tree-sitter Setup Python environment and Tree-sitter
@ -20,11 +24,25 @@ sudo apt install python3-venv
# Setup virtual environment in Capstone root dir # Setup virtual environment in Capstone root dir
python3 -m venv ./.venv python3 -m venv ./.venv
source ./.venv/bin/activate source ./.venv/bin/activate
```
Install Auto-Sync framework
```
cd suite/auto-sync/ cd suite/auto-sync/
pip install -e . pip install -e .
``` ```
## Update ## Architecture
Please read [ARCHITECTURE.md](https://github.com/capstone-engine/capstone/blob/next/docs/ARCHITECTURE.md) to understand how Auto-Sync works.
This step is essential! Please don't skip it.
## Update an architecture
Updating an architecture module to the newest LLVM release, is only possible if it uses Auto-Sync.
Not all arch-modules support Auto-Sync yet.
Check if your architecture is supported. Check if your architecture is supported.
@ -52,6 +70,14 @@ Run the updater
./src/autosync/ASUpdater.py -a <ARCH> ./src/autosync/ASUpdater.py -a <ARCH>
``` ```
## Update procedure
1. Run the `ASUpdater.py` script.
2. Compare the functions in `<ARCH>DisassemblerExtension.*` to LLVM (search the function names in the LLVM root)
and update them if necessary.
3. Try to build Capstone and fix the build errors.
## Post-processing steps ## Post-processing steps
This update translates some LLVM C++ files to C. This update translates some LLVM C++ files to C.
@ -60,7 +86,7 @@ you will get build errors if you try to compile Capstone.
The last step to finish the update is to fix those build errors by hand. The last step to finish the update is to fix those build errors by hand.
## Developer ## Additional details
### Overview updated files ### Overview updated files
@ -96,14 +122,7 @@ Those files are written by us:
- `<ARCH>Mapping.*`: Binding code between the architecture module and the LLVM files. This is also where the detail is set. - `<ARCH>Mapping.*`: Binding code between the architecture module and the LLVM files. This is also where the detail is set.
- `<ARCH>Module.*`: Interface to the Capstone core. - `<ARCH>Module.*`: Interface to the Capstone core.
### Update procedure ### Relevant documentation and troubleshooting
1. Run the `ASUpdater.py` script.
2. Compare the functions in `<ARCH>DisassemblerExtension.*` to LLVM (search the function names in the LLVM root)
and update them if necessary.
3. Try to build Capstone and fix the build errors.
### Update details
**LLVM file translation** **LLVM file translation**
@ -129,9 +148,66 @@ Documentation about the `.inc` file generation is in the [llvm-capstone](https:/
**Formatting** **Formatting**
- If you make changes to the `CppTranslator` please format the files with `black` - If you make changes to the `CppTranslator` please format the files with `black` and `usort`
``` ```
source ./.venv/bin/activate pip3 install black usort
pip3 install black python3 -m usort format src/autosync
python3 -m black --line-length=120 CppTranslator/*/*.py python3 -m black src/autosync
``` ```
## Refactor an architecture for Auto-Sync framework
Not all architecture modules support Auto-Sync yet.
Here is an overview of the steps to add support for it.
<hr>
To refactor one of them to use `auto-sync`, you need to add it to the configuration.
1. Add the architecture to the supported architectures list in `ASUpdater.py`.
2. Configure the `CppTranslator` for your architecture (`suite/auto-sync/CppTranslator/arch_config.json`)
Now, manually run the update commands within `ASUpdater.py` but *skip* the `Differ` step:
```
./Updater/ASUpdater.py -a <ARCH> -s IncGen Translate
```
The task after this is to:
- Replace leftover C++ syntax with its C equivalent.
- Implement the `add_cs_detail()` handler in `<ARCH>Mapping` for each operand type.
- Edit the main header file of the architecture (`include/capstone/<ARCH>.h`) to include the generated enums (see below)
- Add any missing logic to the translated files.
- Make it build and write tests.
- Run the Differ again and always select the old nodes.
**Notes:**
- Some generated enums must be included in the `include/capstone/<ARCH>.h` header.
At the position where the enum should be inserted, add a comment like this (don't remove the `<>` brackets):
```
// generate content <FILENAME.inc> begin
// generate content <FILENAME.inc> end
```
The update script will insert the content of the `.inc` file at this place.
- If you find yourself fixing the same syntax error multiple times,
please consider adding a `Patch` to the `CppTranslator` for this case.
- Please check out the implementation of ARM's `add_cs_detail()` before implementing your own.
- Running the `Differ` after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them.
- Sometimes the LLVM code uses a single function from a larger source file.
It is not worth it to translate the whole file just for this function.
Bundle those lonely functions in `<ARCH>DisassemblerExtension.c`.
## Adding a new architecture
Adding a new architecture follows the same steps as above. With the exception that you need
to implement all the Capstone files from scratch.
Check out an `auto-sync` supporting architectures for guidance and open an issue if you need help.

96
suite/auto-sync/intro.md Normal file
View File

@ -0,0 +1,96 @@
## Why the Auto-Sync framework?
Capstone provides a simple API to leverage the LLVM disassemblers, without
having the big footprint of LLVM itself.
It does this by using a stripped down copy of LLVM disassemblers (one for each architecture)
and provides a uniform API to them.
The actual disassembly task (bytes to asm-text and decoded operands) is completely done by
the LLVM code.
Capstone takes the disassembled instructions, adds details to them (operand read/write info etc.)
and organizes them to a uniform structure (`cs_insn`, `cs_detail` etc.).
These objects are then accessible from the API.
Capstone is in C and LLVM is in C++. So to use the disassembler modules of LLVM,
Capstone effectively translates LLVM source files from C++ to C, without changing the semantics.
One could also call it a "disassembler port".
Capstone supports multiple architectures. So whenever LLVM
has a new release and adds more instructions, Capstone needs to update its modules as well.
In the past, the update procedure was done by hand and with some Python scripts.
But the task was tedious and error-prone.
To ease the complicated update procedure, Auto-Sync comes in.
<hr>
## How LLVM disassemblers work
Because effectively use the LLVM disassembler logic, one must understand how they operate.
Each architecture is defined in a so-called `.td` file, that is, a "Target Description" file.
Those files are a declarative description of an architecture.
They are written in a Domain-Specific Language called [TableGen](https://llvm.org/docs/TableGen/).
They contain instructions, registers, processor features, which instructions operands read and write and more information.
These files are consumed by "TableGen Backends". They parse and process them to generate C++ code.
The generated code is for example: enums, decoding algorithms (for instructions and operands) or
lookup tables for register names or alias.
Additionally, LLVM has handwritten files. They use the generated code to build the actual instruction classes
and handle architecture specific edge cases.
Capstone uses both of those files. The generated ones as well as the handwritten ones.
## Overview of updating steps
An Auto-Sync update has multiple steps:
**(1)** Changes in the auto-generated C++ files are handled completely automatically,
We have a LLVM fork with patched TableGen-backends, so they emit C code.
**(2)** Changes in LLVM's handwritten sources are handled semi-automatically.
For each source file, we search C++ syntax and replace it with the equivalent C syntax.
For this task we have the CppTranslator.
The end result is of course not perfectly valid C code.
It is merely an intermediate file, which still has some C++ syntax in it.
Because this leftover syntax was likely already fixed in the equivalent C file currently in Capstone,
we have a last step.
The translated file is diffed with the corresponding old file in Capstone.
The `Differ` tool parses both files into an abstract syntax tree.
From this AST it picks nodes with the same name and diffs them.
The diff is given to the user, and they can decide which one to accept.
All choices are also recorded and automatically applied next time.
**Example**
> Suppose there is a file `ArchDisassembler.cpp` in LLVM.
> Capstone has the C equivalent `ArchDisassembler.c`.
>
> Now LLVM has a new release, and there were several additions in `ArchDisassembler.cpp`.
>
> Auto-Sync will pass `ArchDisassembler.cpp` to the CppTranslator, which replaces most C++ syntax.
> The result is an intermediate file `transl_ArchDisassembler.cpp`.
>
> The result is close to what we want (C code), but still contains invalid syntax.
> Most of this syntax errors were fixed before. They must be, because the C file `ArchDisassemble.c`
> is working fine.
>
> So the intermediate file `transl_ArchDisassebmler.cpp` is compared to the old `ArchDisassemble.c.
> The Differ patches both files to an AST and automatically patches all nodes it can.
>
> Effectively automate most of the boring, mechanical work involved in fixing-up `transl_ArchDisassebmler.cpp`.
> If something new came up, it asks the user for a decission.
>
> The result is saved to `ArchDisassembler.c`, which is now up-to-date with the newest LLVM release.
>
> In practice this file will still contain syntax errors. But not many, so they can easily be resolved.
**(3)** After (1) and (2), some changes in Capstone-only files follow.
This step is manual work.