29 Deparsing technology and its use in exact location reporting
R. Bernstein edited this page 2020-02-13 23:37:56 -05:00

One of my pet peeves has been the vagueness of reporting a line number as a position in a program. The Go language Pos type type that is used in Go's AST is more informed in that it can encode the line number, file name, column number and byte offset all very compactly as a pointer. Even better, though, would be extend this to an interval or a pointer to a tuple of Pos'es. What makes this work or makes it sparse is that the number of "interesting locations" in a program, whether a single point or a pair of points, is sparse compared to the number of bytes of a program.

I used this idea in the debugger I did for Go. It would be cool if Go used its Pos idea more pervasively.

But now here's another totally different idea. In any language you can always get something like a program counter. In CPython it is a virtual-machine instruction offset. The packages uncompyle or uncompyle2, go back over a decade. They take bytecode and turn it back into Python. The code itself is written in a pretty cool way as it is all pattern-match driven, using a grammar-parsing tool that uses the Earley parsing algorithm.

Suppose my CPython byte code is:

LOAD_CONST 0
STORE_NAME a

A scanner turns this into a token sequence "LOAD_CONST" and "STORE_NAME" with attributes 0, and "a" respectively attached to the two tokens. There is grammar rule, "assign", that matches LOAD_CONST STORE_NAME (via grammar rules for "expr" and "designator").

From those tokens (or rather, CPython instructions), we can derive non-terminal grammar symbol "assign" using a grammar rule like

assign_smt ::= expr designator

And then we can associate that with Python code: a = 0 using the abstract syntax tree nodes (and their attributes) rooted at that "assign" nonterminal. Of course, close to the grammar "start" symbol there's a grammar for Python that has the expected Python grammar derivations:

stmts ::= stmt+
stmt ::= assign_stmt |  ...

But one thing to notice is that if the code generation in later Python releases changes, the top-level grammar generally doesn't. So we don't need to change those grammar rules above describing the kinds of Python statements.

And vice versa. If new grammar is added to Python, as it probably uses the same kinds of code generation at lower levels for assignment statements so we don't have to change those patterns either.

I'm waving my hands and simplifying a lot. See here and here for the more of the gory details of bytecode deparsing of the above assignment statement. I omitted, for example, operator precedence, and template-driven semantic actions.

But one eerie thing about this code is that people, myself included, are amazed at how exactly the program can be reconstructed. That is probably because Python code generation isn't very sophisticated and doesn't do a lot of optimization: it always handles a single Python construct in a single way.

But now back to my use. The existing uncompyle2 code just reads bytecodes and produces a string. For my purpose, for an offset I need a fragment of Python code around that. So I need to the keep the intermediate results, or at least from the instruction back up to the root of the parsed tree.

In sum if my code is:

def fib(x):
    if x <= 1: return 1
    return fib(x-1) + fib(x-2)

I can give results like:

opcode: CALL_FUNCTION_1
return fib(x - 1) + fib(x - 2)
                    ----------
Contained in...
return fib(x - 1) + fib(x - 2)
       -----------------------
parsed type: binary_expr

The reason I give the "opcode" or "parsed type" is that I realized that

 fib(x - 2)

can be ambiguous. Am I in the middle of the call i, the case above, or has that finished and am about to use the result, here as an operand of the binary_expr? The opcode/parsed type makes it clear which of these two situations applies.

I should say this has been an incredible amount of work to pull off, and bugs pop up every now and then. But rarely occurring bugs are less impacting in this use where one that only cares about fragments of code.