As a way to explain how to add new Python language features in uncompyle6, I'll go into an example: adding Python 3.6 Formatted String Literals: PEP 498. I don't do the full spec, but rather just a simplified version of it.
In July 2016, before Python 3.6 was fully released and was still using bytecode instead of wordcode. If you wrote:
def fn(var1, var2):
return f'{var1}py36_string_interpolation{var2}'
this got translated to:
2 0 LOAD_CONST ''
3 LOAD_ATTR 'join'
6 LOAD_FAST 'var1'
9 FORMAT_VALUE 0 ''
12 LOAD_CONST 'py36_string_interpolation'
15 LOAD_FAST 'var2'
18 FORMAT_VALUE 0 ''
21 BUILD_LIST 3 ''
24 CALL_FUNCTION 1 '1 positional, 0 keyword pair'
27 RETURN_VALUE
A literal decompilation of the opcodes would be something like:
''.join([fv('var1'), 'py36_string_interpolation', fv('var2')])
Try running uncompyle6 --tree
on my rough translation above to get a feel for what the grammar looks like. It's long so I will not copy all of it here.
But it is a little different from the opcodes above. Instead of:
call (3)
0. expr
6 LOAD_NAME 1 'fv'
1. expr
10 LOAD_CONST 1 'var1'
2. 12 CALL_FUNCTION 1
The call
replaced by:
LOAD_FAST 'var1'
FORMAT_VALUE 0
In other words, a FORMAT_VALUE
opcode was added as a special case of a particular kind of function call.
After the 3.6 release when things got changed to wordcodes instead of bytecode, the code generation got changed to:
0 LOAD_FAST 'var1'
2 FORMAT_VALUE 0 ''
4 LOAD_CONST 'py36_string_interpolation'
6 LOAD_FAST 'var2'
8 FORMAT_VALUE 0 ''
10 BUILD_STRING 3 ''
12 RETURN_VALUE
The BUILD_STRING
opcode seems to have been added and replaces the call to the function call to a string join
.
Note: because as happened here bytecode generation can change when a new feature is added, we don't support such intermediate or "dev" or "release-candidate" versions of Python.
Consulting Python's AST name, it is called a FormattedValue
. We'll use formatted_value
just to be consistent with the existing grammar conventions. With this, we have:
formatted_value ::= LOAD_FAST FORMAT_VALUE
Also in the list that makes up a "Formated String Literal" are constant strings which in the AST is called Str
. You will see that above in the second LOAD_CONST
instruction. What we should do is add a transformation inside the ingester step to change LOAD_CONST
to LOAD_STRING
whenever the value loaded is a string. However that's too much work for now. So we will use the more general expr
instead of having a str
.
Since this is currently for Python 3.6 only, we add that those two grammar rules to class Python36Parser
in a docstring to a method that starts p_
.
In the AST, you'll see a list of formatted values and/or strings is combined together in a list called JoinedStr
so let's make grammar rules for that:
expr ::= formatted_value
joined_str ::= expr+ BUILD_STRING
And now I get to the first set of technical issues to discuss.
First the simple SPARK parser doesn't have nice operators like |
, and
grouping of grammar symbols. It does have a +
and a *
that can be applied here where
there is only one nonterminal on the left-hand-side.
So instead we need to write this as the more cumbersome:
exprs ::= expr+
joined_str ::= exprs BUILD_STRING
I originally had something like this, and it often worked until I had a tuple which contained as one item a formatted value.
('a', 'b', f'{foo}')
The grammar doesn't separate the tuple entry boundaries from the joined_str
boundary.
There is another more subtle problem with using expr+
which is that
it can lead to exponential parsing time. We need to ensure that we
keep grammar parsing efficient. See
https://github.com/rocky/python-uncompyle6/wiki/Deparsing-Paper.pdf
for details
So instead, in uncompyle6's "ingest" method, when we see the
BUILD_STRING
it notices it is a list of size 3 and changes the opcode
name to BUILD_STRING_3
. Then in parsing a custom rule is added. In
other situations it would be:
joined_str ::= expr expr expr BUILD_STRING_3
and to hook this into the rest of the grammar:
expr ::= joined_str
Finally comes semantic rules to take the AST and produce the right text.
Nonterminal formatted_value
has braces around that. So we add it in TABLE_DIRECT
of pysource.py
as:
'formatted_value': ( '{%c}', 0),
But this really works only if the FORMAT_VALUE
has attribute value 0 (no format specifier like !r
or !s
).
An interpolation rule to first approximation might be:
'joined_str: ( "f'%C', (0, -1, '') ),
However the above table rules are not quite right. Format strings can have braces and quotes in them. So that needs to escaped. Instead we need then special procedures called
n_formatted_value()
and n_joined_str
for this. Consult the code for the full details.