0 Adding a new language construct ‐ Python 3.6 Formatted String Literals
chu23465 edited this page 2024-02-18 03:07:50 +05:30

As a way to explain how to add new Python language features in uncompyle6, I'll go into an example: adding Python 3.6 Formatted String Literals: PEP 498. I don't do the full spec, but rather just a simplified version of it.

In July 2016, before Python 3.6 was fully released and was still using bytecode instead of wordcode. If you wrote:

def fn(var1, var2):
    return f'{var1}py36_string_interpolation{var2}'

this got translated to:

     2           0  LOAD_CONST                ''
                 3  LOAD_ATTR                 'join'
                 6  LOAD_FAST                 'var1'
                 9  FORMAT_VALUE           0  ''
                12  LOAD_CONST                'py36_string_interpolation'
                15  LOAD_FAST                 'var2'
                18  FORMAT_VALUE           0  ''
                21  BUILD_LIST             3  ''
                24  CALL_FUNCTION          1  '1 positional, 0 keyword pair'
                27  RETURN_VALUE

A literal decompilation of the opcodes would be something like:

''.join([fv('var1'), 'py36_string_interpolation', fv('var2')])

Try running uncompyle6 --tree on my rough translation above to get a feel for what the grammar looks like. It's long so I will not copy all of it here.

But it is a little different from the opcodes above. Instead of:

  call (3)
       0. expr
            6  LOAD_NAME      1  'fv'
       1. expr
           10  LOAD_CONST     1  'var1'
       2.  12  CALL_FUNCTION  1

The call replaced by:

    LOAD_FAST      'var1'
    FORMAT_VALUE    0

In other words, a FORMAT_VALUE opcode was added as a special case of a particular kind of function call.

After the 3.6 release when things got changed to wordcodes instead of bytecode, the code generation got changed to:

                 0  LOAD_FAST                 'var1'
                 2  FORMAT_VALUE           0  ''
                 4  LOAD_CONST                'py36_string_interpolation'
                 6  LOAD_FAST                 'var2'
                 8  FORMAT_VALUE           0  ''
                10  BUILD_STRING           3  ''
                12  RETURN_VALUE

The BUILD_STRING opcode seems to have been added and replaces the call to the function call to a string join.

Note: because as happened here bytecode generation can change when a new feature is added, we don't support such intermediate or "dev" or "release-candidate" versions of Python.

Consulting Python's AST name, it is called a FormattedValue. We'll use formatted_value just to be consistent with the existing grammar conventions. With this, we have:

   formatted_value ::= LOAD_FAST FORMAT_VALUE

Also in the list that makes up a "Formated String Literal" are constant strings which in the AST is called Str. You will see that above in the second LOAD_CONST instruction. What we should do is add a transformation inside the ingester step to change LOAD_CONST to LOAD_STRING whenever the value loaded is a string. However that's too much work for now. So we will use the more general expr instead of having a str.

Since this is currently for Python 3.6 only, we add that those two grammar rules to class Python36Parser in a docstring to a method that starts p_.

In the AST, you'll see a list of formatted values and/or strings is combined together in a list called JoinedStr so let's make grammar rules for that:

   expr       ::= formatted_value
   joined_str ::= expr+ BUILD_STRING

And now I get to the first set of technical issues to discuss.

First the simple SPARK parser doesn't have nice operators like |, and grouping of grammar symbols. It does have a + and a * that can be applied here where there is only one nonterminal on the left-hand-side.

So instead we need to write this as the more cumbersome:

exprs       ::=  expr+
joined_str  ::=  exprs BUILD_STRING

I originally had something like this, and it often worked until I had a tuple which contained as one item a formatted value.

('a', 'b', f'{foo}')

The grammar doesn't separate the tuple entry boundaries from the joined_str boundary.

There is another more subtle problem with using expr+ which is that it can lead to exponential parsing time. We need to ensure that we keep grammar parsing efficient. See https://github.com/rocky/python-uncompyle6/wiki/Deparsing-Paper.pdf for details

So instead, in uncompyle6's "ingest" method, when we see the BUILD_STRING it notices it is a list of size 3 and changes the opcode name to BUILD_STRING_3. Then in parsing a custom rule is added. In other situations it would be:

   joined_str ::= expr expr expr BUILD_STRING_3

and to hook this into the rest of the grammar:

   expr ::= joined_str

Finally comes semantic rules to take the AST and produce the right text.

Nonterminal formatted_value has braces around that. So we add it in TABLE_DIRECT of pysource.py as:

    'formatted_value':	( '{%c}', 0),

But this really works only if the FORMAT_VALUE has attribute value 0 (no format specifier like !r or !s).

An interpolation rule to first approximation might be:

  'joined_str:	( "f'%C', (0, -1, '') ),

However the above table rules are not quite right. Format strings can have braces and quotes in them. So that needs to escaped. Instead we need then special procedures called n_formatted_value() and n_joined_str for this. Consult the code for the full details.