Continuing in the series of how I fix parse bugs encountered.
The basic idea is to find something that works and compare to see where things go wrong. In this example, the problem is in Python 2.7 so I compare uncompyle6
with uncompyle2
which is similar. In Fixing Issue #283, I compare Python 2.6 bytecode which has bugs, with Python 2.7 which decompiles fine.
Issue 151 ultimately was missing a grammar rule, and another too permissive grammar rule was covering for that. This also happens somewhat in Fixing Issue #283.
Here are the steps I have took to address issue #151.
Step one is get a small test case and compile it. The test case provided in the issue was:
def func(a, b, c):
while a:
if b:
continue
return False
return True
To compile it one can use the compile-file.py program in python-uncompyle6/test/stdlib
$ cd test/stdlib
$ pyenv 2.7.14
$ ./compile-file.py /tmp/issue-151.py
compiling /tmp/issue-151.py to /tmp/issue-151-2.7.pyc
Let's get detailed information in decompilation: assembly code, grammar reductions and the AST tree (which will fail below):
$ ../../bin/uncompyle6 -agT /tmp/issue-151-2.7.pyc > /tmp/issue-151.log 2>&1
It so happens that uncompyle2
decompiles correctly. So let's run that
and compare the grammar. The option that corresponds to t
in uncompyle2
is --showast
:
$ uncompyle2 --showast /tmp/issue-151-2.7.pyc > /tmp/issue-151-good.log
Looking at the tree for that we see:
whilestmt
0 SETUP_LOOP '26' (1)
testexpr
testfalse
expr
3 LOAD_FAST 'a' (2)
jmp_false
6 POP_JUMP_IF_FALSE '25' (3)
return_stmts
_stmts
stmt
ifstmt
testexpr
testfalse
expr
9 LOAD_FAST 'b' (4)
jmp_false
12 POP_JUMP_IF_FALSE '21' (5)
_ifstmts_jump
c_stmts_opt
c_stmts
_stmts
stmt
continue_stmt
15 CONTINUE '3' (6)
18 JUMP_FORWARD '21' (7)
21_0 COME_FROM '18' (8)
return_stmt
ret_expr
expr
21 LOAD_GLOBAL 'False' (9)
24 RETURN_VALUE None (10)
25 POP_BLOCK None
26_0 COME_FROM '0'
Here I compared the grammar reductions with the tree that uncompyle2
has. The number
in parenthesis I added and are are the token numbers. Unfortunately
uncompyle2
doesn't have an option to track grammar reductions, so you
can compare this phase with uncompyle6
.
In the /tmp/issue-151.log
log, I saw:
3 expr ::= LOAD_FAST (2)
3 ret_expr ::= expr (2)
3 assert_expr ::= expr (2)
6 jmp_false ::= POP_JUMP_IF_FALSE (3)
3-6 testfalse ::= expr jmp_false (3)
3 testexpr ::= testfalse (3)
That first LOAD_FAST
matches the load fast shown in the tree above. The token number, 2
is given in parenthesis (each token is actually an instruction and the first instruction is SETUP_LOOP
here).
So we have expr ::= LOAD_FAST (2)
in /tmp/issue-151.log
and if you
look in the tree you can find expr
with a child LOAD_FAST
. Note
the 3 in both places refers to offset 3. Then a couple of lines down
in /tmp/issue-151.log
you will see
6 jmp_false ::= POP_JUMP_IF_FALSE (3)
And you will see in the tree jmp_false
has a child
POP_JUMP_IF_FALSE
in the tree. Then after that in
/tmp/issue-151.log
there is:
3-6 testfalse ::= expr jmp_false (3)
The 3-6 means that this rule spans offset 3 to 6 in bytecode
instructions. But token number 3 in parenthesis is the last token
that the Earley parse has seen at this point, In particular it has currently scanned SETUP_LOOP
, LOAD_FAST
, and POP_JUMP_IF_FALSE
.
The reduction rules ret_expr
and assert_expr
don't appear in the
tree. That's okay. What's going on is that there are things the
grammar considered as possibilities. They just didn't pan out though.
Tracing on like this I finally get in issue-151-bad.log
this:
L. 5: 21 expr ::= LOAD_GLOBAL (9)
L. 5: 21 ret_expr ::= expr (9)
L. 5: 21 assert_expr ::= expr (9)
L. 5: 21-24 return ::= ret_expr RETURN_VALUE (10)
this is where things start to go awry...
L. 3: 9-24 returns ::= _stmts return (10)
L. 5: 21 stmt ::= return (10)
3-24 returns ::= _stmts return (10)
After the returns ::= _stmts return (10)
is where we expect to see a rule like
whilestmt ::= SETUP_LOOP testexpr return_stmts
. In uncompyle2
, returns
is called return_stmt
.
Notice that the returns
(or return_stmt
as uncompyle2
calls it) that starts
on line three spans offsets 3 to 24, ending at token 10, and that is wrong.
To address the grabbiness I changed:
iflaststmtl ::= testexpr c_stmts_opt
to:
iflaststmtl ::= testexpr c_stmts
Okay, so I guess now I got to explain nonterminal names and what they mean.
First returns
. This is from uncompyle6/parser.py
:
# The "returns" nonterminal is a sequence of statements that ends in a RETURN statement.
# In later Python versions with jump optimization, this can cause JUMPs
# that would normally appear to be omitted.
returns ::= return
returns ::= _stmts return
In uncompyle6 it is called return_stmts
. To try to keep nonterminal names closer to the
Python AST names change the name to returns
. The AST symbol would be Return
.
The nonterminal iflaststmtl
is an "if" statement which is inside a loop
(that's the "l" suffix. I think the "last" part means it is the last statement of the loop).
The nonterminal c_stmts
are statements that can contain a continue
statement in them. And when there is an _opt
suffix it means it is optional.
So it seems unusual that iflastmtl
could just contain some sort of
test without a body. It's actually more complicated than that because that might not be
true for for other things. So this change might cause a bug in the future.
Lastly, we add a new test case for this particular code. Run inside test
directory:
$ cp /tmp/issue-151.py simple_source/bug27\+/05_while_if_continue.py
Add some additional some comments to the beginning of
05_while_if_continue.py
to describe what's up.
$ git add simple_source/bug27+/05_while_if_continue.py
$ ./add-test.py simple_source/bug27+/05_while_if_continue.py
$ git add -f bytecode_2.7/05_while_if_continue.pyc
Ideally though we need a whole separate process for figuring out control flow