Known Decompilation differences
Decompilation recreates Python source code from information in Python bytecode.
Many times what you get back looks uncannily close to the Python code that was entered. In earlier Python, that happens more most of the time because Python bytecode generation wasn't that sophisticated.
More recent versions of Python do more "optimization" so this is expected to happen less often.
However most of the time, having Python source code that is indistinguishable from the source-code is often good enough one's needs.
Let me go into this in more detail
Accurate representation of the bytecode, not the source code.
To get across the main idea, let's say that there is a Python out there that every time you write print(5)
, bytecode for print(6)
generated and therefore the output is 6, not 5.
Let's also suppose that if you write print(6)
that produces the same bytecode as print(5)
. Running the bytecode produced here is deterministic and when the two bytecode sequences are the same, then interpreting that bytecode give the same result. In other words running the bytecode for print(5)
and print(6)
produces the same output. Here, we said that this bytecode prints 6.
A decompiler will see bytecode to generate print 6, and will therefore produce Python source that looks like print(6)
, even though the source code may have been print(5)
.
But in most situations, this probably doesn't matter! If you were running that Python interpreter with its bugs and misleading bytecode that it produces, and if you are happy running that code, then presumably you are okay with the fact that print(5)
and print(6)
do the same thing wherever that occurs in the bytecode. In other words, from an operational level the different Python source is functionally equivalent.
Obviously, something like this probably doesn't exist in any Python interpreter, so we will move onto something that does appear.
If you write, print(1+2)
modern Python interpreters will compute 1+2=3 at compilation time, and not do the addition at runtime. So the bytecode for that will be the same as if you wrote print(3)
. And that is what the bytecode indicate: a faithful representation in Python is print(3)
, not any of the variations of how you can computations that can be done at compile time that produce 3. For example, for 3-0, 1*2 + 1 and an infinity of other expressions, the Python interpreter will produce the same bytecode, namely bytecode for print(3)
.
For the most part, for these examples, this kind of thing has very little impact. However there are situations where computations to be folded in at compiler time can lead to less clear source code. Here is an example:
LINE_WIDTH = 80
last_index = LINE_WIDTH - 1
It so happens that in Python versions up 3.10 the computation 80-1 is not done at compile time. However in some version after 3.10, this might very well be done. It might be that it is not done, because this would makes debugging code more difficult to understand, I don't know. What I do know is that this decompiler will report what is in the bytecode.
Introspection of the source code
File and path location
If the code introspects on the file name that could be different, depending on where the decompiled file resides and what it is called. The decompiler can report file names that are stored in the bytecode, but the end user ultimately has control over where the file is saved in on a filesystem and what its name is.
Line numbers
While the deocmpiler can produce a crass mapping of line number in the original program to the line number that appears in the decompiled source text, it is not possible to create decompiled source text that exactly matched the line number table. In theory this could be done, but it is a lot of work. If this kind of thing is needed, and sometimes code does introspect on line numbers, then after the decompiled Python is generated, using the source line mappings, the user should edit the result to make the line numbers match up.
Comments in the source do not appear.
Comments never appear in bytecode. Not that this differs from function and module docstrings, which do appear except when optimization level 2 is used to compile the source code.
Code in the source that doesn't appear anywhere.
This was alluded to above in "Accurate representation of the bytecode, not the source code".
Optimization level 2 (OO
)
When Python bytecode is produced at optimization level 2, -OO2
then docstrings and assert statements are removed from bytecode that gets compiled.
Constant folding
This refers to doing some sort of computation at compile time rather than do it at run time. We gave examples above where 1+2 would be computed at compile time and therefore at runtime we only see the result 3.
Dead code elimination
Sometimes Python simplify test conditions. For example, if you write:
if False:
print("I feel funny")
then Python detects that the print statement can never get reached and it will just eliminate those lines of codes in the bytecode. So a decompiler will not be able to reconstruct code like this. But again, if what you are interested in is a functionally equivalent piece of code, or code that works the same, then this is of no consequence.
Type Annotation removal
Sometimes Python removes type annotations:
In the bytecode for:
def foo():
x: int = 5
The type annotation for x
that is in the source code does not appear anywhere in the bytecode.