Eigenvectors for Code Analysis

Eigenvectors for Code Analysis
Published on 10.05.2025
Consider the following python code that we want to analyse.

from numpy import bitwise_xor
from apache import log5j

def util():
    log5j("logging bro...")

def core_logic():
    util()
    process()
    util()

def process():
    util()
    bitwise_xor(1, 2)

def main():
    core_logic()
    util()

main()
util()

The goal is to be able to find some direction on
- Hot code paths (perhaps for jit or performance measuring)
- Unusual control flows (anomalies for security)
- Hotspots (to prioritise fixing bugs)
- Rarely executed code paths (vulnerabilties in rare areas)
- and so on...

To achieve the above, especially in a large codebase can be tedious.

Step 1: Convert the code into an AST.

Step 2: Walk the AST nodes and index the frequency of the function calls.

Step 3: Sort by this index.

Step 4: Profit ???

Well, maybe not so much.. you can literally grep for function calls and get away with it honestly.

But we seek something deeper, perhaps not just sorting by how many times a function is called, but perhaps identifying functions that are "important". 

We can extract some cool insights about our test code by constructing a directed call graph, where nodes would be functions and edges would be function calls.

Now it's not about degree centrality (counts how many neighbors a node has) anymore, but more about assigning scores to the nodes where the highest score means that the node is more "important" -- Eigenvector Centrality.

"A node is important if it's connected to other important nodes."

I realise this is recursive and a bit vague on what "important" really mean, so let me clarify that real quick. 

Important doesn't just mean how many connections a node has, but also how important those connected nodes are. The more high-impact (high-centrality) nodes you're connected to, the more important you become.

Now instead of counting function call frequency after AST traversal, we create a directed Call graph instead.

Step 1: Convert the code into an AST.

Step 2: Walk the AST nodes and create a directed call graph.

Step 3: Now the idea is to score each node in the call graph, but we start with `1` for all nodes, and iteratively update the values of these based on the "important" connections.

Step 4: So the "importance" of the node is basically an iterative estimate of the sum of the scores of the neighboring nodes. Overtime this estimate converges into whats called an Eigenvector.

Quick primer of Eigenvectors and Eigenvalues

\( A \vec{v} = \lambda \vec{v} \)

Here,
    \( A \) is a matrix of \( n \times n \),
    \( \vec{v} \) is the eigenvector (length \( n \), nonzero),
    \( \lambda \) is the eigenvalue (scalar)

Step 5: In our case we can either iteratively estimate eigenvectors or directly calculate the eigenvalue and the eigenvector. 

\( \det(A - \lambda I) = 0 \)

\( (A - \lambda I)\vec{v} = \vec{0} \)

But for the sake of simplicity we'll explore the iterative approach, I think it's more intuitive.

Step 6: So let's calculate the centrality scores we must first construct the adjacency matrix. Consider this graph,


Here,
    \( X \rightarrow Y \)
    \( Y \rightarrow Z \)
    \( Z \rightarrow X \)

\(A = \begin{bmatrix}
0 & 1 & 0 \\
0 & 0 & 1 \\
1 & 0 & 0
\end{bmatrix} \)

And since this above matrix expresses influence of one node over the other, but we want to give a high score to nodes that are linked to by important nodes, so we transponse it.

\(A^T = \begin{bmatrix}
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0
\end{bmatrix} \)

We initialise the scores and iterate!

\(x^0  = \begin{bmatrix}
1 \\
1 \\
1
\end{bmatrix} \)

\(x^1 = A \cdot x^0  = \begin{bmatrix}
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0
\end{bmatrix} \cdot \begin{bmatrix}
1 \\
1 \\
1
\end{bmatrix} = \begin{bmatrix}
1 \\
1 \\
1
\end{bmatrix} \)

\( \|x^{(1)}\| = \sqrt{1^2 + 1^2 + 1^2} = \sqrt{3} \Rightarrow x^{(1)} \approx \begin{bmatrix} 0.57 \\ 0.57 \\ 0.57 \end{bmatrix} \)

Second Iteration:

\( x^{(2)} = A \cdot x^{(1)} = \begin{bmatrix}
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0
\end{bmatrix} \cdot \begin{bmatrix}
0.577 \\
0.577 \\
0.577
\end{bmatrix} = \begin{bmatrix}
0.577 \\
0.577 \\
0.577
\end{bmatrix} \)

Step 7: We have the centrality scores, all nodes have the score `0.577` and it makes sense `X → Y → Z → X`. But on this method performs well for nodes that are wildly connected to one another in a large graph.

For shorter graphs, we can directly calculate the eigenvector using linear algebra and matrix theory. This would give us the precice value for the scores in one shot. But for larger call graphs it's better to approximate using the iterative method.

So coming back to our intial test code, we can first generate the call graph.

>>> import ast
>>> tree = ast.parse(code)

Walk the AST and construct a directed call graph.

>>> import networkx as nx
>>>
>>> class CallGraphBuilder(ast.NodeVisitor):
...     def __init__(self):
...         super().__init__()
...         self.call_graph = nx.DiGraph()
...         self.current_func = None
...  
...     def visit_FunctionDef(self, node):
...         self.current_func = node.name
...         self.generic_visit(node)
...         self.current_func = None
...  
...     def visit_Call(self):
...         name = self.get_call_name(node.func)
...         self.call_graph.add_edge(self.current_func, name)
...         
>>> cg = CallGraphBuilder()
>>> cg.visit(tree)

> I've omited a code that handles import names for the sake of brevity. You can find the full code at the end. 

I would use `nx.eigenvector_centrality`, but since languages are kinda hard to deal with (top level calls, dead code...), it's better to do `nx.pagerank` instead. PageRank is more robust thanks to the damping factor, which ensures even isolated nodes eventually gain some importance, avoiding the "dead end" problem.

>>> nx.pagerank(self.call_graph)
{'apache.log5j': 0.28179614027492406,
 'core_logic': 0.12146191843631485,
 'main': 0.08523651703604046,
 'numpy.bitwise_xor': 0.14340080854995002,
 'process': 0.13685746139429322,
 'util': 0.2312471543084772}

You can see from the result that `apache.log5j` has the highest score even though syntactically the frequency is only `1`. This is because `util` was called a lot of times and `util` depended on `apache.log5j`, hence capturing "importance".

There's lot more utility of eigenvectors in code analysis, I'll try to explore a few more in the near future.

And here's the full code. If you got questions or just wanna reach out, DM on twitter - @pwnfunction. See you tomorrow, byte byte!

import ast
from collections import defaultdict
from pprint import pprint

import matplotlib.pyplot as plt
import networkx as nx


class CallGraphBuilder(ast.NodeVisitor):
    def __init__(self):
        super().__init__()
        self.current_func = None
        self.call_graph = nx.DiGraph()
        self.import_map = defaultdict(set)
        self.MAP_GLOBAL = False


    def visit_ImportFrom(self, node):
        """
        On `ImportFrom` node visit, create a import map
        """
        for imports in node.names:
            self.import_map[node.module].add(imports.name)

        self.generic_visit(node)


    def visit_FunctionDef(self, node):
        """
        On `FunctionDef` node visit, set/remove current function
        """
        self.current_func = node.name
        self.generic_visit(node)
        self.current_func = None


    def visit_Call(self, node):
        """
        On `Call` node visit, create a call graph node
        """
        name = self.get_call_name(node.func)
        if isinstance(node.func, ast.Name):
            name = self.imports_mapped(name)
        
        if name and (self.MAP_GLOBAL or self.current_func):
            self.call_graph.add_edge(self.current_func or "$", name)
       

        self.generic_visit(node)


    def imports_mapped(self, name):
        """
        Returns if `module.name` if module is imported
        """
        for k, v in self.import_map.items():
            if name in v:
                return f"{k}.{name}"
        return name


    def get_call_name(self, node):
        """
        Returns a dotted name for Name/Attribute/Call nodes, or None.
        """
        # simple name
        if isinstance(node, ast.Name):
            return node.id

        # attribute lookup: foo.bar or (expr).baz
        if isinstance(node, ast.Attribute):
            parent = self.get_call_name(node.value)
            if parent:
                return f"{parent}.{node.attr}"
            return node.attr   # fallback to bare attribute name

        # if someone does foo().bar, unwrap the call
        if isinstance(node, ast.Call):
            return self.get_call_name(node.func)

        return None
    

    def draw_call_graph(self):
        """
        Draw the current state of the call graph
        """
        ug = self.call_graph
        pos = nx.spring_layout(ug)
        nx.draw(ug, pos, with_labels=True, arrows=True, node_size=1500, node_color='lightblue', font_weight='bold')
        plt.show()


    def get_pagerank(self):
        """
        Return pagerank
        """
        return nx.pagerank(self.call_graph)
    

def load_code(filename='test.py'):
    with open(filename, 'r') as f:
        return f.read()
    

if __name__ == '__main__':
    code = load_code()
    tree = ast.parse(code)
    cg = CallGraphBuilder()
    cg.visit(tree)

    pagerank = cg.get_pagerank()
    pprint(pagerank)

    cg.draw_call_graph()