This page collects information related to work regarding malware classification based on characteristic of the code.

A paper exposing some of the ideas was presented on VB2004. The paper is available online here (PDF) .

Errata & Updates:

  • In References & Notes, point 9:

    "In graph theory they are the leaf vertices with indegree 1 and outdegree 0."

    There's no reason why those functions would have indegree 1, most frequently they will rather have larger indegree.

  • In section 8.1.3 The index of similarity the index is not well defined. As expressed it will always return 1 (identical match), the following definition is correct:

    Let

    T_f = { all functions in T }
    S_f = { all functions in S }

    with

    T_e = { equivalent functions in T }
    S_e = { equivalent functions in S }

    S_e == T_e as they contain the same elements.

    The similarity is then:

    (|S_e|/|S_f|)*(|T_e|/|T_f|) = |matched functions|^2 / (T_f*S_f)

    Some people have proposed as well:

    (|S_e|/|S_f|)/2 + (|T_e|/|T_f|)/2

    which behaves more linearly.

    The next pictures shows both similarity measures side by side. The plot shows the behavior for an increasing number of matched functions. As seen, the later behaves better.