Optimising orbit counting of arbitrary order by equation selection

Bioinformatics

The small-scale structure of a graph contains important information about that graph’s function. Network motifs [1] are defined as subgraphs of a larger graph that appear significantly more often in that large graph than would be expected in a purely random graph with the same number of nodes and edges. It was found that motifs can easily be used to distinguish networks of different functions from each other, like genetic transcription networks from ecosystem food webs [1]. To know whether a certain subgraph is a motif in a specific explored graph, its expected number of appearances in random graphs should be known. This means the subgraph must be counted in a large number of random graphs [1]. This may be sped up by sampling [2] or parallellisation [3], but the random graphs are still needed.

It was first proposed in [4] not to restrict interest to those subgraphs that appear more often than expected. Instead, PPI networks were analysed by counting all small connected, induced subgraphs. These subgraphs are called graphlets. All graphlets on 2 to 5 nodes are shown in Fig. 1. They are all assigned a unique identification number, which can be used to refer to that specific graphlet (shown under each graphlet in Fig. 1). Originally, graphlets were restricted to 5 or fewer nodes because of limitations in computing power. Later research [5, 6] increased that number of nodes. Regardless of the exact computing power available, the exponentially exploding number of graphlets enforces the use of some cut-off on the number of nodes in a graphlet. Therefore, if the term “all graphlets” is used in this paper, it means “all graphlets on k or fewer nodes”, with k some arbitrary but fixed cut-off.

Fig. 1

All graphlets of 2-5 nodes with their original numbering. Within each graphlet, grayscales show the graphlets’ orbits. Orbit numbers are shown close to a node of that orbit. Figure adapted from [10]

Each graphlet’s nodes can be subdivided into symmetrically equivalent sets, called orbits [7]. For example, consider graphlet 1, the three-node path, in Fig. 1. By mirroring the graphlet, the two outside nodes can be swapped, which does not change the graphlet’s structure. Therefore, these two nodes are contained in the same orbit. The middle node can never be interchanged with either of the outside nodes without changing the graphlet’s structure. This means that this node is in a different orbit. In Fig. 1, the outer nodes of graphlet 1 are coloured black, whereas the middle node is coloured white, to indicate their respective orbits. Likewise, in all other graphlets in Fig. 1, the nodes are shaded according to their orbit. Like graphlets, orbits are numbered. These numbers are unique across all graphlets, so an orbit number can be used to identify both the orbit and its graphlet. In Fig. 1, the number of each orbit is indicated close to one of its nodes.

When a graphlet is found in an explored graph, it is possible to determine which graph node is in what orbit of that graphlet. That orbit is then said to touch that node. The number of times orbit i touches node v is called v’s ith graphlet degree [7]. Graphlet degrees are an extension of a node’s degree, which itself is equivalent to the 0th graphlet degree, i.e. the graphlet degree of the 2-node graphlet. As an extension of a graph’s degree distribution, the distribution of all of the graphlet degrees can be calculated, which is called the graph’s graphlet degree distribution or GDD. These graphlet degree distributions themselves are a useful tool for network comparison [7] and identification of nodes’ function [8].

To calculate a graph’s GDD, it seems at first sight that all graphlets within that graph need to be found. However, it was proved in [9] that this is not the case. They showed that, to calculate a node’s graphlet degrees (using graphlets on up to k nodes), a system of linear equations can be composed from all graphlets on k−1 nodes and all common neighbours of up to k−2 nodes in the explored graph. Solving this system of equations then results in the node’s graphlet degrees. Using equations allows part of the calculation to be done in advance and reused while counting multiple different orbits.

The ORCA algorithm (ORbit Counting Algorithm) [9] is the fastest available algorithm to calculate all nodes’ graphlet degrees. ORCA can count the orbits of graphlets up to either 4 or 5 nodes and uses such a system of equations to reduce this to finding graphlets on 3 or 4 nodes, respectively.

The Jesse algorithm, which managed to automatically generate equations for orbits of arbitrary order, as well as use them to count those orbits, was described in [5]. To this end, a non-ambiguous, extendable graphlet and orbit numbering was introduced. This numbering does not follow the original numbering in Fig. 1, but is constructed in an algorithmical, consistent way. Likewise, a novel type of graphlet that represents a specific orbit was introduced. In these graphlets, a single node is marked, which singles out that node’s orbit. These so-called orbit representatives were used as the nodes of a directed tree, in which the arcs were the addition of nodes and edges to these orbit representatives. This tree can then be used to find orbit representatives on k−1 nodes within a graph, after which automatically generated [10] equations are used to calculate the number of orbits of graphlets on k nodes.

However, many redundant equations are generated, so a large linearly dependent system of equations arises. Originally, the equations used in Jesse were chosen on a first-come-first-served basis. If Jesse generated more than 1 equation that could be used to count orbit i, only the one that was generated first was saved and used. All other generated equations are just as correct, though.

Once the system of equations is completed, it can be straightforwardly solved as its left-hand side, where graphlet degrees are related to each other, is an upper triangular matrix. However, filling in the right-hand side of all equations is more time-consuming, as this depends on the graph’s structure. Therefore, different, but equally correct sets of equations can be used and have an impact on Jesse’s running time.

Other techniques to count graphlets have been developed: ranging from combinatorial counting that uses graphlets’ symmetries and substructures to count them without finding any graphlet, which is currently limited to graphlets on 4 nodes [11], to sampling techniques that do not count all graphlets but identify randomly sampled subgraphs [6], or incremental counting [12].

In this paper, we investigate whether it is possible to select a set of equations that allows the graphlet degrees to be calculated faster than when using other sets, and whether this set depends on the explored graph’s properties. First, a number of graph theory terms need to be defined. A high-level explanation of Jesse’s internal structure follows. Then, the structure of a graphlet counting equation is explained, and the variable properties of equations are identified. With all of these, the effect of equation selection on Jesse’s speed can be investigated, accounting for different graph types, orders and sizes. Finally, the Cytoscape version of Jesse will be presented.

Formal graph theory definitions

This section is a collection of formal definitions of terms which will be used later on in the paper. These are formal definitions for all terms used in this paper, but for more information on how and why these terms were constructed we refer to [10].

Definition 1

An undirected graph

$$G = (V,E),$$

where V is the collection of nodes and E is the collection of edges of G, such that

$$E subseteq {{v,w} | (v, w in V wedge v neq w)}.$$

Similarly, V(G) and E(G) are used to indicate G’s node and edge set, respectively. n=|V| is called the graph’s order, while m=|E| is called its size.

Definition 2

A graph G’s density

$$D(G) = frac{|E|}{binom{|V|}{2}} = frac{2|E|}{|V|(|V|-1)}.$$

As it is the graph’s size divided by the maximal size that graphs of that order can have,

$$0 leq D(G) leq 1.$$

Definition 3

The isomorphisms between two graphs G and H are the one-to-one functions that map G’s nodes to H’s nodes, such that their edge sets are mapped onto each other as well.

$$begin{aligned}{}text{Iso}(G,H) &= {f:{V}(G) rightarrow {V}(H)|(f text{ is bijective }\ &quad wedge (forall v, w in {V}(G): ({v,w}\&quad in {E}(G) iff { f(v), f(w) } in {E}(V))))}. end{aligned} $$

Two graphs are said to be isomorphic if they have at least one isomorphism.

$$G simeq H iff text{Iso}(G,H) neq emptyset $$

Definition 4

The automorphisms of a graph G are the isomorphisms of G to itself.

$$text{Aut} (G) = text{Iso}(G,G)$$

Definition 5

A graph’s nodes have a unique, ordered index:

$$text{Ind} (v) = text{Ind} (w) iff v = w. $$

This index is chosen arbitrarily at graph creation. As a form of shorthand, we note v<w for Ind(v)<Ind(w).

Definition 6

The induced subgraph on a subset VsV(G)of a graph G’s nodes is the graph formed by Vs and all edges between those nodes that are present in G.

$$G[V_{s}] = (V_{s}, { {v,w} in E(G) : v, w in V_{s}}) $$

Definition 7

The number of common neighbours of a subset VsV(G) of a graph G’s nodes will be noted as c(Vs).

$$c(V_{s}) = | {v in V(G) |(forall w in V_{s} : {v,w} in E(G))} | $$

Following [9], c({u,v,…}) will also be noted as c(u,v,…).

Definition 8

A graphletG is a connected graph. Graphlets are assigned an unique numbering, such that the notation Gi denotes a unique graphlet for each value of i. A k-graphlet is a graphlet with k nodes.

Definition 9

The instances of a graphlet Gi in a graph H are the induced subgraphs of H that are isomorphic with Gi.

Definition 10

The automorphism orbit, or orbit in short, of a node v in a graphlet Gi is the set of nodes to which v can be mapped by an automorphism of Gi.

$$text{Orb}(G_{i},v) = left{w in V(G_{i}) | left(exists f in text{Aut} (G_{i}) : f(v) = w right)right} $$

Orbits are assigned a unique numbering, spanning over all graphlets, such that the notation Orbj can be used to identify both the orbit and its graphlet.

Definition 11

An orbit representative is a graphlet with a marked node x.

$$Omega = (V,E,x) $$

with

$$x in V wedge E subseteq {{v,w} | (v, w in V wedge v neq w) } $$

Orbit representatives follow the same numbering as orbits, such that Ωi is the orbit representative that corresponds with Orbi. The marked node of an orbit representative Ω will be noted as x(Ω).

Definition 12

The isomorphisms between two orbit representatives Ω and Ψ are the isomorphisms of Ω and Ψ’s graphlets that map Ω’s marked node to Ψ’s marked node.

$$begin{aligned} text{Iso}(Omega,Psi) &= {f:{V}(Omega) rightarrow {V}(Psi) |(f text{ is bijective} \ &quad wedge (forall v, w in {V}(Omega): ({v,w}\ &quad in {E}(Omega) iff { f(v), f(w)} in E(Psi)))\ &quad wedge (f({x}(Omega)) = {x}(Psi)))} end{aligned} $$

Two orbit representatives are said to be isomorphic if they have at least one isomorphism.

$$Omega simeq Psi iff text{Iso}(Omega,Psi) neq emptyset $$

Definition 13

The automorphisms of an orbit representative Ω are the isomorphisms of Ω to itself.

$$text{Aut}(Omega) = text{Iso}(Omega,Omega) $$

Definition 14

The induced orbit representative on a subset VsV(G) of a graph G’s nodes and a marked node xVs is the orbit representative formed by Vs and all edges between those nodes that are present in G, with x as the orbit representative’s marked node.

$$G[V_{s}, x] = left(V_{s}, left{ {v,w } in E(G) : v, w in V_{s} right}, x right) $$

Definition 15

The instances of an orbit representative Ωi in a graph H are the induced subgraphs of H that are isomorphic with Ωi.

An orbit Orbitouches a node v in a graph G if its corresponding orbit representative Ωi has an instance in G in which v is the marked node.

Definition 16

The ith graphlet degree oi(v)of a node v in a graph G is the number of times orbit i touches v.

$$o_{i}(v) = | {S subseteq V(G) : G[S,v] simeq Omega_{j} } | $$

Once more, we refer the interested reader to [10] for more information.

Articles You May Like

Neonics hinder bees’ ability to fend off deadly mites
Defying the laws of physics? Engineers demonstrate bubbles of sand
Novel antibody may suppress HIV for up to four months
New study shows people used natural dyes to color their clothing thousands of years ago
Meal Kits Have Smaller Carbon Footprint Than Grocery Shopping, Study Says

Leave a Reply

Your email address will not be published. Required fields are marked *