Model

(i)
Genomes (viewed as gene networks) are represented as undirected graphs with proteincoding genes for vertices (Fig. 1a). Two proteincoding genes are connected by an edge if they are neighbors on the same strand of the same chromosome.

(ii)
Metabolic pathways are represented as directed graphs with reactions for vertices (Fig. 1b). An arc leading from reaction r_{i} to r_{j} signifies that reaction r_{i} produces a metabolite that is a substrate for r_{j}.

(iii)
For a given species S, the relation between one of its metabolic pathways and its genome takes the form of a correspondence function associating genes to reactions: for any given reaction r, the correspondence function returns the set of genes of species S that code for enzymes catalyzing reaction r (Fig. 1c). This information can be found in a knowledge base such as KEGG (Kyoto Encyclopedia of Genes and Genomes) [26] which, for a given species, contains information on its metabolic pathways, the reactions that the species performs, and the genes associated to these reactions.
The method we propose, CoMetGeNe, requires two input graphs possessing the same vertex set. Thus, an additional undirected graph is constructed as described in [27] such that it reflects gene neighborhood with respect to the reactions that the gene products catalyze (Fig. 1d). The additional graph links two reactions r_{i} and r_{j} with an edge if at least one of the genes coding for an enzyme involved in reaction r_{i} is adjacent to a gene coding for an enzyme involved in r_{j}. For example, genes X and Y are neighbors in G^{′} (Fig. 1a). Gene X codes for an enzyme involved in reaction r_{8}, and gene Y codes for an enzyme involved in reactions r_{9} and r_{10}. To reflect adjacency between genes X and Y, reactions r_{8} and r_{9}, respectively r_{8} and r_{10}, are linked by an edge in G (Fig. 1d).
Finding metabolic and genomic patterns for a single species
Problem formulation
Given a metabolic pathway and the gene network for the same species, the objective is to identify a maximal number of consecutive reactions being catalyzed by products of neighboring genes. The problem was initially formulated under the name of LONGEST SUPPORTED PATH (LSP) [28], as follows:
LONGEST SUPPORTED PATH (LSP)
Input: A directed graph D=(V,A), an undirected graph G=(V,E).
Output: A longest path P in D such that G[V(P)] is connected.
In the above formulation, the notation G[X], where G is a graph and X is a set of vertices, stands for the subgraph of G induced by X, that is the subgraph of G with vertices of X as its vertex set, and where edges (or arcs in the directed case) are all the edges (or arcs) of G linking two vertices in X (see [29]). Thus the solution for LSP is a path in the directed graph D inducing a connected subgraph in the undirected graph G.
The vast majority of metabolic pathways, however, exhibit cycles (e.g. reversible reactions). Taking cycles into account requires that solutions be authorized to contain repeated vertices. Recall that, contrary to paths, trails can contain repeated vertices, but not repeated arcs [25].
We now define the concept of span and propose a new problem formulation that provides trails as solutions, instead of paths. The span of a trail T represents the number of distinct vertices in T. For example, if T is the trail (r_{2}, r_{3}, r_{7}, r_{8}, r_{3}, r_{4}) in Fig. 1b, then the span of T is 5, because vertex r_{3} is repeated.
MAXIMUM SPAN SUPPORTED TRAIL (MaSST)
Input: A directed graph D=(V,A), an undirected graph G=(V,E), an arc (u,v) in D.
Output: A trail of maximum span T in D passing through (u,v) such that G[V(T)] is connected.
Whereas LSP produces a path for every graph D, MaSST outputs trails of maximum span passing through arcs of D if the vertex sets of these trails induce connected subgraphs in G. The choice of producing a trail for every arc in D is deliberate in order to ensure that more than a single trail is retrieved per graph. For example, for graphs D (Fig. 1b) and G (Fig. 1d) and the arc (r_{1},r_{2}), MaSST outputs one of the two following trails of span 8: (r_{1}, r_{2}, r_{3}, r_{7}, r_{8}, r_{3}, r_{4}, r_{9}, r_{10}) or (r_{1}, r_{2}, r_{7}, r_{8}, r_{3}, r_{4}, r_{9}, r_{10}). For any other arc in D, the output of MaSST is either of the two following trails of span 9: (r_{5}, r_{6}, r_{2}, r_{3}, r_{7}, r_{8}, r_{3}, r_{4}, r_{9}, r_{10}) or (r_{5}, r_{6}, r_{2}, r_{7}, r_{8}, r_{3}, r_{4}, r_{9}, r_{10}).
For practical purposes (see Path finding in the line graph below), we solve MaSST by using the line graph of D. Given a directed graph D, its line graph L(D) is a directed graph in which vertices are arcs in D. There is an arc in L(D) from a vertex x to another vertex y if and only if x=(r,s) and y=(s,t) with r, s, t∈V(D). For example, the graph in Fig. 1e is the line graph of the graph in Fig. 1b.
Let D be a directed graph and L(D) be its line graph. Let P=(a_{1}, a_{2}, …, a_{k}) be a path in L(D), where a_{i}=(t_{i−1}, t_{i}), 1≤i≤k, are arcs in D. The trail in D corresponding to P, denoted L^{−1}(P), is the trail T=(t_{0}, t_{1}, t_{2}, …, t_{k−1}, t_{k}). If P is an empty path, then L^{−1}(P) is an empty trail. For example, if P is the path ((r_{3},r_{7}), (r_{7},r_{8}), (r_{8},r_{3})) in Fig. 1e, then L^{−1}(P) is the trail (r_{3},r_{7},r_{8},r_{3}) in the directed graph D.
We further propose MAXIMUM SPAN SUPPORTED CORRESPONDING TRAIL (MaSSCoT), a problem formulation equivalent to MaSST:
MAXIMUM SPAN SUPPORTED CORRESPONDING TRAIL (MaSSCoT)
Input: A directed graph D=(V,A), an undirected graph G=(V,E), an arc (u,v) in D.
Output: A path P in the line graph of D such that L^{−1}(P) has maximum span, passes through (u,v), and G[V(L^{−1}(P))] is connected.
Note that, as LSP has been shown to be NPhard in the general case [23, 28], we have proved that MaSST and MaSSCoT are also NPhard (Additional file 1).
Graph reduction
Fertin et al. [23] introduced the concept of a cover set of a path and proposed an algorithm to compute it. Briefly, given two graphs D (directed) and G (undirected) on the same vertex set U, as well as a path P in D, the cover set of P with respect to D and G is a maximal subset of U containing only vertices that might extend P into a path P^{′} such that G[V(P^{′})] and the undirected graph underlying D[V(P^{′})] stay connected. We have shown that, for a given arc (u,v) in D, reducing the input graphs D and G to the cover set U^{′} of (u,v) and feeding these reduced graphs D[U^{′}] and G[U^{′}] as input to MaSST and MaSSCoT yields the same solution as providing D and G as input (Additional file 2). In other words, graphs D and G are reduced to a strict minimum without loss of solutions.
Path finding in the line graph
 1

2
For every SCC of L(D), vertices acting as entry points from predecessor SCCs, as well as vertices acting as exit points to successor SCCs are determined. For example, in Fig. 1e, vertices (r_{2},r_{3}) and (r_{2},r_{7}) are entry points for SCC S_{2} when coming from the predecessor SCC S_{1}. Vertex (r_{3},r_{4}) in S_{2} is an exit point when heading to SCC S_{3}. In S_{3}, vertex (r_{4},r_{9}) is both an entry point when coming from predecessor S_{2} and an exit point when heading to successor S_{4}. S_{1} has no predecessor SCCs and S_{4} has no successor SCCs.

3
For every SCC X of L(D), path enumeration is performed only between strictly necessary source and destination vertices, as follows: (i) if X has at least one predecessor and one successor SCC, then paths are enumerated between feasible pairs of entry and exit points for these SCCs; (ii) if X has no predecessor and at least one successor SCC, then paths are enumerated between every vertex of X and exit points towards the successor SCC(s); (iii) if X has at least one predecessor and no successor SCC, then paths are enumerated between entry points from the predecessor SCC(s) and every vertex of X; (iv) only if X has no predecessor and no successor SCCs, paths are enumerated between every pair of vertices of X.
The paths obtained through step 3 above are evaluated in terms of span of their corresponding trails in D and the best candidate paths among them are retained. They are referred to as best partial paths.
Concatenation of partial paths
A path Q in the condensation graph C of L(D) is “translated” into one or several paths in L(D) by concatenating best partial paths in SCCs of L(D). Let C_{i} and C_{j} be two consecutive vertices of a path Q in C of length at least 1. Let S_{i} and S_{j} be the SCCs in L(D) corresponding to C_{i} and C_{j}, respectively. Then Q has more than one corresponding path in L(D) if S_{i} has at least two exit points when heading to the successor SCC S_{j}, or if S_{j} has at least two entry points when coming from the predecessor SCC S_{i}.
For example, two paths in L(D) (Fig. 1e) correspond to path Q_{1} = (S_{1}, S_{2}, S_{3}, S_{4}) in C (Fig. 1f): P_{1}=((r_{1},r_{2}), (r_{2},r_{7}), (r_{7},r_{8}), (r_{8},r_{3}), (r_{3},r_{4}), (r_{4},r_{9}), (r_{9},r_{10})) and (P^{prime }_{1} = ((r_{1}, r_{2})), (r_{2},r_{3}), (r_{3},r_{7}), (r_{7},r_{8}), (r_{8},r_{3}), (r_{3},r_{4}), (r_{4},r_{9}), (r_{9},r_{10})). The corresponding trails in D (Fig. 1b) are L^{−1}(P_{1})=(r_{1}, r_{2}, r_{7}, r_{8}, r_{3}, r_{4}, r_{9}, r_{10}) and L^{−1}(P1′)=(r_{1}, r_{2}, r_{3}, r_{7}, r_{8}, r_{3}, r_{4}, r_{9}, r_{10}), both with span 8. Note that if P_{1} (respectively (P^{prime }_{1})) passed through arcs (r_{4},r_{5}), (r_{5},r_{6}), or (r_{6},r_{2}), then P_{1} (respectively (P^{prime }_{1})) would be a trail instead of a path, which is not allowed.
In order to determine the solution to the MaSST problem, all paths in the condensation graph of L(D) are enumerated such that their corresponding paths in L(D) contain the SCC possessing the input arc (u,v) as vertex. If a path in L(D) obtained by concatenating best partial paths contains vertex (u,v), it is then evaluated in terms of its span by comparing it to the best current solution and by updating the current solution if necessary.
For example, let (u,v)=(r_{2},r_{7}) (Fig. 1b). After translating path Q_{1} = (S_{1}, S_{2}, S_{3}, S_{4}) in C to a path in L(D), the best current solution P_{1} has span 8 as shown above. Now, suppose path Q_{2} = (S_{2}, S_{3}, S_{4}) (Fig. 1f) is enumerated. There is one corresponding path in L(D) (Fig. 1e) passing through (r_{2},r_{7}), obtained by concatenation of best partial paths in S_{2}, S_{3}, and S_{4}. The best partial path in S_{2} ends in vertex (r_{3},r_{4}) (which is an exit point when heading toward S_{3}) and may start with any vertex in S_{2}, provided the corresponding trail in D has maximum span. The path in L(D) corresponding to Q_{2} is therefore P_{2} = ((r_{5},r_{6}), (r_{6},r_{2}), (r_{2},r_{7}), (r_{7},r_{8}), (r_{8},r_{3}), (r_{3},r_{4}), (r_{4},r_{9}), (r_{9},r_{10})), for which L^{−1}(P_{2}) has span 9. When P_{1} and P_{2} are compared, the best current solution now becomes P_{2} (because L^{−1}(P_{2}) has maximum span and because G[V(L^{−1}(P_{2}))] is connected (Fig. 1d)).
HNET algorithm
We propose HNET (Heterogeneous NETwork mining), an algorithm that solves the MaSST problem using the MaSSCoT formulation internally (Algorithm 1). Unlike the heuristic solution introduced in [23] to the LSP problem, HNET is an exact method. However, it is not exhaustive, meaning that if several trails of maximum span pass through a given arc (u,v) in D, then only one such trail is reported as solution. The bottleneck in HNET is path enumeration at line 5. In effect, the number of paths between two given vertices of a graph can be exponential with respect to the size of the graph. The exponential worstcase complexity of path enumeration is due to the NPhardness of MaSST and MaSSCoT. The worstcase scenario occurs when all possible paths are enumerated between all pairs of vertices in a SCC. This scenario occurs in two distinct cases which nonetheless rarely arise in practice. The first case is that of SCCs of D that are completely disconnected from the rest of the graph. Sequences of reactions in metabolic pathways that are completely disconnected from the rest of the pathway are typically very short and therefore not limiting for exhaustive path enumeration. The second case is when D is strongly connected, corresponding to the infrequent situation in which a chain of reactions leads from any reaction r_{i} to any other reaction r_{j} of a given metabolic pathway, and vice versa.
In the following, assume: D=(V,A) is a directed graph; (u,v), an arc in D; G=(V,E), an undirected graph; L(D), the line graph of D; and C, the condensation graph of L(D).
Algorithm GRAPHREDUCTION (line 1) returns the reduced graphs D and G (see Graph reduction above). For graphs D and G in Fig. 1 (panels b and d), the reduced and unreduced graphs are the same. LINEGRAPH (line 2) returns the line graph L(D) of the reduced input graph (Fig. 1b, e). CONDENSATIONGRAPH (line 3) returns the condensation graph of L(D), i.e. the directed acyclic graph obtained by replacing every SCC of L(D) by a single vertex (Fig. 1e, f).
Algorithm ACCESSPOINTS determines entry and exit points for every SCC X of L(D), from SCCs that are predecessors of X and toward SCCs that are successors of X (see Path finding in the line graph above, step 2). This information is stored in a data structure (mathcal {A}) that the algorithm returns at line 4. Algorithm PARTIALPATHS then uses (mathcal {A}) to compute best paths in every SCC X of L(D) (in terms of span of their corresponding trails in D) between all feasible pairs of source and destination vertices. Source vertices are entry points from predecessor SCCs if X has predecessors, and vertices of X otherwise. Reciprocally, destination vertices are exit points to successor SCCs if X has successors, and vertices of X otherwise. These paths, termed best partial paths, are stored in a data structure (mathcal {B}) that the algorithm returns at line 5 (see Path finding in the line graph above, step 3).
At line 6, HNET determines a, the vertex of C whose corresponding SCC in L(D) contains the input arc (u,v) as a vertex. Next, all possible paths in C are enumerated (lines 814) and, if they contain vertex a, the corresponding paths in L(D) are obtained by concatenation of best partial paths stored in (mathcal {B}). The best current solution is updated accordingly. A path P in L(D) qualifies as a best current solution if the trail in D corresponding to P, L^{−1}(P), fulfills the following conditions: (i) it contains the input arc (u,v); (ii) it induces a connected subgraph in G; (iii) it has maximum span so far.
Algorithm ENUMERATEPATHS at line 10 returns all paths starting with vertex s and ending in vertex t in the condensation graph. If s and t are the same vertex, the algorithm returns either one. Algorithm FINDPATHS at line 12 returns all paths in L(D) corresponding to path Q in the condensation graph C, obtained by concatenation of best partial paths stored in (mathcal {B}). Given two paths in L(D), algorithm BESTPATH at line 14 returns the best current path, i.e. the path among the two whose corresponding trail in D has greater span than the other (see Concatenation of partial paths above).
Finally, HNET returns the trail in D corresponding to the best solution (line 15), effectively solving the MaSST problem. An additional consistency check is performed as detailed in [27] to ensure that the trail L^{−1}(P) also “makes sense” when passing from G to the initial graph G^{′} (see Model and Fig. 1, panels a through d). It is checked whether vertices in G^{′} corresponding to the vertex set of the trail are connected. Note that [27] describes a heuristic solution to LSP (see Problem formulation).
Allowing for skipped vertices
The MaSST and MaSSCoT formulations imply that solutions consist of strictly neighboring genes and reactions. As in a previous graphbased approach for the integration of heterogeneous biological data in another context [19], a preprocessing step was added to CoMetGeNe in order to allow for non contiguous reactions and/or genes. The preprocessing step consists in modifying the input graphs by adding arcs (respectively edges) between vertices separated by at most δ_{D} other reactions (respectively δ_{G} other genes). δ_{D} and δ_{G} are referred to as the gap parameters. Their value should be set quite low (e.g. at most 3) for ensuring that CoMetGeNe results are relevant from a biological point of view.
Finding conserved metabolic and genomic patterns across multiple species
Here we show how trail finding, presented in the previous section, can be used to identify conserved interspecies metabolic and genomic patterns. We developed two methods for grouping trails obtained using the CoMetGeNe pipeline. They rely on examining trails of a given species, the reference species, in terms of either reactions or genes involved in these reactions, with the aim of comparing trails of the reference species with similar trails found for the remaining species. Both methods start out by pooling together all trails produced by the CoMetGeNe pipeline, for every species, every metabolic pathway, and every combination of the gap parameters.
For reasons explained below, both trail grouping methods were designed to treat trails as reaction sets, meaning that the order of reactions is not taken into account and that repeated reactions are ignored. In Fig. 1b, trails t_{1}=(r_{2}, r_{7}, r_{8}, r_{3}, r_{4}) and t_{2}=(r_{2}, r_{3}, r_{7}, r_{8}, r_{3}, r_{4}) both have the same corresponding reaction set {r_{2}, r_{3}, r_{4}, r_{7}, r_{8}}.
As previously explained, CoMetGeNe determines trails of reactions being catalyzed by neighboring genes. The definition of conserved patterns (in terms of metabolic and gene neighborhoods) needs to be able to accommodate slight variations between species. One such variation is encountering a different reaction order between trails. For example, if trails (r_{9},r_{10}) and (r_{10},r_{9}) are identified for two different species for the pathway in Fig. 1b, these trails naturally constitute a conserved pattern for the two species. Another variation that needs to be taken into account is best illustrated with the example of trails t_{1} and t_{2} above. If these trails are obtained for different species, the common feature is that both species perform the same five reactions using products of neighboring genes, irrespective of reaction order and of whether reaction r_{3} is repeated. Another example of variation that should not prevent the identification of conserved patterns is related to reactions (or genes) that are present in trails of some, but not all, of the species. For example, suppose the trails t_{3}=(r_{2}, r_{3}, r_{7}) and t_{4}=(r_{3}, r_{7}, r_{8}) are identified for two different species for the pathway in Fig. 1b. The fact that reactions r_{3} and r_{7} are common to both trails and are catalyzed by products of neighboring genes for both species should be identified as a conserved pattern. The necessity of accommodating these types of trail variations explains the choice for processing trails as reaction sets during the present trail grouping step.
Let (mathcal {P}) be the panel of selected species under study. Species (S in mathcal {P}) denotes the chosen reference species. Let R_{S} be the set of all reaction sets of S. Note that reaction sets in R_{S} are not disjoint. From a biological standpoint, R_{S} represents the pool of trails of the reference species produced by CoMetGeNe, viewed in terms of reaction sets.
In the following, two genes of a given species are said to be neighboring if they are separated by at most three other genes on the same strand of the same chromosome.
Trail grouping by reactions
Briefly, the method of grouping trails by reactions consists in grouping reactions of the reference species according to the reaction sets they belong to. This grouping method focuses more on metabolic rather than genomic conserved patterns.
Grouping trails by reactions for the reference species S consists in constructing a table (T_{S}^{r}) where rows represent reactions in every reaction set of S and columns represent the remaining species in (mathcal {P}). Table (T_{S}^{r}) reflects conserved metabolic patterns between the reference species and the rest of the panel through the three possible symbols that can be assigned to each cell. These symbols allow to easily distinguish which reactions of the reference species are not present in the other species (blanks), and which are catalyzed by products of neighboring (crosses) and non neighboring (dots) genes of the other species.
Trail grouping by reactions and by genes for the reference species S against species S_{1}
r 
U 
A 
. 
. 
r 
X 
X 
x 
x 
r 
V 
− 
. 

r 
W 
W 
x 
x 
r 
T 
T 
x 
x 
({mathcal {R}^{prime }}) 
Neigh. ( 
Neigh. ( 
Rows in table (T_{S}^{r}) represent reactions in R_{S} and are ordered by reaction sets of S. Note that a given reaction performed by species S appears several times in (T_{S}^{r}) if it belongs to several reaction sets. Columns represent the remaining species in (mathcal {P}) and are ordered according to evolutionary distance to S, such that species phylogenetically closer to S have lower column indexes than species phylogenetically distant from S.
Let (T_{S}^{r}[i, j]) denote the cell in (T_{S}^{r}) on row i and column j. Let r_{i} denote the reaction of species S corresponding to row i in (T_{S}^{r}). Let S_{1} denote the species corresponding to column j in (T_{S}^{r}). Let (mathcal {R} subseteq R_{S}) denote the reaction set of species S to which reaction r_{i} belongs. For the example presented above, the reaction set of species S that is investigated is (mathcal {R} = {r_{2}), r_{3}, r_{6}, r_{7}, r_{8}} (see the first column ((mathcal {R})) in Table 1).
Let (mathcal {R’}) denote a maximal subset of (mathcal {R}) such that the genes of S_{1} involved in (mathcal {R’}) are neighbors. For the above example, the subset (mathcal {R’}) is {r_{6}, r_{7}, r_{8}} (see (mathcal {R’}), i.e. entries in bold in the first column ((mathcal {R})) in Table 1) because reactions in (mathcal {R’}) involve the neighboring genes T_{1}, W_{1}, and X_{1}, respectively (even though gene B_{1} is skipped).

a cross (x) if (r_{i} in mathcal {R’}).

a dot (.) if (r_{i} in mathcal {R} – mathcal {R’}) and r_{i} is performed by species S_{1}.

a blank if (r_{i} in mathcal {R} – mathcal {R’}) and r_{i} is not performed by species S_{1}.
For the above example (see the fourth column (S_{1} in (T_{S}^{r})) in Table 1), the cells corresponding to reactions in (mathcal {R’}) receive a cross symbol (x). Since reaction r_{2} is performed in S_{1} by gene A_{1} which is not a neighbor of W_{1}, T_{1}, or X_{1}, the corresponding cell on column S_{1} in (T_{S}^{r}) receives a dot symbol (.). Finally, reaction r_{3} is absent from S_{1}, therefore the corresponding cell receives a blank. The interpretation is that reactions r_{6}, r_{7}, and r_{8} are performed in species S_{1} by products of neighboring genes. Reaction r_{3} is absent from S_{1}, whereas the gene involved in r_{2} is not a neighbor of genes involved in reactions r_{6}, r_{7}, and r_{8}.
Trail grouping by genes
Here, we group CoMetGeNe reaction sets according to the gene order of the reference species. This second grouping method focuses more on genomic rather than metabolic conserved patterns. Two genes coding for enzymes involved in the same metabolic reaction are referred to as functionally similar genes. Functionally similar genes in two species can be either analogues (products of convergent evolution) or homologues (products of divergent evolution).
Grouping trails by genes consists in constructing a table (T_{S}^{g}) where rows represent genes of the reference species S involved in reaction sets shared by S and at least one other species in (mathcal {P}), and columns represent the remaining species in (mathcal {P}). Table (T_{S}^{g}) reflects conserved genomic patterns between the reference species and the rest of the panel through the two possible symbols that can be assigned to each cell. These symbols allow to easily differentiate genes of S with neighboring (crosses) and non neighboring (dots) functionally similar genes in other species.
For example, for the trail t=(r_{6}, r_{2}, r_{3}, r_{7}, r_{8}) in Fig. 1b and the gene neighborhood in Fig. 2 for the reference species S and another species S_{1}, (T_{S}^{g}) is represented by the second ((mathcal {G})) and fifth (S_{1} in (T_{S}^{g})) columns in Table 1. Genes X_{1}, W_{1}, and T_{1} of S_{1} respectively have the neighboring functionally similar genes X, W, and T in the reference species S.
$$R = R_{S} cap left(bigcuplimits_{S_{1} in mathcal{P} – {S}} R_{S_{1}} right) $$
Hence, R represents the set of reaction sets common to S and at least one other species in (mathcal {P}). Let G_{S} be the set of genes of the reference species S that are involved in reactions belonging to reaction sets of R. From a biological standpoint, G_{S} represents the pool of genes of the reference species coding for enzymes involved in reaction sets common to S and at least one other species in (mathcal {P}).
Rows in table (T_{S}^{g}) represent genes from G_{S} and are ordered by chromosome and strand, according to the position of genes on the strand. Columns represent the remaining species in (mathcal {P}) and are ordered according to evolutionary distance to S (see Trail grouping by reactions).
Let S_{1} denote the species corresponding to column j in (T_{S}^{g}). Let (mathcal {G}) be a subset of G_{S} such that genes in (mathcal {G}) are neighbors on the same strand and chromosome of S. For the example presented above, the gene group of species S that is investigated is (mathcal {G} = {U, X, V, W, T}) (see the second column ((mathcal {G})) in Table 1).
Let (mathcal {R}) be the set of reactions in all reaction sets in which the genes in (mathcal {G}) are involved. Formally, (mathcal {R}) is the set of all reactions r such that: (i) there exists a reaction set h of species S such that r∈h, and (ii) there exists a gene (g in mathcal {G}) such that g is involved in r. In other words, given a group (mathcal {G}) of neighboring genes of S, (mathcal {R}) is the set of reactions in trails common to S and at least one other species in (mathcal {P}) such that reactions in (mathcal {R}) are catalyzed by products of genes in (mathcal {G}). For the above example, (mathcal {R}) is {r_{2}, r_{3}, r_{6}, r_{7}, r_{8}} (see the first column ((mathcal {R})) in Table 1).
Let (mathcal {H}) be the set of genes of S_{1} involved in reactions in (mathcal {R}). That is, given (mathcal {R}), the genome for species S_{1}, and the correspondence between reactions in (mathcal {R}) and genes of S_{1}, (mathcal {H}) is the set of genes in S_{1} (along with their position on the chromosome) such that every gene in (mathcal {H}) is involved in at least one reaction in (mathcal {R}). For the above example, (mathcal {H} = {A_{1}, X_{1}, W_{1}, T_{1}}) (see the third column ((mathcal {H})) in Table 1).
Let (mathcal {H’} subseteq mathcal {H}) be neighboring genes in (mathcal {H}), and let (mathcal {G’} subseteq mathcal {G}) such that genes in (mathcal {H’}) and (mathcal {G’}) are involved in the same reactions in (mathcal {R}). (mathcal {H’}) is chosen such as to maximize (mathcal {G’}), i.e. the number of genes in (mathcal {G}) involved in the same reactions as neighboring genes in (mathcal {H}).
For the above example, gene A_{1} is not a neighbor of gene W_{1}, therefore (mathcal {H’}) must be a strict subset of (mathcal {H}). There are several possible strict non empty subsets of (mathcal {H}) of neighboring genes, other than singletons: {W_{1},T_{1}}, {W_{1},X_{1}}, {T_{1},X_{1}}, and {W_{1},T_{1},X_{1}}. The subset of (mathcal {H}) that is of interest is (mathcal {H’} = {W_{1}, T_{1}, X_{1}}), as it maximizes the number of genes in (mathcal {G}) involved in reactions in (mathcal {R}); (mathcal {G’}) is thus {X,W,T} (see (mathcal {H’}) and (mathcal {G’}), i.e. entries in bold in the third ((mathcal {H})) and second ((mathcal {G})) columns, respectively, in Table 1). The genes in (mathcal {H’}) can be considered neighbors because only gene B_{1} needs to be skipped as it does not code for an enzyme. The subset of reactions in (mathcal {R}) catalyzed by genes in (mathcal {H’}) is therefore {r_{6}, r_{7}, r_{8}}.
For the above example, cells for genes U and V receive a dot symbol (.), whereas cells for genes X, W, and T receive a cross symbol (x) (see the second ((mathcal {G})) and fifth (S_{1} in (T_{S}^{g})) columns in Table 1). The interpretation is that genes X, W, and T of the reference species are involved in reactions catalyzed by neighboring genes in species S_{1}. Notice that from trail grouping by genes alone it is not possible to decide whether the reactions catalyzed by genes U and V are absent from S_{1} or performed by products of non neighboring genes. Trail grouping by genes assigns dot symbols to S_{1} for genes U and V of the reference species. However, trail grouping by reactions assigns a dot symbol to r_{2} and a blank to r_{3}, thus effectively distinguishing between reactions present in S_{1} (r_{2}) and absent from S_{1} (r_{3}).
Pipeline
Trail finding (the HNET algorithm) and trail grouping are implemented in the form of CoMetGeNe, a Python 2.7 pipeline available under a MIT license at https://cometgene.lri.fr. CoMetGeNe is not compatible with Python 3. The following Python libraries are required: NetworkX (version ≥ 1.10 and ≤ 2.2) for graph handling, and lxml (version ≥ 3.5.0) for XML parsing. An Internet connection is mandatory for automatic data retrieval. Pipeline usage is detailed in Additional file 3.
CoMetGeNe files
CoMetGeNe automatically extracts the necessary data from KEGG using its REST API [30]. Metabolic pathways are stored in KGML format in a userspecified directory. Only pathways for primary and secondary metabolism excluding global and overview maps are extracted (i.e., maps whose KEGG identifier is at least 01100 are excluded). Genomic and EC number information are stored in binary format.
Trail finding
The script CoMetGeNe.py offers a convenient commandline interface for Finding metabolic and genomic patterns for a single species. The only required information is the species to be analyzed (designated by its three or fourletter KEGG identifier [31]) and the directory where metabolic pathways of the species in question will be stored. Optionally, the gap parameters δ_{D} and δ_{G} can be specified (their default value being 0), as well as an output file for the results.
An important speedup is attained if CoMetGeNe is ran in parallel using the provided script CoMetGeNe_launcher.py. Restrictions inherent to KEGG limit pathway and genomic information retrieval to 3 and 2 threads, respectively. Trail finding in CoMetGeNe can, however, take full advantage of the maximum number of physical threads. See Results for CoMetGeNe run times.
A potential caveat when running CoMetGeNe in parallel is that KEGG may block concurrent downloads when using a fast Internet connection. Another potential caveat is that the machine may run out of memory on very large datasets (hundreds or thousands of species). In both cases, a possible workaround consists in adjusting parameters for CoMetGeNe_launcher.py. In the latter case, the dataset may be split into several smaller batches. For more details, see the “Trail finding” page on the CoMetGeNe website (https://cometgene.lri.fr/tfinding.html).
Storing metabolic pathways and genomic information for a given species allows CoMetGeNe to perform trail finding without redownloading the same data for subsequent executions, e.g. when CoMetGeNe is ran for the same species but with different gap parameters.
CoMetGeNe uses a configurable timeout (defaulting to 5 min) for analyzing a given metabolic pathway. If the timeout is reached without producing any result, the pathway in question is “blacklisted” for the current species and set of gap parameters. This prevents CoMetGeNe from further attempting to analyze the given pathway for subsequent executions if the gap parameters increase. For example, a pathway that is blacklisted for (δ_{D}=2, δ_{G}=2) will not be further analyzed for (δ_{D},δ_{G})∈{(2,3),(3,2),(3,3)}. The blacklist is stored locally as a text file. Blacklisted pathways are computationally prohibitive due to the exponential number of enumerable paths. However, blacklisted pathways only amount to 3.3% of our dataset (121 out of 3709 pathways).
Trail grouping
Once CoMetGeNe results are available for several species, trail grouping can be performed in order to identify conserved metabolic and genomic patterns for several organisms (see Finding conserved metabolic and genomic patterns across multiple species). The script grouping.py provides this functionality and offers the possibility to save tables (T_{S}^{r}) and (T_{S}^{g}) in CSV format.
Three binary files are created when grouping trails by either reactions or genes. They contain pathway data, genomic information, and parsed CoMetGeNe results that can be reused when choosing another species as reference.
Experimental setup
The test machine is a quadcore 2.6 GHz Intel Xeon E52623 v4 (Broadwell) with 10 MB L3 cache and 64 GB of RAM, running under Ubuntu GNU/Linux 16.04.3 LTS. Although the test machine has 64 GB of main memory, running CoMetGeNe on a single thread only requires approximately 100 MB of RAM.