CoMetGeNe: mining conserved neighborhood patterns in metabolic and genomic contexts

Bioinformatics

Model

A non-spontaneous metabolic reaction is catalyzed by one or several enzymes. A given enzyme can be encoded by one or several genes. We regard metabolic pathways and genomic context as networks of reactions and genes, respectively. We represent the relation between metabolic pathways and their encoding genes using a classical model involving two graphs and a correspondence function:

  1. (i)

    Genomes (viewed as gene networks) are represented as undirected graphs with protein-coding genes for vertices (Fig. 1a). Two protein-coding genes are connected by an edge if they are neighbors on the same strand of the same chromosome.

    Fig. 1

    Schematic view of the CoMetGeNe model linking metabolic reactions and their encoding genes. a The undirected graph G represents the gene order of a given species. The reactions that gene products catalyze are indicated above each gene. b The directed graph D represents a metabolic pathway of the same species as in a. c The correspondence between reactions in D and genes in G. d G is an undirected graph with the same vertex set as D built using the correspondence between reactions and genes. G represents gene neighborhood with respect to the reactions that the gene products catalyze. e L(D) is the line graph of D. By definition of the line graph, vertices of L(D) are arcs in D. Strongly connected components (SCCs) of L(D) are shaded in gray and assigned a label Si. f C is the condensation graph of L(D), obtained by replacing every SCC of L(D) with a single vertex

  2. (ii)

    Metabolic pathways are represented as directed graphs with reactions for vertices (Fig. 1b). An arc leading from reaction ri to rj signifies that reaction ri produces a metabolite that is a substrate for rj.

  3. (iii)

    For a given species S, the relation between one of its metabolic pathways and its genome takes the form of a correspondence function associating genes to reactions: for any given reaction r, the correspondence function returns the set of genes of species S that code for enzymes catalyzing reaction r (Fig. 1c). This information can be found in a knowledge base such as KEGG (Kyoto Encyclopedia of Genes and Genomes) [26] which, for a given species, contains information on its metabolic pathways, the reactions that the species performs, and the genes associated to these reactions.

The method we propose, CoMetGeNe, requires two input graphs possessing the same vertex set. Thus, an additional undirected graph is constructed as described in [27] such that it reflects gene neighborhood with respect to the reactions that the gene products catalyze (Fig. 1d). The additional graph links two reactions ri and rj with an edge if at least one of the genes coding for an enzyme involved in reaction ri is adjacent to a gene coding for an enzyme involved in rj. For example, genes X and Y are neighbors in G (Fig. 1a). Gene X codes for an enzyme involved in reaction r8, and gene Y codes for an enzyme involved in reactions r9 and r10. To reflect adjacency between genes X and Y, reactions r8 and r9, respectively r8 and r10, are linked by an edge in G (Fig. 1d).

Finding metabolic and genomic patterns for a single species

Problem formulation

Given a metabolic pathway and the gene network for the same species, the objective is to identify a maximal number of consecutive reactions being catalyzed by products of neighboring genes. The problem was initially formulated under the name of LONGEST SUPPORTED PATH (LSP) [28], as follows:

LONGEST SUPPORTED PATH (LSP)

Input: A directed graph D=(V,A), an undirected graph G=(V,E).

Output: A longest path P in D such that G[V(P)] is connected.

In the above formulation, the notation G[X], where G is a graph and X is a set of vertices, stands for the subgraph of G induced by X, that is the subgraph of G with vertices of X as its vertex set, and where edges (or arcs in the directed case) are all the edges (or arcs) of G linking two vertices in X (see [29]). Thus the solution for LSP is a path in the directed graph D inducing a connected subgraph in the undirected graph G.

The vast majority of metabolic pathways, however, exhibit cycles (e.g. reversible reactions). Taking cycles into account requires that solutions be authorized to contain repeated vertices. Recall that, contrary to paths, trails can contain repeated vertices, but not repeated arcs [25].

We now define the concept of span and propose a new problem formulation that provides trails as solutions, instead of paths. The span of a trail T represents the number of distinct vertices in T. For example, if T is the trail (r2, r3, r7, r8, r3, r4) in Fig. 1b, then the span of T is 5, because vertex r3 is repeated.

MAXIMUM SPAN SUPPORTED TRAIL (MaSST)

Input: A directed graph D=(V,A), an undirected graph G=(V,E), an arc (u,v) in D.

Output: A trail of maximum span T in D passing through (u,v) such that G[V(T)] is connected.

Whereas LSP produces a path for every graph D, MaSST outputs trails of maximum span passing through arcs of D if the vertex sets of these trails induce connected subgraphs in G. The choice of producing a trail for every arc in D is deliberate in order to ensure that more than a single trail is retrieved per graph. For example, for graphs D (Fig. 1b) and G (Fig. 1d) and the arc (r1,r2), MaSST outputs one of the two following trails of span 8: (r1, r2, r3, r7, r8, r3, r4, r9, r10) or (r1, r2, r7, r8, r3, r4, r9, r10). For any other arc in D, the output of MaSST is either of the two following trails of span 9: (r5, r6, r2, r3, r7, r8, r3, r4, r9, r10) or (r5, r6, r2, r7, r8, r3, r4, r9, r10).

For practical purposes (see Path finding in the line graph below), we solve MaSST by using the line graph of D. Given a directed graph D, its line graph L(D) is a directed graph in which vertices are arcs in D. There is an arc in L(D) from a vertex x to another vertex y if and only if x=(r,s) and y=(s,t) with r, s, tV(D). For example, the graph in Fig. 1e is the line graph of the graph in Fig. 1b.

Let D be a directed graph and L(D) be its line graph. Let P=(a1, a2, …, ak) be a path in L(D), where ai=(ti−1, ti), 1≤ik, are arcs in D. The trail in D corresponding to P, denoted L−1(P), is the trail T=(t0, t1, t2, …, tk−1, tk). If P is an empty path, then L−1(P) is an empty trail. For example, if P is the path ((r3,r7), (r7,r8), (r8,r3)) in Fig. 1e, then L−1(P) is the trail (r3,r7,r8,r3) in the directed graph D.

We further propose MAXIMUM SPAN SUPPORTED CORRESPONDING TRAIL (MaSSCoT), a problem formulation equivalent to MaSST:

MAXIMUM SPAN SUPPORTED CORRESPONDING TRAIL (MaSSCoT)

Input: A directed graph D=(V,A), an undirected graph G=(V,E), an arc (u,v) in D.

Output: A path P in the line graph of D such that L−1(P) has maximum span, passes through (u,v), and G[V(L−1(P))] is connected.

Note that, as LSP has been shown to be NP-hard in the general case [23, 28], we have proved that MaSST and MaSSCoT are also NP-hard (Additional file 1).

Graph reduction

Fertin et al. [23] introduced the concept of a cover set of a path and proposed an algorithm to compute it. Briefly, given two graphs D (directed) and G (undirected) on the same vertex set U, as well as a path P in D, the cover set of P with respect to D and G is a maximal subset of U containing only vertices that might extend P into a path P such that G[V(P)] and the undirected graph underlying D[V(P)] stay connected. We have shown that, for a given arc (u,v) in D, reducing the input graphs D and G to the cover set U of (u,v) and feeding these reduced graphs D[U] and G[U] as input to MaSST and MaSSCoT yields the same solution as providing D and G as input (Additional file 2). In other words, graphs D and G are reduced to a strict minimum without loss of solutions.

Path finding in the line graph

The problem of trail enumeration in the directed graph D modeling a metabolic pathway is naturally solved by performing path enumeration in the line graph L(D). In other words, MaSST is solved using the MaSSCoT problem formulation. Path enumeration in L(D) is restricted to a minimum using the following three steps:

  1. 1

    The strongly connected components (SCCs, see [29] for a definition) of L(D) and its condensation graph are computed, where a condensation graph results from replacing every SCC with a single vertex (Fig. 1e, f). Note that condensation graphs are acyclic by definition.

  2. 2

    For every SCC of L(D), vertices acting as entry points from predecessor SCCs, as well as vertices acting as exit points to successor SCCs are determined. For example, in Fig. 1e, vertices (r2,r3) and (r2,r7) are entry points for SCC S2 when coming from the predecessor SCC S1. Vertex (r3,r4) in S2 is an exit point when heading to SCC S3. In S3, vertex (r4,r9) is both an entry point when coming from predecessor S2 and an exit point when heading to successor S4. S1 has no predecessor SCCs and S4 has no successor SCCs.

  3. 3

    For every SCC X of L(D), path enumeration is performed only between strictly necessary source and destination vertices, as follows: (i) if X has at least one predecessor and one successor SCC, then paths are enumerated between feasible pairs of entry and exit points for these SCCs; (ii) if X has no predecessor and at least one successor SCC, then paths are enumerated between every vertex of X and exit points towards the successor SCC(s); (iii) if X has at least one predecessor and no successor SCC, then paths are enumerated between entry points from the predecessor SCC(s) and every vertex of X; (iv) only if X has no predecessor and no successor SCCs, paths are enumerated between every pair of vertices of X.

The paths obtained through step 3 above are evaluated in terms of span of their corresponding trails in D and the best candidate paths among them are retained. They are referred to as best partial paths.

Concatenation of partial paths

A path Q in the condensation graph C of L(D) is “translated” into one or several paths in L(D) by concatenating best partial paths in SCCs of L(D). Let Ci and Cj be two consecutive vertices of a path Q in C of length at least 1. Let Si and Sj be the SCCs in L(D) corresponding to Ci and Cj, respectively. Then Q has more than one corresponding path in L(D) if Si has at least two exit points when heading to the successor SCC Sj, or if Sj has at least two entry points when coming from the predecessor SCC Si.

For example, two paths in L(D) (Fig. 1e) correspond to path Q1 = (S1, S2, S3, S4) in C (Fig. 1f): P1=((r1,r2), (r2,r7), (r7,r8), (r8,r3), (r3,r4), (r4,r9), (r9,r10)) and (P^{prime }_{1} = ((r_{1}, r_{2})), (r2,r3), (r3,r7), (r7,r8), (r8,r3), (r3,r4), (r4,r9), (r9,r10)). The corresponding trails in D (Fig. 1b) are L−1(P1)=(r1, r2, r7, r8, r3, r4, r9, r10) and L−1(P1′)=(r1, r2, r3, r7, r8, r3, r4, r9, r10), both with span 8. Note that if P1 (respectively (P^{prime }_{1})) passed through arcs (r4,r5), (r5,r6), or (r6,r2), then P1 (respectively (P^{prime }_{1})) would be a trail instead of a path, which is not allowed.

In order to determine the solution to the MaSST problem, all paths in the condensation graph of L(D) are enumerated such that their corresponding paths in L(D) contain the SCC possessing the input arc (u,v) as vertex. If a path in L(D) obtained by concatenating best partial paths contains vertex (u,v), it is then evaluated in terms of its span by comparing it to the best current solution and by updating the current solution if necessary.

For example, let (u,v)=(r2,r7) (Fig. 1b). After translating path Q1 = (S1, S2, S3, S4) in C to a path in L(D), the best current solution P1 has span 8 as shown above. Now, suppose path Q2 = (S2, S3, S4) (Fig. 1f) is enumerated. There is one corresponding path in L(D) (Fig. 1e) passing through (r2,r7), obtained by concatenation of best partial paths in S2, S3, and S4. The best partial path in S2 ends in vertex (r3,r4) (which is an exit point when heading toward S3) and may start with any vertex in S2, provided the corresponding trail in D has maximum span. The path in L(D) corresponding to Q2 is therefore P2 = ((r5,r6), (r6,r2), (r2,r7), (r7,r8), (r8,r3), (r3,r4), (r4,r9), (r9,r10)), for which L−1(P2) has span 9. When P1 and P2 are compared, the best current solution now becomes P2 (because L−1(P2) has maximum span and because G[V(L−1(P2))] is connected (Fig. 1d)).

HNET algorithm

We propose HNET (Heterogeneous NETwork mining), an algorithm that solves the MaSST problem using the MaSSCoT formulation internally (Algorithm 1). Unlike the heuristic solution introduced in [23] to the LSP problem, HNET is an exact method. However, it is not exhaustive, meaning that if several trails of maximum span pass through a given arc (u,v) in D, then only one such trail is reported as solution. The bottleneck in HNET is path enumeration at line 5. In effect, the number of paths between two given vertices of a graph can be exponential with respect to the size of the graph. The exponential worst-case complexity of path enumeration is due to the NP-hardness of MaSST and MaSSCoT. The worst-case scenario occurs when all possible paths are enumerated between all pairs of vertices in a SCC. This scenario occurs in two distinct cases which nonetheless rarely arise in practice. The first case is that of SCCs of D that are completely disconnected from the rest of the graph. Sequences of reactions in metabolic pathways that are completely disconnected from the rest of the pathway are typically very short and therefore not limiting for exhaustive path enumeration. The second case is when D is strongly connected, corresponding to the infrequent situation in which a chain of reactions leads from any reaction ri to any other reaction rj of a given metabolic pathway, and vice versa.

In the following, assume: D=(V,A) is a directed graph; (u,v), an arc in D; G=(V,E), an undirected graph; L(D), the line graph of D; and C, the condensation graph of L(D).

Algorithm GRAPHREDUCTION (line 1) returns the reduced graphs D and G (see Graph reduction above). For graphs D and G in Fig. 1 (panels b and d), the reduced and unreduced graphs are the same. LINEGRAPH (line 2) returns the line graph L(D) of the reduced input graph (Fig. 1b, e). CONDENSATIONGRAPH (line 3) returns the condensation graph of L(D), i.e. the directed acyclic graph obtained by replacing every SCC of L(D) by a single vertex (Fig. 1e, f).

Algorithm ACCESSPOINTS determines entry and exit points for every SCC X of L(D), from SCCs that are predecessors of X and toward SCCs that are successors of X (see Path finding in the line graph above, step 2). This information is stored in a data structure (mathcal {A}) that the algorithm returns at line 4. Algorithm PARTIALPATHS then uses (mathcal {A}) to compute best paths in every SCC X of L(D) (in terms of span of their corresponding trails in D) between all feasible pairs of source and destination vertices. Source vertices are entry points from predecessor SCCs if X has predecessors, and vertices of X otherwise. Reciprocally, destination vertices are exit points to successor SCCs if X has successors, and vertices of X otherwise. These paths, termed best partial paths, are stored in a data structure (mathcal {B}) that the algorithm returns at line 5 (see Path finding in the line graph above, step 3).

At line 6, HNET determines a, the vertex of C whose corresponding SCC in L(D) contains the input arc (u,v) as a vertex. Next, all possible paths in C are enumerated (lines 8-14) and, if they contain vertex a, the corresponding paths in L(D) are obtained by concatenation of best partial paths stored in (mathcal {B}). The best current solution is updated accordingly. A path P in L(D) qualifies as a best current solution if the trail in D corresponding to P, L−1(P), fulfills the following conditions: (i) it contains the input arc (u,v); (ii) it induces a connected subgraph in G; (iii) it has maximum span so far.

Algorithm ENUMERATEPATHS at line 10 returns all paths starting with vertex s and ending in vertex t in the condensation graph. If s and t are the same vertex, the algorithm returns either one. Algorithm FINDPATHS at line 12 returns all paths in L(D) corresponding to path Q in the condensation graph C, obtained by concatenation of best partial paths stored in (mathcal {B}). Given two paths in L(D), algorithm BESTPATH at line 14 returns the best current path, i.e. the path among the two whose corresponding trail in D has greater span than the other (see Concatenation of partial paths above).

Finally, HNET returns the trail in D corresponding to the best solution (line 15), effectively solving the MaSST problem. An additional consistency check is performed as detailed in [27] to ensure that the trail L−1(P) also “makes sense” when passing from G to the initial graph G (see Model and Fig. 1, panels a through d). It is checked whether vertices in G corresponding to the vertex set of the trail are connected. Note that [27] describes a heuristic solution to LSP (see Problem formulation).

Fig. 2

Gene neighborhood for species S and S1.

Allowing for skipped vertices

The MaSST and MaSSCoT formulations imply that solutions consist of strictly neighboring genes and reactions. As in a previous graph-based approach for the integration of heterogeneous biological data in another context [19], a preprocessing step was added to CoMetGeNe in order to allow for non contiguous reactions and/or genes. The preprocessing step consists in modifying the input graphs by adding arcs (respectively edges) between vertices separated by at most δD other reactions (respectively δG other genes). δD and δG are referred to as the gap parameters. Their value should be set quite low (e.g. at most 3) for ensuring that CoMetGeNe results are relevant from a biological point of view.

Finding conserved metabolic and genomic patterns across multiple species

Here we show how trail finding, presented in the previous section, can be used to identify conserved interspecies metabolic and genomic patterns. We developed two methods for grouping trails obtained using the CoMetGeNe pipeline. They rely on examining trails of a given species, the reference species, in terms of either reactions or genes involved in these reactions, with the aim of comparing trails of the reference species with similar trails found for the remaining species. Both methods start out by pooling together all trails produced by the CoMetGeNe pipeline, for every species, every metabolic pathway, and every combination of the gap parameters.

For reasons explained below, both trail grouping methods were designed to treat trails as reaction sets, meaning that the order of reactions is not taken into account and that repeated reactions are ignored. In Fig. 1b, trails t1=(r2, r7, r8, r3, r4) and t2=(r2, r3, r7, r8, r3, r4) both have the same corresponding reaction set {r2, r3, r4, r7, r8}.

As previously explained, CoMetGeNe determines trails of reactions being catalyzed by neighboring genes. The definition of conserved patterns (in terms of metabolic and gene neighborhoods) needs to be able to accommodate slight variations between species. One such variation is encountering a different reaction order between trails. For example, if trails (r9,r10) and (r10,r9) are identified for two different species for the pathway in Fig. 1b, these trails naturally constitute a conserved pattern for the two species. Another variation that needs to be taken into account is best illustrated with the example of trails t1 and t2 above. If these trails are obtained for different species, the common feature is that both species perform the same five reactions using products of neighboring genes, irrespective of reaction order and of whether reaction r3 is repeated. Another example of variation that should not prevent the identification of conserved patterns is related to reactions (or genes) that are present in trails of some, but not all, of the species. For example, suppose the trails t3=(r2, r3, r7) and t4=(r3, r7, r8) are identified for two different species for the pathway in Fig. 1b. The fact that reactions r3 and r7 are common to both trails and are catalyzed by products of neighboring genes for both species should be identified as a conserved pattern. The necessity of accommodating these types of trail variations explains the choice for processing trails as reaction sets during the present trail grouping step.

Let (mathcal {P}) be the panel of selected species under study. Species (S in mathcal {P}) denotes the chosen reference species. Let RS be the set of all reaction sets of S. Note that reaction sets in RS are not disjoint. From a biological standpoint, RS represents the pool of trails of the reference species produced by CoMetGeNe, viewed in terms of reaction sets.

In the following, two genes of a given species are said to be neighboring if they are separated by at most three other genes on the same strand of the same chromosome.

Trail grouping by reactions

Briefly, the method of grouping trails by reactions consists in grouping reactions of the reference species according to the reaction sets they belong to. This grouping method focuses more on metabolic rather than genomic conserved patterns.

Grouping trails by reactions for the reference species S consists in constructing a table (T_{S}^{r}) where rows represent reactions in every reaction set of S and columns represent the remaining species in (mathcal {P}). Table (T_{S}^{r}) reflects conserved metabolic patterns between the reference species and the rest of the panel through the three possible symbols that can be assigned to each cell. These symbols allow to easily distinguish which reactions of the reference species are not present in the other species (blanks), and which are catalyzed by products of neighboring (crosses) and non neighboring (dots) genes of the other species.

For example, for the trail t=(r6, r2, r3, r7, r8) in Fig. 1b and the gene neighborhood in Fig. 2 for the reference species S and another species S1, (T_{S}^{r}) is represented by the first ((mathcal {R})) and fourth (S1 in (T_{S}^{r})) columns in Table 1. Reaction r3 is not performed by species S1. Reactions r6, r7, and r8 are performed by neighboring genes of S1 (T1, W1, and X1, respectively), whereas reaction r2 involves the product of a distant gene.

Table 1

Trail grouping by reactions and by genes for the reference species S against species S1

r
2

U

A
1

.

.

r

8

X

X

1

x

x

r
3

V

 

.

r

7

W

W

1

x

x

r

6

T

T

1

x

x

({mathcal {R}^{prime }})

Neigh. (
({boldsymbol{mathcal {G}}^{prime }})
)

Neigh. (
({boldsymbol{mathcal {H}}^{prime }})
)

   

Rows in table (T_{S}^{r}) represent reactions in RS and are ordered by reaction sets of S. Note that a given reaction performed by species S appears several times in (T_{S}^{r}) if it belongs to several reaction sets. Columns represent the remaining species in (mathcal {P}) and are ordered according to evolutionary distance to S, such that species phylogenetically closer to S have lower column indexes than species phylogenetically distant from S.

Let (T_{S}^{r}[i, j]) denote the cell in (T_{S}^{r}) on row i and column j. Let ri denote the reaction of species S corresponding to row i in (T_{S}^{r}). Let S1 denote the species corresponding to column j in (T_{S}^{r}). Let (mathcal {R} subseteq R_{S}) denote the reaction set of species S to which reaction ri belongs. For the example presented above, the reaction set of species S that is investigated is (mathcal {R} = {r_{2}), r3, r6, r7, r8} (see the first column ((mathcal {R})) in Table 1).

Let (mathcal {R’}) denote a maximal subset of (mathcal {R}) such that the genes of S1 involved in (mathcal {R’}) are neighbors. For the above example, the subset (mathcal {R’}) is {r6, r7, r8} (see (mathcal {R’}), i.e. entries in bold in the first column ((mathcal {R})) in Table 1) because reactions in (mathcal {R’}) involve the neighboring genes T1, W1, and X1, respectively (even though gene B1 is skipped).

One of the following three symbols is assigned to each cell (T_{S}^{r}[i,j]):

  • a cross (x) if (r_{i} in mathcal {R’}).

  • a dot (.) if (r_{i} in mathcal {R} – mathcal {R’}) and ri is performed by species S1.

  • a blank if (r_{i} in mathcal {R} – mathcal {R’}) and ri is not performed by species S1.

For the above example (see the fourth column (S1 in (T_{S}^{r})) in Table 1), the cells corresponding to reactions in (mathcal {R’}) receive a cross symbol (x). Since reaction r2 is performed in S1 by gene A1 which is not a neighbor of W1, T1, or X1, the corresponding cell on column S1 in (T_{S}^{r}) receives a dot symbol (.). Finally, reaction r3 is absent from S1, therefore the corresponding cell receives a blank. The interpretation is that reactions r6, r7, and r8 are performed in species S1 by products of neighboring genes. Reaction r3 is absent from S1, whereas the gene involved in r2 is not a neighbor of genes involved in reactions r6, r7, and r8.

Trail grouping by genes

Here, we group CoMetGeNe reaction sets according to the gene order of the reference species. This second grouping method focuses more on genomic rather than metabolic conserved patterns. Two genes coding for enzymes involved in the same metabolic reaction are referred to as functionally similar genes. Functionally similar genes in two species can be either analogues (products of convergent evolution) or homologues (products of divergent evolution).

Grouping trails by genes consists in constructing a table (T_{S}^{g}) where rows represent genes of the reference species S involved in reaction sets shared by S and at least one other species in (mathcal {P}), and columns represent the remaining species in (mathcal {P}). Table (T_{S}^{g}) reflects conserved genomic patterns between the reference species and the rest of the panel through the two possible symbols that can be assigned to each cell. These symbols allow to easily differentiate genes of S with neighboring (crosses) and non neighboring (dots) functionally similar genes in other species.

For example, for the trail t=(r6, r2, r3, r7, r8) in Fig. 1b and the gene neighborhood in Fig. 2 for the reference species S and another species S1, (T_{S}^{g}) is represented by the second ((mathcal {G})) and fifth (S1 in (T_{S}^{g})) columns in Table 1. Genes X1, W1, and T1 of S1 respectively have the neighboring functionally similar genes X, W, and T in the reference species S.

Let (R_{S_{1}}) be the set of all reaction sets for species (S_{1} in mathcal {P} – {S}). Let R be the set of reaction sets defined by:

$$R = R_{S} cap left(bigcuplimits_{S_{1} in mathcal{P} – {S}} R_{S_{1}} right) $$

Hence, R represents the set of reaction sets common to S and at least one other species in (mathcal {P}). Let GS be the set of genes of the reference species S that are involved in reactions belonging to reaction sets of R. From a biological standpoint, GS represents the pool of genes of the reference species coding for enzymes involved in reaction sets common to S and at least one other species in (mathcal {P}).

Rows in table (T_{S}^{g}) represent genes from GS and are ordered by chromosome and strand, according to the position of genes on the strand. Columns represent the remaining species in (mathcal {P}) and are ordered according to evolutionary distance to S (see Trail grouping by reactions).

Let S1 denote the species corresponding to column j in (T_{S}^{g}). Let (mathcal {G}) be a subset of GS such that genes in (mathcal {G}) are neighbors on the same strand and chromosome of S. For the example presented above, the gene group of species S that is investigated is (mathcal {G} = {U, X, V, W, T}) (see the second column ((mathcal {G})) in Table 1).

Let (mathcal {R}) be the set of reactions in all reaction sets in which the genes in (mathcal {G}) are involved. Formally, (mathcal {R}) is the set of all reactions r such that: (i) there exists a reaction set h of species S such that rh, and (ii) there exists a gene (g in mathcal {G}) such that g is involved in r. In other words, given a group (mathcal {G}) of neighboring genes of S, (mathcal {R}) is the set of reactions in trails common to S and at least one other species in (mathcal {P}) such that reactions in (mathcal {R}) are catalyzed by products of genes in (mathcal {G}). For the above example, (mathcal {R}) is {r2, r3, r6, r7, r8} (see the first column ((mathcal {R})) in Table 1).

Let (mathcal {H}) be the set of genes of S1 involved in reactions in (mathcal {R}). That is, given (mathcal {R}), the genome for species S1, and the correspondence between reactions in (mathcal {R}) and genes of S1, (mathcal {H}) is the set of genes in S1 (along with their position on the chromosome) such that every gene in (mathcal {H}) is involved in at least one reaction in (mathcal {R}). For the above example, (mathcal {H} = {A_{1}, X_{1}, W_{1}, T_{1}}) (see the third column ((mathcal {H})) in Table 1).

Let (mathcal {H’} subseteq mathcal {H}) be neighboring genes in (mathcal {H}), and let (mathcal {G’} subseteq mathcal {G}) such that genes in (mathcal {H’}) and (mathcal {G’}) are involved in the same reactions in (mathcal {R}). (mathcal {H’}) is chosen such as to maximize (|mathcal {G’}|), i.e. the number of genes in (mathcal {G}) involved in the same reactions as neighboring genes in (mathcal {H}).

For the above example, gene A1 is not a neighbor of gene W1, therefore (mathcal {H’}) must be a strict subset of (mathcal {H}). There are several possible strict non empty subsets of (mathcal {H}) of neighboring genes, other than singletons: {W1,T1}, {W1,X1}, {T1,X1}, and {W1,T1,X1}. The subset of (mathcal {H}) that is of interest is (mathcal {H’} = {W_{1}, T_{1}, X_{1}}), as it maximizes the number of genes in (mathcal {G}) involved in reactions in (mathcal {R}); (mathcal {G’}) is thus {X,W,T} (see (mathcal {H’}) and (mathcal {G’}), i.e. entries in bold in the third ((mathcal {H})) and second ((mathcal {G})) columns, respectively, in Table 1). The genes in (mathcal {H’}) can be considered neighbors because only gene B1 needs to be skipped as it does not code for an enzyme. The subset of reactions in (mathcal {R}) catalyzed by genes in (mathcal {H’}) is therefore {r6, r7, r8}.

Let (T_{S}^{g}[i, j]) denote the cell in (T_{S}^{g}) on row i and column j, where i is the index in GS of a gene gi in (mathcal {G}). One of the following two symbols is assigned to each cell (T_{S}^{g}[i,j]):

For the above example, cells for genes U and V receive a dot symbol (.), whereas cells for genes X, W, and T receive a cross symbol (x) (see the second ((mathcal {G})) and fifth (S1 in (T_{S}^{g})) columns in Table 1). The interpretation is that genes X, W, and T of the reference species are involved in reactions catalyzed by neighboring genes in species S1. Notice that from trail grouping by genes alone it is not possible to decide whether the reactions catalyzed by genes U and V are absent from S1 or performed by products of non neighboring genes. Trail grouping by genes assigns dot symbols to S1 for genes U and V of the reference species. However, trail grouping by reactions assigns a dot symbol to r2 and a blank to r3, thus effectively distinguishing between reactions present in S1 (r2) and absent from S1 (r3).

Pipeline

Trail finding (the HNET algorithm) and trail grouping are implemented in the form of CoMetGeNe, a Python 2.7 pipeline available under a MIT license at https://cometgene.lri.fr. CoMetGeNe is not compatible with Python 3. The following Python libraries are required: NetworkX (version ≥ 1.10 and ≤ 2.2) for graph handling, and lxml (version ≥ 3.5.0) for XML parsing. An Internet connection is mandatory for automatic data retrieval. Pipeline usage is detailed in Additional file 3.

CoMetGeNe files

CoMetGeNe automatically extracts the necessary data from KEGG using its REST API [30]. Metabolic pathways are stored in KGML format in a user-specified directory. Only pathways for primary and secondary metabolism excluding global and overview maps are extracted (i.e., maps whose KEGG identifier is at least 01100 are excluded). Genomic and EC number information are stored in binary format.

Trail finding

The script CoMetGeNe.py offers a convenient command-line interface for Finding metabolic and genomic patterns for a single species. The only required information is the species to be analyzed (designated by its three- or four-letter KEGG identifier [31]) and the directory where metabolic pathways of the species in question will be stored. Optionally, the gap parameters δD and δG can be specified (their default value being 0), as well as an output file for the results.

An important speedup is attained if CoMetGeNe is ran in parallel using the provided script CoMetGeNe_launcher.py. Restrictions inherent to KEGG limit pathway and genomic information retrieval to 3 and 2 threads, respectively. Trail finding in CoMetGeNe can, however, take full advantage of the maximum number of physical threads. See Results for CoMetGeNe run times.

A potential caveat when running CoMetGeNe in parallel is that KEGG may block concurrent downloads when using a fast Internet connection. Another potential caveat is that the machine may run out of memory on very large datasets (hundreds or thousands of species). In both cases, a possible workaround consists in adjusting parameters for CoMetGeNe_launcher.py. In the latter case, the dataset may be split into several smaller batches. For more details, see the “Trail finding” page on the CoMetGeNe website (https://cometgene.lri.fr/tfinding.html).

Storing metabolic pathways and genomic information for a given species allows CoMetGeNe to perform trail finding without re-downloading the same data for subsequent executions, e.g. when CoMetGeNe is ran for the same species but with different gap parameters.

CoMetGeNe uses a configurable timeout (defaulting to 5 min) for analyzing a given metabolic pathway. If the timeout is reached without producing any result, the pathway in question is “blacklisted” for the current species and set of gap parameters. This prevents CoMetGeNe from further attempting to analyze the given pathway for subsequent executions if the gap parameters increase. For example, a pathway that is blacklisted for (δD=2, δG=2) will not be further analyzed for (δD,δG){(2,3),(3,2),(3,3)}. The blacklist is stored locally as a text file. Blacklisted pathways are computationally prohibitive due to the exponential number of enumerable paths. However, blacklisted pathways only amount to 3.3% of our dataset (121 out of 3709 pathways).

Trail grouping

Once CoMetGeNe results are available for several species, trail grouping can be performed in order to identify conserved metabolic and genomic patterns for several organisms (see Finding conserved metabolic and genomic patterns across multiple species). The script grouping.py provides this functionality and offers the possibility to save tables (T_{S}^{r}) and (T_{S}^{g}) in CSV format.

Three binary files are created when grouping trails by either reactions or genes. They contain pathway data, genomic information, and parsed CoMetGeNe results that can be reused when choosing another species as reference.

Experimental setup

The test machine is a quad-core 2.6 GHz Intel Xeon E5-2623 v4 (Broadwell) with 10 MB L3 cache and 64 GB of RAM, running under Ubuntu GNU/Linux 16.04.3 LTS. Although the test machine has 64 GB of main memory, running CoMetGeNe on a single thread only requires approximately 100 MB of RAM.

Articles You May Like

Kinetic analysis of multistep USP7 mechanism shows critical role for target protein in activity
The Structural Violence of Hyperincarceration — A 44-Year-Old Man with Back Pain
Sacklers Directed Efforts to Mislead Public About OxyContin, New Documents Indicate
Ideas for a bioinformatics based web app
RNA Sequencing – Case Studies & Clinical Relevance

Leave a Reply

Your email address will not be published. Required fields are marked *