Abstract
Gene regulatory networks are composed of subnetworks that are often shared across biological processes, celltypes, and organisms. Leveraging multiple sources of information, such as publicly available gene expression datasets, could therefore be helpful when learning a network of interest. Integrating data across different studies, however, raises numerous technical concerns. Hence, a common approach in network inference, and broadly in genomics research, is to separately learn models from each dataset and combine the results. Individual models, however, often suffer from undersampling, poor generalization and limited network recovery. In this study, we explore previous integration strategies, such as batchcorrection and model ensembles, and introduce a new multitask learning approach for joint network inference across several datasets. Our method initially estimates the activities of transcription factors, and subsequently, infers the relevant network topology. As regulatory interactions are contextdependent, we estimate model coefficients as a combination of both datasetspecific and conserved components. In addition, adaptive penalties may be used to favor models that include interactions derived from multiple sources of prior knowledge including orthogonal genomics experiments. We evaluate generalization and network recovery using examples from Bacillus subtilis and Saccharomyces cerevisiae, and show that sharing information across models improves network reconstruction. Finally, we demonstrate robustness to both false positives in the prior information and heterogeneity among datasets.
Author summary
Due to increasing availability of biological data, methods to properly integrate data generated across the globe become essential for extracting reproducible insights into relevant research questions. In this work, we developed a framework to reconstruct gene regulatory networks from expression datasets generated in separate studies—and thus, because of technical variation (different dates, handlers, laboratories, protocols etc…), challenging to integrate. Since regulatory mechanisms are often shared across conditions, we hypothesized that drawing conclusions from various data sources would improve performance of gene regulatory network inference. By transferring knowledge among regulatory models, our method is able to detect weaker patterns that are conserved across datasets, while also being able to detect datasetunique interactions. We also allow incorporation of prior knowledge on network structure to favor models that are somewhat similar to the prior itself. Using two model organisms, we show that joint network inference outperforms inference from a single dataset. We also demonstrate that our method is robust to false edges in the prior and to low condition overlap across datasets, and that it can outperform current data integration strategies.
Citation: Castro DM, de Veaux NR, Miraldi ER, Bonneau R (2019) Multistudy inference of regulatory networks for more accurate models of gene regulation. PLoS Comput Biol 15(1):
e1006591.
https://doi.org/10.1371/journal.pcbi.1006591
Editor: Erik van Nimwegen,
University of Basel, SWITZERLAND
Received: March 6, 2018; Accepted: October 23, 2018; Published: January 24, 2019
Copyright: © 2019 Castro et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Link to the code with example data is provided below. https://github.com/simonsfoundation/multitask_inferelator/tree/AMuSR.
Funding: This work was funded by the Simons Foundation and the U.S. National Institute of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
This is a PLOS Computational Biology Methods paper.
Introduction
Gene regulatory network inference aims at computationally deriving and ranking regulatory hypotheses on transcription factortarget gene interactions [1–3]. Often, these regulatory models are learned from gene expression measurements across a large number of samples. Strategies to obtain such data range from combining several publicly available datasets to generating large expression datasets from scratch [4–7]. Given decreasing costs of sequencing and the exponential growth in the availability of gene expression data in public databases [8, 9], data integration across several studies becomes particularly promising for an increasing number of biological systems.
In theory, multistudy analyses provide a better representation of the underlying cellular regulatory network, possibly revealing insights that could not be uncovered from individual studies [6]. In practice, however, biological datasets are highly susceptible to batch effects [10], which are systematic sources of technical variation due to different reagents, machines, handlers etc. that complicate omics metaanalyses [11, 12]. Although several methods to remove batch effects from expression data have been developed, they often rely on evenly distributed experimental designs across batches [13, 14]. Batchcorrection methods may deflate relevant biological variability or induce incorrect differences between experimental groups when conditions are unbalanced across batches, which can significantly affect downstream analyses [15]. Therefore these batch effect removal methods are not applicable when integrating public data from multiple sources with widely differing experimental designs.
In network inference, an approach often taken to bypass batch effects is to learn models from each dataset separately and combine the resulting networks [16, 17]. Known as ensemble learning, this idea of synthesizing several weaker models into a stronger aggregate model is commonly used in machine learning to prevent overfitting and build more generalizable prediction models [18]. In several scenarios, ensemble learning avoids introducing additional artifacts and complexity that may be introduced by explicitly modeling batch effects. On the other hand, the relative sample size of each dataset is smaller when using ensemble methods, likely decreasing the ability of an algorithm to detect relevant interactions. As regulatory networks are highly contextdependent [19], for example, TF binding to several promoters is conditionspecific [20], a drawback for both batchcorrection and ensemble methods is that they produce a single network model to explain the data across datasets. Relevant datasetspecific interactions might not be recovered, or just difficult to tell apart using a single model.
Although it will not be the primary focus of this paper, most modern network inference algorithms integrate multiple datatypes to derive prior or constraints on network structure. These priors/constraints have been shown to dramatically improve network model selection performance when combined with the state variables provided by expression data. In these methods [17, 21], priors or constraints on network structure (derived from multiple sources like known interactions, ATACseq, DHS, or ChIPseq experiments [22–24]) are used to influence the penalty on adding model components, where edges in the prior are effectively penalized less. Here we describe a method that builds on that work (and similar work in other fields), but in addition we let model inference processes (each carried out using a separate dataset) influence each others model penalties, so that edges that agree across inference tasks are more likely to be uncovered [25–31]. Several previous works on this front focused on enforcing similarity across models by penalizing differences on strength and direction of regulatory interactions using a fusion penalty [25, 27, 28]. Because the influence of regulators on the expression of targets may vary across datasets, possibly even due to differences in measurement technologies, we look to induce similarity on network structure (the choice of regulators) using a groupsparse penalty. Previous methods also applied this type of penalty [26, 29, 31], however, they were not robust to differences in relevant edges across datasets.
Here we propose a multitask learning (MTL) approach to exploit crossdataset commonalities while recognizing differences and is able to incorporate prior knowledge on network structure if available [32, 33]. In this framework, information flow across datasets leads the algorithm to prefer solutions that better generalize across domains, thus reducing chances of overfitting and improving model predictive power [34]. Since biological datasets are often undersampled, we hypothesize that sharing information across models inferred from multiple datasets using a explicit multitask learning framework will improve accuracy of inferred network models in a variety of common experimental designs/settings.
In this paper, we explicitly show that joint inference significantly improves network recovery using examples from two model organisms, Bacillus subtilis and Saccharomyces cerevisiae. We show that models inferred for each dataset using our MTL approach (which adaptively penalizes conserved and datasetunique model components separately) are vastly more accurate than models inferred separately using a singletask learning (STL) approach. We also explore commonly used data integration strategies, and show that MTL outperforms both batchcorrection and ensemble approaches. In addition, we also demonstrate that our method is robust to noise in the input prior information. Finally, we look at conserved and datasetspecific inferred interactions, and show that our method can leverage crossdataset commonalities, while being robust to differences.
Results
Overview of network inference algorithm
To improve regulatory network inference from expression data, we developed a framework that leverages training signals across related expression datasets. For each gene, we assume that its regulators may overlap across conditions in related datasets, and thus we could increase our ability to uncover accurate regulatory interactions by inferring them jointly. Our method takes as input multiple expression datasets and priors on network structure, and then outputs regulatory hypotheses associated with a confidence score proportional to our belief that each prediction is true (Fig 1A). As previous studies [17, 35–37], our method also includes an intermediate step that estimates transcription factor activities (TFA), and then, models gene expression as a function of those estimates (Fig 1B).
Fig 1. Gene regulatory network inference schematic.
(A) Our network inference algorithm takes as input a gene expression matrix, X, and a prior on network structure and outputs regulatory hypotheses of regulatortarget interactions. (B) Using priors on network topology and gene expression data, we estimate transcription factor activities (TFA), and subsequently model gene expression as a function of these activities. (C) We use several possible sources of prior information on network topology. (D) Prior information is encoded in a matrix P, where positive and negative entries represent known activation and repression respectively, whereas zeros represent absence of known regulatory interaction. To estimate hidden activities, we consider X = PA (top), where the only unknown is the activities. Of note, a timedelay is implemented for timeseries experiments (bottom). (E) Finally, for each gene, we find regulators that influence its expression using regularized linear regression. We either learn these influences, or weights, for each dataset independently, singletask learning (top), or jointly through multitask learning (bottom).
In our model, TFA represent a relative quantification of active protein that is inducing or repressing the transcription of its targets in a given sample, and is an attempt to abstract away unmeasured factors that influence TFA in a living cell [37–39], such as posttranslational regulation [40], proteinprotein interactions [41], and chromatin accessibility [42]. We estimate TFA from partial knowledge of the network topology (Fig 1C) [21, 43–47] and gene expression data as previously proposed (Fig 1D) [17]. This is comparable to using a TF’s targets collectively as a reporter for its activity.
Next, we learn the dependencies between gene expression and TFA and score predicted interactions. In this step, our method departs from previous work, and we employ multitask learning to learn regulatory models across datasets jointly, as opposed to singletask learning, where network inference is performed for each dataset independently (Fig 1E). As genes are known to be regulated by a small number of TFs [48], we can assume that these models are sparse, that is, they contain only a few nonzero entries [3]. We thus implement both approaches using sparsityinducing penalties derived from the lasso [49]. Here the network model is represented as a matrix for each target gene (where columns are datasets/celltypes/studies and rows are potential regulators) with signed entries corresponding to strength and type of regulation.
Importantly, our MTL approach decomposes this model coefficients matrix into a datasetspecific component and a conserved component to enable us to penalize datasetunique and conserved interactions separately for each target gene [32]; this separation captures differences in regulatory networks across datasets (Fig 2). Specifically, we apply an l_{1}/l_{∞} penalty to the one component to encourage similarity between network models [50], and an l_{1}/l_{1} penalty to the other to accommodate differences [32]. We also incorporate prior knowledge by using adaptive weights when penalizing different coefficients in the l_{1}/l_{1} penalty [33]. Finally, we perform this step for several bootstraps of the conditions in the expression and activities matrices, and calculate a confidence score for each predicted interaction that represents both the stability across bootstraps and the proportion of variance explained of the target expression dependent on each predictor.
Fig 2. Representation of the weights matrix for one gene in the multitask setting.
We represent model coefficients as a matrix W (predictors by datasets) where nonzero rows represent predictors relevant for all datasets. We decompose the weights into two components, and regularize them differently, using a sparse penalty (l_{1}/l_{1} to S component) to encode a datasetspecific component and a blocksparse penalty (l_{1}/l_{∞} to B component) to encode a conserved one. To illustrate, in this example, nonzero weights are shown on the right side. Note that, in this schematic example, regulators w3 and w7 are shared between all datasets. We also show the objective function minimized to estimate S and B on the bottom (for details, see Methods).
Our method is readily available in an opensource package, InferelatorAMuSR (Adaptive Multiple Sparse Regression), enabling TF activity estimation and multisource gene regulatory network inference, ultimately facilitating mechanistic interpretations of gene expression data to the Biology community. In addition, this method allows for adaptive penalties to favor interactions with prior knowledge proportional to the userdefined belief that interactions in the prior are true. Finally, our implementation also includes several mechanisms that speedup computations, making it scalable for the datasets used here, and support for parallel computing across multiple nodes and cores in several computing environments.
Model organisms, expression datasets, and priors
We validated our approach using two model organisms, a grampositive bacteria, B. subtilis, and an eukaryote, S. cerevisiae. Availability of validated TFtarget regulatory interactions, hereafter referred to as the goldstandard, make both organisms a good choice for exploring inference methods (3040 interactions, connecting 153 TFs to 1822 target genes for B. subtilis [17, 46], 1198 interactions connecting 91 TFs to 842 targets for S. cerevisiae [51]). For B. subtilis, we use two expression datasets. The first one, B. subtilis 1, was collected for strain PY79 and contains multiple knockouts, competence and sporulationinducing conditions, and chemical treatments (429 samples, 38 experimental designs with multiple timeseries experiments) [17]. The second dataset, B. subtilis 2, was collected for strain BSB1 and contains several nutritional, and other environmental stresses, as well as competence and sporulationinducing conditions (269 samples, and 104 conditions) [52]. For S. cerevisiae, we downloaded three expression datasets from the SPELL database [53]. S. cerevisiae 1 is a compendium of steadystate chemostat cultures with several combinations of cultivation parameters (170 samples, 55 conditions) [54]. S. cerevisiae 2 profiles two yeast strains (BY and RM) grown with two carbon sources, glucose and ethanol, in different concentrations (246 samples, and 109 conditions) [55]. Finally, S. cerevisiae 3 with expression profiles following several mutations and chemical treatments (300 samples) [56]. Each dataset was collected using a different microarray platform. Crossplatform data aggregation is well known to cause strong batch effects [10]. For each species, we considered the set of genes present across datasets.
In our inference framework, prior knowledge on network topology is essential to first estimate transcription factor activities and to then bias model selection towards interactions with prior information during the network inference stage of the algorithm. Therefore, to properly evaluate our method, it is necessary to gather prior interactions independent of the ones in the goldstandard. For B. subtilis, we adopt the previously used strategy of partitioning the initial goldstandard into two disjoint sets, a prior for use in network inference and a goldstandard to evaluate model quality [17]. For S. cerevisiae, on the other hand, we wanted to explore a more realistic scenario, where a goldstandard is often not available. In the absence of such information, we hypothesized that orthogonal highthroughput datasets would provide insight. Because the yeast goldstandard [51] was built as a combination of TFbinding (ChIPseq, ChIPChIP) and TF knockout datasets available in the YEASTRACT [47] and the SGD [57] databases, we propose to derive prior knowledge from chromatin accessibility data [22, 23] and TF binding sites [58] (as this is a realistic and efficient genomic experimental design for nonmodel organisms). Open regions in the genome can be scanned for transcription factor binding sites, which can provide indirect evidence of regulatory function [59]. We then assigned TFs to the closest downstream gene, and built a prior matrix where entries represent the number of motifs for a particular TF that was associated to a gene [60, 61]. We obtained a list of regulators from the YeastMine database [62], which we also used to sign entries in the prior: interactions for regulators described as repressors were marked as negative. Because genomewide measurements of DNA accessibility can be obtained in a single experiment, using techniques that take advantage of the sensitivity of nucleosomefree DNA to endonuclease digestion (DNaseseq) or to Tn5 transposase insertion (ATACseq) [63], we expect this approach to be generalizable to several biological systems.
Sharing information across network models via multitask learning improves model accuracy
Using the above expression datasets and priors, we learn regulatory networks for each organism employing both singletask and our multitask approaches. To provide an intuition for crossdataset transfer of knowledge, we compare confidence scores attributed to a single goldstandard interaction using either STL or MTL for each organism. For B. subtilis, we look at the interaction between the TF sigB and the gene ydfP (Fig 3A). The relationship between the sigB activity and ydfP expression in the first dataset B. subtilis 1 is weaker than in B. subtilis 2. This is reflected in the predicted confidence scores, a quarter as strong for B. subtilis 1 than for B. subtilis 2, when each dataset is used separately to learn networks through STL. On the other hand, when we learn these networks in the MTL framework, information flows from B. subtilis 2 to B. subtilis 1, and we assign a high confidence score to this interaction in both networks. Similarly, for S. cerevisiae, we look at the interaction between the TF Rap1 and the target gene Rpl12a (Fig 3B). In this particular case, we observe a strong and easiertouncover relationship between Rap1 estimated activity and Rpl12a expression for all datasets. Indeed, we assign a nonzero confidence score to this interaction for all datasets using STL, although for S. cerevisiae 2 and 3 these are much smaller than the scores attributed when networks are learned using MTL.
Fig 3. Multitask learning improves accuracy of inferred networks.
(A) Relationship between TF activity and target expression in B. subtilis 1 (blue) and in B. subtilis 2 (orange), and corresponding STL and MTL inferred confidence scores for an example of an interaction in the B. subtilis goldstandard, sigB to ydfP. (B) as shown in (A), but for an interaction in the S. cerevisiae goldstandard, Rap1 to Rpl12a. (C) Precisionrecall curves assessing accuracy of network models inferred for individual B. subtilis datasets against a leaveout set of interactions. Barplot show mean area under precisionrecall curve (AUPR) for each method and dataset. Error bars show the standard deviation across 10 splits of the goldstandard into prior and evaluation set. (D) Precisionrecall curves assessing accuracy of network models inferred for individual S. cerevisiae networks, with the difference that the prior is from an independent source (no splits or replicates).
In order to evaluate the overall quality of the inferred networks, we use area under precisionrecall curves (AUPR) [16], widely used to quantify a classifier’s ability to distinguish two classes and to rank predictions. Networks learned using MTL are significantly more accurate than networks learned using the STL approach. For B. subtilis (Fig 3D), we observe around a 30% gain in AUPR for both datasets, indicating significant complementarity between the datasets. For S. cerevisiae (Fig 3E), we observe a clear increase in performance for networks inferred for every dataset, indicating that our method is very robust to both data heterogeneity and potential false edges derived from chromatin accessibility in the prior. These experiments were also performed using TF expression as covariates, instead of TF activities, and those results are shown at (S1A and S1B Fig). Although we recommend using TFA for the organisms here tested, MTL also improves the performance for each datasetspecific network in this scenario.
Benefits of multitask learning exceed those from batchcorrection and ensemble methods
Next, we asked whether the higher performance of the MTL framework could be achieved by other commonly used data integration strategies, such as batchcorrection and ensemble methods. Ensemble methods include several algebraic combinations of predictions from separate classifiers trained within a singledomain (sum, mean, maximum, minimum [64]). To address this question, we evaluated networks inferred using all available data. First, we combined regulatory models inferred for each dataset either through STL or MTL by taking the average rank for each interaction, generating two networks hereafter called STLC and MTLC [16]. For each organism, we also merged all datasets into one, and applied ComBat for batchcorrection [65], because of its perceived higher performance [66]. We then learn network models from these larger batchcorrected datasets, STLBC. Both for B. subtilis (Fig 4A) and S. cerevisiae (Fig 4B), the MTLC networks outperform the STLC and STLBC networks, indicating that crossdataset information sharing during modelling is a better approach to integrate datasets from different domains. Interestingly, for B. subtilis, the STLBC network has a higher performance than the STLC network, whereas for yeast we observe the opposite. We speculate that the higher overlap between the conditions in the two B. subtilis datasets improved performance of the batchcorrection algorithm here used. For yeast, on the other hand, conditions were very different across datasets, and although much new information is gained by merging datasets into one, it is likely that incorrect relationships between genes were induced as an artifact, possibly confounding the inference. Of note, these approaches emphasize the commonalities across datasets, whereas the motivation to use MTL frameworks is to increase statistical power, while maintaining separate models for each dataset, hopefully improving interpretability. These experiments were also performed using TF expression as covariates, instead of TF activities, and those results are shown at (S2A and S2B Fig). In that case, results hold for yeast, but not for B. subtilis.
Fig 4. Multitask learning performance boost outweights benefits of other data integration methods.
Assessment of accuracy of network models learned using three different data integration strategies, data merging and batch correction (STLBC), ensemble method combining models learned independently (STLC), and ensemble method combining models learned jointly (MTLC). (A) Precisionrecall curves for B. subtilis, again using a leaveout set of interactions. Barplot show mean area under precisionrecall curve (AUPR) for each method. Error bars show the standard deviation across 10 splits of the goldstandard into prior and evaluation set. (B) Precisionrecall curves for S. cerevisiae, with the difference that the prior is from an independent source (no splits or replicates).
Our method is robust to increasing prior weights and noise in prior
Because genes are frequently coregulated, and biological networks are redundant and robust to perturbations, spurious correlations between transcription factors and genes are highly prevalent in gene expression data [67, 68]. To help discriminate true from false interactions, it is essential to incorporate prior information to bias model selection towards interactions with prior knowledge. Indeed, incorporating prior knowledge has been shown to increase accuracy of inferred models in several studies [3, 21, 69].
For example, suppose that two regulators present highly correlated activities, but regulate different sets of genes. A regressionbased model would be unable to differentiate between them, and only other sources of information, such as binding evidence nearby a target gene, could help selecting one predictor over the other in a principled way. Thus, we provide an option to integrate prior knowledge to our MTL approach in the model selection step by allowing the user to input a “prior weight”. This weight is used to increase presence of prior interactions to the final model, and should be proportional to the quality of the input prior.
Sources of prior information for the two model organisms used in this study are fundamentally different. The B. subtilis prior is highquality, derived from smallscale experiments, whereas the S. cerevisiae prior is noisier, likely with both high falsepositive and falsenegative rates, derived from highthroughput chromatin accessibility experiments and TF binding motifs. To understand differences in prior influences for the same organism, we also include the yeast goldstandard as a possible source of prior in this analysis. The number of TFs per target gene in the B. subtilis (Fig 5A) and the S. cerevisiae (Fig 5B) goldstandards (GS) is hardly ever greater than 2, with median of 1, whereas for the chromatin accessibilityderived priors (ATAC) for S. cerevisiae, the median is 11 (Fig 5C). A large number of regulators per gene likely indicates a high falsepositive rate in the yeast ATAC prior. Given the differences in prior quality, we test the sensitivity of our method to the prior weight parameter. We applied increasing prior weights, and measured how the confidence scores attributed to prior interactions was affected (Fig 5D) for the three source of priors described above. Interestingly, the confidence scores distributions show dependencies on both the prior quality and the prior weights. When the goldstandard interactions for B. subtilis and S. cerevisiae are used as prior knowledge, they receive significantly higher scores than interactions in the S. cerevisiae chromatin accessibilityderived prior, which is proportional to our belief on the quality of the input prior information. Importantly, even when we set the prior weight value to a very high value, such as 10, interactions in the ATAC prior are not pushed to very high confidence scores, suggesting that our method is robust to the presence of false interactions in the prior.
Fig 5. Recovery of prior interactions depends on prior quality and is robust to increasing prior weights.
Distribution of number of regulators per target in the B. subtilis prior (A), for the S. cerevisiae goldstandard (B), and for the S. cerevisiae chromatin accessibilityderived priors (C). (D) Distributions of MTL inferred confidence scores for interactions in the prior for each dataset. Different colors show prior weights used, and represent an amount by which interactions in the prior are favored by model selection when compared to interactions without prior information. (E) Distributions of MTL inferred confidence scores for true (yellow) and false (gray) interactions in the prior for each dataset. (F) Counts of MTL inferred interactions with nonzero confidence scores for true (yellow) and false (gray) interactions in the prior for each dataset.
In order to test this hypothesis, we artificially introduced false edges to both the B. subtilis and the yeast goldstandards. We added 1 randomly chosen “false” interaction for every 5 true edges in the goldstandard. That affects both TFA estimation and model selection (for prior weights greater than 1). We then ran the inference using the InferelatorAMuSR method with increasing prior weights, and evaluated both the confidence scores of recovered true and false interactions (Fig 5C) as well as the counts of true and false interactions that receive nonzero confidence scores (Fig 5D). For both B. subtilis and yeast, we notice that confidence scores distributions show dependency on whether edges are true or false, indicating that the method is not overfitting the prior for the majority of datasets, even when prior weights used are as high as 10 (Fig 5C). We speculate that the greater completeness of the B. subtilis goldstandard and of the expression datasets make it easier to differentiate true from false prior interactions when compared to yeast. Besides, inferring networks for prokaryotes is regarded as an easier problem [16]. Importantly, we also show the number of nonzero interactions in each of these distributions (Fig 5D). Taken together, these results show that our method is robust to false interactions in the prior, but requires the user to choose an appropriate prior weight for the specific application. As in previous studies [43], in the presence of a goldstandard, we recommend the user to evaluate performance in leaveout sets of interactions to determine the best prior weight to be used. In the absence of a goldstandard, priors are likely to be of lower confidence, and therefore, smaller prior weights should be used.
Joint network inference is robust to dataset heterogeneity
Because multitask learning approaches are inclined to return models that are more similar to each other, we sought to understand how heterogeneity among datasets affected the inferred networks. Specifically, we quantified the overlap between the networks learned for each dataset for B. subtilis and yeast. That is, the number of edges that are unique or shared across networks inferred for each dataset (Fig 6). In this analysis, we consider valid only predictions within a 0.5 precision cutoff, calculated using only TFs and genes present in the goldstandard. Since the B. subtilis datasets share more conditions than the yeast datasets, we hypothesized that the B. subtilis networks would have a higher overlap than the yeast networks. As expected, we observe that about 40% of the total edges are shared among two B. subtilis networks (Fig 6A), whereas for yeast only about 27% (Fig 6B) and 22% (Fig 6C), using goldstandard and chromatin accessibilityderived priors respectively, of the total number of edges is shared by at least two of the three inferred networks. Therefore, our approach for joint inference is robust to crossdataset influences, preserving relative uniqueness when datasets are more heterogeneous.
Fig 6. Overlap of edges in inferred networks is higher for B. subtilis than for S. cerevisiae.
Edges overlap across networks inferred using multitask learning for B. subtilis (prior weight of 1.0) (A), for S. cerevisiae (using the goldstandard as priors) (B), for S. cerevisiae (using the chromatin accessibilityderived priors) (C).
Discussion
In this study, we presented a multitask learning approach for joint inference of gene regulatory networks across multiple expression datasets that improves performance and biological interpretation by factoring network models derived from multiple datasets into conserved and datasetspecific components. Our approach is designed to leverage crossdataset commonalities while preserving relevant differences. While other multitask methods for network inference penalize for differences in model coefficients across datasets [25–28, 30], our method leverages shared underlying topology rather than the influence of TFs on targets. We expect this method to be more robust, because, in living cells, a TF’s influence on a gene’s expression can change in different conditions. In addition, previous methods either deal with datasetspecific interactions [25], or apply proper sparsity inducing regularization penalties [26–30]. Our approach, on the other hand, addresses both concerns. Finally, we implemented an additional feature to allow for incorporation of prior knowledge on network topology in the model selection step.
Using two different model organisms, B. subtilis and S. cerevisiae, we show that joint inference results in accurate network models. We also show that multitask learning leads to more accurate models than other data integration strategies, such as batchcorrection and combining fitted models. Generally, the benefits of multitask learning are more obvious when task overlap is high and datasets are slightly undersampled [34]. Our results support this principle, as the overall performance increase of multitask network inference for B. subtilis is more pronounced than for S. cerevisiae, which datasets sample more heterogeneous conditions. Therefore, to benefit from this approach, defining input datasets that share underlying regulatory mechanisms is essential and userdefined.
A key question here, that requires future work, is the partitioning of data into separate datasets. Here we use the boundaries afforded by previous study designs: we use data from two platforms and two strains for B. subtilis (a fairly natural boundary) and the separation between studies by different groups (again using different technologies) in yeast. We choose these partitions to illustrate robustness to the more common sources of batch effect in metaanalysis. In the future, we expect that multitask methods in this domain will integrate dataset partition estimation (which data go in which bucket) with network inference. Such methods would ideally be able to estimate task similarity, taking into account principles of regulatory biology, and apply a weighted approach to information sharing. In addition, a key avenue for future work will be to adapt this method to multispecies studies. Examples of high biological and biomedical interest include joint inferences across model systems and organisms of primary interest (for example datasets that include mouse and human data collected for similar cell types in similar conditions). These results (and previous work on many fronts [7, 25, 70]) suggest that this method would perform well in this setting. Nevertheless, because of the increasing practice of data sharing in Biology, we speculate that crossstudy inference methods will be largely valuable in the near future, being able to learn more robust and generalizable hypotheses and concepts. Although we present this method as an alternative to batch correction, we should point out that there are many uses to batch correction that fall outside of the scope of network inference, and our results do not lessen the applicability of batch correction methods to these many tasks. There is still great value in properly balancing experimental designs when possible to allow for the estimation of specific gene and conditionwise batch effects. Experiments where we interact MTL learning with properly balanced designs and quality batch correction are not provided here, but would be superior. Thus, the results here should be strictly interpreted in the context of network inference, pathway inference, and modeling interactions.
Methods
Expression data selection, preprocessing and batchcorrection
For B. subtilis, we downloaded normalized expression datasets from the previously published network study by ArrietaOrtiz et al [17]. Both datasets are available at GEO, B. subtilis 1 with accession number GSE67023 [17] and B. subtilis 2 with accession number GSE27219 [52]. For yeast, we downloaded expression datasets from the SPELL database, where hundreds of reprocessed gene expression data is available for this organism. In particular, we selected three datasets from separate studies based on the number of samples, withindataset condition diversity, and crossdataset condition overlap (such as nutrientlimited stress). S. cerevisiae 1 and S. cerevisiae 2 are also available at GEO at accession numbers GSE11452 [54] and GSE9376 [55]. S. cerevisiae 3 does not have a GEO accession number, and was collected in a custom spotted microarray [56]. For network inference, we only kept genes present in all datasets, resulting in 3780 and 4614 genes for B. subtilis and for yeast respectively. In order to join merge, for comparison, we consider each dataset to be a separate batch, since they were generated in different labs as part of separate studies, and applied ComBat for batchcorrection using default parameters and no reference to experimental designs [65].
Building priors from chromatin accessibility
ATACseq data download, processing, and peak calling.
We downloaded chromatin accessibility data for S. cerevisiae from the European Nucleotide Archive (PRJNA276699) [71, 72]. Reads were mapped to the sacCer3 genome (iGenomes, UCSC) using bowtie2 [73] with the options –verysensitive –maxins 2000. Reads with low mapping quality (MAPQ < 30), or that mapped to mitochondrial DNA were removed. Duplicates were removed using Picard. Reads mapping the forward strand were offset by +4 bp, and reads mapping to the reverse strand 4 bp. Accessible regions were called using MACS2 [74] with the options –qvalue 0.01 –gsize 12100000 –nomodel –shift 20 –extsize 40. We defined the union of peaks called in any the ATACseq samples as the set of putative regulatory regions.
Motifs download, assignment to target genes, and prior generation.
We obtained a set of expertcurated motifs for S. cerevisiae containing position frequency matrices for yeast transcription factors from The Yeast Transcription Factor Specificity Compendium motifs (YeTFaSCo) [75]. Then, we scanned the whole yeast genome for occurrences of motifs using FIMO with pvalue cutoff 1e4 [59], and kept motifs that intersected with putative regulatory regions. Each motif was then assigned to the gene with closest downstream transcription start site. Gene annotations were obtained from the Saccharomyces Genome Database (SGD) [76]. A list of putative regulators was downloaded from the YeastMine database [62], and then generated a targetsbyregulators matrix (prior) where entries are the count of motifs for a particular regulator assigned to each gene. Finally, we multiplied entries for repressors by 1.
Network inference
We approach network inference by modeling gene expression as a weighted sum of the activities of transcription factors [17, 36]. Our goal is to learn these weights from gene expression data as accurately as possible. In this section, we explain our core model of gene regulation, and of transcription factor activities, and state our assumptions. We also describe how we extend our framework to support learning of multiple networks simultaneously, and integration of prior knowledge on network structure. Finally, we explain how we rank predicted interactions which is used to evaluate the ability of these methods to recover the known underlying network.
Core model.
We model the expression of a gene i at condition j, X_{i,j}, as the weighted sum of the activities of each transcription factor k at condition j, A_{k,j} [17, 43]. Note that although several methods use transcription factor expression as an approximation for its activity, we explicitly estimate these values from expression data and a set of a prior known interactions. Strength and direction (activation or repression) of a regulatory interaction between transcription factor k and gene i is represented by i, k. At steady state, we assume:
(1)
For timeseries, we reason that there is a delay τ between transcription factor activities and resulting changes in target gene expression [43]. Given expression of a gene i in time t_{n}, , and activity of transcription factor k at time t_{n−τ}, , we assume:
(2)
If time tn − τ is not available in the expression data, linear interpolation is used to fit .
Finally, because we expect each gene to be regulated by only a few transcription factors, we seek a sparse solution for w. That is, a solution in which most entries in w are zero. Of note, we set τ = 15 for B. subtilis [17]. For S. cerevisiae, all experiments are considered steadystate.
Estimating transcription factor activities (TFA).
We use the expression of known targets of a transcription factor to estimate its activity. From a set of prior interactions, we build a connectivity matrix P, where entries represent known activation, P_{i,k} = 1, or repression, P_{i,k} = −1, of gene i by transcription factor k. If no known interaction, P_{i,k} = 0. We assume that the expression of a gene can be written as a linear combination of the activities of its prior known regulators [17].
(3)
In case of timeseries experiments, we use the expression of genes at time t_{n+τ/2}, , to inform the activities at time t_{n}, A_{n}. Note that for estimating activities, the time delay used is τ/2. Again, linear interpolation is used to estimate if gene expression at t_{n+τ/2} was not measured experimentally [17].
(4)
In matrix form, both timeseries and steadystate equations can be written as X = PA. Since there are more target genes than regulators i > p, this is an overdetermined system, and thus has no solution, so we approximate A by finding that minimizes . The solution is given by , where P* = (P^{T} P)^{−1} P^{T}, the pseudoinverse of P. Finally, for transcription factors with no targets in P, we use the measured expression values as proxy for the activities.
Learning regression parameters.
Given gene expression and activity estimates, the next step is to define a set of regulatory hypotheses for the observed changes in gene expression. For each gene, we find a sparse solution for the regression coefficients where nonzero values indicate the transcription factors that better explain the changes observed in gene expression. In this section, we explain how we learn these parameters from a single dataset (singletask learning) and from multiple (multitask learning).
Singletask learning using lasso regression (l_{1}).
The lasso (least absolute selection and shrinkage operator) is a method that performs both shrinkage of the regression coefficients and model selection [49]. That is, it shrinks regression coefficients towards zero, while setting some of them to exactly zero. It does so by adding a penalty on the sum of the absolute values of the estimated regression coefficients. Let be the activities matrix, X_{i} the expression values for gene i, and w the vector of coefficients, lasso estimates are given by:
(5)
where w_{1} = ∑_{k}w_{k}. When minimizing the above function, we seek a good fit while subject to a “budget” on the regression coefficients. The hyperparameter λ controls how much weight to put on the l_{1} penalty. The lasso became very popular in the last decade, because it reduces overfitting and automatically performs variable selection. We choose the lasso as a singletask baseline because it is equivalent to the S matrix in the multitask case (see below), but with independent choice of sparsity parameter for each dataset.
Multitask learning using sparse blocksparse regression (l_{1}/l_{1} + l_{1}/l_{∞}).
We extend our core model to the multiple linear regression setting to enable simultaneous parameter estimation. Here we represent regression parameters for a single gene i as a matrix W, where rows are transcription factors k and columns are networks (or datasets) d. We seek to learn the support Supp(W), where nonzero entries W_{k,d} represent a regulatory interaction between transcription factor k and gene i for network from dataset d.
(6)
For a given gene i, we could assume that the same regulatory network underlies the expression data in all datasets d. That is, rows in W are either completely nonzero or zero. Since a different set of experiments may have different regulatory patterns, this could be a very strong assumption. A more realistic scenario would be that for each gene i, certain regulators are relevant to regulatory models for all datasets d, while others may be selected independently by each model d. Thus, some rows in the parameter matrix W are entirely nonzero or zero, while others do not follow any particular rule. In this scenario, the main challenge is that a single structural constraint such as rowsparsity does not capture the structure of the parameter matrix W. For these problems, a solution is to model the parameter matrix as the combination of structurally constrained parameters [77].
As proposed by Jalali et al. [32], we learn the regression coefficients by decomposing W into B and S, that encode similarities and differences between regulatory models respectively. This representation combines a blockregularization penalty on B enforcing rowsparsity B_{1,∞} = ∑_{k}B_{k}_{∞}, where B_{k}_{∞} ≔ max_{d}B_{k, d} (as the one from the previous section), and an elementwise penalty on S allowing for deviation across regulatory models for each dataset S_{1,1} = ∑_{k,d}S_{k,d}. The goal is to leverage any parameter overlap between models through B, while accommodating the differences through S. We obtain an estimate for by solving the following optimization problem:
(7)
Incorporating prior knowledge using the adaptive lasso.
We incorporate prior knowledge by differential shrinkage of regression parameters in S through the adaptive lasso [33]. We choose to apply this only to the S component, because we wanted to allow the user to input different priors for each dataset if so desired. Intuitively, we penalize less interactions present in the prior network. Let Φ be a matrix of regulators k by datasets d, such that entries Φ_{k, d} are inversely proportional to our prior confidence on the interaction between regulator k and gene i for dataset d. We then optimize the following objective:
(8)
We implement this by scaling λ_{s} by Φ, then the penalty applied to S_{k,d} becomes Φ_{k,d}λ_{s}. In the extreme Φ_{k,d} = 0, the regulator k is not penalized and will be necessarily included in the final model for dataset d. In practice, the algorithm accepts an input prior weight ρ ≥ 1 that is used to generate the matrix Φ. We apply the complexitypenalty reduction afforded by Φ_{k,d} to and not as this choice penalizes unique terms, creating the correct behavior of encouraging model differences that are in accord with orthogonal data as expressed in the networkprior. This choice is also in accord with the interpretation of the prior as valid in one, but not necessarily all, conditions. If regulator k is in the prior for dataset d, then Φ_{k,d} = 1/ρ, otherwise Φ_{k,d} = 1. Finally, we rescale Φ_{*,d} to sum to the number of predictors k. Note that each network model accepts its own set of priors.
Model selection.
As proposed by Jalali et al. [32], for MTL, we set , with n being the number of samples, d being the number of datasets, and search for c in the logarithmic interval [0.01, 10]. For each λ_{b}, we look for λ_{s} that satisfy . We choose the optimal combination (λ_{s}, λ_{b}) that minimizes the extended Bayesian information criterion (EBIC) [78], here defined as:
(9)
with k_{d} being the number of nonzero predictors in W for model d, and 0 ≤ γ ≤ 1. Note that for γ = 0, we recover the original BIC. Whereas for γ > 0, the EBIC scales with the predictor space k making it particularly appropriate for scenarios where p ≫ n, often encountered in biological network inference projects. In this study, we set γ = 1. For STL, we use the same EBIC measure, but we calculate it for each dataset separately. Importantly, model selection using EBIC is significantly faster than when compared to resampling approaches, such as crossvalidation or stability selection [79]. Crossvalidation, for example, was previously reported as an impediment for multitask learning in largescale network inference due computational feasibility [29].
Implementation.
We implemented the MTL objective function using cyclical coordinate descent with covariance updates. That is, at each iteration of the algorithm we cycle through the predictors (coordinates), and minimize the objective at each predictor k while keeping the others fixed. Briefly, for a given (λ_{s}, λ_{b}), we update entries in S and B respectively, while keeping other values in these matrices unchanged, for several iterations until convergence. First, we update values in S by:
(10)
with , being the partial residual vector. Intuitively, we remove effect of the previous coefficient value for S_{k,d}, while keeping B_{k,d} unchanged and measure how it changes the residuals. This represents a measure of how important that feature is to the prediction, and contributes to the decision of whether a feature is pushed towards zero or not by the lasso penalty. For λ_{s} = 0, we can find the least squares update, , and rewrite as . This formulation can be optimized much quicker using the covariance updates explained below.
Then, we update , which represents an entire row in B, by:
(11)
with , being the partial residual vector for this case. In this case, we keep the value S_{k,d} unchanged, and set B_{k,d} to zero. Similarly, we remove effects from previous B_{k,d} and evaluate how this feature is for the prediction; this then contributes to the decision of whether this entire row is sent to zero by the infinity norm penalty. For λ_{b} = 0, we can find the least squares update, , which can be rewritten as . Finally, we apply softthresholding to penalize the leastsquares updates.
Using these formulations for the updates, we can use the idea of covariance updates [50, 80], where the crossproducts A^{T} A and A^{T} X are stored in separate matrices and reused at every iteration. Because these crossproducts correspond to over 95% of computation time, this trick decreases runtime significantly. To further decrease runtime, we also employ warm starts when searching for optimal penalty values (λ_{s}, λ_{b}) [80]. Additionally, since we infer regulators for each gene separately, we can parallelize calculations by gene.
Estimating prediction confidence scores
For each predicted interaction we compute a confidence score that represents how well a predictor explains the expression data, and a measure of prediction stability. As previously proposed [17, 43], we calculate confidence scores for each interaction by:
(12)
where σ^{2} equals the variance of the residuals for the models, with and without predictor k. The score c_{k,i} is proportional to how much removing regulator k from gene i set of predictors decreases model fit. To measure stability, we perform the inference across multiple bootstraps of the expression data (we used 20 bootstraps for both B. subtilis and yeast), rankaverage the interactions across all bootstraps [16, 43], and rescale the ranking between 0 and 1 to output a final ranked list of regulatory hypotheses.
Supporting information
S1 Fig. Multitask learning (without TF activities) improves accuracy of inferred networks.
(A) Precisionrecall curves assessing accuracy of network models inferred without TF activities for individual B. subtilis datasets against the whole goldstandard set of interactions. Networks Barplot show mean area under precisionrecall curve (AUPR) for each method and dataset. (B) Precisionrecall curves assessing accuracy of network models inferred without TF activities for individual S. cerevisiae networks, with the difference that priors are derived from chromatin accessibility data.
https://doi.org/10.1371/journal.pcbi.1006591.s001
(TIFF)
S2 Fig. Multitask learning (without TF activities) performance boost outweights benefits of other data integration methods for yeast, but not for B. subtilis.
Assessment of accuracy of network models learned using three different data integration strategies, data merging and batch correction (STLBC), ensemble method combining models learned independently (STLC), and ensemble method combining models learned jointly (MTLC). TF expression was used as predictors of gene expression. (A) Precisionrecall curves for B. subtilis, again using the whole goldstandard set of interactions. Barplot show mean area under precisionrecall curve (AUPR) for each method. (B) Precisionrecall curves for S. cerevisiae, with the difference that priors are derived from chromatin accessibility data.
https://doi.org/10.1371/journal.pcbi.1006591.s002
(TIFF)
References

1.
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC bioinformatics. 2006;7(1):S7. pmid:16723010 
2.
Petralia F, Wang P, Yang J, Tu Z. Integrative random forest for gene regulatory network inference. Bioinformatics. 2015;31(12):i197–i205. pmid:26072483 
3.
Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, et al. The Inferelator: an algorithm for learning parsimonious regulatory networks from systemsbiology data sets de novo. Genome biology. 2006;7(5):R36. pmid:16686963 
4.
Yosef N, Shalek AK, Gaublomme JT, Jin H, Lee Y, Awasthi A, et al. Dynamic regulatory network controlling TH17 cell differentiation. Nature. 2013;496(7446):461. pmid:23467089 
5.
Ciofani M, Madar A, Galan C, Sellars M, Mace K, Pauli F, et al. A validated regulatory network for Th17 cell specification. Cell. 2012;151(2):289–303. pmid:23021777 
6.
Rung J, Brazma A. Reuse of public genomewide gene expression data. Nature reviews Genetics. 2013;14(2):89. pmid:23269463 
7.
Koch C, Konieczka J, Delorey T, Lyons A, Socha A, Davis K, et al. Inference and Evolutionary Analysis of GenomeScale Regulatory Networks in Large Phylogenies. Cell systems. 2017;4(5):543–558. pmid:28544882 
8.
Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, et al. The real cost of sequencing: scaling computation to keep pace with data generation. Genome biology. 2016;17(1):53. pmid:27009100 
9.
Marx V. Biology: The big challenges of big data. Nature. 2013;498(7453):255–260. pmid:23765498 
10.
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in highthroughput data. Nature reviews Genetics. 2010;11(10). pmid:20838408 
11.
Nayfach S, Pollard KS. Toward accurate and quantitative comparative metagenomics. Cell. 2016;166(5):1103–1116. pmid:27565341 
12.
Pritchard CC, Cheng HH, Tewari M. MicroRNA profiling: approaches and considerations. Nature reviews Genetics. 2012;13(5):358. pmid:22510765 
13.
Tung PY, Blischak JD, Hsiao CJ, Knowles DA, Burnett JE, Pritchard JK, et al. Batch effects and the effective design of singlecell gene expression studies. Scientific reports. 2017;7:39921. pmid:28045081 
14.
Auer PL, Doerge R. Statistical design and analysis of RNA sequencing data. Genetics. 2010;185(2):405–416. pmid:20439781 
15.
Nygaard V, Rødland EA, Hovig E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2016;17(1):29–39. pmid:26272994 
16.
Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al. Wisdom of crowds for robust gene network inference. Nature Methods. 2012;9(8):796–804. pmid:22796662 
17.
ArrietaOrtiz ML, Hafemeister C, Bate AR, Chu T, Greenfield A, Shuster B, et al. An experimentally supported model of the Bacillus subtilis global transcriptional regulatory network. Molecular Systems Biology. 2015;11(11):839. pmid:26577401 
18.
Dietterich TG, et al. Ensemble methods in machine learning. Multiple classifier systems. 2000;1857:1–15. 
19.
Papp B, Oliver S. Genomewide analysis of the contextdependence of regulatory networks. Genome biology. 2005;6(2):206. pmid:15693953 
20.
Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431(7004):99–104. pmid:15343339 
21.
Siahpirani AF, Roy S. A priorbased integrative framework for functional transcriptional regulatory network inference. Nucleic acids research. 2017;45(4):e21–e21. pmid:27794550 
22.
Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNAbinding proteins and nucleosome position. Nature methods. 2013;10(12):1213. pmid:24097267 
23.
Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al. Highresolution mapping and characterization of open chromatin across the genome. Cell. 2008;132(2):311–322. pmid:18243105 
24.
Johnson DS, Mortazavi A, Myers RM, Wold B. Genomewide mapping of in vivo proteinDNA interactions. Science. 2007;316(5830):1497–1502. pmid:17540862 
25.
Lam KY, Westrick ZM, Müller CL, Christiaen L, Bonneau R. Fused regression for multisource gene regulatory network inference. PLoS computational biology. 2016;12(12):e1005157. pmid:27923054 
26.
Omranian N, EloundouMbebi JM, MuellerRoeber B, Nikoloski Z. Gene regulatory network inference using fused LASSO on multiple data sets. Scientific reports. 2016;6:20533. pmid:26864687 
27.
Jain S, Gitter A, BarJoseph Z. Multitask learning of signaling and regulatory networks with application to studying human response to flu. PLoS computational biology. 2014;10(12):e1003943. pmid:25522349 
28.
Wang Y, Joshi T, Zhang XS, Xu D, Chen L. Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics. 2006;22(19):2413–2420. pmid:16864593 
29.
Chasman D, Walters KB, Lopes TJ, Eisfeld AJ, Kawaoka Y, Roy S. Integrating Transcriptomic and Proteomic Data Using Predictive Regulatory Network Models of Host Response to Pathogens. PLoS computational biology. 2016;12(7):e1005013. pmid:27403523 
30.
Gupta R, Stincone A, Antczak P, Durant S, Bicknell R, Bikfalvi A, et al. A computational framework for gene regulatory network inference that combines multiple methods and datasets. BMC systems biology. 2011;5(1):52. pmid:21489290 
31.
Qin J, Hu Y, Xu F, Yalamanchili HK, Wang J. Inferring gene regulatory networks by integrating ChIPseq/chip and transcriptome data via LASSOtype regularization methods. Methods. 2014;67(3):294–303. pmid:24650566 
32.
Jalali A, Sanghavi S, Ruan C, Ravikumar PK. A dirty model for multitask learning. In: Advances in Neural Information Processing Systems; 2010. p. 964–972. 
33.
Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101(476):1418–1429. 
34.
Caruana R. Multitask learning. In: Learning to learn. Springer; 1998. p. 95–133. 
35.
Chen X, Xuan J, Wang C, Shajahan AN, Riggins RB, Clarke R. Reconstruction of transcriptional regulatory networks by stabilitybased network component analysis. IEEE/ACM transactions on computational biology and bioinformatics. 2013;10(6):1347–1358. pmid:24407294 
36.
Fu Y, Jarboe LR, Dickerson JA. Reconstructing genomewide regulatory network of E. coli using transcriptome data and predicted transcription factor activities. BMC bioinformatics. 2011;12(1):233. pmid:21668997 
37.
Dai Z, Iqbal M, Lawrence ND, Rattray M. Efficient inference for sparse latent variable models of transcriptional regulation. Bioinformatics. 2017;33(23):3776–3783. pmid:28961802 
38.
Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP. Network component analysis: reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences. 2003;100(26):15522–15527. 
39.
Sanguinetti G, Lawrence ND, Rattray M. Probabilistic inference of transcription factor concentrations and genespecific regulatory activities. Bioinformatics. 2006;22(22):2775–2781. pmid:16966362 
40.
Filtz TM, Vogel WK, Leid M. Regulation of transcription factor activity by interconnected posttranslational modifications. Trends in pharmacological sciences. 2014;35(2):76–85. pmid:24388790 
41.
Ravasi T, Suzuki H, Cannistraci CV, Katayama S, Bajic VB, Tan K, et al. An atlas of combinatorial transcriptional regulation in mouse and man. Cell. 2010;140(5):744–752. pmid:20211142 
42.
Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genomewide predictions. Nature Reviews Genetics. 2014;15(4):272–286. pmid:24614317 
43.
Greenfield A, Hafemeister C, Bonneau R. Robust datadriven incorporation of prior knowledge into the inference of dynamic regulatory networks. Bioinformatics. 2013;29:1060–1067. pmid:23525069 
44.
Han H, Shim H, Shin D, Shim JE, Ko Y, Shin J, et al. TRRUST: a reference database of human transcriptional regulatory interactions. Scientific reports. 2015;5:11432. pmid:26066708 
45.
GamaCastro S, Salgado H, PeraltaGil M, SantosZavaleta A, MunizRascado L, SolanoLira H, et al. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K12 integrated within genetic sensory response units (Gensor Units). Nucleic acids research. 2010;39(suppl_1):D98–D105. pmid:21051347 
46.
Michna RH, Zhu B, Mäder U, Stülke J. Subti Wiki 2.0—an integrated database for the model organism Bacillus subtilis. Nucleic acids research. 2015;44(D1):D654–D662. pmid:26433225 
47.
Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, et al. The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic acids research. 2006;34(suppl_1):D446–D451. pmid:16381908 
48.
Arnone MI, Davidson EH. The hardwiring of development: organization and function of genomic regulatory systems. Development. 1997;124(10):1851–1864. pmid:9169833 
49.
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996; p. 267–288. 
50.
Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multitask lasso, with applications to neural semantic basis discovery. In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM; 2009. p. 649–656. 
51.
Tchourine K, Vogel C, Bonneau R. ConditionSpecific Modeling of Biophysical Parameters Advances Inference of Regulatory Networks. Cell reports. 2018;23(2):376–388. pmid:29641998 
52.
Nicolas P, Mäder U, Dervyn E, Rochat T, Leduc A, Pigeonneau N, et al. Conditiondependent transcriptome reveals highlevel regulatory architecture in Bacillus subtilis. Science. 2012;335(6072):1103–1106. pmid:22383849 
53.
Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics. 2007;23(20):2692–2699. pmid:17724061 
54.
Knijnenburg TA, Daran JMG, van den Broek MA, DaranLapujade PA, de Winde JH, Pronk JT, et al. Combinatorial effects of environmental parameters on transcriptional regulation in Saccharomyces cerevisiae: a quantitative analysis of a compendium of chemostatbased transcriptome data. BMC genomics. 2009;10(1):53. pmid:19173729 
55.
Smith EN, Kruglyak L. Gene—environment interaction in yeast gene expression. PLoS biology. 2008;6(4):e83. pmid:18416601 
56.
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, et al. Functional discovery via a compendium of expression profiles. Cell. 2000;102(1):109–126. pmid:10929718 
57.
Costanzo MC, Engel SR, Wong ED, Lloyd P, Karra K, Chan ET, et al. Saccharomyces genome database provides new regulation data. Nucleic acids research. 2013;42(D1):D717–D725. pmid:24265222 
58.
Weirauch MT, Yang A, Albu M, Cote AG, MontenegroMontero A, Drewe P, et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158(6):1431–1443. pmid:25215497 
59.
Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–1018. pmid:21330290 
60.
Karwacz K, Miraldi ER, Pokrovskii M, Madi A, Yosef N, Wortman I, et al. Critical role of IRF1 and BATF in forming chromatin landscape during type 1 regulatory cell differentiation. Nature immunology. 2017;18(4):412. pmid:28166218 
61.
Wilkins O, Hafemeister C, Plessis A, HollowayPhillips MM, Pham GM, Nicotra AB, et al. EGRINs (Environmental Gene Regulatory Influence Networks) in rice that function in the response to water deficit, high temperature, and agricultural environments. The Plant Cell. 2016; p. tpc–00158. pmid:27655842 
62.
Balakrishnan R, Park J, Karra K, Hitz BC, Binkley G, Hong EL, et al. YeastMine—an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose toolkit. Database. 2012;2012. 
63.
Tsompana M, Buck MJ. Chromatin accessibility: a window into the genome. Epigenetics & chromatin. 2014;7(1):33. 
64.
Kittler J, Hatef M, Duin RP, Matas J. On combining classifiers. IEEE transactions on pattern analysis and machine intelligence. 1998;20(3):226–239. 
65.
Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–127. pmid:16632515 
66.
Müller C, Schillert A, Röthemeier C, Trégouët DA, Proust C, Binder H, et al. Removing Batch Effects from Longitudinal Gene ExpressionQuantile Normalization Plus ComBat as Best Approach for Microarray Transcriptome Data. PloS one. 2016;11(6):e0156594. pmid:27272489 
67.
MacNeil LT, Walhout AJ. Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome research. 2011;21(5):645–657. pmid:21324878 
68.
Gitter A, Siegfried Z, Klutstein M, Fornes O, Oliva B, Simon I, et al. Backup in gene regulatory networks explains differences between binding and knockout results. Molecular systems biology. 2009;5(1):276. pmid:19536199 
69.
Hecker M, Lambeck S, Toepfer S, Van Someren E, Guthke R. Gene regulatory network inference: data integration in dynamic models—a review. Biosystems. 2009;96(1):86–103. pmid:19150482 
70.
Waltman P, Kacmarczyk T, Bate AR, Kearns DB, Reiss DJ, Eichenberger P, et al. Multispecies integrative biclustering. Genome biology. 2010;11(9):R96. pmid:20920250 
71.
Schep AN, Buenrostro JD, Denny SK, Schwartz K, Sherlock G, Greenleaf WJ. Structured nucleosome fingerprints enable highresolution mapping of chromatin architecture within regulatory regions. Genome research. 2015;25(11):1757–1770. pmid:26314830 
72.
Leinonen R, Akhtar R, Birney E, Bower L, CerdenoTárraga A, Cheng Y, et al. The European nucleotide archive. Nucleic acids research. 2010;39(suppl_1):D28–D31. pmid:20972220 
73.
Langmead B, Salzberg SL. Fast gappedread alignment with Bowtie 2. Nature methods. 2012;9(4):357–359. pmid:22388286 
74.
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Modelbased analysis of ChIPSeq (MACS). Genome biology. 2008;9(9):R137. pmid:18798982 
75.
de Boer CG, Hughes TR. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic acids research. 2011;40(D1):D169–D179. pmid:22102575 
76.
Cherry JM. The Saccharomyces Genome Database: A Tool for Discovery. Cold Spring Harbor Protocols. 2015;2015(12):pdb–top083840. 
77.
Yang E, Ravikumar PK. Dirty statistical models. In: Advances in Neural Information Processing Systems; 2013. p. 611–619. 
78.
Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95(3):759–771. 
79.
Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010;72(4):417–473. 
80.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33(1):1. pmid:20808728