Canonical correlation analysis (CCA) is a classic and highly versatile statistical approach to investigate the linear relationship between two sets of variables [1, 2]. CCA helps to decode complex dependency structures in multivariate data and to identify groups of interacting variables. Consequently, it has numerous practical applications in molecular biology, for example omics data integration [3] and network analysis [4], but also in many other areas such as econometrics or social science.

In its original formulation CCA is viewed as an algorithmic procedure optimizing a set of objective functions, rather than as a probablistic model for the data. Only relatively recently this perspective has changed. Bach and Jordan [5] proposed a latent variable model for CCA building on earlier work on probabilistic principal component analysis (PCA) by [6]. The probabilistic approach to CCA not only allows to derive the classic CCA algorithm but also provide an avenue for Bayesian variants [7, 8].

In parallel to establishing probabilistic CCA the classic CCA approach has also been further developed in the last decade by introducing variants of the CCA algorithm that are more pertinent for high-dimensional data sets now routinely collected in the life and physical sciences. In particular, the problem of singularity in the original CCA algorithm is resolved by introducing sparsity and regularization [9–13] and, similarly, large-scale computation is addressed by new algorithms [14, 15].

In this note, we revisit both classic and probabilistic CCA from the perspective of whitening of random variables [16]. As a result, we propose a simple yet flexible probabilistic model for CCA linking together multivariate regression, latent variable models, and high-dimensional estimation. Crucially, this model for CCA not only facilitates comprehensive understanding of both classic and probabilistic CCA via the process of whitening but also extends CCA by allowing for negative canonical correlations and providing the flexibility to include non-normal latent variables.

The remainder of this paper is as follows. First, we present our main results. After reviewing classical CCA we demonstrate that the classic CCA algorithm is special form of whitening. Next, we show that the link of CCA with multivariate regression leads to a probabilistic two-level latent variable model for CCA that directly reproduces classic CCA without any rotational ambiguity. Subsequently, we discuss our approach by applying it to both synthetic data as well as to multiple integrated omics data sets. Finally, we describe our implementation in R and highlight computational and algorithmic aspects.

Much of our discussion is framed in terms of random vectors and their properties rather than in terms of data matrices. This allows us to study the probabilistic model underlying CCA separate from associated statistical procedures for estimation.

### Multivariate notation

We consider two random vectors X=(*X*_{1},…,*X*_{p})^{T} and Y=(*Y*_{1},…,*Y*_{q})^{T} of dimension *p* and *q*. Their respective multivariate distributions *F*_{X} and *F*_{Y} have expectation E(X)=μ_{X} and E(Y)=μ_{Y} and covariance var(X)=Σ_{X} and var(Y)=Σ_{Y}. The cross-covariance between X and Y is given by cov(X,Y)=Σ_{XY}. The corresponding correlation matrices are denoted by P_{X}, P_{Y}, and P_{XY}. By V_{X}=diag(Σ_{X}) and V_{Y}=diag(Σ_{Y}) we refer to the diagonal matrices containing the variances only, allowing to decompose covariances as Σ=V^{1/2}PV^{1/2}. The composite vector (X^{T},Y^{T})^{T} has therefore mean (left (boldsymbol {mu }_{boldsymbol {X}}^{T}, boldsymbol {mu }_{boldsymbol {Y}}^{T}right)^{T}) and covariance (left (begin {array}{cc} boldsymbol {Sigma }_{boldsymbol {X}} & boldsymbol {Sigma }_{boldsymbol {X} boldsymbol {Y}}\ boldsymbol {Sigma }_{boldsymbol {X} boldsymbol {Y}}^{T} & boldsymbol {Sigma }_{boldsymbol {Y}} end {array}right)).

Vector-valued samples of the random vectors X and Y are denoted by x_{i} and y_{i} so that (x_{1},…,x_{i},…,x_{n})^{T} is the *n*×*p* data matrix for X containing *n* observed samples (one in each row). Correspondingly, the empirical mean for X is given by (hat {boldsymbol {mu }}_{boldsymbol {X}} = bar {boldsymbol {x}} =frac {1}{n} {sum nolimits }_{i=1}^{n} boldsymbol {x}_{i}), the unbiased covariance estimate is (widehat {boldsymbol {Sigma }}_{boldsymbol {X}} = boldsymbol {S}_{boldsymbol {X}} = frac {1}{n-1} {sum nolimits }^{n}_{i=1} (boldsymbol {x}_{i} -bar {boldsymbol {x}})left (boldsymbol {x}_{i} -bar {boldsymbol {x}}right)^{T}), and the corresponding correlation estimate is denoted by (widehat {boldsymbol {P}}_{boldsymbol {X}} = boldsymbol {R}_{boldsymbol {X}}).