We consider a sample with *n* unrelated individuals. Each individual has *K* (potentially correlated) phenotypes and has been genotyped at a variant of interest. Let *y*_{ik} denote the *k*^{th} phenotype value of the *i*^{th} individual and *x*_{i} denote the genotype score of the *i*^{th} individual, where *x*_{i}∈{0, 1, 2} is the number of minor alleles that the *i*^{th} individual carries. We model the relationship between the multiple phenotypes and the genetic variant using an inverse linear regression model, in which the genotype at the variant of interest is the response variable and the multiple phenotypes are predictors. That is,

$${x}_{i}={beta }_{0}+{beta }_{1}{y}_{i1}+ldots +{beta }_{K}{y}_{iK}+{varepsilon }_{i}.$$

(1)

We are not the first using an ordinal variable as response variable in a linear model. To correct for population stratification, Price *et al*.^{16} used a qualitative phenotype or genotypes as response variables in linear models. To adjust the effects of covariates in rare variant association studies, Sha *et al*.^{17} also used a qualitative phenotype or genotypes as response variables in linear models. To test the association between the *K* multiple phenotypes and the variant, we test the null hypothesis *H*_{0}:*β*_{1} = ··· = *β*_{K} = 0 under model (1).

Let *y*_{i} = (1,*y*_{i1}, …, *y*_{iK})^{T} and *β* = (*β*_{0}, *β*_{1}, …, *β*_{K})^{T}, then the regression model in equation (1) can be written as ({x}_{i}={y}_{i}^{T}beta +{varepsilon }_{i},i=1,2,ldots ,n). The ordinary linear square estimate of *β* is (hat{beta }={({Y}^{T}Y)}^{-1}{Y}^{T}x), where *Y* = (*y*_{1}, …, *y*_{n})^{T} and *x* = (*x*_{1}, …, *x*_{n})^{T}. When multiple phenotypes are highly correlated, the rank of matrix *Y* may be less than *K*, then the inverse of *Y*^{T}*Y* may not exist, which results in that the ordinary linear square estimate of *β* may not be unique^{18}. Since multiple phenotypes in a GWAS are usually highly correlated, we propose to use Ridge regression^{19,20,21,22,23,24}. Ridge regression penalizes the size of the regression coefficients. The Ridge regression estimator of *β* is defined as the value of *β* that minimizes

$$sum _{i}{({x}_{i}-{y}_{i}^{T}beta )}^{2}+lambda sum _{j}{beta }_{j}^{2},$$

where *λ* (*λ* ≥ 0) is a tuning parameter. The solution to the Ridge regression is given by ({hat{beta }}_{lambda }={({Y}^{T}Y+lambda I)}^{-1}{Y}^{T}x). Here the estimator of *β* depends on *λ* and we use the subscript *λ* to indicate that the estimator of *β* is a function of *λ*.

Based on Ridge regression, we propose to use the leave-one-out cross validation (LOOCV) prediction error under model (1) as a test statistic. Let ({hat{x}}_{-i}^{lambda }) denote the LOOCV predicted value (leave the *i*^{th} individual out) of *x*_{i} under model (1) with parameter *λ* in Ridge regression. Then, the statistic can be written as ({T}_{lambda }=sum _{i=1}^{n}{({x}_{i}-{hat{x}}_{-i}^{lambda })}^{2}). Note that *T*_{λ} is the LOOCV prediction error, thus low values of *T*_{λ} would imply significance. Let *p*_{λ} denote the p-value of *T*_{λ} (see next paragraph on how to calculate *p*_{λ}). We define the test statistic of Multiple Phenotypes based on Prediction Error (MultP-PE) as

$${T}_{MultP-PE}={min }_{lambda }{p}_{lambda }.$$

(2)

We propose to use a grid search method in equation (2) to evaluate the minimization. We divide the interval [0, ∞) into subintervals (0le {lambda }_{1} < cdot cdot cdot < {lambda }_{M-1} < {lambda }_{M} < infty ). Then, ({T}_{MultP-PE}={{rm{min }}}_{lambda }{p}_{lambda }={min }_{1le mle M}{p}_{{lambda }_{m}}). We use a permutation procedure to evaluate the p-value of *T*_{MultP−PE}. Intuitively, we need to use two layers of permutations to estimate ({p}_{{lambda }_{m}}) and the overall p-value for the test statistic *T*_{MultP−PE}. For microarray data analysis, Ge *et al*.^{25} proposed that one layer of permutation can be used to estimate p-values. We use the permutation procedure of Ge *et al*. to estimate ({p}_{{lambda }_{m}}) and the overall p-value for the test statistic *T*_{MultP−PE}. In each permutation, we randomly shuffle the genotypes at the variant. Suppose that we perform *B* times of permutations. Let ({T}_{{lambda }_{m}}^{(b)}) denote the value of ({T}_{{lambda }_{m}}) based on the *b*^{th} permuted data for *b* = 0, 1, …, *B* and *m* = 1, …, *M*, and ({p}_{{lambda }_{m}}^{(b)}) denote the p-value of ({T}_{{lambda }_{m}}^{(b)}), where *b* = 0 represents the original data. Then, we can estimate ({p}_{{lambda }_{m}}^{(b)}) using ({p}_{{lambda }_{m}}^{(b)}=frac{#{d:{T}_{{lambda }_{m}}^{(d)} < {T}_{{lambda }_{m}}^{(b)},{rm{for}},d=1,ldots ,B}}{B}). Let ({T}_{MultP-PE}^{(b)}={min }_{1le mle M}{p}_{{lambda }_{m}}^{(b)}) denote the test statistic of *T*_{MultP−PE} based on the *b*^{th} permuted data, then the p-value of *T*_{MultP−PE} is given by

$$frac{#{{T}_{MultP-PE,}^{(b)}:{T}_{MultP-PE}^{(b)} < {T}_{MultP-PE}^{(0)},{rm{for}},b=1,2,ldots ,B}}{B}$$

(3)

To apply MultP-PE to GWAS with hundreds of thousands of SNPs, we also propose an algorithm that can perform the permutation procedure described above more efficiently in the following section.

### A Fast Algorithm for the Permutation Procedure

We use the notations in the above section and let *A*_{λ} = (*Y*^{T}*Y* + *λI*)^{−1}, ({h}_{i}^{lambda }={y}_{i}^{T}{A}_{lambda }{y}_{i}), (,{h}_{lambda }=({h}_{1}^{lambda },ldots ,{h}_{n}^{lambda })), and ({hat{beta }}_{lambda }={A}_{lambda }{Y}^{T}x). Then, the Ridge predicted value of *x*_{i} is ({hat{x}}_{i}^{lambda }={y}_{i}^{T}{hat{beta }}_{lambda }) and ({hat{x}}_{lambda }={({hat{x}}_{1}^{lambda },ldots ,{hat{x}}_{n}^{lambda })}^{T}=Y{({Y}^{T}Y+lambda I)}^{-1}{Y}^{T}x). We can show that the LOOCV prediction error in Ridge regression has a closed-form formula^{24,26}, that is, ({x}_{i}-{hat{x}}_{-i}^{lambda }=({x}_{i}-{hat{x}}_{i}^{lambda })/(1-{h}_{i}^{lambda })). Note that for two matrices or vectors *A* and *B*, we use *A***B* and (frac{A}{B}) to denote the element-wise operations; for a matrix *C*, we use *colSum*(*C*) to denote the sums of the columns of matrix *C*. We assume *n* ≥ *K* + 1. We perform singular value decomposition of *Y*, that is, *Y* = *UDV*, where *U* is an *n* × (*K* + 1) matrix with orthonormal columns, *D* is (*K* + 1) × (*K* + 1) diagonal matrix with non-negative real numbers on the diagonal, and *V* is an (*K* + 1) × (*K* + 1) orthogonal matrix. Let *D* = *diag*(*d*_{1}, …, *d*_{K + 1}). Then, ({hat{x}}_{lambda }=U{C}_{lambda }{U}^{T}x), where *C*_{λ} = *diag*(*c*_{λ,1}, …, *c*_{λ,K + 1}) and ({c}_{lambda ,j}={d}_{j}^{2}/({d}_{j}^{2}+lambda )) for *j* = 1, …, *K* + 1. Let ({c}_{lambda }={({c}_{lambda ,1},ldots ,{c}_{lambda ,K+1})}^{T}) and *x*^{(K)} = *U*^{T}*x* be a *K* + 1 dimensional vector. Then, ({hat{x}}_{lambda }=U{C}_{lambda }{x}^{(K)}=U({c}_{lambda }ast {x}^{(K)})) and *h*_{λ} = *diag*(*UC*_{λ}*U*^{T}). For (0le {lambda }_{1} < ldots < {lambda }_{M} < infty ), let (C=({c}_{{lambda }_{1}},ldots ,{c}_{{lambda }_{M}})) and (H=({h}_{{lambda }_{1}},ldots ,{h}_{{lambda }_{M}})). Then, ((,{hat{x}}_{{lambda }_{1}},ldots ,,{hat{x}}_{{lambda }_{M}})=U(Cast {x}^{(K)})=U({c}_{{lambda }_{1}}ast {x}^{(K)},ldots ,{c}_{{lambda }_{M}}ast {x}^{(K)})). If we denote (Q=frac{(x-{hat{x}}_{{lambda }_{1}},,ldots ,,x-{hat{x}}_{{lambda }_{M}})}{1-H}), then (({T}_{{lambda }_{1}},ldots ,{T}_{{lambda }_{M}})=colSum(Qast Q)). Note that *C*, *U*, and *H* only depend on phenotypes and *λ*_{1}, …, *λ*_{M}. Thus, *C*, *U*, and *H* do not change in each permutation. For a GWAS, *C*, *U*, and *H* also do not change at different SNPs. For 1,000 permutations on one SNP, our fast algorithm is about 20 times faster than the original algorithm (the original algorithm calculates *T*_{λ} by ({T}_{lambda }=sum _{i=1}^{n}{({x}_{i}-{hat{x}}_{-i}^{lambda })}^{2})). To perform a GWAS with hundreds of thousands of SNPs, we can use the same approach as was suggested in Zhu *et al*.^{14}, that is, we can first select SNPs that show evidence of association based on a small number of permutations (e.g. 1,000), then use a large number of permutations to test the selected SNPs. For example, in our real data analysis with 630,860 SNPs, we first performed 1,000 permutations and selected SNPs with *p*–*value* ≤ 0.005, then we performed 10^{8} permutations on the selected SNPs because SNPs with *p-value* > 0.005 are not significantly associated with phenotypes.

Although we use a permutation procedure to calculate the p-value of MultP-PE, by using our fast algorithm, we can use less than one day to perform a typical GWAS. In our read data analysis on COPD in the following section, we performed a GWAS with 5,430 individuals across 630,860 SNPs and seven phenotypes. We completed the analysis in 10 hours on Intel Xeon E5-2680v3 by using a single node.

In the above section, we describe MulP-PE without considering covariates. If covariates are needed to be considered, we can incorporate covariates using the following approach in MultP-PE. Suppose that there are total *G* covariates we would like to consider and let (*z*_{i1}, …, *z*_{iG})^{T} denote the covariates for the *i*^{th} individual. We can adjust esch of the phenotypes by the covariates by applying the linear regression model ({y}_{ik}={a}_{0k}+{a}_{1k}{z}_{i1}+ldots +{a}_{Mk}{z}_{iG}+{varepsilon }_{ik}), for *i* = 1, 2, …, *n*, *k* = 1, 2, …, *K*, and use the residual of *y*_{ik} to replace *y*_{ik} in MultP-PE. In our real data analysis, we used this approach to incorporate four covariates. This approach has been used in the literature. For example, Sha *et al*.^{16} and Zhu *et al*.^{14} also used the same approach to adjust phenotypes for the covariates.

In association studies for unrelated individuals, it has been well known that population stratification can seriously confound association results^{27}. There are several methods that have been developed to control for population stratification. For example, Genomic Control (GC) approach^{28,29}, Principal Component (PC) approach^{16,30,31,32}, and Mixed Linear Model (MLM) approach^{33,34}. Similar to most association tests for unrelated individuals, MulP-PE subjects to bias due to population stratification. To make MultP-PE robust to population stratification, we can use the PC approach. Let *c*_{i1}, …, *c*_{iL} denote the top *L* PCs of the genotypes at a set of genomic markers for the *i*^{th} individual. We can use the residuals of the regression model ({x}_{i}=alpha +{beta }_{1}{c}_{i1}+cdot cdot cdot +{beta }_{L}{c}_{iL}+{varepsilon }_{i}) to replace *x*_{i} and use the residuals of the regression model ({y}_{ik}={alpha }_{k}+{beta }_{1k}{c}_{i1}+cdot cdot cdot +{beta }_{Lk}{c}_{iL}+{varepsilon }_{ik}) to replace *y*_{ik} for *k* = 1, 2, …, *K* in MultP-PE to adjust for population stratification.

### Comparison of Methods

We evaluate the performance of the proposed test MultP-PE by comparing it with five most commonly used methods for association studies using multiple phenotypes. These five methods include the O’Brien’s method (OB)^{5}, Trait-based Association Test that uses Extended Simes procedure (TATES)^{7}, Optimal weight method (OW)^{6}, Multivariate analysis of variance (MANOVA)^{9}, and Joint model of multiple phenotypes (MultiPhen)^{2}.