# Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error

Bioinformatics

We consider a sample with n unrelated individuals. Each individual has K (potentially correlated) phenotypes and has been genotyped at a variant of interest. Let yik denote the kth phenotype value of the ith individual and xi denote the genotype score of the ith individual, where xi{0, 1, 2} is the number of minor alleles that the ith individual carries. We model the relationship between the multiple phenotypes and the genetic variant using an inverse linear regression model, in which the genotype at the variant of interest is the response variable and the multiple phenotypes are predictors. That is,

$${x}_{i}={beta }_{0}+{beta }_{1}{y}_{i1}+ldots +{beta }_{K}{y}_{iK}+{varepsilon }_{i}.$$

(1)

We are not the first using an ordinal variable as response variable in a linear model. To correct for population stratification, Price et al.16 used a qualitative phenotype or genotypes as response variables in linear models. To adjust the effects of covariates in rare variant association studies, Sha et al.17 also used a qualitative phenotype or genotypes as response variables in linear models. To test the association between the K multiple phenotypes and the variant, we test the null hypothesis H0:β1 = ··· = βK = 0 under model (1).

Let yi = (1,yi1, …, yiK)T and β = (β0, β1, …, βK)T, then the regression model in equation (1) can be written as ({x}_{i}={y}_{i}^{T}beta +{varepsilon }_{i},i=1,2,ldots ,n). The ordinary linear square estimate of β is (hat{beta }={({Y}^{T}Y)}^{-1}{Y}^{T}x), where Y = (y1, …, yn)T and x = (x1, …, xn)T. When multiple phenotypes are highly correlated, the rank of matrix Y may be less than K, then the inverse of YTY may not exist, which results in that the ordinary linear square estimate of β may not be unique18. Since multiple phenotypes in a GWAS are usually highly correlated, we propose to use Ridge regression19,20,21,22,23,24. Ridge regression penalizes the size of the regression coefficients. The Ridge regression estimator of β is defined as the value of β that minimizes

$$sum _{i}{({x}_{i}-{y}_{i}^{T}beta )}^{2}+lambda sum _{j}{beta }_{j}^{2},$$

where λ (λ ≥ 0) is a tuning parameter. The solution to the Ridge regression is given by ({hat{beta }}_{lambda }={({Y}^{T}Y+lambda I)}^{-1}{Y}^{T}x). Here the estimator of β depends on λ and we use the subscript λ to indicate that the estimator of β is a function of λ.

Based on Ridge regression, we propose to use the leave-one-out cross validation (LOOCV) prediction error under model (1) as a test statistic. Let ({hat{x}}_{-i}^{lambda }) denote the LOOCV predicted value (leave the ith individual out) of xi under model (1) with parameter λ in Ridge regression. Then, the statistic can be written as ({T}_{lambda }=sum _{i=1}^{n}{({x}_{i}-{hat{x}}_{-i}^{lambda })}^{2}). Note that Tλ is the LOOCV prediction error, thus low values of Tλ would imply significance. Let pλ denote the p-value of Tλ (see next paragraph on how to calculate pλ). We define the test statistic of Multiple Phenotypes based on Prediction Error (MultP-PE) as

$${T}_{MultP-PE}={min }_{lambda }{p}_{lambda }.$$

(2)

We propose to use a grid search method in equation (2) to evaluate the minimization. We divide the interval [0, ∞) into subintervals (0le {lambda }_{1} < cdot cdot cdot < {lambda }_{M-1} < {lambda }_{M} < infty ). Then, ({T}_{MultP-PE}={{rm{min }}}_{lambda }{p}_{lambda }={min }_{1le mle M}{p}_{{lambda }_{m}}). We use a permutation procedure to evaluate the p-value of TMultPPE. Intuitively, we need to use two layers of permutations to estimate ({p}_{{lambda }_{m}}) and the overall p-value for the test statistic TMultPPE. For microarray data analysis, Ge et al.25 proposed that one layer of permutation can be used to estimate p-values. We use the permutation procedure of Ge et al. to estimate ({p}_{{lambda }_{m}}) and the overall p-value for the test statistic TMultPPE. In each permutation, we randomly shuffle the genotypes at the variant. Suppose that we perform B times of permutations. Let ({T}_{{lambda }_{m}}^{(b)}) denote the value of ({T}_{{lambda }_{m}}) based on the bth permuted data for b = 0, 1, …, B and m = 1, …, M, and ({p}_{{lambda }_{m}}^{(b)}) denote the p-value of ({T}_{{lambda }_{m}}^{(b)}), where b = 0 represents the original data. Then, we can estimate ({p}_{{lambda }_{m}}^{(b)}) using ({p}_{{lambda }_{m}}^{(b)}=frac{#{d:{T}_{{lambda }_{m}}^{(d)} < {T}_{{lambda }_{m}}^{(b)},{rm{for}},d=1,ldots ,B}}{B}). Let ({T}_{MultP-PE}^{(b)}={min }_{1le mle M}{p}_{{lambda }_{m}}^{(b)}) denote the test statistic of TMultPPE based on the bth permuted data, then the p-value of TMultPPE is given by

$$frac{#{{T}_{MultP-PE,}^{(b)}:{T}_{MultP-PE}^{(b)} < {T}_{MultP-PE}^{(0)},{rm{for}},b=1,2,ldots ,B}}{B}$$

(3)

To apply MultP-PE to GWAS with hundreds of thousands of SNPs, we also propose an algorithm that can perform the permutation procedure described above more efficiently in the following section.

### A Fast Algorithm for the Permutation Procedure

We use the notations in the above section and let Aλ = (YTY + λI)−1, ({h}_{i}^{lambda }={y}_{i}^{T}{A}_{lambda }{y}_{i}), (,{h}_{lambda }=({h}_{1}^{lambda },ldots ,{h}_{n}^{lambda })), and ({hat{beta }}_{lambda }={A}_{lambda }{Y}^{T}x). Then, the Ridge predicted value of xi is ({hat{x}}_{i}^{lambda }={y}_{i}^{T}{hat{beta }}_{lambda }) and ({hat{x}}_{lambda }={({hat{x}}_{1}^{lambda },ldots ,{hat{x}}_{n}^{lambda })}^{T}=Y{({Y}^{T}Y+lambda I)}^{-1}{Y}^{T}x). We can show that the LOOCV prediction error in Ridge regression has a closed-form formula24,26, that is, ({x}_{i}-{hat{x}}_{-i}^{lambda }=({x}_{i}-{hat{x}}_{i}^{lambda })/(1-{h}_{i}^{lambda })). Note that for two matrices or vectors A and B, we use A*B and (frac{A}{B}) to denote the element-wise operations; for a matrix C, we use colSum(C) to denote the sums of the columns of matrix C. We assume n ≥ K + 1. We perform singular value decomposition of Y, that is, Y = UDV, where U is an n × (K + 1) matrix with orthonormal columns, D is (K + 1) × (K + 1) diagonal matrix with non-negative real numbers on the diagonal, and V is an (K + 1) × (K + 1) orthogonal matrix. Let D = diag(d1, …, dK + 1). Then, ({hat{x}}_{lambda }=U{C}_{lambda }{U}^{T}x), where Cλ = diag(cλ,1, …, cλ,K + 1) and ({c}_{lambda ,j}={d}_{j}^{2}/({d}_{j}^{2}+lambda )) for j = 1, …, K + 1. Let ({c}_{lambda }={({c}_{lambda ,1},ldots ,{c}_{lambda ,K+1})}^{T}) and x(K) = UTx be a K + 1 dimensional vector. Then, ({hat{x}}_{lambda }=U{C}_{lambda }{x}^{(K)}=U({c}_{lambda }ast {x}^{(K)})) and hλ = diag(UCλUT). For (0le {lambda }_{1} < ldots < {lambda }_{M} < infty ), let (C=({c}_{{lambda }_{1}},ldots ,{c}_{{lambda }_{M}})) and (H=({h}_{{lambda }_{1}},ldots ,{h}_{{lambda }_{M}})). Then, ((,{hat{x}}_{{lambda }_{1}},ldots ,,{hat{x}}_{{lambda }_{M}})=U(Cast {x}^{(K)})=U({c}_{{lambda }_{1}}ast {x}^{(K)},ldots ,{c}_{{lambda }_{M}}ast {x}^{(K)})). If we denote (Q=frac{(x-{hat{x}}_{{lambda }_{1}},,ldots ,,x-{hat{x}}_{{lambda }_{M}})}{1-H}), then (({T}_{{lambda }_{1}},ldots ,{T}_{{lambda }_{M}})=colSum(Qast Q)). Note that C, U, and H only depend on phenotypes and λ1, …, λM. Thus, C, U, and H do not change in each permutation. For a GWAS, C, U, and H also do not change at different SNPs. For 1,000 permutations on one SNP, our fast algorithm is about 20 times faster than the original algorithm (the original algorithm calculates Tλ by ({T}_{lambda }=sum _{i=1}^{n}{({x}_{i}-{hat{x}}_{-i}^{lambda })}^{2})). To perform a GWAS with hundreds of thousands of SNPs, we can use the same approach as was suggested in Zhu et al.14, that is, we can first select SNPs that show evidence of association based on a small number of permutations (e.g. 1,000), then use a large number of permutations to test the selected SNPs. For example, in our real data analysis with 630,860 SNPs, we first performed 1,000 permutations and selected SNPs with pvalue ≤ 0.005, then we performed 108 permutations on the selected SNPs because SNPs with p-value > 0.005 are not significantly associated with phenotypes.

Although we use a permutation procedure to calculate the p-value of MultP-PE, by using our fast algorithm, we can use less than one day to perform a typical GWAS. In our read data analysis on COPD in the following section, we performed a GWAS with 5,430 individuals across 630,860 SNPs and seven phenotypes. We completed the analysis in 10 hours on Intel Xeon E5-2680v3 by using a single node.

In the above section, we describe MulP-PE without considering covariates. If covariates are needed to be considered, we can incorporate covariates using the following approach in MultP-PE. Suppose that there are total G covariates we would like to consider and let (zi1, …, ziG)T denote the covariates for the ith individual. We can adjust esch of the phenotypes by the covariates by applying the linear regression model ({y}_{ik}={a}_{0k}+{a}_{1k}{z}_{i1}+ldots +{a}_{Mk}{z}_{iG}+{varepsilon }_{ik}), for i = 1, 2, …, n, k = 1, 2, …, K, and use the residual of yik to replace yik in MultP-PE. In our real data analysis, we used this approach to incorporate four covariates. This approach has been used in the literature. For example, Sha et al.16 and Zhu et al.14 also used the same approach to adjust phenotypes for the covariates.

In association studies for unrelated individuals, it has been well known that population stratification can seriously confound association results27. There are several methods that have been developed to control for population stratification. For example, Genomic Control (GC) approach28,29, Principal Component (PC) approach16,30,31,32, and Mixed Linear Model (MLM) approach33,34. Similar to most association tests for unrelated individuals, MulP-PE subjects to bias due to population stratification. To make MultP-PE robust to population stratification, we can use the PC approach. Let ci1, …, ciL denote the top L PCs of the genotypes at a set of genomic markers for the ith individual. We can use the residuals of the regression model ({x}_{i}=alpha +{beta }_{1}{c}_{i1}+cdot cdot cdot +{beta }_{L}{c}_{iL}+{varepsilon }_{i}) to replace xi and use the residuals of the regression model ({y}_{ik}={alpha }_{k}+{beta }_{1k}{c}_{i1}+cdot cdot cdot +{beta }_{Lk}{c}_{iL}+{varepsilon }_{ik}) to replace yik for k = 1, 2, …, K in MultP-PE to adjust for population stratification.

### Comparison of Methods

We evaluate the performance of the proposed test MultP-PE by comparing it with five most commonly used methods for association studies using multiple phenotypes. These five methods include the O’Brien’s method (OB)5, Trait-based Association Test that uses Extended Simes procedure (TATES)7, Optimal weight method (OW)6, Multivariate analysis of variance (MANOVA)9, and Joint model of multiple phenotypes (MultiPhen)2.