Abstract
Background
Recent advances in genome technologies and the subsequent collection of genomic information at various molecular resolutions hold promise to accelerate the discovery of new therapeutic targets. A critical step in achieving these goals is to develop efficient clinical prediction models that integrate these diverse sources of highthroughput data. This step is challenging due to the presence of highdimensionality and complex interactions in the data. For predicting relevant clinical outcomes, we propose a flexible statistical machine learning approach that acknowledges and models the interaction between platformspecific measurements through nonlinear kernel machines and borrows information within and between platforms through a hierarchical Bayesian framework. Our model has parameters with direct interpretations in terms of the effects of platforms and data interactions within and across platforms. The parameter estimation algorithm in our model uses a computationally efficient variational Bayes approach that scales well to large highthroughput datasets.
Results
We apply our methods of integrating gene/mRNA expression and microRNA profiles for predicting patient survival times to The Cancer Genome Atlas (TCGA) based glioblastoma multiforme (GBM) dataset. In terms of prediction accuracy, we show that our nonlinear and interactionbased integrative methods perform better than linear alternatives and nonintegrative methods that do not account for interactions between the platforms. We also find several prognostic mRNAs and microRNAs that are related to tumor invasion and are known to drive tumor metastasis and severe inflammatory response in GBM. In addition, our analysis reveals several interesting mRNA and microRNA interactions that have known implications in the etiology of GBM.
Conclusions
Our approach gains its flexibility and power by modeling the nonlinear interaction structures between and within the platforms. Our framework is a useful tool for biomedical researchers, since clinical prediction using multiplatform genomic information is an important step towards personalized treatment of many cancers. We have a freely available software at: http://odin.mdacc.tmc.edu/~vbaladan webcite.
Keywords:
Bayesian modeling; Multiple kernel learning; Genomics; Highdimensional data analysis; Prediction; Variational inference1 Introduction
Recent advances in genome technologies such as microarrays and nextgeneration sequencing have enabled the measurement of genomic activity at a very detailed resolution (e.g., base pair, singlenucleotide polymorphisms) as well as across multiple molecular levels: the epigenome, transcriptome and proteome. The collection of genomic information at various resolutions holds promise to accelerate the amalgamation of discovery science and clinical medicine [1]. One of the overarching goals of such studies is to relate these genomic data to relevant (patientspecific) clinical outcomes, not only to find significant biomarkers of disease progression/evolution but also to use the biomarkers to develop prediction models for deployment in future therapeutic studies. Furthermore, genomic data are now available from multiple platforms and resolutions for the same individual, thus allowing a researcher to simultaneously query these multiple sources of data to achieve these goals. Such motivating data have been collected under the aegis of The Cancer Genome Atlas (TCGA) project, wherein data from multiple genomic platforms such as gene/mRNA expression, DNA copy number, methylation and microRNA expression profiles are available for multiple tumor types (see http://cancergenome.nih.gov webcite for more details). In addition, the available clinical information, such as stage of disease and survival times, motivates the analytic frameworks that integrate patientspecific data.
One of the main challenges in modeling the statistical dependence between such highthroughput studies is that a large number of measurements (usually in thousands) is available for a relatively small number (usually in tens or hundreds) of patient samples; therefore, classical statistical approaches based on linear models and hierarchical clustering are prone to overfitting [2,3]. In these situations, [3] recommends accounting for highdimensionality by using approaches that borrow information across covariates to compensate for the limited information available across samples, which leads to better and more reliable inference. Several approaches have been developed to address these challenges in various contexts. Some examples include linear parametric models and hierarchical clustering for inferring the relation between phenotypes and genomic features [4], hierarchical Bayesian modeling approaches based on linear shrinkage estimators [5], linear canonical correlation analysis [6], intensitybased approaches for merging datasets [7], and regularized linear regression approaches [8].
Although these approaches are computationally efficient, interpretable, and simple, they make two unrealistic assumptions for practical data analysis. First, due to the parametric and linear assumptions, they might miss the underlying nonlinear patterns in the data. Second, and more importantly, these nonlinear patterns are further amplified in the presence of complex interactions within and between the different platforms that must be modeled while integrating data from these platforms. In this paper, we present a statistical machine learning approach called the hierarchical relevance vector machines (HRVM) to address these modeling and inferential challenges. Briefly, the framework presented here: (a) models the relation between a relevant clinical outcome (scalar) and highdimensional covariates/features through a dataadaptive and flexible nonparametric approach,(b) borrows information within and between platforms through a hierarchical Bayesian framework, (c) acknowledges and models the interaction between platforms through nonlinear kernelbased functionals, (d) has parameters that have explicit interpretation as the effects of the platforms and their interactions on the outcome, and (e) uses a computationally efficient variational Bayes approach that can be readily scaled to large datasets.
Our methods are motivated by and applied to a TCGA based glioblastoma multiforme (GBM) dataset, for which we integrate gene (mRNA) and microRNA (miRNA) expression profiles to predict patient survival times^{a}. There is an increasing interest in identifying subtypes of GBM based on its gene expression data. The ultimate goal of subtyping GBM is to identify gene expression profiles that are prognostic or predictive of treatment outcomes. The known subtypes of GBM samples in TCGA include proneural, neural, classical, and mesenchymal; with the first two classes of which are suspected to differ from the other two in the cell of origin, which is a critical determinant of effective treatment regimens [9]. Differential expressions of miRNAs were recently found to be associated with many diseases, including cancers [10,11]. Previous studies have shown that combining multiple types of data, such as mRNA and miRNA expressions, could significantly improve the accuracy of detecting GBM subtypes, and thereby potentially predict the clinical outcomes [12]. However, methods are lacking to accurately model the effect of interactions between these data types directly on clinical outcomes. Here we show that our nonlinear and interactionbased integrative methods have better prediction accuracy than linear alternatives and nonintegrative methods that do not account for the interactions between the platforms. We also find several prognostic mRNAs and microRNAs that are related to tumor invasion and that are known to drive tumor metastasis and severe inflammatory response in GBM. In addition, our analysis reveals several interesting mRNAmiRNA interactions that have known implications in the etiology of GBM. The paper is structured as follows. The basic construction of HRVM is detailed in Section 2. The analysis of GBM data is presented in Section 3, and concluding remarks about the HRVM framework are presented in Section 4.
2 Hierarchical Relevance Vector Machine model
For ease of exposition, we illustrate the model building process of HRVM using data from two sources: gene/mRNA and miRNA expression measurements. The framework is easily extended to multiple platforms as discussed in Section 4. Suppose, we have data for N patients, and X and Y represent meancentered and standardized gene and miRNA expression matrices, with rows corresponding to patients and columns representing the G genes and M miRNAs, respectively ^{b}. Centering and standardizing the gene and miRNA expression matrices remove any systematic mean or scaling effects caused by the use of different data sources, and make them compatible for model fitting. We denote the gene and miRNA expression for the ith patient as row vectors and . These covariates are highdimensional, that is, both G and M are much larger than N; for example, in the GBM data G≈12000,M≈540,N≈250. Based on these measurements, our aim is to predict a relevant clinical outcome, which in our case is the (logtransformed) survival time measured from time of diagnosis to death, denoted by the column vector t=(t_{1},…,t_{N}) for the N patients.
2.1 Basic construction
A basic (conceptual) model can be written in a highdimensional regression setting as,
where α_{0} is the overall mean effect and ε_{i} is the random error; f(•)’s, generally referred to as basis functions, are chosen to achieve a desired level of flexibility for modeling the effects of X,Y, and their functions on t. Of these functions, f_{(x⊗y)} models the interactions between X and Y, and the remaining basis functions, f_{x} and f_{y}, respectively, model the main effects of X and Y for predicting t. In most situations the regression coefficients, α=(α_{0},α_{1},α_{2},α_{3}), linearly relate the covariate effects (i.e., values of the basis functions evaluated at the covariates) to the response. Linear regression is a special case of (1) when all the basis functions are linear, and the response for the ith patient,
where (x_{i}⊗y_{i})=(x_{i1}y_{i1},…,x_{i1}y_{iM},…,x_{iG}y_{i1},…,x_{iG}y_{iM}) models the first order interactions between genes and miRNAs and α_{0}=0 because of the centered covariates. Further, due to the highdimensional covariates x_{i}’s and y_{i}’s, a penalty is imposed on the regression coefficients α=(α_{1},α_{2},α_{3}) to avoid overfitting. The most popular of such penalties is the Lasso because it has many desirable properties for highdimensional linear regression and variable selection [13,14]. Although (2) with a Lasso penalty is a popular choice for highdimensional regression, the linearity of the basis functions imposes serious restrictions on the flexibility of the model. For example, (2) does not model nonlinear covariate effects as well as second or higher order interactions between genes and miRNAs.
Through HRVM, we propose a regression model as a special case of (1), using kernelbased functions to respectively model f_{x},f_{y}, and f_{(x⊗y)}. The kernel functions incorporate nonlinear effects of possible interactions within and between highdimensional gene and miRNA expression measurements. Further, HRVM estimates the respective contributions of genes, miRNAs, and their interactions in predicting survival times, which is of primary importance in developing novel drug targets. HRVM posits the following regression of t on X and Y for the ith patient:
where (x⊗y)_{i}=(x_{i1},…,x_{iG},y_{i1},…,y_{iM}) is a vector of length G+M and β=(β_{1},β_{2},β_{3}) is such that its components lie on a probability simplex i.e. β_{m}>0 for m=1,2,3 and . HRVM posits different kernels for the data sources and combines them through weights β. The model parameters have the following interpretation:
•The kernel functions f_{x}(•) and f_{y}(•) model all possible interactions among genes and among miRNAs, respectively, and f_{(x⊗y)}(•) models all possible interactions between genes and miRNAs. The three kernels together account for the highdimensionality and nonlinearity of the covariate effects of X and Y by embedding them in the space of kernels.
•The mth component of β, β_{m}, denotes the influence of the mth source on predicting the log survival time and has the following interpretation: if β={1,0,0}, then (3) corresponds to a functional regression model that predicts t (log survival time) with only X (gene expressions) as covariates. Conversely, if β={1/3,1/3,1/3}, then (3) corresponds to a regression model, with the platforms and their interactions contributing equally to the prediction of the survival time. In reality, we expect (and show) different contributions from each platform and estimate these weights from the data.
The task now remains to explicitly characterize the functions f_{x}(•), f_{y}(•) and f_{(x⊗y)}(•) using multiple kernels, as detailed below.
2.2 Multiple kernel learning
Kernel learning (KL) is an approach for nonparametric classification and regression that can be used for predicting t based on X and Y[14]. First, for simplicity, assume that we want to predict t based on X. KL posits the following relation between t and X
where σ^{2} is a kernelspecific “bandwidth” parameter and depends on the choice of kernel, K(•) (detailed later in the section) and ε=(ε_{1},…,ε_{N}) is the (whitenoise) error. The primary parameter of interest is α=(α_{0},…,α_{N})^{T}, and α_{1},…,α_{N} correspond to weights assigned to the features for Nx_{j}’s. Support Vector Machine (SVM) and Relevance Vector Machine (RVM) are canonical examples of KL [14]. We prefer RVM because of its probabilistic interpretation and other optimal properties compared to those of SVM [15]. There are cases, however, where one feature matrix may not fit the data well. Based on this observation, Multiple Kernel Learning (MKL) extends (4) and replaces K by a weighted average of L feature matrices ,
MKL improves the flexibility of KL by introducing L bandwidth parameters and L weights for feature matrices β=(β_{1},…,β_{L})^{T}. A variety of approaches exist to learn , β, and α for MKL (for details see [14,16,17]). Note that in all these works the data source (i.e., X) remains the same for both KL and MKL. The HRVM framework developed in this article extends KL to include multiple data sources and their interactions, and uses a learning algorithm similar to the MKL framework.
Because the three data sources (gene expressions, miRNA expressions, and their interactions) can be used separately for predicting the log survival time, it is reasonable to combine their predictions to obtain more reliable estimates. To this end, HRVM combines respective predictions obtained from different sources obtained using KL (4) through a weighted average, and chooses appropriate weights using MKL (5). Similar to (4), , and are the predicted values of t that correspond to genes, miRNAs, and their interactions, respectively. Using (5), we combine the predictions through the weight vector β=(β_{1},β_{2},β_{3}) to model t as
We further constrain β such that its components lie on a probability simplex, i.e., . This constraint ensures that the joint (convolved) kernel, K_{β}, is positive definite and that β_{m} denotes the influence of the mth source in predicting the log survival time. Note that HRVM is a special case of (3) with , , and , where k_{m,i} is the ith column of K_{m}. Given , MKL can be used to learn α and β.
Although similar to (5), (6) differs in two important ways. First, (6) obtains kernels using (4) for different data sources, namely gene expression, miRNA expression, and their interaction. Second, we allow for dependence between data sources via the interaction kernel (K_{3}), but MKL does not; instead MKL uses a convex combination of the different kernels from the same data source to aid prediction.
The learning algorithm of HRVM is independent of the choice of kernels, but in this work we use a Gaussian radial basis function (RBF) kernel (denoted by K) [14]. The RBF kernel maps the mth highdimensional covariate to its feature space that is represented as feature matrix K_{m}. The feature matrices K_{1},K_{2}, and K_{3} correspond to genes, miRNAs, and interactions, and their (i,j)th entries are as follows:
where is the “bandwidth” parameter of the mth kernel matrix and is chosen a priori through crossvalidation (see [14] for details). The other choices of kernels include polynomial kernels and matern kernels [18]. To account for the overall mean (or intercept) in (1), an extra row of 1’s is appended to the feature matrices in (7); therefore, hereafter have dimensions (N+1)×N.
2.3 Generative Bayesian model for HRVM
HRVM reformulates (6) as a hierarchical Bayesian model for greater flexibility and better interpretation of its parameters. This reformulation serves two important purposes. First, HRVM is interpreted as a hierarchical Bayesian extension of RVM [15], which is a special case of Bayesian KL. Second, instead of using MKL methods, HRVM learns parameters α and β from t,X, and Y using the variational learning algorithm of hierarchical kernel learning (HKL) [14,16].
HRVM posits the following generative model for the (noisy) log survival time measurements t. Similar to MKL, represents the mean of t. The error distribution is Gaussian with mean 0 and precision parameter γ (8). Further, we impose a Gamma prior on γ such that
where represents a multivariate Gaussian distribution with mean μ and covariance matrix Σ and Gamma(.c_{•},d_{•}) represents a Gamma distribution with respective shape and rate parameters c_{•} and d_{•}.
Motivated by the “automatic relevance determination” idea of RVM, we impose independent Gaussian priors on the α_{j}’s with the same mean 0 and different precision parameters ϕ_{j}’s (10), where ϕ_{j} controls (a priori) predictive power of the jth feature vector from the three data sources for the log survival time. A large ϕ_{j} indicates low predictive power. We also impose independent Gamma priors on the ϕ_{j}’s,
where . This setting forces many α_{j}’s a posteriori to be near 0 with high precision. Most of the variance in t is explained by a small number of feature vectors that correspond to nonzero α_{j}’s. These feature vectors are the “relevance vectors” of HRVM that have the following three characteristics: they prevent overfitting, represent a parsimonious description of the data, and correspond to feature vectors that are most predictive of the log survival time. An equivalent prior setting is found by marginalizing the ϕ_{j}’s from the joint distribution of α and ϕ above, which imposes a multivariate Student’s t prior on α with mean 0.
Finally, we impose a Dirichlet prior on β to ensure that its components lie on a probability simplex:
where the mth component of β, β_{m}, denotes the influence of mth source in predicting the log survival time.
The hierarchical Bayesian model (8) – (12) specifies a complete sampling model for the HRVM framework. It can also be interpreted as a probabilistic approach for combining the predictions of log survival times from the three RVMs respectively corresponding to gene expressions, miRNA expressions, and their interactions. HRVM introduces an additional hierarchy and combines the predictions of these three RVMs as a weighted average, with the weights generated from a Dirichlet distribution (12). The increased flexibility of HRVM over RVM comes at the cost of analytic intractability of the posterior distributions of the HRVM parameters. Estimation of the posterior distributions of the HRVM’s parameters can proceed via either simulationbased Markov chain Monte Carlo (MCMC) approaches or deterministic variational Bayes approaches. Given the complexity and size of highthroughput data in general and GBM data in particular, MCMC approaches tend to be computationally expensive and slow. We employ variational Bayes methods from HKL [16] and obtain the analytically tractable variational posterior distribution, q(α,β,ϕ,γt,X,Y,c_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2},a_{3}), that approximates analytically intractable true posterior distribution, p(α,β,ϕ,γt,X,Y,c_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2},a_{3}). This approximation achieves analytic tractability by assuming that α,β,ϕ, and γ are independent under the variational posterior distribution. The analytic tractability leads to improved computational efficiency of the variational Bayes approach over samplingbased MCMC approaches. The derivations for variational posterior distributions are provided in Appendix A Appendix: Variational inference for HRVM.
3 Data analysis
We apply the HRVM approach to the GBM data as introduced in Section 1. GBM was one of the first cancers evaluated by the TCGA. GBM data have multiple molecular measurements on over 500 samples that include gene expression, copy number, methylation and microRNA expression. TCGA datasets are available at http://tcgadata.nci.nih.gov/tcga/ webcite. The dataset we analyze here includes information about the gene expressions (11972 probes), miRNA expressions (534 probes), and (uncensored) survival times for matched patient samples (248).
To remove the irrelevant noise variables before model fitting, we prescreened the gene and miRNA probes as follows. We performed univariate regression of the log survival times on the gene expression values, obtained pvalues, and retained gene and miRNA probes that cross a liberal pvalue threshold (≤ 0.05 here) – to balance the practical and statistical significance. This preselection identifies 1747 and 43 gene expression and miRNA probes, respectively, for downstream modeling. All our analyses and comparisons were based on these selected gene and miRNA probes.
We compare the predictions of HRVM and three linear methods: penalized regressions (2) with the Lasso penalty [13] and with the following covariates: i. gene expressions (GeneLasso), ii. miRNA expressions (MiRNALasso), and iii. both gene and miRNA expressions, and their first order interactions (InteractionLasso). We randomly split the GBM survival data into a training data and a test data with 223 (90%) and 25 (10%) patients, respectively. HRVM, GeneLasso, MiRNALasso, and Interaction Lasso are fit using the gene and miRNA expressions and log survival times in the training data. The variational inference algorithm is used for fitting HRVM (see Appendix A Appendix: Variational inference for HRVM). The R package glmnet is used for the three penalized linear regressions [19,20]. The log survival times of the test data are predicted for the four methods using the model fits on the training data. The mean squared prediction errors (MSPEs) are respectively calculated for the four models as the average of the squared difference between the observed and predicted values for the test data. This process of randomly splitting the GBM survival data into training+test data and fitting the four models is repeated 50 times. The results are summarized below.
Figure 1 shows the prediction results for HRVM, GeneLasso, MiRNALasso, and InteractionLasso using the kernel density estimates (KDEs) of the MSPEs for the 50 random splits. The KDEs of the MSPEs for HRVM is shifted to lower values than those for the three penalized linear regressions. The KDEs of the MSPEs for GeneLasso, MiRNALasso, and InteractionLasso are close to each other, which implies that the MSPEs for these models are fairly similar. Two observations arise from these results. First, the results indicate that penalized linear regression with the Lasso penalty does not lead to improved performance after accounting for interactions among covariates, which has been wellestablished in literature [19]. Second, the prediction results of the penalized linear regressions do not improve after modeling the first order interactions among genes and miRNAs, thus indicating the presence of nonlinear genomic effects and second or higher order interactions among them. For this case study, we see that HRVM performs better than the penalized linear regression methods. This may be because of HRVM accounts for the nonlinear effects of genes and miRNAs and models the interactions within genes, within miRNAs, and between genes and miRNAs. Further, because we model the log survival time, the gain for survival time predictions is, in fact, exponentially higher for HRVM compared to those for the other methods.
Figure 1. Kernel density estimates. Kernel density estimates (KDEs) of mean square prediction errors (MSPEs) for HRVM, GeneLasso, MiRNALasso, and InteractionLasso. The GBM survival data is randomly split 50 times into training and test data, all four models are fit on the training data, and MSPEs for log survival times are calculated for the test data using the model fit on training data. The xaxis represents the MSPEs and the yaxis represents the respective KDEs for HRVM (in solid red), GeneLasso (in dotted blue), MiRNALasso (in dotted and dashed blue), and InteractionLasso (in dashed blue).
We compared the performance of the predictions of the log survival times from HRVM and the observed survival times using KaplanMeier estimates of the survival curves. We used the R package survival to perform the log rank test and estimate the KaplanMeier survival curves [21]. Figure 2 compares the survival probability curves of the log survival times of patients predicted to be in the long and short survival groups by HRVM. The patients are assigned to the long and short survival groups based on a median cutoff of the predicted log survival times obtained from HRVM. The pvalue of the log rank test that the two survival curves are the same is close to 0, indicating that the survival group predictions of HRVM closely agree with the observed survival groups of the patients. In addition, Figure 3 compares the actual survival probability curves of the observed and predicted log survival times of patients in the test data with the minimum MSPE. The pvalues and the survival probability curves indicate that the log survival time predictions of HRVM agree closely agree with the observed log survival times, as well.
Figure 2. Survival probability curves for TCGA data. Survival probability curves for log survival time in the TCGA GBM data. The solid lines are the KaplanMeier estimates of survival probabilities for the patients predicted to have long survival times (in blue) and for the patients predicted to have short survival times (in red), respectively. The patients are assigned to the long and short survival groups based on the estimates of log survival times obtained from HRVM. The dotted lines indicate pointwise 95% confidence intervals for the survival probabilities. The pvalue of the log rank test is 0.
Figure 3. True and predicted survival probability curves. Survival probability curves for the observed log survival time and predicted log survival time (using HRVM) of the patients in the test data with minimum mean square prediction error. The solid lines are the KaplanMeier estimates of survival probabilities for the predicted (in blue) and observed (in red) log survival times in the test data. The dotted lines indicate pointwise 95% confidence intervals for the survival probabilities. The pvalue of the log rank test is 0.0964.
One of the additional gains of our modeling framework is the determination of which platform has a more profound influence on predicting the response, as captured by the weight parameter β. Figure 4 shows the estimates of the weights β for predicting the log survival time of the patients for gene expression, miRNA expressions, and their interactions obtained from HRVM. The medians of the weights (25% and 75% quartiles) for the three data sources are 0.239 (0.113 and 0.360), 0.504 (0.408 and 0.583), and 0.201 (0.108 and 0.404), respectively. Interestingly, HRVM shows that miRNAs have better predictive power than genes in predicting the log survival times of patients in the GBM data. The nonzero weight for interactions between gene and miRNA expressions further confirms the presence of nonlinear interactions.
Figure 4. Boxplots of weights. Boxplots of weights β=(β_{1},β_{2},β_{3}) of gene expressions, miRNA expressions, and their interactions in predicting log survival time. The GBM survival data is randomly split 50 times into training and test data, HRVM is fit on the training data, and β is obtained from the fit on training data. The yaxis shows the distributions of respective weights for gene expressions, miRNA expressions, and their interactions in predicting log survival time of patients across 50 random splits of the GBM survival data.
To gain biological insights into our results, we performed a functional analysis of our model fitting results. We used Ingenuity Pathway Analysis software to perform functional analysis on selected significant genes used in fitting HRVM. We used targetHub [22] to find the known and predicted interactions between significant genes and miRNAs. mirTarBase, a curated database of experimentally validated miRNA targets, was our choice as a source of known gene and miRNA interactions [23]. To identify the predicted gene and miRNA interactions, we used targetScan data [24] to filter out miRNAgene interactions in which the miRNA and gene effects on survival were concordant, since discordant behavior is expected in biological systems for a direct interaction between miRNA and its targets.
Pathway analyses indicates that the antisurvival genes (i.e., genes with negative effects on survival times) are enriched with signaling pathways related to tumor invasion (see Figure 5). HMGB1 and TWEAK signaling pathways, which are known to drive tumor metastasis and severe inflammatory responses in GBM and other cancers, are associated with these genes [2528]. Prosurvival genes are represented by PDGF, PTEN and other signaling pathways. It is wellestablished that the PDGF signaling pathway dominates the proneural subgroup, which correlates with a good survival time for patients with GBM [29]. The functional terms cellular movement and celltocell signaling and interaction pathways are enriched for antisurvival genes, reinforcing their role in invasive GBM.
Figure 5. Comparison of the signaling pathways. Comparison of the signaling pathways associated with significant prognostic genes in Glioblastoma multiforme.
The target analysis of miRNA revealed 22 known interactions between 8 miRNAs and 20 genes, as shown in Table 1. Four of these eight miRNAs (hsamiR31, hsamiR146b, hsamiR221 and hsamiR222) were previously identified as antisurvival markers of GBM [30]. Mir21 is an established marker of GBM and is known to target many tumor suppressor genes [31]. Mir34a expression is higher in other GBM subtypes compared to that in the prosurvival proneural glioma subtype [32]. The antisurvival patterns of all these miRNAs indicate that these gene and miRNA interactions can be targeted for therapy of GBM subgroups with expected poor survival. We also identified 1006 predicted interactions (by TargetScan) between 31 miRNA and 484 genes that are significant (see Additional file 1).
Table 1. List of known genemicroRNA interactions identified as significant in the HRVM model using target analysis
Additional file 1. mRNAmiRNApredictedinteractions.xlsx. Excel file containing all predicted mRNA and microRNA interactions flagged as significant in our analysis. The file is available at: http://odin.mdacc.tmc.edu/~vbaladan/Veera_Home_Page/Software_files/mRNAmiRNApredictedinteractions.xlsx webcite.
Format: XLSX Size: 63KB Download file
4 Conclusions and future work
We have presented an integrative framework, HRVM, that generalizes the multiple kernel learning framework for integrating highdimensional data from multiple sources, incorporating within and between platform interactions to develop a prediction model for clinical outcomes. We applied HRVM to a highdimensional TCGA GBM data to predict patient survival times using two data sources: gene and miRNA expressions, and found that the predictive performance of HRVM is better than those of linear methods that do not model nonlinear effects and interactions. We hypothesize that HRVM gains flexibility and power by modeling the nonlinear interaction structures between gene and miRNA expressions. HRVM will be a useful tool for biomedical researchers, as clinical prediction using multiplatform genomic information is an important step towards identifying personalized treatments for many cancers. We have code for fitting HRVM that is freely available at the corresponding author’s website (see Additional file 2).
Additional file 2. hrvm0.1.1.tar.gz. R package for fitting HRVM available at: http://odin.mdacc.tmc.edu/~vbaladan/Veera_Home_Page/Software_files/hrvm_0.1.1.tar.gz webcite.
Format: GZ Size: 155KB Download file
Although we have presented the application of HRVM in the context of two platforms, the framework is general and can be extended and adapted to data from multiple platforms with different distributional assumptions. This will essentially entail a generalization of the HRVM model by assuming additional terms for the different platforms. One key issue that warrants further investigation is an increase in the number of (multiplicative) betweenplatform interaction terms. We used the computationally efficient variational Bayes methods, which are extremely useful for handling large datasets from projects such as TCGA. In addition, [17] presents more scalable versions of HKL and MKL that can be adapted to our framework. Our future work will concentrate on extending the HRVM framework using Bayesian spike and slab priors to select variables from the interacting covariates before embedding the data in the space of kernels, as well as incorporating uncertainty estimations of the scale parameters – thus aiding the joint model building process.
Endnotes
^{a} We use gene and mRNA interchangeably to mean transcriptlevel expression.
^{b} We use bold lowercase and uppercase alphabets to denote column vectors and matrices, respectively.
A Appendix: Variational inference for HRVM
Following the hierarchic kernel learning algorithm (HKL) of [16], we provide the derivation for the variational inference algorithm that estimates the variational posterior distributions for the parameters of HRVM. Our interest lies in the posterior distributions of α and β that are obtained using the Bayesian model (8) – (12). Unlike RVM, the posterior distributions of α and β in HRVM are analytically intractable. There are several techniques that can be used to obtain these posteriors distributions. We employ the variational Bayes methods from HKL [16] and obtain analytically tractable variational posterior distribution, q(α,β,ϕ,γ t,X,Y,c_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2},a_{3}), that approximates analytically intractable true posterior distribution, p(α,β,ϕ,γt,X,Y,c_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2},a_{3}). The variational approach minimizes the KullbackLiebler (KL) divergence between q(α, β, ϕ,γt, X,Y,c_{ϕ},d_{ϕ},c_{γ}, d_{γ},a_{1},a_{2},a_{3}) and p(α,β, ϕ,γt,X,Y, c_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2},a_{3}). This approximation achieves analytic tractability by assuming that α,β, ϕ, and γ are independent under q(α,β,ϕ, γt,X,Y,c_{ϕ},d_{ϕ},c_{γ}, d_{γ},a_{1},a_{2},a_{3}). Therefore,
where we have suppressed the conditioning on the data and hyperparameters for the variational posteriors on the right. Notice that the factorization (13) alone guarantees the analytic tractability of q(α,β,ϕ,γt,X,Y,c_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2},a_{3}), and we do not assume any distributional form for the q’s. Following [16] and [14], the variational posterior distributions are derived as
where all expectations are with respect to the variational posterior distributions. Hereafter, we will denote as 〈f〉_{•} for notational simplicity.
Following [16] and using (14), the variational posterior distribution of α,
Following [16] and using (16), the variational posterior distribution of ϕ,
Following [16] and using (17), the variational posterior distribution of γ,
All the expectations above are determined using 〈α〉_{α},〈αα^{T}〉_{α},〈ϕ〉_{ϕ}, and 〈γ〉_{γ}, which are available from the distributional forms of q(α),q(ϕ), and q(γ) in (18) – (20). Specifically,
Instead of q(β), its nonnormalized version q^{∗}(β) is available from [16] as
Following [16], calculate 〈β〉_{β},〈logβ〉_{β}, and 〈ββ^{T}〉_{β}, as follows. Sample Sβ’s from Dirichlet(a_{1},a_{2},a_{3}) and estimate the expectations as where f(β)≡β, logβ,andββ^{T}, respectively, and
The analytic tractability of q(α,β,ϕ,γt,X,Y,c_{ϕ},d_{ϕ},c_{γ}, d_{γ},a_{1},a_{2},a_{3}) in variational inference guarantees that the marginal variational distribution (or likelihood) of the data q(t,X,Yc_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2},a_{3}) is also analytically tractable. Estimate hyperparameters c_{ϕ},d_{ϕ},c_{γ},d_{γ},a_{1},a_{2}, and a_{3} as
which is the type II maximum likelihood procedure as recommended in [16]. The kernel parameters are learned respectively from three RVMs for each of the three sources using crossvalidation as recommended by [15].
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
VB conceived the research. SS and VB worked out the detailed algorithms and derivations. SS implemented the software and conducted the data analysis. WW and VB oversaw the entire research. WW and GM provided the biological insights into the findings. CO provided insights into the predictive analysis. All authors contributed towards writing the manuscript.
Acknowledgements
VB’s research is partially supported by NIH grant R01 CA160736; NSF grant IIS915196 and the Cancer Center Support Grant (CCSG) (P30 CA016672). WW’s work is in part funded by 5U24 CA14388304 and P30 CA016672. CO is supported by NSF grant IIS914861. The content is solely the responsibility of the authors and does not necessarily represent the official views of the U.S. National Cancer Institute, the National Institutes of Health, or the National Science Foundation. We also thank LeeAnn Chastain for editorial assistance with the manuscript.
References

L Chin, JN Andersen, PA Futreal, Cancer genomics: from discovery science to personalized medicine. Nature Med 17(3), 297–303 (2011). PubMed Abstract  Publisher Full Text

D Witten, R Tibshirani, A framework for feature selection in clustering. J. Am. Stat. Assoc 105(490), 713–726 (2010). PubMed Abstract  Publisher Full Text  PubMed Central Full Text

B Efron, LargeScale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction (Cambridge University Press, New York, USA, 2010)

M Diehn, C Nardini, M Kuo, Identification of noninvasive imaging surrogates for brain tumor geneexpression modules. Proc. Natl. Acad. Sci 105(13), 5213 (2008). PubMed Abstract  Publisher Full Text  PubMed Central Full Text

W Wang, V Baladandayuthapani, JS Morris, BM Broom, G Manyam, KA Do, iBAG: integrative Bayesian analysis of highdimensional multiplatform genomics data. Bioinformatics 29(2), 149–159 (http://bioinformatics, 2013), . oxfordjournals.org/content/29/2/149.abstract webcite PubMed Abstract  Publisher Full Text  PubMed Central Full Text

DM Witten, RJ Tibshirani, et al. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol 8, 28 (2009)

AA Shabalin, H Tjelmeland, C Fan, CM Perou, AB Nobel, Merging two geneexpression studies via crossplatform normalization. Bioinformatics 24(9), 1154–1160 (2008). PubMed Abstract  Publisher Full Text

S Ma, Y Zhang, J Huang, Y Huang, Q Lan, N Rothman, T Zheng, Integrative analysis of cancer prognosis data with multiple subtypes using regularized gradient descent. Genet. Epidemiol 36(8), 829–838 (http://dx, 2012), . doi.org/10.1002/gepi.21669 webcite Publisher Full Text

RGx Verhaak, RG Hoadley, CM Perou, DN Hayes, Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110 (2010). PubMed Abstract  Publisher Full Text  PubMed Central Full Text

MV Iorio, M Ferracin, M Negrini, CM Croce, MicroRNA gene expression deregulation in human breast cancer. Cancer Res 65(16), 7065–7070 (2005). PubMed Abstract  Publisher Full Text

P Fasanaro, S Greco, M Ivan, MC Capogrossi, F Martelli, microRNA: emerging therapeutic targets in acute ischemic diseases. Pharmacol. Ther 125, 92–104 (2010). PubMed Abstract  Publisher Full Text

W Tang, J Duan, JG Zhang, YP Wang, Subtyping glioblastoma by combining miRNA and mRNA expression data using compressed sensingbased approach. EURASIP J. Bioinformatics Syst. Biol 2013, 2 (2013). BioMed Central Full Text

R Tibshirani, Regression shrinkage selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996)

CM Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006)

M Tipping, Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res 1, 211–244 (2001)

M Girolami, S Rogers, Hierarchic Bayesian models for kernel learning. Proceedings of the 22nd International Conference on Machine Learning (ICML05) (ACM, New York, USA, Bonn, Germany, 2005), pp. 241–248 (http://doi, 2005), . acm.org/10.1145/1102351.1102382 webcite Publisher Full Text

M Gönen, Bayesian efficient multiple kernel learning. in Proceedings of the 29th International Conference on Machine Learning (ICML12), ed. by Langford J, Pineau J (Omnipress, Edinburgh, Scotland, 2012), pp. 1–8

J ShaweTaylor, N Cristianini, Kernel Methods for Pattern Analysis (Cambridge university press, New York, USA, 2004)

J Friedman, T Hastie, R Tibshirani, Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw 33, 1 (2010). PubMed Abstract  PubMed Central Full Text

R Development Core Team R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, 2013) (http://www, 2013), . Rproject.org/ webcite

TM Therneau, Modeling Survival Data: Extending the Cox Model (SpringerVerlag New York, Inc., New York, USA, 2000)

targetHub targetHub (http://app1, 2013), . bioinformatics.mdanderson.org/tarhub/_design/basic/index.html webcite

SD Hsu, FM Lin, WY Wu, C Liang, WC Huang, WL Chan, WT Tsai, GZ Chen, CJ Lee, CM Chiu, et al. miRTarBase: a database curates experimentally validated microRNA–target interactions. Nucleic Acids Res 39(suppl 1), D163—D169 (2011). PubMed Abstract  Publisher Full Text  PubMed Central Full Text

RC Friedman, KKH Farh, CB Burge, DP Bartel, Most mammalian mRNAs are conserved targets of microRNAs. Genome Res 19, 92–105 (2009). PubMed Abstract  Publisher Full Text  PubMed Central Full Text

D Tang, R Kang, HJ Zeh, MT Lotze, Highmobility group box 1 and cancer. Biochimica et Biophysica Acta (BBA)Gene Regul. Mech 1799, 131–140 (2010). Publisher Full Text

NL Tran, WS McDonough, BA Savitch, TF Sawyer, JA Winkles, ME Berens, The tumor necrosis factorlike weak inducer of apoptosis (TWEAK)fibroblast growth factorinducible 14 (Fn14) signaling system regulates glioma cell survival via NFκB pathway activation and BCLXL/BCLW expression. J. Biol. Chem 280(5), 3483–3492 (2005). PubMed Abstract  Publisher Full Text

M Huang, S Narita, N Tsuchiya, Z Ma, K Numakura, T Obara, H Tsuruta, M Saito, T Inoue, Y Horikawa, et al. Overexpression of Fn14 promotes androgenindependent prostate cancer progression through MMP9 and correlates with poor treatment outcome. Carcinogenesis 32(11), 1589–1596 (2011). PubMed Abstract  Publisher Full Text

L Dai, L Gu, C Ding, L Qiu, W Di, TWEAK promotes ovarian cancer cell metastasis via NFÎ°B pathway activation and VEGF expression. Cancer Lett 283(2), 159–167 (http://www, 2009), . sciencedirect.com/science/article/pii/S0304383509002286 webcite PubMed Abstract  Publisher Full Text

JT Huse, E Holland, LM DeAngelis, Glioblastoma: molecular analysis and clinical implications. Ann. Rev. Med 64, 59–70 (http://dx, 2012), . doi.org/10.1146/annurevmed100711143028 webcite PubMed Abstract  Publisher Full Text

S Srinivasan, IRP Patric, K Somasundaram, A tenmicroRNA expression signature predicts survival in glioblastoma. PLoS One 6(3), e17438 (2011). PubMed Abstract  Publisher Full Text  PubMed Central Full Text

T Papagiannakopoulos, A Shapiro, KS Kosik, MicroRNA21 targets a network of key tumorsuppressive pathways in glioblastoma cells. Cancer Res 68(19), 8164–8172 (2008). PubMed Abstract  Publisher Full Text

J Silber, A Jacobsen, T Ozawa, G Harinath, A Pedraza, C Sander, EC Holland, JT Huse, miR34a repression in proneural malignant gliomas upregulates expression of its target PDGFRA and promotes tumorigenesis. PLoS ONE 7(3), e33844 (http://dx, 2012), . doi.org/10.1371 webcite PubMed Abstract  Publisher Full Text  PubMed Central Full Text