6 N-Integration
N-Integration is the framework of having multiple datasets which measure different aspects of the same samples. For example, you may have transcriptomic, genetic and proteomic data for the same set of cells. N-integrative methods are built to use the information in all three of these dataframes simultaenously.
DIABLO is a novel mixOmics
framework for the integration of multiple data sets while explaining their relationship with a categorical outcome variable. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies. It can also be referred to as Multiblock (s)PLS-DA.
6.1 Block sPLS-DA on the TCGA case study
Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumours can be classified into several subtypes, according to their levels of mRNA expression (Sørlie et al. 2001). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Cancer Genome Atlas Network et al. 2012). For the package, data were normalised, and then drastically prefiltered for illustrative purposes.
The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set including 70 samples, but only with mRNA and miRNA data (the proteomics data are missing). The aim of this integrative analysis is to identify a highly correlated multi-omics signature discriminating the breast cancer subtypes Basal, Her2 and LumA.
The breast.TCGA
(more details can be found in ?breast.TCGA
) is a list containing training and test sets of omics data data.train
and data.test
which include:
$miRNA
: A data frame with 150 (70) rows and 184 columns in the training (test) data set for the miRNA expression levels,$mRNA
: A data frame with 150 (70) rows and 520 columns in the training (test) data set for the mRNA expression levels,$protein
: A data frame with 150 rows and 142 columns in the training data set for the protein abundance (there are no proteomics in the test set),$subtype
: A factor indicating the breast cancer subtypes in the training (for 150 samples) and test sets (for 70 samples).
This case study covers an interesting scenario where one omic data set is missing in the test set, but because the method generates a set of components per training data set, we can still assess the prediction or performance evaluation using majority or weighted prediction vote.
6.2 Load the data
To illustrate the multiblock sPLS-DA approach, we will integrate the expression levels of miRNA, mRNA and the abundance of proteins while discriminating the subtypes of breast cancer, then predict the subtypes of the samples in the test set.
The input data is first set up as a list of \(Q\) matrices \(\boldsymbol X_1, \dots, \boldsymbol X_Q\) and a factor indicating the class membership of each sample \(\boldsymbol Y\). Each data frame in \(\boldsymbol X\) should be named as we will match these names with the keepX
parameter for the sparse method.
library(mixOmics)
data(breast.TCGA)
# Extract training data and name each data frame
# Store as list
<- list(mRNA = breast.TCGA$data.train$mrna,
X miRNA = breast.TCGA$data.train$mirna,
protein = breast.TCGA$data.train$protein)
# Outcome
<- breast.TCGA$data.train$subtype
Y summary(Y)
## Basal Her2 LumA
## 45 30 75
6.3 Parameter choice
6.3.1 Design matrix
The choice of the design can be motivated by different aspects, including:
Biological apriori knowledge: Should we expect
mRNA
andmiRNA
to be highly correlated?Analytical aims: As further developed in Singh et al. (2019), a compromise needs to be achieved between a classification and prediction task, and extracting the correlation structure of the data sets. A full design with weights = 1 will favour the latter, but at the expense of classification accuracy, whereas a design with small weights will lead to a highly predictive signature. This pertains to the complexity of the analytical task involved as several constraints are included in the optimisation procedure. For example, here we choose a 0.1 weighted model as we are interested in predicting test samples later in this case study.
<- matrix(0.1, ncol = length(X), nrow = length(X),
design dimnames = list(names(X), names(X)))
diag(design) <- 0
design
## mRNA miRNA protein
## mRNA 0.0 0.1 0.1
## miRNA 0.1 0.0 0.1
## protein 0.1 0.1 0.0
Note however that even with this design, we will still unravel a correlated signature as we require all data sets to explain the same outcome \(\boldsymbol y\), as well as maximising pairs of covariances between data sets.
- Data-driven option: we could perform regression analyses with PLS to further understand the correlation between data sets. Here for example, we run PLS with one component and calculate the cross-correlations between components associated to each data set:
<- pls(X$mRNA, X$protein, ncomp = 1)
res1.pls.tcga cor(res1.pls.tcga$variates$X, res1.pls.tcga$variates$Y)
<- pls(X$mRNA, X$miRNA, ncomp = 1)
res2.pls.tcga cor(res2.pls.tcga$variates$X, res2.pls.tcga$variates$Y)
<- pls(X$protein, X$miRNA, ncomp = 1)
res3.pls.tcga cor(res3.pls.tcga$variates$X, res3.pls.tcga$variates$Y)
## comp1
## comp1 0.9031761
## comp1
## comp1 0.8456299
## comp1
## comp1 0.7982008
The data sets taken in a pairwise manner are highly correlated, indicating that a design with weights \(\sim 0.8 - 0.9\) could be chosen.
6.3.2 Number of components
As in the PLS-DA framework presented in Module 3, we first fit a block.plsda
model without variable selection to assess the global performance of the model and choose the number of components. We run perf()
with 10-fold cross validation repeated 10 times for up to 5 components and with our specified design matrix. Similar to PLS-DA, we obtain the performance of the model with respect to the different prediction distances (Figure 6.1):
<- block.plsda(X, Y, ncomp = 5, design = design)
diablo.tcga
set.seed(123) # For reproducibility, remove for your analyses
= perf(diablo.tcga, validation = 'Mfold', folds = 10, nrepeat = 10)
perf.diablo.tcga
#perf.diablo.tcga$error.rate # Lists the different types of error rates
# Plot of the error rates based on weighted vote
plot(perf.diablo.tcga)
The performance plot indicates that two components should be sufficient in the final model, and that the centroids distance might lead to better prediction. A balanced error rate (BER) should be considered for further analysis.
The following outputs the optimal number of components according to the prediction distance and type of error rate (overall or balanced), as well as a prediction weighting scheme illustrated further below.
$choice.ncomp$WeightedVote perf.diablo.tcga
## max.dist centroids.dist mahalanobis.dist
## Overall.ER 3 2 3
## Overall.BER 3 2 3
Thus, here we choose our final ncomp
value:
<- perf.diablo.tcga$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"] ncomp
6.3.3 Number of variables to select
We then choose the optimal number of variables to select in each data set using the tune.block.splsda
function. The function tune()
is run with 10-fold cross validation, but repeated only once (nrepeat = 1
) for illustrative and computational reasons here. For a thorough tuning process, we advise increasing the nrepeat
argument to 10-50, or more.
We choose a keepX
grid that is relatively fine at the start, then coarse. If the data sets are easy to classify, the tuning step may indicate the smallest number of variables to separate the sample groups. Hence, we start our grid at the value 5
to avoid a too small signature that may preclude biological interpretation.
# chunk takes about 2 min to run
set.seed(123) # for reproducibility
<- list(mRNA = c(5:9, seq(10, 25, 5)),
test.keepX miRNA = c(5:9, seq(10, 20, 2)),
proteomics = c(seq(5, 25, 5)))
<- tune.block.splsda(X, Y, ncomp = 2,
tune.diablo.tcga test.keepX = test.keepX, design = design,
validation = 'Mfold', folds = 10, nrepeat = 1,
BPPARAM = BiocParallel::SnowParam(workers = 2),
dist = "centroids.dist")
<- tune.diablo.tcga$choice.keepX list.keepX
Note:
- For fast computation, we can use parallel computing here - this option is also enabled on a laptop or workstation, see
?tune.block.splsda
.
The number of features to select on each component is returned and stored for the final model:
list.keepX
## $mRNA
## [1] 8 25
##
## $miRNA
## [1] 14 5
##
## $protein
## [1] 10 5
Note:
- You can skip any of the tuning steps above, and hard code your chosen
ncomp
andkeepX
parameters (as a list for the latter, as shown below).
<- list( mRNA = c(8, 25), miRNA = c(14,5), protein = c(10, 5)) list.keepX
6.4 Final model
The final multiblock sPLS-DA model includes the tuned parameters and is run as:
<- block.splsda(X, Y, ncomp = ncomp,
diablo.tcga keepX = list.keepX, design = design)
## Design matrix has changed to include Y; each block will be
## linked to Y.
#06.tcga # Lists the different functions of interest related to that object
A warning message informs us that the outcome \(\boldsymbol Y\) has been included automatically in the design, so that the covariance between each block’s component and the outcome is maximised, as shown in the final design output:
$design diablo.tcga
## mRNA miRNA protein Y
## mRNA 0.0 0.1 0.1 1
## miRNA 0.1 0.0 0.1 1
## protein 0.1 0.1 0.0 1
## Y 1.0 1.0 1.0 0
The selected variables can be extracted with the function selectVar()
, for example in the mRNA block, along with their loading weights (not output here):
# mRNA variables selected on component 1
selectVar(diablo.tcga, block = 'mRNA', comp = 1)
Note:
- The stability of the selected variables can be extracted from the
perf()
function, similar to the example given in the PLS-DA analysis (Module 3).
6.5 Sample plots
6.5.1 plotDiablo
plotDiablo()
is a diagnostic plot to check whether the correlations between components from each data set were maximised as specified in the design matrix. We specify the dimension to be assessed with the ncomp
argument (Figure 6.2).
plotDiablo(diablo.tcga, ncomp = 1)
The plot indicates that the first components from all data sets are highly correlated. The colours and ellipses represent the sample subtypes and indicate the discriminative power of each component to separate the different tumour subtypes. Thus, multiblock sPLS-DA is able to extract a strong correlation structure between data sets, as well as discriminate the breast cancer subtypes on the first component.
6.5.2 plotIndiv
The sample plot with the plotIndiv()
function projects each sample into the space spanned by the components from each block, resulting in a series of graphs corresponding to each data set (Figure 6.3). The optional argument blocks
can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE
).
plotIndiv(diablo.tcga, ind.names = FALSE, legend = TRUE,
title = 'TCGA, DIABLO comp 1 - 2')
This type of graphic allows us to better understand the information extracted from each data set and its discriminative ability. Here we can see that the LumA group can be difficult to classify in the miRNA data.
Note:
- Additional variants include the argument
block = 'average'
that averages the components from all blocks to produce a single plot. The argumentblock='weighted.average'
is a weighted average of the components according to their correlation with the components associated with the outcome.
6.5.3 plotArrow
In the arrow plot in Figure 6.4, the start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrow the location of that same sample but in each block. Such graphics highlight the agreement between all data sets at the sample level when modelled with multiblock sPLS-DA.
plotArrow(diablo.tcga, ind.names = FALSE, legend = TRUE,
title = 'TCGA, DIABLO comp 1 - 2')
This plot shows that globally, the discrimination of all breast cancer subtypes can be extracted from all data sets, however, there are some dissimilarities at the samples level across data sets (the common information cannot be extracted in the same way across data sets).
6.6 Variable plots
The visualisation of the selected variables is crucial to mine their associations in multiblock sPLS-DA. Here we revisit existing outputs presented in Module 2 with further developments for multiple data set integration. All the plots presented provide complementary information for interpreting the results.
6.6.1 plotVar
The correlation circle plot highlights the contribution of each selected variable to each component. Important variables should be close to the large circle (see Module 2). Here, only the variables selected on components 1 and 2 are depicted (across all blocks), see Figure 6.5. Clusters of points indicate a strong correlation between variables. For better visibility we chose to hide the variable names.
plotVar(diablo.tcga, var.names = FALSE, style = 'graphics', legend = TRUE,
pch = c(16, 17, 15), cex = c(2,2,2),
col = c('darkorchid', 'brown1', 'lightgreen'),
title = 'TCGA, DIABLO comp 1 - 2')
The correlation circle plot shows some positive correlations (between selected miRNA and proteins, between selected proteins and mRNA) and negative correlations between mRNAand miRNA on component 1. The correlation structure is less obvious on component 2, but we observe some key selected features (proteins and miRNA) that seem to highly contribute to component 2.
Note:
These results can be further investigated by showing the variable names on this plot (or extracting their coordinates available from the plot saved into an object, see
?plotVar
), and looking at various outputs fromselectVar()
andplotLoadings()
.You can choose to only show specific variable type names, e.g.
var.names = c(FALSE, FALSE, TRUE)
(where each argument is assigned to a data set in \(\boldsymbol X\)). Here for example, the protein names only would be output.
6.6.2 circosPlot
The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connections between blocks, and expression levels of each variable according to each class (argument line = TRUE
). The circos plot is built based on a similarity matrix, which was extended to the case of multiple data sets from González et al. (2012) (see also Module 2 and Extra Reading material from that module). A cutoff
argument can be further included to visualise correlation coefficients above this threshold in the multi-omics signature (Figure 6.6). The colours for the blocks and correlation lines can be chosen with color.blocks
and color.cor
respectively:
circosPlot(diablo.tcga, cutoff = 0.7, line = TRUE,
color.blocks = c('darkorchid', 'brown1', 'lightgreen'),
color.cor = c("chocolate3","grey20"), size.labels = 1.5)
The circos plot enables us to visualise cross-correlations between data types, and the nature of these correlations (positive or negative). Here we observe that correlations > 0.7 are between a few mRNAand some Proteins, whereas the majority of strong (negative) correlations are observed between miRNA and mRNAor Proteins. The lines indicating the average expression levels per breast cancer subtype indicate that the selected features are able to discriminate the sample groups.
6.6.3 network
Relevance networks, which are also built on the similarity matrix, can also visualise the correlations between the different types of variables. Each colour represents a type of variable. A threshold can also be set using the argument cutoff
(Figure 6.7). By default the network includes only variables selected on component 1, unless specified in comp
.
Note that sometimes the output may not show with Rstudio due to margin issues. We can either use X11()
to open a new window, or save the plot as an image using the arguments save
and name.save
, as we show below. An interactive
argument is also available for the cutoff
argument, see details in ?network
.
# X11() # Opens a new window
network(diablo.tcga, blocks = c(1,2,3),
cutoff = 0.4,
color.node = c('darkorchid', 'brown1', 'lightgreen'),
# To save the plot, uncomment below line
#save = 'png', name.save = 'diablo-network'
)
The relevance network in Figure 6.7 shows two groups of features of different types. Within each group we observe positive and negative correlations. The visualisation of this plot could be further improved by changing the names of the original features.
Note that the network can be saved in a .gml format to be input into the software Cytoscape, using the R package igraph
(Csardi, Nepusz, et al. 2006):
# Not run
library(igraph)
<- network(diablo.tcga, blocks = c(1,2,3), cutoff = 0.4)
myNetwork write.graph(myNetwork$gR, file = "myNetwork.gml", format = "gml")
6.6.4 plotLoadings
plotLoadings()
visualises the loading weights of each selected variable on each component and each data set. The colour indicates the class in which the variable has the maximum level of expression (contrib = 'max'
) or minimum (contrib = 'min'
), on average (method = 'mean'
) or using the median (method = 'median'
).
plotLoadings(diablo.tcga, comp = 1, contrib = 'max', method = 'median')
The loading plot shows the multi-omics signature selected on component 1, where each panel represents one data type. The importance of each variable is visualised by the length of the bar (i.e. its loading coefficient value). The combination of the sign of the coefficient (positive / negative) and the colours indicate that component 1 discriminates primarily the Basal samples vs. the LumA samples (see the sample plots also). The features selected are highly expressed in one of these two subtypes. One could also plot the second component that discriminates the Her2 samples.
6.6.5 cimDiablo
The cimDiablo()
function is a clustered image map specifically implemented to represent the multi-omics molecular signature expression for each sample. It is very similar to a classical hierarchical clustering (Figure 6.9).
cimDiablo(diablo.tcga, color.blocks = c('darkorchid', 'brown1', 'lightgreen'),
comp = 1, margin=c(8,20), legend.position = "right")
##
## trimming values to [-3, 3] range for cim visualisation. See 'trim' arg in ?cimDiablo
According to the CIM, component 1 seems to primarily classify the Basal samples, with a group of overexpressed miRNA and underexpressed mRNAand proteins. A group of LumA samples can also be identified due to the overexpression of the same mRNAand proteins. Her2 samples remain quite mixed with the other LumA samples.
6.7 Model performance and prediction
We assess the performance of the model using 10-fold cross-validation repeated 10 times with the function perf()
. The method runs a block.splsda()
model on the pre-specified arguments input from our final object diablo.tcga
but on cross-validated samples. We then assess the accuracy of the prediction on the left out samples. Since the tune()
function was used with the centroid.dist
argument, we examine the outputs of the perf()
function for that same distance:
set.seed(123) # For reproducibility with this handbook, remove otherwise
<- perf(diablo.tcga, validation = 'Mfold', folds = 10,
perf.diablo.tcga nrepeat = 10, dist = 'centroids.dist')
#perf.diablo.tcga # Lists the different outputs
We can extract the (balanced) classification error rates globally or overall with
perf.diablo.tcga$error.rate.per.class
, the predicted components associated to \(\boldsymbol Y\), or the stability of the selected features with perf.diablo.tcga$features
.
Here we look at the different performance assessment schemes specific to multiple data set integration.
First, we output the performance with the majority vote, that is, since the prediction is based on the components associated to their own data set, we can then weight those predictions across data sets according to a majority vote scheme. Based on the predicted classes, we then extract the classification error rate per class and per component:
# Performance with Majority vote
$MajorityVote.error.rate perf.diablo.tcga
## $centroids.dist
## comp1 comp2
## Basal 0.02666667 0.04444444
## Her2 0.20666667 0.12333333
## LumA 0.04533333 0.00800000
## Overall.ER 0.07200000 0.04200000
## Overall.BER 0.09288889 0.05859259
The output shows that with the exception of the Basal samples, the classification improves with the addition of the second component.
Another prediction scheme is to weight the classification error rate from each data set according to the correlation between the predicted components and the \(\boldsymbol Y\) outcome.
# Performance with Weighted vote
$WeightedVote.error.rate perf.diablo.tcga
## $centroids.dist
## comp1 comp2
## Basal 0.006666667 0.04444444
## Her2 0.140000000 0.10666667
## LumA 0.045333333 0.00800000
## Overall.ER 0.052666667 0.03866667
## Overall.BER 0.064000000 0.05303704
Compared to the previous majority vote output, we can see that the classification accuracy is slightly better on component 2 for the subtype Her2.
An AUC plot per block is plotted using the function auroc()
. We have already mentioned in Module 3 for PLS-DA, the interpretation of this output may not be particularly insightful in relation to the performance evaluation of our methods, but can complement the statistical analysis. For example, here for the miRNA data set once we have reached component 2 (Figure 6.10):
<- auroc(diablo.tcga, roc.block = "miRNA", roc.comp = 2,
auc.diablo.tcga print = FALSE)
Figure 6.10 shows that the Her2 subtype is the most difficult to classify with multiblock sPLS-DA compared to the other subtypes.
The predict()
function associated with a block.splsda()
object predicts the class of samples from an external test set. In our specific case, one data set is missing in the test set but the method can still be applied. We need to ensure the names of the blocks correspond exactly to those from the training set:
# Prepare test set data: here one block (proteins) is missing
<- list(mRNA = breast.TCGA$data.test$mrna,
data.test.tcga miRNA = breast.TCGA$data.test$mirna)
<- predict(diablo.tcga, newdata = data.test.tcga)
predict.diablo.tcga # The warning message will inform us that one block is missing
#predict.diablo # List the different outputs
The following output is a confusion matrix that compares the real subtypes with the predicted subtypes for a 2 component model, for the distance of interest centroids.dist
and the prediction scheme WeightedVote
:
<- get.confusion_matrix(truth = breast.TCGA$data.test$subtype,
confusion.mat.tcga predicted = predict.diablo.tcga$WeightedVote$centroids.dist[,2])
confusion.mat.tcga
## predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal 20 1 0
## Her2 0 13 1
## LumA 0 3 32
From this table, we see that one Basal and one Her2 sample are wrongly predicted as Her2 and Lum A respectively, and 3 LumA samples are wrongly predicted as Her2. The balanced prediction error rate can be obtained as:
get.BER(confusion.mat.tcga)
## [1] 0.06825397
It would be worthwhile at this stage to revisit the chosen design of the multiblock sPLS-DA model to assess the influence of the design on the prediction performance on this test set - even though this back and forth analysis is a biased criterion to choose the design!