5 PLS-DA on the SRBCT case study
The Small Round Blue Cell Tumours (SRBCT) data set from (Khan et al. 2001) includes the expression levels of 2,308 genes measured on 63 samples. The samples are divided into four classes: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS). The data are directly available in a processed and normalised format from the mixOmics
package and contains the following:
$gene
: A data frame with 63 rows and 2,308 columns. These are the expression levels of 2,308 genes in 63 subjects,$class
: A vector containing the class of tumour for each individual (4 classes in total),$gene.name
: A data frame with 2,308 rows and 2 columns containing further information on the genes.
More details can be found in ?srbct
. We will illustrate PLS-DA and sPLS-DA which are suited for large biological data sets where the aim is to identify molecular signatures, as well as classify samples. We will analyse the gene expression levels of srbct$gene
to discover which genes may best discriminate the 4 groups of tumours.
5.1 Load the data
We first load the data from the package. We then set up the data so that \(\boldsymbol X\) is the gene expression matrix and \(\boldsymbol y\) is the factor indicating sample class membership. \(\boldsymbol y\) will be transformed into a dummy matrix \(\boldsymbol Y\) inside the function. We also check that the dimensions are correct and match both \(\boldsymbol X\) and \(\boldsymbol y\):
library(mixOmics)
data(srbct)
<- srbct$gene
X
# Outcome y that will be internally coded as dummy:
<- srbct$class
Y dim(X); length(Y)
## [1] 63 2308
## [1] 63
summary(Y)
## EWS BL NB RMS
## 23 8 12 20
5.2 Example: PLS-DA
5.2.1 Initial exploration with PCA
As covered in Module 3, PCA is a useful tool to explore the gene expression data and to assess for sample similarities between tumour types. Remember that PCA is an unsupervised approach, but we can colour the samples by their class to assist in interpreting the PCA (Figure 5.1). Here we center (default argument) and scale the data:
<- pca(X, ncomp = 3, scale = TRUE)
pca.srbct
plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE,
legend = TRUE,
title = 'SRBCT, PCA comp 1 - 2')
We observe almost no separation between the different tumour types in the PCA sample plot, with perhaps the exception of the NB samples that tend to cluster with other samples. This preliminary exploration teaches us two important findings:
- The major source of variation is not attributable to tumour type, but an unknown source (we tend to observe clusters of samples but those are not explained by tumour type).
- We need a more ‘directed’ (supervised) analysis to separate the tumour types, and we should expect that the amount of variance explained by the dimensions in PLS-DA analysis will be small.
5.2.2 Number of components in PLS-DA
The perf()
function evaluates the performance of PLS-DA - i.e., its ability to rightly classify ‘new’ samples into their tumour category using repeated cross-validation. We initially choose a large number of components (here ncomp = 10
) and assess the model as we gradually increase the number of components. Here we use 3-fold CV repeated 10 times. In Module 2, we provided further guidelines on how to choose the folds
and nrepeat
parameters:
<- plsda(X,Y, ncomp = 10)
plsda.srbct
set.seed(30) # For reproducibility with this handbook, remove otherwise
<- perf(plsda.srbct, validation = 'Mfold', folds = 3,
perf.plsda.srbct progressBar = FALSE, # Set to TRUE to track progress
nrepeat = 10) # We suggest nrepeat = 50
plot(perf.plsda.srbct, sd = TRUE, legend.position = 'horizontal')
From the classification performance output presented in Figure 5.2, we observe that:
There are some slight differences between the overall and balanced error rates (BER) with BER > overall, suggesting that minority classes might be ignored from the classification task when considering the overall performance (
summary(Y)
shows that BL only includes 8 samples). In general the trend is the same, however, and for further tuning with sPLS-DA we will consider the BER.The error rate decreases and reaches a minimum for
ncomp = 3
for themax.dist
distance. These parameters will be included in further analyses.
Notes:
- PLS-DA is an iterative model, where each component is orthogonal to the previous and gradually aims to build more discrimination between sample classes. We should always regard a final PLS-DA (with specified
ncomp
) as a ‘compounding’ model (i.e. PLS-DA with component 3 includes the trained model on the previous two components). - We advise to use at least 50 repeats, and choose the number of folds that are appropriate for the sample size of the data set, as shown in Figure 5.2).
Additional numerical outputs from the performance results are listed and can be reported as performance measures (not output here):
perf.plsda.srbct
5.2.3 Final PLS-DA model
We now run our final PLS-DA model that includes three components:
<- plsda(X,Y, ncomp = 3) final.plsda.srbct
We output the sample plots for the dimensions of interest (up to three). By default, the samples are coloured according to their class membership. We also add confidence ellipses (ellipse = TRUE
, confidence level set to 95% by default, see the argument ellipse.level
) in Figure 5.3. A 3D plot could also be insightful (use the argument type = '3D'
).
plotIndiv(final.plsda.srbct, ind.names = FALSE, legend=TRUE,
comp=c(1,2), ellipse = TRUE,
title = 'PLS-DA on SRBCT comp 1-2',
X.label = 'PLS-DA comp 1', Y.label = 'PLS-DA comp 2')
plotIndiv(final.plsda.srbct, ind.names = FALSE, legend=TRUE,
comp=c(2,3), ellipse = TRUE,
title = 'PLS-DA on SRBCT comp 2-3',
X.label = 'PLS-DA comp 2', Y.label = 'PLS-DA comp 3')
We can observe improved clustering according to tumour subtypes, compared with PCA. This is to be expected since the PLS-DA model includes the class information of each sample. We observe some discrimination between the NB and BL samples vs. the others on the first component (x-axis), and EWS and RMS vs. the others on the second component (y-axis). From the plotIndiv()
function, the axis labels indicate the amount of variation explained per component. However, the interpretation of this amount is not as important as in PCA, as PLS-DA aims to maximise the covariance between components associated to \(\boldsymbol X\) and \(\boldsymbol Y\), rather than the variance \(\boldsymbol X\).
5.2.4 Classification performance
We can rerun a more extensive performance evaluation with more repeats for our final model:
set.seed(30) # For reproducibility with this handbook, remove otherwise
<- perf(final.plsda.srbct, validation = 'Mfold',
perf.final.plsda.srbct folds = 3,
progressBar = FALSE, # TRUE to track progress
nrepeat = 10) # we recommend 50
Retaining only the BER and the max.dist
, numerical outputs of interest include the final overall performance for 3 components:
$error.rate$BER[, 'max.dist'] perf.final.plsda.srbct
## comp1 comp2 comp3
## 0.53850543 0.25986413 0.04481884
As well as the error rate per class across each component:
$error.rate.class$max.dist perf.final.plsda.srbct
## comp1 comp2 comp3
## EWS 0.2565217 0.08695652 0.08260870
## BL 0.8125000 0.51250000 0.00000000
## NB 0.3000000 0.37500000 0.04166667
## RMS 0.7850000 0.06500000 0.05500000
From this output, we can see that the first component tends to classify EWS and NB better than the other classes. As components 2 and then 3 are added, the classification improves for all classes. However we see a slight increase in classification error in component 3 for EWS and RMS while BL is perfectly classified. A permutation test could also be conducted to conclude about the significance of the differences between sample groups, but is not currently implemented in the package.
5.2.5 Background prediction
A prediction background can be added to the sample plot by calculating a background surface first, before overlaying the sample plot (Figure 5.4, see Extra Reading material, or ?background.predict
). We give an example of the code below based on the maximum prediction distance:
<- background.predict(final.plsda.srbct,
background.max comp.predicted = 2,
dist = 'max.dist')
plotIndiv(final.plsda.srbct, comp = 1:2, group = srbct$class,
ind.names = FALSE, title = 'Maximum distance',
legend = TRUE, background = background.max)
Figure 5.4 shows the differences in prediction according to the prediction distance, and can be used as a further diagnostic tool for distance choice. It also highlights the characteristics of the distances. For example the max.dist
is a linear distance, whereas both centroids.dist
and mahalanobis.dist
are non linear. Our experience has shown that as discrimination of the classes becomes more challenging, the complexity of the distances (from maximum to Mahalanobis distance) should increase, see details in the Extra reading material.
5.3 Example: sPLS-DA
In high-throughput experiments, we expect that many of the 2308 genes in \(\boldsymbol X\) are noisy or uninformative to characterise the different classes. An sPLS-DA analysis will help refine the sample clusters and select a small subset of variables relevant to discriminate each class.
5.3.1 Number of variables to select
We estimate the classification error rate with respect to the number of selected variables in the model with the function tune.splsda()
. The tuning is being performed one component at a time inside the function and the optimal number of variables to select is automatically retrieved after each component run, as described in Module 2.
Previously, we determined the number of components to be ncomp = 3
with PLS-DA. Here we set ncomp = 4
to further assess if this would be the case for a sparse model, and use 5-fold cross validation repeated 10 times. We also choose the maximum prediction distance.
Note:
- For a thorough tuning step, the following code should be repeated 10 - 50 times and the error rate is averaged across the runs. You may obtain slightly different results below for this reason.
We first define a grid of keepX
values. For example here, we define a fine grid at the start, and then specify a coarser, larger sequence of values:
# Grid of possible keepX values that will be tested for each comp
<- c(1:10, seq(20, 100, 10))
list.keepX list.keepX
## [1] 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
# This chunk takes ~ 2 min to run
# Some convergence issues may arise but it is ok as this is run on CV folds
<- tune.splsda(X, Y, ncomp = 4, validation = 'Mfold',
tune.splsda.srbct folds = 5, dist = 'max.dist',
test.keepX = list.keepX, nrepeat = 10)
The following command line will output the mean error rate for each component and each tested keepX
value given the past (tuned) components.
# Just a head of the classification error rate per keepX (in rows) and comp
head(tune.splsda.srbct$error.rate)
## comp1 comp2 comp3 comp4
## 1 0.5942844 0.3252355 0.07589674 0.010923913
## 2 0.5638859 0.3071830 0.05585145 0.007798913
## 3 0.5491938 0.2985507 0.04576087 0.006711957
## 4 0.5360054 0.2920743 0.03071558 0.006711957
## 5 0.5211685 0.2855707 0.02821558 0.006711957
## 6 0.5150362 0.2819837 0.02254529 0.005461957
When we examine each individual row, this output globally shows that the classification error rate continues to decrease after the third component in sparse PLS-DA.
We display the mean classification error rate on each component, bearing in mind that each component is conditional on the previous components calculated with the optimal number of selected variables. The diamond in Figure 5.5 indicates the best keepX
value to achieve the lowest error rate per component.
# To show the error bars across the repeats:
plot(tune.splsda.srbct, sd = TRUE)
The tuning results depend on the tuning grid list.keepX
, as well as the values chosen for folds
and nrepeat
. Therefore, we recommend assessing the performance of the final model, as well as examining the stability of the selected variables across the different folds, as detailed in the next section.
Figure 5.5 shows that the error rate decreases when more components are included in sPLS-DA. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100). This type of graph helps not only to choose the ‘optimal’ number of variables to select, but also to confirm the number of components ncomp
. From the code below, we can assess that in fact, the addition of a fourth component does not improve the classification (no statistically significant improvement according to a one-sided \(t-\)test), hence we can choose ncomp = 3
.
# The optimal number of components according to our one-sided t-tests
$choice.ncomp$ncomp tune.splsda.srbct
## [1] 3
# The optimal keepX parameter according to minimal error rate
$choice.keepX tune.splsda.srbct
## comp1 comp2 comp3 comp4
## 6 30 40 6
5.3.2 Final model and performance
Here is our final sPLS-DA model with three components and the optimal keepX
obtained from our tuning step.
You can choose to skip the tuning step, and input your arbitrarily chosen parameters in the following code (simply specify your own ncomp
and keepX
values):
# Optimal number of components based on t-tests on the error rate
<- tune.splsda.srbct$choice.ncomp$ncomp
ncomp ncomp
## [1] 3
# Optimal number of variables to select
<- tune.splsda.srbct$choice.keepX[1:ncomp]
select.keepX select.keepX
## comp1 comp2 comp3
## 6 30 40
<- splsda(X, Y, ncomp = ncomp, keepX = select.keepX) splsda.srbct
The performance of the model with the ncomp
and keepX
parameters is assessed with the perf()
function. We use 5-fold validation (folds = 5
), repeated 10 times (nrepeat = 10
) for illustrative purposes, but we recommend increasing to nrepeat = 50
. Here we choose the max.dist
prediction distance, based on our results obtained with PLS-DA.
The classification error rates that are output include both the overall error rate, as well as the balanced error rate (BER) when the number of samples per group is not balanced - as is the case in this study.
set.seed(34) # For reproducibility with this handbook, remove otherwise
<- perf(splsda.srbct, folds = 5, validation = "Mfold",
perf.splsda.srbct dist = "max.dist", progressBar = FALSE, nrepeat = 10)
# perf.splsda.srbct # Lists the different outputs
$error.rate perf.splsda.srbct
## $overall
## max.dist
## comp1 0.44444444
## comp2 0.20317460
## comp3 0.01269841
##
## $BER
## max.dist
## comp1 0.53068841
## comp2 0.27567029
## comp3 0.01138587
We can also examine the error rate per class:
$error.rate.class perf.splsda.srbct
## $max.dist
## comp1 comp2 comp3
## EWS 0.02608696 0.004347826 0.01304348
## BL 0.60000000 0.450000000 0.01250000
## NB 0.91666667 0.483333333 0.00000000
## RMS 0.58000000 0.165000000 0.02000000
These results can be compared with the performance of PLS-DA and show the benefits of variable selection to not only obtain a parsimonious model, but also to improve the classification error rate (overall and per class).
5.3.3 Variable selection and stability
During the repeated cross-validation process in perf()
we can record how often the same variables are selected across the folds. This information is important to answer the question: How reproducible is my gene signature when the training set is perturbed via cross-validation?.
par(mfrow=c(1,2))
# For component 1
<- perf.splsda.srbct$features$stable$comp1
stable.comp1 barplot(stable.comp1, xlab = 'variables selected across CV folds',
ylab = 'Stability frequency',
main = 'Feature stability for comp = 1')
# For component 2
<- perf.splsda.srbct$features$stable$comp2
stable.comp2 barplot(stable.comp2, xlab = 'variables selected across CV folds',
ylab = 'Stability frequency',
main = 'Feature stability for comp = 2')
par(mfrow=c(1,1))
Figure 5.6 shows that the genes selected on component 1 are moderately stable (frequency < 0.5) whereas those selected on component 2 are more stable (frequency < 0.7). This can be explained as there are various combinations of genes that are discriminative on component 1, whereas the number of combinations decreases as we move to component 2 which attempts to refine the classification.
The function selectVar()
outputs the variables selected for a given component and their loading values (ranked in decreasing absolute value). We concatenate those results with the feature stability, as shown here for variables selected on component 1:
# First extract the name of selected var:
<- selectVar(splsda.srbct, comp = 1)$name
select.name
# Then extract the stability values from perf:
<- perf.splsda.srbct$features$stable$comp1[select.name]
stability
# Just the head of the stability of the selected var:
head(cbind(selectVar(splsda.srbct, comp = 1)$value, stability))
## value.var Var1 Freq
## g123 0.82019919 g123 0.46
## g846 0.48384321 g846 0.46
## g335 0.18883438 g335 0.22
## g1606 0.17786558 g1606 0.24
## g836 0.14246204 g836 0.36
## g783 0.07469278 g783 0.20
As we hinted previously, the genes selected on the first component are not necessarily the most stable, suggesting that different combinations can lead to the same discriminative ability of the model. The stability increases in the following components, as the classification task becomes more refined.
Note:
- You can also apply the
vip()
function onsplsda.srbct
.
5.3.4 Sample visualisation
Previously, we showed the ellipse plots displayed for each class. Here we also use the star argument (star = TRUE
), which displays arrows starting from each group centroid towards each individual sample (Figure 5.7).
plotIndiv(splsda.srbct, comp = c(1,2),
ind.names = FALSE,
ellipse = TRUE, legend = TRUE,
star = TRUE,
title = 'SRBCT, sPLS-DA comp 1 - 2')
plotIndiv(splsda.srbct, comp = c(2,3),
ind.names = FALSE,
ellipse = TRUE, legend = TRUE,
star = TRUE,
title = 'SRBCT, sPLS-DA comp 2 - 3')
The sample plots are different from PLS-DA (Figure 5.3) with an overlap of specific classes (i.e. NB + RMS on component 1 and 2), that are then further separated on component 3, thus showing how the genes selected on each component discriminate particular sets of sample groups.
5.3.5 Variable visualisation
We represent the genes selected with sPLS-DA on the correlation circle plot. Here to increase interpretation, we specify the argument var.names
as the first 10 characters of the gene names (Figure 5.8). We also reduce the size of the font with the argument cex
.
Note:
- We can store the
plotvar()
as an object to output the coordinates and variable names if the plot is too cluttered.
<- substr(srbct$gene.name[, 2], 1, 10)
var.name.short plotVar(splsda.srbct, comp = c(1,2),
var.names = list(var.name.short), cex = 3)
By considering both the correlation circle plot (Figure 5.8) and the sample plot in Figure 5.7, we observe that a group of genes with a positive correlation with component 1 (‘EH domain’, ‘proteasome’ etc.) are associated with the BL samples. We also observe two groups of genes either positively or negatively correlated with component 2. These genes are likely to characterise either the NB + RMS classes, or the EWS class. This interpretation can be further examined with the plotLoadings()
function.
In this plot, the loading weights of each selected variable on each component are represented (see Module 2). The colours indicate the group in which the expression of the selected gene is maximal based on the mean (method = 'median'
is also available for skewed data). For example on component 1:
plotLoadings(splsda.srbct, comp = 1, method = 'mean', contrib = 'max',
name.var = var.name.short)
Here all genes are associated with BL (on average, their expression levels are higher in this class than in the other classes).
Notes:
- Consider using the argument
ndisplay
to only display the top selected genes if the signature is too large. - Consider using the argument
contrib = 'min'
to interpret the inverse trend of the signature (i.e. which genes have the smallest expression in which class, here a mix of NB and RMS samples).
To complete the visualisation, the CIM in this special case is a simple hierarchical heatmap (see ?cim
) representing the expression levels of the genes selected across all three components with respect to each sample. Here we use an Euclidean distance with Complete agglomeration method, and we specify the argument row.sideColors
to colour the samples according to their tumour type (Figure 5.10).
cim(splsda.srbct, row.sideColors = color.mixo(Y))
The heatmap shows the level of expression of the genes selected by sPLS-DA across all three components, and the overall ability of the gene signature to discriminate the tumour subtypes.
Note:
- You can change the argument
comp
if you wish to visualise a specific set of components incim()
.
5.4 Take a detour: prediction
In this section, we artificially create an ‘external’ test set on which we want to predict the class membership to illustrate the prediction process in sPLS-DA (see Extra Reading material). We randomly select 50 samples from the srbct
study as part of the training set, and the remainder as part of the test set:
set.seed(33) # For reproducibility with this handbook, remove otherwise
<- sample(1:nrow(X), 50) # Randomly select 50 samples in training
train <- setdiff(1:nrow(X), train) # Rest is part of the test set
test
# Store matrices into training and test set:
<- X[train, ]
X.train <- X[test,]
X.test <- Y[train]
Y.train <- Y[test]
Y.test
# Check dimensions are OK:
dim(X.train); dim(X.test)
## [1] 50 2308
## [1] 13 2308
Here we assume that the tuning step was performed on the training set only (it is really important to tune only on the training step to avoid overfitting), and that the optimal keepX
values are, for example, keepX = c(20,30,40)
on three components. The final model on the training data is:
<- splsda(X.train, Y.train, ncomp = 3, keepX = c(20,30,40)) train.splsda.srbct
We now apply the trained model on the test set X.test
and we specify the prediction distance, for example mahalanobis.dist
(see also ?predict.splsda
):
<- predict(train.splsda.srbct, X.test,
predict.splsda.srbct dist = "mahalanobis.dist")
The $class
output of our object predict.splsda.srbct
gives the predicted classes of the test samples.
First we concatenate the prediction for each of the three components (conditionally on the previous component) and the real class - in a real application case you may not know the true class.
# Just the head:
head(data.frame(predict.splsda.srbct$class, Truth = Y.test))
## mahalanobis.dist.comp1 mahalanobis.dist.comp2 mahalanobis.dist.comp3
## EWS.T7 EWS EWS EWS
## EWS.T15 EWS EWS EWS
## EWS.C8 EWS EWS EWS
## EWS.C10 EWS EWS EWS
## BL.C8 BL BL BL
## NB.C6 NB NB NB
## Truth
## EWS.T7 EWS
## EWS.T15 EWS
## EWS.C8 EWS
## EWS.C10 EWS
## BL.C8 BL
## NB.C6 NB
If we only look at the final prediction on component 2, compared to the real class:
# Compare prediction on the second component and change as factor
<- predict.splsda.srbct$class$mahalanobis.dist[,2]
predict.comp2 table(factor(predict.comp2, levels = levels(Y)), Y.test)
## Y.test
## EWS BL NB RMS
## EWS 4 0 0 0
## BL 0 1 0 0
## NB 0 0 1 1
## RMS 0 0 0 6
And on the third compnent:
# Compare prediction on the third component and change as factor
<- predict.splsda.srbct$class$mahalanobis.dist[,3]
predict.comp3 table(factor(predict.comp3, levels = levels(Y)), Y.test)
## Y.test
## EWS BL NB RMS
## EWS 4 0 0 0
## BL 0 1 0 0
## NB 0 0 1 0
## RMS 0 0 0 7
The prediction is better on the third component, compared to a 2-component model.
Next, we look at the output $predict
, which gives the predicted dummy scores assigned for each test sample and each class level for a given component (as explained in Extra Reading material). Each column represents a class category:
# On component 3, just the head:
head(predict.splsda.srbct$predict[, , 3])
## EWS BL NB RMS
## EWS.T7 1.26848551 -0.05273773 -0.24070902 0.024961232
## EWS.T15 1.15058424 -0.02222145 -0.11877994 -0.009582845
## EWS.C8 1.25628411 0.05481026 -0.16500118 -0.146093198
## EWS.C10 0.83995956 0.10871106 0.16452934 -0.113199949
## BL.C8 0.02431262 0.90877176 0.01775304 0.049162580
## NB.C6 0.06738230 0.05086884 0.86247360 0.019275265
In PLS-DA and sPLS-DA, the final prediction call is given based on this matrix on which a pre-specified distance (such as mahalanobis.dist
here) is applied. From this output, we can understand the link between the dummy matrix \(\boldsymbol Y\), the prediction, and the importance of choosing the prediction distance. More details are provided in Extra Reading material.
5.5 AUROC outputs complement performance evaluation
As PLS-DA acts as a classifier, we can plot the AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) to complement the sPLS-DA classification performance results. The AUC is calculated from training cross-validation sets and averaged. The ROC curve is displayed in Figure 5.11. In a multiclass setting, each curve represents one class vs. the others and the AUC is indicated in the legend, and also in the numerical output:
<- auroc(splsda.srbct) auc.srbct
## $Comp1
## AUC p-value
## EWS vs Other(s) 0.4163 2.717e-01
## BL vs Other(s) 1.0000 5.586e-06
## NB vs Other(s) 0.8448 2.214e-04
## RMS vs Other(s) 0.6000 2.041e-01
##
## $Comp2
## AUC p-value
## EWS vs Other(s) 1.0000 5.135e-11
## BL vs Other(s) 1.0000 5.586e-06
## NB vs Other(s) 0.9020 1.663e-05
## RMS vs Other(s) 0.7895 2.363e-04
##
## $Comp3
## AUC p-value
## EWS vs Other(s) 1 5.135e-11
## BL vs Other(s) 1 5.586e-06
## NB vs Other(s) 1 8.505e-08
## RMS vs Other(s) 1 2.164e-10
The ideal ROC curve should be along the top left corner, indicating a high true positive rate (sensitivity on the y-axis) and a high true negative rate (or low 100 - specificity on the x-axis), with an AUC close to 1. This is the case for BL vs. the others on component 1. The numerical output shows a perfect classification on component 3.
Note:
- A word of caution when using the ROC and AUC in s/PLS-DA: these criteria may not be particularly insightful, or may not be in full agreement with the s/PLS-DA performance, as the prediction threshold in PLS-DA is based on a specified distance as we described earlier in this section and in Extra Reading material related to PLS-DA. Thus, such a result complements the sPLS-DA performance we have calculated earlier.