1 Introduction

mixOmics is an R toolkit dedicated to the exploration and integration of biological data sets with a specific focus on variable selection. The package currently includes more than twenty multivariate methodologies, mostly developed by the mixOmics team (see some of our references in 1.2.3). Originally, all methods were designed for omics data, however, their application is not limited to biological data only. Other applications where integration is required can be considered, but mostly for the case where the predictor variables are continuous (see also 1.1).

In mixOmics, a strong focus is given to graphical representation to better translate and understand the relationships between the different data types and visualize the correlation structure at both sample and variable levels.

1.1 Input data

Note the data pre-processing requirements before analysing data with mixOmics:

  • Types of data. Different types of biological data can be explored and integrated with mixOmics. Our methods can handle molecular features measured on a continuous scale (e.g. microarray, mass spectrometry-based proteomics and metabolomics) or sequenced-based count data (RNA-seq, 16S, shotgun metagenomics) that become `continuous’ data after pre-processing and normalisation.

  • Normalisation. The package does not handle normalisation as it is platform-specific and we cover a too wide variety of data! Prior to the analysis, we assume the data sets have been normalised using appropriate normalisation methods and pre-processed when applicable.

  • Prefiltering. While mixOmics methods can handle large data sets (several tens of thousands of predictors), we recommend pre-filtering the data to less than 10K predictor variables per data set, for example by using Median Absolute Deviation (Teng et al. 2016) for RNA-seq data, by removing consistently low counts in microbiome data sets (Lê Cao et al. 2016) or by removing near-zero variance predictors. Such step aims to lessen the computational time during the parameter tuning process.

  • Data format. Our methods use matrix decomposition techniques. Therefore, the numeric data matrix or data frames have \(n\) observations or samples in rows and \(p\) predictors or variables (e.g. genes, proteins, OTUs) in columns.

  • Covariates. In the current version of mixOmics, covariates that may confound the analysis are not included in the methods. We recommend correcting for those covariates beforehand using appropriate univariate or multivariate methods for batch effect removal. Contact us for more details as we are currently working on this aspect.

1.2 Methods

1.2.1 Some background knowledge

We list here the main methodological or theoretical concepts you need to know to be able to efficiently apply mixOmics:

  • Individuals, observations or samples: the experimental units on which information are collected, e.g. patients, cell lines, cells, faecal samples etc.

  • Variables, predictors: read-out measured on each sample, e.g. gene (expression), protein or OTU (abundance), weight etc.

  • Variance: measures the spread of one variable. In our methods, we estimate the variance of components rather that variable read-outs. A high variance indicates that the data points are very spread out from the mean, and from one another (scattered).

  • Covariance: measures the strength of the relationship between two variables, i.e. whether they co-vary. A high covariance value indicates a strong relationship, e.g. weight and height in individuals frequently vary roughly in the same way; roughly, the heaviest are the tallest. A covariance value has no lower or upper bound.

  • Correlation: a standardized version of the covariance that is bounded by -1 and 1.

  • Linear combination: variables are combined by multiplying each of them by a coefficient and adding the results. A linear combination of height and weight could be \(2 * weight - 1.5 * height\) with the coefficients \(2\) and \(-1.5\) assigned with weight and height respectively.

  • Component: an artificial variable built from a linear combination of the observed variables in a given data set. Variable coefficients are optimally defined based on some statistical criterion. For example in Principal Component Analysis, the coefficients of a (principal) component are defined so as to maximise the variance of the component.

  • Loadings: variable coefficients used to define a component.

  • Sample plot: representation of the samples projected in a small space spanned (defined) by the components. Samples coordinates are determined by their components values or scores.

  • Correlation circle plot: representation of the variables in a space spanned by the components. Each variable coordinate is defined as the correlation between the original variable value and each component. A correlation circle plot enables to visualise the correlation between variables - negative or positive correlation, defined by the cosine angle between the centre of the circle and each variable point) and the contribution of each variable to each component - defined by the absolute value of the coordinate on each component. For this interpretation, data need to be centred and scaled (by default in most of our methods except PCA). For more details on this insightful graphic, see Figure 1 in (González et al. 2012).

  • Unsupervised analysis: the method does not take into account any known sample groups and the analysis is exploratory. Examples of unsupervised methods covered in this vignette are Principal Component Analysis (PCA, Chapter 3), Projection to Latent Structures (PLS, Chapter 4), and also Canonical Correlation Analysis (CCA, not covered here but see the website page).

  • Supervised analysis: the method includes a vector indicating the class membership of each sample. The aim is to discriminate sample groups and perform sample class prediction. Examples of supervised methods covered in this vignette are PLS Discriminant Analysis (PLS-DA, Chapter 5), DIABLO (Chapter 6) and also MINT (Chapter 7).

If the above descriptions were not comprehensive enough and you have some more questions, feel free to explore the glossary on our website.

1.2.2 Overview

Here is an overview of the most widely used methods in mixOmics that will be further detailed in this vignette, with the exception of rCCA. We depict them along with the type of data set they can handle.

newplot

FIGURE 1: An overview of what quantity and type of dataset each method within mixOmics requires. Thin columns represent a single variable, while the larger blocks represent datasets of multiple samples and variables.

List of methods in mixOmics, sparse indicates methods that perform variable selection

Figure 1.1: List of methods in mixOmics, sparse indicates methods that perform variable selection

Main functions and parameters of each method

Figure 1.2: Main functions and parameters of each method

1.2.3 Key publications

The methods implemented in mixOmics are described in detail in the following publications. A more extensive list can be found at this link.

1.3 Outline of this Vignette

  • Chapter 2: details some practical aspects to get started
  • Chapter 3: Principal Components Analysis (PCA)
  • Chapter 4: Projection to Latent Structures (PLS)
  • Chapter 5: Projection to Latent Structure - Discriminant Analysis (PLS-DA)
  • Chapter 6: Integrative analysis for multiple data sets, across samples (namely DIABLO)
  • Chapter 7: Integrative analysis for multiple data, across features (namely MINT)

Each methods chapter has the following outline:

  1. Type of biological question to be answered
  2. Brief description of an illustrative data set
  3. Principle of the method
  4. Quick start of the method with the main functions and arguments
  5. To go further: customized plots, additional graphical outputs, and tuning parameters
  6. FAQ

1.4 Other methods not covered in this vignette

Other methods not covered in this document are described on our website and the following references: