1 The Consensus OPLS method

Omics approaches have proven their value to provide a broad monitoring of biological systems. However, despite the wealth of data generated by modern analytical platforms, the analysis of a single dataset is still limited and insufficient to reveal the full biochemical complexity of biological samples. The fusion of information from several data sources constitutes therefore a relevant approach to assess biochemical events more comprehensively. However, inherent problems encountered when analyzing single tables are amplified with the generation of multiblock datasets and finding the relationships between data layers of increasing complexity constitutes a challenging task. For that purpose, a versatile methodology is proposed by combining the strengths of established data analysis strategies, i.e. multiblock approaches and the OPLS-DA framework to offer an efficient tool for the fusion of Omics data obtained from multiple sources (Boccard and Rutledge 2013).

The method, already available in MATLAB (available at Gitlab repository), has been translated into an R package (available at Gitlab repository) that includes quality metrics for optimal model selection, such as the R-squared (R²) coefficient, the Stone-Geisser Q² coefficient, the discriminant Q² index (DQ²) (Westerhuis et al. 2008), the permutation diagnostics statistics (Szymańska et al. 2012), as well as many graphical outputs (scores plot, block contributions, loadings, permutation results, etc.). It has been enhanced with additional functionalities such as the computation of the Variable Importance in Projection (VIP) values (Wold, Sjöström, and Eriksson 2001). Moreover, the new implementation now offers the possibility of using different kernels, i.e. linear (previously the only option was a kernel-based reformulations of the NIPALS algorithm (Lindgren, Geladi, and Wold 1993)), polynomial, or Gaussian, which greatly enhances the versatility of the method and extends its scope to a wide range of applications. The package also includes a function to predict new samples using an already computed model.

A demonstration case study available from a public repository of the National Cancer Institute, namely the NCI-60 data set, was used to illustrate the method’s potential for omics data fusion. A subset of NCI-60 data (transcriptomics, proteomics and metabolomics) involving experimental data from 14 cancer cell lines from two tissue origins, i.e. colon and ovary, was used (Shoemaker 2006). Results from the consensusOPLS R package and Matlab on this dataset were strictly identical (tolerance of 10e-06).

The combination of these data sources was excepted to provide a global profiling of the cell lines in an integrative systems biology perspective. The Consensus OPLS-DA strategy was applied for the differential analysis of the two selected tumor origins and the simultaneous analysis of the three blocks of data.

2 R environment preparation

#install.packages("knitr")
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.3
opts_chunk$set(echo = TRUE)

# To ensure reproducibility
set.seed(12)

Before any action, it is necessary to verify that the needed packages were installed (the code chunks are not shown, click on Show to open them). The code below has been designed to have as few dependencies as possible on R packages, except for the stable packages.

pkgs <- c("ggplot2", "ggrepel", "DT", "psych", "ConsensusOPLS")
sapply(pkgs, function(x) {
    if (!requireNamespace(x, quietly = TRUE)) {
        install.packages(x)
    }
})
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
    BiocManager::install("ComplexHeatmap")
}
library(ggplot2) # to make beautiful graphs
library(ggrepel) # to annotate ggplot2 graph
library(DT) # to make interactive data tables
library(psych) # to make specific quantitative summaries
library(ComplexHeatmap) # to make heatmap with density plot
library(ConsensusOPLS) # to load ConsensusOPLS

Then we create a uniform theme that will be used for all graphic output.

theme_graphs <- theme_bw() + theme(strip.text = element_text(size=14),
                                   axis.title = element_text(size=16),
                                   axis.text = element_text(size=14),
                                   plot.title = element_text(size=16),
                                   legend.title = element_text(size=14))

3 Data preprocessing

As mentioned earlier, we use the demonstration dataset proposed in the package.

data(demo_3_Omics, package = "ConsensusOPLS")

This will load an data object of type list with 5 matrices MetaboData, MicroData, ObsNames, ProteoData, Y.

In other words, this list contains three data blocks, a list of observation names (samples), and the binary response matrix Y. Since the ConsensusOPLS method performs horizontal integration, all data blocks should have the same samples (rows). We can check the block dimension this with the following command:

# Check dimension
BlockNames <- c("MetaboData", "MicroData", "ProteoData")
nbrBlocs <- length(BlockNames)
dims <- lapply(X=demo_3_Omics[BlockNames], FUN=dim)
names(dims) <- BlockNames
dims
## $MetaboData
## [1]  14 147
## 
## $MicroData
## [1]  14 200
## 
## $ProteoData
## [1] 14 99
# Remove unuseful object for the next steps
rm(dims)

The identical order of samples in the three omics blocks should be ensured.

rns <- do.call(cbind, lapply(X=demo_3_Omics[BlockNames], rownames))
rns.unique <- apply(rns, 1, function(x) length(unique(x)))
stopifnot(max(rns.unique) == 1)

The list of the data blocks demo_3_Omics[BlockNames] and the response demo_3_Omics$Y are required as input to the ConsensusOPLS analysis.

4 Data visualization

4.1 Summary by Y groups

Before performing the multiblock analysis, we might investigate the nature of variables in each omics block w.r.t the response. The interactive tables are produced here below, so that the variables can be sorted in ascending or descending order. A variable of interest can also be looked up.

require(psych)
require(DT)
describe_data_by_Y <- function(data, group) {
    bloc_by_Y <- describeBy(x = data, group = group,
                            mat = TRUE)[, c("group1", "n", "mean", "sd",
                                            "median", "min", "max", "range",
                                            "se")]
    bloc_by_Y[3:ncol(bloc_by_Y)] <- round(bloc_by_Y[3:ncol(bloc_by_Y)], 
                                          digits = 2)
  return (datatable(bloc_by_Y))
}

For metabolomic data,

describe_data_by_Y(data = demo_3_Omics[[BlockNames[1]]],
                   group = demo_3_Omics$ObsNames[,1])

For microarray data,

describe_data_by_Y(data = demo_3_Omics[[BlockNames[2]]],
                   group = demo_3_Omics$ObsNames[,1])

For proteomic data,

describe_data_by_Y(data = demo_3_Omics[[BlockNames[3]]],
                   group = demo_3_Omics$ObsNames[,1])

What information do these tables provide? To begin with, we see that there are the same number of subjects in the two groups defined by the Y response variable. Secondly, there is a great deal of variability in the data, both within and between blocks. For example, let’s focus on the range of values. The order of magnitude for the :

  • MetaboData is 0.007, 39.012,

  • MicroData is 0.581023, 8735.97217, and

  • ProteoData is -3.69, 1.17873^{5}.

A data transformation is therefore recommended before proceeding.

4.2 Unit variance scaling

To use the Consensus OPLS-DA method, it is possible to calculate the Z-score of the data, i.e. each columns of the data are centered to have mean 0, and scaled to have standard deviation 1. The user is free to perform it before executing the method, just after loading the data, and using the method of his choice.

According to previous results, the scales of the variables in the data blocks are highly variable. So, the data needs to be standardized.

# Save not scaled data
demo_3_Omics_not_scaled <- demo_3_Omics

# Scaling data
demo_3_Omics[BlockNames] <- lapply(X=demo_3_Omics[BlockNames], FUN=function(x)
    scale(x, center = TRUE, scale = TRUE)
)

4.3 Heatmap and density plots

Heat maps can be used to compare results before and after scaling. Here, the interest factor is categorical, so it was interesting to create a heat map for each of these groups. The function used to create the heat map is based on the following code (code hidden).

heatmap_data <- function(data, bloc_name, factor = NULL){
  if(!is.null(factor)){
    ht <- Heatmap(
      matrix = data, name = "Values",
      row_dend_width = unit(3, "cm"),
      column_dend_height = unit(3, "cm"),
      column_title = paste0("Heatmap of ", bloc_name),
      row_split = factor,
      row_title = "Y = %s",
      row_title_rot = 0
    )
  } else{
    ht <- Heatmap(
      matrix = data, name = "Values",
      row_dend_width = unit(3, "cm"),
      column_dend_height = unit(3, "cm"),
      column_title = paste0("Heatmap of ", bloc_name)
    )
  }
  return(ht)
}

Let’s apply this function to the demo data:

# Heat map for each data block
lapply(X = 1:nbrBlocs,
       FUN = function(X){
         bloc <- BlockNames[X]
         heatmap_data(data = demo_3_Omics_not_scaled[[bloc]],
                      bloc_name = bloc,
                      factor = demo_3_Omics_not_scaled$Y[,1])})
## [[1]]

## 
## [[2]]

## 
## [[3]]

And on the scaled data:

# Heat map for each data block
lapply(X = 1:nbrBlocs,
       FUN = function(X){
         bloc <- BlockNames[X]
         heatmap_data(data = demo_3_Omics[[bloc]],
                      bloc_name = bloc,
                      factor = demo_3_Omics$Y[,1])})
## [[1]]

## 
## [[2]]

## 
## [[3]]

By comparing these graphs, several observations can be made. To begin with, the unscaled data had a weak signal for the proteomics and transcriptomics blocks. The metabolomics block seemed to contain a relatively usable signal as it stood. These graphs therefore confirm that it was wise to perform this transformation prior to the analyses. And secondly, the profiles seem to differ according to the Y response variable.

In the same way, the user can visualize density distribution using a heat map (here on scaled data):

# Heatmap with density for each data bloc
lapply(X = 1:nbrBlocs,
       FUN = function(X){
         bloc <- BlockNames[X]
         factor <- demo_3_Omics$Y[, 1]
         densityHeatmap(t(demo_3_Omics[[bloc]]),
                        ylab = bloc,
                        column_split  = factor,
                        column_title = "Y = %s")})
## [[1]]

## 
## [[2]]

## 
## [[3]]

In the light of these graphs, it would appear that the Y = 0 data is denser than the Y = 1 data. This means that the discriminant model (DA) should be able to detect the signal contained in this data.

# Remove unscaled data
rm(demo_3_Omics_not_scaled)

5 Consensus OPLS-DA model

A model with a predictor variable and an orthogonal latent variable was evaluated. For this, the following parameters were defined:

# Number of predictive component(s)
LVsPred <- 1

# Maximum number of orthogonal components
LVsOrtho <- 3

# Number of cross-validation folds
CVfolds <- nrow(demo_3_Omics[[BlockNames[[1]]]])
CVfolds
## [1] 14

Then, to use the ConsensusOPLS method proposed by the package of the same name, only one function needs to be called. This function, ConsensusOPLS, takes as arguments the data blocks, the response variable, the maximum number of predictive and orthogonal components allowed in the model, the number of partitions for n-fold cross-validation, and the model type to indicate discriminant analysis. The result is the optimal model, without permutation.

copls.da <- ConsensusOPLS(data = demo_3_Omics[BlockNames],
                          Y = demo_3_Omics$Y,
                          maxPcomp = LVsPred,
                          maxOcomp  = LVsOrtho,
                          modelType = "da",
                          nperm = 100,
                          cvType = "nfold",
                          nfold = 14,
                          nMC = 100,
                          cvFrac = 4/5,
                          kernelParams = list(type = "p", 
                                              params = c(order = 1)),
                          mc.cores = 1)

This model gives the results for:

print("The main results of the ConsensusOPLS analysis:")
## [1] "The main results of the ConsensusOPLS analysis:"
copls.da
## ***Optimal Consensus OPLS model***
## 
## Mode:  da 
## 
## Number of predictive components:  1 
## 
## Number of orthogonal components:  1 
## 
## Block contribution:
##                  p_1       o_1
## MetaboData 0.2684595 0.4178126
## MicroData  0.3470002 0.3129129
## ProteoData 0.3845402 0.2692744
## 
## Explained variance R2 in response: 0.995081 
## 
## Predictive ability (cross validation Q2):  0.7654607
print("List of available outputs in the ConsensusOPLS object:")
## [1] "List of available outputs in the ConsensusOPLS object:"
summary(attributes(copls.da))
##                   Length Class  Mode     
## modelType          1     -none- character
## response          14     -none- character
## nPcomp             1     -none- numeric  
## nOcomp             1     -none- numeric  
## blockContribution  6     -none- numeric  
## scores            28     -none- numeric  
## loadings           3     -none- list     
## VIP                3     -none- list     
## R2X                2     -none- numeric  
## R2Y                2     -none- numeric  
## Q2                 2     -none- numeric  
## DQ2                2     -none- numeric  
## permStats          5     -none- list     
## cv                 0     -none- list     
## class              1     -none- character

6 Display the main results

As indicated at the beginning of the file, the R package ConsensusOPLS calculates:

  • the R-squared (R²) coefficient, gives a measure of how predictive the model is and how much variation is explained by the model. The lowest R-squared is 0 and means that the points are not explained by the regression whereas the highest R-squared is 1 and means that all the points are explained by the regression line.
position <- copls.da@nOcomp

paste0('R2: ', round(copls.da@R2Y[position], 4))
## [1] "R2: 0.9581"

Here, that means the model explain 95.81\(\%\) of the variation in the y response variable.

  • The Stone-Geisser Q² coefficient, also known as the redundancy index in cross-validation, is used to evaluate the quality of each structural equation, and thus to assess, independently of each other, the predictive quality of each model construct (Tenenhaus et al. 2005). If the Q² is positive, the model has predictive validity, whereas if it is negative, the model has no (absence of) predictive validity. It is defined as 1 - (PRESS/ TSS), with PRESS is the prediction error sum of squares, and TSS is the total sum of squares of the response vector Y (Westerhuis et al. 2008).
paste0('Q2: ', round(copls.da@Q2[position], 4))
## [1] "Q2: 0.7461"

Here, this means that the model has a predictive validity.

  • the discriminant Q² index (DQ2) to assess the model fit as it does not penalize class predictions beyond the class label value. The DQ2 is defined as 1 - (PRESSD/ TSS), with PRESSD is the prediction error sum of squares, disregarded when the class prediction is beyond the class label (i.e. >1 or <0, for two classes named 0 and 1), and TSS is the total sum of squares of the response vector Y. This value is a measure for class prediction ability (Westerhuis et al. 2008).
paste0('DQ2: ', round(copls.da@DQ2[position], 4))
## [1] "DQ2: 0.7055"

Here, this means that the model can predict classes.

  • the variable Importance in projection (VIP) for each block of data. Within each block, the relevance of the variables in explaining variation in the Y response was assessed using the VIP parameter, which reflects the importance of the variables in relation to both response and projection quality. Using the VIP* sign(loadings) value, the relevant features can be represented as:
# Compute the VIP
VIP <- copls.da@VIP

# Multiply VIP * sign(loadings for predictive component)
VIP_plot <- lapply(X = 1:nbrBlocs,
                   FUN = function(X){
                       sign_loadings <- sign(copls.da@loadings[[X]][, "p_1"])
                       result <- VIP[[X]][, "p"]*sign_loadings
                       return(sort(result, decreasing = TRUE))})
names(VIP_plot) <- BlockNames
# Metabo data
ggplot(data = data.frame(
  "variables" = factor(names(VIP_plot[[1]]),
                       levels=names(VIP_plot[[1]])[order(abs(VIP_plot[[1]]), 
                                                         decreasing=T)]), 
  "valeur" = VIP_plot[[1]]), 
  aes(x = variables, y = valeur)) +
  geom_bar(stat = "identity") +
  labs(title = paste0("Barplot of ", names(VIP_plot)[1])) +
  xlab("Predictive variables") +
  ylab("VIP x loading sign") +
  theme_graphs +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 

# Microarray data
ggplot(data = data.frame(
  "variables" = factor(names(VIP_plot[[2]]),
                       levels=names(VIP_plot[[2]])[order(abs(VIP_plot[[2]]), 
                                                         decreasing=T)]), 
  "valeur" = VIP_plot[[2]]), 
  aes(x = variables, y = valeur)) +
  geom_bar(stat = "identity") +
  labs(title = paste0("Barplot of ", names(VIP_plot)[2])) +
  xlab("Predictive variables") +
  ylab("VIP x loading sign") +
  theme_graphs +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 

# Proteo data
ggplot(data = data.frame(
  "variables" = factor(names(VIP_plot[[3]]),
                       levels=names(VIP_plot[[3]])[order(abs(VIP_plot[[3]]), 
                                                         decreasing=T)]), 
  "valeur" = VIP_plot[[3]]), 
  aes(x = variables, y = valeur)) +
  geom_bar(stat = "identity") +
  labs(title = paste0("Barplot of ", names(VIP_plot)[3])) +
  xlab("Predictive variables") +
  ylab("VIP x loading sign") +
  theme_graphs +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 

One possibility might be to select only the 20 most important components (the first 10 and the last 10). The user is free to do this.

7 Plot the main results

7.1 Consensus Score plot

The scores plot shows the representation of the samples in the two new components calculated by the optimal model. A horizontal separation is expected.

ggplot(data = data.frame("p_1" = copls.da@scores[, "p_1"],
                                  "o_1" = copls.da@scores[, "o_1"],
                                  "Labs" = as.matrix(unlist(demo_3_Omics$ObsNames[, 1]))),
                aes(x = p_1, y = o_1, label = Labs, 
                    shape = Labs, colour = Labs)) +
  xlab("Predictive component") +
  ylab("Orthogonal component") +
  ggtitle("ConsensusOPLS Score plot")+
  geom_point(size = 2.5) + 
  geom_text_repel(size = 4, show.legend = FALSE) + 
  theme_graphs+
  scale_color_manual(values = c("#7F3C8D", "#11A579"))

Graph of scores obtained by the optimal ConsensusOPLS model for ovarian tissue (triangle) and colon tissue (circle) from NCI-60 data, for three data blocks (metabolomics, proteomics, and transcriptomics). Each cancer cell is represented by a unique symbol whose location is determined by the contributions of the predictive and orthogonal components of the ConsensusOPLS-DA model. A clear partition of the classes was obtained.

7.2 Block contributions to the predictive component

ggplot(
  data = data.frame("Values" = copls.da@blockContribution[, "p_1"],
                    "Blocks" = as.factor(labels(demo_3_Omics[1:nbrBlocs]))),
  aes(x = Blocks, y = Values,
      fill = Blocks, labels = Values)) +
  geom_bar(stat = 'identity') + 
  geom_text(aes(label=round(Values, 2)), vjust=-0.3) +
  ggtitle("Block contributions to the predictive component")+
  xlab("Data blocks") +
  ylab("Weight") +
  theme_graphs +
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))

The block contributions of the predictive latent variable indicated the specific importance of the proteomic block (38.5\(\%\)), the transcriptomic block (34.7\(\%\)) and the metabolomic block (26.8\(\%\)).

7.3 Block contributions to the first orthogonal component

ggplot(
  data = 
    data.frame("Values" = copls.da@blockContribution[, "o_1"],
               "Blocks" = as.factor(labels(demo_3_Omics[1:nbrBlocs]))),
  aes(x = Blocks, y = Values,
      fill = Blocks, labels = Values)) +
  geom_bar(stat = 'identity') + 
  geom_text(aes(label=round(Values, 2)), vjust=-0.3) +
  ggtitle("Block contributions to the first orthogonal component") +
  xlab("Data blocks") +
  ylab("Weight") +
  theme_graphs +
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))

The block contributions of first orthogonal component indicated the specific importance of the metabolomic block (41.8\(\%\)), the transcriptomic block (31.3\(\%\)) and the proteomic block (26.9\(\%\)).

7.4 Block contributions: the two previous plots into one

data_two_plots <- data.frame("Values" = copls.da@blockContribution[, "p_1"],
                             "Type" = "Pred",
                             "Blocks" = labels(demo_3_Omics[1:nbrBlocs]))
data_two_plots <- data.frame("Values" = c(data_two_plots$Values,
                                          copls.da@blockContribution[, "o_1"]),
                             "Type" = c(data_two_plots$Type,
                                        rep("Ortho", times = length(copls.da@blockContribution[, "o_1"]))),
                             "Blocks" = c(data_two_plots$Blocks,
                                          labels(demo_3_Omics[1:nbrBlocs])))

ggplot(data = data_two_plots,
       aes(x = factor(Type), 
           y = Values, 
           fill = factor(Type))) +
  geom_bar(stat = 'identity') + 
  ggtitle("Block contributions to each component")+
  geom_text(aes(label=round(Values, 2)), vjust=-0.3) +
  xlab("Data blocks") +
  ylab("Weight") +
  facet_wrap(. ~ Blocks)+
  theme_graphs+
  scale_fill_discrete(name = "Component")+
  scale_fill_manual(values = c("#7F3C8D", "#11A579"))

In the same way, the previous graph can be represented as:

ggplot(data = data_two_plots,
                           aes(x = Blocks, 
                               y = Values, 
                               fill = Blocks)) +
    geom_bar(stat = 'identity') +
    geom_text(aes(label=round(Values, 2)), vjust=-0.3) +
    ggtitle("Block contributions to each component") +
    xlab("Components") +
    ylab("Weight") +
    facet_wrap(. ~ factor(Type)) +
    theme_graphs +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
          plot.title = element_text(hjust = 0.5, 
                                    margin = margin(t = 5, r = 0, b = 0, l = 100))) +
    scale_fill_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))

7.5 Block contributions predictive vs. orthogonal

ggplot(data = data.frame("Pred" = copls.da@blockContribution[, "p_1"],
                         "Ortho" = copls.da@blockContribution[, "o_1"],
                         "Labels" = labels(demo_3_Omics[1:nbrBlocs])),
       aes(x = Pred, y = Ortho, label = Labels, 
           shape = Labels, colour = Labels)) +
  xlab("Predictive component") +
  ylab("Orthogonal component") +
  ggtitle("Block contributions predictive vs. orthogonal") +
  geom_point(size = 2.5) + 
  geom_text_repel(size = 4, show.legend = FALSE) + 
  theme_graphs +
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))

7.6 Loading plots (one for each data set)

Individual loading of each block were calculated for the predictive latent variable of the optimal model, to detect metabolite, protein and transcript level differences between the two groups of tissues cell lines.

loadings <- copls.da@loadings
data_loads <- sapply(X = 1:nbrBlocs,
                     FUN = function(X){
                       data.frame("Pred" = 
                                    loadings[[X]][, grep(pattern = "p_",
                                                               x = colnames(loadings[[X]]),
                                                               fixed = TRUE)],
                                  "Ortho" = 
                                    loadings[[X]][, grep(pattern = "o_",
                                                               x = colnames(loadings[[X]]),
                                                               fixed = TRUE)],
                                  "Labels" = labels(demo_3_Omics[1:nbrBlocs])[[X]])
                     })
data_loads <- as.data.frame(data_loads)

The loading plot shows the representation of variables in the two new components calculated by the optimal model.

ggplot() +
  geom_point(data = as.data.frame(data_loads$V1),
             aes(x = Pred, y = Ortho, colour = Labels), 
             size = 2.5, alpha = 0.5) + 
  geom_point(data = as.data.frame(data_loads$V2),
             aes(x = Pred, y = Ortho, colour = Labels),
             size = 2.5, alpha = 0.5) +
  geom_point(data = as.data.frame(data_loads$V3),
             aes(x = Pred, y = Ortho, colour = Labels),
             size = 2.5, alpha = 0.5) +
  xlab("Predictive component") +
  ylab("Orthogonal component") +
  ggtitle("Loadings plot on first orthogonal and predictive component")+
  theme_graphs+
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))

7.7 Loading and VIP of the optimal model

loadings <- do.call(rbind.data.frame, copls.da@loadings)
loadings$block <- do.call(c, lapply(names(copls.da@loadings), function(x) 
    rep(x, nrow(copls.da@loadings[[x]]))))
loadings$variable <- gsub(paste(paste0(names(copls.da@loadings), '.'), 
                                collapse='|'), '', 
                          rownames(loadings))

VIP <- do.call(rbind.data.frame, copls.da@VIP)
VIP$block <- do.call(c, lapply(names(copls.da@VIP), function(x) 
    rep(x, nrow(copls.da@VIP[[x]]))))
VIP$variable <- gsub(paste(paste0(names(copls.da@VIP), '.'), 
                                collapse='|'), '', 
                          rownames(VIP))
        
loadings_VIP <- merge(x = loadings[, c("p_1", "variable")], 
                      y = VIP[, c("p", "variable")], 
                      by = "variable", all = TRUE)
colnames(loadings_VIP) <- c("variable", "loadings", "VIP")
loadings_VIP <- merge(x = loadings_VIP, 
                      y = loadings[, c("block", "variable")], 
                      by = "variable", all = TRUE)
loadings_VIP$label <- ifelse(loadings_VIP$VIP > 1, loadings_VIP$variable, NA)
ggplot(data = loadings_VIP,
       aes(x=loadings, y=VIP, col=block, label = label)) +
  geom_point(size = 2.5, alpha = 0.5) + 
  xlab("Predictive component") +
  ylab("Variable Importance in Projection") +
  ggtitle("VIP versus loadings on predictive components")+
  theme_graphs+
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))

8 Permutations

Permutation tests were done with 10^3 replicates to test model validity.

PermRes <- copls.da@permStats
ggplot(data = data.frame("R2Yperm" = PermRes$R2Y),
       aes(x = R2Yperm)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 color="grey", fill="grey") +
  geom_density(color = "blue", linewidth = 0.5) +
  geom_vline(aes(xintercept=PermRes$R2Y[1]), 
             color="blue", linetype="dashed", size=1) +
  xlab("R2 values") +
  ylab("Frequency") +
  ggtitle("R2 Permutation test")+
  theme_graphs
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(data = data.frame("Q2Yperm" = PermRes$Q2Y),
       aes(x = Q2Yperm)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 color="grey", fill="grey") +
  geom_density(color = "blue", size = 0.5) +
  geom_vline(aes(xintercept=PermRes$Q2Y[1]), 
             color="blue", linetype="dashed", size=1) +
  xlab("Q2 values") +
  ylab("Frequency") +
  ggtitle("Q2 Permutation test")+
  theme_graphs

ggplot(data = data.frame("DQ2Yperm" = PermRes$DQ2Y),
       aes(x = DQ2Yperm)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 color="grey", fill="grey") +
  geom_density(color = "blue", size = 0.5) +
  geom_vline(aes(xintercept=PermRes$DQ2Y[1]), 
             color="blue", linetype="dashed", size=1) +
  xlab("DQ2 values") +
  ylab("Frequency") +
  ggtitle("DQ2 Permutation test")+
  theme_graphs

9 Reproducibility

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Paris
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ConsensusOPLS_1.0.0   ComplexHeatmap_2.16.0 psych_2.4.3          
## [4] DT_0.33               ggrepel_0.9.5         ggplot2_3.5.1        
## [7] knitr_1.47           
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5        circlize_0.4.16     shape_1.4.6        
##  [4] rjson_0.2.21        xfun_0.44           bslib_0.7.0        
##  [7] htmlwidgets_1.6.4   GlobalOptions_0.1.2 lattice_0.21-8     
## [10] vctrs_0.6.5         tools_4.3.1         crosstalk_1.2.1    
## [13] generics_0.1.3      stats4_4.3.1        parallel_4.3.1     
## [16] tibble_3.2.1        fansi_1.0.6         highr_0.11         
## [19] cluster_2.1.4       pkgconfig_2.0.3     RColorBrewer_1.1-3 
## [22] S4Vectors_0.38.1    lifecycle_1.0.4     farver_2.1.2       
## [25] compiler_4.3.1      stringr_1.5.1       munsell_0.5.1      
## [28] mnormt_2.1.1        codetools_0.2-19    clue_0.3-65        
## [31] htmltools_0.5.8.1   sass_0.4.9          yaml_2.3.8         
## [34] pillar_1.9.0        crayon_1.5.2        jquerylib_0.1.4    
## [37] cachem_1.1.0        iterators_1.0.14    foreach_1.5.2      
## [40] nlme_3.1-163        tidyselect_1.2.1    digest_0.6.35      
## [43] stringi_1.8.4       dplyr_1.1.4         reshape2_1.4.4     
## [46] labeling_0.4.3      fastmap_1.2.0       colorspace_2.1-0   
## [49] cli_3.6.2           magrittr_2.0.3      utf8_1.2.4         
## [52] withr_3.0.0         scales_1.3.0        rmarkdown_2.27     
## [55] matrixStats_1.0.0   png_0.1-8           GetoptLong_1.0.5   
## [58] evaluate_0.24.0     IRanges_2.34.1      doParallel_1.0.17  
## [61] rlang_1.1.4         Rcpp_1.0.12         glue_1.7.0         
## [64] BiocGenerics_0.46.0 rstudioapi_0.15.0   jsonlite_1.8.8     
## [67] R6_2.5.1            plyr_1.8.9

References

Boccard, Julien, and Douglas N Rutledge. 2013. “A Consensus Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) Strategy for Multiblock Omics Data Fusion.” Analytica Chimica Acta 769 (March): 30–39. https://doi.org/10.1016/j.aca.2013.01.022.
Lindgren, Fredrik, Paul Geladi, and Svante Wold. 1993. “The Kernel Algorithm for PLS.” Journal of Chemometrics 7 (1): 45–59. https://doi.org/10.1002/cem.1180070104.
Shoemaker, Robert H. 2006. “The NCI60 Human Tumour Cell Line Anticancer Drug Screen.” Nature Reviews. Cancer 6 (10): 813—823. https://doi.org/10.1038/nrc1951.
Szymańska, Ewa, Edoardo Saccenti, Age K Smilde, and Johan A Westerhuis. 2012. “Double-Check: Validation of Diagnostic Statistics for PLS-DA Models in Metabolomics Studies.” Metabolomics 8: 3–16. https://doi.org/10.1007/s11306-011-0330-3.
Tenenhaus, Michel, Vincenzo Esposito Vinzi, Yves-Marie Chatelin, and Carlo Lauro. 2005. “PLS Path Modeling.” Computational Statistics & Data Analysis 48 (1): 159–205. https://doi.org/10.1016/j.csda.2004.03.005.
Westerhuis, Johan, Ewoud Velzen, Huub Hoefsloot, and Age Smilde. 2008. “Discriminant q 2 (DQ 2) for Improved Discrimination in PLSDA Models.” Metabolomics 4 (December): 293–96. https://doi.org/10.1007/s11306-008-0126-2.
Wold, Svante, Michael Sjöström, and Lennart Eriksson. 2001. “PLS-Regression: A Basic Tool of Chemometrics.” Chemometrics and Intelligent Laboratory Systems 58 (2): 109–30. https://doi.org/10.1016/S0169-7439(01)00155-1.