Creating RDML file from raw data

Konstantin A. Blagodatskikh, Michał Burdukiewicz, Stefan Rödiger

2016-12-13

Creating RDML file from raw data

This vignette describes how to create RDML object from user provided data and calculate Cq values with custom RDML class method. Our aim is to a real-world example for the use of the RDML package.

For this purpose we used data from the VideoScan platform. The VideoScan platform is a highly versatile fluorescence microscope based imaging platform for the real-time analysis of biomolecular interactions. VideoScan has been extended by a heating/cooling-unit, which can be used for DNA melting curve analysis [@roediger_RJ_2013], real-time quantitative PCR [@roediger_highly_2013] and quantitative isothermal amplification [@spiess_impact_2014]. In this example we focus on qPCR data obtained from the amplification of the human gene mlc-2v [@sheikh_2015]. The experimental details (e.g., primer sequences, amplification conditions) are described in [@roediger_highly_2013; @george_development_2016].

In detail, this dataset contains three amplification curves collected by three runs monitored with TaqMan qPCR chemistry. Similar to other experimental platforms has the VideoScan systems an output as csv file (comma separated value). In general, csv is format that is sufficient for in house usage. However, the exchange with other scientists is demanding.

After completion of the experiment all csv raw data were transferred to the R environment and stored as rda files in the C54 dataset as described elsewhere [@RDCT2010c]. The R environment provides numerous packages for the import of standard file formats, including formats from spread sheet applications (e.g., XLS, XLSX, ODS) [@Schutten_2014_ods, @Warnes_2015_gdata, @dragulescu_xlsx_2014], an ODBC connection [@ripley_rodbc_2015] or from the clipboard (read.table("clipboard")). There are several graphical user interfaces (GUIs) for R, which facilitate an easy import of raw data [@Valero_2012]. For example RKWard [@roediger_rkward_2012] has import functionality for csv, txt, spss and stata files.

Therefore, we will use the C54 dataset directly from the chipPCR package [@roediger2015chippcr]. The amplification curves have different number of cycles and one of them contains a missing value due to a sensor dropout. Thus, this data should be preprocessed before Cq calculation.

Initialization

First of all we have to load all necessary packages and fluorescence data into our R session.

library(RDML)
library(chipPCR)
library(MBmca)
#> Warning: package 'robustbase' was built under R version 3.3.1
library(ggplot2)
data(C54)

We can see that our data is data.frame with cycle numbers in the first column and fluorescence points in other columns. Such format is ready to use with RDML package and does not need any conversion.

str(C54)
#> 'data.frame':    56 obs. of  4 variables:
#>  $ Cycle: num  0 1 2 3 4 5 6 7 8 9 ...
#>  $ D1   : num  0.141 0.146 0.152 0.151 0.153 ...
#>  $ D2   : num  0.148 0.148 0.148 0.149 0.149 ...
#>  $ D3   : num  0.145 0.147 0.148 0.148 0.149 ...

dat <- data.frame(Cycle = rep(C54[, 1], ncol(C54) - 1),
                  Experiment = unlist(lapply(colnames(C54)[-1], function(i) rep(i, nrow(C54)))),
                  refMFi = unlist(C54[, -1]))
levels(dat[["Experiment"]]) <- c("D1 - Stock cDNA", "D2 - 1/10 cDNA", "D3 - 1/100 cDNA")

ggplot(dat, aes(x = Cycle, y = refMFi, colour = Experiment)) + geom_point()

Creating custom method for RDML class

Preprocessing and Cq calculation steps will be held inside RDML object. To do this we shall add custom method to RDML class and use this advanced class as RDML object initializator (data import).

As described previously [@roediger2015r], we first preprocess the raw data with the CPP function from the chipPCR package. After that preprocessed data will be stored in new runs inside same experiment. For Cq calculation we used the diffQ2 function from the MBmca [@roediger_RJ_2013] package.

Additional our method will have two arguments:

RDML$set("public", "ProcessVideoScan",
         function(last.cycle,
                  bg.range) {
           # Get curves description
           tab <- as.data.frame(self$AsTable())
           # Get fluorescent point
           dat <- as.data.frame(self$GetFData(tab))
           # Give new names to runs
           # Preprocess amplification curve raw data as described in 
           # R&ouml;diger et al. (2015) Bioinformatics. Note that the dataset
           # has different length of cycle numbers and missing values (NAs).
           # The CPP function from the chipPCR package prepares the data for 
           # further analysis.
           tab[, "run.id"] <- paste0(tab[, "run.id"], "_CPP")
           
           # Curves will be used one by one
           for(i in 2:length(dat)) {
             # Preprocess data
             preprocessed <- CPP(dat[1:last.cycle[i - 1], 1], 
                                 dat[1:last.cycle[i - 1], i],
                                 trans = TRUE,
                                 bg.range = bg.range[[i - 1]])[["y.norm"]]
             # Create data.frame with cycle number and preprocessed curve Then
             # give name to fluorescence points columns as before preprocessing
             dat_CPP <- cbind(dat[1:last.cycle[i - 1], 1],
                              preprocessed)
             colnames(dat_CPP)[2] <- tab$fdata.name[i - 1]
             # Set preprocessed data with new description (new run names)
             self$SetFData(dat_CPP, tab)
             # Calculate and set Cq
             # Set Cq from second derivative maximum method as described in 
             # R&ouml;diger et al. (2015) Bioinformatics for preprocessed data.
             # The diffQ2 function from the MBmca package 
             # (R&ouml;diger et al. (2013), The R Journal) was used to calculate the
             # Cq values of each amplificaion curve.
             cq <- diffQ2(dat_CPP, inder = TRUE)[["xTm1.2.D2"]][1]
             self$experiment[[tab[i - 1, "exp.id"]]]$
               run[[tab[i - 1, "run.id"]]]$
               react[[tab[i - 1, "react.id"]]]$
               data[[tab[i - 1, "target"]]]$cq <- cq
           }
         }, overwrite = TRUE
)

RDML object creation

Now we can create empty RDML with our advanced class and set fluorescence data into it. But for setting data we have to create description (metadata) of curves which will contain:

# Create a data frame of metadata 
description <- data.frame(
  fdata.name = c("D1", "D2", "D3"),
  exp.id = c("exp1", "exp1", "exp1"),
  run.id = c("run1", "run2", "run3"),
  react.id = c(1, 1, 1),
  sample = c("D1 - Stock cDNA", "D2 - 1/10 cDNA", "D3 - 1/100 cDNA"),
  sample.type = c("unkn", "unkn", "unkn"),
  target = c("MLC-2v", "MLC-2v", "MLC-2v"),
  target.dyeId = c("Cy5", "Cy5", "Cy5"),
  stringsAsFactors = FALSE
)

# Create an empty RDML object
video.scan <- RDML$new()

# Add fluorescence data and metadata to the RDML object from a given source
# Fare the sake of easyness we use the C54 dataset from the chipPCR package.
video.scan$SetFData(C54, description)

Setting additional information

As optional step we can add some information about our experiment. Such as experimenter contacts, references to documentation, description of samples and targets, thermal cycling conditions, etc.

# Add experimentator information
video.scan$experimenter <- 
  list(
    experimenterType$new(
      idType$new("SR"),
      "Stefan",
      "Roediger",
      "stefan.roediger@b-tu.de"
    ),
    experimenterType$new(
      idType$new("CD"),
      "Claudia",
      "Deutschmann"
    )
  )

# Add a reference to documentation
video.scan$documentation <- list(
  documentationType$new(
    idType$new("Roediger et al. 2013"),
    paste("A Highly Versatile Microscope Imaging Technology Platform for the Multiplex",
          "Real-Time Detection of Biomolecules and Autoimmune Antibodies. S. Roediger,",
          "P. Schierack, A. Boehm, J. Nitschke, I. Berger, U. Froemmel, C. Schmidt, M.",
          "Ruhland, I. Schimke, D. Roggenbuck, W. Lehmann and C. Schroeder. Advances in",
          "Biochemical Bioengineering/Biotechnology. 133:33-74, 2013.",
          "https://www.ncbi.nlm.nih.gov/pubmed/22437246")
  )
)

cdna <- 
  cdnaSynthesisMethodType$new(
    enzyme = "SuperScript II",
    primingMethod =
      primingMethodType$new("oligo-dt"),
    dnaseTreatment = TRUE
  )
video.scan$sample$`D1 - Stock cDNA`$description <- "Input stock cDNA was used undiluted (D1)"
video.scan$sample$`D1 - Stock cDNA`$cdnaSynthesisMethod <- cdna
video.scan$sample$`D2 - 1/10 cDNA`$description <- "1/1000 diluted in A. bidest"
video.scan$sample$`D2 - 1/10 cDNA`$cdnaSynthesisMethod <- cdna
video.scan$sample$`D3 - 1/100 cDNA`$description <- "1/1000000 diluted in A. bidest"
video.scan$sample$`D3 - 1/100 cDNA`$cdnaSynthesisMethod <- cdna

video.scan$target$`MLC-2v`$xRef <- list(
  xRefType$new("uniprot",
               "P10916")
)
video.scan$target$`MLC-2v`$sequences <- 
  sequencesType$new(
    forwardPrimer <- oligoType$new(
      sequence = "ACAGGGATGGCTTCATTGAC"),
    reversePrimer <- oligoType$new(
      sequence = "ATGCGTTGAGAATGGTTTCC"),
    probe1 <- oligoType$new(
      threePrimeTag = "Atto647N",
      sequence = "CAGGGTCCGCTCCCTTAAGTTTCTCC",
      fivePrimeTag = "BHQ2")
  )

tcc <- 
  thermalCyclingConditionsType$new(
    idType$new("Amplification"),
    experimenter = list(
      idReferencesType$new("SR"),
      idReferencesType$new("CD")
    ),
    step = 
      list(
        stepType$new(
          nr = 1,
          temperature = temperatureType$new(95,
                                            600)
        ),
        stepType$new(
          nr = 2,
          temperature = temperatureType$new(95,
                                            40)
        ),
        stepType$new(
          nr = 3,
          temperature = temperatureType$new(58.5,
                                            90)
        ),
        stepType$new(
          nr = 4,
          temperature = temperatureType$new(68.5,
                                            90)
        ),
        stepType$new(
          nr = 5,
          loop = loopType$new(goto = 2,
                              repeat.n = 49)
        )
      )
  )
video.scan$thermalCyclingConditions <- list(
  tcc
)

#add description of the experiment
video.scan$experiment$exp1$description <- 
  paste("The aim was to amplify MLC-2v in the VideoScan and to monitor with a",
        "hydrolysis probe for MLC-2v. The primer sequences for MLC-2v were taken",
        "from Roediger et al. (2013). The amplification was detected in solution of",
        "the 1 HCU (see Roediger et al. 2013 for details). A 20 micro L PCR reaction",
        "was composed of 250 nM primer (forward and reverse), 1x Maxima Probe qPCR",
        "Master Mix (Fermentas), 1 micro L template (MLC-2v amplification product in",
        "different dilutions), 50 nM hydrolysis probe probe for MLC-2v and A.",
        "bidest. During the amplification, fluorescence was measured at 59.5 degree",
        "Celsius. The Cy5 channel was used to monitor the MLC-2v specific hydrolysis",
        "probe. Input stock cDNA was used undiluted (D1). D2 was 1/1000 and D3",
        "1/1000000 diluted in A. bidest. The D1, D2, and D3 have different numbers",
        "measure points and D2 contains a missing value at cycle 37.")
video.scan$experiment$exp1$run$run1$thermalCyclingConditions <- idReferencesType$new("Amplification")
video.scan$experiment$exp1$run$run2$thermalCyclingConditions <- idReferencesType$new("Amplification")
video.scan$experiment$exp1$run$run3$thermalCyclingConditions <- idReferencesType$new("Amplification")

Processing data

After creating RDML object and setting data we can use our method $ProcessVideoScan.

# Process VideoScan data
video.scan$ProcessVideoScan(last.cycle = c(35, 45, 55),
                            bg.range = list(c(1,8),
                                           NULL,
                                           NULL))

Then visualize object with $AsDendrogram method (runs with IDs that include _CPP in names contain preprocessed curves):

# Visualise RDML object
video.scan$AsDendrogram()

And plot preprocessed curves indicating Cq values:

## Visualise preprocessed data with Cq values as vertical dashed lines
# Add custom column that contains calculated Cq
tab <- video.scan$AsTable(cq = {
                 cq <- data$cq
                 if (is.null(cq) || is.na(cq))
                   NULL
                 else
                   cq
               })
# Get preprocessed data in 'long.table' format
dat <- video.scan$GetFData(tab[grepl("_CPP", tab[["run.id"]]), ],
                           long.table = TRUE)
ggplot(dat, aes(x = cyc, y = fluor)) +
  geom_line(aes(group = fdata.name, color = fdata.name),
            size = 1.2) +
  geom_vline(aes(xintercept = cq, color = fdata.name),
             linetype = "longdash") +
  scale_x_continuous("Cycle") +
  scale_y_continuous("Fluorescence") +
  ggtitle("Amplification of human MLC-2v\nVideoScan HCU")

Export to file

To save our RDML object as file we can use $AsXML() method with file name as argument:

video.scan$AsXML("filename.RDML")

Note that currently due to limits of XML package it is rather slow operation!!!

XML files can be compressed to save sapce. The RMDL package uses the zip() function from the utils package with default values to compress the RDML file. Therefore, the compression mode depends on the operating system. The RDML function AsXML() without an argument returns one string, which contians the XML data. This can be used to compress it with any other compression algorithm.

Benchmark of the RDML generation

XML becomes less efficient to work with at larger file sizes and complexer structured information (e.g., matrices). Therefore, a limitation of RMDL is the process of file generation. This characteristic is inherited by the RDML package. Speed is an issue in high-throughput applications. To give the user a foundation for a decision if the implementation of the RMDL format is advisable we performed a benchmark. We used the microbenchmark package [@Mersmann_2014_microbenchmark, @Radford_2014_microbenchmark] to assess the time required for generation of RDML files. The simulation included datasets with ranging from 1 to 1000 entries per file.

Conclusion

In this example we used raw data from the VideoScan platform to demonstrate the creation of an RDML file via the RDML package. This procedure can be easily adapted to other experiments and gives the researcher a tool to create a data file for convenient data exchange. In contrast to simple csv files contains this file metadata in a defined structure. This in turn makes the data available to other scientists for reproducible research.