Digital PCR Cluster Predictor: a Universal Tool for the Automated Analysis of Multiplex Digital PCR Data

Introduction

Digital polymerase chain reaction (dPCR) is a PCR-based technology that enables the sensitive quantification of nucleic acids. In a dPCR experiment, nucleic acids are randomly distributed and end-point amplified in several thousands of partitions that act as micro PCR reactors. The partitioning process and the end-point detection of targets are the foundation of dPCR high sensitivity: each partition receives either zero or few nucleic acid copies, increasing the amplification efficiency; the end-point reaction ensures the amplification of targets to a detectable level. The signal emitted by hydrolysis probes or intercalating binding dyes is used to detect the partitions containing the targets sequence. The maximum number of fluorescence signals read in a single sample represents the major limitation of dPCR technology: the majority of dPCR system on the market is able to detect up to two fluorescence signals, limiting the experiment plexity. Several strategies were developed to overcome that limitation (Whale et al., 2016), however data analysis of multiplex assays and clustering of data generated from low input specimens are still an issue: manual annotation is time-consuming, user-dependent and has poor reproducibility.

Digital PCR Cluster Predictor (dPCP) was developed to automate the analysis of multiplex digital PCR data with up to four targets. dPCP supports the analysis multiple digital PCR systems, is independent of multiplexing geometry, and is not influenced by the amount of input nucleic acid.

dPCP input data and workflow

dPCP requires two types of input files:

A file containing the raw data for each sample and reference (“.eds” for QuantStudio 3D Digital PCR System, “.csv” for QX100/QX200 Droplet Digital PCR System and other digital PCR systems). The raw data of QX100/QX200 system can be exported from QuantSoft software selecting Export Amplitude and Cluster data in the Options pane, it is fundamental to keep the default file name and prepare the plate setup always for two targets (Channel1 and Channel2), even for singleplex assays. For other digital PCR systems, the raw data have to be prepared in the same format of QX100/QX200 system:
- upload for each sample a “.csv” file with two columns (column 1: fluorescence channel to be shown on y axis of plots; column 2: fluorescence channel to be shown on x axis of plots).
- the file name has be structured as follow: WellID_Amplitude.csv. If it is needed to analyse the files of different experiments, ensure that the same well ID is not used multiple times as the software cannot discriminate file names with the same well ID.
A sample table. The sample table is a csv file containing the information about the samples and the experiment settings.

#Some examples of input files are included in the package
system.file("extdata", package = "dPCP")

The sample table has nine columns:

Sample name. The name to be assigned to a sample. The samples with the same name are considerate replicates. The field is not mandatory; the samples without a name are not replicates.
Chip ID/Well ID. This field is mandatory. For QuantStudio 3D Digital PCR System the chip ID is required, while for QX100/QX200 Droplet Digital PCR System and other systems the well ID has to be provided (e.g. A01). Chip and well ID are used to identify the raw data files, therefore the default file name of QuantStudio 3D Digital PCR System and QX100/QX200 Droplet Digital PCR System must not be changed.
No of targets. This field is mandatory and must be an integer number between 0 and 4.
FAM target. The name of the target detected with a FAM probe.
Target 3. The name of the third target. If the assay has an orthogonal geometry, this field should be filled in with the name of the second target detected with a FAM probe;
Target 4. The name of the fourth target. If the assay has an orthogonal geometry, this field should be filled in with the name of the second target detected with a VIC/HEX probe;
VIC/HEX target. The name of the target detected with a VIC/HEX probe.
Reference. This field is optional, a reference must be provided only when sensitive analysis are needed (few positive partition for one or more targets) and/or rain is expexted in sample data. For QuantStudio 3D Digital PCR System the chip ID of the reference is required, while for QX100/QX200 Droplet Digital PCR System and other systems the complete file name (with file extension) of the reference raw data file has to be provided (e.g. example_ref_Amplitude.csv). If the field is empty, the corresponding sample is used as reference to identify the empty partitions and single-target clusters with DBSCAN. In this scenario, the sample must have all the requirements to be used as reference (see below). The samples from other runs can be used as reference but in case of experiments with QX100/QX200 Droplet Digital PCR System or other systems, the name of the reference file has to be changed to remove the well ID, which could interfere with the samples well IDs. Although it is not always required for the analysis, we strongly recommend the use of reference to have faster and more accurate cluster analyses.
Dilution. This field is mandatory and must be a numeric value representing the dilution ratio of the sample (e.g. 1:5 dilution has to be indicated as 0.2).

The sample table has to be filled out by the user with the required information. The table format is fundamental for the analysis and must not be changed. A file named Template_sampleTable.csv is saved as an example file of this package and can be used as template:

#Show the content of sample table template
read.csv(system.file("extdata", "Template_sampleTable.csv", package = "dPCP"),
          stringsAsFactors = FALSE, na.strings = c("NA", ""))

#Copy the template to working directory
file.copy(system.file("extdata", "Template_sampleTable.csv", package = "dPCP"), getwd())

The first step carried out by dPCP is the collection of data and information from the input files. When a reference is used, it is fundamental to have high-quality data as dPCP starts the identification of clusters from the reference. Once a good reference has been identified, it can be used for the analysis of all samples amplified with the same experimental conditions (e.g. same assay, primers and probes concentration, cycling protocol).
The ideal reference has:

The same experimental settings and conditions of the corresponding sample
Sufficient input amount to promote the formation at least of high-density empty partitions and single-target clusters;
Negligible presence of rain;
Non-cross-reactive probes (see Quality control).

dPCP identifies the empty partitions and single-target clusters in the reference using the non-parametric algorithm called density-based spatial clustering of applications with noise (Ester et al., 1996; Hahsler et al., 2019) (DBSCAN). Maximum distance (ε) between cluster elements and the number of minimum elements (minPts) to assemble a cluster are the input parameters to be chosen by the users. The function dbscan_combination() (see Quality control) helps the user to identify the most suitable ε and minPts values.

After the identification of empty partitions and single-target clusters, their centroid position is identified by computing the arithmetic mean of the coordinates of their data elements. The distance between a cluster centroid and the centroid of empty partitions can be represented by a Euclidean vector. As the coordinates of the centroids of multi-target clusters are predicted to be the sum of the coordinates of the centroids of single-target clusters, the position of the centroid of multi-target clusters can be calculated by computing the vector sum of vectors representing the distance of the centroid the single-target clusters to the centroids of empty partitions.

The clustering analysis of sample data is carried out by the unsupervised competitive learning version of the c-means algorithm (Bezdek, 1981; Lai Chung and Lee, 1994; Pal et al., 1996). The principle of fuzzy c-means algorithm is to minimize the variance within the cluster. The intra-cluster variance is defined as the sum of the squared distance of all cluster elements from the cluster centroid. The fluorescence values of sample elements and the coordinates of all centroids are used as input parameter for the analysis. The output of the c-means analysis is a matrix showing the probability of membership of the data elements to each cluster. Each data element is assigned to the cluster whose probability is the highest. If the highest probability is lower than 0.5 a data element is classified as rain and its membership is recalculated with Mahalanobis distance (Mahalanobis, 1936). Mahalanobis distance computes the distance between a point and a distribution, it is based on measuring at multidimensional level how many standard deviations away is a point from the mean of a distribution. The rain-tagged elements are assigned to the cluster with the lowest Mahalanobis distance.

The cluster results can be corrected manually by the user with the shiny-based function manual_correction().

Finally, the copies per partition of each target are calculated according to a Poisson model. (Hindson et al., 2011). Precision is calculated as previously described (Majumdar et al., 2015). Replicates can be combined and the copies per partition are re-calculated.

A complete analysis can be executed by the function dPCP().

library(dPCP)

#Find path of sample table and location of reference and input files
sampleTable <- system.file("extdata", "Template_sampleTable.csv", 
                           package = "dPCP")

fileLoc <- system.file("extdata",package = "dPCP")

#Lunch dPCP analysis
results <- dPCP(sampleTable, system = "bio-rad", file.location = fileLoc,
                 , eps = 200, minPts = 50, save.template = FALSE, rain = TRUE)

Alternatively, a step by step analysis can be carried out following the abovementioned pipeline:

library(dPCP)
#Find path of sample table and location of reference and input files
sampleTable <- system.file("extdata", "Template_sampleTable.csv", 
                           package = "dPCP")

fileLoc <- system.file("extdata",package = "dPCP")

#Read sample table file
sample.table <- read_sampleTable(sampleTable, system = "bio-rad", 
                                 file.location = fileLoc)

#Read reference files
ref <- read_reference(sample.table, system = "bio-rad", 
                      file.location = fileLoc)

#Read samples files
samp <- read_sample(sample.table, system = "bio-rad", file.location = fileLoc)

#Reference DBSCAN clustering
dbref <- reference_dbscan(ref, sample.table, save.template = FALSE)

#Predict position of clusters centroid from reference DBSCAN results
cent <- centers_data(samp, sample.table,dbref)

#Fuzzy c-means clustering
cmclus <- cmeans_clus(cent)

#Rain classification.
rainclus <- rain_reclus(cmclus)

#Quantification
quantcm <- target_quant(cmclus, sample.table)
quant <- target_quant(rainclus, sample.table)

#Replicates pooling
rep.quant <- replicates_quant(quant, sample.table)

dPCP is available also as web app accessible through a web browser at dpcp.lns.lu

Quality control

Quality controls were developed for the fundamental steps of dPCP analysis. Along with the clustering algorithm, the function dbscan_combination() and the plot() S3 method were implemented to have a graphical view of the results and to check the quality of the following processes:

Choice of DBSCAN input parameters. The identification of empty partitions and single-target clusters in the reference is the first step of dPCP analysis and relies on the DBSCAN algorithm. The performance of DBSCAN depends on the input parameters ε and minPts that are the only two values the user has to adapt for dPCP analysis. In order to simplify the choice of input values, we developed the function dbscan_combination() that carries out a DBSCAN simulation for the combinations of ε and minPts values chosen by the user. The function generates a pdf file for each reference, showing a scatterplot for each combination. The ideal combination of input values is chosen according to the following criteria:
- identification of the majority of data elements of empty partitions cluster and all single-target clusters;
- absence of multiple subclusters;
- the identified area has to be centred in the cluster centroid.

An example of dbscan_combination output is showed in Figure 1.

Examples of output plots of dbscan_combination.

Fig. 1: Examples of the output plots of dbscan_combination(). Each graph represents the DBSCAN analysis performed with different combinations of input parameters eps and minPts. Assembled clusters are represented with colored dots; different colors indicate distinct clusters whereas grey dots show not-clustered elements. The combinations (A), (B), (C), and (D) of DBSCAN input parameters are not suitable for a dPCP analysis because: (A) None of the single-target clusters is identified.
(B) One of the single-target clusters is not identified.
(C) One of the single-target clusters (purple) shows multiple subclusters.
(D) In one of the single-target clusters (green), the identified cluster is not centered in the cluster centroid. The combinations (E) and (F) identified the empty partitions cluster and all single-target clusters, therefore they are suitable for the analysis.

DBSCAN analysis of reference. At the end of cluster analysis, the results of DBSCAN analysis of each reference can be represented in a scatterplot using the S3 method plot.
Identification of clusters centroid. The S3 method plot can be used also to verify the correct position of all cluster centroids. If the predicted position of the centroid of multi-target clusters does not match the real position of centroids, cross-reaction between probes or poor assay optimization could be the cause (Fig. 2).
C-means, rain analysis, and silhouette coefficient. The S3 method plot is available also to show the scatterplot of c-means, rain analysis, and silhouette coefficient.

In order to evaluate the structure of the original dPCP clustering in terms of cluster cohesion and separation, the silhouette coefficient (Rousseeuw, 1987) is calculated for each sample. According to Kaufman and Rousseeuw (Kaufman and Rousseeuw, 1990), the mean value of silhouette coefficient has to be interpreted as follow:

between 0.71 and 1: the clustering structure is strong
between 0.51 and 0.70: the clustering structure is reasonable;
between 0.26 and 0.50: the clustering structure is weak and could be artificial;
lower than 0.25: no substantial structure.

Quality control ofthe centroids position prediction.

Fig. 2: Quality control of the centroids coordinates prediction. (A) The prediction of coordinates of multi-target cluster centroid did not match the real position. The shift of centroid position can be the consequence of cross-reactive probes or poor assay optimization. (B) The position of clusters centroids were correctly predicted.

Data export

The results and analysis information can be exported to a csv file with the function export_csv(). The exported file consists of three tables:

Reference samples table reporting the following information:
- Reference: reference ID;
- Quality: quality threshold;
- eps: value used for the reference DBSCAN analysis;
- minPts: values used for the reference DBSCAN analysis;
Results table reporting the following information:
- Sample: sample name and ID;
- Target: target name;
- Negative reactions: number of negative reactions;
- Total reactions: number of total reactions with quality higher than the threshold;
- lambda: average number of copies per reaction;
- Lower CI lambda: lower 95 % confidence interval of lambda;
- Upper CI lambda: upper 95 % confidence interval of lambda;
- Copies/μl: target concentration;
- Lower CI copies/μl: lower 95 % confidence interval of target concentration;
- Upper CI copies/μl: upper 95 % confidence interval of target concentration;
- Copies/μl at sample dilution: target concentration before dilution;
- Lower CI copies/μl at sample dilution: lower 95 % confidence interval of Copies/μl at sample dilution;
- Upper CI copies/μl at sample dilution: upper 95 % confidence interval of Copies/μl at sample dilution;
- Precision: size of the confidence interval for distinguishing between two sample concentrations at a given confidence level;
- Dilution: dilution factor;
- Quality: quality threshold;
- Reference: reference ID;
Replicates results table reporting the following information:
- Sample: sample name and chip ID;
- Target: target name;
- Copies/μl: target concentration;
- Lower CI copies/μl: lower 95 % confidence interval of target concentration;
- Upper CI copies/μl: upper 95 % confidence interval of target concentration;
- Precision: size of the confidence interval for distinguishing between two sample concentrations at a given confidence level;
- No of replicates: number of replicates.

A summary report can be generated with the function report_pdf(). The output is a pdf file containing:

the plot of dPCP clustering analysis,
the sample quality threshold and dilution;
the reference ID, quality threshold and input parameters of DBSCAN analysis;
a table containing the same information reported in the Results table of the abovementioned csv file.

When the shiny-based function manual_correction() is used, the results tables and pdf report can be exported directly within the shiny window clicking the “Export data” button.

References

Bezdek,J.C. (1981) Pattern Recognition with Fuzzy Objective Function Algorithms Springer US, Boston, MA.

Ester,M. et al. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining., pp. 226–231.

Hahsler,M. et al. (2019) dbscan: Fast Density-Based Clustering with R. J. Stat. Softw., 91, 1–30.

Hindson,B.J. et al. (2011) High-throughput droplet digital PCR system for absolute quantitation of DNA copy number. Anal. Chem., 83, 8604–8610.

Kaufman,L. and Rousseeuw,P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis.

Lai Chung,F. and Lee,T. (1994) Fuzzy competitive learning. Neural Networks, 7, 539–551.

Mahalanobis,P.P.C. (1936) On the generalized distance in statistics. Proc. Natl. Inst. Sci. India, 2, 49–55.

Majumdar,N. et al. (2015) Digital PCR modeling for maximal sensitivity, dynamic range and measurement precision. PLoS One, 10, e0118833.

Pal,N.R. et al. (1996) Sequential competitive learning and the fuzzy c-means clustering algorithms. Neural Networks, 9, 787–796.

Rousseeuw,P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.

Whale,A.S. et al. (2016) Fundamentals of multiplexing with digital PCR. Biomol. Detect. Quantif., 10, 15–23.