Digital polymerase chain reaction (dPCR) is a PCR-based technology that enables the sensitive quantification of nucleic acids. In a dPCR experiment, nucleic acids are randomly distributed and end-point amplified in several thousands of partitions that act as micro PCR reactors. The partitioning process and the end-point detection of targets are the foundation of dPCR high sensitivity: each partition receives either zero or few nucleic acid copies, increasing the amplification efficiency; the end-point reaction ensures the amplification of targets to a detectable level. The signal emitted by hydrolysis probes or intercalating binding dyes is used to detect the partitions containing the targets sequence. The maximum number of fluorescence signals read in a single sample represents the major limitation of dPCR technology: the majority of dPCR system on the market is able to detect up to two fluorescence signals, limiting the experiment plexity. Several strategies were developed to overcome that limitation (Whale et al., 2016), however data analysis of multiplex assays and clustering of data generated from low input specimens are still an issue: manual annotation is time-consuming, user-dependent and has poor reproducibility.
Digital PCR Cluster Predictor (dPCP) was developed to automate the analysis of multiplex digital PCR data with up to four targets. dPCP supports the analysis multiple digital PCR systems, is independent of multiplexing geometry, and is not influenced by the amount of input nucleic acid.
dPCP requires two types of input files:
The sample table has nine columns:
The sample table has to be filled out by the user with the required information. The table format is fundamental for the analysis and must not be changed. A file named Template_sampleTable.csv is saved as an example file of this package and can be used as template:
#Show the content of sample table template read.csv(system.file("extdata", "Template_sampleTable.csv", package = "dPCP"), stringsAsFactors = FALSE, na.strings = c("NA", "")) #Copy the template to working directory file.copy(system.file("extdata", "Template_sampleTable.csv", package = "dPCP"), getwd())
The first step carried out by dPCP is the collection of data and
information from the input files. When a reference is used, it is
fundamental to have high-quality data as dPCP starts the identification
of clusters from the reference. Once a good reference has been
identified, it can be used for the analysis of all samples amplified
with the same experimental conditions (e.g. same assay, primers and
probes concentration, cycling protocol).
The ideal reference has:
dPCP identifies the empty partitions and single-target clusters in the reference using the non-parametric algorithm called density-based spatial clustering of applications with noise (Ester et al., 1996; Hahsler et al., 2019) (DBSCAN). Maximum distance (ε) between cluster elements and the number of minimum elements (minPts) to assemble a cluster are the input parameters to be chosen by the users. The function dbscan_combination() (see Quality control) helps the user to identify the most suitable ε and minPts values.
After the identification of empty partitions and single-target clusters, their centroid position is identified by computing the arithmetic mean of the coordinates of their data elements. The distance between a cluster centroid and the centroid of empty partitions can be represented by a Euclidean vector. As the coordinates of the centroids of multi-target clusters are predicted to be the sum of the coordinates of the centroids of single-target clusters, the position of the centroid of multi-target clusters can be calculated by computing the vector sum of vectors representing the distance of the centroid the single-target clusters to the centroids of empty partitions.
The clustering analysis of sample data is carried out by the unsupervised competitive learning version of the c-means algorithm (Bezdek, 1981; Lai Chung and Lee, 1994; Pal et al., 1996). The principle of fuzzy c-means algorithm is to minimize the variance within the cluster. The intra-cluster variance is defined as the sum of the squared distance of all cluster elements from the cluster centroid. The fluorescence values of sample elements and the coordinates of all centroids are used as input parameter for the analysis. The output of the c-means analysis is a matrix showing the probability of membership of the data elements to each cluster. Each data element is assigned to the cluster whose probability is the highest. If the highest probability is lower than 0.5 a data element is classified as rain and its membership is recalculated with Mahalanobis distance (Mahalanobis, 1936). Mahalanobis distance computes the distance between a point and a distribution, it is based on measuring at multidimensional level how many standard deviations away is a point from the mean of a distribution. The rain-tagged elements are assigned to the cluster with the lowest Mahalanobis distance.
The cluster results can be corrected manually by the user with the shiny-based function manual_correction().
Finally, the copies per partition of each target are calculated according to a Poisson model. (Hindson et al., 2011). Precision is calculated as previously described (Majumdar et al., 2015). Replicates can be combined and the copies per partition are re-calculated.
A complete analysis can be executed by the function dPCP().
library(dPCP) #Find path of sample table and location of reference and input files sampleTable <- system.file("extdata", "Template_sampleTable.csv", package = "dPCP") fileLoc <- system.file("extdata",package = "dPCP") #Lunch dPCP analysis results <- dPCP(sampleTable, system = "bio-rad", file.location = fileLoc, , eps = 200, minPts = 50, save.template = FALSE, rain = TRUE)
Alternatively, a step by step analysis can be carried out following the abovementioned pipeline:
library(dPCP) #Find path of sample table and location of reference and input files sampleTable <- system.file("extdata", "Template_sampleTable.csv", package = "dPCP") fileLoc <- system.file("extdata",package = "dPCP") #Read sample table file sample.table <- read_sampleTable(sampleTable, system = "bio-rad", file.location = fileLoc) #Read reference files ref <- read_reference(sample.table, system = "bio-rad", file.location = fileLoc) #Read samples files samp <- read_sample(sample.table, system = "bio-rad", file.location = fileLoc) #Reference DBSCAN clustering dbref <- reference_dbscan(ref, sample.table, save.template = FALSE) #Predict position of clusters centroid from reference DBSCAN results cent <- centers_data(samp, sample.table,dbref) #Fuzzy c-means clustering cmclus <- cmeans_clus(cent) #Rain classification. rainclus <- rain_reclus(cmclus) #Quantification quantcm <- target_quant(cmclus, sample.table) quant <- target_quant(rainclus, sample.table) #Replicates pooling rep.quant <- replicates_quant(quant, sample.table)
dPCP is available also as web app accessible through a web browser at dpcp.lns.lu
Quality controls were developed for the fundamental steps of dPCP analysis. Along with the clustering algorithm, the function dbscan_combination() and the plot() S3 method were implemented to have a graphical view of the results and to check the quality of the following processes:
Choice of DBSCAN input parameters. The identification of empty partitions and single-target clusters in the reference is the first step of dPCP analysis and relies on the DBSCAN algorithm. The performance of DBSCAN depends on the input parameters ε and minPts that are the only two values the user has to adapt for dPCP analysis. In order to simplify the choice of input values, we developed the function dbscan_combination() that carries out a DBSCAN simulation for the combinations of ε and minPts values chosen by the user. The function generates a pdf file for each reference, showing a scatterplot for each combination. The ideal combination of input values is chosen according to the following criteria:
An example of dbscan_combination output is showed in Figure 1.
Fig. 1: Examples of the output plots of
dbscan_combination(). Each graph represents the DBSCAN analysis
performed with different combinations of input parameters eps and
minPts. Assembled clusters are represented with colored dots; different
colors indicate distinct clusters whereas grey dots show not-clustered
elements. The combinations (A), (B), (C), and (D) of DBSCAN input
parameters are not suitable for a dPCP analysis because: (A) None of the
single-target clusters is identified.
(B) One of the single-target clusters is not identified.
(C) One of the single-target clusters (purple) shows multiple subclusters.
(D) In one of the single-target clusters (green), the identified cluster is not centered in the cluster centroid. The combinations (E) and (F) identified the empty partitions cluster and all single-target clusters, therefore they are suitable for the analysis.
In order to evaluate the structure of the original dPCP clustering in terms of cluster cohesion and separation, the silhouette coefficient (Rousseeuw, 1987) is calculated for each sample. According to Kaufman and Rousseeuw (Kaufman and Rousseeuw, 1990), the mean value of silhouette coefficient has to be interpreted as follow:
Fig. 2: Quality control of the centroids coordinates prediction. (A) The prediction of coordinates of multi-target cluster centroid did not match the real position. The shift of centroid position can be the consequence of cross-reactive probes or poor assay optimization. (B) The position of clusters centroids were correctly predicted.
The results and analysis information can be exported to a csv file with the function export_csv(). The exported file consists of three tables:
A summary report can be generated with the function report_pdf(). The output is a pdf file containing:
When the shiny-based function manual_correction() is used, the results tables and pdf report can be exported directly within the shiny window clicking the “Export data” button.
Bezdek,J.C. (1981) Pattern Recognition with Fuzzy Objective Function Algorithms Springer US, Boston, MA.
Ester,M. et al. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining., pp. 226–231.
Hahsler,M. et al. (2019) dbscan: Fast Density-Based Clustering with R. J. Stat. Softw., 91, 1–30.
Hindson,B.J. et al. (2011) High-throughput droplet digital PCR system for absolute quantitation of DNA copy number. Anal. Chem., 83, 8604–8610.
Kaufman,L. and Rousseeuw,P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis.
Lai Chung,F. and Lee,T. (1994) Fuzzy competitive learning. Neural Networks, 7, 539–551.
Mahalanobis,P.P.C. (1936) On the generalized distance in statistics. Proc. Natl. Inst. Sci. India, 2, 49–55.
Majumdar,N. et al. (2015) Digital PCR modeling for maximal sensitivity, dynamic range and measurement precision. PLoS One, 10, e0118833.
Pal,N.R. et al. (1996) Sequential competitive learning and the fuzzy c-means clustering algorithms. Neural Networks, 9, 787–796.
Rousseeuw,P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.
Whale,A.S. et al. (2016) Fundamentals of multiplexing with digital PCR. Biomol. Detect. Quantif., 10, 15–23.