pcadapt

Keurcien Luu, Michael G.B. Blum

2016-01-05

pcadapt has been developed to detect genetic markers involved in biological adaptation. pcadapt provides statistical tools for outlier detection based on Principal Component Analysis (PCA).

In the following, we show how the pcadapt package can perform genome scans for selection. The package contains two examples: geno3pops (genotype data) and pool3pops (Pool-seq data). The geno3pops data contain simulated genotypes for 1,500 diploid markers. A total of 150 individuals coming from three different populations were genotyped. Simulations were performed with simuPOP using a divergence model assuming that 200 SNPs confer a selective advantage. The pool3pops data contain simulated allele frequencies in 3 populations for 1,500 diploid markers. Allele frequencies have been computed based on the geno3pops dataset.

To run the package, you need to install it and to load it using the following command lines:

install.packages("pcadapt")
library(pcadapt)

Sections A-E explain how to use the package for genotype data. Section F additionally explains how to use the package for Pool-seq data, but previous sections should also be read. Section G explains how to use the R package with the C software PCAdapt fast which is well-suited for datasets with a large number of SNPs (\(\geq 100,000\)). Section H explains advanced usage of the package.

A. Reading the genotype data

A.1. The read4pcadapt function

The geno3pops example data can be loaded using the following command.

data <- read4pcadapt(x="geno3pops",option="example")
print(dim(data))
## [1]  150 1500

For your own dataset, use also the read4pcadapt function. The current version of read4pcadapt supports two formats. The first format assumes that the genotype is a matrix with individuals in rows and genetic markers in columns (same format as LFMM). Data files should be tab-delimited or space-delimited. An example of genotype data containing 3 diploid individuals and 4 SNPs is given below. \[ 0 \ 0 \ 1 \ 2 \ 0 \ 1 \ 2 \ 0 \ 2 \ 1 \ 0 \ 1 \] Assuming the data file is called “mydata” and is available at “path_to_directory”, the command line to read the genotype data is

data <- read4pcadapt(x="path_to_directory/mydata")

If the genotype is a matrix with genetic markers in rows and individuals in columns, you can also use the read4pcadapt function. The command line to read the genotype data is

data <- read4pcadapt(x="path_to_directory/mydata",transpose=TRUE)

The second supported format is the standard .ped format. To read a .ped file, the argument x should contain the name of the file only without the .ped extension, and the option ped should be specified. If you read a .ped file, make sure the .map file is in the same directory. Assuming the data file is called “mydata.ped” and is available at “path_to_directory”, the command line is

data <- read4pcadapt(x="path_to_directory/mydata",option="ped")

NB: in both cases, standard read.table options header and sep can be specified in the read4pcadapt function. For .ped files, the separator for the .map file can also be specified using the option mapsep.

A.2. How to handle missing values

The read4pcadapt function can account for missing data. By default, the function assumes that missing data are coded with -9, NA or NaN. To specify other coding values, use the option na.strings in the read4pcadapt function. For example, if missing values are coded with NA_code_1 and NA_code_2, use the following command line

data <- read4pcadapt(x="path_to_directory/mydata",na.strings=c("NA_code_1","NA_code_2"))

B. Choosing the number K of Principal Components

The pcadapt function performs two successive tasks. First, PCA is performed on the centered and scaled genotype matrix. The second stage consists of computing test statistics and p-values based on the correlations between SNPs and the first K principal components (PCs). To run the function pcadapt, the user should specify the number K of principal components to work with.

To choose K, principal component analysis should first be performed with a large enough number of principal components (e.g. K=20).

x <- pcadapt(data,K=20)

NB : by default, data are assumed to be diploid. To specify the ploidy, use the argument ploidy (ploidy=2 by default) in the pcadapt function. For example for haploid data:

x <- pcadapt(data,K=20,ploidy=1)

B.1. Scree plot

The 'scree plot' displays the percentage of variance that is explained by each PC. The ideal pattern in a scree plot is a steep curve followed by a bend and an almost horizontal line. The recommended value of K corresponds to the largest value of K before the plateau is reached. In the provided example, K=2 is the optimal choice of K. The plot function displays a scree plot:

plot(x,option="screeplot")

plot of chunk unnamed-chunk-4

By default, the number of principal components taken into account in the scree plot is set to K but it can be reduced via the argument num_pc.

plot(x,option="screeplot",num_pc=4)

B.2. Score plot

Another option to choose the number of PCs is based on the 'score plot' that displays population structure. The score plot displays the projections of the individuals onto the specified principal components. Using the score plot, the choice of K can be limited to the values of K that correspond to a relevant level of population structure.

When population labels are known, individuals of the same populations can be displayed with the same color using the pop argument, which should contain the list of indices of the populations of origin. In the geno3pops example, the first population is composed of the first 50 individuals, the second population of the next 50 individuals, and so on. Thus, a vector of indices or characters (population names) that can be provided to the argument pop should look like this:

poplist <- c(rep(1,50),rep(2,50),rep(3,50))
print(poplist)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3

If this field is left empty, the points will be displayed in black. By default, if the values of i and j are not specified, the projection is done onto the first two principal components.

plot(x,option="scores",pop=poplist)

plot of chunk unnamed-chunk-6

Looking at population structure beyond K=2 confirms the results of the scree plot. The third and the fourth principal components do not ascertain population structure anymore.

plot(x,option="scores",i=3,j=4,pop=poplist)

plot of chunk unnamed-chunk-7

NB: for .ped files, the plot function adds colors to the 'score plot' automatically provided that the data have been loaded with read4pcadapt.

C. Computing the test statistic based on PCA

For a given SNP, the test statistic is based on the loadings that are defined as the correlations between the SNP and the PCs. The test statistic for detecting outlier SNPs is the Mahalanobis distance, which is a multi-dimensional approach that measures how distant is a point from the mean. Denoting by \(\rho^j = (\rho_1^j, \dots, \rho_K^j)\) the vector of K correlations between the \(j\)-th SNP and the first K PCs, the Mahalanobis distance is defined as

\[D_j = \sqrt{(\rho^j - \bar{\rho})\Sigma^{-1}(\rho^j - \bar{\rho})}\]

where \(\bar{\rho}\) and \(\Sigma\) are robust estimates of the mean and of the covariance matrix. Once divided by a constant \(\lambda\) called the genomic inflation factor, the scaled distances \(D_j/\lambda\) should have a chi-square distribution with K degrees of freedom under the assumption that there are no outlier.

For the geno3pops data, it was found in section B that K=2 corresponds to the optimal choice of the number of PCs.

x <- pcadapt(data,K=2)

In addition to the number K of principal components to work with, the user can also set the parameter min.maf that corresponds to a threshold of minor allele frequency. By default, the parameter min.maf is set to 5%. P-values of SNPs with a minor allele frequency smaller than the threshold are not computed (NA is returned).

The object x returned by the function pcadapt contains numerical quantities obtained after performing a PCA on the genotype matrix.

summary(x)
##                 Length Class  Mode   
## stat            1500   -none- numeric
## pvalues         1500   -none- numeric
## maf             1500   -none- numeric
## gif                1   -none- numeric
## chi2_stat       1500   -none- numeric
## scores           300   -none- numeric
## loadings        3000   -none- numeric
## singular_values    2   -none- numeric

We assume in the following that there are n individuals and L markers.

All of these elements are accessible using the $ symbol. For example, the p-values are contained in x$pvalues.

D. Graphical tools

D.1. Manhattan Plot

A Manhattan plot displays \(-\text{log}_{10}\) of the p-values.

plot(x,option="manhattan")

plot of chunk unnamed-chunk-10

D.2. Q-Q Plot

The user is also given the possibility to check the distribution of the p-values using a Q-Q plot

plot(x,option="qqplot",threshold=0.1)

plot of chunk unnamed-chunk-11

On the right side of the grey dotted line are the top 10% lowest p-values. This plot confirms that most of the p-values follow the expected uniform distribution. However, the smallest p-values are smaller than expected confirming the presence of outliers.

D.2. Histograms of the test statistic and of the p-values

A histogram of p-values confirms that most of the p-values follow the uniform distribution, and that the excess of small p-values indicates the presence of outliers.

hist(x$pvalues,xlab="p-values",main=NULL)

plot of chunk unnamed-chunk-12

The presence of outliers is also visible when plotting a histogram of the test statistic \(D_j\).

plot(x,option="stat.distribution")

plot of chunk unnamed-chunk-13

E. Choosing a cutoff for outlier detection

To provide a list of outliers, we recommend using the R package qvalue, transforming the p-values into q-values. To install and load the package, type the following command lines:

## try http if https is not available
source("https://bioconductor.org/biocLite.R")
biocLite("qvalue")
library(qvalue)

For a given \(\alpha\) (real valued number between \(0\) and \(1\)), SNPs with q-values less than \(\alpha\) will be considered as outliers with an expected false discovery rate bounded by \(\alpha\). The false discovery rate is defined as the percentage of false discoveries among the list of candidate SNPs. Here is an example of how to provide a list of candidate SNPs for the geno3pops data, for an expected false discovery rate lower than 10%:

qval <- qvalue(x$pvalues)$qvalues
alpha <- 0.1
outliers <- which(qval<alpha)
print(outliers)
##   [1]    1    2    3    4    5    6    7    8   11   12   13   15   16   17
##  [15]   18   21   22   23   24   25   26   27   28   29   30   31   32   33
##  [29]   34   36   37   39   40   41   42   43   44   45   46   47   48   51
##  [43]   52   53   54   55   56   57   58   59   60   61   62   63   64   65
##  [57]   66   67   69   70   71   72   73   74   75   76   77   78   79   80
##  [71]   81   82   83   84   85   86   87   88   89   91   92   93   94   95
##  [85]   97   98   99  100  101  102  103  104  105  106  107  110  111  112
##  [99]  113  114  115  116  117  118  119  120  121  122  123  124  125  126
## [113]  127  128  129  130  131  132  133  134  135  137  138  139  141  142
## [127]  143  144  145  146  147  148  149  150  414  525  625  663  932 1406

F. Pool-seq Data

The package also handles Pool-seq data. A Pool-seq example is provided in the package and can be loaded as follows:

pooldata <- read4pcadapt("pool3pops",option="example",transpose=FALSE)

For your own dataset, assuming the data file is called “mydata” and is available at “path_to_directory”, the command line to read the (n,L) frequency matrix data (where n is the number of populations and L is the number of genetic markers) is

pooldata <- read4pcadapt("path_to_directory/mydata",transpose=FALSE)

If the input data is a (L,n) matrix, set transpose to TRUE in the read4pcadapt function.

As for genotype data, the pcadapt function performs two successive tasks. First PCA is performed on the centered matrix of allele frequencies (not scaled). And the second stage consists of computing test statistics (Mahalanobis distances by default) and p-values based on the covariances between allele frequencies and the first K PCs. By default, the function assumes that the number K of PCs is equal to the number of populations minus one (K=n-1). The user can use a smaller number of PCs (K < n-1) by determining the optimal number of PCs using the scree plot.

When calling the pcadapt function, make sure to specify data.type ="pool".

xpool <- pcadapt(pooldata,data.type="pool")

Plotting options mentioned in section D such as "manhattan", "qqplot", "scores", "screeplot", or "stat.distribution" are still valid for Pool-seq data.

G. Reading outputs from PCAdapt fast

For large datasets (more than 100,000 SNPs), computing PCA with the R package can be slow. In this case, the PCAdapt C software is faster than the R package. PCA should be performed with the C software and outputs of PCA should be processed with the R package.

Installation

The C software is currently available for UNIX OS or MAC OS and can be downloaded here. To install the software, proceed as follows:

cd PCAdaptPackage
make lapack
make

The software can handle missing values. It assumes missing values are encoded using 9.

G.1 Running PCAdapt fast for genotype data

The C software assumes that individuals are available in colums and genotypes in rows. If you have data files in .vcf or .ped format, you can convert the data file using the following command lines

./vcf2pcadapt inputfile.vcf outputgenotype
./ped2pcadapt inputfile.ped outputgenotype

To run PCAdapt for diploid data, use the following command line:

./PCAdapt fast -i outputgenotype -o myoutput -K number_of_PCs -h 0 -m 0.05

where outputgenotype is the genotype matrix, myoutput is the name of the output file, number_of_PCs is the number of principal components that will be computed, 0.05 is the threshold for minor allele frequencies below which SNPs are discarded. Choose a value of K that is large enough (K>20) in order to be able to choose K with the scree plot.

To run PCAdapt for haploid data, use the following command line:

./PCAdapt fast -i outputgenotype -o myoutput -K number_of_PCs -h 1 -m 0.05

For more details about the software, check out the manual.

Read outputs with the R package pcadapt

To read outputs in R from the software PCAdapt, set data.type="PCAdapt" in the pcadapt function. The argument data should be the name of the output provided by PCAdapt without any extension. Assuming the file returned by the PCAdapt software is called “myoutput” and is available at “path_to_directory”, the command line should be (ploidy=2 and min.maf=0.05 by default in the pcadapt function) :

x <- pcadapt(data="path_to_directory/myoutput",data.type="PCAdapt")

Use the plot function to display the scree plot and choose the optimal value for K

plot(x,option="screeplot")

When the optimal value of K is found, you can compute Mahalanobis distances and p-values using again the pcadapt function (ploidy=2 and min.maf=0.05 by default).

x <- pcadapt(data="path_to_directory/myoutput",K=K,data.type="PCAdapt")

G.2 Running PCAdapt fast for Pool-seq data

The C software supports data files that contain allele frequencies. It assumes that populations are in rows and SNPs in columns. To run PCAdapt for Pool-seq data, use the following command line

./PCAdapt fast -i outputgenotype -o myoutput -K number_of_PCs -S 0

where number_of_PCs should be equal to the number of populations minus one. For example, if there are 5 populations in the dataset, the command line should be

./PCAdapt fast -i outputgenotype -o myoutput -K 4 -S 0

where -S 0 indicates that the data should not be scaled, which is important for Pool-seq data.

Read outputs with the R package pcadapt

To read outputs in R from the software PCAdapt, set data.type="PCAdapt" in the pcadapt function and follow the same routines as described in section G.1.

H. Advanced Usage

The default option uses the Mahalanobis distance as a test statistic to seek for outlier SNPs, but other statistics can be computed based on the loadings. However except for power users, we recommend to use the Mahalanobis distance that provides the best rankings of SNPs in most cases we have investigated.

Communality

The communality statistic measures the proportion of variance explained by the first K PCs. When there are K+1 populations, the communality statistic provides a ranking similar to the widely used Fst statistic. P-values are computed using a chi-square approximation (Duforet-Frebourg et al. 2015). The test based on the communality statistic can be performed by setting method to communality in the pcadapt function.

x_com <- pcadapt(data,K=2,method="communality")

The communality can be approximated by a constant times a chi-square distribution, allowing to compute p-values for each SNP.

plot(x_com,option="stat.distribution")

plot of chunk unnamed-chunk-19

Euclidean

Instead of using the Mahalanobis distance, we can simply use Euclidean distance as a test statistic.

x_eucl <- pcadapt(data,K=2,method="euclidean")
plot(x_eucl,option="stat.distribution")

plot of chunk unnamed-chunk-21

Component-wise

Another possibility (component-wise p-values) is to perform one genome scan for each principal component. The test statistics are the loadings, which correspond to the correlations between each PC and each SNP. P-values are computed by making a Gaussian approximation for each PC and by estimating the standard deviation of the null distribution.

x_cw <- pcadapt(data,K=2,method="componentwise")
summary(x_cw$pvalues)
##        p1                p2        
##  Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.09287   1st Qu.:0.1327  
##  Median :0.33333   Median :0.3842  
##  Mean   :0.38363   Mean   :0.4166  
##  3rd Qu.:0.64068   3rd Qu.:0.6829  
##  Max.   :0.99932   Max.   :0.9997

pcadapt returns K vectors of p-values (one for each principal component), all of them being accessible, using the $ symbol or the [] symbol. For example, typing x_cw$pvalues$p2 or x_cw$pvalues[,2] in the R console returns the list of p-values associated with the second principal component (provided that K is larger than or equal to 2).

The p-values are computed based on the matrix of loadings. The loadings of the neutral markers are assumed to follow a centered Gaussian distribution. The standard deviation of the Gaussian distribution is estimated using the Median Absolute Deviation.

To display the neutral distribution for the component-wise case, the value of K has to be specified.

plot(x_cw,option="stat.distribution",K=2)

plot of chunk unnamed-chunk-23

Changelog pcadapt 2.1.1 (01-05-2016)

Changelog pcadapt 2.1 (12-18-2015)

Changelog pcapdapt 2.0.1

Changelog pcadapt 2.0

Reference

Duforet-Frebourg, N., Luu, K., Laval, G., Bazin, E., & Blum, M. G. (2015). Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data. arXiv preprint arXiv:1504.04543.