Computing a PGS using RápidoPGS-single and GWAS catalog

Guillermo Reales

2023-10-13

Introduction

RápidoPGS is a tool to quickly compute polygenic scores (PGS) from GWAS summary statistics datasets from either case-control (eg. asthma) or quantitative traits (eg. height and BMI).

Input GWAS summary statistics datasets should have a minimum set of columns, with these names, but in any order:

Also, if doing a PGS on a quantitative trait GWAS dataset, your file must include this:

Installation of RápidoPGS

Current RápidoPGS version (v2.1.0) is available on CRAN, so we can install it using the following code

install.packages("RapidoPGS")

Alternatively, you can download the development version from Github (Note: If you don’t have remotes installed, please install it first:

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("GRealesM/RapidoPGS")

Setup

Once installed, let’s load it. This will automatically load all required dependencies.

library(RapidoPGS)

Downloading data from GWAS catalog

GWAS catalog is an outstanding GWAS summary statistics data source, as it not only puts together tons of public datasets, but it also uses a semi-automatic pipeline to apply quality control and liftover (a.k.a. harmonise) those datasets, thus helping overcome the problem of having GWAS sumstats in so many different formats and builds.

In this vignette, we’ll use GWAS catalog to obtain an example dataset. For this vignette we’ll use a Breast cancer (BRCA) dataset from Michailidou et al., 2017, which is one that we used in our manuscript. This GWAS was performed on 119078 controls and 137045 cases of Breast cancer.

It’s a big file, and may take a while to download, so here we will show the command to download, but actually cheat and load a subset of data already loaded.

Note that in this particular case we chose a study that contained a single dataset. In other cases there may be more than one. In that case gwascat.download will prompt you with the list of available options, prompting their accession numbers, and asking you to choose a file by its number in the list, if running interactively.

We use it’s PubMed ID (29059683). When running gwascat.download without any other arguments, it will try to get the harmonised files associated with the ID. Harmonised files have been processed by GWAS catalog and are formatted and aligned to the lastest genome build. See here for more information.

Once we download the file, we’ll need to prepare it for RápidoPGS. That will involve renaming columns to something RápidoPGS can understand, and this is easy to do with data.table. Also, RápidoPGS use only autosomes, so remove X or Y chromosomes at this step.

ds <- gwascat.download(29059683)

# Select the harmonised hg38 file 
# This is equivalent to:
# ds <- fread("ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST004001-GCST005000/GCST004988/harmonised/29059683-GCST004988-EFO_0000305.h.tsv.gz")

# Then apply some reformatting
setnames(ds, old = c("hm_rsid","hm_chrom","hm_pos", "hm_other_allele", "hm_effect_allele", "hm_effect_allele_frequency", "hm_beta", "standard_error", "p_value"), new = c("SNPID","CHR", "BP", "REF","ALT","ALT_FREQ", "BETA", "SE", "P"))
ds <- ds[,.(SNPID, CHR, BP, REF, ALT, ALT_FREQ, BETA, SE, P)]
ds <- ds[CHR !="X"]
ds$CHR <- as.numeric(ds$CHR)
ds <- ds[order(CHR, BP)]
ds <- na.omit(ds, cols = c("BETA", "ALT_FREQ"))

For illustrative purposes, I took a random subset from this file including 100,000 SNPs from this file, which we’ll use in this tutorial. Bear in mind that the final PGS won’t be a valid model since we randomly removed most SNPs. It will serve for our teaching purposes, though! ;) We can load this file straight away.

ds <- michailidou38

Applying quality control to the dataset

The first thing to do is to check what our file looks like. RápidoPGS can handle NAs in crucial columns, so don’t worry too much about that (unless you have all NA for crucial columns, of course!).

A note of caution when dealing with GWAS catalog files, though: since they use a semi-automated pipeline, it is possible that even some columns are present, they are empty, so be careful with that!

Also, RápidoPGS requires allele frequencies, so it’s possible that you need to compute it from an external reference panel (eg. 1000 Genomes Phase III). We don’t cover that step in this vignette, but we might write instructions explaining how to do it in the future.

Lastly, we applied a number of QC steps in our paper, which we won’t apply here, but encourage you to try when using real data. The details of this procedure are described in the paper. You can also take a look at the code here.

Let’s now look at the file.

summary(ds)
##     SNPID                CHR               BP                REF           
##  Length:100000      Min.   : 1.000   Min.   :    19484   Length:100000     
##  Class :character   1st Qu.: 4.000   1st Qu.: 32314220   Class :character  
##  Mode  :character   Median : 7.000   Median : 69249390   Mode  :character  
##                     Mean   : 8.612   Mean   : 78711220                     
##                     3rd Qu.:13.000   3rd Qu.:114464115                     
##                     Max.   :22.000   Max.   :248922216                     
##      ALT               ALT_FREQ           BETA                  SE         
##  Length:100000      Min.   :0.0051   Min.   :-2.1291000   Min.   :0.00620  
##  Class :character   1st Qu.:0.0205   1st Qu.:-0.0100000   1st Qu.:0.00740  
##  Mode  :character   Median :0.1037   Median :-0.0003000   Median :0.01140  
##                     Mean   :0.2245   Mean   :-0.0009533   Mean   :0.01949  
##                     3rd Qu.:0.3571   3rd Qu.: 0.0094000   3rd Qu.:0.02540  
##                     Max.   :0.9949   Max.   : 0.6442000   Max.   :0.83960  
##        P                N         
##  Min.   :0.0000   Min.   :256123  
##  1st Qu.:0.1922   1st Qu.:256123  
##  Median :0.4470   Median :256123  
##  Mean   :0.4597   Mean   :256123  
##  3rd Qu.:0.7174   3rd Qu.:256123  
##  Max.   :0.9991   Max.   :256123

In this case, we don’t have any missing data, which is fantastic.

Computing PGS using RápidoPGS-single

Let’s now compute our PGS! The build of this example file is hg38, so we must tell RápidoPGS about that (default is hg19).

In this new version, we don’t need sample size for case-control datasets. Note that if this was a quantitative trait dataset, we should inform total number of individuals (N parameter). Also, if our dataset had columns reporting the number of individuals for each SNP, we can replace N by a string specifying the column (eg. N = “sample_size”). By doing so, RápidoPGS-single will extract this information directly from the columns.

Let’s get our hands dirty! Let’s compute first a full RápidoPGS-single model.

Advanced use note: You may want to filter by and align your dataset to an external reference. You can do that with RápidoPGS, too, by specifying the path of your reference file at reference argument. This reference file should have five columns (CHR, BP, REF, ALT, and SNPID) and be in the same build as our summary statistics dataset. RápidoPGS currently supports hg19 and hg38 builds.

full_PGS <- rapidopgs_single(ds, trait = "cc", build = "hg38")
## Assigning LD blocks...
## Done!
## Computing a RapidoPGS-single model for a case-control dataset...

Well, that was rápido! With increasingly big datasets, it will take a bit longer, of course.

Let’s take a look.

head(full_PGS)
##         SNPID CHR        BP REF ALT ALT_FREQ    BETA     SE      P      N
## 1: rs76197520   2  49673981   C   T   0.0603  0.0149 0.0131 0.2540 256123
## 2: rs12232959   2 149575863   C   G   0.7818 -0.0035 0.0076 0.6450 256123
## 3:  rs7621633   3   7330932   C   T   0.7460 -0.0045 0.0089 0.6151 256123
## 4: rs11129616   3  34773314   A   G   0.6731 -0.0060 0.0069 0.3895 256123
## 5: rs62282714   3 150895438   T   C   0.1539 -0.0052 0.0113 0.6439 256123
## 6: rs62375579   5 109299301   G   A   0.1909  0.0080 0.0102 0.4347 256123
##    ld.block          ppi        weight
## 1:      135 1.244593e-05  1.854443e-07
## 2:      178 4.221391e-06 -1.477487e-08
## 3:      221 5.050474e-06 -2.272713e-08
## 4:      235 5.029883e-06 -3.017930e-08
## 5:      287 6.268896e-06 -3.259826e-08
## 6:      457 6.922043e-06  5.537634e-08

We see three new columns: ld.block corresponds to the ld-detect LD block number assigned to the SNP (see manuscript for more details of where this comes from), ppi is the posterior probability that the SNP is causal, and weight is the weighted effect size (BETA * ppi) - and the column we’re interested in.

Applying different thresholds

Imagine that we want a small portable PGS. We could use a threshold (eg. keep only SNPs with ppi larger than \(1e^{-4}\), a reasonable threshold) or keep the top SNPs with largest weights (in either way).

We can do that using RápidoPGS, let’s see how by using a \(1e^{-4}\) threshold.

For accuracy reasons, we recommend recomputing the PGS on just these SNPs after the threshold was applied, so recalc = TRUE by default.

PGS_1e4 <- rapidopgs_single(ds, trait ="cc", build = "hg38", filt_threshold = 1e-4)
## Assigning LD blocks...
## Done!
## Computing a RapidoPGS-single model for a case-control dataset...
head(PGS_1e4)
##          SNPID CHR       BP REF ALT ALT_FREQ   BETA     SE       P      N
## 1: rs143863360  15 75910714   G   T   0.0222 0.0611 0.0265 0.02123 256123
##    ld.block          ppi       weight
## 1:     1112 0.0001789831 1.093587e-05

You can omit recalculation by setting that argument to FALSE

PGS_1e4_norecalc <- rapidopgs_single(ds, trait ="cc", build = "hg38", filt_threshold = 1e-4, recalc = FALSE)
## Assigning LD blocks...
## Done!
## Computing a RapidoPGS-single model for a case-control dataset...
head(PGS_1e4_norecalc)
##          SNPID CHR       BP REF ALT ALT_FREQ   BETA     SE       P      N
## 1: rs143863360  15 75910714   G   T   0.0222 0.0611 0.0265 0.02123 256123
##    ld.block          ppi       weight
## 1:     1112 0.0001789831 1.093587e-05

And what if we want the top 10 SNPs? Just change the filt_threshold argument. If it’s larger than 1, RápidoPGS will understand that you want a top list, rather than a thresholded result.

PGS_top10 <- rapidopgs_single(ds, trait ="cc", build = "hg38", filt_threshold = 10)
## Assigning LD blocks...
## Done!
## Computing a RapidoPGS-single model for a case-control dataset...
head(PGS_top10)
##          SNPID CHR       BP REF ALT ALT_FREQ    BETA     SE       P      N
## 1: rs143863360  15 75910714   G   T   0.0222  0.0611 0.0265 0.02123 256123
## 2:  rs55817933  19 36970383   T   C   0.3584 -0.0154 0.0065 0.01691 256123
## 3: rs149557496   9 35906658   G   A   0.0152  0.0297 0.0313 0.34330 256123
## 4:  rs72867697  17 76595643   T   A   0.1731 -0.0165 0.0095 0.08168 256123
## 5:  rs76197520   2 49673981   C   T   0.0603  0.0149 0.0131 0.25400 256123
## 6:  rs61932677  12 98186323   G   A   0.1930  0.0096 0.0093 0.30170 256123
##    ld.block          ppi        weight
## 1:     1112 1.789831e-04  1.093587e-05
## 2:     1257 5.361020e-05 -8.255971e-07
## 3:      757 2.399329e-05  7.126007e-07
## 4:     1198 2.136836e-05 -3.525779e-07
## 5:      135 1.244593e-05  1.854443e-07
## 6:      978 7.904310e-06  7.588138e-08

Conclusion

So those are examples of basic RápidoPGS-single usage!

Drop us a line if you encounter problems, and we’ll be happy to help.

Good luck!

Guillermo