This package is designed to entail the whole processing chain to use SamBada, from pre- to post-processing. In this documentation, we will work through all these steps with an example given in the data available with the package. This is a subset of ten SNPs from 800 Ougandan cattle including the sample location (see SamBada’s documentation below for more information)
If you want more documentation, you can read the documentation of the package or SamBada’s documentation. Also read the article …
NB: As a general rule, please avoid spaces in input files (and paths leading to them), that might make some commands crash
NB2: By default, all examples write to a temporary folder (use tempdir() to see where it is actually saved). You can remove the tempdir() statements and files will be saved in your current directory.
For running sambada, you need to download sambada’s binaries. This can be done with downloadSambada
which downloads Sambada from GitHub and upacks it into the directory of your choice. You might already be a Sambada user and do not have to download it again! Note that if you plan to often use Sambada, it is recommended that you put the binaries folder of Sambada into your path environmental variable (this procedure is OS-dependent, look on the internet how to proceed), otherwise you will have to specify this path every time you start a new R session.
#Load help
?downloadSambada()
#Downloads Sambada into the temporary directory
downloadSambada(tempdir()) #remove tempdir() to downlaod it in the current directory.
Most of the functions described here have an interactiveChecks
mode. When you run the function for the first time on your dataset, we strongly advise that you set it to TRUE. This prints plots that allows you to detect anomalies.
The first step when you have your genomic matrix is to prepare it into a format that samBada accepts. You can use prepareGeno
for this, which can process plink ped, plink bed, vcf or gds input file. In the meantime, you can also filter out SNPs based on Minor Allele Frequency (MAF), Missingness, Linkage Disequilibrium (LD) and Major Genotype Frequency (MGF). In order to work with your dataset, a GDS file (from SNPRelate package) is first created. If saveGDS is set to TRUE, then the file will be saved in the active directory. The GDS file is used in other functions, so we recommend that you keep it.
#Loads data
#================
#These files are distributed within your package. The system.file will return the full path to them. With your data, you can just use the name of the file, provided the file is in the active directory
genoFile=system.file("extdata", "uganda-subset-mol.ped", package = "R.SamBada")
genoFile #Check the path to the file
#Load help
#================
?prepareGeno
#Run prepareGeno
#================
#Warning: the histograms shown due to maf and missingness filtering are really difficult to understand given the small number of SNPs
prepareGeno(fileName=genoFile,outputFile=file.path(tempdir(),'uganda-subset-mol.csv'),saveGDS=TRUE,mafThresh=0.05, missingnessThresh=0.1,interactiveChecks=TRUE)
#remove tempdir(), to save the file to your current directory
If location are unknown, you can use setLocation
to help you defining sample coordinates. The procedure is self-explained: it opens a local web page in which you need to upload a file with a list of IDs. Then specify the name of the column containing IDs (and longitude and latitude if present). Once processed, select one or several samples, and click “Select coordinate” at the end of the list. Then select a point on the map (first zoom to a satisfying level): you should see the coordinates being updated in the list. When finished, click “Export Coordinates” to export the new csv file. In the data presented in this vignette, samples are already georeferenced. However, if you want to try this function :
#Load help
#================
?setLocation
#Run setLocation
#================
setLocation()
#Once the browser opens, you can load the file uganda-subset-id.csv located in extdata folder of the package
Then from the point locations, you need to create your envionmental dataset from point location. Use createEnv
for this task.
You can use rasters of your study site that you already have or use the function to automatically download rasters of your study site from global databases. For example tmin10 represents the minimum temperature in october. Similarly tmax, tavg and prec refers to maximum temperature, average temperature and precipitation. The bio1-bio19 are bioclim variables are computed from these indices and are described here. Temperature are given in 10 degree C and precipitation in mm. The funciton always downloads the best resolution available (30 seconds for worldclim dataset and 90m for SRTM).
The tiles are downloaded in the folders wc0.5 and srtm of the active directory if saveDownload is set to TRUE (or to a temporary folder otherwise). The downloaded tiles can get really heavy! Once your final file is exported, you could delete the tiles (but you won’t be able to use them as background in the post-processing mapping function). If the tiles are already present in the active directory, they won’t be downloaded again.
This function requires that you define the EPSG code of your projection system. All systems are referenced here. If you work with lat/long global projection, then you most probably work with WGS 84 whose EPSG is 4326.
#Loads data
#================
#These files are distributed within your package. The system.file will return the full path to them. With your data, you can just use the name of the file, provided the file is in the active directory
locationFile=system.file("extdata", "uganda-subset.csv", package = "R.SamBada")
locationFile #Check the path to the file
#Load help
#================
?createEnv
#Create environmental dataset
#================
#downloads the raster tiles from global databases and create the environmental file
#Warning: the download and processing of raster is both heavy in space and time-consuming
#If you want to skip this step, you can skip this step and continue to the next function
createEnv(locationFileName=locationFile, outputFile='uganda-subset-env.csv', x='longitude',y='latitude',locationProj=4326,separator=';', worldclim=TRUE, saveDownload=TRUE, rasterName=NULL,rasterProj=NULL, interactiveChecks=TRUE)
You can now use the prepareEnv
function. This function has 3 goals
maxCorr
)The function creates a new file with the name specified in outputFile.
The population structure is calculated as a PCA of all the SNPs that pass the filtering (maf, ld, missingness). You can either choose to use the score of the X first components to evaluate the population structure (set numPop
to NULL) or you can compute a “membership coefficient” to a cluster of individuals based on the scores on the first X components. You can choose between two clustering algorithm (k-means or hierarchical cluster in the clustMethod
argument).
One of the option to decide the number of PCs that you should keep is to detect a bump in the proportion of variance explained and keep all the PC before the bump.
#Loads data
#================
#If you ran prepareGeno, use the gds file created in this step
gdsFile='uganda-subset-mol.gds'
#Otherwise, take the one prvided in the package
gdsFile=system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada") #If you run Windows
gdsFile=system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada") #If you run MacOS or Linux
gdsFile #Check the path to the file
#If you ran createEnv, use the .csv file created in this step
envFile='uganda-subset-env.csv'
#Otherwise, take the one prvided in the package
envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada")
envFile #Check the path to the file
#Load help
#================
?prepareEnv
#prepareEnv
#================
#Stores Principal components scores
prepareEnv(envFile=envFile, outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, idName='short_name', genoFile=gdsFile, numPc=0.2, mafThresh=0.05, missingnessThresh=0.1, ldThresh=0.2, numPop=NULL, x='longitude', y='latitude', interactiveChecks=TRUE, locationProj=4326 )
#Accept the maxCorr threshold, and numPc to 1.
#remove "tempdir()," to save to the current directory
If you run samBada from the command line, you will have to create a parameter file. All parameters that can be calculated automatically from your file are calculated for you in sambadaParallel
.
Furthermore, samBada includes a module called supervision to split the molecular file into several subfiles to allow parallel computing. The split of the input file and merge of the output file with supervision’s help is integrated in the function.
#Loads data
#================
#If you ran prepareEnv, use the .csv file created in this step
envFile='uganda-subset-env-export.csv'
#Otherwise, take the one prvided in the package
envFile=system.file("extdata", "uganda-subset-env-export.csv", package = "R.SamBada")
envFile #Check the path to the file
#If you ran prepareGeno, use the .csv file created in this step
genoFile='uganda-subset-mol.csv'
#Otherwise, take the one prvided in the package
genoFile=system.file("extdata", "uganda-subset-mol.csv", package = "R.SamBada")
genoFile #Check the path to the file
#Load help
#================
?sambadaParallel
#sambadaParallel
#================
#Run sambada on two cores.
#prepareEnv puts the population variables at the end of the file (=> LAST).
#Only one pop var was saved, so set dimMax=2
sambadaParallel(genoFile=genoFile, envFile=envFile, idGeno='ID_indiv', idEnv='short_name', dimMax=2, cores=2, saveType='END ALL', populationVar='LAST', outputFile=file.path(tempdir(),'uganda-subset-mol'))
The function prepareOutput
does the following things on sambada’s output
#Loads data
#================
#If you haven't run sambadaParallel, you need to copy the output file into the active directory with the following command
file.copy(system.file("extdata", "uganda-subset-mol-Out-2.csv", package = "R.SamBada"), 'uganda-subset-mol-Out-2.csv')
file.copy(system.file("extdata", "uganda-subset-mol-storey.csv", package = "R.SamBada"), 'uganda-subset-mol-storey.csv')
#Furthermore, if you haven't run prepareGeno, you need to copy the GDS file with the following command
file.copy(system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada"),'uganda-subset-mol.gds') #If you run Windows
file.copy(system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada"),'uganda-subset-mol.gds') #If you run MacOS or Linux
#Load help
#================
?prepareOutput
#prepareOutput
#================
prep = prepareOutput(sambadaname='uganda-subset-mol', dimMax=2, popStr=TRUE, interactiveChecks=TRUE)
#Accepts both pi0 parameter for G and Wald score
To explore your results, the first thing you could do is draw a manhattan plot of each environmental variable to detect one or several intersting peaks in particular variables
#Loads data
#================
#You need to run prepareOutput to run this function
#Load help
#================
?plotManhattan
#plotManhattan
#================
#Plot manhattan of all kept variables
plotManhattan(prep, c('bio1','bio2','bio3'),chromo='all',valueName='pvalueG')
#Warning: the manhattan plot is different from what we are used to see, given the small number of SNPs
The function plotResultInteractive
can now be invoked. This will open a local page on your web-browser with a manhattan plot (you have to choose the environmental variable you want to study and can choose the chromosomes to draw).
The plot is interactive so that you can select a point on the manhattan which will query the Ensembl database to indicate nearby SNPs and the consequence of the variant (intergenic, synonymous, non-synonymous, stop gained, stop lost,…) with VEP.
The map of the marker, population variables and environmental variable is available.
A boxplot showing the environmental range of both present and absent individuals is drawn as well.
#Loads data
#================
#You need to run prepareOutput to run this function
#If you haven't run prepareEnv, use the file provided in the package
envFile=system.file("extdata", "uganda-subset-env-export.csv", package = "R.SamBada")
#Otherwise use
envFile='uganda-subset-env-export.csv'
#If you haven't run prepareGeno, use the file provided in the package
gdsFile=system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada") #If you run Windows
gdsFile=system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada") #If you run MacOS or Linux
#Otherwise use
gdsFile='uganda-subset-mol.gds'
#Load help
#================
?plotResultInteractive
#plotResultInteractive
#================
plotResultInteractive(preparedOutput=prep, varEnv='bio1', envFile=envFile,species='btaurus', pass=50000,x='longitude',y='latitude',gdsFile=gdsFile, IDCol='short_name', popStrCol='pop1')
#Accepts the Dataset and SNO Data found
#Once the interactive window opens, click on any point of the manhattan plot
You might be interested in plotting a map (though it is already done in the plotResultInteractive
). The advantages of the plotMap
are
You can draw a map of
#Loads data
#================
#You need to run prepareOutput to run this function
#If you haven't run prepareEnv, use the file provided in the package
envFile=system.file("extdata", "uganda-subset-env-export.csv", package = "R.SamBada")
#Otherwise use
envFile='uganda-subset-env-export.csv'
#If you haven't run prepareGeno, use the file provided in the package
gdsFile=system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada") #If you run Windows
gdsFile=system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada") #If you run MacOS or Linux
#Otherwise use
gdsFile='uganda-subset-mol.gds'
#Load help
#================
?plotMap
#plotMap
#================
plotMap(envFile=envFile, x='longitude', y='latitude', locationProj=4326, popStrCol='pop1', gdsFile=gdsFile, markerName='Hapmap28985-BTA-73836_GG', mapType='marker', varEnvName='bio1', simultaneous=FALSE)
Good luck with the analysis of your own dataset!