BIGDAWG

Derek Pappas, Ph.D. (dpappas@chori.org)

2021-10-23

Overview

‘Bridging ImmunoGenomic Data-Analysis Workflow Gaps’ (‘BIGDAWG’) is an integrated analysis system that automates the manual data-manipulation and trafficking steps (the gaps in an analysis workflow) normally required for analyses of highly polymorphic genetic systems (e.g., the immunological human leukocyte antigen (HLA) and killer-cell Immunoglobulin-like receptor (KIR) genes) and their respective genomic data (immunogenomic) (Pappas DJ, Marin W, Hollenbach JA, Mack SJ. 2016. ‘Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline.’ Human Immunology. 77:283-287). Starting with unambiguous genotype data for case-control groups, ‘BIGDAWG’ performs tests of Hardy-Weinberg equilibrium, and carries out case-control association analyses for haplotypes, individual loci, specific HLA exons, and HLA amino acid positions.

Input Data

Data for BIGDAWG should be in a tab delimited text format (UTF-8 encoding). The first row must be a header line and must include column names for genotype data. The first two columns must contain subject IDs and phenotypes (0 = control, 1 = case), respectively. A phenotype is not limited to disease status and may include other phenotypes such as onset, severity, ancestry, etc. However, phenotype designatons in the dataset are restricted to the use of 0s and 1s. Genotype pairs must be located in adjacent columns. Column names for a given locus may use ’_1’, ‘.1’,’_2’,‘.2’ to distinguish each locus pair. Genotype calls may include any text (numeric or character) except the numbers 1 and 0.

Data may also be passed to BIGDAWG as an R object (dataframe) following the same formatting as above for text files. You may also choose to run a synthetic HLA data set (see below) to observe a typical BIGDAWG analysis and experiment with parameter settings.

For HLA alleles, you may choose to format your genotype calls with our without the locus prefix. For example, for HLA-A, a given genotype call maybe 01:01:01:01 or A*01:01:01:01 or HLA-A*01:01:01:01. Allele names can include any level of resolution, from a single field up to the full length name. For HLA-DRB3,-DRB4,-DRB5 genotype calls, you may choose to represent these as a single pair of columns or as separate pairs of columns for each locus. However, when submitted as a single pair of columns, all genotypes must be formatted as Locus*Allele (including non-DRB loci). The single pair column names may be DRB345, DRB3.4.5 or DRB3/4/5. Homozygous or hemizygous status for DRB3, DRB4 and DRB5 genotypes is based on the DRB1 haplotype as defined by Andersson, 1998 (Andersson G. 1998. Evolution of the HLA-DR region. Front Biosci. 3:d739-45). If you wish to define your own zygosity, it is suggested you split them into separate pairs of columns for each locus manually.

Missing Information
When there is missing information, either for lack of genotyping information or absence of genotyped loci, BIGDAWG allows for conventions to differentiate the type of data that is missing.

Data missing due to lack of a molecular genotyping result is considered not available (NA). Acceptable NA strings include: NA, ****, -, na and Na. Empty data cells will be considered NA. If your data is formatted as LocusAllele, please include this formatting for all absent alleles as well (e.g., DRB1NA).

Data missing due to genomic structural variation (i.e., no locus present) is considered absence. Acceptable absence strings include: Absent, absent, Abs, ABS, ab, Ab, AB, @. The last symbol is the unicode at sign. BIGDAWG allows for a special allele name that indicates absence of an HLA locus: 00, 00:00, 00:00:00 and 00:00:00:00 are all acceptable indicators of HLA locus absence. When choosing to use 00’s (zeros) to populate allele name fields, use similar or higher levels of resolution http://hla.alleles.org and follow the same naming convention as with other genotype calls (either with or without locus prefix). If using a single column pair for DRB3/4/5 and the “00” absence indicator, then do NOT affix a locus prefix for the absent calls. In this case, only include the locus prefix for known DRB345 genotypes (i.e., DRB3/4/5*00:00 is NOT an acceptable name). For HLA data, the 00:00 naming convention is preferred and absent designations will be converted to allow the amino acid analysis to test phenotypic associations with locus absence (see below).

Finally, when missing alleles (due to lack of a genotype call) for a locus are included in the haplotype analysis, the haplotype estimation method may impute the identity of the missing alleles for that subject. If such imputation is not desired, the “Missing”” parameter should be set to 0.

Genotype List Strings
For for HLA alleles, you may submit your data formatted as a genotype list strings (GL strings) and BIGDAWG can automatically convert the data to a tabular format. The data should be 3-columns with the GL string in the third column (see table below). Data is restricted to 00:00 for absent designations when using GL strings, you should NOT use any other indicator of absence. Utilizing NA is not compatible with GL2Tab conversion.

BIGDAWG also has a built-in function for converting between GL strings and tabular formatting using the GLSconvert() function. Please see the GLSconvert vignette for more detail.

Novel Alleles
BIGDAWG will accept novel allele names. However, it is suggested you follow the same naming convention for novel alleles as with other genotypes calls in your data, either with or without the locus prefix. For example, novel alleles could be submitted as follows: Novel, 01:Novel, or A*01:Novel. Unfortunately, the BIGDAWG amino acid analysis cannot accept novel allele designations and will display an error. If you would like to run the amino acid analysis, you should replace the novel allele with NA or omit the subject entirely.

Example of data architecture and acceptable values:

Tabular

SubjectID Disease A A B B DRB1 DRB1 DRB3 DRB3
subject1 0 01:01 02:01 08:01 44:02 01:01 08:01 Abs Abs
subject2 1 02:01 24:02 51:01 51:01 11:01 14:01 02:02 02:11
subject3 0 03:01 26:02 NA NA 10:01 08:01 00:00 00:00

Genotype List String

SubjectID Disease GLString
Subject1 0 HLA-A*01:01+HLA-A*02:01^HLA-B*08:01+HLA-B*44:02^HLA-DRB1*01:01+HLA-DRB1*03:01^HLA-DRB3*00:00+HLA-DRB3*00:00
subject2 1 HLA-A*02:01+HLA-A*24:02^HLA-B*51:01+HLA-B*51:01^HLA-DRB1*11:01+HLA-DRB1*14:01^HLA-DRB3*02:02+HLA-DRB3*02:11
subject3 0 HLA-A*01:01+HLA-A*02:01^HLA-DRB1*01:01+HLA-DRB1*03:01^HLA-DRB3*00:00+HLA-DRB3*00:00

Data Output

After the package is run, BIGDAWG will create a new folder entitled ‘output hhmmss ddmmyy’ in the working directory (unless otherwise specified by Results.Dir parameter, see below). Within the output folder will be a precheck file (‘Data_Summary.txt’) detailing the summary statistics of the dataset and the results of the Hardy-Weinberg equilibrium test (‘HWE.txt’). If any errors are present, a log file (‘Error_Log.txt’) will be written. If no locus subsets are specified (see parameters section), a single subfolder entitled ‘Set1’ will contain the outputs of each association analysis run. If multiple locus subsets are defined, subfolders for each locus set will be created containing the respective analytic results for that subset. Within each set subfolder, a parameter file will detail the parameters that are relevant to that subset.

When all pairwise combinations are run in the haplotype analysis, each pairwise set will be written to a single file. A separate file called “haplotype_PairwiseSets.txt” will be written to the corresponding set’s directory and will breakdown the locus make up for each PairwiseSet.

Data output to both the console and text files can be suppressed with Verbose=F and Output=F respectively. This is particularly useful when the user prefers to send the results to an R object (Return=T, see below) rather than to text files for further analysis. Finally, when multiple analyses are run (i.e., HWE, H, L, A) the data for each analysis (Chi Squares, Odds Ratios, Allele Frequencies, Allele Counts, etc.) can be written as set of files labeled ‘Merge’ for convenient access (Merge.Ouput=T).

Output As List Object
With the parameter “Return=T”, the result is returned as a list with indices for each analysis result (HWE, H, L, A). Defined loci sets will exists as list sub-elements under each respective analysis. If no Loci.Sets were specified, only one list element will exist for Set1.

For an example. If BIGDAWG outputs to an object called ‘BD’ in R (see examples below).

BD (e.g., 3 loci sets defined)
..$HWE - Results of the the Hardy-Weinberg equilibrium test
..$H$Set1 - Results of the ‘Haplotype’ analysis for Set1 ..$H$Set2 - Results of the ‘Haplotype’ analysis for Set2
..$H$Set3 - Results of the ‘Haplotype’ analysis for Set3
..$L... - Results of the ‘Locus’ analysis
..$A... - Results of the ‘Amino Acid’ analysis

or when “All.Pairwise=T” (e.g., 3 loci, 3 pairwise comparisons possible, single locus set)
..$HWE - Results of the the Hardy-Weinberg equilibrium test
..$H$Set1.PairwiseSet1 - Results of the ‘Haplotype’ analysis for Set1 Pairwise Set 1
..$H$Set1.PairwiseSet2 - Results of the ‘Haplotype’ analysis for Set1 Pairwise Set 2
..$H$Set1.PairwiseSet3 - Results of the ‘Haplotype’ analysis for Set1 Pairwise Set 3
..$L$Set1 - Results of the ‘Locus’ analysis for Set1
..$A$Set1 - Results of the ‘Amino Acid’ analysis for Set1

names(BD) - Display complete list of available sets to index. For a complete list of available pairwise results, use names(BD$H). This results list is written to the results directory as an R object called ‘Analysis.RData’ for later use.

Error Messages and Codes

BIGDAWG has a few built-in checks to ensure data format consistency and compatibility, especially for HLA data. BIGDAWG also does a parameter review before performing chi-squared tests and returns ‘NCalc’ (not calculated) when all genotypes have expected counts < 5 or the degrees of freedom do not allow for a test (e.g., dof < 1).

Known Issues

BIGDAWG’s output includes locus frequencies to provide convenient access for future reference. However, the values have been rounded to 5 digits (arbitrarily chosen) to make the output more concise. This rounding may introduce errors in the frequencies wherein they do not sum to 1. If downstream use of the allele frequency is required, we suggest the user calculate frequencies directly from the counts tables for their own application. Please refer to ?round help documentation and the section Warning for more detail on rounding considerations.

Parameters

BIGDAWG(Data, HLA=T, Run.Tests, Loci.Set, Exon, All.Pairwise=F, Trim=F, Res=2, Missing=0, Strict.Bin=F, Cores.Lim=1L, Results.Dir, Return=F, Output=T, Verbose=T)

Data

Class: String. Required. No Default.

e.g., Data=HLA_data -or- Data="HLA_data" -or- Data="foo.txt" -or- Data=foo.txt

Specifies genotype data file name. May use file name within working directory or full file name path to specify file location. See Data Input section for details about file formatting. Use ‘Data=HLA_data’ to analyze the bundled synthetic dataset.

HLA

Class: logical. Optional. Default = T.

Indicates whether or not your data is specific for HLA loci. If your data is not HLA, is a mix of HLA and data for other loci, or includes non-standard HLA allele names, you should set HLA = F. This will override the Trim and EVS.rm arguments, and will skip various tests and checks. Set HLA = T if and only if the dataset HLA alleles name are consistent with the most recent IPD-IMGT/HLA database release.

Run.Tests

Class: String or Character vector. Optional. Default = Run all tests.

e.g., Run.Tests = c("L","A") -or- Run.Tests = "HWE"

Specifies which tests to run in analysis. “HWE” will run the Hardy Weinberg Equilibrium test, “H” will run the haplotype association test, “L” will run the locus association test, and “A” will run the amino acid association test. Combinations of the test are permitted as indicated in the example.

The amino acid test generally requires the most processing time. Taking advantage of multi-core machines can minimize this time (see below). Moreover, avoid defining multiple Loci.Sets (see below) with overlapping loci as processing time will increasing redundantly. Currently, the amino acid analysis is limited to the HLA-A, -B, -C, -DRB1, -DRB3, -DRB4, -DRB5, -DQA1, -DQB1, -DPA1 and -DPB1 loci.

Loci.Set

Class: List. Optional. Default = Use all loci.

e.g., Loci.Set=list(c("DRB1","DQB1"),c("A","DRB1","DPB1"), c("DRB1","DRB3")) -or- Loci.Set=list("A")

Input list defining which loci to use for analyses. Combinations are permitted. If you included HLA-DRB3,-DRB4,-DRB5 as a collapsed column pair (‘DRB3/4/5’), you must specify the single locus in the Loci.set if you wish them to be included in an analysis set (i.e., ‘DRB3’ NOT ‘DRB3/4/5’). The pair of alleles for a locus must be in adjacent columns in the analyzed data set.

Running multiple sets is generally only relevant for the haplotype analysis without all pairwise combinations. For all other analyses, loci are treated independently. Consider running haplotype analysis independently when optioning multi-locus sets that included overlapping loci to avoid redundancies. Each locus set’s output will be contained within a separate set folder numbered numerically corresponding to their order in the Loci.set parameter (see Data Output section).

Exon

Class: Numeric. Optional.

e.g., Exon=3 -or- Exon=c(3,5,6) -or- Exon=c(2:3) -or- Exon=1:4

A single numeric or numeric vector that defines exons to target in the amino acid analysis. The amino acid alignment is parsed according to the overlap of the defined exons. When amino acid codons overlap exon boundaries, the exon with the majority overlap (2 out of 3 nucleotides) is assigned that residue. This argument is only relevant to the amino acid analysis. The defined exons are not required to be continuous. Multiple sets are not analyzed separately. The defined exons are applied to all loci in the the analysis. If an exon does not exist for a given locus, BIGDAWG will register an error and the analysis will stop. In such instances, it is recommended you analyze those loci separately.

All.Pairwise

Class: Logical. Optional. Default = F.

Should pairwise combinations of loci be run in the haplotype analysis? Only relevant to haplotype analysis. When optioned, only pairwise combinations of loci will be run and not all the loci in a given data set.

EVS.rm

Class: Logical. Optional. Default = F. (HLA=T specific).

Flags whether or not to strip expression variant suffixes from HLA alleles. Example: A*01:11N will convert to A*01:11. Should not be optioned for data that does not conform to HLA naming conventions.

Trim

Class: Logical. Optional. Default = F. (HLA=T specific).

Flags whether or not to Trim HLA alleles to a specified resolution. Should not be optioned for data that does not conform to HLA naming conventions.

Resolution

Class: Numeric. Optional. Default = 2. (HLA=T specific).

Sets the desired resolution when trimming HLA alleles. Used only when Trim = T. Fields for HLA formatting must follow current colon-delimited nomenclature conventions. Currently, the amino acid analysis will automatically truncate to 2-field resolution. Trimming is automatic and need not be optioned for amino acid analysis to run. This test will not run for data that does not conform to HLA naming conventions.

Missing

Class: String/Numeric. Optional. Default = 0.

Sets the allowable per subject threshold for missing alleles. Relevant to running the haplotype analysis. Effects can be disastrous on processing time and memory allocation for large values (>2) of missing. Missing may be set as a number or as “ignore” to skip removal and retain all subjects. If you find BIGDAWG spending a lot time on the “Estimating Haplotypes…” step, reduced you missing to a value less than or equal to 2.

Strict.Bin

Class: Logical. Optional. Default = FALSE.

Sets whether strict binning should be used during Chi Square testing. Strict binning (Strict.Bin = T) will bin all rare cells (expected count < 5). Otherwise, BIGDAWG will allow for up to 20% of the cells to have expected counts less than 5. Currently limited to the H, L, and A tests. This may rescue haplotypes/alleles/amino acids from binning and help identify significant loci/alleles.

Cores.Lim

Class: Integer. Optional. Default = 1 Core.

Specifies the number of cores accessible by BIGDAWG in amino acid analysis. Not relevant to Windows operating systems which will use only a single core. More than 1 core is best when optioned in command line R and is not recommend when used in combination with a GUI, e.g. RStudio.

Results.Dir

Class: String. Optional. Default = see Data Output section.

String name of a folder for BIGDAWG output. Subfolder for each locus set will be generated within any output folder specified.

Return

Class: Logical. Optional. Default = F.

Specifies if BIGDAWG should output analysis results to a specified object.

Output

Class: Logical. Optional. Default = T.

Turns on or off the writing of results to files. The default will write all results to text files in the output directory.

Merge.Output

Class: Logical. Optional. Default = F.

Turns on or off the merging of all analysis results into single files labeled ’Merged_xxxx.txt”. This process is rapid for smaller sets of loci (less than 50). However it can become increasingly cpu time intensive when there are many loci analyzed in conjunction with all pairwise combinations in the haplotype analysis. This parameter is subordinate to Output=T.

Verbose

Class: Logical. Optional. Default = T.

Sets the levels of detail that should be displayed on the console. The default will display summaries of the analysis from each specified test. When turned off, only the completion status of each test is displayed.

Examples

These are examples only and need not be run as defined below.


# Install the BIGDAWG package
install.packages("BIGDAWG")

# Run the full analysis using the example set bundled with BIGDAWG
BIGDAWG(Data="HLA_data")

# Run the haplotype analysis with all pairwise combinations on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", All.Pairwise=T)

# Run the Hardy-Weinberg and Locus analysis with non-HLA data while ignoring any missing data on a file called 'data.txt'
BIGDAWG(Data="data.txt", HLA=F, Run.Tests=c("HWE","L"), Missing="ignore")

# Run the amino acid analysis on exons 2 and 3, trimming data to 2-Field resolution on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="A", Exon=c(2,3), Trim=T, Res=2)

# Run the haplotype analysis with subsets of loci on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", Loci.Set=list(c("DRB1","DQB1","DPB1"),c("DRB1","DQB1")))

# Run the full analysis, minimize console output, disable write to file, output to object 'BD'
BD <- BIGDAWG(Data="data.txt", Output=F, Return=T, Verbose=F)

Updating the bundled IMGT/HLA protein alignment

For the amino acid analysis, BIGDAWG is bundled with HLA protein alignment data using the above indicated database release. These bundled alignments can be updated to the most recent release IPD-IMGT/HLA. Future database updates do not guarantee compatibility with the updating tool.

# Identify the installed and current release of the bundled IMGT/HLA database release
# Requires active internet connection.
CheckRelease()
CheckRelease(Package=F) # restricts to IMGT/HLA database versions only

# Update to the most recent IMGT/HLA database release
UpdateRelease()

# Force update
UpdateRelease(Force=T)

# Restore to original bundled update.
UpdateRelease(Restore=T)

Updating BIGDAWG to latest developmental versions

Developmental versions of BIGDAWG can be downloaded through GitHub or using the following code (requires R package ‘devtools’). These versions will include the most up-to-date bug fixes as well as access to new features that are still under development. GitHub versions are constantly under development and if you prefer a more stable fixed release, install BIGDAWG from the CRAN respository. You may check BIGDAWG versions using CheckRelease(). Local builing of vignettes requires pandoc and pandoc-citeproc if you do not use RStudio Pandoc. Before installation from GitHub, it is recommended that all other R packages be up to date.

# Identify the installed (local), release (CRAN), and developmental (GitHub) versions of BIGDAWG.
# Requires active internet connection.
CheckRelease()
CheckRelease(Alignment=F) # restricts to BIGDAWG versions only

# If 'devtools' package installation is required for installation via GitHub.
install.packages("devtools")

# Load latest BIGDAWG version from GitHub
# May require closing and reopening of R Studio after install
library("devtools")
install_github("IgDAWG/BIGDAWG", build_vignettes = TRUE) # Requires Pandoc or RStudio
install_github("IgDAWG/BIGDAWG") # No Pandoc or RStudio

# For a temporary install of the developmental version
library("devtools")
dev_mode(on=T)
install_github("IgDAWG/BIGDAWG")
# .... run BIGDAWG analysis
dev_mode(on=F)

End of vignette.