‘Bridging ImmunoGenomic Data-Analysis Workflow Gaps’ (‘BIGDAWG’) is an integrated analysis system that automates the manual data-manipulation and trafficking steps (the gaps in an analysis workflow) normally required for analyses of highly polymorphic genetic systems (e.g., the immunological human leukocyte antigen (HLA) and killer-cell Immunoglobulin-like receptor (KIR) genes) and their respective genomic data (immunogenomic) (Pappas DJ, Marin W, Hollenbach JA, Mack SJ. 2016. ‘Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline.’ Human Immunology. Article in press). Starting with unambiguous genotype data for case-control groups, ‘BIGDAWG’ performs tests of Hardy-Weinberg equilibrium, and carries out case-control association analyses for haplotypes, individual loci, and HLA amino acid positions.
Data for BIGDAWG should be in a tab delimited text format. The first row must be a header line and must include column names for genotype data. The first two columns must contain subject IDs and phenotypes (0 = control, 1 = case), respectively. Data for each genotype pair must be located in adjacent columns. Column names for a given locus must have the same name; do not use ’_1’, ‘.1’, etc. For HLA alleles, names (with or with a locus prefix) can include from a single field up to the full length name for a given allele.
Missing Information When there is missing information ,either for lack of genotyping information or absence of genotyped loci, BIGDAWG allows for conventions to differentiate the type of data that is missing.
Data missing due to lack of a molecular genotyping result is considered not available (NA). Acceptable NA strings include: NA, ****, -, na and Na. Empty data cells will be considered NA.
Data missing due to genomic structural variation (i.e., no locus present) is considered absence. Acceptable absence strings include: Absent, absent, Abs, ABS, ab, Ab, AB, ^. The last symbol is the unicode caret symbol. For HLA data, BIGDAWG allows for a special allele name that indicates absence: 00, 00:00, 00:00:00 and 00:00:00:00 are all acceptable indicators of HLA locus absence. When choosing to use 0’s (zeros) to populate allele name fields, use similar or higher levels of resolution (http://hla.alleles.org/nomenclature/naming.html). When using HLA data, the 00:00 naming convention is preferred and will allow for the amino acid analysis to test a phenotype association for locus absence (see below).
Example of data set architecture and acceptable values:
subjectID | Disease | A | A | B | B | DRB1 | DRB1 | DRB3 | DRB3 |
---|---|---|---|---|---|---|---|---|---|
subject1 | 0 | 01:01 | 02:01 | 08:01 | 44:02 | 01:01 | 03:01 | Abs | Abs |
subject2 | 1 | 02:01 | 24:02 | 51:01 | 51:01 | 11:01 | 14:01 | 02:02 | 02:11 |
subject3 | 0 | 03:01 | 26:02 | NA | NA | 13:01 | 15:01 | 00:00 | 00:00 |
After the package is run, BIGDAWG will create a new folder entitled ‘output hhmmss ddmmyy’ in the working directory (unless otherwise specified by Results.Dir parameter, see below). Within the output folder will be a precheck file (‘PreCheck.txt’) detailing the summary statistics of the dataset and the results of the Hardy-Weinberg equilibrium test (‘HWE.txt’). If no locus subsets are specified (see parameters section), a single subfolder entitled ‘set1’ will contain the outputs of each association analysis optioned. If multiple locus subsets are optioned, multiple subfolders for each locus set will be written, each containing the analytic results for that subset. Within each set subfolder, a parameter file will detail the parameters that are relevant to that subset, as well as BIGDAWG version numbers, for user reference.
BIGDAWG(Data, HLA=T, Run.Tests, Loci.Set, All.Pairwise=F, Trim=F, Res=2, Missing=0, Cores.Lim=1L, Results.Dir)
Data
Class: String. No Default.
e.g., Data="HLA_data"
or Data="foo.txt"
Specifies genotype data file name. May use file name within working directory or full file name path to specify file location. See Data Input section for details about file formatting.
HLA
Class: logical. Default = T.
Indicates whether or not your data is specific for HLA loci. If your data is not HLA, is a mix of HLA and data for other loci, or includes non-standard HLA allele names, you should set HLA = F
. This will override the Trim and EVS.rm arguments, and will skip various tests and checks. Set HLA = T
if and only if the dataset HLA alleles name are consistent with the most recent IMGT/HLA Database release (https://www.ebi.ac.uk/ipd/imgt/hla).
Run.Tests
Class: String or Character vector. Default = Run all tests.
e.g., Run.Tests = c("L","A")
Specifies which tests to run in analysis. “HWE” will call the Hardy Weinberg Equilibrium test, “H” will call the haplotype association test, “L” will call the locus association test, and “A” will call the amino acid test. Combinations of the test are permitted, e.g., Run.Tests="HWE"
or Run.Tests=c("HWE","H","L")
.
Currently, the amino acid analysis is limited to the HLA-A, -B, -C, -DRB1, -DQA1, -DQB1, -DPA1 and -DPB1 loci.
Loci.Set
Class: List. Default = Use all loci.
e.g., Loci.Set=list(c("DRB1","DQB1"),c("A","DRB1","DPB1"))
Input list defining which loci to use for analyses. Combinations Permitted. The pair of alleles for a locus must be in adjacent columns in the analyzed data set. Running multiple sets is only relevant for the haplotype analysis. For all other analyses, loci are treated independently. Consider running haplotype analysis independently when optioning multi-locus sets to avoid redundancy of the other analyses. Each set output will be contained within a separate folder (see Data Output section).
All.Pairwise
Class: Logical. Default = F.
Should pairwise combinations of loci be run in the haplotype analysis? Only relevant to haplotype analysis.
EVS.rm
Class: Logical. Default = F. (HLA=T
specific).
Flags whether or not to strip expression variant suffixes from HLA alleles. Example: A*01:11N will convert to A*01:11. Should not be optioned for data that does not conform to HLA naming conventions.
Trim
Class: Logical. Default = F. (HLA=T
specific).
Flags whether or not to Trim HLA alleles to a specified resolution. Should not be optioned for data that does not conform to HLA naming conventions.
Resolution
Class: Numeric. Default = 2. (HLA=T
specific).
Sets the desired resolution when trimming HLA alleles. Used only when Trim = T
. Fields for HLA formatting must follow current colon-delimited nomenclature conventions. Currently, amino acid analysis will automatically truncate to 2-field resolution. Trimming is automatic and need not be optioned for amino acid analysis to run. Should not be optioned for data that does not conform to HLA naming conventions.
Missing
Class: String/Numeric. Default = 0.
Sets the allowable threshold for subjects missing one or both alleles for a given locus. Relevant to running the haplotype analysis. Effects can be disastrous on processing time for large values of missing. Missing may be set as a number or as “ignore” to skip removal and retain all subjects.
Cores.Lim
Class: Integer. Default = 1 Core.
Specifies the number of cores accessible by BIGDAWG in amino acid analysis. Not relevant to Windows operating systems which will use only a single core. More than 1 core is best when optioned in command line R. Not recommend for GUIs, e.g. RStudio.
Results.Dir
Class: String. Default = see Data Output section.
String name of a folder for BIGDAWG output. Subfolder for each set will generated within any output folder specified.
These are examples only and need not be run as defined below.
#Run the full analysis using the example set bundled with BIGDAWG
BIGDAWG(Data="HLA_data")
#Run the haplotype analysis with all pairwise combinations on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", All.Pairwise=T)
#Run the Hardy-Weinberg and Locus analysis with non-HLA data while ignoring any missing data on a file called 'data.txt'
BIGDAWG(Data="data.txt", HLA=F, Run.Tests=c("HWE","L"), Missing="ignore")
#Run the amino acid analysis trimming data to 2-Field resolution on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="A", Trim=T, Res=2)
#Run the haplotype analysis with subsets of loci on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", Loci.Set=list(c("DRB1","DQB1","DPB1"),c("DRB1","DQB1")))
The bundled HLA protein alignment used in the amino acid analysis can be updated to the most recent release (https://www.ebi.ac.uk/ipd/imgt/hla). This version of BIGDAWG was bundled using the indicated release (see above). Future database updates do not guarantee compatability with the updating tool.
# Update to the most recent IMGT/HLA database release
UpdateRelease()
# Force update
UpdateRelease(Force=T)
# Restore to original bundled update.
UpdateRelease(Restore=T)
End of vignette.