Importing data for genome polarisation

Natália Martínková

2023-07-11

Quick start

The package diemr provides a function vcf2diem, which converts diploid genotypes in a vcf file (optionally, the vcf file can be gzipped) or an vcfR object loaded into R to a file with format required by diem. To start, load the diemr package or install it from CRAN if it is not yet available:

# Attempt to load a package, if the package was not available, install and load
if(!require("diemr", character.only = TRUE)){
    install.packages("diemr", dependencies = TRUE)
    library("diemr", character.only = TRUE)
}

Next, decide whether all markers in the vcf file should be polarised from one file, or whether the analysis will benefit from parallelisation. Parallelisation is available on UNIX-based platforms, and we advice to use it for large datasets. diem will process each file separately, so dividing the data into suitable chunks must be included in the data conversion step.

# Path to the vcf file
inputfile <- system.file("extdata", "myotis.vcf", package = "diemr")
# File name for the output
# If working directory does not have writing privileges, the filename must include a path to a suitable folder
outputfile <- "myotis"
# Convert the vcf file to two diem files
vcf2diem(SNP = inputfile, filename = outputfile, chunk = 2)
# Expecting to include 11 markers per diem file.
# If you expect more markers in the file, provide suitable chunk size.
# Done with chunk 1

The vcf2diem function processes diploid data at this time. The marker order will be identical to that in the original vcf file (but see chapter Markers not informative for genome polarisation).

Principle of conversion to diem format

The diemr package uses a concise genome representation, differentiating homozygotes for the two most frequent alleles for each marker, and their respective heterozygotes. All other genotypes represent an unknown state. Specifically, the genotypes encoded as 0 represent homozygotes for one of the two most frequent alleles, 1 are heterozygous genotypes for the two most frequent alleles, 2 are homozygotes for the other allele, and U (symbol “_” is also allowed) is an unknown state that can be a missing genotype or a genotype that includes a third (or a fourth) allele.

Invariant sites and indels

What logically follows from this description is that genome polarisation is meaningful only for variant markers (sites). Invariant markers will have low support and, though they will not change the outcome, they will slow down convergence and obscure the nature of change between taxa. Their obscuring effect is removed by rescaling the hybrid indices onto \([0,1]\) during diem iterations.

Including invariant markers in genome polarisation is acceptable, and likely unavoidable. A proportion of truly invariant markers will be mis-diagnosed as variant due to sequencing errors.

At this time, genome polarisation uses variant markers where alleles differ in nucleotide substitutions. Insertions or deletions (indels) are not allowed among the most frequent alleles in the vcf2diem function.

Markers not informative for genome polarisation

Except invariant markers, some variant markers are also uninformative for genome polarisation. Markers that include only missing data and heterozygous genotypes are unpolarisable. In general applications, one might wish to exclude markers that are only included in a vcf due to the presence of a single heterozygous individual.

Uninformative markers will have support equal to 0, and their polarity will thus be equal to the null polarity. Including uninformative markers has similar consequences to including invariant sites. When a user wishes to retain such markers, we advice to then filter polarised markers with high support for subsequent analyses and interpretation.

Haploid data

Haploid data, including markers on sex chromosomes of the heterogametic sex, are usually shown as sequence alignments. To illustrate the principle of genome representation used in diem, let’s have an 8bp-long alignment with five individuals.

5 8
ind1   AACCTTGG
ind2   TACGATGG
ind3   TACC-TGT
ind4   CACGTTGG
ind5   AACCTTGT

We can see that the example alignment contains four variant markers at sites 1, 4, 5, and 8. Marker 5 contains a deletion in ind3, while other markers vary in different nucleotide substitutions. In marker 1, two individuals have an allele A, two individuals have an allele T and one has a C. This marker can be resolved in terms of two most frequent alleles A and T, where the individual ind4 has a third allele and is therefore considered as having an unknown genomic state. To convert the marker into diem format, we must make a decision on which allele will be encoded as 0. The choice can be arbitrary, but good practice is to keep a record of genotype encoding. Let’s decide that all alleles in ind1 will be encoded as 0. Marker 1 will then be 022_0.

Contrary to DNA sequence alignments, diem format transposes the data. Rows in diem input files are markers and columns represent individual samples. The diem output for the example alignment will thus have four lines representing the four variant markers, and five columns representing the individuals.

Marker1   022_0
Marker4   02020
Marker5   02_00
Marker8   00202

The diem format does not store information on marker identification or location. This metadata must be saved separately. Instead, diem uses a prefix S at line starts to ensure that all operating systems always read genotypes as strings without attempting to convert them into numbers. Encoding of markers 4 and 8 shown above is at risk of being interpreted as numeric. The final file in diem format will be:

S022_0
S02020
S02_00
S00202

Diploid data

Heterozygotes in DNA sequence alignments

Diploid markers in DNA sequence alignment could be resolved in an analogous way as haploid markers. Heterozygotes in alignments can be identified according to their IUPAC DNA ambiguity codes. For example, code W would represent an accepted heterozygote in Marker 1, where the most frequent alleles are A and T.

Variant call format

Variant markers in genomic context will be most often identified using variant callers from sequence reads mapped onto a reference. Such data are stored or can be converted to variant call format (vcf). We implemented conversion of vcf files to diem format for the vcf version 4.2 in function vcf2diem. The function resolves indels and markers with more than two alleles to obtain biallelic genotypes for all markers included in the original vcf file.

  1. Markers with the reference allele (column REF) containing insertions are set as unknown.
  2. Markers with the alternative allele (column ALT) containing only indels are set as unknown.
  3. Indels in the alternative alleles are set as unknown.
  4. Overall allele counts in markers with more than two alleles representing substitutions are calculated and the two most frequent alleles are selected. Ties are resolved according to the allele order in the vcf file.
  5. Third and fourth most frequent substitutions are set as unknown.

The conversion might result in some markers no longer being polymorphic. We advice checking how frequent invariant markers are after conversion.

# Import the first converted file for all individuals 
# and without changing marker polarity
myotis <- importPolarized("myotis-01.txt", 
                          changePolarity = rep(FALSE, 11), 
                          ChosenInds = 1:14)
# Check whether a marker includes heterozygotes 
# or that there is more than one genotype
apply(myotis, MARGIN = 2, 
      FUN = \(x) any(grepl("1", x)) | sum(levels(factor(x)) %in% c("0", "1", "2")) > 1) 
#    m1    m2    m3    m4    m5    m6    m7    m8    m9   m10   m11 
# FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
         

Starting from version 1.2.2, vcf2diem removes markers that include 1. only missing data and heterozygotes, 1. missing data, homozygotes for the reference allele and one heterozygote.

We include a list of the removed markers in a file omittedLoci.txt in the working directory. The file includes information from columns CHROM and POS for the respective markers. The file omittedLoci.txt must be checked when interpreting polarised markers to ensure correct marker annotation.

Polyploid data

The same principles apply to higher ploidy datasets. Preparing data in diem format for genome polarisation includes these steps:

  1. Remove indels.
  2. Identify two most frequent alleles for each marker.
  3. Encode homozygous and heterozygous genotypes that include only the two most frequent alleles.