# Haplotype Analysis for discrete TTP

Cycle-specific logistic regression of haplo-type effects with known haplo-type probabilities. Given observed genotype G and unobserved haplotypes H we here mix out over the possible haplotypes using that $$P(H|G)$$ is given as input.

\begin{align*} S(t|x,G) & = E( S(t|x,H) | G) = \sum_{h \in G} P(h|G) S(t|z,h) \end{align*} so survival can be computed by mixing out over possible h given g.

Survival is based on logistic regression for the discrete hazard function of the form \begin{align*} \mbox{logit}(P(T=t| T >= t, x,h)) & = \alpha_t + x(h) beta \end{align*} where x(h) is a regression design of x and haplotypes $$h=(h_1,h_2)$$

For standard errors we assume that haplotype probabilities are known.

We are particularly interested in the types haplotypes:

types <- c("DCGCGCTCACG","DTCCGCTGACG","ITCAGTTGACG","ITCCGCTGAGG")

## some haplotypes frequencies for simulations
data(hapfreqs)
print(hapfreqs)
#>             index   haplotype     freq
#> DCGAGCTCACG     1 DCGAGCTCACG 0.010681
#> DCGCGCTCACG     2 DCGCGCTCACG 0.138387
#> DTGAGCTCACG     3 DTGAGCTCACG 0.000310
#> DTGAGCTCACA     4 DTGAGCTCACA 0.006800
#> DTGAGCTCGCG     5 DTGAGCTCGCG 0.034517
#> DTGACCTCACG     6 DTGACCTCACG 0.001336
#> DTGCGCTCACG     7 DTGCGCTCACG 0.009969
#> DTGCGCTCACA     8 DTGCGCTCACA 0.011833
#> DTGCGCTCGCG     9 DTGCGCTCGCG 0.302389
#> DTGCGCCCGCG    10 DTGCGCCCGCG 0.001604
#> DTGCCCTCACG    11 DTGCCCTCACG 0.003912
#> DTCAGCTGACG    12 DTCAGCTGACG 0.001855
#> DTCCGCTGACG    13 DTCCGCTGACG 0.103394
#> DTCCCCTGACG    14 DTCCCCTGACG 0.000310
#> ITCAGTTGACG    15 ITCAGTTGACG 0.048124
#> ITCCGCTGAGG    16 ITCCGCTGAGG 0.291273
#> ITCCGTTGACG    17 ITCCGTTGACG 0.031089
#> ITCCGTCGACG    18 ITCCGTCGACG 0.001502
#> ITCCCCTGAGG    19 ITCCCCTGAGG 0.000653

Among the types of interest we look up the frequencies and choose a baseline

www <-which(hapfreqs$haplotype %in% types) hapfreqs$freq[www]
#> [1] 0.138387 0.103394 0.048124 0.291273

baseline=hapfreqs$haplotype[9] baseline #> [1] "DTGCGCTCGCG" We have cycle specific data with $$id$$ and outcome $$y$$ data(haploX) dlist(haploX,.~id|id %in% c(1,4,7)) #> id: 1 #> y X1 X2 X3 X4 times lbnr__id Count1 #> 1 0 0 0 0 0 1 1 0 #> 2 0 0 0 0 0 2 2 0 #> 3 0 0 0 0 0 3 3 0 #> 4 0 0 0 0 0 4 4 0 #> 5 0 0 0 0 0 5 5 0 #> 6 0 0 0 0 0 6 6 0 #> attr(,"class") #> [1] matrix array #> ------------------------------------------------------------ #> id: 4 #> y X1 X2 X3 X4 times lbnr__id Count1 #> 19 1 0 0 0 0 1 1 0 #> attr(,"class") #> [1] matrix array #> ------------------------------------------------------------ #> id: 7 #> y X1 X2 X3 X4 times lbnr__id Count1 #> 37 0 1 0 0 0 1 1 0 #> 38 0 1 0 0 0 2 2 0 #> 39 1 1 0 0 0 3 3 0 #> attr(,"class") #> [1] matrix array and a list of possible haplo-types for each id and how likely they are $$p$$ (the sum of within each id is 1): data(hHaplos) ## loads ghaplos head(ghaplos) #> id haplo1 haplo2 p #> 1 1 DTGCGCTCGCG DTGAGCTCGCG 1.00000000 #> 19 2 ITCCGTTGACG DTGAGCTCGCG 0.06867716 #> 21 2 ITCAGTTGACG DTGCGCTCGCG 0.93132284 #> 51 3 ITCCGTTGACG DTGAGCTCGCG 0.06867716 #> 53 3 ITCAGTTGACG DTGCGCTCGCG 0.93132284 #> 66 4 DTGCGCTCGCG DTGCGCTCGCG 1.00000000 The first id=1 has the haplotype fully observed, but id=2 has two possible haplotypes consistent with the observed genotype for this id, the probabiblities are 7% and 93%, respectively. With the baseline given above we can specify a regression design that gives an effect if a “type” is present (sm=0), or an additive effect of haplotypes (sm=1): designftypes <- function(x,sm=0) { hap1=x[1] hap2=x[2] if (sm==0) y <- 1*( (hap1==types) | (hap2==types)) if (sm==1) y <- 1*(hap1==types) + 1*(hap2==types) return(y) } To fit the model we start by constructing a time-design (named X) and takes the haplotype distributions for each id haploX$time <- haploX$times Xdes <- model.matrix(~factor(time),haploX) colnames(Xdes) <- paste("X",1:ncol(Xdes),sep="") X <- dkeep(haploX,~id+y+time) X <- cbind(X,Xdes) Haplos <- dkeep(ghaplos,~id+"haplo*"+p) desnames=paste("X",1:6,sep="") # six X's related to 6 cycles head(X) #> id y time X1 X2 X3 X4 X5 X6 #> 1 1 0 1 1 0 0 0 0 0 #> 2 1 0 2 1 1 0 0 0 0 #> 3 1 0 3 1 0 1 0 0 0 #> 4 1 0 4 1 0 0 1 0 0 #> 5 1 0 5 1 0 0 0 1 0 #> 6 1 0 6 1 0 0 0 0 1 Now we can fit the model with the design given by the designfunction out <- haplo.surv.discrete(X=X,y="y",time.name="time", Haplos=Haplos,desnames=desnames,designfunc=designftypes) names(out$coef) <- c(desnames,types)
out$coef #> X1 X2 X3 X4 X5 X6 #> -1.82153345 -0.61608261 -0.17143057 -1.27152045 -0.28635976 -0.19349091 #> DCGCGCTCACG DTCCGCTGACG ITCAGTTGACG ITCCGCTGAGG #> 0.79753613 0.65747412 0.06119231 0.31666905 summary(out) #> Estimate Std.Err 2.5% 97.5% P-value #> X1 -1.82153 0.1619 -2.13892 -1.5041 2.355e-29 #> X2 -0.61608 0.1895 -0.98748 -0.2447 1.149e-03 #> X3 -0.17143 0.1799 -0.52398 0.1811 3.406e-01 #> X4 -1.27152 0.2631 -1.78719 -0.7559 1.346e-06 #> X5 -0.28636 0.2030 -0.68425 0.1115 1.584e-01 #> X6 -0.19349 0.2134 -0.61184 0.2249 3.647e-01 #> DCGCGCTCACG 0.79754 0.1494 0.50465 1.0904 9.445e-08 #> DTCCGCTGACG 0.65747 0.1621 0.33971 0.9752 5.007e-05 #> ITCAGTTGACG 0.06119 0.2145 -0.35931 0.4817 7.755e-01 #> ITCCGCTGAGG 0.31667 0.1361 0.04989 0.5834 1.999e-02 Haplotypes “DCGCGCTCACG” “DTCCGCTGACG” gives increased hazard of pregnancy The data was generated with these true coefficients tcoef=c(-1.93110204,-0.47531630,-0.04118204,-1.57872602,-0.22176426,-0.13836416, 0.88830288,0.60756224,0.39802821,0.32706859) cbind(out$coef,tcoef)
#>                               tcoef
#> X1          -1.82153345 -1.93110204
#> X2          -0.61608261 -0.47531630
#> X3          -0.17143057 -0.04118204
#> X4          -1.27152045 -1.57872602
#> X5          -0.28635976 -0.22176426
#> X6          -0.19349091 -0.13836416
#> DCGCGCTCACG  0.79753613  0.88830288
#> DTCCGCTGACG  0.65747412  0.60756224
#> ITCAGTTGACG  0.06119231  0.39802821
#> ITCCGCTGAGG  0.31666905  0.32706859

The design fitted can be found in the output

head(out\$X,10)
#>    X1 X2 X3 X4 X5 X6 haplo1 haplo2 haplo3 haplo4
#> 1   1  0  0  0  0  0      0      0      0      0
#> 2   1  1  0  0  0  0      0      0      0      0
#> 3   1  0  1  0  0  0      0      0      0      0
#> 4   1  0  0  1  0  0      0      0      0      0
#> 5   1  0  0  0  1  0      0      0      0      0
#> 6   1  0  0  0  0  1      0      0      0      0
#> 8   1  0  0  0  0  0      0      0      1      0
#> 10  1  1  0  0  0  0      0      0      1      0
#> 12  1  0  1  0  0  0      0      0      1      0
#> 14  1  0  0  1  0  0      0      0      1      0

# SessionInfo

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: Fedora 32 (Workstation Edition)
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.10.so
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#> [1] mets_1.2.8.1   lava_1.6.8     timereg_1.9.6  survival_3.2-3
#>
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5          mvtnorm_1.1-1       lattice_0.20-41
#>  [4] ucminf_1.1-4        digest_0.6.25       grid_4.0.2
#>  [7] magrittr_1.5        evaluate_0.14       rlang_0.4.7
#> [10] stringi_1.4.6       Matrix_1.2-18       rmarkdown_2.3
#> [13] splines_4.0.2       tools_4.0.2         stringr_1.4.0
#> [16] numDeriv_2016.8-1.1 xfun_0.15           yaml_2.2.1
#> [19] compiler_4.0.2      htmltools_0.5.0     knitr_1.29