Package MKomics

Introduction

Package MKomics includes a collection of functions that I found and find useful for the analysis of omics data. It includes:

similarity plots based on correlation and median absolute deviation (MAD)
adjusting colors for heatmaps
aggregate technical replicates
calculate pairwise fold-changes and log fold-changes
compute one- and two-way ANOVA
simplified interface to package ‘limma’ (Ritchie et al. 2015) for moderated t-test and one-way ANOVA
Hamming and the Levenshtein (edit) distance of strings as well as optimal alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap penalties (Merkl and Waack 2009).

The package requires Bioconductor package limma, which can be installed via

## Install package BiocManager
install.packages("BiocManager")
## Use BiocManager to install limma
BiocManager::install("limma")

We first load the package.

library(MKomics)

Moderated Tests Based on Package limma

The functions to compute the moderated t-test and one-way ANOVA were motivated by the fact that my students had problems to understand and correctly adapt the code of the limma package (Ritchie et al. 2015).

Moderated t-Test

We generate some artificial data and compute moderated t-tests.

## One-sample test
X <- matrix(rnorm(10*20, mean = 1), nrow = 10, ncol = 20)
mod.t.test(X)

##         mean      2.5%    97.5%        t      p.value  adj.p.value          B
## 1  0.8167034 0.3347307 1.298676 3.342451 1.000055e-03 1.250069e-03 -1.2889029
## 2  0.8597711 0.3777984 1.341744 3.518711 5.425967e-04 9.043279e-04 -0.6914649
## 3  0.8024507 0.3204780 1.284423 3.284121 1.217895e-03 1.353217e-03 -1.4798576
## 4  1.1286441 0.6466713 1.610617 4.619104 7.092114e-06 2.364038e-05  3.7323676
## 5  0.7657581 0.2837853 1.247731 3.133952 1.998232e-03 1.998232e-03 -1.9559897
## 6  1.2950566 0.8130838 1.777029 5.300166 3.189578e-07 1.594789e-06  7.0697754
## 7  1.0246341 0.5426613 1.506607 4.193431 4.210188e-05 8.420376e-05  1.8791782
## 8  1.5096517 1.0276790 1.991624 6.178421 3.856033e-09 3.856033e-08 12.0500350
## 9  0.8280052 0.3460325 1.309978 3.388705 8.537765e-04 1.219681e-03 -1.1350943
## 10 1.0959262 0.6139534 1.577899 4.485202 1.258857e-05 3.147142e-05  3.1301177

## Two-sample test
set.seed(123)
X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20),
           matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20))
g2 <- factor(c(rep("group 1", 10), rep("group 2", 10)))
mod.t.test(X, group = g2)

##    difference in means       2.5%     97.5%           t   p.value adj.p.value
## 1           0.30099317 -0.5318079 1.1337943  0.71317027 0.4766640   0.9665915
## 2          -0.13637650 -0.9691776 0.6964246 -0.32312914 0.7469725   0.9665915
## 3           0.49260925 -0.3401919 1.3254104  1.16718354 0.2446801   0.8156005
## 4          -0.15266680 -0.9854679 0.6801343 -0.36172723 0.7179801   0.9665915
## 5           0.05546448 -0.7773366 0.8882656  0.13141699 0.8955922   0.9665915
## 6           0.03152278 -0.8012783 0.8643239  0.07468976 0.9405445   0.9665915
## 7           0.69311916 -0.1396820 1.5259203  1.64226977 0.1022800   0.8156005
## 8           0.17595066 -0.6568505 1.0087518  0.41689577 0.6772514   0.9665915
## 9           0.54524220 -0.2875589 1.3780433  1.29189154 0.1980510   0.8156005
## 10          0.01770155 -0.8150996 0.8505027  0.04194188 0.9665915   0.9665915
##            B mean of group 1 mean of group 2
## 1  -4.608912      0.07083562     0.371828786
## 2  -4.619655      0.13562592    -0.000750575
## 3  -4.586223     -0.30436677     0.188242477
## 4  -4.618953      0.18497579     0.032308986
## 5  -4.621971      0.08494719     0.140411667
## 6  -4.622282      1.11828347     1.149806246
## 7  -4.550748      0.67404426     1.367163425
## 8  -4.617811      0.53111339     0.707064052
## 9  -4.578072      0.92210638     1.467348578
## 10 -4.622384      0.48495033     0.502651884

## Paired two-sample test
subjID <- factor(rep(1:10, 2))
mod.t.test(X, group = g2, paired = TRUE, subject = subjID)

##    mean of differences       2.5%     97.5%           t   p.value adj.p.value
## 1           0.30099317 -0.5606128 1.1625992  0.69402426 0.4894545   0.9675331
## 2          -0.13637650 -0.9979825 0.7252295 -0.31445430 0.7539038   0.9675331
## 3           0.49260925 -0.3689968 1.3542152  1.13584893 0.2590354   0.8634514
## 4          -0.15266680 -1.0142728 0.7089392 -0.35201618 0.7256490   0.9675331
## 5           0.05546448 -0.8061415 0.9170705  0.12788893 0.8985222   0.9675331
## 6           0.03152278 -0.8300832 0.8931288  0.07268461 0.9422184   0.9675331
## 7           0.69311916 -0.1684868 1.5547252  1.59818084 0.1135079   0.8634514
## 8           0.17595066 -0.6856553 1.0375567  0.40570365 0.6859233   0.9675331
## 9           0.54524220 -0.3163638 1.4068482  1.25720898 0.2119307   0.8634514
## 10          0.01770155 -0.8439044 0.8793075  0.04081589 0.9675331   0.9675331
##            B mean of group 1 mean of group 2
## 1  -4.608863      0.07083562     0.371828786
## 2  -4.618525      0.13562592    -0.000750575
## 3  -4.588455     -0.30436677     0.188242477
## 4  -4.617893      0.18497579     0.032308986
## 5  -4.620608      0.08494719     0.140411667
## 6  -4.620887      1.11828347     1.149806246
## 7  -4.556550      0.67404426     1.367163425
## 8  -4.616866      0.53111339     0.707064052
## 9  -4.581125      0.92210638     1.467348578
## 10 -4.620978      0.48495033     0.502651884

Moderated 1-Way ANOVA

We generate some artifical data and compute moderated one-way ANOVAs.

set.seed(123)
X <- rbind(matrix(rnorm(5*20), nrow = 5, ncol = 20),
           matrix(rnorm(5*20, mean = 1), nrow = 5, ncol = 20))
gr <- factor(c(rep("A1", 5), rep("B2", 5), rep("C3", 5), rep("D4", 5)))
mod.oneway.test(X, gr)

## topTableF is obsolete and will be removed in a future version of limma. Please considering using topTable instead.

##      A1 vs B2   A1 vs C3    A1 vs D4    B2 vs C3   B2 vs D4    C3 vs D4
## 1   1.0974330  0.2271669  0.26827976 -0.87026609 -0.8291532  0.04111286
## 2   0.0769194  0.9622020 -0.61252965  0.88528264 -0.6894491 -1.57473169
## 3  -0.3105477 -0.7130963 -0.58266984 -0.40254866 -0.2721221  0.13042651
## 4  -0.5832304 -0.2440290 -0.03386782  0.33920142  0.5493626  0.21016117
## 5  -0.9579128 -0.4501003 -0.61874146  0.50781250  0.3391713 -0.16864116
## 6  -0.6013595 -0.5109767 -0.15342834  0.09038278  0.4479311  0.35754833
## 7   0.3471695 -0.3493145 -0.68975433 -0.69648400 -1.0369239 -0.34043986
## 8  -0.9277388 -0.5691081 -0.71053196  0.35863064  0.2172068 -0.14142382
## 9  -0.5998638 -1.5488157 -0.14153255 -0.94895185  0.4583313  1.40728311
## 10  1.5527067  0.5179562  0.99934740 -1.03475050 -0.5533593  0.48139118
##     grand mean         F    p.value adj.p.value
## 1   0.22133220 1.4119467 0.23709800   0.5868380
## 2   0.06743767 2.5678204 0.05255479   0.1751826
## 3  -0.05806215 0.6094415 0.60879068   0.7253749
## 4   0.10864239 0.4386138 0.72537489   0.7253749
## 5   0.11267943 0.9691456 0.40611786   0.5868380
## 6   1.13404486 0.5002131 0.68212276   0.7253749
## 7   1.02060384 1.2185746 0.30111975   0.5868380
## 8   0.61908872 0.9594030 0.41078662   0.5868380
## 9   1.19472748 2.9871234 0.02980893   0.1751826
## 10  0.49380111 2.6903326 0.04456696   0.1751826

We can also perform a moderated one-way ANOVA with repeated measures.

X <- rbind(matrix(rnorm(6*18), nrow = 6, ncol = 18),
           matrix(rnorm(6*18, mean = 1), nrow = 6, ncol = 18))
gr <- factor(c(rep("T1", 6), rep("T2", 6), rep("T3", 6)))
subjectID <- factor(c(rep(1:6, 3)))
mod.oneway.test(X, gr, repeated = TRUE, subject = subjectID)

##      grand mean          F    p.value adj.p.value
## 1   0.439887277 0.24258038 0.91426233   0.9856427
## 2  -0.348176298 0.25294001 0.90800757   0.9856427
## 3  -0.001648741 0.19504576 0.94108489   0.9856427
## 4   0.380034581 0.64717729 0.62882454   0.9856427
## 5   0.116088830 0.19890913 0.93903209   0.9856427
## 6  -0.215112459 0.13298495 0.97031807   0.9856427
## 7   1.305439170 0.08992079 0.98564266   0.9856427
## 8   0.927363547 2.87518053 0.02147716   0.2577260
## 9   1.057394093 1.60523754 0.16983981   0.6793592
## 10  0.985162868 1.66804543 0.15425972   0.6793592
## 11  0.780801913 0.48820193 0.74443860   0.9856427
## 12  0.787626019 0.45961349 0.76543996   0.9856427

Pairwise moderated t-tests

As a optional post-hoc analysis after mod.oneway.test one can use pairwise moderated t-tests (only unpaired). One should carefully think about the adjustment of p values in this context.

pairwise.mod.t.test(X, gr)

##         T1-T2   T1-T2: t    T1-T2: p T1-T2: adj.p  T1-T2: B       T1-T3
## 1   0.2342401  0.4014278 0.688581325   0.82059173 -5.803960 -0.05487846
## 2   0.2423784  0.4153749 0.678362198   0.82059173 -5.798701  0.24537369
## 3  -0.1966000 -0.3369222 0.736568028   0.82059173 -5.825944 -0.13384174
## 4  -0.2086677 -0.3576033 0.721059196   0.82059173 -5.819314 -0.26788464
## 5  -0.1845143 -0.3162105 0.752209088   0.82059173 -5.832189  0.17817385
## 6  -0.2977183 -0.5102134 0.610527064   0.82059173 -5.758182 -0.34726718
## 7  -0.3016336 -0.5169231 0.605844436   0.82059173 -5.755000 -0.01197801
## 8   1.6357283  2.8032215 0.005614332   0.06737198 -2.251098  1.58693190
## 9   0.4583369  0.7854726 0.433209202   0.82059173 -5.593554  0.46639256
## 10 -0.2211433 -0.3789833 0.705146931   0.82059173 -5.812045 -0.19783574
## 11  0.4886687  0.8374536 0.403448173   0.82059173 -5.554613  0.54861003
## 12  0.1275920  0.2186602 0.827162496   0.82716250 -5.856273  0.33974584
##       T1-T3: t    T1-T3: p T1-T3: adj.p  T1-T3: B        T2-T3     T2-T3: t
## 1  -0.09404770 0.925175869   0.98364550 -5.806707 -0.289118564 -0.495475544
## 2   0.42050798 0.674615985   0.98260953 -5.730101  0.002995244  0.005133086
## 3  -0.22937064 0.818841277   0.98260953 -5.786748  0.062758227  0.107551609
## 4  -0.45908601 0.646726303   0.98260953 -5.714626 -0.059216945 -0.101482754
## 5   0.30534457 0.760456541   0.98260953 -5.768222  0.362688176  0.621555112
## 6  -0.59512745 0.552505716   0.98260953 -5.649223 -0.049548837 -0.084914080
## 7  -0.02052726 0.983645501   0.98364550 -5.810548  0.289655548  0.496395797
## 8   2.71959689 0.007176506   0.08611807 -2.437802 -0.048796409 -0.083624610
## 9   0.79927800 0.425182677   0.98260953 -5.519404  0.008055682  0.013805387
## 10 -0.33904005 0.734974762   0.98260953 -5.758320  0.023307578  0.039943250
## 11  0.94017779 0.348386345   0.98260953 -5.407634  0.059941321  0.102724150
## 12  0.58223779 0.561135089   0.98260953 -5.656144  0.212153838  0.363577617
##     T2-T3: p T2-T3: adj.p  T2-T3: B
## 1  0.6208688    0.9959101 -4.606091
## 2  0.9959101    0.9959101 -4.609593
## 3  0.9144712    0.9959101 -4.609428
## 4  0.9192802    0.9959101 -4.609446
## 5  0.5350208    0.9959101 -4.604082
## 6  0.9324241    0.9959101 -4.609490
## 7  0.6202208    0.9959101 -4.606078
## 8  0.9334478    0.9959101 -4.609493
## 9  0.9890005    0.9959101 -4.609590
## 10 0.9681826    0.9959101 -4.609570
## 11 0.9182963    0.9959101 -4.609442
## 12 0.7166001    0.9959101 -4.607707

Pairwise Comparisons

Often we are in a situation that we want to compare more than two groups pairwise.

FC and logFC

In the analysis of omics data, the FC or logFC are important measures of effect size and are often used in combination with (adjusted) p values (Shi et al. 2006).

x <- rnorm(100) ## assumed as log-data
g <- factor(sample(1:4, 100, replace = TRUE))
levels(g) <- c("a", "b", "c", "d")
## modified FC
pairwise.fc(x, g)

##    a vs b    a vs c    a vs d    b vs c    b vs d    c vs d 
##  1.444915  1.232566  1.347226 -1.172281 -1.072511  1.093025

## "true" FC
pairwise.fc(x, g, mod.fc = FALSE)

##    a vs b    a vs c    a vs d    b vs c    b vs d    c vs d 
## 1.4449145 1.2325662 1.3472261 0.8530375 0.9323915 1.0930253

## without any transformation
pairwise.logfc(x, g)

##     a vs b     a vs c     a vs d     b vs c     b vs d     c vs d 
##  0.5309842  0.3016652  0.4299920 -0.2293190 -0.1009922  0.1283268

The function returns a modified FC. That is, if the FC is smaller than 1 it is transformed to -1/FC. One can also use other functions than the mean for the aggregation of the data.

pairwise.logfc(x, g, ave = median)

##     a vs b     a vs c     a vs d     b vs c     b vs d     c vs d 
##  0.6128841  0.2182627  0.4992445 -0.3946214 -0.1136396  0.2809818

Aggregating Technical Replicates

In case of omics experiments it is often the case that technical replicates are measured and hence it is part of the preprocessing of the raw data to aggregate these technical replicates. This is the purpose of function repMeans.

M <- matrix(rnorm(100), ncol = 5)
FL <- matrix(rpois(100, lambda = 10), ncol = 5) # only for this example
repMeans(x = M, flags = FL, use.flags = "max", ndups = 5, spacing = 4)

## $exprs
##             [,1]       [,2]       [,3]       [,4]        [,5]
## [1,] -1.18155923  0.2353980  0.9606907  0.1573533 -1.93850470
## [2,]  1.37000399 -1.5160676  0.1781902 -0.4929374  0.12071933
## [3,] -0.54174319  0.8405398 -0.2997625 -0.8341882  0.86364843
## [4,] -0.07335999 -0.2858454  0.3183903  0.8993541 -0.03515489
## 
## $flags
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   11   14   12   14   16
## [2,]   13   13   13   21   13
## [3,]   11   15   18   14   17
## [4,]   13   15   13   14   11

1- and 2-Way ANOVA

Functions oneWayAnova and twoWayAnova return a function that can be used to perform a 1- or 2-way ANOVA, respectively.

af <- oneWayAnova(c(rep(1,5),rep(2,5)))
## p value
af(rnorm(10))

## [1] 0.3936899

x <- matrix(rnorm(12*10), nrow = 10)
## 2-way ANOVA with interaction
af1 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2))
## p values
apply(x, 1, af1)

##                  [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
## cov1        0.7189770 0.1199344 0.7797292 0.3399445 0.4791519 0.3287401
## cov2        0.1248696 0.1246937 0.2553811 0.5648113 0.6118778 0.0463841
## interaction 0.3941863 0.1489906 0.9090044 0.1403097 0.1257246 0.8367084
##                  [,7]      [,8]       [,9]     [,10]
## cov1        0.4046772 0.8125137 0.01638546 0.7791182
## cov2        0.2884822 0.7388120 0.85514661 0.8902903
## interaction 0.2453559 0.7667181 0.20801934 0.2081726

## 2-way ANOVA without interaction
af2 <- twoWayAnova(c(rep(1,6),rep(2,6)), rep(c(rep(1,3), rep(2,3)), 2), 
                   interaction = FALSE)
## p values
apply(x, 1, af2)

##           [,1]      [,2]      [,3]      [,4]      [,5]       [,6]      [,7]
## cov1 0.7150619 0.1423638 0.7661585 0.3758909 0.5173798 0.29988569 0.4157757
## cov2 0.1172529 0.1476506 0.2264692 0.5948831 0.6432147 0.03443076 0.2988795
##           [,8]       [,9]     [,10]
## cov1 0.8018355 0.01789566 0.7880695
## cov2 0.7242262 0.86114929 0.8948589

Correlation Distance Matrix

In the analysis of omics data correlation and absolute correlation distance matrices are often used during quality control. Function corDist can compute the Pearson, Spearman, Kendall or Cosine sample correlation and absolute correlation as well as the minimum covariance determinant or the orthogonalized Gnanadesikan-Kettenring correlation and absolute correlation implemented in package robustbase (Maechler et al. 2019).

M <- matrix(rcauchy(1000), nrow = 5)
## Pearson
corDist(M)

##           1         2         3         4
## 2 0.2875716                              
## 3 0.9914823 1.0146980                    
## 4 0.9961413 0.9955740 1.0016289          
## 5 1.0063069 0.9774429 1.0012735 1.0165059

## Spearman
corDist(M, method = "spearman")

##           1         2         3         4
## 2 1.0801350                              
## 3 0.9244866 1.0314483                    
## 4 1.0515473 0.9741859 1.1059056          
## 5 1.0247986 0.9784685 0.9892507 0.9815645

## Minimum Covariance Determinant
corDist(M, method = "mcd")

##           1         2         3         4
## 2 0.9298225                              
## 3 0.9496162 1.0243463                    
## 4 1.0301798 0.9683927 1.0295911          
## 5 0.9421068 0.9974223 1.1032783 0.8426071

MAD Matrix

In case of outliers the MAD (median absolute deviation = median of the absolute deviations from the median) may be useful as measure of scale.

madMatrix(t(M))

##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 0.000000 3.118364 3.006511 2.974456 2.878586
## [2,] 3.118364 0.000000 2.617500 2.612594 2.474732
## [3,] 3.006511 2.617500 0.000000 3.278506 3.096261
## [4,] 2.974456 2.612594 3.278506 0.000000 2.899905
## [5,] 2.878586 2.474732 3.096261 2.899905 0.000000

Similarity Matrices

First, we can plot a similarity matrix based on correlation.

M <- matrix(rnorm(1000), ncol = 20)
colnames(M) <- paste("Sample", 1:20)
M.cor <- cor(M)
corPlot(M.cor, minCor = min(M.cor), labels = colnames(M))

There is a second implementation based on package (Gu, Eils, and Schlesner 2016).

corPlot2(M.cor, minCor = min(M.cor), labels = colnames(M))

Next, we can use the MAD.

## random data
x <- matrix(rnorm(1000), ncol = 10)
## outliers
x[1:20,5] <- x[1:20,5] + 10
madPlot(x, new = TRUE, maxMAD = 2.5, labels = TRUE,
        title = "MAD: Outlier visible")

madPlot2(x, new = TRUE, maxMAD = 2.5, labels = TRUE,
        title = "MAD: Outlier visible")

## in contrast
corPlot2(x, new = TRUE, minCor = -0.5, labels = TRUE,
        title = "Correlation: Outlier masked")

Colors for Heatmaps

Nowadays there are better solutions e.g. provided by Bioconductor package ComplexHeatmap (Gu, Eils, and Schlesner 2016). This is my solution to get a better coloring of heatmaps.

## generate some random data
data.plot <- matrix(rnorm(100*50, sd = 1), ncol = 50)
colnames(data.plot) <- paste("patient", 1:50)
rownames(data.plot) <- paste("gene", 1:100)
data.plot[1:70, 1:30] <- data.plot[1:70, 1:30] + 3
data.plot[71:100, 31:50] <- data.plot[71:100, 31:50] - 1.4
data.plot[1:70, 31:50] <- rnorm(1400, sd = 1.2)
data.plot[71:100, 1:30] <- rnorm(900, sd = 1.2)
nrcol <- 128
## Load required packages
library(RColorBrewer)
myCol <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(nrcol))
heatmap(data.plot, col =  myCol, main = "standard colors")

myCol2 <- heatmapCol(data = data.plot, col = myCol, 
                     lim = min(abs(range(data.plot)))-1)
heatmap(data.plot, col = myCol2, main = "heatmapCol colors")

String Alignment

String Distances

In Bioinformatics the (pairwise and multiple) alignment of strings is an important topic. For this one can use the distance or similarity between strings. The Hamming and the the Levenshtein (edit) distance are implemented in function stringDist (Merkl and Waack 2009).

x <- "GACGGATTATG"
y <- "GATCGGAATAG"
## Hamming distance
stringDist(x, y)

##             GACGGATTATG
## GATCGGAATAG           3

## Levenshtein distance
d <- stringDist(x, y)
d

##             GACGGATTATG
## GATCGGAATAG           3

In case of the Levenshtein (edit) distance, the respective scoring and traceback matrices are attached as attributes to the result.

attr(d, "ScoringMatrix")

##     gap  G A T C G G A A T  A  G
## gap   0  1 2 3 4 5 6 7 8 9 10 11
## G     1  0 1 2 3 4 5 6 7 8  9 10
## A     2  1 0 1 2 3 4 5 6 7  8  9
## C     3  2 1 1 1 2 3 4 5 6  7  8
## G     4  3 2 2 2 1 2 3 4 5  6  7
## G     5  4 3 3 3 2 1 2 3 4  5  6
## A     6  5 4 4 4 3 2 1 2 3  4  5
## T     7  6 5 4 5 4 3 2 2 2  3  4
## T     8  7 6 5 5 5 4 3 3 2  3  4
## A     9  8 7 6 6 6 5 4 3 3  2  3
## T    10  9 8 7 7 7 6 5 4 3  3  3
## G    11 10 9 8 8 7 7 6 5 4  4  3

attr(d, "TraceBackMatrix")

##     gap     G     A     T      C        G      G     A     A      T   A     
## gap "start" "i"   "i"   "i"    "i"      "i"    "i"   "i"   "i"    "i" "i"   
## G   "d"     "m"   "i"   "i"    "i"      "m/i"  "m/i" "i"   "i"    "i" "i"   
## A   "d"     "d"   "m"   "i"    "i"      "i"    "i"   "m/i" "m/i"  "i" "m/i" 
## C   "d"     "d"   "d"   "mm"   "m"      "i"    "i"   "i"   "i"    "i" "i"   
## G   "d"     "d/m" "d"   "d/mm" "d/mm"   "m"    "m/i" "i"   "i"    "i" "i"   
## G   "d"     "d/m" "d"   "d/mm" "d/mm"   "d/m"  "m"   "i"   "i"    "i" "i"   
## A   "d"     "d"   "d/m" "d/mm" "d/mm"   "d"    "d"   "m"   "m/i"  "i" "m/i" 
## T   "d"     "d"   "d"   "m"    "d/mm/i" "d"    "d"   "d"   "mm"   "m" "i"   
## T   "d"     "d"   "d"   "d/m"  "mm"     "d"    "d"   "d"   "d/mm" "m" "mm/i"
## A   "d"     "d"   "d/m" "d"    "d/mm"   "d/mm" "d"   "d/m" "m"    "d" "m"   
## T   "d"     "d"   "d"   "d/m"  "d/mm"   "d/mm" "d"   "d"   "d"    "m" "d"   
## G   "d"     "d/m" "d"   "d"    "d/mm"   "m"    "d/m" "d"   "d"    "d" "d/mm"
##     G     
## gap "i"   
## G   "m/i" 
## A   "i"   
## C   "i"   
## G   "m/i" 
## G   "m/i" 
## A   "i"   
## T   "i"   
## T   "mm/i"
## A   "i"   
## T   "mm"  
## G   "m"

The characters in the trace-back matrix reflect insertion of a gap in string y (d: deletion), match (m), mismatch (mm), and insertion of a gap in string x (i).

String Similarities

The function stringSim computes the optimal alignment scores for global (Needleman-Wunsch) and local (Smith-Waterman) alignments with constant gap penalties (Merkl and Waack 2009). Scoring and trace-back matrix are again attached as attributes to the results.

## optimal global alignment score
d <- stringSim(x, y)
d

##             GACGGATTATG
## GATCGGAATAG           6

attr(d, "ScoringMatrix")

##     gap  G  A  T  C  G  G  A  A  T   A   G
## gap   0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11
## G    -1  1  0 -1 -2 -3 -4 -5 -6 -7  -8  -9
## A    -2  0  2  1  0 -1 -2 -3 -4 -5  -6  -7
## C    -3 -1  1  1  2  1  0 -1 -2 -3  -4  -5
## G    -4 -2  0  0  1  3  2  1  0 -1  -2  -3
## G    -5 -3 -1 -1  0  2  4  3  2  1   0  -1
## A    -6 -4 -2 -2 -1  1  3  5  4  3   2   1
## T    -7 -5 -3 -1 -2  0  2  4  4  5   4   3
## T    -8 -6 -4 -2 -2 -1  1  3  3  5   4   3
## A    -9 -7 -5 -3 -3 -2  0  2  4  4   6   5
## T   -10 -8 -6 -4 -4 -3 -1  1  3  5   5   5
## G   -11 -9 -7 -5 -5 -3 -2  0  2  4   4   6

attr(d, "TraceBackMatrix")

##     gap     G     A     T      C      G     G     A     A      T   A     
## gap "start" "i"   "i"   "i"    "i"    "i"   "i"   "i"   "i"    "i" "i"   
## G   "d"     "m"   "i"   "i"    "i"    "m/i" "m/i" "i"   "i"    "i" "i"   
## A   "d"     "d"   "m"   "i"    "i"    "i"   "i"   "m/i" "m/i"  "i" "m/i" 
## C   "d"     "d"   "d"   "mm"   "m"    "i"   "i"   "i"   "i"    "i" "i"   
## G   "d"     "d/m" "d"   "d/mm" "d"    "m"   "m/i" "i"   "i"    "i" "i"   
## G   "d"     "d/m" "d"   "d/mm" "d"    "d/m" "m"   "i"   "i"    "i" "i"   
## A   "d"     "d"   "d/m" "d/mm" "d"    "d"   "d"   "m"   "m/i"  "i" "m/i" 
## T   "d"     "d"   "d"   "m"    "d/i"  "d"   "d"   "d"   "mm"   "m" "i"   
## T   "d"     "d"   "d"   "d/m"  "mm"   "d"   "d"   "d"   "d/mm" "m" "mm/i"
## A   "d"     "d"   "d/m" "d"    "d/mm" "d"   "d"   "d/m" "m"    "d" "m"   
## T   "d"     "d"   "d"   "d/m"  "d/mm" "d"   "d"   "d"   "d"    "m" "d"   
## G   "d"     "d/m" "d"   "d"    "d/mm" "m"   "d/m" "d"   "d"    "d" "d/mm"
##     G     
## gap "i"   
## G   "m/i" 
## A   "i"   
## C   "i"   
## G   "m/i" 
## G   "m/i" 
## A   "i"   
## T   "i"   
## T   "mm/i"
## A   "i"   
## T   "mm"  
## G   "m"

## optimal local alignment score
d <- stringSim(x, y, global = FALSE)
d

##             GACGGATTATG
## GATCGGAATAG           6

attr(d, "ScoringMatrix")

##     gap G A T C G G A A T A G
## gap   0 0 0 0 0 0 0 0 0 0 0 0
## G     0 1 0 0 0 1 1 0 0 0 0 1
## A     0 0 2 1 0 0 0 2 1 0 1 0
## C     0 0 1 1 2 1 0 1 1 0 0 0
## G     0 1 0 0 1 3 2 1 0 0 0 1
## G     0 1 0 0 0 2 4 3 2 1 0 1
## A     0 0 2 1 0 1 3 5 4 3 2 1
## T     0 0 1 3 2 1 2 4 4 5 4 3
## T     0 0 0 2 2 1 1 3 3 5 4 3
## A     0 0 1 1 1 1 0 2 4 4 6 5
## T     0 0 0 2 1 0 0 1 3 5 5 5
## G     0 1 0 1 1 2 1 0 2 4 4 6

attr(d, "TraceBackMatrix")

##     gap     G        A           T           C        G            
## gap "start" "i"      "i"         "i"         "i"      "i"          
## G   "d"     "m"      "i/stop"    "stop"      "stop"   "m"          
## A   "d"     "d/stop" "m"         "i"         "i/stop" "d/stop"     
## C   "d"     "stop"   "d"         "mm"        "m"      "i"          
## G   "d"     "m"      "d/i/stop"  "d/mm/stop" "d"      "m"          
## G   "d"     "m"      "mm/i/stop" "stop"      "d/stop" "d/m"        
## A   "d"     "d/stop" "m"         "i"         "i/stop" "d"          
## T   "d"     "stop"   "d"         "m"         "i"      "i"          
## T   "d"     "stop"   "d/stop"    "d/m"       "mm"     "mm/i"       
## A   "d"     "stop"   "m"         "d"         "d/mm"   "mm"         
## T   "d"     "stop"   "d/stop"    "m"         "i"      "d/mm/i/stop"
## G   "d"     "m"      "i/stop"    "d"         "mm"     "m"          
##     G             A          A             T           A        G         
## gap "i"           "i"        "i"           "i"         "i"      "i"       
## G   "m"           "i/stop"   "stop"        "stop"      "stop"   "m"       
## A   "d/mm/stop"   "m"        "m/i"         "i/stop"    "m"      "d/i/stop"
## C   "i/stop"      "d"        "mm"          "mm/i/stop" "d/stop" "mm/stop" 
## G   "m/i"         "i"        "d/mm/i/stop" "mm/stop"   "stop"   "m"       
## G   "m"           "i"        "i"           "i"         "i/stop" "m"       
## A   "d"           "m"        "m/i"         "i"         "m/i"    "i"       
## T   "d"           "d"        "mm"          "m"         "i"      "i"       
## T   "d"           "d"        "d/mm"        "m"         "mm/i"   "mm/i"    
## A   "d/mm/i/stop" "d/m"      "m"           "d"         "m"      "i"       
## T   "mm/stop"     "d"        "d"           "m"         "d"      "mm"      
## G   "m/i"         "d/i/stop" "d"           "d"         "d/mm"   "m"

The entry stop indicates that the minimum similarity score has been reached.

Optimal Alignment

Finally, the function traceBack computes an optimal global or local alignment based on a trace back matrix as provided by function stringDist or stringSim (Merkl and Waack 2009).

x <- "GACGGATTATG"
y <- "GATCGGAATAG"
## Levenshtein distance
d <- stringDist(x, y)
## optimal global alignment
traceBack(d)

##    pairwise global alignment
## x' "GA-CGGATTATG"           
## y' "GATCGGAATA-G"

## Optimal global alignment score
d <- stringSim(x, y)
## optimal global alignment
traceBack(d)

##    pairwise global alignment
## x' "GA-CGGATTATG"           
## y' "GATCGGAATA-G"

## Optimal local alignment score
d <- stringSim(x, y, global = FALSE)
## optimal local alignment
traceBack(d, global = FALSE)

##    pairwise local alignment
## x' "GA-CGGATTA"            
## y' "GATCGGAATA"

sessionInfo

sessionInfo()

## R version 4.1.0 Patched (2021-07-20 r80661)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 20.1
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /home/kohlm/RTOP/Rbranch/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-2 MKomics_0.7       
## 
## loaded via a namespace (and not attached):
##  [1] ComplexHeatmap_2.8.0 highr_0.9            bslib_0.2.5.1       
##  [4] DEoptimR_1.0-9       compiler_4.1.0       jquerylib_0.1.4     
##  [7] iterators_1.0.13     tools_4.1.0          digest_0.6.27       
## [10] clue_0.3-59          jsonlite_1.7.2       evaluate_0.14       
## [13] png_0.1-7            rlang_0.4.11         foreach_1.5.1       
## [16] yaml_2.2.1           parallel_4.1.0       xfun_0.25           
## [19] cluster_2.1.2        stringr_1.4.0        knitr_1.33          
## [22] S4Vectors_0.30.0     GlobalOptions_0.1.2  sass_0.4.0          
## [25] IRanges_2.26.0       stats4_4.1.0         grid_4.1.0          
## [28] robustbase_0.93-8    R6_2.5.0             GetoptLong_1.0.5    
## [31] rmarkdown_2.10       limma_3.48.1         magrittr_2.0.1      
## [34] BiocGenerics_0.38.0  codetools_0.2-18     htmltools_0.5.1.1   
## [37] matrixStats_0.60.0   shape_1.4.6          circlize_0.4.13     
## [40] colorspace_2.0-2     stringi_1.7.3        doParallel_1.0.16   
## [43] crayon_1.4.1         rjson_0.2.20         Cairo_1.5-12.2

References

Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.

Maechler, Martin, Peter Rousseeuw, Christophe Croux, Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke, Manuel Koller, Eduardo L. T. Conceicao, and Maria Anna di Palma. 2019. Robustbase: Basic Robust Statistics. http://robustbase.r-forge.r-project.org/.

Merkl, Rainer, and Stephan Waack. 2009. Bioinformatik Interaktiv: Grundlagen, Algorithmen, Anwendungen (German Edition). Wiley-Blackwell.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Shi, L., L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J. Collins, et al. 2006. “The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.” Nat. Biotechnol. 24 (9): 1151–61.