The protr package offers a unique and comprehensive toolkit for generating various numerical representation schemes of protein sequences. The descriptors included are extensively utilized in bioinformatics and chemogenomics research.
The commonly used descriptors listed in protr include amino acid composition, autocorrelation, CTD, conjoint traid, quasi-sequence order, pseudo amino acid composition, and profile-based descriptors derived by Position-Specific Scoring Matrix (PSSM).
The descriptors for proteochemometric (PCM) modeling, includes the scales-based descriptors derived by principal components analysis, factor analysis, multidimensional scaling, amino acid properties (AAindex), 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.), and BLOSUM/PAM matrix-derived descriptors.
The protr package also implemented parallelized similarity computation derived by pairwise protein sequence alignment and Gene Ontology (GO) semantic similarity measures. ProtrWeb, the web application built on protr, can be accessed from http://protr.org.
If you find protr useful in your research, please feel free to cite our paper:
Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31 (11), 1857-1859.
BibTeX entry:
@article{Xiao2015,
author = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.},
title = {protr/{ProtrWeb}: {R} package and web server for generating various numerical representation schemes of protein sequences},
journal = {Bioinformatics},
year = {2015},
volume = {31},
number = {11},
pages = {1857--1859},
doi = {10.1093/bioinformatics/btv042}
}
Here we use the subcellular localization dataset of human proteins presented in Chou and Shen (2008) to demonstrate the basic usage of protr.
The complete dataset includes 3,134 protein sequences (2,750 different proteins), classified into 14 human subcellular locations. We selected two classes of proteins as our benchmark dataset. Class 1 contains 325 extracell proteins, and class 2 includes 307 mitochondrion proteins. Here we aim to build a random forest classification model to classify these two types of proteins.
First, we load the protr package, then read the protein sequences
stored in two separated FASTA files with readFASTA()
:
library("protr")
# load FASTA files
# (system.file is for accessing example file in protr package,
# replace it with your path)
extracell <- readFASTA(system.file(
"protseq/extracell.fasta",
package = "protr"
))
mitonchon <- readFASTA(system.file(
"protseq/mitochondrion.fasta",
package = "protr"
))
To read protein sequences stored in PDB format files, use
readPDB()
instead. The loaded sequences will be stored as
two lists in R, and each component in the list is a character string
representing one protein sequence. In this case, there are 325
extracell protein sequences and 306 mitonchon protein
sequences:
length(extracell)
## [1] 325
length(mitonchon)
## [1] 306
To ensure that the protein sequences only have the 20 standard amino
acid types which is usually required for the descriptor computation, we
use the protcheck()
function to do the amino acid type
sanity check and remove the non-standard sequences:
extracell <- extracell[(sapply(extracell, protcheck))]
mitonchon <- mitonchon[(sapply(mitonchon, protcheck))]
length(extracell)
## [1] 323
length(mitonchon)
## [1] 304
Two protein sequences were removed from each class. For the remaining sequences, we calculate the Type II PseAAC descriptor, i.e., the amphiphilic pseudo amino acid composition (APseAAC) descriptor (Chou, 2005) and make class labels for classification modeling.
# calculate APseAAC descriptors
x1 <- t(sapply(extracell, extractAPAAC))
x2 <- t(sapply(mitonchon, extractAPAAC))
x <- rbind(x1, x2)
# make class labels
labels <- as.factor(c(rep(0, length(extracell)), rep(1, length(mitonchon))))
In protr, the functions of commonly used descriptors for protein
sequences and proteochemometric (PCM) modeling descriptors are named
after extract...()
.
Next, we will split the data into a 75% training set and a 25% test set.
set.seed(1001)
# split training and test set
tr.idx <- c(
sample(1:nrow(x1), round(nrow(x1) * 0.75)),
sample(nrow(x1) + 1:nrow(x2), round(nrow(x2) * 0.75))
)
te.idx <- setdiff(1:nrow(x), tr.idx)
x.tr <- x[tr.idx, ]
x.te <- x[te.idx, ]
y.tr <- labels[tr.idx]
y.te <- labels[te.idx]
We will train a random forest classification model on the training
set with 5-fold cross-validation, using the randomForest
package.
library("randomForest")
rf.fit <- randomForest(x.tr, y.tr, cv.fold = 5)
rf.fit
The training result is:
## Call:
## randomForest(x = x.tr, y = y.tr, cv.fold = 5)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 25.11%
## Confusion matrix:
## 0 1 class.error
## 0 196 46 0.1900826
## 1 72 156 0.3157895
With the model trained on the training set, we predict on the test
set and plot the ROC curve with the pROC
package, as is
shown in Figure 1.
# predict on test set
rf.pred <- predict(rf.fit, newdata = x.te, type = "prob")[, 1]
# plot ROC curve
library("pROC")
plot.roc(y.te, rf.pred, grid = TRUE, print.auc = TRUE)
The area under the ROC curve (AUC) is:
## Call:
## plot.roc.default(x = y.te, predictor = rf.pred, col = "#0080ff",
## grid = TRUE, print.auc = TRUE)
##
## Data: rf.pred in 81 controls (y.te 0) > 76 cases (y.te 1).
## Area under the curve: 0.8697
Figure 1: ROC curve for the test set of protein subcellular localization data
The protr package (Xiao et al.,
2015) implemented most of the state-of-the-art protein sequence
descriptors with R. Generally, each type of the descriptors (features)
can be calculated with a function named extractX()
in the
protr package, where X
stands for the abbrevation of the
descriptor name. The descriptors are:
extractAAC()
- Amino acid compositionextractDC()
- Dipeptide compositionextractTC()
- Tripeptide compositionextractMoreauBroto()
- Normalized Moreau-Broto
autocorrelationextractMoran()
- Moran autocorrelationextractGeary()
- Geary autocorrelationextractCTDC()
- CompositionextractCTDT()
- TransitionextractCTDD()
- DistributionextractCTriad()
- Conjoint triad descriptorsextractSOCN()
- Sequence-order-coupling numberextractQSO()
- Quasi-sequence-order descriptorsextractPAAC()
- Pseudo-amino acid composition
(PseAAC)extractAPAAC()
- Amphiphilic pseudo-amino acid
composition (APseAAC)extractPSSM()
extractPSSMAcc()
extractPSSMFeature()
The descriptors commonly used in Proteochemometric Modeling (PCM) implemented in protr include:
extractScales()
, extractScalesGap()
-
Scales-based descriptors derived by Principal Components Analysis
extractProtFP()
, extractProtFPGap()
-
Scales-based descriptors derived by amino acid properties from AAindex
(a.k.a. Protein Fingerprint)extractDescScales()
- Scales-based descriptors derived
by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM,
VHSE, etc.)extractFAScales()
- Scales-based descriptors derived by
Factor AnalysisextractMDSScales()
- Scales-based descriptors derived
by Multidimensional ScalingextractBLOSUM()
- BLOSUM and PAM matrix-derived
descriptorsThe protr package integrates the function of parallelized similarity
score computation derived by local or global protein sequence alignment
between a list of protein sequences, the sequence alignment computation
is provided by Biostrings
, the corresponding functions
listed in the protr
package include:
twoSeqSim()
- Similarity calculation derived by
sequence alignment between two protein sequencesparSeqSim()
- Parallelized pairwise similarity
calculation with a list of protein sequences (in-memory version)parSeqSimDisk()
- Parallelized pairwise similarity
calculation with a list of protein sequences (disk-based version)The protr package also integrates the function of parallelized
similarity score computation derived by Gene Ontology (GO) semantic
similarity measures between a list of GO terms / Entrez Gene IDs, the GO
similarity computation is provided by GOSemSim
, the
corresponding functions listed in the protr package include:
twoGOSim()
- Similarity calculation derived by GO-terms
semantic similarity measures between two GO terms / Entrez Gene
IDs;parGOSim()
- Pairwise similarity calculation with a
list of GO terms / Entrez Gene IDs.To use the parSeqSim()
function, we suggest the users to
install the packages foreach
and doParallel
first, in order to make the parallelized pairwise similarity computation
available.
In the following sections, we will introduce the descriptors and function usage in this order.
Note: Users need to intelligently evaluate the underlying details of the descriptors provided, instead of using protr with their data blindly, especially for the descriptor types with more flexibility. It would be wise for the users to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
A protein or peptide sequence with \(N\) amino acid residues can be generally represented as \(\{\,R_1, R_2, \ldots, R_n\,\}\), where \(R_i\) represents the residue at the \(i\)-th position in the sequence. The labels \(i\) and \(j\) are used to index amino acid position in a sequence, and \(r\), \(s\), \(t\) are used to represent the amino acid type. The computed descriptors are roughly categorized into 4 groups according to their major applications.
A protein sequence can be partitioned equally into segments. The descriptors designed for the complete sequence, can be often applied to each individual segment.
The amino acid composition describes the fraction of each amino acid type within a protein sequence. The fractions of all 20 natural amino acids are calculated as:
\[ f(r) = \frac{N_r}{N} \quad r = 1, 2, \ldots, 20. \] where \(N_r\) is the number of the amino acid type \(r\) and \(N\) is the length of the sequence.
As was described above, we can use the function
extractAAC()
to extract the descriptors (features) from
protein sequences:
library("protr")
x <- readFASTA(system.file(
"protseq/P00750.fasta",
package = "protr"
))[[1]]
extractAAC(x)
#> A R N D C E Q
#> 0.06405694 0.07117438 0.03914591 0.05160142 0.06761566 0.04804270 0.04804270
#> G H I L K M F
#> 0.08185053 0.03024911 0.03558719 0.07651246 0.03914591 0.01245552 0.03202847
#> P S T W Y V
#> 0.05338078 0.08896797 0.04448399 0.02313167 0.04270463 0.04982206
Here, with the function readFASTA()
we loaded a single
protein sequence (P00750, Tissue-type plasminogen activator) from a
FASTA format file. Then extracted the AAC descriptors with
extractAAC()
. The result returned is a named vector, whose
elements are tagged with the name of each amino acid.
Dipeptide composition gives a 400-dimensional descriptor, defined as:
\[ f(r, s) = \frac{N_{rs}}{N - 1} \quad r, s = 1, 2, \ldots, 20. \]
where \(N_{rs}\) is the number of
dipeptide represented by amino acid type \(r\) and type \(s\). Similar to extractAAC()
,
here we use extractDC()
to compute the descriptors:
dc <- extractDC(x)
head(dc, n = 30L)
#> AA RA NA DA CA EA
#> 0.003565062 0.003565062 0.000000000 0.007130125 0.003565062 0.003565062
#> QA GA HA IA LA KA
#> 0.007130125 0.007130125 0.001782531 0.003565062 0.001782531 0.001782531
#> MA FA PA SA TA WA
#> 0.000000000 0.005347594 0.003565062 0.007130125 0.003565062 0.000000000
#> YA VA AR RR NR DR
#> 0.000000000 0.000000000 0.003565062 0.007130125 0.005347594 0.001782531
#> CR ER QR GR HR IR
#> 0.005347594 0.005347594 0.000000000 0.007130125 0.001782531 0.003565062
Here we only showed the first 30 elements of the result vector and omitted the rest of the result. The element names of the returned vector are self-explanatory as before.
Tripeptide composition gives a 8000-dimensional descriptor, defined as:
\[ f(r, s, t) = \frac{N_{rst}}{N - 2} \quad r, s, t = 1, 2, \ldots, 20 \]
where \(N_{rst}\) is the number of
tripeptides represented by amino acid type \(r\), \(s\), and \(t\). With function
extractTC()
, we can easily obtain the length-8000
descriptor, to save some space, here we also omitted the long
outputs:
tc <- extractTC(x)
head(tc, n = 36L)
#> AAA RAA NAA DAA CAA EAA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> QAA GAA HAA IAA LAA KAA
#> 0.001785714 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> MAA FAA PAA SAA TAA WAA
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000
#> YAA VAA ARA RRA NRA DRA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> CRA ERA QRA GRA HRA IRA
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000
#> LRA KRA MRA FRA PRA SRA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
Autocorrelation descriptors are defined based on the distribution of amino acid properties along the sequence. The amino acid properties used here are various types of amino acids index (Retrieved from the AAindex database, see Kawashima et al. (1999), Kawashima and Kanehisa (2000), and Kawashima et al. (2008); see Figure 2 for an illustrated example). Three types of autocorrelation descriptors are defined here and described below.
All the amino acid indices are centralized and standardized before the calculation, i.e.
\[ P_r = \frac{P_r - \bar{P}}{\sigma} \]
where \(\bar{P}\) is the average of the property of the 20 amino acids:
\[ \bar{P} = \frac{\sum_{r=1}^{20} P_r}{20} \quad \textrm{and} \quad \sigma = \sqrt{\frac{1}{2} \sum_{r=1}^{20} (P_r - \bar{P})^2} \]
Figure 2: An illustrated example in the AAIndex database
For protein sequences, the Moreau-Broto autocorrelation descriptors can be defined as:
\[ AC(d) = \sum_{i=1}^{N-d} P_i P_{i + d} \quad d = 1, 2, \ldots, \textrm{nlag} \]
where \(d\) is called the lag of the autocorrelation; \(P_i\) and \(P_{i+d}\) are the properties of the amino acids at position \(i\) and \(i+d\); \(\textrm{nlag}\) is the maximum value of the lag.
The normalized Moreau-Broto autocorrelation descriptors are defined as:
\[ ATS(d) = \frac{AC(d)}{N-d} \quad d = 1, 2, \ldots, \textrm{nlag} \]
The corresponding function for this descriptor is
extractMoreauBroto()
. A typical call would be:
moreau <- extractMoreauBroto(x)
head(moreau, n = 36L)
#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4
#> 0.081573213 -0.016064817 -0.015982990 -0.025739038
#> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8
#> 0.079058632 -0.042771564 -0.036320847 0.024087298
#> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#> -0.005273958 0.052274763 0.082170073 0.005419919
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#> 0.083292042 0.004810584 0.001872446 -0.001531495
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#> -0.011917230 0.071161551 0.033473197 0.026882737
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#> 0.073075402 0.115272790 0.041517897 -0.027025993
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#> 0.033477388 -0.003245255 0.078117010 -0.028177304
#> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2
#> 0.046695832 0.020584423 0.052740185 0.030804784
#> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6
#> 0.037170476 -0.058993771 0.070641780 -0.089192490
The eight default properties used here are:
Users can change the property names of AAindex database with the
argument props
. The AAindex data shipped with protr can be
loaded by data(AAindex)
, which has the detailed information
of each property. With the argument customprops
and
nlag
, users can specify their own properties and lag value
to calculate with. For example:
# Define 3 custom properties
myprops <- data.frame(
AccNo = c("MyProp1", "MyProp2", "MyProp3"),
A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101),
N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59),
C = c(0.29, -1, 47), E = c(-0.74, 3, 73),
Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1),
H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57),
L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73),
M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91),
P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31),
T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130),
Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)
)
# Use 4 properties in the AAindex database, and 3 cutomized properties
moreau2 <- extractMoreauBroto(
x,
customprops = myprops,
props = c(
"CIDH920105", "BHAR880101",
"CHAM820101", "CHAM820102",
"MyProp1", "MyProp2", "MyProp3"
)
)
head(moreau2, n = 36L)
#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4
#> 0.081573213 -0.016064817 -0.015982990 -0.025739038
#> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8
#> 0.079058632 -0.042771564 -0.036320847 0.024087298
#> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#> -0.005273958 0.052274763 0.082170073 0.005419919
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#> 0.083292042 0.004810584 0.001872446 -0.001531495
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#> -0.011917230 0.071161551 0.033473197 0.026882737
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#> 0.073075402 0.115272790 0.041517897 -0.027025993
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#> 0.033477388 -0.003245255 0.078117010 -0.028177304
#> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2
#> 0.046695832 0.020584423 0.052740185 0.030804784
#> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6
#> 0.037170476 -0.058993771 0.070641780 -0.089192490
About the standard input format of props
and
customprops
, see ?extractMoreauBroto
for
details.
For protein sequences, the Moran autocorrelation descriptors can be defined as:
\[ I(d) = \frac{\frac{1}{N-d} \sum_{i=1}^{N-d} (P_i - \bar{P}') (P_{i+d} - \bar{P}')}{\frac{1}{N} \sum_{i=1}^{N} (P_i - \bar{P}')^2} \quad d = 1, 2, \ldots, 30 \]
where \(d\), \(P_i\), and \(P_{i+d}\) are defined in the same way as in the first place; \(\bar{P}'\) is the considered property \(P\) along the sequence, i.e.,
\[ \bar{P}' = \frac{\sum_{i=1}^N P_i}{N} \]
\(d\), \(P\), \(P_i\) and \(P_{i+d}\), \(\textrm{nlag}\) have the same meaning as above.
With extractMoran()
(which has the identical parameters
as extractMoreauBroto()
), we can compute the Moran
autocorrelation descriptors (only print out the first 36 elements):
# Use the 3 custom properties defined before
# and 4 properties in the AAindex database
moran <- extractMoran(
x,
customprops = myprops,
props = c(
"CIDH920105", "BHAR880101",
"CHAM820101", "CHAM820102",
"MyProp1", "MyProp2", "MyProp3"
)
)
head(moran, n = 36L)
#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4
#> 0.062895724 -0.044827681 -0.045065117 -0.055955678
#> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8
#> 0.060586377 -0.074128412 -0.067308852 -0.001293384
#> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#> -0.033747588 0.029392193 0.061789800 -0.023368437
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#> 0.062769417 -0.024912264 -0.028298043 -0.031584063
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#> -0.043466730 0.047830694 0.005883901 -0.001769769
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#> 0.049334048 0.096427969 0.015147594 -0.060092509
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#> 0.007549152 -0.033987885 0.056307675 -0.061844453
#> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2
#> 0.021484780 -0.008461776 0.014229951 -0.009142419
#> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6
#> -0.003272262 -0.109613332 0.033346233 -0.141538598
Geary autocorrelation descriptors for protein sequences can be defined as:
\[ C(d) = \frac{\frac{1}{2(N-d)} \sum_{i=1}^{N-d} (P_i - P_{i+d})^2}{\frac{1}{N-1} \sum_{i=1}^{N} (P_i - \bar{P}')^2} \quad d = 1, 2, \ldots, 30 \]
where \(d\), \(P\), \(P_i\), \(P_{i+d}\), and \(\textrm{nlag}\) have the same meaning as above.
For each amino acid index, there will be \(3 \times \textrm{nlag}\) autocorrelation
descriptors. The usage of extractGeary()
is exactly the
same as extractMoreauBroto()
and
extractMoran()
:
# Use the 3 custom properties defined before
# and 4 properties in the AAindex database
geary <- extractGeary(
x,
customprops = myprops,
props = c(
"CIDH920105", "BHAR880101",
"CHAM820101", "CHAM820102",
"MyProp1", "MyProp2", "MyProp3"
)
)
head(geary, n = 36L)
#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4
#> 0.9361830 1.0442920 1.0452843 1.0563467
#> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8
#> 0.9406031 1.0765517 1.0675786 0.9991363
#> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12
#> 1.0316555 0.9684585 0.9353130 1.0201990
#> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16
#> 0.9340933 1.0207373 1.0251486 1.0290464
#> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20
#> 1.0414375 0.9494403 0.9905987 0.9987183
#> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24
#> 0.9472542 0.9010009 0.9828848 1.0574098
#> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28
#> 0.9897955 1.0290018 0.9400066 1.0584150
#> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2
#> 0.9762904 1.0029734 0.9818711 1.0051730
#> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6
#> 0.9967069 1.1012905 0.9595859 1.1337056
The CTD descriptors are developed by Dubchak et al. (1995) and Dubchak et al. (1999).
Figure 3: The sequence of a hypothetic protein indicating the
construction of composition, transition, and distribution descriptors of
a protein. Sequence index indicates the position of an amino acid in the
sequence. The index for each type of amino acids in the sequence
(1
, 2
or 3
) indicates the
position of the first, second, third, … of that type of amino acid. 1/2
transition indicates the position of 12
or 21
pairs in the sequence (1/3 and 2/3 are defined in the same way).
Step 1: Sequence Encoding
The amino acids are categorized into three classes according to its attribute, and each amino acid is encoded by one of the indices 1, 2, 3 according to which class it belongs. The attributes used here include hydrophobicity, normalized van der Waals volume, polarity, and polarizability. The corresponding classification for each attribute is listed in Table 1.
Group 1 | Group 2 | Group 3 | |
---|---|---|---|
Hydrophobicity | Polar | Neutral | Hydrophobicity |
R, K, E, D, Q, N | G, A, S, T, P, H, Y | C, L, V, I, M, F, W | |
Normalized van der Waals Volume | 0-2.78 | 2.95-4.0 | 4.03-8.08 |
G, A, S, T, P, D, C | N, V, E, Q, I, L | M, H, K, F, R, Y, W | |
Polarity | 4.9-6.2 | 8.0-9.2 | 10.4-13.0 |
L, I, F, W, C, M, V, Y | P, A, T, G, S | H, Q, R, K, N, E, D | |
Polarizability | 0-1.08 | 0.128-0.186 | 0.219-0.409 |
G, A, S, D, T | C, P, N, V, E, Q, I, L | K, M, H, F, R, Y, W | |
Charge | Positive | Neutral | Negative |
K, R | A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V | D, E | |
Secondary Structure | Helix | Strand | Coil |
E, A, L, M, Q, K, R, H | V, I, Y, C, W, F, T | G, N, P, S, D | |
Solvent Accessibility | Buried | Exposed | Intermediate |
A, L, F, C, G, I, V, W | R, K, Q, E, N, D | M, S, |